Figures and data

Task, stimuli and the WPPM.
(A) 3AFC oddity task. On each trial, participants viewed a triplet of stimuli—two identical references and one different comparison—and identified the odd one out. (B) Stimuli were constrained to lie in the isoluminant plane the display’s gray point. Data were represented and fit in a transformation of this plane which we refer to as model space. The grid of dots illustrates the transformation between the plane in the RGB and model space. (C) Example of a smoothly varying covariance matrix field produced by the WPPM. The field was generated by sampling from a finite-basis Wishart random process with a smooth prior (ϵ = 0.5; see Prior over the weight matrix). Although the field is illustrated on a 7 × 7 grid, it specifies a covariance matrix 




Threshold results and validation.
(A) Adaptively sampled trials. AEPsych-driven stimulus pairs that were most informative for estimating thresholds across the entire psychometric field. Of the 6,000 trials, the first 900 were Sobol’-sampled; the remaining 5,100 (shown) were adaptively selected based on a non-parametric GP model that was updated every 20 trials and the EAVC acquisition function. (B) Discrimination threshold contours read out from the WPPM fit (66.7% correct) for a representative participant (CH), based on fits to the 6000 AEPsych trials and the fallback trials (Appendix 2). (C) Group summary of WPPM readouts (N = 8). Summary of regression slopes and correlation coefficients for all participants. Error bars: 95% confidence intervals. As a benchmark, the same analysis was performed on simulated data using CIELab ΔE 94 as ground truth. (D) Validation trials for the same participant. reference stimuli and chromatic directions were Sobol’-sampled uniquely for each participant. (E) Comparison of thresholds. Ellipses represent discrimination threshold contour read out from the WPPM fit (same fit as in (B)), evaluated at the 25 reference stimuli used in the validation trials. The black bars at the end of each gray line show the 95% bootstrapped confidence interval for the corresponding threshold. (F) Comparison of psychometric functions. Black lines represent the Weibull functions fit to the validation trials (black points), with 95% bootstrapped confidence intervals (gray regions). Colored lines represent the psychometric functions from the WPPM fit, with the full range of 10 bootstraps shown as colored shaded regions. (G) Linear regression of thresholds read out from the WPPM fit against validation thresholds. Horizontal and vertical error bars represent 95% confidence intervals for the validation thresholds from 120 bootstraps and the full range from 10 bootstraps of the WPPM fits, respectively. (H) Summary of regression slopes and correlation coefficients for all participants. Error bars: 95% confidence intervals. As a benchmark, the same analysis was performed on simulated data using CIELab ΔE 94 as ground truth.

Comparison of color discrimination thresholds with previous measurements.
(A) MacAdam 1942. Top: MacAdam’s original threshold contours, magnified by 10× for visualization. Bottom: Threshold contours from one participant in our study, transformed from the model space into CIE 1931 chromaticity space. reference stimuli were sampled from a 5 × 5 grid evenly spaced from –0.75 to 0.75 along each dimension of the model space. To reduce visual clutter, MacAdam ellipses that fall within the gamut of our isoluminant plane (parallelogram) are shown as red arrows indicating only their major axes. For visual comparability, our ellipses are magnified 2× to approximate the size of those in MacAdam’s data. Triangle: gamut of our monitor. (B) Danilova and Mollon 2025. Left: Original threshold contours (79.4% correct) from their study, magnified by 4×. Right: Threshold contours from one participant in our study (colored ellipses), transformed from the model space into a scaled MacLeod–Boynton space. Reference points were sampled on a 5 × 5 grid ranging from –0.7 to 0.7. As in (A), to reduce visual clutter, their ellipses that fall within the gamut of our isoluminant plane (parallelogram) are shown as red arrows indicating only their major axes. For visual comparability, our ellipses are magnified by 1.5×. (C) Krauskopf and Karl 1992. Left: Original threshold contours (79.4% correct converged by a three-down-one-up staircase) from their study (Fig. 14 from this study, reproduced under Creative Commons CC BY-NC-ND 4.0). Right: Threshold contours from one participant in our study, transformed into a stretched DKL space. All contours are shown at their original sizes. (D) CIELab ΔE 76, ΔE 94, and ΔE 2000. ΔE values were converted into percent correct using a Weibull psychometric function, and threshold is defined as the ΔE = 2.5. Colored lines represent the measured thresholds from one participant, shown at their original sizes. For visual comparability, the predicted threshold contours from each CIELab metric (black lines) were magnified by factors of 5, 2.5, and 2.5, for ΔE 76, ΔE 94, ΔE 2000 metrics, respectively, to approximately match the scale of the measured thresholds in our study. See Appendix 6 - Appendix 9 for additional details.

The finite-basis Wishart Process Psychophysical Model (WPPM).
In our implementation, we used a set of 5 × 5 two-dimensional Chebyshev polynomial basis functions, denoted ϕi,j (x), where i, j ∈ {0, 1, …, 4}. These basis functions were linearly combined using a learnable weight matrix W to produce an overcomplete representation Uk,l(x), where k ∈ 1, 2 and l ∈ 1, 2, 3. The resulting representation Uk,l was then combined with its own transpose to produce a field of symmetric, positive semi-definite covariance matrices. Each matrix specifies the internal noise in terms of the variance along the two model dimensions 

Corner vertices in the DKL, LMS, RGB, and model spaces.

Transformation matrices between DKL, RGB and model spaces.


AEPsych-driven trials (900 Sobol’-sampled and 5,100 adaptively sampled), fallback trials, and WPPM predictions for all participants.
Each row represents data from one participant. AEPsych-driven trials (900 Sobol’-sampled and 5,100 adaptively sampled), fallback trials, and WPPM predictions for all participants. Each row represents data from one participant. Note that for participant CH, no pre-generated Sobol’ trials were used, as the fallback strategy was implemented later in the study to maintain experimental continuity and reduce delays between trials.

Task timing and real-time trial scheduling.
(A) Trial sequence: a 0.5 s fixation cross was followed by a 0.2 s blank interval, then a 1 s presentation of three blobby stimuli. Participants responded at their own pace to identify the odd-one-out, after which a 0.2 s blank screen and a 0.5 s feedback were shown. The inter-trial interval (ITI) was 1.5 s. (B) A schematic representation of the trial timing and computational responsibilities of the two computers.

Validation for participant ME.
Same format as Figure 2D-G in the main text.

Validation for participant SG.

Validation for participant DK.

Validation for participant BH.

Validation for participant FM.

Validation for participant HG.

Validation for participant FW.

Validation for participant CH.

Threshold residuals.
Data are pooled across all validation conditions and all participants (N = 8). For all panels, color codes for the surface color of the reference stimulus, and the y-axis limits are set to ± the mean of the validation thresholds. (A) Residuals as a function of the absolute angular difference between the major axis of the elliptical threshold contours read out from the WPPM fits and the chromatic direction of the validation condition. (B) Residuals as a function of the aspect ratio (major/minor axis) of the WPPM threshold contours. (C) Residuals as a function of thresholds estimated from validation trials.

Linear regression results assessing the relationship between WPPM–validation threshold discrepancies and three predictors: (1) the absolute angular difference between the chromatic direction of the validation condition and the major axis of the contours read out from the WPPM fits, (2) the aspect ratio of the contours, and (3) the magnitude of the validation threshold.
This analysis was done on human data.

Catch trial performance summary across all sessions.
The proportion correct reflects the total number of correct responses divided by the total number of catch trials. Lower and upper bounds indicate the participant’s lowest and highest session-level performance, respectively.

Derivation of the ground-truth Wishart fits based on CIELab ΔE 94.
(A–B) Comparison stimuli at the iso-distance contours in the isoluminant plane, shown in both RGB and model spaces. Note that the reference grid and fixed set of directions shown here are for illustration only; the actual sampling did not use a fixed grid or evenly spaced chromatic directions. (C) The Weibull psychometric function used to simulate binary (correct or incorrect) responses given ΔE values. (D) Sampled reference-comparison stimulus pairs. Reference colors and chromatic directions were sampled using Sobol’ sequences, and comparison stimuli were jittered around the iso-distance contour. A total of 18,000 trials were simulated; only the first 200 are shown here for clarity. (E) Comparison between readouts from the WPPM fit and from CIELab ΔE 94. The WPPM fit was subsequently treated as the ground truth for simulating AEPsych and validation trials.

AEPsych-driven trials and WPPM readouts for a simulated participant.
Note that the ground-truth thresholds shown in (C) is the same WPPM readouts from Figure S12E.

Validation trials and WPPM readouts for a simulated participant.

Threshold residuals for a simulated dataset.
For all panels, color codes for the surface color of the reference stimulus, and the y-axis limits are set to ± the mean of the validation thresholds. (A) Residuals as a function of the absolute angular difference between the major axis of the elliptical threshold contours read out from the WPPM fits and the chromatic direction of the validation condition. (B) Residuals as a function of the aspect ratio (major/minor axis) of the WPPM threshold contours. (C) Residuals as a function of thresholds estimated from validation trials.

Linear regression results for the simulated dataset.

Deviation of WPPM estimates from the ground truth.
(A) BW distance between WPPM-estimated thresholds and the ground-truth ellipses. The upper limit of the color map (0.17) corresponds to the maximum BW distance between each ground-truth ellipse and a reference circle whose radius equals the largest major axis length among all ground-truth ellipses. The maximum BW distance between WPPM estimates and the ground truth (0.03) is substantially lower than this reference value. (B) Difference in major axis length between WPPM-readouts and ground-truth ellipses. The colormap limits (±0.17) reflect the ± maximum ground-truth major axis length. Again, the maximum deviation observed (0.03) is small relative to this range.

Comparison with MacAdam 1942.
Left: MacAdam’s original ellipses, enlarged 10× for visualization. Red arrows indicate inferred major axis directions at unsampled reference locations, guessed from nearby ellipses. Right: Threshold ellipses from our measurements, also magnified by 2× for visualization. Triangle: gamut of our monitor; parallelogram: gamut of our isoluminant plane.

Comparison with Danilova and Mollon 2025 in the scaled MacLeod–Boynton space.
Top left: threshold contours from their study (black ellipses), enlarged by 4×. Remaining panels: threshold contours from all participants in our study (colored ellipses; N = 8). We sampled a grid of reference points evenly spaced from –0.7 to 0.7 (5 steps) in our model space, read out the corresponding threshold contours, and transformed them into the same scaled MacLeod–Boynton space. The parallelogram indicates the gamut of the isoluminant plane. To reduce visual clutter, ellipses from Danilova & Mollon that fall within our gamut are represented by red arrows indicating only their major axes. For visual comparability, our ellipses are enlarged by 1.5× to roughly match the size of those in their study.

Transformation from the model space to a stretched DKL space used in Krauskopf and Karl 1992 for participant CH.
(A) Model space. Threshold contours were read out in this space based on each participant’s WPPM fit. Notably, our data were collected on a much larger region of the isoluminant plane than they characterized. (B) The intermediate, unstretched DKL space. Transformations between this space and both the model space and the stretched DKL space are affine. (C) Stretched DKL space, in which the cardinal axes of the original DKL space are rescaled such that the threshold at the achromatic reference point is normalized to one.

Comparison with Krauskopf and Karl 1992 across participants.
Top left: original threshold contours reported by Krauskopf and Karl 1992, reproduced under Creative Commons CC BY-NC-ND 4.0). Remaining panels: threshold contours for the remaining participants, transformed into the stretched DKL space using participant-specific scaling of the cardinal axes. All contours are plotted at their original sizes.

Comparison with CIELab ΔE 94 (McDonald and Smith, 1995) predictions.
These are scaled by a factor of 2.5× to approximately match the scale of the measured thresholds in our study, which are shown at their original scale.

Comparison with CIELab ΔE 2000 (Sharma et al., 2005) predictions.
These are scaled by a factor of 2.5× to approximately match the scale of the measured thresholds in our study, which are shown at their original scale.

Comparison with CIELab ΔE 76 (Robertson et al., 1977) predictions.
These are scaled by a factor of 5× to approximately match the scale of the measured thresholds in our study, which are shown at their original scale.

Stimuli and equipment used for calibration.
(A) The stimulus setup during calibration was identical to that used in the main experiment. The surface color of both the cubic room and the blobby stimulus (shown here as the top-position stimulus) was varied during the calibration procedure. The shaded gray circular region on the stimulus indicates the area measured by the spectroradiometer’s lens. (B) A SpectraScan PR-670 used for all calibration measurements.

Calibration results.
(A) Gamma functions for red, green and blue primaries. Note that Unity’s internal correction places them above the identity line. (B) Spectral power distributions (SPDs) of the three primaries across a range of intensity levels. (C) The chromaticity of each primaries in the CIE chromaticity diagram at different intensity levels. (D) Normalized SPDs for each primary, showing spectral shape stability across intensity levels. (E) Linearity tests comparing predicted and measured chromaticity and luminance across two independent measurement runs. (F) Deviations from linearity. (G) Effect of the cubic room’s background color on the SPD of the blobby stimulus, showing no detectable influence.

Comparison of calibration results across the three blobby stimuli.
(A) Spectral power distributions (SPDs) for each stimulus location: Ref Cal (bottom right), Cal 2 (bottom left), and Cal 3 (top). (B) Ambient light SPDs measured during calibration. (C) Gamma functions for each primary (red, green, blue) across all three stimulus locations. (D) Differences in normalized output for each pairwise comparison of stimulus locations, plotted separately for each primary. (E) Chromaticity coordinates of each primary in the CIE diagram, shown for all three stimulus locations.

Gamma correction.
(A) Measured gamma functions and their corresponding inverse functions for the red, green, and blue primaries, used to construct the gamma correction lookup table. (B) Gamma functions re-measured after applying the correction in Unity, showing close alignment with the identity line for all three primaries.

Comparison between the initial and repeated calibration one month into data collection.
(A) Spectral power distributions (SPDs) from two calibration sessions at the bottom-right blobby stimulus location: Ref Cal (initial calibration prior to the experiment) and Cal 2 (follow-up calibration). (B) Ambient light SPDs measured during each calibration. (C) Gamma functions for the red, green, and blue primaries across both sessions, with gamma correction applied. (D) Chromaticity coordinates of each primary plotted in the CIE diagram for both calibration runs.

Stimuli and equipment used for calibration.
(A) The stimulus setup during calibration was identical to that used in the main experiment. The surface color of both the cubic room and the blobby stimulus (shown here as the top-position stimulus) was varied across trials during the calibration procedure. The shaded gray circular region on the stimulus indicates the area measured by the spectroradiometer’s lens. (B) A SpectraScan PR-670 used for all calibration measurements.

Evidence of spatial dithering by Unity’s standard shader when the surface texture of the stimulus is being modified.
(A) Spatial dithering by Unity’s standard shader is suggested by comparing the luminance measurements from the Klein K-10A (averaged across a circular region on the blobby object) with the RGB values stored in the frame buffer. The measured luminance shows small incremental changes as the RGB settings increase in steps of 1/1023. These measurements are consistent with what we obtain by averaging over pixels in a saved image of the frame buffer (saved from Unity in .exr format). The averaged pixel values exhibit 10-bit quantization even though individual pixel values exhibit 8-bit quantization. (B) Top row: mean R channel values averaged vertically within a horizontal slice of the blobby object. Bottom row: differences in the R channel values between the minimum target R channel setting and each of the rest settings. Different shades of gray represent different target R settings. For illustration, only a portion of the horizontal slice is shown, and solid lines in the bottom row are scaled by a factor of 0.1. Dashed lines: the mean difference averaged across all pixels within each slice.