Discovering and deciphering relationships across disparate data modalities

  1. Joshua T Vogelstein  Is a corresponding author
  2. Eric W Bridgeford
  3. Qing Wang
  4. Carey E Priebe
  5. Mauro Maggioni
  6. Cencheng Shen
  1. Johns Hopkins University, United States
  2. Child Mind Institute, United States
  3. University of Delaware, United States
6 figures, 5 tables and 4 additional files


Illustration of Multiscale Graph Correlation (Mgc) on simulated cloud density (xi) and grass wetness (yi).

We present two different relationships: linear (top) and nonlinear spiral (bottom; see Materials and methods for simulation details). (A) Scatterplots of the raw data using 50 pairs of samples for each scenario. Samples 1, 2, and 3 (black) are highlighted; arrows show x distances between these pairs of points while their y distances are almost 0. (B) Scatterplots of all pairs of distances comparing x and y distances. Distances are linearly correlated in the linear relationship, whereas they are not in the spiral relationship. Dcorr uses all distances (gray dots) to compute its test statistic and p-value, whereas Mgc chooses the local scale and then uses only the local distances (green dots). (C) Heatmaps characterizing the strength of the generalized correlation at all possible scales (ranging from 2 to n for both x and y). For the linear relationship, the global scale is optimal, which is the scale that Mgc selects and results in a p-value identical to Dcorr. For the nonlinear relationship, the optimal scale is local in both x and y, so Mgc achieves a far larger test statistic, and a correspondingly smaller and significant p-value. Thus, Mgc uniquely detects dependence and characterizes the geometry in both relationships.
Figure 2 with 4 supplements
An extensive benchmark suite of 20 different relationships spanning polynomial, trigonometric, geometric, and other relationships demonstrates that Mgc empirically nearly dominates eight other methods across dependencies and dimensionalities ranging from 1 to 1000 (see Materials and methods and Figure 2—figure supplement 1 for details).

Each panel shows the testing power of other methods relative to the power of Mgc (e.g. power of Mcorr minus the power of Mgc) at significance level α=0.05 versus dimensionality for n=100. Any line below zero at any point indicates that that method’s power is less than Mgc’s power for the specified setting and dimensionality. Mgc achieves empirically better (or similar) power than all other methods in almost all relationships and all dimensions. For the independent relationship (#20), all methods yield power 0.05 as they should. Note that Mgc is always plotted ‘on top’ of the other methods, therefore, some lines are obscured.
Figure 2—figure supplement 1
Visualization of the 20 dependencies at p=q=1.

For each, n=100 points are sampled with noise (κ=1) to show the actual sample data used for one-dimensional relationships (gray dots). For comparison purposes, n=1000 points are sampled without noise (κ=0) to highlight each underlying dependency (black dots). Note that only black points are plotted for type 19 and 20, as they do not have the noise parameter κ.
Figure 2—figure supplement 2
The same power plots as in Figure 2, except the 20 dependencies are one-dimensional with noise, and the x-axis shows sample size increasing from 5 to 100.

Mgc empirically achieves similar or better power than the previous state-of-the-art approaches on most problems. Note that Mic is included in 1D case; RV and Cca both equal Pearson in 1D; Kendall and Spearman are too similar to Pearson in power and thus omitted in plotting.
Figure 2—figure supplement 3
The same set-ups as in Figure 2, comparing different Mgc implementations versus its global counterparts.

The default Mgc builds upon Mcorr throughout the paper, and we further consider Mgc on Mantel to illustrate the generalization. The magenta line shows the power difference between Mcorr and Mgc , and the cyan line shows the power difference between Mantel and the Mgc version of Mantel. Indeed, Mgc is able to improve the global counterpart in testing power under nonlinear dependencies, and maintains similar power under linear and independent dependencies.
Figure 2—figure supplement 4
The same power plots as in Figure 3, except the 20 dependencies are one-dimensional with noise, and the x-axis shows sample size increasing from 5 to 100.
Figure 3 with 1 supplement
The Mgc-Map characterizes the geometry of the dependence function.

For each of the 20 panels, the abscissa and ordinate denote the number of neighbors for X and Y, respectively, and the color denotes the magnitude of each local correlation. For each simulation, the sample size is 60, and both X and Y are one-dimensional. Each dependency has a different Mgc-Map characterizing the geometry of dependence, and the optimal scale is shown in green. In linear or close-to-linear relationships (first row), the optimal scale is global, that is the green dot is in the top right corner. Otherwise the optimal scale is non-global, which holds for the remaining dependencies. Moreover, similar dependencies often share similar Mgc-Maps and similar optimal scales, such as (10) logarithmic and (11) fourth root, the trigonometric functions in (12) and (13 , 16) circle and (17) ellipse, and (14) square and (18) diamond. The Mgc-Maps for high-dimensional simulations are provided in Figure 3—figure supplement 1.
Figure 3—figure supplement 1
The Mgc-Map for the 20 panels for high-dimensional dependencies.

For each simulation, the sample size is 100, and the dimension is selected as the dimension such that Mgc has a testing power above 0.5. It has similar behavior and interpretation as the one-dimensional power maps in Figure 3, that is the linear relationships optimal scales are global, and similar dependencies share similar Mgc-Maps.
Demonstration that Mgc successfully detects dependency, distinguishes linearity from nonlinearity, and identifies the most informative feature in a variety of real data experiments.

(A) The Mgc-Map for brain activity versus personality. Mgc has a large test statistic and a significant p-value at the optimal scale (13, 4), while the global counterpart is non-significant. That the optimal scale is non-global implies a strongly nonlinear relationship. (B) The Mgc-Map for brain connectivity versus creativity. The image is similar to that of a linear relationship, and the optimal scale equals the global scale, thus both Mgc and Mcorr are significant in this case. (C) For each peptide, the x-axis shows the p-value for testing dependence between pancreatic and healthy subjects by Mgc, and the y-axis shows the p-value for testing dependence between pancreatic and all other subjects by Mgc. At critical level 0.05, Mgc identifies a unique protein after multiple testing adjustment. (D) The true and false positive counts using a k-nearest neighbor (choosing the best k[1,10]) leave-one-out classification using only the significant peptides identified by each testing method. The peptide identified by Mgc achieves the best true and false positive rates, as compared to the peptides identified by Hsic or Hhg.
Appendix 1—figure 1
We demonstrate that Mgc is a valid test that does not inflate the false positives in screening and variable selection.

This figure shows the density estimate for the false positive rates of applying Mgc to select the 'falsely significant' brain regions versus independent noise experiments; dots indicate the false positive rate of each experiment. The mean ± standard deviation is 0.0538 ± 0.0394.
Author response image 1
Compute the test statistics of MGC, DCORR, and HSIC for 100 replicates, and then plot the average running time in log scale (clocked using Matlab 2017a on a Windows 10 machine with I7 six-core CPU).

The sample data are repeatedly generated using the quadratic relationship in Appendix, the sample size increases from 25 to 500, and the dimensionality is fixed at p = 1 on the left and p = 1000 on the right. In either panel, the three lines differ by some constants in the log scale, suggesting the same running time complexity but different constants. MGC has a higher intercept than the other two, which translates to about a constant of 6 times of DCORR and 3 times of HSIC at n = 500 and p = 1, and about 3 at p = 1000.


Table 1
The median sample size for each method to achieve power 85% at type one error level 0.05, grouped into monotone (type 1–5) and non-monotone relationships (type 6–19) for both one- and ten-dimensional settings, normalized by the number of samples required by Mgc.

In other words, a 2.0 indicates that the method requires double the sample size to achieve 85% power relative to Mgc. Pearson, Rv, and Cca all achieve the same performance, as do Spearman and Kendall. Mgc requires the fewest number of samples in all settings, and for high-dimensional non-monotonic relationships, all other methods require about double or triple the number of samples Mgc requires.
Dependency typeMonotoneNon-MonoAverageMonotoneNon-MonoAverage
Pearson / Rv / Cca1>10>100.8>10>10
Spearman / Kendall1>10>10n/an/an/a
Table 1—source data 1

Testing power sample size data in one dimension.
Table 1—source data 2

Testing power sample size data in high-dimensions.
Table 2
The p-values for brain imaging vs mental properties.

Mgc always uncovers the existence of significant relationships and discovers the underlying optimal scales. Bold indicates significant p-value per dataset.
Testing Pairs/MethodsMgcDcorrMcorrHhgHsic
Activity vs Personality0.0430.6670.4410.0590.124
Connectivity vs Creativity0.0110.0100.0110.0310.092
Table 2—source data 1

p-value data for activity vs personality.
Table 2—source data 2

p-value data for connetivity vs creativity.
Appendix 1—table 1
Results for cancer peptide screening.

The first two rows report the p-values for the tests of interest based on all peptides. The next four rows report the number of significant proteins from individual peptide tests; the Benjamini-Hochberg procedure is used to locate the significant peptides by controlling the false discovery rate at 0.05.
Testing pairs / MethodsMgcMantelDcorrMcorrHhg
1Ovar vs. Norm: p-value0.00010.00010.00010.00010.0001
2Ovar vs. Norm: # peptides218190186178225
3Pancr vs. Norm: p-value0.00820.06850.06690.01920.0328
4Panc vs. Norm: # peptides976711
5Panc vs. All: # peptides10003
6# peptides unique to Panc10002
7# false positives for Panc0n/an/an/a1
Appendix 1—table 1—source data 1

Ovarian testing results.
Appendix 1—table 1—source data 2

Pancreatic testing results.
Appendix 1—table 1—source data 3

Peptide screening results for pancreatic.
Appendix 1—table 2
For each of Mgc, Dcorr, Mcorr, Hhg, Hsic, Mantel, Pearson, and Mic, list the top four peptides identified for Panc vs All and the respective corrected p-value using Benjamini-Hochberg.

Bold indicates a significant peptide at type 1 error level 0.05. The top candidates are very much alike except Mic. In particular, neurogranin is consistently among the top candidates for all methods, but is only significant while using Mgc, Hsic, and Hhg; there are two other significant proteins from Hsic and Hhg, but they do not further improve the classification performance comparing to just using neurogranin. Note that the p-values from Mantel and Pearson are always 1 after Benjamini-Hochberg correction, so their respective top peptides are identified using raw p-values without correction.
methodTop four identified peptides
Mgcneurograninfibrinogen protein 1tropomyosin alpha-3ras suppressor protein 1
Dcorrneurograninfibrinogen protein 1kinase 6twinfilin-2
Mcorrneurograninfibrinogen protein 1kinase 6tropomyosin alpha-3
Hsicneurogranintropomyosin alpha-3kinase 6tripeptidyl-peptidase 2
Hhgneurograninfibrinogen protein 1tropomyosin alpha-3platelet basic protein
Mantelneurograninadenylyl cyclasetropomyosin alpha-3alpha-actinin-1
Pearsonneurograninadenylyl cyclasetropomyosin alpha-3alpha-actinin-1
Mickinase BS100-A9ERF3Athymidine
Appendix 1—table 3
The actual testing time (in seconds) on real data.

Additional files

Transparent reporting form
Appendix 1—table 1—source data 1

Ovarian testing results.
Appendix 1—table 1—source data 2

Pancreatic testing results.
Appendix 1—table 1—source data 3

Peptide screening results for pancreatic.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Joshua T Vogelstein
  2. Eric W Bridgeford
  3. Qing Wang
  4. Carey E Priebe
  5. Mauro Maggioni
  6. Cencheng Shen
Discovering and deciphering relationships across disparate data modalities
eLife 8:e41690.