Computational and Systems Biology

Discovering Root Causal Genes with High Throughput Perturbations

Eric V Strobl author has email address
Eric R Gamazon

University of Pittsburgh, Pittsburgh, United States
Vanderbilt University Medical Center, Nashville, United States

https://doi.org/10.7554/eLife.100949.2

Open access
Copyright information

Figures and data

(a) Toy example where a variable E₂ simultaneously models genetic and non-genetic root causes that jointly have a large causal effect on a diagnose Y through gene expression . E₂ first affects the gene expression level , or the root causal gene. The root causal gene then affects other downstream levels during pathogenesis, including the core (or direct causal) gene , to ultimately induce a diagnosis Y. (b) We hypothesize that the causal effects of most root causes are small, but a few are large (red ellipse), in each patient with disease. As a result, the distribution of these root causal effects tends to be right skewed in disease.

(a) Toy example where a variable E₂ simultaneously models genetic and non-genetic root causes that jointly have a large causal effect on a diagnose Y through gene expression . E₂ first affects the gene expression level , or the root causal gene. The root causal gene then affects other downstream levels during pathogenesis, including the core (or direct causal) gene , to ultimately induce a diagnosis Y. (b) We hypothesize that the causal effects of most root causes are small, but a few are large (red ellipse), in each patient with disease. As a result, the distribution of these root causal effects tends to be right skewed in disease.

Method overview and synthetic data results. (a) We consider a latent causal graph over the true counts . (b) We augment the graph with error terms E such that each E_i ∈ E in red has an edge directed towards . (c) The RCS of , denoted by Φ₂, quantifies the magnitude of the conditional root causal effect, or the strength of the causal effect from E₂ to Y conditional on . (d) We cannot observe in practice but instead observe the noisy surrogates X in blue corrupted by Poisson measurement error. (e) Perturbing a variable such as changes the marginal distributions of downstream variables shown in green under mild conditions. (f) RCSP thus uses the perturbation data to identify (an appropriate superset of) the surrogate parents for each variable in order to compute Φ. (g) Violin plots show that RCSP achieved the smallest RMSE to the ground truth RCS values in the synthetic data. (h) RCSP also took about the same amount of time to complete as multivariate regression. Univariate regression only took 11 seconds on average, so its bar is not visible. Error bars denote 95% confidence intervals of the mean. (i) Finally, RCSP maintained low RMSE values regardless of the number of clusters considered.

Method overview and synthetic data results. (a) We consider a latent causal graph over the true counts . (b) We augment the graph with error terms E such that each E_i ∈ E in red has an edge directed towards . (c) The RCS of , denoted by Φ₂, quantifies the magnitude of the conditional root causal effect, or the strength of the causal effect from E₂ to Y conditional on . (d) We cannot observe in practice but instead observe the noisy surrogates X in blue corrupted by Poisson measurement error. (e) Perturbing a variable such as changes the marginal distributions of downstream variables shown in green under mild conditions. (f) RCSP thus uses the perturbation data to identify (an appropriate superset of) the surrogate parents for each variable in order to compute Φ. (g) Violin plots show that RCSP achieved the smallest RMSE to the ground truth RCS values in the synthetic data. (h) RCSP also took about the same amount of time to complete as multivariate regression. Univariate regression only took 11 seconds on average, so its bar is not visible. Error bars denote 95% confidence intervals of the mean. (i) Finally, RCSP maintained low RMSE values regardless of the number of clusters considered.

Analysis of AMD. (a) The distribution of the RCS scores of age deviated away from zero and had a composite D-RCS of 0.46. (b) However, the majority of gene D-RCS scores concentrated around zero, whereas the majority of gene D-SD scores concentrated around the relatively larger value of 0.10. Furthermore, the D-RCS scores of the genes in (d) mapped onto the “amino acid transport across the plasma membrane” pathway known to be involved in the pathogenesis of AMD in (c). Blue bars survived 5% FDR correction. (e) Drug enrichment analysis revealed four significant drugs, the later three of which have therapeutic potential. (f) Hierarchical clustering revealed four clear clusters according to the elbow method, which we plot by UMAP dimensionality reduction in (g). The RCS scores of the top genes in (d) increased only from the left to right on the first UMAP dimension (x-axis); we provide an example of SLC7A5 in (h) and one of three detected exceptions in (i). We therefore performed pathway enrichment analysis on the black cluster in (g) containing the largest RCS scores. (j) The amino acid transport pathway had a larger degree of enrichment in the black cluster as compared to the global analysis in (c).

Analysis of MS. (a) The distribution of the RCS scores of age deviated away from zero with a composite D-RCS of 0.55. (b) The distribution of D-RCS concentrated around zero, whereas the distribution of D-SD concentrated around 0.3. (d) RCSP identified many genes with large D-RCS scores that in turn mapped onto known pathogenic pathways in MS in (c). Hierarchical clustering revealed three clusters in (e), which we plot in two dimensions with UMAP in (f). Top genes did not correlate with either dimension of the UMAP embedding; we provide an example of the MNT gene in (g). (h) Drug enrichment analysis in the green cluster implicated multiple cathepsin inhibitors. Finally, EPH-ephrin signaling survived FDR correction in (c) and was enriched in the pink cluster in (i) which contained more MS patients with the relapsing-remitting subtype in (j); subtypes include relapse-remitting (RR), primary progressive (PP), secondary progressive (SP), clinically isolated syndrome (CIS), and radiologically isolated syndrome (RIS).

In this example, two root causal genes and affect many downstream genes and ultimately cause Y. Thus all genes correlate with Y, but only and have large root causal effects on Y. The omnigenic root causal model posits that only a few root causal genes affect many downstream genes, so that nearly all genes are correlated with Y. Causal genetic variants can directly cause Y or cause any gene expression level that causes Y – including those with small root causal effects – but only and have large root causal effects on Y due to genetic and non-genetic root causes modeled by E₁ and E₂. In contrast, the core gene model assumes only a few direct causal genes . These core genes do not account for the deleterious causal effects of E₁ and E₂ on and .

In this example, two root causal genes and affect many downstream genes and ultimately cause Y. Thus all genes correlate with Y, but only and have large root causal effects on Y. The omnigenic root causal model posits that only a few root causal genes affect many downstream genes, so that nearly all genes are correlated with Y. Causal genetic variants can directly cause Y or cause any gene expression level that causes Y – including those with small root causal effects – but only and have large root causal effects on Y due to genetic and non-genetic root causes modeled by E₁ and E₂. In contrast, the core gene model assumes only a few direct causal genes . These core genes do not account for the deleterious causal effects of E₁ and E₂ on and .

An example of a DAG over augmented with the error terms E. The observed vertices X denote counts corrupted by batch B effects and Poisson measurement error.

An example of a DAG over augmented with the error terms E. The observed vertices X denote counts corrupted by batch B effects and Poisson measurement error.

Mean RMSE to the ground truth RCS values across different mean sequencing depths and normalization strategies. The no normalization strategy achieved low RMSEs at lower mean sequencing depths, but the performances of all methods converged as the mean sequencing depths increased. Error bars denote 95% confidence intervals of the mean RMSE.

Mean RMSE values to the ground truth error term values across different sample sizes. The accuracies of ANM and LiNGAM do not improve with increasing sample sizes.

RCSP achieved the lowest RMSE in cyclic graphs as well. However, error terms can influence ancestors in the cyclic case, so the interpretation of the RCS remains unclear when cycles exist.

The performance of RCSP degrades gracefully as the percent of samples from the alternate DAG increases.

Results with a sink or non-sink target Y. RCSP estimated the RCS scores less accurately with a non-sink target indicating that the algorithm is sensitive to violations of the sink target assumption.

Mean RMSE (blue, left) and percent sign incongruence (green, right) of the expected root causal effects and signed RCS values, respectively. The RMSE continues to decrease with increasing sample size but reaches a floor of around 0.05. Similarly, the percent sign incongruence decreases but reaches a floor of around 5%.

Comparison of the algorithms in age-related macular degeneration.

Mean sequencing depth of each gene plotted against their D-RCS scores in AMD. Genes with the largest D-RCS scores (red ellipse) had a variety of sequencing depths.

Full pathway enrichment analysis results for all patients in the AMD dataset. We list the Entrez gene IDs of up to the top three leading edge genes in the right-most column.

Additional UMAP embedding results for AMD. (a) The UMAP dimensions did not correlate with AMD severity as assessed by the MGS score. Many genes correlated with the first UMAP dimension in (b), but only three genes correlated with the second UMAP dimension in (c). Blue bars passed an FDR threshold of 5%, and error bars denote 95% confidence intervals.

Drug enrichment analysis results by cluster in Figure 3 (g). The analyses recovered similar drugs across clusters, but the results for the green cluster in (c) were supra-significant.

Comparison of the algorithms in multiple sclerosis.

Mean sequencing depth of each gene plotted against their D-RCS scores in MS. Genes with the largest D-RCS scores (red ellipse) again had a variety of sequencing depths.

Full pathway enrichment analysis results for all patients in the MS dataset. We again list up to the top three leading edge genes in the right-most column.

Pathway enrichment analysis results by cluster consistently revealed EPH-ephrin signaling as well as an additional pathway implicating T cell pathology.

Additional analyses of the UMAP embedding for MS. (a) The UMAP dimensions did not correlate with MS severity as assessed by EDSS. However, lower ranked genes such as TRIP10 correlated with both dimensions in (b). We expanded the analysis to the top 30 genes and plot the genes with the highest correlations to UMAP dimension one and two in (c) and (d), respectively.

Sign up for email alerts