Macroscopic Analyses of RNA-Seq Data to Reveal Chromatin Modifications in Aging and Disease

Achal Mahajan author has email address
Francesca Ratti
Ban Wang
Hana El-Samad
James H Kaufman
Vishrawas Gopalakrishnan

Institute of Computation, Altos Labs Inc, San Francisco, United States

https://doi.org/10.7554/eLife.107396.1

Open access
Copyright information

Figures and data

Cell state definitions in LINE-1.
The table outlines the cell states along with their descriptions used in Della Valle et al. (2022), including Wild Type (WT), Hutchinson-Gilford Progeria Syndrome treated with scrambled ASO (HGPS-SCR), untreated HGPS (HGPS-NT), HGPS treated with LINE-1 ASO (HGPS-L1-ASO), Werner Syndrome treated with scrambled ASO (WRN-SCR), untreated Werner Syndrome (WRN-NT), and Werner Syndrome treated with LINE-1 ASO (WRN-L1-ASO)

Intra-chromosomal gene correlation length ℓ^* (measured through autocorrelation as a function of lag) for genes in (a) HAT (red) and (b) LAT (green) class located on chromosome 1.
The plots are shown for WT-1 sample in L1 dataset. The intra-chromosomal gene correlation length is defined as the point where the autocorrelation falls below the 95% confidence limit. This value is normalized as per the normalization described in Intra-chromosomal gene correlation length in Methods. The correlation length is relatively longer for genes in the LAT class compared to that of genes in the HAT class.

Distance correlation coefficient 𝒞 (Δℓ) measured based on abundances for binned pairs of genes plotted against Δℓ, where each bin contains pairs of genes separated by Δℓ.
The red line represents the correlations for HAT class, while green line indicate correlations for the LAT class. The plot depicts data from the WT-1 sample in L1 dataset, focusing on genes located on chromosome 1. The inset figure illustrates the difference in the correlations for a reduced range of x-axis.

Similarity measured using Manhattan distance for different samples with respect to WT for (a) HGPS, (b) WRN in L1 dataset and (c) Fleicher dataset.
The blue and red line represents the similarity on gene expressions and ℓ^* respectively. The plot highlights the distance as measured using intra-chromosomal gene correlation length ℓ^* (red) has a similar trend to the one measured using similarity distance from gene expression α_t (blue).

Box plots showing variation of intra-chromosomal gene correlation length (ℓ^*) across different chromosomes in Fleischer dataset.
(a) Samples are grouped in age groups ordered from young to aged. Mean of the box plots are labeled. (b) ℓ^* for each sample ordered by donor age. The center bar denotes the mean across all chromosomes, and the error bar shows the variation.

Differential gene expression analysis results (a) HGPS-NT vs. WT, and (b) WRN-NT vs. WT in L1 dataset.
Number of differentially expressed genes (DEGs) in the top 200 hits, total number of genes, and the fraction of DEGs per chromosome, are listed. Entries are sorted by the fraction of DEGs. Top 5 results are shown here. Chromosome 6 has the largest number of fraction of DEGs in both cases.

Intra-chromosomal gene correlation length (ℓ^*) measured in LINE-1 dataset for chromosome 6.
Error bars are computed from within group replicates. The figure illustrates that diseased samples exhibit a higher values of ℓ^* than the ASO treated/WT samples in chromosome 6.

Hierarchical clustering of L1 dataset and Fleischer dataset.
(a) Heat-map with dendrogram using the proposed intra-chromosomal gene correlation length (ℓ^*) shows a clear clustering of the samples based on treatment and disease on the second level (highlighted by pink shade). (b) Hierarchical clustering of Fleischer dataset. Samples were bucketized into different age group and ℓ^* can cluster age groups in an expected order on the second level (highlighted by pink shade).

Gene expression (α_t) as a function of rank (r_t) for chromosome 1 in (a) LINE-1 dataset and (b) Fleischer dataset.
The main plot shows that the normalized expression is conserved on a log-log scale while rank changes as a function of disease or treatment. The inset shows two example of how rank changes by experiment for two particular genes, IRF6 (blue circles), and AK4 (red circles). Pearson’s correlation coefficients and p-values are annotated on the figure. The samples along the x-axis in the inset are based on the rank of gene IRF6 sorted from low to high.

Chromatin energy landscape for chromosome 1 based on rank transitions and gene expression, applying the Arrhenius equation to the L1 data.
The magnitude (intensity of each point) is based on the change in abundance for each rank change and replicate pairs given the two sample types in each sub-figure. Ordering the experimental classes by class(x) ⇒ class(y), the sub-figures correspond to: (a) WT ⇒ WT (b) HGPS-ASO ⇒ WT (c) HGPS-NT ⇒ WT and (d) HGPS-NT ⇒ HGPS-ASO. The upper triangle shows energy changes associated with genes moving up in rank, the lower triangle shows the energy change for movement down in rank. With the exception of panel (a), the landscapes are not symmetric.

Chromatin transition energy barrier summarized across all chromosomes for transitions between samples up and down the ranks respectively using the integral expression (Equation 1).
The markers with blue triangles shows the energy required to move up the ranks is relatively higher for diseased to WT/ASO transitions whereas WT to ASO transitions are equally probable in (a) L1 (HGPS) and (b) L1 (WRN). Moving from young to old state in (c) Fleischer dataset requires the most energy with higher variance compared to transitions from other age groups.

An unbiased spectral clustering procedure is used to segregate all gene based on their expression values (α_t) into one of two groups: HAT and LAT represented by red and green markers respectively. The plots depict clustering results for (a) WT with 3 replicates in L1 dataset, and (b) samples from the 0 − 20 age group, both corresponding to chromosome 1. The x-axis represents the rank of genes, determined by sorting their expression values in descending order.

Comparison of gene expression activity for Chromosome 6 is shown between healthy (WT) samples in sub-figure (a) and diseased (HGPS) samples in sub-figure (b) from the LINE-1 dataset, as well as between a 51-year-old sample (sub-figure c) and a 94-year-old sample (sub-figure d) from the Fleischer dataset. The x-axis in each subplot represent the genomic coordinates of the chromosome. One can see sharply defined distribution characteristics for the healthier and “younger” samples and more flatter distributions spanning multiple coordinates for the diseased and “older” samples. The distributions are obtained by fitting Gaussian Mixture Model and using Dirichlet process to estimate the number of components. The flatter distribution spanning multiple coordinates indicate increased activation of “nearby” genes w.r.t target intended genes during the RNA-polymerase transcription. The sharper distribution observed in the healthier/younger population leads to a conclusion of a relatively intact regulatory mechanism that is able to more precisely activate the target intended genes.

Comparison of hierarchical clustering using intra-chromosomal gene correlation length (ℓ^*) in different scenarios for L1 dataset: (a) considering only HAT genes in the (b) considering only LAT genes (c) considering all genes. Clustering using HAT genes shown in (a) yields optimal results.

Hierarchical clustering results using SVD with (a) size = 10 and (b) size = 20 in FL dataset.
Clustering using SVD yields poor results. The x-tick labels are color coded with cyan for 0 − 20 age group, green for 21 − 40, red for 41 − 60, magenta for 61 − 80 and blue for ≥ 81. The grey colored dendrogram lines represent instances where samples from different age groups are merged.

Box plots showing variation of intra-chromosomal gene correlation length (ℓ^*) across different chromosomes.
No significant differences were observed between untreated or scrambled-treated HGPS samples and those treated with Line-1 ASO, as well as WT samples, suggesting an alternative epigenetic modification mechanism in pathological aging. The line in each box plot represents the mean rather than the median.

Differential gene expression analysis for three experimental conditions (a) HGPS-NT vs. WT and (b) WRN-NT vs. WT using PyDESeq2 to identify differentially expressed genes.
The dark colored markers show the top 200 hits based on the log₂ fold change out of which 38 (HGPS-NT vs. WT) and 29 (WRN-NT vs. WT) are found on chromosome 6 and annotated in the figure.

Chromatin energy landscape for chromosome 6 in the case of HGPS for L1 dataset, similar to Fig. 8 in main text.
The energy landscape for all other chromosomes look similar.

Chromatin energy landscape for chromosome 1 in the case of WRN for L1 dataset, similar to Fig. 8 in main text.
The energy landscapes for all other chromosomes look similar.

(a) Manhattan distance (city block) (b) Euclidean distance comparison when used in hierarchical clustering. One can observe the range of values that Manhattan distance uses in larger than the Euclidean distance. This leads to more effective discrimination between nearby points.

Chromatin transition energy barrier summarized across all chromosomes for transitions between samples up and down the ranks respectively for different sizes of Gaussian filter (a) L1 (HGPS), (b) L1 (WRN) and (c) Fleischer dataset.
The trend is similar to Fig. 9 (filter size = 3) in the main text suggesting that the results are robust irrespective of the choice of filter size.

Chromosomal variation of ℓ^* for Fleischer dataset.
Error bars represent the standard deviation across different samples within each age group.

Sign up for email alerts