Extrachromosomal DNA (ecDNA) are large, functional, circular double-stranded DNA molecules that are enriched for oncogenes, highly amplified, and frequently observed in a wide variety of cancer types1, 2. ecDNAs lack centromeres and are asymmetrically segregated into daughter cells during cell division, driving intratumoral genetic heterogeneity, accelerated evolution, and rapid treatment resistance3, 4. Further, recent studies demonstrate strong positive selection for ecDNA during tumor progression5. ecDNAs also exhibit highly accessible chromatin and altered cis- and trans-regulation, including cooperative intramolecular interactions6, promoting elevated expression of oncogenic transcriptional programs79, further contributing to poor outcome for patients2.

The recent development of computational tools that enable detection of ecDNA from whole genome sequencing data, has facilitated analyses of well-curated, publicly available datasets, including The Cancer Genome Atlas (TCGA), thereby providing an important opportunity to identify transcriptional repertoires that are preferentially detected in bona fide, clinical ecDNA-containing tumors. To shed new light on the gene expression patterns that may enhance ecDNA development and progression, we examined global transcriptional analysis of ecDNA-containing tumors.


A recent analysis utilized the tools AmpliconArchitect and AmpliconClassifier on 1,921 tumors from The Cancer Genome Atlas (TCGA) to suggest that ecDNA prevalence ranges from 0% to 59.6% across multiple tumor tissue subtypes2. Using AmpliconClassifier (AC), the analysis classified tumor samples into five subtypes: ecDNA(+), Breakage Fusion Bridge (BFB), complex non-cyclic, linear, and no-amplification. However, due to limitations imposed by short-read sequencing, AC may classify some ecDNA(+) structures as complex non-cyclic when breakpoints are missed. Secondly, BFB cycles can give rise to ecDNA formation, making discernment of the two modes of amplification difficult. To limit false-negative ecDNA classifications in the ecDNA(-) set, we treated samples with only a linear or no-amplification status as ecDNA(-), removing complex non-cyclic and BFB(+) samples from the analysis. In order to understand the transcriptional programs active in maintaining ecDNA, we selected 870 samples from 14 tumor types with at least three ecDNA(+) samples each, and compared the gene expression data of the resulting 234 ecDNA(+) and 636 ecDNA(-) samples (Table S1).

Machine learning identifies candidate genes for ecDNA maintenance

In lieu of identifying genes that are highly differentially expressed between ecDNA(+) and ecDNA(-) samples but driven by a small subset of cases (e.g. gene A in Figure S1a), we sought to identify genes (e.g. gene B) whose expression level was predictive of ecDNA presence. We assumed that genes that were persistently over-expressed or under-expressed in ecDNA(+) samples relative to ecDNA(-) samples were more likely to be involved in ecDNA biogenesis or maintenance, or in mediating the cellular response to the presence of ecDNA. For exposition purposes, we refer to these functions as “maintenance” activities below.

To identify a minimal set of genes whose expression values were consistently predictive of ecDNA presence, we used Boruta,10 an automated feature selection algorithm (Fig. 1a and Methods). Given the unequal representation of ecDNA(+) and ecDNA(-) samples within each of the 14 tumor types, we performed Boruta on 200 datasets, each consisting of a random selection of 80% of the 870 samples (Fig. 1a), and chose the criterion of a gene being labeled as a Boruta gene in at least 10 of the 200 runs to be selected for downstream analysis. The Boruta analysis identified a set of 408 genes with persistent differential expression, hereafter denoted as the Core gene set.

Genes predictive of ecDNA status

(a) The feature selection algorithm, Boruta, was applied to 200 datasets of randomly selected subsets consisting of 80% of all samples. Genes selected by Boruta in at least 10 of the 200 runs were identified as the Core set of genes (408) involved in ecDNA maintenance. (b) Identification of highly co-expressed and stable gene clusters using pvclust expanded the Core set by an additional 235 genes to the final list of 643 CorEx genes. (c) Out of 354 clusters, the majority (344) of clusters contained 1 or 2 Core genes. (d) Most clusters were small, with only 7 clusters containing more than 10 genes.

Extending the Core set with co-expressed genes

We note that the Core gene set is not a comprehensive list of discriminatory genes, using a toy example. Consider gene “B”, a member of the core gene set, and another gene, “C”, whose expression values across all samples are nearly identical to the expression values of core gene B. The Boruta analysis would not need to assign gene C to the core set in addition to gene B, because adding both genes incurs the same predictive power as adding one. However, either, or both genes may play an important role in ecDNA maintenance. To correct this, we ran pvclust11 to cluster all gene expression values, and to identify stable clusters using multiscale bootstrap resampling (Fig. 1b; Methods). We used an approximately unbiased (AU) confidence value of 0.95 to select the most highly co-expressed gene clusters. An AU confidence value of 0.95 represents the rejection of the null hypothesis that a group of genes fail to form a stable cluster at a significance level of 0.05. Recomputing the number of Boruta runs that members of a cluster were selected in, we selected clusters that appeared in at least 10 of the 200 Boruta runs (Methods). This resulted in the selection of 354 recurring clusters (Table S2).

Notably, among the 354 clusters, only 2 clusters (with 14 total genes) did not contain any Core genes. As most genes do not have completely identical expression patterns, we would expect one gene to be consistently picked as a Boruta gene over another co-expressed gene. Consistent with this hypothesis, most (344/354) clusters contained only 1 or 2 Core genes (Fig. 1c). When selecting clusters that contained at least 1 Core and 1 co-expressed gene, 53 of 71 clusters contained 1 to 3 Core genes (Fig. S1b), confirming that a few genes per co-expressed cluster provide sufficient predictive value, but other co-expressed genes might still play an important role in ecDNA maintenance. This is true for clusters of various sizes, including the 2-member cluster #74 and the 21-member cluster #3. In cluster #74, CSTF1 had similar expression values to the Core gene RAE1, which is a mitotic checkpoint regulator implicated in tumor progression12 (Fig. S1c; Table S3). While not necessarily increasing the predictive value, CSTF1 is also a proto-oncogene involved in aberrant alternative splicing events13. In cluster #3, 12 genes were highly co-expressed with 9 Core genes (Fig. S1d; Table S3), and were enriched in cell-cycle related biological processes (Methods). Importantly, the total number of genes per cluster was also small (Fig. 1d), with only 7 of 354 clusters carrying more than 10 genes. This suggests that genes involved in ecDNA maintenance have specific roles that cannot be accomplished by multiple other genes.

Summarizing, the 354 clusters contained 643 genes, which included 408 Core genes and 235 additional genes (Fig. 1b). Together, we define these genes as the CorEx (Core+co-expressed) genes (Table 1). The remaining manuscript investigates the functional properties of these genes.

CorEx gene expression correlates strongly with ecDNA occurrence

We validated the relevance of CorEx genes in ecDNA maintenance by running cross-validation experiments (Fig. 2a; Methods) to test the predictive power of CorEx gene expression in determining the ecDNA status of the sample. For exposition, only the maximum recall was plotted for experiments with similar precision (Fig. 2a,b), with all data points shown in Fig. S2a,b. Expectedly, the predictive performance did not change when switching between Core genes and CorEx genes, because each of the non-core gene in the CorEx list had an expression pattern similar to at least one Core gene (Fig. 2b, S2a). While the precision and recall numbers were not high enough for the gene expression to be used directly in predicting the ecDNA status, they were significantly better than a random subset of genes. For precision values at least 0.7, the maximum (median) recall using CorEx genes was 0.65 (0.48) (Fig. 2b, S2b). In contrast, the maximum (median) recall for randomly selected genes at the same precision level was 0.37 (0.24) (p-value 1.8e-35; Mann-Whitney U test).

Validation of CorEx genes

(a,b) Cross-validation experiments validating the predictive value of CorEx genes. Precision denotes the fraction of predicted samples that were truly ecDNA(+). Recall refers to the fraction of ecDNA(+) samples that were predicted correctly. For multiple points with a similar precision, the maximum recall is plotted. (a) The curves for CorEx and Core genes overlap, suggesting similar predictive power. (b) CorEx genes have higher predictive rates compared to 643 randomly selected genes and the top 643 differentially expressed genes based on logarithmic fold changes from a DESeq2 analysis (Top-|LFC| genes). (c) CorEx genes were consistently up- or down-regulated in ecDNA(+) samples across tumor types, with the exception of SARC. (d) Of the 643 Top-|LFC| genes, 240 were up-regulated while 403 were down-regulated in ecDNA(+) samples. Of the CorEx genes, 325 were up-regulated while 318 were down-regulated. The absolute LFC values of the Top-|LFC| gene set was significantly greater than that of the CorEx genes (p-value 1.83e-158). (e) The normalized gene expression values of the CorEx genes were significantly higher than that of the Top-|LFC| gene set (p-value < 2e-308). ***p-value < 0.001.

To test the persistence of CorEx genes across tumor types, we re-computed Cliff’s delta values14, 15 for each of the 11 TCGA tumor types that had at least 10 ecDNA(+) and at least 10 ecDNA(-) samples. The directionality of gene expression patterns was significantly similar to TCGA in each tissue type, with one exception (Fig. 2c; Table S4). The sole exception was the tumor type of Sarcoma (SARC). It is notable that the TCGA-SARC samples included many liposarcomas. In addition to containing ecDNA, liposarcoma samples are known to have extensively rearranged structures indicative of chromothripsis and neo-chromosome formation16. For other tissue types, the p-values against a null hypothesis of no match to the pan-cancer prediction ranged from 4.2e-12 to 3.5e-85 (Fisher’s exact test) for the significant associations (Methods). The results were similar if we tested using only Core genes (Fig. S2c). Summarizing, the 643 CorEx genes are differentially expressed across a multitude of tumor types, and have consistently higher or lower expression in ecDNA(+) samples relative to ecDNA(-) samples. These results are consistent with a pan-cancer role of CorEx genes in ecDNA maintenance.

CorEx genes are very different from the most differentially expressed genes

We asked if the CorEx gene list was significantly different from a list of genes with the highest change in expression between the two classes. We performed a differential expression analysis using DESeq217 and picked the 643 most significantly differentially expressed genes in terms of the absolute value of their shrunken log-fold change estimate (LFC; Methods). Using the sign of the LFC value as the determinant for directionality, 240 of these genes were up-regulated, while 403 were down-regulated. Notably, only 86 of these top-|LFC| genes overlapped with the CorEx gene set (Table S5; Fig. S2d). As discussed above, these highly differentially expressed genes could be driven by a small subset of samples, and therefore, were not as predictive of ecDNA status as the CorEx genes. For precision values of at least 0.7, the maximum recall using top-|LFC| genes was 0.52 (median: 0.39), significantly less than the recall of 0.65 (median: 0.48) using CorEx genes (Mann-Whitney U, p-value 4.8e-21; Fig. 2b, S2b).

The top-|LFC| genes were also different from the CorEx genes by other metrics. Not surprisingly, the log-fold change (LFC) values of the top-|LFC| genes were higher than the LFC values of the CorEx genes (Fig. 2d, MWU p-value 1.83e-158). However, much of the LFC change was due to the very low expression of the top-|LFC| genes in either ecDNA(+), or ecDNA(-) samples. In fact, the CorEx genes had higher expression in both ecDNA(+) and ecDNA(-) samples compared to the Differentially Expressed (DE) genes (Fig. 2e, MWU p-value < 2e-308). While the absolute log fold-change in expression of CorEx genes between ecDNA(+) and ecDNA(-) samples was not that high (median: 0.30, mean: 0.41), it was persistent across all samples (variance: 0.14, standard deviation: 0.37).

For example, the genes ITLN1 and PNMT had the second and eighth-highest absolute LFC values of 3.92 and 2.89 in the top-|LFC| list. However, their normalized expression values in most ecDNA(+) samples were low. ITLN1 had a normalized RSEM expression value ≤ 8 (21st percentile) in 210/234 ecDNA(+) samples. Similarly, the normalized RSEM expression value of PNMT in 223/234 ecDNA(+) samples was less than 8.5 (rank percentile: 41.1%). For PNMT, the differential expression was mediated by 11 ecDNA(+) samples having an expression value ≥ 11, and 5 of the 11 samples contained PNMT on an ecDNA amplicon (Fig. S2e). Similarly, 3 samples with high RSEM contained ITLN1 on an amplicon (Fig. S2f), partly accounting for the high |LFC| value. In contrast, the CorEx gene, RAE1, had a high normalized expression value in both ecDNA(+) and ecDNA(-) samples (average 9.72, rank percentile 74.3%), with a small but persistent LFC value of 0.33.

The results confirm our intuition that differential expression can arise due to multiple reasons, including low expression of the gene in a majority of samples, or the copy number amplification of a gene in a few samples. In contrast, the CorEx genes were selected based on persistent over- or under-expression in ecDNA(+) samples.

CorEx genes primarily up-regulate three biological processes: Cell Cycle, Cell division, and DNA Damage Response

To identify enriched biological processes specific to either up-regulated or down-regulated genes in ecDNA(+) samples, we combined two metrics of effect size, Cliff’s delta14, 15, and log fold change17 to determine the directionality of CorEx genes (Table S6; Methods). The two effect size metrics were mostly in agreement in terms of directionality. Of the 7,288 genes that passed the negligible effect size thresholds in both metrics, only 14 were not in concordance. This more stringent approach, in comparison to a simple directionality based on the sign of a single effect size value, was applied given that gene set enrichment is dependent on not only the number of up- or down-regulated genes but also the degree of overlap with genes under a specific biological process term (Methods). Using this approach, among the 643 CorEx genes, 262 genes were found to be up-regulated in ecDNA(+), while 271 were found to be down-regulated (Table 1; Methods). 110 genes did not make the effect size cut-off. The numbers were similar for the 408 Core genes, with 190 up-regulated, 196 down-regulated, and 22 genes not making the cut-off.

We performed a gene set enrichment analysis to identify the Gene Ontology (GO) biological processes that are enriched in CorEx genes (Methods). Briefly, we applied a one-sided Fisher’s exact test using gene sets from MSigDB1820, using a false discovery rate of 5% (Benjamini-Hochberg procedure). The UP-regulated genes were enriched for 187 Biological processes (Table S7). Note that the GO-biological process (BP) terms are not independent, because of their hierarchical organization, and sharing of genes across different GO terms. Therefore, we used an approach similar to that used in DAVID21 to cluster the biological processes enriched by the UP-regulated genes into 11 broad categories (Table S8; Fig S3; Methods). The 11 categories were assigned a name using manual inspection of the constituent GO terms, or called “Other.” The 11 categories (including “Other”) are shown in a waterfall plot to explain the contribution of each gene to a category (Fig. 3a).

Up-regulated CorEx genes

(a) GO biological processes enriched in up-regulated genes were clustered into 11 broad categories. (b) Genes up- or down-regulated in processes involved in major double-strand break (DSB) damage repair pathways. Many critical genes in the c-NHEJ pathway were down-regulated in ecDNA(+) samples relative to ecDNA(-) samples.

The 10 categories included expected participation of biological processes involved in (a) cell-cycle regulation (Mitotic/Meiotic Cell Cycle, G1/S, G2/M) (b) cell-division (Spindle Organization, Cell Division, Chromosome Condensation, Chromosome Segregation), (c) DNA Damage response (DNA Repair), and (d) the HOX Gene cluster. Indeed, one of the largest clusters, cluster #3, containing 9 Core genes and 12 co-expressed genes, was enriched in GO-BP terms related to the cell cycle (Table S9). Notably, the enriched categories also included a role for the HOX genes with 17 members of the HOX family up-regulated in ecDNA(+) cancers (Table S8). Many recent reports have associated HOX genes with cancer, including an association with their phenotypic “hallmarks”22. Genes involved in angiogenesis (HOXA2, HOXC5), genome instability (HOXC5, HOXC11), deregulating cellular energetics (HOXA4, HOXC5), and metastasis (HOXA2) were all up-regulated in ecDNA(+) cancers.

While Meiotic cell-cycle was also enriched, only 7 genes were allocated specifically to the group of enriched terms: SEPP1, SNRPA1, TMEM203, RNF114, TAF4, TNFAIP6, and PTX3 (Table S10). Though these genes have roles in the meiotic cell cycle, they have also been implicated in cancer and other inflammatory diseases. The spliceosomal protein SNRPA1 is a pro-metastatic splicing enhancer23. TNFAIP6, together with PTX3, activates the Wnt/β-catenin pathway to promote gastric carcinoma cell invasion24. TMEM203 is a STING-centered signaling regulator implicated in inflammatory diseases25. RNF114 is a zinc-binding protein whose over-expression is an indicator of epithelial inflammation and implicated in various tumors26. TAF4, a transcription initiation factor, when overexpressed, is implicated in ovarian cancer by playing a role in dedifferentiation that promotes metastasis and chemoresistance27. Finally, in looking at the genes in the “Other” category, SHCBP1 was the only gene unique to it. A member of the neural precursor cell proliferation process, SHCBP1 is reported to promote tumor cell signaling and proliferation28.

Taken together, the 11 biological process categories explain 165 of the 262 up-regulated CorEx genes, and suggest that Mitotic Cell-Division, Cell cycle regulation, and DNA Damage response are the three broad categories of biological processes up-regulated in ecDNA maintenance, along with an up-regulation of genes in the HOX cluster.

CorEx genes upregulate specific Double-strand break repair pathways

The 16 enriched DNA damage response GO terms contained terms “double-strand break repair” and “recombination”, but none contained the terms “single-strand”, “nucleotide-excision”, or “mismatch-repair” (Table S8), suggesting that the CorEx genes are largely composed of genes involved in multiple double-strand break (DSB) repair pathways, which include classical non-homologous end-joining (c-NHEJ), Alternative end-joining (Alt-EJ), single-strand annealing (SSA), or homology directed repair (HDR)29.

The choice of these varied DSB repair mechanisms for ecDNA maintenance is not well understood. We compiled and hand-curated a list of 129 genes involved in DSB repair and marked them for their role in one or more of these 4 pathways (Table S11). Of these genes, a high number (67) were up-regulated in ecDNA(+), while a smaller number (15) were down-regulated, relative to ecDNA(-) samples. This breakdown of 129 DDR genes contrasts with an analysis using all genes where a near identical number of genes (5,256, and 5,251) were up- and down-regulated in ecDNA(+) samples, confirming that DDR genes are significantly up-regulated relative to all differentially expressed genes (p-value < 0.0001; Fisher’s exact test). When broken down to the roles of genes in individual DSB repair pathways, we found that Alt-EJ with 11 up-regulated and 1 down-regulated genes (p-value 0.0063), SSA (11 up, 1 down (p-value 0.0063)), and HR (46 up, 8 down; p-value <0.00001) were all up-regulated. However, classical NHEJ (14 up, 7 down; p-value: 0.19) was not significantly up-regulated in ecDNA(+) samples relative to ecDNA(-) samples (Methods, Table S12, Table S13).

The expression of key genes in these pathways raises the possibility of an increased role of non-classical-NHEJ processes in ecDNA development or progression, relative to c-NHEJ (Fig. 3b; Table S11). A number of genes involved in c-NHEJ were downregulated in ecDNA-containing tumors relative to non-ecDNA tumors. These included XLF/NHEJ1 (MWU p-value 2.05e-03), which is a key member of the ligase complex required for c-NHEJ; LIG4, another member of the ligase complex (MWU p-value 0.03), PNKP, which generates 5΄-phosphate/3΄-hydroxyl DNA termini required for ligation (MWU p-value 3.90e-06); and also, DNA polymerases λ (POLL; MWU p-value 1.09e-21) and μ (POLM; MWU p-value 0.01), which promotes the ligation of terminally compatible overhangs requiring fill-in synthesis and promotes the ligation of incompatible 3ʹ overhangs30 in a template independent manner, respectively. This does not imply a defect in these repair processes, but rather, potentially additional or preferential utilization of alternative DSB repair pathways in ecDNA-containing tumors.

TP53BP1 is key to blocking resection and promoting the c-NHEJ pathway choice, but is displaced by BRCA1 and the MRN complex to initiate resection in the broken strands31, 32. BRCA1 was significantly up-regulated in ecDNA(+) samples (Fig. 3b), while TP53BP-1 was significantly down-regulated (MWU p-value 0.016), although with negligible effect size (Table S11). Supporting the role of alternative pathway choice for DDR, key genes in the Alt-EJ pathway, including PARP-1, DNA polymerase θ (POLQ), LIG1, LIG3, FEN1 were all significantly up-regulated in ecDNA(+) samples. Homology directed repair is the preferred pathway when a sister chromatid is available to act as a template. HDR is initiated by additional and extensive resection. The genes BLM, EXO1, RPA1, RPA3 which promote additional resection, as well as BRCA1, BRCA2, RAD51, and others that support HDR were all found to be significantly up-regulated. We can conclude that the specific pathway choice for DSB repair in ecDNA(+) samples is dominated by alt-EJ and homology directed repair pathways, while c-NHEJ is not a preferred choice.

CorEx genes primarily down-regulate immune system processes

Using methodology similar to the analysis of the up-regulated genes, the down-regulated genes enriched 73 GO terms (Table S14), and could be clustered into seven broad categories, including “Other” (Fig. 4a; Table S15; Fig. S4). Surprisingly, all categories were immunomodulatory. The most enriched broad category contained 75 CorEx genes relating to the Lymphocyte activation pathway. It included genes enriching “T-cell activation” (28 CorEx genes; p-value 2.58e-05), and “Positive regulation of cell-cell adhesion” (16 CorEx genes; p-value 4.49e-03). Other down-regulated pathways included Cytokine activation, especially for genes in the IL-12 pathway (6 CorEx genes, p-value 7.17e-03), TNF super-family (12 CorEx genes, p-value 7.24e-03), and Inflammation, including, for example, down-regulation of Toll-like receptor 2 signaling (4 CorEx genes, p-value 4.49e-03). Finally, the broad category of Leukocyte chemotaxis was also enriched among the down-regulated genes. The chemotaxis genes include many chemokines and their receptors involved in trafficking of T cells to the site of the tumor. The remaining down-regulated genes included four fucosyltransferases, and the category marked “Other.” FUT2 silencing is associated with reduced adhesion and increased metastatic potential33. Notably, the category marked “Other” was dominated by genes in NF-κB pathway regulation (14 CorEx genes, p-value 2.28e-02).

Down-regulated CorEx genes

(a) GO biological processes enriched in down-regulated genes were clustered into 7 broad categories. (b) Four of these categories map to steps in the cancer-immunity cycle. CorEx genes in three of the four categories were significantly down-regulated compared to all genes (Fisher’s exact test).

NF-κB signaling represents a prototypical, proinflammatory pathway34 with multiple roles, including apoptosis. Specifically, 8 of the 14 down-regulated genes involved caspase activation (Table S16), representing the pro-apoptotic arm of NF-κB signaling. A parallel pathway for sensing endogenous ligands secreted in cell death and cancer is mediated by Toll-like receptor (TLR) proteins35. Remarkably, all ten TLRs were significantly down-regulated in ecDNA(+) tumors. They included TLRs expressed on the cell membrane that bind lipids and proteins as well as TLRs expressed on endosomal membranes that bind DNA. The CorEx down-regulated genes also included many involved in TLR signaling, such as TLR3, CYBA, LYN, and TIRAP.

Four of the seven broad categories mapped to facets of the cancer immune cycle36 (Fig. 4b). We tested if CorEx genes in these categories were more likely to be down-regulated rather than up-regulated, when compared to the non-CorEx differentially expressed genes. The Inflammation category, which mapped to the “Cancer antigen presentation” facet, showed 18 up-regulated and 76 down-regulated CorEx genes (p-value 0.005, Fisher exact test, Table S15). Similarly, the CorEx genes related to the “Trafficking of T cells” facet (“Leukocyte migration and chemotaxis” category, 13 up, 49 down-regulated; p-value 0.03) and “Infiltration and recognition of tumor cells by cytotoxic T cells” facet (“Lymphocyte activation” category, 23 up, 75 down-regulated; p-value 0.0075) were also significantly down-regulated. However, down-regulation in the “Priming and activation” facet (“Cytokine production” category, 6 up, 28 down-regulated) was not significant at the 5% level.

As the RNA data were bulk-sequenced, transcripts were sampled from tumor cells and cells from the tumor microenvironment. Thorsson et al.37 mined immune cell expression signatures to identify six immune subtypes: wound healing (C1), IFN-γ dominant (C2), inflammatory (C3), lymphocyte depleted (C4), immunologically quiet (C5), and TGF-β dominant (C6). A recent study analyzing the tumor microenvironment (TME) of ecDNA(+) vs. ecDNA(-) samples in seven tumor subtypes revealed an association of ecDNA presence with immune evasion38. Our results (Fig. S5), which used an updated version of the classification method for these ecDNA(+) samples, were broadly consistent with those from the Wu et al38. study. Our results suggested an increase in C1 and C2 subtypes and a depletion of C3 and C6 between ecDNA(+) and ecDNA(-) categories (p-value 3.96e-03, Chi-squared test). Notably, the C3 (inflammatory) subtype is associated with lower levels of somatic copy number alterations, and C6 with high lymphocyte infiltration, while C1 is associated with elevated levels of angiogenic genes. These are consistent with our findings of increased somatic copy numbers, increased expression of angiogenic genes on ecDNA(+) samples, and reduced lymphocyte infiltration.

ecDNA(+) samples carry a higher mutational burden relative to ecDNA(-) samples

In order to understand if the change in transcriptional program was driven by mutations to the genes, we checked if ecDNA(+) samples have differential levels of mutation relative to ecDNA(-). Intriguingly, we found that the total mutation burden was significantly higher in ecDNA(+) samples relative to ecDNA(-) samples (Fig. 5a). The result was significant also when mutations were limited to deleterious substitutions as measured by SIFT or PolyPhen2, and high-impact insertions and deletions (Fig. S6a). However, when controlling for cancer type, only glioblastoma (GBM; lower mutations in ecDNA(+)), low-grade gliomas (LGG; higher mutations in ecDNA(+)), and uterine corpus endometrial carcinoma (UCEC; lower mutations in ecDNA(+)) continued to show differential total mutational burden (Fig. S6b). Next, we tested if specific genes were differentially mutated between the two classes (Fig. 5b). For deleterious/high-impact mutations, TP53 was the only gene whose mutational patterns were significantly higher in ecDNA(+) compared to ecDNA(-) (OR 2.67, Bonferroni adjusted p-value 4.22e-07). BRAF mutations, however, were more common in ecDNA(-) samples and were significant to an adjusted p-value < 0.1 (OR 0.27). The excess of TP53 mutations in ecDNA(+) samples provides additional support to the hypothesis that mutations in DNA damage response or cell cycle checkpoints are important for ecDNA(+) formation and maintenance. Other genes that are differentially mutated with nominal significance (unadjusted p-value < 0.005) are shown in Table S17.

Mutational characteristics of ecDNA-containing tumors

(a) Total mutation burden of ecDNA(+) and ecDNA(-) samples. ecDNA(+) samples have significantly higher mutation burden than the ecDNA(-) samples (p-value < 0.0001, Mann Whitney test). (b) Odds ratios of differentially mutated genes in ecDNA(+) and ecDNA(-) (p-value < 0.005). The size of the dot indicates whether the corresponding gene belongs to the Cancer Gene Census (CGC) or not (Non-CGC). Only TP53 and BRAF showed significance at the level of FDR < 0.1 (Benjamini-Hochberg).

We also tested if a collection of gene mutations could predict ecDNA status using XGBoost39, which uses an adaptive boosting of “weak classifiers” to predict class. Here, each mutated gene was treated as a weak classifier of ecDNA status. However, the two classes could not be separated with high accuracy (Fig. S6c). An unsupervised principal component analysis did not separate the two classes either. Only the first principal component explained a significant proportion (14%) of the total variance (Fig. S6d) and did not separate the bulk of the samples. Finally, we recapitulated earlier findings that ecDNA(+) samples enrich for APOBEC activity through the presence of the mutation signatures SBS2 and SBS1340, 41 (Fig. S7). The enrichment in TP53 mutations was also consistent with previous findings5. On balance, however, collections of gene mutations did not distinguish ecDNA(+) samples from ecDNA(-) samples, at least at a pan-cancer level, in contrast to the gene expression data.

Persistently occurring genes in ecDNA(+) samples represent potential vulnerabilities

To rank CorEx genes by importance, we computed harmonic mean rank values based on three categories: a) the average GINI importance statistic from the trained random forest models; b) the number of Boruta runs that a gene was selected in; and c) the number of Boruta runs (out of 200) that a gene was selected in when counting by cluster (Methods). 65 genes that were up-regulated (47 genes) or down-regulated (18 genes) had a harmonic rank lower than 3 (Table 1). The next highest ranked gene had harmonic rank exceeding 17. These 65 genes represent the most persistent differentially expressed CorEx genes. Expectedly, the high-ranked up-regulated genes impacted cell division (16 genes), cell cycle regulation (10 genes), DNA damage response (16 genes), with only 12 of the 47 genes not specifically enriching any known GO biological process. Many of these genes were from small CorEx clusters with less than 3 members, but we also found 6 genes from the HOX gene cluster (cluster #17), and another cluster of 21 genes (cluster #3). Members of cluster #3 appeared in all 200 Boruta runs; however, there were three genes all involved in cell-division (TPX2, KIF2C, and AURKA), each of which appeared in at least 180 Boruta runs. High expression among these three genes is associated with poor prognosis42, and due to the highly persistent nature of their differential expression across ecDNA(+) samples, they represent a possible widespread vulnerability for ecDNA(+) samples.

Intriguingly, 14 of the 18 down-regulated genes with low harmonic rank came from a single cluster (#2; Table S2), and 13 of the 18 genes did not specifically enrich any specific BP ontology. Six of the down-regulated genes appeared in 180 or more Boruta runs (CHMP7, XPO7, INTS9, TACR1, KIAA1967, and PCM1). Some of these genes (CHMP7, XPO7, KIAA1967) are reported to be tumor suppressor genes4345. However, the exact functional role of down-regulating these genes in ecDNA(+) samples remains to be elucidated.


ecDNA is increasingly recognized as a major cause of oncogene amplification, intratumoral genetic heterogeneity, accelerated evolution, and treatment resistance, but many of the underlying processes involved in its formation, function, and progression are not fully understood. The ability to conduct multi-omic studies of well-curated, bona fide clinical tumor samples, such as the TCGA, presents an opportunity to learn about differentially regulated gene expression programs that may be involved in ecDNA biogenesis or maintenance, and in worse outcomes for patients79. Using a relatively intuitive set of principles, we have developed a machine learning approach that identifies differentially expressed, co-regulated genes in ecDNA-containing tumors, highlighting four main biological processes: non-c-NHEJ DSB repair, cell cycle, proliferation control, and immune regulation.

The GO analysis revealed three core biological processes that were up-regulated and only the immune system processes as being down-regulated. These observations strengthen the case for targeting proteins involved in mitotic cell-division46, cell-cycle regulation, and DNA damage response in ecDNA(+) cancers, but also reveal roles for the HOX cluster of genes. Also, in this paper, we did not extensively study the role of ncRNA in ecDNA maintenance. We do note that HOTAIR, encoded in the HOXC locus, is independently associated with metastasis and poor outcomes47. Further experiments are needed to provide a mechanistic basis for the role of HOX cluster genes in ecDNA maintenance, as also for involvement of ncRNA.

The DNA damage genes are broadly up-regulated in ecDNA(+) samples, especially in double-strand break repair. Within this broad category of mechanisms, our analysis suggests that alternative DSB repair pathways such as Alt-EJ are preferred relative to classical NHEJ. This is consistent with previous observations of small microhomologies at breakpoint junctions2, 48, and has important implications in therapeutic selection that will need to be validated in future experimental studies. We note, however, the microhomology analyses typically study breakpoint junctions, and might ignore double-strand breaks in non-junctional sequences which could be observed, for example at replication-transcription junctions.

The down-regulated genes were primarily immunomodulatory in nature, in addition to a few persistently down-regulated tumor suppressor genes. Lowered expression of immunomodulatory genes in ecDNA(+) samples has been previously reported2, 38, but not mechanistically explained. Remarkably, the down-regulated immunomodulatory genes encompassed most aspects of the cancer immune cycle, suggesting impaired recognition of tumor DNA and proteins as foreign in ecDNA(+) tumors. Sensing of foreign DNA, including tumor DNA, is often mediated by the cGAS/STING pathway49, 50. Intriguingly, cGAS was significantly up-regulated in ecDNA(+) samples, while STING was significantly down-regulated, suggesting a role for STING agonists in intervention. Finally, in addition to the down-regulation of genes in the toll-like receptor family, we observed a down-regulation of genes involved in regulating TLR signaling pathways that were part of the CorEx list. Understanding the mechanisms of broad down-regulation of TLRs could provide insight into vulnerabilities of ecDNA(+) tumors.

Mutation data alone does not provide as clear a picture of the genes involved in ecDNA maintenance. We did observe that the total mutation burden (TMB) was higher in ecDNA(+) samples. However, that relationship is much less clear after controlling for cancer type. High TMB has been positively correlated with sensitivity to immunotherapy51, and better patient outcomes; however, the gene expression patterns suggest that immunomodulatory genes are down-regulated in ecDNA(+) samples, and patients with ecDNA(+) tumors have worse outcomes2. Notably, other results have suggested that the correlation between TMB and response to immunotherapy is not uniform, and it can vary across different tumor subtypes52. In general, no collection of gene mutations was predictive of ecDNA status, although mutations in TP53 were more likely in ecDNA(+) samples, and perhaps are an important driver for ecDNA formation5.

These results suggest that cancer cells that contain ecDNA have profound alterations in their global transcriptional patterns. Importantly, these transcriptional differences do not arise solely from genes on the ecDNAs themselves, but rather suggest that fundamental global processes involved in DSB repair, cell cycle control, and immune regulation contribute to ecDNA formation and pathogenesis.


TCGA sample ecDNA status classification

Amplicon Classifier (version 0.4.9, classified amplicons detected in 1,921 TCGA samples into five sub-types: ecDNA, BFB, complex non-cyclic, linear, and no-amplification. When classifying a sample with multiple amplicons, the order of preference is as follows: ecDNA, BFB, complex non-cyclic, linear, and no-amplification. Given the challenges of detecting ecDNA from short read data, and to avoid possible false-negative ecDNA classifications, samples with a BFB or complex non-cyclic status which were not called ecDNA(+), were removed from the analysis. We treated samples with the linear amplification and no-amplification classifications as ecDNA(-). Of the 1,921 samples, 1,535 samples classified as ecDNA(+) and ecDNA(-) had RNA-seq data, including 1,406 primary solid tumor samples, 95 tumor metastasis samples, and 34 primary blood derived cancer – peripheral blood samples. Removing metastases results in a total of 1,440 samples, including 243 ecDNA(+) and 1,197 ecDNA(-) samples. While the set of 1,440 samples represented 24 tumor types, ten of these tumor types had insufficient numbers of ecDNA(+) samples, including four tumor types with no ecDNA(+) samples. To prevent the 561 ecDNA(-) samples representing these tumors from skewing the analysis, we removed 570 samples representing tumor types with less than three ecDNA(+) samples. This resulted in a total of 870 samples representing 14 tumor types, of which 234 were classified as ecDNA(+) and 636 were classified as ecDNA(-).

Gene expression datasets

Gene expression data for 32 studies part of the TCGA Pan-cancer Atlas was downloaded from cBioPortal (01.05.2021) ( The cBioPortal “data_RNA_Seq_v2_expression_median.txt” data is sourced from the file “EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv” (synapse id: syn4976363). Briefly, the matrices contain batch corrected values of the upper-quartile (UQ) normalized RSEM estimated counts data from Broad firehose (tumor.uncv2.mRNAseq_RSEM_all.txt). Missing values due to the batch effect correction process were imputed using K-nearest neighbors (KNN). For each tumor type, values were imputed based on gene vectors under the assumption that genes are similarly expressed between samples of the same tumor type. For genes with less than 60% of samples with missing values, values were imputed using the logarithmic (base 2) of the gene expression value plus one, and subsequently back-transformed when writing the imputed matrices to file. The resulting gene expression matrix used for the Boruta analysis described below consisted of 870 TCGA samples and 16,309 protein-coding genes (based on “hgnc_complete_set.txt” downloaded from HGNC on 7.24.2018). To generate a RSEM raw counts matrix for the DESeq2 analysis described below, mRNAseq_Preprocess.Level_3 data was downloaded from Broad Firehose (tumor.uncv2.mRNAseq_raw_counts.txt).

Boruta analysis

To identify a minimal set of genes whose expression values were predictive of the sample being ecDNA(+), we used an automated feature selection algorithm, Boruta10. Given a gene expression matrix, M, of dimension r x c, where r is the number of samples and c is the number of genes, Boruta generates c shadow feature vectors by random shuffling of gene vectors in matrix M, generating a new matrix M’ of dimension r x 2c. A random forest algorithm is then used to quantify the importance of each gene in separating ecDNA(+) from ecDNA(-) samples. In a given Boruta run, for each iteration in, where i = {1, 2, 3, …, n}, there are two possible outcomes for a gene: 1) if the gene scores higher than the best scoring shadow feature, the gene is considered a “hit”, and 2) if the gene scores lower than the best scoring shadow feature, the gene is considered a “non-hit”. The number of hits, kg, for each gene, g, after i iterations is stored in a list, Li. Similar to a coin toss event of n trials, the probability that a gene is a “hit” or “non-hit” in each iteration is 0.5, and follows the binomial distribution.

After i iterations, the probability that a gene has k or more hits, P(X > x), where x = k − 1, can be computed using the binomial distribution survival function. This probability is computed for each gene, g, with kg number of hits and stored in list Pi. As Boruta involves testing multiple genes per iteration and testing the same genes repeatedly over i iterations, two multiple hypothesis testing corrections are applied to list Pi. First, the Benjamini-Hochberg procedure is applied to control the False Discovery Rate (FDR) at 5%, and second, the Bonferroni correction is applied to control for the family-wise error rate at α = 0. 05. If the probability of a gene satisfies both corrections, the gene is accepted as a Boruta gene. Separately, the probability that a gene, g, has kg or lower hits, P(Xkg), can be computed using the binomial cumulative distribution function. For genes that are not accepted, if the probability satisfies both corrections, the gene is rejected as a Boruta gene.

The Boruta script used for this analysis is a revised version of the BorutaPy python package (6.21.2021; with the addition of early stop parameters (≤50 tentative feature counts stagnant for 5 iterations) and output of hits per iteration. We randomly split the 870 samples into training (80%) and testing (20%) data to enable the evaluation of selected features. Given the unequal representation of ecDNA(+) and ecDNA(-) samples within each of the tumor types, we opted to perform the following procedure 200 times. For each of the 200 runs, Boruta is performed on the training data. In short, the Boruta method takes as input the original expression matrix, M, appends shadow features that are randomized versions of gene features in M, performs random forest classification on the data (class_weight=balanced_subsample, max_depth=7), selects the highest feature importance score, s, of shadow features as a cut off, and returns features with a higher score than s. Of the 941 genes identified in the Boruta analysis, 408 genes were selected in at least 10 of the 200 Boruta runs, and subsequently defined as the Core set of genes in downstream analyses.

Highly co-expressed genes

To identify genes co-expressed with the core set of Boruta genes, hierarchical clustering of the 16,309 genes was performed using the R package pvclust11 (ver. 2.2-0; dist.method=correlation, method=ward.D2, nboot=1000). A total of 843 significant clusters (AU > 0.95) with at least 1 Boruta gene were selected, consisting of 1,375 genes. To obtain the final list of CorEx genes, we apply a minimal count of 10 runs for the gene or 10 runs for the cluster of genes seen in 200 Boruta runs. A cluster is determined to be seen in a Boruta run if at least one of its members is selected in the run. This results in 354 clusters, with a total number of 643 genes, of which 408 are Core genes.

Evaluation of CorEx genes

To evaluate the 408 Core genes as predictive of ecDNA presence in tumor samples, a random forest classifier was trained on the 80% data sets and tested on the 20% data sets defined previously. Hyper-parameters were tuned using RandomizedSearchCV followed by GridSearchCV (k_fold =5, f1 score). Furthermore, we performed a similar evaluation using 408 randomly selected genes instead of the Core genes for each of the 200 training and testing data sets. A similar evaluation was performed using 643 randomly selected genes instead of the CorEx genes. We further evaluated a set of 643 most differentially expressed genes based on the absolute log-fold change estimates from a conventional DE analysis using DESeq217 as described below.

Precision, recall curve

To select data points (i.e., (precision, recall) values for a training/testing data pair described above) representative of the predictive power of a gene list (i.e., CorEx genes, Core genes, Top-|LFC| genes, or randomly selected genes), for each consecutive window of precision, we selected the data point with the highest recall (sensitivity) value. Specifically, using 16 consecutive windows of precision with a step-size of 0.0625 from 0.0 to 1.0, we selected 8 representative data points for Core genes, 8 for CorEx genes, 8 for Top-|LFC| genes, and 11 for randomly selected genes. The representative points were then used to interpolate 500 new points using the piecewise cubic hermite interpolating polynomial (pchip) function in the Scipy python package. For the figures generated, we only showed data points with a precision value greater or equal to 0.55.

Default DE analysis

We performed a default DESeq217 (R package, ver. 1.36.0) analysis to obtain shrunken maximum a posteriori (MAP) log-fold change estimates for effect size (i.e., LFC). Specifically, to obtain i) LFC effect size values per gene for integration with its Cliff’s delta effect size value when determining if a gene is up- or down-regulated in ecDNA(+) samples, and ii) a list of n top-ranked genes by absolute value of the LFC (with application of an adjusted p-value <0.05 cutoff and LFC threshold of log2(1. 1) = 0. 13) for use in comparison against genes selected as important in the prediction of ecDNA in samples. For comparisons against Core genes, n is set to 408, and for comparisons against CorEx genes, n, is set to 643.

To obtain the LFC effect size metric between ecDNA(+) vs. ecDNA(-) samples for each gene’s expression, we fed as input to DESeq2 a matrix of raw RSEM estimated counts. To take into account batch effects, we included the center and platform information of samples, downloaded from synapse id syn4976363 (EB++GeneExpAnnotation.tsv), in the design of the DESeq object:

To compute results, the lfcThreshold was set to log2 (1. 1) for an accurate computation of p-values and the contrast set to c(“condition”, “ecDNA(+)”, “ecDNA(-)”) to obtain the logarithmic fold change of the form . By setting the lfcThreshold, the null hypothesis tested is that |LFC|≤θ, where θ = (1. 1), and the alternative hypothesis is that |LFC| > θ. A log2 (1. 1) value is chosen as the minimal value/negligible effect size threshold as it represents a 10% fold-change, and anything below this fold-change would likely not be of biological interest53. The specific commands run are as follows:

To obtain the shrunken MAP log-fold change estimates, we used the lfcShrink function provided in DESeq2, using the default apeglm method for the empirical Bayes shrinkage procedure54:

Up- or Down-regulated genes in ecDNA(+) samples

To categorize genes as “up-” or “down-” regulated in ecDNA(+) samples, we integrated two effect size metrics, Cliff’s delta (d)14, 15 and the DESeq2 shrunken MAPlog-fold change estimate (LFC17). Effect size is a measure of the magnitude of deviation from the null hypothesis, and unlike p-values, has the advantage of not being impacted by sample size55. This property is especially useful when comparing effect size values of a gene between tests where sample sizes differ. Comparing p-values between such tests would be invalid.

Cliff’s delta, d, is a non-parametric measure of the separation between two distributions and ranges from −1 to 1. Given two distributions, X = {i1, i2,…,im} and Y = {j1, j2, …, jn}, comparisons are made between each of m values in X and n values in Y. Cliff’s delta is computed as , where #(i > j) is the number of times a member of X is greater than a member of Y, and #(i < j) is the number of times a member of X is less than a member of Y14. A negative d indicates that values in Y tend to be higher than X, while a positive d indicates that values in X tend to be higher than Y. The magnitude of the effect size of Cliff’s delta can be separated into four levels: |d| < 0. 147 for negligible effects, 0. 147≤|d| < 0. 33 for small effects, 0. 33≤|d| < 0. 474 for medium effects, and |d| ≥ 0. 474 for large effects55. The python package used to compute Cliff’s delta values can be accessed at The input values used to compute Cliff’s delta are log-transformed normalized gene expression values plus one as described in the “Gene expression datasets” section of methods.

The DESeq2 LFC is computed as described above. To allow integration of the Cliff’s delta effect size with the DESeq2 LFC, we also separated the LFC values into four levels: |LFC| < log2 (1. 1) for negligible effects, log2 (1. 1)≤|LFC| < log2 (1. 5) for small effects, log2 (1. 5)≤|LFC| < log2 (2) or medium effects, and |LFC|≥log2 (2) for large effects.

For each gene, g, whether the gene is up-regulated or down-regulated in ecDNA(+) samples is determined by the signage and magnitude of its effect sizes dg and LFCg . The initial criteria for dg and LFCg to be used as a determinant in the direction of a gene is for it to have a magnitude larger than that of a negligible effect. If the signage of dg and LFCg are both positive, gene g is considered up-regulated in ecDNA(+). If only a single value has a magnitude larger than the negligible effect threshold (e.g, dg > 0 and LFCg =− 0. 1), gene g is considered up-regulated in ecDNA(+). In the case of conflicting signages between the two values, the effect size with a larger magnitude takes precedence. For example, if dg =− 0. 2 and LFCg = 0. 847, given that dg has a small negative effect and LFCg has a large positive effect, gene g is considered up-regulated in ecDNA(+) samples.

Tumor heatmap

A Cliff’s delta effect size matrix representing 643 CorEx genes was generated to compare TCGA with tumor expression patterns. For each of the 11 tumor types with at least 10 ecDNA(+) and at least 10 ecDNA(-) samples, we re-computed Cliff’s delta. Using a Fisher’s exact test (fisher_exact function from the scipy.stats python package; alternative hypothesis: two-sided), we tested the null-hypothesis of whether the up- and down-directionality of CorEx genes in TCGA vs. each tumor were independent of each other. The contingency table is as below. The directionality of a gene (up or down) is based solely on the signage of the gene’s Cliff’s delta effect size value.

Gene set enrichment analysis

To identify gene ontology biological processes that are enriched in our CorEx set of genes, we applied a hypergeometric test (one-sided Fisher’s exact test; fisher_exact function from the scipy.stats python package) using gene set to gene mappings from msigdb (c5.go.bp.v7.5.1.entrez) and a universe of N=16k for the number of genes measured in the RNAseq data. The false discovery rate was controlled at 5% and adjusted p-values were computed using the Benjamini-Hochberg procedure (fdr correction from python statsmodels package). A final set of GOBP terms with adjusted p-value<0.05 was used for downstream analysis. The contingency table is as follows, where N is the total number of genes in the universe, n is the number of DE genes (either up- or down-regulated in ecDNA(+)), m is the number of genes in the gene set, and k is the number of DE genes in the gene set.

Clustering gene sets

To cluster enriched gene sets into categories for visualization purposes, Cohen’s kappa coefficient (python sklearn cohen_kappa_score) was used to determine term-term “connectivity” (agreement of term-term pairs) – an approach described in the DAVID paper21. Given a r×c binary matrix, where enriched GOBP terms are rows, CorEx genes are columns, and values are 1 if a CorEx gene is part of the GOBP term or 0 otherwise, kappa scores were computed between each pair of terms, where term tit and i = {1, 2, 3, …. n}: Kappa_Score(tx, ty), where xi and y ∈i.

Each term, ti, formed an initial seeding group, gi, where a term tx is part of gi if Kappa_Score(ti, tx)≥Score threshold. If at least 50% of term-term pairs in gi have a Kappa_Scorescore threshold, the initial seeding group gi is retained for the next step. The econd criteria ensures that terms within the same seeding group have strong interconnectivity. An iterative merging of seeding groups then follows: groups sharing p% or more members are merged. The representative term for each group was determined as the member with the highest interconnectivity score with other members of the group. After the automatic grouping process, a manual inspection leads to the merging of outliers or smaller groups into representative groups. We used a kappa score threshold of 0.5, and condition of p≥25% of shared members when merging for the down-regulated genes, and a kappa score threshold of 0.6 and condition of p≥50% of shared members when merging for the up-regulated genes.

DDR pathway genes

We hand-curated 88 genes for double-stranded break DNA damage repair pathways (a-EJ, HR, c-NHEJ, SSA) via an extensive literature search, and added an additional 51 genes from the following MSigDB (c5.go.bp.v2022.1.Hs.entrez.gmt) GO biological process terms: GO:0097680 (double strand break repair via classical nonhomologous end joining), GO:0097681 (double strand break repair via alternative nonhomologous end joining), GO:1905168 (positive regulation of double strand break repair via homologous recombination), and GO:0045002 (double strand break repair via single strand annealing). This resulted in a final list of 129 genes. The directionality of the genes, as either up- or down-regulated in ecDNA(+) samples, is based on the full set of 1,440 samples, consisting of 243 ecDNA(+) and 1,197 ecDNA(-) samples representing 24 tumor types (Table S12; Table S13).

To test whether the number of genes passing our effect size thresholds for all genes 5,256 (UP) and 5,251 (DOWN) was significantly different from the up-/down-regulated genes implicated in each of the 4 pathways, we performed a Fisher’s exact test (fisher_exact function from the scipy.stats python package) on the contingency table below, where c and d are the up- and down-regulated genes for each of the pathways tested.

Contingency table:

Pathway values:

Physical presence on amplicons

To determine physical presence of a gene on an amplicon, gene coordinates listed in the gene annotations file GRCh37/human_hg19_september_2011/Genes_July_2010_hg19.gff, downloaded from the AA repo on 3/21/2022, were mapped to amplicon genomic intervals (bed files). A gene is determined to be physically present on an amplicon if its genomic coordinates are fully encompassed within the amplicon genomic intervals.

Mutational analysis

We pulled the list of mutations from the open-access version of MC3 dataset (, then investigated differences between ecDNA(+) and ecDNA(-) samples. Synonymous mutations were excluded when calculating the mutation burden. For the differentially mutated gene analysis, only damaging mutations were selected by using snpEff56 annotation which is originally included in the MC3 dataset. First, mutations annotated as HIGH in the IMPACT column were selected to obtain frameshift INDELs and stop gain SNVs. Next, mutations predicted to be damaging by SIFT57 and PolyPhen258, were selected to obtain damaging missense mutations. Finally, we generated 2-by-2 contingency tables for each gene with cells a, b, c, and d representing the number of individuals with and without damaging mutations in ecDNA(+) and (-) tumors. The odds ratios were computed as , where a is the number of individuals in ecDNA(+) with the mutation, b is the number of individuals in ecDNA(+) without the mutation, c is the number of individuals in ecDNA(-) with the mutation, and d is the number of individuals in ecDNA(-) without the mutation. To determine if a gene contained mutations that were implicated in cancer, we checked genes against the Cancer Gene Census (CGC) database (v97)59, marking genes in the database as CGC and those that were not as non-CGC.

Classification with mutational status

First, we created a binary matrix representing whether a gene is damaged or not, from the MC3 damaging mutation set described as above. Then we divided the whole matrix into 80% of the training set and 20% of the test set. Hyperopt was applied on the training set to select the best parameters for the XGBoost39 model. The optimal parameters estimated by Hyperopt (eta = 0.1, max_depth = 6, min_child_weight = 3.0, scale_pos_weight = 4.9) were input to the XGBoost model, and the ecDNA status of each sample were also input as an answer set. The performance of the model was checked by inputting the 20% test set to the model and comparing the output result with the answer. Principal component analysis was also performed with the same mutational matrix as above, using the scikit.learn package.

Ranking of CorEx genes

To rank CorEx genes by importance, we computed harmonic mean rank values based on three categories: a) the average Gini importance statistic or mean decrease impurity (MDI) MDI (feature importance) values extracted from the trained random forest models on 200 training sets during the evaluation method described above, b) the number of Boruta runs (out of 200) that a gene is selected in, rounded to 2-digits, and c) the number of Boruta runs (out of 200) that a gene is selected in when counting by cluster, rounded to 2-digits. The ranks for each category are adjusted separately so that genes with the same value share the same rank value. For example, if using the Boruta run count as the rank value:

The harmonic mean rank is defined as:

Tumor immune subtype

We classified ecDNA(+) and ecDNA(-) samples into the 6 immune subtype categories (Thorsson et al., 201837) provided in Table S6 from Bagaev et al., 202160.

Author Contributions

M.S.L., P.S.M, and V.B. conceived and designed the experiments and core narrative of the manuscript. All analyses were performed by M.S.L., apart from the following ones. Mutational analysis was performed by S.J. The amplicon classifications were performed by J.L. S.W, V.B, H.Y.C. and P.S.M. provided guidance on the analysis and interpretation of results. M.S.L., S.J., J.L., and V.B. wrote the manuscript with feedback from all authors.


This project was supported by Cancer Grand Challenges CGCSDF-2021\100007 with support from Cancer Research UK and the National Cancer Institute (V.B, H.Y.C., P.S.M.) M.S.L., J.L., and V.B. were supported in part by grants U24CA264379, R01GM114362, and OT2CA278635 from the NIH and CGCATF-2021/100025 from CRUK. S.J. was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI19C1330).

Competing Interests

V.B. is a co-founder, paid consultant, SAB member and has equity interest in Boundless Bio, inc. and Abterra, Inc. H.Y.C. is a co-founder of Accent Therapeutics, Boundless Bio, Cartography Biosciences, Orbital Therapeutics, and an advisor of 10x Genomics, Arsenal Biosciences, Chroma Medicine, and Spring Discovery. P.S.M. is a co-founder and advisor of Boundless Bio. J.L. receives compensation as a consultant for Boundless Bio. The remaining authors declare no competing interests.