Structure-guided isoform identification for the human transcriptome

  1. Markus J Sommer  Is a corresponding author
  2. Sooyoung Cha
  3. Ales Varabyou
  4. Natalia Rincon
  5. Sukhwan Park
  6. Ilia Minkin
  7. Mihaela Pertea
  8. Martin Steinegger  Is a corresponding author
  9. Steven L Salzberg  Is a corresponding author
  1. Department of Biomedical Engineering, Johns Hopkins School of Medicine and Whiting School of Engineering, United States
  2. Center for Computational Biology, Johns Hopkins University, United States
  3. School of Biological Sciences, Seoul National University, Republic of Korea
  4. Artificial Intelligence Institute, Seoul National University, Republic of Korea
  5. Department of Computer Science, Johns Hopkins University, United States
  6. Institute of Molecular Biology and Genetics, Seoul National University, Republic of Korea
  7. Department of Biostatistics, Johns Hopkins University, United States
9 figures, 1 video and 4 additional files

Figures

Predicted local distance difference test (pLDDT) distribution across the human transcriptome.

Two-dimensional joint histograms comparing pLDDT to protein amino acid length (a) and expression (b) measured in transcripts per million (TPM). For each protein-coding gene, only the isoform found …

Acetylserotonin O-methyltransferase (ASMT) isoform comparison.

Comparison of predicted structures of ASMT, showing the 373aa isoform from Matched Annotation from NCBI and EMBL-EBI (MANE) (CHS.57426.2, RefSeq NM_001171038.2, GENCODE ENST00000381241.9) on the …

CRYGN isoform comparison.

(a) Predicted protein structure for the Matched Annotation from NCBI and EMBL-EBI (MANE) isoform (CHS.52273.5, RefSeq NM_144727.3, GENCODE ENST00000337323.3) of gamma-N crystallin (CRYGN), colored …

CRYGN intron-exon structure.

Comparison of gamma-N crystallin (CRYGN) transcript structures in frog, mouse, and human. Exons 1, 2, and 3 are highly conserved across all species. Exon 4 is missing from the poorly folding Matched …

TXNDC8 isoform comparison.

Predicted protein structures for seven distinct human isoforms of thioredoxin domain-containing protein 8 (TXNDC8), as well as the primary cattle transcript and a novel mouse transcript. Alternate …

IL36B isoform comparison.

Comparison of predicted structures for interleukin 36 beta (IL36B) for the Matched Annotation from NCBI and EMBL-EBI (MANE) isoform (CHS.30565.1, RefSeq NM_014438.5, GENCODE ENST00000259213.9) and …

PGAP2 isoform comparison.

Comparison of the structure of the Matched Annotation from NCBI and EMBL-EBI (MANE) isoform (CHS.7860.58, RefSeq NM_014489.4, GENCODE ENST00000278243.9) versus the highest scoring alternate isoform …

Vascular endothelial growth factor B (VEGFB) isoform comparison.

VEGFB isoforms VEGFB-186 (a) and VEGFB-167 (b). The inclusion of a heparin binding domain in VEGFB-167 results in sequestration to the cell surface while VEGFB-186 remains freely soluble. Relying …

TXNDC8 human and mouse comparison.

Intron-exon and predicted protein structure for TXNDC8 in human (a and b) and mouse (c, d, and e). Exons are colored according to their average predicted local distance difference test (pLDDT) …

Videos

Video 1
CRYGN comparison.

A three-dimensional (3D) animation comparing the predicted protein structure of the Matched Annotation from NCBI and EMBL-EBI (MANE) isoform (CHS.52273.5, RefSeq NM_144727.3, GENCODE …

Additional files

Supplementary file 1

All isoform summary.

Folding scores from ColabFold for each transcript from a preliminary new build of the Comprehensive Human Expressed SequenceS (CHESS) database that contained a protein-coding sequence (CDS) that was under 1000aa in length. For transcripts already contained in the released CHESS v3.0 database, the identifier from that database is provided. If the transcript maps to a known gene locus X but is a novel isoform, it is shown with the identifier CHS.X.altY. If a transcript occurs at a novel locus X, the identifier is hypothetical.X.Y, where Y identifies the isoform number. Additional columns show the gene name, the RefSeq ID (release 110), the GENCODE ID (release 40), the predicted local distance difference test (pLDDT) (folding) score, and a flag indicating whether all intron boundaries (for multi-exon genes) are conserved in the mouse genome.

https://cdn.elifesciences.org/articles/82556/elife-82556-supp1-v3.csv
Supplementary file 2

Matched Annotation from NCBI and EMBL-EBI (MANE) comparison summary.

Folding scores and additional data for all Comprehensive Human Expressed SequenceS (CHESS) transcripts that match genes in the MANE v1.0 dataset, limited to protein sequences under 1000aa in length. Transcripts must overlap the annotated CDS of the MANE transcript to be included. Columns include: CHESS_ID_isoform, the CHESS identifier of the alternate isoform transcript; CHESS_ID_MANE, the CHESS identifier of the MANE transcript at the same locus; gene, the gene name; aa_length_isoform, the amino acid length of the alternate isoform’s CDS; aa_length_MANE, the amino acid length of the MANE transcript’s CDS; length_ratio, the ratio of the alternate isoform length to the MANE isoform length; pLDDT_isoform, the predicted folding score of the alternate isoform; pLDDT_MANE, the predicted folding score of the MANE isoform; pLDDT_ratio, the ratio of the alternate isoform folding score to the MANE isoform folding score; GTEx_samples_observed_isoform, the total number of GTEx samples where the alternate isoform was observed at least once; GTEx_samples_observed_MANE, the total number of GTEx samples where the MANE isoform was observed at least once; GTEx_top_tissue_name_isoform, the name of the tissue in which the alternate isoform was observed in the highest number of samples; GTEx_top_tissue_name_MANE, the name of the tissue in which the MANE isoform was observed in the highest number of samples; GTEx_top_tissue_TPM_isoform, the average TPM of the alternate isoform in the named tissue; GTEx_top_tissue_TPM_MANE, the observed transcripts per million (TPM) of the MANE isoform in the named tissue; introns_conserved_in_mouse_isoform, an indicator of whether introns are conserved between the alternate human isoform and any annotated isoform in the GRCm38 mouse reference genome; introns_conserved_in_mouse_MANE, an indicator of whether introns are conserved between the MANE human isoform and any annotated isoform in the GRCm38 mouse reference genome.

https://cdn.elifesciences.org/articles/82556/elife-82556-supp2-v3.csv
Supplementary file 3

Matched Annotation from NCBI and EMBL-EBI (MANE) comparison summary, filtered subset.

A filtered set of Comprehensive Human Expressed SequenceS (CHESS) transcripts compared to MANE according to the criteria detailed in the ‘Filtering MANE comparisons’ section of the Materials and methods. Uses the same column names as Supplementary file 2.

https://cdn.elifesciences.org/articles/82556/elife-82556-supp3-v3.csv
MDAR checklist
https://cdn.elifesciences.org/articles/82556/elife-82556-mdarchecklist1-v3.pdf

Download links