Predicting Disorder from the Proteome
(A) A summary of the workflow of the IDR prediction algorithm. IUPred scores were computed for the entire proteome, and the output was parsed into scores for sequences inside annotated structured domains, or in unannotated regions of the proteome. These two sets of IUPred scores were used to train a Hidden Markov model (HMM) to assign sequences in the proteome to “structured” and “unstructured” states. A Viterbi path was computed from the HMM to provide a binary output of the predictions. The plot shows the output of the algorithm for the Daughterless TF. The IUPred “long” scores are plotted in black, and the Viterbi path from our HMM is shown in red. The green box at the top of the figure denotes the annotated structured domain from SMART, extracted from the FlyBase GFF file, for this protein. Beneath the figure is a schematic of the linear protein structure (modified from SMART (Letunic et al., 2015; Schultz et al., 1998)), with IDRs indicated in purple and a helix-loop-helix binding domain in green. The IDR isolated for this study is shown in orange.
(B) Histogram showing the distribution of IUPred “long” scores in regions of the proteome annotated as structured domains by Pfam and/or SMART (green) vs. regions outside of known domains (red).
(C) The number of amino acids from the proteome that are classified as structured (blue) vs. unstructured (red) by our HMM Viterbi call in annotated Pfam/SMART domains and in regions of the proteome outside of known domains.