Toward Compilation of Balanced Protein Stability Data Sets: Flattening the ΔΔG Curve through Systematic Enrichment
Often studies analyzing stability data sets and/or predictors ignore neutral mutations and use a binary classification scheme labeling only destabilizing and stabilizing mutations. Recognizing that highly concentrated neutral mutations interfere with data set quality, we have explored three protein...
Gespeichert in:
Veröffentlicht in: | Journal of chemical information and modeling 2022-03, Vol.62 (5), p.1345-1355 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Often studies analyzing stability data sets and/or predictors ignore neutral mutations and use a binary classification scheme labeling only destabilizing and stabilizing mutations. Recognizing that highly concentrated neutral mutations interfere with data set quality, we have explored three protein stability data sets: S2648, PON-tstab, and the symmetric Ssym that differ in size and quality. A characteristic leptokurtic shape in the ΔΔG distributions of all three data sets including the curated and symmetric ones was reported due to concentrated neutral mutations. To further investigate the impact of neutral mutations on ΔΔG predictions, we have comprehensively assessed the performance of 11 predictors on the PON-tstab data set. Correlation and error analyses showed that all of the predictors performed the best on the neutral mutations, while their performance became gradually worse as the ΔΔG of the mutations departed further from the neutral zone regardless of the direction, implying a bias toward dense mutations. To this end, after unraveling the role of concentrated neutral mutations in biases of stability data sets, we described a systematic enrichment approach to balance the ΔΔG distributions. Before enrichment, mutations were clustered based on their biochemical and/or structural features, and then three mutations were selected from every 2 kcal/mol of each cluster. Upon implementation of this approach by distinct clustering schemes, we generated five subsets varying in size and ΔΔG distributions. All subsets showed improved ΔΔG and frequency distributions. We ultimately reported that the errors toward enriched subsets were higher than those toward the parent data sets, confirming the enrichment of difficult-to-predict mutations in the subsets. In summary, we elaborated the prediction bias toward a concentrated neutral zone and also implemented a rational strategy to tackle this and other forms of biases. Ultimately, this study equipping us with an extended view of shortcomings of stability data sets is a step taken toward development of an unbiased predictor. |
---|---|
ISSN: | 1549-9596 1549-960X |
DOI: | 10.1021/acs.jcim.2c00054 |