Optimizing Protein Fitness and Function with Sparse Experimental Data

The quest to create customized protein sequences with specific functions holds great promise across diverse fields, from healthcare to sustainable energy. While Next Generation Sequencing (NGS) allows for experimental evaluation of millions of protein sequences, it is dwarfed by the vast residue pos...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
1. Verfasser:	Shaw, Ada Y
Format:	Dissertation
Sprache:	eng
Schlagworte:	Applied mathematics Artificial intelligence Biology
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The quest to create customized protein sequences with specific functions holds great promise across diverse fields, from healthcare to sustainable energy. While Next Generation Sequencing (NGS) allows for experimental evaluation of millions of protein sequences, it is dwarfed by the vast residue possibility space. Recent advances in unsupervised generative models offer potential solutions, yet they need comprehensive evaluation on their generalizability to different types of data. This work addresses the biases and limitations of current protein design methods, emphasizing the importance of systematic evaluation. We explore protein sequence and structure models, particularly in the context of deep mutational scans. Chapter 1 investigates the biases of unsupervised protein sequence models and presents a method to alleviate these biases. This chapter aids in ranking diverse protein sequences, enhancing their prioritization for testing. Chapter 2 delves into the predictions of various structure models for mutational effect analysis. Spatially-local residue preference models are found to prevail in certain cases, guiding local sequence optimization without additional experimental labor. Chapter 3 focuses on predicting enzyme pH optima using sequence embeddings from large language models. This benchmark study enhances our understanding of using unsupervised models to predict enzyme characteristics. Chapter 4 explores methods to predict protein function and fitness using sparse and disparate experimental data, shedding light on leveraging diverse information sources for predictive modeling. This work underscores the importance of evaluating designs on experimental data while highlighting the assets of unsupervised models. Future endeavors will involve experimental validation of the presented ideas.