GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The international journal of high performance computing applications 2023-11, Vol.37 (6), p.683-705
Hauptverfasser: Zvyagin, Maxim, Brace, Alexander, Hippe, Kyle, Deng, Yuntian, Zhang, Bin, Bohorquez, Cindy Orozco, Clyde, Austin, Kale, Bharat, Perez-Rivera, Danilo, Ma, Heng, Mann, Carla M., Irvin, Michael, Ozgulbas, Defne G., Vassilieva, Natalia, Pauloski, James Gregory, Ward, Logan, Hayot-Sasson, Valerie, Emani, Murali, Foreman, Sam, Xie, Zhen, Lin, Diangen, Shukla, Maulik, Nie, Weili, Romero, Josh, Dallago, Christian, Vahdat, Arash, Xiao, Chaowei, Gibbs, Thomas, Foster, Ian, Davis, James J., Papka, Michael E., Brettin, Thomas, Stevens, Rick, Anandkumar, Anima, Vishwanath, Venkatram, Ramanathan, Arvind
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole-genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.
ISSN:1094-3420
1741-2846
DOI:10.1177/10943420231201154