De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks
The widespread use of high throughput genome sequencing technologies has resulted in a significant increase in the number of available sequences, creating new challenges for genome annotation and prediction of protein-coding genes in terms of error detection and quality control. Multiple Sequence Al...
Gespeichert in:
Veröffentlicht in: | Applied intelligence (Dordrecht, Netherlands) Netherlands), 2023-08, Vol.53 (15), p.18806-18820 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The widespread use of high throughput genome sequencing technologies has resulted in a significant increase in the number of available sequences, creating new challenges for genome annotation and prediction of protein-coding genes in terms of error detection and quality control. Multiple Sequence Alignments (MSAs) of the predicted protein sequences provide important contextual information that can be used to distinguish errors (caused by artifacts in the raw genome data, badly predicted gene sequences, or the alignment methods themselves) from true biological events. This can be achieved either by human expertise or by statistical analysis of the sequence data. Here, we propose a new approach that uses visual representations of MSAs as inputs for Convolutional Neural Networks (CNN) to classify MSAs into erroneous and non-erroneous categories. The MSAs are extracted from a unique in-house dataset, in which errors are carefully identified. Our model, called De-MISTED (Deep learning for MultIple Sequence alignmenTs Error Detection) identifies MSAs containing erroneous sequences with high accuracy (87%) and sensitivity (92%). Visual explanation techniques show that our model correctly identifies the position of multiple errors of different types (insertions, deletions and mismatches). Close examination of the data showed that our model can also identify errors that were not previously annotated in the data. The De-MISTED method thus contributes to a more robust exploitation of the genome data. |
---|---|
ISSN: | 0924-669X 1573-7497 |
DOI: | 10.1007/s10489-022-04390-7 |