Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes

Comparing transcriptomic data with proteomic data to identify protein-coding sequences is a long-standing challenge in molecular biology, one that is exacerbated by the increasing size of high-throughput datasets. To address this challenge, and thereby to improve the quality of genome annotation and...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Bioinformatics 2012-12, Vol.28 (23), p.3042-3050
Hauptverfasser:	GASCOIGNE, Dennis K, CHEETHAM, Seth W, CATTENOZ, Pierre B, CLARK, Michael B, AMARAL, Paulo P, TAFT, Ryan J, WILHELM, Dagmar, DINGER, Marcel E, MATTICK, John S
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations Bioinformatics Biological and medical sciences Computational Biology Computational Biology - methods Databases, Protein Exons Fundamental and applied biological sciences. Psychology Gene Expression Profiling Gene Expression Profiling - methods Gene Library General aspects Genes Genome Genomes Genomics Genomics - methods Humans Life Sciences Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Molecular Sequence Annotation Open Reading Frames Peptides Proteins Proteins - genetics Proteomics Proteomics - methods Quantitative Methods RNA, Long Noncoding RNA, Long Noncoding - genetics RNA, Messenger RNA, Messenger - genetics Sequence Analysis, RNA Software Translations
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Comparing transcriptomic data with proteomic data to identify protein-coding sequences is a long-standing challenge in molecular biology, one that is exacerbated by the increasing size of high-throughput datasets. To address this challenge, and thereby to improve the quality of genome annotation and understanding of genome biology, we have developed an integrated suite of programs, called Pinstripe. We demonstrate its application, utility and discovery power using transcriptomic and proteomic data from publicly available datasets. To demonstrate the efficacy of Pinstripe for large-scale analysis, we applied Pinstripe's reverse peptide mapping pipeline to a transcript library including de novo assembled transcriptomes from the human Illumina Body Atlas (IBA2) and GENCODE v10 gene annotations, and the EBI Proteomics Identifications Database (PRIDE) peptide database. This analysis identified 736 canonical open reading frames (ORFs) supported by three or more PRIDE peptide fragments that are positioned outside any known coding DNA sequence (CDS). Because of the unfiltered nature of the PRIDE database and high probability of false discovery, we further refined this list using independent evidence for translation, including the presence of a Kozak sequence or functional domains, synonymous/non-synonymous substitution ratios and ORF length. Using this integrative approach, we observed evidence of translation from a previously unknown let7e primary transcript, the archetypical lncRNA H19, and a homolog of RD3. Reciprocally, by exclusion of transcripts with mapped peptides or significant ORFs (>80 codon), we identify 32 187 loci with RNAs longer than 2000 nt that are unlikely to encode proteins. Pinstripe (pinstripe.matticklab.com) is freely available as source code or a Mono binary. Pinstripe is written in C# and runs under the Mono framework on Linux or Mac OS X, and both under Mono and .Net under Windows. m.dinger@garvan.org.au or j.mattick@garvan.org.au Supplementary data are available at Bioinformatics online.
ISSN:	1367-4803 1367-4811 1460-2059
DOI:	10.1093/bioinformatics/bts582