Proteogenomics is an area of research at the interface of proteomics and genomics. It was initially used to describe studies in which proteomic data are used for improved genome annotation and characterization of the protein-coding potential. Proteogenomics has since been broadened to include any type of application in which a proteogenomic-like approach is used to interpret tandem mass spectrometry (MS/MS) spectra. In this approach, customized protein sequence databases are used to help identify novel peptides from MS-based proteomic data. In turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models.
Fig.1 The concept of proteogenomics. (Nesvizhskii, 2014)
As Fig.1 shown, in a proteogenomic approach, novel peptides are identified by searching MS/MS spectra against customized protein sequence databases containing predicted novel protein sequences and sequence variants. These databases are generated using genomic and transcriptomic sequence information.
Fig.2 Type of peptides identified in proteogenomics. (Nesvizhskii, 2014)
The main idea behind the proteogenomic approach is to identify peptides by comparing MS/MS data to protein databases that contain predicted protein sequences. Type of peptides identified in proteogenomics can be classified as intergenic or intragenic. Intergenic peptides map to regions located between annotated gene models, whereas intragenic peptides map to genomic regions contained within or in close proximity to an annotated gene model. The protein database is generated in a variety of ways through the utilization of genomic and transcriptomic data. Below are some of the ways in which protein databases are generated:
Generation of Customized Protein Sequence Databases
Considerations for Peptide Identification Using Customized Protein Databases
The feasibility of various proteogenomic applications has been demonstrated in multiple studies in human and in many model organisms, including in Plasmodium falciparum, Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana and Anopheles gambiae. Recent improvements in proteomic technologies, coupled with wide availability of high-throughput DNA and transcriptome sequencing data, have led to a resurgence of proteogenomic studies.