Have you read these 21 must-read proteomics articles?

At Nonlinear we get a lot of questions on the whole analysis process for proteomics data, from experimental design through to statistical analysis, QC, and database searching for protein and compound identities. For our own software and approaches, you may well find the answers to questions you have in our FAQs, and we’re always happy to help. However, we often get questions that go beyond the ‘number crunching’ into the details of some of these wider concepts. With that in mind, I thought I’d collect together a mini reading list with some starting points for learning more on concepts surrounding the analytical workflow, for anyone new to the field. Of course these are just one selection of topics, but they may be worth a look.

QC approaches

This whole blog entry was prompted first and foremost by an excellent recent review on proteomics LC-MS/MS QC, itself the topic of a recent post in the form of our own QC metrics. Since that post was written, Bereman [1] published a review on the topic that, while requiring a subscription to Proteomics, I would really recommend a look at. It provides a good grounding in the approaches one can take and various software tools available including SimpatiQCo and QuaMeter. An interesting application of QuaMeter itself was also recently provided by Wang et al. [2]. In this work, the authors developed multivariate QC metrics (independent of MS/MS identifications) to identify outlier data by dissimilarity analysis, investigating the effects of different runs, mass spectrometers, laboratories and the application of SOPs. Amidan et al. [3] is another good example, which used classification models to develop ongoing composite control metrics. Both papers either use freely available data or have made their data available, and are well worth a read.

Data sharing

On the topic of quality, there is also a need to share, and standardise the sharing of, proteomics data. Ternent et al. [4] produced a very useful overview of the process for uploading to a key repository, ProteomeXchange, via PRIDE; further recent overviews of ProteomeXchange have been provided by Vizcaíno et al. [5] and Römpp et al. [6]; and a wide-ranging overview of the range of current databases available has been provided by Perez-Riverol et al. [7].

File formats and interconversions

As you’ll know, there is a huge array of file formats in mass spectrometry; Deutsch [8] summarised these very well, discussing both the formats themselves and issues raised by their diversity. Tools for interconverting data between different formats such as ProteoWizard are also discussed in that review.

This also links in to data sharing, as commonality of formats can aid this process. The development of standardised open exchange file formats by the HUPO-PSI group is described in a series of freely available papers [9, 10, 11]. This also points back to QC: Walzer et al. [12] recently provided a good overview of the qcML format, which will provide an expandable but standardised means of reporting quality metrics.

Experimental design and statistics

Karp and Lilley [13] published a review, “Design and Analysis Issues in Quantitative Proteomics Studies”, on this topic a while back – it’s a great starting point and looks at a number of the issues we’re commonly asked about. The consequences of improper experimental design can be critical – Ioannidis [14] published a strikingly titled paper in 2005 discussing aspects of this problem, and the 2012 Institute of Medicine report on the evolution of translational ‘omics has some food for thought in the form of several very interesting case studies [15].

Missing values

We’ve blogged on the issue of missing values, which our software helps to avoid. If you’re interested in learning a bit more about them and how they may be handled when present, then I recommend a look at Karpievitch et al. [16].

Protein & peptide identification

Nesvizhskii published a very in-depth review of computational approaches to MS/MS-based identification in 2010 [17].

Law and Lim [18] have also published a very good summary of recent technical approaches to improving peptide and protein identification coverage, such as DIA (Data Independent Analysis). This covers developments such as MSE, SWATH and AIF. Sajic et al. also produced a general overview of DIA methods, which then goes on to focus on SWATH in particular [19]. Of course these methods have relevance for quantitation as well, and create challenges for software used to analyse their output data, which are also described in those two reviews.

Protein inference

Given peptide identities in bottom-up proteomics, it is then not trivial to assemble these correctly into protein identifications. Two papers that summarise the issues encountered, and look at a range of approaches, are Nesvizhskii & Aebersold [20] and Li & Radivojac [21]. Our own approaches / options are described in an FAQ.

If you’d prefer to view a full list of the articles mentioned in this post, please see our references page.

I hope some of these pointers might be of some use and/or interest to you! As I was saying, we’re always happy to help with any questions you have on our approach, so do get in touch on that, but these recommendations are designed to range more widely than our own software.

Happy reading! 🙂