How to choose the best reference run for data analysis with Progenesis LC-MS

Update: Since this blog post was first published, a new version of Progenesis LC-MS  has been released with an option to automatically select an alignment reference. The run chosen will be the one that gives the best results from automatic alignment, thereby taking a lot of the subjectivity and hard work out of the equation. Download it today to try it out.

“How do I know which reference run to choose for my proteomics data analysis?”. This is a question often asked by people new to Progenesis LC-MS and it’s a good one! Which reference run you choose has a major effect on the quality of run alignment and peak picking (feature detection), and therefore the reliability of quantification and identification.

While it is hard to give a one-size-fits-all answer to the original question,  it very much depends on knowing your own experimental details and aims, a recent publication can provide a helpful guide for you.

Sandin et al at Lund University have published a paper titled Generic workflow for quality assessment of quantitative label-free LC-MS analysis1. It is a universal approach to check performance of detection, alignment and quantification for your data analysis software, and it included Progenesis LC-MS as one of the packages under test. Here, I’ll focus on one section of the results where  two separate metrics, feature detection recall and alignment precision were measured using two very different reference runs:

  1. Reference run from within the data set. So, one of a replicate set of injected samples that, in theory, should have most, if not all features in common with other runs.
  2. Reference run from a different data set. This had only ~25% features in common with runs compared to the first data set.

The Results

To help you understand the graph below I’ve pulled out the definitions for the metrics used to measure the effect of choosing the different reference runs:

  • Feature detection recall measures the ratio of detected peptides to the expected number, in this case the expected number was the number of peptides identified by MS/MS in all the files. A peak recall value of 1.0 means an identified peptide (i.e. one that fragmented and generated a positive ID from database searching) was detected in all the files.
  • Alignment precision measures how well the found features have been aligned and does not take into account if there is a missing feature, i.e. a precision value of 1.0 means that all the found features have been correctly aligned, but will not say anything about how many features were actually detected.
  • Mapping intervals correspond to an increasing retention time interval that was applied to even out the possibility of bias due to varying retention time formats between the software. So a higher mapping interval increases the tolerance for each value on the y-axis.

These graphs, reconstructed from graphs in the paper1, show the effect that choosing  reference run 1. or 2., referred to above, has on feature detection (peak) recall and alignment precision:


Peak recall values when using reference run 1. or 2. Using reference run 1. where most, if not all features, are in common with the rest of the runs produced consistent peak recall values >0.9  across all mapping intervals (solid line 1.) Using a reference run from a different data set, with only 25% common features, shows the negative effect (dotted line 2.) this has on performance.


Alignment precision values when using reference run 1. or 2. Using reference run 1. where most or all features are in common with the rest of the runs produced alignment precision values >0.95  across all mapping intervals (solid line 1.) . Using a reference run from a different data set, with only 25% common features, shows the strong negative effect (dotted line 2.) this has on performance.


This illustrates the guiding principle when it comes to choosing your reference run:

Selecting a reference run that is the most similar  to the rest of your runs will generate the best results from alignment, peak picking and protein quantification.

The publication also highlighted some other useful points to consider:

  • “Technical or biological replicates are necessary in a label-free workflow in order to obtain satisfying quantification when the number of files is more than a handful, due to missing values”
  • “To get satisfying results the user has to be familiar with the software workflow and parameter settings, rather than the underlying algorithms”
  • The other software packages under test required a lot of manual “tweaking” to achieve the best quality of results

Progenesis LC-MS has been developed to address these challenges:

  • Peak modelling for fast analysis of high numbers replicates and multiple groups, with an analysis approach that produces a data set with NO missing values
  • Ease-of-use, and easy familiarity with the workflow, are two things Progenesis LC-MS has become recognised for by users around the world as well as proven performance
  • An objective workflow that was run “parameterless” in the publication

Reference run selection is one of the few places in the workflow where a choice  is required when you perform parameterless analysis. Hopefully this post and the publication can make that choice easier and less subjective for you?

If you want to see how parameterless analysis can save you time and effort, download Progenesis LC-MS today, or contact us and we’ll show you how to get the most from your label-free LC-MS data.

1. Sandin M, Krogh M, Hansson K, Levander F. Generic workflow for quality assessment of quantitative label-free LC-MS analysis. Proteomics. 2011 Mar;11(6):1114-24