Missing values: the Progenesis co-detection solution

In my last blog I described the problem of missing values in discovery omics analysis and how it adversely affects the statistics. Now I’ll describe the Progenesis co-detection solution to this problem.

First, a quick recap: the problem is caused by an inefficient workflow in which the feature ion signals are detected independently on each sample. This creates different detection patterns, even for technical replicates (same sample run multiple times), so that matching the ions to ensure you are comparing ‘like with like’ across all samples becomes very difficult. This leads to the generation of many “missing values” in the ion quantity matrix. Multivariate statistical analysis is then performed on the ion quantity matrix, in order to find the truly significant expression changes. Actually, the impact of having missing values in the ion quantity matrix means that it is not possible to do a ‘like with like’ comparison on many features.

This means the multivariate statistics have to be applied to a restricted number of features, consequently false positives and false negatives are generated through the applied multivariate analysis. We examined the consequences of missing values in more detail in our blog post: Missing Values: The hard truths.

Progenesis however, takes an alternative unique approach to data extraction in which ion signals are essentially “matched” before detection takes place by aligning the pixel patterns of the 2D ion maps (see figure below). This compensates for any retention time differences between samples. The pre-matched ions can then be co-detected so that a single detection pattern is created for all the samples in the experiment, resulting in 100% matching of ions and no missing values!

Here is a schematic of how Progenesis QI works:

Cchematic of how Progenesis QI works

How does this approach help?

Well, let’s consider a comparison of two very similar samples from a small discovery omics experiment.

The traditional approach

Figure 1A below, shows zoomed in ion-map views of the same m/z / RT region from the two samples so you can see how visually similar they are, allowing for some vertical retention time drift between them. In Figure 1 B and C, you can see how the conventional (and inefficient) analysis workflow handles this task:

  • First, the feature ions are detected independently on each sample (1B).
  • Then, the detected feature ions are vertically aligned to compensate for the retention time drift and feature ions are “matched” between the samples using the mono-isotopic m/z and adjusted retention time as reference (1C).

Zoomed in ion-map views
The degree of ion matching between the samples is best shown by the arrow markers which indicate ions that are present on one sample but not present on the other. In fact out of 108 ions detected on sample 1, 31 are not detected on sample 2 while 19 out of 98 detected on sample 2 are not detected on sample 1. This means that out of 129 unique ions detected across both samples, almost 40% are only detected on one sample and therefore generate a missing value in the data. What’s more, in addition to the 50 unmatched ions, there are more which are detected quite differently on the two samples in terms of their isotope numbers, chromatographic peak width, or both. In a real experiment with multiple samples in two or more groups, these detection differences increase the variance in quantitation of any ion across different samples within a single comparison group, making it more difficult to find true statistically significant differences between different groups.

The Progenesis approach

Now let’s look at how Progenesis analyses the same data used in the traditional approach. In this case the first step is to align the signals on the ion maps by creating a series of alignment vectors as shown in Figure 2B. You can see that the effect of this is to reduce two signal patterns (shown in purple and green in 2B(i)) 2B(ii)) to one. This single signal pattern (formed by aggregation of both samples) is then used for peak “co-detection” (2C) in which a single detection pattern is created that applies to both samples (2D).

Zoomed in Progenesis ion-map views

Using the same detection algorithm as in the conventional workflow, but co-detecting from an aggregated ion map rather than detecting individually on each sample, Progenesis has detected a total of 154 feature ions, all of which are detected in the same way on both samples. In a real experiment this increases the statistical power in the following ways:

  1. Co-detection generates a complete data matrix with no missing values, eliminating the need to filter out ions with too few real values present or to impute model values, possibly leading to false positive or false negative results.
  2. By detecting each ion on all samples in the same way, co-detection minimizes variance in ion quantitation across samples in the same comparison group, making it easier to find true statistically significant differences between the groups.

In addition to the above benefits, co-detection also increases sensitivity and reliability of ion detection by increasing the signal to noise ratio. Even with co-detection of just two samples, we can see this in the detection of 25 (=154-129) ions that were not detected in either of the samples individually. As we co-detect from more samples, very faint and/or fragmented signals that cannot be reliably detected on individual samples but are consistently present, will become more distinct and easily detected from the aggregated data.

Progenesis co-detection in action

Finally, let’s take a look at how the Progenesis co-detection workflow helps us to easily extract powerful statistical information from a 3 Vs 3 experiment that includes the two samples we’ve already looked at. The figure below shows quantitative data for two different ions extracted from the experiment, one in which a significant expression change is detected and another in which no change is detected. The figure also illustrates another powerful benefit of the co-detection workflow – the ability to visually confirm expression change results (p-values and fold changes) at the “raw data” level, a great way to increase confidence in your results!

Progenesis co-detection workflow

So, there you have it. The unique Progenesis QI workflow really does eliminate missing values at the analysis stage.

Would you like to try Progenesis QI on ALL your data? Download now and complete your analysis with confidence.

Identification scoring in Progenesis QI

One of the advantages of using Progenesis QI is its ability to combine results from multiple search methods and databases. Progenesis QI uses a common scale to score results from all the databases and search methods it supports, so you can compare search results obtained from different search methods. This post explains the scoring method we use in Progenesis QI, and how you can improve your search scores by searching additional dimensions of your data.

Progenesis QI search methods

At the time of writing, Progenesis QI supports these search methods and databases:

Progenesis MetaScope
Searches SDF and MSP files from any source. Supports retention time, CCS, theoretical fragmentation and spectral libraries.
METLIN batch metabolite search
Exports data for use with the METLIN batch search interface, and reads METLIN batch CSV files.
Searches the LipidBlast MS/MS database provided by Metabolomics Fiehn Lab.
Elemental composition
Produces putative formulae for compounds based on mass, isotope profile, and the Seven Golden Rules.
Searches the ChemSpider structure database. Supports theoretical fragmentation, isotope similarity filtering, and elemental composition filtering.
NIST MS/MS Library (requires purchase)
Searches the NIST MS/MS library for spectral matches.

You can find out more about each of these search methods in the search methods and databases FAQ. This blog post, however, will focus on how we calculate scores so that identifications from different search methods can be compared.

The Progenesis scoring method

For any given search, there are a possible five properties that can contribute to the overall score:

  1. Mass error
  2. Isotope distribution similarity
  3. Retention time error
  4. CCS error
  5. Fragmentation score

Each of these individual scores is on a scale from 0-100. If your search criteria do not include a given piece of data, the score for that piece of data is 0. The overall score is the mean of these 5 scores.

Note that the more search criteria you use, the higher the maximum possible score becomes, as described in the following example.


Suppose we have searched ChemSpider using theoretical fragmentation. For a given compound we find Identification A, with these scores:

Identification A Score
Mass error 95.2
Isotope distribution similarity 99.2
Retention time error 0
CCS error 0
Fragmentation score 87.1
Overall score 56.3

Note that the scores for retention time and CCS errors are 0, because ChemSpider does not support searching those properties.

If we then perform a MetaScope search, this time including a CCS constraint, we might obtain the following scores for Identification B:

Identification B Score
Mass error 95.2
Isotope distribution similarity 99.2
Retention time error 0
CCS error 94.1
Fragmentation score 87.1
Overall score 75.12

We have identical scores for the mass error, isotope distribution, and fragmentation. However, we also have an extra piece of information in the CCS score. This provides additional evidence for Identification B, so it is given a higher score than Identification A.

Note that in the ChemSpider case, if an identification scores 100 on all 3 items, it obtains a score of 60. In the MetaScope case, if an identification scores 100 on all items, it obtains a score of 80. So for each additional piece of data we include in our search, the maximum score increases by 20.

The component scores

Here we’ll briefly describe how the five component scores that make the final score are calculated.

Mass error, retention time error, and CCS error

These are all functions of the magnitude of the relative error, Δ:

The score profile for mass error, retention time error and CCS error Figure 1: The score profile for mass error, retention time error and CCS error.

For the mass error, Δ is the ppm mass error and N = 4000. For the retention time and CCS errors, Δ is the percentage error, and N = 20.

Isotope distribution similarity score

This compares the intensities of each isotope between observed and theoretical distributions. A total intensity difference of 0 gives a score of 100, which falls linearly to 0 when the total intensity difference is equal to the maximum isotope intensity.

Fragmentation score

The fragmentation score is more complicated and depends on the fragmentation method used. The FAQs describe how scoring works for theoretical fragmentation and database fragmentation.

Improving identification scores

The best way to improve the scores of your identifications and your confidence in them is to use more search constraints.


In general, most searches will be able to produce a mass error score and an isotope similarity score. With just these two pieces of information, the maximum score for any identification is only 40/100. In this example we’ve identified Warfarin using only mass error and isotope similarity.


By including fragmentation data in your search criteria (either theoretical fragmentation or a fragmentation database), this increases the possible score for identifications to 60/100. Here we’ve added theoretical fragmentation to our search parameters.


Finally, if you use an appropriate data source (e.g. an SDF and additional properties file) you can add search constraints for retention time and CCS, giving a maximum score of 100/100. Here we don’t have CCS information, but have added retention time to our search parameters for a maximum of 80/100.

Future improvements

Currently Progenesis gives equal weight to the five component scores – mass error, isotope similarity, fragmentation score, retention time error, and CCS error. In some cases this might not be ideal, so if you have any suggestions for different weightings we’d love to hear from you in the comments section below.

As always, if you have any further questions, check our FAQ or get in touch.

Missing values: the hard truths


“I could burst into tears… I spent weeks of time and effort on sample collection, instrument optimisation, sample running, and data generation from a very expensive LC-MS setup that has the resolving power to find tiny but significant differences between my conditions, only to find that my data analysis led me down a dead end with many false positives. I’m certain that real significant differences are in that dataset, but my analysis workflow just isn’t picking them up… What will I tell my boss?”

Many know how real that scenario is, but what are the possible consequences resulting from the impact of missing values in your data?

  • False negatives – are you missing true positive results, as in the example above?
  • Time spent on researching false positives – does this slow down department research progress?
  • Wasted money on investigating false positives – how easy is it to then get follow up funding?
  • Acceptance of research results – is your journal of choice going to accept data with missing values?
  • Reduced return on investment – do your research results bring value from either publications or commercial gain?

Hopefully you’ve been lucky and the impact was not as drastic as the above, however those outcomes are all very possible.

Frustrated researcher by his mass spec with his head in his hands

In our ‘Back to basics’ blog post, we mentioned a number of the problems that missing values can cause, such as:

  1. Reduced effectiveness of statistical analysis techniques
  2. Misleading statistics (impacting false positive and false negative results)
  3. Problems with multivariate statistics as observed by the visualisations (such as PCA)

Before we move onto the issues that arise from various imputation methods of missing values in your data matrix, let’s remember how many data points are actually affected:

  • Missing values are commonly reported as occurring at rates of approximately 20%[1,2] and affecting up to 80% of variables[2] in LC-MS data.
  • “Further investigation of the peaks detected as significantly different between biological groups showed that substantial proportions of these peaks were comprised of those which initially had missing data.”[2]

Let’s expand upon the problems mentioned above, but first a quick note on terminology: Hrydzuszko and Viant refer to missing data where we refer to missing values. Missing values have also been referred to as ‘missing or lost data points’. To simplify the language, we will continue to refer to missing values, except in the case of citations.

A – Reduced effectiveness of statistical analysis techniques

To have confidence in conclusions drawn from comparative results between conditions, we look to statistical power. In the ‘Missing values: what’s the problem?’ blog post, Dr. Goulding described how increasing biological replicates should work to stabilise inherent differences between the biological samples, and so increase the statistical power, however with additional samples there are additional associated missing values in the experiment’s quantitative results and so this actually limits any desired resulting increase in statistical power. Therefore, we throw away most of the gain in statistical power to be had from the addition of biological replicates due to incorporating even more missing values. Missing real experimental differences is a probable result of having low statistical power.

B – Misleading statistics

Typically, 1 in 5 data points are missing; can’t we just impute them with one of the many existing methods? Yes, but this is not without danger of taking your investigation completely astray. Can we measure the impact of imputing upon our significant discovery rate? Hrydzuszko and Viant evaluated[2] the impact of eight imputation methods; they found:

  • Different imputation methods ultimately yielded quite diverse data analysis outcomes. Specifically the number of peaks identified as significantly different between groups varied considerably between the eight estimation methods”.
  • “It is quite possible that when an inappropriate missing value estimation method is used we may not only lose the knowledge of which peaks are significant or not, but we may introduce further bias by identifying non-significant peaks as significantly different between groups.”
  • “Overall the results presented here provide substantial evidence that the choice of missing value estimation method has a substantial effect on the outcome and interpretation of univariate statistical analysis.”

C – Problems with multivariate statistics as observed by visualisations (such as Principle Component Analysis (PCA))

Complications with useful multivariate statistics caused by the missing values were also seen when Hrydzuszko and Viant[2] looked at the application of PCA. They found similar (larger) differences to those discussed above for univariate analysis. PCA plots varied from 3 distinct separations to no separation between the 3 conditions, depending on which imputation was used. This suggests that multivariate analysis could be even more sensitive to imputation.

What can be done about this serious skewing of data?

Hrydzuszko and Viant[2] concluded their paper with “a three step process recommended in order to determine optimal method selection for missing value estimation for a given dataset and analytical platform that includes: assessing the nature of the missing data, analysing the impact of the missing data treatments on the final data analysis outcome, and analysing the performance of missing data algorithms on the ‘complete’ dataset if available.”

The proposed approach is very thorough, but must be time consuming and each dataset must be assessed on its own basis requiring the three steps to be repeated for each new study.

There is another way!

“Co-detection in Progenesis is a very attractive feature that we have come to trust. We do a lot of follow up Western blots to reproduce our quantitative proteomics findings and the blots do a great job of building confidence in the co-detection feature and the Progenesis system as a whole. We’ve had success reproducing changes in protein expression levels observed in Progenesis using the traditional Western approach. That’s extremely important since a lot of reviewers do not speak mass spec but love to flaunt how fluent they are in the language of Western blot.”

Paul Langlais, Mayo Clinic Arizona, Arizona, USA

Progenesis offers you the ability to take your data straight from your mass spectrometer and find the significantly changing compounds or proteins in your dataset without any of the problems associated with missing values. As long as the differences are present in your data file, Progenesis will maximise your chances of picking them up. Working on data with no missing values improves the experimental specificity, sensitivity and therefore reproducibility of your research, allowing you to quickly and, more importantly, confidently, quantify the compounds or proteins of interest.

An issue Dr. Goulding discussed in ‘Missing values: what’s the problem?’ was the data filtering, where missing values fall below predefined matching thresholds. With the Progenesis co-detection approach, you do not have to worry about this as you will always be comparing the same features across all the samples in your dataset. This means that when you use the multivariate statistical visualisations such as PCA, you will have unfiltered and therefore unbiased data representation. In turn, this gives you the ability to QC your data by assessing for outliers to ensure the differences you see in your data are real.

PCA plot from Progenesis showing scores and loadings for all variables PCA plot from Progenesis showing scores and loadings for all variables

Dr. Goulding will be following up soon on how Progenesis achieves no missing values; in the meantime why not download Progenesis QI and speak to one of our specialists about analysing your own data? You too can have peace of mind that the conclusions you are submitting are based on the unbiased analysis of a complete data matrix.


1. Gromski et al, “Metabolites” 2014, 4, 433-452

2. Hrydziuszko and Viant, “Metabolomics” 2012, 8, Supplement 1, 161-174

Come and see us at ProteoMMX 4.0!

Two years ago, I attended my first conference for Nonlinear, ProteoMMX 3.0. ProteoMMX 4.0 is fast approaching (5th – 7th April, at The Queen Hotel, Chester, UK) and I’m excited to say I’m lucky enough to be attending again. While this will be my second time at ProteoMMX, and one of many conferences I’ve been to, this will be the first for my Guide Dog, Winston.

Photograph of Chester Cross, courtesy of Matty Ring Chester is one of the best preserved medieval walled cities in the UK. (Photograph courtesy of Matty Ring.)

As with previous years, the conference will be preceded by the Quant 4.0 event. Quant 4.0 is a quantitative proteomics and data analysis training course, partly intended as a good introduction to the field for any newcomers, before attending the more detailed lectures at ProteoMMX. Agnès, one of my colleagues, will be attending this event (as well as ProteoMMX) and will be delivering a session on Progenesis QI for proteomics on the Tuesday morning.

Once again, ProteoMMX is offering great opportunities for early career researchers to deliver presentations, as several slots have been reserved for short presentations based on elevated abstracts. There will be plenty of talks from more experienced researchers too – you can check out the full programme on the ProteomMMX 4.0 website.

As well as having the opportunity to hear about current work in this field, I’d like to encourage you to come and speak with us – we’ll be on the Waters table when not in lectures, but I should be easy to spot as I’ll be the one with a golden retriever in tow. Book an appointment now to guarantee yourself a timeslot – we’d love to speak with you, whether you’re already a Progenesis user, or are interested in a demo. We hope to see you soon! Smile

Missing values: what’s the problem?

Missing values are a major problem in LC-MS based discovery ‘omics analysis and could be the difference between a successful research project and a failure. Whether you run a 3 vs. 3 experiment on a model biological system or a much larger clinical study, missing values will adversely affect the results; some expression changes which are actually present in your data, will be missed. But why is this? How and why are missing values generated and how do they affect the results? To clarify, let’s look at how discovery ‘omics analysis works.

Biological “noise” and statistical power

In discovery ‘omics we’re looking for differences in the relative quantities of analytes between two or more groups or conditions, such as control vs. treated or healthy vs. diseased. But in all biological systems there is inherent biological variation caused by both genetic and environmental factors, so that the relative quantity of any given analyte ion will vary across samples from different specimens within a given condition. This can be further complicated in clinical studies where there is no control over external factors such as diet, fitness etc. which contribute towards the final phenotype.

This inherent variation can be seen as biological “noise” which we must cut through in order to find the consistent condition-related differences we’re looking for. To do this, we must run multiple biological replicate samples from different specimens of the same species and condition and compare the resulting sample groups to find the analyte ions that are displaying statistically significant differences between conditions. The more biological replicate samples we run, the greater the ability or statistical power of our experiment to find the significantly changing ions.

Missing values

The relationship between the number of replicate samples and the statistical power of the experiment is strongest when data from all the runs is available for statistical analysis. However, due to limitations in the conventional workflow used by most analysis software, this is usually not the case. In the conventional workflow the ions are first detected, one sample at a time, and the detected ions are then “matched” across the samples so that the same ion is being compared between samples. This approach often results in different patterns of detection and different numbers of ions detected on each sample. Ions that are detected on one or more samples may not be detected on others for reasons such as:

  • The ion is actually not present or is below the limits of detection in those samples.
  • The ion may be “missing” due to some instrument error or ionisation issue
  • The ion may be differently detected due to fragmentation of signal, failure to detect monoisotopic peak, differences in chromatography, or other reasons (see figure below).

Image showing differences in detection pattern for technical replicate samples when co-detection isn't performed

For the above reasons, when we create a matrix of ion quantities for statistical analysis, we often find a number of gaps in the data where ions detected on one or more samples could not be found on others. We call these “missing values”, four of which can be seen in the data from a simple 3 vs. 3 experiment shown below.

Table showing missing values in abundance measurements Table showing missing values in abundance measurements

Missing values which are commonly reported as occurring at rates of approximately 20%[1,2] and affecting up to 80% of variables[2] in LC-MS data, decrease the statistical power of our experiment by reducing the number of values available for statistical analysis. What’s worse, the probability of missing values occurring increases with the number of biological replicate samples so we find that we have to run many more samples in order to gain a relatively small benefit in statistical power of our experiment. This is a fundamental problem in discovery ‘omics analysis since we may then miss expression changes that are actually present and waiting to be found in our data.

Since running more samples only increases the chances of encountering missing values, and samples are not in endless supply, the only real solution is to find a better way to analyse the data.

What is the solution?

The solution used in many ‘omics discovery analysis workflows is to use a combination of data filtering and data imputation. First, the workflow may ask you to define a threshold % of values for each ion, so if you set a 60% threshold for a 10 vs. 10 experiment, all ions with more than 4 missing values in a group are eliminated from the analysis. If these include the potential biomarkers you’re looking for then you’ll miss them entirely! For the remaining ions, any missing values are replaced with imputed model values, the most commonly used option being the mean of the values present, while zeros are also used in some workflows. These approaches are dangerous since statistics must take full account of both the mean and the variance of the data in each group being compared. Therefore the former approach may produce false positives by reducing the variance while the latter may produce false positives or negatives by skewing both the mean and the variance. Another way to see the invalidity of these approaches is to imagine using them in other types of statistical experiment. Let’s say we’re measuring the effect of a high salt environment on the mature height of a certain type of plant. You plant 10 seedlings in high salt and 10 in low salt soil, but due to a problem with the automatic irrigation system, 4 of the low salt plants receive no water and die. Is it valid to insert the mean height of the other 6 low salt plants as a model value for the 4 missing plants? You would essentially be creating 4 virtual plants while artificially reducing the variance in the low salt group.

Real plant measurements vs. imputed plant measurements

Would this be considered an optimal way to perform the statistics? Of course not, yet it’s precisely what is done in some discovery ‘omics analysis workflows and has become so routine that it often isn’t even mentioned in publications[1].

How will this affect your work?

Missing values can have profound implications for your research projects. False negative results are serious enough with the potential to miss important biomarker candidates, but false positives may be even worse, leading to much wasted time, effort and resource investigating false biomarker candidates. Furthermore, it’s worth repeating that all along, the evidence of real biomarker candidates is actually there in your data and waiting to be found. At the outset you carefully design the experiment, select subjects and prepare samples which you run on the best equipment available after painstakingly optimising the running conditions – only to find your results underwhelming or even worse, misleading! So how do you ensure that missing values don’t jeopardise your chances of research success?

The Progenesis co-detection solution

The handling of missing values has been described in the literature as “an absolutely vital step in data pre-processing” and one “to which special consideration should be given”[1]. However, what if the problem of missing values did not exist? What if there was a workflow for LC-MS discovery ‘omics analysis that eliminated missing values and maximised statistical power in experiments of any size without resorting to dubious data imputation practises. This is the Progenesis co-detection solution and I’ll be telling you how it works in my next blog post.

1. Gromski et al, “Metabolites” 2014, 4, 433-452

2. Hrydziuszko and Viant, “Metabolomics” 2012, 8, Supplement 1, 161-174

Why do people buy Progenesis QI when there is freeware available?

It’s an interesting question and there are many of our users out there with various answers. We decided to ask our users some questions about why they bought Progenesis QI and what difference it has made to their research. Here’s what Research Professor Jace W. Jones had to say on the matter:

Please can you briefly describe your area of research?Jace operating the Synapt G2-S

Our research involves development of mass spectrometry-based platforms that couple biomarker discovery to quantitative validation, from circulating and tissue lipids. In particular, the use of high resolution tandem mass spectrometry to structurally elucidate, identify, and quantify biologically active lipids to further understand disease/injury mechanisms of action and provide insight for drug development targets. To this end, we first design untargeted liquid chromatography tandem mass spectrometry (LC-MS/MS) experiments to identify differentially expressed plasma and tissue-bound lipids using in vivo models. Our discovery–based instrument platform of choice is the Waters UPLC coupled to a Synapt G2-S operated in HDMSE acquisition mode. Our typical LC conditions elute lipids over a 20-minute gradient using a UPLC C18 column. The HDMSE data is acquired in both positive and negative ion modes. Experimental parameters vary depending on the particular in vivo model under study but involve multiple biological replicates per condition, per time point. In addition, quality control samples and addition of internal standards are standard operational procedure. The resulting output from this type of workflow is a tremendous amount of analytical data per sample that ideally generates a list of identified lipids that are differentially expressed between the conditions under study.

What problems did you experience prior to using Progenesis?

The data generated from the UPLC-HDMSE workflow is highly complex and results in 1000s of m/z values being identified by a number of analytical parameters, such as retention time, drift time, accurate mass precursor ions, and diagnostic product ions. In order to expedite biomarker discovery and fully utilise the multidimensional data generated on the UPLC HDMSE platform, we realised there was an immediate need for a bioinformatics solution that could efficiently process multidimensional datasets.

What made you convert to Progenesis QI?

We decided to go with Progenesis QI for its ability to handle multidimensional datasets, especially HDMSE workflows. In addition, a primary goal with our discovery/un-targeted mass spectrometry experiments is to generate lipid markers that can then be pipelined for targeted, high-throughput assays. Progenesis QI is an efficient bioinformatics solution that allows us to make the transition from discovery to validation. The ability to process multi-vendor data was also a major selling point.

What difference has Progenesis QI made to your research?

Progenesis QI enables us to efficiently process multidimensional lipidomic datasets in a systematic and straightforward manner. We can also now process HDMSE data on a single software platform.

One of the biggest differences we have seen is our ability to incorporate more biological replicates at the same time including temporal time points and multiple conditions. This gives us the ability to bolster our statistical significance and conduct experiments where we can evaluate potential biomarkers across time over varied conditions.

Please can you give a specific example of the success that Progenesis QI has helped you to achieve?

Progenesis QI has enabled us to increase our lipidomic workflow while increasing the amount of analytical data per sample. Because our data processing has been streamlined with Progenesis QI, we now spend more time on optimizing chromatography (e.g. orthogonal column chemistries) and mass spectrometry acquisition (e.g. ion mobility with tandem mass spectrometry) for more confident lipid identification.

How will it help you in your future research?

The demand for lipidomic experiments from not only our existing collaborators but also from outside researchers has grown steadily over the past couple years. Progenesis QI has enabled us to keep pace with that demand by allowing us to efficiently and confidently process multidimensional lipidomic datasets. This, in turn, expedites the experimental process of generating potential lipid biomarker candidates.

What advice would you give to a metabolomics/lipidomics scientist struggling with similar problems?

The amount of data generated by metabolomic/lipidomic workflows means a tremendous reliance on data processing. Often, the data processing aspect of ‘omics data is time-consuming and beyond the expertise of the scientist performing the experiments. Consequently, having a bioinformatics solution that is efficient, versatile, and reliable is a valuable investment and allows researchers to focus on optimization of their experimental approach and validation studies for potential targets. I highly recommend the use of Progenesis QI as your bioinformatics solution.


If you are a Progenesis QI user and would like to tell us about your research, please contact us – we’d love to hear from you.

6 ways Progenesis QI can help with your compound identification

Visualisation of the results from a theoretical fragmentation search as done in Metascope

One of the biggest challenges in metabolomics is compound identification – it’s a topic that comes up continually, and something at Nonlinear HQ that we’re constantly trying to help with. The recent releases of Progenesis QI have focussed on improving the process of compound identification, but do you know just how many tools are available within the software?


MetaScope is a tool unique to Progenesis QI, and is the most versatile identification plugin we offer. It can be used to perform a neutral mass search but also allows searching using retention time and CCS values.

For the neutral mass search, MetaScope can read libraries in either SDF, CSV, XLS or XLSX format to give flexibility in the source of your chosen libraries. The ability to search SDFs means you can make use of publicly available libraries, such as HMDB. Thanks to Progenesis SDF Studio, you can customise existing databases by merging multiple files, removing entries or fixing errors. Having the option to search from a CSV / Excel file means you can make your own library without the need to construct an SDF.

MetaScope can also make use of fragmentation data by searching a fragment database or by doing theoretical fragmentation. If you’d like to build your own fragment database, Progenesis QI can help you do that too.


ChemSpider is a web-based chemical structure database with access to over 32 million structures from hundreds of data sources. This tool makes use of those ChemSpider web services, automatically exporting data from Progenesis QI to ChemSpider for searching according to the parameters you select, importing the results, and assigning them against the correct compounds within the software.

As well as being able to define which of the 600+ libraries to search from, and set parameters for precursor tolerance, you can also perform theoretical fragmentation on the search hits, and filter the search by elemental composition and isotope similarity score. Just as for MetaScope and the elemental composition tools, parameter sets can be saved for use with subsequent experiments.


Don’t have access to your own library and don’t want to download one? You can make use of METLIN, a metabolite database containing over 240,000 compounds. METLIN, developed by the Scripps Center for Metabolomics, provides information on names, formulae, theoretical masses, and a link to a webpage detailing identifiers for the compound on various other databases such as KEGG and HMDB.


LipidBlast is a computer-generated MS/MS database produced by the Metabolomics Fiehn Lab. Since theoretical fragmentation searching can be unsuitable for lipids due to the specificity of bond breakages, LipidBlast is a useful alternative.

Elemental Composition Estimation

When you can’t find a database match for your compounds, it may be useful to see the theoretical molecular formulae that match the measured masses and isotope distributions. This tool can also help you to filter down a set of potential hits retrieved from a database search.

Progenesis QI has 3 pre-defined parameter sets: small molecules, lipids, and CHNO (optimised for simple organic), but you can also create your own which can be saved for future use.

Once you have the theoretical formula, you can search this manually in online databases such as PubChem to return potential IDs.

NIST MS/MS Library

The NIST MS/MS library search plugin bundles the NIST 14 LC-MS/MS libraries and performs a combination of neutral mass and MS/MS based searches. This can provide a higher degree of confidence to using just theoretical fragmentation and saves time spent creating your own MSMS library.

Please note that this plugin comes at an additional cost – please contact us for more information.

What next?

We’re always looking for more ways we can improve the identification process, so if there’s a tool you’d like us to link up with, get in touch.

Season’s Greetings from all at Nonlinear!

It’s that time of year again when we close the office for the festive period, and we’d like to take a moment to wish everyone a Merry Christmas and a Happy New Year.

It’s been another busy year for us, with a few highlights worth mentioning:

  • We released 2 updates to Progenesis QI, with v2.0 being released in March, and v2.1 following shortly after in August.
  • We brought out a brand new product, which is FREE to download and use: Progenesis SDF Studio, releasing v1.0 in July following some great feedback to the Beta release of v0.9 in April.
  • We’ve been to conferences all over the world, including the Czech Republic, Canada, various states of the USA and Germany.
  • We welcomed back Gavin Hope to Nonlinear, who re-joined us as a software developer back in September.
  • We also acquired a new member of the team who is proving to possibly be our most popular “employee” yet: Winston the Guide Dog, who is my service dog:

Winston the Guide Dog

The office will be closed from Christmas Day and reopening on Monday 4th January.

Barking up the right tree: characterising Garcinia buchananii extracts with Progenesis QI

It’s often said that plants are a rich source of dietary supplements, medicines, and other usefully bioactive phytochemicals. Among these, there are many traditional remedies derived from plants, but these often derive from a specific part of a plant, or historical means of preparation. How, then, to know if this is the best method of obtaining the target compounds? Are they the ‘best’ compounds that plant has to offer? How do different parts of the plant differ from each other for providing bioactive metabolites? The answers to these questions could both help to obtain better yields of such compounds, and to assess whether there is real medical benefit on offer.

Progenesis QI, which we think is a versatile piece of software, is beginning to assist this process, and it turns out that two of its strengths are key to this. Firstly, the ability to rapidly quantify and effectively identify compounds in complex metabolomes; secondly, integrated statistics that allow rapid and robust discovery of biological changes between samples.

These strengths have been brought to bear on Garcinia buchananii, the source of a traditional sub-Saharan African remedy for diaorrhea that has also been claimed to represent a rich source of antioxidants – specifically in its stem bark. However, Dr Timo Stark at Technische Universität München decided to pose several questions – was bark extract truly providing the ‘best’ antioxidant activity; if not, which part of the tree would represent the best source of bioactive antioxidant compounds; and, how did leaf, root, and stem bark extracts differ from each other in their metabolite profiles.

To do this, he analysed G. buchananii extracts from those sources comprehensively, using an Acquity UPLC – Synapt G2-S – HD-MSE Waters technology workflow. This generated a vast array of metabolite data, carrying those twin challenges of identification – always a bottleneck in metabolomics – and accurate, quantitative statistical analysis. However, with Progenesis QI, these need not be intimidating. Our quantify-then-identify co-detection approach with no missing values, multivariate statistical visualisations which can reveal subtle co-ordinated trends in data, flexible and comprehensive range of identification approaches and user-friendly OPLS-DA (discriminant analysis) using an optional integrated analytical package (EZinfo 3.0, Umetrics) combine to make complex analyses much more straightforward. Dr Stark was able to rapidly determine the organs richest in known literature-corroborated antioxidants, differentiate the profile of antioxidants and other compounds associated with each organ, and identify several antioxidant species novel to G. buchananii. In the course of one study a great deal was revealed about the bioactive profile of the plant.

As Dr Stark put it:

“With Progenesis QI we were able to analyse data in a reasonably short time that had previously proved too difficult to analyse. Progenesis is summarizing and illustrating the data, there are direct links to online databases, fragmentation tools can help to verify/identify compounds. It is straightforward.

The power and speed of Progenesis analysis means we can not only get better results from existing experiments but can also analyse larger experiments with more biological replicates to further improve quality of results. Faster hints on compound identification.”

Figure 1

Figure 1. Progenesis QI allowed the detection of antioxidant compounds enriched in particular G. buchananii tissues; in this case, (2R,3S)-morelloflavone in leaf.*

I won’t reiterate the full details of his paper and results here, as there is a better option! Dr Stark himself is presenting a webinar where he will describe his work with Garcinia buchananii and Progenesis QI, and his discoveries, on December the 9th (08:00 PST / 11:00 EST / 16:00 GMT / 17:00 CET) and I would encourage you to register for what promises to be a very interesting presentation. In preparation for that, why not have a read of his paper yourself?

Enjoy the webinar, and if you would like to hear more about how Progenesis QI can assist and improve your own metabolomics studies, please do get in touch.

* Reprinted (adapted) with permission from Figure 5 (B), “UPLC-ESI-TOF MS-Based Metabolite Profiling of the Antioxidative Food Supplement Garcinia buchananii”, Timo D. Stark, Sofie Lösch, Junichiro Wakamatsu, et al. Journal of Agricultural and Food Chemistry 63:7169-79; DOI: 10.1021/acs.jafc.5b02544. Copyright 2015 American Chemical Society.

Hi-N Quantitation For Clinical Discovery Proteomics

Progenesis QI for proteomics provides untargeted absolute quantitation of all identified proteins via the Hi-N method. This post explains the method and how it can be a useful tool for discovery proteomics in a clinical setting.

What is Hi-N?

Graph from Silva et al. (2006)Hi-N is a label-free quantitation method allowing absolute quantitation of all identified proteins in a sample, using just a single un-labelled internal standard. Other literature variously describes the method as Top3 or Hi3.

The method relies on a discovery made by Silva et al. (2006) that the average integrated signal intensity of the top 3 most intense tryptic peptides is proportional to the absolute amount of a protein present in a sample. Their graph (right) shows a linear relationship over 2 orders of magnitude for 6 proteins on 6 samples (R2 = 0.9939).

The Hi-N method in Progenesis QI for proteomics chooses, for each protein, the N peptides with the highest abundance and averages their abundances to produce a Hi-N measurement. The number of peptides to consider (N) is configurable, but defaults to 3 as per the majority of literature. By incorporating a known amount of a single internal standard to your samples, the absolute amount of all other proteins can then be calibrated:

Absolute amount of protein A = (Hi-N value for protein A)/(Hi-N value for internal standard) * (Absolute amount of internal standard)

How well does it work?

The principle that the method relies upon (i.e. that the average abundance of the top 3 peptides is proportional to the absolute amount of protein) has been verified by a number of studies, using different instruments and data collection techniques:

The relationship between the average MS signal response of the three most intense tryptic peptides and the absolute quantity of protein can be immediately inferred from the relative ratio of the average MS signal responses. The relative ratios of the average MS signal responses are proportional to the absolute quantities of each protein present in the sample.

Silva et al. (2006) [Waters Q-TOF with LCMSE]

We show that only the Top3 method is directly proportional to protein abundance over the full quantification range and is the preferred method in the absence of reference protein measurements.

Ahrné et al. (2013) [Thermo Orbitrap]

Fig. 2 shows a linearity between the average of the three most intense MS signals of tryptic peptides of one protein and the protein abundance.

Grossman et al. (2010) [Thermo LTQ-FT-ICR]

Further studies have shown the method to have a good dynamic range, high reproducibility and excellent correlation with typical clinical quantitation methods such as routine immunoassays:

The dynamic range of protein abundances spanned four orders of magnitude. The correlation between technical replicates of the ten biological samples was R2 = 0.9961 ± 0.0036 (95% CI = 0.9940 – 0.9992) and the technical CV averaged 7.3 ± 6.7% (95% CI = 6.87 – 7.79%). This represents the most sophisticated label-free profiling of skeletal muscle to date.

Burniston et al. (2014)

One of the key factors required for accurate quantification is high reproducibility of abundance (intensity) measurements. The abundance coefficient of variation (CV) was calculated for all detected peptides in the three data sets (Fig. 6). The average CVs were 0.08 ± 0.1, 0.26 ± 0.09, and 0.18 ± 0.09 for the 4-protein mixture, serum, and tissue data sets, respectively (mean±standard deviation).

Levin et al. (2011)

Our study demonstrates that LCMSE allows reproducible untargeted quantitation of abundant plasma proteins. It gives fair to excellent correlation with immunoassays, and is achieved at low setup costs, without costly isotope-labelled standards used in targeted proteomics approaches. Reasonable variability compared to these targeted-approaches also gives confidence with regard to using this method.

Kramer et al. (2015)

This high correlation with the “gold-standard” of immunoassays suggests that discoveries made using Hi-N will transfer well to further validation studies using targeted methods such as MRM or immunoassays. This makes it a good candidate for quantitation of large numbers of proteins in clinical discovery proteomics.

How do I use it?

By default, Progenesis QI for proteomics provides relative Hi-N values calculated without an internal standard. This provides you with an abundance measure that is proportional to the absolute amount of protein in your samples, without any additional processing or sample preparation steps.

To obtain absolute measurements for all proteins in your experiment, you simply need to add a known amount of internal standard to each sample. Then it’s a simple case of telling Progenesis the accession and amount of your internal standard added. Progenesis will automatically re-calibrate your abundance values to provide absolute measurements (in fmol). So with just the addition of a single internal standard, you get absolute quantitation of all proteins in your sample effectively “for free” – with no additional analysis steps.

Protein quant options in the automatic processing tool Protein quantitation options in Progenesis QI for proteomics

You can configure your internal standard (refered to as “calibrant” in Progenesis QI for proteomics) either at the automatic processing set up wizard, or later on in the workflow when reviewing your identified proteins.

Why should I use it?

The label-free Hi-N method provides quantitative precision similar to labelled methods, without the greater expense, preparation time and variability the labelling process brings. The label-free approach is applicable to any kind of sample, in comparison to some labelled approaches – not all labelling methods are applicable in all scenarios, and in some methods only a subset of proteins are actually labelled.

Quantitative measurements in label-free proteomics have typically only allowed for relative “cross-run” comparison. Such measurements can only be validly compared for a single protein across runs. The linearity of the Hi-N method allows, in addition, comparison between proteins in the same run, providing much more information about the relative amounts of different proteins in your samples.


In conclusion, the Hi-N method provides a useful tool for quantitation in clinical discovery proteomics. The measurements obtained correlate well with routine immunoassays and labelled approaches, so make it likely that discoveries will transfer well to MRM/immunoassay validation studies. The only extra sample preparation required is the addition of a known amount of a single (non-labelled) internal standard to each sample.

Progenesis QI for proteomics performs Hi-N quantitation (without an internal standard) by default. Absolute quantitation using an internal standard is simply a case of entering the standard’s accession and spiked amount. If you’d like to find out more, get in touch, or download Progenesis QI for proteomics and try it for yourself.


Silva, J. C., Gorenstein, M. V., Li, G. Z., Vissers, J. P. C., & Geromanos, S. J. (2006). Absolute quantification of proteins by LCMSE – a virtue of parallel MS acquisition. Molecular & Cellular Proteomics, 5(1), 144-156.

Ahrné, E., Molzahn, L., Glatter, T., & Schmidt, A. (2013). Critical assessment of proteome-wide label-free absolute abundance estimation strategies. Proteomics, 13(17), 2567-2578.

Grossmann, J., Roschitzki, B., Panse, C., Fortes, C., Barkow-Oesterreicher, S., Rutishauser, D., & Schlapbach, R. (2010). Implementation and evaluation of relative and absolute quantification in shotgun proteomics with label-free methods. Journal of proteomics, 73(9), 1740-1746.

Burniston, J. G., Connolly, J., Kainulainen, H., Britton, S. L., & Koch, L. G. (2014). Label-free profiling of skeletal muscle using high-definition mass spectrometry. Proteomics, 14(20), 2339-2344.

Levin, Y., Hradetzky, E., & Bahn, S. (2011). Quantification of proteins using data-independent analysis (MSE) in simple and complex samples: A systematic evaluation. Proteomics, 11(16), 3273-3287.

Kramer, G., Woolerton, Y., van Straalen, J. P., Vissers, J. P. C., Dekker, N., Langridge, J. I., Benyon, R. J., Speijer, D., Sturk, A. & Aerts, J. M. F. G. (2015). Accuracy and Reproducibility in Quantification of Plasma Protein Concentrations by Mass Spectrometry without the Use of Isotopic Standards. PloS one, 10(10), e0140097.