Progenesis plugins: gotta catch ’em all!

Data import plugin options in Progenesis QIHere at Nonlinear Dynamics, we’ve always strived to keep Progenesis QI and Progenesis QI for proteomics vendor agnostic.

This allows our users to utilise a single software package to analyse data from all of their instruments, and interface with a wide range of search methods and pathways tools.

We achieve this through our plugin architecture, which allows you to install and update your supported data formats, search methods, and pathways tools independently of Progenesis.

What are the advantages of the plugin system?

Distributing vendor specific functionality as plugins confers a number of advantages. Progenesis users can:

  • interface with multiple vendors using a single piece of software – a key distinguishing feature versus other analysis software.
  • remain up to date with new file formats and/or changes to existing file formats, without having to install a new version of Progenesis.
  • apply novel search methods and pathways tools to their existing data analyses, thus staying up to date with developments in the scientific community.

What plugins are available?

Data import plugins

Progenesis allows you to import raw data from a number of different vendors and machines. All imported data is converted to Progenesis’s unique internal peak models, so all types of data can be analysed using a consistent workflow. You can even combine data from different vendors in the same experiment (although this isn’t recommended as you may have trouble aligning the data).

Data file format Plugin FAQs Availability
Waters (.raw) QI
QI for proteomics
Provided as standard
Thermo (.raw) QI
QI for proteomics
Provided as standard
UNIFI Export Packages (.uep) QI Provided as standard (only available in QI)
AB SCIEX (.wiff) QI
QI for proteomics
Provided as standard
Agilent (.d) QI
QI for proteomics
Free download
Bruker Daltronics (.d) QI
QI for proteomics
Free download
mzXML files QI
QI for proteomics
Provided as standard
NetCDF files QI
QI for proteomics
Provided as standard in QI for proteomics
Free download for QI

Search plugins (QI)

These plugins allow you to search for small molecules or lipids in your data set, using a wide variety of data sources. Elemental composition even enables you to elucidate compound composition without the use of a dedicated compound database. Progenesis MetaScope allows you to search SDF and MSP files from any source you choose, e.g. HMDB or PubChem.

Search method Availability
Progenesis MetaScope Provided as standard
METLIN batch metabolite search Provided as standard
LipidBlast Provided as standard
Elemental composition Provided as standard
ChemSpider Provided as standard
NIST MS/MS Library Contact us for access

Search plugins (QI for proteomics)

Progenesis QI for proteomics can perform peptide search and protein inference using a number of different plugins. These encompass both database search methods like Mascot, and de novo sequencing methods such as PEAKS Studio.

Search method Alternative versions Availability
Scaffold v3.0 and v4.0 Free download
Mascot Provided as standard
Phenyx Provided as standard
SEQUEST dta and out files
dta and pepXml files
sqt and ms2 files
Dta plugins provided as standard
Free download for sqt plugin
PLGS v2.4 and v2.5
v2.3 and v3.0
Free download
Proteome Discoverer v1.3 (.xls)
pepXml
Free download
ProteinPilot Free download
Spectrum Mill Free download
PEAKS Studio pepXml import only Free download
EasyProt Free download
Byonic Free download

Inclusion list plugins

Inclusion list plugins in both QI and QI for proteomics allow you to target your ms/ms data collection for greater ms/ms coverage. Importantly, you can import new LC/MS runs into an existing experiment without having to replace peak picking and other analysis steps. This makes the use of an inclusion list workflow a powerful tool to increase ms/ms coverage in DDA experiments.

Inclusion list format Plugin FAQs Availability
AB SCIEX QI
QI for proteomics
Provided as standard in QI for proteomics
Free download in QI
MassLynx QI
QI for proteomics
Provided as standard
Thermo Finnigan QI
QI for proteomics
Provided as standard
Thermo Finnigan (4 d.p.) QI
QI for proteomics
Free download
Thermo Q exactive QI
QI for proteomics
Free download
Agilent preferred MSMS table QI
QI for proteomics
Free download
Agilent targeted MSMS table QI
QI for proteomics
Free download
Bruker Maxis QI
QI for proteomics
Free download

Pathways plugins

Progenesis provides reliable quantitative information about the changes in your experimental conditions. A number of pathways tools exist to translate such quantitative results into biologically relevant conclusions. Progenesis supports the following pathways tools, including the widely used IPA, and the multi-omics approach of IMPaLA.

Inclusion list format Plugin FAQs Availability
IMPaLA QI
QI for proteomics
Provided as standard
PANTHER classification system QI for proteomics Provided as standard (only available in QI for proteomics)
IPA QI
QI for proteomics
Provided as standard

Recent plugin updates

These are just a few examples of recent plugin releases we have made. As you can see, we regularly produce updates to Progenesis plugins, and develop new plugins when requested by customers.Progenesis web panel plugin update notification

  • In April 2016 we released an updated version of the mzML reader for Progenesis QI, introducing the ability to read indexed mzML files, as requested by our customers.
  • In January 2016 we released the IPA plugin for Progenesis QI for proteomics, giving users of Progenesis QI for proteomics v2.0 easy integration with this widely used pathway tool.
  • In November 2015 we released a new version of the Proteome Discoverer plugin, to support the newly released Proteome Discoverer v2.0 and v2.1.
  • In November 2015 we also released a brand new Thermo Q Exactive inclusion list plugin for both QI and QI for proteomics, since the Q Exactive machine uses a different inclusion list format to other Thermo machines.

Future plugins

Here at Nonlinear Dynamics we are committed to ensuring Progenesis remains vendor agnostic and supports the widest range of third party integrations possible.

As such, we’re always happy to hear from customers if they wish to use Progenesis with a third party piece of software for which a plugin does not exist. Please get in touch if you have any ideas for new plugins, or improvements to existing plugins.

Symphony: The right product at the right time

Here at Nonlinear, we are very pleased to have been given a treat of a new product to sell, Symphony. It is available now.

Symphony data pipeline logo

I asked the product manager, Dr Rob Tonge of Waters, a few questions about its inception, what it does and why people are so enthusiastic about it.

Here is the interview:

“How did Symphony come about?”

Initially we were working with The Phenome Centre in London on large scale population studies and gained insight into the problems that scientists experience when performing high throughput metabolomics.

Dr Rob Tonge explains to Juliet Evans why Symphony was so well received at ASMS

“What problems were these?”

Research groups are increasingly trying to perform larger and larger experiments, in areas such as Personalised or Precision Medicine. Big Data is the flavour of the day. The field is being led by genomics and next generation sequencing technologies, but proteins and metabolites are also of interest to provide a more holistic picture of the biology under study.

The scale of experiments is moving from 10’s, to 100’s, to 1000’s of samples as we push towards population-scale investigations. However, many of the methods developed for omics have been built with a research scale in mind, many with very manual processes, and these can be prone to many errors when used at a larger scale. Thus, for larger experiments, automated informatics workflows are essential for efficiency, accuracy, and sensitivity.

“Do all these groups want the same solution?”

No.  When we talk to customers in the research environment, many of them have very varied needs, often from one project to the next. One size really does not fit all and labs want to use the latest cutting edge methods and algorithms, and the potential to future-proof their operations.

“So the challenge was to create something flexible enough to allow for a variety of workflows.”

Exactly. Researchers want to experiment with ideas and require informatics systems that enable creativity, not constrain it.

“Presumably if you are talking about automation, you are talking about time saving, as well as reduced errors”

Absolutely.  In today’s world, time certainly IS money. People have more and more to do in less and less time and do not have time to waste on repetitive tasks when automated protocols could greatly accelerate their work.

“Were there any other factors that you took into account when designing Symphony?”

Yes, we now live in highly connected communities, where social media platforms such as Twitter, Facebook and LinkedIn bring like-minded people together. There are great benefits to be had from working together to share ideas, share applications and code, and be catalytic on each other’s thinking.

“OK, I understand the background to the product now; tell me more about the Symphony solution”

Symphony is a client/server application that allows automation of tasks. It is a framework into which different tasks can be plugged, so, as in the example below, we have built a data processing pipeline with 4 tasks (blue, yellow, green and red) and we are processing our blue incoming data into the resultant green, transformed data at the bottom.

The first version of Symphony is initiated by MassLynx at sample acquisition, and typical tasks that can be applied include moving a file to a server, de-noising, compressing, renaming, making a copy, running a series of executables, etc., etc.

Diagram illustrating the way that data is passed automatically through various tasks via Symphony“I see. So you can customise it to whatever workflow you design?”

Yes, Symphony is built with flexibility, creativity and efficiency in mind. It accepts a wide range of tasks and it is very easy to construct a pipeline sequence by dragging and dropping task icons together, and we have the facility to run conditional tasks that are able to collect data from one task and use it in another.

Tasks and Pipelines can be saved into a library for future use and pipelines can be configured to work across multiple PCs and across networks. Symphony has an excellent trouble-shooting system to allow a user to diagnose pipeline configurations and comes with a Home Page that allows us to send information to a user such as news items, information about latest builds, and items from the Symphony Community.

“Where do you see this product being used?”

I’ve just returned from ASMS, at which we launched Symphony.  It was well received by the community there. The main benefit to all users is that Symphony saves hands on time in data processing. That can be research labs and also higher throughput labs like DMPK CROs. Data processing can be initiated as soon as the file is recorded by the instrument and can be done automatically to save time and allow out of hours working.

As well as efficiency, automation also brings the additional benefit of a reduced chance of errors that are very possible when performing repetitive tasks. And what a lot of customers require today, Symphony allows the implementation of Personalised Data Processing – that is, the data processing that THEY need in THEIR laboratories.

“That really does sound great!  Let’s finish with some feedback from three of our users”

“By automating routine data-processing steps, Symphony saves our operator time, and allows us to conduct the most time-consuming parts of the informatics workflow in parallel to acquisition. Best case, it can save MONTHS of processing time, and in combination with noise-reduction, petabytes of storage. We see great value in the modular nature of Symphony, allowing us to rapidly develop and test new processes for handling experimental data, including real-time QC, prospective fault detection, and tools for ensuring data-integrity.”

Jake Pearce, Informatics Manager, National Phenome Centre, London, UK.

“Symphony offers a solution to address many challenges, providing a platform with automated, flexible and adaptable workflows for high-throughput handling of proteomic data. Just the simple step of being able to seamlessly and automatically copy raw files to a remote file location whilst a column is conditioning, maximises the time we can use the instrument for analysis. Previously, the instrument could be idle for 1-2 hours whilst data is copied to a filestore in preparation for processing. With three Synapts generating data 24/7 in our laboratory, this alone is a major advance.

Symphony’s flexibility of being able to execute sample specific workflows directly from the MassLynx sample list will have a major impact on our productivity. The scalable client-server architecture makes Symphony perfect for large scale high-throughput MS data processing, where the processing of highly complex data can only be addressed by calling on a range of computational resources.”

Paul Skipp, Lecturer in Proteomics and Centre Director, Centre for Proteomics Research, University of Southampton, UK.

"New approaches are continuously being developed to extract increasing amounts of data from very data-rich ion mobility-assisted HDMSE experiments. Plugging new algorithms into an automated Symphony pipeline provides the ingredients for exponential growth in information content that can be extracted from both new and archived samples. Automation brings the possibilities of finding optimal parameter settings and reducing the possibility of errors, without significant time penalties. I was amazed at the level of detail that I can see using these approaches!"

Maarten Dhaenens, Laboratory for Pharmaceutical Biotechnology, University of Gent

Want to learn more?

Contact us if you would like to try Symphony.

Out now – Progenesis QI for proteomics v3.0

If you attended ASMS 2016, you may have been lucky enough to see a preview of Progenesis QI for proteomics v3.0, and today I’m pleased to announce that it is now available to download. This release is focussed on peptide level information, with a few other treats as well.

What’s New?

  • Improved access to peptide level information: ions are now charge-state deconvoluted so you can see whole peptide quantitation and expression profiles at Review Proteins; there is also a new peptide export available. Progenesis QI for proteomics now displays a correlation score for peptides and peptide ions which can be used to focus in on modified peptides, as well as to aid confidence with identifications. We’ve also added a new modification quick tag for peptide ions.
  • Improvements to responsiveness
  • Resolve conflicts is no longer part of the main workflow: following the implementation of Hi-N in v2.0 which handles conflicts by distributing the ion abundance, and based on customer feedback, we have taken the Resolve Conflicts screen out of the main workflow. You can still access this screen from Identify Peptides, in the same way that Normalisation can be reviewed or altered from Filtering.
  • Support for ProteinPilot v5.0: continuing with our multi-vendor support, we have released a plugin to allow import of peptide search results from v5.0 of ProteinPilot from SCIEX.

Other minor improvements / bug fixes

  • The mzML importer now supports indexed files.
  • Software no longer crashes when connection to an external drive is lost (issue was limited to experiments saved externally).
  • Improvements to peak picking for low intensity peptide ions, and heavy peptide ions (those with a mass greater than 3000Da).
  • Other minor bug fixes.

Where can I download it?

If you’re an existing customer with an up to date coverwise plan, this upgrade is totally free of charge and very simple – you will receive an email with a direct download link as well as specific instructions on how to upgrade your dongle. In addition, if your Progenesis PC is connected to the internet, there should be a message in the Experiments list sidebar notifying you of this new version – if you click this, and your dongle is plugged in, you’ll be sent to the download page.

Screenshot of Recent Experiments screen with upgrade notice highlighted

If you’re thinking of trying Progenesis QI for proteomics for the first time, you can download the software from here.

How will I know how to get the most out of the new features?

We’ve expanded our FAQs to cover the new features, as well as updating any previously available FAQs to correctly reflect new behaviour.

We’ve also updated our user guide if you’re looking for a step-by-step guide from start to finish.

Bringing the analysis to the sample: Progenesis QI helps beat food fraud

I recently watched a recorded webinar and was so impressed I decided to blog about it.

Addressing complex and critical food integrity issues using the latest analytical technologies

Prof. Chris Elliott Prof Chris Elliott
Professor of Food Safety and Director of the Institute for Global Food Security
Queen’s University Belfast
Dr. Sara Stead Dr. Sara Stead
Senior Strategic Collaborations Manager, Food & Environmental
Waters Corporation

The bad news

The food supply system is incredibly complex, sustaining 6 billion people with ingredients and processed foods.  The problem arises when things go wrong, whether accidentally or deliberately; there can be catastrophic consequences. The melamine milk scandal in China resulted in 54,000 babies being hospitalised, and 6 dying. How prevalent is food fraud?  The answer is we don’t know, it has been estimated at 40 billion U$D.  Not only is food fraud dangerous, it erodes the trust between the consumers and businesses.  Sometimes companies unwittingly buy fraudulent products from the supplier, damaging their reputations, in many cases to the point of the company’s collapse.

Chris went on to talk about the red meat supply, which is hugely complicated, having multiple points of vulnerability.  He mentioned how sub-contracting within the food supply chain has made it possible to substitute far cheaper horse meat for beef; the complexity of the sub-contracting has made finding the culprits difficult.

He talked about herbs and spices, mentioning the cumin scandal, whereby similarly coloured peanut shells are deliberately substituted to bulk out the cumin.  This act is particularly malicious because of peanut allergies potentially leading to fatalities.  ‘Pure’ oregano has been found to contain leaves of citrus, olive and myrtle.

Oregano

‘Pure’ oregano has been found to contain leaves of citrus, olive and myrtle

The good news

The UK Government asked Chris to run an enquiry which lead to the Elliott Review into the Integrity and Assurance of Food Supply Networks – Final Report.  His review mentions the ‘Eight Pillars of Food Integrity’.  I like the first pillar, ‘Always put the consumer first’.  The fourth pillar is where Progenesis QI comes in: ‘Laboratory Testing’.

As part of ‘Laboratory testing’, Chris has been working with Dr Sara Stead of Waters at producing ‘fingerprints’ for different foods, using metabolomic profiling and the very simple-to-operate Rapid Evaporative Ionisation Mass Spectrometry (“REIMS”) research system, incorporating the iKnife.  Progenesis QI is used to analyse the data produced.  Its ‘Quantify then Identify’ workflow makes it a natural choice for finding unknown changes in complex samples.  Sara explained how much promise this new technology shows; a huge advantage is that there are results in seconds and no sample preparation is needed.  It’s a revolutionary approach, ‘bringing the analysis to the sample’.  So long as your sample has conductivity, you can analysis it.

Sara explained how high levels of fraud have been found in the fish industry, for example, tuna being substituted by escolar, which can cause steatorrhea. She talked about substitutions causing problems for people with seafood allergies and her work on coffee, Belgian butter, herbs and spices and the botanical origin of honey.

Which species of fish are present?

Figure 1: Which species of fish are present? (Photograph by Javier Lastras.)

The group are building models to help identify geographical origins and seasonal variation.  All this from a test that takes seconds.  Amazing!

PCA analysis showing separation of different fish species by metabolic profiling

Figure 2: PCA analysis showing separation of different fish species by metabolic profiling

The future for food fraud testing lies in ‘holistic profiling’, being able to detect the unexpected and verify unique product markers.  Chris still thinks that more vigilance in the food industry is needed and there is plenty more work to do.  He revealed that product complexity makes detection more difficult and believes the best place to start testing is the raw ingredients.

He is using this game-changing, sensitive technology alongside other techniques.  The speed and simplicity of the workflow, make it a very attractive option when compared to, for example, DNA testing which is prohibitively costly and complex.  One more piece of good news is that once the models are built, the analysis is highly reproducible between laboratory sites.

Watching this webinar really inspired me! I’m truly impressed by the boldness of this technological approach; I came away thinking “This revolutionary technology really could change the world!”

As you will see in the webinar Progenesis QI is the software that is being used to do comparative profiling of food varieties, if you’d like to use Progenesis to profile your own data, download today.

Missing values: the Progenesis co-detection solution

In my last blog I described the problem of missing values in discovery omics analysis and how it adversely affects the statistics. Now I’ll describe the Progenesis co-detection solution to this problem.

First, a quick recap: the problem is caused by an inefficient workflow in which the feature ion signals are detected independently on each sample. This creates different detection patterns, even for technical replicates (same sample run multiple times), so that matching the ions to ensure you are comparing ‘like with like’ across all samples becomes very difficult. This leads to the generation of many “missing values” in the ion quantity matrix. Multivariate statistical analysis is then performed on the ion quantity matrix, in order to find the truly significant expression changes. Actually, the impact of having missing values in the ion quantity matrix means that it is not possible to do a ‘like with like’ comparison on many features.

This means the multivariate statistics have to be applied to a restricted number of features, consequently false positives and false negatives are generated through the applied multivariate analysis. We examined the consequences of missing values in more detail in our blog post: Missing Values: The hard truths.

Progenesis however, takes an alternative unique approach to data extraction in which ion signals are essentially “matched” before detection takes place by aligning the pixel patterns of the 2D ion maps (see figure below). This compensates for any retention time differences between samples. The pre-matched ions can then be co-detected so that a single detection pattern is created for all the samples in the experiment, resulting in 100% matching of ions and no missing values!

Here is a schematic of how Progenesis QI works:

Cchematic of how Progenesis QI works

How does this approach help?

Well, let’s consider a comparison of two very similar samples from a small discovery omics experiment.

The traditional approach

Figure 1A below, shows zoomed in ion-map views of the same m/z / RT region from the two samples so you can see how visually similar they are, allowing for some vertical retention time drift between them. In Figure 1 B and C, you can see how the conventional (and inefficient) analysis workflow handles this task:

  • First, the feature ions are detected independently on each sample (1B).
  • Then, the detected feature ions are vertically aligned to compensate for the retention time drift and feature ions are “matched” between the samples using the mono-isotopic m/z and adjusted retention time as reference (1C).

Zoomed in ion-map views
The degree of ion matching between the samples is best shown by the arrow markers which indicate ions that are present on one sample but not present on the other. In fact out of 108 ions detected on sample 1, 31 are not detected on sample 2 while 19 out of 98 detected on sample 2 are not detected on sample 1. This means that out of 129 unique ions detected across both samples, almost 40% are only detected on one sample and therefore generate a missing value in the data. What’s more, in addition to the 50 unmatched ions, there are more which are detected quite differently on the two samples in terms of their isotope numbers, chromatographic peak width, or both. In a real experiment with multiple samples in two or more groups, these detection differences increase the variance in quantitation of any ion across different samples within a single comparison group, making it more difficult to find true statistically significant differences between different groups.

The Progenesis approach

Now let’s look at how Progenesis analyses the same data used in the traditional approach. In this case the first step is to align the signals on the ion maps by creating a series of alignment vectors as shown in Figure 2B. You can see that the effect of this is to reduce two signal patterns (shown in purple and green in 2B(i)) 2B(ii)) to one. This single signal pattern (formed by aggregation of both samples) is then used for peak “co-detection” (2C) in which a single detection pattern is created that applies to both samples (2D).

Zoomed in Progenesis ion-map views

Using the same detection algorithm as in the conventional workflow, but co-detecting from an aggregated ion map rather than detecting individually on each sample, Progenesis has detected a total of 154 feature ions, all of which are detected in the same way on both samples. In a real experiment this increases the statistical power in the following ways:

  1. Co-detection generates a complete data matrix with no missing values, eliminating the need to filter out ions with too few real values present or to impute model values, possibly leading to false positive or false negative results.
  2. By detecting each ion on all samples in the same way, co-detection minimizes variance in ion quantitation across samples in the same comparison group, making it easier to find true statistically significant differences between the groups.

In addition to the above benefits, co-detection also increases sensitivity and reliability of ion detection by increasing the signal to noise ratio. Even with co-detection of just two samples, we can see this in the detection of 25 (=154-129) ions that were not detected in either of the samples individually. As we co-detect from more samples, very faint and/or fragmented signals that cannot be reliably detected on individual samples but are consistently present, will become more distinct and easily detected from the aggregated data.

Progenesis co-detection in action

Finally, let’s take a look at how the Progenesis co-detection workflow helps us to easily extract powerful statistical information from a 3 Vs 3 experiment that includes the two samples we’ve already looked at. The figure below shows quantitative data for two different ions extracted from the experiment, one in which a significant expression change is detected and another in which no change is detected. The figure also illustrates another powerful benefit of the co-detection workflow – the ability to visually confirm expression change results (p-values and fold changes) at the “raw data” level, a great way to increase confidence in your results!

Progenesis co-detection workflow

So, there you have it. The unique Progenesis QI workflow really does eliminate missing values at the analysis stage.

Would you like to try Progenesis QI on ALL your data? Download now and complete your analysis with confidence.

Identification scoring in Progenesis QI

One of the advantages of using Progenesis QI is its ability to combine results from multiple search methods and databases. Progenesis QI uses a common scale to score results from all the databases and search methods it supports, so you can compare search results obtained from different search methods. This post explains the scoring method we use in Progenesis QI, and how you can improve your search scores by searching additional dimensions of your data.

Progenesis QI search methods

At the time of writing, Progenesis QI supports these search methods and databases:

Progenesis MetaScope
Searches SDF and MSP files from any source. Supports retention time, CCS, theoretical fragmentation and spectral libraries.
METLIN batch metabolite search
Exports data for use with the METLIN batch search interface, and reads METLIN batch CSV files.
LipidBlast
Searches the LipidBlast MS/MS database provided by Metabolomics Fiehn Lab.
Elemental composition
Produces putative formulae for compounds based on mass, isotope profile, and the Seven Golden Rules.
ChemSpider
Searches the ChemSpider structure database. Supports theoretical fragmentation, isotope similarity filtering, and elemental composition filtering.
NIST MS/MS Library (requires purchase)
Searches the NIST MS/MS library for spectral matches.

You can find out more about each of these search methods in the search methods and databases FAQ. This blog post, however, will focus on how we calculate scores so that identifications from different search methods can be compared.

The Progenesis scoring method

For any given search, there are a possible five properties that can contribute to the overall score:

  1. Mass error
  2. Isotope distribution similarity
  3. Retention time error
  4. CCS error
  5. Fragmentation score

Each of these individual scores is on a scale from 0-100. If your search criteria do not include a given piece of data, the score for that piece of data is 0. The overall score is the mean of these 5 scores.

Note that the more search criteria you use, the higher the maximum possible score becomes, as described in the following example.

Example

Suppose we have searched ChemSpider using theoretical fragmentation. For a given compound we find Identification A, with these scores:

Identification A Score
Mass error 95.2
Isotope distribution similarity 99.2
Retention time error 0
CCS error 0
Fragmentation score 87.1
Overall score 56.3

Note that the scores for retention time and CCS errors are 0, because ChemSpider does not support searching those properties.

If we then perform a MetaScope search, this time including a CCS constraint, we might obtain the following scores for Identification B:

Identification B Score
Mass error 95.2
Isotope distribution similarity 99.2
Retention time error 0
CCS error 94.1
Fragmentation score 87.1
Overall score 75.12

We have identical scores for the mass error, isotope distribution, and fragmentation. However, we also have an extra piece of information in the CCS score. This provides additional evidence for Identification B, so it is given a higher score than Identification A.

Note that in the ChemSpider case, if an identification scores 100 on all 3 items, it obtains a score of 60. In the MetaScope case, if an identification scores 100 on all items, it obtains a score of 80. So for each additional piece of data we include in our search, the maximum score increases by 20.

The component scores

Here we’ll briefly describe how the five component scores that make the final score are calculated.

Mass error, retention time error, and CCS error

These are all functions of the magnitude of the relative error, Δ:

The score profile for mass error, retention time error and CCS error Figure 1: The score profile for mass error, retention time error and CCS error.

For the mass error, Δ is the ppm mass error and N = 4000. For the retention time and CCS errors, Δ is the percentage error, and N = 20.

Isotope distribution similarity score

This compares the intensities of each isotope between observed and theoretical distributions. A total intensity difference of 0 gives a score of 100, which falls linearly to 0 when the total intensity difference is equal to the maximum isotope intensity.

Fragmentation score

The fragmentation score is more complicated and depends on the fragmentation method used. The FAQs describe how scoring works for theoretical fragmentation and database fragmentation.

Improving identification scores

The best way to improve the scores of your identifications and your confidence in them is to use more search constraints.

37.9/100

In general, most searches will be able to produce a mass error score and an isotope similarity score. With just these two pieces of information, the maximum score for any identification is only 40/100. In this example we’ve identified Warfarin using only mass error and isotope similarity.

55.4/100

By including fragmentation data in your search criteria (either theoretical fragmentation or a fragmentation database), this increases the possible score for identifications to 60/100. Here we’ve added theoretical fragmentation to our search parameters.

70.8/80

Finally, if you use an appropriate data source (e.g. an SDF and additional properties file) you can add search constraints for retention time and CCS, giving a maximum score of 100/100. Here we don’t have CCS information, but have added retention time to our search parameters for a maximum of 80/100.

Future improvements

Currently Progenesis gives equal weight to the five component scores – mass error, isotope similarity, fragmentation score, retention time error, and CCS error. In some cases this might not be ideal, so if you have any suggestions for different weightings we’d love to hear from you in the comments section below.

As always, if you have any further questions, check our FAQ or get in touch.

Missing values: the hard truths

 

“I could burst into tears… I spent weeks of time and effort on sample collection, instrument optimisation, sample running, and data generation from a very expensive LC-MS setup that has the resolving power to find tiny but significant differences between my conditions, only to find that my data analysis led me down a dead end with many false positives. I’m certain that real significant differences are in that dataset, but my analysis workflow just isn’t picking them up… What will I tell my boss?”

Many know how real that scenario is, but what are the possible consequences resulting from the impact of missing values in your data?

  • False negatives – are you missing true positive results, as in the example above?
  • Time spent on researching false positives – does this slow down department research progress?
  • Wasted money on investigating false positives – how easy is it to then get follow up funding?
  • Acceptance of research results – is your journal of choice going to accept data with missing values?
  • Reduced return on investment – do your research results bring value from either publications or commercial gain?

Hopefully you’ve been lucky and the impact was not as drastic as the above, however those outcomes are all very possible.

Frustrated researcher by his mass spec with his head in his hands

In our ‘Back to basics’ blog post, we mentioned a number of the problems that missing values can cause, such as:

  1. Reduced effectiveness of statistical analysis techniques
  2. Misleading statistics (impacting false positive and false negative results)
  3. Problems with multivariate statistics as observed by the visualisations (such as PCA)

Before we move onto the issues that arise from various imputation methods of missing values in your data matrix, let’s remember how many data points are actually affected:

  • Missing values are commonly reported as occurring at rates of approximately 20%[1,2] and affecting up to 80% of variables[2] in LC-MS data.
  • “Further investigation of the peaks detected as significantly different between biological groups showed that substantial proportions of these peaks were comprised of those which initially had missing data.”[2]

Let’s expand upon the problems mentioned above, but first a quick note on terminology: Hrydzuszko and Viant refer to missing data where we refer to missing values. Missing values have also been referred to as ‘missing or lost data points’. To simplify the language, we will continue to refer to missing values, except in the case of citations.

A – Reduced effectiveness of statistical analysis techniques

To have confidence in conclusions drawn from comparative results between conditions, we look to statistical power. In the ‘Missing values: what’s the problem?’ blog post, Dr. Goulding described how increasing biological replicates should work to stabilise inherent differences between the biological samples, and so increase the statistical power, however with additional samples there are additional associated missing values in the experiment’s quantitative results and so this actually limits any desired resulting increase in statistical power. Therefore, we throw away most of the gain in statistical power to be had from the addition of biological replicates due to incorporating even more missing values. Missing real experimental differences is a probable result of having low statistical power.

B – Misleading statistics

Typically, 1 in 5 data points are missing; can’t we just impute them with one of the many existing methods? Yes, but this is not without danger of taking your investigation completely astray. Can we measure the impact of imputing upon our significant discovery rate? Hrydzuszko and Viant evaluated[2] the impact of eight imputation methods; they found:

  • Different imputation methods ultimately yielded quite diverse data analysis outcomes. Specifically the number of peaks identified as significantly different between groups varied considerably between the eight estimation methods”.
  • “It is quite possible that when an inappropriate missing value estimation method is used we may not only lose the knowledge of which peaks are significant or not, but we may introduce further bias by identifying non-significant peaks as significantly different between groups.”
  • “Overall the results presented here provide substantial evidence that the choice of missing value estimation method has a substantial effect on the outcome and interpretation of univariate statistical analysis.”

C – Problems with multivariate statistics as observed by visualisations (such as Principle Component Analysis (PCA))

Complications with useful multivariate statistics caused by the missing values were also seen when Hrydzuszko and Viant[2] looked at the application of PCA. They found similar (larger) differences to those discussed above for univariate analysis. PCA plots varied from 3 distinct separations to no separation between the 3 conditions, depending on which imputation was used. This suggests that multivariate analysis could be even more sensitive to imputation.

What can be done about this serious skewing of data?

Hrydzuszko and Viant[2] concluded their paper with “a three step process recommended in order to determine optimal method selection for missing value estimation for a given dataset and analytical platform that includes: assessing the nature of the missing data, analysing the impact of the missing data treatments on the final data analysis outcome, and analysing the performance of missing data algorithms on the ‘complete’ dataset if available.”

The proposed approach is very thorough, but must be time consuming and each dataset must be assessed on its own basis requiring the three steps to be repeated for each new study.

There is another way!

“Co-detection in Progenesis is a very attractive feature that we have come to trust. We do a lot of follow up Western blots to reproduce our quantitative proteomics findings and the blots do a great job of building confidence in the co-detection feature and the Progenesis system as a whole. We’ve had success reproducing changes in protein expression levels observed in Progenesis using the traditional Western approach. That’s extremely important since a lot of reviewers do not speak mass spec but love to flaunt how fluent they are in the language of Western blot.”

Paul Langlais, Mayo Clinic Arizona, Arizona, USA

Progenesis offers you the ability to take your data straight from your mass spectrometer and find the significantly changing compounds or proteins in your dataset without any of the problems associated with missing values. As long as the differences are present in your data file, Progenesis will maximise your chances of picking them up. Working on data with no missing values improves the experimental specificity, sensitivity and therefore reproducibility of your research, allowing you to quickly and, more importantly, confidently, quantify the compounds or proteins of interest.

An issue Dr. Goulding discussed in ‘Missing values: what’s the problem?’ was the data filtering, where missing values fall below predefined matching thresholds. With the Progenesis co-detection approach, you do not have to worry about this as you will always be comparing the same features across all the samples in your dataset. This means that when you use the multivariate statistical visualisations such as PCA, you will have unfiltered and therefore unbiased data representation. In turn, this gives you the ability to QC your data by assessing for outliers to ensure the differences you see in your data are real.

PCA plot from Progenesis showing scores and loadings for all variables PCA plot from Progenesis showing scores and loadings for all variables

Dr. Goulding will be following up soon on how Progenesis achieves no missing values; in the meantime why not download Progenesis QI and speak to one of our specialists about analysing your own data? You too can have peace of mind that the conclusions you are submitting are based on the unbiased analysis of a complete data matrix.

References

1. Gromski et al, “Metabolites” 2014, 4, 433-452

2. Hrydziuszko and Viant, “Metabolomics” 2012, 8, Supplement 1, 161-174

Come and see us at ProteoMMX 4.0!

Two years ago, I attended my first conference for Nonlinear, ProteoMMX 3.0. ProteoMMX 4.0 is fast approaching (5th – 7th April, at The Queen Hotel, Chester, UK) and I’m excited to say I’m lucky enough to be attending again. While this will be my second time at ProteoMMX, and one of many conferences I’ve been to, this will be the first for my Guide Dog, Winston.

Photograph of Chester Cross, courtesy of Matty Ring Chester is one of the best preserved medieval walled cities in the UK. (Photograph courtesy of Matty Ring.)

As with previous years, the conference will be preceded by the Quant 4.0 event. Quant 4.0 is a quantitative proteomics and data analysis training course, partly intended as a good introduction to the field for any newcomers, before attending the more detailed lectures at ProteoMMX. Agnès, one of my colleagues, will be attending this event (as well as ProteoMMX) and will be delivering a session on Progenesis QI for proteomics on the Tuesday morning.

Once again, ProteoMMX is offering great opportunities for early career researchers to deliver presentations, as several slots have been reserved for short presentations based on elevated abstracts. There will be plenty of talks from more experienced researchers too – you can check out the full programme on the ProteomMMX 4.0 website.

As well as having the opportunity to hear about current work in this field, I’d like to encourage you to come and speak with us – we’ll be on the Waters table when not in lectures, but I should be easy to spot as I’ll be the one with a golden retriever in tow. Book an appointment now to guarantee yourself a timeslot – we’d love to speak with you, whether you’re already a Progenesis user, or are interested in a demo. We hope to see you soon! Smile

Missing values: what’s the problem?

Missing values are a major problem in LC-MS based discovery ‘omics analysis and could be the difference between a successful research project and a failure. Whether you run a 3 vs. 3 experiment on a model biological system or a much larger clinical study, missing values will adversely affect the results; some expression changes which are actually present in your data, will be missed. But why is this? How and why are missing values generated and how do they affect the results? To clarify, let’s look at how discovery ‘omics analysis works.

Biological “noise” and statistical power

In discovery ‘omics we’re looking for differences in the relative quantities of analytes between two or more groups or conditions, such as control vs. treated or healthy vs. diseased. But in all biological systems there is inherent biological variation caused by both genetic and environmental factors, so that the relative quantity of any given analyte ion will vary across samples from different specimens within a given condition. This can be further complicated in clinical studies where there is no control over external factors such as diet, fitness etc. which contribute towards the final phenotype.

This inherent variation can be seen as biological “noise” which we must cut through in order to find the consistent condition-related differences we’re looking for. To do this, we must run multiple biological replicate samples from different specimens of the same species and condition and compare the resulting sample groups to find the analyte ions that are displaying statistically significant differences between conditions. The more biological replicate samples we run, the greater the ability or statistical power of our experiment to find the significantly changing ions.

Missing values

The relationship between the number of replicate samples and the statistical power of the experiment is strongest when data from all the runs is available for statistical analysis. However, due to limitations in the conventional workflow used by most analysis software, this is usually not the case. In the conventional workflow the ions are first detected, one sample at a time, and the detected ions are then “matched” across the samples so that the same ion is being compared between samples. This approach often results in different patterns of detection and different numbers of ions detected on each sample. Ions that are detected on one or more samples may not be detected on others for reasons such as:

  • The ion is actually not present or is below the limits of detection in those samples.
  • The ion may be “missing” due to some instrument error or ionisation issue
  • The ion may be differently detected due to fragmentation of signal, failure to detect monoisotopic peak, differences in chromatography, or other reasons (see figure below).

Image showing differences in detection pattern for technical replicate samples when co-detection isn't performed

For the above reasons, when we create a matrix of ion quantities for statistical analysis, we often find a number of gaps in the data where ions detected on one or more samples could not be found on others. We call these “missing values”, four of which can be seen in the data from a simple 3 vs. 3 experiment shown below.

Table showing missing values in abundance measurements Table showing missing values in abundance measurements

Missing values which are commonly reported as occurring at rates of approximately 20%[1,2] and affecting up to 80% of variables[2] in LC-MS data, decrease the statistical power of our experiment by reducing the number of values available for statistical analysis. What’s worse, the probability of missing values occurring increases with the number of biological replicate samples so we find that we have to run many more samples in order to gain a relatively small benefit in statistical power of our experiment. This is a fundamental problem in discovery ‘omics analysis since we may then miss expression changes that are actually present and waiting to be found in our data.

Since running more samples only increases the chances of encountering missing values, and samples are not in endless supply, the only real solution is to find a better way to analyse the data.

What is the solution?

The solution used in many ‘omics discovery analysis workflows is to use a combination of data filtering and data imputation. First, the workflow may ask you to define a threshold % of values for each ion, so if you set a 60% threshold for a 10 vs. 10 experiment, all ions with more than 4 missing values in a group are eliminated from the analysis. If these include the potential biomarkers you’re looking for then you’ll miss them entirely! For the remaining ions, any missing values are replaced with imputed model values, the most commonly used option being the mean of the values present, while zeros are also used in some workflows. These approaches are dangerous since statistics must take full account of both the mean and the variance of the data in each group being compared. Therefore the former approach may produce false positives by reducing the variance while the latter may produce false positives or negatives by skewing both the mean and the variance. Another way to see the invalidity of these approaches is to imagine using them in other types of statistical experiment. Let’s say we’re measuring the effect of a high salt environment on the mature height of a certain type of plant. You plant 10 seedlings in high salt and 10 in low salt soil, but due to a problem with the automatic irrigation system, 4 of the low salt plants receive no water and die. Is it valid to insert the mean height of the other 6 low salt plants as a model value for the 4 missing plants? You would essentially be creating 4 virtual plants while artificially reducing the variance in the low salt group.

Real plant measurements vs. imputed plant measurements

Would this be considered an optimal way to perform the statistics? Of course not, yet it’s precisely what is done in some discovery ‘omics analysis workflows and has become so routine that it often isn’t even mentioned in publications[1].

How will this affect your work?

Missing values can have profound implications for your research projects. False negative results are serious enough with the potential to miss important biomarker candidates, but false positives may be even worse, leading to much wasted time, effort and resource investigating false biomarker candidates. Furthermore, it’s worth repeating that all along, the evidence of real biomarker candidates is actually there in your data and waiting to be found. At the outset you carefully design the experiment, select subjects and prepare samples which you run on the best equipment available after painstakingly optimising the running conditions – only to find your results underwhelming or even worse, misleading! So how do you ensure that missing values don’t jeopardise your chances of research success?

The Progenesis co-detection solution

The handling of missing values has been described in the literature as “an absolutely vital step in data pre-processing” and one “to which special consideration should be given”[1]. However, what if the problem of missing values did not exist? What if there was a workflow for LC-MS discovery ‘omics analysis that eliminated missing values and maximised statistical power in experiments of any size without resorting to dubious data imputation practises. This is the Progenesis co-detection solution and I’ll be telling you how it works in my next blog post.

1. Gromski et al, “Metabolites” 2014, 4, 433-452

2. Hrydziuszko and Viant, “Metabolomics” 2012, 8, Supplement 1, 161-174

Why do people buy Progenesis QI when there is freeware available?

It’s an interesting question and there are many of our users out there with various answers. We decided to ask our users some questions about why they bought Progenesis QI and what difference it has made to their research. Here’s what Research Professor Jace W. Jones had to say on the matter:

Please can you briefly describe your area of research?Jace operating the Synapt G2-S

Our research involves development of mass spectrometry-based platforms that couple biomarker discovery to quantitative validation, from circulating and tissue lipids. In particular, the use of high resolution tandem mass spectrometry to structurally elucidate, identify, and quantify biologically active lipids to further understand disease/injury mechanisms of action and provide insight for drug development targets. To this end, we first design untargeted liquid chromatography tandem mass spectrometry (LC-MS/MS) experiments to identify differentially expressed plasma and tissue-bound lipids using in vivo models. Our discovery–based instrument platform of choice is the Waters UPLC coupled to a Synapt G2-S operated in HDMSE acquisition mode. Our typical LC conditions elute lipids over a 20-minute gradient using a UPLC C18 column. The HDMSE data is acquired in both positive and negative ion modes. Experimental parameters vary depending on the particular in vivo model under study but involve multiple biological replicates per condition, per time point. In addition, quality control samples and addition of internal standards are standard operational procedure. The resulting output from this type of workflow is a tremendous amount of analytical data per sample that ideally generates a list of identified lipids that are differentially expressed between the conditions under study.

What problems did you experience prior to using Progenesis?

The data generated from the UPLC-HDMSE workflow is highly complex and results in 1000s of m/z values being identified by a number of analytical parameters, such as retention time, drift time, accurate mass precursor ions, and diagnostic product ions. In order to expedite biomarker discovery and fully utilise the multidimensional data generated on the UPLC HDMSE platform, we realised there was an immediate need for a bioinformatics solution that could efficiently process multidimensional datasets.

What made you convert to Progenesis QI?

We decided to go with Progenesis QI for its ability to handle multidimensional datasets, especially HDMSE workflows. In addition, a primary goal with our discovery/un-targeted mass spectrometry experiments is to generate lipid markers that can then be pipelined for targeted, high-throughput assays. Progenesis QI is an efficient bioinformatics solution that allows us to make the transition from discovery to validation. The ability to process multi-vendor data was also a major selling point.

What difference has Progenesis QI made to your research?

Progenesis QI enables us to efficiently process multidimensional lipidomic datasets in a systematic and straightforward manner. We can also now process HDMSE data on a single software platform.

One of the biggest differences we have seen is our ability to incorporate more biological replicates at the same time including temporal time points and multiple conditions. This gives us the ability to bolster our statistical significance and conduct experiments where we can evaluate potential biomarkers across time over varied conditions.

Please can you give a specific example of the success that Progenesis QI has helped you to achieve?

Progenesis QI has enabled us to increase our lipidomic workflow while increasing the amount of analytical data per sample. Because our data processing has been streamlined with Progenesis QI, we now spend more time on optimizing chromatography (e.g. orthogonal column chemistries) and mass spectrometry acquisition (e.g. ion mobility with tandem mass spectrometry) for more confident lipid identification.

How will it help you in your future research?

The demand for lipidomic experiments from not only our existing collaborators but also from outside researchers has grown steadily over the past couple years. Progenesis QI has enabled us to keep pace with that demand by allowing us to efficiently and confidently process multidimensional lipidomic datasets. This, in turn, expedites the experimental process of generating potential lipid biomarker candidates.

What advice would you give to a metabolomics/lipidomics scientist struggling with similar problems?

The amount of data generated by metabolomic/lipidomic workflows means a tremendous reliance on data processing. Often, the data processing aspect of ‘omics data is time-consuming and beyond the expertise of the scientist performing the experiments. Consequently, having a bioinformatics solution that is efficient, versatile, and reliable is a valuable investment and allows researchers to focus on optimization of their experimental approach and validation studies for potential targets. I highly recommend the use of Progenesis QI as your bioinformatics solution.

 

If you are a Progenesis QI user and would like to tell us about your research, please contact us – we’d love to hear from you.