Identification scoring in Progenesis QI

With the amount of information available today, important and helpful information can easily get lost and overlooked. I’d like to take this opportunity to repost this blog post about identification scoring in Progenesis QI of as many of our customers find this very useful in their research and still refer to it today.

One of the advantages of using Progenesis QI is its ability to combine results from multiple search methods and databases. Progenesis QI uses a common scale to score results from all the databases and search methods it supports, so you can compare search results obtained from different search methods. This post explains the scoring method we use in Progenesis QI, and how you can improve your search scores by searching additional dimensions of your data.

Progenesis QI search methods

At the time of writing, Progenesis QI supports these search methods and databases:

Progenesis MetaScope

Searches SDF and MSP files from any source. Supports retention time, CCS, theoretical fragmentation and spectral libraries.

METLIN™  MS/MS Library (requires purchase)

The Waters® METLIN™ MS/MS Library for Progenesis QI contains a local copy of the METLIN database and allows you to search this copy rapidly.

LipidBlast

Searches the LipidBlast MS/MS database provided by Metabolomics Fiehn Lab.

Elemental composition

Produces putative formulae for compounds based on mass, isotope profile, and the Seven Golden Rules.

ChemSpider

Searches the ChemSpider structure database. Supports theoretical fragmentation, isotope similarity filtering, and elemental composition filtering.

NIST MS/MS Library (requires purchase)

Searches the NIST MS/MS library for spectral matches.

You can find out more about each of these search methods in the search methods and databases FAQ. This blog post, however, will focus on how we calculate scores so that identifications from different search methods can be compared.

The Progenesis scoring method

For any given search, there are a possible five properties that can contribute to the overall score:

  1. Mass error
  2. Isotope distribution similarity
  3. Retention time error
  4. CCS error
  5. Fragmentation score

Each of these individual scores is on a scale from 0-100. If your search criteria do not include a given piece of data, the score for that piece of data is 0. The overall score is the mean of these 5 scores.

Note that the more search criteria you use, the higher the maximum possible score becomes, as described in the following example.

Example

Suppose we have searched ChemSpider using theoretical fragmentation. For a given compound we find Identification A, with these scores:

Note that the scores for retention time and CCS errors are 0, because ChemSpider does not support searching those properties.

If we then perform a MetaScope search, this time including a CCS constraint, we might obtain the following scores for Identification B:

We have identical scores for the mass error, isotope distribution, and fragmentation. However, we also have an extra piece of information in the CCS score. This provides additional evidence for Identification B, so it is given a higher score than Identification A.

Note that in the ChemSpider case, if an identification scores 100 on all 3 items, it obtains a score of 60. In the MetaScope case, if an identification scores 100 on all items, it obtains a score of 80. So, for each additional piece of data we include in our search, the maximum score increases by 20.

The component scores

Here we’ll briefly describe how the five component scores that make the final score are calculated.

Mass error, retention time error, and CCS error

These are all functions of the magnitude of the relative error, Δ:

The score profile for mass error, retention time error and CCS error.
Figure 1: The score profile for mass error, retention time error and CCS error.

For the mass error, Δ is the ppm mass error and N = 4000. For the retention time and CCS errors, Δ is the percentage error, and N = 20.

Isotope distribution similarity score

This compares the intensities of each isotope between observed and theoretical distributions. A total intensity difference of 0 gives a score of 100, which falls linearly to 0 when the total intensity difference is equal to the maximum isotope intensity.

Fragmentation score

The fragmentation score is more complicated and depends on the fragmentation method used. The FAQs describe how scoring works for theoretical fragmentation and database fragmentation.

Improving identification scores

The best way to improve the scores of your identifications and your confidence in them is to use more search constraints.

37.9/100

In general, most searches will be able to produce a mass error score and an isotope similarity score. With just these two pieces of information, the maximum score for any identification is only 40/100. In this example we’ve identified Warfarin using only mass error and isotope similarity.

55.4/100

By including fragmentation data in your search criteria (either theoretical fragmentation or a fragmentation database), this increases the possible score for identifications to 60/100. Here we’ve added theoretical fragmentation to our search parameters.

70.8/80

Finally, if you use an appropriate data source (e.g. an SDF and additional properties file) you can add search constraints for retention time and CCS, giving a maximum score of 100/100. Here we don’t have CCS information but have added retention time to our search parameters for a maximum of 80/100.

Future improvements

Currently Progenesis gives equal weight to the five component scores – mass error, isotope similarity, fragmentation score, retention time error, and CCS error. In some cases, this might not be ideal, so if you have any suggestions for different weightings we’d love to hear from you in the comments section below.

As always, if you have any further questions, check our FAQ or get in touch.

How do you know your raw materials are as they should be?

Agnès Corbin of Nonlinear Dynamics gives us an overview of what’s needed to maintain high manufacturing standards.

Agnes Corbin

We all know it, reproducibility is one of the key parameters to master for maintaining a product’s quality. As a customer, we all like our favorite products of a consistent high quality; as a manufacturer, we want to preserve our quality and customers’ satisfaction

That starts with the supply chain of the raw materials and ingredients used to manufacture a finished product.

A non-conformity, be it a cross-contamination, adulteration or degradation, can have huge economic, clinical and sanitary consequences, especially with high cost raw materials.

With this in mind, you might be interested in the below application note, produced in collaboration with Robertet Group, the world leader in sustainable natural raw materials for fragrance and flavor.

Click on the image to download the Vetiver essential oils application note
Vetiver essential oils application note

It describes how Progenesis QI was used, to spot an ‘out-of-the-blue’ potential non-conformity in Vetiver essential oil, using an Untargeted Metabolomics Profiling approach with LC-HRMS and a variety of Ionization techniques.

Progenesis QI helped to detect and identify adulteration with Castor Oil, a non-volatile compound, in a new batch of Vetiver Essential Oil. It would have been missed with the use of classical and common GC-MS techniques applied on volatile compounds.

The combination of LC separation equipment UPC² and UPLC (with different ionization sources ESI, APCI, ASAP) were used to get a better understanding of the product’s composition. Suppliers of natural raw materials must increase their phytochemistry knowledge of their products, as per the recent change in the REACH regulation based on the Natural Complexes Substances (NCS).

The easy-to-set-up LC-MS techniques for non-volatile compounds, can be considered as complementary to GC-MS for volatile compounds QC.

Robertet could have missed out if they hadn’t used the Progenesis QI software.

Are you missing out by not using it?

Please contact us for an evaluation today and we can help you with your research.

The importance of the surfaceome and its interactors

We love to hear how our customers are using Progenesis QI and Progenesis QI for proteomics. It’s great to learn what they are researching and how Progenesis can help them. It’s also nice to find out more about the people behind the research. Our latest blog post features Dr Maria Pavlou and her interesting work with Dualsystems Biotech AG. First, here’s some background about Maria:

Dr Maria Pavlou

Maria Pavlou received her PhD in translational proteomics from the Department of Laboratory Medicine and Pathobiology at University of Toronto, Canada. Upon PhD completion, Maria moved to Switzerland to pursue a post-doctoral fellowship in the Institute of Molecular Systems Biology at the Swiss Federal Institute of Technology (ETH) in Zurich focusing on host-pathogen interactions. In 2017, she joined Dualsystems Biotech AG as a senior scientist and a year later, she was promoted to Chief Scientific Officer leading the research team to develop further the Ligand-based Receptor Capture (LRC) methodology and establish new services.

Now, onto the research:

The importance of the surfaceome and its interactors

If the plasma membrane is considered the gateway through which cells communicate and interact with their environment, then proteins associated with the surface – referred as the surfaceome – can be seen as the gatekeepers. The surfaceome largely dictates the shape, polarity, differentiation and motility of cells. It also mediates cellular behaviors such as cell-cell communication, self and non-self recognition, and cell signalling. Given the crucial role of surface-associated proteins in every aspect of cellular life, it is not surprising that they are the molecular targets for roughly 70% of FDA approved drugs [1].

The original concept, depicting the plasma membrane as a homogeneous fluid bilayer with freely diffusing proteins, has been evolved to another depicting a highly organized and crowded mosaic of interacting lipids and glycoproteins. This higher organization modulates the biological processes occurring on the cell surface, exemplified by receptors being active only when they form dimers, hetero-dimers or higher order oligomers [2]. Interactions of proteins in the cell membrane of the same cell (cis), and interactions of proteins of neighboring cells, the extracellular matrix and circulating ligands (trans) are collectively referred to as extracellular protein-protein interactions (ePPIs).

Elucidating ePPIs in a systemic fashion is pivotal to gain a better understanding of the surfaceome function. More specifically, identifying the targets of key ligands on the cell surface provides valuable mechanistic information about signal transduction, drug action or off-target effects. For instance, pathogen or growth factor interactions are important for developing novel therapies. Additionally, numerous ligands exist – both biologics and small molecules – involved in biological functions mediated at the cell surface through still unknown protein targets.

Towards target identification, an advanced cell-based chemo-proteomic approach has been developed namely ligand-based receptor capture (LRC) [3-4]. In this approach, the endogenous receptor repertoire of a given cell serves as an existing bait library that can be probed for ligand interaction. The key component of the LRC methodology is a trifunctional compound (TriCEPS or its latest development named HATRIC [3-4]) that utilizes the extensive glycosylation displayed by the majority of cell surface proteins to capture receptor interactions on living cells. Experimentally, the first arm of TriCEPS is conjugated with the primary amines of a ligand and the conjugates are added on living cells (mildly oxidized). There the ligand binds to its target(s) and the second arm of TriCEPS is covalently crosslinked to the glycans of the binding partner. The third arm facilitates target purification for mass spectrometric analysis.

In a typical LRC-TriCEPS experiment, at least two treatment arms are performed in parallel: one with the ligand of interest and a second with a control ligand (that is, a ligand with a known target). Upon identification, the relative abundance of cell surface proteins in the ligand samples is compared to those in the control samples using MS1-based label-free quantification. Randomly identified cell surface proteins are expected to have equal abundance in both samples, whereas the corresponding receptors are found enriched in the ligand sample.

Progenesis QIP gives the user full control but does not require advanced computational skills

Progenesis QI for proteomics (QIP) has been the workhorse when it comes to data analysis. Performing MS1-based label-free quantitation in Progenesis is extremely straightforward through an intuitive user-friendly interface.

The alignment of features is performed by sophisticated algorithms but at the same time the software provides the user with visual inspection of the whole procedure. This is extremely useful as the user has full control of the data and a better understanding on how the samples are processed. It can also reveal technical issues related to the liquid chromatography separation prior to mass spectrometric analysis or sample quality. Notably, proper feature alignment is pivotal for robust quantitation. Moreover, the feature-picking algorithm (peak picking) has been developed to minimize missing values therefore the requirement for imputation; another asset for robust quantitation.

Through the various filtering options, the user can eliminate features that are not of interest (such as polymers or contaminants) focusing on what really matters. The QC metrics tab provides a qualitative overview of the experiment giving the opportunity (once more) to assess whether the LC and MS parameters used were optimal.

Upon protein inference and calculation of relative abundances, the user can easily review protein characteristics (such as number of peptides, peptide sequence and modifications, expression profiles, see figure 1) and confirm results or flag outliers. Once more, this option gives the user full control over the data and eliminates, to a great extent, experimental artefacts.

LRC-TriCEPS analysis to identify the receptors of Insulin and Transferrin on HEK293 cells; a screenshot of Progenesis QIP.

Two LRC experiments were performed using insulin and transferrin as ligands of interest on HEK293 cells with receptor capture at two different pH (6.5 and 7.4).

(A) Following the intuitive and straightforward progenesis pipeline, the identification and quantitation (MS-1 based) was completed within 6 hours.

(B) A total of approximately 300 surface proteins were identified and quantified across all samples. Transferrin receptor (TFR1), the known target of Transferrin, was also identified with roughly 40 unique peptides.

(C) Using the protein filter the relative abundance of TFR1 across the four conditions was visualized; TFR1 is clearly enriched in the Transferrin samples.

(D) For more detailed information the user can check the quantitation of every peptide identified and spot any irregularities.

(E) The user can use the statistics run by Progenesis or export the data for post-hoc analysis.

As statistical testing is incorporated in the software, the final outcome of an analysis provides immediate information regarding proteins being significantly regulated. However, there is still the option to export all necessary information in order to perform post-hoc statistical analysis using different tools. This increases greatly the flexibility of the user.

Finally, Progenesis QIP is readily scalable when it comes to number of samples and performs analysis in a time-efficient manner, allowing for complete label-free quantitation in the course of a working day. This is extremely important given that data analysis is usually the beginning of a series of experiments aiming to verify and interpret the identified quantitative differences. It provides a variety of different plots and graphs that can be readily used for publications or reports.

In summary, the analysis of LRC-TriCEPS data with Progenesis QIP offers unique advantages. The software is user-friendly and intuitive therefore can be used by researchers with experience in data analysis but also by users that are just starting or do not perform data analysis daily. Progenesis QIP provides the user with full control over data analysis which is very important to spot and resolve experimental artefacts and to understand how the final outcome is reached. At the same time, the sophisticated algorithms provide high quality label-free quantitation and robust results. Finally, the nice visualization aspects can generate high quality graphs that can be used to communicate the results of each study.

1. Uhlén M, et al. Tissue-based map of the human proteome. Science (80- ). 2015.

2. Milligan G, G protein-coupled receptor dimerisation: molecular basis and relevance to function, Biochim Biophys Acta. 1768(4):825-35, 2007

3. Frei AP, Moest H, Novy K, Wollscheid B. Ligand-based receptor identification on living cells and tissues using TRICEPS. Nat Protoc. 2013;8:1321–36.

4. Sobotzki N, et al. HATRIC-based identification of receptors for orphan ligands. Nat Commun. 2018;9:1–16.

Thank you to Maria for a very interesting blogpost. Finally, please get in touch

• If you are a user with an interesting research project using Progenesis. We are keen to share user stories via our blog.

• If you would like to try Progenesis on your own data

Thank you

Acknowledgement: Maria Pavlou, PhD, Paul Helbling, PhD

A Progenesis QI workflow in Exposomics

Following on from the previous post about our 3 Progenesis QI lunchtime presentations at IMSC 2018, we are proud to present to you the talk given by Emilien Jamin from the Toxalim, Research Centre in Food Toxicology, Toulouse University, available to view here.

Emilien’s work in contaminant discovery and analysis shows how Progenesis QI can be used very effectively for untargeted analysis in the Exposomics field.   This workflow goes beyond suspect screening which requires prior knowledge.  Emilien uses several examples of how Progenesis QI was used to discriminate between different populations and finally touches on a proof of concept on lipids peroxidation.

You can view the talk for yourself and you can read an overview of his presentation below.

Metabolomic profiling of reactive metabolites in toxicology by MSE and Progenesis QI

Metabolomic profiling of reactive metabolites in toxicology by MSE and Progenesis

Emilien Jamin, Robin Costantino, Jean-François Martin, Françoise Guéraud, Laurent Debrauwer

Toxalim (Research Centre in Food Toxicology) Toulouse university, INRA, ENVT, INP-Purpan, UPS, F-31027 Toulouse, France.

Axiom Platform, MetaToul-MetaboHUB, National Infrastructure for Metabolomics and Fluxomics, F-31027 Toulouse, France

In food safety, current exposure assessment approaches are based on food consumption data crossed with food contamination data or biomonitoring data. This allows evaluating exposure only in a targeted way on a few families of compounds. Based on our previous results in exposomics [1], food or environmental toxicology should focus on the exposure to a mixture of compounds (contaminant cocktails), mostly at low doses, and in an untargeted way to detect/identify unknown compounds. And among these numerous known and unknown metabolites, it seems a priority to focus on potentially toxic compounds.

In this context; we developed an untargeted method using high resolution mass spectrometry coupled to liquid chromatography to specifically profile electrophilic metabolites, in parallel with a classic untargeted metabolomic study. This allows the study of the exposure of potentially toxic compounds on one hand, and the study of the effects of this exposure on the endogenous metabolites on the other hand. More precisely, we used the MSE mode of a Synapt G2-Si mass spectrometer to detect all the metabolites displaying a neutral loss specific of metabolites conjugated with mercapturic acid. Data from MSE and from untargeted HRMS analyses were processed with Progenesis QI, to highlight discriminant reactive metabolites, as well as endogenous metabolites.

As a proof of concept, this approach has been applied to the study of different groups of rats fed diets containing various oils. According to our previous results on lipid peroxidation [2] these diets led to the production of different aldehydes conjugated to mercapturic acid. The most well known is DHN-MA which corresponds to the mercapturate conjugate of 4-hydroxynonenal (4-HNE), which is commonly used as a biomarker of lipid peroxidation [2]. Using our methodology, we were able to detect without a priori, dozens of mercapturate conjugates, including DHN-MA and other known conjugated aldehydes. Furthermore, our approach also allowed the detection of conjugates of unexpected aldehydes, and of other chemical classes, for which putative identifications have been proposed based on complementary structural analyses. Interestingly, multivariate statistical analyses of the HRMS signals carried out on the mercapturate conjugates yield a better characterization of the studied animal groups compared to results obtained from a classic untargeted metabolomic approach.

[1] Jamin E.L. et al. Anal Bioanal Chem (2014) 406:1149–1161

[2] Guéraud F. Free Radic Biol Med (2017) 111:196-208

Progenesis QI is a powerful tool for contaminant analysis and has been used in the food, cosmetics, natural products, chemical materials, sports doping, biopharma, metabolomics and proteomics fields.

Why not download the software and see how it can help you in your research? Progenesis QI for Progenesis QIP for proteomics

Can Progenesis QI impact your research project?

At IMSC 2018, we were lucky to have not one, not two, but three researchers give their presentations at our Progenesis QI lunchtime seminar.

Progenesis–Three personal accounts showing the power of Progenesis QI

  • Untargeted metabolomics using Progenesis QI for small molecules: Developing ion-chromatography-mass spectrometry for the investigation of cancer metabolism – James S.O. McCullagh, University of Oxford, UK
  • Metabolomic profiling of reactive metabolites in toxicology by MSE and Progenesis – Emilien Jamin, Toxalim (Research Centre in Food Toxicology) Toulouse university, INRA, France
  • Novel strategies for discovery of cardiovascular biomarkers in human plasma – Donald JL Jones, Leicester Cancer Research Centre, RKCSB, University of Leicester, UK

These were recorded so we’d like to draw your attention to the interesting and varied presentations over the next few blog posts.

As one of the presenters is awaiting publication, we will present these in reverse order, starting with a lively 23-minute presentation by Prof Don Jones of the University of Leicester.

Below is a short written summary of Don’s talk.  Even better, watch it for yourself and learn which features of Progenesis QI for proteomics Don found so helpful in this ambitious project.  It really is 23 minutes well spent!

Screenshot of the title page for the talk

Novel Strategies for Discovery of Cardiovascular Biomarkers in Human Plasma

 

Donald JL Jones1,2, Sanjay Bhandari2, Paulene Quinn2, Jatinderpal Sandhu2 and Leong L Ng2

1Leicester Cancer Research Centre, RKCSB, University of Leicester, Leicester, LE2 7LX, United Kingdom

2Department of Cardiovascular Sciences and NIHR Leicester Cardiovascular Biomedical Research Unit, Glenfield Hospital, Leicester, LE3 9QP, United Kingdom

Background: The search for blood-based biomarkers is particularly compelling in the cardiovascular clinical arena. Whilst understanding the genetic basis of cardiovascular disease will provide a clear indication of risk, phenotypic markers represent the pathological changes that occur during disease processes. Methods for investigating the plasma proteome have ostensibly relied on complex pre-analytical protocols that are expensive and limit throughput.

Methods: 100 Coronary heart disease Patients with 20 healthy control were analyzed on the SYNAPT G2-Si, using label-free data-independent acquisition LC-MS with ion mobility optimized (HDMSE). Samples were treated with Calcium Silicate matrix (CSM). Raw data was then analyzed using Progenesis QI for Proteomics. Models of panels of markers were developed using SPSS and RapidMiner.

Results: From 50 µL of plasma, in excess of 1800 proteins are realized that can be reliably observed between samples. Of these, >1100 are quantified. The data shows high reproducibility with known differences predictably demonstrated. New markers are revealed which can be strongly aligned with potential novel mechanisms of coronary artery disease (CAD). The method is shown to be highly reproducible.

Conclusion: We demonstrate that CSM provides sufficient coverage to enable single shot analysis of plasma, historically, a very challenging proteomic sample to analyze, and can provide potential markers for CAD which could feasibly be extended to several classes of disease. This provides a method that can run alongside other omic technologies to profile large-scale numbers of patients individually and thus usher in a new era of precision medicine. Importantly, there are advantageous savings to be made in terms of cost and throughput, which mean that for the first time large scale cardiovascular cohorts, conducted in a realistic timeframe, can be analyzed using proteomics

If you would like to try the Progenesis QI software on your own data then please don’t hesitate to get in touch.

Acknowledgments

Professor Donald JL Jones

Can Progenesis adapt to your application?

Adaptability is a critical factor in research success.  How adaptable is the Progenesis QI and Progenesis QI for proteomics software? The answer is very adaptable. You’ll be surprised to see what different market segments and applications the software has been used for.

Food and environment

In this market sector, the software has been used for food authenticity which is now a major concern across the globe.  This blogpost “Don’t get stung by your Manuka Honey!” Highlights the use of the Progenesis QI software in laboratories that test food to examine the differences between genuine and fraudulent products. For further reading, “Bringing the analysis to the sample: Progenesis QI helps beat food fraud” discusses Professor Chris Elliott’s webinar where he talks about many instances of food fraud including milk, horse meat, oregano, and cumin fraud. Chris speaks about how he works with Progenesis QI to analyze data produced at the Institute for Global Food Security.

Review of standardized abundance profiles and assignment of identity for three markers of Manuka honey as displayed in Progenesis QI software.

Figure 1. Review of standardized abundance profiles and assignment of identity for three markers of Manuka honey as displayed in Progenesis QI software.

Natural products

As with the Food and Environment sector, adulteration is also being seen in the natural products market sector. This blogpost aptly named “Just because it’s natural doesn’t mean it’s safe” describes the webinar on Authenticated Herbal Supplements with a metabolomics approach.  This looks at correctly identifying the marker compounds of each species of Hoodia, Terminalia, and Chamomile to ensure they are the correct substance. The work was carried out with collaboration from Waters and the University of Mississippi. For further reading you can read the relevant paper: Metabolic Profiling of Hoodia, Chamomile, Terminalia Species and Evaluation of Commercial Preparations Using Ultrahigh-Performance Liquid Chromatography Quadrupole-Time-of-Flight Mass Spectrometry.

Screenshot of the Authenticated Herbal Supplements with a metabolomics approach webinar.

Figure 2: Screenshot of the Authenticated Herbal Supplements with a metabolomics approach webinar.

Chemical and materials science

What value can Progenesis QI provide in the world of co-polymer characterization? Polymeric materials are being used more and more in everyday life. They are being used as structural materials for cars and airplanes, fabrics for clothing, packaging materials for food and medicine, to name but a few. These new materials must be properly characterized in order to manufacture the polymers’ reproducibly and thus achieve the required characteristics.

This blogpost explores using Pyrolysis-Gas Chromatography/Mass Spectrometry (Py-GC/MS) coupled with Progenesis QI’s multivariate analysis of the data, to provide novel insight into the polymer structure.

Score plot showing two co-polymer types that are clearly discriminated.

Figure 3. Following replicate analysis of the two co-polymer samples the data was aligned and peak picked using the workflow presented by Progenesis QI. The resulting data was analyzed using an OPLS-DA model to compare the samples. The scores plot resulting from that analysis is shown here where it can be seen that the two co-polymer types are clearly discriminated.

Biopharma

An exciting application is the use of Progenesis QI for proteomics for Host Cell Protein (HCP) analysis. The blogpost, “Progenesis QI for proteomics speeds up biopharmaceutical purification!” , explains how using the software greatly simplifies the user interaction with HCP datasets. Extracting mass chromatograms and calculating peak areas for a multitude of peptide precursors (like the 113 peptides from the MIX-4 spectral library) can be a tedious process. The data from each individual sample replicates need to be compared in order to obtain the peptide level HCP trends. The HCP peptide level results need to be translated into HCP protein levels. All these steps are automated and they are performed rapidly in the Progenesis QI for proteomics software without significant user intervention. This saves a significant amount of time spent on data analysis, producing rapid results.

Four HCPs that were identified by Progenesis QIP in the highly-purified NIST mAb.

Figure 4. Four HCPs (highlighted by red arrows) were identified by Progenesis QIP in the highly-purified NIST mAb.

Proteomics

Dr. Maartens Dhaenens shares his expertise in using the software when he presented his webinar “Proteomics: A peptides Journey to Emergence”. This blogpost of the same name explains his thoughts about understanding how all the different levels of complexity in proteomics data can be unraveled by the different functionalities of Progenesis QI for proteomics. He extols how Progenesis QI for proteomics is a powerful tool for resolving the many challenges in automated data analysis and protein inference.

Screenshot of the title for the Webinar

Figure 5. Screenshot of the title for the Webinar

Metabolomics

We have many happy customers, including Dr. Daniel Carrizo, who uses the Progenesis QI software for his metabolomics research. Daniel uses the Progenesis QI software to assess exposure to persistent organic pollutants (POPs). Daniel comments “The data generated is too complex to analyze without specific software like Progenesis. When you have 3000 or 4000 ions of interest and 300 samples, it is impossible to manage this amount of data with normal software. I have found Progenesis QI is robust and easy to use and the technical support is excellent.” He goes on to say “The most important aspect is the power of the analysis and robustness of the data generated, as well the easy design for setting up experiments within the software.” If you would like to see Daniel’s full blogpost you can read it here.

Customers

Happy customers mean you have a product to be proud of. We are certainly very proud of the Progenesis QI and Progenesis QI for proteomics software. This blogpost entitled “What are our customers saying?”  highlights how our customers have used the software successfully in their day-to-day research. Comments like “with Progenesis QI for proteomics, we practically started getting publishable data within the first few hours.” from Dr. Lam Yun Wah at the Department of Biology and Chemistry, City University of Hong Kong and “Progenesis QI for proteomics provided me an excellent tool to profile hundreds of proteins with incredible precision to map proteins in cellular compartments of the photoreceptor cells of retina.” from Nikolai Skiba, Assistant Professor at the Albert Eye Research Institute, Duke University in the US, keeps us inspired knowing we are making a difference in people’s research.

More information

Learn more from the resources we have on our website. This useful summary, “Helping you to help yourself”, highlights all the valuable resource we have available, giving you the information you need to use the software more successfully. With user guides, a starter pack and FAQ’s, you can instantly get answers to the more common questions asked about the software.

Were you surprised by the breadth of Progenesis applications?  It’s great to look back but even better to look towards the future. From everyone here at Nonlinear Dynamics, we wish you a successful New Year and look forward to hearing how the Progenesis software has helped you. If you haven’t used the software and would like to, then simply get it touch and we’ll get you started.

Progenesis QI for proteomics speeds up biopharmaceutical purification!

 

Most recombinant protein biopharmaceuticals are produced in specially designed expression systems typically using CHO (Chinese Hamster Ovary) cells. Many CHO proteins are simultaneously expressed along with high amounts of the desired biopharmaceutical, but they need to be removed by multi-step purification processes. Residual host cell proteins (HCPs) are low-level (1-100 ppm) process-related impurities that might be present in protein biopharmaceuticals even after extensive purification. HCPs could produce unwanted immunogenic responses in patients, they can reduce the efficacy or the stability of the drug or they can be responsible for drug degradation. For these reasons, the regulatory agencies required that all HCPs are identified and quantified prior to drug approval. The Biopharmaceutical industry relies on ELISA (enzyme-linked immunosorbent assays) for measuring the total HCP concentration expressed in ppm (or ng HCPs/mg biopharmaceutical). Mass spectrometry-based HCP analysis has emerged in recent years as a powerful alternative to ELISA [1-4] because it provides more extensive (proteome-wide) HCP coverage and is able to measure individual HCP levels.

Any LC-MS workflow for HCP analysis has three major goals: 1) identification of unknown HCPs; 2) reporting of the individual HCP quantification results expressed in ng HCP/mg biopharmaceutical (ppm concentrations); 3) monitoring of the HCP levels across multiple biopharmaceutical preparations. To accomplish these goals, two different LC/MS assays are required as illustrated by the workflow displayed in figure 1.

Figure 1. Workflow of the HCP analysis.

Figure 1. Workflow of the HCP analysis.

 

 

 

 

 

 

 

 

 

 

 

The Discovery HCP assay is performed in SONARTM mode in order to identify the unknown HCPs present in the purified biopharmaceutical and Progenesis QI for proteomics (QIP) is used for a proteome-wide database search to reveal the identity of these HCPs. For example, in the case of the NIST mAb, four HCPs and three spiked proteins (ADH, PHO and BSA) were identified as illustrated by the screenshot displayed in figure 2:

Table showing identification of 4 HCPs

Figure 2. Four HCPs (highlighted by red arrows) were identified by Progenesis QIP in the highly-purified NIST mAb.

A different type of LC-MS assay is required when multiple samples, produced from the bioprocessing of the same protein biotherapeutic, need to be analyzed with increased sample throughput, for the purpose of investigating HCP clearance. In this situation, the information gained from the HCP Discovery assay can be used to speed up the HCP identification and quantification process.

Using Progenesis QIP, the MS/MS fragmentation spectra of HCP peptides identified by SONAR acquisition can be assembled into spectral libraries, containing peptide precursors, charge states, retention times and relevant fragment ions. A list of HCP peptides sequenced from the NIST mAb is presented in Figure 3.

Table showing HCP peptides identified with a combination of NIST and SONAR

Figure 3. HCP peptides identified in the NIST mAb using SONAR acquisition.

 

 

 

 

 

 

 

 

 

 

 

The MS/MS fragmentation spectra of these peptides were assembled in a spectral library using Progenesis QIP. Peptides are sorted in the increasing order of their precursors. Two MS/MS spectra were recorded for four highlighted peptides, following fragmentation of their doubly and triply charged precursors.

Higher–throughput HCP Monitoring assays relying on 30 min peptide separations and employing MSE data acquisition are used for screening biopharmaceutical samples taken at every step of the purification process. The entire LC/MSE dataset is searched with Progenesis QIP against a spectral library for HCP identification, quantification, and monitoring.

To simulate an HCP monitoring assay, three protein digests standards (ADH–yeast alcohol dehydrogenase, BSA-bovine serum albumin and PHO-rabbit phosphorylase b) were spiked at four different concentration levels in four NIST mAb digests, while one protein digest (CLP_B-Ecoli chaperone protein) was spiked at the same concentration in all 4 samples. The LC/MSE data was searched in Progenesis QIP against a spectral library of 113 SONARTM fragmentation spectra of MIX-4 peptides (ADH, BSA, CLB-B, and PHO). Spiked proteins were easily tracked down to the lowest spiked levels (~ 20 ppm) across all five samples (20 LC/MSE runs) as exemplified by the graphs shown in Figure 4.

Graph showing measurement of spiked samples

Figure 4. (A) Example of protein level results obtained for the HCP Monitoring assay: the levels of spiked ADH were accurately measured in five NIST mAb samples; (B) Peptide level results of the HCP monitoring assay.

Eleven ADH peptides showed identical trends plots across all 20 runs. Four spiked samples, identified by letters A-D in this figure, containing different levels of ADH, BSA, PHO and CLP-B protein digests were spiked in the NIST mAb digest. The sample labeled “Blk” corresponded to the non-spiked NIST mAb digest. Each sample was analyzed with four replicates.

Protein measurements were obtained from multiple peptides and excellent correlation was obtained between the spiked and measured fold changes with RSDs under 10% for all measurements.

Progenesis QIP greatly simplifies the user interaction with HCP datasets. Extracting mass chromatograms and calculating peak areas for a multitude of peptide precursors (like the 113 peptides from the MIX-4 spectral library) can be a tedious process. In addition, the data from each individual sample replicates need to be compared in order to obtain the peptide level HCP trends. Finally, the HCP peptide level results need to be translated into HCP protein levels. All these steps are automated and they are performed rapidly in Progenesis QIP without significant user intervention. This saves a significant amount of time spent on data analysis, allowing for rapid results.

The experiment with spiked protein digests described above can be easily performed as a QC test to demonstrate the capability of the entire LC/MS platform to provide reliable HCP clearance results in a timely fashion.

Our collaborators from EMD Millipore asked us to test this capability for “real” mAb samples: they wanted to know which one of their four SCX (strong cation exchange) purification protocols produced “cleaner” purifications, with lower HCP content. The results are shown in Figure 5 and one of their protocols indeed worked better than the other three.

Graph showing monitoring of HCP peptides

Figure 5. Peptide level monitoring of three HCP peptides across five mAb preparations (one Protein A eluate and 4 SCX (strong cation exchange) chromatographic purifications using four different protocols (A-D). As illustrated here, Protocol D provided the best results.

Progenesis QIP allows purification laboratories to develop and test novel purification procedures in a relatively short time.

References:

  1. Doneanu CE, Anderson M, Williams BJ, Lauber MA, Chakraborty A, Chen W. Enhanced Detection of Low-Abundance Host-Cell Protein Impurities in High-Purity Monoclonal Antibodies Down to 1 ppm Using Ion Mobility Mass Spectrometry Coupled with Multidimensional Liquid Chromatography, Anal Chem, 2015, 87, 10283-10291.
  2. Huang L, Wang N, Mitchell CE, Brownlee T, Maple SR, De Felippis MR. A Novel Sample Preparation for Shotgun Proteomics Characterization of HCPs in Antibodies, Anal Chem, 2017, 89, 5436-5444.
  3. Weibin C, Doneanu CE, Lauber MA, Koza S, Prakash K, Stapels M, Fountain KJ. Improved Identification and Quantification of Host Cell Proteins (HCPs) in Biotherapeutics Using Liquid Chromatography-Mass Spectrometry, book chapter in Technologies for Therapeutic Monoclonal antibody characterization, Vol 3, ACS Symposium Series, 2015, 357-393.
  4. Doneanu C, Lennon S, Anderson M, Reah I, Ross M, Anderson S, Morns I, Yu YQ, Chakraborty A, Denbigh L, Chen W. A Comprehensive Approach for HCP Identification, Quantification and Monitoring Based on a Single Dimension (1D) LC Separation, Waters application note 720006262en, 2018.

Acknowledgments

Catalin Doneanu, Waters Corporations, Milford, MA, USA

What a ConFirenze! Explore. Dream. Discover.

When I heard that IMSC was in Florence I stuck my hand up to be on the exhibition booth.  It’s a city I’ve wanted to visit for years.

Ironically, we were so busy I didn’t get to look around!  Myself and my colleague, Mark Bennett, were attending for Nonlinear and, from the opening reception, we had a lot of interest in Progenesis.

This varied from researchers in nuclear physics to researchers analysing skin for anti-malarial research.  Progenesis really does cover a large breadth of scientific research, it’s strength being the ability to seek out differences without identification.

Quantify, then identify, is fundamental in the Progenesis workflow.

Mark Bennett demonstrating Progenesis to interested researchers Mark Bennett demonstrating Progenesis to interested researchers

In addition to busy ‘booth traffic’ and demonstrations, we had a workshop on the Thursday lunchtime.

Three speakers gave interesting accounts of how Progenesis QI and Progenesis QI for proteomics have helped them in their research:

Progenesis3 –Three personal accounts showing the power of Progenesis QI

  • Untargeted metabolomics using Progenesis QI for small molecules: Developing ion-chromatography-mass spectrometry for the investigation of cancer metabolism –James S.O. McCullagh, University of Oxford, UK
  • Metabolomic profiling of reactive metabolites in toxicology by MSE and Progenesis –Emilien Jamin, Toxalim (Research Centre in Food Toxicology) Toulouse university, INRA, France
  • Novel strategies for discovery of cardiovascular biomarkers in human plasma – Donald JL Jones, Leicester Cancer Research Centre, RKCSB, University of Leicester, UK

All three talks were engaging and we had good attendance.  The speakers had questions for each other and were engaged in conversation long after the workshop had finished.

Don Jones, Emilien Jamin and James McCullagh in conversation post workshop Don Jones, Emilien Jamin and James McCullagh in conversation post workshop

The workshop was so uplifting for me, it’s great when you hear customers picking out things that they really like in your product.

Don’t worry if you missed these presentations, we recorded the whole session, soon to be released!

If you’d like to be notified about these recordings, please email us and we’ll inform you when they become available.

So then it was on to the conference dinner, it completely exceeded expectations (which were not low).

We gathered at the Villa Viviani, located on the hill of Settignano, with a perfect view of Florence, which sits nestled in a natural bowl, just as the sun was setting.

The villa was home to Mark Twain for a time and one of my favourite quotes of his, “Explore. Dream. Discover.” seemed particularly apt for the conference and the wonderful evening.

It was so beautiful to stand there, listening to the live band and watching the dusk progress, an unforgettable memory.

Later, we were participants in an Italian birthday song game that had us up and down out of seats non-stop.  The next day my legs were really aching!

The sun setting over Florence at the gala dinner The sun setting over Florence at the gala dinner

Finally, we all dispersed home, the Progenesis researchers we spoke to heading back to their labs all over the globe.

It was a great Confirenze!!  Now I have to go back to Florence and see it properly, as a tourist.

When is a Biomarker not a biomarker? (part 2)

In my last blog, I discussed interpretation of data from two model experiments using univariate statistical analysis (p-values and false discovery rates). It was concluded that the use of p-values alone can potentially lead to dramatic misinterpretation of results and many false discoveries, so false discovery rates (FDRs) from q-values are a vital tool to avoid this. In this blog I’ll use the same model experiments to discuss multivariate statistical analysis, specifically, Orthogonal Projections to Latent Structures-Discriminate Analysis (OPLS-DA), a method commonly used to extract biomarkers in discovery metabolomics analysis.

First, a brief re-cap of the details of our model experiments. Experiment 1 consists of 12 human urine samples in conditions B and C (Fig. 1, (i)) where C is normal patients and B is patients who’ve been given a high dosage of a mixture of analgesic drugs. In this case the PCA scores (samples, shown as coloured dots) show tight clustering within the conditions, indicating some highly significant differences between the conditions resulting from the presence of the drugs or their metabolites in condition B. Experiment 2 comprises the same data, but re-arranged into two “mixed” conditions called BC and CB (Fig.1, (ii)) for which the PCA scores show no condition-related clustering indicating (as we’d expect) that there are no differences between the conditions. After automatic processing of the data through Progenesis QI (data alignment, co-detection and adduct deconvolution) there were 5,333 compounds detected across all 12 samples with no missing values.

Experimental design and PCA bi-plot for model experiment 1 (i) and experiment 2 (ii)

Figure 1: Experimental design and PCA bi-plot for model experiment 1 (i) and experiment 2 (ii)

As mentioned in part 1 of this blog, PCA is a non-discriminate type of analysis which takes no account of the conditions of the experiment and just arranges the samples (scores) and compounds (loadings) according to how similar (or different) is their expression behaviour. In the case of the scores therefore, samples in which the compounds exhibit similar expression behaviour are clustered closer together while those with less similar behaviour are further apart on the plot. The loadings are arranged similarly according to their expression behaviour and in addition, the clustering of scores and loadings are linked, in that compounds (loadings) which show significant up-regulation in a condition are clustered closest to the samples (scores) of that condition (see Figure 1). PCA is also useful for identifying outliers in the data.

In contrast to PCA, OPLS-DA is a “discriminate” analysis which does take account of the conditions of the experiment and builds a model that best represents the differences between the conditions. The data can then be plotted in a way which represents how well each sample and compound fits the model. From Progenesis QI, our experiment 1 data can be automatically exported into the EZinfo statistical package in which OPLS-DA can be performed before importing the results back into Progenesis for further review. In EZinfo we can easily create our OPLS-DA model and initially view a Bi-plot which looks quite similar to PCA (Figure 2). However, instead of representing degrees of variance in the data, the axis now represent values related to the model of the difference between the conditions and how the scores (samples) and loadings (compounds) fit into the model. So, how does this type of analysis help us to extract good candidate biomarkers from our experiment?

OPLS-DA bi-plot for experiment 1

Figure 2: OPLS-DA bi-plot for experiment 1

If we change the data scaling from “unit variance” (where each compound abundance is divided by the compound standard deviation) to “Pareto” (where it’s divided by the square root of the standard deviation) we can create an “S-plot” of the compounds (loadings) which takes its name from the characteristic S-shape in which the “best” biomarkers are located towards the extreme of the plot. In the S-plot (Fig 3), the vertical axis defines the p(corr) correlation to the model while the horizontal axis defines the p(1) contribution to the variance between the conditions. This means that compounds located towards the vertical extremes conform best to the B Vs C difference model and are essentially the compounds where the difference between conditions B and C is most clear, while those located towards the horizontal extremes contribute most to the overall variance between the conditions, meaning they are highly abundant, have a large fold change, or both. In the case of experiment 1, we know there to be many expression changing compounds mainly up-regulated in condition B, where the drugs were administered. The S-plot supports this in that there are many compounds located towards the lower left extreme of the plot indicating they are up-regulated in condition B, while there are very few located towards the other extreme where the “up in C” compounds should be. We can see more clearly how the location of compounds on the S-plot relate to their expression behaviour, by selecting groups of them and importing them back into Progenesis QI as “tagged groups” which enables us to select them using filters and visualise their expression behaviour using the Progenesis QI tools. In this case, 4 groups of compounds have been selected indicated as A, B, C and D in figure 3.

S-plot for experiment 1

Figure 3: S-plot for experiment 1

Back in Progenesis QI, we can view the expression profiles for all of the compounds imported as tagged groups from EZinfo and in this way we can see how their location on the S-plot relates to their expression behaviour. Group A were the 3 compounds at the extreme bottom left of the plot and as such should have excellent correlation to the model along with a high contribution to the variance making them the very best candidate biomarkers. Figure 4,(i) confirms this since the clean step shape of the profiles show very clear distinction between the conditions and the accompanying table shows a combination of very low p-values and CVs, with high abundance and fold changes. Group B were not so far out as group A horizontally but equally far out vertically, so they should have similar correlation to the model but less influence on the overall difference. The step-shaped profiles in Figure 4, Bi confirm high correlation with the model and the table shows that these compounds have lower abundance than those in group A. The generally higher fold changes of this group compared to group A indicates that compound abundance is more important than fold change in determining the overall influence of the compounds on the variance between the conditions. The expression profiles shown in figure 4,A, Bi, C and D are “standardised” profiles in which the data is mean-centered and the variance normalised to 1. This results in the data being scaled to optimally display the shape of the profiles without taking account of the actual abundances.. If we view group B as “normal” (unscaled) profiles (Figure 4, Bii) we see that the abundance of the compounds in the highest condition (B) actually vary from <1,000 to >12,000. Which accounts for their relative positions on the S-plot.

Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Figure 4: Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Group C were in only moderately extreme positions both horizontally and vertically, so are likely to be less good candidate biomarkers and this is seen in figure 4, C in which the discrimination between the conditions is now minimal. Interestingly, the table shows that the group C compounds have much higher abundances than those of group A yet are much further from the horizontal extreme of the S-plot, showing the effect of the fold changes which in this case are very low and therefore limit the influence of the compounds on the model. Finally, group D are towards the top right extreme of the S-plot indicating they are up-regulated in condition C. However, the profiles show less clear distinction between the conditions than in groups A or B (though more than in group C), while the table shows moderately high abundances, but low fold changes as we might expect from our experiment.

We’ve established that the OPLS-DA and particularly the S-plot can help us to extract the “best” candidate biomarkers from our experiment 1, in terms of compounds displaying a combination of good conformation to the difference model, high abundance and high fold changes. But how does OPLS-DA handle the data from experiment 2? Perhaps a little surprisingly, despite there being no real expression changes in this data according to univariate analysis and PCA (see part 1 of blog), we still initially see a bi-plot in which there appears to be clear separation between the conditions (figure 4, A). However, this is not a result of any real differences between the conditions, but rather the OPLS-DA tool essentially “forcing” them into the best model which represents a difference between them. We also see an S-plot that approximates to the characteristic shape seen with the experiment 1 data which is potentially misleading. So what kind of behaviour do the compounds towards the extremes of this S-plot have?

OPLS-DA bi-plot

S-plot

Figure 5: OPLS-DA bi-plot (A) and S-plot for experiment 2

Groups A and B are both located towards (but not at) the vertical extremes of the plot so should have the best correlation to the model of any of the data. However, in both cases the expression profiles show a lot of variance within the conditions and not such clear distinction between the conditions (Figure 6). What’s more, the tables show that the p-values are only moderately low while the q values (and therefore the false discovery rates) are very high. Combined with low abundances and relatively low fold changes, none of these compounds could be good candidate biomarkers, as we’d expect from our previous knowledge of the data.

Expression profiles and uni-variate statistical data for groups A (A) and B (B) from S-plot of experiment 2

Figure 6: Expression profiles and uni-variate statistical data for groups A (A) and B (B) from S-plot of experiment 2

Groups C and D, which are further from the vertical but more towards the horizontal extremes have  profiles indicating even less difference between the conditions, particularly in group D (Figure 7). This is confirmed by the very high p and q-values plus very low fold changes shown in the tables. The reason for their location towards the horizontal extremes of the plot is their relatively high abundances which mean they will have a relatively high influence on the data model.

Expression profiles and uni-variate statistical data for groups C (A) and D (B) from S-plot of experiment 2

Figure 7: Expression profiles and uni-variate statistical data for groups C (A) and D (B) from S-plot of experiment 2

From the evidence of our two model experiments, it’s clear that when using OPLS-DA and the S-plot we need to be cautious in using them to select candidate biomarkers since there is potential to select compounds which do not in fact have any of the characteristics we are looking for. It’s important to remember that OPLS-DA will always try to create the best model which represents the differences between the conditions in the experiment and that this may lead to Bi-plots and S-plots which appear to show differences even when there are none there. The best way to check this is to view the selected compounds in Progenesis QI or a similar software that will display the compound expression profile and the uni-variate statistics such as p and q-values since these together will tell you if the selected compounds really do have characteristics we would associate with good candidate biomarkers.

When is a Biomarker not a biomarker? (part 1)

Statistics have a longstanding reputation for being potentially misleading and unreliable. It was in the 19th century that British Prime Minister Benjamin Disraeli said “There are three kinds of lies: lies, damn lies and statistics” while in the mid-20th Century, Winston Churchill added “The only statistics you can trust are the ones you have falsified yourself”. Things haven’t improved much recently as evidenced by a google search for the term “Statistics are unreliable” which returns no fewer than 6.8 million results! Discovery omics analysis and particularly p-values, which play a prominent role in the discovery of potential biomarkers, are no exception to this issue with a google search for “p-values are unreliable” producing about a quarter of a million results.

The huge complexity of discovery omics data, on the one hand, makes statistics vital in extracting results, but on the other makes interpretation of those statistics more problematic. In this article I’ll describe a simple “model” discovery omics experiment in the Progenesis QI software that highlights how misinterpretation of statistics can lead, not just to overstatement of success in an experiment, but potentially to conclusions that are the direct opposite of the reality. I’ll also discuss how you can avoid these misinterpretations and ensure that all your results are reliable. Please note that while the model experiment used here is metabolomics data, all the conclusions can equally be applied to proteomics or lipidomics analysis.  NB.  All the figures in this blog post are taken from the Progenesis QI software.

Original experimental design setup Figure 1. Original experimental design setup

Our “model” experiment uses a metabolomics data set of 12 human urine samples in two conditions B and C, as shown in the experimental design (Fig. 1). Condition C are from normal individuals while condition B are from individuals who’ve been given a high dose of a mixture of analgesic drugs. The 6 samples in each condition are technical replicates which enhances the relative differences between the conditions, but as we’re not interested in biological results, only what the statistics tell us, this is OK for our test. After automatic processing through Progenesis QI (data alignment, co-detection and adduct deconvolution) there were 5,333 compounds detected across all 12 samples with no missing values.

If we look at univariate statistics data (Fig. 2,a), we see that many compounds have extremely low p-values (some < 10-16) which might lead us to conclude a real expression change exists in those compounds. In fact, there are more than 300 compounds with p-values of < 0.0001 in this analysis indicating the presence of many significantly changing compounds (candidate biomarkers) between our two conditions. In many compounds, the fold change is also very high including some “infinity” fold changes where the compound is detectable in condition B and not in condition C.

This situation is confirmed if we now look at the PCA, a type of none-discriminate cluster analysis in which all samples are treated the same with no prior knowledge of the conditions they belong to. The samples (scores) cluster in multi-dimensional space according to how similar they are. By colour coding them by condition (Fig. 2,b), we see that the samples have clustered within their conditions and with very clear separation between conditions along the horizontal axis of principle component (PC) 1 which accounts for >21% of the total variance in the data. These then, are the kind of statistical results we expect to see where there are very distinct differences between our conditions.

Image a) Univariate statistical data table, including p and q values - Image b) PCA analysis plot Figure 2. From the original experiment:
Image a) Univariate statistical data table, including p and q values
Image b) PCA analysis plot

So far so good. Now, let’s look at an experiment in which there are no significant differences between the conditions and see how this affects the statistics. To do this, we’re going to use the same samples, but randomly mix them up and re-assign them to two arbitrary conditions which we’ll call BC and CB (Fig. 3,a). It’s now evident from the PCA clustering pattern (Fig. 3,b) that there are no significant differences between these new conditions. But do the other statistical results support this?

a) Experimental design setup b) PCA analysis plot Figure 3. From the arbitrary experiment:
a) Experimental design setup
b) PCA analysis plot

If we again look at our univariate statistics (Fig. 4), we can see that although the p-values are generally much higher than before, there are still a number of compounds where p < 0.05, which is often used (incorrectly) as a threshold of significance in discovery omics experiments. In fact there are 197 compounds with p<0.05, 25 with p < 0.005, and 4 with p < 0.001! Are any of these compounds really changing expression in a statistically significant way? The answer is no and when we consider that our original conditions have been randomly mixed together, this is the answer we might expect. So, why do we still get such low p-values when there are no actual expression changes occurring? To answer this we need to consider the experiment as a whole and not just the individual compounds.

Figure 4. Univariate statistics data table, including p and q values from the arbitrary experiment Figure 4. Univariate statistics data table, including p and q values from the arbitrary experiment

The misuse of p<0.05 as a suitable significance threshold in discovery omics is usually the result of an incorrect definition of p-values. They are often referred to as “the probability that there is no expression change occurring in the data” which, if true, would mean that p<0.05 would indicate a <5% probability of no expression change occurring (or 95% probability of one occurring) and would therefore be a very suitable threshold. However, the p-value is actually a measure of the likelihood of the data observed occurring if no real difference existed (i.e., how likely it is to occur by random chance) and in this case the significance is dependent on the number of results in the experiment, which is referred to as the “multiple testing problem”.

In an experiment where only 10 compounds are detected and measured, p<0.05 may be a suitable threshold since we’d then expect only 0.5 compounds (10 x 0.05) to have p>0.05 by random chance, meaning any compounds with this p-value range are likely to be changing significantly and therefore to be potential biomarkers. In discovery omics analysis we typically detect and measure >1,000 compounds so in this case we expect >50 (1,000 x 0.05) to have p<0.05 by random chance and using it as a threshold would produce at least that many false discoveries.

In our experiment we detected and measured 5,333 compounds, so we’d actually expect as many as 266 compounds to have p<0.05, 26 to have p<0.005 and 5 to have p<0.001 by random chance. Compare this with the actual results and we can conclude that all the results are false discoveries having come about by random chance alone.

So how do we check our p-value thresholds to see if they’re suitable for our experiments? A systematic way of doing this, is to use the q values calculated in Progenesis QI to calculate a false discovery rate (FDR). We do this by reading the highest q value (corresponding to the highest p-value) in the subset of features we extract using our p-value threshold. If we do this for our original experiment (Fig. 5, a), we see that using a threshold of. 0.0001, gives us a q-value of 0.000942, or an FDR of just below 0.1%, meaning <1 false discovery from the subset of 300 discoveries. However, using a threshold of 0.05 gives us a q-value of 0.128, or 12.8% FDR, translating to as many as 147 false discoveries from a total of 1,151 discovered compounds. With our “mixed” data set, we get far too many false discoveries no matter what threshold we use, with a threshold of 0.05 giving us a >99.95% FDR and even a threshold of 0. 001 giving an FDR of 90% for only 4 discoveries.

Tables showing the difference in FDR between the two experiments Figure 5. Tables showing the difference in FDR between the two experiments

In this study of model omics experiments we’ve seen examples of how misinterpretation of univariate statistics can lead to experimental features (in this case metabolomics compounds) being assigned as potential biomarkers when, in fact, they are nothing of the kind. However, we’ve also seen that by using appropriate safeguards (false discovery rates) these issues can be avoided, ensuring that all your results are of high confidence and reliability.

In the second part of this blog we’ll use the same data to look at the issues of interpreting multivariate statistics and how we can avoid making false discoveries using that approach.

If you would like to know more about the Progenesis QI or the Progenesis Qi for proteomics software then don’t hesitate to get in touch. More information can be found here.