When is a Biomarker not a biomarker? (part 1)

Statistics have a longstanding reputation for being potentially misleading and unreliable. It was in the 19th century that British Prime Minister Benjamin Disraeli said “There are three kinds of lies: lies, damn lies and statistics” while in the mid-20th Century, Winston Churchill added “The only statistics you can trust are the ones you have falsified yourself”. Things haven’t improved much recently as evidenced by a google search for the term “Statistics are unreliable” which returns no fewer than 6.8 million results! Discovery omics analysis and particularly p-values, which play a prominent role in the discovery of potential biomarkers, are no exception to this issue with a google search for “p-values are unreliable” producing about a quarter of a million results.

The huge complexity of discovery omics data, on the one hand, makes statistics vital in extracting results, but on the other makes interpretation of those statistics more problematic. In this article I’ll describe a simple “model” discovery omics experiment in the Progenesis QI software that highlights how misinterpretation of statistics can lead, not just to overstatement of success in an experiment, but potentially to conclusions that are the direct opposite of the reality. I’ll also discuss how you can avoid these misinterpretations and ensure that all your results are reliable. Please note that while the model experiment used here is metabolomics data, all the conclusions can equally be applied to proteomics or lipidomics analysis.  NB.  All the figures in this blog post are taken from the Progenesis QI software.

Original experimental design setup Figure 1. Original experimental design setup

Our “model” experiment uses a metabolomics data set of 12 human urine samples in two conditions B and C, as shown in the experimental design (Fig. 1). Condition C are from normal individuals while condition B are from individuals who’ve been given a high dose of a mixture of analgesic drugs. The 6 samples in each condition are technical replicates which enhances the relative differences between the conditions, but as we’re not interested in biological results, only what the statistics tell us, this is OK for our test. After automatic processing through Progenesis QI (data alignment, co-detection and adduct deconvolution) there were 5,333 compounds detected across all 12 samples with no missing values.

If we look at univariate statistics data (Fig. 2,a), we see that many compounds have extremely low p-values (some < 10-16) which might lead us to conclude a real expression change exists in those compounds. In fact, there are more than 300 compounds with p-values of < 0.0001 in this analysis indicating the presence of many significantly changing compounds (candidate biomarkers) between our two conditions. In many compounds, the fold change is also very high including some “infinity” fold changes where the compound is detectable in condition B and not in condition C.

This situation is confirmed if we now look at the PCA, a type of none-discriminate cluster analysis in which all samples are treated the same with no prior knowledge of the conditions they belong to. The samples (scores) cluster in multi-dimensional space according to how similar they are. By colour coding them by condition (Fig. 2,b), we see that the samples have clustered within their conditions and with very clear separation between conditions along the horizontal axis of principle component (PC) 1 which accounts for >21% of the total variance in the data. These then, are the kind of statistical results we expect to see where there are very distinct differences between our conditions.

Image a) Univariate statistical data table, including p and q values - Image b) PCA analysis plot Figure 2. From the original experiment:
Image a) Univariate statistical data table, including p and q values
Image b) PCA analysis plot

So far so good. Now, let’s look at an experiment in which there are no significant differences between the conditions and see how this affects the statistics. To do this, we’re going to use the same samples, but randomly mix them up and re-assign them to two arbitrary conditions which we’ll call BC and CB (Fig. 3,a). It’s now evident from the PCA clustering pattern (Fig. 3,b) that there are no significant differences between these new conditions. But do the other statistical results support this?

a) Experimental design setup b) PCA analysis plot Figure 3. From the arbitrary experiment:
a) Experimental design setup
b) PCA analysis plot

If we again look at our univariate statistics (Fig. 4), we can see that although the p-values are generally much higher than before, there are still a number of compounds where p < 0.05, which is often used (incorrectly) as a threshold of significance in discovery omics experiments. In fact there are 197 compounds with p<0.05, 25 with p < 0.005, and 4 with p < 0.001! Are any of these compounds really changing expression in a statistically significant way? The answer is no and when we consider that our original conditions have been randomly mixed together, this is the answer we might expect. So, why do we still get such low p-values when there are no actual expression changes occurring? To answer this we need to consider the experiment as a whole and not just the individual compounds.

Figure 4. Univariate statistics data table, including p and q values from the arbitrary experiment Figure 4. Univariate statistics data table, including p and q values from the arbitrary experiment

The misuse of p<0.05 as a suitable significance threshold in discovery omics is usually the result of an incorrect definition of p-values. They are often referred to as “the probability that there is no expression change occurring in the data” which, if true, would mean that p<0.05 would indicate a <5% probability of no expression change occurring (or 95% probability of one occurring) and would therefore be a very suitable threshold. However, the p-value is actually a measure of the likelihood of the data observed occurring if no real difference existed (i.e., how likely it is to occur by random chance) and in this case the significance is dependent on the number of results in the experiment, which is referred to as the “multiple testing problem”.

In an experiment where only 10 compounds are detected and measured, p<0.05 may be a suitable threshold since we’d then expect only 0.5 compounds (10 x 0.05) to have p>0.05 by random chance, meaning any compounds with this p-value range are likely to be changing significantly and therefore to be potential biomarkers. In discovery omics analysis we typically detect and measure >1,000 compounds so in this case we expect >50 (1,000 x 0.05) to have p<0.05 by random chance and using it as a threshold would produce at least that many false discoveries.

In our experiment we detected and measured 5,333 compounds, so we’d actually expect as many as 266 compounds to have p<0.05, 26 to have p<0.005 and 5 to have p<0.001 by random chance. Compare this with the actual results and we can conclude that all the results are false discoveries having come about by random chance alone.

So how do we check our p-value thresholds to see if they’re suitable for our experiments? A systematic way of doing this, is to use the q values calculated in Progenesis QI to calculate a false discovery rate (FDR). We do this by reading the highest q value (corresponding to the highest p-value) in the subset of features we extract using our p-value threshold. If we do this for our original experiment (Fig. 5, a), we see that using a threshold of. 0.0001, gives us a q-value of 0.000942, or an FDR of just below 0.1%, meaning <1 false discovery from the subset of 300 discoveries. However, using a threshold of 0.05 gives us a q-value of 0.128, or 12.8% FDR, translating to as many as 147 false discoveries from a total of 1,151 discovered compounds. With our “mixed” data set, we get far too many false discoveries no matter what threshold we use, with a threshold of 0.05 giving us a >99.95% FDR and even a threshold of 0. 001 giving an FDR of 90% for only 4 discoveries.

Tables showing the difference in FDR between the two experiments Figure 5. Tables showing the difference in FDR between the two experiments

In this study of model omics experiments we’ve seen examples of how misinterpretation of univariate statistics can lead to experimental features (in this case metabolomics compounds) being assigned as potential biomarkers when, in fact, they are nothing of the kind. However, we’ve also seen that by using appropriate safeguards (false discovery rates) these issues can be avoided, ensuring that all your results are of high confidence and reliability.

In the second part of this blog we’ll use the same data to look at the issues of interpreting multivariate statistics and how we can avoid making false discoveries using that approach.

If you would like to know more about the Progenesis QI or the Progenesis Qi for proteomics software then don’t hesitate to get in touch. More information can be found here.