When is a Biomarker not a biomarker? (part 2)

In my last blog, I discussed interpretation of data from two model experiments using univariate statistical analysis (p-values and false discovery rates). It was concluded that the use of p-values alone can potentially lead to dramatic misinterpretation of results and many false discoveries, so false discovery rates (FDRs) from q-values are a vital tool to avoid this. In this blog I’ll use the same model experiments to discuss multivariate statistical analysis, specifically, Orthogonal Projections to Latent Structures-Discriminate Analysis (OPLS-DA), a method commonly used to extract biomarkers in discovery metabolomics analysis.

First, a brief re-cap of the details of our model experiments. Experiment 1 consists of 12 human urine samples in conditions B and C (Fig. 1, (i)) where C is normal patients and B is patients who’ve been given a high dosage of a mixture of analgesic drugs. In this case the PCA scores (samples, shown as coloured dots) show tight clustering within the conditions, indicating some highly significant differences between the conditions resulting from the presence of the drugs or their metabolites in condition B. Experiment 2 comprises the same data, but re-arranged into two “mixed” conditions called BC and CB (Fig.1, (ii)) for which the PCA scores show no condition-related clustering indicating (as we’d expect) that there are no differences between the conditions. After automatic processing of the data through Progenesis QI (data alignment, co-detection and adduct deconvolution) there were 5,333 compounds detected across all 12 samples with no missing values.

Experimental design and PCA bi-plot for model experiment 1 (i) and experiment 2 (ii)

Figure 1: Experimental design and PCA bi-plot for model experiment 1 (i) and experiment 2 (ii)

As mentioned in part 1 of this blog, PCA is a non-discriminate type of analysis which takes no account of the conditions of the experiment and just arranges the samples (scores) and compounds (loadings) according to how similar (or different) is their expression behaviour. In the case of the scores therefore, samples in which the compounds exhibit similar expression behaviour are clustered closer together while those with less similar behaviour are further apart on the plot. The loadings are arranged similarly according to their expression behaviour and in addition, the clustering of scores and loadings are linked, in that compounds (loadings) which show significant up-regulation in a condition are clustered closest to the samples (scores) of that condition (see Figure 1). PCA is also useful for identifying outliers in the data.

In contrast to PCA, OPLS-DA is a “discriminate” analysis which does take account of the conditions of the experiment and builds a model that best represents the differences between the conditions. The data can then be plotted in a way which represents how well each sample and compound fits the model. From Progenesis QI, our experiment 1 data can be automatically exported into the EZinfo statistical package in which OPLS-DA can be performed before importing the results back into Progenesis for further review. In EZinfo we can easily create our OPLS-DA model and initially view a Bi-plot which looks quite similar to PCA (Figure 2). However, instead of representing degrees of variance in the data, the axis now represent values related to the model of the difference between the conditions and how the scores (samples) and loadings (compounds) fit into the model. So, how does this type of analysis help us to extract good candidate biomarkers from our experiment?

OPLS-DA bi-plot for experiment 1

Figure 2: OPLS-DA bi-plot for experiment 1

If we change the data scaling from “unit variance” (where each compound abundance is divided by the compound standard deviation) to “Pareto” (where it’s divided by the square root of the standard deviation) we can create an “S-plot” of the compounds (loadings) which takes its name from the characteristic S-shape in which the “best” biomarkers are located towards the extreme of the plot. In the S-plot (Fig 3), the vertical axis defines the p(corr) correlation to the model while the horizontal axis defines the p(1) contribution to the variance between the conditions. This means that compounds located towards the vertical extremes conform best to the B Vs C difference model and are essentially the compounds where the difference between conditions B and C is most clear, while those located towards the horizontal extremes contribute most to the overall variance between the conditions, meaning they are highly abundant, have a large fold change, or both. In the case of experiment 1, we know there to be many expression changing compounds mainly up-regulated in condition B, where the drugs were administered. The S-plot supports this in that there are many compounds located towards the lower left extreme of the plot indicating they are up-regulated in condition B, while there are very few located towards the other extreme where the “up in C” compounds should be. We can see more clearly how the location of compounds on the S-plot relate to their expression behaviour, by selecting groups of them and importing them back into Progenesis QI as “tagged groups” which enables us to select them using filters and visualise their expression behaviour using the Progenesis QI tools. In this case, 4 groups of compounds have been selected indicated as A, B, C and D in figure 3.

S-plot for experiment 1

Figure 3: S-plot for experiment 1

Back in Progenesis QI, we can view the expression profiles for all of the compounds imported as tagged groups from EZinfo and in this way we can see how their location on the S-plot relates to their expression behaviour. Group A were the 3 compounds at the extreme bottom left of the plot and as such should have excellent correlation to the model along with a high contribution to the variance making them the very best candidate biomarkers. Figure 4,(i) confirms this since the clean step shape of the profiles show very clear distinction between the conditions and the accompanying table shows a combination of very low p-values and CVs, with high abundance and fold changes. Group B were not so far out as group A horizontally but equally far out vertically, so they should have similar correlation to the model but less influence on the overall difference. The step-shaped profiles in Figure 4, Bi confirm high correlation with the model and the table shows that these compounds have lower abundance than those in group A. The generally higher fold changes of this group compared to group A indicates that compound abundance is more important than fold change in determining the overall influence of the compounds on the variance between the conditions. The expression profiles shown in figure 4,A, Bi, C and D are “standardised” profiles in which the data is mean-centered and the variance normalised to 1. This results in the data being scaled to optimally display the shape of the profiles without taking account of the actual abundances.. If we view group B as “normal” (unscaled) profiles (Figure 4, Bii) we see that the abundance of the compounds in the highest condition (B) actually vary from <1,000 to >12,000. Which accounts for their relative positions on the S-plot.

Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Figure 4: Expression profiles and univariate statistical data for groups A, B, C and D from the S-plot

Group C were in only moderately extreme positions both horizontally and vertically, so are likely to be less good candidate biomarkers and this is seen in figure 4, C in which the discrimination between the conditions is now minimal. Interestingly, the table shows that the group C compounds have much higher abundances than those of group A yet are much further from the horizontal extreme of the S-plot, showing the effect of the fold changes which in this case are very low and therefore limit the influence of the compounds on the model. Finally, group D are towards the top right extreme of the S-plot indicating they are up-regulated in condition C. However, the profiles show less clear distinction between the conditions than in groups A or B (though more than in group C), while the table shows moderately high abundances, but low fold changes as we might expect from our experiment.

We’ve established that the OPLS-DA and particularly the S-plot can help us to extract the “best” candidate biomarkers from our experiment 1, in terms of compounds displaying a combination of good conformation to the difference model, high abundance and high fold changes. But how does OPLS-DA handle the data from experiment 2? Perhaps a little surprisingly, despite there being no real expression changes in this data according to univariate analysis and PCA (see part 1 of blog), we still initially see a bi-plot in which there appears to be clear separation between the conditions (figure 4, A). However, this is not a result of any real differences between the conditions, but rather the OPLS-DA tool essentially “forcing” them into the best model which represents a difference between them. We also see an S-plot that approximates to the characteristic shape seen with the experiment 1 data which is potentially misleading. So what kind of behaviour do the compounds towards the extremes of this S-plot have?

OPLS-DA bi-plot

S-plot

Figure 5: OPLS-DA bi-plot (A) and S-plot for experiment 2

Groups A and B are both located towards (but not at) the vertical extremes of the plot so should have the best correlation to the model of any of the data. However, in both cases the expression profiles show a lot of variance within the conditions and not such clear distinction between the conditions (Figure 6). What’s more, the tables show that the p-values are only moderately low while the q values (and therefore the false discovery rates) are very high. Combined with low abundances and relatively low fold changes, none of these compounds could be good candidate biomarkers, as we’d expect from our previous knowledge of the data.

Expression profiles and uni-variate statistical data for groups A (A) and B (B) from S-plot of experiment 2

Figure 6: Expression profiles and uni-variate statistical data for groups A (A) and B (B) from S-plot of experiment 2

Groups C and D, which are further from the vertical but more towards the horizontal extremes have  profiles indicating even less difference between the conditions, particularly in group D (Figure 7). This is confirmed by the very high p and q-values plus very low fold changes shown in the tables. The reason for their location towards the horizontal extremes of the plot is their relatively high abundances which mean they will have a relatively high influence on the data model.

Expression profiles and uni-variate statistical data for groups C (A) and D (B) from S-plot of experiment 2

Figure 7: Expression profiles and uni-variate statistical data for groups C (A) and D (B) from S-plot of experiment 2

From the evidence of our two model experiments, it’s clear that when using OPLS-DA and the S-plot we need to be cautious in using them to select candidate biomarkers since there is potential to select compounds which do not in fact have any of the characteristics we are looking for. It’s important to remember that OPLS-DA will always try to create the best model which represents the differences between the conditions in the experiment and that this may lead to Bi-plots and S-plots which appear to show differences even when there are none there. The best way to check this is to view the selected compounds in Progenesis QI or a similar software that will display the compound expression profile and the uni-variate statistics such as p and q-values since these together will tell you if the selected compounds really do have characteristics we would associate with good candidate biomarkers.

Post a Comment

Your email is never shared. Required fields are marked *

*
*