“I could burst into tears… I spent weeks of time and effort on sample collection, instrument optimisation, sample running, and data generation from a very expensive LC-MS setup that has the resolving power to find tiny but significant differences between my conditions, only to find that my data analysis led me down a dead end with many false positives. I’m certain that real significant differences are in that dataset, but my analysis workflow just isn’t picking them up… What will I tell my boss?”
Many know how real that scenario is, but what are the possible consequences resulting from the impact of missing values in your data?
- False negatives – are you missing true positive results, as in the example above?
- Time spent on researching false positives – does this slow down department research progress?
- Wasted money on investigating false positives – how easy is it to then get follow up funding?
- Acceptance of research results – is your journal of choice going to accept data with missing values?
- Reduced return on investment – do your research results bring value from either publications or commercial gain?
Hopefully you’ve been lucky and the impact was not as drastic as the above, however those outcomes are all very possible.
In our ‘Back to basics’ blog post, we mentioned a number of the problems that missing values can cause, such as:
- Reduced effectiveness of statistical analysis techniques
- Misleading statistics (impacting false positive and false negative results)
- Problems with multivariate statistics as observed by the visualisations (such as PCA)
Before we move onto the issues that arise from various imputation methods of missing values in your data matrix, let’s remember how many data points are actually affected:
- Missing values are commonly reported as occurring at rates of approximately 20%[1,2] and affecting up to 80% of variables in LC-MS data.
- “Further investigation of the peaks detected as significantly different between biological groups showed that substantial proportions of these peaks were comprised of those which initially had missing data.”
Let’s expand upon the problems mentioned above, but first a quick note on terminology: Hrydzuszko and Viant refer to missing data where we refer to missing values. Missing values have also been referred to as ‘missing or lost data points’. To simplify the language, we will continue to refer to missing values, except in the case of citations.
A – Reduced effectiveness of statistical analysis techniques
To have confidence in conclusions drawn from comparative results between conditions, we look to statistical power. In the ‘Missing values: what’s the problem?’ blog post, Dr. Goulding described how increasing biological replicates should work to stabilise inherent differences between the biological samples, and so increase the statistical power, however with additional samples there are additional associated missing values in the experiment’s quantitative results and so this actually limits any desired resulting increase in statistical power. Therefore, we throw away most of the gain in statistical power to be had from the addition of biological replicates due to incorporating even more missing values. Missing real experimental differences is a probable result of having low statistical power.
B – Misleading statistics
Typically, 1 in 5 data points are missing; can’t we just impute them with one of the many existing methods? Yes, but this is not without danger of taking your investigation completely astray. Can we measure the impact of imputing upon our significant discovery rate? Hrydzuszko and Viant evaluated the impact of eight imputation methods; they found:
- “Different imputation methods ultimately yielded quite diverse data analysis outcomes. Specifically the number of peaks identified as significantly different between groups varied considerably between the eight estimation methods”.
- “It is quite possible that when an inappropriate missing value estimation method is used we may not only lose the knowledge of which peaks are significant or not, but we may introduce further bias by identifying non-significant peaks as significantly different between groups.”
- “Overall the results presented here provide substantial evidence that the choice of missing value estimation method has a substantial effect on the outcome and interpretation of univariate statistical analysis.”
C – Problems with multivariate statistics as observed by visualisations (such as Principle Component Analysis (PCA))
Complications with useful multivariate statistics caused by the missing values were also seen when Hrydzuszko and Viant looked at the application of PCA. They found similar (larger) differences to those discussed above for univariate analysis. PCA plots varied from 3 distinct separations to no separation between the 3 conditions, depending on which imputation was used. This suggests that multivariate analysis could be even more sensitive to imputation.
What can be done about this serious skewing of data?
Hrydzuszko and Viant concluded their paper with “a three step process recommended in order to determine optimal method selection for missing value estimation for a given dataset and analytical platform that includes: assessing the nature of the missing data, analysing the impact of the missing data treatments on the final data analysis outcome, and analysing the performance of missing data algorithms on the ‘complete’ dataset if available.”
The proposed approach is very thorough, but must be time consuming and each dataset must be assessed on its own basis requiring the three steps to be repeated for each new study.
There is another way!
“Co-detection in Progenesis is a very attractive feature that we have come to trust. We do a lot of follow up Western blots to reproduce our quantitative proteomics findings and the blots do a great job of building confidence in the co-detection feature and the Progenesis system as a whole. We’ve had success reproducing changes in protein expression levels observed in Progenesis using the traditional Western approach. That’s extremely important since a lot of reviewers do not speak mass spec but love to flaunt how fluent they are in the language of Western blot.”
Paul Langlais, Mayo Clinic Arizona, Arizona, USA
Progenesis offers you the ability to take your data straight from your mass spectrometer and find the significantly changing compounds or proteins in your dataset without any of the problems associated with missing values. As long as the differences are present in your data file, Progenesis will maximise your chances of picking them up. Working on data with no missing values improves the experimental specificity, sensitivity and therefore reproducibility of your research, allowing you to quickly and, more importantly, confidently, quantify the compounds or proteins of interest.
An issue Dr. Goulding discussed in ‘Missing values: what’s the problem?’ was the data filtering, where missing values fall below predefined matching thresholds. With the Progenesis co-detection approach, you do not have to worry about this as you will always be comparing the same features across all the samples in your dataset. This means that when you use the multivariate statistical visualisations such as PCA, you will have unfiltered and therefore unbiased data representation. In turn, this gives you the ability to QC your data by assessing for outliers to ensure the differences you see in your data are real.
Dr. Goulding will be following up soon on how Progenesis achieves no missing values; in the meantime why not download Progenesis QI and speak to one of our specialists about analysing your own data? You too can have peace of mind that the conclusions you are submitting are based on the unbiased analysis of a complete data matrix.
1. Gromski et al, “Metabolites” 2014, 4, 433-452
2. Hrydziuszko and Viant, “Metabolomics” 2012, 8, Supplement 1, 161-174