Missing values: what’s the problem?

Missing values are a major problem in LC-MS based discovery ‘omics analysis and could be the difference between a successful research project and a failure. Whether you run a 3 vs. 3 experiment on a model biological system or a much larger clinical study, missing values will adversely affect the results; some expression changes which are actually present in your data, will be missed. But why is this? How and why are missing values generated and how do they affect the results? To clarify, let’s look at how discovery ‘omics analysis works.

Biological “noise” and statistical power

In discovery ‘omics we’re looking for differences in the relative quantities of analytes between two or more groups or conditions, such as control vs. treated or healthy vs. diseased. But in all biological systems there is inherent biological variation caused by both genetic and environmental factors, so that the relative quantity of any given analyte ion will vary across samples from different specimens within a given condition. This can be further complicated in clinical studies where there is no control over external factors such as diet, fitness etc. which contribute towards the final phenotype.

This inherent variation can be seen as biological “noise” which we must cut through in order to find the consistent condition-related differences we’re looking for. To do this, we must run multiple biological replicate samples from different specimens of the same species and condition and compare the resulting sample groups to find the analyte ions that are displaying statistically significant differences between conditions. The more biological replicate samples we run, the greater the ability or statistical power of our experiment to find the significantly changing ions.

Missing values

The relationship between the number of replicate samples and the statistical power of the experiment is strongest when data from all the runs is available for statistical analysis. However, due to limitations in the conventional workflow used by most analysis software, this is usually not the case. In the conventional workflow the ions are first detected, one sample at a time, and the detected ions are then “matched” across the samples so that the same ion is being compared between samples. This approach often results in different patterns of detection and different numbers of ions detected on each sample. Ions that are detected on one or more samples may not be detected on others for reasons such as:

  • The ion is actually not present or is below the limits of detection in those samples.
  • The ion may be “missing” due to some instrument error or ionisation issue
  • The ion may be differently detected due to fragmentation of signal, failure to detect monoisotopic peak, differences in chromatography, or other reasons (see figure below).

Image showing differences in detection pattern for technical replicate samples when co-detection isn't performed

For the above reasons, when we create a matrix of ion quantities for statistical analysis, we often find a number of gaps in the data where ions detected on one or more samples could not be found on others. We call these “missing values”, four of which can be seen in the data from a simple 3 vs. 3 experiment shown below.

Table showing missing values in abundance measurements Table showing missing values in abundance measurements

Missing values which are commonly reported as occurring at rates of approximately 20%[1,2] and affecting up to 80% of variables[2] in LC-MS data, decrease the statistical power of our experiment by reducing the number of values available for statistical analysis. What’s worse, the probability of missing values occurring increases with the number of biological replicate samples so we find that we have to run many more samples in order to gain a relatively small benefit in statistical power of our experiment. This is a fundamental problem in discovery ‘omics analysis since we may then miss expression changes that are actually present and waiting to be found in our data.

Since running more samples only increases the chances of encountering missing values, and samples are not in endless supply, the only real solution is to find a better way to analyse the data.

What is the solution?

The solution used in many ‘omics discovery analysis workflows is to use a combination of data filtering and data imputation. First, the workflow may ask you to define a threshold % of values for each ion, so if you set a 60% threshold for a 10 vs. 10 experiment, all ions with more than 4 missing values in a group are eliminated from the analysis. If these include the potential biomarkers you’re looking for then you’ll miss them entirely! For the remaining ions, any missing values are replaced with imputed model values, the most commonly used option being the mean of the values present, while zeros are also used in some workflows. These approaches are dangerous since statistics must take full account of both the mean and the variance of the data in each group being compared. Therefore the former approach may produce false positives by reducing the variance while the latter may produce false positives or negatives by skewing both the mean and the variance. Another way to see the invalidity of these approaches is to imagine using them in other types of statistical experiment. Let’s say we’re measuring the effect of a high salt environment on the mature height of a certain type of plant. You plant 10 seedlings in high salt and 10 in low salt soil, but due to a problem with the automatic irrigation system, 4 of the low salt plants receive no water and die. Is it valid to insert the mean height of the other 6 low salt plants as a model value for the 4 missing plants? You would essentially be creating 4 virtual plants while artificially reducing the variance in the low salt group.

Real plant measurements vs. imputed plant measurements

Would this be considered an optimal way to perform the statistics? Of course not, yet it’s precisely what is done in some discovery ‘omics analysis workflows and has become so routine that it often isn’t even mentioned in publications[1].

How will this affect your work?

Missing values can have profound implications for your research projects. False negative results are serious enough with the potential to miss important biomarker candidates, but false positives may be even worse, leading to much wasted time, effort and resource investigating false biomarker candidates. Furthermore, it’s worth repeating that all along, the evidence of real biomarker candidates is actually there in your data and waiting to be found. At the outset you carefully design the experiment, select subjects and prepare samples which you run on the best equipment available after painstakingly optimising the running conditions – only to find your results underwhelming or even worse, misleading! So how do you ensure that missing values don’t jeopardise your chances of research success?

The Progenesis co-detection solution

The handling of missing values has been described in the literature as “an absolutely vital step in data pre-processing” and one “to which special consideration should be given”[1]. However, what if the problem of missing values did not exist? What if there was a workflow for LC-MS discovery ‘omics analysis that eliminated missing values and maximised statistical power in experiments of any size without resorting to dubious data imputation practises. This is the Progenesis co-detection solution and I’ll be telling you how it works in my next blog post.

1. Gromski et al, “Metabolites” 2014, 4, 433-452

2. Hrydziuszko and Viant, “Metabolomics” 2012, 8, Supplement 1, 161-174