MathDAMP Mathematica package for differential analysis of metabolite profiles
MathDAMP > Examples > MathDAMP source > Assemblies of Core Functions for Common Tasks > Looking for outliers within multiple datasets

Looking for outliers within multiple datasets

The DAMPOutliers function uses the DAMPNormalizeGroup function to align and normalize a group of datasets and highlights outlying signals. Two types of results are generated: a z-score map and a quartile-based result. Both types of results may be generated for every single dataset (optional). Overall results are generated by default. In this case, only one resulting dataset is generated (for both types of approaches) by using the most outlying datapoint from every set of corresponding signal intensities for result calculation. This simplifies the generation of chromatograms/electropherograms for candidates as these do not have to be generated for each dataset's results.

DAMPDropOutliers is used internally by the function DAMPOutliers to remove a selected number of outliers from a set of values. The function is employed in z-score calculation. The presence of an outlier(s) influences the values of both the mean and the standard deviation of a set. This may lead to a relatively low z-score value for an outlier. It is therefore desirable to calculate the mean and the standard deviation (which are used for calculating the z-scores) from a set of values not containing an outlier(s). The DAMPOutliers function's option OutliersToDrop determines the number of outliers to drop from every set of corresponding signal intensities. When the number is set to more than one, the outlier dropping is performed iteratively by calculating the mean and removing the single most distant value from the set. The same specified number of values is dropped from every set regardless of whether the values would be classified as outliers.
The result for a quartile-based calculation is set to 0 if the tested signal intensity is between the first and the third quartile of the set of corresponding signal intensities from all datasets (no 'outliers' are dropped in this case in contrast to the z-score calculation). If the tested signal intensity is greater than the third quartile, the result is calculated as the difference between the tested signal intensity and the third quartile and this difference is divided by the interquartile range (the difference between the third and the first quartile). If the tested signal intensity is less than the first quartile, the result is calculated as the difference between the first quartile and the tested signal intensity value and this difference is divided by the interquartile range. A negative sign is assigned to the latter result to indicate a 'negative' outlier.   