Institute for Advanced Biosciences Keio University
MathDAMP Mathematica package for differential analysis of metabolite profiles
Home Overview Examples Downloads TriDAMP References Contact
MathDAMP > Examples > 04-MathDAMP-Outliers


This notebook provides a template for identifying outliers within a group of datasets with the MathDAMP package. The main approaches employed are the analysis of z-scores and the analysis of quartiles for groups of corresponding signal intensities.
Additional notebooks from the MathDAMP package (03-MathDAMP-TwoDatasets.nb, 05-MathDAMP-TwoGroups, and 06-MathDAMP-MultipleGroups.nb) provide templates for the comparison of two datasets, two groups of replicate datasets, and multiple groups of replicate datasets. The notebook 02-MathDAMP-Elements.nb demonstrates the basic functionality of the MathDAMP package.

Step 1 : Loading the Data

First, the MathDAMP package has to be loaded. Please assign the path leading to MathDAMP files to the MathDAMPPath variable. Due to the size of the datasets and results, the global variable $HistoryLength is set to 1 to save memory.

$HistoryLength = 1 ;

MathDAMPPath = "/home/baran/math/ms/MathDAMP.1.0.0/" ;

<< (MathDAMPPath<>"MathDAMP.m")

MathDAMP version 1.0.0 loaded (2006/04/26)

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Datasets acquired by capillary electrophoresis coupled to a quadrupole mass spectrometer (CE-QMS) operated in selected ion monitoring mode (SIM) will be used for the demonstration in this notebook. The datafiles are part of the MathDAMP package.

data = DAMPImportMS[#] &/@FileNames[MathDAMPPath<>"/data/*.ms"] ;

Optional : Exploring the data, locating the peak of the internal standard in the reference dataset

Step 2 : Performing the Differential Analysis

The function DAMPOutliers performes the calculation of z-scores and  the quartile-based analysis for a group of datasets. DAMPNormalizeGroup function is used internally to align and normalize the datasets along with the annotation tables. Please refer to the MathDAMP.nb notebook for more details about the implementation of functions DAMPNormalizeGroup and DAMPOutliers. Execute ?FunctionName to list a brief description of the respective function's available options.

? DAMPNormalizeGroup

? DAMPOutliers

The loaded datasets are preprocessed prior to the differential analysis by applying baseline subtraction. Noise removal is not necessary, since the calculations based on both z-scores and quartiles are not sensitive to the presence of noise in the data (if compared to calculating the relative difference between two datasets for example).

ppdata = DAMPSubtractBaselines[#] &/@data ;

Most of the options for the DAMPOutliers and DAMPNormalizeGroup are specified explicitly in the following command to allow easy editing of the options. The annotation table for the cation mode CE-MS analysis is used. This table was assembled according to a CE-TOFMS analysis of a mixture of standard compounds. Methioninesulfone is used as the internal standard. Its short name (in the annotation table) 363 is passed to the DAMPNormalizeGroup function via the InternalStandard option. The location of the peak of the internal standard will be extrapolated from the aligned annotation table. Overlaid electropherograms of the vicinities of the expected peaks of the internal standard are plotted along with indicators of the beginning and the end of the blindly integrated regions for visual confirmation. To specify the location of the peak of the internal standard in the reference dataset explicitly, use the notation {mz,{starttime,endtime}} instead of the short name. In this case it would be {182,{13.3,13.7}} (according to the electropherogram at the end of the optional section).
The first dataset from the ppdata list will be used as the reference dataset (as specified by the options Reference->1). The annotation table is reduced to contain only items with m/z values relevant to the analyzed datasets.
The DAMPOutliers function generates two types of results: a z-score and a quartile-based result. The results are calculated based on the most outlying signal intensity from within the corresponding set of signal intensities. Optionally, the results may be generated individually for every dataset. For more details on the implementation of the DAMPOutliers function, please refer to the MathDAMP.nb notebook.

rslt = DAMPOutliers[ppdata, NormalizeGroupOptions {Reference1, AlignmentTime ... tsNone}, OutliersToDrop1, IndividualZsFalse, IndividualQsTrue] ;






IS normalization coefficients : {1., 1.44175, 1.07612, 1.43676, 1.38857, 1.33193, 0.953661}

Step 3 : Exploring the Results, Listing the Candidates

The parallel plot may provide an overall view on the aligned datasests. Differences in particular datasets may be spotted as discontinuities in the vertical bands (representing aligned peaks corresponding to an identical m/z value from different datasets). Black horizontal bands are visible on the plot for one of the datasets (at migration times 28 to 30 min). Their presence is due to the need for extrapolation of the electropherograms after alignment. The datapoints corresponding to this region in the reference dataset were not present in the extrapolated dataset. The remaining results of the DAMPOutliers function will be cropped to 28 min for visualization.
Annotation is not shown on the plots below. On how to show the annotation labels or how to modify the appearance of the plots, please refer to the 02-MathDAMP-Elements.nb notebook.

DAMPParallelPlot[NormalizedDatasets/.rslt, MaxScale200000, PlotOptions {AspectRatio.8}] ;


Below are the density plots of an overall z-score and an overall quartile-based result. These results were calculated for the most outlying signal intensity from the set of corresponding signal intensities from all datasets. The result datasets are smoothed (by applying a moving average filter) to suppress strong signals coming from 'lucky' constellations of a particular set of corresponding noise related signal intensities (without any strong signals in the neighbourhood).

DAMPDensityPlot[DAMPSmooth[DAMPCrop[ZScores/.rslt, TimeRange {0, 28}]], MaxScale10, PaletteDAMPGradientPalette[ColorPositions {.5, .75}]] ;

DAMPDensityPlot[DAMPSmooth[DAMPCrop[QuartileResult/.rslt, TimeRange {0, 28}]], MaxScale10, PaletteDAMPGradientPalette[ColorPositions {.25, .6}]] ;



Quartile-based results were also generated for all individual datasets. Below are the respective density plots.

DAMPDensityPlot[DAMPSmooth[DAMPCrop[#, TimeRange {0, 28}]], MaxScale10, Pale ... DAMPGradientPalette[ColorPositions {.25, .6}]] &/@(QuartileResultIndividual/.rslt) ;

Overlaid electropherograms in the vicinities of the most significant outlier candidates (from a particular result) may be plotted in descending order of significance for visual confirmation (and for the rejection of false positives). Below are the electropherograms of the top 20 candidates from the overall z-score result. Further below are the top 20 candidates from the overall quartile-based result. The vertical dashed line indicates the position of the most significant result according to the selected criteria.

DAMPPlotCandidates[NormalizedDatasets/.rslt, DAMPSmooth[DAMPCrop[ZScores/.rslt, TimeRange ... amOptions {AnnotationTable (AlignedAnnotationTables/.rslt) 〚1〛}] ;





















DAMPPlotCandidates[NormalizedDatasets/.rslt, DAMPSmooth[DAMPCrop[QuartileResult/.rslt, TimeR ... amOptions {AnnotationTable (AlignedAnnotationTables/.rslt) 〚1〛}] ;