|Institute for Advanced Biosciences||Keio University|
|MathDAMP||Mathematica package for differential analysis of metabolite profiles|
Peaks found in the first 5 electropherograms from the ctrl dataset are listed above. The layouts of picked peaks may be displayed on peak layout plots for comparison or for visual inspection of the alignment quality.
The peak lists may also be converted to an annotation table format for display on chromatograms or on density plots.
Sometimes, a large number of peaks may be erroneously picked from a single chromatogram/electropherogram (noise above the peak-picking threshold) or an overwhelming number of redundant signals is present along the m/z dimension at a certain retention/migration time. The presence of these signals may bias the alignment and it could be desirable to select a representative set of peaks from a peak list. Also, the alignment procedure needs more running time for peak lists with large numbers of peaks. The function DAMPSelectRepresentativePeaks performs the representative peak selection.
Annotation table manipulation
Annotation tables are intended to facilitate easier identification of peaks. The annotation table may be constructed according to an analysis of a mixture of standard compounds. The table is expected to consist of 5 columns: m/z, retention/migration time, short compound name/id, full compound name, and relative position of the text with respect to the label on the density plots (1 - right, 2 - top, 3 - left, 4 - bottom, 1.5 - top right, etc).
|71.0637||10.351||1||3-Aminopropionitrile monofumarate salt||4|
Multiple annotation tables can easily be displayed on a single density plot. Their appearance may be modified as well. The following example shows the above loaded (and unaligned) annotation table along with the labels for picked peaks (shown as gray dots). The alignment of annotation tables to MathDAMP datasets is demonstrated in the next section.
To align two datasets, parameters of a (custom) function describing the time shifts of corresponding peaks in two datasets are optimized. A combination of global optimization and dynamic programming is used for this purpose. The function is then used to rescale the timescale on one of the datasets, interpolate the chromatograms/electropherograms, and timepoints identical to timepoints in the reference datasets are selected. For details regarding the dataset alignment procedure, please refer to the MathDAMP.nb notebook.
DAMPFitShiftFunction performs the parameter optimization for the retention/migration time shift function.
Only a subset of peaks (selected by the DAMPSelectRepresentativePeaks) was used to achieve faster alignment. A function derived by Reijenga et al. (see the MathDAMP.nb notebook for details) is used by default as a retention/migration time shift function (due to the predominant use of capillary electrophoresis based techniques in the authors' institution - Institute for Advanced Biosciences, Keio University). Any function may be passed to DAMPFitShiftFunction as a retention/migration time shift function as demonstrated below with a second order polynomial.
The peak layouts may be shown for visual confirmation of the alignment (alignment done with the default time shift function). The layout of peaks prior to the alignment is shown in the previous section.
After finding the time shift function, the aligned dataset is created using the function DAMPAlign.
Annotation tables may be aligned in a similar fashion. The step below demonstrates the robustness of the alignment procedure. Even a relatively small number of corresponding peaks is sufficient for finding the optimal alignment. The timeshifts of unaligned annotation labels were quite significant (over 5 min) when the density plots below are compared to the one at the end of the previous section.
Plots of chromatograms/electropherograms may be annotated as well. Full compound names are used instead of short names/ids in this case.
DAMPNormalizeGroup function aligns and normalizes multiple datasets to a selected reference dataset. This function assembles the steps described in this section along with an intensity normalization step described in the next section. The DAMPNormalizeGroup function is used by the functions for common types of differential analysis of metabolite profiles demonstrated in the notebooks 03-MathDAMP-TwoDatasets.nb, 04-MathDAMP-Outliers.nb, 05-MathDAMP-TwoGroups, and 06-MathDAMP-MultipleGroups.nb.
Often the datasets' signal intensity values have to be normalized according to the peak of the internal standard. MathDAMP implements very simple peak integration functionality. A specified range of a chromatogram/electropherogram is integrated blindly. When normalizing multiple datasets using the DAMPNormalizeGroup function (mentioned at the end of the previous section), the location of the peak of the internal standard in the reference dataset has to be either specified explicitly or can be extrapolated from the aligned annotation table. In the latter case, only the short name/id of the internal standard is specified and the peak is located automatically. For more details about the DAMPNormalizeGroup function, please refer to the MathDAMP.nb notebook and the 03-MathDAMP-TwoDatasets.nb, 04-MathDAMP-Outliers.nb, 05-MathDAMP-TwoGroups, and 06-MathDAMP-MultipleGroups.nb notebooks.
Signal intensity normalization of the alignedsmpl dataset to the ppctrl dataset according to the area of the Methioninesulfone peak is shown below. The location of the peak is specified explicitly.
Comparing normalized datasets
One way to compare the normalized datasets is to interlace their chromatograms/electropherograms into each other and plot the resulting dataset on a parallel plot. Here, the electropherograms from the datasets corresponding to identical m/z values are plotted next to each other. Differences would appear as half-bands (like for m/z 136 at about 9.5 min or for m/z 137 between 12 and 13 min).
Additionally, simple arithmetic operations may be performed on the signal intensity matrices of the two normalized datasets to highlight differences between them. Subtraction provides a dataset representing the difference in signal intensities. The normasmpl dataset contains identical timepoints to the ctrl dataset (via the DAMPAlign function) so the m/z value list as well as the list of timepoints is taken from the ctrl dataset.
The result contains some signals indicating either positive (yellow/red) or negative (cyan/blue) difference between the datasets. Some ambiguous signals (red and blue in close proximity) appear on the result as well. These may be caused by partial misalignments of the corresponding peaks or small relative differences in significant peaks (the small relative difference is significant in absolute terms). For instance, there are two ambiguous signals in the topmost lane (m/z 182) around migration time of 13 min on the density plot above. These correspond to the peaks shown on the last electropherogram in the previous section. The signals on the density plot are due to an imperfect overlap of the corresponding peaks. The ambiguous signals may be ruled out as false positives either after the visual inspection of overlaid chromatograms/electropherograms (which can be generated automatically in a ranked order as described in the next section) or the presence of these signals may be suppressed using different kinds of visualization approaches described below.
In a way similar to the absolute difference, a relative difference between the two datasets can be calculated. In this case, the difference between the corresponding signal intensities is divided by the larger of the two (or an absolute value of the difference between the two signal intensities, if one of them is negative). The signal intensities in the resulting dataset fall within the range -1 to 1.
Tiny peaks (often noise-related which evaded preprocessing) may provide significant signals on the relative result. A signal intensity threshold of 5000 suppresses these influences in the previous result. In spite of this, numerous misleading signals still remain in the result. A simple way to suppress signals originating from small relative changes in huge peaks (scoring high on the absolute difference plot) and significant relative differences in tiny noise-related peaks (scoring high on the relative difference plot) is to multiply the absolute and relative difference results (below). Differences significant in both absolute and relative terms tend to be highlighted. As shown below, ambiguous signals become suppressed and signals coming from actual differences acquire better visibility. This holds even if the threshold for the relative difference is set to 0.
Availability of replicate datasets allows the application of statistical tests to all corresponding signal intensities. Examples may be found in notebooks 04-MathDAMP-Outliers.nb (looking for outliers within multiple datasets using z-scores and by analyzing quartiles), 05-MathDAMP-TwoGroups (comparison of two groups of replicates includes the t-test), and 06-MathDAMP-MultipleGroups.nb (comparing multiple groups of replicates using F ratio). Noise removal proves not to be necessary when using these approaches. However, the resulting datasets are usually smoothed (by applying a moving average filter) to suppress strong signals originating from 'lucky' constellations of a particular set of corresponding noise related signal intensities (without any strong signals in their neighborhood).
Any resulting datasets may be further combined (in a way similar to the absolute×relative result) or used as a filtration criteria for other results. Below is an example of selecting only those datapoints from the relative difference, where at least one of the corresponding signal intensities in the ctrl and smpl datasets exceeds a threshold (10000). Using the DAMPFilter function may prove especially useful when a result of a statistical test (like the t-test for two groups of replicates) is used as the criteria dataset (to filter the absolute×relative result between the averaged groups for instance).
The DAMPApplyFunctionToGroup applies a specified pure function to corresponding signal intensities in the group of datasets. Above, the function was used to create a dataset containing maxima from corresponding signal intensities in the ctrl and the smpl datasets (by using the Max function as the specified pure function).
Listing the overlaid chromatograms/electropherograms in the vicinities of the most significant differences
For the visual confirmation of significant differences between the datasets (and for the rejection of false positives), overlaid chromatograms/electropherograms are plotted in descending order of significance. Below are the electropherograms of the top 12 differences from the absolute×relative difference result from above. The vertical dashed line indicates the position of the most significant difference according to the selected criteria.
This notebook demonstrated the basic core functionality of the MathDAMP package. For a more convenient usage, the core functions are assembled into modules for common types of differential analysis of metabolite profiles. Examples can be found in the additional notebooks (03-MathDAMP-TwoDatasets.nb, 04-MathDAMP-Outliers.nb, 05-MathDAMP-TwoGroups, and 06-MathDAMP-MultipleGroups.nb) from the MathDAMP package.
|Institute for Advanced Biosciences