Institute for Advanced Biosciences Keio University
MathDAMP Mathematica package for differential analysis of metabolite profiles
Home Overview Examples Downloads TriDAMP References Contact
MathDAMP > Examples > MathDAMP source > Core Functionality > Peak picking

Peak picking

Peak picking is performed to find a set of peaks in every dataset and use these to align the datasets (find the parameters of the time shift function, subsection below). Thanks to the robustness of the alignment approach, even few corresponding peaks between the datasets are sufficient to find the best parameters for the alignment. For this reason, the demands on the quality of peak-picking algorithm are low. The peak lists may contain many peaks without corresponding counterparts across the lists or erroneous peak picks as well.
The implementation of the peak picking algorithm was inspired by Wallace's et al. [2] use of the Douglas-Peucker algorithm. In their approach, the chromatogram/electropherogram is segmented into portions defined by strategic points. First, the endpoints of the chromatogram/electropherogram are the only two strategic points. Additional strategic points are found by recursively finding a point with the biggest orthogonal distance from the line connecting two neighboring strategic points (and the distance being above a threshold). The difference in the current implementation is that vertical (not orthogonal) distance of a datapoint from a line connecting two strategic points is used as a criteria for new strategic points. The advantage of this approach is that it is not necessary to normalize either the signal intensity or time dimension. The threshold may be therefore specified in the units of the signal intensity scale. Using the vertical distance, however, may lead to too many datapoints fulfilling the criteria of a strategic point (having the distance from the line connecting two adjacent strategic points above the threshold). This is partially overcome by setting the minimal distance between the strategic points. In other words, a new strategic point cannot be selected in the specified vicinity of an existing strategic point. A good setting for this parameter is the 1/4 of expected peak width at its base, so that the restricted areas around the peak-top strategic point and peak base strategic points partially or almost overlap. The peaks are then selected from the list of strategic points according to criteria, that peak-top strategic point must have a higher signal intensity than the two neighboring strategic points. This approach may sometimes lead to artifacts - e.g. selecting a high point in the baseline as a peak, however, these do not cause harm to the purpose thanks to the robustness of the alignment algorithm as discussed above (and in the next subsection). Applying additional quality control on the selected peaks was therefore not implemented to save computational time. The peaks' retention/migration times are then calculated as a centroid in the vicinity of the peak-top strategic points. Smoothing the data before peak picking could prove beneficial in some cases.
The function DAMPFindNodesFast selects strategic points from a chromatogram/electropherogram. The function returns a list of strategic points' retention times and signal intensities. If a new strategic point can not be found between two other strategic points (the restricted areas of the two overlap) or the distance is not above threshold, {{0,0}} is returned. This is to comply with the restrictions set on compiled functions in Mathematica. The {0,0} values are then filtered out in the DAMPPickChromatogramPeaks function.
The DAMPPickChromatogramPeaks function takes the minimal distance between two strategic points (in minutes) via the NodeRestrictedVicinity option and converts it to approximate number of datapoints for the DAMPFindNodesFast. The conversion is done to allow faster processing.
DAMPGetPeakCentroid calculates the peak centroid from datapoints around the specified retention time. The selection of datapoints for calculation is determined by the option CentroidVicinity (in minutes, datapoints within this distance in both directions are selected). Note that the retention/migration time values coming from the compiled function DAMPFindNodesFast may differ from the values in the original chromatogram, since the compiled function converts the retention time values, which may be rational, to real. Therefore, it is necessary to use the Select function in DAMPGetCentroid instead of trying to find the position of the retention time value (using the Position function) and selecting a certain number of datapoints in the neighborhood.
DAMPPickPeaks[msdata,options] picks peaks from all chromatograms/electropherograms in msdata and returns them in a list {{mz,peaklist},{mz,peaklist},...}. The function does not have any default options, all options passed to DAMPPickPeaks are passed further to DAMPPickChromatogramPeaks.

DAMPFindNodesFast = Compile[{{chrom, _Real, 2}, {thresh, _Real}, {nrv, _Integer}}, Module[{c ... 62371; {{0, 0}} ] ], {{DAMPFindNodesFast[_, _, _], _Real, 2}} ] 

Options[DAMPGetPeakCentroid] = {Global`CentroidVicinity.05} 

DAMPGetPeakCentroid[chrom_, rt_, opts___] := Module[{vicinity, selpoints}, vicinity  ... 5; #〚2〛&/@selpoints)/Plus @@ selpoints〚All, 2〛] 

Options[DAMPPickChromatogramPeaks] = {Global`Threshold1000, Global`NodeRestrictedVicinity.05, Global`CentroidVicinity.05} 

DAMPPickChromatogramPeaks[chrom_, opts___] := Module[{threshold, nrv, centvic, nodes}, ɯ ... 2315;, #〚3, 2〛}]  {1, 1} &] 〚All, 2〛] ] 

DAMPPickPeaks[msdata_, opts___] := Module[{},  {#, DAMPPickChromatogramPeaks[DAMPGetChromatogram[msdata, #], opts]} &/@msdata〚2〛]

The peaklists may be used to annotate the density plots. The function DAMPPeakListToAnnotationTable converts the peaklist in a format returned by DAMPPickPeaks to an annotation table format. The names (both short and long) are assigned to peaks from a sequence of integers (starting with 1).

Options[DAMPPeakListToAnnotationTable] = {Global`LabelTextPosition1.5} 

DAMPPeakListToAnnotationTable[peaks_, opts___] := Module[{ltpos, mz, tmplist, atlen}, ᡝ ... ose[tmplist], {#, #} &[ToString[#] &/@Range[atlen]], {Table[ltpos, {atlen}]}]] ]

A stack of peaks with different m/z values may be present at certain retention/migration times. This may be due to post-source fragmentation of a single compound, isotopic peaks, or due to a group of multiple compounds reaching the detector at the same time. If there is an overwhelming number of stacked peaks, these may have significant influence on the dataset alignment (implemented in the next subsection) and bias the alignment. This is undesirable, especially in the case of CE analysis, where there are numerous peaks stacked at migration time corresponding the position of EOF peak. These peaks often do not seem to follow the migration time shift trend of the remaining compounds. Their influence may be removed by (manually) specifying a time range, which does not contain the stacked peaks, for fitting the migration time shifts. Another possibility to limit their influence (and to avoid manual specification of the time range) is to automatically select a set of representative peaks: a certain number of highest peaks from every chromatogram/electropherogram and a certain number of highest peaks from every time interval of specified size. This may provide peaks more evenly distributed on the retention/migration time vs m/z plane and eliminate most bias-causing stacked peaks. Given the robustness of the alignment method, a minority of bias-causing peaks should not hijack the alignment. DAMPSelectRepresentativePeaks can also be used to select peak picks from a certain time interval only.

Options[DAMPSelectRepresentativePeaks] = {Global`PeaksPerChromatogramAll, Global`PeaksPerIntervalAll, Global`IntervalSize1, Global`TimeRangeAll} 

DAMPSelectRepresentativePeaks[peaklist_, opts___] := Module[{perchrom, perint, intsize, time ... 314;All, 2〛}) &/@newlist〚All, 1〛 ; ] ; newlist]

The layouts of peaks from different datasets picked by the DAMPPickPeaks function may be plotted both prior to and after the alignment for visual inspection.
Graphics::realu was turned off since it started appearing for higher number of peaklists (probably PlotLegend related). The result does not seem to be affected when the error is present.

Options[DAMPPlotPeakLayout] = {Global`mzGridLineFreq1, Global`mzTickFreq1, G ... otSymbol[Diamond, 3, FilledFalse], PlotSymbol[Box, 3, FilledFalse]}}} ; 

DAMPPlotPeakLayout[peaklists_, opts___] := Module[{mzgridfreq, mztickfreq, leglabels, plotop ... ndPosition {1, -.4}] ; On[Graphics :: "realu"] ; rslt]