Institute for Advanced Biosciences Keio University
MathDAMP Mathematica package for differential analysis of metabolite profiles
Home Overview Examples Downloads TriDAMP References Contact
MathDAMP > Examples > MathDAMP source

MathDAMP
Mathematica package for Differential Analysis of Metabolite Profiles
version 1.0.0

MathDAMP package facilitates the differential analysis of metabolite profiles. A visual approach is employed: (i) Individual datasets' retention/migration times and signal intensities are normalized. (ii) Arithmetic operations are applied to to all corresponding datapoints in the datasets to highlight differences or patterns of interest. (iii) The results are displayed on annotated density plots. Additionally, sets of overlaid chromatograms/electropherograms in the vicinities of the most significant differences from selected results are displayed in descending order of significance for visual confirmation.
The text notes throughout the code in this notebook are intended to provide brief comments on the implementation of particular functions. Actual reference documentation for each function is provided via the usage messages defined in the Usage Description Section. This information can be displayed by executing ?FunctionName (where FunctionName stands for the name of the function of interest) in any notebook after loading the MathDAMP package.
The package provides basic building blocks for performing differential analysis along with assemblies of core functions to perform common tasks. Specific examples about using the package for different types of differential analysis (comparing two datasets, detecting outliers within a group of datasets, comparing two groups of replicate datasets, comparing multiple groups of replicate datasets) along with additional information can be found at http://mathdamp.iab.keio.ac.jp/

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

BeginPackage["MathDAMP`"]

Needs["Graphics`MultipleListPlot`"]

Needs["Statistics`NonlinearFit`"]

Needs["Statistics`DescriptiveStatistics`"]

Usage Description Section

This section contains a list of explanations for MathDAMP's public functions. This information is shown upon executing the ?FunctionName in any notebook after loading the MathDAMP package.

DAMPStrInternalFormat = "{matrix of signal intensities, list of m/z values, list of timepoints, additional information (list of rules)}"

DAMPStrResolution = "specifies the resolution to which the data will be binned along the m/z axis"

DAMPImportMS :: usage = DAMPImportMS[filename] reads and processes an Agilent ChemStation MS ... nd scan mode datafiles. The data are binned to 1 m/z unit resolution by averaging."

DAMPImportCSV :: usage = DAMPImportCSV[filename,samplename,options] reads and processes a cs ... h current time every 100 processed chromatograms/electropherograms (default: True)"

DAMPImportMZXML :: usage = DAMPImportMZXML[filename,samplename,options] reads and processes  ... d for scan data, Selective mode is recommended for SIM data. (default: Sequential)"

DAMPImportCDF :: usage = DAMPImportCDF[filename,options] reads and processes a CDF file spec ... nOptions:\nResolution - "<>DAMPStrResolution<>" (default: 1)"

DAMPImportBDT :: usage = DAMPImportBDT[filename,samplename] reads and processes a binary BDT ... bout the sample name, this has to be specified as a second parameter (samplename)."

DAMPExportBDT :: usage = DAMPExportBDT[filename,msdata] exports data in a MathDAMP format &l ...  m/z values which can be converted to the range representation should be exported."

DAMPAppendToSampleName :: usage = "DAMPAppendToSampleName[msdata4,string] appends strin ... om previous modifiers by two vertical bars with a space in between (default: True)"

DAMPLoadAnnotationTable :: usage = "DAMPLoadAnnotationTable[filename] loads a list of a ... ve to the label (1 - right, 2 - top, 3 - left, 4 - bottom, 1.5 - top right, etc.)."

DAMPGetIntensities :: usage = "DAMPGetIntensities[msdata,mz] returns a list of signal intensities from msdata corresponding to m/z mz."

DAMPGetChromatogram :: usage = "DAMPGetChromatogram[msdata,mz] returns a chromatogram/e ...  (a list of {time,signal intensity} elements) from msdata corresponding to m/z mz."

DAMPCrop :: usage = "DAMPCrop[msdata,options] reduces the msdata dataset to datapoints  ... )\nTimeRange - two element list specifying the cropping time range (default: All)."

DAMPSelectMZs :: usage = "DAMPSelectMZs[msdata,mzs] reduces the msdata datased to datapoints corresponding to m/z values specified in the mzs list."

DAMPDropMZs :: usage = "DAMPDropMZs[msdata,mzs] reduces the msdata datased by eliminating datapoints corresponding to m/z values specified in the mzs list."<br />

DAMPPlotChromatogram :: usage = "DAMPPlotChromatogram[{msdata,msdata,...},mz,options] p ... specified, the SampleName from each msdata is used as a label (default: Automatic)"

DAMPGradient :: usage = "DAMPGradient[pos1,pos2,color1,color2,backgroundgraylevel] retu ... s1, and color2 at pos2, and finally to darkened color2 at the relative position 1."

DAMPGradientPalette :: usage = "DAMPGradientPalette[options] - generates a color palett ... olor gradients, palette entry for value 0 is shared between the two (default: 100)"

DAMPDrawAnnotation :: usage = "DAMPDrawAnnotation[annotationtable,options] draws the an ... ropriately capture the dimensions of peaks on the ListDensityPlot (default: {1,1})"

DAMPDensityPlot :: usage = "DAMPDensityPlot[msdata,options] plots the msdata using the  ... ons must correspond to the number of passed annotation tables (default: Automatic)"

DAMPParallelPlot :: usage = "DAMPParallelPlot[msdatas,options] plots msdatas (a list of ... ly to the DAMPDensityPlot function which is used internally for plotting the data."

DAMPApplyFunctionToSingle :: usage = "DAMPApplyFunctionToSingle[msdata,function,options ...  to keep track of modifications performed on the dataset (default: \"\")"

DAMPBinChromatogram :: usage = "DAMPBinChromatogram[chromatogram,binsize,options] bins  ...  be applied to signal intensities and timepoints in every interval (default: Mean)"

DAMPBinChromatogramFast :: usage = "DAMPBinChromatogramFast[chromatogram,binsize] bins  ... function DAMPBinChromatogram for a more flexible (and slower) way to bin the data."

DAMPRobustPolynomialFit :: usage = "DAMPRobustPolynomialFit[chromatogram,options] fits  ...  the data for polynomial fitting (default: (DAMPBinChromatogramFast[#,0.25]&))"

DAMPSubtractBaselines :: usage = "DAMPSubtractBaselines[msdata,options] subtracts basel ... ep the track of modifications performed on the dataset (default: \"bs\")"

DAMPRemoveNoise :: usage = "DAMPRemoveNoise[msdata,options] removes noise from every ch ... ep the track of modifications performed on the dataset (default: \"nr\")"

DAMPRemoveSpikes :: usage = "DAMPRemoveSpikes[msdata,options] levels to 0 all signal in ... ep the track of modifications performed on the dataset (default: \"sr\")"

DAMPThreshold :: usage = "DAMPThreshold[msdata,threshold,options] levels to 0 all signa ... to keep track of modifications performed on the dataset (default: \"t\")"

DAMPMovingAverageFast :: usage = "DAMPMovingAverageFast[intensities,windowsize] applies ... ed for the calculation of the average at the beginning or at the end for the list."

DAMPSmooth :: usage = "DAMPSmooth[msdata,options] applies a smoothing function to all c ... to keep track of modifications performed on the dataset (default: \"s\")"

DAMPPickChromatogramPeaks :: usage = "DAMPPickChromatogramPeaks[chromatogram,options] p ... c point for the calculation of centroided retention/migration time (default: 0.05)"

DAMPPickPeaks :: usage = "DAMPPickPeaks[msdata,options] picks peaks from all chromatogr ...  is used internally to pick peaks from individual chromatograms/electropherograms."

DAMPPeakListToAnnotationTable :: usage = "DAMPPeakListToAnnotationTable[peaklist,option ... he position of the text label relative to the annotation label mark (default: 1.5)"

DAMPSelectRepresentativePeaks :: usage = "DAMPSelectRepresentativePeaks[peaklist,option ... eRange - select peaks from this retention/migration time range only (default: All)"

DAMPPlotPeakLayout :: usage = "DAMPPlotPeakLayout[peaklists,options] plots the position ... PlotSymbol[Diamond,3,FilledFalse],PlotSymbol[Box,3,FilledFalse]}})"

DAMPDPScore :: usage = "DAMPDPScore[peaklist1,peaklist2,options] calculates the dynamic ... s function.\nOptions:\nGapPenalty - gap penalty (in minutes) for DP (default: 0.5)"

DAMPFitShiftFunction :: usage = "DAMPFitShiftFunction[peaklist1,peaklist2,options] opti ... n time range of peaks from peaklist1 to be used for scoring (default: {0,∞})"

DAMPAlignPeakList :: usage = "DAMPAlignPeakList[peaklist,timeshiftfunction] applies the ... n the peaklist (expected to have format as returned by the function DAMPPickPeaks)"

DAMPAnnotationTableToPeakList :: usage = "DAMPAnnotationTableToPeakList[annotationtable ... ich resolution the m/z values in the annotation table will be rounded (default: 1)"

DAMPAlignAnnotationTable :: usage = "DAMPAlignAnnotationTable[peaklist,annotationtable, ... .1 m/z resolution, the resolution option should be set to 0.1 as well (default: 1)"

DAMPAlign :: usage = "DAMPAlign[msdata,shiftfunction,timepoints,options] aligns msdata  ... to keep track of modifications performed on the dataset (default: \"a\")"

DAMPNormalize :: usage = "DAMPNormalize[msdata,coefficient,options] multiplies the sign ... eep the track of modifications performed on the dataset (default: \"n\")"

DAMPIntegrate :: usage = "DAMPIntegrate[chromatogram,options] calculates the area below ... ange (default: None). If set to None, baseline is set to signal intensity value 0."

DAMPFilter :: usage = "DAMPFilter[msdata,criteriamsdata,threshold,options] sets to 0 th ... eep the track of modifications performed on the dataset (default: \"f\")"

DAMPTrendFilter :: usage = "DAMPTrendFilter[group1msdatas,group2msdatas,tofiltermsdata, ... ep the track of modifications performed on the dataset (default: \"tf\")"

DAMPApplyFunctionToGroup :: usage = "DAMPApplyFunctionToGroup[msdatas,function,options] ...  the track of the modifications performed on the dataset (default: \"\")"

DAMPCheckDir :: usage = "DAMPCheckDir[directory] checks if directory exists, creates it if it does not. The parent directory must exist, error message is shown otherwise."

DAMPGenColors :: usage = "DAMPGenColors[number] generates a list of color specification ...  hue range is split proportionally and the colors are assigned from this sequence."

DAMPNormalizeGroup :: usage = "DAMPNormalizeGroup[msdatas,options] aligns msdatas (a li ... ities are rounded to integers in internal calculations and results (default: True)"

DAMPTwoDatasets :: usage = "DAMPTwoDatasets[msdata1,msdata2,options] generates datasets ...  from msdata1 and msdata2 are equal to or greater than this threshold (default: 0)"

DAMPOutliers :: usage = "DAMPOutliers[msdatas,options] highlights the presence of outly ... ed for every dataset in msdatas in addition to the overall result (default: False)"

DAMPTwoGroups :: usage = "DAMPTwoGroups[msdatas1,msdatas2,options] generates datasets r ...  to filter out results originating from individual spikes or outliers (default: 2)"

DAMPMultiGroups :: usage = "DAMPMultiGroups[msdatas,replicates,options] performs the co ...  the SampleName of the first dataset from every group is used (default: Automatic)"

DAMPPlotCandidates :: usage = "DAMPPlotCandidates[msdatas,criteriamsdata,options] plots ... which is used internally to plot the chromatograms/electropherograms (default: {})"

Clear[DAMPStrInternalFormat]

Clear[DAMPStrResolution]

Public Variables

DAMPTextStyle contains the default text formatting specifiers for plots. Since different font sizes seem to be suitable for printing and on screen inspection, the font settings are central (and have Global` context) and can be easily overridden in particular notebooks. Redefining the DAMPTextStyle in the analysis notebook makes all subsequently generated plots use the settings. Text settings can be redefined on an individual basis for each plot via the TextStyle option in the PlotOptions option for the particular plot.
DAMPCETOFMSDensityPlotOptions, DAMPCETOFMSPeakLayoutOptions, DAMPCEQMSDensityPlotOptions, and DAMPCEQMSPeakLayoutOptions are lists of options for optimal display of data from different sources. Since the data dimensions of CE-TOFMS and CE-QMS datasets differ significantly, different settings for density plots and peak layout plots are desirable (tickmark/gridline frequency, plot size, etc.) These lists can then be specified as a whole or parts overridden by joining a list with overriding/additional options with the list represented by any of these global variables. The lists are empty for the CE-QMS data since the options for underlying function (DAMPPlotDensityPlot, DAMPPlotPeakLayout) are set by default to the optimum for CE-QMS data.

Global`DAMPTextStyle = {FontFamily"Helvetica", FontSize8} ;

Global`DAMPCETOFMSDensityPlotOptions = {Global`MaxScale20000, Global`FrameTickFreqs& ... ineFreq10, Global`PlotOptions {ImageSize940, AspectRatio1.43}} ;

Global`DAMPCETOFMSPeakLayoutOptions = {Global`mzGridLineFreq50, Global`mzTickFreq50} ;

Global`DAMPCEQMSDensityPlotOptions = {} ;

Global`DAMPCEQMSPeakLayoutOptions = {} ;

Begin Private Section

Begin["`Private`"]

Data Import/Export

Imported datasets are processed into a nested list (further referred to as msdata). The first element is a matrix containing ion signal intensities, rows corresponding to individual chromatograms/electropherograms. The second element is a list of m/z values and the third element is a list of scan timepoints (in minutes). The dimensions of the signal intensity matrix are determined by the length of the lists containing the m/z values and timepoints. The fourth element is a list of rules containing additional information about the dataset. Currently, it is used to store the sample name only.

Sample dataset with 2 m/z values and 6 timepoints in MathDAMP internal representation: {{{0,0,1,3,1,0},{0,0,0,0,0,0}},{100.,150.},{1,2,3,4,5,6},{SampleName→"Dummy Profile"}}

The datasets may be imported in two modes: sequential and selective. The modes determine the way in which the m/z values are generated (along with the matrix of signal intensities). In the sequential mode, the list of m/z values will contain an ordered list of m/z values in which the difference between two neighboring m/z values (step size) will be fixed. The values of the first and the last element in the m/z list are determined by the minimum and maximum m/z value in the imported dataset (rounded according to specified resolution as described in the next paragraph). In the selective mode, the list of m/z values will contain only those m/z values, which are present in the imported dataset (and rounded according to the selected resolution as described in the next paragraph). Selective mode import is recommended for datasets acquired by selected ion monitoring (SIM) for instance. Selective mode works for scan data also, using sequential mode leads to faster data import. Additionally, using selective mode for scan data may lead to an omission of some chromatograms (leading to a non-sequential m/z scale) if no signal intensities are present within a particular m/z interval (as determined by resolution).

The raw data is binned along the m/z axis during import. The binning interval is determined by the option Resolution. All signal intensities falling within the binning interval at a particular timepoint are averaged. For example, if resolution is set to 1, the signal intensities at a particular timepoint falling within an interval <mz-0.5,mz+0.5) will be averaged (mz represents a whole number in this case). For resolution 0.5, the binning interval would be <mz-0.25,mz+0.25) (with mz representing a whole number multiple of 0.5), etc...

Scan binning

The functions DAMPBinScanSequential and DAMPBinScanSelective facilitate the binning along the m/z dimension as described above. The functions are used internally by functions importing datasets in different data formats.

DAMPBinScanSequential = Compile[{{rec, _Real, 2}, {dims, _Real, 1}}, Module[{tbl, cnts, tmp2 ... 315;]]) &/@rec ; cnts = If[# == 0., 1., #] &/@cnts ; tbl/cnts]] 

DAMPBinScanSelective = Compile[{{data, _Real, 2}, {allmzs, _Real, 1}}, Module[{tbl, cnttbl,  ... ;/@data ; cnttbl = If[#0., 1., #] &/@cnttbl ; tbl/cnttbl]] ;

Agilent ChemStation MS format

The DAMPImportMS function imports the raw data from an Agilent ChemStation *.MS file and processes it into the MathDAMP internal format. Although the description of the file format is not available (?) and the format was reverse-engineered, the imported data appear to perfectly match the data in a csv file exported using the ChemStation software.
Part of the code from  the DAMPImportMS function was moved to a compiled function DAMPProcessDataMS[mzintensitypairs,scanpositions] to speed up the data processing.
The signal intensities are binned to 1 m/z unit resolution upon loading.

DAMPProcessDataMS = Compile[{{data, _Integer, 1}, {positions, _Integer, 2}}, Module[{procdat ... amp;/@cnttbl ; tbl/cnttbl) &/@procdata ; Join[{allmzs}, findata] ]] 

DAMPImportMS[filename_] := Module[{data, msnr, pos, positions, trheaders, timepoints, findat ... op[Flatten[Transpose[#[mainhead, 2^8] &/@{Quotient, Mod}]], 1], #≠0&]]}} ]

Analyst QS CSV format

DAMPImportCSV function imports data from a csv file generated by Analyst QS software (for Agilent TOFMS) and processes it into MathDAMP internal format.  Since the csv file does not contain information about the sample name, the name has to be specified via the second parameter samplename. The option Resolution determines the binning of the data along the m/z dimension.

Options[DAMPImportCSV] = {Global`Resolution1, VerboseTrue} ; 

DAMPImportCSV[filename_, samplename_, opts___] := Module[{verbs, dims, allmzs, timepoints, s ...  {Transpose[findata], allmzs, timepoints, {Global`SampleName->samplename}} )]

mzXML format

DAMPImportMZXML imports the first level MS scans from an mzXML datafile. The import may be performed in either sequential or selective mode (see beginning of this section for explanation).
The function DAMP6bitTo8bit facilitates the conversion of base64 encoded m/z and signal intensity pairs to binary data.

base64enctbl = #〚1〛#〚2〛&/@Transpose[{Join[CharacterR ... ["0", "9"], {"+", "/", "="}], Range[0, 64]}] ;

DAMP6bitTo8bit = Compile[{{data, _Integer, 1}}, Module[{todrop}, todrop = Length[Cas ... 12314;3〛, 2^2] 2^6 + #〚4〛} &/@Partition[data, 4]], -todrop]] ]]

Options[DAMPImportMZXML] = {Global`Resolution1., Global`ImportModeSequential} ; <br />

DAMPImportMZXML[filename_, samplename_, opts___] := Module[{resol, md, scans, prec, todrop,  ... uot;}]] &/@scans〚All, 1〛)/60, {Global`SampleNamesamplename}} ]

CDF format

DAMPImportCDF[filename,options] function imports NetCDF data and processes it into MathDAMP internal format. The function was not tested thoroughly with NetCDF format files from different sources and may therefore not be generic. The option Resolution specifies the data binning along the m/z axis.

CDFnctypesizes = {"Byte"1, "Character8"1, "Integer16"2, "Real64"8, _4} ;

CDFReadPadded[st_, nctp_, len_] := Module[{rslt}, rslt = BinaryRead[st, Table[nctp,  ... [#4, 0, #] &[4 - Mod[len (nctp/.CDFnctypesizes), 4]]] ; rslt] <br />

CDFReadString[st_] := Module[{strlen}, strlen = BinaryRead[st, "UnsignedInteger ... ;1] ; StringJoin @@ CDFReadPadded[st, "Character8", strlen] ] 

CDFReadDimArrayElem[st_] := Module[{},  {CDFReadString[st], BinaryRead[st, "UnsignedInteger32", ByteOrdering1]} ] <br />

CDFnctypes = {1"Byte", 2"Character8", 3"Int ... 2754;"Integer32", 5"Real32", 6"Real64"} ; 

CDFReadAttArrayElem[st_] := Module[{nm, nctype, nelems}, nm = CDFReadString[st] ; &# ... Ordering1] ;  {nm, nctype, CDFReadPadded[st, nctype, nelems]} ] 

CDFReadAttArray[st_] := Module[{attarraymagic, attarraynelems}, attarraymagic = Bina ... 62754;1] ; Table[CDFReadAttArrayElem[st], {attarraynelems}] ] ] 

CDFReadVarArrayElem[st_] := Module[{nm, nelems, dimids, vattarray}, nm = CDFReadStri ... ering1], BinaryRead[st, "UnsignedInteger32", ByteOrdering1]} ]

Options[DAMPImportCDF] = {Global`Resolution1.} 

DAMPImportCDF[filename_, opts___] := Module[{resol, strm, magic, versionbyte, numrecs, dimar ... uot;&] 〚1, 3〛, -1]}} ; ] ; Close[strm] ; rslt]

BDT format

DAMPImportBDT[filename,samplename] function imports data from a BDT file and processes it into MathDAMP internal format. The BDT is a binary datafile used internally by Institute for Advanced Biosciences (Keio University). Separate in house software for preprocessing of the Analyst QS generated csv files stores the data in the BDT format. The organization of data in the BDT format is similar to the internal MathDAMP organization what allows a fast data import into Mathematica. Since the csv file does not contain the information about the sample name, this has to be specified via the second parameter samplename.

DAMPImportBDT[filename_, samplename_] := Module[{tmp}, tmp = BinaryReadList[filename ... ]], Take[tmp, {5, 4 + tmp〚4〛}], {Global`SampleNamesamplename}} ] ;

DAMPExportBDT[filename,msdata] function exports data in MathDAMP internal format into a BDT file format specified by filename.  Since the m/z values in the data are not saved one by one but as a range specified by min, max and step, only data having a regular list of m/z values which can be converted to the range representation should be exported.

DAMPExportBDT[filename_, msdata_] := Module[{strm}, strm = OpenWrite[filename, Binar ... 12314;1〛], "Real64", ByteOrdering -1] ; Close[strm] ; ]

Core Functionality

Modifying the SampleName in msdata

When msdata is processed by some functions (baseline subtraction, smoothing, etc) it may be desirable to keep track of modifications performed on the dataset by appending a short string to the SampleName of the original dataset. The functions use the private function DAMPAppendToSampleName for this purpose. The function adds vertical bars to separate multiple modification identifiers. The vertical bars are shared between multiple modifiers unless the ShareBars option is set to False.

DAMPAppendToSampleName[msdata4_, str_, opts___] := Module[{sname}, sname = Global`Sa ... obal`ShareBars/.{opts}/.{Global`ShareBarsTrue}, "", " |"]] ]

Annotation table manipulation

The DAMPLoadAnnotationTable function loads a compound annotation table from a csv file and selects records having relevant retention/migration times and m/z values. The organization of the records is assumed to be as follows: 1) m/z value, 2) retention/migration time, 3) short compound name/id (density plot label text), 4) full compound name (chromatogram/electropherogam annotation text), and 5) label's text position on the density plots relative to the label (1 - right, 2 - top, 3 - left, 4 - bottom, 1.5 - top right, etc). The text label can be placed anywhere around the label on the density plot by using real numbers like 1.5 (position at 45 degrees from the right position).

DAMPLoadAnnotationTable[filename_] := Select[Import[filename], (NumericQ[#〚1〛] ... #〚2〛] &&#〚1〛>0&&#〚2〛>0) &] ;

Extracting chromatograms/electropherograms from data

DAMPGetIntensities[msdata,mz] returns a list of signal intensities from msdata corresponding to m/z mz.
DAMPGetChromatogram[msdata,mz] returns a chromatogram/electropherogram (list of {time,signal intensity}) from msdata corresponding to m/z mz. Since the m/z values in msdata are of type Real, the mz parameter is multiplied by 1. to convert it to a Real type value (for cases when it is passed as an Integer).
The _?(Chop[1. mz-#]==0&) construct is the argument of the Position function to match values slightly different due to roundoff or truncation errors.

DAMPGetIntensities[msdata_, mz_] := msdata〚1, Position[msdata〚2〛, _ ? (Chop[1. mz - #] 0&)] 〚1, 1〛〛

DAMPGetChromatogram[msdata_, mz_] := Transpose[{msdata〚3〛, DAMPGetIntensities[msdata, mz]}]

Dataset cropping and particular m/z(s) dropping/selection

DAMPCrop[msdata,options] selects datapoints from msdata falling within the m/z and time ranges specified by options mzRange and TimeRange.
DAMPSelectMZs[msdata,mzs] reduces the msdata to those datapoints only, which correspond to m/z values specified in the  mzs list.
DAMPDropMZs[msdata,mzs] drops datapoints from msdata corresponding to m/z values specified in the  mzs list.
The mzs parameters are multiplied by 1. to make sure they are of Real  type. This is necesarry for looking up the m/z values in the datasets' lists of m/z values which are of type Real.

Options[DAMPCrop] = {Global`TimeRangeAll, Global`mzRangeAll} 

DAMPCrop[msdata_, opts___] := Module[{timerange, mzrange, selmzposs, seltpposs, newints, new ... ltpposs〛]] ;  {newints, newmzs, newtimepoints, msdata〚4〛} ]

DAMPSelectMZs[msdata_, mzs_] := Module[{newmzs, selmzposs}, newmzs = Intersection[ms ... )]) &/@newmzs) 〛, newmzs, msdata〚3〛, msdata〚4〛} ]

DAMPDropMZs[msdata_, mzs_] := Module[{newmzs, selmzposs}, newmzs = Complement[msdata ... )]) &/@newmzs) 〛, newmzs, msdata〚3〛, msdata〚4〛} ]

Plotting chromatograms/electropherograms

DAMPPlotChromatogram[msdatas,mz,options] plots overlaid chromatograms/electropherograms from msdatas (a list of msdata datasets) corresponding to m/z mz. mz may be a list of m/z values as well. The chromatograms/electropherograms may be annotated. Annotation table is passed via the AnnotationTable option. The Resolution option specifies to which resolution the data were binned along the m/z dimension. The m/z values in the annotation table are then rounded accordingly to ensure the appropriate annotation labels are selected for annotating the chromatogram/electropherogram.
The legend is drawn explicitly. For an unknown reason, this is considerably faster than Mathematica's PlotLegend function. Due to this, significant amount of time can be saved when plotting numerous chromatograms/electropherograms (using the DAMPPlotCandidates function for example).
Legend elements can be specified either explicitly (assigning a list in the form {{color,label},...} to the LegendData option) or created automatically (LegendData->Automatic). In the latter case, SampleName from each msdata is used as a legend label.
MultipleListPlot is used for plotting the chromatograms/electropherograms. Options for the MultipleListPlot function can be passed via the PlotOptions option. These options then override the options determined by options passed to the DAMPPlotChromatogram function. This approach of using the PlotOptions option to pass options to the function used for plotting the data is used throughout the MathDAMP package. It can also be found in functions like DAMPDensityPlot and DAMPPlotPeakLayout.

Options[DAMPPlotChromatogram] = {Global`Resolution1, Global`AnnotationTable  ... 54;670, SymbolShapeNone, AspectRatio.23, TextStyleGlobal`DAMPTextStyle}}

DAMPPlotChromatogram[msdatas_/;Depth[N[msdatas]] 5, mz_, opts___] := Module[{annotta ...  AspectRatioar, ImageSizeis, DisplayFunctiondispfunc] ] 

DAMPPlotChromatogram[msdatas_/;Depth[N[msdatas]] 4, mz_, opts___] := DAMPPlotChromatogram[{msdatas}, mz, opts]

Plotting Density Plots

The functions in this section facilitate the visualization of msdata on density plots. The axes represent the retention/migration time and m/z values. The peaks appear as colored spots. The color is assigned to a signal intensity value from a color gradient. The color is not calculated for each intensity during the generation of the plot but taken from a pre-generated palette instead. In other words, the pure function passed via the ColorFunction option of the ListDensityPlot function does not compute the color value but picks it from a pre-generated palette. This speeds up the process considerably.
DAMPGradient
returns a function which calculates the color parameters for the Hue directive corresponding to a relative position within the color gradient (the returned function takes this position as the only parameter <0;1>) . The gradient consists of three subgradients: backgroundgraylevel (at relative position 0) to color1 (at pos1), color1 (at pos1) to color2 (at pos2), and color2 (at pos2) to darkened color2 (at relative position 1). The pos1 and pos2 parameters represent the relative positions of the endpoints of the middle gradient (0;1); color1 and color2 parameters represent the hue specification <0;1> of the endpoints of the middle gradient; and the backgroundgraylevel parameter specifies the color (graylevel) of the background area <0;1> (black to white scale). The returned function is probably compiled for historical reasons only. Faster possibilities for the plot generation were investigated by increasing the speed of the function which was used as a color function in the ListDensityPlot function. The speed was not satisfactory, so a switch to pre-generating the color palette was made so that now the colorfunction just picks the appropriate color from the palette and does not calculate the color every time. The DAMPGradient still returns a compiled function although it does not make a significant difference in the current context.
DAMPGradientPalette uses the DAMPGradient function to generate a palette of two adjacent gradients (intended for positive and negative range). The palette will be used by the colorfunction of the ListDensityPlot to assign appropriate colors to signal intensities and to draw a legend. The option ColorPositions determines the relative position of gradient endpoints and is used for both positive and negative ranges (default: {.075,.4}). PositiveColors and NegativeColors hold the hue color specifications for the endpoints of the middle gradients in both positive and negative ranges (defaults: {1/6,0} - yellow to red; {1/2,2/3} - cyan to blue). BackgroundGrayLevel specifies the graylevel of the background area (default: 0.7 - 30% gray). PaletteSize specifies the number of palette entries (subdivisions) for both positive and negative range color gradients, entry for value 0 is shared between the two ranges.

DAMPGradient[pos1_, pos2_, color1_, color2_, backgroundgraylevel_] := Compile[{col}, {If[col ...  + ((1 - backgroundgraylevel) col)/pos1, If[col<pos2, 1, 0.2 + (0.8 (1 - col))/(1 - pos2)]]}]

Options[DAMPGradientPalette] = {Global`ColorPositions {0.075, 0.4}, Global`PositiveC ... eColors {1/2, 2/3}, Global`BackgroundGrayLevel.7, Global`PaletteSize200}

DAMPGradientPalette[opts___] := Module[{bgl, palsize, colposs, negfunc, posfunc}, bg ... erse[negfunc[#] &/@Drop[#, 1]], posfunc[#] &/@#] &[Range[0, 1, 1/palsize]] ]

DAMPDrawAnnotation[atdata,options] is used by the DAMPDensityPlot function to annotate the density plots of datasets and allow easier identification of peaks. The annotation table is passed via the atdata parameter. For the annotation table format, please refer to the description preceding the DAMPLoadAnnotationTable in the Annotation Table Manipulation subsection of the Core Functionality section. The TimeRange and mzs options determine the selection of annotation table items and their proper positioning on the density plots. The Resolution option specifies the m/z dimension coverage of each electropherogram on the density plot. To ensure the proper selection and display of the annotation labels, the m/z values in the annotation table are rounded to the resolution. Usually, the data are binned to 1 m/z resolution, so the default value for the Resolution option is set to 1. The option ScaleCoefficients ensures the correct positioning of the labels and their size reproducibility. The ScaleCoefficients are used for the scale conversion to the scale of the ListDensityPlot where the time scale is numbered sequentially for every datapoint. The positioning along the m/z dimension is done according to the respective m/z value's position within the mzs list, so the second element in the ScaleCoefficients is not relevant at the moment and is set to 1 by default. The scale conversion value for the time dimension is calculated in the DAMPDensityPlot function according to the msdata's timepoints. The option TextStyle determines the appearance of the annotation label's text, it is set to DAMPTextStyle by default, the FontSize is overriden to a smaller size (4) to fit better into the crowded density plots. The LabelStyle and LabelSize options determine the label mark's appearance and size. The size units are relative, {1,1} by default. The LabelShape option specifies a pure function for drawing the annotation label symbol. Two parameters are passed to the function, the first is the position on the ListDensityPlot's coordinate system {retention/migration time,m/z value} and the second the size of the size {xsize,ysize} in the same units. The scale conversion is set so, that the default relative size of the annotation label (LabelSize->{1,1}) draws an ellipse which appears to properly mark the peaks on the density plot. The {xsize,ysize} correspond to the ellipse's radii in this case.

Options[DAMPDrawAnnotation] = {Global`TimeRange {0, 50}, Global`MZsRange[50. ... ness[.25]}, Global`LabelShape (Circle[#1, #2] &), Global`LabelSize {1, 1}} ;

DAMPDrawAnnotation[atdata_, opts___] := Module[{at, timerange, mzrange, resol, scalecoefs, l ... le, {labelshape[position, {.2, .8} labelsize scalecoefs]}]}) &/@annottoshow ; tmpgr]

DAMPDensityPlot plots msdata using the ListDensityPlot function. The axes are intended to represent the migration/retention time and m/z. Signal intensities are encoded by colors. The color palette is generated using the DAMPGradientPalette function. The ColorFunction of the ListDensity plot then picks the closest palette entry for a given signal intensity. This implementation improves the speed of the plot generation (compared to calculating the color for every signal intensity by the color function). The plot legend is drawn explicitly and placed below the plot. Options for the ListDensityPlot may be passed via the option PlotOptions. The option MaxScale determines the size of the signal intensity scale. It is Automatic by default meaning the signal intensity scale is determined by the dataset's biggest positive or negative signal intensity value. The option FrameTickFreq determines the frequency of frameticks on the time and m/z axes. The value for the time axis specifies the frequency of tickmarks in minutes, for the m/z axis the frequency in the number actual m/z values from the dataset's m/z list. Default is {1,1} meaning a frametick will be placed for every minute on the time axis and for every single m/z value from the msdata's m/z list. FrameTickOffsets option specifies from which position to start to place the tickmarks on both axes. The mzTickShift option specifies the shift of every tickmark along the m/z axis. ListDensityPlot places the tickmarks at the beginning of the colored rectangle not in its middle. The -.5 shift places the tickmark in the middle of the colored rectangle. It may seem redundant with the FrameTickOffsets, but the functions of the two are different. mzTickShift can be used to appropriately place the tickmarks on the parallel plots where it is desirable to place the tickmark in the middle of the multiple parallel lanes corresponding to the same m/z. For this reason, the shifting on the m/z axis is not fixed to -.5 as is the case for the tickmarks on the time axis. The options mzGridLineFreq and mzGridLineStyle specify the distribution and appearance of horizontal gridlines. The horizontal gridlines are intended for better orientation on the plots in terms of estimating the m/z values of peaks/spots. Solid gridlines alternate with dashed ones by default. More complex patterns are possible by entering multiple style specifications in the mzGridLineStyle option. AnnotationTables option specifies the list of annotation tables and AnnotationOptions can be used to pass lists of options to the DAMPDrawAnnotation function and control the appearance of annotation labels for each annotation table. The signal intensities can be displayed using a logarithmic scale by setting the LogScale option to True.
mzFrameTicks option was added to facilitate easier display of asymmetric tickmarks on the m/z axis. This is used by the DAMPParallelPlot function if datasets of non-perfect overlap in terms of their m/z values are displayed. mzTickShift may be becoming obsolete due to this.

Options[DAMPDensityPlot] = {Global`MaxScaleAutomatic, Global`FrameTickFreqs  ... AMPTextStyle}, Global`AnnotationOptionsAutomatic, Global`LogScaleFalse} 

DAMPDensityPlot[msdata_, opts___] := Module[{tmax, gridfreq, palette, pallen, frametickfreq, ... ] &[(DisplayFunction/.{plotoptions}/.{DisplayFunction$DisplayFunction})])] ]

DAMPParallelPlot plots multiple msdatas in a parallel format. The datasets are interlaced into one dataset so that chromatograms/electropherograms corresponding to the same m/z value appear next to each other on the density plot. The chromatograms/electropherograms do not have to correspond to an identical set of m/z values in every dataset. The corresponding chromatograms/electropherograms are paired automatically and horizontal gridlines separate lanes of parallel chromatograms/electropherograms corresponding to the same m/z value. Some lanes may be narrower/wider depending on the presence of data corresponding to the particular m/z value in all datasets. Although aligned datasets are most appropriate to plot using DAMPParallelPlot, unaligned datasets may be plotted as well. In this case, the time axis is labeled according to the first dataset in msdatas and the chromatograms/electropherograms from the remaining datasets are simply concatenated or extended with 0 signal intensities to accommodate them to the dimensions of the first dataset. Please note that the time axis may be misleading in this case. The possibility to plot unaligned datasets is granted in spite of this to allow quick parallel visualizations of initial data prior to processing.

DAMPParallelPlot[msdatas_, opts___] := Module[{mzs, tplen, mzitm, msdataitm, mesodata, parda ... a, opts, Global`mzFrameTicksmzticks, Global`mzGridLineFreqmzgridlines] ]

Applying a function to a single dataset

DAMPApplyFunctionToSingle applies a pure function function to the signal intensities of msdata (msdata[[1]]). This generic function could be wrapped by functions like DAMPSubtractBaselines, DAMPThreshold, DAMPRemoveSpikes, DAMPSmooth, etc. However, it is not done so for simplicity (at the expense of having redundant code though).

Options[DAMPApplyFunctionToSingle] = {Global`SampleNameSuffix""} 

DAMPApplyFunctionToSingle[msdata_, func_, opts___] := Module[{tmpdat}, tmpdat = func ... , Global`SampleNameSuffix/.{opts}/.Options[DAMPApplyFunctionToSingle]] ; tmpdat]

Baseline subtraction / noise removal / smoothing

Currently, only robust nonlinear regression to a polynomial is implemented as a baseline fitting algorithm. Any custom baseline fitting function may be passed as an option to the DAMPSubtractBaselines function which subtracts baselines from all electropherograms in the passed msdata.
The implementation of DAMPFitBaselineRobRegr was inspired by Ruckstuhl et al [1]. The weights are calculated in the same way, although the regression is global. Binning chromatograms speeds up the baseline fitting. For this purpose, two functions were implemented, a versatile DAMPBinChromatogram for which the binning function can be passed as an options (default: Mean) and which bins precisely to the time intervals specified (in minutes), and a fast compiled version DAMPBinChromatogramFast with limitations in the two mentioned factors. The function for binning the data for baseline estimation can be passed to the DAMPRobustNonlinearFit function as an option. It is set to (DAMPBinChromatogramFast[#,0.25]&) by default.

Options[DAMPBinChromatogram] = {Global`BinningFunctionMean} ;

DAMPBinChromatogram[chrom_, timeint_, opts___] := Module[{binfunc, predat}, binfunc  ...  -1, 1〛, timeint}] ; binfunc[#] &/@Transpose[#] &/@predat] <br />

DAMPBinChromatogramFast = Compile[{{chrom, _Real, 2}, {timeint, _Real}}, Module[{},  ... hrom] - 1)/(chrom〚 -1, 1〛 - chrom〚1, 1〛) timeint]] ]] <br />

Options[DAMPRobustPolynomialFit] = {Global`BinningFunction (DAMPBinChromatogramFast[ ... 2754;1, Global`ConvergenceCriteria.001, MaxIterations10, Global`b4.05} ;

DAMPRobustPolynomialFit[chrom_, opts___] := Module[{binfunc, poldeg, cc, maxit, bval, ssqpre ... m〚All, 1〛, (BestFit/.rgr)/.xchrom〚All, 1〛}] ] <br />

Options[DAMPSubtractBaselines] = {Global`BaselineFittingFunctionDAMPRobustPolynomialFit, Global`SampleNameSuffix"bs"} <br />

DAMPSubtractBaselines[msdata_, opts___] := Module[{bfunc}, bfunc = Global`BaselineFi ... ta〚4〛, Global`SampleNameSuffix/.{opts}/.Options[DAMPSubtractBaselines]]} ]

Noise removal is done by calculating the standard deviation of signal intensities in the selected area of each chromatogram/electropherogram (where no signals are expected) and leveling to 0 of all signal intensities within a specified multiple of the calculated standard deviation. The chromatograms/electropherograms should be baseline subtracted before applying this noise removal. Some noise related signals may exceed the threshold and stay in the dataset. If no negative peaks are expected (or they are artifacts without significance), all negative signal intensities may be leveled to 0 by setting the LevelNegativeSignals option to True (default). Also, applying the DAMPRemoveSpikes function on the noise reduced data may remove additional noise-related signals which exceeded the threshold during noise removal. DAMPThreshold may be applied to msdata to level to 0 all signal intensities within ±threshold.

Options[DAMPRemoveNoise] = {Global`TimeRange {1, 3}, Global`SDThreshold5, Global`LevelNegativeSignalsTrue, Global`SampleNameSuffix"nr"} 

DAMPRemoveNoise[msdata_, opts___] := Module[{timerange, sdthreshold, levelnegs, posrange, ne ... #12314;4〛, Global`SampleNameSuffix/.{opts}/.Options[DAMPRemoveNoise]]} ] 

DAMPRemoveSpikesFromInts = Compile[{{ints, _Real, 1}}, Module[{}, Join[{ints〚 ... nts〚i〛], {i, 2, Length[ints] - 1}], {ints〚 -1〛}] ]] 

Options[DAMPRemoveSpikes] = {Global`SampleNameSuffix"sr"} <br />

DAMPRemoveSpikes[msdata_, opts___] := Module[{},  {DAMPRemoveSpikesFromInts[#] & ... 〚4〛, Global`SampleNameSuffix/.{opts}/.Options[DAMPRemoveSpikes]]} ] <br />

Options[DAMPThreshold] = {Global`SampleNameSuffix"t"} 

DAMPThreshold[msdata_, threshold_, opts___] := Module[{},  {Chop[1. msdata〚1& ... ame[msdata〚4〛, Global`SampleNameSuffix/.{opts}/.Options[DAMPThreshold]]} ]

Chromatogram/electropherogram smoothing is done via the DAMPSmooth function which takes the smoothing function as one of its parameters. It is set to DAMPMovingAverageFast[#,9]& by default. DAMPMovingAverageFast smoothes the list of signal intensities by averaging every signal intensity with an equal number of preceding and succeeding signal intensities. The signal intensities at the beginning and at the end of the list are averaged with the available intensities within the window only. This ensures the smoothed list to have an equal number of elements as the original list.

DAMPMovingAverageFast = Compile[{{ints, _Real, 1}, {winsize, _Integer}}, Module[{vicdist, in ... e[ints, {Max[1, i - vicdist], Min[i + vicdist, intslen]}]], {i, 1, intslen}] ]] 

Options[DAMPSmooth] = {Global`SmoothingFunction (DAMPMovingAverageFast[#, 9] &), Global`SampleNameSuffix"s"} 

DAMPSmooth[msdata_, opts___] := Module[{smoothingfunc}, smoothingfunc = Global`Smoot ... leName[msdata〚4〛, Global`SampleNameSuffix/.{opts}/.Options[DAMPSmooth]]} ]

Peak picking

Peak picking is performed to find a set of peaks in every dataset and use these to align the datasets (find the parameters of the time shift function, subsection below). Thanks to the robustness of the alignment approach, even few corresponding peaks between the datasets are sufficient to find the best parameters for the alignment. For this reason, the demands on the quality of peak-picking algorithm are low. The peak lists may contain many peaks without corresponding counterparts across the lists or erroneous peak picks as well.
The implementation of the peak picking algorithm was inspired by Wallace's et al. [2] use of the Douglas-Peucker algorithm. In their approach, the chromatogram/electropherogram is segmented into portions defined by strategic points. First, the endpoints of the chromatogram/electropherogram are the only two strategic points. Additional strategic points are found by recursively finding a point with the biggest orthogonal distance from the line connecting two neighboring strategic points (and the distance being above a threshold). The difference in the current implementation is that vertical (not orthogonal) distance of a datapoint from a line connecting two strategic points is used as a criteria for new strategic points. The advantage of this approach is that it is not necessary to normalize either the signal intensity or time dimension. The threshold may be therefore specified in the units of the signal intensity scale. Using the vertical distance, however, may lead to too many datapoints fulfilling the criteria of a strategic point (having the distance from the line connecting two adjacent strategic points above the threshold). This is partially overcome by setting the minimal distance between the strategic points. In other words, a new strategic point cannot be selected in the specified vicinity of an existing strategic point. A good setting for this parameter is the 1/4 of expected peak width at its base, so that the restricted areas around the peak-top strategic point and peak base strategic points partially or almost overlap. The peaks are then selected from the list of strategic points according to criteria, that peak-top strategic point must have a higher signal intensity than the two neighboring strategic points. This approach may sometimes lead to artifacts - e.g. selecting a high point in the baseline as a peak, however, these do not cause harm to the purpose thanks to the robustness of the alignment algorithm as discussed above (and in the next subsection). Applying additional quality control on the selected peaks was therefore not implemented to save computational time. The peaks' retention/migration times are then calculated as a centroid in the vicinity of the peak-top strategic points. Smoothing the data before peak picking could prove beneficial in some cases.
The function DAMPFindNodesFast selects strategic points from a chromatogram/electropherogram. The function returns a list of strategic points' retention times and signal intensities. If a new strategic point can not be found between two other strategic points (the restricted areas of the two overlap) or the distance is not above threshold, {{0,0}} is returned. This is to comply with the restrictions set on compiled functions in Mathematica. The {0,0} values are then filtered out in the DAMPPickChromatogramPeaks function.
The DAMPPickChromatogramPeaks function takes the minimal distance between two strategic points (in minutes) via the NodeRestrictedVicinity option and converts it to approximate number of datapoints for the DAMPFindNodesFast. The conversion is done to allow faster processing.
DAMPGetPeakCentroid calculates the peak centroid from datapoints around the specified retention time. The selection of datapoints for calculation is determined by the option CentroidVicinity (in minutes, datapoints within this distance in both directions are selected). Note that the retention/migration time values coming from the compiled function DAMPFindNodesFast may differ from the values in the original chromatogram, since the compiled function converts the retention time values, which may be rational, to real. Therefore, it is necessary to use the Select function in DAMPGetCentroid instead of trying to find the position of the retention time value (using the Position function) and selecting a certain number of datapoints in the neighborhood.
DAMPPickPeaks[msdata,options] picks peaks from all chromatograms/electropherograms in msdata and returns them in a list {{mz,peaklist},{mz,peaklist},...}. The function does not have any default options, all options passed to DAMPPickPeaks are passed further to DAMPPickChromatogramPeaks.

DAMPFindNodesFast = Compile[{{chrom, _Real, 2}, {thresh, _Real}, {nrv, _Integer}}, Module[{c ... 62371; {{0, 0}} ] ], {{DAMPFindNodesFast[_, _, _], _Real, 2}} ] 

Options[DAMPGetPeakCentroid] = {Global`CentroidVicinity.05} 

DAMPGetPeakCentroid[chrom_, rt_, opts___] := Module[{vicinity, selpoints}, vicinity  ... 5; #〚2〛&/@selpoints)/Plus @@ selpoints〚All, 2〛] 

Options[DAMPPickChromatogramPeaks] = {Global`Threshold1000, Global`NodeRestrictedVicinity.05, Global`CentroidVicinity.05} 

DAMPPickChromatogramPeaks[chrom_, opts___] := Module[{threshold, nrv, centvic, nodes}, ɯ ... 2315;, #〚3, 2〛}]  {1, 1} &] 〚All, 2〛] ] 

DAMPPickPeaks[msdata_, opts___] := Module[{},  {#, DAMPPickChromatogramPeaks[DAMPGetChromatogram[msdata, #], opts]} &/@msdata〚2〛]

The peaklists may be used to annotate the density plots. The function DAMPPeakListToAnnotationTable converts the peaklist in a format returned by DAMPPickPeaks to an annotation table format. The names (both short and long) are assigned to peaks from a sequence of integers (starting with 1).

Options[DAMPPeakListToAnnotationTable] = {Global`LabelTextPosition1.5} 

DAMPPeakListToAnnotationTable[peaks_, opts___] := Module[{ltpos, mz, tmplist, atlen}, ᡝ ... ose[tmplist], {#, #} &[ToString[#] &/@Range[atlen]], {Table[ltpos, {atlen}]}]] ]

A stack of peaks with different m/z values may be present at certain retention/migration times. This may be due to post-source fragmentation of a single compound, isotopic peaks, or due to a group of multiple compounds reaching the detector at the same time. If there is an overwhelming number of stacked peaks, these may have significant influence on the dataset alignment (implemented in the next subsection) and bias the alignment. This is undesirable, especially in the case of CE analysis, where there are numerous peaks stacked at migration time corresponding the position of EOF peak. These peaks often do not seem to follow the migration time shift trend of the remaining compounds. Their influence may be removed by (manually) specifying a time range, which does not contain the stacked peaks, for fitting the migration time shifts. Another possibility to limit their influence (and to avoid manual specification of the time range) is to automatically select a set of representative peaks: a certain number of highest peaks from every chromatogram/electropherogram and a certain number of highest peaks from every time interval of specified size. This may provide peaks more evenly distributed on the retention/migration time vs m/z plane and eliminate most bias-causing stacked peaks. Given the robustness of the alignment method, a minority of bias-causing peaks should not hijack the alignment. DAMPSelectRepresentativePeaks can also be used to select peak picks from a certain time interval only.

Options[DAMPSelectRepresentativePeaks] = {Global`PeaksPerChromatogramAll, Global`PeaksPerIntervalAll, Global`IntervalSize1, Global`TimeRangeAll} 

DAMPSelectRepresentativePeaks[peaklist_, opts___] := Module[{perchrom, perint, intsize, time ... 314;All, 2〛}) &/@newlist〚All, 1〛 ; ] ; newlist]

The layouts of peaks from different datasets picked by the DAMPPickPeaks function may be plotted both prior to and after the alignment for visual inspection.
Graphics::realu was turned off since it started appearing for higher number of peaklists (probably PlotLegend related). The result does not seem to be affected when the error is present.

Options[DAMPPlotPeakLayout] = {Global`mzGridLineFreq1, Global`mzTickFreq1, G ... otSymbol[Diamond, 3, FilledFalse], PlotSymbol[Box, 3, FilledFalse]}}} ; 

DAMPPlotPeakLayout[peaklists_, opts___] := Module[{mzgridfreq, mztickfreq, leglabels, plotop ... ndPosition {1, -.4}] ; On[Graphics :: "realu"] ; rslt]

Dataset alignment

A combination of global optimization and dynamic programming (DP) is used for dataset alignment. DP score serves as a measure of goodness of the alignment between two datasets (their peaklists). Parameters of a mathematical function, which is assumed to be able to fit the time shifts (as a function of retention/migration time in one of the samples) of corresponding peaks between two datasets, are optimized to achieve the lowest DP score.
For a discussion on using DP in Mathematica, please refer to [3].

Options[DAMPDPChromatogramScore] = {Global`GapPenalty.5} ; 

DAMPDPChromatogramScore[pl1_, pl2_, opts___] := Module[{mpos, gappenalty}, gappenalt ... enalty, mpos[i, j - 1] + gappenalty] ; mpos[Length[pl1], Length[pl2]] ] 

Options[DAMPDPScore] = {Global`GapPenalty.5} ; 

DAMPDPScore[pls1_, pls2_, opts___] := Module[{gappenalty, itm}, gappenalty = (Global ... &[Select[pls2, #〚1〛itm〚1〛&]]) &/@pls1) ]

DAMPFitShiftFunction optimizes the parameters of a retention/migration time shift function to find the optimum alignment between two peak lists. The parameters are optimized (using the NMinimize function), the timescale on one of the peaklists is modified accordingly, and the goodness of the alignment evaluated using DAMPDPScore. The DAMPDPScore is used as the objective function to be minimized.
The default function to fit the migration time shifts is the one derived by Reijenga [4] for normalizing migration times in capillary electrophoresis. The default choice is influenced by the predominant use of CE-based approaches in the authors' institute (Institute for Advanced Biosciences, Keio University). Any custom migration time shift function can be passed to the DAMPFitShiftFunction via the ShiftFunction option. Parameters for the retention/migration time shift function with lower and upper bounds for the optimization range can be passed via the ShiftFunctionParameters option. If ShiftFunctionParameters is set to Automatic, the parameter names are extracted from the shift function automatically and are assigned no bounds for the optimization range. The peaks to be used for fitting the time shift function may be limited to those (for the peaklist pls1) falling within a specific time range set by the TimeRange option.
The GapPenalty option may hold a list of gap penalty values to be used iteratively for fitting the time shift function. This may be desirable when a small gap penalty value is required for good alignment. If a big gap penalty value is used, noncorresponding peaks from the two peaklists which are close enough to fall within the gap penalty value may negatively affect the alignment. However, if only a small gap penalty value is used and time shifts between the two peaklists are significant, NMinimize may not find the region of convergence to the global minimum. Subproblem scores are assigned to gap penalty values for a wider range of shift function parameters so the objective function does not change (if all scores are assigned the gap penalty values) or may converge to a local minimum (if noncorresponding peaks are within the gap penalty distance). A reasonable approach in cases, where time shifts are significant, appears to be a two-step fitting of the time shift function. First, the fitting is performed with a big gap penalty value to find an approximate alignment. Then, second fitting is performed with a small gap penalty value with initial regions of parameters set to the neighborhood of optimized parameter values from the first fitting.

DAMPAlignPeakList[peaklist_, shiftfunc_] := Module[{}, Transpose[{peaklist〚Al ... 1〛], #〚All, 2〛}] &/@peaklist〚All, 2〛}] ] 

Options[DAMPFitShiftFunction] = {Global`ShiftFunction (1/(1/(Global`α #) + Glob ... mizeOptions {MaxIterations1000}, Global`TimeRange {0, ∞}} 

DAMPFitShiftFunction[pls1_, pls2_, opts___] := Module[{shiftfunc, params, gappenalty, autopa ... FitShiftFunction], Global`NMinimizeOptions/.Options[DAMPFitShiftFunction]]]] ] ]

Annotation table may be converted to a peak list format for displaying it on the peak layout plots and also for purposes of aligning the annotation table onto a reference dataset. There are no default options for the DAMPAlignAnnotationTable, all options passed to the DAMPAlignAnnotationTable are passed further to the DAMPFitShiftFunction, which is used internally.

Options[DAMPAnnotationTableToPeakList] = {Global`Resolution1} 

DAMPAnnotationTableToPeakList[atbl_, opts___] := Module[{resol, tmpannot, mz}, resol ... 2314;1〛mz&]}) &/@Union[tmpannot〚All, 1〛] ] 

DAMPAlignAnnotationTable[peaklist_, atbl_, opts___] := Module[{atpl, shiftfunc, newatbl}, &# ... atbl〚All, 2〛 = shiftfunc[newatbl〚All, 2〛] ; newatbl]

Sample dataset is aligned to a reference dataset by applying the migration time shift function to the timepoints of the sample dataset, interpolating all resulting chromatograms in the sample dataset, and selecting timepoints (from this interpolation) identical to timepoints in the reference dataset. This will lead to datasets having corresponding datapoints. Direct datapoint-by-datapoint arithmetic operation will therefore be possible. Signal intensities are adjusted to compensate for peak broadening/compression and conserve peaks' areas. Due to interpolation and the discrete nature of the data, minor differences in the areas before and after rescaling are observed. These seemed to stay within 2% for our datasets, what could be acceptable for visual approach.

Options[DAMPAlign] = {Global`SampleNameSuffix"a"} 

DAMPAlign[msdata_, shiftfunc_, timepoints_, opts___] := Module[{compensationcoefs, ifunc, rs ... opts}/.Options[DAMPAlign]]} ; On[InterpolatingFunction :: dmval] ; rslt]

Dataset normalization

Datasets are normalized by multiplying all the signal intensities by a normalization coefficient. The coefficient value may come from some external source (e.g. from sample weights) or from integrating the peaks of internal standard in the datasets.
A simple integration was implemented (used for the peaks of internal standards only). No peak detection is performed. The area below a linearly interpolated part of interest of a  chromatogram/electropherogram is calculated. The part of interest is specified as a time range. Optionally, the baseline may be calculated as an average of signal intensities from a time range where no peaks are expected. Please note that both the peak integration and baseline estimation is blind in the case of the DAMPIntegrate function. Therefore, the chromatograms of integrated peaks as well as the integration results are displayed for visual inspection during automated differential analysis located in the Assemblies of Core Functions for Common Tasks section.

Options[DAMPNormalize] = {Global`SampleNameSuffix"n"} 

DAMPNormalize[msdata_, coef_, opts___] := Module[{}, DAMPApplyFunctionToSingle[msdat ... 14;3〛, #〚4〛} &), Join[opts, Options[DAMPNormalize]]] ] 

Options[DAMPIntegrate] := {Global`TimeRange {0, ∞}, Global`BaselineFromTimeRangeNone} 

DAMPIntegrate[chrom_, opts___] := Module[{bsln, bslnfrotr, timerange}, bsln = 0 ; &# ... 2314;1〛≤#〚1〛≤timerange〚2〛&], 2, 1]) ]

Dataset filtering

The DAMPFilter function uses one dataset as a criteria to filer out datapoints from a different dataset (although the two could be identical). A pure function to process the signal intensity matrix of the criteria dataset is passed as a parameter. The signal intensity values in the processed criteria dataset determine the fate of signal intensities in the filtered dataset. Signal intensities in the filtered dataset for which the corresponding signal intensities in the processed criteria dataset are zero are leveled to zero as well (or left unchanged if the corresponding signal intensities from the criteria dataset are non-zero).
A simpler version of the DAMPFilter function which accepts a numeric threshold value instead of the processing function is defined. Signal intensities in the filtered dataset are leveled to 0 if the absolute values of the corresponding signal intensities in the criteria dataset do not exceed the threshold value.
The dataset to be filtered and the criteria dataset must have identical dimensions (in other words, be aligned or identical).

Options[DAMPFilter] = {Global`SampleNameSuffix"f"} <br />

DAMPFilter[msdata_, criteriamsdata_, threshold_ ? NumericQ, opts___] := Module[{}, DAMPFilter[msdata, criteriamsdata, Chop[1.#, threshold] &, opts] ] 

DAMPFilter[msdata_, criteriamsdata_, filtfunc_, opts___] := Module[{},  {msdata{ ... leName[msdata〚4〛, Global`SampleNameSuffix/.{opts}/.Options[DAMPFilter]]} ]

The DAMPTrendFilter function filters out those datapoints from the result of averages of two groups, where at least a certain number of signal intensities from both groups does not follow the same trend (greater or smaller than the average of the other group) as the averages of the groups.

Options[DAMPTrendFilter] = {Global`SampleNameSuffix"tf"} 

DAMPTrendFilter[msdatas1_, msdatas2_, tofiltermsdata_, filtnum_, opts___] := Module[{tmpset, ... ermsdata〚4〛, Global`SampleNameSuffix/.{opts}/.Options[DAMPTrendFilter]]} ]

Applying a function to a group of datasets

The DAMPApplyFunctionToGroup function applies a pure function function to all corresponding signal intensities of msdatas (a list of msdata datasets).

Options[DAMPApplyFunctionToGroup] = {Global`ApplyToIntensitiesOnlyTrue, Global`SampleNameSuffix"", Global`ResultSampleNameAutomatic} 

DAMPApplyFunctionToGroup[msdatas_, func_, opts___] := Module[{rsltsamplename}, rslts ... SampleNameSuffix/.{opts}/.Options[DAMPApplyFunctionToGroup]] ; tmpdat] ]

Miscellaneous functions

DAMPCheckDir checks if a directory dir exists and creates it, if it does not exist. Parent directory must exist, error message is displayed otherwise. This function is used in some template notebooks to create directories for the saving/export of the results.

DAMPCheckDir[dir_] := Module[{curdir, createrslt}, curdir = Directory[] ; If ... 2371;SetDirectory[curdir] ; ] ; On[SetDirectory :: "cdir"] ; ]

DAMPGenColors generates a list of color specifications having the a specified number of elements. If the number is 6 or less elements, the colors are assigned from the following sequence: blue, red, green, yellow, cyan, magenta. If the number is more than 6, the hue range is split proportionally and the colors are assigned  from this sequence. The function is mostly used to assign comprehensive colors to overlaid chromatogram/electropherogram plots.

DAMPGenColors[nr_] := Module[{}, If[nr≤6, Take[{Hue[2/3], Hue[0], Hue[1/3], Hue[1/6], Hue[1/2], Hue[5/6]}, nr], Table[Hue[i/nr], {i, 0, nr - 1}]] ]

DAMPCalcRelativeDifference calculates a relative difference between two corresponding signal intensities. The difference between the signal intensities is divided by the larger of the two (or an absolute value of their difference if this is bigger than any of the two, when one of the signal intensities is negative).

DAMPCalcRelativeDifference[ref_, smpl_, threshold_:0] := N[(smpl - ref)/(Chop[1.Max[smpl, ref, Abs[smpl - ref]], threshold]/.{0∞, 0.∞})]

Assemblies of Core Functions for Common Tasks

Normalizing multiple datasets

The DAMPNormalizeGroup function aligns msdatas datasets to a reference dataset (one picked from msdatas specified by the Reference option) and normalizes them according to the area of the peak of the internal standard. Additional normalization coefficients may be passed to DAMPNormalizeGroup via the ExternalNormalizationCoefficients options (to account for different sample weights, etc.). Annotation tables may be passed to the function via the AnnotationTables option. These are aligned to the reference dataset as well. For automatic internal standard localization, its short name in the first annotation table must be specified by the InternalStandard option. The annotation table is aligned to the reference dataset and the vicinity of the location of the peak of the selected internal standard, as extrapolated from the aligned annotation table, is integrated. This integration is blind, overlaid chromatograms/electropherograms with the integrated area are shown for visual confirmation. The location of the peak of the internal standard in the reference dataset may also be specified explicitly. Also, peak picks, representative peak selection, aligned peaks and aligned annotation table layouts are shown for visual confirmation.
RepresentativePeakOptions were set to rather low values for the peak numbers to allow fast alignment with default options.
With many groups, Mathematica may run into memory problems. To limit this, signal intensities in internal calculations may be converted to integers to save memory. This is determined by the SaveMemory option (default: True).

Options[DAMPNormalizeGroup] = {Global`Reference1, Global`AlignmentTimeRangeA ... 62754;None, Global`AutoISIntegrationVicinity {-.25, .25}, Global`SaveMemoryTrue}

DAMPNormalizeGroup[msdatas_, opts___] := Module[{reference, aligntr, rpopts, ppopts, fsfopts ... edDatasetsaligneddatasets, Global`AlignedAnnotationTables->alignedannotbls} ]

Comparing two datasets

The DAMPTwoDatasets function uses the DAMPNormalizeGroup function to align and normalize two datasets. Datasets representing the absolute, relative, and absolute×relative differences are returned along with the normalized datasets and aligned annotation tables as a list of rules.

Options[DAMPTwoDatasets] = {Global`NormalizeGroupOptions {}, Global`ThresholdForRelative0} <br />

DAMPTwoDatasets[msdata1_, msdata2_, opts___] := Module[{normrslt, ndat, annottbls, combsname ... al`AbsoluteRelativeabsreldif, Global`AlignedAnnotationTablesannottbls} ]

Looking for outliers within multiple datasets

The DAMPOutliers function uses the DAMPNormalizeGroup function to align and normalize a group of datasets and highlights outlying signals. Two types of results are generated: a z-score map and a quartile-based result. Both types of results may be generated for every single dataset (optional). Overall results are generated by default. In this case, only one resulting dataset is generated (for both types of approaches) by using the most outlying datapoint from every set of corresponding signal intensities for result calculation. This simplifies the generation of chromatograms/electropherograms for candidates as these do not have to be generated for each dataset's results.

DAMPDropOutliers is used internally by the function DAMPOutliers to remove a selected number of outliers from a set of values. The function is employed in z-score calculation. The presence of an outlier(s) influences the values of both the mean and the standard deviation of a set. This may lead to a relatively low z-score value for an outlier. It is therefore desirable to calculate the mean and the standard deviation (which are used for calculating the z-scores) from a set of values not containing an outlier(s). The DAMPOutliers function's option OutliersToDrop determines the number of outliers to drop from every set of corresponding signal intensities. When the number is set to more than one, the outlier dropping is performed iteratively by calculating the mean and removing the single most distant value from the set. The same specified number of values is dropped from every set regardless of whether the values would be classified as outliers.
The result for a quartile-based calculation is set to 0 if the tested signal intensity is between the first and the third quartile of the set of corresponding signal intensities from all datasets (no 'outliers' are dropped in this case in contrast to the z-score calculation). If the tested signal intensity is greater than the third quartile, the result is calculated as the difference between the tested signal intensity and the third quartile and this difference is divided by the interquartile range (the difference between the third and the first quartile). If the tested signal intensity is less than the first quartile, the result is calculated as the difference between the first quartile and the tested signal intensity value and this difference is divided by the interquartile range. A negative sign is assigned to the latter result to indicate a 'negative' outlier.

DAMPDropOutliers = Compile[{{dt, _Real, 1}, {nr, _Integer}}, Module[{avg = 0., rslt = {}}, & ... avg]]}]] ; If[nr>1, rslt = DAMPDropOutliers[rslt, nr - 1]] ; rslt]] ;

Options[DAMPOutliers] = {Global`NormalizeGroupOptions {}, Global`OutliersToDrop1, Global`IndividualZsFalse, Global`IndividualQsFalse} 

DAMPOutliers[msdatas_, opts___] := Module[{iz, iq, outlierstodrop, normrslt, ndat, annottbls ... l`QuartileResultIndividualiq, Global`AlignedAnnotationTablesannottbls} ]

Comparing two groups of replicates

The DAMPTwoGroups function uses the DAMPNormalizeGroup function to align and normalize the datasets of the two groups of replicates to be compared. One approach to the comparison is to average the datasets in every group and continue the comparison as with two datasets. Absolute, relative, and absolute×relative differences are generated for the averaged datasets. These results can be further filtered with DAMPTrendFilter to remove those results, where at least a certain number of signal intensities in both groups does not follow the trend of the averages of the groups. In the current implementation, this is done for the absolute×relative result only. Additionally to the comparison of the averages of the groups, t-scores are calculated for the groups of corresponding signal intensities.

Options[DAMPTwoGroups] = {Global`NormalizeGroupOptions {}, Global`ThresholdForRelative0, Global`GroupNamesAutomatic, Global`AbsRelTrendFilter2}

DAMPTwoGroups[ctrls_, smpls_, opts___] := Module[{relthreshold, groupnames, ctrllen, smpllen ... roupCounts {ctrllen, smpllen}, Global`GroupNames {ctrlname, smplname}} ]

Comparing multiple groups of replicates

One-way ANOVA is used to locate the differences among multiple groups of datasets. To reduce the amount of memory necessary to perform the calculations, the F ratio is calculated for each corresponding group of signal intensities separately (instead of doing it in a whole dataset fashion by transposing them appropriately - this approach proves to be memory demanding). A compiled function DAMPCalcFRatioFast for calculating the F ratio was implemented. It is used instead of Mathematica's ANOVA function to gain speed. However, the groups must contain an identical number of replicates as mentioned below.
Possibility to use an analogy of the absolute×relative plot (average of control datasets vs. the most distant average) was investigated, but the results did not meet the expectations for the datasets tested. This approach is therefore not used.

DAMPCalcFRatioFast calculates an F-ratio (one-way ANOVA) for data. Data is expected to contain a flat list of values. These are then grouped into groups each containing equal number of values determined by replicates. Every group must contain an equal number of replicates - a constraint enforced by this F ratio calculation procedure. This function is used by the DAMPMultiGroups function.

DAMPCalcFRatioFast = Compile[{{data, _Real, 1}, {replicates, _Integer}}, Module[{gro ... (groups - 1)/#] &[totals〚 -1〛/(groups (replicates - 1))] ] ] ;

Options[DAMPMultiGroups] = {Global`NormalizeGroupOptions {}, Global`GroupNamesAutomatic} 

DAMPMultiGroups[msdatas_, replicates_, opts___] := Module[{normopts, groupnames, rslt, finda ... " : "<>#) &/@groupnames)))}}, Global`GroupNamesgroupnames} ]

Overlaid chromatograms/electropherograms for candidate differences

For the confirmation of signals (and rejection of false positives) from plots highlighting differences between selected datasets overlaid chromatograms/electropherograms may be plotted. Normalized original datasets along with a dataset representing the differences between them (result dataset) are passed to the function DAMPPlotCandidates and overlaid chromatograms/electropherograms of the most significant differences are generated. To find the most significant differences, the result dataset is searched for a highest absolute signal intensity value. This datapoint, along with its neighborhood (of specified size in the retention/migration time dimension) is set to 0 and the search for additional significant differences continues iteratively.

Options[DAMPPlotCandidates] = {Global`PlotCount12, Global`TimeRangeAll, Glob ... 4; {-.6, .6}, Global`DropVicinity.2, Global`PlotChromatogramOptions {}} 

DAMPPlotCandidates[datamsdatas_, critmsdata_, opts___] := Module[{plotcount, timerange, plot ... ; , {Global`PlotCount/.{opts}/.Options[DAMPPlotCandidates]}] ; resplots]

References

[1] Ruckstuhl, A. F., Jacobson, M. P., Field, R. W., Dodd, J. A. (2001) Baseline subtraction using robust local regression estimation. J. Quant. Spectrosc. Radiat. Transfer, 68, 179-193
[2] Wallace, W. E., Kearsley, A. J., Guttman, C. M. (2004) An Operator-Independent Approach to Mass Spectral Peak Identification and Integration. Anal. Chem., 76, 2446-2452
[3] Wagner, D. B. (1995) Dynamic Programming. Mathematica J., 5, 42-51
[4] Reijenga, J. C., Martens, J. H. P. A., Giuliani, A., Chiari, M. (2002) Pherogram normalization in capillary electrophoresis and micellar electrokinetic chromatography analyses in cases of sample matrix-induced migration time shifts. J. Chromatogr. B, 770, 45-51

Epilog

End[]

Print["MathDAMP version 1.0.0 loaded (2006/04/26)"]

Print["This program is distributed in the hope that it will be useful, but WITHOUT ANY  ... without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE."]

EndPackage[]