Computer‐Assisted Structure Identification: Predictive Models and Automation
Presented at METABOLOMICS 2011
The CASI (Computer-Assisted Structure Identification) approach is used at Philip Morris International R&D to identify small molecules in complex matrices analyzed with GCxGC-TOF-MS. In the CASI approach, structure candidates and associated match factors for a mass spectrum are obtained using NIST MS search. In order to refine the results of NIST MS search, we developed quantitative structure−property relationship models to predict values of the two retention times of a GCxGC-TOF-MS instrument. A Kovats indices model was built for the first dimension and a model developed for the second dimension using relative retention times that are specific for the GCxGC-TOF-MS instrument: non-polar (1st dimension) x polar (2nd dimension). The models can be adapted for different column combinations. Results obtained by k-nearest neighbors, multiple linear regression, and support vector machines for each type of model were compared. For each algorithm, the best sets of descriptors were chosen using genetic algorithms. The process is fully automated using java and several other standard tools, such as NIST MS search for searches in a mass spectral database, dragon for computing molecular descriptors, RapidMiner to apply predictive retention models, pipeline pilot to normalize chemical structures, and ACD/labs PhysChem batch to compute boiling points. CASI web interface proposes a list of the best matched structure candidates allowing users to easily check and correct structure assignments. CASI also enables the user to easily add new instruments, analytical columns, and retention models to the platform.