Four distinct tasks are performed by the SIGNATURE program (cf. figure).
The purpose of the first task, the signature equation, is to calculate the list of molecular fragments and interfragment bonds that constitute the models. Roughly, the signature equation consists of matching qualitative structural data with quantitative structural data in order to compute an exhaustive and non-overlapping list of molecular fragments and interfragment bonds. In mathematical terms, the signature equation is an integer linear programming (ILP) problem, where the unknowns are the numbers of each molecular fragments and each interfragment bonds. Structural quantitative data are not exact values; there is a standard deviation associated with each datum. It is the task of the expert using the SIGNATURE program to input these standard deviations. Furthermore, if the molecular formula of the studied compound is unknown, the user of the program inputs the average number of atoms. Most of the time, there are several lists of molecular fragments and interfragment bonds that correspond to the given sets of 2D data and standard deviations. The goal of the signature equation is to determine the "best" list, i.e., the list that minimizes the deviation between the model and the 2D quantitative data. Once a list of molecular fragments and interfragment bonds is determined, a structural formula can be obtained by connecting the fragments with the corresponding interfragment bonds. At that stage, the structure to be constructed is much like a jigsaw puzzle; one knows the pieces of the puzzle and the ways these pieces are connected together. Generally, several structural formulas can be constructed.

The second task, the structure generation, determines how many structural formulas have to be constructed. When the studied compound contains a small number of fragments it is possible to used a deterministic technique, and therefore, to construct all the structural formulas that correspond to the list of fragments and interfragment bonds computed by the signature equation. The SIGNATURE program offers the possibility to use a deterministic algorithm to generate all the structural formulas. The algorithm is based on the symmetries of the fragments. However, as already mentioned, for large molecular compounds, deterministic techniques are not applicable to resolve the problem of structure elucidation. In such an instance one has to use a stochastic structure generation. The purpose of a stochastic structure generation is to approximate the number of possible structural formulas, and to generate a sample of these formulas that statistically represents the entire population of possibilities. The stochastic technique used by the SIGNATURE program to approximate the number of possible structural formulas is based on the Knuth algorithm. Although the Knuth algorithm was devised for other purposes, it can be used to compute an unbiais estimator of the number of possible structural formulas. The sample of structural formulas is then generated using several stochastic techniques: Random Sampling, Monte-Carlo, Simulated Annealing, and Genetic Algorithm. All the structural formulas generated are constructed in a three-dimensional space. During the generation process, the expert using the system inputs the sample size, and can impose some structural constraints, such as avoiding the formation of double bonds, or forcing the generator to build five or six membered rings.

Once the sample of models is constructed, the third task, the 3D simulations, submits each model to molecular orbital calculations or molecular simulations. After the optimized 3D models are produced, 3D physical properties are calculated for the models and compared to the corresponding 3D analytical data. The 3D physical properties are: the density, the pore volume distribution, the surface area, and the fractal dimension of the surface. The methods employed by the SIGNATURE program to simulate the three-dimensional physical characteristics are based on finite element theory.

Finally, the sample is statistically analyzed by the fourth task. If the statistical technique used by the SIGNATURE program is random sampling the optimal sample size needed for statistical significance can be determined. Furthermore, the calculations performed with the sample can be extrapolated to the entire population of possible models.

Return to structure inference of natural products

Four distinct tasks are performed by the SIGNATURE program (cf. figure).

The purpose of the first task, the signature equation, is to calculate the list of molecular fragments and interfragment bonds that constitute the models. Roughly, the signature equation consists of matching qualitative structural data with quantitative structural data in order to compute an exhaustive and non-overlapping list of molecular fragments and interfragment bonds. In mathematical terms, the signature equation is an integer linear programming (ILP) problem, where the unknowns are the numbers of each molecular fragments and each interfragment bonds. Structural quantitative data are not exact values; there is a standard deviation associated with each datum. It is the task of the expert using the SIGNATURE program to input these standard deviations. Furthermore, if the molecular formula of the studied compound is unknown, the user of the program inputs the average number of atoms. Most of the time, there are several lists of molecular fragments and interfragment bonds that correspond to the given sets of 2D data and standard deviations. The goal of the signature equation is to determine the "best" list, i.e., the list that minimizes the deviation between the model and the 2D quantitative data. Once a list of molecular fragments and interfragment bonds is determined, a structural formula can be obtained by connecting the fragments with the corresponding interfragment bonds. At that stage, the structure to be constructed is much like a jigsaw puzzle; one knows the pieces of the puzzle and the ways these pieces are connected together. Generally, several structural formulas can be constructed.

The second task, the structure generation, determines how many structural formulas have to be constructed. When the studied compound contains a small number of fragments it is possible to used a deterministic technique, and therefore, to construct all the structural formulas that correspond to the list of fragments and interfragment bonds computed by the signature equation. The SIGNATURE program offers the possibility to use a deterministic algorithm to generate all the structural formulas. The algorithm is based on the symmetries of the fragments. However, as already mentioned, for large molecular compounds, deterministic techniques are not applicable to resolve the problem of structure elucidation. In such an instance one has to use a stochastic structure generation. The purpose of a stochastic structure generation is to approximate the number of possible structural formulas, and to generate a sample of these formulas that statistically represents the entire population of possibilities. The stochastic technique used by the SIGNATURE program to approximate the number of possible structural formulas is based on the Knuth algorithm. Although the Knuth algorithm was devised for other purposes, it can be used to compute an unbiais estimator of the number of possible structural formulas. The sample of structural formulas is then generated using several stochastic techniques: Random Sampling, Monte-Carlo, Simulated Annealing, and Genetic Algorithm. All the structural formulas generated are constructed in a three-dimensional space. During the generation process, the expert using the system inputs the sample size, and can impose some structural constraints, such as avoiding the formation of double bonds, or forcing the generator to build five or six membered rings.

Once the sample of models is constructed, the third task, the 3D simulations, submits each model to molecular orbital calculations or molecular simulations. After the optimized 3D models are produced, 3D physical properties are calculated for the models and compared to the corresponding 3D analytical data. The 3D physical properties are: the density, the pore volume distribution, the surface area, and the fractal dimension of the surface. The methods employed by the SIGNATURE program to simulate the three-dimensional physical characteristics are based on finite element theory.

Finally, the sample is statistically analyzed by the fourth task. If the statistical technique used by the SIGNATURE program is random sampling the optimal sample size needed for statistical significance can be determined. Furthermore, the calculations performed with the sample can be extrapolated to the entire population of possible models.

For more information e-mail to: jfaulon@gmail.com

JLF 1996