Data Mining in Chemistry

Markus C. Hemmer, Johann Gasteiger

Computer-Chemie-Centrum, Institut für Organische Chemie, Universität Erlangen-Nürnberg, Nägelsbachstr. 25, 91052 Erlangen
Phone: +49-9131-85-26570, Fax: +49-9131-85-26566



The analysis of data sets is one of the most important tasks in the investigation of properties of chemical compounds. Especially in drug design, methods are used to characterize complete sets of chemical compounds instead of describing individual molecules. Data mining, i.e. the exploration of large amounts of data in search for consistent patterns, correlations and other systematic relationships, can be a helpful tool to evaluate "hidden" information in a set of molecules. Finding the adequate descriptors for the representation of chemical structures is one of the most important problems in chemical data mining. A special descriptor of chemical structures has already been used successfully in the Internet project TeleSpec and it will now find its application in a new project.
Data Mining Service - Chemistry (DMSC) is a project for the development of a centralized service for the exploration of chemical data sets. With this service it will be possible to analyze chemical data sets for molecular patterns and systematic relationships using the following methods:

The results of these data analyses are presented with modern Web technology using VRML, XML and the special languages CML and MathML, and can be downloaded in several graphical formats as well as in more than 30 different chemical file formats.
Data Mining Service Chemistry (DMSC) opens a new way of chemical information processing using the newest WWW techniques to visualize complex trends, patterns and relationships in chemical datasets in a most effective way.

The Problem - Data Scale in Chemistry

With the progressive specialization in sciences and the extensive use of computational methods the steady increase of data is barely manageable even by a team of scientists. Thereby the interest in specific information is pushed into the background while global or superior information of complete sets of data is becoming more and more important. Thus, the recognition of superior information for complete data sets becomes one of the most important tasks for information management in science. As a consequence the quality of information retrieval is no longer depending on the quantity and size of primary information objects, but on an automated intelligent analysis of primary sources. The development of new techniques like Combinatorial Chemistry and Chemometrics is proving the same trend in chemistry.

In Chemistry the investigation of molecular structures and of their properties is one of most fascinating topics. In chemistry an own language and namespace for molecular structures exists, that is still in development since the first alchemy experiment to the modern times. With increase of computational information processing several conventions and formats for chemical information have been developed.

But, in one of the most important communication media of modern times, the Internet, the chemical language has been used only in a few specific applications. While a couple of databases were accessible via WorldWideWeb, no service exists, that allows a data mining of chemical data sets by the use of this specific language.

Extraction of hidden information in chemical data sets

The task of Data Mining in a chemical context is to evaluate "hidden" information in a set of chemical data. One of the differences of Data Mining compared to conventional database queries is the production of new information that is used to characterize chemical data in a more general way. Generally, it is not be possible to hold all of the potentially required information in a data set of chemical structures. Thus, the extraction of relevant information and the production of reliable secondary information is important.

In the last decades methods has been developed to describe Quantitative Structure Activity Relationships (QSAR) and Quantitative Structure Property Relationships (QSPR) that are dealing with the modelling between structural and chemical and biological properties. The similarity of two compounds concerning their biological activity is one of the central tasks in the development of pharmaceutical products. A typical application is the retrieval of structures with defined biological activity from a database. Biological activity is of special interest in the development of drugs. On one hand, the diversity of structures in a data set of drugs can be of interest for the synthesis of new compounds; with increasing variety of structures in a data set, the chance to find a new way of synthesis for a compound with similar biological properties is increasing. On the other hand, the similarity of some structural features is of importance for retrieving a compound with similar biological properties. In fact, the term "similarity" can have quite different meanings in chemical approaches. Similarity does not simply mean: similarity concerning structural features (which is, in fact, easy to determine). Similarity in a chemical context must include additional properties, that are sometimes hard to describe as a simple feature.

Therefore, finding the adequate descriptor for the representation of chemical structures is one of the basic problems in chemical data mining. Several methods have been developed in the last decades for the description of molecules including their chemical or physicochemical properties. These so-called structure coding methods underlie some basic restrictions. Especially, they have to be independent on the size of a molecule.

A special descriptor for chemical structures, the Radial Distribution Function Codes (RDF Codes), have already been used in the Internet project "TeleSpec", that was sponsored by the German Research Network (DFN). The success of this method was the reason, to think about the use of this descriptor in a more complex context: the task of Data Mining in Chemistry.

The Internet project Data Mining Service Chemistry (DMSC) is planned to provide a centralized access to a wide variety of data mining methods, like statistical processing, simulation and prediction of chemical properties. With this service it will be possible to submit data sets or to compile a data set by extracting structures from chemical databases via Internet. For submitted or compiled data sets descriptors can be calculated with an extensive set of options. On the basis of these descriptors, several methods of data analysis can be performed on the data set.

The need for a unique characterization of molecules

Molecules are normally represented as 2D formulas or 3D molecular models. While the three-dimensional coordinates of atoms in a molecule are sufficient to describe the spatial arrangement of atoms, they lack of two features:

The first feature is most important for computational analysis of data. Even a simple statistical function, e.g., a correlation coefficient, requires two equally sized information sets (i.e. vectors). The solution to this problem is a mathematical transformation of the Cartesian coordinates of a molecule to a vector of fixed length. The second point can be overcome by including the desired properties into the transformation algorithm.

During several research projects in our group descriptors have been developed. These descriptors will be used in the DMSC project for the characterization of structures including their properties. An example of a descriptor is shown with the next Figures.


Figure 1a. 2D structure of Cholic Acid, as it is submitted by a user

Figure 1b. 3D model with a chemical adequate orientation of the atoms

Figure 1c. Van-der-Waals surface of electrostaic potential based on partial charges of the atoms

Figure 1d. Projection of the molecular surface onto a neural Kohonen map.

In Fig. 1a the 2D structure of Cholic Acid is shown, a compound found in the bile of most vertebrates. This 2D structure of a molecule is a common representation of a molecule in the chemical language, but, it does not include any information about the spatial arrangement of atoms. There were some atoms in this molecule, that could appear in different orientations leading to different molecules (the enantiomers) that can also differ completely in biological activity. An example of the dramatic consequences of this fact is Thalodimid (Contergan™), a compound that acts as soporific drug in one of its enantiomeric forms and causes heavy malformations of foetes in the second enantiomeric form.

The chemical adequate 3D model is shown in Fig 1b. This representation is based on cartesian coordinates (an xyz-triple) for each atom.

The next figure 1c shows the surface of the molecule as it can be "seen" by an active biological center. This representation is calulated by a special program and colored according to electronic properties of the surface.

The final figure 1d shows an example of a descriptor: a pattern of the electrostatic potential mapped into a 2D plane as it can be performed by artificial neural networks. This is a visual representation of one the descriptors that are used in the DMSC project.

Let us compare again the original 2D structure in Fig 1a (as it can be provided by a structure editor and, thus, by a user of DMSC) with the descriptor in Fig 1d. The information content of the descriptor is much higher than in the 2D structure. Even if the descriptor seems to be an unusual kind of a chemical structure it is nothing else than a molecule in another chemical language. Additionally, this descriptor can be used effectively easier in computational processing.

By transforming a query structure and all of the molecules of a data set into their descriptor, it becomes possible to search in a data set not only for structures, but also for properties.

The Outline of Data Mining Service Chemistry

As Data Mining can be defined as a process of exploration of large amounts of data in search for consistent patterns, correlations and other systematic relationships between variables, the tasks of the Data Mining Service Chemistry can be divided into the following sections:

Search and Processing of Raw Data. The most convenient raw data for molecules are the connection tables (CT). A connection table simply consist of a list of the connectivtity of the atoms, i.e., basic information of the atom (atom symbol or number) and the bonds to other atoms. With the basic information in a bond list other types of information can be calculated or derived. Starting with a connection table the two- and three-dimensional model of molecules can be calculated. These models are the basis for the calculation of secondary data, like physicochemical properties of the atoms in a molecule. These properties are calculated for the specific spatial arrangement of the atoms and, thus, are not only different for each atom but for each individual molecule.

Calculation of Descriptors. Using the three-dimensional arrangement of atoms in a molecule including the physicochemical properties of these atoms, it is possible to calculate a descriptor of the molecule. Here a descriptor is defined as a mathematical vector of a fixed length, that is describing a molecule including its properties.

Analysis by Statistical Methods. Because of their fixed length descriptors are valuable representations of molecules for the use in statistical calculations. The most important methods for chemical descriptors will be linear and non-linear regression, correlation methods and correlation matrices. With the aid of these tools similarities or diversities in structure/property data can be easily found.

Analysis by Artificial Neural Networks. If a statistical methods fail to solve a chemical problem, artificial neural networks can be used for analyzing especially non-linear and complex relationships between descriptors. Some of this neural networks are self-adaptive auto-associative systems, i.e. they learn by processing a set of training data about the relationships within this data set. The important tasks for neural networks in Data Mining are:

Optimization by Genetic Algorithms. Genetic Algorithms are computer algorithms that are used for the optimization of fuzzy data. Genetic Algorithms are working on the basis of the model of biological genetic processes, i.e. recombination, mutation and "natural" selection of data.

Expert Systems. While the previously mentioned systems are working automated Expert Systems are depending on the analysis and interaction of an expert. Expert Systems are the final step in data analysis and can be provided for reaction prediction and synthesis planning.

Visualization of data and interactivity. One of the tasks of a Data Mining Service must be to visualize the revealed correlations and patterns and to validate the chosen parameters by applying the methods to new subsets of data. The visualization of complex relationships between individual data and data sets is of profound importance to the usability of the DMSC project. While it is easy to present statistical data in graphs and correlation matrices, complex results of data analysis must be presented in an interactive 3D environment. The virtual reality modelling language VRML will be used as a mining tool for data sets, where results of data analysis are linked to the representations of source data as well as to representations of related data within the same graph. Additionally preformatted reports will be implemented for a clear representation of the results.

Archiving. The effective access of raw data that has been saved in previous experiments is important for repetition of experiments and to avoid unnecessary calculations. Information that was proved to be useful can be saved and retrieved for later use with other data sets. For this task raw data (input data), descriptors and query data will be archived seperately. By separation of primary structure data and the structure describing secondary data and query data of a user a simple retrieval of a previous experiment as well as the use of already calculated descriptors will be possible.

The next figure shows a scheme of the modular system of the DMSC project

Figure 2. Scheme of the data mining interface. (ES Expert System; SS Statistical System; NN Neural Networks; GA Genetic Algorithm)

The Developers

The research team of Prof. J. Gasteiger at the Computer-Chemie-Centrum of the University of Erlangen-Nürnberg is developing since 25 years programs that aid organic chemists solving their problems in research and development. Some of these programs will be used for the Data Mining Service. Some of these program systems are the knowledge based expert systems for synthesis planning WODCA and for reaction prediction EROS, the 3D structure generator CORINA, the calculator PETRA and the descriptor generator ARC.

The chemical information system CACTVS provides a series of tools for the processing of structural information. Modules of CACTVS that will be used in the development of the Data Mining Service will be a 2D strukture editor and browser, tools for graphical representation of spectra, search engines for structure and substructure searches and about 45 chemical file converters.

The co-author of the book about Artificial Neural Networks in Chemistry and Drug Design, Prof. Dr. Jure Zupan of the National Institute of Chemistry in Ljubljana, Slovenia, is working with his team on a further descriptor for the mathematical representation of chemical structures, that will be implemented in the project. From the team of Prof. Ana Lobo of the Departamento de Química der Universidade Nova de Lisboa, Portugal, Dr. Joao Aires de Sousa is working on a descriptor for the representation of chirality that is able to distinguish between the above mentioned enantiomers of a molecule.

Selected Literature

The following literature compilation gives an overview on the scientific background of the project. A complete list of the about 200 scientific punblications of the research team can be found at
An overview on the methods is also given in the websites of the above mentioned book at

Secondary data systems

  1. CORINA: Automatic Generation of High-Quality 3D-Molecular Models for Application in QSAR
    J. Sadowski, M. Wagener, J. Gasteiger, "10th European Symposium on Structure-Activity Relationships: QSAR and Molecular Modelling", Editor: F. Sanz, Prous Science Publishers, 1994
  2. Empirical Methods for the Calculation of Physicochemical Data of Organic Compounds
    J. Gasteiger, in: "Physical Property Prediction in Organic Chemistry", Editor: C. Jochum, M. G. Hicks, J. Sunkel Springer-Verlag, Heidelberg, 1988, S. 119-138


  1. Chemical Information in 3D Space
    J. Gasteiger, J. Sadowski, J. Schuur, P. Selzer, L. Steinhauer, V. Steinhauer, J. Chem. Inf. Comput. Sci., 36, 1030-1037 (1996)
  2. 3D-MoRSE Code - A Method for Coding the 3D Structure of Molecules
    J. Schuur, J. Gasteiger in "Software-Entwicklung in der Chemie 10", J. Gasteiger (Hrsg.), GDCh, Frankfurt/Main, 1996, S. 67-80
  3. Overcoming the Limitations of a Connection Table Description: A Universal Representation of Chemical Species
    S. Bauerschmidt, J. Gasteiger, J. Chem. Inf. Comput. Sci., 37, 705-714 (1997)
  4. 3D Structure Descriptors for Biological Activity
    J. Gasteiger, S. Handschuh, M. C. Hemmer, T. Kleinöder, C. H. Schwab, A. Teckentrup, J. Sadowski, M. Wagener in "Molecular Modelling and Prediction of Bioactivity", S. Jorgensen, K. Gundertofte (Eds.), Plenum Press, Sept. 1998, in press

Expert Systems

  1. EROS - A Computer Program for Generating Sequences of Reactions
    J. Gasteiger, C. Jochum, Topics Curr. Chem. 74, 93-126 (1978)
  2. The WODCA System
    J. Gasteiger, W. D. Ihlenfeldt, in: "Software-Development in Chemistry 4", Editor: J. Gasteiger, Springer-Verlag, Heidelberg, 1990, 57-65

Artificial Neural Networks / Genetic Algorithms

  1. The Determination of Maximum Common Substructures by a Genetic Algorithm: Application in Synthesis Design and for the Structural Analysis of Biological Activity.
    M. Wagener, J. Gasteiger, Angew. Chem. Int. Ed. Engl., 1994, 33, 1189-1192 (1994).
  2. Ähnlickeitsanalyse biologisch aktiver Verbindungen unter Einsatz genetischer Algorithmen und neuronaler Netze
    J. Gasteiger, J. Sadowski, M. Wagener, P. Levi, A. Zell, H. Bauknecht, T.Will, G. Klebe, T. Mietzner, F. Weber, G. Barnickel, S. Anzali, M. Krug, BMBF Statusseminar "Bioinformatik", Hrsg.: G. Wolf, R. Schmidt, M. van der Meer, Projektträger DLR, Berlin 1995, S. 79-103
  3. The Use of Self-Organizing Neural Networks in Drug Design
    S. Anzali, J. Gasteiger, U. Holzgrabe, J. Polanski, J. Sadowski, A. Teckentrup, M. Wagener, in "3D QSAR im Drug Design - Volume 2", p. 273-299, H. Kubinyi, G. Folkers, Y. C. Martin (Ed.), Kluwer/ESCOM, Dordrecht, NL, 1998

Data Mining / Chemical Information Systems

  1. Treasure Hunt: On the Track of Reusable Structural Information in the World Wide Web
    W. D. Ihlenfeldt, J. Gasteiger, in "Software-Entwicklung in der Chemie 10", J. Gasteiger (Hrsg.), GDCh, Frankfurt/Main, 1996, S. 403-413
  2. German Expertsí View and Ideas about Information on the Internet
    K. Voigt, J. Gasteiger, W. D. Ihlenfeldt, B. Page, K. Specht, W. Umstätter, Online & CDROM Review, 20, 125-132 (1996)
  3. Database Mining: From Information to Knowledge
    J. Gasteiger, Proceed. 1997 Chem. Inf. Conf., H. Collier (Ed.), Infonortics Ltd. Calne, UK, 1997, S.1-6