putting the focus on ADMET properties

CSLogWS Home

CSLogD Home

CSLogP Home

CSpKa Home

CSBBB Home

CSPB Home

CSGenoTox Home

ChemSilico Methods


The following information concerning ChemSilico Methods is available on this page.  Please select the appropriate topic in the list below to navigate to the subject you are interested in.


    ChemSilico Methods

      Data Selection

      Neural Network Analysis (NNA)

      Cross Validation and Final Predictors

      Explanation of Data Handling and Statistics

Data Selection:   (top of page)


Selection of data is difficult to define in a generic sense since several data selection techniques were employed to analyze the many different datasets used to construct ChemSilico predictors. However, generally after removal of duplicated compounds within large datasets,

  • Noisy data is removed on a 1st pass neural net convergence test using all 519 descriptors to fit the data with R2 > 0.95.  Compounds with absolute errors greater than 2.5 times the MAE for dataset are removed.  Elimination normally amounts to less than 5% of total in large datasets (> 2000 compounds). 

The removal process does not completely eliminate inaccuracies or inconsistencies in reported values.  That is not feasible.  Nonetheless, the resultant dataset is well described by the set of descriptors, which are used as inputs to the neural net for the training/test process.

Neural Net Analysis (NNA):   (top of page)


Artificial neural networks (ANN) have the astonishing ability to ferret out nonlinear dependencies among input variables and compound parameters that classical statistical methods or multivariate linear analysis cannot.  Such linear and non-linear relationships can be established for a large dataset of compounds utilizing neural network analysis.  There are many different approaches to ANN, the most common being back-propagation, which is used in ChemSilico QSAR predictor modeling.


The methodology to construct robust NN QSAR models is to rank descriptors, reduce their number, maximize the number of compounds per neural net work weight used, while at the same time achieving a maximal R2 value (a measure of the goodness of fit) for the training set and the maximal Q2 value (a measure of the goodness of prediction) for the validation sets, without over fitting the data.


Proprietary systems were used to "prune" the molecular descriptor set.   As the NN QSAR models are developed, they are continually tested for predictive accuracy throughout their development.  This is accomplished using a portion of the data, randomly selected as a "withheld set" which was not used to select the final algorithm.  With each iteration the least important inputs or irrelevant inputs are continually removed so that only the fittest survive this exhaustive process.  This pruning process is iterative and reduces the inputs to an essential set that must perform well on the "withheld set" before it moves on to the final testing phase.  In this manner, the input number is reduced from 515 to a more manageable value between 7 and 70.  The variable reduction occurs with high confidence.  The total number of molecular indices is dependent on the size of dataset.  Ultimately the emphasis in production of the final predictive QSAR model is directed towards maximizing the Q2 value. 


Validation, the resultant Q2 (based on a validation set) of a model when applied to a dataset not used to construct the model, is the principle statistical parameter used to assess the predictive capability of a QSAR model.  A well known problem associated with QSAR model development is the algorithmic bias that arises from the specific structural characteristics of the compounds in a given training set.  The latter may work well within the chemical-descriptor space from which the model was built, but not outside these spaces.  There is a need to demonstrate that the chemical-descriptor space used is sufficiently broad to predict with reasonably accuracy biopharmaceutical properties of new chemical entities.  Enhanced classical cross-validation techniques have been employed with all ChemSilico predictors and their respective datasets if the latter is of sufficient size (>800 compounds).  CSlogP, CSlogWS, CSGenoTox, CSpKa, CSBBB, CSPB have undergone cross-validation.  The Q2 for CSLogD arises from an  external validation set.  Although not all CSpKa predictors for ionizable groups under went cross-validation due to dataset size limitations, all the major pKa groups (CO2H, ROH, N1R, ArN, N3R) were cross-validated.

Cross-validation and Final Predictor:   (top of page)


R2 (the goodness of fit) is inflationary and approaches unity (1.0) as the number of variables (network inputs) increases.  Q2 (the goodness of prediction), in contrast, is not inflationary.  Q2 reaches a plateau and then declines as the complexity of the QSAR model increases.  ChemSilico predictors are finalized on a maximal Q2 value with a minimal number of variables. Enhanced cross-validation and the final biopharmaceutical property predictor are interrelated as follows:

  1. The dataset is repeatedly split into either 5 or 10 folds such that the compounds in any fold are unique, not contained in any other fold.

  2. A new NN QSAR model is built from the remaining 80% or 90% of the  compounds in the reduced dataset then used to predict the biopharmaceutical properties for compounds in the 10% or 20% cross-validation set.  The molecular descriptors used in the new model come from the initial training set pool, however, not all descriptors are used in final the model.  Hence this is an enhanced cross-validation technique, not simply a new set of weights being assigned to the descriptors.

  3. The process is repeated for each fold so either 5 or 10 QSAR models are built.  The number of models is dependent on the initial size of the dataset.  In this manner, the possibility of biasing from a random splitting of the data is eliminated and all the experimental property data contributes to the final predictive accuracy via an equal contribution from each model.

  4. In the final ADMET property predictor, a new compound is processed through these NN-based models.  The individual results from all models are averaged, and the result is reported as the predicted value. 

Explanation of Data Handling and Statistics:   (top of page)


Correlation Coefficient:

The correlation between experimental values and those generated during various phases of the modeling process are compared by use of the Pearson product moment correlation coefficient defined below.

X  is the experimental (observed) value
Y  is the value generated by the predictor
n  is each individual observation

The Pearson equation is applied to sets of X and Y values that arise from three different computational environments. These environments are defined by the extent to which the individual observations on the compounds represented by the Y values contributed to the development of the equation from which the Y value is generated.

R2: the square of the correlation coefficient between the calculated and experimental values, is derived from calculated results.  All of the compounds that contribute to R2 were used in both variable selection and in the generation of the model.

Q2: the square of the correlation coefficient for cross validation between the calculated and experimental values, is derived from predicted results.  The compounds that contribute to Q2 were used in variable selection, but not in the generation of the model.

Q2val: the square of the correlation coefficient for external validation between the predicted and experimental values, is derived from predicted results.  The compounds that contribute to validation Q2 were not used in either variable selection or in the generation of the model.

Mean Absolute Error ( MAE )

The mean absolute error statistic is calculated for both the calculated and predicted (validation) results.

Additional Statistics

A number of additional statistics are calculated to help define the quality of the ChemSilico family of predictors.

X  is the experimental (observed) value
Y  is the value generated by the predictor
i  is the i th compound
N  is the number of compounds in the dataset

Standard Deviation ( s )

The standard deviation is calculated for regression when the number of degrees of freedom is known.

X  is the experimental (observed) value
Y  is the value generated by the predictor
i  is the ith compound
Ndf  ( N - number of regression variables -1)

Average Relative Error (RAE)

The average relative error gives the average of the absolute error expressed as a percent of the experimental value.

X  is the experimental (observed) value
Y  is the value generated by the predictor
i  is the ith compound
N  is the number of compounds in the dataset

Root Mean Square (RMS)

The root mean square is a corollary statistic to s and is calculated for validation where the number of degrees of freedom is undefined.

X  is the experimental (observed) value
Y  is the value generated by the predictor
i  is the ith compound
N  is the number of compounds in the dataset

Back to: Methods and Descriptors Page

search
links
user login
contact us

To contact us:

Phone: 978-501-0633

Fax: 781-275-5197

Email:  sales@chemsilico.com

Copyright © 2003 ChemSilico LLC All Rights Reserved

Terms and Conditions of Use | Privacy Policy

ChemSilico is a registered trademark of ChemSilico LLC, Tewksbury, MA 01876