Leading Edge Predictors for Drug Discovery

CSLogWS Home

CSLogD Home

CSLogP Home



CSGenoTox Home


Download a Pre-print
about CSLogWS


     ...Calculation and Prediction

 The following information about CSLogWS Calculation and Prediction is available on this page.  Please select the appropriate topic in the list below to navigate to the subject you are interested in.

    CSLogWS Calculation and Prediction

      Calculated Results From CSLogWS Development

      Cross Validation Predicted Results

      External Validation Predicted Results

      CSLogWS Representative Compounds

Development of the CSLogWS Predictor.

The CSLogWS predictor is based on topological structure descriptors and was developed by the use of artificial neural networks.  Neural network analysis was applied to select descriptors and then optimize the relationship between experimental Log WS values and those calculated by the CSLogWS predictor.  The Resulting predictor was cross-validated by the leave-group-out method and then external validation was performed on an external validation test set.

(Please see "Neural Network Analysis" on our Methods page for additional information)

 In each phase of development, a correlation coefficient was calculated as a measure of the quality of the predictor.


gives the correlation between calculated and experimental values for the compounds in the training set.  Every compound in the training set contributed to descriptor selection and predictor development.


gives the correlation coefficient between experimental values and predicted values from a 10-fold leave-10%-out cross validation.  The compounds that generate Q2 contribute to descriptor selection, but the predicted values arise from calculations made when the compounds were not part of the training set for predictor development.


gives the square of the correlation coefficient between the predicted and experimental values.  The compounds used to generate the Q2valid statistic were not used for either descriptor selection or predictor development.

Several additional statistics are given as a measure of predictor performance:
MAE gives the mean absolute error.
s gives the standard deviation for regression.

gives the root mean square error, used in place of s for validation, where degrees of freedom are undefined.

(Please see "Data Handling and Statistics" on our Methods page for additional information)

Calculation of Intrinsic Aqueous Solubility

Calculated Results from CSLogWS Predictor Development   (top of page)

A correlation of the CSLogWS (calculated) values with known Log WS (experimental) values over an activity range of 16 orders of magnitude gave the following statistics:

R2 = 0.94

MAE = 0.40 (mean absolute error)

s = 0.55 log units

The results are shown in the plot below below.

Calculated Results on 5650 Heterogeneous Compounds

Prediction of Intrinsic Aqueous Solubility

Cross-Validation on 5650 Compounds   (top of page)

Cross-validation testing was conducted by setting up a series of 10 cross-validation test sets (VTS), each containing approximately 565 (~ 10%) of the 5650 compounds.  Each VTS contained a set of unique compounds where no compound was used in more than 1 test set and each compound was used exactly once.

For each VTS, a new neural network-based QSAR model was developed on the remaining 5085 compounds in the training set and applied to the VTS to predict Log WS.  The process was repeated 10 times, once for each for each VTS.

A correlation of the predicted CSLogWS values with Log WS (experimental) values gave the following statistics:

Q2 = 0.89

MAE = 0.51 log units

s = 0.72 log units

The quality of this result attests to the diversity of  the chemical space and the descriptors employed. The CSlogWS predictor averages the results from all 10 neural nets when predicting for new NCEs

The results are shown in the plot below.

Predicted Results on 5650 Heterogeneous Compounds

External Validation of Intrinsic Aqueous Solubility

External Validation on 420 Compounds   (top of page)

The external validation set consisted of 377 compounds that were neutral at pH 7.4 and 43 compounds that were considered neutral at the measured pH based on calculated pKas but were within 1pH unit of being charged.  All experimental aqueous solubility values came from a compendium (1) built from published experimental results.  The average MW = 508.8 for the 420 compounds with a range between 179.2 to 1035.2. The dataset itself was reasonably diverse (see CSLogWS Representative Compounds) ranging from highly extended structures to small semi-rigid structures, as seen below in compounds 266 and 1180 respectively.

ID 1180

(1). Mario Lobell and Stephen East, OSI Pharmaceuticals, Watlington Road, Oxford OX4 6LT

ID 266

Statistical Results from External Validation

Intrinsic solubility (WSo) was predicted for 420 compounds, resulting in the following validation statistics:

Q2valid = 0.84

MAE = 0.47 log units

RMS = 0.61 log units

The observed Q2 = 0.84 can be considered excellent as to the goodness of the prediction.

A scatter plot, a casewise plot, and a Q2 histogram of intrinsic solubilities (WSo) are shown below.

Predicted Results on 420 External Validation Compounds

External Validation Casewise Plot

Several features of the case wise plot illustrate the strength of the CSLogWS predictor.

95%, or 398 of the 420 predictions, fall within the limit defined by ± 2 RMS (1.22 log units).
No prediction falls outside 3 RMS units (corollary to having no calculated values outside of 3 standard deviations).
The compounds in the validation test set show relatively even distribution through the activity space.
The predictor performs equally well through the range of activity covered by the external validation test set.

The scatter and casewise plots complement each other.  The scatter plot shows the straight-line fit of the data while the casewise plot tracks the pattern of the match between experimental and predicted Log(1/WSo) values.  The quality of aqueous solubilities predictions made by CSLogWS is evident from the histogram above which, utilizes a bin width of 0.05 to give Q2 as function of frequency (no. of compounds/bin).  Approximately, 92% (380 compounds) of the 420 entities were found within the first two bins with the first bin (i.e., 0.96 to 1) accounting for ~72% of total population.

The diversity of chemical space represented by these compounds (see CSLogWS Representative Compounds) and the two fold difference in average MW between the external validation set and the training set used to develop CSLogWS, attest to the robustness of the predictive abilities of CSLogWS.

CSLogWS  Representative Compounds

Compounds from the CSLogWS External Validation Test Set   (top of page)

Follow the link below to a set of 40 representative compounds  of the 420 used in external validation testing of CSLogWS.  Each structure is given with a comparison of known experimental values with predicted Log WSo.

Go to: CSLogWS  Compounds

Back to: CSLogWS  Home Page

Go to: Next CSLogWS  Topic

user login
contact us

To contact us:

Phone: 978-501-0633

Fax: 781-275-5197

Email:  sales@chemsilico.com

Copyright © 2003 ChemSilico LLC All Rights Reserved

Terms and Conditions of Use | Privacy Policy

ChemSilico is a registered trademark of ChemSilico LLC, Tewksbury, MA 01876