Robert D. Clark
Non-parametric Methods for Identifying the Robust QSAR Models
Echeminfo

Non-parametric Methods for Identifying the Robust QSAR Models

The vast majority of Quantitative Structure-Activity Relationship (QSAR) models employed in pharmaceutical and agrochemical discovery and development are based on some subset of the (very large) number of numerical descriptors available. Unfortunately, the datasets of interest share some problematic attributes. Firstly, the descriptors used are indirect to a greater or lesser degree, so that the “true” independent variables behind the model are latent – i.e., they are approximated by some linear (or nonlinear) combination of the descriptors at hand. Secondly, the descriptors used are usually intrinsically interdependent in a statistical sense, though that interdependence may be of rather complex form, especially if those descriptors have been pre-selected or constructed to not be linearly correlated.

Secondly, the underlying “universe” of chemical structures is inherently discrete, biased and non-redundant. The non-redundancy means that a structure will almost never appear twice in any “real world” data set. The bias reflects the fact that the likelihood of any particular structure appearing in a training set is dependent on the other structures in that training set – something that might be called the “chemical series problem.”

In classical parametric perturbation analysis, a small amount of IID (independent identically distributed) noise is added to each parameter in turn, and the response of the model performance to that noise is monitored. Unfortunately, the data structure characteristics cited above combine to make this approach more or less unsuitable for most QSAR models. Alternative nonparametric perturbation methods – i.e., methods that make no assumptions about the underlying intercorrelation structure of the data – will be described. These include applications in descriptor space (molecular holograms (HQSAR) and comparative molecular field analysis (CoMFA)) as well as the more broadly applicable response-space method known as progressive scrambling.