Non-parametric
Methods for Identifying the Robust QSAR Models
The
vast majority of Quantitative Structure-Activity Relationship
(QSAR) models employed in pharmaceutical and agrochemical
discovery and development are based on some subset of the
(very large) number of numerical descriptors available.
Unfortunately, the datasets of interest share some problematic
attributes. Firstly, the descriptors used are indirect to
a greater or lesser degree, so that the “true”
independent variables behind the model are latent –
i.e., they are approximated by some linear (or nonlinear)
combination of the descriptors at hand. Secondly, the descriptors
used are usually intrinsically interdependent in a statistical
sense, though that interdependence may be of rather complex
form, especially if those descriptors have been pre-selected
or constructed to not be linearly correlated.
Secondly, the underlying “universe” of chemical
structures is inherently discrete, biased and non-redundant.
The non-redundancy means that a structure will almost never
appear twice in any “real world” data set. The
bias reflects the fact that the likelihood of any particular
structure appearing in a training set is dependent on the
other structures in that training set – something
that might be called the “chemical series problem.”
In classical parametric perturbation analysis, a small amount
of IID (independent identically distributed) noise is added
to each parameter in turn, and the response of the model
performance to that noise is monitored. Unfortunately, the
data structure characteristics cited above combine to make
this approach more or less unsuitable for most QSAR models.
Alternative nonparametric perturbation methods – i.e.,
methods that make no assumptions about the underlying intercorrelation
structure of the data – will be described. These include
applications in descriptor space (molecular holograms (HQSAR)
and comparative molecular field analysis (CoMFA)) as well
as the more broadly applicable response-space method known
as progressive scrambling.