|
Assessing the Importance
of Variables in Database Response Models
Bruce Ratner,
Ph.D. The classic approach
for assessing the statistical significance of a variable considered for
model inclusion is the well-known null hypothesis-significance testing
procedure, which is based on the reduction in prediction error (actual
response minus predicted response) associated with the variable in
question. The statistical apparatus of the formal testing procedure for
logistic regression analysis consists of: the log likelihood function
(LL), the G statistic, degrees of freedom, and the p-value. The
procedure uses the apparatus within a theoretical framework with
weighty and untenable assumptions. From a purist point of view, this
could cast doubt on findings that actually have statistical
significance. Even if findings of statistical significance are accepted
as correct, they may not be of practical importance or have noticeable
value to the study at hand. For the data analyst with a pragmatic
slant, the limitations and lack of scalability inherent in the classic
system can not be overlooked, especially within big data settings. In
contrast, the data mining approach uses the LL units, the G statistic
and degrees of freedom in an informal data-guided search for variables
that suggest a noticeable reduction in prediction error. One
point worth noting is that the informality of the data mining approach
calls for suitable change in terminology, from declaring a result as
statistically significant to one worthy of notice or noticeably
important. In this article I describe the data mining approach of
variable assessment for building database response models.
|
|