Statistics versus Machine Learning: A Significant Difference for Database Response Modeling

Statistics versus Machine Learning:
A Significant Difference for
Database Response Modeling

Bruce Ratner, Ph.D.

The regnant statistical paradigm for database response modeling is: The data analyst fits the data to the presumedly true logistic regression model (LRM), which has the form (equation) of (log of the odds of) response is the sum of weighted predictor variables. The predictor variables are determined by a mixture of well-established variable selection methods and the will of the data analyst to re-express the original variables and construct new variables (data mining). The weights, better known as the regression coefficients, are determined by the pre-programmed machine-crunching method of calculus. The purpose of this article is to show a significant difference for database response modeling when implementing the antithetical machine learning paradigm: The data suggests the “true” model form, as the machine learning process acquires knowledge of the form without being explicitly programmed.

I use the machine learning GenIQ Model© and LRM to build a database response model, which predicts the rank-order likelihood of response, to illustrate the advantages and the singular weakness of the machine learning paradigm. Specifically, the GenIQ Model shows the superiority of the machine learning paradigm over the statistical paradigm, as it not only specifies the true model form (a computer program), but simultaneously performs variable selection and data mining. The difficulty in interpreting the computer program often accounts for the limited use of the machine learning paradigm. For an eye-opening preview of the 9-step modeling process of GenIQ, click here. For FAQs about GenIQ, click here.

Outline of this Article

Outline of Another Article as an Accompaniment