Hybrid Statistics-Machine Learning Paradigm for Database Response Modeling

Bruce Ratner, Ph.D.

The regnant statistical paradigm for database response modeling is: The data analyst fits the data to the presumedly true logistic regression model (LRM), whose form (equation) is the sum of weighted predictor variables. The weights (better known as regression coefficients) are the main appeal of the statistical paradigm, as they provide the key to interpreting what the equation means. The well-established LRM variable selection methodology, which identifies the predictor variables for the LRM, is the inherent weakness in the statistical paradigm. The variable selection is exclusive of the data analyst's will and ability for constructing new variables with potential predictive power (data mining).

The antithetical machine learning (ML) paradigm is: The data suggests the "true" model form (a computer program), as the ML process acquires knowledge of the form without being explicitly programmed. The strengths of the ML paradigm are its flexibility within a nonparametric, assumption-free openwork that accommodates big data, and its serviceability as a data mining tool. The weakness in the ML paradigm is the difficulty in interpreting the abstruse computer program; this surely has accounted for the limited use of ML methods.

The purpose of this article is to present a hybrid statistics-ML paradigm - integrating the best features of two paradigms - to yield a utile alternative for database response modeling. The proposed paradigm is: The data analyst fits the data to the LRM with the LRM variable selection among the original variables and the constructed variables, which are the subroutines of the computer program. Thus, the hybrid LRM-ML database response model includes a) the redoubtable regression coefficients, which provide the necessary comfort level for model acceptance, and b) the probably inclusion of powerful ML-predictor variables. I use the machine learning GenIQ Model© and LRM to build a hybrid LRM-GenIQ database response model, which predicts the rank-order likelihood of response, to illustrate the promise of the proposed hybrid approach. For a preview of the 9-step modeling process of GenIQ, click here. For FAQs about GenIQ, click here.


	A Hybrid Statistics-Machine Learning Paradigm for Database Response Modeling Bruce Ratner, Ph.D. The regnant statistical paradigm for database response modeling is: The data analyst fits the data to the presumedly true logistic regression model (LRM), whose form (equation) is the sum of weighted predictor variables. The weights (better known as regression coefficients) are the main appeal of the statistical paradigm, as they provide the key to interpreting what the equation means. The well-established LRM variable selection methodology, which identifies the predictor variables for the LRM, is the inherent weakness in the statistical paradigm. The variable selection is exclusive of the data analyst's will and ability for constructing new variables with potential predictive power (data mining). The antithetical machine learning (ML) paradigm is: The data suggests the "true" model form (a computer program), as the ML process acquires knowledge of the form without being explicitly programmed. The strengths of the ML paradigm are its flexibility within a nonparametric, assumption-free openwork that accommodates big data, and its serviceability as a data mining tool. The weakness in the ML paradigm is the difficulty in interpreting the abstruse computer program; this surely has accounted for the limited use of ML methods. The purpose of this article is to present a hybrid statistics-ML paradigm - integrating the best features of two paradigms - to yield a utile alternative for database response modeling. The proposed paradigm is: The data analyst fits the data to the LRM with the LRM variable selection among the original variables and the constructed variables, which are the subroutines of the computer program. Thus, the hybrid LRM-ML database response model includes a) the redoubtable regression coefficients, which provide the necessary comfort level for model acceptance, and b) the probably inclusion of powerful ML-predictor variables. I use the machine learning GenIQ Model© and LRM to build a hybrid LRM-GenIQ database response model, which predicts the rank-order likelihood of response, to illustrate the promise of the proposed hybrid approach. For a preview of the 9-step modeling process of GenIQ, click here. For FAQs about GenIQ, click here. For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at br@dmstat1.com. DM STAT-1 CONSULTING / br@dmstat1.com 574 Flanders Drive / North Woodmere, NY 11581 / U S A Voice 1-516-791-3544 / Fax 1-516-791-5075 Toll Free 1 800 DM STAT-1