A Non-Imputation Methodology for Database Modeling with Missing Data

The problem of modeling data with missing values is well known to data analysts. Data analysts know that almost all standard statistical modeling techniques require complete data, and accordingly discard individuals with missing data. They make every effort to impute the missing data values, but are mindful that their first intention can leave a sizable sample of discarded individuals. The implication of the discards sample is that implementation of the model on a database can only provide estimated target scores for individuals with complete data. Data analysts must assign extra-model scores (e.g., mean estimated target value of scored individuals) for individuals with missing data if they want the entire database scored. This article presents a non-imputation methodology (NIM) for database modeling with missing data. NIM includes a new method for modeling the discarded individuals due to missing values. NIM combines the discards model and the model built with the reduced complete-data sample, thus rendering a single model for scoring the entire database. In particular, NIM offers great utile in big data settings with large amounts of missing data. For example, data analysts often append external data to the sample data to provide extra candidate predictor variables when building a model. The potential predictive power due to the extra variables is offset by a large amount of missing data in the appended sample because the match rate between the sample and external files is typically low. The new methodology offers a copious variable selection and a superior utilitarian model as it permits the scoring of an entire database without regard to the problem of mssing data.


	A Non-Imputation Methodology for Database Modeling with Missing Data Bruce Ratner, Ph.D. The problem of modeling data with missing values is well known to data analysts. Data analysts know that almost all standard statistical modeling techniques require complete data, and accordingly discard individuals with missing data. They make every effort to impute the missing data values, but are mindful that their first intention can leave a sizable sample of discarded individuals. The implication of the discards sample is that implementation of the model on a database can only provide estimated target scores for individuals with complete data. Data analysts must assign extra-model scores (e.g., mean estimated target value of scored individuals) for individuals with missing data if they want the entire database scored. This article presents a non-imputation methodology (NIM) for database modeling with missing data. NIM includes a new method for modeling the discarded individuals due to missing values. NIM combines the discards model and the model built with the reduced complete-data sample, thus rendering a single model for scoring the entire database. In particular, NIM offers great utile in big data settings with large amounts of missing data. For example, data analysts often append external data to the sample data to provide extra candidate predictor variables when building a model. The potential predictive power due to the extra variables is offset by a large amount of missing data in the appended sample because the match rate between the sample and external files is typically low. The new methodology offers a copious variable selection and a superior utilitarian model as it permits the scoring of an entire database without regard to the problem of mssing data. For more information about this article, call Bruce Ratner at 516.791.3544, 1 800 DM STAT-1, or e-mail at br@dmstat1.com. DM STAT-1 CONSULTING / br@dmstat1.com 574 Flanders Drive / North Woodmere, NY 11581 / U S A Voice 1-516-791-3544 / Fax 1-516-791-5075 Toll Free 1 800 DM STAT-1