Missing Value Analysis: A Machine-learning Approach

Missing Value Analysis:
A Machine-learning Approach
Bruce Ratner, Ph.D.

The problem of modeling data with missing values is well known to data analysts. Data analysts know that almost all standard statistical modeling techniques perform complete-case analysis, which discards cases with at least one missing value. They make every effort to impute the missing values, but are mindful that their first intention can leave a meager complete-case sample. The implication of complete-case analysis is twofold. First, the results are marked by caution (i.e., potential prediction bias), as the complete-case sample is questionably representative of the population under consideration. Second, the model built on a complete-case sample has limited utility, as it only provides estimated target scores for individuals with complete data. Data analysts must assign extra-model scores for individuals with missing data if they want an entire database scored. This article presents a machine-learning, genetic-based assumption-free imputation method for database modeling based on all-case analysis, which includes all cases regardless of the missingness. This method should be a welcomed entry in the data analysis arsenal for building a better database model, as it has the distinctive feature of softening the effects of missing data by minimizing likely prediction bias and maximizing the model's utility.