A Database Marketing Model
for Zero-inflated Data
Bruce Ratner, Ph.D.
The problem of modeling data with missing values is well known to data analysts. Data analysts know that almost all standard statistical modeling techniques require complete data, and accordingly discard individuals with missing data. They make every effort to impute the missing data values. A common approach is to "zero-inflate" the data by replacing missing values with zeros. For binary variables and dummified categorical variables, say, representing participation in lifestyle activities, which assume 1 or 0 if an individual does or does not participate in a given lifestyle activity, respectively, missing-value individuals would have zeros. The working assumption is the missing-value individuals are nonparticipants of the corresponding lifestyle activities. Similarly, for continuous variables, say, representing a count activity (e.g., number of visits) or dollar amount, missing-value individuals would have zeros, implying they have no activity or a zero dollar value. Zero-inflated data clearly do not meet the bell-shaped data distributional assumption of the standard statistical modeling techniques. The zero-inflated data approach empirically has been justified by producing good model results in the majority.
The purpose of this article is to present a distribution-free alternative to regression modeling with zero-inflated data, which are either due to imputation as discussed above, or actually observed. The GenIQ Model, which is based on the machine learning method of genetic programming, theoretically accepts zero-inflated data, and thus offers optimal model results. Two case studies are presented using response and profit database marketing models.