A Simple Data Cleaning Method for Boosting the Reliability and Performance of Database Models

Common sense dictates that the reliability of a model is directly related to the quality of the data on which the model is built: the better the data quality, the more reliable the model. One way to increase data quality is to perform data cleaning. There are two data cleaning steps commonly taken. In one step, the data analyst looks for impossible and implausible data. Impossible data contains values that simply can not exist, and are therefore considered “dirty.” Implausible data contains values that are possible, but suspect. Once located, the analyst establishes rules for the treatment of such data. In the second step, the data analyst determines the percentages and patterns of missing data for all variables, and decides how to handle the corresponding data. Subsequent steps are as varied the individual analyst’s approach to cleaning his or her own data. The purpose of this article is to present a simple yet powerful data cleaning method for removing the effects of dirty and doubtful values, and thus increasing model reliability. The method cleans the data by ranking and symmetrizing data in a single sweep across all non-categorical variables. As its underpinnings are in concert with the performance criterion of database models, the method offers a prepatent boost in the performance of database models. I illustrate the new method with database model case studies.


	A Simple Data Cleaning Method for Boosting the Reliability and Performance of Database Models Bruce Ratner, Ph.D. Common sense dictates that the reliability of a model is directly related to the quality of the data on which the model is built: the better the data quality, the more reliable the model. One way to increase data quality is to perform data cleaning. There are two data cleaning steps commonly taken. In one step, the data analyst looks for impossible and implausible data. Impossible data contains values that simply can not exist, and are therefore considered “dirty.” Implausible data contains values that are possible, but suspect. Once located, the analyst establishes rules for the treatment of such data. In the second step, the data analyst determines the percentages and patterns of missing data for all variables, and decides how to handle the corresponding data. Subsequent steps are as varied the individual analyst’s approach to cleaning his or her own data. The purpose of this article is to present a simple yet powerful data cleaning method for removing the effects of dirty and doubtful values, and thus increasing model reliability. The method cleans the data by ranking and symmetrizing data in a single sweep across all non-categorical variables. As its underpinnings are in concert with the performance criterion of database models, the method offers a prepatent boost in the performance of database models. I illustrate the new method with database model case studies. For more information about this article, call Bruce Ratner at 516.791.3544, 1 800 DM STAT-1, or e-mail at br@dmstat1.com. DM STAT-1 CONSULTING / br@dmstat1.com 574 Flanders Drive / North Woodmere, NY 11581 / U S A Voice 1-516-791-3544 / Fax 1-516-791-5075 Toll Free 1 800 DM STAT-1