The 80/20 Rule:
Revised for Data Preparation
Bruce Ratner, Ph.D.
In marketing and sales the mantra is: 80 percent of profits are generated by 20 percent of sales. This is an application of the 80/20 Rule discovered in 1897 by an Italian economist Vilfredo Pareto while he was searching for patterns of wealth and income in England. He found that 80 percent of the wealth was enjoyed by only 20 percent of the population. After Pareto made his observation and created his ratio, many others observed similar phenomena in their own areas of expertise. In other words, 80 percent of your results (output) in a given activity are generated by 20 percent of your effort (input).
In database marketing, data analysts misstate the 80/20 Rule: 80% of the project timeline for data preparation, thus yielding 20% of the timeline for analysis and model building. Data preparation can be defined as your acquaintance with the data to understanding what they tell you. You must 1] insure there are no impossible or improbable values (e.g., age of 120 years, or a boy named Sue, respectively), and 2] audit missing and zero values. Post-audit may demand imputation for missing values. Importantly, data preparation also includes coming face-to-face with the data distribution (shape): Looking for 1) a clump - a mass of data (spike) at a single value (often at zero), or a quantity of data cohering together so as to make one body of indefinite shape; and 2) a gap - an intervening space between two nonconsecutive adjacent values. The purpose of this article is to reduce the “80% for data preparation” by presenting a new efficient and highly effective data preparation procedure, thus, giving analysis and model building more time, which is always welcomed. The method spreads out the clumps, closes in the gaps, and reshapes the data in the desirable and reliable bell-shape curve.
1 800 DM STAT-1, or e-mail at firstname.lastname@example.org.