Data Mining 101
Bruce Ratner, Ph.D.

The essence of data mining is found the John Tukey’s seminal book Exploratory Data Analysis (EDA). Tukey’s own words best describe EDA and, in part, data mining itself: “Exploratory data analysis is detective work – numerical detective work – or counting detective work -- or graphical detective work. It is about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights.” Tukey’s provides us with a set of techniques to do our detective work. I list the most popular and useful techniques below.

1. Stem-and-Leaf Display - data exhibition from the numeric values.

2. Box-and-Whisker Plot - five-number summaries of data samples.

3. X-Y Plotting – the simplest of graphical displays use to show the relationship between two variables

4. Smooth Scatterplot - X-Y (scatter)plot in which the errors in the data are removed to better reveal the true relationship between X and Y..

5. Resistant Method for Fitting a Straight Line - alternative method (to ordinary regression) for fitting a straight line which is resistant to the potential presence of outliers.

6. Median Polish of Two-Way Tables - a method for discovering patterns in two-way tables.

7. Rootogram – histogram-like display based on the square roots of class frequencies to compare the data with a standard shape.

8. Bubble Charts - coded X-Y scatterplots where the symbol size represents the value of an additional quantitative variable.

9. Radar/Spider Plots - technique for comparing several samples of multivariate data.

10. Scatterplot Matrices - organized arrays of two-variable scatterplots.

In this article, I will discuss in a database marketing setting the ten methods by illustrating the construction of their displays, and providing how to draw the necessary inferences when data mining.

