|
Genetic Data Mining Method for the
Proper Use of the Correlation Coefficient Bruce Ratner, Ph.D. Assessing the relationship between a predictor variable and a target variable is an essential task in the model building process. If the relationship is identified and tractable, then the predictor variable is re-expressed to reflect the uncovered relationship, and consequently tested for inclusion into the model. Most methods of variable assessment are based on the well-known correlation coefficient, which is often misused because its linearity assumption is not tested. The purpose of this article is to illustrate a genetic data mining method – the GenIQ Model© – that is perhaps the best “data-straightener” available today. I use the third pair of x and y values from the well-known Anscombe data. OUTLINE I. Ancombe Data ID x y 1 10 7.46 2 8 6.77 3 13 12.74 4 9 7.11 5 11 7.81 6 14 8.84 7 6 6.08 8 4 5.39 9 12 8.15 10 7 6.42 11 5 5.73 II. GenIQ Model (Tree Display) The GenIQ Model (Code) x1 = .6550772; x2 = x; If x1 NE 0 Then x1 = x2 / x1; Else x1 = 1; x2 = x; x3 = x; x2 = x2 + x3; x2 = Cos(x2); x1 = x1 + x2; GenIQvar(y) = x1; III. GenIQ Model Results The results of the GenIQ Model as a data-straightener are in Table 2. There is a perfect rank-order prediction based on the descending GenIQ model score GenIQvar(y), which is used to order the table. Table 2. GenIQ Model Rank-order Prediction ID x y GenIQvar(y) 3 13 12.74 20.4919 6 14 8.84 20.4089 9 12 8.15 18.7426 5 11 7.81 15.7920 1 10 7.46 15.6735 4 9 7.11 14.3992 2 8 6.77 11.2546 10 7 6.42 10.8225 7 6 6.08 10.0031 11 5 5.73 6.7936 8 4 5.39 5.9607 Perhaps, the best way of illustrating the GenIQ Model as a data-straightener, and a data mining tool are the plots below: Plot y*x and Plot GenIQ*x. IV. Summary Perhaps, the GenIQ Model is an excellent data-straightener and data mining tool all-in-one? What do you think? Oh, two things - the correlation coefficients between y and x, and GenIQ(y) and x are: 0.81629 and 0.9895, respectively. Go back to previous page. |