Genetic Data Mining Method for the Proper Use of the Correlation Coefficient

Genetic Data Mining Method for the
Proper Use of the Correlation Coefficient

Bruce Ratner, Ph.D.

Assessing the relationship between a predictor variable and a target variable is an essential task in the model building process.
If the relationship is identified and tractable, then the predictor variable is re-expressed to reflect the uncovered relationship,
and consequently tested for inclusion into the model. Most methods of variable assessment are based on the well-known
correlation coefficient, which is often misused because its linearity assumption is not tested. The purpose of this article is to
illustrate a genetic data mining method – the GenIQ Model© – that is perhaps the best “data-straightener” available today.
I use the third pair of x and y values from the well-known Anscombe data.

OUTLINE

I. Ancombe Data

ID      x         y

1      10      7.46
2        8      6.77
3      13    12.74
4        9      7.11
5      11      7.81
6      14      8.84
7        6      6.08
8        4      5.39
9      12      8.15
10       7      6.42
11       5      5.73

II. GenIQ Model (Tree Display)

gtree

The GenIQ Model (Code)

x1 = .6550772;
        x2 = x;
   If x1 NE 0 Then x1 = x2 / x1; Else x1 = 1;
       x2 = x;
          x3 = x;
     x2 = x2 + x3;
     x2 = Cos(x2);
x1 = x1 + x2;
GenIQvar(y) = x1;

III. GenIQ Model Results
The results of the GenIQ Model as a data-straightener are in Table 2. There is a perfect rank-order prediction based on the descending
GenIQ model score GenIQvar(y), which is used to order the table.

Table 2. GenIQ Model Rank-order Prediction

ID     x        y     GenIQvar(y)

3      13    12.74      20.4919
6      14      8.84      20.4089
9      12      8.15      18.7426
5      11      7.81      15.7920
1      10      7.46      15.6735
4        9      7.11      14.3992
2        8      6.77      11.2546
10      7      6.42      10.8225
7        6      6.08      10.0031
11     5      5.73        6.7936
8        4      5.39       5.9607

Perhaps, the best way of illustrating the GenIQ Model as a data-straightener, and a data mining tool are the plots below:
Plot y*x and Plot GenIQ*x.
gplots

IV. Summary
Perhaps, the GenIQ Model is an excellent data-straightener and data mining tool all-in-one? What do you think?
Oh, two things - the correlation coefficients between y and x, and GenIQ(y) and x are: 0.81629 and 0.9895, respectively.

Go back to previous page.

For more information about this article, call Bruce Ratner at 516.791.3544, 1 800 DM STAT-1, or e-mail at br@dmstat1.com.