DM Stat-1 Articles
Link to Home

Link to Articles

Link to Consulting

Link to Seminar

Link to Stat-Chat

Link to Software

Link to Clients

Genetic Data Mining Method for the
Proper Use of the Correlation Coefficient
Bruce Ratner, Ph.D.

Assessing the relationship between a predictor variable and a target variable is an essential task in the model building process.
 If the relationship is identified and tractable, then the predictor variable is re-expressed to reflect the uncovered relationship,
and consequently tested for inclusion into the model. Most methods of variable assessment are based on the well-known
correlation coefficient, which is often misused because its linearity assumption is not tested. The purpose of this article is to
illustrate a genetic data mining method – the GenIQ Model© – that is perhaps the best “data-straightener” available today.
I use the third pair of x and y values from the well-known Anscombe data.  (For an eye-opening preview of the 9-step modeling process
of GenIQ, click here. For FAQs about GenIQ, click here.)


OUTLINE

I. Ancombe Data

ID      x         y

 1      10      7.46
 2        8      6.77
 3      13    12.74
 4        9      7.11
 5      11      7.81
 6      14      8.84
 7        6      6.08
 8        4      5.39
 9      12      8.15
10       7      6.42
11       5      5.73


II. GenIQ Model (Tree Display)

gtree
The GenIQ Model (Code)

x1 = .6550772;
        x2 = x;
   If x1 NE 0 Then x1 = x2 / x1; Else x1 = 1;
       x2 = x;
          x3 = x;
     x2 = x2 + x3;
     x2 = Cos(x2);
x1 = x1 + x2;
GenIQvar(y) = x1;


III. GenIQ Model Results
The results of the GenIQ Model as a data-straightener are in Table 2. There is a perfect rank-order prediction based on the descending
GenIQ model score GenIQvar(y), which is used to order the table.


Table 2. GenIQ Model Rank-order Prediction

ID     x        y     GenIQvar(y)

3      13    12.74      20.4919
6      14      8.84      20.4089
9      12      8.15      18.7426
5      11      7.81      15.7920
1      10      7.46      15.6735
4        9      7.11      14.3992
2        8      6.77      11.2546
10      7      6.42      10.8225
7        6      6.08      10.0031
11      5      5.73        6.7936
8        4      5.39        5.9607 


Perhaps, the best way of illustrating the GenIQ Model as a data-straightener, and a data mining tool are the plots below:
Plot y*x and Plot GenIQ*x.
gplots


IV. Summary 
Perhaps, the GenIQ Model is an excellent data-straightener and data mining tool all-in-one? What do you think?
Oh, two things - the correlation coefficients between y and x, and GenIQ(y) and x are: 0.81629 and 0.9895, respectively.
For an eye-opening preview of the 9-step modeling process of GenIQ, click here

Go back to previous page.


Related Articles: Data Mining and Its Aplications







For more information about this article, call Bruce Ratner at 516.791.3544, 1 800 DM STAT-1, or e-mail at br@dmstat1.com.