A New Method for Including Qualitative Information in Database Models

The classic approach to include a qualitative information, namely, nominal-level categorical variables, into the modeling process involves dummy variable coding. A categorical variable with k classes of qualitative (non-numerical) information is replaced by a set of k-1 quantitative dummy variables. The dummy variables are defined by the present or absent of the class values. The class left out is called the reference class, to which the other classes are compared when interpreting the effects of dummy variables on response. The classic approach instructs that the complete set of k-1 dummy variables is included in the model regardless of the number of dummy variables that are declared non-significant. This approach is problematic when the number of classes is large, which is typically the case in big data applications. By chance alone, as the number of classes increases, the probability of one or more dummy variables being declared non-significant increases. To put all the dummy variables in the model effectively adds "noise" or unreliability to the model, as non-significant variables are known to be "noisy." Intuitively, a large set of inseparable dummy variables poses a difficulty in model building, in that they quickly "fill up" the model not allowing room for other variables. The purpose of this article is to present a new method that upgrades the complete set of nominal-level dummy variables into a smaller set of smooth (reliable) interval-level quantitative variables, which retains a large percentage of the original information. Thus, the new variables not only offer greater reliability in the database model, but make room for other variables.

Related Articles: Data Mining and Its Aplications