DM Stat-1 Articles
Link to Home

Link to Articles

Link to Consulting

Link to Seminar

Link to Stat-Chat

Link to Software

Link to Clients

Data Preparation: Never Drop Original Variables, Always Create Copies of Them
Bruce Ratner, Ph.D.

Data preparation of big data, consisting of hundreds of variables, inevitably leads to recoding of some, if not many, variables. Recoding consists of  modifying the values of an existing variable for various reasons. Examples of creating a new variable (to be named VAR_) from an original variable (named VAR) include: replacing missing values, trimming the outliers, using arithmetic or functions for transformations, and employing conditions (e.g., "if-then" rules). The data analyst must never recode a variable without keeping the original variable intact, as initial attempts are prone to human error as well as recoding is sometimes an iterative process, requiring several takes. If recoding is performed on the original variable – without creating copies of the original variable – it will be a misstep with the unavoidable, time-wasting task of recreating the original dataset. The purpose of this article is to provide a devise for the data preparation tool kit - a SAS-code program that semi-automatically creates copies of original variables. This handy implement is a welcomed addition to the data analyst’s tool kit when many variables require recoding.

********** SAS-code Program **********

data IN;
input ID a b c;
cards;
1 1 2 3
2 4 5 6
3 7 8 9
;
run;
proc contents
data=IN out=NAME (keep=name) noprint;
run;

proc sql noprint;
select trim(left(name))||'='||trim(left(name))||'_' into :list separated by ' '
from NAME;
quit;

options symbolgen;

data VARS_  (rename=(ID_=ID));
set IN;
rename &list;
data    IN_with_VARS_;
merge IN     VARS_;
by ID;
run;
proc print;
run;



For more information about this article, call Bruce Ratner at 516.791.3544,
1 800 DM STAT-1, or e-mail at br@dmstat1.com.