Data Preparation: Never Drop Original Variables, Always Create Copies of Them

Data Preparation: Never Drop Original Variables, Always Create Copies of Them
Bruce Ratner, Ph.D.

Data preparation of big data, consisting of hundreds of variables, inevitably leads to recoding of some, if not many, variables. Recoding consists of modifying the values of an existing variable for various reasons. Examples of creating a new variable (to be named VAR_) from an original variable (named VAR) include: replacing missing values, trimming the outliers, using arithmetic or functions for transformations, and employing conditions (e.g., "if-then" rules). The data analyst must never recode a variable without keeping the original variable intact, as initial attempts are prone to human error as well as recoding is sometimes an iterative process, requiring several takes. If recoding is performed on the original variable – without creating copies of the original variable – it will be a misstep with the unavoidable, time-wasting task of recreating the original dataset. The purpose of this article is to provide a devise for the data preparation tool kit - a SAS-code program that semi-automatically creates copies of original variables. This handy implement is a welcomed addition to the data analyst’s tool kit when many variables require recoding.

********** SAS-code Program **********

data IN;
input ID a b c;
cards;
1 1 2 3
2 4 5 6
3 7 8 9
;
run;
proc contents data=IN out=NAME (keep=name) noprint;
run;

proc sql noprint;
select trim(left(name))||'='||trim(left(name))||'_' into :list separated by ' '
from NAME;
quit;

options symbolgen;

data VARS_ (rename=(ID_=ID));
set IN;
rename &list;
data IN_with_VARS_;
merge IN VARS_;
by ID;
run;
proc print;
run;