|
Data Preparation for Big Data
Bruce Ratner, Ph.D.
Data preparation can be defined as your acquaintance with the data to understanding what they tell you. You must 1] insure there are no impossible or improbable values (e.g., age of 120 years, or a boy named Sue, respectively), and 2] audit missing and zero values. When the data at hand are BIG (e.g., hundreds of numeric and character variables), then the auditing of missing values can be onerous. It is not uncommon to have too many variables, each with all missing values, thus rendering any preliminary statistical measure (e.g., the average correlation coefficient) useless, as the complete-case analysis sample size is either minikin or zero. The purpose of this article is to provide a devise for the data preparation tool kit - a SAS-code program that deletes variables, each having all missing values. This handy implement is a welcomed addition to the data analyst’s tool kit . ********** SAS-code Program ********** DATA IN; input ID 2. GENDER $1. MARITAL $1.AGE 2. INCOME 3.; cards; 01 S 123 02 M 345 03 4 04 05 S 08 M 234 07 7 08 M 09 S 34 10 D ; RUN; proc print; run; DATA missing; set IN; options symbolgen; DATA _null_; set missing end=finished; array num_vars[*] _NUMERIC_; array char_vars[*] _CHARACTER_; x= dim(num_vars); y= dim(char_vars); call symput('num_vars',trim(left(put(x,8.)))); call symput('char_vars',trim(left(put(y,8.)))); RUN; DATA _null_; set IN end=finished; array num_vars[*] _NUMERIC_; Array char_vars[*] _CHARACTER_; array num_miss [&num_vars] $ (&num_vars*'missing'); array char_miss [%eval(&char_vars+1)] $ (%eval(&char_vars+1)*'missing'); do i=1 to dim(num_vars); if num_vars(i) ne . then num_miss(i)='non-miss'; end; do i=1 to dim(char_vars); if char_vars(i) ne ' ' then char_miss(i)='non-miss'; end; /* make sure the length is large enough*/ length list $ 50; if finished then do; do i= 1 to dim(num_vars); if num_miss(i) = 'missing' then list=trim(list)||' '||trim(vname(num_vars(i))); end; do i= 1 to dim(char_vars); if char_miss(i) = 'missing' then list=trim(list)||' '||trim(vname(char_vars(i))); end; len=length(list); put len=; call symput('mlist',list); end; RUN; DATA notmiss_vars; set missing; drop &mlist; RUN; proc contents;run; DATA miss_vars; set missing; keep &mlist; RUN; proc contents;run; 1 800 DM STAT-1, or e-mail at br@dmstat1.com. |
|