Data Preparation for Determining Sample Size

Data preparation can be defined as your acquaintance with the data to understanding what they tell you. You must 1] insure there are no impossible or improbable values (e.g., age of 120 years, or a boy named Sue, respectively), and 2] audit missing and zero values. When the data at hand are BIG (e.g., hundreds of variables), then the auditing of missing values can be onerous. It is not uncommon to have variables with different percentages (also know as “coverage”) of non-missing values. For example, variable INCOME has small (poor) coverage, typically 20%. That is, 20% of the sample has INCOME values, and the remaining 80% of the sample has missing values for INCOME. As another example, consider variable AGE, it has large (good) coverage, typically 90%. Thus, 90% of the sample has AGE values, and the remaining 10% of the sample has missing values for AGE.

When the coverage across the variables at hand has varying levels, say, between 15% - 100%, the complete-case analysis (CCA) sample size is ever minikin, rendering the intended analysis useless. The data analyst must decide on a single acceptable minimum coverage level for all variables that insures a reliable imputation for the missing values. This renders a stable CCA-sample size and a viable dataset, insuring a workable analysis.

Varying levels of Coverage requires finding the optimal mix between Number of Variables and CCA-Sample Size. The relationship among Coverage, Number of Variables and CCA-Sample Size is described below:

1. As the desired level of Coverage increases:


	Data Preparation for Determining Sample Size Bruce Ratner, Ph.D. Data preparation can be defined as your acquaintance with the data to understanding what they tell you. You must 1] insure there are no impossible or improbable values (e.g., age of 120 years, or a boy named Sue, respectively), and 2] audit missing and zero values. When the data at hand are BIG (e.g., hundreds of variables), then the auditing of missing values can be onerous. It is not uncommon to have variables with different percentages (also know as “coverage”) of non-missing values. For example, variable INCOME has small (poor) coverage, typically 20%. That is, 20% of the sample has INCOME values, and the remaining 80% of the sample has missing values for INCOME. As another example, consider variable AGE, it has large (good) coverage, typically 90%. Thus, 90% of the sample has AGE values, and the remaining 10% of the sample has missing values for AGE. When the coverage across the variables at hand has varying levels, say, between 15% - 100%, the complete-case analysis (CCA) sample size is ever minikin, rendering the intended analysis useless. The data analyst must decide on a single acceptable minimum coverage level for all variables that insures a reliable imputation for the missing values. This renders a stable CCA-sample size and a viable dataset, insuring a workable analysis. Varying levels of Coverage requires finding the optimal mix between Number of Variables and CCA-Sample Size. The relationship among Coverage, Number of Variables and CCA-Sample Size is described below: 1. As the desired level of Coverage increases: The Number of Variables decreases The CCA-Sample Size increases Model Handicap is that less predictors are available, producing a biased model (with over/under prediction estimates). 2. As the desired level of Coverage decreases: The Number of Variables increases The CCA-Sample Size decreases Model Handicap is the utility of model is curbed, and model is unstable (i.e., has large error variance). The purpose of this article is to provide a devise for the data preparation tool kit – a SAS-code program that allows the data analyst to determine a stable sample size by setting a single acceptable minimum coverage level. The program drops variables that cannot be kept due to their poor coverage – rendering a utile dataset, after successful imputation for missing values. This handy implement is a welcomed addition to the data analyst’s tool kit. ******** SAS-code Program ****** PROC MEANS data=IN N noprint; var _NUMERIC_; output out=COVERAGE (drop =_TYPE_ _FREQ_ ) N=; run; DATA** _NULL_; set COVERAGE end=last; array all_nums[] _NUMERIC_; / make sure the length is long enough / length DN $ 300; do i = 1 to dim(all_nums); if all_nums[i] < 1000* then DN = trim(DN) \|\| ' ' \|\| vname(all_nums[i]); end; call symput('DN', DN); call symput('AS', put(dim(all_nums), 8.-L)); if last then do; LN = length(DN); put 'length ' LN; end; RUN; %put &DN; %put &AS; /* Dataset KEEP_VARS includes all character variables in dataset IN, irrespective of sample size/ DATA KEEP_VARS; set IN ; drop &DN; RUN;* PROC MEANS data=KEEP_VARS n nmiss; title2' KEEP-VARIABLES N ge 1000'; RUN; data DROP_VARS; set IN; keep &DN; run; PROC MEANS data=DROP_VARS n nmiss; title2' DROP-VARIABLES var N lt 1000'; RUN; For more information about this article, call Bruce Ratner at 516.791.3544, 1 800 DM STAT-1, or e-mail at br@dmstat1.com. DM STAT-1 CONSULTING / br@dmstat1.com 574 Flanders Drive / North Woodmere, NY 11581 / U S A Voice 1-516-791-3544 / Fax 1-516-791-5075 Toll Free 1 800 DM STAT-1