When Data Are Too Large to Handle in the Memory of Your Computer

Bruce Ratner, Ph.D.

There is a growing area of inquiry in data preparation, in particular procedures for data that are too large to be handled in the memory of your computer. The rife approach of handling big data is subsampling the original data in some manner so as to not lose accuracy. The purpose of this article is to acquaint the data analyst with the latest such subsampling procedures, which use partitioning and bagging. The procedures have two advantages: 1) the subsample size can simply be set at whatever amount of the original data can be comfortably handled on your computer, and 2) the procedures have potentially better accuracy (via a method of "committee averaging" of models across the subsamples) than the single model results based on all the data. The procedures are:

Disjoint Partition: Divide the original data into N disjoint partitions of size 1/N-th of the original data. Each partition has randomly selected elements. Thus, replication can neither occur within nor across the bags.
Small Bags: Create N “bags,” namely, resamples with replacement of the original data into “new” datasets, called bags, of size 1/N-th of the original data. Each bag is created independently by random sampling with replacement. Thus, replication can occur within and across the bags.
No Replication-Small Bags: Similar to small bags in #2, but without replacement for each individual bag. Thus, no replication within bags, but replication may be across bags.
Disjoint Bags: Begins with the disjoint partitions in #1, then independently for each partition, a number of its elements are randomly selected with replacement. The number of added elements is equal to the average number of repeated elements in a small bag in #2.

These latest procedures of - handling data too large for your computer - will be a welcomed addition the data analysts’ data preparation toolkit.

Related Articles:
1. Creating A Bootstrap Sample
2. A Simple Bootstrap Variable Selection Method for Building Database Marketing Models
3. Bootstraping In Direct Marketing: A New Approach for Validating Response Models


	When Data Are Too Large to Handle in the Memory of Your Computer Bruce Ratner, Ph.D. There is a growing area of inquiry in data preparation, in particular procedures for data that are too large to be handled in the memory of your computer. The rife approach of handling big data is subsampling the original data in some manner so as to not lose accuracy. The purpose of this article is to acquaint the data analyst with the latest such subsampling procedures, which use partitioning and bagging. The procedures have two advantages: 1) the subsample size can simply be set at whatever amount of the original data can be comfortably handled on your computer, and 2) the procedures have potentially better accuracy (via a method of "committee averaging" of models across the subsamples) than the single model results based on all the data. The procedures are: Disjoint Partition: Divide the original data into N disjoint partitions of size 1/N-th of the original data. Each partition has randomly selected elements. Thus, replication can neither occur within nor across the bags. Small Bags: Create N “bags,” namely, resamples with replacement of the original data into “new” datasets, called bags, of size 1/N-th of the original data. Each bag is created independently by random sampling with replacement. Thus, replication can occur within and across the bags. No Replication-Small Bags: Similar to small bags in #2, but without replacement for each individual bag. Thus, no replication within bags, but replication may be across bags. Disjoint Bags: Begins with the disjoint partitions in #1, then independently for each partition, a number of its elements are randomly selected with replacement. The number of added elements is equal to the average number of repeated elements in a small bag in #2. These latest procedures of - handling data too large for your computer - will be a welcomed addition the data analysts’ data preparation toolkit. Related Articles: 1. Creating A Bootstrap Sample 2. A Simple Bootstrap Variable Selection Method for Building Database Marketing Models 3. Bootstraping In Direct Marketing: A New Approach for Validating Response Models For more information about this article, call Bruce at 516.791.3544, 1 800 DM STAT-1, or e-mail at br@dmstat1.com. DM STAT-1 CONSULTING / br@dmstat1.com 574 Flanders Drive / North Woodmere, NY 11581 / U S A Voice 1-516-791-3544 / Fax 1-516-791-5075 Toll Free 1 800 DM STAT-1