Big Data and R
This blog post is a presentation of tips on computing with Big Data in R, using Revolution R Enterprise 7.0
and RevoScaleR
, Revolution’s R
package for HPA computing, as introduced by Revolution Analytics blog. For more detailed information you can take a look at Tips on Computing with Big Data in R.
1 Upgrade your hardware
Since bigger is better, increasing memory and adding as many cores as R can use is very helpful. Also trying to avoid bottlenecks that occur in disk I/O and the speed of RAM, so as to use more cores.
2 Upgrade your software
Since R allows its core math libraries to be replaced, a performance boost can be achieved to any function that makes use of computational linear algebra algorithms. Revolution R Enterprise
links in the Intel Math Kernel Libraries.
3 Minimize copies of the data
R does quite a bit of automatic copying. For example, when a data frame is passed into a function a copy of the data is made if the data frame is modified, and putting a data frame into a list also automatically causes a copy to be made. Moreover, many basic analysis algorithms, such as lm
and glm
produce multiple copies of a data set as the computations progress. Memory management is important.
4 Process data in chunks
Processing data a chunk at a time can scale computations without increasing memory requirements. There are several CRAN
packages including biglm
, bigmemory
, ff
and ffbase
that can implement external memory algorithms or help with writing them. Revolution R Enterprise’s RevoScaleR
package takes chunking algorithms to the next level by automatically taking advantage of the computational resources to run its algorithms in parallel.
5 Compute in parallel across cores or nodes
In order to scale computations to big data the CRAN package foreach provides easy-to-use tools for executing R functions in parallel on both on a single computer and across multiple computers. The foreach()
function is particularly useful for “embarrassingly parallel” computations that do not involve communication among different tasks.
The statistical functions and machine learning algorithms in the RevoScaleR
package are all Parallel External Memory Algorithm’s (PEMA’s). They automatically take advantage of all of the cores available on a machine or on a cluster (including LSF and Hadoop clusters.)
6 Take advantage of integers
In R, the two choices for “continuous” data are numeric, an 8 byte (double) floating point number and integer, a 4 byte integer. There are circumstances where storing and processing integer data can provide the dual advantages using less memory and decreasing processing time. For example, when working with integers, a tabulation is generally much faster than sorting and gives exact values for all empirical quantiles. Even when you are not working with integers scaling and converting to integers can produce fast and accurate estimates of quantiles. As an example, if the data consists of floating point values in the range from 0 to 1,000, converting to integers and tabulating will bound the median or any other quantile to within two adjacent integers. Then interpolation can get you even closer approximation.
7 Store data efficiently
When big data has to be efficiently accessed from disk appropriate data types should be used, so as to save storage space and access time. Integers should be preffered when possible, instead of doubles and floats, since they can represent 7 decimal digits of precision, which is more than enough for most data, and the take up half the space of doubles. Save the 64-bit doubles for computations.
8 Only read the data needed
Reading from disk the variables needed for computations and analysis, instead of reading a whole data set of variables can speed up the analysis considerably.
9 Avoid loops when transforming data
Since loops in R can be very slow compared with R’s core vector operations which are typically written in C, C++ or Fortran, they should be avoided.
10 Use C, C++, or Fortran for critical functions
Since R can integrate easily with other languages, including C, C++, and Fortran, one can pass R data objects to other languages, do some computations, and return the results in R data objects. So this ability of R can be used for critical functions. The CRAN
package Rcpp
, for example, makes it easy to call C and C++ code from R.
11 Process data transformations in batches
To avoid overhead of making multiple passes over large data sets write chunking algorithms that apply all of the transformations to each chunk. RevoScaleR
’s rxDataStep()
function is designed for one pass processing by permitting multiple data transformations to be performed on each chunk.
12 User row-oriented data transformations where possible
When writing chunking algorithms, try to avoid algorithms that cross chunk boundaries. In general, data transformations for a single row of data should not be dependent on values in other rows. The key idea is that a transformation expression should give the same result even if only some of the rows of data are in memory at one time. Data manipulations requiring lags can be done but require special handling.
13 Handle categorical variables efficiently and with care
Working with categorical or factor variables in big data sets can be challenging. For example, using R’s factor()
function in a transformation on a chunk of data without explicitly specifying all of the levels that are present in the entire data set might end up with incompatible factor levels from chunk to chunk. Also, building models with factors having hundreds of levels may cause hundreds of dummy variables to be created that really eat up memory. The functions in the RevoScaleR
package that deal with factors minimize memory use and do not generally explicitly create dummy variables to represent factors.
14 Be aware of 0utput with the same number of rows as your input
When output has the same number of rows as the data, for example, when computing predictions and residuals, the output should be written out to a file rather than kept in memory.
15 Think Twice Before Sorting
Since sorting is a time-intensive operation, we should use implementations of algorithms that avoid it. In R, the RevoScaleR
function rxDTree()
avoids sorting by working with histograms of the data rather that with the raw data itself.