kmeans.big.matrix {bigmemory}R Documentation

bigmemory's memory-efficient k-means

Description

k-means cluster analysis without the memory overhead, and possibly in parallel using shared memory.

Usage

kmeans.big.matrix(x, centers, iter.max = 10, nstart = 1,
                  algorithm = "MacQueen", tol = 1e-8,
                  parallel = NA, nwssleigh = NULL)

Arguments

x a big.matrix object.
centers a scalar denoting the number of clusters, or for k clusters, a k by ncol(x) matrix.
iter.max the maximum number of iterations.
nstart number of random starts, to be done in parallel if possible.
algorithm only MacQueen's algorithm has been implemented at this point.
tol the convergence tolerance, not used at this point.
parallel "nws" for NetWorkSpaces; we couldn't include "snow" because CRAN doesn't distribute it for Windows and so we ran into R CMD check problems.
nwssleigh the NWS sleigh (which should be limited to this workstation and could have multiple processors).

Details

The real benefit is the lack of memory overhead compared to the standard kmeans function. With a big.matrix, kmeans.big.matrix() requires essentially no extra memory (beyond the data, other than recording the cluster memberships), whereas kmeans() makes at least two extra copies of the data. In fact, kmeans() is even worse if multiple starts (nstart>1) are used. It's a little surprising, as you can see from the examples, below.

If nstart>1 and you are using kmeans.big.matrix() in parallel, a vector of cluster memberships will need to be stored for each random starting point, which could be memory-intensive for large data. This isn't a problem if you use are running the multiple starts sequentially.

Unless you have a really big data set (where a single run of kmeans not only burns memory but takes more than a few seconds), using NWS for parallel computing of multiple random starts is unlikely to be much faster than running iteratively.

Value

An object of class texttt{kmeans}, just as produced by kmeans.

Note

You should make sure you have used NetWorkSpaces successfully with a simple example from their help pages before trying kmeans.big.matrix(). Also, bear in mind that the shared-memory parallel algorithm is only working on the local machine (not sharing resources of a cluster of machines). It doesn't make sense to allocate more clients than the number of available cores.

If you are unfamiliar with NetWorkSpaces, it requires both the nws package (available from CRAN) and a NetWorkSpaces server (freely available but somewhat more work to get started). We will keep a page updated with links and information to help you get started: http://www.stat.yale.edu/~jay/nws/.

Author(s)

John W. Emerson and Michael J. Kane

See Also

big.matrix

Examples

# Simple example (with one processor, because we don't want to require the
# installation of package nws here:

  x <- big.matrix(100000, 3, init=0, type="double")
  x[seq(1,100000,by=2),] <- rnorm(150000)
  x[seq(2,100000,by=2),] <- rnorm(150000, 5, 1)
  head(x)
  ans <- kmeans.big.matrix(x, 2, nstart=5)    # Sequential multiple starts.

  # To use NWS, try something like the following:
  ## Not run: 
    library(nws)
    s <- sleigh(nwsHost='yourhostname.xxx.yyy.zzz', workerCount=2)
    ans <- kmeans.big.matrix(x, 2, nstart=5, parallel='nws', nwssleigh=s)
    stopSleigh(s)
  
## End(Not run)

  # Both the following are run iteratively, but with less memory overhead using
  # kmeans.big.matrix.  Note that this first gc() doesn't reflect the C++
  # memory usage for the big.matrix, but the maximum memory used is about
  # 35 MB after kmeans.big.matrix().
  gc(reset=TRUE)
  time.new <- system.time(print(kmeans.big.matrix(x, 2, nstart=5)$centers))
  gc()
  y <- x[,]
  rm(x)
  # In contrast, the regular kmeans() really burns through the memory:
  gc(reset=TRUE)
  time.old <- system.time(print(kmeans(y, 2, nstart=5)$centers))
  gc()
  # The new kmeans() centers should match the old kmeans() centers, without
  # the memory overhead running more quickly; it isn't a problem with the guts of the 
  # kmeans() implementation (the algorithm is in C, well-implemented), but in
  # the traditional C/R interface and R code managing the objects and nstart:
  time.new
  time.old

[Package bigmemory version 2.3 Index]