XStringSet-class {Biostrings}R Documentation

XStringSet objects

Description

The BStringSet class is a container for storing a set of BString objects and for making its manipulation easy and efficient.

Similarly, the DNAStringSet (or RNAStringSet, or AAStringSet) class is a container for storing a set of DNAString (or RNAString, or AAString) objects.

All those containers derive directly (and with no additional slots) from the XStringSet virtual class.

Usage

## Constructors:
BStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
DNAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
RNAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)
AAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE)

## Accessor-like methods:
## S4 method for signature 'character'
width(x)
## S4 method for signature 'XStringSet'
nchar(x, type="chars", allowNA=FALSE)

## ... and more (see below)

Arguments

x Either a character vector (with no NAs), or an XString, XStringSet or XStringViews object.
start,end,width Either NA, a single integer, or an integer vector of the same length as x specifying how x should be "narrowed" (see ?narrow for the details).
use.names TRUE or FALSE. Should names be preserved?
type,allowNA Ignored.

Details

The BStringSet, DNAStringSet, RNAStringSet and AAStringSet functions are constructors that can be used to turn input x into an XStringSet object of the desired base type.

They also allow the user to "narrow" the sequences contained in x via proper use of the start, end and/or width arguments. In this context, "narrowing" means dropping a prefix or/and a suffix of each sequence in x. The "narrowing" capabilities of these constructors can be illustrated by the following property: if x is a character vector (with no NAs), or an XStringSet (or XStringViews) object, then the 3 following transformations are equivalent:

BStringSet(x, start=mystart, end=myend, width=mywidth)
subseq(BStringSet(x), start=mystart, end=myend, width=mywidth)
BStringSet(subseq(x, start=mystart, end=myend, width=mywidth))
Note that, besides being more convenient, the first form is also more efficient on character vectors.

Accessor-like methods

In the code snippets below, x is an XStringSet object.

length(x): The number of sequences in x.
width(x): A vector of non-negative integers containing the number of letters for each element in x. Note that width(x) is also defined for a character vector with no NAs and is equivalent to nchar(x, type="bytes").
names(x): NULL or a character vector of the same length as x containing a short user-provided description or comment for each element in x. These are the only data in an XStringSet object that can safely be changed by the user. All the other data are immutable! As a general recommendation, the user should never try to modify an object by accessing its slots directly.
alphabet(x): Return NULL, DNA_ALPHABET, RNA_ALPHABET or AA_ALPHABET depending on whether x is a BStringSet, DNAStringSet, RNAStringSet or AAStringSet object.
nchar(x): The same as width(x).

Subsequence extraction and related transformations

In the code snippets below, x is a character vector (with no NAs), or an XStringSet (or XStringViews) object.

subseq(x, start=NA, end=NA, width=NA): Applies subseq on each element in x. See ?subseq for the details.

Note that this is similar to what substr does on a character vector. However there are some noticeable differences:

(1) the arguments are start and stop for substr;

(2) the SEW interface (start/end/width) interface of subseq is richer (e.g. support for negative start or end values); and (3) subseq checks that the specified start/end/width values are valid i.e., unlike substr, it throws an error if they define "out of limits" subsequences or subsequences with a negative width.

narrow(x, start=NA, end=NA, width=NA, use.names=TRUE): Same as subseq. The only differences are: (1) narrow has a use.names argument; and (2) all the things narrow and subseq work on (IRanges, XStringSet or XStringViews objects for narrow, XVector or XStringSet objects for subseq). But they both work and do the same thing on an XStringSet object.
threebands(x, start=NA, end=NA, width=NA): Like the method for IRanges objects, the threebands methods for character vectors and XStringSet objects extend the capability of narrow by returning the 3 set of subsequences (the left, middle and right subsequences) associated to the narrowing operation. See ?threebands in the IRanges package for the details.
subseq(x, start=NA, end=NA, width=NA) <- value: A vectorized version of the subseq<- method for XVector objects. See ?`subseq<-` for the details.

Subsetting and appending

In the code snippets below, x and values are XStringSet objects, and i should be an index specifying the elements to extract.

x[i]: Return a new XStringSet object made of the selected elements.
x[[i]]: Extract the i-th XString object from x.
append(x, values, after=length(x)): Add sequences in values to x.

Ordering and related methods

In the code snippets below, x is an XStringSet object.

is.unsorted(x, strictly=FALSE): Return a logical values specifying if x is unsorted. The strictly argument takes logical value indicating if the check should be for _strictly_ increasing values.
order(x): Return a permutation which rearranges x into ascending or descending order.
sort(x): Sort x into ascending order (equivalent to x[order(x)]).
rank(x): Rank x in ascending order.

Duplicated and unique methods

In the code snippets below, x is an XStringSet object.

duplicated(x): Return a logical vector whose elements denotes duplicates in x.
unique(x): Return an XStringSet containing the unique values in x.

Set operations

In the code snippets below, x and y are XStringSet objects

union(x, y, ...): Union of x and y.
intersect(x, y, ...): Intersection of x and y.
setdiff(x, y, ...): Asymmetric set difference of x and y.
setequal(x, y): Set equality of x to y.

Identical value matching

In the code snippets below, x is a character vector, XString, or XStringSet object and table is an XStringSet object.

x %in% table: Returns a logical vector indicating which elements in x match identically with an element in table.
match(x, table, nomatch = NA_integer_, incomparables = NULL): Returns an integer vector containing the first positions of an identical match in table for the elements in x.

Other methods

In the code snippets below, x is an XStringSet object.

unlist(x): Turns x into an XString object by combining the sequences in x together. Fast equivalent to do.call(c, as.list(x)).
as.character(x, use.names): Convert x to a character vector of the same length as x. use.names controls whether or not names(x) should be used to set the names of the returned vector (default is TRUE).
as.matrix(x, use.names): Return a character matrix containing the "exploded" representation of the strings. This can only be used on an XStringSet object with equal-width strings. use.names controls whether or not names(x) should be used to set the row names of the returned matrix (default is TRUE).
toString(x): Equivalent to toString(as.character(x)).

Author(s)

H. Pages

See Also

XString-class, XStringViews-class, XStringSetList-class, subseq, narrow, substr, compact, XVectorList-class

Examples

  ## ---------------------------------------------------------------------
  ## A. USING THE XStringSet CONSTRUCTORS ON A CHARACTER VECTOR OR FACTOR
  ## ---------------------------------------------------------------------
  ## Note that there is no XStringSet() constructor, but an XStringSet
  ## family of constructors: BStringSet(), DNAStringSet(), RNAStringSet(),
  ## etc...
  x0 <- c("#CTC-NACCAGTAT", "#TTGA", "TACCTAGAG")
  width(x0)
  x1 <- BStringSet(x0)
  x1

  ## 3 equivalent ways to obtain the same BStringSet object:
  BStringSet(x0, start=4, end=-3)
  subseq(x1, start=4, end=-3)
  BStringSet(subseq(x0, start=4, end=-3))

  dna0 <- DNAStringSet(x0, start=4, end=-3)
  dna0
  names(dna0)
  names(dna0)[2] <- "seqB"
  dna0

  ## When the input vector contains a lot of duplicates, turning it into
  ## a factor first before passing it to the constructor will produce an
  ## XStringSet object that is more compact in memory:
  library(hgu95av2probe)
  x2 <- sample(hgu95av2probe$sequence, 999000, replace=TRUE)
  dna2a <- DNAStringSet(x2)
  dna2b <- DNAStringSet(factor(x2))  # slower but result is more compact
  object.size(dna2a)
  object.size(dna2b)

  ## ---------------------------------------------------------------------
  ## B. USING THE XStringSet CONSTRUCTORS ON A SINGLE SEQUENCE (XString
  ##    OBJECT OR CHARACTER STRING)
  ## ---------------------------------------------------------------------
  x3 <- "abcdefghij"
  BStringSet(x3, start=2, end=6:2)  # behaves like 'substring(x3, 2, 6:2)'
  BStringSet(x3, start=-(1:6))
  x4 <- BString(x3)
  BStringSet(x4, end=-(1:6), width=3)

  ## Randomly extract 1 million 40-mers from C. elegans chrI:
  extractRandomReads <- function(subject, nread, readlength)
  {
      if (!is.integer(readlength))
          readlength <- as.integer(readlength)
      start <- sample(length(subject) - readlength + 1L, nread,
                      replace=TRUE)
      DNAStringSet(subject, start=start, width=readlength)
  }
  library(BSgenome.Celegans.UCSC.ce2)
  rndreads <- extractRandomReads(Celegans$chrI, 1000000, 40)
  ## Notes:
  ## - This takes only 2 or 3 seconds versus several hours for a solution
  ##   using substring() on a standard character string.
  ## - The short sequences in 'rndreads' can be seen as the result of a
  ##   simulated high-throughput sequencing experiment. A non-realistic
  ##   one though because:
  ##     (a) It assumes that the underlying technology is perfect (the
  ##         generated reads have no technology induced errors).
  ##     (b) It assumes that the sequenced genome is exactly the same as the
  ##         reference genome.
  ##     (c) The simulated reads can contain IUPAC ambiguity letters only
  ##         because the reference genome contains them. In a real
  ##         high-throughput sequencing experiment, the sequenced genome
  ##         of course doesn't contain those letters, but the sequencer
  ##         can introduce them in the generated reads to indicate ambiguous
  ##         base-calling.
  ##     (d) The simulated reads come from the plus strand only of a single
  ##         chromosome.
  ## - See the getSeq() function in the BSgenome package for how to
  ##   circumvent (d) i.e. how to generate reads that come from the whole
  ##   genome (plus and minus strands of all chromosomes).

  ## ---------------------------------------------------------------------
  ## C. USING THE XStringSet CONSTRUCTORS ON AN XStringSet OBJECT
  ## ---------------------------------------------------------------------
  library(drosophila2probe)
  probes <- DNAStringSet(drosophila2probe)
  probes

  RNAStringSet(probes, start=2, end=-5)  # does NOT copy the sequence data!

  ## ---------------------------------------------------------------------
  ## D. USING subseq() ON AN XStringSet OBJECT
  ## ---------------------------------------------------------------------
  subseq(probes, start=2, end=-5)

  subseq(probes, start=13, end=13) <- "N"
  probes

  ## Add/remove a prefix:
  subseq(probes, start=1, end=0) <- "--"
  probes
  subseq(probes, end=2) <- ""
  probes

  ## Do more complicated things:
  subseq(probes, start=4:7, end=7) <- c("YYYY", "YYY", "YY", "Y")
  subseq(probes, start=4, end=6) <- subseq(probes, start=-2:-5)
  probes

  ## ---------------------------------------------------------------------
  ## E. UNLISTING AN XStringSet OBJECT
  ## ---------------------------------------------------------------------
  library(drosophila2probe)
  probes <- DNAStringSet(drosophila2probe)
  unlist(probes)

  ## ---------------------------------------------------------------------
  ## F. COMPACTING AN XStringSet OBJECT
  ## ---------------------------------------------------------------------
  ## As a particular type of XVectorList objects, XStringSet objects can
  ## eventually be compacted. Compacting is done typically before
  ## serialization. See ?compact for more information.
  library(drosophila2probe)
  probes <- DNAStringSet(drosophila2probe)

  y <- subseq(probes[1:12], start=5)
  probes@pool
  y@pool
  object.size(probes)
  object.size(y)

  y0 <- compact(y)
  y0@pool
  object.size(y0)

[Package Biostrings version 2.20.1 Index]