letterFrequency {Biostrings} | R Documentation |
Given a biological sequence (or a set of biological sequences),
the alphabetFrequency
function computes the frequency of
each letter in the (base) alphabet.
The letterFrequencyInSlidingView
function is a more
specialized version of alphabetFrequency
that computes
the frequencies of a set of letters in a view (or window)
that is conceptually sliding along the input sequence.
The consensusMatrix
function computes the consensus matrix
of a set of sequences, and the consensusString
function creates
the consensus sequence based on a 50% + 1 vote from the consensus
matrix (using the "?"
letter to represent the lack of consensus).
In this man page we call "DNA input" (or "RNA input") an XString, XStringSet, XStringViews or MaskedXString object of base type DNA (or RNA).
alphabetFrequency(x, as.prob=FALSE, freq=FALSE, ...) hasOnlyBaseLetters(x) uniqueLetters(x) letterFrequencyInSlidingView(x, view.width, letters, OR="|") consensusMatrix(x, as.prob=FALSE, freq=FALSE, shift=0L, width=NULL, ...) ## S4 method for signature 'matrix': consensusString(x, ambiguityMap="?", threshold=0.5) ## S4 method for signature 'DNAStringSet': consensusString(x, ambiguityMap=IUPAC_CODE_MAP, threshold=0.25, shift=0L, width=NULL) ## S4 method for signature 'RNAStringSet': consensusString(x, ambiguityMap=as.character(RNAStringSet(DNAStringSet(IUPAC_CODE_MAP))), threshold=0.25, shift=0L, width=NULL)
x |
An XString, XStringSet, XStringViews
or MaskedXString object for alphabetFrequency
or uniqueLetters .
DNA or RNA input for
An XString object for
A character vector, or an XStringSet or XStringViews
object for
A consensus matrix (as returned by |
as.prob |
If TRUE then probabilities are reported,
otherwise counts (the default).
|
freq |
This argument is deprecated.
Please use the as.prob argument instead.
|
view.width |
For letterFrequencyInSlidingView,
the constant (e.g. 35, 48, 1000) size of the "window" to slide
along x .
The specified letters are tabulated in each window of length
view.width .
The rows of the result (see value) correspond to the various windows.
|
letters |
For letterFrequencyInSlidingView,
a character vector (e.g. "C", "CG", c("C", "G")) giving the
letters to tabulate.
Except with OR=0 , multi-character elements of letters
('nchar' > 1) are taken as groupings of letters into subsets, to
be tabulated in common ("or"'d), as if their alphabetFrequency's
were added (Arithmetic).
The columns of the result (see value) correspond to the individual
and sets of letters which are counted separately.
Unrelated (and, with some post-processing, related) counts may of
course be obtained in separate calls.
|
OR |
For letterFrequencyInSlidingView ,
the string (default | ) to use as a separator in forming names
for the "grouped" columns, e.g. "C|G".
The otherwise exceptional value 0 (zero) disables or'ing and
is provided for convenience, allowing a single multi-character string
(or several strings) of letters that should be counted separately.
If some but not all letters are to be counted separately, they must
reside in separate elements of letters (with 'nchar' 1 unless they
are to be grouped with other letters), and OR cannot be 0.
|
ambiguityMap |
Either a single character to use when agreement is not reached or
a named character vector where the names are the ambiguity characters
and the values are the combinations of letters that comprise the
ambiguity (e.g. link{IUPAC_CODE_MAP} ).
|
threshold |
The minimum probability threshold for an agreement to be declared.
When ambiguityMap is a single character, threshold
is a single number in (0, 1].
When ambiguityMap is a named character vector
(e.g. link{IUPAC_CODE_MAP} ), threshold
is a single number in (0, 1/sum(rowSums(x) > 0)].
|
... |
Further arguments to be passed to or from other methods.
For the XStringViews and XStringSet methods,
the
For DNA or RNA input, the |
shift |
An integer vector (recycled to the length of x ) specifying how
each sequence in x should be (horizontally) shifted with respect
to the first column of the consensus matrix to be returned.
By default (shift=0 ), each sequence in x has its
first letter aligned with the first column of the matrix.
A positive shift value means that the corresponding sequence
must be shifted to the right, and a negative shift value
that it must be shifted to the left.
For example, a shift of 5 means that it must be shifted 5 positions
to the right (i.e. the first letter in the sequence must be aligned
with the 6th column of the matrix), and a shift of -3 means that
it must be shifted 3 positions to the left (i.e. the 4th letter in
the sequence must be aligned with the first column of the matrix).
|
width |
The number of columns of the returned matrix for the consensusMatrix
method for XStringSet objects.
When width=NULL (the default), then this method returns a matrix
that has just enough columns to have its last column aligned
with the rightmost letter of all the sequences in x after
those sequences have been shifted (see the shift argument above).
This ensures that any wider consensus matrix would be a "padded with zeros"
version of the matrix returned when width=NULL .
The length of the returned sequence for the |
alphabetFrequency
and letterFrequencyInSlidingView
are
generic functions defined in the Biostrings package.
letterFrequencyInSlidingView
is a much lighter alternative to
alphabetFrequency
, without collapse
, of the hypothetical
XStringViews object consisting of every interval of length
view.width
on x
.
If x
is masked (MaskedXString), it is treated as the
XStringSet of its visible segments.
To include the masked regions (as well as the intervals of length
view.width-1
which immediately precede them), use unmasked(x)
or DNAString(x)
as the subject.
alphabetFrequency
returns an integer vector when x
is an
XString or MaskedXString object. When x
is an
XStringSet or XStringViews object, then it returns
an integer matrix with length(x)
rows where the
i
-th row contains the frequencies for x[[i]]
.
If x
is a DNA or RNA input, then the returned vector is named
with the letters in the alphabet. If the baseOnly
argument is
TRUE
, then the returned vector has only 5 elements: 4 elements
corresponding to the 4 nucleotides + the 'other' element.
letterFrequencyInSlidingView
returns for each XString
element of x
, say s of length L, an integer matrix with
L-view.width+1
rows, the i
-th of which holds the various
letter frequencies in the i
-th "window along s", i.e.
substring(s, i, i+view.width-1)
.
hasOnlyBaseLetters
returns TRUE
or FALSE
indicating
whether or not x
contains only base letters (i.e. As, Cs, Gs and Ts
for DNA input and As, Cs, Gs and Us for RNA input).
uniqueLetters
returns a vector of 1-letter or empty strings. The empty
string is used to represent the nul character if x
happens to contain
any. Note that this can only happen if the base class of x
is BString.
An integer matrix with letters as row names for consensusMatrix
.
A standard character string for consensusString
.
H. Pages and P. Aboyoun; H. Jaffee for letterFrequencyInSlidingView
alphabet
,
coverage
,
oligonucleotideFrequency
,
countPDict
,
XString-class,
XStringSet-class,
XStringViews-class,
MaskedXString-class,
strsplit
## --------------------------------------------------------------------- ## alphabetFrequency() ## --------------------------------------------------------------------- data(yeastSEQCHR1) yeast1 <- DNAString(yeastSEQCHR1) alphabetFrequency(yeast1) alphabetFrequency(yeast1, baseOnly=TRUE) hasOnlyBaseLetters(yeast1) uniqueLetters(yeast1) ## With input made of multiple sequences: library(drosophila2probe) probes <- DNAStringSet(drosophila2probe) alphabetFrequency(probes[1:50], baseOnly=TRUE) alphabetFrequency(probes, baseOnly=TRUE, collapse=TRUE) ## --------------------------------------------------------------------- ## letterFrequencyInSlidingView() ## --------------------------------------------------------------------- data(yeastSEQCHR1) x <- DNAString(yeastSEQCHR1) view.width <- 48 letters <- c("A", "CG") two_columns <- letterFrequencyInSlidingView(x, view.width, letters) head(two_columns) tail(two_columns) three_columns <- letterFrequencyInSlidingView(x, view.width, letters, OR=0) head(three_columns) tail(three_columns) stopifnot(identical(two_columns[ , "C|G"], three_columns[ , "C"] + three_columns[ , "G"])) ## Note that, alternatively, 'three_columns' can also be obtained by ## creating the views on 'x' (as a Views object) and by calling ## alphabetFrequency() on it. But, of course, that is be *much* less ## efficient (both, in terms of memory and speed) than using ## letterFrequencyInSlidingView(): v <- Views(x, start=seq_len(length(x) - view.width + 1), width=view.width) v three_columns2 <- alphabetFrequency(v, baseOnly=TRUE)[ , c("A", "C", "G")] stopifnot(identical(three_columns2, three_columns)) ## Set the width of the view to length(x) to get the global frequencies: letterFrequencyInSlidingView(x, letters="ACGTN", view.width=length(x), OR=0) ## --------------------------------------------------------------------- ## consensus*() ## --------------------------------------------------------------------- ## Read in ORF data: file <- system.file("extdata", "someORF.fa", package="Biostrings") orf <- read.DNAStringSet(file) ## To illustrate, the following example assumes the ORF data ## to be aligned for the first 10 positions (patently false): orf10 <- DNAStringSet(orf, end=10) consensusMatrix(orf10, baseOnly=TRUE) ## The following example assumes the first 10 positions to be aligned ## after some incremental shifting to the right (patently false): consensusMatrix(orf10, baseOnly=TRUE, shift=0:6) consensusMatrix(orf10, baseOnly=TRUE, shift=0:6, width=10) ## For the character matrix containing the "exploded" representation ## of the strings, do: as.matrix(orf10, use.names=FALSE) ## consensusMatrix() can be used to just compute the alphabet frequency ## for each position in the input sequences: consensusMatrix(probes, baseOnly=TRUE) ## After sorting, the first 5 probes might look similar (at least on ## their first bases): consensusString(sort(probes)[1:5]) consensusString(sort(probes)[1:5], ambiguityMap = "N", threshold = 0.5) ## --------------------------------------------------------------------- ## C. RELATIONSHIP BETWEEN consensusMatrix() AND coverage() ## --------------------------------------------------------------------- ## Applying colSums() on a consensus matrix gives the coverage that ## would be obtained by piling up (after shifting) the input sequences ## on top of an (imaginary) reference sequence: cm <- consensusMatrix(orf10, shift=0:6, width=10) colSums(cm) ## Note that this coverage can also be obtained with: as.integer(coverage(IRanges(rep(1, length(orf)), width(orf)), shift=0:6, width=10))