pairwiseAlignment {Biostrings} | R Documentation |
Solves (Needleman-Wunsch) global alignment, (Smith-Waterman) local alignment, and (ends-free) overlap alignment problems.
pairwiseAlignment(pattern, subject, ...) ## S4 method for signature 'XStringSet,XStringSet': pairwiseAlignment(pattern, subject, patternQuality = PhredQuality(22L), subjectQuality = PhredQuality(22L), type = "global", substitutionMatrix = NULL, fuzzyMatrix = NULL, gapOpening = -10, gapExtension = -4, scoreOnly = FALSE) ## S4 method for signature 'QualityScaledXStringSet,QualityScaledXStringSet': pairwiseAlignment(pattern, subject, type = "global", substitutionMatrix = NULL, fuzzyMatrix = NULL, gapOpening = -10, gapExtension = -4, scoreOnly = FALSE)
pattern |
a character vector of any length, an XString , or
an XStringSet object. |
subject |
a character vector of length 1 or an XString
object. |
patternQuality, subjectQuality |
objects of class
XStringQuality representing the respective quality scores for
pattern and subject that are used in a quality-based method
for generating a substitution matrix. These two arguments are ignored if
!is.null(substitutionMatrix) or if its respective string set
(pattern , subject ) is of class
QualityScaledXStringSet . |
type |
type of alignment. One of "global" , "local" ,
"overlap" , "global-local" , and "local-global" where
"global" = align whole strings with end gap penalties,
"local" = align string fragments,
"overlap" = align whole strings without end gap penalties,
"global-local" = align whole strings with end gap penalties on
pattern and without end gap penalties on subject
"local-global" = align whole strings without end gap penalties on
pattern and with end gap penalties on subject . |
substitutionMatrix |
substitution matrix representing the fixed
substitution scores for an alignment. It cannot be used in conjunction with
patternQuality and subjectQuality arguments. |
fuzzyMatrix |
fuzzy match matrix for quality-based alignments. It takes values between 0 and 1; where 0 is an unambiguous mismatch, 1 is an unambiguous match, and values in between represent a fraction of "matchiness". (See details section below.) |
gapOpening |
the cost for opening a gap in the alignment. |
gapExtension |
the incremental cost incurred along the length of the gap in the alignment. |
scoreOnly |
logical to denote whether or not to return just the scores of the optimal pairwise alignment. |
... |
optional arguments to generic function to support additional methods. |
Quality-based alignments are based on the paper the Bioinformatics article by
Ketil Malde listed in the Reference section below. Let \epsilon_i be the
probability of an error in the base read. For "Phred"
quality measures
Q in [0, 99], these error probabilities are given by
\epsilon_i = 10^{-Q/10}. For "Solexa"
quality measures Q in
[-5, 99], they are given by \epsilon_i = 1 - 1/(1 + 10^{-Q/10}).
Assuming independence within and between base reads, the combined error
probability of a mismatch when the underlying bases do match is
\epsilon_c = \epsilon_1 + \epsilon_2 - (n/(n-1)) * \epsilon_1 * \epsilon_2,
where n is the number of letters in the underlying alphabet. Using
\epsilon_c, the substitution score is given by when two bases match is given by
b * \log_2(\gamma_{x,y} * (1 - \epsilon_c) * n + (1 - \gamma_{x,y}) * \epsilon_c * (n/(n-1))),
where b is the bit-scaling for the scoring and \gamma_{x,y} is the
probability that characters x and y represents the same underlying
information (e.g. using IUPAC, \gamma_{A,A} = 1 and \gamma_{A,N} = 1/4.
In the arguments listed above fuzzyMatch
represents \gamma_{x,y}
and patternQuality
and subjectQuality
represents \epsilon_1
and \epsilon_2 respectively.
If scoreOnly == FALSE
, the pairwise alignment with the maximum alignment
score is returned. If more than one pairwise alignment has the maximum alignment
score exists, the first alignment along the subject is returned. If there are
multiple pairwise alignments with the maximum alignment score at the chosen
subject location, then at each location along the alignment mismatches are given
preference to insertions/deletions. For example, pattern: [1] ATTA;
subject: [1] AT-A
is chosen above pattern: [1] ATTA; subject: [1] A-TA
if they both have the maximum alignment score.
If scoreOnly == FALSE
, an instance of class
PairwiseAlignedXStringSet
or
PairwiseAlignedFixedSubject
is returned.
If scoreOnly == TRUE
, a numeric vector containing the scores for the
optimal pairwise alignments is returned.
Use matchPattern
or vmatchPattern
if you need to
find all the occurrences (eventually with indels) of a given pattern in a
reference sequence or set of sequences.
Use matchPDict
if you need to match a (big) set of patterns
against a reference sequence.
P. Aboyoun and H. Pages
R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis, Cambridge UP 1998, sec 2.3.
B. Haubold, T. Wiehe, Introduction to Computational Biology, Birkhauser Verlag 2006, Chapter 2.
K. Malde, The effect of sequence quality on sequence alignment, Bioinformatics 2008 24(7):897-900.
stringDist
,
PairwiseAlignedXStringSet-class,
XStringQuality-class,
substitution.matrices,
matchPattern
## Nucleotide global, local, and overlap alignments s1 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG") s2 <- DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC") # First use a fixed substitution matrix mat <- nucleotideSubstitutionMatrix(match = 1, mismatch = -3, baseOnly = TRUE) globalAlign <- pairwiseAlignment(s1, s2, substitutionMatrix = mat, gapOpening = -5, gapExtension = -2) localAlign <- pairwiseAlignment(s1, s2, type = "local", substitutionMatrix = mat, gapOpening = -5, gapExtension = -2) overlapAlign <- pairwiseAlignment(s1, s2, type = "overlap", substitutionMatrix = mat, gapOpening = -5, gapExtension = -2) # Then use quality-based method for generating a substitution matrix pairwiseAlignment(s1, s2, patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))), subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))), scoreOnly = TRUE) # Now assume can't distinguish between C/T and G/A pairwiseAlignment(s1, s2, patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))), subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))), type = "local") mapping <- diag(4) dimnames(mapping) <- list(DNA_BASES, DNA_BASES) mapping["C", "T"] <- mapping["T", "C"] <- 1 mapping["G", "A"] <- mapping["A", "G"] <- 1 pairwiseAlignment(s1, s2, patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))), subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))), fuzzyMatrix = mapping, type = "local") ## Amino acid global alignment pairwiseAlignment(AAString("PAWHEAE"), AAString("HEAGAWGHEE"), substitutionMatrix = "BLOSUM50", gapOpening = 0, gapExtension = -8)