grep {base} | R Documentation |
grep
, grepl
, regexpr
and gregexpr
search
for matches to argument pattern
within a character vector: they
differ in the format of and amount of detail in the results.
sub
and gsub
perform replacement of the first and all
matches respectively.
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
pattern |
character string containing a regular expression
(or character string for fixed = TRUE ) to be matched
in the given character vector. Coerced by
as.character to a character string if possible. If a
character vector of length 2 or more is supplied, the first element
is used with a warning. Missing values are allowed except for
regexpr and gregexpr . |
x, text |
a character vector where matches are sought, or an
object which can be coerced by as.character to a character vector. |
ignore.case |
if FALSE , the pattern matching is case
sensitive and if TRUE , case is ignored during matching. |
perl |
logical. Should perl-compatible regexps be used?
Has priority over extended . |
value |
if FALSE , a vector containing the (integer )
indices of the matches determined by grep is returned, and if
TRUE , a vector containing the matching elements themselves is
returned. |
fixed |
logical. If TRUE , pattern is a string to be
matched as is. Overrides all conflicting arguments. |
useBytes |
logical. If TRUE the matching is done
byte-by-byte rather than character-by-character. See
‘Details’. |
invert |
logical. If TRUE return indices or values for
elements that do not match. |
replacement |
a replacement for matched pattern in sub and
gsub . Coerced to character if possible. For fixed =
FALSE this can include backreferences "\1" to
"\9" to parenthesized subexpressions of pattern . For
perl = TRUE only, it can also contain "\U" or
"\L" to convert the rest of the replacement to upper or
lower case and "\E" to end case conversion. If a
character vector of length 2 or more is supplied, the first element
is used with a warning. If NA , all elements in the result
corresponding to matches will be set to NA .
|
Arguments which should be character strings or character vectors are coerced to character if possible.
Each of these functions operates in one of three modes:
fixed = TRUE
: use exact matching.
perl = TRUE
: use Perl-style regular expressions.
fixed = FALSE, perl = FALSE
: use POSIX 1003.2
extended regular expressions.
The two *sub
functions differ only in that sub
replaces
only the first occurrence of a pattern
whereas gsub
replaces all occurrences.
For regexpr
and gregexpr
it is an error for
pattern
to be NA
, otherwise NA
is permitted and
gives an NA
match.
The main effect of useBytes
is to avoid errors/warnings about
invalid inputs and spurious matches in multibyte locales, but for
regexpr
it changes the interpretation of the output. As from
R 2.10.0 it inhibits the conversion of inputs with marked encodings.
Caseless matching does not make much sense for bytes in a multibyte
locale, and you should expect it only to work for ASCII characters if
useBytes = TRUE
.
grep(value = FALSE)
returns an integer vector of the indices
of the elements of x
that yielded a match (or not, for
invert = TRUE
.
grep(value = TRUE)
returns a character vector containing the
selected elements of x
(after coercion, preserving names but no
other attributes).
grepl
returns a logical vector (match or not for each element of
x
).
For sub
and gsub
return a character vector of the same
length and with the same attributes as x
(after possible
coercion to character). Elements of character vectors x
which
are not substituted will be returned unchanged (including any declared
encoding). If useBytes = FALSE
a non-ASCII substituted result
will often be in UTF-8 with a marked encoding (e.g. if there is a
UTF-8 input, and in a multibyte locale unless fixed = TRUE
).
regexpr
returns an integer vector of the same length as
text
giving the starting position of the first match or
-1 if there is none, with attribute "match.length"
, an
integer vector giving the length of the matched text (or -1 for
no match). The match positions and lengths are in characters unless
useBytes = TRUE
is used, when they are in bytes.
gregexpr
returns a list of the same length as text
each
element of which is of the same form as the return value for regexpr
,
except that the starting positions of every (disjoint) match are
given.
POSIX 1003.2 mode of gsub
and gregexpr
does not
work correctly with repeated word-boundaries (e.g. pattern =
"\b"
). Use perl = TRUE
for such matches (but that may not
work as expected with non-ASCII inputs, as the meaning of
‘word’ is system-dependent).
If you are doing a lot of regular expression matching, including on
very long strings, you will want to consider the options used.
Generally PCRE will be faster than the default regular expression
engine, and fixed = TRUE
faster still (especially when each
pattern is matched only a few times).
If you are working in a single-byte locale and have marked UTF-8 strings that are representable in that locale, convert them first as just one UTF-8 string will force all the matching to be done in Unicode, which attracts a penalty of around 3x for the default POSIX 1003.2 mode.
If you can make use of useBytes = TRUE
, the strings will not be
checked before matching, and the actual matching will be faster.
Often byte-based matching suffices in a UTF-8 locale since byte
patterns of one character never match part of another.
Prior to R 2.11.0 there was an argument extended
which could
be used to select ‘basic’ regular expressions: this was often
used when fixed = TRUE
would be preferable. In the actual
implementation (as distinct from the POSIX standard) the only
difference was that ?, +, {, |, (,
and ) were not interpreted as metacharacters.
The C code for POSIX-style regular expression matching has changed
over the years. As from R 2.10.0 the TRE library of Ville Laurikari
(http://laurikari.net/tre/) is used. From 2005 to R 2.9.2,
code based on glibc
was used (and before that, code from GNU
grep
). The POSIX standard does give some room for
interpretation, especially in the handling of invalid regular
expressions and the collation of character ranges, so the results will
have changed slightly.
For Perl-style matching PCRE (http://www.pcre.org) is used.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
The New S Language.
Wadsworth & Brooks/Cole (grep
)
regular expression (aka regexp
) for the details
of the pattern specification.
glob2rx
to turn wildcard matches into regular expressions.
agrep
for approximate matching.
enc2native
to re-encode the result of sub
.
tolower
, toupper
and chartr
for character translations.
charmatch
, pmatch
, match
.
apropos
uses regexps and has more examples.
grep("[a-z]", letters) txt <- c("arm","foot","lefroo", "bafoobar") if(length(i <- grep("foo",txt))) cat("'foo' appears at least once in\n\t",txt,"\n") i # 2 and 4 txt[i] ## Double all 'a' or 'b's; "\" must be escaped, i.e., 'doubled' gsub("([ab])", "\\1_\\1_", "abc and ABC") txt <- c("The", "licenses", "for", "most", "software", "are", "designed", "to", "take", "away", "your", "freedom", "to", "share", "and", "change", "it.", "", "By", "contrast,", "the", "GNU", "General", "Public", "License", "is", "intended", "to", "guarantee", "your", "freedom", "to", "share", "and", "change", "free", "software", "--", "to", "make", "sure", "the", "software", "is", "free", "for", "all", "its", "users") ( i <- grep("[gu]", txt) ) # indices stopifnot( txt[i] == grep("[gu]", txt, value = TRUE) ) ## Note that in locales such as en_US this includes B as the ## collation order is aAbBcCdEe ... (ot <- sub("[b-e]",".", txt)) txt[ot != gsub("[b-e]",".", txt)]#- gsub does "global" substitution txt[gsub("g","#", txt) != gsub("g","#", txt, ignore.case = TRUE)] # the "G" words regexpr("en", txt) gregexpr("e", txt) ## trim trailing white space str <- 'Now is the time ' sub(' +$', '', str) ## spaces only sub('[[:space:]]+$', '', str) ## white space, POSIX-style sub('\\s+$', '', str, perl = TRUE) ## Perl-style white space ## capitalizing txt <- "a test of capitalizing" gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", txt, perl=TRUE) gsub("\\b(\\w)", "\\U\\1", txt, perl=TRUE) txt2 <- "useRs may fly into JFK or laGuardia" gsub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", txt2, perl=TRUE) sub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", txt2, perl=TRUE)