iconv {base} | R Documentation |
This uses system facilities to convert a character vector between encodings: the ‘i’ stands for ‘internationalization’.
iconv(x, from ="", to = "", sub = NA, mark = TRUE) iconvlist()
x |
A character vector, or an object to be converted to a character
vector by as.character . |
from |
A character string describing the current encoding. |
to |
A character string describing the target encoding. |
sub |
character string. If not NA it is used to replace
any non-convertible bytes in the input. (This would normally be a
single character, but can be more.) If "byte" , the indication is
"<xx>" with the hex code of the byte. |
mark |
logical, for expert use. Should encodings be marked? |
The names of encodings and which ones are available are
platform-dependent. All R platforms support ""
(for the
encoding of the current locale), "latin1"
and "UTF-8"
.
Generally case is ignored when specifying an encoding.
On many platforms, including Windows, iconvlist
provides an
alphabetical list of the supported encodings. On others, the
information is on the man page for iconv(5)
or elsewhere in the
man pages (but beware that the system command iconv
may not
support the same set of encodings as the C functions R calls).
Unfortunately, the names are rarely common across platforms.
Elements of x
which cannot be converted (perhaps because they
are invalid or because they cannot be represented in the target
encoding) will be returned as NA
unless sub
is specified.
Most versions of iconv
will allow transliteration by appending
//TRANSLIT to the to
encoding: see the examples.
Encoding "ASCII"
is also accepted, but prior to R 2.10.0
conversion to "ASCII"
on Windows might have involved dropping
accents.
Any encoding bits (see Encoding
) on elements of x
are ignored: they will always be translated as if from from
even if declared otherwise.
"UTF8"
will be accepted as meaning the (more correct) "UTF-8"
.
A character vector of the same length and the same attributes as
x
(after conversion).
If mark = TRUE
(the default) the elements of the result have a
declared encoding if from
is "latin1"
or "UTF-8"
,
or if from = ""
and the current locale's encoding is detected
as Latin-1 or UTF-8.
iconv
was optional before R 2.10.0, but its absence was
deprecated in R 2.5.0.
There are three main implementations of iconv
in use.
glibc (as used on Linux) contains one. Several platforms
supply GNU libiconv, including Mac OS X and Cygwin. On Windows
we use a version of Yukihiro Nakadaira's win_iconv, which is
based on Windows' codepages (but libiconv can be used by
swapping a DLL). All three have iconvlist
, ignore case in
encoding names and support //TRANSLIT (but with different
results, and for win_iconv currently a ‘best fit’
strategy is used except for to = "ASCII"
).
Most commercial Unixes contain an implemetation of iconv
but
none we have encountered have supported the encoding names we need:
the “R Installation and Administration Manual” recommends
installing libiconv on Solaris and AIX, for example.
There are other implementations, e.g. NetBSD uses one from the Citrus project (which does not support //TRANSLIT) and there is an older FreeBSD port (libiconv is usually used there): it has not been reported whether or not these work with R.
## not all systems have iconvlist try(utils::head(iconvlist(), n = 50)) ## Not run: ## convert from Latin-2 to UTF-8: two of the glibc iconv variants. iconv(x, "ISO_8859-2", "UTF-8") iconv(x, "LATIN2", "UTF-8") ## End(Not run) ## Both x below are in latin1 and will only display correctly in a ## locale that can represent and display latin1. x <- "fa\xE7ile" Encoding(x) <- "latin1" x charToRaw(xx <- iconv(x, "latin1", "UTF-8")) xx iconv(x, "latin1", "ASCII") # NA iconv(x, "latin1", "ASCII", "?") # "fa?ile" iconv(x, "latin1", "ASCII", "") # "faile" iconv(x, "latin1", "ASCII", "byte") # "fa<e7>ile" # Extracts from R help files x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") Encoding(x) <- "latin1" x try(iconv(x, "latin1", "ASCII//TRANSLIT")) # platform-dependent iconv(x, "latin1", "ASCII", sub="byte")