java.lang.Object | |
↳ | java.nio.charset.Charset |
A charset is a named mapping between Unicode characters and byte sequences. Every
Charset
can decode, converting a byte sequence into a sequence of characters,
and some can also encode, converting a sequence of characters into a byte sequence.
Use the method canEncode()
to find out whether a charset supports both.
In the context of this class, character always refers to a Java character: a Unicode
code point in the range U+0000 to U+FFFF. (Java represents supplementary characters using surrogates.)
Not all byte sequences will represent a character, and not
all characters can necessarily be represented by a given charset. The method contains(Charset)
can be used to determine whether every character representable by one charset can also be
represented by another (meaning that a lossless transformation is possible from the contained
to the container).
There are many possible ways to represent Unicode characters as byte sequences. See UTR#17: Unicode Character Encoding Model for detailed discussion.
The most important mappings capable of representing every character are the Unicode Transformation Format (UTF) charsets. Of those, UTF-8 and the UTF-16 family are the most common. UTF-8 (described in RFC 3629) encodes a character using 1 to 4 bytes. UTF-16 uses exactly 2 bytes per character (potentially wasting space, but allowing efficient random access into BMP text), and UTF-32 uses exactly 4 bytes per character (trading off even more space for efficient random access into text that includes supplementary characters).
UTF-16 and UTF-32 encode characters directly, using their code point as a two- or four-byte
integer. This means that any given UTF-16 or UTF-32 byte sequence is either big- or
little-endian. To assist decoders, Unicode includes a special byte order mark (BOM)
character U+FEFF used to determine the endianness of a sequence. The corresponding byte-swapped
code point U+FFFE is guaranteed never to be assigned. If a UTF-16 decoder sees
0xfe, 0xff
, for example, it knows it's reading a big-endian byte sequence, while
0xff, 0xfe
, would indicate a little-endian byte sequence.
UTF-8 can contain a BOM, but since the UTF-8 encoding of a character always uses the same
byte sequence, there is no information about endianness to convey. Seeing the bytes
corresponding to the UTF-8 encoding of U+FEFF (0xef, 0xbb, 0xbf
) would only serve to
suggest that you're reading UTF-8. Note that BOMs are decoded as the U+FEFF character, and
will appear in the output character sequence. This means that a disadvantage to including a BOM
in UTF-8 is that most applications that use UTF-8 do not expect to see a BOM. (This is also a
reason to prefer UTF-8: it's one less complication to worry about.)
Because a BOM indicates how the data that follows should be interpreted, a BOM should occur as the first character in a character sequence.
See the Byte Order Mark (BOM) FAQ for more about dealing with BOMs.
The following tables show the endianness and BOM behavior of the UTF-16 variants.
This table shows what the encoder writes. "BE" means that the byte sequence is big-endian,
"LE" means little-endian. "BE BOM" means a big-endian BOM (that is, 0xfe, 0xff
).
Charset | Encoder writes |
---|---|
UTF-16BE | BE, no BOM |
UTF-16LE | LE, no BOM |
UTF-16 | BE, with BE BOM |
The next table shows how each variant's decoder behaves when reading a byte sequence.
The exact meaning of "failure" in the table is dependent on the
CodingErrorAction
supplied to malformedInputAction()
, so
"BE, failure" means "the byte sequence is treated as big-endian, and a little-endian BOM
triggers the malformedInputAction".
The phrase "includes BOM" means that the output includes the U+FEFF byte order mark character.
Charset | BE BOM | LE BOM | No BOM |
---|---|---|---|
UTF-16BE | BE, includes BOM | BE, failure | BE |
UTF-16LE | LE, failure | LE, includes BOM | LE |
UTF-16 | BE | LE | BE |
A charset has a canonical name, returned by name()
. Most charsets will
also have one or more aliases, returned by aliases()
. A charset can be looked up
by canonical name or any of its aliases using forName(String)
.
The following charsets are available on every Java implementation:
All of these charsets support both decoding and encoding. The charsets whose names begin "UTF" can represent all characters, as mentioned above. The "ISO-8859-1" and "US-ASCII" charsets can only represent small subsets of these characters. Except when required to do otherwise for compatibility, new code should use one of the UTF charsets listed above. The platform's default charset is UTF-8. (This is in contrast to some older implementations, where the default charset depended on the user's locale.)
Most implementations will support hundreds of charsets. Use availableCharsets()
or
isSupported(String)
to see what's available. If you intend to use the charset if it's
available, just call forName(String)
and catch the exceptions it throws if the charset isn't
available.
Additional charsets can be made available by configuring one or more charset
providers through provider configuration files. Such files are always named
as "java.nio.charset.spi.CharsetProvider" and located in the
"META-INF/services" directory of one or more classpaths. The files should be
encoded in "UTF-8". Each line of their content specifies the class name of a
charset provider which extends CharsetProvider
.
A line should end with '\r', '\n' or '\r\n'. Leading and trailing whitespace
is trimmed. Blank lines, and lines (after trimming) starting with "#" which are
regarded as comments, are both ignored. Duplicates of names already found are also
ignored. Both the configuration files and the provider classes will be loaded
using the thread context class loader.
Although class is thread-safe, the CharsetDecoder
and CharsetEncoder
instances
it returns are inherently stateful.
Protected Constructors | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Constructs a
Charset object. |
Public Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Returns an unmodifiable set of this charset's aliases.
| |||||||||||
Returns an immutable case-insensitive map from canonical names to
Charset instances. | |||||||||||
Returns true if this charset supports encoding, false otherwise.
| |||||||||||
Compares this charset with the given charset.
| |||||||||||
Determines whether this charset is a superset of the given charset.
| |||||||||||
Returns a new
CharBuffer containing the characters decoded from buffer . | |||||||||||
Returns the system's default charset.
| |||||||||||
Returns the name of this charset for the default locale.
| |||||||||||
Returns the name of this charset for the specified locale.
| |||||||||||
Returns a new
ByteBuffer containing the bytes encoding the characters from
buffer . | |||||||||||
Returns a new
ByteBuffer containing the bytes encoding the characters from s . | |||||||||||
Determines whether this charset equals to the given object.
| |||||||||||
Returns a
Charset instance for the named charset. | |||||||||||
Gets the hash code of this charset.
| |||||||||||
Returns true if this charset is known to be registered in the IANA
Charset Registry.
| |||||||||||
Determines whether the specified charset is supported by this runtime.
| |||||||||||
Returns the canonical name of this charset.
| |||||||||||
Returns a new instance of a decoder for this charset.
| |||||||||||
Returns a new instance of an encoder for this charset.
| |||||||||||
Gets a string representation of this charset.
|
[Expand]
Inherited Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
From class
java.lang.Object
| |||||||||||
From interface
java.lang.Comparable
|
Constructs a Charset
object. Duplicated aliases are
ignored.
canonicalName | the canonical name of the charset. |
---|---|
aliases | an array containing all aliases of the charset. May be null. |
IllegalCharsetNameException | on an illegal value being supplied for either
canonicalName or for any element of
aliases .
|
---|
Returns an unmodifiable set of this charset's aliases.
Returns an immutable case-insensitive map from canonical names to Charset
instances.
If multiple charsets have the same canonical name, it is unspecified which is returned in
the map. This method may be slow. If you know which charset you're looking for, use
forName(String)
.
Returns true if this charset supports encoding, false otherwise.
Compares this charset with the given charset. This comparison is based on the case insensitive canonical names of the charsets.
charset | the given object to be compared with. |
---|
Determines whether this charset is a superset of the given charset. A charset C1 contains charset C2 if every character representable by C2 is also representable by C1. This means that lossless conversion is possible from C2 to C1 (but not necessarily the other way round). It does not imply that the two charsets use the same byte sequences for the characters they share.
Note that this method is allowed to be conservative, and some implementations may return false when this charset does contain the other charset. Android's implementation is precise, and will always return true in such cases.
charset | a given charset. |
---|
Returns a new CharBuffer
containing the characters decoded from buffer
.
This method uses CodingErrorAction.REPLACE
.
Applications should generally create a CharsetDecoder
using newDecoder()
for performance.
buffer | the byte buffer containing the content to be decoded. |
---|
Returns the system's default charset. This is determined during VM startup, and will not change thereafter. On Android, the default charset is UTF-8.
Returns the name of this charset for the default locale.
The default implementation returns the canonical name of this charset. Subclasses may return a localized display name.
Returns the name of this charset for the specified locale.
The default implementation returns the canonical name of this charset. Subclasses may return a localized display name.
Returns a new ByteBuffer
containing the bytes encoding the characters from
buffer
.
This method uses CodingErrorAction.REPLACE
.
Applications should generally create a CharsetEncoder
using newEncoder()
for performance.
buffer | the character buffer containing the content to be encoded. |
---|
Returns a new ByteBuffer
containing the bytes encoding the characters from s
.
This method uses CodingErrorAction.REPLACE
.
Applications should generally create a CharsetEncoder
using newEncoder()
for performance.
s | the string to be encoded. |
---|
Determines whether this charset equals to the given object. They are considered to be equal if they have the same canonical name.
obj | the given object to be compared with. |
---|
Returns a Charset
instance for the named charset.
charsetName | a charset name (either canonical or an alias) |
---|
IllegalCharsetNameException | if the specified charset name is illegal. |
---|---|
UnsupportedCharsetException | if the desired charset is not supported by this runtime. |
Gets the hash code of this charset.
Returns true if this charset is known to be registered in the IANA Charset Registry.
Determines whether the specified charset is supported by this runtime.
charsetName | the name of the charset. |
---|
IllegalCharsetNameException | if the specified charset name is illegal. |
---|
Returns the canonical name of this charset.
If a charset is in the IANA registry, this will be the MIME-preferred name (a charset may have multiple IANA-registered names). Otherwise the canonical name will begin with "x-" or "X-".
Returns a new instance of a decoder for this charset.
Returns a new instance of an encoder for this charset.
Gets a string representation of this charset. Usually this contains the canonical name of the charset.