Plywood encourages the use of UTF-8-encoded text and is generally forgiving of encoding errors.
Plywood provides the
TextFormat class for compatibility with other encodings. The
createExporter() functions convert UTF-8-encoded text to and from other formats such as UTF-16, ISO 8859-1 or Windows-1252, allowing you to work exclusively with UTF-8 at runtime. There's also a
TextFormat::autodetect() function to detect source encodings at load time.
StringView classes are generally meant to hold UTF-8-encoded text, but there's nothing to prevent you from storing other kinds of data in them as well. The exact encoding requirements depend on the operations performed on the string. For example, the
trim() functions will work with any 8-bit text encoding, including UTF-8, ISO 8859-1, Windows-1252 and ASCII. On the other hand,
reversedUTF8() does expect UTF-8.
StringWriter classes, which do things like read alphanumeric identifiers and convert numbers to text, will generally work with any 8-bit encoding that is compatible with ASCII, including UTF-8, ISO 8859-1 and Windows-1252.
At the lowest level, Plywood provides functions to encode and decode text one character (code point) at time through a variety of encodings. These functions reside in
TextEncoding.h and include
UTF8::decodePoint(). A decoded character is always represented as a Unicode point stored in a 32-bit integer. These functions are tolerant of encoding errors as described in the following section.
How Plywood Handles Encoding Errors
In UTF-8, many byte sequences are considered invalid. For example, a byte with a hexadecimal value between
0xdf (inclusive) is expected to be followed by a byte with a hexadecimal value between
0xbf (inclusive). Anything else is considered an error according to the Unicode standard.
[FIXME: add diagrams]
In general, Plywood's character decoding functions, such as
UTF8::decodePoint(), take the following approach to encoding errors: Invalid byte sequences are always decoded one code unit at a time, with each code unit converted to a Unicode point of the same value. (In UTF-8, a code unit is a byte.) That way, if encoding errors are ignored, invalid UTF-8 sequences will simply be interpreted as ISO-8859-1. When using the low-level decoding functions, such errors can be detected by checking the
validEncoding member of
In UTF-16, there are many invalid code point sequences as well. (In UTF-16, a code point is a 16-bit integer.) For example, a code point with a hexadecimal value between
0xdbff (inclusive) is expected to be followed by a code point with a hexadecimal value between
0xdfff (inclusive). Anything else is considered an error according to the Unicode standard. In Plywood, the
UTF16_Native::decodePoint() function decodes such errors one code unit at a time, which means that a Unicode point between
0xdbff is returned, and the error can be detected by checking the
validEncoding member of the return value. The Unicode standard says that Unicode points between
0xdbff are not valid, but Plywood returns them anyway, since such values can be trivially encoded back to UTF-16 to reconstruct the original string exactly.
Note that when importing a UTF-8 file through
TextFormat, no decoding step is performed at all, and errors won't be detected. For example, if you read a line of text from a UTF-8 file using
StringReader::readString<fmt::Line>(), Plywood preserves the original UTF-8 text byte-for-byte, with any encoding errors left unchanged. That way, if the text is written back to a UTF-8 file, the original contents are preserved exactly.
Finally, note that for many Plywood functions, UTF-8 encoding errors are irrelevant. For example, most of the
StringView member functions, such as
trim(), simply view the string as a sequence of bytes, and are therefore indifferent to UTF-8 encoding errors. Most of the
StringReader parse functions only give special treatment to ASCII code points, which are represented as a single byte in UTF-8 and therefore can't be encoded incorrectly. And if any incorrectly-encoded UTF-8 sequences are passed to
String::format(), either in the format string on in an argument, they are simply passed through unchanged.
In Plywood, all
FileSystem functions, such as
openForWrite(), work with UTF-8 paths and filenames, even on Windows.
On Windows, the NTFS filesystem stores filenames using 16-bit characters. Plywood currently interprets such filenames as UTF-16 and converts them to and from UTF-8 according to the above rules. This approach has the convenient property that all possible NTFS filenames have a unique UTF-8 representation that converts back to the original filename exactly.
See Manipulating Paths for more information.