Using Unicode in C++ source code

Using Unicode in C++ source code - c++

What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?
For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)
Furthermore, can I use Unicode for strings? For example:
Wstring str=L"Strange chars: âÂ Čšđ ě €€";

Encoding in C++ is quite a bit complicated. Here is my understanding of it.
Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one char. In addition implementations have to support a way to name other characters using a way called universal-character-names and look like \uffff or \Uffffffff and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).
This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):
Physical source file characters are
mapped, in an implementation-defined
manner, to the basic source character
set (introducing new-line characters
for end-of-line indicators) if
necessary. Trigraph sequences (2.3)
are replaced by corresponding
single-character internal
representations. Any source file
character not in the basic source
character set (2.2) is replaced by the
universal-character-name that des-
ignates that character. (An
implementation may use any internal
encoding, so long as an actual
extended character encountered in the
source file, and the same extended
character expressed in the source file
as a universal-character-name (i.e.
using the \uXXXX notation), are
handled equivalently.)
For gcc, you can change it using the option -finput-charset=charset. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is -fexec-charset=charset for char (it defaults to utf-8) and -fwide-exec-charset=charset (which defaults to either utf-16 or utf-32 depending on the size of wchar_t).

The C++ standard doesn't say anything about source-code file encoding, so far as I know.
The usual encoding is (or used to be) 7-bit ASCII -- some compilers (Borland's, for instance) would balk at ASCII characters that used the high-bit. There's no technical reason that Unicode characters can't be used, if your compiler and editor accept them -- most modern Linux-based tools, and many of the better Windows-based editors, handle UTF-8 encoding with no problem, though I'm not sure that Microsoft's compiler will.
EDIT: It looks like Microsoft's compilers will accept Unicode-encoded files, but will sometimes produce errors on 8-bit ASCII too:
warning C4819: The file contains a character that cannot be represented
in the current code page (932). Save the file in Unicode format to prevent
data loss.

In addition to litb's post, MSVC++ supports Unicode too. I understand it gets the Unicode encoding from the BOM. It definitely supports code like int (*♫)(); or const std::set<int> ∅;
If you're really into code obfuscuation:
typedef void ‼; // Also known as \u203C
class ooɟ {
operator ‼() {}
};

There are two issues at play here. The first is what characters are allowed in C++ code (and comments), such as variable names. The second is what characters are allowed in strings and string literals.
As noted, C++ compilers must support a very restricted ASCII-based character set for the characters allowed in code and comments. In practice, this character set didn't work very well with some European character sets (and especially with some European keyboards that didn't have a few characters -- like square brackets -- available), so the concept of digraphs and trigraphs was introduced. Many compilers accept more than this character set at this time, but there isn't any guarantee.
As for strings and string literals, C++ has the concept of a wide character and wide character string. However, the encoding for that character set is undefined. In practice it's almost always Unicode, but I don't think there's any guarantee here. Wide character string literals look like L"string literal", and these can be assigned to std::wstring's.
C++11 added explicit support for Unicode strings and string literals, encoded as UTF-8, UTF-16 big endian, UTF-16 little endian, UTF-32 big endian and UTF-32 little endian.

For encoding in strings I think you are meant to use the \u notation, e.g.:
std::wstring str = L"\u20AC"; // Euro character

It's also worth noting that wide characters in C++ aren't really Unicode strings as such. They are just strings of larger characters, usually 16, but sometimes 32 bits. This is implementation-defined, though, IIRC you can have an 8-bit wchar_t You have no real guarantee as to the encoding in them, so if you are trying to do something like text processing, you will probably want a typedef to the most suitable integer type to your Unicode entity.
C++1x has additional unicode support in the form of UTF-8 encoding string literals (u8"text"), and UTF-16 and UTF-32 data types (char16_t and char32_t IIRC) as well as corresponding string constants (u"text" and U"text"). The encoding on characters specified without \uxxxx or \Uxxxxxxxx constants is still implementation-defined, though (and there is no encoding support for complex string types outside the literals)

In this context, if you get MSVC++ warning C4819, just change the source file coding to "UTF-8 with Bom".
GCC 4.1 doesn't support this, but GCC 4.4 does, and the latest Qt version uses GCC 4.4, so use "UTF-8 with Bom" as source file coding.

AFAIK It's not standardized as you can put any type of characters in wide strings.
You just have to check that your compiler is set to Unicode source code to make it work right.

Related

How can one properly declare char8_t for diacritical letters?

I attempt to initialise some diacritical Latin letters using the new char8_t type:
constexpr char8_t french_letter_A_1 = 'À';//does not function properly
However, Visual Studio 2019 suggests me the following “character represented by universal-character-name "\u(the name)" cannot be represented in the current code page”, and the character cannot be properly displayed; If I try to explicitly declare the character as a u8 one, like:
constexpr char8_t french_letter_A_2 = u8'Â';//has error
It even throws an error " a UTF-8 character literal value cannot occupy more than one code unit"; but non-diacritical letters can be successfully interpreted as a UTF-8 one:
constexpr char8_t french_letter_A_0 = u8'A';//but ASCII letters are fine
I am wondering how can I properly declare a UTF-8 character with Visual C++... or I misunderstand the concept of char8_t, and should rather use something else instead?
Edit: I have comprehended that char8_t does not support those non-ASCII characters. What character type should I use instead?

char8_t, like char, signed char, and unsigned char, has a size of 1 byte. On most platforms (but not all!), that means it is an 8-bit type only capable of holding 256 discrete values. Unicode 12.1 defines 137,994 characters. Clearly, they can't all fit in a single char8_t value!
The C and C++ "character" types are, regrettably, poorly named. If we were designing a new language with modern terminology, we would name them some variation of code_unit as that better reflects how they are actually used. char32_t is the only character type that is currently guaranteed to be able to hold a code point value for every character in its associated character set (the C and C++ standards claim that wchar_t can too, but that contradicts existing practice).
Looking at your example, À is U+00C0 {LATIN CAPITAL LETTER A WITH GRAVE} (or is it actually A U+0041 {LATIN CAPITAL LETTER A} followed by ̀ U+0300 {COMBINING GRAVE ACCENT}? Unicode is tricky that way). The UTF-8 encoding of U+00C0 is 0xC3 0x80. What value should french_letter_A_1 hold? It can't hold both code unit values. And if the value were to be the code point, then we're either in the situation that only 256 characters can be (portably) supported or, worse, that sometimes values of char8_t are code points and sometimes they are code units.
The reality is that C and C++ character literals are limited to just a few more characters than are in the basic source character set. That is sufficient if one is writing an English-only application. But for modern applications, character literals have limited use.
As Nicol already stated, working with most characters outside the basic source character set requires doing real text processing on strings. Unfortunately, the C and C++ standards do not provide much help there. That is something that SG16 is working to improve.

UTF-8 is an encoding for Unicode codepoints. In UTF-8, a codepoint is broken down into one or more "octets" (8-bit words) called UTF-8 code units. The C++20 type that represents a UTF-8 code unit is char8_t.
A single char8_t is only one UTF-8 code unit. Therefore, it can only represent a Unicode codepoint whose UTF-8 encoding only takes up 1 code unit. Visual Studio is telling you that the "character" you are trying to store in a char8_t requires more than 1 code unit and therefore cannot be stored in such a type. The only Unicode code points that UTF-8 encodes in a single code unit are the ASCII code points.
When dealing with UTF-8 (or any Unicode encoding that isn't UTF-32 for that matter), you do not deal in "characters"; you deal in strings: contiguous sequences of code units. Anytime you want to deal with UTF-8, you should be using some kind of char8_t-based string type.

What determines the normalized form of a Unicode string in C++?

When creating string literals in C++, I would like to know how the strings are encoded -- I can specify the encoding form (UTF-8, 16, or 32), but I want to know how the compiler determines the unspecified parts of the encoding.
For UTF-8 the byte-ordering is not relevant, and I would assume the byte ordering of UTF-16 and UTF-32 is, by default, the system byte-ordering. This leaves the normalization. As an example:
std::string u8foo = u8"Föo";
std::u16string u16foo = u"Föo";
std::u32string u32foo = U"Föo";
In all three cases, there are at least two possible encodings -- decomposed or composed. For more complex characters there might by multiple possible encodings, but I would assume that the compiler would generate one of the normalized forms.
Is this a safe assumption? Can I know in advance in what normalization the text in u8foo and u16foo is stored? Can I specify it somehow?
I am of the impression this is not defined by the standard, and that it is implementation specific. How does GCC handle it? Other compilers?

The interpretation of character strings outside of the basic source character set is implementation-dependent. (Standard quote below.) So there is no definitive answer; an implementation is not even obliged to accept source characters outside of the basic set.
Normalisation involves a mapping of possibly multiple source codepoints to possibly multiple internal codepoints, including the possibility of reordering the source character sequence (if, for example, diacritics are not in the canonical order). Such transformations are more complex than the source→internal transformation anticipated by the standard, and I suspect that a compiler which attempted them would not be completely conformant. In any event, I know of no compiler which does so.
So, in general, you should ensure that the source code you provide to the compiler is normalized as per your desired normalization form, if that matters to you.
In the particular case of GCC, the compiler interprets the source according to the default locale's encoding, unless told otherwise (with the -finput-charset command-line option). It will recode if necessary to Unicode codepoints. But it does not alter the sequence of codepoints. So if you give it a normalized UTF-8 string, that's what you get. And if you give it an unnormalized string, that's also what you get.
In this example on coliru, the first string is composed and the second one decomposed (although they are both in some normalization form). (The rendering of the second example string in coliru seems to be browser-dependent. On my machine, chrome renders them correctly, while firefox shifts the diacritics one position to the left. YMMV.)
The C++ standard defines the basic source character set (in §2.3/1) to be letters, digits, five whitespace characters (space, newline, tab, vertical tab and formfeed) and the symbols:
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’
It gives the compiler a lot of latitude as to how it interprets the input, and how it handles characters outside of the basic source character set. §2.2 paragraph 1 (from C++14 draft n4527):
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
It's worth adding that diacritics are characters, from the perspective of the C++ standard. So the composed ñ (\u00d1) is one character and the decomposed ñ (\u006e \u0303) is two characters, regardless of how it looks to you.
A close reading of the above paragraph from the standard suggests that normalization or other transformations which are not strictly 1-1 are not permitted, although the compiler may be able to reject an input which contains characters outside the basic source character set.

Microsoft Visual C++ will keep the normalization used in the source file.
The main problem you have when doing this cross-platform is making sure the compilers are using the right encodings. Here is how MSVC handles it:
Source file encoding
The compiler has to read your source file with the right encoding.
MSVC doesn't have an option to specify the encoding on the command line but relies on the BOM to detect encoding, so it can read the following encodings:
UTF-16 with BOM, if the file starts with that BOM
UTF-8, if the file starts with "\xef\xbb\xbf" (the UTF-8 "BOM")
in all other cases, the file is read using an ANSI code page dependent on your system language setting. In practice this means you can only use ASCII characters in your source files.
Output encoding
Your unicode strings will be encoded with some encoding before written to your executable as a byte string.
Wide literals (L"...") are always written as UTF-16.
MSVC 2010 you can use #pragma execution_character_set("utf-8") to have char strings encoded as UTF-8. By default they are encoded in your local code page. That pragma is apparently missing from MSVC 2012 but it's back in MSVC 2013.
#pragma execution_character_set("utf-8")
const char a[] = "ŦεŞŧ";
Support for the Unicode literals (u"..." and friends) was only just now introduced with MSVC 2015.

Is the u8 string literal necessary in C++11

From Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this
const char[] str = "Test String";
or this?
const char[] str = u8"Test String";
Is there be any reason not to use the latter for every string literal in your code?
What happens when there are non-ASCII-Characters inside the TestString?

The encoding of "Test String" is the implementation-defined system encoding (the narrow, possibly multibyte one).
The encoding of u8"Test String" is always UTF-8.
The examples aren't terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.
If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the u8-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.

You quote Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
Well, the “For the purpose of” is not true. char has always been guaranteed to be at least 8 bits, that is, CHAR_BIT has always been required to be ≥8, due to the range required for char in the C standard. Which is (quote C++11 §17.5.1.5/1) “incorporated” into the C++ standard.
If I should guess about the purpose of that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.
Regarding the effect of the u8 literal prefix, it
affects the encoding of the string in the executable, but
unfortunately it does not affect the type.
Thus, in both cases "tørrfisk" and u8"tørrfisk" you get a char const[n]. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “ø” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.

If the execution character set of the compiler is set to UTF-8, it makes no difference if u8 is used or not, since the compiler converts the characters to UTF-8 in both cases.
However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when u8 is omitted. For example, the conversion to wide strings will crash e.g. in VS15:
std::string narrowJapanese("スタークラフト");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convertWindows;
std::wstring wide = convertWindows.from_bytes(narrowJapanese); // Unhandled C++ exception in xlocbuf.

The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using u8"..." strings.
That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (char, wchar_t, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string where each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.

UTF-8 Compatibility in C++

I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.
Am I right to say that UTF-8 can be stored in a simple char in C++? If so, why do I get the following warning when I use a program with char, string and stringstream: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252). (I do not get that error when I use wchar_t, wstring and wstringstream.)
Additionally, I know that UTF is variable length. When I use the at or substr string methods would I get the wrong answer?

To use UTF-8 string literals you need to prefix them with u8, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD" is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4].
Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.

Yes, the UTF-8 encoding can be used with char, string, and stringstream. A char will hold a single UTF-8 code unit, of which up to four may be required to represent a single Unicode code point.
However, there are a few issues using UTF-8 specifically with Microsoft's compilers. C++ implementations use an 'execution character set' for a number of things, such as encoding character and string literals. VC++ always use the system locale encoding as the execution character set, and Windows does not support UTF-8 as the system locale encoding, therefore UTF-8 can never by the execution character set.
This means that VC++ never intentionally produces UTF-8 character and string literals. Instead the compiler must be tricked.
The compiler will convert from the known source code encoding to the execution encoding. That means that if the compiler uses the locale encoding for both the source and execution encodings then no conversion is done. If you can get UTF-8 data into the source code but have the compiler think that the source uses the locale encoding, then character and string literals will use the UTF-8 encoding. VC++ uses the so-called 'BOM' to detect the source encoding, and uses the locale encoding if no BOM is detected. Therefore you can get UTF-8 encoded string literals by saving all your source files as "UTF-8 without signature".
There are caveats with this method. First, you cannot use UCNs with narrow character and string literals. Universal Character Names have to be converted to the execution character set, which isn't UTF-8. You must either write the character literally so it appears as UTF-8 in the source code, or you can use hex escapes where you manually write out a UTF-8 encoding. Second, in order to produce wide character and string literals the compiler performs a similar conversion from the source encoding to the wide execution character set (which is always UTF-16 in VC++). Since we're lying to the compiler about the encoding, it will perform this conversion to UTF-16 incorrectly. So in wide character and string literals you cannot use non-ascii characters literally, and instead you must use UCNs or hex escapes.
UTF-8 is variable length (as is UTF-16). The indices used with at() and substr() are code units rather than character or code point indices. So if you want a particular code unit then you can just index into the string or array or whatever as normal. If you need a particular code point then you either need a library that can understand composing UTF-8 code units into code points (such as the Boost Unicode iterators library), or you need to convert the UTF-8 data into UTF-32. If you need actual user perceived characters then you need a library that understands how code points are composed into characters. I imagine ICU has such functionality, or you could implement the Default Grapheme Cluster Boundary Specification from the Unicode standard.
The above consideration of UTF-8 only really matters for how you write Unicode data in the source code. It has little bearing on the program's input and output.
If your requirements allow you to choose how to do input and output then I would still recommend using UTF-8 for input. Depending on what you need to do with the input you can either convert it to another encoding that's easy for you to process, or you can write your processing routines to work directly on UTF-8.
If you want to ever output anything via the Windows console then you'll want a well defined module for output that can have different implementations, because internationalized output to the Windows console will require a different implementation from either outputting to a file on Windows or console and file output on other platforms. (On other platforms the console is just another file, but the Windows console needs special treatment.)

The reason you get the warning about \uFFFD is that you're trying to fit FF FD inside a single byte, since, as you noted, UTF-8 works on chars and is variable length.
If you use at or substr, you will possibly get wrong answers since these methods count that one byte should be one character. This is not the case with UTF-8. Notably, with at, you could end up with a single byte of a character sequence; with substr, you could break a sequence and end up with an invalid UTF-8 string (it would start or end with �, \uFFFD, the same one you're apparently trying to use, and the broken character would be lost).
I would recommend that you use wchar to store Unicode strings. Since the type is at least 16 bits, many many more characters can fit in a single "unit".

Is a wide character string literal starting with L like L"Hello World" guaranteed to be encoded in Unicode?

I've recently tried to get the full picture about what steps it takes to create platform independent C++ applications that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character type (char or wchar_t). As I've learned so far, these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, right?
So the question that comes to my mind is whether the C++ standard gives any guarantee about the encoding of string literals starting with L or does it just say it's of type wchar_t with implementation specific character encoding?
If there is no such guaranty, does that mean I need some sort of external resource system to provide non ASCII string literals for my application in a platform independent way?
What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?

The L symbol in front of a string literal simply means that each character in the string will be stored as a wchar_t. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, (however C++11 defines Unicode char types and string literals) so it's up to you to properly represent Unicode strings in C++03.
Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc" will be represented as a 3-byte string (not counting the null), and L"abc" will be represented as a 3 * sizeof(wchar_t)-byte string of wide-characters.
The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXX hexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytes by using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.
As for string literals in C++03 source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXX notation in your string literals.

The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes.
Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.

The standard makes no mention of encoding formats for strings.
Take a look at ICU from IBM (its free). http://site.icu-project.org/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js