using unicode in a C++ program - c++

I want that strings with Unicode characters be correctly handled in my file synchronizer application but I don't know how this kind of encoding works ?
In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
In internet I see examples using "wide strings or wchar_t ??
So, what's the suitable object to handle unicode characters ? In rapidJson (which supports Unicode, UTF-8, UTF-16, UTF-32) , we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented... I don't understand...
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string) :
boost::replace_all(listFolder, "\\u00e0", "à");
boost::replace_all(listFolder, "\\u00e2", "â");
boost::replace_all(listFolder, "\\u00e4", "ä");
...

The suitable object to handle Unicode strings in C++ is icu::UnicodeString (check "API References, ICU4C" in the sidebar), at least if you want to really handle Unicode strings (as opposed to just passing them from one point of your application to another).
wchar_t was an early attempt at handling international character sets, which turned out to be a failure because Microsoft's definition of wchar_t as two bytes turned out to be insufficient once Unicode was extended beyond code point 0x10000. Linux defines wchar_t as four bytes, but the inconsistency makes it (and its derived std::wstring) rather useless for portable programming.
TCHAR is a Microsoft define that resolves to char by default and to WCHAR if UNICODE is defined, with WCHAR in turn being wchar_t behind a level of indirection... yeah.
C++11 brought us char16_t and char32_t as well as the corresponding string classes, but those are still instances of basic_string<>, and as such have their shortcomings e.g. when trying to uppercase / lowercase characters that have more than one replacement character (e.g. the German ß would require to be extended to SS in uppercase; the standard library cannot do that).
ICU, on the other hand, goes the full way. For example, it provides normalization and decomposition, which the standard strings do not.
\uxxxx and \UXXXXXXXX are unicode character escapes. The xxxx are a 16-bit hexadecimal number representing a UCS-2 code point, which is equivalent to a UTF-16 code point within the Basic Multilingual Plane.
The XXXXXXXX are a 32-bit hex number, representing a UTF-32 code point, which may be any plane.
How those character escapes are handled depends on the context in which they appear (narrow / wide string, for example), making them somewhat less than perfect.
C++11 introduced "proper" Unicode literals:
u8"..." is always a const char[] in UTF-8 encoding.
u"..." is always a const uchar16_t[] in UTF-16 encoding.
U"..." is always a const uchar32_t[] in UTF-32 encoding.
If you use \uxxxx or \UXXXXXXXX within one of those three, the character literal will always be expanded to the proper code unit sequence.
Note that storing UTF-8 in a std::string is possible, but hazardous. You need to be aware of many things: .length() is not the number of characters in your string. .substr() can lead to partial and invalid sequences. .find_first_of() will not work as expected. And so on.
That being said, in my opinion UTF-8 is the only sane encoding choice for any stored text. There are cases to be made for handling texts as UTF-16 in-memory (the way ICU does), but on file, don't accept anything but UTF-8. It's space-efficient, endianess-independent, and allows for semi-sane handling even by software that is blissfully unaware of Unicode matters (see caveats above).

In a unicode string, I can see that a unicode char has this form : "\uxxxx" where xs are numbers, how a normal C or C++ program interpret this kind of char ? (why there is a 'u' after '\' ? what's the effect ?)
That is a unicode character escape sequence. It will be interpreted as a unicode character. The u after the escape character is part of the syntax and it's what differentiates it from other escape sequences. Read the documentation for more information.
So, what's the suitable object to handle unicode characters ?
char for uft-8
char16_t for utf-16
char32_t for utf-32
The size of wchar_t is platform dependent, so you cannot make portable assumptions of which encoding it suits.
we can use const char* to store a JSOn that could have "wide characters" but those characters take more than a byte to be represented...
If you mean that you can store multi-byte utf-8 characters in a char string, then you're correct.
This is the kind of temporary arrangement I found for the moment (unicode->utf8?ascii?, listFolder is a std::string)
What you're attempting to do there is replacing some unicode characters with characters that have a plaftorm defined encoding. If you have other unicode characters besides those, then you end up with a string that has mixed encoding. Also, in some cases it may accidentally replace parts of other byte sequences. I recommend using library to convert encoding or do any other manipulation on encoded strings.

Related

How can one properly declare char8_t for diacritical letters?

I attempt to initialise some diacritical Latin letters using the new char8_t type:
constexpr char8_t french_letter_A_1 = 'À';//does not function properly
However, Visual Studio 2019 suggests me the following “character represented by universal-character-name "\u(the name)" cannot be represented in the current code page”, and the character cannot be properly displayed; If I try to explicitly declare the character as a u8 one, like:
constexpr char8_t french_letter_A_2 = u8'Â';//has error
It even throws an error " a UTF-8 character literal value cannot occupy more than one code unit"; but non-diacritical letters can be successfully interpreted as a UTF-8 one:
constexpr char8_t french_letter_A_0 = u8'A';//but ASCII letters are fine
I am wondering how can I properly declare a UTF-8 character with Visual C++... or I misunderstand the concept of char8_t, and should rather use something else instead?
Edit: I have comprehended that char8_t does not support those non-ASCII characters. What character type should I use instead?
char8_t, like char, signed char, and unsigned char, has a size of 1 byte. On most platforms (but not all!), that means it is an 8-bit type only capable of holding 256 discrete values. Unicode 12.1 defines 137,994 characters. Clearly, they can't all fit in a single char8_t value!
The C and C++ "character" types are, regrettably, poorly named. If we were designing a new language with modern terminology, we would name them some variation of code_unit as that better reflects how they are actually used. char32_t is the only character type that is currently guaranteed to be able to hold a code point value for every character in its associated character set (the C and C++ standards claim that wchar_t can too, but that contradicts existing practice).
Looking at your example, À is U+00C0 {LATIN CAPITAL LETTER A WITH GRAVE} (or is it actually A U+0041 {LATIN CAPITAL LETTER A} followed by ̀ U+0300 {COMBINING GRAVE ACCENT}? Unicode is tricky that way). The UTF-8 encoding of U+00C0 is 0xC3 0x80. What value should french_letter_A_1 hold? It can't hold both code unit values. And if the value were to be the code point, then we're either in the situation that only 256 characters can be (portably) supported or, worse, that sometimes values of char8_t are code points and sometimes they are code units.
The reality is that C and C++ character literals are limited to just a few more characters than are in the basic source character set. That is sufficient if one is writing an English-only application. But for modern applications, character literals have limited use.
As Nicol already stated, working with most characters outside the basic source character set requires doing real text processing on strings. Unfortunately, the C and C++ standards do not provide much help there. That is something that SG16 is working to improve.
UTF-8 is an encoding for Unicode codepoints. In UTF-8, a codepoint is broken down into one or more "octets" (8-bit words) called UTF-8 code units. The C++20 type that represents a UTF-8 code unit is char8_t.
A single char8_t is only one UTF-8 code unit. Therefore, it can only represent a Unicode codepoint whose UTF-8 encoding only takes up 1 code unit. Visual Studio is telling you that the "character" you are trying to store in a char8_t requires more than 1 code unit and therefore cannot be stored in such a type. The only Unicode code points that UTF-8 encodes in a single code unit are the ASCII code points.
When dealing with UTF-8 (or any Unicode encoding that isn't UTF-32 for that matter), you do not deal in "characters"; you deal in strings: contiguous sequences of code units. Anytime you want to deal with UTF-8, you should be using some kind of char8_t-based string type.

Does `std::wregex` support utf-16/unicode or only UCS-2?

With c++11 the regex library was introduced into the standard library.
On the Windows/MSVC platform wchar_t has size of 2 (16 bit) and wchar_t* is normally utf-16 when interfacing with the system/platform (eg. CreateFileW).
However it seems that std::regex isn't utf-8 or does not support it, so I'm wondering whether std::wregex supports utf-16 or just ucs2 ?
I do not find any mention of this (Unicode or the like) in the documentation. In other languages normalization takes place.
The question is:
Is std::wregex representing ucs2 when wchar_t has size of 2 ?
C++ standard doesn't enforce any encoding on std::string and std::wstring. They're simply a series of CharT. Only std::u8string, std::u16string and std::u32string have defined encoding
What encoding does std::string.c_str() use?
Does std::string in c++ has encoding format
Similarly std::regex and std::wregex also wrap around std::basic_string and CharT. Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale)
On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle
Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.
https://en.wikipedia.org/wiki/UTF-8#Description
The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead
In other languages normalization takes place.
This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"
If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words
The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library
Related:
Do C++11 regular expressions work with UTF-8 strings?
How well is Unicode supported in C++11?
How do I properly use std::string on UTF-8 in C++?
How to use Unicode range in C++ regex
See also
Unicode Regular Expressions
Unicode Support in the Standard Library

What is the efficient, standards-compliant mechanism for processing Unicode using C++17?

Short version:
If I wanted to write program that can efficiently perform operations with Unicode characters, being able to input and output files in UTF-8 or UTF-16 encodings. What is the appropriate way to do this with C++?
Long version:
C++ predates Unicode, and both have evolved significantly since. I need to know how to write standards-compliant C++ code that is leak-free. I need a clear answers to:
Which string container should I pick?
std::string with UTF-8?
std::wstring (don't really know much about it)
std::u16string with UTF-16?
std::u32string with UTF-32?
Should I stick entirely to one of the above containers or change them when needed?
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
What are differences between wchar_t and other multi-byte character types? Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Which string container should I pick?
That is really up to you to decide, based on your own particular needs. Any of the choices you have presented will work, and they each have their own advantages and disadvantages. Generically, UTF-8 is good to use for storage and communication purposes, and is backwards compatible with ASCII. Whereas UTF-16/32 is easier to use when processing Unicode data.
std::wstring (don't really know much about it)
The size of wchar_t is compiler-dependent and even platform-dependent. For instance, on Windows, wchar_t is 2 bytes, making std::wstring usable for UTF-16 encoded strings. On other platforms, wchar_t may be 4 bytes instead, making std::wstring usable for UTF-32 encoded strings instead. That is why wchar_t/std::wstring is generally not used in portable code, and why char16_t/std::u16string and char32_t/std::u32string were introduced in C++11. Even char can have portability issues for UTF-8, since char can be either signed or unsigned at the descretion of the compiler vendors, which is why char8_t/std::u8string was introduced in C++20 for UTF-8.
Should I stick entirely to one of the above containers or change them when needed?
Use whatever containers suit your needs.
Typically, you should use one string type throughout your code. Perform data conversions only at the boundaries where string data enters/leaves your program. For instance, when reading/writing files, network communications, platform system calls, etc.
How to properly convert between them?
There are many ways to handle that.
C++11 and later have std::wstring_convert/std::wbuffer_convert. But these were deprecated in C++17.
There are 3rd party Unicode conversion libraries, such as ICONV, ICU, etc.
There are C library functions, platform system calls, etc.
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
Yes, if you use appropriate string literal prefixes:
u8 for UTF-8.
L for UTF-16 or UTF-32 (depending on compiler/platform).
u16 for UTF-16.
u32 for UTF-32.
Also, be aware that the charset you use to save your source files can affect how the compiler interprets string literals. So make sure that whatever charset you choose to save your files in, like UTF-8, that you tell your compiler what that charset is, or else you may end up with the wrong string values at runtime.
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
Each string character may be a single-byte, or be part of a multi-byte representation of a Unicode codepoint. It depends on the encoding of the string, and the character being encoded.
Just as std::wstring (when wchar_t is 2 bytes) and std::u16string can hold strings containing supplementary characters outside of the Unicode BMP, which require UTF-16 surrogates to encode.
When a string container contains a UTF encoded string, each "character" is just a UTF encoded codeunit. UTF-8 encodes a Unicode codepoint as 1-4 codeunits (1-4 chars in a std::string). UTF-16 encodes a codepoint as 1-2 codeunits (1-2 wchar_ts/char16_ts in a std::wstring/std::u16string). UTF-32 encodes a codepoint as 1 codeunit (1 char32_t in a std::u32string).
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
Exactly what you would expect. A std::string holds char elements. Regardless of encoding, operator+=(char) will simply append a single char to the end of the std::string.
How can I distinguish UTF char[] and non-UTF char[] or std::string?
You would need to have outside knowledge of the string's original encoding, or else perform your own heuristic analysis of the char[]/std::string data to see if it conforms to a UTF or not.
What are differences between wchar_t and other multi-byte character types?
Byte size and UTF encoding.
char = ANSI/MBCS or UTF-8
wchar_t = DBCS, UTF-16 or UTF-32, depending on compiler/platform
char8_t = UTF-8
char16_t = UTF-16
char32_t = UTF-32
Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Yes, UTF-16 or UTF-32, depending on compiler/platform. In case of UTF-16, a single wchar_t can only hold a codepoint value that is in the BMP. A single wchar_t in UTF-32 can hold any codepoint value. A wchar_t string can encode all codepoints in either encoding.
How to properly manipulate UTF strings (such as toupper/tolower conversion) and be compatible with locales simultaneously?
That is a very broad topic, worthy of its own separate question by itself.

Is the u8 string literal necessary in C++11

From Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this
const char[] str = "Test String";
or this?
const char[] str = u8"Test String";
Is there be any reason not to use the latter for every string literal in your code?
What happens when there are non-ASCII-Characters inside the TestString?
The encoding of "Test String" is the implementation-defined system encoding (the narrow, possibly multibyte one).
The encoding of u8"Test String" is always UTF-8.
The examples aren't terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.
If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the u8-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.
You quote Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
Well, the “For the purpose of” is not true. char has always been guaranteed to be at least 8 bits, that is, CHAR_BIT has always been required to be ≥8, due to the range required for char in the C standard. Which is (quote C++11 §17.5.1.5/1) “incorporated” into the C++ standard.
If I should guess about the purpose of that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.
Regarding the effect of the u8 literal prefix, it
affects the encoding of the string in the executable, but
unfortunately it does not affect the type.
Thus, in both cases "tørrfisk" and u8"tørrfisk" you get a char const[n]. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “ø” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.
If the execution character set of the compiler is set to UTF-8, it makes no difference if u8 is used or not, since the compiler converts the characters to UTF-8 in both cases.
However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when u8 is omitted. For example, the conversion to wide strings will crash e.g. in VS15:
std::string narrowJapanese("スタークラフト");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convertWindows;
std::wstring wide = convertWindows.from_bytes(narrowJapanese); // Unhandled C++ exception in xlocbuf.
The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using u8"..." strings.
That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (char, wchar_t, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string where each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.

Is a wide character string literal starting with L like L"Hello World" guaranteed to be encoded in Unicode?

I've recently tried to get the full picture about what steps it takes to create platform independent C++ applications that support unicode. A thing that is confusing to me is that most howtos and stuff equalize the character encoding (i.e. ANSI or Unicode) and the character type (char or wchar_t). As I've learned so far, these are different things and there may exist a character sequence encodeded in Unicode but represented by std::string as well as a character sequence encoded in ANSI but represented as std::wstring, right?
So the question that comes to my mind is whether the C++ standard gives any guarantee about the encoding of string literals starting with L or does it just say it's of type wchar_t with implementation specific character encoding?
If there is no such guaranty, does that mean I need some sort of external resource system to provide non ASCII string literals for my application in a platform independent way?
What is the prefered way for this? Resource system or proper encoding of source files plus proper compiler options?
The L symbol in front of a string literal simply means that each character in the string will be stored as a wchar_t. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, (however C++11 defines Unicode char types and string literals) so it's up to you to properly represent Unicode strings in C++03.
Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc" will be represented as a 3-byte string (not counting the null), and L"abc" will be represented as a 3 * sizeof(wchar_t)-byte string of wide-characters.
The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXX hexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytes by using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.
As for string literals in C++03 source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXX notation in your string literals.
The C++03 does not mention unicode (future C++0x does). Currently you have to either use external libraries (ICU, UTF-CPP, etc.) or build your own solution using platform specific code. As others have mentioned, wchar_t encoding (or even size) is not specified. Consequently, string literal encoding is implementation specific. However, you can give unicode codepoints in string literals by using \x \u \U escapes.
Typically unicode apps in Windows use wchar_t (with UTF-16 encoding) as internal character format, because it makes using Windows APIs easier as Windows itself uses UTF-16. Unix/Linux unicode apps in turn usually use char (with UTF-8 encoding) internally. If you want to exchange data between different platforms, UTF-8 is usual choice for data transfer encoding.
The standard makes no mention of encoding formats for strings.
Take a look at ICU from IBM (its free). http://site.icu-project.org/