Here i have some simple code:
#include <iostream>
#include <cstdint>
int main()
{
const unsigned char utf8_string[] = u8"\xA0";
std::cout << std::hex << "Size: " << sizeof(utf8_string) << std::endl;
for (int i=0; i < sizeof(utf8_string); i++) {
std::cout << std::hex << (uint16_t)utf8_string[i] << std::endl;
}
}
I see different behavior here with MSVC and GCC.
MSVC sees "\xA0" as not encoded unicode sequence, and encodes it to utf-8.
So in MSVC the output is:
C2A0
Which is correctly encoded in utf8 unicode symbol U+00A0.
But in case of GCC nonthing happens. It treats string as simple bytes. There's no change even if i remove u8 before string literal.
Both compilers encode to utf8 with output C2A0 if the string is set to: u8"\u00A0";
Why do compilers behave differently and which actually does it right?
Software used for test:
GCC 8.3.0
MSVC 19.00.23506
C++ 11
They're both wrong.
As far as I can tell, the C++17 standard says here that:
The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.
Although there are other hints, this seems to be the strongest indication that escape sequences are not multi-byte and that MSVC's behaviour is wrong.
There are tickets for this which are currently marked as Under Investigation:
https://developercommunity.visualstudio.com/content/problem/225847/hex-escape-codes-in-a-utf8-literal-are-treated-in.html
https://developercommunity.visualstudio.com/content/problem/260684/escape-sequences-in-unicode-string-literals-are-ov.html
However it also says here about UTF-8 literals that:
If the value is not representable with a single UTF-8 code unit, the program is ill-formed.
Since 0xA0 is not a valid UTF-8 character, the program should not compile.
Note that:
UTF-8 literals starting with u8 are defined as being narrow.
\xA0 is an escape sequence
\u00A0 is considered a universal character name and not an escape sequence
This is CWG issue 1656.
It has been resolved in the current standard draft through P2029R4 so that the numeric escape sequences are to be considered by their value as a single code unit, not as a unicode code point which is then encoded to UTF-8. This is even if it results in an invalid UTF-8 sequence.
Therefore GCC's behavior is/will be correct.
I can't tell you which way is true to the standard.
The way MSVC does it is at least logically consistent and easily explainable. The three escape sequences \x, \u, and \U behave identically except for the number of hex digits they pull from the input: 2, 4, or 8. Each defines a Unicode codepoint that must then be encoded to UTF-8. To embed a byte without encoding leads to the possibility of creating an invalid UTF-8 sequence.
Why do compilers behave differently and which actually does it right?
Compilers behave differently because of the way they decided to implement the C++ standard:
GCC uses strict rules and implements the standard as is
MSVC uses loose rules and implements the standard in a more practical "real-world" kind of way
So things that fail in GCC will usually work in MSVC because it's more permissible. And MSVC handles some of these issues automatically.
Here is a similar example:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33167.
It follows the standard, but it's not what you would expect.
As to which does it right, depends on what your definition of "right" is.
Related
I have a simple line of code:
std::cout << std::hex << static_cast<int>('©');
This character's the Copyright Sign Emoji, its code's a9, but the app writes c2a9. The same happens to lots of Unicode characters. Another example: ™ (this's 2122) suddenly returns e284a2. Why C++ returns wrong codes of some characters, and how to fix this?
Note: I'm using Microsoft Visual Studio, a file with my code is saved in UTF-8.
An ordinary character literal (one without prefix) usually has type char and can store only elements of the execution character set that are representable as a single byte.
If the character is not representable in this way, the character literal is only conditionally-supported with type int and implementation-defined value. Compilers typically warn when this happens with some of the generic warning flags since it is a mistake most of the time. That might depend on what warning flags exactly you have enabled.
A byte is typically 8 bits and therefore it is impossible to store all of unicode in it. I don't know what execution character set your implementation uses, but clearly neither © nor ™ are in it.
It also seems that your implementation chose to support the non-representable character by encoding it in UTF-8 and using that as the value of the literal. You are seeing a representation of the numeric value of the UTF-8 encoding of the two characters.
If you want the numeric value of the unicode code point for the character, then you should use a character literal with U prefix, which implies that the value of the character according to UTF-32 is given with type char32_t, which is large enough to hold all unicode code points:
std::cout << std::hex << static_cast<std::uint_least32_t>(U'©');
I come from python where you can use 'string[10]' to access a character in sequence. And if the string is encoded in Unicode it will give me expected results. However when I use indexing on a string in C++, as long the characters are ASCII it works, but when I use a Unicode character inside the string and use indexing, in the output I'll get an octal representation like /201.
For example:
string ramp = "ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";
cout << ramp[5] << "\n";
Output:
ÐðŁłŠšÝýÞþŽž
/201
Why this is happening and how can I access that character in the string representation or how can I convert the octal representation to the actual character?
Standard C++ is not equipped for proper handling of Unicode, giving you problems like the one you observed.
The problem here is that C++ predates Unicode by a comfortable margin. This means that even that string literal of yours will be interpreted in an implementation-defined manner because those characters are not defined in the Basic Source Character set (which is, basically, the ASCII-7 characters minus #, $, and the backtick).
C++98 does not mention Unicode at all. It mentions wchar_t, and wstring being based on it, specifying wchar_t as being capable of "representing any character in the current locale". But that did more damage than good...
Microsoft defined wchar_t as 16 bit, which was enough for the Unicode code points at that time. However, since then Unicode has been extended beyond the 16-bit range... and Windows' 16-bit wchar_t is not "wide" anymore, because you need two of them to represent characters beyond the BMP -- and the Microsoft docs are notoriously ambiguous as to where wchar_t means UTF-16 (multibyte encoding with surrogate pairs) or UCS-2 (wide encoding with no support for characters beyond the BMP).
All the while, a Linux wchar_t is 32 bit, which is wide enough for UTF-32...
C++11 made significant improvements to the subject, adding char16_t and char32_t including their associated string variants to remove the ambiguity, but still it is not fully equipped for Unicode operations.
Just as one example, try to convert e.g. German "Fuß" to uppercase and you will see what I mean. (The single letter 'ß' would need to expand to 'SS', which the standard functions -- handling one character in, one character out at a time -- cannot do.)
However, there is help. The International Components for Unicode (ICU) library is fully equipped to handle Unicode in C++. As for specifying special characters in source code, you will have to use u8"", u"", and U"" to enforce interpretation of the string literal as UTF-8, UTF-16, and UTF-32 respectively, using octal / hexadecimal escapes or relying on your compiler implementation to handle non-ASCII-7 encodings appropriately.
And even then you will get an integer value for std::cout << ramp[5], because for C++, a character is just an integer with semantic meaning. ICU's ustream.h provides operator<< overloads for the icu::UnicodeString class, but ramp[5] is just a 16-bit unsigned integer (1), and people would look askance at you if their unsigned short would suddenly be interpreted as characters. You need the C-API u_fputs() / u_printf() / u_fprintf() functions for that.
#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/ustdio.h>
#include <iostream>
int main()
{
// make sure your source file is UTF-8 encoded...
icu::UnicodeString ramp( icu::UnicodeString::fromUTF8( "ÐðŁłŠšÝýÞþŽž" ) );
std::cout << ramp << "\n";
std::cout << ramp[5] << "\n";
u_printf( "%C\n", ramp[5] );
}
Compiled with g++ -std=c++11 testme.cpp -licuio -licuuc.
ÐðŁłŠšÝýÞþŽž
353
š
(1) ICU uses UTF-16 internally, and UnicodeString::operator[] returns a code unit, not a code point, so you might end up with one half of a surrogate pair. Look up the API docs for the various other ways to index a unicode string.
C++ has no useful native Unicode support. You almost certainly will need an external library like ICU.
To access codepoints individually, use u32string, which represents a string as a sequence of UTF-32 code units of type char32_t.
u32string ramp = U"ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";
cout << ramp[5] << "\n";
In my opinion, the best solution is to do any task with strings using iterators. I can't imagine a scenario where one really has to index strings: if you need indexing like ramp[5] in your example, then the 5 is usually computed in other part of the code and usually you scan all the preceding characters anyway. That's why Standard Library uses iterators in its API.
A similar problem comes up if you want to get the size of a string. Should it be character (or code point) count or merely number of bytes? Usually you need the size to allocate a buffer so byte count is more desirable. You only very, very rarely have to get Unicode character count.
If you want to process UTF-8 encoded strings using iterators then I would definitely recommend UTF8-CPP.
Answering about what is going on, cplusplus.com makes it clear:
Note that this class handles bytes independently of the encoding used: If used to handle sequences of multi-byte or variable-length characters (such as UTF-8), all members of this class (such as length or size), as well as its iterators, will still operate in terms of bytes (not actual encoded characters).
About the solution, others had it right: ICU if you are not using C++11; u32string if you are.
From Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this
const char[] str = "Test String";
or this?
const char[] str = u8"Test String";
Is there be any reason not to use the latter for every string literal in your code?
What happens when there are non-ASCII-Characters inside the TestString?
The encoding of "Test String" is the implementation-defined system encoding (the narrow, possibly multibyte one).
The encoding of u8"Test String" is always UTF-8.
The examples aren't terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.
If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the u8-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.
You quote Wikipedia:
For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.
Well, the “For the purpose of” is not true. char has always been guaranteed to be at least 8 bits, that is, CHAR_BIT has always been required to be ≥8, due to the range required for char in the C standard. Which is (quote C++11 §17.5.1.5/1) “incorporated” into the C++ standard.
If I should guess about the purpose of that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.
Regarding the effect of the u8 literal prefix, it
affects the encoding of the string in the executable, but
unfortunately it does not affect the type.
Thus, in both cases "tørrfisk" and u8"tørrfisk" you get a char const[n]. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “ø” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.
If the execution character set of the compiler is set to UTF-8, it makes no difference if u8 is used or not, since the compiler converts the characters to UTF-8 in both cases.
However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when u8 is omitted. For example, the conversion to wide strings will crash e.g. in VS15:
std::string narrowJapanese("スタークラフト");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convertWindows;
std::wstring wide = convertWindows.from_bytes(narrowJapanese); // Unhandled C++ exception in xlocbuf.
The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using u8"..." strings.
That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (char, wchar_t, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string where each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.
Why is there no UTF-8 character literal in C11 or C++11 even though there are UTF-8 string literals? I understand that, generally-speaking, a character literal represents a single ASCII character which is identical to a single-octet UTF-8 code point, but neither C nor C++ says the encoding has to be ASCII.
Basically, if I read the standard right, there's no guarantee that '0' will represent the integer 0x30, yet u8"0" must represent the char sequence 0x30 0x00.
EDIT:
I'm aware not every UTF-8 code point would fit in a char. Such a literal would only be useful for single-octet code points (aka, ASCII), so I guess calling it an "ASCII character literal" would be more fitting, so the question still stands. I just chose to frame the question with UTF-8 because there are UTF-8 string literals. The only way I can imagine portably guaranteeing ASCII values would be to write a constant for each character, which wouldn't be so bad considering there are only 128, but still...
It is perfectly acceptable to write non-portable C code, and this is one of many good reasons to do so. Feel free to assume that your system uses ASCII or some superset thereof, and warn your users that they shouldn't try to run your program on an EBCDIC system.
If you are feeling very generous, you can encode a check. The gperf program is known to generate code that includes such a check.
_Static_assert('0' == 48, "must be ASCII-compatible");
Or, for pre-C11 compilers,
extern int must_be_ascii_compatible['0' == 48 ? 1 : -1];
If you are on C11, you can use the u or U prefix on character constants, but not the u8 prefix...
/* This is useless, doesn't do what you want... */
_Static_assert(0, "this code is broken everywhere");
if (c == '々') ...
/* This works as long as wchar_t is UTF-16 or UTF-32 or UCS-2... */
/* Note: you shouldn't be using wchar_t, though... */
_Static_assert(__STDC_ISO_10646__, "wchar_t must be some form of Unicode");
if (c == L'々') ...
/* This works as long as char16_t is UTF-16 or UCS-2... */
_Static_assert(__STDC_UTF_16__, "char16_t must be UTF-16");
if (c == u'々') ...
/* This works as long as char32_t is UTF-32... */
_Static_assert(__STDC_UTF_32__, "char32_t must be UTF-32");
if (c == U'々') ...
There are some projects that are written in very portable C and have been ported to non-ASCII systems (example). This required a non-trivial amount of porting effort, and there's no real reason to make the effort unless you know you want to run your code on EBCDIC systems.
On standards: The people writing the C standard have to contend with every possible C implementation, including some downright bizarre ones. There are known systems where sizeof(char) == sizeof(long), CHAR_BIT != 8, integral types have trap representations, sizeof(void *) != sizeof(int *), sizeof(void *) != sizeof(void (*)()), va_list are heap-allocated, etc. It's a nightmare.
Don't beat yourself up trying to write code that will run on systems you've never even heard of, and don't search to hard for guarantees in the C standard.
For example, as far as the C standard is concerned, the following is a valid implementation of malloc:
void *malloc(void) { return NULL; }
Note that while u8"..." constants are guaranteed to be UTF-8, u"..." and U"..." have no guarantees except that the encoding is 16-bits and 32-bits per character, respectively, and the actual encoding must be documented by the implementation.
Summary: Safe to assume ASCII compatibility in 2012.
UTF-8 character literal would have to have variable length - for many most of them, it's not possible to store single character in char or wchar, what type should it have, then? As we don't have variable length types in C, nor in C++, except for arrays of fixed size types, the only reasonable type for it would be const char * - and C strings are required to be null-terminated, so it wouldn't change anything.
As for the edit:
Quote from the C++11 standard:
The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
(footnote at 2.3.1).
I think that it's good reason for not guaranteeing it. Although, as you noted in comment here, for most (or every) mainstream compiler, the ASCII-ness of character literals is implementation guaranteed.
For C++ this has been addressed by Evolution Working Group issue 119: Adding u8 character literals whose Motivation section says:
We have five encoding-prefixes for string-literals (none, L, u8, u, U)
but only four for character literals -- the missing one is u8. If the
narrow execution character set is not ASCII, u8 character literals
would provide a way to write character literals with guaranteed ASCII
encoding (the single-code-unit u8 encodings are exactly ASCII). Adding
support for these literals would add a useful feature and make the
language slightly more consistent.
EWG discussed the idea of adding u8 character literals in Rapperswil and accepted the change. This paper provides wording for that
extension.
This was incorporated into the working draft using the wording from N4267: Adding u8 character literals and we can find the wording in at this time latest draft standard N4527 and note as section 2.14.3 say they are limited to code points that fit into a single UTF-8 code unit:
A character literal that begins with u8, such as u8'w', is a character
literal of type char, known as a UTF-8 character literal. The value of
a UTF-8 character literal is equal to its ISO10646 code point value,
provided that the code point value is representable with a single
UTF-8 code unit (that is, provided it is a US-ASCII character). A
UTF-8 character literal containing multiple c-chars is ill-formed.
If you don't trust that your compiler will treat '0' as ASCII character 0x30, then you could use static_cast<char>(0x30) instead.
As you are aware, UTF-8-encoded characters need several octets, thus chars, so the natural type for them is char[], which is indeed the type for a u8-prefixed string literal! So C11 is right on track here, just that it sticks to its syntax conventions using " for a string, needing to be used as an array of char, rather than your implied semantic-based proposal to use ' instead.
About "0" versus u8"0", you are reading right, only the latter is guaranteed to be identical to { 0x30, 0 }, even on EBCDIC systems. By the way, the very fact the former is not can be handled conveniently in your code, if you pay attention to the __STDC_MB_MIGHT_NEQ_WC__ predefined identifier.
I've noticed the length method of std::string returns the length in bytes and the same method in std::u16string returns the number of 2-byte sequences.
I've also noticed that when a character or code point is outside of the BMP, length returns 4 rather than 2.
Furthermore, the Unicode escape sequence is limited to \unnnn, so any code point above U+FFFF cannot be inserted by an escape sequence.
In other words, there doesn't appear to be support for surrogate pairs or code points outside of the BMP.
Given this, is the accepted or recommended practice to use a non-standard string manipulation library that understands UTF-8, UTF-16, surrogate pairs, and so on?
Does my complier have a bug or am I using the standard string manipulation methods incorrectly?
Example:
/*
* Example with the Unicode code points U+0041, U+4061, U+10196 and U+10197
*/
#include <iostream>
#include <string>
int main(int argc, char* argv[])
{
std::string example1 = u8"A䁡𐆖𐆗";
std::u16string example2 = u"A䁡𐆖𐆗";
std::cout << "Escape Example: " << "\u0041\u4061\u10196\u10197" << "\n";
std::cout << "Example: " << example1 << "\n";
std::cout << "std::string Example length: " << example1.length() << "\n";
std::cout << "std::u16string Example length: " << example2.length() << "\n";
return 0;
}
Here is the result I get when compiled with GCC 4.7:
Escape Example: A䁡မ6မ7
Example: A䁡𐆖𐆗
std::string Example length: 12
std::u16string Example length: 6
std::basic_string is code unit oriented, not character oriented. If you need to deal with code points you can convert to char32_t, but there's nothing in the standard for more advanced Unicode functionality yet.
Also you can use the \UNNNNNNNN escape sequence for non-BMP code points, in addition to typing them in directly (assuming you're using a source encoding that supports them).
Depending on your needs this may be all the Unicode support you need. A lot of software doesn't need to do more than basic manipulations of strings, such as those that can easily be done on code units directly. For slightly higher level needs you can convert code units to code points and work on those. For higher level needs, such as working on grapheme clusters, additional support will be needed.
I would say this means there's adequate support in the standard for representing Unicode data and performing basic manipulation. Whatever third party library is used for higher level functionality should build on the standard library. As time goes on the standard is likely to subsume more of that higher level functionality as well.
At the risk of judging prematurely, it seems to me that the language used in the standards in slightly ambiguous (although the final conclusion is clear, see at the end):
In the description of char16_t literals (i.e. the u"..." ones like in your example), the size of a literal is defined as:
The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’.
And the footnote further clarifies:
[ Note: The size of a char16_t string literal is the number of code units, not the number of characters. —end note ]
This implies a definition of character and code unit. A surrogate pair is one
character, but two code units.
However, the description of the length() method of std::basic_string (of which std::u16string is derived) claims:
Returns the number of characters in the string, i.e. std::distance(begin(), end()). It is the same as size().
As it appears, the description of length() uses the word character to mean what the definition of char16_t calls a code unit.
However, the conclusion of all of this is: The length is defined as code units, hence your compiler complies with the standard, and there will be continued demand for special libraries to provide proper counting of characters.
I used the references below:
For the definition of the size of char16_t literals: Here
For the description of std::basic_string::length(): Here
Given this, is the accepted or recommended practice to use a non-standard string manipulation library that understands UTF-8, UTF-16, surrogate pairs, and so on?
It's hard to talk about recommended practice for a language standard that was created a few months ago and isn't fully implemented yet, but in general I would agree: the locale and Unicode features in C++11 are still hopelessly inadequate (although they obviously got a lot better), and for serious work, you should drop them and use ICU or Boost.Locale instead.
The addition of Unicode strings and conversion functions to C++11 are the first step towards real Unicode support; time will tell whether they turn out to be useful or whether they will be forgotten.