How do I correctly initialize wide character string? - c++

I am trying to figure out wide characters in c. For example, I test a string that contains a single letter "Ē" that is encoded as c492 in utf8.
char* T1 = "Ē";
//This is the resulting array { 0xc4, 0x92, 0x00 }
wchar_t* T2 = L"Ē";
//This is the resulting array { 0x00c4, 0x2019, 0x0000 }
I expected that the second array would be {0xc492, 0x0000}, instead it contains an extra character that just wastes space in my opinion. Can anyone help me understand what is going on with this?

What you've managed to do here is mojibake. Your source code is written in UTF-8 but it was interpreted in Windows codepage 1252 (i.e. the compiler source character set was CP1252).
The wide string contents are the Windows codepage 1252 characters of the UTF-8 bytes 0xC4 0x92 converted to UCS-2. The easiest way out is to just using an escape instead:
wchar_t* T2 = L"\x112";
or
wchar_t* T2 = L"\u0112";
The larger problem is that to my knowledge neither C nor C++ have a mechanism for specifying the source character set within the code itself, so it is always a setting or option external to something that you can easily copy-paste.

Your compiler is misinterpreting your source code file (which is saved as UTF-8) as Windows-1252 (commonly called ANSI). It does not interpret the byte sequence C4 92 as the one-character UTF-8 string "Ē", but as the two-character Windows-1252 string "Ä’". The unicode codepoint of "Ä" is U+00C4, and the unicode codepoint of "’" is U+2019. This is exactly what you see in your wide character string.
The 8-bit string only works, because the misinterpretation of the string does not matter, as it is not converted during compilation. The compiler reads the string as Windows-1252 and emits the string as Windows-1252 (so it does not need to convert anything, and considers both to be "Ä’"). You interpret the source code and the data in the binary as UTF-8, so you consider both to be "Ē".
To have the compiler treat your source code as UTF-8, use the switch /utf-8.
BTW: The correct UTF-16 encoding (which is the encoding MSVC uses for wide character strings) to be observed in a wide-character string is not {0xc492, 0x0000}, but {0x0112, 0x0000}, because "Ē" is U+0112.

Related

byte representation of ASCII symbols in std::wstring with different locales

Windows C++ app. We have a string that contain only ASCII symbols: std::wstring(L"abcdeABCDE ... any other ASCII symbol"). Note that this is std::wstring that uses wchar_t.
Question - do byte representation of this string depends on the localization settings, or something else? Can I assume that if I receive such string (for example, from WindowsAPI) while app is running its bytes will be the same as on the my PC?
In general, for characters (not escape sequence) wchar_t and wstring have to use the same codes as ASCII (just extended to 2 bytes).
But I am not sure about codes less then 32 and of course codes greater than 128 can has different meaning (as in ASCII) in the moment of output, so to avoid problem on output set particular locale explicitly, e.g.:
locale("en_US.UTF-8")
for standard output
wcout.imbue(locale("en_US.UTF-8"));
UPDATE:
I found one more suggestion about adding
std::ios_base::sync_with_stdio(false);
before setting localization with imbue
see details on How can I use std::imbue to set the locale for std::wcout?
The byte representation of the literal string does not depend on the environment. It's hardcoded to the binary data from the editor. However, the way that binary data is interpreted depends on the current code page, so you can end up with different results when converted at runtime to a wide string (as opposed to defining the string using a leading L, which means that the wide characters will be set at compile time.)
To be safe, use setlocale() to guarantee the encoding used for conversion. Then you don't have to worry about the environment.
This might help: "By definition, the ASCII character set is a subset of all multibyte-character sets. In many multibyte character sets, each character in the range 0x00 – 0x7F is identical to the character that has the same value in the ASCII character set. For example, in both ASCII and MBCS character strings, the 1-byte NULL character ('\0') has value 0x00 and indicates the terminating null character."
From:
Visual Studio Character Sets 'Not set' vs 'Multi byte character set'

How does string work with non-ascii symbols while char does not?

I understand that char in C++ is just an integer type that stores ASCII symbols as numbers ranging from 0 to 127. The Scandinavian letters 'æ', 'ø', and 'å' are not among the 128 symbols in the ASCII table.
So naturally when I try char ch1 = 'ø' I get a compiler error, however string str = "øæå" works fine, even though a string makes use of chars right?
Does string somehow switch over to Unicode?
In C++ there is the source character set and the execution character set. The source character set is what you can use in your source code; but this doesn't have to coincide with which characters are available at runtime.
It's implementation-defined what happens if you use characters in your source code that aren't in the source character set. Apparently 'ø' is not in your compiler's source character set, otherwise you wouldn't have gotten an error; this means that your compiler's documentation should include an explanation of what it does for both of these code samples. Probably you will find that str does have some sort of sequence of bytes in it that form a string.
To avoid this you could use character literals instead of embedding characters in your source code, in this case '\xF8'. If you need to use characters that aren't in the execution character set either, you can use wchar_t and wstring.
From the source code char c = 'ø';:
source_file.cpp:2:12: error: character too large for enclosing character literal type
char c = '<U+00F8>';
^
What's happening here is that the compiler is converting the character from the source code encoding and determining that there's no representation of that character using the execution encoding that fits inside a single char. (Note that this error has nothing to do with the initialization of c, it would happen with any such character literal. examples)
When you put such characters into a string literal rather than a character literal, however, the compiler's conversion from the source encoding to the execution encoding is perfectly happy to use multi-byte representations of the characters when the execution encoding is multi-byte, such as UTF-8 is.
To better understand what compilers do in this area you should start by reading clauses 2.3 [lex.charsets], 2.14.3 [lex.ccon], and 2.14.5 [lex.string] in the C++ standard.
What's likely happening here is that your source file is encoded as UTF-8 or some other multi-byte character encoding, and the compiler is simply treating it as a sequence of bytes. A single char can only be a single byte, but a string is perfectly happy to be as many bytes as are required.
The ASCII for C++ is only 128 characters.
If you want 'ø' which is ASCII-EXTENDED 248 out of (255) which is 8 bit (is not a character value) that included 7 bit from ASCII.
you can try char ch1 ='\xD8';

UTF-8 Compatibility in C++

I am writing a program that needs to be able to work with text in all languages. My understanding is that UTF-8 will do the job, but I am experiencing a few problems with it.
Am I right to say that UTF-8 can be stored in a simple char in C++? If so, why do I get the following warning when I use a program with char, string and stringstream: warning C4566: character represented by universal-character-name '\uFFFD' cannot be represented in the current code page (1252). (I do not get that error when I use wchar_t, wstring and wstringstream.)
Additionally, I know that UTF is variable length. When I use the at or substr string methods would I get the wrong answer?
To use UTF-8 string literals you need to prefix them with u8, otherwise you get the implementation's character set (in your case, it seems to be Windows-1252): u8"\uFFFD" is null-terminated sequence of bytes with the UTF-8 representation of the replacement character (U+FFFD). It has type char const[4].
Since UTF-8 has variable length, all kinds of indexing will do indexing in code units, not codepoints. It is not possible to do random access on codepoints in an UTF-8 sequence because of it's variable length nature. If you want random access you need to use a fixed length encoding, like UTF-32. For that you can use the U prefix on strings.
Yes, the UTF-8 encoding can be used with char, string, and stringstream. A char will hold a single UTF-8 code unit, of which up to four may be required to represent a single Unicode code point.
However, there are a few issues using UTF-8 specifically with Microsoft's compilers. C++ implementations use an 'execution character set' for a number of things, such as encoding character and string literals. VC++ always use the system locale encoding as the execution character set, and Windows does not support UTF-8 as the system locale encoding, therefore UTF-8 can never by the execution character set.
This means that VC++ never intentionally produces UTF-8 character and string literals. Instead the compiler must be tricked.
The compiler will convert from the known source code encoding to the execution encoding. That means that if the compiler uses the locale encoding for both the source and execution encodings then no conversion is done. If you can get UTF-8 data into the source code but have the compiler think that the source uses the locale encoding, then character and string literals will use the UTF-8 encoding. VC++ uses the so-called 'BOM' to detect the source encoding, and uses the locale encoding if no BOM is detected. Therefore you can get UTF-8 encoded string literals by saving all your source files as "UTF-8 without signature".
There are caveats with this method. First, you cannot use UCNs with narrow character and string literals. Universal Character Names have to be converted to the execution character set, which isn't UTF-8. You must either write the character literally so it appears as UTF-8 in the source code, or you can use hex escapes where you manually write out a UTF-8 encoding. Second, in order to produce wide character and string literals the compiler performs a similar conversion from the source encoding to the wide execution character set (which is always UTF-16 in VC++). Since we're lying to the compiler about the encoding, it will perform this conversion to UTF-16 incorrectly. So in wide character and string literals you cannot use non-ascii characters literally, and instead you must use UCNs or hex escapes.
UTF-8 is variable length (as is UTF-16). The indices used with at() and substr() are code units rather than character or code point indices. So if you want a particular code unit then you can just index into the string or array or whatever as normal. If you need a particular code point then you either need a library that can understand composing UTF-8 code units into code points (such as the Boost Unicode iterators library), or you need to convert the UTF-8 data into UTF-32. If you need actual user perceived characters then you need a library that understands how code points are composed into characters. I imagine ICU has such functionality, or you could implement the Default Grapheme Cluster Boundary Specification from the Unicode standard.
The above consideration of UTF-8 only really matters for how you write Unicode data in the source code. It has little bearing on the program's input and output.
If your requirements allow you to choose how to do input and output then I would still recommend using UTF-8 for input. Depending on what you need to do with the input you can either convert it to another encoding that's easy for you to process, or you can write your processing routines to work directly on UTF-8.
If you want to ever output anything via the Windows console then you'll want a well defined module for output that can have different implementations, because internationalized output to the Windows console will require a different implementation from either outputting to a file on Windows or console and file output on other platforms. (On other platforms the console is just another file, but the Windows console needs special treatment.)
The reason you get the warning about \uFFFD is that you're trying to fit FF FD inside a single byte, since, as you noted, UTF-8 works on chars and is variable length.
If you use at or substr, you will possibly get wrong answers since these methods count that one byte should be one character. This is not the case with UTF-8. Notably, with at, you could end up with a single byte of a character sequence; with substr, you could break a sequence and end up with an invalid UTF-8 string (it would start or end with �, \uFFFD, the same one you're apparently trying to use, and the broken character would be lost).
I would recommend that you use wchar to store Unicode strings. Since the type is at least 16 bits, many many more characters can fit in a single "unit".

How to UTF-8 encode a character/string

I am using a Twitter API library to post a status to Twitter. Twitter requires that the post be UTF-8 encoded. The library contains a function that URL encodes a standard string, which works perfectly for all special characters such as !##$%^&*() but is the incorrect encoding for accented characters (and other UTF-8).
For example, 'é' gets converted to '%E9' rather than '%C3%A9' (it pretty much only converts to a hexadecimal value). Is there a built-in function that could input something like 'é' and return something like '%C9%A9"?
edit: I am fairly new to UTF-8 in case what I am requesting makes no sense.
edit: if I have a
string foo = "bar é";
I would like to convert it to
"bar %C3%A9"
Thanks
If you have a wide character string, you can encode it in UTF8 with the standard wcstombs() function. If you have it in some other encoding (e.g. Latin-1) you will have to decode it to a wide string first.
Edit: ... but wcstombs() depends on your locale settings, and it looks like you can't select a UTF8 locale on Windows. (You don't say what OS you're using.) WideCharToMultiByte() might be more useful on Windows, as you can specify the encoding in the call.
To understand what needs to be done, you have to first understand a bit of background. Different encodings use different values for the "same" character. Latin-1, for example, says "é" is a single byte with value E9 (hex), while UTF-8 says "é" is the two byte sequence C3 A9, and yet UTF-16 says that same character is the single double-byte value 00E9 – a single 16-bit value rather than two 8-bit values as in UTF-8. (Unicode, which isn't an encoding, actually uses the same codepoint value, U+E9, as Latin-1.)
To convert from one encoding to another, you must first take the encoded value, decode it to a value independent of the source encoding (i.e. Unicode codepoint), then re-encode it in the target encoding. If the target encoding doesn't support all of the source encoding's codepoints, then you'll either need to translate or otherwise handle this condition.
This re-encoding step requires knowing both the source and target encodings.
Your API function is not converting encodings; it appears to be URL-escaping an arbitrary byte string. The authors of the function apparently assume you will have already converted to UTF-8.
In order to convert to UTF-8, you must know what encoding your system is using and be able to map to Unicode codepoints. From there, the UTF-8 encoding is trivial.
Depending on your system, this may be as easy as converting the "native" character set (which has "é" as E9 for you, so probably Windows-1252, Latin-1, or something very similar) to wide characters (which is probably UTF-16 or UCS-2 if sizeof(wchar_t) is 2, or UTF-32 if sizeof(wchar_t) is 4) and then to UTF-8. Wcstombs, as Martin answers, may be able to handle the second part of this conversion, but this is system-dependent. However, I believe Latin-1 is a subset of Unicode, so conversion from this source encoding can skip the wide character step. Windows-1252 is close to Latin-1, but replaces some control characters with printable characters.

c++: getting ascii value of a wide char

let's say i have a char array like "äa".
is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte?
even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long.
is there a way to do this, im trying for 2 days now :(
i'm using gcc.
thanks!
You're contradicting yourself. International characters like ä are (by definition) not in the ASCII character set, so they don't have an "ascii value".
It depends on the exact encoding of your two-character array, if you can get the code point for a single character or not, and if so which format it will be in.
You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.
More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?
There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.
Depends on the encoding used in your char array.
If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:
0xE4 (lower-case a umlaut)
0x61 (lower-case a).
Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.
You would get the value like this:
int i = (unsigned char) my_array[0];
If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:
binary 11000011 (first byte of UTF-8 encoded 0xE4)
binary 10100100 (second byte of UTF-8 encoded 0xE4)
0x61 (lower-case a)
To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:
wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
// handle error
}
If it's SHIFT-JIS then this doesn't work...
what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.