C++ std::printf formatting breaks with cyrillic alphabet - c++

I've noticed somewhat of an unexpected behavior when using std::printf() with max field length specifiers like %10ls in conjuction with wchar_t (for cyrillic text).
Code example I use:
void printHeader() {
printDelim();
std::printf("\n|%15ls|%15ls|%15ls|%15ls|%15ls|", L"Имя", L"Континент", L"Длина", L"Глубина", L"Приток");
}
Simple function that prints delimiter (bunch of "-") and should be printing formatted line of titles (in Russian) separated by "|". So each field will be max 15 chars long and look pretty.
Actual output: | Имя|Континент| Длина|Глубина| Приток |
Notice:
Locale is set like this setlocale(LC_ALL, "") and Russian is present there.
If parameters passed to printf() are in English - works fine.
Just in case - output of the setlocale():
Locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=ru_RU.UTF-8;LC_TIME=ru_RU.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=ru_RU.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=ru_RU.UTF-8;LC_NAME=ru_RU.UTF-8;LC_ADDRESS=ru_RU.UTF-8;LC_TELEPHONE=ru_RU.UTF-8;LC_MEASUREMENT=ru_RU.UTF-8;LC_IDENTIFICATION=ru_RU.UTF-8
Also tried it with std::wprintf(), but it does not print anything at all.
std::printf() with %15s and same strings without L prefix prints in the same "broken length" manner and cyrillic strings are correct.
I'm extremely curious why this happens with wchar_t.
P.S. - I'm aware that this code is almost literally C in C++, which is a bad idea and practice. Unfortunately it is required to do so in this case.

Let's look at cppreference.com's description of the %ls format specifier, because it explains one part of what's happening here in a very clear way:
If the l specifier is used, the argument must be a pointer to the initial element of an array of wchar_t, which is converted to char array as if by a call to wcrtomb with zero-initialized conversion state.
The key take-away is that %ls converts the wchar_t string to plain, narrow characters, as the first order of business. Basically, std::printf works with non-wide "characters", and allegedly-wide character strings get converted to non-wide "character" strings, before anything else happens.
Now that the input's domain consists of non-wide characters we make further progress:
Referencing "characters" in the context of the width specifier: it's really specifying the number of bytes. 15 bytes. That's what it really means:
Имя
This is not three "characters", as far as printf is concerned. This is a six character sequence, here they are: d0 98 d0 bc d1 8f.
Just in case - output of the setlocale():
Locale: LC_CTYPE=en_US.UTF-8 ...
Your system uses UTF-8 encoding, which uses more than one byte to encode non-Latin characters.
printf is a little bit dumb. It doesn't know anything about your locale, or your encoding. Every reference to character counts and field widths, in printf's documentation really means bytes. %15s, or %15ls, really means not 15 characters, but 15 bytes, to format here. So, it counts off 15 bytes, and spits them out. But, when interpreted as UTF-8 characters, these bytes don't really take up 15 characters on the screen.
Before Unicode, before the modern world with many alphabets, funny-looking characters, there was only the Latin alphabet, and characters and bytes were pretty much the same thing, and printf's documentation harkens back to that era. This is not true any more, but printf is still living in the past.

Related

Get decimal value of Unicode Character C++

How do I get the decimal values of Unicode Character such as "Ồ"
std::string a = "Ồ";
unsigned char c = a[0];
long val = long(c);
cout << val << endl;
OUTPUT
7,891;
Your question may look pretty straight-forward but as we delve into it, we'll find it isn't as simple as it might first appear.
The first problem is that std::string is defined as std::basic_string<char> which isn't really compatible with "Ồ". Thus the results you get from your code will probably depend on the compiler you use and/or the environment and OS you are running on. For example, my copy of Visual Studio treats "Ồ" as an invalid ASCII character and puts "?" (or 0x3F) in `a[0]'.
The second problem is that the character "Ồ" is more than eight bits wide, so it may not fit into the variable c. Whatever the compiler put into a[0], the variable c will only hold char bits of that value. Again, the results you get are likely to change depending on the compiler you use and/or the environment you run in.
Leaving that aside, let's start by assuming the character "Ồ" is LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND GRAVE (0x1ED2). With that assumption, one might imagine that the answer we are seeking to get is 0x1ED2 right? But not necessarily.
There are several ways to encode a Unicode character. The UTF-32 encoding is 0x1ED2 (or 0x00001ED2 if we include all the leading zeros to get thirty-two bits). The UTF-8 encoding is 0xE1BB92.
So the decimal value of "Ồ" is 7,890 if it is encoded in UTF-32 or 14,793,618 if it is encoded in UTF-8 (I'm ignoring the effects of endianness to keep things simple)
The Unicode site has a FAQ on encodings and Wikipedia has a page too.
As you can see, the answer to your question (to some extent) depends on the encoding you want to use. One C++ way to deal with encodings is std::codecvt. Another solution is to just treat your string as a sequence of bytes - which your code attempts to do - but that rather depends on you knowing how your system encodes strings, what endianness you are dealing with, etc. And the code won't necessarily be portable.
Another wrinkle to consider is that - in the general case - "Ồ" might not be one character. Obviously it is one character in your code. But if you read a string in from a disk file say and when printed or displayed that file produces "Ồ" we can't assume the file contains a single "Ồ" character.
Unicode defines COMBINING CIRCUMFLEX ACCENT (0x0302) and COMBINING GRAVE ACCENT (0x0300) as separate characters which can be combined with other characters. And it defines intermediate characters like LATIN CAPITAL LETTER O WITH GRAVE and LATIN CAPITAL LETTER O WITH ACUTE so there are actually several ways you can create a string in memory (or in a disk file) that would give you the same effect as the character "Ồ".

Get number of characters in string?

I have an application, accepting a UTF-8 string of a maximum 255 characters.
If the characters are ASCII, (characters number == size in bytes).
If the characters are not all ASCII and contains Japanese letters for example, given the size in bytes, how can I get the number of characters in the string?
Input: char *data, int bytes_no
Output: int char_no
You can use mblen to count the length or use mbstowcs
source:
http://www.cplusplus.com/reference/cstdlib/mblen/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
The number of characters can be counted in C in a portable way using
mbstowcs(NULL,s,0). This works for UTF-8 like for any other supported
encoding, as long as the appropriate locale has been selected. A
hard-wired technique to count the number of characters in a UTF-8
string is to count all bytes except those in the range 0x80 – 0xBF,
because these are just continuation bytes and not characters of their
own. However, the need to count characters arises surprisingly rarely
in applications.
you can save a unicode char in a wide char wchar_t
There's no such thing as "character".
Or, more precisely, what "character" is depends on whom you ask.
If you look in the Unicode glossary you will find that the term has several not fully compatible meanings. As a smallest component of written language that has semantic value (the first meaning), á is a single character. If you take á and count basic unit of encoding for the Unicode character encoding (the third meaning) in it, you may get either one or two, depending on what exact representation (normalized or denormalized) is being used.
Or maybe not. This is a very complicated subject and nobody really knows what they are talking about.
Coming down to earth, you probably need to count code points, which is essentially the same as characters (meaning 3). mblen is one method of doing that, provided your current locale has UTF-8 encoding. Modern C++ offers more C++-ish methods, however, they are not supported on some popular implementations. Boost has something of its own and is more portable. Then there are specialized libraries like ICU which you may want to consider if your needs are much more complicated than counting characters.

Determine if a byte array contains an ANSI or Unicode string?

Say I have a function that receives a byte array:
void fcn(byte* data)
{
...
}
Does anyone know a reliable way for fcn() to determine if data is an ANSI string or a Unicode string?
Note that I'm intentionally NOT passing a length arg, all I receive is the pointer to the array. A length arg would be a great help, but I don't receive it, so I must do without.
This article mentions an OLE API that apparently does it, but of course they don't tell you WHICH api function: http://support.microsoft.com/kb/138142
First, a word on terminology. There is no such thing as an ANSI string; there are ASCII strings, which represents a character encoding. ASCII was developed by ANSI, but they're not interchangable.
Also, there is no such thing as a Unicode string. There are Unicode encodings, but those are only a part of Unicode itself.
I will assume that by "Unicode string" you mean "UTF-8 encoded codepoint sequence." And by ANSI string, I'll assume you mean ASCII.
If so, then every ASCII string is also a UTF-8 string, by the definition of UTF-8's encoding. ASCII only defines characters up to 0x7F, and all UTF-8 code units (bytes) up to 0x7F mean the same thing as they do under ASCII.
Therefore, your concern would be for the other 128 possible values. That is... complicated.
The only reason you would ask this question is if you have no control over the encoding of the string input. And therefore, the problem is that ASCII and UTF-8 are not the only possible choices.
There's Latin-1, for example. There are many strings out there that are encoded in Latin-1, which takes the other 128 bytes that ASCII doesn't use and defines characters for them. That's bad, because those other 128 bytes will conflict with UTF-8's encoding.
There are also code pages. Many strings were encoded against a particular code page; this is particularly so on Windows. Decoding them requires knowing what codepage you're working on.
If you are in a situation where you are certain that a string is either ASCII (7-bit, with the high bit always 0) or UTF-8, then you can make the determination easily. Either the string is ASCII (and therefore also UTF-8), or one or more of the bytes will have the high bit set to 1. In which case, you must use UTF-8 decoding logic.
Unless you are truly certain of that these are the only possibilities, you are going to need to do a bit more. You can validate the data by trying to run it through a UTF-8 decoder. If it runs into an invalid code unit sequence, then you know it isn't UTF-8. The problem is that it is theoretically possible to create a Latin-1 string that is technically valid UTF-8. You're kinda screwed at that point. The same goes for code page-based strings.
Ultimately, if you don't know what encoding the string is, there's no guarantee you can display it properly. That's why it's important to know where your strings come from and what they mean.

Can anyone tell me what does this mean

This is a very basic doubt, but clarify me please
#define TLVTAG_APPLICATIONMESSAGE_V "\xDF01"
printf("%s\n", TLVTAG_APPLICATIONMESSAGE_V);
means what will be printed.
To go step by step (using the C++ standard, 2.13.2 and 2.13.4 as references):
The #define means that you substitute the second thing wherever the first appears, so the printf is processed as printf("%s\n", "\xDF01");.
The "\xDF01" is a string of one character (plus the zero-byte terminator), and the \x means to take the next characters as a hex value, so it attempts to treat DF01 as a number in hex, and fit it into a char.
Since a standard quoted string contains chars, not wchar_ts, and you're almost certainly working with an 8-bit char, the result is implementation-defined, and without the documentation for your implementation it's really impossible to speculate further.
Now, if the string were L"\xDF01", its elements would be wchar_ts, which are wide characters, normally 16 or 32 bits, and the DF01 value would turn into one character (presumably Unicode) value, and the print statement would print characters \xDF and characters \x01, not necessarily in that order, since printf prints char, not wchar_t. wprintf would print out the whole wchar_t.
It seems somebody is trying to print an unicode character -> �

How does the UTF-8 support of TinyXML work?

I'm using TinyXML to parse/build XML files. Now, according to the documentation this library supports multibyte character sets through UTF-8. So far so good I think. But, the only API that the library provides (for getting/setting element names, attribute names and values, ... everything where a string is used) is through std::string or const char*. This has me doubting my own understanding of multibyte character set support. How can a string that only supports 8-bit characters contain a 16 bit character (unless it uses a code page, which would negate the 'supports Unicode' claim)? I understand that you could theoretically take a 16-bit code point and split it over 2 chars in a std::string, but that wouldn't transform the std::string to a 'Unicode' string, it would make it invalid for most purposes and would maybe accidentally work when written to a file and read in by another program.
So, can somebody explain to me how a library can offer an '8-bit interface' (std::string or const char*) and still support 'Unicode' strings?
(I probably mixed up some Unicode terminology here; sorry about any confusion coming from that).
First, utf-8 is stored in const char * strings, as #quinmars said. And it's not only a superset of 7-bit ASCII (code points <= 127 always encoded in a single byte as themselves), it's furthermore careful that bytes with those values are never used as part of the encoding of the multibyte values for code points >= 128. So if you see a byte == 44, it's a '<' character, etc. All of the metachars in XML are in 7-bit ASCII. So one can just parse the XML, breaking strings where the metachars say to, sticking the fragments (possibly including non-ASCII chars) into a char * or std::string, and the returned fragments remain valid UTF-8 strings even though the parser didn't specifically know UTF-8.
Further (not specific to XML, but rather clever), even more complex things genrally just work (tm). For example, if you sort UTF-8 lexicographically by bytes, you get the same answer as sorting it lexicographically by code points, despite the variation in # of bytes used, because the prefix bytes introducing the longer (and hence higher-valued) code points are numerically greater than those for lesser values).
UTF-8 is compatible to 7-bit ASCII code. If the value of a byte is larger then 127, it means a multibyte character starts. Depending on the value of the first byte you can see how many bytes the character will take, that can be 2-4 bytes including the first byte (technical also 5 or 6 are possible, but they are not valid utf-8). Here is a good resource about UTF-8: UTF-8 and Unicode FAQ, also the wiki page for utf8 is very informative. Since UTF-8 is char based and 0-terminated, you can use the standard string functions for most things. The only important thing is that the character count can differ from the byte count. Functions like strlen() return the byte count but not necessarily the character count.
By using between 1 and 4 chars to encode one Unicode code point.