Can anyone tell me what does this mean - c++

This is a very basic doubt, but clarify me please
#define TLVTAG_APPLICATIONMESSAGE_V "\xDF01"
printf("%s\n", TLVTAG_APPLICATIONMESSAGE_V);
means what will be printed.

To go step by step (using the C++ standard, 2.13.2 and 2.13.4 as references):
The #define means that you substitute the second thing wherever the first appears, so the printf is processed as printf("%s\n", "\xDF01");.
The "\xDF01" is a string of one character (plus the zero-byte terminator), and the \x means to take the next characters as a hex value, so it attempts to treat DF01 as a number in hex, and fit it into a char.
Since a standard quoted string contains chars, not wchar_ts, and you're almost certainly working with an 8-bit char, the result is implementation-defined, and without the documentation for your implementation it's really impossible to speculate further.
Now, if the string were L"\xDF01", its elements would be wchar_ts, which are wide characters, normally 16 or 32 bits, and the DF01 value would turn into one character (presumably Unicode) value, and the print statement would print characters \xDF and characters \x01, not necessarily in that order, since printf prints char, not wchar_t. wprintf would print out the whole wchar_t.

It seems somebody is trying to print an unicode character -> �

Related

C++ std::printf formatting breaks with cyrillic alphabet

I've noticed somewhat of an unexpected behavior when using std::printf() with max field length specifiers like %10ls in conjuction with wchar_t (for cyrillic text).
Code example I use:
void printHeader() {
printDelim();
std::printf("\n|%15ls|%15ls|%15ls|%15ls|%15ls|", L"Имя", L"Континент", L"Длина", L"Глубина", L"Приток");
}
Simple function that prints delimiter (bunch of "-") and should be printing formatted line of titles (in Russian) separated by "|". So each field will be max 15 chars long and look pretty.
Actual output: | Имя|Континент| Длина|Глубина| Приток |
Notice:
Locale is set like this setlocale(LC_ALL, "") and Russian is present there.
If parameters passed to printf() are in English - works fine.
Just in case - output of the setlocale():
Locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=ru_RU.UTF-8;LC_TIME=ru_RU.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=ru_RU.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=ru_RU.UTF-8;LC_NAME=ru_RU.UTF-8;LC_ADDRESS=ru_RU.UTF-8;LC_TELEPHONE=ru_RU.UTF-8;LC_MEASUREMENT=ru_RU.UTF-8;LC_IDENTIFICATION=ru_RU.UTF-8
Also tried it with std::wprintf(), but it does not print anything at all.
std::printf() with %15s and same strings without L prefix prints in the same "broken length" manner and cyrillic strings are correct.
I'm extremely curious why this happens with wchar_t.
P.S. - I'm aware that this code is almost literally C in C++, which is a bad idea and practice. Unfortunately it is required to do so in this case.
Let's look at cppreference.com's description of the %ls format specifier, because it explains one part of what's happening here in a very clear way:
If the l specifier is used, the argument must be a pointer to the initial element of an array of wchar_t, which is converted to char array as if by a call to wcrtomb with zero-initialized conversion state.
The key take-away is that %ls converts the wchar_t string to plain, narrow characters, as the first order of business. Basically, std::printf works with non-wide "characters", and allegedly-wide character strings get converted to non-wide "character" strings, before anything else happens.
Now that the input's domain consists of non-wide characters we make further progress:
Referencing "characters" in the context of the width specifier: it's really specifying the number of bytes. 15 bytes. That's what it really means:
Имя
This is not three "characters", as far as printf is concerned. This is a six character sequence, here they are: d0 98 d0 bc d1 8f.
Just in case - output of the setlocale():
Locale: LC_CTYPE=en_US.UTF-8 ...
Your system uses UTF-8 encoding, which uses more than one byte to encode non-Latin characters.
printf is a little bit dumb. It doesn't know anything about your locale, or your encoding. Every reference to character counts and field widths, in printf's documentation really means bytes. %15s, or %15ls, really means not 15 characters, but 15 bytes, to format here. So, it counts off 15 bytes, and spits them out. But, when interpreted as UTF-8 characters, these bytes don't really take up 15 characters on the screen.
Before Unicode, before the modern world with many alphabets, funny-looking characters, there was only the Latin alphabet, and characters and bytes were pretty much the same thing, and printf's documentation harkens back to that era. This is not true any more, but printf is still living in the past.

Required to convert a String to UTF8 string

Problem Statement:
I am required to convert a generated string to UTF8 string, this generated string has extended ascii characters and I am on Linux system (2.6.32-358.el6.x86_64).
A POC is still in progress so I can only provide small code samples
and complete solution can be posted only once ready.
Why I required UFT8 (I have extended ascii characters to be stored in a string which has to be UTF8).
How I am proceeding:
Convert generated string to wchar_t string.
Please look at the below sample code
int main(){
char CharString[] = "Prova";
iconv_t cd;
wchar_t WcharString[255];
size_t size= mbstowcs(WcharString, CharString, strlen(CharString));
wprintf(L"%ls\n", WcharString);
wprintf(L"%s\n", WcharString);
printf("\n%zu\n",size);
}
One question here:
Output is
Prova?????
s
Why the size is not printed here ?
Why the second printf prints only one character.
If I print size before both printed string then only 5 is printed and both strings are missing from console.
Moving on to Second Part:
Now that I will have a wchar_t string I want to convert it to UTF8 string
For this I was surfing through and found iconv will help here.
Question here
These are the methods I found in manual
**iconv_t iconv_open(const char *, const char *);
size_t iconv(iconv_t, char **, size_t *, char **, size_t *);
int iconv_close(iconv_t);**
Do I need to convert back wchar_t array to char array to before feeding to iconv ?
Please provide suggestions on the above issues.
Extended ascii I am talking about please see letters i in the marked snapshot below
For your first question (which I am interpreting as "why is all the output not what I expect"):
Where does the '?????' come from? In the call mbstowcs(WcharString, CharString, strlen(CharString)), the last argument (strlen(CharString)) is the length of the output buffer, not the length of the input string. mbstowcs will not write more than that number of wide characters, including the NUL terminator. Since the conversion requires 6 wide characters including the terminator, and you are only allowing it to write 5 wide characters, the resulting wide character string is not NUL terminated, and when you try to print it out you end up printing garbage after the end of the converted string. Hence the ?????. You should use the size of the output buffer in wchar_t's (255, in this case) instead.
Why does the second wprintf only print one character? When you call wprintf with a wide character string argument, you must use the %ls format code (or, more accurately, the %s conversion needs to be qualified with an l length modifier). If you use %s without the l, then wprintf will interpret the string as a char*, and it will convert each character to a wchar_t as it outputs it. However, since the argument is actually a wide character string, the first wchar_t in the string is L"p", which is the number 0x70 in some integer size. That means that the second byte of the wchar_t (counting from the end, since you have a little-endian architecture) is a 0, so if you treat the string as a string of characters, it will be terminated immediately after the p. So only one character is printed.
Why doesn't the last printf print anything? In C, an output stream can either be a wide stream or a byte stream, but you don't specify that when you open the stream. (And, in any case, standard output is already opened for you.) This is called the orientation of the stream. A newly opened stream is unoriented, and the orientation is fixed when you first output to the stream. If the first output call is a wide call, like wprintf, then the stream is a wide stream; otherwise, it is a byte stream. Once set, the orientation is fixed and you can't use output calls of the wrong orientation. So the printf is illegal, and it does nothing other than raise an error.
Now, let's move on to your second question: What do I do about it?
The first thing is that you need to be clear about what format the input is in, and how you want to output it. On Linux, it is somewhat unlikely that you will want to use wchar_t at all. The most likely cases for the input string are that it is already UTF-8, or that it is in some ISO-8859-x encoding. And the most likely cases for the output are the same: either it is UTF-8, or it is some ISO-8859-x encoding.
Unfortunately, there is no way for your program to know what encoding the console is expecting. The output may not even be going to a console. Similarly, there is really no way for your program to know which ISO-8859-x encoding is being used in the input string. (If it is a string literal, the encoding might be specified when you invoke the compiler, but there is no standard way of providing the information.)
If you are having trouble viewing output because non-ascii characters aren't displaying properly, you should start by making sure that the console is configured to use the same encoding as the program is outputting. If the program is sending UTF-8 to a console which is displaying, say, ISO-8859-15, then the text will not display properly. In theory, your locale setting includes the encoding used by your console, but if you are using a remote console (say, through PuTTY from a Windows machine), then the console is not part of the Linux environment and the default locale may be incorrect. The simplest fix is to configure your console correctly, but it is also possible to change the Linux locale.
The fact that you are using mbstowcs from a byte string suggests that you believe that the original string is in UTF-8. So it seems unlikely that the problem is that you need to convert it to UTF-8.
You can certainly use iconv to convert a string from one encoding to another; you don't need to go through wchar_t to do so. But you do need to know the actual input encoding and the desired output encoding.
It's no good idea to use iconv for utf8. Just implement the definition of utf8 yourself. That is quite easily in done in C from the Description https://en.wikipedia.org/wiki/UTF-8.
You don't even need wchar_t, just use uint32_t for your characters.
You will learn much if you implement yourself and your program will gain speed from not using mb or iconv functions.

How does string work with non-ascii symbols while char does not?

I understand that char in C++ is just an integer type that stores ASCII symbols as numbers ranging from 0 to 127. The Scandinavian letters 'æ', 'ø', and 'å' are not among the 128 symbols in the ASCII table.
So naturally when I try char ch1 = 'ø' I get a compiler error, however string str = "øæå" works fine, even though a string makes use of chars right?
Does string somehow switch over to Unicode?
In C++ there is the source character set and the execution character set. The source character set is what you can use in your source code; but this doesn't have to coincide with which characters are available at runtime.
It's implementation-defined what happens if you use characters in your source code that aren't in the source character set. Apparently 'ø' is not in your compiler's source character set, otherwise you wouldn't have gotten an error; this means that your compiler's documentation should include an explanation of what it does for both of these code samples. Probably you will find that str does have some sort of sequence of bytes in it that form a string.
To avoid this you could use character literals instead of embedding characters in your source code, in this case '\xF8'. If you need to use characters that aren't in the execution character set either, you can use wchar_t and wstring.
From the source code char c = 'ø';:
source_file.cpp:2:12: error: character too large for enclosing character literal type
char c = '<U+00F8>';
^
What's happening here is that the compiler is converting the character from the source code encoding and determining that there's no representation of that character using the execution encoding that fits inside a single char. (Note that this error has nothing to do with the initialization of c, it would happen with any such character literal. examples)
When you put such characters into a string literal rather than a character literal, however, the compiler's conversion from the source encoding to the execution encoding is perfectly happy to use multi-byte representations of the characters when the execution encoding is multi-byte, such as UTF-8 is.
To better understand what compilers do in this area you should start by reading clauses 2.3 [lex.charsets], 2.14.3 [lex.ccon], and 2.14.5 [lex.string] in the C++ standard.
What's likely happening here is that your source file is encoded as UTF-8 or some other multi-byte character encoding, and the compiler is simply treating it as a sequence of bytes. A single char can only be a single byte, but a string is perfectly happy to be as many bytes as are required.
The ASCII for C++ is only 128 characters.
If you want 'ø' which is ASCII-EXTENDED 248 out of (255) which is 8 bit (is not a character value) that included 7 bit from ASCII.
you can try char ch1 ='\xD8';

std::string and UTF-8 encoded unicode

If I understand well, it is possible to use both string and wstring to store UTF-8 text.
With char, ASCII characters take a single byte, some chinese characters take 3 or 4, etc. Which means that str[3] doesn't necessarily point to the 4th character.
With wchar_t same thing, but the minimal amount of bytes used per characters is always 2 (instead of 1 for char), and a 3 or 4 byte wide character will take 2 wchar_t.
Right ?
So, what if I want to use string::find_first_of() or string::compare(), etc with such a weirdly encoded string ? Will it work ? Does the string class handle the fact that characters have a variable size ? Or should I only use them as dummy feature-less byte arrays, in which case I'd rather go for a wchar_t[] buffer.
If std::string doesn't handle that, second question: are there libraries providing string classes that could handle that UTF-8 encoding so that str[3] actually points to the 3rd character (which would be a byte array from length 1 to 4) ?
You are talking about Unicode. Unicode uses 32 bits to represent a character. However since that is wasting memory there are more compact encodings. UTF-8 is one such encoding. It assumes that you are using byte units and it maps Unicode characters to 1, 2, 3 or 4 bytes. UTF-16 is another that is using words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes).
You can use both encoding with both string and wchar_t. UTF-8 tends to be more compact for english text/numbers.
Some things will work regardless of encoding and type used (compare). However all functions that need to understand one character will be broken. I.e the 5th character is not always the 5th entry in the underlying array. It might look like it's working with certain examples but It will eventually break.
string::compare will work but do not expect to get alphabetical ordering. That is language dependent.
string::find_first_of will work for some but not all. Long string will likely work just because they are long while shorter ones might get confused by character alignment and generate very hard to find bugs.
Best thing is to find a library that handles it for you and ignore the type underneath (unless you have strong reasons to pick one or the other).
You can't handle Unicode with std::string or any other tools from Standard Library. Use external library such as: http://utfcpp.sourceforge.net/
You are correct for those:
...Which means that str[3] doesn't necessarily point to the 4th character...only use them as dummy feature-less byte arrays...
string of C++ can only handle ascii characters. This is different from the String of Java, which can handle Unicode characters. You can store the encoding result (bytes) of Chinese characters into string (char in C/C++ is just byte), but this is meaningless as string just treat the bytes as ascii chars, so you cannot use string function to process it.
wstring may be something you need.
There is something that should be clarified. UTF-8 is just an encoding method for Unicode characters (transforming characters from/to byte format).

c++: getting ascii value of a wide char

let's say i have a char array like "äa".
is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte?
even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long.
is there a way to do this, im trying for 2 days now :(
i'm using gcc.
thanks!
You're contradicting yourself. International characters like ä are (by definition) not in the ASCII character set, so they don't have an "ascii value".
It depends on the exact encoding of your two-character array, if you can get the code point for a single character or not, and if so which format it will be in.
You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.
More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?
There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.
Depends on the encoding used in your char array.
If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:
0xE4 (lower-case a umlaut)
0x61 (lower-case a).
Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.
You would get the value like this:
int i = (unsigned char) my_array[0];
If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:
binary 11000011 (first byte of UTF-8 encoded 0xE4)
binary 10100100 (second byte of UTF-8 encoded 0xE4)
0x61 (lower-case a)
To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:
wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
// handle error
}
If it's SHIFT-JIS then this doesn't work...
what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.