Multi-Byte to Widechar conversion using mbsnrtowcs - c++

I'm trying to convert a multi-byte(UTF) string to Widechar string and mbsnrtowcs is always failing. Here is the input and expected strings:
char* pInputMultiByteString = "A quick brown Fox jumps \xC2\xA9 over the lazy Dog.";
wchar_t* pExpectedWideString = L"A quick brown Fox jumps \x00A9 over the lazy Dog.";
Special character is the copyright symbol.
This conversion works fine when I use Windows MultiByteToWideChar routine, but since that API is not available on linux, I have to use mbsnrtowcs - which is failing. I've tried using other characters as well and it always fails. The only expection is that when I use only an ASCII based Input string then mbsnrtowcs works fine. What am I doing wrong?

UTF is not a multibyte string (although it is true that unicode characters will be represented using more than 1 byte). A multibyte string is a string that uses a certain codepage to represent characters and some of them will use more than one byte.
Since you are combining ANSI chars and UTF chars you should use UTF8.
So trying to convert UTF to wchar_t (which on windows is UTF16 and on linux is UTF32) using mbsnrtowcs just can't be done.
If you use UTF8 you should look into a UNICODE handling library for that. For most tasks I recommend using UTF8-CPP from http://utfcpp.sourceforge.net/
You can read more on UNICODE and UTF8 on Wikipedia.

MultiByteToWideChar has a parameter where you specify the code page, but mbsnrtowcs doesn't. On Linux, have you set LC_CTYPE in your locale to specify UTF-8?

SOLUTION: By default each C program uses "C" locale, so I had to call setlocale(LCTYPE,"").."" means that it'll use my environment's locale i.e. en_US.utf8 and the conversion worked.

Related

Playing cards Unicode printing in C++

According to this wiki link, the play cards have Unicode of form U+1f0a1.
I wanted to create an array in c++ to sore the 52 standard playing cards but I notice this Unicode is longer that 2 bytes.
So my simple example below does not work, how do I store a Unicode character that is longer than 2 bytes?
wchar_t t = '\u1f0a1';
printf("%lc",t);
The above code truncates t to \u1f0a
how do I store a longer that 2 byte unicode character?
you can use char32_t with prefix U, but there's no way to print it to console. Besides, you don't need char32_t at all, utf-16 is enough to encode that character. wchar_t t = L'\u2660', you need the prefix L to specify it's a wide char.
If you are using Windows with Visual C++ compiler, I recommend a way:
Save your source file with utf-8 encoding
set compile parameter /utf-8, reference here.
use a console supports utf-8 encoded like Git Bash to see the result.
On Windows wchar_t stores a UTF-16 code-unit, you have to store your string as UTF-16 (using a string-literal with prefix) This doesn't help you either since the windows console can only output characters up to 0xFFFF. See this:
How to use unicode characters in Windows command line?

What is the equivalent of mb_convert_encoding in GCC?

Currently I do the following in PHP to convert from UTF-8 to ASCII, but with the UTF-8 characters as HTML encoded. For instance, the German ß character would be converted to ß.
$sData = mb_convert_encoding($sData, 'HTML-ENTITIES','UTF-8');
In GCC C++ on Linux, what is the equivalent way to do this?
Actually, I can also take an alternative form like, instead of ß, the &#xNNN; format will do.

CoreFoundation printing Unicode characters

I have the current code and it does seem to work except for the fact CFShow doesn't translate the unicode UTF8 encoding of \u00e9 to é
#include <CoreFoundation/CoreFoundation.h>
int main()
{
char *s = "This is a test of unicode support: fiancée\n";
CFTypeRef cfs = CFStringCreateWithCString(NULL, s, kCFStringEncodingUTF8);
CFShow(cfs);
}
Output is
This is a test of unicode support: fianc\u00e9e
|____|
> é doesn't output properly.
How do I instruct CFShow that it is unicode? printf handles it fine when it is a c string.
CFShow() is only for debugging. It's deliberately converting non-ASCII to escape codes in order to avoid ambiguity. For example, "é" can be expressed in two ways: as U+00E9 LATIN SMALL LETTER E WITH ACUTE or as U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. If CFShow() were to emit the UTF-8 sequence, your terminal would likely present it as "é" and you wouldn't be able to tell which variant was in the string. That would undermine the usefulness of CFShow() for debugging.
Why do you care what the output of CFShow() so long as it you understand what the content of the string is?
It appears to me that CFShow knows that the string is Unicode, but doesn't know how to format Unicode for the console. I doubt that you can do anything but look for an alternative, perhaps NSLog.

How to use exetended unix characters in c++ in Visual studio?

We are using a korean font and freetype library and trying to display a korean character. But it displays some other characters indtead of hieroglyph
Code:
std::wstring text3 = L"놈";
Is there any tricks to type the korean characters?
For maximum portability, I'd suggest avoiding encoding Unicode characters directly in your source code and using \u escape sequences instead. The character 놈 is Unicode code point U+B188, so you could write this as:
std::wstring text3 = L"\uB188";
The question is what is the encoding of the source code.
It is likely UTF-8, which is one of the reasons not to use wstring. Use regular string. For more information on my way of handling characters, see http://utf8everywhere.org.

How to get Unicode for Chracter strings(UTF-8) in c or c++ language (Linux)

I am working on one application in which i need to know Unicode of Characters to classify them like Chinese Characters, Japanese Characters(Kanji,Katakana,Hiragana) , Latin , Greek etc .
The given string is in UTF-8 Format.
If there is any way to know Unicode for UTF-8 Character? For example:
Character '≠' has U+2260 Unicode value.
Character '建' has U+5EFA Unicode value.
The utf-8 encoding is a variable width encoding of unicode. Each unicode code point can be encoded from one to four char.
To decode a char* string and extract a single code point, you read one byte. If the most significant bit is set then, the code point is encoded on multiple characters, otherwise it is the unicode code point. The number of bits set counting from the most-significant bit indicate how many char are used to encode the unicode code point.
This table explain how to make the conversion:
UTF-8 (char*) | Unicode (21 bits)
------------------------------------+--------------------------
0xxxxxxx | 00000000000000000xxxxxxx
------------------------------------+--------------------------
110yyyyy 10xxxxxx | 0000000000000yyyyyxxxxxx
------------------------------------+--------------------------
1110zzzz 10yyyyyy 10xxxxxx | 00000000zzzzyyyyyyxxxxxx
------------------------------------+--------------------------
11110www 10zzzzzz 10yyyyyy 10xxxxxx | 000wwwzzzzzzyyyyyyxxxxxx
Based on that, the code is relatively straightforward to write. If you don't want to write it, you can use a library that does the conversion for you. There are many available under Linux : libiconv, icu, glib, ...
libiconv can help you with converting the utf-8 string to utf-16 or utf-32. Utf-32 would be the savest option if you really want to support every possible unicode codepoint.