Playing cards Unicode printing in C++ - c++

According to this wiki link, the play cards have Unicode of form U+1f0a1.
I wanted to create an array in c++ to sore the 52 standard playing cards but I notice this Unicode is longer that 2 bytes.
So my simple example below does not work, how do I store a Unicode character that is longer than 2 bytes?
wchar_t t = '\u1f0a1';
printf("%lc",t);
The above code truncates t to \u1f0a

how do I store a longer that 2 byte unicode character?
you can use char32_t with prefix U, but there's no way to print it to console. Besides, you don't need char32_t at all, utf-16 is enough to encode that character. wchar_t t = L'\u2660', you need the prefix L to specify it's a wide char.
If you are using Windows with Visual C++ compiler, I recommend a way:
Save your source file with utf-8 encoding
set compile parameter /utf-8, reference here.
use a console supports utf-8 encoded like Git Bash to see the result.

On Windows wchar_t stores a UTF-16 code-unit, you have to store your string as UTF-16 (using a string-literal with prefix) This doesn't help you either since the windows console can only output characters up to 0xFFFF. See this:
How to use unicode characters in Windows command line?

Related

Unicode in wxWidgets

I'm creating a calculator application in C++ wxWidgets using Visual Studio 2019. I have created a custom button class that I want to use for all mathematical operations and symbols.
How can I set the button's label to √ instead of sqrt? If I do that, a ? symbol is displayed instead. I also need to display these symbols on a wxTextCtrl, if I do it I get the following error when I try to compile: (ignore App.razor, the picture is not mine)
Do I need to change che current character set from ASCII to Unicode? How do you do that?
For a single character, you can just use wxUniChar. You create a wxUniChar with a value in hexadecimal of the Unicode code point for the desired character. Since the Unicode code point of the square root character is U+221A, you can create a wxUniChar for this character like so:
wxUniChar c(0x221A);
wxUnichar is implicitly convertible to wxString, so (assuming wxWidgets was built in Unicode mode), you can use wxUniChar variables exactly as you would use a wxString. For example you could do something like:
control->SetLabel(c);
or
dc.DrawText(c,0,0);
The answer by #New-Pagodi (sorry, don't know how to tag people with spaces in their names) works, but just saving your file in UTF-8 encoding, as MSVS proposes you to do is a much nicer solution. Even in this case notice that you still need to either use wxString::FromUTF8("√") or explicitly set your locale encoding to UTF-8 by using setlocale(), which is (finally) supported by the recent Windows versions, in which case you can use just "√", or use wide strings, i.e. L"√"`.
I.e. you must both have the correct bytes (e2 88 9a for UTF-8-encoded representation of U+221A) in the file and use the correct encoding when creating wxString from it if you're using char* strings. By default this encoding is not UTF-8 under Windows, so using just wxString("√") doesn't work.

std::string with different encoding to QString

Is there any way to detect std::string encoding?
My problem: I have an external web services which give data in different encodings. Also I have a library witch parse that data and store it in std::string. Than I want to display data in Qt GUI. The problem is that std::string can have different encodings. Some string can be converted using QString::fromAscii(), some QString::fromUtf8().
I haven't looked into it but I did use some Qt3.3 in the past.
ASCII vs Unicode + UTF-8
Utf8 is 8-bit, ascii 7-bit. I guess you can try to look into the values of string array
http://doc.qt.digia.com/3.3/qstring.html#ascii and http://doc.qt.digia.com/3.3/qstring.html#utf8
it seems ascii returns an 8-bit ASCII representation of the string, still I think it should have values from 0 to 127 or something like that. you must compare more characters in the string.

Does Visual Studio 2010 Supports C++ Source Code in Unicode with Unicode Char in String Literal

I want to directly embed non-ASCII Unicode characters in string literals and use them in printf. This implies my source codes must be saved in utf-8 or utf-16. Visual Studio 2010 does support editing and saving C++ source files in either format. But when compiled & executed, it does not produce the correct unicode characters. Does the compiler support string literals with unicode characters embedded?
e.g.
wprintf(L" chinese characters:中文字\n"); the trailing chinese characters cannot be displayed
I don't have a Chinese version of Windows to test with, so this is complete speculation.
The console and file output functions are aware that files are not coded in UTF-16, so they attempt to convert the characters to a code page before output. Just as the default locale is "C" rather than anything based on your system settings, so too the default code page is probably an inappropriate one that does not include Chinese characters.
There is a function SetConsoleOutputCP to change the code page for the console. It is not clear if this function changes the code page used by the actual console window, or if it only affects conversions from Unicode within the program.
The easy way to test wide literals is to skip the formatting part of printf, and give your string straight to the OS: WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L" chinese characters:中文字", ....
It's possible that #pragma setlocale may be what you need.

How to get Unicode for Chracter strings(UTF-8) in c or c++ language (Linux)

I am working on one application in which i need to know Unicode of Characters to classify them like Chinese Characters, Japanese Characters(Kanji,Katakana,Hiragana) , Latin , Greek etc .
The given string is in UTF-8 Format.
If there is any way to know Unicode for UTF-8 Character? For example:
Character '≠' has U+2260 Unicode value.
Character '建' has U+5EFA Unicode value.
The utf-8 encoding is a variable width encoding of unicode. Each unicode code point can be encoded from one to four char.
To decode a char* string and extract a single code point, you read one byte. If the most significant bit is set then, the code point is encoded on multiple characters, otherwise it is the unicode code point. The number of bits set counting from the most-significant bit indicate how many char are used to encode the unicode code point.
This table explain how to make the conversion:
UTF-8 (char*) | Unicode (21 bits)
------------------------------------+--------------------------
0xxxxxxx | 00000000000000000xxxxxxx
------------------------------------+--------------------------
110yyyyy 10xxxxxx | 0000000000000yyyyyxxxxxx
------------------------------------+--------------------------
1110zzzz 10yyyyyy 10xxxxxx | 00000000zzzzyyyyyyxxxxxx
------------------------------------+--------------------------
11110www 10zzzzzz 10yyyyyy 10xxxxxx | 000wwwzzzzzzyyyyyyxxxxxx
Based on that, the code is relatively straightforward to write. If you don't want to write it, you can use a library that does the conversion for you. There are many available under Linux : libiconv, icu, glib, ...
libiconv can help you with converting the utf-8 string to utf-16 or utf-32. Utf-32 would be the savest option if you really want to support every possible unicode codepoint.

How to convert a single-byte const char* to a UTF-8 encoding

I have a function which requires me to pass a UTF-8 string pointed by a char*, and I have the char pointer to a single-byte string. How can I convert the string to UTF-8 encoding in C++? Is there any code I can use to do this?
Thanks!
Assuming Linux, you're looking for iconv. When you open the converter (iconv_open), you pass from and to encoding. If you pass an empty string as from, it'll convert from the locale used on your system which should match the file system.
On Windows, you have pretty much the same with MultiByteToWideChar where you pass CP_ACP as the codepage. But on Windows you can simply call the Unicode version of the functions to get Unicode straight away and then convert to UTF-8 with WideCharToMultiByte and CP_UTF8.
To convert a string to a different character encoding, use any of various character encoding libraries. A popular choice is iconv (the standard on most Linux systems).
However, to do this you first need to figure out the encoding of your input. There is unfortunately no general solution to this. If the input does not specify its encoding (like e.g. web pages generally do), you'll have to guess.
As to your question: You write that you get the string from calling readdir on a FAT32 file system. I'm not quite sure, but I believe readdir will return the file names as they are stored by the file system. In the case of FAT/FAT32:
The short file names are encoded in some DOS code page - which code page depends on how the files where written, there's no way to tell from just the file system AFAIK.
The long file names are in UTF-16.
If you use the standard vfat Linux kernel module to access the FAT32 partition, you should get long file names from readdir (unless a file only has an 8.3 name). These can be decoded as UTF-16. FAT32 stores the long file names in UTF-16 internally. The vfat driver will convert them to the encoding given by the iocharset= mount parameter (with the default being the default system encoding, I believe).
Additional information:
You may have to play with the mount options codepage and iocharset (see http://linux.die.net/man/8/mount ) to get filenames right on the FAT32 volume. Try to mount such that filenames are shown correctly in a Linux console, then proceed. There is some more explanation here: http://www.nslu2-linux.org/wiki/HowTo/MountFATFileSystems
I guess the top bit is set on the 1 byte string so the function you're passing that to is expecting more than 1 byte to be passed.
First, print the string out in hex.
i.e.
unsigned char* str = "your string";
for (int i = 0; i < strlen(str); i++)
printf("[%02x]", str[i]);
Now have a read of the wikipedia article on UTF8 encoding which explains it well.
http://en.wikipedia.org/wiki/UTF-8
UTF-8 is variable width where each character can occupy from 1 to 4 bytes.
Therefore, convert the hex to binary and see what the code point is.
i.e. if the first byte starts 11110 (in binary) then it's expecting a 4 byte string. Since ascii is 7-bit 0-127 the top bit is always zero so there should be only 1 byte. By the way, the bytes following the first byte in a wide character of a UTF8 string will start "10..." for the top bits. These are the continuation bytes... that's what your function is complaining about... i.e. the continuation bytes are missing when expected.
So the string is not quite true ascii as you thought it was.
You can convert using as someone suggested iconv, or perhaps this library http://utfcpp.sourceforge.net/