How do i create a unicode filename in linux? - c++

I heard fopen supports UTF8 but i dont know how to convert an array of shorts to utf8
How do i create a file with unicode letters in it? I prefer to use only built in libraries (no boost which is not installed on the linux box). I do need to use fopen but its pretty simple to.

fopen(3) supports any valid byte sequence; the encoding is unimportant. Use nl_langinfo(3) with CODESET to get what charset you should use for the encoding, and libiconv or icu for the actual encoding.

Related

Unicode in wxWidgets

I'm creating a calculator application in C++ wxWidgets using Visual Studio 2019. I have created a custom button class that I want to use for all mathematical operations and symbols.
How can I set the button's label to √ instead of sqrt? If I do that, a ? symbol is displayed instead. I also need to display these symbols on a wxTextCtrl, if I do it I get the following error when I try to compile: (ignore App.razor, the picture is not mine)
Do I need to change che current character set from ASCII to Unicode? How do you do that?
For a single character, you can just use wxUniChar. You create a wxUniChar with a value in hexadecimal of the Unicode code point for the desired character. Since the Unicode code point of the square root character is U+221A, you can create a wxUniChar for this character like so:
wxUniChar c(0x221A);
wxUnichar is implicitly convertible to wxString, so (assuming wxWidgets was built in Unicode mode), you can use wxUniChar variables exactly as you would use a wxString. For example you could do something like:
control->SetLabel(c);
or
dc.DrawText(c,0,0);
The answer by #New-Pagodi (sorry, don't know how to tag people with spaces in their names) works, but just saving your file in UTF-8 encoding, as MSVS proposes you to do is a much nicer solution. Even in this case notice that you still need to either use wxString::FromUTF8("√") or explicitly set your locale encoding to UTF-8 by using setlocale(), which is (finally) supported by the recent Windows versions, in which case you can use just "√", or use wide strings, i.e. L"√"`.
I.e. you must both have the correct bytes (e2 88 9a for UTF-8-encoded representation of U+221A) in the file and use the correct encoding when creating wxString from it if you're using char* strings. By default this encoding is not UTF-8 under Windows, so using just wxString("√") doesn't work.

Playing cards Unicode printing in C++

According to this wiki link, the play cards have Unicode of form U+1f0a1.
I wanted to create an array in c++ to sore the 52 standard playing cards but I notice this Unicode is longer that 2 bytes.
So my simple example below does not work, how do I store a Unicode character that is longer than 2 bytes?
wchar_t t = '\u1f0a1';
printf("%lc",t);
The above code truncates t to \u1f0a
how do I store a longer that 2 byte unicode character?
you can use char32_t with prefix U, but there's no way to print it to console. Besides, you don't need char32_t at all, utf-16 is enough to encode that character. wchar_t t = L'\u2660', you need the prefix L to specify it's a wide char.
If you are using Windows with Visual C++ compiler, I recommend a way:
Save your source file with utf-8 encoding
set compile parameter /utf-8, reference here.
use a console supports utf-8 encoded like Git Bash to see the result.
On Windows wchar_t stores a UTF-16 code-unit, you have to store your string as UTF-16 (using a string-literal with prefix) This doesn't help you either since the windows console can only output characters up to 0xFFFF. See this:
How to use unicode characters in Windows command line?

utf8mb4 encode/decode in c++

A third-part server echoes string to my client program, the string contains both utf8 data and unicode emoji (listed here). for example:
I googled some time and found this is called utf8mb4 encoding, which is used in SQL application.
I find some article about utf8mb4 in mysql/python/ruby/etc... but no c++.
Is there any c++ library can do encoding/decoding utf8mb4?
MySQL calls utf8mb4 what is in truth utf8:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters:
so any library that supports utf8 will give you utf8mb4. In this question it was asked what solutions are there in C++ for converting to/from utf8: How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8 . The three solutions given are ICU (International Components for Unicode), Boost.Locale and C++11.

How to convert a single-byte const char* to a UTF-8 encoding

I have a function which requires me to pass a UTF-8 string pointed by a char*, and I have the char pointer to a single-byte string. How can I convert the string to UTF-8 encoding in C++? Is there any code I can use to do this?
Thanks!
Assuming Linux, you're looking for iconv. When you open the converter (iconv_open), you pass from and to encoding. If you pass an empty string as from, it'll convert from the locale used on your system which should match the file system.
On Windows, you have pretty much the same with MultiByteToWideChar where you pass CP_ACP as the codepage. But on Windows you can simply call the Unicode version of the functions to get Unicode straight away and then convert to UTF-8 with WideCharToMultiByte and CP_UTF8.
To convert a string to a different character encoding, use any of various character encoding libraries. A popular choice is iconv (the standard on most Linux systems).
However, to do this you first need to figure out the encoding of your input. There is unfortunately no general solution to this. If the input does not specify its encoding (like e.g. web pages generally do), you'll have to guess.
As to your question: You write that you get the string from calling readdir on a FAT32 file system. I'm not quite sure, but I believe readdir will return the file names as they are stored by the file system. In the case of FAT/FAT32:
The short file names are encoded in some DOS code page - which code page depends on how the files where written, there's no way to tell from just the file system AFAIK.
The long file names are in UTF-16.
If you use the standard vfat Linux kernel module to access the FAT32 partition, you should get long file names from readdir (unless a file only has an 8.3 name). These can be decoded as UTF-16. FAT32 stores the long file names in UTF-16 internally. The vfat driver will convert them to the encoding given by the iocharset= mount parameter (with the default being the default system encoding, I believe).
Additional information:
You may have to play with the mount options codepage and iocharset (see http://linux.die.net/man/8/mount ) to get filenames right on the FAT32 volume. Try to mount such that filenames are shown correctly in a Linux console, then proceed. There is some more explanation here: http://www.nslu2-linux.org/wiki/HowTo/MountFATFileSystems
I guess the top bit is set on the 1 byte string so the function you're passing that to is expecting more than 1 byte to be passed.
First, print the string out in hex.
i.e.
unsigned char* str = "your string";
for (int i = 0; i < strlen(str); i++)
printf("[%02x]", str[i]);
Now have a read of the wikipedia article on UTF8 encoding which explains it well.
http://en.wikipedia.org/wiki/UTF-8
UTF-8 is variable width where each character can occupy from 1 to 4 bytes.
Therefore, convert the hex to binary and see what the code point is.
i.e. if the first byte starts 11110 (in binary) then it's expecting a 4 byte string. Since ascii is 7-bit 0-127 the top bit is always zero so there should be only 1 byte. By the way, the bytes following the first byte in a wide character of a UTF8 string will start "10..." for the top bits. These are the continuation bytes... that's what your function is complaining about... i.e. the continuation bytes are missing when expected.
So the string is not quite true ascii as you thought it was.
You can convert using as someone suggested iconv, or perhaps this library http://utfcpp.sourceforge.net/

How to convert ISO-8859-1 to UTF-8 using libiconv in C++

I'm using libcurl to fetch some HTML pages.
The HTML pages contain some character references like: סלקום
When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨
is it the ISO-8859-1 encoding?
If so, how do I convert it to UTF-8 to get the correct word.
Thanks
EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8.
I added this to eclipse.ini
-Dfile.encoding=utf-8
and finally I got Hebrew characters on my Eclipse console.
Thanks
Have you seen the libxml2 page on i18n ? It explains how libxml2 solves these problems.
You will get a ס from libxml2. However, you said that you get something like ׳₪׳¨׳˜׳ ׳¨. Why do you think that you got that? You get an XMLchar*. How did you convert that pointer into the string above? Did you perhaps use a debugger? Does that debugger know how to render a XMLchar* ? My bet is that the XMLchar* is correct, but you used a debugger that cannot render the Unicode in a XMLchar*
To answer your last question, a XMLchar* is already UTF-8 and needs no further conversion.
No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. See this page for example.
You can therefore store your Unicode values as integers and use an algorithm to transform those integers to an UTF-8 multibyte character. See UTF-8 specification for this.
This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case.
I would guess the encoding is UTF-16 or UCS2. Specify this as input for iconv. There might also be an endian issue, have a look here
The c-style way would be (no checking for clarity):
iconv_t ic = iconv_open("UCS-2", "UTF-8");
iconv(ic, myUCS2_Text, inputSize, myUTF8-Text, outputSize);
iconv_close(ic);