What is the equivalent of mb_convert_encoding in GCC? - c++

Currently I do the following in PHP to convert from UTF-8 to ASCII, but with the UTF-8 characters as HTML encoded. For instance, the German ß character would be converted to ß.
$sData = mb_convert_encoding($sData, 'HTML-ENTITIES','UTF-8');
In GCC C++ on Linux, what is the equivalent way to do this?
Actually, I can also take an alternative form like, instead of ß, the &#xNNN; format will do.

Related

Playing cards Unicode printing in C++

According to this wiki link, the play cards have Unicode of form U+1f0a1.
I wanted to create an array in c++ to sore the 52 standard playing cards but I notice this Unicode is longer that 2 bytes.
So my simple example below does not work, how do I store a Unicode character that is longer than 2 bytes?
wchar_t t = '\u1f0a1';
printf("%lc",t);
The above code truncates t to \u1f0a
how do I store a longer that 2 byte unicode character?
you can use char32_t with prefix U, but there's no way to print it to console. Besides, you don't need char32_t at all, utf-16 is enough to encode that character. wchar_t t = L'\u2660', you need the prefix L to specify it's a wide char.
If you are using Windows with Visual C++ compiler, I recommend a way:
Save your source file with utf-8 encoding
set compile parameter /utf-8, reference here.
use a console supports utf-8 encoded like Git Bash to see the result.
On Windows wchar_t stores a UTF-16 code-unit, you have to store your string as UTF-16 (using a string-literal with prefix) This doesn't help you either since the windows console can only output characters up to 0xFFFF. See this:
How to use unicode characters in Windows command line?

Encode gives wrong value of Japanese kanji

As a part of a scraper, I need to encode kanji to URLs, but I just can't seem to even get the correct output from a simple sign, and I'm currently blinded by everything I've tried thus far from various Stack Overflow posts.
The document is set to UTF-8.
sampleText=u'ル'
print sampleText
print sampleText.encode('utf-8')
print urllib2.quote(sampleText.encode('utf-8'))
It gives me the values:
ル
ル
%E3%83%AB
But as far as I understand, it should give me:
ル
XX
%83%8B
What am I doing wrong? Are there some settings I don't have correct? Because as far as I understand it, my output from the encode() should not be ル.
The code you show works correctly. The character ル is KATAKANA LETTER RU, and is Unicode codepoint U+30EB. When encoded to UTF-8, you'll get the Python bytestring '\xe3\x83\xab', which prints out as ル if your console encoding is Latin-1. When you URL-escape those three bytes, you get %E3%83%AB.
The value you seem to be expecting, %83%8B is the Shift-JIS encoding of ル, rather than UTF-8 encoding. For a long time there was no standard for how to encode non-ASCII text in a URL, and as this Wikipedia section notes, many programs simply assumed a particular encoding (often without specifying it). The newer standard of Internationalized Resource Identifiers (IRIs) however says that you should always convert Unicode text to UTF-8 bytes before performing percent encoding.
So, if you're generating your encoded string for a new program that wants to meet the current standards, stick with the UTF-8 value you're getting now. I would only use the Shift-JIS version if you need it for backwards compatibility with specific old websites or other software that expects that the data you send will have that encoding. If you have any influence over the server (or other program), see if you can update it to use IRIs too!

How to use exetended unix characters in c++ in Visual studio?

We are using a korean font and freetype library and trying to display a korean character. But it displays some other characters indtead of hieroglyph
Code:
std::wstring text3 = L"놈";
Is there any tricks to type the korean characters?
For maximum portability, I'd suggest avoiding encoding Unicode characters directly in your source code and using \u escape sequences instead. The character 놈 is Unicode code point U+B188, so you could write this as:
std::wstring text3 = L"\uB188";
The question is what is the encoding of the source code.
It is likely UTF-8, which is one of the reasons not to use wstring. Use regular string. For more information on my way of handling characters, see http://utf8everywhere.org.

Multi-Byte to Widechar conversion using mbsnrtowcs

I'm trying to convert a multi-byte(UTF) string to Widechar string and mbsnrtowcs is always failing. Here is the input and expected strings:
char* pInputMultiByteString = "A quick brown Fox jumps \xC2\xA9 over the lazy Dog.";
wchar_t* pExpectedWideString = L"A quick brown Fox jumps \x00A9 over the lazy Dog.";
Special character is the copyright symbol.
This conversion works fine when I use Windows MultiByteToWideChar routine, but since that API is not available on linux, I have to use mbsnrtowcs - which is failing. I've tried using other characters as well and it always fails. The only expection is that when I use only an ASCII based Input string then mbsnrtowcs works fine. What am I doing wrong?
UTF is not a multibyte string (although it is true that unicode characters will be represented using more than 1 byte). A multibyte string is a string that uses a certain codepage to represent characters and some of them will use more than one byte.
Since you are combining ANSI chars and UTF chars you should use UTF8.
So trying to convert UTF to wchar_t (which on windows is UTF16 and on linux is UTF32) using mbsnrtowcs just can't be done.
If you use UTF8 you should look into a UNICODE handling library for that. For most tasks I recommend using UTF8-CPP from http://utfcpp.sourceforge.net/
You can read more on UNICODE and UTF8 on Wikipedia.
MultiByteToWideChar has a parameter where you specify the code page, but mbsnrtowcs doesn't. On Linux, have you set LC_CTYPE in your locale to specify UTF-8?
SOLUTION: By default each C program uses "C" locale, so I had to call setlocale(LCTYPE,"").."" means that it'll use my environment's locale i.e. en_US.utf8 and the conversion worked.

How to convert ISO-8859-1 to UTF-8 using libiconv in C++

I'm using libcurl to fetch some HTML pages.
The HTML pages contain some character references like: סלקום
When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨
is it the ISO-8859-1 encoding?
If so, how do I convert it to UTF-8 to get the correct word.
Thanks
EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8.
I added this to eclipse.ini
-Dfile.encoding=utf-8
and finally I got Hebrew characters on my Eclipse console.
Thanks
Have you seen the libxml2 page on i18n ? It explains how libxml2 solves these problems.
You will get a ס from libxml2. However, you said that you get something like ׳₪׳¨׳˜׳ ׳¨. Why do you think that you got that? You get an XMLchar*. How did you convert that pointer into the string above? Did you perhaps use a debugger? Does that debugger know how to render a XMLchar* ? My bet is that the XMLchar* is correct, but you used a debugger that cannot render the Unicode in a XMLchar*
To answer your last question, a XMLchar* is already UTF-8 and needs no further conversion.
No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. See this page for example.
You can therefore store your Unicode values as integers and use an algorithm to transform those integers to an UTF-8 multibyte character. See UTF-8 specification for this.
This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case.
I would guess the encoding is UTF-16 or UCS2. Specify this as input for iconv. There might also be an endian issue, have a look here
The c-style way would be (no checking for clarity):
iconv_t ic = iconv_open("UCS-2", "UTF-8");
iconv(ic, myUCS2_Text, inputSize, myUTF8-Text, outputSize);
iconv_close(ic);