utf-8 encoding a std::string? - c++

I use a drawing api which takes in a const char* to a utf-8 encoded string. Doing myStdString.cstr() does not work, the api fails to draw the string.
for example:
sd::stringsomeText = "■□▢▣▤▥▦▧▨▩▪▫▬▭▮▯▰▱";
will render ???????????????
So how do I get the std::string to act properly?
Thanks

Try using std::codecvt_utf8 to write the string to a stringstream and then pass the result (stringstream::str) to the API.

There are so many variables here it's hard to know where to begin.
First, verify that the API really and truly supports UTF-8 input, and that it doesn't need some special setup or O/S support to do so. In particular make sure it's using a font with full Unicode support.
Your compiler will be responsible for converting the source file into a string. It probably does not do UTF-8 encoding by default, and may not have an option to do so no matter what. In that case you may have to declare the string as a std::wstring and convert it to UTF-8 from there. Alternatively you can look up each character beyond the first 128 and encode them as hex values in the string, but that's a hassle and makes for an unreadable source.

Related

Reading input from file with Chinese Characters that got mangled

I'm getting stuck trying to convert an input string in char* to Chinese character encoding. An application accepts a Chinese string input ex: "啊说到" and when it is written into a file it turns into this "°¡Ëµµ½". I'm able to take this input and feed it to _mbstowcs_s_l() but the solution needs to be locale independent, so I'm forced to use either mbstowcs() or WideCharToMultiByte() but it looks like both would work for me if the input did already went through MBCS to UTF-8, which in our case isnt.
The project is using Multibyte Character Set, and I'm struggling to understand what is going on. One other thing is the input is coming from a different application and stores it into file.
The application that accepted the Chinese input is an MFC set to Multibyte Char Set and the os was set to regional Chinese Simplified, UI accepts the input and is placed on a CString, that is coped to a char*. This is that part where I don't know whats going on in the encoding, this application stores it into a file, then we read it using the other application, the string is read unto char*, thats when the characters seems to take the "°¡Ëµµ½".
Question is, how can I turn this encoded char"°¡Ëµµ½" back to its Chinese encoding "啊说到", with out setting the locale in _mbstowcs_s_l()? The problem is, we could be reading strings from other regional settings and the application wouldn't just know what character map to use unless we tell it to.

Print special character from utf-8 encoded string

I'm having trouble dealing with encoding in Python:
I get some strings from a csv that I open using pandas.read_csv(), they are encoded in unicode so I encode it to utf-8 doing the following
# data is from my csv
string = data.encode('utf-8')
print string
However, when I print it, i get
"Parc d'Activit\xc3\xa9s des Gravanches"
and i would like to return
"Parc d'Activités des Gravanches"
It seems like an easy issue but I'm quite new to python and did not find anything close enough to my problem.
Note: I am using Python 2.7 and my file starts with
#!/usr/bin/env python2.7
# coding: utf8
EDIT: I just say that you are using Python 2, okay, I think the answer below is still valuable though.
In Python 2 this is even more complicated and inconsistent. Here you have str and unicode, and the default str doesn't support unicode stuff.
Anyways, the situation is more or less the same, use decode instead of encode to convert from str to unicode. That should fix it.
More info at: https://pythonhosted.org/kitchen/unicode-frustrations.html
This is a common source of confusion.The issue is a bit complex, but I'll try to simplify it. I'm talking about Python 3 here, I believe there's several differences with Python 2.
There's two types of what you would call a string: str and bytes.
str is the general string type form Python, it supports unicode seamlessly in Python 3, but the way it encodes the actual data is not relevant, it's an object.
bytes is a byte array, like char* in C. It's a sequence of bytes.
Strings can be represented both ways, but you need to specify an encoding standard to translate between the two, as bytes needs to be interpreted, because it's just, again, a raw array of bytes.
encode converts a str into bytes, that's the mistake you make. Of course, if you print bytes it will just show it's raw data, AKA, the string encoded as utf-8.
decode does the opposite operation, that may be what you need.
However, if you open the file normally (open(file_name, 'r')) instead of in byte mode (open(file_name, 'b'), which I doubt you are doing, you shouldn't need to do anything, printing data should just work as you want it to.
More info at: https://docs.python.org/3/howto/unicode.html

what locale does wstring support?

In my program I used wstring to print out text I needed but it gave me random ciphers (those due to different encoding scheme). For example, I have this block of code.
wstring text;
text.append(L"Some text");
Then I use directX to render it on screen. I used to use wchar_t but I heard it has portability problem so I switched to swtring. wchar_t worked fine but it seemed only took English character from what I can tell (the print out just totally ignore the non-English character entered), which was fine, until I switch to wstring: I only got random ciphers that looked like Chinese and Korean mixed together. And interestingly, my computer locale for non-unicode text is Chinese. Based on what I saw I suspected that it would render Chinese character correctly, so then I tried and it does display the charactor correctly but with a square in front (which is still kind of incorrect display). I then guessed the encoding might depend on the language locale so I switched the locale to English(US) (I use win8), then I restart and saw my Chinese test character in the source file became some random stuff (my file is not saved in unicode format since all texts are English) then I tried with English character, but no luck, the display seemed exactly the same and have nothing to do with the locale. But I don't understand why it doesn't display correctly and looked like asian charactor (even I use English locale).
Is there some conversion should be done or should I save my file in different encoding format? The problem is I wanted to display English charactore correctly which is the default.
In the absence of code that demonstrates your problem, I will give you a correspondingly general answer.
You are trying to display English characters, but see Chinese characters. That is what happens when you pass 8 bit ANSI text to an API that receives UTF-16 text. Look for somewhere in your program where you cast from char* to wchar_t*.
First of all what is type of file you are trying to store text in?Normal txt files stores in ANSI by default (so does excel). So when you are trying to print a Unicode character to a ANSI file it will print junk. Two ways of over coming this problem is:
try to open the file in UTF-8 or 16 mode and then write
convert Unicode to ANSI before writing in file. If you are using windows then MSDN provides particular API to do Unicode to ANSI conversion and vice-verse. If you are using Linux then Google for conversion of Unicode to ANSI. There are lot of solution out there.
Hope this helps!!!
std::wstring does not have any locale/internationalisation support at all. It is just a container for storing sequences of wchar_t.
The problem with wchar_t is that its encoding is unspecified. It might be Unicode UTF-16, or Unicode UTF-32, or Shift-JIS, or something completely different. There is no way to tell from within a program.
You will have the best chances of getting things to work if you ensure that the encoding of your source code is the same as the encoding used by the locale under which the program will run.
But, the use of third-party libraries (like DirectX) can place additional constraints due to possible limitations in what encodings those libraries expect and support.
Bug solved, it turns out to be the CASTING problem (not rendering problem as previously said).
The bugged text is a intermediate product during some internal conversion process using swtringstream (which I forgot to mention), the code is as follows
wstringstream wss;
wstring text;
textToGenerate.append(L"some text");
wss << timer->getTime()
text.append(wss.str());
Right after this process the debugger shows the text as a bunch of random stuff but later somehow it converts back so it's readable. But the problem appears at rendering stage using DirectX. I somehow left the casting for wchar_t*, which results in the incorrect rendering.
old:
LPCWSTR lpcwstrText = (LPCWSTR)textToDraw->getText();
new:
LPCWSTR lpcwstrText = (*textToDraw->getText()).c_str();
By changing that solves the problem.
So, this is resulted by a bad cast. As some kind people provided correction to my statement.

How to convert ISO-8859-1 to UTF-8 using libiconv in C++

I'm using libcurl to fetch some HTML pages.
The HTML pages contain some character references like: סלקום
When I read this using libxml2 I'm getting: ׳₪׳¨׳˜׳ ׳¨
is it the ISO-8859-1 encoding?
If so, how do I convert it to UTF-8 to get the correct word.
Thanks
EDIT: I got the solution, MSalters was right, libxml2 does use UTF-8.
I added this to eclipse.ini
-Dfile.encoding=utf-8
and finally I got Hebrew characters on my Eclipse console.
Thanks
Have you seen the libxml2 page on i18n ? It explains how libxml2 solves these problems.
You will get a ס from libxml2. However, you said that you get something like ׳₪׳¨׳˜׳ ׳¨. Why do you think that you got that? You get an XMLchar*. How did you convert that pointer into the string above? Did you perhaps use a debugger? Does that debugger know how to render a XMLchar* ? My bet is that the XMLchar* is correct, but you used a debugger that cannot render the Unicode in a XMLchar*
To answer your last question, a XMLchar* is already UTF-8 and needs no further conversion.
No. Those entities correspond t the decimal value of the Unicode sequence number of your characters. See this page for example.
You can therefore store your Unicode values as integers and use an algorithm to transform those integers to an UTF-8 multibyte character. See UTF-8 specification for this.
This answer was given in the assumpltion that the encoded text is returned as UTF-16, which as it turns out, isn't the case.
I would guess the encoding is UTF-16 or UCS2. Specify this as input for iconv. There might also be an endian issue, have a look here
The c-style way would be (no checking for clarity):
iconv_t ic = iconv_open("UCS-2", "UTF-8");
iconv(ic, myUCS2_Text, inputSize, myUTF8-Text, outputSize);
iconv_close(ic);

Rule for handling UTF-8 characters in cookie for CGI applications?

I was told to always URL-encode a UTF-8 string before placing on a cookie.
So when a CGI application reads this cookie, it has to URL-decode the string to get the original UTF-8 string.
Is this the right way to handle UTF-8 characters in cookies?
Is there a better way to do this?
There is no one standard scheme for encapsulating Unicode characters into a cookie.
URL-encoding the UTF-8 representation is certainly a common and sensible way of doing it, not least because it can be read easily into a Unicode string from JavaScript (using decodeURIComponent). But there's no reason you couldn't choose some other scheme if you prefer.
Generally, this is the easiest way, you could do another binary encoding, not sure if base64 includes reserved characters... %uXXXX where XXXX is the hex unicode value is most appropriate.