How to convert UTF-16 to UTF-8 using C++? - c++

I already know 'codecvt', 'WideCharToMultiByte', and someone.
I use korean language. For example. '안녕하세요'.
It message can insert normal string class. right?
But in my case. If i have file :: 'test.txt' {in :: '안녕하세요'}
And read 'test.txt', and getline(),
(test.txt file read)
string temp;
getline(file pointer, temp);
cout<<temp;
Now i use cout. Ta-Da! message are broken!
I know that is WideCharacter problem. so i tring MultiByteToWideChar method.
Ok. It is work well.
But i not want this.
Finally I want reading widecharcter files, and save 'string' Variable.
So, I question for you.
How to convert UTF-16 (widecharcter/wstring) to UTF-8 (multibyte/string) when 'Not change message' ?
:: I want this style
wstring temp = "안녕하세요"
string temp2 = convert_to_string(temp);
->
string temp2 = "안녕하세요"

As mentioned in the comment, you can see Convert C++ std::string to UTF-16-LE encoded string for the code on how to do the conversion.
But given you assumed you have wstring to hold your Korean string, you avoided the trouble of distinguishing UTF-16-LE and UTF-16-BE and you can readily find the Unicode code point of each Korean character in the string. So your problem boils down to find the UTF-8 representation of any code point. It would not be hard, see page 3 of https://www.rfc-editor.org/rfc/rfc3629 (also Wikipedia https://en.wikipedia.org/wiki/UTF-8).
A sample code is in
Convert Unicode code points to UTF-8 and UTF-32

Related

Print special character from utf-8 encoded string

I'm having trouble dealing with encoding in Python:
I get some strings from a csv that I open using pandas.read_csv(), they are encoded in unicode so I encode it to utf-8 doing the following
# data is from my csv
string = data.encode('utf-8')
print string
However, when I print it, i get
"Parc d'Activit\xc3\xa9s des Gravanches"
and i would like to return
"Parc d'Activités des Gravanches"
It seems like an easy issue but I'm quite new to python and did not find anything close enough to my problem.
Note: I am using Python 2.7 and my file starts with
#!/usr/bin/env python2.7
# coding: utf8
EDIT: I just say that you are using Python 2, okay, I think the answer below is still valuable though.
In Python 2 this is even more complicated and inconsistent. Here you have str and unicode, and the default str doesn't support unicode stuff.
Anyways, the situation is more or less the same, use decode instead of encode to convert from str to unicode. That should fix it.
More info at: https://pythonhosted.org/kitchen/unicode-frustrations.html
This is a common source of confusion.The issue is a bit complex, but I'll try to simplify it. I'm talking about Python 3 here, I believe there's several differences with Python 2.
There's two types of what you would call a string: str and bytes.
str is the general string type form Python, it supports unicode seamlessly in Python 3, but the way it encodes the actual data is not relevant, it's an object.
bytes is a byte array, like char* in C. It's a sequence of bytes.
Strings can be represented both ways, but you need to specify an encoding standard to translate between the two, as bytes needs to be interpreted, because it's just, again, a raw array of bytes.
encode converts a str into bytes, that's the mistake you make. Of course, if you print bytes it will just show it's raw data, AKA, the string encoded as utf-8.
decode does the opposite operation, that may be what you need.
However, if you open the file normally (open(file_name, 'r')) instead of in byte mode (open(file_name, 'b'), which I doubt you are doing, you shouldn't need to do anything, printing data should just work as you want it to.
More info at: https://docs.python.org/3/howto/unicode.html

Decoding %E6%B0%94%E6%97%8B%E5%93%88%E5%88%A9.txt to a valid string

I am trying to decode a filename*= field of content disposition header. I get a string something like:
%E6%B0%94%E6%97%8B%E5%93%88%E5%88%A9.txt
What I have figured out that replacing % to \x works fine and I get the correct file name:
气旋哈利.txt
Is there a standard way of doing this in C++? Is there any library available to decode this?
I tried
boost::replace_all(name, "%x","\\x");
std::locale::generator gen;
std::locale locl = gen.generate("en_US.utf-8");
decoded_data = boost::locale::conv::from_utf( encoded_data, locl);
But it prints the replaced string instead of chinese characters.
\xE6\xB0\x94\xE6\x97\x8B\xE5\x93\x88\xE5\x88\xA9.txt
Any Idea where am I going wrong?
Replacing escape code like "\xE6" only work in string and character literals, not generally in strings. That's because it's handled by the compiler when it compiles the program.
However, it's not very hard to do yourself, using a simple loop that check for the '%' character, gets the next two characters and convert them to a number and use that number as a "character".

Converting Hexadecimal(\x) in a string to unicode (\u)

I'm encountering a problem currently.
I'm getting a string from a url, I'm decoding this string via curl_easy_unescape and I'm getting a decoded string. So far so good.
Now is where the problem is. For example, when the url had the "counterpart" to ü inside his header, curl_easy_unescape turns the counterpart of ü in \xfc. Now my String has \xfc.
I need it as a "ü".
I need a written "ü" in my string, or I'm getting an error that my string is not utf8 formatted. And i need it inside a string. For example
"Hallü howre yoü"
with curl_easy_escape this turns into
"Hall\xfc+howre+you\xfc"
And i want to revert the \xfc into "ü"s or into "\u00fc"s
My solutions i tried have been:
changing the \x to \u00 . That would work and do the trick. But replacing doesn't work.
encoding the string in utf 8
getting the decimal value of xFC and doing char = valueofFC.
I don't have a clue, how i could resolve that issue.

How to read the … character and french accents from a text file

I am given a text file that contains a couple character per line. I have to read it, line by line, and apply a lexical analyzer on each character. Then, I write my analysis in another file.
With the following code, I have no problem reading french accents, but I realized that the character '…' (this is one character not 3 dots) is turned into a '&'.
Note: My lexical analyzer must use strings, that's why I converted back the wstring to a string.
wfstream SourceFile;
ofstream ResultFile (ResultFileName);
locale utf8_locale(std::locale(), new codecvt_utf8<wchar_t>);
SourceFile.imbue(utf8_locale);
SourceFile.open(SourceFileName);
while(getline(SourceFile, wLineBuffer))
{
string LineBuffer( wLineBuffer.begin(), wLineBuffer.end() );
...
Edit: Raymond Chen figured that the character is lost because of my conversion from wstring to string.
So the new question is now : How do I convert from a wstring to a string without transforming the characters ?
Edit: file sample
"stringééé"
"ccccccccccccccccccccccccccccccccccccccccccccccccccccccccc"
Identificateur1
Identificateur2
// Commentaire22
/**/
/*
Autre commentaire
…
*/
You need a proper Unicode support library. Forget using the broken Standard functions. They were not designed to support Unicode, don't support Unicode, and cannot be extended to support it properly. Look into using ICU or Boost.Locale or something like that.

base64 encode null terminator

Hi I am currently trying to encode a string using the base64 encoding method in C++.
The string itself encodes fine however I would like to have an extra null character at the end of the decoded string (so the null character would also show up in the text file I want to save the decoded string into).
I am using this base64 code here -> http://www.adp-gmbh.ch/cpp/common/base64.html
I hope you can give me some advices what I can do here to make this possible (I tried already writing two null characters at the end of the string I am encoding but it seems as if the encoding method only reads to the first occurence of a null character).
A cursory lookat the encoding function does not seem to show any special handling of NUL. And neither does the decoding function, are you sure the issue is not in the way that you test for NUL in the decoded string?