Converting Hexadecimal(\x) in a string to unicode (\u) - c++

I'm encountering a problem currently.
I'm getting a string from a url, I'm decoding this string via curl_easy_unescape and I'm getting a decoded string. So far so good.
Now is where the problem is. For example, when the url had the "counterpart" to ü inside his header, curl_easy_unescape turns the counterpart of ü in \xfc. Now my String has \xfc.
I need it as a "ü".
I need a written "ü" in my string, or I'm getting an error that my string is not utf8 formatted. And i need it inside a string. For example
"Hallü howre yoü"
with curl_easy_escape this turns into
"Hall\xfc+howre+you\xfc"
And i want to revert the \xfc into "ü"s or into "\u00fc"s
My solutions i tried have been:
changing the \x to \u00 . That would work and do the trick. But replacing doesn't work.
encoding the string in utf 8
getting the decimal value of xFC and doing char = valueofFC.
I don't have a clue, how i could resolve that issue.

Related

How to convert UTF-16 to UTF-8 using C++?

I already know 'codecvt', 'WideCharToMultiByte', and someone.
I use korean language. For example. '안녕하세요'.
It message can insert normal string class. right?
But in my case. If i have file :: 'test.txt' {in :: '안녕하세요'}
And read 'test.txt', and getline(),
(test.txt file read)
string temp;
getline(file pointer, temp);
cout<<temp;
Now i use cout. Ta-Da! message are broken!
I know that is WideCharacter problem. so i tring MultiByteToWideChar method.
Ok. It is work well.
But i not want this.
Finally I want reading widecharcter files, and save 'string' Variable.
So, I question for you.
How to convert UTF-16 (widecharcter/wstring) to UTF-8 (multibyte/string) when 'Not change message' ?
:: I want this style
wstring temp = "안녕하세요"
string temp2 = convert_to_string(temp);
->
string temp2 = "안녕하세요"
As mentioned in the comment, you can see Convert C++ std::string to UTF-16-LE encoded string for the code on how to do the conversion.
But given you assumed you have wstring to hold your Korean string, you avoided the trouble of distinguishing UTF-16-LE and UTF-16-BE and you can readily find the Unicode code point of each Korean character in the string. So your problem boils down to find the UTF-8 representation of any code point. It would not be hard, see page 3 of https://www.rfc-editor.org/rfc/rfc3629 (also Wikipedia https://en.wikipedia.org/wiki/UTF-8).
A sample code is in
Convert Unicode code points to UTF-8 and UTF-32

Decoding %E6%B0%94%E6%97%8B%E5%93%88%E5%88%A9.txt to a valid string

I am trying to decode a filename*= field of content disposition header. I get a string something like:
%E6%B0%94%E6%97%8B%E5%93%88%E5%88%A9.txt
What I have figured out that replacing % to \x works fine and I get the correct file name:
气旋哈利.txt
Is there a standard way of doing this in C++? Is there any library available to decode this?
I tried
boost::replace_all(name, "%x","\\x");
std::locale::generator gen;
std::locale locl = gen.generate("en_US.utf-8");
decoded_data = boost::locale::conv::from_utf( encoded_data, locl);
But it prints the replaced string instead of chinese characters.
\xE6\xB0\x94\xE6\x97\x8B\xE5\x93\x88\xE5\x88\xA9.txt
Any Idea where am I going wrong?
Replacing escape code like "\xE6" only work in string and character literals, not generally in strings. That's because it's handled by the compiler when it compiles the program.
However, it's not very hard to do yourself, using a simple loop that check for the '%' character, gets the next two characters and convert them to a number and use that number as a "character".

base64 encode null terminator

Hi I am currently trying to encode a string using the base64 encoding method in C++.
The string itself encodes fine however I would like to have an extra null character at the end of the decoded string (so the null character would also show up in the text file I want to save the decoded string into).
I am using this base64 code here -> http://www.adp-gmbh.ch/cpp/common/base64.html
I hope you can give me some advices what I can do here to make this possible (I tried already writing two null characters at the end of the string I am encoding but it seems as if the encoding method only reads to the first occurence of a null character).
A cursory lookat the encoding function does not seem to show any special handling of NUL. And neither does the decoding function, are you sure the issue is not in the way that you test for NUL in the decoded string?

PHP: How to get rid of � symbol inside text?

Can't figure out, how to remove this � symbol from string.
String is in utf-8 format.
What to do? :(
This removes whole string:
preg_replace('/\W/','',utf8_decode(substr(utf8_encode($ad['description']),0,125)))
Thanks ;)
Update:
Using:
header('Content-Type: text/html; charset=utf-8');
After replacement using exit() right away.
U+FFFD REPLACEMENT CHARACTER is used when the character does not have a representation in the current charset encoding. Declare your encodings properly as UTF-8 and use UTF-8 strings and it will not show upon most platforms.
The problem here is that your string is not in utf-8 format. You pretend it is, and handle the data accordingly, but the string probably contains Ansi characters. You don't just need to pass the Content-Encoding = utf-8 header, but your contents needs to be converted to utf-8 before it is sent as well.
you could try utf8_decode('string'); or utf8_encode('string');
but you should really try to find the actuall problem make sure the headers are correct set, document type and that the text is encoded in the right format when saved or what not

Unicode Woes! Ms-Access 97 migration to Ms-Access 2007

Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)
There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.