C++ Conversion UTF-8 to String - c++

I have a web server developed in C++. In this web server, the data is received from the client side and stored in the database.
Some of this data is in Persian, which is converted to Unicode UTF-8 format.
as example:
data string is "سلام" in client side
when i get data, in webserver
"D8%B3%D9%84%D8%A7%D9%85"
I want to convert UTF-8 Code to c++ string, How can I do this conversion?

Your string is not UTF-8 encoded but uses a custom encoding similiar to HTTP url query params.
% indicates that the next two characters encode a single byte in hex. You will need to parse for % and if you encounter such a character, interpret the next two characters as a hexadecimal encoded byte. Otherwise you just copy the characters/bytes over.

Related

Is there a way to send a POST through WinHttpSendRequest(); with the unicode representation of characters ie: \u0420?

I need to send a UTF-8 string with Unicode characters in it to a web API from a c/c++ application, I am sending it with a Content-Type of application/json.
For example I have a string that is "щь".
Currently what sends is the ASCII representation (each individual byte) "щь"
However I would like it to send the Unicode number instead "\u0449\u044c"
Is there a way to do this using WinHTTPRequest?

AES encrypted password in UTF-8

My application receives UTF-16 string as password, which should be saved post encryption in the database with UTF-8 encoding. I'm taking following actions for it
Take input password in wstring (UTF-16)
Reinterpret this password using reinterpret_cast to unsigned char *.
Use step 2 password and encrypt it using AES_cbc_encrypt, which returns unsigned char *
Convert step 3 output to wstring (UTF-16)
Convert wstring to UTF-8 using Poco's UnicodeConvertor class. Save this UTF-8 string in the database
Is this the correct way of saving AES encrypted password? Please suggest if there is a better way
Depending on your requirements you might want to consider first encoding the string to UTF-8 and then encrypting it.
Advantage of this approach is, that the hash stored in the DB is based on a binary format that is independent of endianess.
Using UTF-16 you usually need to deal with endianess when you have clients on different systems implemented in different programming languages.
I think you'd be much better off converting the encripted password to hex digits or to base-64 encoding. This way you're guaranteed to have no weird or illegal UTF-16 symbols, nor will you have \n, \r or \t in your UTF-8. The converted text will be somewhat larger - hope it's not a big deal.

Converting character encoding within c++

I have a website which allows users to input usernames.
The problem here is that the code in c++ assumes the browser encoding is Western Europe and converts the string received from the username text box into unicode to compare with string stored within the databasse.
with the right browser encoding set the character úser is recieved as %FAser and coverted properly to úser within the program
however with the browser settings set to UTF-8 the string is recieved as %C3%BAser and then converted to úser due to the code converting C3 and BA as seperate characters.
Is there a way to convert the example %c3%BA to ú while ensuring the right conversions are being made?
You can use the ICU library to convert between almost all usable encodings. This library also provides lots of string manipulation facilities.

UTF 8 encoded Japanese string in XML

I am trying to create a SOAP call with Japanese string. The problem I faced is that when I encode this string to UTF8 encoded string, it has many control characters in it (e.g. 0x1B (Esc)). If I remove all such control characters to make it a valid SOAP call then the Japanese content appears as garbage on server side.
How can I create a valid SOAP request for Japanese characters? Any suggestion is highly appreciated.
I am using C++ with MS-DOM.
With Best Regards.
If I remember correctly it's true, the first 32 unicode code points are not allowed as characters in XML documents, even escaped with &#. Not sure whether they're allowed in HTML or not, but certainly the server thinks they're not allowed in your requests, and it gets the only meaningful vote.
I notice that your document claims to be encoded in iso-2022-jp, not utf-8. And indeed, the sequence of characters ESC $ B that appears in your document is valid iso-2022-jp. It indicates that the data is switching encodings (from ASCII to a 2-byte Japanese encoding called JIS X 0208-1983).
But somewhere in the process of constructing your request, something has seen that 0x1B byte and interpreted it as a character U+001B, not realising that it's intended as one byte in data that's already encoded in the document encoding. So, it has XML-escaped it as a "best effort", even though that's not valid XML.
Probably, whatever is serializing your XML document doesn't know that the encoding is supposed to be iso-2022-jp. I imagine it thinks it's supposed to be serializing the document as ASCII, ISO-Latin-1, or UTF-8, and the <meta> element means nothing to it (that's an HTML way of specifying the encoding anyway, it has no particular significance in XML). But I don't know MS-DOM, so I don't know how to correct that.
If you just remove the ESC characters from iso-2022-jp data, then you conceal the fact that the data has switched encodings, and so the decoder will continue to interpret all that 7nMK stuff as ASCII, when it's supposed to be interpreted as JIS X 0208-1983. Hence, garbage.
Something else strange -- the iso-2022-jp code to switch back to ASCII is ESC ( B, but I see |(B</font> in your data, when I'd expect the same thing to happen to the second ESC character as happened to the first: &#0x1B(B</font>. Similarly, $B#M#S(B and $BL#D+(B are mangled attempts to switch from ASCII to JIS X 0208-1983 and back, and again the ESC characters have just disappeared rather than being escaped.
I have no explanation for why some ESC characters have disappeared and one has been escaped, but it cannot be coincidence that what you generate looks almost, but not quite, like valid iso-2022-jp. I think iso-2022-jp is a 7 bit encoding, so part of the problem might be that you've taken iso-2022-jp data, and run it through a function that converts ISO-Latin-1 (or some other 8 bit encoding for which the lower half matches ASCII, for example any Windows code page) to UTF-8. If so, then this function leaves 7 bit data unchanged, it won't convert it to UTF-8. Then when interpreted as UTF-8, the data has ESC characters in it.
If you want to send the data as UTF-8, then first of all you need to actually convert it out of iso-2022-jp (to wide characters or to UTF-8, whichever your SOAP or XML library expects). Secondly you need to label it as UTF-8, not as iso-2022-jp. Finally you need to serialize the whole document as UTF-8, although as I've said you might already be doing that.
As pointed out by Steve Jessop, it looks like you have encoded the text as iso-2022-jp, not UTF-8. So the first thing to do is to check that and ensure that you have proper UTF-8.
If the problem still persists, consider encoding the text.
The simplest option is "hex encoding" where you just write the hex value of each byte as ASCII digits. e.g. the 0x1B byte becomes "1B", i.e. 0x31, 0x42.
If you want to be fancy you could use MIME or even UUENCODE.

URL encoding for multibyte character string in c++

I am trying to achieve URL encoding for some of my strings via c++. Strings can contaim multibyte characters like ™, ®, ©, etc.
Input text: Something ™
Output should be: Something%20%E2%84%A2
I can achieve URL encode or decode in JS with encodeURIComponent and decodeURIComponent,
but I have some native code in c++ and hence need to encode some text via c++.
Any help here would be great relief for me.
It's not to hard to do manually, if you can't find a library. First encode the string as UTF-8 (there are other posts on SO about using the standard library to do that if the string is in another encoding) and then replace every character with a value above 127, and every one that's restricted in URLs, with the percent encoding of that character (A percent sign followed by the two hexadecimal digits representing the character's value).