Converting character encoding within c++ - c++

I have a website which allows users to input usernames.
The problem here is that the code in c++ assumes the browser encoding is Western Europe and converts the string received from the username text box into unicode to compare with string stored within the databasse.
with the right browser encoding set the character úser is recieved as %FAser and coverted properly to úser within the program
however with the browser settings set to UTF-8 the string is recieved as %C3%BAser and then converted to úser due to the code converting C3 and BA as seperate characters.
Is there a way to convert the example %c3%BA to ú while ensuring the right conversions are being made?

You can use the ICU library to convert between almost all usable encodings. This library also provides lots of string manipulation facilities.

Related

Reading input from file with Chinese Characters that got mangled

I'm getting stuck trying to convert an input string in char* to Chinese character encoding. An application accepts a Chinese string input ex: "啊说到" and when it is written into a file it turns into this "°¡Ëµµ½". I'm able to take this input and feed it to _mbstowcs_s_l() but the solution needs to be locale independent, so I'm forced to use either mbstowcs() or WideCharToMultiByte() but it looks like both would work for me if the input did already went through MBCS to UTF-8, which in our case isnt.
The project is using Multibyte Character Set, and I'm struggling to understand what is going on. One other thing is the input is coming from a different application and stores it into file.
The application that accepted the Chinese input is an MFC set to Multibyte Char Set and the os was set to regional Chinese Simplified, UI accepts the input and is placed on a CString, that is coped to a char*. This is that part where I don't know whats going on in the encoding, this application stores it into a file, then we read it using the other application, the string is read unto char*, thats when the characters seems to take the "°¡Ëµµ½".
Question is, how can I turn this encoded char"°¡Ëµµ½" back to its Chinese encoding "啊说到", with out setting the locale in _mbstowcs_s_l()? The problem is, we could be reading strings from other regional settings and the application wouldn't just know what character map to use unless we tell it to.

Text encoding of Protocol Buffers string fields

If a C++ program receives a Protocol Buffers message that has a Protocol Buffers string field, which is represented by a std::string, what is the encoding of text in that field? Is it UTF-8?
Protobuf strings are always valid UTF-8 strings.
See the Language Guide:
A string must always contain UTF-8 encoded or 7-bit ASCII text.
(And ASCII is always also valid UTF-8.)
Not all protobuf implementations enforce this, but if I recall correctly, at least the Python library refuses to decode non-unicode strings.

Encode gives wrong value of Japanese kanji

As a part of a scraper, I need to encode kanji to URLs, but I just can't seem to even get the correct output from a simple sign, and I'm currently blinded by everything I've tried thus far from various Stack Overflow posts.
The document is set to UTF-8.
sampleText=u'ル'
print sampleText
print sampleText.encode('utf-8')
print urllib2.quote(sampleText.encode('utf-8'))
It gives me the values:
ル
ル
%E3%83%AB
But as far as I understand, it should give me:
ル
XX
%83%8B
What am I doing wrong? Are there some settings I don't have correct? Because as far as I understand it, my output from the encode() should not be ル.
The code you show works correctly. The character ル is KATAKANA LETTER RU, and is Unicode codepoint U+30EB. When encoded to UTF-8, you'll get the Python bytestring '\xe3\x83\xab', which prints out as ル if your console encoding is Latin-1. When you URL-escape those three bytes, you get %E3%83%AB.
The value you seem to be expecting, %83%8B is the Shift-JIS encoding of ル, rather than UTF-8 encoding. For a long time there was no standard for how to encode non-ASCII text in a URL, and as this Wikipedia section notes, many programs simply assumed a particular encoding (often without specifying it). The newer standard of Internationalized Resource Identifiers (IRIs) however says that you should always convert Unicode text to UTF-8 bytes before performing percent encoding.
So, if you're generating your encoded string for a new program that wants to meet the current standards, stick with the UTF-8 value you're getting now. I would only use the Shift-JIS version if you need it for backwards compatibility with specific old websites or other software that expects that the data you send will have that encoding. If you have any influence over the server (or other program), see if you can update it to use IRIs too!

URL encoding for multibyte character string in c++

I am trying to achieve URL encoding for some of my strings via c++. Strings can contaim multibyte characters like ™, ®, ©, etc.
Input text: Something ™
Output should be: Something%20%E2%84%A2
I can achieve URL encode or decode in JS with encodeURIComponent and decodeURIComponent,
but I have some native code in c++ and hence need to encode some text via c++.
Any help here would be great relief for me.
It's not to hard to do manually, if you can't find a library. First encode the string as UTF-8 (there are other posts on SO about using the standard library to do that if the string is in another encoding) and then replace every character with a value above 127, and every one that's restricted in URLs, with the percent encoding of that character (A percent sign followed by the two hexadecimal digits representing the character's value).

Unicode Woes! Ms-Access 97 migration to Ms-Access 2007

Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)
There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.