My application receives UTF-16 string as password, which should be saved post encryption in the database with UTF-8 encoding. I'm taking following actions for it
Take input password in wstring (UTF-16)
Reinterpret this password using reinterpret_cast to unsigned char *.
Use step 2 password and encrypt it using AES_cbc_encrypt, which returns unsigned char *
Convert step 3 output to wstring (UTF-16)
Convert wstring to UTF-8 using Poco's UnicodeConvertor class. Save this UTF-8 string in the database
Is this the correct way of saving AES encrypted password? Please suggest if there is a better way
Depending on your requirements you might want to consider first encoding the string to UTF-8 and then encrypting it.
Advantage of this approach is, that the hash stored in the DB is based on a binary format that is independent of endianess.
Using UTF-16 you usually need to deal with endianess when you have clients on different systems implemented in different programming languages.
I think you'd be much better off converting the encripted password to hex digits or to base-64 encoding. This way you're guaranteed to have no weird or illegal UTF-16 symbols, nor will you have \n, \r or \t in your UTF-8. The converted text will be somewhat larger - hope it's not a big deal.
Related
I have a mobile app that stores data in dynamoDB tables. There is a group of users in Taiwan that attempted to store there names in the database. when the data is stored it become garbled. I have researched this and see that it is because dynamoDB uses UTF encoding while tradional chinese uses big 5 text encoding. How do I setup dynamoDB so that it will store and recall the proper characters??
So you start with a string in your head. It's a sequence of Unicode characters. There's no inherent byte encoding to the characters. The same string could be encoded into bytes in a variety of ways. Big5 is one. UTF-8 is another.
When you say that Traditional Chinese uses Big5, that's not entirely true. It may be commonly encoded in Big5, but it could be in UTF-8 instead, and UTF-8 has this cool property that it can encode all Unicode character sequences. That's why it's become the standard encoding for situations where you don't want to optimize for one character set.
So your challenge is make sure to carefully control the characters and encodings so that you're sending UTF-8 sequences to DynamoDB. The standard SDKs would do this correctly as long as you're creating the strings as basic strings in them.
You also have to make sure you're not confusing yourself when you look at the data. If you look at UTF-8 bytes but in a way where you're interpreting them as Big5 then it's going to look like gibberish, or vice versa.
You don't say how they're loading the data. If they're starting with a file, could be that. You'd want to read the file in a language saying it's Big5, then you'll have the string version, and then you can send the string version and rely on the SDK to correctly translate to UTF-8 on the wire.
I remember when I first learned this stuff it was all kind of confusing. The thing to remember is a capital A exists as an idea (and is a defined character in Unicode) and there's a whole lot of mechanisms you could use to put that letter into ones and zeros on disk. Each of those ways is an encoding. ASCII is popular but EBCDIC was another contender from the past, and UTF-16 is yet another contender now. Traditional Chinese is a character set (a set of characters) and you can encode each those characters a bunch of ways too. It's just a question of how you map characters to bits and bytes and back again.
WinAPI uses UTF-16LE encoding, so if I called some WinAPI function that returns a string, it will return it as UTF-16LE encoded.
So I'm thinking of using UTF-16LE encoding for strings in my program, and when it's time to send the data over the network, I convert it to UTF-8, and on the other side I convert it back to UTF-16LE. This is so there is less amount of data to send.
Is there's a reason why I shouldn't do that?
With UTF-8 encoding, you'll use:
1 byte for ASCII chars
2 bytes for unicode chars between U+0000 and U+07FF
more bytes if necesseray
So, if your text is western language, in most case it will probably be shorter in UTF-8 than in UTF-16LE encoding: the western alphabets are encoded between U-0000 and U-0590.
On the opposite, if your text is asian, then the UTF8 encoding might inflate significantly your data. The asian caracter sets are beyond U+7FF and require hence at least 3 bytes
In the UTF8 everywhere article you can find some (basic) statistics about length of text encoding, as well as other arguments supporting the use of UTF8.
One that comes to my mind for networking, is taht UTF8 representation is the same représentation on all platforms, whereas for UTF16 you have the LE and BE, depending on OS and CPU architecure.
Is there any way to detect std::string encoding?
My problem: I have an external web services which give data in different encodings. Also I have a library witch parse that data and store it in std::string. Than I want to display data in Qt GUI. The problem is that std::string can have different encodings. Some string can be converted using QString::fromAscii(), some QString::fromUtf8().
I haven't looked into it but I did use some Qt3.3 in the past.
ASCII vs Unicode + UTF-8
Utf8 is 8-bit, ascii 7-bit. I guess you can try to look into the values of string array
http://doc.qt.digia.com/3.3/qstring.html#ascii and http://doc.qt.digia.com/3.3/qstring.html#utf8
it seems ascii returns an 8-bit ASCII representation of the string, still I think it should have values from 0 to 127 or something like that. you must compare more characters in the string.
I have a website which allows users to input usernames.
The problem here is that the code in c++ assumes the browser encoding is Western Europe and converts the string received from the username text box into unicode to compare with string stored within the databasse.
with the right browser encoding set the character úser is recieved as %FAser and coverted properly to úser within the program
however with the browser settings set to UTF-8 the string is recieved as %C3%BAser and then converted to úser due to the code converting C3 and BA as seperate characters.
Is there a way to convert the example %c3%BA to ú while ensuring the right conversions are being made?
You can use the ICU library to convert between almost all usable encodings. This library also provides lots of string manipulation facilities.
I was told to always URL-encode a UTF-8 string before placing on a cookie.
So when a CGI application reads this cookie, it has to URL-decode the string to get the original UTF-8 string.
Is this the right way to handle UTF-8 characters in cookies?
Is there a better way to do this?
There is no one standard scheme for encapsulating Unicode characters into a cookie.
URL-encoding the UTF-8 representation is certainly a common and sensible way of doing it, not least because it can be read easily into a Unicode string from JavaScript (using decodeURIComponent). But there's no reason you couldn't choose some other scheme if you prefer.
Generally, this is the easiest way, you could do another binary encoding, not sure if base64 includes reserved characters... %uXXXX where XXXX is the hex unicode value is most appropriate.