parsing utf8 string from server response - c++

I had implemented app on some device which was dealing with sending receiving data from server.
Data from server would usually come in this form:
"1;username;someInteger;"
Parsing was easy, and I was using strtok as you can imagine to retrieve individual values from that string such as: 1, username, and someInteger.
But now a situation may occur when the server will send me unicode string as username.
I think good idea is to use the username encoded as a UTF-8 string (am I right?). What do you recommend - how should I parse it from above string? What symbol to use as a separator for example (e.g., instead of ";"), or which functions to use to extract the username from above string?
as this is some embedded device I want to avoid installing some third party libraries there (which might not be even possible) so more "pure" ways would be more desirable.

The character ';' is the same in UTF-8 as it is in ASCII, because the 127 first characters in both encodings are the same. That means you can still use strtok to split on the ';'.

The very thing with UTF8 is that you hardly have to do anything at all. ASCII characters still encode as the same ASCII bytes they always would, so if you just continue to use semicolon separators, you don't have to do anything at all.

Related

Using Traditional Chinese with AWS DynamoDB

I have a mobile app that stores data in dynamoDB tables. There is a group of users in Taiwan that attempted to store there names in the database. when the data is stored it become garbled. I have researched this and see that it is because dynamoDB uses UTF encoding while tradional chinese uses big 5 text encoding. How do I setup dynamoDB so that it will store and recall the proper characters??
So you start with a string in your head. It's a sequence of Unicode characters. There's no inherent byte encoding to the characters. The same string could be encoded into bytes in a variety of ways. Big5 is one. UTF-8 is another.
When you say that Traditional Chinese uses Big5, that's not entirely true. It may be commonly encoded in Big5, but it could be in UTF-8 instead, and UTF-8 has this cool property that it can encode all Unicode character sequences. That's why it's become the standard encoding for situations where you don't want to optimize for one character set.
So your challenge is make sure to carefully control the characters and encodings so that you're sending UTF-8 sequences to DynamoDB. The standard SDKs would do this correctly as long as you're creating the strings as basic strings in them.
You also have to make sure you're not confusing yourself when you look at the data. If you look at UTF-8 bytes but in a way where you're interpreting them as Big5 then it's going to look like gibberish, or vice versa.
You don't say how they're loading the data. If they're starting with a file, could be that. You'd want to read the file in a language saying it's Big5, then you'll have the string version, and then you can send the string version and rely on the SDK to correctly translate to UTF-8 on the wire.
I remember when I first learned this stuff it was all kind of confusing. The thing to remember is a capital A exists as an idea (and is a defined character in Unicode) and there's a whole lot of mechanisms you could use to put that letter into ones and zeros on disk. Each of those ways is an encoding. ASCII is popular but EBCDIC was another contender from the past, and UTF-16 is yet another contender now. Traditional Chinese is a character set (a set of characters) and you can encode each those characters a bunch of ways too. It's just a question of how you map characters to bits and bytes and back again.

UTF 8 encoded Japanese string in XML

I am trying to create a SOAP call with Japanese string. The problem I faced is that when I encode this string to UTF8 encoded string, it has many control characters in it (e.g. 0x1B (Esc)). If I remove all such control characters to make it a valid SOAP call then the Japanese content appears as garbage on server side.
How can I create a valid SOAP request for Japanese characters? Any suggestion is highly appreciated.
I am using C++ with MS-DOM.
With Best Regards.
If I remember correctly it's true, the first 32 unicode code points are not allowed as characters in XML documents, even escaped with &#. Not sure whether they're allowed in HTML or not, but certainly the server thinks they're not allowed in your requests, and it gets the only meaningful vote.
I notice that your document claims to be encoded in iso-2022-jp, not utf-8. And indeed, the sequence of characters ESC $ B that appears in your document is valid iso-2022-jp. It indicates that the data is switching encodings (from ASCII to a 2-byte Japanese encoding called JIS X 0208-1983).
But somewhere in the process of constructing your request, something has seen that 0x1B byte and interpreted it as a character U+001B, not realising that it's intended as one byte in data that's already encoded in the document encoding. So, it has XML-escaped it as a "best effort", even though that's not valid XML.
Probably, whatever is serializing your XML document doesn't know that the encoding is supposed to be iso-2022-jp. I imagine it thinks it's supposed to be serializing the document as ASCII, ISO-Latin-1, or UTF-8, and the <meta> element means nothing to it (that's an HTML way of specifying the encoding anyway, it has no particular significance in XML). But I don't know MS-DOM, so I don't know how to correct that.
If you just remove the ESC characters from iso-2022-jp data, then you conceal the fact that the data has switched encodings, and so the decoder will continue to interpret all that 7nMK stuff as ASCII, when it's supposed to be interpreted as JIS X 0208-1983. Hence, garbage.
Something else strange -- the iso-2022-jp code to switch back to ASCII is ESC ( B, but I see |(B</font> in your data, when I'd expect the same thing to happen to the second ESC character as happened to the first: &#0x1B(B</font>. Similarly, $B#M#S(B and $BL#D+(B are mangled attempts to switch from ASCII to JIS X 0208-1983 and back, and again the ESC characters have just disappeared rather than being escaped.
I have no explanation for why some ESC characters have disappeared and one has been escaped, but it cannot be coincidence that what you generate looks almost, but not quite, like valid iso-2022-jp. I think iso-2022-jp is a 7 bit encoding, so part of the problem might be that you've taken iso-2022-jp data, and run it through a function that converts ISO-Latin-1 (or some other 8 bit encoding for which the lower half matches ASCII, for example any Windows code page) to UTF-8. If so, then this function leaves 7 bit data unchanged, it won't convert it to UTF-8. Then when interpreted as UTF-8, the data has ESC characters in it.
If you want to send the data as UTF-8, then first of all you need to actually convert it out of iso-2022-jp (to wide characters or to UTF-8, whichever your SOAP or XML library expects). Secondly you need to label it as UTF-8, not as iso-2022-jp. Finally you need to serialize the whole document as UTF-8, although as I've said you might already be doing that.
As pointed out by Steve Jessop, it looks like you have encoded the text as iso-2022-jp, not UTF-8. So the first thing to do is to check that and ensure that you have proper UTF-8.
If the problem still persists, consider encoding the text.
The simplest option is "hex encoding" where you just write the hex value of each byte as ASCII digits. e.g. the 0x1B byte becomes "1B", i.e. 0x31, 0x42.
If you want to be fancy you could use MIME or even UUENCODE.

asp-classic Request.Cookies brings this value "ϑ" for 1 cookie instead of "ÅÙÏ‘‹„‰Š„‹"

This is happening in one cookie with keys in one key only.
The value should be "ÅÙÏ‘‹„‰Š„‹".
The value should be "ÅÙÏ‘‹„‰Š„‹".
Erm, really? That looks like the corrupted, wrong-character set version to me! :-) Either way, “ϑ” is what you get when you save that string in Windows Western European encoding (cp1252) and then read it back in as UTF-8, removing all the ‘invalid character’ codes that result because it's not a valid UTF-8 string. So you've got a classic reading-and-writing-using-different-encodings problem.
As a general rule you can't get away with putting non-ASCII characters in a cookie (name or value) directly. You'll need an application-level encoding mechanism of some sort; one of the most popular ways is to URL-encode the UTF-8 representation of the characters you want, similarly to how JavaScript's encodeURIComponent does it.
(Unfortunately ASP classic has very poor support for handling Unicode.)
Final Solution:
Save As different file with "correct" encoding
Changed encoding
From "Unicode (UTF-8 with signature) -Codepage 65001"
To "Western European (Windows) - Codepage 1252"
We're using encoding on our cookies and some of the resulting characters can cause problems. So what we did is take the cookie string and encode it in HEX. - Problems Solved.

Rule for handling UTF-8 characters in cookie for CGI applications?

I was told to always URL-encode a UTF-8 string before placing on a cookie.
So when a CGI application reads this cookie, it has to URL-decode the string to get the original UTF-8 string.
Is this the right way to handle UTF-8 characters in cookies?
Is there a better way to do this?
There is no one standard scheme for encapsulating Unicode characters into a cookie.
URL-encoding the UTF-8 representation is certainly a common and sensible way of doing it, not least because it can be read easily into a Unicode string from JavaScript (using decodeURIComponent). But there's no reason you couldn't choose some other scheme if you prefer.
Generally, this is the easiest way, you could do another binary encoding, not sure if base64 includes reserved characters... %uXXXX where XXXX is the hex unicode value is most appropriate.

JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?

I am writing a webservice that uses json to represent its resources, and I am a bit stuck thinking about the best way to encode the json. Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8. But the rfc also describes a string escaping mechanism for specifying characters. I assume this would generally be used to escape non-ascii characters, thereby making the resulting utf-8 valid ascii.
So let's say I have a json string that contains unicode characters (code-points) that are non-ascii. Should my webservice just utf-8 encoding that and return it, or should it escape all those non-ascii characters and return pure ascii?
I'd like browsers to be able to execute the results using jsonp or eval. Does that effect the decision? My knowledge of various browser's javascript support for utf-8 is lacking.
EDIT: I wanted to clarify that my main concern about how to encode the results is really about browser handling of the results. What I've read indicates that browsers may be sensitive to the encoding when using JSONP in particular. I haven't found any really good info on the subject, so I'll have to start doing some testing to see what happens. Ideally I'd like to only escape those few characters that are required and just utf-8 encode the results.
The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.
The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.
Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).
Some frameworks, including PHP's json_encode() (by default), always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is a mostly unnecessary extra step intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that any JSON decoders have a problem with UTF-8.
So, I guess you just could decide which to use like this:
Just use UTF-8, unless any software you are using for storage or transport between encoder and decoder isn't binary-safe.
Otherwise, use the numeric escape sequences.
I had a problem there.
When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".
Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().
Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.
ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:
All Unicode characters may be placed
within the quotation marks except
for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F)
I was facing the same problem. It works for me. Please check this.
json_encode($array,JSON_UNESCAPED_UNICODE);
Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8.
FYI, RFC 4627 is no longer the official JSON spec. It was obsoleted in 2014 by RFC 7159, which was then obsoleted in 2017 by RFC 8259, which is the current spec.
RFC 8259 states:
8.1. Character Encoding
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].
Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.
Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
I had a similar problem with é char... I think the comment "it's possible that the text you're feeding it isn't UTF-8" is probably close to the mark here. I have a feeling the default collation in my instance was something else until I realized and changed to utf8... problem is the data was already there, so not sure if it converted the data or not when i changed it, displays fine in mysql workbench. End result is that php will not json encode the data, just returns false. Doesn't matter what browser you use as its the server causing my issue, php will not parse the data to utf8 if this char is present. Like i say not sure if it is due to converting the schema to utf8 after data was present or just a php bug. In this case use json_encode(utf8_encode($string));