I have a mobile app that stores data in dynamoDB tables. There is a group of users in Taiwan that attempted to store there names in the database. when the data is stored it become garbled. I have researched this and see that it is because dynamoDB uses UTF encoding while tradional chinese uses big 5 text encoding. How do I setup dynamoDB so that it will store and recall the proper characters??
So you start with a string in your head. It's a sequence of Unicode characters. There's no inherent byte encoding to the characters. The same string could be encoded into bytes in a variety of ways. Big5 is one. UTF-8 is another.
When you say that Traditional Chinese uses Big5, that's not entirely true. It may be commonly encoded in Big5, but it could be in UTF-8 instead, and UTF-8 has this cool property that it can encode all Unicode character sequences. That's why it's become the standard encoding for situations where you don't want to optimize for one character set.
So your challenge is make sure to carefully control the characters and encodings so that you're sending UTF-8 sequences to DynamoDB. The standard SDKs would do this correctly as long as you're creating the strings as basic strings in them.
You also have to make sure you're not confusing yourself when you look at the data. If you look at UTF-8 bytes but in a way where you're interpreting them as Big5 then it's going to look like gibberish, or vice versa.
You don't say how they're loading the data. If they're starting with a file, could be that. You'd want to read the file in a language saying it's Big5, then you'll have the string version, and then you can send the string version and rely on the SDK to correctly translate to UTF-8 on the wire.
I remember when I first learned this stuff it was all kind of confusing. The thing to remember is a capital A exists as an idea (and is a defined character in Unicode) and there's a whole lot of mechanisms you could use to put that letter into ones and zeros on disk. Each of those ways is an encoding. ASCII is popular but EBCDIC was another contender from the past, and UTF-16 is yet another contender now. Traditional Chinese is a character set (a set of characters) and you can encode each those characters a bunch of ways too. It's just a question of how you map characters to bits and bytes and back again.
Related
I am building a language analysis program I have a program which counts the words in text and give the ratio of every word in text as a output, but this program can not work on file containing Urdu text. how can I make it work
Encoding
Urdu may be presented in two¹ forms: Unicode and Code Page 868. This is convenient to you because the two ranges do not overlap. It is inconvenient because the Unicode code range is U+0600 – U+06FF, which means encoding is an issue:
CP-868 will encode each one as a single-byte value in the range 128–252
UTF-8 will encode each one as a two-byte sequence with bits 110x xxxx and 10xx xxxx
UTF-16 encodes every character as two-byte entities
UTF-32 encodes every character as four-byte entities
This means that you should be aware of encoding issues, and for an easy life, use UTF-16 internally (std::u16string), and accept files as (default) UTF-8 / CP-868, or as UTF-16/32 if there is a BOM indicating such.
Your other option is to simply require all input to be UTF-8 / CP-868.
¹ AFAIK. There may be other ways of storing Urdu text.
Three forms. See comments below.
Word separation
As you know, the end of a word is generally marked with a special letter form.
So, all you need is a table of end-of-word letters listing letters in both the CP-868 range and the Unicode Arabic text range.
Then, every time you find a space or a letter in that table you know you have found the end of a word.
Histogram
As you read words, store them in a histogram. For C++ a map <u16string, size_t> will do. The actual content of each word does not matter.
After that you have all the information necessary to print stats about the text.
Edit
The approach presented above is designed to be simple at the cost of some correctness. If you are doing something for the workplace, for example, and assuming it matters, you should also consider:
Normalizing word forms
For example, the same word may be presented in standard Arabic text codes or using the Urdu-specific codes. If you do not convert to the Urdu equivalent characters then you will have two words that should compare equal but do not.
Use something internally consistent. I recommend UZT, as it is the most complete Urdu text representation. You will also need an additional lookup for the original text representation from the UZT representation.
Dictionaries
As complete a dictionary (as an unordered_set <u16string>) of words in Urdu as you can get.
This is how it is done with languages like Japanese, for example, to find breaks between words.
Then use the dictionary to find all the words you can, and fall back on letterform recognition and/or spaces for what remains.
As a part of a scraper, I need to encode kanji to URLs, but I just can't seem to even get the correct output from a simple sign, and I'm currently blinded by everything I've tried thus far from various Stack Overflow posts.
The document is set to UTF-8.
sampleText=u'ル'
print sampleText
print sampleText.encode('utf-8')
print urllib2.quote(sampleText.encode('utf-8'))
It gives me the values:
ル
ル
%E3%83%AB
But as far as I understand, it should give me:
ル
XX
%83%8B
What am I doing wrong? Are there some settings I don't have correct? Because as far as I understand it, my output from the encode() should not be ル.
The code you show works correctly. The character ル is KATAKANA LETTER RU, and is Unicode codepoint U+30EB. When encoded to UTF-8, you'll get the Python bytestring '\xe3\x83\xab', which prints out as ル if your console encoding is Latin-1. When you URL-escape those three bytes, you get %E3%83%AB.
The value you seem to be expecting, %83%8B is the Shift-JIS encoding of ル, rather than UTF-8 encoding. For a long time there was no standard for how to encode non-ASCII text in a URL, and as this Wikipedia section notes, many programs simply assumed a particular encoding (often without specifying it). The newer standard of Internationalized Resource Identifiers (IRIs) however says that you should always convert Unicode text to UTF-8 bytes before performing percent encoding.
So, if you're generating your encoded string for a new program that wants to meet the current standards, stick with the UTF-8 value you're getting now. I would only use the Shift-JIS version if you need it for backwards compatibility with specific old websites or other software that expects that the data you send will have that encoding. If you have any influence over the server (or other program), see if you can update it to use IRIs too!
I had implemented app on some device which was dealing with sending receiving data from server.
Data from server would usually come in this form:
"1;username;someInteger;"
Parsing was easy, and I was using strtok as you can imagine to retrieve individual values from that string such as: 1, username, and someInteger.
But now a situation may occur when the server will send me unicode string as username.
I think good idea is to use the username encoded as a UTF-8 string (am I right?). What do you recommend - how should I parse it from above string? What symbol to use as a separator for example (e.g., instead of ";"), or which functions to use to extract the username from above string?
as this is some embedded device I want to avoid installing some third party libraries there (which might not be even possible) so more "pure" ways would be more desirable.
The character ';' is the same in UTF-8 as it is in ASCII, because the 127 first characters in both encodings are the same. That means you can still use strtok to split on the ';'.
The very thing with UTF8 is that you hardly have to do anything at all. ASCII characters still encode as the same ASCII bytes they always would, so if you just continue to use semicolon separators, you don't have to do anything at all.
I am trying to create a SOAP call with Japanese string. The problem I faced is that when I encode this string to UTF8 encoded string, it has many control characters in it (e.g. 0x1B (Esc)). If I remove all such control characters to make it a valid SOAP call then the Japanese content appears as garbage on server side.
How can I create a valid SOAP request for Japanese characters? Any suggestion is highly appreciated.
I am using C++ with MS-DOM.
With Best Regards.
If I remember correctly it's true, the first 32 unicode code points are not allowed as characters in XML documents, even escaped with &#. Not sure whether they're allowed in HTML or not, but certainly the server thinks they're not allowed in your requests, and it gets the only meaningful vote.
I notice that your document claims to be encoded in iso-2022-jp, not utf-8. And indeed, the sequence of characters ESC $ B that appears in your document is valid iso-2022-jp. It indicates that the data is switching encodings (from ASCII to a 2-byte Japanese encoding called JIS X 0208-1983).
But somewhere in the process of constructing your request, something has seen that 0x1B byte and interpreted it as a character U+001B, not realising that it's intended as one byte in data that's already encoded in the document encoding. So, it has XML-escaped it as a "best effort", even though that's not valid XML.
Probably, whatever is serializing your XML document doesn't know that the encoding is supposed to be iso-2022-jp. I imagine it thinks it's supposed to be serializing the document as ASCII, ISO-Latin-1, or UTF-8, and the <meta> element means nothing to it (that's an HTML way of specifying the encoding anyway, it has no particular significance in XML). But I don't know MS-DOM, so I don't know how to correct that.
If you just remove the ESC characters from iso-2022-jp data, then you conceal the fact that the data has switched encodings, and so the decoder will continue to interpret all that 7nMK stuff as ASCII, when it's supposed to be interpreted as JIS X 0208-1983. Hence, garbage.
Something else strange -- the iso-2022-jp code to switch back to ASCII is ESC ( B, but I see |(B</font> in your data, when I'd expect the same thing to happen to the second ESC character as happened to the first: �x1B(B</font>. Similarly, $B#M#S(B and $BL#D+(B are mangled attempts to switch from ASCII to JIS X 0208-1983 and back, and again the ESC characters have just disappeared rather than being escaped.
I have no explanation for why some ESC characters have disappeared and one has been escaped, but it cannot be coincidence that what you generate looks almost, but not quite, like valid iso-2022-jp. I think iso-2022-jp is a 7 bit encoding, so part of the problem might be that you've taken iso-2022-jp data, and run it through a function that converts ISO-Latin-1 (or some other 8 bit encoding for which the lower half matches ASCII, for example any Windows code page) to UTF-8. If so, then this function leaves 7 bit data unchanged, it won't convert it to UTF-8. Then when interpreted as UTF-8, the data has ESC characters in it.
If you want to send the data as UTF-8, then first of all you need to actually convert it out of iso-2022-jp (to wide characters or to UTF-8, whichever your SOAP or XML library expects). Secondly you need to label it as UTF-8, not as iso-2022-jp. Finally you need to serialize the whole document as UTF-8, although as I've said you might already be doing that.
As pointed out by Steve Jessop, it looks like you have encoded the text as iso-2022-jp, not UTF-8. So the first thing to do is to check that and ensure that you have proper UTF-8.
If the problem still persists, consider encoding the text.
The simplest option is "hex encoding" where you just write the hex value of each byte as ASCII digits. e.g. the 0x1B byte becomes "1B", i.e. 0x31, 0x42.
If you want to be fancy you could use MIME or even UUENCODE.
I am writing a webservice that uses json to represent its resources, and I am a bit stuck thinking about the best way to encode the json. Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8. But the rfc also describes a string escaping mechanism for specifying characters. I assume this would generally be used to escape non-ascii characters, thereby making the resulting utf-8 valid ascii.
So let's say I have a json string that contains unicode characters (code-points) that are non-ascii. Should my webservice just utf-8 encoding that and return it, or should it escape all those non-ascii characters and return pure ascii?
I'd like browsers to be able to execute the results using jsonp or eval. Does that effect the decision? My knowledge of various browser's javascript support for utf-8 is lacking.
EDIT: I wanted to clarify that my main concern about how to encode the results is really about browser handling of the results. What I've read indicates that browsers may be sensitive to the encoding when using JSONP in particular. I haven't found any really good info on the subject, so I'll have to start doing some testing to see what happens. Ideally I'd like to only escape those few characters that are required and just utf-8 encode the results.
The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.
The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.
Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).
Some frameworks, including PHP's json_encode() (by default), always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is a mostly unnecessary extra step intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that any JSON decoders have a problem with UTF-8.
So, I guess you just could decide which to use like this:
Just use UTF-8, unless any software you are using for storage or transport between encoder and decoder isn't binary-safe.
Otherwise, use the numeric escape sequences.
I had a problem there.
When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".
Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().
Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.
ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:
All Unicode characters may be placed
within the quotation marks except
for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F)
I was facing the same problem. It works for me. Please check this.
json_encode($array,JSON_UNESCAPED_UNICODE);
Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8.
FYI, RFC 4627 is no longer the official JSON spec. It was obsoleted in 2014 by RFC 7159, which was then obsoleted in 2017 by RFC 8259, which is the current spec.
RFC 8259 states:
8.1. Character Encoding
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].
Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.
Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
I had a similar problem with é char... I think the comment "it's possible that the text you're feeding it isn't UTF-8" is probably close to the mark here. I have a feeling the default collation in my instance was something else until I realized and changed to utf8... problem is the data was already there, so not sure if it converted the data or not when i changed it, displays fine in mysql workbench. End result is that php will not json encode the data, just returns false. Doesn't matter what browser you use as its the server causing my issue, php will not parse the data to utf8 if this char is present. Like i say not sure if it is due to converting the schema to utf8 after data was present or just a php bug. In this case use json_encode(utf8_encode($string));