I am trying to write a fully utf-8 compliant application with CouchDB as the back-end. I use c++ with the casablanca rest sdk to send my requests to Couch version 1.6.1. To test that the application can handle various unicode characters I have a teststring in a JSON object that I want to PUT to Couch. The string is formatted as such (c++)
const string_t InternationalText =
L"Hello world!123##%\n\r\v\t\f Å i åa ä e ö
\u00c5 \u00fc \u03bb \u0416 \u4e16\u754c\u548c\u5e73 \U00013080";
The last character in the string, \U00013080 Eye of Horus, is giving me trouble. I get a 400 bad request from Couch and if I look in the log I see the error "lexical error: invalid character inside string."
I've done some sniffing using RawCap to capture the request - response cycle and the important parts of my request are:
PUT *address*
Content-Type: application/json;charset=utf-8
Body: *Complex Json object containing the string as such*
{"description"="Hello world!123##% Å i åa ä e ö Å ü λ Ж 世界和平 𓂀",...}
If I look at the hex of the request the Eye of horus character is encoded as F0 93 82 80 which according to https://codepoints.net/U+13080 is correct. Still, I get the UTF-8 error. What am I missing? Does CouchDB have problem dealing with characters from plane 1+ in the unicode standard?
Almost needless to say, everything works fine if I remove the hieroglyph.
I found the problem. Turns out \v is an illegal character for JSON, https://www.rfc-editor.org/rfc/rfc7159, and removing that solves my issue. I was thrown by some strange behavior in visual studio's unit test framework that passed the test when I removed the last character in my test string even though there still were errors in the call.
Related
We have a PEGA frontend, from where we are keying in double byte characters like Japanese and being send to distributed java webservice through axis. This is working fine when we send singlebyte characters. Only failing while using double byte characters. The encoding used is UTF-8 in the xml which is being passed. Double byte characters are being properly rendered in PEGA front end. Even the PEGA logs shows characters intact.
Axis version: 2
PEGA gets the response while invoking webservice...
Error:problem accessing the parser. Parser already accessed!
Did various combination testing and found the following...
single byte- working
chinese - working
japanese
Hiragana - working
katakana - working
kanji - not working
For kanji, PEGA is not even hitting the distributed code, it fails with parser error "problem accessing the parser. Parser already accessed!
Any pointers would be helpful...
PEGA have issue with parsing of chinese,japanees, and other asian characters which take two byte storage and they have HFIX, for the following in many version and as of what I know they were fixing this issue in 7.2.2 version.
I would say raise defect to them to get the HFIX for your PEGA version.
I am using string tokenizer and transform APIs to convert kanji characters to hiragana.
The code in query (What is the replacement for Language Analysis framework's Morpheme analysis deprecated APIs) converts most of kanji characters to hiragana but these APIs fails to convert kanji word having 3-4 characters.
like-
a) 現人神 is converted to latin - 'gen ren shen' and in hiragana- 'げんじんしん'
whereas it should be - in latin - 'Arahitogami ' and in hiragana- 'あらひとがみ'
b) 安本丹 is converted to latin - 'an ben dan' and in hiragana- 'やすもとまこと'
whereas it should be - in latin as - 'Yasumoto makoto ' and in hiragana- 'あんぽんたん'
My main purpose is to obtain the ruby text for given japanese text. I cant use lang analysis framework as its unavailable in 64-bit.
Any suggestions? Are there other APIs to perform such string conversion?
So in both cases your API uses onyomi but shouldn't. So I assume it just guesses "3 or more characters ? onyomi should be more appropriate in most cases, so I use it". Sounds like an actual dictionary is needed for your problem, which you can download.
Names ( for b) ) should still be a problem tho. I don't see how a computer should be able to get the correct name from kanjis, as even native japanese people sometimes fail at it. jisho.org doesn't even find a single name for 安本丹.
( Btw you mixed up your hiragana in b), and the latin for 'あんぽんたん'. I can't write comments yet with my rep so I'm leaving this here )
Can't figure out, how to remove this � symbol from string.
String is in utf-8 format.
What to do? :(
This removes whole string:
preg_replace('/\W/','',utf8_decode(substr(utf8_encode($ad['description']),0,125)))
Thanks ;)
Update:
Using:
header('Content-Type: text/html; charset=utf-8');
After replacement using exit() right away.
U+FFFD REPLACEMENT CHARACTER is used when the character does not have a representation in the current charset encoding. Declare your encodings properly as UTF-8 and use UTF-8 strings and it will not show upon most platforms.
The problem here is that your string is not in utf-8 format. You pretend it is, and handle the data accordingly, but the string probably contains Ansi characters. You don't just need to pass the Content-Encoding = utf-8 header, but your contents needs to be converted to utf-8 before it is sent as well.
you could try utf8_decode('string'); or utf8_encode('string');
but you should really try to find the actuall problem make sure the headers are correct set, document type and that the text is encoded in the right format when saved or what not
Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)
There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.
I am writing a webservice that uses json to represent its resources, and I am a bit stuck thinking about the best way to encode the json. Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8. But the rfc also describes a string escaping mechanism for specifying characters. I assume this would generally be used to escape non-ascii characters, thereby making the resulting utf-8 valid ascii.
So let's say I have a json string that contains unicode characters (code-points) that are non-ascii. Should my webservice just utf-8 encoding that and return it, or should it escape all those non-ascii characters and return pure ascii?
I'd like browsers to be able to execute the results using jsonp or eval. Does that effect the decision? My knowledge of various browser's javascript support for utf-8 is lacking.
EDIT: I wanted to clarify that my main concern about how to encode the results is really about browser handling of the results. What I've read indicates that browsers may be sensitive to the encoding when using JSONP in particular. I haven't found any really good info on the subject, so I'll have to start doing some testing to see what happens. Ideally I'd like to only escape those few characters that are required and just utf-8 encode the results.
The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.
The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.
Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).
Some frameworks, including PHP's json_encode() (by default), always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is a mostly unnecessary extra step intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that any JSON decoders have a problem with UTF-8.
So, I guess you just could decide which to use like this:
Just use UTF-8, unless any software you are using for storage or transport between encoder and decoder isn't binary-safe.
Otherwise, use the numeric escape sequences.
I had a problem there.
When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".
Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().
Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.
ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:
All Unicode characters may be placed
within the quotation marks except
for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F)
I was facing the same problem. It works for me. Please check this.
json_encode($array,JSON_UNESCAPED_UNICODE);
Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8.
FYI, RFC 4627 is no longer the official JSON spec. It was obsoleted in 2014 by RFC 7159, which was then obsoleted in 2017 by RFC 8259, which is the current spec.
RFC 8259 states:
8.1. Character Encoding
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].
Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.
Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
I had a similar problem with é char... I think the comment "it's possible that the text you're feeding it isn't UTF-8" is probably close to the mark here. I have a feeling the default collation in my instance was something else until I realized and changed to utf8... problem is the data was already there, so not sure if it converted the data or not when i changed it, displays fine in mysql workbench. End result is that php will not json encode the data, just returns false. Doesn't matter what browser you use as its the server causing my issue, php will not parse the data to utf8 if this char is present. Like i say not sure if it is due to converting the schema to utf8 after data was present or just a php bug. In this case use json_encode(utf8_encode($string));