I am trying to write a fully utf-8 compliant application with CouchDB as the back-end. I use c++ with the casablanca rest sdk to send my requests to Couch version 1.6.1. To test that the application can handle various unicode characters I have a teststring in a JSON object that I want to PUT to Couch. The string is formatted as such (c++)
const string_t InternationalText =
L"Hello world!123##%\n\r\v\t\f Å i åa ä e ö
\u00c5 \u00fc \u03bb \u0416 \u4e16\u754c\u548c\u5e73 \U00013080";
The last character in the string, \U00013080 Eye of Horus, is giving me trouble. I get a 400 bad request from Couch and if I look in the log I see the error "lexical error: invalid character inside string."
I've done some sniffing using RawCap to capture the request - response cycle and the important parts of my request are:
PUT *address*
Content-Type: application/json;charset=utf-8
Body: *Complex Json object containing the string as such*
{"description"="Hello world!123##% Å i åa ä e ö Å ü λ Ж 世界和平 𓂀",...}
If I look at the hex of the request the Eye of horus character is encoded as F0 93 82 80 which according to https://codepoints.net/U+13080 is correct. Still, I get the UTF-8 error. What am I missing? Does CouchDB have problem dealing with characters from plane 1+ in the unicode standard?
Almost needless to say, everything works fine if I remove the hieroglyph.
I found the problem. Turns out \v is an illegal character for JSON, https://www.rfc-editor.org/rfc/rfc7159, and removing that solves my issue. I was thrown by some strange behavior in visual studio's unit test framework that passed the test when I removed the last character in my test string even though there still were errors in the call.
As a part of a scraper, I need to encode kanji to URLs, but I just can't seem to even get the correct output from a simple sign, and I'm currently blinded by everything I've tried thus far from various Stack Overflow posts.
The document is set to UTF-8.
sampleText=u'ル'
print sampleText
print sampleText.encode('utf-8')
print urllib2.quote(sampleText.encode('utf-8'))
It gives me the values:
ル
ル
%E3%83%AB
But as far as I understand, it should give me:
ル
XX
%83%8B
What am I doing wrong? Are there some settings I don't have correct? Because as far as I understand it, my output from the encode() should not be ル.
The code you show works correctly. The character ル is KATAKANA LETTER RU, and is Unicode codepoint U+30EB. When encoded to UTF-8, you'll get the Python bytestring '\xe3\x83\xab', which prints out as ル if your console encoding is Latin-1. When you URL-escape those three bytes, you get %E3%83%AB.
The value you seem to be expecting, %83%8B is the Shift-JIS encoding of ル, rather than UTF-8 encoding. For a long time there was no standard for how to encode non-ASCII text in a URL, and as this Wikipedia section notes, many programs simply assumed a particular encoding (often without specifying it). The newer standard of Internationalized Resource Identifiers (IRIs) however says that you should always convert Unicode text to UTF-8 bytes before performing percent encoding.
So, if you're generating your encoded string for a new program that wants to meet the current standards, stick with the UTF-8 value you're getting now. I would only use the Shift-JIS version if you need it for backwards compatibility with specific old websites or other software that expects that the data you send will have that encoding. If you have any influence over the server (or other program), see if you can update it to use IRIs too!
I need to figure out a way to remove any non standard ASCII char from a feed I am getting form a partners system.
The issue is they are sending a mix of data - HTML formatted and hard returns along with bad chars.
I am already using the UDF DeMoronize(text) to pull any Microsoft Latin-1 "Extentions" chars out.
The feed is coming over in utf-8 and we have a tag on the page to insure we are processing the feed in utf-8.
is there a simple way to code this to remove any NON UTF-8 char?
I am writing a webservice that uses json to represent its resources, and I am a bit stuck thinking about the best way to encode the json. Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8. But the rfc also describes a string escaping mechanism for specifying characters. I assume this would generally be used to escape non-ascii characters, thereby making the resulting utf-8 valid ascii.
So let's say I have a json string that contains unicode characters (code-points) that are non-ascii. Should my webservice just utf-8 encoding that and return it, or should it escape all those non-ascii characters and return pure ascii?
I'd like browsers to be able to execute the results using jsonp or eval. Does that effect the decision? My knowledge of various browser's javascript support for utf-8 is lacking.
EDIT: I wanted to clarify that my main concern about how to encode the results is really about browser handling of the results. What I've read indicates that browsers may be sensitive to the encoding when using JSONP in particular. I haven't found any really good info on the subject, so I'll have to start doing some testing to see what happens. Ideally I'd like to only escape those few characters that are required and just utf-8 encode the results.
The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.
The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.
Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).
Some frameworks, including PHP's json_encode() (by default), always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is a mostly unnecessary extra step intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that any JSON decoders have a problem with UTF-8.
So, I guess you just could decide which to use like this:
Just use UTF-8, unless any software you are using for storage or transport between encoder and decoder isn't binary-safe.
Otherwise, use the numeric escape sequences.
I had a problem there.
When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".
Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().
Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.
ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:
All Unicode characters may be placed
within the quotation marks except
for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F)
I was facing the same problem. It works for me. Please check this.
json_encode($array,JSON_UNESCAPED_UNICODE);
Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8.
FYI, RFC 4627 is no longer the official JSON spec. It was obsoleted in 2014 by RFC 7159, which was then obsoleted in 2017 by RFC 8259, which is the current spec.
RFC 8259 states:
8.1. Character Encoding
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].
Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.
Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
I had a similar problem with é char... I think the comment "it's possible that the text you're feeding it isn't UTF-8" is probably close to the mark here. I have a feeling the default collation in my instance was something else until I realized and changed to utf8... problem is the data was already there, so not sure if it converted the data or not when i changed it, displays fine in mysql workbench. End result is that php will not json encode the data, just returns false. Doesn't matter what browser you use as its the server causing my issue, php will not parse the data to utf8 if this char is present. Like i say not sure if it is due to converting the schema to utf8 after data was present or just a php bug. In this case use json_encode(utf8_encode($string));
I'm trying to auto-generate a plain text email with a trademark symbol in it. I've tried everything I can think of but it's still not going through.
<cfmail from="#x#" to="#y#" subject="test" charset="UTF-8">
™
™
#Chr(153)#
</cfmail>
This is an encoding issue.
You state the mail is encoded as UTF-8, but Chr(153) does not return a trademark symbol in Unicode. It does in Windows-1252, but Chr() works with Unicode code points.
Use Chr(8482) to nail it to the Unicode TM symbol.
I've found an info page that outlines the issue nicely.
By the way, writing the literal TM symbol works for me as well. But this assumes your .cfm files are in fact encoded as Windows-1252 and that the ColdFusion runtime is configured to expect this (Both of which is the default on Windows systems, where I've tested it on. Analog rules apply to other systems.). ColdFusion converts all strings to Unicode internally, so maybe something is broken in this chain of expectations in your set-up.
I think that this is not so much an issue with CFMail but rather an issue with email clients displaying the characters codes in plain text messages literally rather than converting them to their corresponding characters.
Using CFMail in HTML mode should provide the result you're looking for.