Python Raw-Unicode-Escape encoding - python-2.7

I am reading documentation of python 2.7, I just don't understand Raw-Unicode-Escape encoding. Original documentation is below:
For experts, there is also a raw mode just like the one for normal strings. You have to prefix the opening quote with ‘ur’ to have Python use the Raw-Unicode-Escape encoding. It will only apply the above \uXXXX conversion if there is an uneven number of backslashes in front of the small ‘u’.
And I wonder why the required number of backslashes is uneven. Is it just a rule or due to anything else?

\uXXXX escapes are handled specially in raw strings, as the text you quoted describes. ur'\\\\' is a string containing four backslashes, while ur'\\\u0020\\' is four backslashes and a space. If I had to guess why there have to be an uneven number of backslashes for the \u to be recognized, I'd guess that it was because the non-raw string parser works like that too (I haven't looked at the source to be sure).
The question of why probably comes down to "because that's the way it was defined" for python 2. Python 3 doesn't do that anymore - r'\\\u0020\\' is the same as
'\\\\\\u0020\\\\'.

Related

Replacing unicode characters with ascii characters in Python/Django

I'm using Python 2.7 here (which is very relevant).
Let's say I have a string containing an "em" dash, "—". This isn't encoded in ASCII. Therefore, when my Django app processes it, it complains. A lot.
I want to to replace some such characters with unicode equivalents for string tokenization and use with a spell-checking API (PyEnchant, which considers non-ASCII apostrophes to be misspellings), for example by using the shorter "-" dash instead of an em dash. Here's what I'm doing:
s = unicode(s).replace(u'\u2014', '-').replace(u'\u2018', "'").replace(u'\u2019', "'").replace(u'\u201c', '"').replace(u'\u201d', '"')
Unfortunately, this isn't actually replacing any of the unicode characters, and I'm not sure why.
I don't really have time to upgrade to Python 3 right now, importing unicode_literals from future at the top of the page or setting the encoding there does not let me place actual unicode literals in the code, as it should, and I have tried endless tricks with encode() and decode().
Can anyone give me a straightforward, failsafe way to do this in Python 2.7?
Oh boy... false alarm, here! It actually works, but I entered some incorrect character codes. I'm going to leave the question up since that code is the only thing that seemed to let me complete this particular task in this environment.

Range of UTF-8 Characters in C++11 Regex

This question is an extension of Do C++11 regular expressions work with UTF-8 strings?
#include <regex>
if (std::regex_match ("中", std::regex("中") )) // "\u4e2d" also works
std::cout << "matched\n";
The program is compiled on Mac Mountain Lion with clang++ with the following options:
clang++ -std=c++0x -stdlib=libc++
The code above works. This is a standard range regex "[一-龠々〆ヵヶ]" for matching any Japanese Kanji or Chinese character. It works in Javascript and Ruby, but I can't seem to get ranges working in C++11, even with using a similar version [\u4E00-\u9fa0]. The code below does not match the string.
if (std::regex_match ("中", std::regex("[一-龠々〆ヵヶ]")))
std::cout << "range matched\n";
Changing locale hasn't helped either. Any ideas?
EDIT
So I have found that all ranges work if you add a + to the end. In this case [一-龠々〆ヵヶ]+, but if you add {1} [一-龠々〆ヵヶ]{1} it does not work. Moreover, it seems to overreach it's boundaries. It won't match latin characters, but it will match は which is \u306f and ぁ which is \u3041. They both lie below \u4E00
nhahtdh also suggested regex_search which also works without adding + but it still runs into the same problem as above by pulling values outside of its range. Played with the locales a bit as well. Mark Ransom suggests it treats the UTF-8 string as a dumb set of bytes, I think this is possibly what it is doing.
Further pushing the theory that UTF-8 is getting jumbled some how, [a-z]{1} and [a-z]+ matches a, but only [一-龠々〆ヵヶ]+ matches any of the characters, not [一-龠々〆ヵヶ]{1}.
Encoded in UTF-8, the string "[一-龠々〆ヵヶ]" is equal to this one: "[\xe4\xb8\x80-\xe9\xbe\xa0\xe3\x80\x85\xe3\x80\x86\xe3\x83\xb5\xe3\x83\xb6]". And this is not the droid character class you are looking for.
The character class you are looking for is the one that includes:
any character in the range U+4E00..U+9FA0; or
any of the characters 々, 〆, ヵ, ヶ.
The character class you specified is the one that includes:
any of the "characters" \xe4 or \xb8; or
any "character" in the range \x80..\xe9; or
any of the "characters" \xbe, \xa0, \xe3, \x80, \x85, \xe3 (again), \x80 (again), \x86, \xe3 (again), \x83, \xb5, \xe3 (again), \x83 (again), \xb6.
Messy isn't it? Do you see the problem?
This will not match "latin" characters (which I assume you mean things like a-z) because in UTF-8 those all use a single byte below 0x80, and none of those is in that messy character class.
It will not match "中" either because "中" has three "characters", and your regex matches only one "character" out of that weird long list. Try assert(std::regex_match("中", std::regex("..."))) and you will see.
If you add a + it works because "中" has three of those "characters" in your weird long list, and now your regex matches one or more.
If you instead add {1} it does not match because we are back to matching three "characters" against one.
Incidentally "中" matches "中" because we are matching the three "characters" against the same three "characters" in the same order.
That the regex with + will actually match some undesired things because it does not care about order. Any character that can be made from that list of bytes in UTF-8 will match. It will match "\xe3\x81\x81" (ぁ U+3041) and it will even match invalid UTF-8 input like "\xe3\xe3\xe3\xe3".
The bigger problem is that you are using a regex library that does not even have level 1 support for Unicode, the bare minimum required. It munges bytes and there isn't much your precious tiny regex can do about it.
And the even bigger problem is that you are using a hardcoded set of characters to specify "any Japanese Kanji or Chinese character". Why not use the Unicode Script property for that?
R"(\p{Script=Han})"
Oh right, this won't work with C++11 regexes. For a moment there I almost forgot those are annoyingly worse than useless with Unicode.
So what should you do?
You could decode your input into a std::u32string and use char32_t all over for the matching. That would not give you this mess, but you would still be hardcoding ranges and exceptions when you mean "a set of characters that share a certain property".
I recommend you forget about C++11 regexes and use some regular expression library that has the bare minimum level 1 Unicode support, like the one in ICU.

Perl Regular Expression cannot find fancy quotes “

I was trying to find the fancy quotes “ from a string using the following Perl regular expression but it returns false.
$text = "NBN “a joint venture with Telstra”";
if ($text =~ m/“/)
{
print "found";
}
I also tried using "\x93" ascii code but still does not work. I am stuck here.
Any help is appreciated.
Regards,
Allen
Depending on the encoding of the string you are trying to match, you might need to do different things. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
If the input string is encoded in UTF-8, then you need to specify that encoding in your perl script - one way to do that is with use encoding 'UTF-8'.
You can also specify use utf8 if you want the encoding of the script itself to be UTF-8. You are probably better off, though, knowing the code point of the character you are checking for, and specifying it directly:
use utf8;
use encoding 'UTF-8';
$text = "NBN “a joint venture with Telstra”"; # Make sure to quote this string properly
if ($text =~ m/\N{U+201C}/) # “ is the same as U+201C LEFT DOUBLE QUOTATION MARK
{
print "found";
}
See the "Demoroniser" and for your specific problem, the discussion of just the "smart" quotes bit of it on Perlmonks Re^3: Reg Ex to strip MS smart quotes.
This advice is assuming - perhaps incorrectly - that your database's "fancy quotes" have come from some piece of Microsoft software producing Windows-1252 encoded text - if you've got UTF-8 instead, Avi's already pointed you in the right direction.
I recently came across some smart quotes which I couldn't eliminate using the regex-es mentioned in the above posts only. I had to do a trick which I found out entirely by trial and error:
First convert to iso-8859-1 using Encode::encode.
Next, convert the fancy quotes (using the 4 regular expressions mentioned above).
Next convert the string to UTF-8 using Encode::encode (I needed this since I was using the string in an iOS app and reading it from a SQLite database using “NSString stringWithUTF8String:” - may not be relevant to you).
Hope this helps someone.

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?

I am writing a webservice that uses json to represent its resources, and I am a bit stuck thinking about the best way to encode the json. Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8. But the rfc also describes a string escaping mechanism for specifying characters. I assume this would generally be used to escape non-ascii characters, thereby making the resulting utf-8 valid ascii.
So let's say I have a json string that contains unicode characters (code-points) that are non-ascii. Should my webservice just utf-8 encoding that and return it, or should it escape all those non-ascii characters and return pure ascii?
I'd like browsers to be able to execute the results using jsonp or eval. Does that effect the decision? My knowledge of various browser's javascript support for utf-8 is lacking.
EDIT: I wanted to clarify that my main concern about how to encode the results is really about browser handling of the results. What I've read indicates that browsers may be sensitive to the encoding when using JSONP in particular. I haven't found any really good info on the subject, so I'll have to start doing some testing to see what happens. Ideally I'd like to only escape those few characters that are required and just utf-8 encode the results.
The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.
The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.
Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).
Some frameworks, including PHP's json_encode() (by default), always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is a mostly unnecessary extra step intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that any JSON decoders have a problem with UTF-8.
So, I guess you just could decide which to use like this:
Just use UTF-8, unless any software you are using for storage or transport between encoder and decoder isn't binary-safe.
Otherwise, use the numeric escape sequences.
I had a problem there.
When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".
Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().
Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.
ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:
All Unicode characters may be placed
within the quotation marks except
for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F)
I was facing the same problem. It works for me. Please check this.
json_encode($array,JSON_UNESCAPED_UNICODE);
Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8.
FYI, RFC 4627 is no longer the official JSON spec. It was obsoleted in 2014 by RFC 7159, which was then obsoleted in 2017 by RFC 8259, which is the current spec.
RFC 8259 states:
8.1. Character Encoding
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].
Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.
Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
I had a similar problem with é char... I think the comment "it's possible that the text you're feeding it isn't UTF-8" is probably close to the mark here. I have a feeling the default collation in my instance was something else until I realized and changed to utf8... problem is the data was already there, so not sure if it converted the data or not when i changed it, displays fine in mysql workbench. End result is that php will not json encode the data, just returns false. Doesn't matter what browser you use as its the server causing my issue, php will not parse the data to utf8 if this char is present. Like i say not sure if it is due to converting the schema to utf8 after data was present or just a php bug. In this case use json_encode(utf8_encode($string));