Python 2.7 range regex matching unicode emoticons - regex

How to count the number of unicode emoticons in a string using python 2.7 regex? I tried the first answer posted for this question. But it has been showing invalid expression error.
re.findall(u'[\U0001f600-\U0001f650]', s.decode('utf-8')) is not working and showing invalid expression error
How to find and count emoticons in a string using python?
"Thank you for helping out ๐Ÿ˜Š(Emoticon1) Smiley emoticon rocks!๐Ÿ˜‰(Emoticon2)"
Count : 2

The problem is probably due to using a "narrow build" of Python 2. That is, if you fire up your interpreter, you'll find that sys.maxunicode == 0xffff is True.
This site has a few interesting notes on wide builds of Python (which are commonly found on Linux, but not, as the link suggests, on OS X in my experience). These builds use UCS-4 internally to encode characters, and as a result seem to have saner support for higher range Unicode code points, such as the ranges you are talking about. Narrow builds apparently use UTF-16 internally, and as a result encode these higher code points using "surrogate pairs". I presume this is the reason you see a bad character range error when you try and compile this regular expression.
The only solution I know is to switch to a python version >= 3.3 which no longer has the wide/narrow distinction if you can, or install a wide Python build

Related

Replacing unicode characters with ascii characters in Python/Django

I'm using Python 2.7 here (which is very relevant).
Let's say I have a string containing an "em" dash, "โ€”". This isn't encoded in ASCII. Therefore, when my Django app processes it, it complains. A lot.
I want to to replace some such characters with unicode equivalents for string tokenization and use with a spell-checking API (PyEnchant, which considers non-ASCII apostrophes to be misspellings), for example by using the shorter "-" dash instead of an em dash. Here's what I'm doing:
s = unicode(s).replace(u'\u2014', '-').replace(u'\u2018', "'").replace(u'\u2019', "'").replace(u'\u201c', '"').replace(u'\u201d', '"')
Unfortunately, this isn't actually replacing any of the unicode characters, and I'm not sure why.
I don't really have time to upgrade to Python 3 right now, importing unicode_literals from future at the top of the page or setting the encoding there does not let me place actual unicode literals in the code, as it should, and I have tried endless tricks with encode() and decode().
Can anyone give me a straightforward, failsafe way to do this in Python 2.7?
Oh boy... false alarm, here! It actually works, but I entered some incorrect character codes. I'm going to leave the question up since that code is the only thing that seemed to let me complete this particular task in this environment.

Regex to match Egyptian Hieroglyphics [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I want to know a regex to match the Egyptian Hieroglyphics. I am completely clueless and need your help.
I cannot post the letters as stack overflow doesnt seem to recognize it.
So can anyone let me know the unicode range for these characters.
TLDNR: \p{Egyptian_Hieroglyphs}
Javascript
Egyptian_Hieroglyphs belong to the "astral" plane that uses more than 16 bits to encode a character. Javascript, as of ES5, doesn't support astral planes (more on that) therefore you have to use surrogate pairs. The first surrogate is
U+13000 = d80c dc00
the last one is
U+1342E = d80d dc2e
that gives
re = /(\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2E])+/g
t = document.getElementById("pyramid").innerHTML
document.write("<h1>Found</h1>" + t.match(re))
<div id="pyramid">
some ๐“€€ really ๐“€ old ๐“ฌ stuff ๐“ญ ๐“ฎ
</div>
This is what it looks like with Noto Sans Egyptian Hieroglyphs installed:
Other languages
On platforms that support UCS-4 you can use Egyptian codepoints 13000 to 1342F directly, but the syntax differs from system to system. For example, in Python (3.3 up) it will be [\U00013000-\U0001342E]:
>>> s = "some \U+13000 really \U+13001 old \U+1342C stuff \U+1342D \U+1342E"
>>> s
'some ๐“€€ really ๐“€ old ๐“ฌ stuff ๐“ญ ๐“ฎ'
>>> import re
>>> re.findall('[\U00013000-\U0001342E]', s)
['๐“€€', '๐“€', '๐“ฌ', '๐“ญ', '๐“ฎ']
Finally, if your regex engine supports unicode properties, you can (and should) use these instead of hardcoded ranges. For example in php/pcre:
$str = " some ๐“€€ really ๐“€ old ๐“ฌ stuff ๐“ญ ๐“ฎ";
preg_match_all('~\p{Egyptian_Hieroglyphs}~u', $str, $m);
print_r($m);
prints
[0] => Array
(
[0] => ๐“€€
[1] => ๐“€
[2] => ๐“ฌ
[3] => ๐“ญ
[4] => ๐“ฎ
)
Unicode encodes Egyptian hieroglyphs in the range from U+13000 โ€“ U+1342F (beyond the Basic Multilingual Plane).
In this case, there are 2 ways to write the regex:
By specifying a character range from U+13000 โ€“ U+1342F.
While specifying a character range in regex for characters in BMP is as easy as [a-z], depending on the language support, doing so for characters in astral planes might not be as simple.
By specifying Unicode block for Egyptian hieroglyphs
Since we are matching any character in Egyptian hieroglyphs block, this is the preferred way to write the regex where support is available.
Java
(Currently, I don't have any idea how other implementation of Java Class Libraries deal with astral plane characters in Pattern classes).
Sun/Oracle implementation
I'm not sure if it makes sense to talk about matching characters in astral planes in Java 1.4, since support for characters beyond BMP was only added in Java 5 by retrofitting the existing String implementation (which uses UCS-2 for its internal String representation) with code point-aware methods.
Since Java continues to allow lone surrogates (one which can't form a pair with other surrogate) to be specified in String, it resulted in a mess, since surrogates are not real characters, and lone surrogates are invalid in UTF-16.
Pattern class saw a major overhaul from Java 1.4.x to Java 5, as the class was rewritten to provide support for matching Unicode characters in astral planes: the pattern string is converted to an array of code point before it is parsed, and the input string is traversed by code point-aware methods in String class.
You can read more about the madness in Java regex in this answer by tchist.
I have written a detailed explanation on how to match a range of character which involves astral plane characters in this answer, so I am only going to include the code here. It also includes a few counter-examples of incorrect attempts to write regex to match astral plane characters.
Java 5 (and above)
"[\uD80C\uDC00-\uD80D\uDC2F]"
Java 7 (and above)
"[\\uD80C\\uDC00-\\uD80D\\uDC2F]"
"[\\x{13000}-\\x{1342F}]"
Since we are matching any code point belongs to the Unicode block, it can also be written as:
"\\p{InEgyptian_Hieroglyphs}"
"\\p{InEgyptian Hieroglyphs}"
"\\p{InEgyptianHieroglyphs}"
"\\p{block=EgyptianHieroglyphs}"
"\\p{blk=Egyptian Hieroglyphs}"
Java supported \p syntax for Unicode block since 1.4, but support for Egyptian Hieroglyphs block was only added in Java 7.
PCRE (used in PHP)
PHP example is already covered in georg's answer:
'~\p{Egyptian_Hieroglyphs}~u'
Note that u flag is mandatory if you want to match by code points instead of matching by code units.
Not sure if there is a better post on StackOverflow, but I have written some explanation on the effect of u flag (UTF mode) in this answer of mine.
One thing to note is Egyptian_Hieroglyphs is only available from PCRE 8.02 (or a version not earlier than PCRE 7.90).
As an alternative, you can specify a character range with \x{h...hh} syntax:
'~[\x{13000}-\x{1342F}]~u'
Note the mandatory u flag.
The \x{h...hh} syntax is supported from at least PCRE 4.50.
JavaScript (ECMAScript)
ES5
The character range method (which is the only way to do this in vanilla JavaScript) is already covered in georg's answer. The regex is modified a bit to cover the whole block, including the reserved unassigned code point.
/(?:\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2F])/
The solution above demonstrates the technique to match a range of character in astral plane, and also the limitations of JavaScript RegExp.
JavaScript also suffers from the same problem of string representation as Java. While Java did fix Pattern class in Java 5 to allow it to work with code points, JavaScript RegExp is still stuck in the days of UCS-2, forcing us to work with code units instead of code point in the regular expression.
ES6
Finally, support for code point matching is added in ECMAScript 6, which is made available via u flag to prevent breaking existing implementations in previous versions of ECMAScript.
ES6 Specification - 21.2 RegExp (Regular Expression) Objects
Unicode-aware regular expressions in ECMAScript 6
Check Support section from the second link above for the list of browser providing experimental support for ES6 RegExp.
With the introduction of \u{h...hh} syntax in ES6, the character range can be rewritten in a manner similar to Java 7:
/[\u{13000}-\u{1342F}]/u
Or you can also directly specify the character in the RegExp literal, though the intention is not as clear cut as [a-z]:
/[๐“€€-๐“ฏ]/u
Note the u modifier in both regexes above.
Still got stuck with ES5? Don't worry, you can transpile ES6 Unicode RegExp to ES5 RegExp with regxpu.

Beginner C++ Windows D2D1Circle Sample from MSDN Problem

So I was just going through the basic Windows Programming guide over at MSDN and attempted to do the D2D1Circle Sample in Module 3. The problem I encountered was an error my VC++ 2008 was throwing.
" 'CreateWindowExA' : cannot convert parameter 2 from 'PCWSTR' to 'LPCSTR'"
So, figuring that I had made a slight error while typing the code in I downloaded the sample code rar and opened it up and it threw the exact same error. Any ideas on how I can fix this so it will work. Also, does the fact that I'm programming on a x64 bit machine have anything to do with why it won't work? I know pointers carry different sized values dependent on the machine and both the parameters being called are pointers.
Update # Jollymorphic: In the first few modules, the MSDN tutorial was saying that there really isn't any reason to continue using ascii since unicode covers ascii and also supports all other languages like Chinese, Japanese, etc. Wouldn't implementing your solution cause my program to only support ascii and subsequently not allow support for east asian languages?
A PCWSTR is a pointer to wide (16-bit) characters. An LPCSTR is a pointer to regular (8-bit) characters. Your project probably is set to generate code based on the UNICODE character set. If you open the properties for your project in Visual Studio, and then navigate to the "General" page, you'll see a "Character Set" property. If it is currently set to "Use Unicode character set," then you can change it to "Use Multi-Byte character set," and your string literals will be generated as 8-bit character strings.

C++ encode string to Unicode - ICU library

I need to convert a bunch of bytes in ISO-2022-JP and ISO-2022-JP-2 (and other variations of ISO-2022) into Unicode. I am trying to use ICU (link text), but the following code doesn't work.
std::string input = "\x1B\x28\x4A" "ABC\xA6\xA7"; //the first 3 chars are escape sequence to use JIS_X201 character set in GL/GR
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
// set up the converter
conv = ucnv_open("ISO-2022-JP", &status);
if (status != U_ZERO_ERROR) return false; //couldn't find character set
UChar * convDest = new UChar[2*input.length()]; //ucnv_toUChars will use up to 2*length
// convert to Unicode
int resultLen = (int)ucnv_toUChars(conv, convDest, 2*input.length(), input.c_str(), input.length(), &status);
This doesn't work. The result contains '?' charcters for anything I put in that was above ASCII. The status has no error. What am I doing wrong?
On top of that I was having trouble compiling the library ver 4.4 as the MSVC 9 project would not convert to MSVC 10 project.
I am also aware of libiconv open source library. I couldn't compile that one on windows. If anyone has any advice on a different library, that's also welcome.
Thanks.
EDIT
The escape sequence I originally used was wrong. So now ICU takes the string, strips out the escape sequence - which is a step in the right direction. But the result still contains '?' chars.
EDIT2 The reason I couldn't convert to MSVC 10 project was because x64 platform wasn't installed (it isn't by default). Alternatively I could open all the projects in text editor and remove all mention of x64 target.
This doesn't resemble an ISO 2022 encoding. The high bits are supposed to be zero. The escape sequence looks somewhat recognizable, but it starts with ESC. 0x1b, not 0xb0. No idea what those byte values really mean.
(This question looks familiar, Hi again.)
A minor, minor nit: You want to check the error status with if(U_FAILURE(status)) (or conversely, U_SUCCESS(status)).
I couldn't get the conversion to work for JIS_X201 character set in ISO-2022-JP encoding. And I couldn't generate a "valid" one using any tools at my disposal - tried Java (ICU and non ICU implementation of ISO2022) and C++.
So I basically just wrote a function to do a code lookup and convert to Unicode using this table: wikipedia.
EDIT
As I started filling out the bug report I wanted to include the RFC for ISO-2022-JP. Then I found this line in the RFC "The Kana set of JIS X 0201 is not used in ISO-2022-JP messages." link text. So it appears that the standard doesn't actually define the upper bits. The ISO-2022-JP-3 WILL map the upper bits, but to lower plane. So I have to take each byte and subtract 0x80 from it, and pass it through ISO-2022-JP-3, and take the other bytes < 128 and pass them through ISO-2022-JP converter for full JIS_X201 character set. Well it's a lot easier to just do it myself.
So strictly speaking I would say it's not a bug. It's a huge headache though.
P.S. the whole messed up stream that I'm trying to decode comes from DICOM. See pdf page 107 to see what they consider acceptable.

Using preg_replace/ preg_match with UTF-8 characters - specifically Mฤori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Mฤori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Mฤori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Mฤori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Mฤori". How does preg_match/ preg_replace see characters like "ฤ" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ฤ can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.