Cannot compile 8 digit unicode regex ranges in Python 2.7 re - regex

Using Python 2.7, re
I'm trying to compile unicode character classes. I can get it to work with 4 digit ranges (u'\uxxxx') but not 8 digits (u'\Uxxxxxxxx').I
The following works:
re.compile(u'[\u0010-\u0012]')
The following does not:
re.compile(u'[\U00010000-\U00010001]')
The resultant error is:
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 242, in _compile
raise error, v # invalid expression
error: bad character range
It appears to be an issue with 8 digit ranges only as the following works:
re.compile(u'\U00010000')
Separate question, I am new to stackoverflow and I am really struggling with how to post questions. I would expect that Trackback to appear on multiple lines, not on one line. I would also like to be able to paste in content copied from the interpreter but this UI makes a mess out of '>>>'
Don't know how to add this in a comment editing question.
The expression I really want to compile is:
re.compile(u'[\U00010000-\U0010FFFF]')
Expanding it with list(u'[\U00010000-\U0010FFFF]') looks pretty intractable as far as extending the suggested workaround:
>>> list(u'[\U00010000-\U0010FFFF]')
[u'[', u'\ud800', u'\udc00', u'-', u'\udbff', u'\udfff', u']']

Depending on the compilation option, Python 2 may store Unicode strings as UTF-16 code units, and thus \U00010000 is actually a two-code-unit string:
>>> list(u'[\U00010000-\U00010001]')
[u'[', u'\ud800', u'\udc00', u'-', u'\ud800', u'\udc01', u']']
The regex parser thus sees the character class containing \udc00-\ud800 which is a "bad character range". In this setting I can't think of a solution other than to match the surrogate pairs explicitly (after ensuring sys.maxunicode == 0xffff):
>>> r = re.compile(u'\ud800[\udc00-\udc01]')
>>> r.match(u'\U00010000')
<_sre.SRE_Match object at 0x10cf6f440>
>>> r.match(u'\U00010001')
<_sre.SRE_Match object at 0x10cf4ed98>
>>> r.match(u'\U00010002')
>>> r.match(u'\U00020000')

Related

I have a string "192.192", why am I not able to match this using '(\d{1,3})\.\1'?

String in question:
ipAddressString = "192.192.10.5/24"
I'm trying to match 192.192 in the above string.
a) The below code gives error, I don't understand why \1 is not matching the second 192:
>>> print re.search('(\d{1,3})\.\1',ipAddressString).group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
I was expecting the output to be : 192.192
b) Whereas, when I use the below regex, it matches 192.192 as expected, as per my understanding the above regex mentioned in point a) should have yielded the same ".group()" output as below regex
>>> print re.search('(\d{1,3})\.(\d{1,3})',ipAddressString).group()
192.192
List of escape sequences available in Python 3
Those are the escapes interpolated when parsing a string in Python.
All other escaped items are ignored.
So, if you give it a string like '(\d{1,3})\.\1'
it interpolates the \1 as a character with an octal value of 1.
\ooo Character with octal value ooo
So this is what you get
>>> import re
>>> ipAddressString = "192.192.10.5/24"
>>> hh = re.search('(\d{1,3})\.\1',ipAddressString)
>>> print (hh)
None
>>> print ('(\d{1,3})\.\1')
(\d{1,3})\.☺
The regex engine sees this (\d{1,3})\.☺ which is not an error
but it doesn't match what you want.
Ways around this:
Escape the escape on the octal
'(\d{1,3})\.\\1'
Make the string a raw string with syntax
either a raw double r"(\d{1,3})\.\1" or a raw single r'(\d{1,3})\.\1'
Using the first method we get:
>>> import re
>>> ipAddressString = "192.192.10.5/24"
>>> hh = re.search('(\d{1,3})\.\\1',ipAddressString)
>>> print (hh)
<re.Match object; span=(0, 7), match='192.192'>
>>> print ('(\d{1,3})\.\\1')
(\d{1,3})\.\1
Just a side note, most regex engines also recognize octal sequences.
But to differentiate an octal from a back reference it usually requires a leading \0then a 2 or 3 digit octal \0000-\0377 for example, but sometimes it doesn't and will accept both.
Thus, there is a gray area of overlap.
Some engines will mark the back reference (example \2) when it finds
an ambiguity, then when finished parsing the regex, revisit the item
and mark it as a back reference if the group exists, or an octal
if it doesn't. Perl is famous for this.
In general, each engine handles the issue of octal vs back reference
in it's own bizarre way. Its always a gotcha waiting to happen.

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

Python 3 regex from file

I'm trying to parse a barely formated text to a price list.
I store a bunch of regex patterns in a file looing like this:
[^S](7).*(\+|(plus)|➕).*(128)
When i attempt to verify whether there is a match like this:
def trMatch(line):
for tr in trs:
nr = re.compile(tr.nameReg, re.IGNORECASE)
cr = re.compile(tr.colourReg, re.IGNORECASE)
if (nr.search(line.text) is not None): doStuff()
I get an error
File "<stdin>", line 1, in <module>
File "<stdin>", line 10, in go
File "<stdin>", line 3, in trMatch
File "/usr/lib/python3.5/re.py", line 224, in compile
return _compile(pattern, flags)
File "/usr/lib/python3.5/re.py", line 292, in _compile
raise TypeError("first argument must be string or compiled pattern")
TypeError: first argument must be string or compiled pattern
I assume it can't compile a pattern because it is missing 'r' flag.
Is there a proper way to make this method to cooperate?
Thanks!
The r"" syntax is not mandatory for working with regular expressions - this is just a helper syntax for escaping fewer characters, but it results in the same string. See What exactly do "u" and "r" string flags do, and what are raw string literals?
I'm not sure what trs is in your code, but it's a good guess that tr.nameReg and tr.colourReg are not strings: try to debug or print them and make sure they have the correct value.
Turns out re.search doesn't omit null patterns as I assumed. I added a simple check if there's a valid pattern and string to look in.
Works like charm

Error while searching for a Pattern in the input using Python 3

I am trying to find out the number of patterns which are of form 1[0]1 in the input number. The pattern is, there can be any number of zeros in between two 1's as in, 89310001898.
I wrote a code to execute this in Python 3.5 using regular expression (re) which is as follows:
>>> import re
>>> input1 = 3787381001
>>> pattern = re.compile('[10*1]')
>>> pattern.search(input1)
But this throws me the following error:
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
pattern.match(input1)
TypeError: expected string or bytes-like object
Is there some workaround to clearly identify if the above pattern 1[0]1 is present in the input number?
The [10*1] pattern matches a single char that is equal to a 1, 0 or *. Also, a regex engine only looks for matches inside a text, it needs a string as the input argument.
Remove the square brackets and pass a string to the re, not an integer.
import re
input1 = '3787381001'
pattern = re.compile('10*1')
m = pattern.search(input1)
if m:
print(m.group())
See the Python demo
Note: if you need to get multiple occurrences, with overlapping matches (e.g. if you need to get 1001 and 10001 from 23100100013), you need to use re.findall(r'(?=(10*1))', input1).

Unicode support in regular expression during group capturing in python

I am currently using re2,re and pcre for regular expression matching in python. when I use regular expression such as re.compile("(?P(\S*))") it is fine and compiled without error but when I use with unicode character such as re.compile("(?P<årsag>(\S*))") then there will be error and can not be compiled. Is there is any python library that support unicode completely.
edit : Please refer my output:
>>> import regex
>>> m = regex.compile(r"(?P<årsag>(\S*))")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/regex.py", line 331, in compile
return _compile(pattern, flags, kwargs)
File "/usr/local/lib/python2.7/site-packages/regex.py", line 499, in _compile
caught_exception.pos)
_regex_core.error: bad character in group name at position 10
You need to use the external regex module. regex module would support Unicode character in the name of named capturing group.
>>> import regex
>>> m = regex.compile(r"(?P<årsag>(\S*))")
>>> m.search('foo').group('årsag')
'foo'
>>> m.search('foo bar').group('årsag')
'foo'