Unicode support in regular expression during group capturing in python

Unicode support in regular expression during group capturing in python - regex

I am currently using re2,re and pcre for regular expression matching in python. when I use regular expression such as re.compile("(?P(\S*))") it is fine and compiled without error but when I use with unicode character such as re.compile("(?P<årsag>(\S*))") then there will be error and can not be compiled. Is there is any python library that support unicode completely.
edit : Please refer my output:
>>> import regex
>>> m = regex.compile(r"(?P<årsag>(\S*))")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/regex.py", line 331, in compile
return _compile(pattern, flags, kwargs)
File "/usr/local/lib/python2.7/site-packages/regex.py", line 499, in _compile
caught_exception.pos)
_regex_core.error: bad character in group name at position 10

You need to use the external regex module. regex module would support Unicode character in the name of named capturing group.
>>> import regex
>>> m = regex.compile(r"(?P<årsag>(\S*))")
>>> m.search('foo').group('årsag')
'foo'
>>> m.search('foo bar').group('årsag')
'foo'

Related

Python 3 regex from file

I'm trying to parse a barely formated text to a price list.
I store a bunch of regex patterns in a file looing like this:
[^S](7).*(\+|(plus)|➕).*(128)
When i attempt to verify whether there is a match like this:
def trMatch(line):
for tr in trs:
nr = re.compile(tr.nameReg, re.IGNORECASE)
cr = re.compile(tr.colourReg, re.IGNORECASE)
if (nr.search(line.text) is not None): doStuff()
I get an error
File "<stdin>", line 1, in <module>
File "<stdin>", line 10, in go
File "<stdin>", line 3, in trMatch
File "/usr/lib/python3.5/re.py", line 224, in compile
return _compile(pattern, flags)
File "/usr/lib/python3.5/re.py", line 292, in _compile
raise TypeError("first argument must be string or compiled pattern")
TypeError: first argument must be string or compiled pattern
I assume it can't compile a pattern because it is missing 'r' flag.
Is there a proper way to make this method to cooperate?
Thanks!

The r"" syntax is not mandatory for working with regular expressions - this is just a helper syntax for escaping fewer characters, but it results in the same string. See What exactly do "u" and "r" string flags do, and what are raw string literals?
I'm not sure what trs is in your code, but it's a good guess that tr.nameReg and tr.colourReg are not strings: try to debug or print them and make sure they have the correct value.

Turns out re.search doesn't omit null patterns as I assumed. I added a simple check if there's a valid pattern and string to look in.
Works like charm

Why is it that input throws a different exception type when given non english text

input is not meant for strings (python 2.7) and will throw a NameError when the user feeds it with latin text.
If instead it is given non english text (say cyrillic) it will throw a SyntaxError
Why is that?

see input as an expression evaluator (which is unsafe like eval and has been replaced by a strict string input in Python 3.
So it's like you're typing identifiers in the console:
SyntaxError: invalid syntax
>>> fac
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
NameError: name 'fac' is not defined
>>>
fac is a valid identifier (because it complies to ASCII standard beside other considerations) but not defined, hence the NameError
now (using french accents)
>>> féc
File "<interactive input>", line 1
fÃ©c
^
SyntaxError: invalid syntax
note that the parsing stopped at the first non-ASCII char, so no identifier has been looked up. SyntaxError was here first.
Now, it seems that you need raw_input() to be able to enter strings (or enter your cyrillic strings between quotes).

Error while searching for a Pattern in the input using Python 3

I am trying to find out the number of patterns which are of form 1[0]1 in the input number. The pattern is, there can be any number of zeros in between two 1's as in, 89310001898.
I wrote a code to execute this in Python 3.5 using regular expression (re) which is as follows:
>>> import re
>>> input1 = 3787381001
>>> pattern = re.compile('[10*1]')
>>> pattern.search(input1)
But this throws me the following error:
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
pattern.match(input1)
TypeError: expected string or bytes-like object
Is there some workaround to clearly identify if the above pattern 1[0]1 is present in the input number?

The [10*1] pattern matches a single char that is equal to a 1, 0 or *. Also, a regex engine only looks for matches inside a text, it needs a string as the input argument.
Remove the square brackets and pass a string to the re, not an integer.
import re
input1 = '3787381001'
pattern = re.compile('10*1')
m = pattern.search(input1)
if m:
print(m.group())
See the Python demo
Note: if you need to get multiple occurrences, with overlapping matches (e.g. if you need to get 1001 and 10001 from 23100100013), you need to use re.findall(r'(?=(10*1))', input1).

Cannot compile 8 digit unicode regex ranges in Python 2.7 re

Using Python 2.7, re
I'm trying to compile unicode character classes. I can get it to work with 4 digit ranges (u'\uxxxx') but not 8 digits (u'\Uxxxxxxxx').I
The following works:
re.compile(u'[\u0010-\u0012]')
The following does not:
re.compile(u'[\U00010000-\U00010001]')
The resultant error is:
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 242, in _compile
raise error, v # invalid expression
error: bad character range
It appears to be an issue with 8 digit ranges only as the following works:
re.compile(u'\U00010000')
Separate question, I am new to stackoverflow and I am really struggling with how to post questions. I would expect that Trackback to appear on multiple lines, not on one line. I would also like to be able to paste in content copied from the interpreter but this UI makes a mess out of '>>>'
Don't know how to add this in a comment editing question.
The expression I really want to compile is:
re.compile(u'[\U00010000-\U0010FFFF]')
Expanding it with list(u'[\U00010000-\U0010FFFF]') looks pretty intractable as far as extending the suggested workaround:
>>> list(u'[\U00010000-\U0010FFFF]')
[u'[', u'\ud800', u'\udc00', u'-', u'\udbff', u'\udfff', u']']

Depending on the compilation option, Python 2 may store Unicode strings as UTF-16 code units, and thus \U00010000 is actually a two-code-unit string:
>>> list(u'[\U00010000-\U00010001]')
[u'[', u'\ud800', u'\udc00', u'-', u'\ud800', u'\udc01', u']']
The regex parser thus sees the character class containing \udc00-\ud800 which is a "bad character range". In this setting I can't think of a solution other than to match the surrogate pairs explicitly (after ensuring sys.maxunicode == 0xffff):
>>> r = re.compile(u'\ud800[\udc00-\udc01]')
>>> r.match(u'\U00010000')
<_sre.SRE_Match object at 0x10cf6f440>
>>> r.match(u'\U00010001')
<_sre.SRE_Match object at 0x10cf4ed98>
>>> r.match(u'\U00010002')
>>> r.match(u'\U00020000')

Regular expression in Python 2.7

Another question :
I'm trying to search for a specific pattern in a fiel , but I have to deal with the following case :
This line returns a correct interpretation
f27 = re.findall( b'\x03\x00\x00\x27''(.*?)''\xF7\x00\xF0', s)
but this one got badly interpreted as x28 is related to the '()' parenthesis
f28 = re.findall( b'\x03\x00\x00\x28''(.*?)''\xF7\x00\xF0', s)
Traceback (most recent call last):
File "", line 1, in
File "D:\Portable Python 2.7.2.1\App\lib\re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
File "D:\Portable Python 2.7.2.1\App\lib\re.py", line 244, in _compile
raise error, v # invalid expression
error: unbalanced parenthesis
I tried with several escapes '\' and '/' but no way !
Any solution ?
Thx

Try using raw bytestrings. The re module itself understands escape sequences.
f28 = re.findall(br'\x03\x00\x00\x28(.*?)\xF7\x00\xF0', s)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unicode support in regular expression during group capturing in python - regex

You need to use the external regex module. regex module would support Unicode character in the name of named capturing group. >>> import regex >>> m = regex.compile(r"(?P<årsag>(\S*))") >>> m.search('foo').group('årsag') 'foo' >>> m.search('foo bar').group('årsag') 'foo'

Related

Python 3 regex from file

Why is it that input throws a different exception type when given non english text

Error while searching for a Pattern in the input using Python 3

Cannot compile 8 digit unicode regex ranges in Python 2.7 re

Regular expression in Python 2.7

Categories

Resources