Regular expression in Python 2.7 - regex

Another question :
I'm trying to search for a specific pattern in a fiel , but I have to deal with the following case :
This line returns a correct interpretation
f27 = re.findall( b'\x03\x00\x00\x27''(.*?)''\xF7\x00\xF0', s)
but this one got badly interpreted as x28 is related to the '()' parenthesis
f28 = re.findall( b'\x03\x00\x00\x28''(.*?)''\xF7\x00\xF0', s)
Traceback (most recent call last):
File "", line 1, in
File "D:\Portable Python 2.7.2.1\App\lib\re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
File "D:\Portable Python 2.7.2.1\App\lib\re.py", line 244, in _compile
raise error, v # invalid expression
error: unbalanced parenthesis
I tried with several escapes '\' and '/' but no way !
Any solution ?
Thx

Try using raw bytestrings. The re module itself understands escape sequences.
f28 = re.findall(br'\x03\x00\x00\x28(.*?)\xF7\x00\xF0', s)

Related

Python 3 regex from file

I'm trying to parse a barely formated text to a price list.
I store a bunch of regex patterns in a file looing like this:
[^S](7).*(\+|(plus)|➕).*(128)
When i attempt to verify whether there is a match like this:
def trMatch(line):
for tr in trs:
nr = re.compile(tr.nameReg, re.IGNORECASE)
cr = re.compile(tr.colourReg, re.IGNORECASE)
if (nr.search(line.text) is not None): doStuff()
I get an error
File "<stdin>", line 1, in <module>
File "<stdin>", line 10, in go
File "<stdin>", line 3, in trMatch
File "/usr/lib/python3.5/re.py", line 224, in compile
return _compile(pattern, flags)
File "/usr/lib/python3.5/re.py", line 292, in _compile
raise TypeError("first argument must be string or compiled pattern")
TypeError: first argument must be string or compiled pattern
I assume it can't compile a pattern because it is missing 'r' flag.
Is there a proper way to make this method to cooperate?
Thanks!
The r"" syntax is not mandatory for working with regular expressions - this is just a helper syntax for escaping fewer characters, but it results in the same string. See What exactly do "u" and "r" string flags do, and what are raw string literals?
I'm not sure what trs is in your code, but it's a good guess that tr.nameReg and tr.colourReg are not strings: try to debug or print them and make sure they have the correct value.
Turns out re.search doesn't omit null patterns as I assumed. I added a simple check if there's a valid pattern and string to look in.
Works like charm

Why is it that input throws a different exception type when given non english text

input is not meant for strings (python 2.7) and will throw a NameError when the user feeds it with latin text.
If instead it is given non english text (say cyrillic) it will throw a SyntaxError
Why is that?
see input as an expression evaluator (which is unsafe like eval and has been replaced by a strict string input in Python 3.
So it's like you're typing identifiers in the console:
SyntaxError: invalid syntax
>>> fac
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
NameError: name 'fac' is not defined
>>>
fac is a valid identifier (because it complies to ASCII standard beside other considerations) but not defined, hence the NameError
now (using french accents)
>>> féc
File "<interactive input>", line 1
féc
^
SyntaxError: invalid syntax
note that the parsing stopped at the first non-ASCII char, so no identifier has been looked up. SyntaxError was here first.
Now, it seems that you need raw_input() to be able to enter strings (or enter your cyrillic strings between quotes).

Unicode support in regular expression during group capturing in python

I am currently using re2,re and pcre for regular expression matching in python. when I use regular expression such as re.compile("(?P(\S*))") it is fine and compiled without error but when I use with unicode character such as re.compile("(?P<årsag>(\S*))") then there will be error and can not be compiled. Is there is any python library that support unicode completely.
edit : Please refer my output:
>>> import regex
>>> m = regex.compile(r"(?P<årsag>(\S*))")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/regex.py", line 331, in compile
return _compile(pattern, flags, kwargs)
File "/usr/local/lib/python2.7/site-packages/regex.py", line 499, in _compile
caught_exception.pos)
_regex_core.error: bad character in group name at position 10
You need to use the external regex module. regex module would support Unicode character in the name of named capturing group.
>>> import regex
>>> m = regex.compile(r"(?P<årsag>(\S*))")
>>> m.search('foo').group('årsag')
'foo'
>>> m.search('foo bar').group('årsag')
'foo'

Cannot compile 8 digit unicode regex ranges in Python 2.7 re

Using Python 2.7, re
I'm trying to compile unicode character classes. I can get it to work with 4 digit ranges (u'\uxxxx') but not 8 digits (u'\Uxxxxxxxx').I
The following works:
re.compile(u'[\u0010-\u0012]')
The following does not:
re.compile(u'[\U00010000-\U00010001]')
The resultant error is:
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 242, in _compile
raise error, v # invalid expression
error: bad character range
It appears to be an issue with 8 digit ranges only as the following works:
re.compile(u'\U00010000')
Separate question, I am new to stackoverflow and I am really struggling with how to post questions. I would expect that Trackback to appear on multiple lines, not on one line. I would also like to be able to paste in content copied from the interpreter but this UI makes a mess out of '>>>'
Don't know how to add this in a comment editing question.
The expression I really want to compile is:
re.compile(u'[\U00010000-\U0010FFFF]')
Expanding it with list(u'[\U00010000-\U0010FFFF]') looks pretty intractable as far as extending the suggested workaround:
>>> list(u'[\U00010000-\U0010FFFF]')
[u'[', u'\ud800', u'\udc00', u'-', u'\udbff', u'\udfff', u']']
Depending on the compilation option, Python 2 may store Unicode strings as UTF-16 code units, and thus \U00010000 is actually a two-code-unit string:
>>> list(u'[\U00010000-\U00010001]')
[u'[', u'\ud800', u'\udc00', u'-', u'\ud800', u'\udc01', u']']
The regex parser thus sees the character class containing \udc00-\ud800 which is a "bad character range". In this setting I can't think of a solution other than to match the surrogate pairs explicitly (after ensuring sys.maxunicode == 0xffff):
>>> r = re.compile(u'\ud800[\udc00-\udc01]')
>>> r.match(u'\U00010000')
<_sre.SRE_Match object at 0x10cf6f440>
>>> r.match(u'\U00010001')
<_sre.SRE_Match object at 0x10cf4ed98>
>>> r.match(u'\U00010002')
>>> r.match(u'\U00020000')

How to replace regex with 2 groups

I have a problem in REGEX .
My code is:
self.file = re.sub(r'([^;{}]{1}\s*)[\n]|([;{}]\s*[\n])',r'\1\2',self.file)
I need to replace this :
TJumpMatchArray *skipTableMatch
);
void computeCharJumps(string *str
with this:
TJumpMatchArray *skipTableMatch );
void computeCharJumps(string *str
I need to store white spaces and I need to replace all new lines '\n' that are not after {}; with '' .
I found that problem is maybe that python interpret(using Python 3.2.3) not working parallen and if it don't match first group if fails with this:
File "cha.py", line 142, in <module>
maker.editFileContent()
File "cha.py", line 129, in editFileContent
self.file = re.sub(r'([^;{}]{1}\s*)[\n]|([;{}]\s*[\n])',r'\1|\2',self.file)
File "/usr/local/lib/python3.2/re.py", line 167, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/local/lib/python3.2/re.py", line 286, in filter
return sre_parse.expand_template(template, match)
File "/usr/local/lib/python3.2/sre_parse.py", line 813, in expand_template
raise error("unmatched group")
In this online regex tool it is working:Example here
Reason why i use :
|([;{}]\s*[\n])
is because if i have:
'; \n'
it replace the :
' \n'
with '' and i need to store the same format after {};.
Is there any way to fix this?
The problem is that for every found match only one group will be not empty.
Consider this simplified example:
>>> import re
>>>
>>> def replace(match):
... print(match.groups())
... return "X"
...
>>> re.sub("(a)|(b)", replace, "-ab-")
('a', None)
(None, 'b')
'-XX-'
As you can see, the replacement function is called twice, once with the second group set to None, and once with the first.
If you would use a function to replace your matches (like in my example), you can easily check which of the groups was the matching one.
Example:
re.sub(r'([^;{}]{1}\s*)[\n]|([;{}]\s*[\n])', lambda m: m.group(1) or m.group(2), self.file)