regex - specifying high Unicode codepoints in Python - regex

In Python 3.3 I have no trouble using ranges of Unicode codepoints within regular expressions:
>>> import re
>>> to_delete = '[\u0020-\u0090\ufb00-\uffff]'
>>> s = 'abcdABCD¯˘¸ðﺉ﹅ffl你我他𐀀𐌈𐆒'
>>> print(s)
abcdABCD¯˘¸ðﺉ﹅ffl你我他𐀀𐌈𐆒
>>> print(re.sub(to_delete, '', s))
¯˘¸ð你我他𐀀𐌈𐆒
It's clean and simple. But if I include codepoints of five hex digits, that is to say anything higher than \uffff, such as \u1047f, as part of a range beginning in four hex digits, I get an error:
>>> to_delete = '[\u0020-\u0090\ufb00-\u1047f]'
>>> print(re.sub(to_delete, '', s))
...
sre_constants.error: bad character range
There is no error if I start a new five-digit range, but I also do not get the expected behavior:
>>> to_delete = '[\u0020-\u0090\ufb00-\uffff\u10000-\u1047f]'
>>> print(re.sub(to_delete, '', s))
你我他𐀀𐌈𐆒
(The symbols 𐀀𐌈𐆒 are codepoints \u10000, \u10308, and \u10192, respectively and should have been replaced in the last re.sub operation.)
Following the instructions of the accepted answer:
>>> to_delete = '[\u0020-\u0090\ufb00-\uffff\U00010000-\U0001047F]'
>>> print(re.sub(to_delete, '', s))
¯˘¸ð你我他
Perfect. Uglesome in the extreme, but perfect.

\u only supports 16-bit codepoints. You need to use the 32-bit version, \U. Note that it requires 8 digits, so you have to prepend a few 0s (e.g. \U00010D2B).
Source: http://docs.python.org/3/howto/unicode.html#unicode-literals-in-python-source-code

Related

Matching Windows-1251 encoding character set in RegEx

I need to create a regular expression, that would match only the characters NOT in the windows-1251 encoding character set, to detect if there are any characters in a given piece of text that would violate the encoding. I tried to do it through the [^\u0000-\u044F]+ expression, however it is also matching some characters that are actually in line with the encoding.
Appreciate any help on the issue
No language specified, but in Python no need for a regex with sets. Create a set of all Unicode code points that are members of Windows-1251 and subtract it from the set of the text. Note that only byte 98h is not used in Windows-1251 encoding:
>>> # Create the set of characters in code page 1251
>>> cp1251 = set(bytes(range(256)).decode('cp1251',errors='ignore'))
>>> set('This is a test \x98 马') - cp1251
{'\x98', '马'}
As a regular expression:
>>> import re
>>> text = ''.join(cp1251) # string of all Windows-1251 codepoints from previous set
>>> text
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xa0¤¦§©«¬\xad®°±µ¶·»ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяёђѓєѕіїјљњћќўџҐґ–—‘’‚“”„†‡•…‰‹›€№™'
>>> not_cp1251 = re.compile(r'[^\x00-\x7f\xa0\xa4\xa6\xa7\xa9\xab-\xae\xb0\xb1\xb5-\xb7\xbb\u0401-\u040c\u040e-\u044f\u0451-\u045c\u045e\u045f\u0490\u0491\u2013\u2014\u2018-\u201a\u201c-\u201e\u2020-\u2022\u2026\u2030\u2039\u203a\u20ac\u2116\u2122]')
>>> not_cp1251.findall(text) # all cp1251 text finds no outliers
[]
>>> not_cp1251.findall(text+'\x98') # adding known outlier
['\x98']
>>> not_cp1251.findall('马克'+text+'\x98') # adding other outliers
['马', '克', '\x98']

Regex matches string, but doesn't group correctly [duplicate]

While matching an email address, after I match something like yasar#webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, yasar#webmail.something.edu.tr matches but only include .tr after yasar#webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?
re module doesn't support repeated captures (regex supports it):
>>> m = regex.match(r'([.\w]+)#((\w+)(\.\w+)+)', 'yasar#webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in #Li-aung Yip's answer.
You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)
This will work:
>>> regexp = r"[\w\.]+#(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "william.adama#galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it's limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[\w\.]+#(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
This is what you are looking for:
>>> import re
>>> s="yasar#webmail.something.edu.tr"
>>> r=re.compile("\.\w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']

I have a string "192.192", why am I not able to match this using '(\d{1,3})\.\1'?

String in question:
ipAddressString = "192.192.10.5/24"
I'm trying to match 192.192 in the above string.
a) The below code gives error, I don't understand why \1 is not matching the second 192:
>>> print re.search('(\d{1,3})\.\1',ipAddressString).group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
I was expecting the output to be : 192.192
b) Whereas, when I use the below regex, it matches 192.192 as expected, as per my understanding the above regex mentioned in point a) should have yielded the same ".group()" output as below regex
>>> print re.search('(\d{1,3})\.(\d{1,3})',ipAddressString).group()
192.192
List of escape sequences available in Python 3
Those are the escapes interpolated when parsing a string in Python.
All other escaped items are ignored.
So, if you give it a string like '(\d{1,3})\.\1'
it interpolates the \1 as a character with an octal value of 1.
\ooo Character with octal value ooo
So this is what you get
>>> import re
>>> ipAddressString = "192.192.10.5/24"
>>> hh = re.search('(\d{1,3})\.\1',ipAddressString)
>>> print (hh)
None
>>> print ('(\d{1,3})\.\1')
(\d{1,3})\.☺
The regex engine sees this (\d{1,3})\.☺ which is not an error
but it doesn't match what you want.
Ways around this:
Escape the escape on the octal
'(\d{1,3})\.\\1'
Make the string a raw string with syntax
either a raw double r"(\d{1,3})\.\1" or a raw single r'(\d{1,3})\.\1'
Using the first method we get:
>>> import re
>>> ipAddressString = "192.192.10.5/24"
>>> hh = re.search('(\d{1,3})\.\\1',ipAddressString)
>>> print (hh)
<re.Match object; span=(0, 7), match='192.192'>
>>> print ('(\d{1,3})\.\\1')
(\d{1,3})\.\1
Just a side note, most regex engines also recognize octal sequences.
But to differentiate an octal from a back reference it usually requires a leading \0then a 2 or 3 digit octal \0000-\0377 for example, but sometimes it doesn't and will accept both.
Thus, there is a gray area of overlap.
Some engines will mark the back reference (example \2) when it finds
an ambiguity, then when finished parsing the regex, revisit the item
and mark it as a back reference if the group exists, or an octal
if it doesn't. Perl is famous for this.
In general, each engine handles the issue of octal vs back reference
in it's own bizarre way. Its always a gotcha waiting to happen.

How to interpret Unicode notation in Python?

How to convert formal Unicode notation like 'U+1F600' into something like this: '\U0001F600', which I saw represented as 'Python Src' at websites online?
My end-goal is to use Unicode for emojis in Python(2.x) and I am able to achieve it in this way:
unicode_string = '\U0001F600'
unicode_string.decode('unicode-escape')
I would appreciate if you could mention the different character sets involved in the above problem.
The simplest way to do it is to just treat the notation as a string:
>>> s = 'U+1F600'
>>> s[2:] # chop off the U+
'1F600'
>>> s[2:].rjust(8, '0') # pad it to 8 characters with 0s
'0001F600'
>>> r'\U' + s[2:].rjust(8, '0') # prepend the `\U`
'\\U0001F600'
It might be a bit cleaner to parse the string as hex and then format the resulting number back out:
>>> int(s[2:], 16)
128512
>>> n = int(s[2:], 16)
>>> rf'\U{n:08X}'
'\\U0001F600'
… but I'm not sure it's really any easier to understand that way.
If you need to extract these from a larger string, you probably want a regular expression.
We want to match a literal U+ followed by 1 to 8 hex digits, right? So, that's U\+[0-9a-fA-F]{1,8}. Except we really don't need to include the U+ just to pull it off with [2:], so let's group the rest of it: U\+([0-9a-fA-F]{1,8}).
>>> s = 'Hello U+1F600 world'
>>> re.search(r'U\+([0-9a-fA-F]{1,8})', s)
<_sre.SRE_Match object; span=(6, 13), match='U+1F600'>
>>> re.search(r'U\+([0-9a-fA-F]{1,8})', s).group(1)
'1F600'
Now, we can use re.sub with a function to apply the \U prepending and rjust padding:
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', lambda match: r'\U' + match.group(1).rjust(8, '0'), s)
'Hello \\U0001F600 world'
That's probably more readable if you define the function out-of-line:
>>> def padunimatch(match):
... return r'\U' + match.group(1).rjust(8, '0')
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', padunimatch, s)
'Hello \\U0001F600 world'
Or, if you prefer to do it numerically:
>>> def padunimatch(match):
... n = int(match.group(1), 16)
... return rf'\U{n:08X}'
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', padunimatch, s)
'Hello \\U0001F600 world'
And of course you already know how to do the last part, because it's in your question, right? Well, not quite: you can't call decode on a string, only on a bytes. The simplest way around this is to use the codec directly:
>>> x = 'Hello \\U0001F600 world'
>>> codecs.decode(x, 'unicode_escape')
'Hello 😀 world'
… unless you're using Python 2. In that case, the str type isn't a Unicode string, it's a byte-string, so decode actually works fine. But in Python 2, you'll run into other problems, unless all of your text is pure ASCII (with any non-ASCII characters encoded as U+xxxx sequences).
For example, let's say your input was:
>>> s = 'Hej U+1F600 världen'
In Python 3, that's fine. That s is a Unicode string. Under the covers, my console is sending Python UTF-8-encoded bytes to standard input and expecting to get UTF-8-encoded bytes back from standard output, but that just works like magic. (Well, not quite magic—you can print(sys.stdin.encoding, sys.stdout.encoding) to see that Python knows my console is UTF-8 and uses that to decode and encode on my behalf.)
In Python 2, it's not. If my console is UTF-8, what I've actually done is equivalent to:
>>> s = 'Hej U+1F600 v\xc3\xa4rlden'
… and if I try to decode that as unicode-escape, Python 2 will treat those \xc3 and \xa4 bytes as Latin-1 bytes, rather than UTF-8:
>>> s = 'Hej \U0001F600 v\xc3\xa4rlden'
… so what you end up with is:
>>> s.decode('unicode_escape')
u'Hej \U0001f600 v\xc3\xa4rlden'
>>> print(s.decode('unicode_escape'))
Hej 😀 världen
But what if you try to decode it as UTF-8 first, and then decode that as unicode_escape?
>>> s.decode('utf-8')
u'Hej \\U0001F600 v\xe4rlden'
>>> print(s.decode('utf-8'))
Hej \U0001F600 världen
>>> s.decode('utf-8').decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 16: ordinal not in range(128)
Unlike Python 3, which just won't let you call decode on a Unicode string, Python 2 lets you do it—but it handles it by trying to encode to ASCII first, so it has something to decode, and that obviously fails here.
And you can't just use the codec directly, the way you can in Python 3:
>>> codecs.decode(s.decode('utf-8'), 'unicode_escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 16: ordinal not in range(128)
You could decode the UTF-8, then unicode-escape the result, then un-unicode-escape everything, but even that isn't quite right:
>>> print(s.decode('utf-8').encode('unicode_escape').decode('unicode_escape'))
Hej \U0001F600 världen
Why? Because unicode-escape, while fixing our existing Unicode character, also escaped our backslash!
If you know you definitely have no \U escapes in the original source that you didn't want parsed, there's a quick fix for this: just replace the escaped backslash:
>>> print(s.decode('utf-8').encode('unicode_escape').replace(r'\\U', r'\U').decode('unicode_escape'))
Hej 😀 världen
If this all seems like a huge pain… well, yeah, that's why Python 3 exists, because dealing with Unicode properly in Python 2 (and notice that I didn't even really deal with it properly…) is a huge pain.

Remove '\x' from string in a text file in Python

This is my first time posting on Stack. I would really appreciate if someone could assist me with this.
I’m trying to remove Unicode characters (\x3a in my case) from a text file containing the following:
10\x3a00\x3a00
The final output is supposed to be:
100000
Basically, we are being instructed to delete all traces of \xXX where X can be any of the following: 0123456789ABCDEF. I tried using regular expressions as follows to delete any \xXX.
Re.sub(‘\\\x[a-fA-F0-9]{2}’,””, a)
Where “a” is a line of a text file.
When I try that, I get an error saying “invalid \x escape”.
I’ve been struggling with this for hours. What’s wrong with my regular expression?
The character "\x3a" is not a multi-byte Unicode character. It is the ASCII character ":". Once you have specified the string "\x3a", it is stored internally as the character ":". Python isn't seeing any "\" action happening. So you can't strip out "\x3a" as a multi-byte Unicode because Python is only seeing single byte ASCII character ":".
$ python
>>> '\x3a' == ':'
True
>>> "10\x3a00\x3a00" == "10:00:00"
True
Check out the description section of the Wikipedia article on UTF-8. See that characters in the range U+0000-U+007F are encoded as a single ASCII character.
If you want to strip non-ASCII characters then do following:
>>> print u'R\xe9n\xe9'
Réné
>>> ''.join([x for x in u'R\xe9n\xe9' if ord(x) < 127])
u'Rn'
>>> ''.join([x for x in 'Réné' if ord(x) < 127])
'Rn'
If you want to retain European characters but discard Unicode characters with higher code points, then change the 127 in ord(x) < 127 to some higher value.
The post replace 3 byte unicode, has another approach. You can also strip out code point ranges with:
>>> str = u'[\uE000-\uFFFF]'
>>> len(str)
5
>>> import re
>>> pattern = re.compile(u'[\uE000-\uFFFF]', re.UNICODE)
>>> pattern.sub('?', u'ab\uFFFDcd')
u'ab?cd'
Notice that working with \u may be easier than working with \x for specifying characters.
On the other hand, you could have the string "\\x3a" which you could strip out. Of course, that string isn't actually a multi-byte Unicode character but rather 4 ASCII characters.
$ python
>>> print '\\x3a'
\x3a
>>> '\\x3a' == ':'
False
>>> '\\x3a' == '\\' + 'x3a'
True
>>> (len('\x3a'), len('\\x3a'))
(1, 4)
You can also strip out the ASCII character ":":
>>> "10:00:00".replace(":", "")
'100000'
>>> "10\x3a00\x3a00".replace(":", "")
'100000'
>>> "10\x3a00\x3a00".replace("\x3a", "")
'100000'
try this
import re
tagRe = re.compile(r'\\x.*?(2)')
normalText = tagRe.sub('', myText)
change myText with your string