re.compile(r' [Б]$').search(' Б') returns None - python-2.7

It seems that re.compile(r' [Б]$').search(' Б') returns None even though it should return <_sre.SRE_Match object; span=(0, 2), match=' Б'>.
This happens when running it on python2, but not on python3, and it happens only with a Unicode symbol (I tried Cyrillic and Chinese). It works fine with Latin symbols.
sashoalm#HP:~/$ python2
Python 2.7.17 (default)
>>> print(re.compile(r' [Б]$').search(' Б'))
None
Any idea what is happening? Is it a real bug or is it supposed to fail?

OK, I realized what is happening after reading https://snarky.ca/why-python-3-exists/ - the part about bytes - python is treating the unicode utf8 symbol as 2 ASCII characters - \xd0\x91, so it has to match a space, then only one of the 2, then end.
This means that print(re.compile(r' [Б][Б]$').search(' Б')) will match.

Related

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

How to interpret Unicode notation in Python?

How to convert formal Unicode notation like 'U+1F600' into something like this: '\U0001F600', which I saw represented as 'Python Src' at websites online?
My end-goal is to use Unicode for emojis in Python(2.x) and I am able to achieve it in this way:
unicode_string = '\U0001F600'
unicode_string.decode('unicode-escape')
I would appreciate if you could mention the different character sets involved in the above problem.
The simplest way to do it is to just treat the notation as a string:
>>> s = 'U+1F600'
>>> s[2:] # chop off the U+
'1F600'
>>> s[2:].rjust(8, '0') # pad it to 8 characters with 0s
'0001F600'
>>> r'\U' + s[2:].rjust(8, '0') # prepend the `\U`
'\\U0001F600'
It might be a bit cleaner to parse the string as hex and then format the resulting number back out:
>>> int(s[2:], 16)
128512
>>> n = int(s[2:], 16)
>>> rf'\U{n:08X}'
'\\U0001F600'
… but I'm not sure it's really any easier to understand that way.
If you need to extract these from a larger string, you probably want a regular expression.
We want to match a literal U+ followed by 1 to 8 hex digits, right? So, that's U\+[0-9a-fA-F]{1,8}. Except we really don't need to include the U+ just to pull it off with [2:], so let's group the rest of it: U\+([0-9a-fA-F]{1,8}).
>>> s = 'Hello U+1F600 world'
>>> re.search(r'U\+([0-9a-fA-F]{1,8})', s)
<_sre.SRE_Match object; span=(6, 13), match='U+1F600'>
>>> re.search(r'U\+([0-9a-fA-F]{1,8})', s).group(1)
'1F600'
Now, we can use re.sub with a function to apply the \U prepending and rjust padding:
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', lambda match: r'\U' + match.group(1).rjust(8, '0'), s)
'Hello \\U0001F600 world'
That's probably more readable if you define the function out-of-line:
>>> def padunimatch(match):
... return r'\U' + match.group(1).rjust(8, '0')
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', padunimatch, s)
'Hello \\U0001F600 world'
Or, if you prefer to do it numerically:
>>> def padunimatch(match):
... n = int(match.group(1), 16)
... return rf'\U{n:08X}'
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', padunimatch, s)
'Hello \\U0001F600 world'
And of course you already know how to do the last part, because it's in your question, right? Well, not quite: you can't call decode on a string, only on a bytes. The simplest way around this is to use the codec directly:
>>> x = 'Hello \\U0001F600 world'
>>> codecs.decode(x, 'unicode_escape')
'Hello 😀 world'
… unless you're using Python 2. In that case, the str type isn't a Unicode string, it's a byte-string, so decode actually works fine. But in Python 2, you'll run into other problems, unless all of your text is pure ASCII (with any non-ASCII characters encoded as U+xxxx sequences).
For example, let's say your input was:
>>> s = 'Hej U+1F600 världen'
In Python 3, that's fine. That s is a Unicode string. Under the covers, my console is sending Python UTF-8-encoded bytes to standard input and expecting to get UTF-8-encoded bytes back from standard output, but that just works like magic. (Well, not quite magic—you can print(sys.stdin.encoding, sys.stdout.encoding) to see that Python knows my console is UTF-8 and uses that to decode and encode on my behalf.)
In Python 2, it's not. If my console is UTF-8, what I've actually done is equivalent to:
>>> s = 'Hej U+1F600 v\xc3\xa4rlden'
… and if I try to decode that as unicode-escape, Python 2 will treat those \xc3 and \xa4 bytes as Latin-1 bytes, rather than UTF-8:
>>> s = 'Hej \U0001F600 v\xc3\xa4rlden'
… so what you end up with is:
>>> s.decode('unicode_escape')
u'Hej \U0001f600 v\xc3\xa4rlden'
>>> print(s.decode('unicode_escape'))
Hej 😀 världen
But what if you try to decode it as UTF-8 first, and then decode that as unicode_escape?
>>> s.decode('utf-8')
u'Hej \\U0001F600 v\xe4rlden'
>>> print(s.decode('utf-8'))
Hej \U0001F600 världen
>>> s.decode('utf-8').decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 16: ordinal not in range(128)
Unlike Python 3, which just won't let you call decode on a Unicode string, Python 2 lets you do it—but it handles it by trying to encode to ASCII first, so it has something to decode, and that obviously fails here.
And you can't just use the codec directly, the way you can in Python 3:
>>> codecs.decode(s.decode('utf-8'), 'unicode_escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 16: ordinal not in range(128)
You could decode the UTF-8, then unicode-escape the result, then un-unicode-escape everything, but even that isn't quite right:
>>> print(s.decode('utf-8').encode('unicode_escape').decode('unicode_escape'))
Hej \U0001F600 världen
Why? Because unicode-escape, while fixing our existing Unicode character, also escaped our backslash!
If you know you definitely have no \U escapes in the original source that you didn't want parsed, there's a quick fix for this: just replace the escaped backslash:
>>> print(s.decode('utf-8').encode('unicode_escape').replace(r'\\U', r'\U').decode('unicode_escape'))
Hej 😀 världen
If this all seems like a huge pain… well, yeah, that's why Python 3 exists, because dealing with Unicode properly in Python 2 (and notice that I didn't even really deal with it properly…) is a huge pain.

Python: ascii codec can't encode en-dash

I'm trying to print a poem from the Poetry Foundation's daily poem RSS feed with a thermal printer that supports an encoding of CP437. This means I need to translate some characters; in this case an en-dash to a hyphen. But python won't even encode the en dash to begin with. When I try to decode the string and replace the en-dash with a hyphen I get the following error:
Traceback (most recent call last):
File "pftest.py", line 46, in <module>
str = str.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 140: ordinal not in range(128)
And here is my code:
#!/usr/bin/python
#-*- coding: utf-8 -*-
# This string is actually a variable entitled d['entries'][1].summary_detail.value
str = """Love brought by night a vision to my bed,
One that still wore the vesture of a child
But eighteen years of age – who sweetly smiled"""
str = str.decode('utf-8')
str = str.replace("\u2013", "-") #en dash
str = str.replace("\u2014", "--") #em dash
print (str)
I can actually print the output using the following code without errors in my terminal window (Mac), but my printer spits out sets of 3 CP437 characters:
str = u''.str.encode('utf-8')
I'm using Sublime Text as my editor, and I've saved the page with UTF-8 encoding, but I'm not sure that will help things. I would greatly appreciate any help with this code. Thank you!
I don't fully understand what's happening in your code, but I've also been trying to replace en-dashes with hyphens in a string I got from the Web, and here's what's working for me. My code is just this:
txt = re.sub(u"\u2013", "-", txt)
I'm using Python 2.7 and Sublime Text 2, but I don't bother setting -*- coding: utf-8 -*- in my script, as I'm trying not to introduce any new encoding issues. (Even though my variables may contain Unicode I like to keep my code pure ASCII.) Do you need to include Unicode in your .py file, or was that just to help with debugging?
I'll note that my txt variable is already a unicode string, i.e.
print type(txt)
produces
<type 'unicode'>
I'd be curious to know what type(str) would produce in your case.
One thing I noticed in your code is
str = str.replace("\u2013", "-") #en dash
Are you sure that does anything? My understanding is that \u only means "unicode character' inside a u"" string, and what you've created there is a string with 5 characters, a "u", a "2", a "0", etc. (The first character is because you can escape any character and if there's no special meaning, like in the case of '\n' or '\t', it just ignores the backslash.)
Also, the fact that you get 3 CP437 characters from your printer makes me suspect that you still have an en-dash in your string. The UTF-8 encoding of an en-dash is 3 bytes: 0xe2 0x80 0x93. When you call str.encode('utf-8') on a unicode string that contains an en-dash you get those three bytes in the returned string. I'm guessing that your terminal knows how to interpret that as an en-dash and that's what you're seeing.
If you can't get my first method to work, I'll mention that I also had success with this:
txt = txt.encode('utf-8')
txt = re.sub("\xe2\x80\x93", "-", txt)
Maybe that re.sub() would work for you if you put it after your call to encode(). And in that case you might not even need that call to decode() at all. I'll confess that I really don't understand why it's there.

Cannot compile 8 digit unicode regex ranges in Python 2.7 re

Using Python 2.7, re
I'm trying to compile unicode character classes. I can get it to work with 4 digit ranges (u'\uxxxx') but not 8 digits (u'\Uxxxxxxxx').I
The following works:
re.compile(u'[\u0010-\u0012]')
The following does not:
re.compile(u'[\U00010000-\U00010001]')
The resultant error is:
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 242, in _compile
raise error, v # invalid expression
error: bad character range
It appears to be an issue with 8 digit ranges only as the following works:
re.compile(u'\U00010000')
Separate question, I am new to stackoverflow and I am really struggling with how to post questions. I would expect that Trackback to appear on multiple lines, not on one line. I would also like to be able to paste in content copied from the interpreter but this UI makes a mess out of '>>>'
Don't know how to add this in a comment editing question.
The expression I really want to compile is:
re.compile(u'[\U00010000-\U0010FFFF]')
Expanding it with list(u'[\U00010000-\U0010FFFF]') looks pretty intractable as far as extending the suggested workaround:
>>> list(u'[\U00010000-\U0010FFFF]')
[u'[', u'\ud800', u'\udc00', u'-', u'\udbff', u'\udfff', u']']
Depending on the compilation option, Python 2 may store Unicode strings as UTF-16 code units, and thus \U00010000 is actually a two-code-unit string:
>>> list(u'[\U00010000-\U00010001]')
[u'[', u'\ud800', u'\udc00', u'-', u'\ud800', u'\udc01', u']']
The regex parser thus sees the character class containing \udc00-\ud800 which is a "bad character range". In this setting I can't think of a solution other than to match the surrogate pairs explicitly (after ensuring sys.maxunicode == 0xffff):
>>> r = re.compile(u'\ud800[\udc00-\udc01]')
>>> r.match(u'\U00010000')
<_sre.SRE_Match object at 0x10cf6f440>
>>> r.match(u'\U00010001')
<_sre.SRE_Match object at 0x10cf4ed98>
>>> r.match(u'\U00010002')
>>> r.match(u'\U00020000')

python different result from IDLE and python script

I have tried the following in Python 2.7 shell:
>>> from nltk.stem.isri import ISRIStemmer
>>> st = ISRIStemmer()
>>> string = u'\u062D\u064E\u062F\u0651\u064E\u062B\u064E\u0646\u064E\u0627'
>>> st.stem(string)
u'\u062d\u062f\u062b'
So basically, I am trying to obtain:
u'\u062d\u062f\u062b'
from
u'\u062D\u064E\u062F\u0651\u064E\u062B\u064E\u0646\u064E\u0627'
using nltk's arabic stemmer, which works!
However, when I try to accomplish the exact thing through a python script, it fails to stem any of the words in the list, tokens :
#!/c/Python27/python
# -*- coding: utf8 -*-
import nltk
import nltk.data
from nltk.stem.isri import ISRIStemmer
#In my script, I tokenize the following string
commasection = '\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
#The tokenizing works
tokens = nltk.word_tokenize(commasection)
st = ISRIStemmer()
for word in tokens:
#But the stemming of each word in tokens doesn't work????
print st.stem(word)
#Should display
#u'u0623\u062e\u0628\u0631'
#u'\u0628\u0634\u0631'
#u'\u0628\u0646'
#u'\u0647\u0644\u0644'
#But it just shows whatever is in commasection
I need my python code to stem all words in tokens. But I don't get how the simpler example running in python shell works but not this script.
I have noticed that in the shell scenario, there is that 'u' in front of the sequence of unicode, so I tried all sorts of encodings/decodings and read a lot about it all night long (pulled an all-nighter on this one), but this python script is just not stemming word from tokens like the python shell!!!
If anyone can please help me make my script display the correct result I would be super super appreciative
Unicode escapes only work in unicode literals.
commasection = u'\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
Ignacio is correct that I have to have unicode literals in order for the stemming to work, but since I am grabbing this string dynamically, I had to find a way to convert what I get dynamically
i.e. '\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
into a string literal with a unicode escapes i.e.
u'\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
(notice the u in front)
This can be done with the following unichr() http://infohost.nmt.edu/tcc/help/pubs/python/web/unichr-function.html:
word = "".join([unichr(int(x, 16)) for x in word.split("\\u") if x !=""])
So basically I grab the numeric codes and form the unicode character while maintaining the unicode escape. And my stemmer works!