python different result from IDLE and python script - python-2.7

I have tried the following in Python 2.7 shell:
>>> from nltk.stem.isri import ISRIStemmer
>>> st = ISRIStemmer()
>>> string = u'\u062D\u064E\u062F\u0651\u064E\u062B\u064E\u0646\u064E\u0627'
>>> st.stem(string)
u'\u062d\u062f\u062b'
So basically, I am trying to obtain:
u'\u062d\u062f\u062b'
from
u'\u062D\u064E\u062F\u0651\u064E\u062B\u064E\u0646\u064E\u0627'
using nltk's arabic stemmer, which works!
However, when I try to accomplish the exact thing through a python script, it fails to stem any of the words in the list, tokens :
#!/c/Python27/python
# -*- coding: utf8 -*-
import nltk
import nltk.data
from nltk.stem.isri import ISRIStemmer
#In my script, I tokenize the following string
commasection = '\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
#The tokenizing works
tokens = nltk.word_tokenize(commasection)
st = ISRIStemmer()
for word in tokens:
#But the stemming of each word in tokens doesn't work????
print st.stem(word)
#Should display
#u'u0623\u062e\u0628\u0631'
#u'\u0628\u0634\u0631'
#u'\u0628\u0646'
#u'\u0647\u0644\u0644'
#But it just shows whatever is in commasection
I need my python code to stem all words in tokens. But I don't get how the simpler example running in python shell works but not this script.
I have noticed that in the shell scenario, there is that 'u' in front of the sequence of unicode, so I tried all sorts of encodings/decodings and read a lot about it all night long (pulled an all-nighter on this one), but this python script is just not stemming word from tokens like the python shell!!!
If anyone can please help me make my script display the correct result I would be super super appreciative

Unicode escapes only work in unicode literals.
commasection = u'\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'

Ignacio is correct that I have to have unicode literals in order for the stemming to work, but since I am grabbing this string dynamically, I had to find a way to convert what I get dynamically
i.e. '\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
into a string literal with a unicode escapes i.e.
u'\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
(notice the u in front)
This can be done with the following unichr() http://infohost.nmt.edu/tcc/help/pubs/python/web/unichr-function.html:
word = "".join([unichr(int(x, 16)) for x in word.split("\\u") if x !=""])
So basically I grab the numeric codes and form the unicode character while maintaining the unicode escape. And my stemmer works!

Related

Matching Windows-1251 encoding character set in RegEx

I need to create a regular expression, that would match only the characters NOT in the windows-1251 encoding character set, to detect if there are any characters in a given piece of text that would violate the encoding. I tried to do it through the [^\u0000-\u044F]+ expression, however it is also matching some characters that are actually in line with the encoding.
Appreciate any help on the issue
No language specified, but in Python no need for a regex with sets. Create a set of all Unicode code points that are members of Windows-1251 and subtract it from the set of the text. Note that only byte 98h is not used in Windows-1251 encoding:
>>> # Create the set of characters in code page 1251
>>> cp1251 = set(bytes(range(256)).decode('cp1251',errors='ignore'))
>>> set('This is a test \x98 马') - cp1251
{'\x98', '马'}
As a regular expression:
>>> import re
>>> text = ''.join(cp1251) # string of all Windows-1251 codepoints from previous set
>>> text
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xa0¤¦§©«¬\xad®°±µ¶·»ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяёђѓєѕіїјљњћќўџҐґ–—‘’‚“”„†‡•…‰‹›€№™'
>>> not_cp1251 = re.compile(r'[^\x00-\x7f\xa0\xa4\xa6\xa7\xa9\xab-\xae\xb0\xb1\xb5-\xb7\xbb\u0401-\u040c\u040e-\u044f\u0451-\u045c\u045e\u045f\u0490\u0491\u2013\u2014\u2018-\u201a\u201c-\u201e\u2020-\u2022\u2026\u2030\u2039\u203a\u20ac\u2116\u2122]')
>>> not_cp1251.findall(text) # all cp1251 text finds no outliers
[]
>>> not_cp1251.findall(text+'\x98') # adding known outlier
['\x98']
>>> not_cp1251.findall('马克'+text+'\x98') # adding other outliers
['马', '克', '\x98']

Split using regex in python

I wan to split file name from the given file using regex function re.split.Please find the details below:
SVC_DC = 'JHN097567898_01102019_050514_svc_dc.tar"
My solution:
import regex as re
ans=re.split(os.sep,SVC_DC)
Error: re.error: bad escape (end of pattern) at position 0
Thanks in advance
If you want a filename, regexes are not your answer.
Python has the pathlib module dedicated to handling filepaths, and its objects, besides having methods to get the isolated filename handlign all possible corner-cases, also have methods to open, list files, and do everything one normally does to a file.
To get the base filename from a path, just use its automatic properties:
In [1]: import pathlib
In [2]: name = pathlib.Path("/home/user/JHN097567898_01102019_050514_svc_dc.tar")
In [3]: name.name
Out[3]: 'JHN097567898_01102019_050514_svc_dc.tar'
In [4]: name.parent
Out[4]: PosixPath('/home/user')
Otherwise, even if you would not use pathlib, os.path.sep being a single character, there would be no advantage in using re.split at all - normal string.split would do. Actually, there is os.path.split as well, that, predating pathlib, would always do the samething:
In [6]: name = "/home/user/JHN097567898_01102019_050514_svc_dc.tar"
In [7]: import os
In [8]: os.path.split(name)[-1]
Out[8]: 'JHN097567898_01102019_050514_svc_dc.tar'
And last (and in this case, actually least), the reason of the error is that you are on windows, and your os.path.sep character is "\" - this character alone is not a full regular expression, as the regex engine expects a character indicating a special sequence to come after the "\". For it to be used withour error, you'd need to do:
re.split(re.escape(os.path.sep), "myfilepath")
The reason of your failure are details concerning regular expressions,
namely the quotation issue.
E.g. under Windows os.sep = '\\', i.e. a single backslash.
But the backslash in regex has special meaning, just to escape special characters,
so in order to use it literally, you have to write it twice.
Try the following code:
import re
import os
SVC_DC = 'JHN097567898_01102019_050514_svc_dc.tar'
print(re.split(os.sep * 2, SVC_DC))
The result is:
['JHN097567898_01102019_050514_svc_dc.tar']
As the source string does not contain any backslashes, the result
is a list containing only one item (the whole source string).
Edit
To make the regex working under both Windows and Unix, you can try:
print(re.split('\\' + os.sep, SVC_DC))
Note that this regex contains:
a hard-coded backslash as the escape character,
the path separator used in the current operating system.
Note that the forward slash (in Unix case) does not require quotation,
but using quotation here is still acceptable (not needed, but working).

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

How to decode UTF-16 to japanese in Python 2.7

I am just start to learn python, and want to decode the url info into japanese word.
>>> s1 = '\u4e2d\u99ac\u8fbc\u30cf\u30a4\u30c4'
>>> print s1
\u4e2d\u99ac\u8fbc\u30cf\u30a4\u30c4
>>> print u'\u4e2d\u99ac\u8fbc\u30cf\u30a4\u30c4'
中馬込ハイツ
I think it is a really basic problem, and I have search for utf-16, but it didn't work out. How can I print s1 and get the japanese words?
UPDATE: An even better way:
import codecs
s1 = '\u4e2d\u99ac\u8fbc\u30cf\u30a4\u30c4'
print (codecs.decode(s1,'unicode-escape'))
(from here)
Original answer:
What about adding u before your string? like this:
s1 = u'\u4e2d\u99ac\u8fbc\u30cf\u30a4\u30c4'
print s1
if you already have the string, like on your question, I would do this:
s1 = '\u4e2d\u99ac\u8fbc\u30cf\u30a4\u30c4'
string = eval ("u'"+s1+"'")
print (string)
# or you can do this:
print (eval ("u'"+s1+"'"))
There might be a better way, but this works.
Note that some terminals won't display unicode characters like this. It works for me under Ubuntu, but not under Windows 10.
try:
print(eval ("u'"+s1+"'"))
except:
print(eval(s1))
Will work for sure, I was stuck in similar issue
Please do vote if it works..

regular expression with special chars

I need a regular expression to validate string with one or more of these characters:
a-z
A-Z
'
àòèéùì
simple white space
FOR EXAMPLE these string are valide:
D' argon calabrò
maryòn l' Ancol
these string are NOT valide:
hello38239
my_house
work [tab] with me
I tryed this:
re.match(r"^[a-zA-Z 'òàèéìù]+$", string )
It seems to work in my python shell but in Django I get this error:
SyntaxError at /home/
("Non-ASCII character '\\xc3' ...
Why ?
Edit:
I have added # -- coding: utf-8 -- at the top of my forms.py but the strings with à,è,ò,ù,é or ì doesn't match never.
This is my forms.py clean method:
def clean_title(self):
if(re.match(r"^[a-zA-Z 'òàèéìù]+$", self.cleaned_data['title'].strip())):
return self.cleaned_data['title'].strip()
raise forms.ValidationError(_("This title is not valid."))
If you user Non-ASCII characters in your python source files you should add proper encoding to the top of your source file like this:
# -*- coding: utf-8 -*-
utf_string='čćžđšp'
Defining Python Source Code Encodings
This seems to work fine for me:
>>> import re
>>> mystring = "D' argon calabrò"
>>> matched = re.match(r"^([a-zA-Z 'òàèéìù]+)$", mystring)
>>> print matched.groups()
("D' argon calabr\xc3\xb2",)
Well, those are pretty much all non-ascii characters. So i'd figure that it's using just ascii for character encoding. Maybe you need to configure it to using UTF-8?