I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.
Related
I wan to split file name from the given file using regex function re.split.Please find the details below:
SVC_DC = 'JHN097567898_01102019_050514_svc_dc.tar"
My solution:
import regex as re
ans=re.split(os.sep,SVC_DC)
Error: re.error: bad escape (end of pattern) at position 0
Thanks in advance
If you want a filename, regexes are not your answer.
Python has the pathlib module dedicated to handling filepaths, and its objects, besides having methods to get the isolated filename handlign all possible corner-cases, also have methods to open, list files, and do everything one normally does to a file.
To get the base filename from a path, just use its automatic properties:
In [1]: import pathlib
In [2]: name = pathlib.Path("/home/user/JHN097567898_01102019_050514_svc_dc.tar")
In [3]: name.name
Out[3]: 'JHN097567898_01102019_050514_svc_dc.tar'
In [4]: name.parent
Out[4]: PosixPath('/home/user')
Otherwise, even if you would not use pathlib, os.path.sep being a single character, there would be no advantage in using re.split at all - normal string.split would do. Actually, there is os.path.split as well, that, predating pathlib, would always do the samething:
In [6]: name = "/home/user/JHN097567898_01102019_050514_svc_dc.tar"
In [7]: import os
In [8]: os.path.split(name)[-1]
Out[8]: 'JHN097567898_01102019_050514_svc_dc.tar'
And last (and in this case, actually least), the reason of the error is that you are on windows, and your os.path.sep character is "\" - this character alone is not a full regular expression, as the regex engine expects a character indicating a special sequence to come after the "\". For it to be used withour error, you'd need to do:
re.split(re.escape(os.path.sep), "myfilepath")
The reason of your failure are details concerning regular expressions,
namely the quotation issue.
E.g. under Windows os.sep = '\\', i.e. a single backslash.
But the backslash in regex has special meaning, just to escape special characters,
so in order to use it literally, you have to write it twice.
Try the following code:
import re
import os
SVC_DC = 'JHN097567898_01102019_050514_svc_dc.tar'
print(re.split(os.sep * 2, SVC_DC))
The result is:
['JHN097567898_01102019_050514_svc_dc.tar']
As the source string does not contain any backslashes, the result
is a list containing only one item (the whole source string).
Edit
To make the regex working under both Windows and Unix, you can try:
print(re.split('\\' + os.sep, SVC_DC))
Note that this regex contains:
a hard-coded backslash as the escape character,
the path separator used in the current operating system.
Note that the forward slash (in Unix case) does not require quotation,
but using quotation here is still acceptable (not needed, but working).
My code takes a list of strings from a static website.
It then traverses through each character in the list and uses the .replace method to replace any non utf-8 character:
foo.replace('\\u2019', "'")
It doesn't replace the character in the list correctly and ends up looking like the following:
before
u'What\u2019s with the adverts?'
after
u'What\u2019s with the adverts?'
Why is it
Python 2.7 interprets string literals as ASCII, not unicode, and so even though you've tried to include unicode characters in your argument to foo.replace, replace is just seeing ASCII {'\', 'u', '2', '0', '1', '9'}. This is because Python doesn't assign a special meaning to "\u" unless it is parsing a unicode literal.
To tell Python 2.7 that this is a unicode string, you have to prefix the string with a u, as in foo.replace(u'\u2017', "'").
Additionally, in order to indicate the start of a unicode code, you need \u, not \\u - the latter indicates that you want an actual '\' in the string followed by a 'u'.
Finally, note that foo will not change as a result of calling replace. Instead, replace will return a value which you must assign to a new variable, like this:
bar = foo.replace(u'\u2017', "'")
print bar
(see stackoverflow.com/q/26943256/4909087)
yeah. If your string is foo = r'What\u2019s with the adverts?' will ok with foo.replace('\\u2019', "'"). It is a raw string and begins with r''. And with u'' is Unicode.
Hope to help you.
I'm trying to extract hangul, english, number from string input.
hangul = re.compile('[^a-zA-Z0-9\u3131-\u3163\uac00-\ud7a3]+')
s = u'abcd 가나다라 1234'
print hangul.sub('', s)
this give me u'abcd1234'
why does it ignore \uac00-\ud7a3 ?
I'm the developer for python jamo. If you use Python 3,
then you can use functions such as jamo.is_hangul_char. Otherwise, you could use the source code to help you out (you're missing a few Korean characters in your regex).
If you don't want to miss out on some of the older Hangul jamo display characters, then you want to use 3131-\u3163\u3165-\u318E to match all Hangul compatibility jamo. If you're only concerned about modern display characters, then you would use \u3131-\u314E\u314F-\u3163 to match all modern Hangul compatibility jamo.
Use a Unicode string in the re.compile; otherwise, \u3163 is not treated as a Unicode escape.
Although not required, '' in the .sub should be Unicode as well. There is an implicit conversion to Unicode in Python 2 otherwise and Python 3 requires it.
#coding:utf8
import re
hangul = re.compile(u'[^a-zA-Z0-9\u3131-\u3163\uac00-\ud7a3]+')
s = u'abcd 가나다라 1234'
print hangul.sub(u'', s)
Output:
abcd가나다라1234
I have tried the following in Python 2.7 shell:
>>> from nltk.stem.isri import ISRIStemmer
>>> st = ISRIStemmer()
>>> string = u'\u062D\u064E\u062F\u0651\u064E\u062B\u064E\u0646\u064E\u0627'
>>> st.stem(string)
u'\u062d\u062f\u062b'
So basically, I am trying to obtain:
u'\u062d\u062f\u062b'
from
u'\u062D\u064E\u062F\u0651\u064E\u062B\u064E\u0646\u064E\u0627'
using nltk's arabic stemmer, which works!
However, when I try to accomplish the exact thing through a python script, it fails to stem any of the words in the list, tokens :
#!/c/Python27/python
# -*- coding: utf8 -*-
import nltk
import nltk.data
from nltk.stem.isri import ISRIStemmer
#In my script, I tokenize the following string
commasection = '\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
#The tokenizing works
tokens = nltk.word_tokenize(commasection)
st = ISRIStemmer()
for word in tokens:
#But the stemming of each word in tokens doesn't work????
print st.stem(word)
#Should display
#u'u0623\u062e\u0628\u0631'
#u'\u0628\u0634\u0631'
#u'\u0628\u0646'
#u'\u0647\u0644\u0644'
#But it just shows whatever is in commasection
I need my python code to stem all words in tokens. But I don't get how the simpler example running in python shell works but not this script.
I have noticed that in the shell scenario, there is that 'u' in front of the sequence of unicode, so I tried all sorts of encodings/decodings and read a lot about it all night long (pulled an all-nighter on this one), but this python script is just not stemming word from tokens like the python shell!!!
If anyone can please help me make my script display the correct result I would be super super appreciative
Unicode escapes only work in unicode literals.
commasection = u'\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
Ignacio is correct that I have to have unicode literals in order for the stemming to work, but since I am grabbing this string dynamically, I had to find a way to convert what I get dynamically
i.e. '\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
into a string literal with a unicode escapes i.e.
u'\u0623\u064E\u062E\u0652\u0628\u064E\u0631\u064E\u0646\u064E\u0627 \u0628\u0650\u0634\u0652\u0631\u064F \u0628\u0652\u0646\u064F \u0647\u0650\u0644\u0627\u064E\u0644\u064D'
(notice the u in front)
This can be done with the following unichr() http://infohost.nmt.edu/tcc/help/pubs/python/web/unichr-function.html:
word = "".join([unichr(int(x, 16)) for x in word.split("\\u") if x !=""])
So basically I grab the numeric codes and form the unicode character while maintaining the unicode escape. And my stemmer works!
Once upon a time, I found this question interesting.
Today I decided to play around with the text of that book.
I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.
#!/usr/bin/env python3.2
# coding=UTF-8
import sys, re
for file in sys.argv[1:]:
f = open(file)
fs = f.read()
regexnl = re.compile('[^\s\w.,?!:;-]')
rstuff = regexnl.sub('', f)
f.close()
print(rstuff)
Something very similar has already been done in this answer.
Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.
This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.
You can specify the unicode range pretty easily: \u0400-\u0500. See also here.
Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.
#coding=utf-8
import re
ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"
cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)
for x in cyril1:
print x
for x in cyril2:
print x
output:
Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже
Addition:
Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:
re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
re.findall("\w+", text, re.UNICODE) is equivalent
So, more specifically for your problem:
* re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.
See here (point 7)
For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".
//Basic Cyr Unicode range for use on Russian websites.
0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463
Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.