Split using regex in python - regex

I wan to split file name from the given file using regex function re.split.Please find the details below:
SVC_DC = 'JHN097567898_01102019_050514_svc_dc.tar"
My solution:
import regex as re
ans=re.split(os.sep,SVC_DC)
Error: re.error: bad escape (end of pattern) at position 0
Thanks in advance

If you want a filename, regexes are not your answer.
Python has the pathlib module dedicated to handling filepaths, and its objects, besides having methods to get the isolated filename handlign all possible corner-cases, also have methods to open, list files, and do everything one normally does to a file.
To get the base filename from a path, just use its automatic properties:
In [1]: import pathlib
In [2]: name = pathlib.Path("/home/user/JHN097567898_01102019_050514_svc_dc.tar")
In [3]: name.name
Out[3]: 'JHN097567898_01102019_050514_svc_dc.tar'
In [4]: name.parent
Out[4]: PosixPath('/home/user')
Otherwise, even if you would not use pathlib, os.path.sep being a single character, there would be no advantage in using re.split at all - normal string.split would do. Actually, there is os.path.split as well, that, predating pathlib, would always do the samething:
In [6]: name = "/home/user/JHN097567898_01102019_050514_svc_dc.tar"
In [7]: import os
In [8]: os.path.split(name)[-1]
Out[8]: 'JHN097567898_01102019_050514_svc_dc.tar'
And last (and in this case, actually least), the reason of the error is that you are on windows, and your os.path.sep character is "\" - this character alone is not a full regular expression, as the regex engine expects a character indicating a special sequence to come after the "\". For it to be used withour error, you'd need to do:
re.split(re.escape(os.path.sep), "myfilepath")

The reason of your failure are details concerning regular expressions,
namely the quotation issue.
E.g. under Windows os.sep = '\\', i.e. a single backslash.
But the backslash in regex has special meaning, just to escape special characters,
so in order to use it literally, you have to write it twice.
Try the following code:
import re
import os
SVC_DC = 'JHN097567898_01102019_050514_svc_dc.tar'
print(re.split(os.sep * 2, SVC_DC))
The result is:
['JHN097567898_01102019_050514_svc_dc.tar']
As the source string does not contain any backslashes, the result
is a list containing only one item (the whole source string).
Edit
To make the regex working under both Windows and Unix, you can try:
print(re.split('\\' + os.sep, SVC_DC))
Note that this regex contains:
a hard-coded backslash as the escape character,
the path separator used in the current operating system.
Note that the forward slash (in Unix case) does not require quotation,
but using quotation here is still acceptable (not needed, but working).

Related

Improve performance of large document text tokenization through Python + RegEx

I'm currently trying to process a large amount of very big (>10k words) text files. In my data pipeline, I identified the gensim tokenize function as my bottleneck, the relevant part is provided in my MWE below:
import re
import urllib.request
url='https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/genesis/english-web.txt'
doc=urllib.request.urlopen(url).read().decode('utf-8')
PAT_ALPHABETIC = re.compile('(((?![\d])\w)+)', re.UNICODE)
def tokenize(text):
text.strip()
for match in PAT_ALPHABETIC.finditer(text):
yield match.group()
def preprocessing(doc):
tokens = [token for token in tokenize(doc)]
return tokens
foo=preprocessing(doc)
Calling the preprocessing function for the given example takes roughly 66ms and I would like to improve this number. Is there anything I can still optimize in my code? Or is my hardware (Mid 2010s Consumer Notebook) the issue? I would be interested in the runtimes from people with some more recent hardware as well.
Thank you in advance
You may use
PAT_ALPHABETIC = re.compile(r'[^\W\d]+')
def tokenize(text):
for match in PAT_ALPHABETIC.finditer(text):
yield match.group()
Note:
\w matches letters, digits, underscores, some other connector punctuation and diacritics in Python 3.x by default, you do not need to use re.UNICODE or re.U options
To "exclude" (or "subtract") digit matching from \w, the ((?!\d)\w)+ looks an overkill, all you need to do is to "convert" the \w into an equivalent negated character class, [^\W], and add a \d there: [^\W\d]+.
Note the extraneous text.strip(): Python strings are immutable, if you do not assign a value to a variable, there is no use in text.strip(). Since whitespace in the input string do not interfere with the regex, [^\W\d]+, you may simply strip this text.strip() from your code.

Making a text file which will contain my list items and applying regular expression to it

I am supposed to make a code which will read a text file containing some words with some common linguistic features. Apply some regular expression to all of the words and write one file which will have the changed words.
For now let's say my text file named abcd.txt has these words
king
sing
ping
cling
booked
looked
cooked
packed
My first question starts from here. In my simple text file how to write these words to get the above mentioned results. Shall I write them line-separated or comma separated?
This is the code provided by user palvarez.
import re
with open("new_abcd", "w+") as new, open("abcd") as original:
for word in original:
new_word = re.sub("ing$", "xyz", word)
new.write(new_word)
Can I add something like -
with open("new_abcd", "w+") as file, open("abcd") as original:
for word in original:
new_aword = re.sub("ed$", "abcd", word)
new.write(new_aword)
in the same code file? I want something like -
kabc
sabc
pabc
clabc
bookxyz
lookxyz
cookxyz
packxyz
PS - I don't know whether mentioning this is necessary or not, but I am supposed to do this for a Unicode supported script Devanagari. I didn't use it here in my examples because many of us here can't read the script. Additionally that script uses some diacritics. eg. 'का' has one consonant character 'क' and one vowel symbol 'ा' which together make 'का'. In my regular expression I need to condition the diacritics.
I think the approach you have with one word by line is better since you don't have to trouble yourself with delimiters and striping.
With a file like this:
king
sing
ping
cling
booked
looked
cooked
packed
And a code like this, using re.sub to replace a pattern:
import re
with open("new_abcd.txt", "w") as new, open("abcd.txt") as original:
for word in original:
new_word = re.sub("ing$", "xyz", word)
new_word = re.sub("ed$", "abcd", new_word)
new.write(new_word)
It creates a resulting file:
kxyz
sxyz
pxyz
clxyz
bookabcd
lookabcd
cookabcd
packabcd
I tried out with the diacritic you gave us and it seems to work fine:
print(re.sub("ा$", "ing", "का"))
>>> कing
EDIT: added multiple replacement. You can have your replacements into a list and iterate over it to do re.sub as follows.
import re
# List where first is pattern and second is replacement string
replacements = [("ing$", "xyz"), ("ed$", "abcd")]
with open("new_abcd.txt", "w") as new, open("abcd.txt") as original:
for word in original:
new_word = word
for pattern, replacement in replacements:
new_word = re.sub(pattern, replacement, word)
if new_word != word:
break
new.write(new_word)
This limits one modification per word, only the first that modifies the word is taken.
It is recommended that for starters, utilize the with context manager to open your file, this way you do not need to explicitly close the file once you are done with it.
Another added advantage is then you are able to process the file line by line, this will be very useful if you are working with larger sets of data. Writing them in a single line or csv format will then all depend on the requirement of your output and how you would want to further process them.
As an example, to read from a file and say substitute a substring, you can use re.sub.
import re
with open('abcd.txt', 'r') as f:
for line in f:
#do something here
print(re.sub("ing$",'ring',line.strip()))
>>
kring
sring
pring
clring
Another nifty trick is to manage both the input and output utilizing the same context manager like:
import re
with open('abcd.txt', 'r') as f, open('out_abcd.txt', 'w') as o:
for line in f:
#notice that we add '\n' to write each output to a newline
o.write(re.sub("ing$",'ring',line.strip())+'\n')
This create an output file with your new contents in a very memory efficient way.
If you'd like to write to a csv file or any other specific formats, I highly suggest you spend sometime to understand Python's input and output functions here. If linguistics in text is what you are going for that understand encoding of different languages and further study Python's regex operations.

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

How to use lists in new regex when the items have re.escape characters?

I'm trying to use the List option in the new regex library. The List works fine if the items in the list do not contain any re escape characters. But it doesn't work when there are re escape characters as shown below. What is the right way to use Lists in this case.
import regex as re
print('=====Test case0=====')
mys="|abcxyz |defblah"
l={re.escape('|abc'),re.escape('|def')}
p=re.compile(r'(?=\|abc|\|def)\|.+?(?:\b|$)') #WORKS
print(p.findall(mys))
p=re.compile(r'(?=\L<l>)\|.+?(?:\b|$)',l=l) #DOESN'T WORK
print(p.findall(mys))
print('=====Test case1=====')
mys="$abcxyz $defblah"
l={re.escape('$abc'),re.escape('$def')}
p=re.compile(r'(?=\$abc|\$def)\$.+?(?:\b|$)') #WORKS
print(p.findall(mys))
p=re.compile(r'(?=\L<l>)\$.+?(?:\b|$)',l=l) #DOESN'T WORK
print(p.findall(mys))
print('=====Test case2=====')
mys="abcxyz defblah"
l={re.escape('abc'),re.escape('def')}
p=re.compile(r'(?=abc|def).+?(?:\b|$)') #WORKS
print(p.findall(mys))
p=re.compile(r'(?=\L<l>).+?(?:\b|$)',l=l) #WORKS
print(p.findall(mys))
output
=====Test case0=====
['|abcxyz', '|defblah']
[]
=====Test case1=====
['$abcxyz', '$defblah']
[]
=====Test case2=====
['abcxyz', 'defblah']
['abcxyz', 'defblah']
That happens because the strings you pass to \L are re.escaped by default, and when you pre-escape them, they get escaped twice.
Just use
l=['|abc','|def']
and then pass the l list to the \L operator.
See the Python demo.
The documentation is a bit vague but this behavior is hinted at:
\L<name>
There are occasions where you may want to include a list (actually, a set) of options in a regex.
One way is to build the pattern like this:
>>> p = regex.compile(r"first|second|third|fourth|fifth")
but if the list is large, parsing the resulting regex can take considerable >time, and care must also be taken that the strings are properly escaped if they contain any character that has a special meaning in a regex, and that if there is a shorter string that occurs initially in a longer string that the longer string is listed before the shorter one, for example, “cats” before “cat”.
The new alternative is to use a named list

How can I specify Cyrillic character ranges in a Python 3.2 regex?

Once upon a time, I found this question interesting.
Today I decided to play around with the text of that book.
I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.
#!/usr/bin/env python3.2
# coding=UTF-8
import sys, re
for file in sys.argv[1:]:
f = open(file)
fs = f.read()
regexnl = re.compile('[^\s\w.,?!:;-]')
rstuff = regexnl.sub('', f)
f.close()
print(rstuff)
Something very similar has already been done in this answer.
Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.
This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.
You can specify the unicode range pretty easily: \u0400-\u0500. See also here.
Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.
#coding=utf-8
import re
ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"
cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)
for x in cyril1:
print x
for x in cyril2:
print x
output:
Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже
Addition:
Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:
re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
re.findall("\w+", text, re.UNICODE) is equivalent
So, more specifically for your problem:
* re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.
See here (point 7)
For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".
//Basic Cyr Unicode range for use on Russian websites.
0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463
Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.