Improve performance of large document text tokenization through Python + RegEx

Improve performance of large document text tokenization through Python + RegEx - regex

I'm currently trying to process a large amount of very big (>10k words) text files. In my data pipeline, I identified the gensim tokenize function as my bottleneck, the relevant part is provided in my MWE below:
import re
import urllib.request
url='https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/genesis/english-web.txt'
doc=urllib.request.urlopen(url).read().decode('utf-8')
PAT_ALPHABETIC = re.compile('(((?![\d])\w)+)', re.UNICODE)
def tokenize(text):
text.strip()
for match in PAT_ALPHABETIC.finditer(text):
yield match.group()
def preprocessing(doc):
tokens = [token for token in tokenize(doc)]
return tokens
foo=preprocessing(doc)
Calling the preprocessing function for the given example takes roughly 66ms and I would like to improve this number. Is there anything I can still optimize in my code? Or is my hardware (Mid 2010s Consumer Notebook) the issue? I would be interested in the runtimes from people with some more recent hardware as well.
Thank you in advance

You may use
PAT_ALPHABETIC = re.compile(r'[^\W\d]+')
def tokenize(text):
for match in PAT_ALPHABETIC.finditer(text):
yield match.group()
Note:
\w matches letters, digits, underscores, some other connector punctuation and diacritics in Python 3.x by default, you do not need to use re.UNICODE or re.U options
To "exclude" (or "subtract") digit matching from \w, the ((?!\d)\w)+ looks an overkill, all you need to do is to "convert" the \w into an equivalent negated character class, [^\W], and add a \d there: [^\W\d]+.
Note the extraneous text.strip(): Python strings are immutable, if you do not assign a value to a variable, there is no use in text.strip(). Since whitespace in the input string do not interfere with the regex, [^\W\d]+, you may simply strip this text.strip() from your code.

Related

How to remove/replace specials characters from a 'dynamic' regex/string on ruby?

So I had this code working for a few months already, lets say I have a table called Categories, which has a string column called name, so I receive a string and I want to know if any category was mentioned (a mention occur when the string contains the substring: #name_of_a_category), the approach I follow for this was something like below:
categories.select { |category_i| content_received.downcase.match(/##{category_i.downcase}/)}
That worked pretty well until today suddenly started to receive an exception unmatched close parenthesis, I realized that the categories names can contain special chars so I decided to not consider special chars or spaces anymore (don't want to add restrictions to the user and at the same time don't want to deal with those cases so the policy is just to ignore it).
So the question is there a clean way of removing these special chars (maintaining the #) and matching the string (don't want to modify the data just ignore it while looking for mentions)?

You can also use
prep_content_received = content_received.gsub(/[^\w\s]|_/,'')
p categories.select { |c|
prep_content_received.match?(/\b#{c.gsub(/[^\w\s]|_/, '').strip()}\b/i)
}
See the Ruby demo
Details:
The prep_content_received = content_received.gsub(/[^\w\s]|_/,'') creates a copy of content_received with no special chars and _. Using it once reduced overhead if there are a lot of categories
Then, you iterate over the categories list, and each time check if the prep_content_received matches \b (word boundary) + category with all special chars, _ and leading/trailing whitespace stripped from it + \b in a case insensitive way (see the /i flag, no need to .downcase).

So after looking around I found some answers on the platform but nothing with my specific requirements (maybe I missed something, if so please let me know), and this is how I fix it for my case:
content_received = 'pepe is watching a #comedy :)'
categories = ['comedy :)', 'terror']
temp_content = content_received.downcase
categories.select { |category_i| temp_content.gsub(/[^\sa-zA-Z0-9]/, '#' => '#').match?(/##{category_i.downcase.
gsub(/[^\sa-zA-Z0-9]/, '')}/) }
For the sake of the example, I reduced the categories to a simple array of strings, basically the first gsub, remove any character that is not a letter or a number (any special character) and replace each # with an #, the second gsub is a simpler version of the first one.
You can test the snippet above here

Split using regex in python

I wan to split file name from the given file using regex function re.split.Please find the details below:
SVC_DC = 'JHN097567898_01102019_050514_svc_dc.tar"
My solution:
import regex as re
ans=re.split(os.sep,SVC_DC)
Error: re.error: bad escape (end of pattern) at position 0
Thanks in advance

If you want a filename, regexes are not your answer.
Python has the pathlib module dedicated to handling filepaths, and its objects, besides having methods to get the isolated filename handlign all possible corner-cases, also have methods to open, list files, and do everything one normally does to a file.
To get the base filename from a path, just use its automatic properties:
In [1]: import pathlib
In [2]: name = pathlib.Path("/home/user/JHN097567898_01102019_050514_svc_dc.tar")
In [3]: name.name
Out[3]: 'JHN097567898_01102019_050514_svc_dc.tar'
In [4]: name.parent
Out[4]: PosixPath('/home/user')
Otherwise, even if you would not use pathlib, os.path.sep being a single character, there would be no advantage in using re.split at all - normal string.split would do. Actually, there is os.path.split as well, that, predating pathlib, would always do the samething:
In [6]: name = "/home/user/JHN097567898_01102019_050514_svc_dc.tar"
In [7]: import os
In [8]: os.path.split(name)[-1]
Out[8]: 'JHN097567898_01102019_050514_svc_dc.tar'
And last (and in this case, actually least), the reason of the error is that you are on windows, and your os.path.sep character is "\" - this character alone is not a full regular expression, as the regex engine expects a character indicating a special sequence to come after the "\". For it to be used withour error, you'd need to do:
re.split(re.escape(os.path.sep), "myfilepath")

The reason of your failure are details concerning regular expressions,
namely the quotation issue.
E.g. under Windows os.sep = '\\', i.e. a single backslash.
But the backslash in regex has special meaning, just to escape special characters,
so in order to use it literally, you have to write it twice.
Try the following code:
import re
import os
SVC_DC = 'JHN097567898_01102019_050514_svc_dc.tar'
print(re.split(os.sep * 2, SVC_DC))
The result is:
['JHN097567898_01102019_050514_svc_dc.tar']
As the source string does not contain any backslashes, the result
is a list containing only one item (the whole source string).
Edit
To make the regex working under both Windows and Unix, you can try:
print(re.split('\\' + os.sep, SVC_DC))
Note that this regex contains:
a hard-coded backslash as the escape character,
the path separator used in the current operating system.
Note that the forward slash (in Unix case) does not require quotation,
but using quotation here is still acceptable (not needed, but working).

Regular expression that accepts only characters with accents

I need a regular expression that accepts only characters having accents. For the moment I'm using this one:
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöœøùúûüýþÿ]*$
Is there another expression, which is clearer than my expression?

i think this will solve your problem :
[œÀ-ÖØ-öø-ÿ]*$

Since all characters except the œ are between characters 192 À and 255 ÿ, could you do something like looking ahead and checking they don't contain any of the characters in the range that you don't want? I'm not sure it improves anything compared to yours but it's a bit shorter and maybe, just maybe, clearer.
(?![÷×])[À-ÿœ]

Regex isn't always the clearest way to handle text, even if it is the fastest.
You could assign your regular expression to a variable, then insert it via text interpolation:
accent_chars = '[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöœøùúûüýþÿ]'
my_regex = '^...%s*...$' % accent_chars
You can also use these ranges:
[œÀ-ÖØ-öø-ÿ]
Demonstration using Python 3:
>>> import re
>>> s = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöœøùúûüýþÿ'
>>> ''.join(re.findall('[œÀ-ÖØ-öø-ÿ]', s))
'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöœøùúûüýþÿ'
>>> len(''.join(re.findall('[œÀ-ÖØ-öø-ÿ]', s))) == len(s)
True
The downside is that it is not immediately clear to someone unfamiliar with Unicode that this covers every desired case.

You could also try using the POSIX bracket expression [:alpha:].
Then just prune the alphabetic characters from your string.

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.

I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.

My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total

Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")

Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.

This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.

the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell

How can I specify Cyrillic character ranges in a Python 3.2 regex?

Once upon a time, I found this question interesting.
Today I decided to play around with the text of that book.
I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.
#!/usr/bin/env python3.2
# coding=UTF-8
import sys, re
for file in sys.argv[1:]:
f = open(file)
fs = f.read()
regexnl = re.compile('[^\s\w.,?!:;-]')
rstuff = regexnl.sub('', f)
f.close()
print(rstuff)
Something very similar has already been done in this answer.
Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.

This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.

You can specify the unicode range pretty easily: \u0400-\u0500. See also here.
Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.
#coding=utf-8
import re
ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"
cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)
for x in cyril1:
print x
for x in cyril2:
print x
output:
Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже
Addition:
Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:
re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
re.findall("\w+", text, re.UNICODE) is equivalent
So, more specifically for your problem:
* re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.
See here (point 7)

For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".
//Basic Cyr Unicode range for use on Russian websites.
0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463
Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js