match words with few differences allowed

match words with few differences allowed - regex

I was wondering is there is any tool to match almost the same word for a bash terminal.
In the following file, called list.txt contain 1 word per line:
ban
1ban
12ban
12ban3
It is easy to find words containing "ban"
grep -E "*ban*" list.txt
Question:
How to actually match words that are have x letters differences?
With the search word "ban", I expect the match "1ban" for X=1.
Concerning the notion of distance, I want to have maximum:
X deletion
or X substitutions
or X insertions
Any tool, but preferentially something you could call as command-line on a bash terminal.
NOTE: The Levenshtein Distance will count an insertion of 2 letter as 1 difference. This is not what I want.

You may use Python PyPi regex class that supports fuzzy matching.
Since you actually want to match words with maximum X difference (1 deletion OR 1 substitution OR 1 deletion), you may create a Python script like
#!/usr/bin/env python3
import regex, io, sys
def main(argv):
if len(argv) < 3:
# print("USAGE: fuzzy_search -searchword -xdiff -file")
exit(-1)
search=argv[0]
xdiff=argv[1]
file=argv[2]
# print("Searching for {} in {} with {} differences...".format(search, file, xdiff))
with open(file, "r") as f:
contents = f.read()
print(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents))
if __name__ == "__main__":
main(sys.argv[1:])
Here, {s<=1,i<=1,d<=1} means we allow the word we search for 1 or 0 substitutions (s<=1), 1 or 0 insertions (i<=1) or 1 or 0 deletions (d<=1).
The \b are word boundaries, thanks to that construct, only whole words are matched (no cat in vacation will get matched).
Save as fuzzy_search.py.
Then, you may call it as
python3 fuzzy_search.py "ban" 1 file
where "ban" is the word the fuzzy search is being performed for and 1 is the higher limit of differences.
The result I get is
['ban', '1ban']
You may change the format of the output to line only:
print("\n".join(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents)))
Then, the result is
ban
1ban

You can check the difference as shown below by checking each character using python,
def is_diff(str1, str2):
diff = False
for char1, char2 in zip(str1, str2):
if char1 != char2:
if diff:
return False
else:
diff = True
return diff
with open('list.txt') as f:
data = f.readlines()
for line in data:
print is_diff('ban', line)

Related

Iteration over the string replace function in python [duplicate]

I have two lists:
a list of about 750K "sentences" (long strings)
a list of about 20K "words" that I would like to delete from my 750K sentences
So, I have to loop through 750K sentences and perform about 20K replacements, but ONLY if my words are actually "words" and are not part of a larger string of characters.
I am doing this by pre-compiling my words so that they are flanked by the \b word-boundary metacharacter:
compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Then I loop through my "sentences":
import re
for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list
This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.
Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?
Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it's not much of an improvement.
I'm using Python 3.5.2

TLDR
Use this method if you want the fastest regex-based solution. For a dataset similar to the OP's, it's approximately 1000 times faster than the accepted answer.
If you don't care about regex, use this set-based version, which is 2000 times faster than a regex union.
Optimized Regex with Trie
A simple Regex union approach becomes slow with many banned words, because the regex engine doesn't do a very good job of optimizing the pattern.
It's possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren't really human-readable, but they do allow for very fast lookup and match.
Example
['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']
The list is converted to a trie:
{
'f': {
'o': {
'o': {
'x': {
'a': {
'r': {
'': 1
}
}
},
'b': {
'a': {
'r': {
'': 1
},
'h': {
'': 1
}
}
},
'z': {
'a': {
'': 1,
'p': {
'': 1
}
}
}
}
}
}
}
And then to this regex pattern:
r"\bfoo(?:ba[hr]|xar|zap?)\b"
The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn't match), instead of trying the 5 words. It's a preprocess overkill for 5 words, but it shows promising results for many thousand words.
Note that (?:) non-capturing groups are used because:
foobar|baz would match foobar or baz, but not foobaz
foo(bar|baz) would save unneeded information to a capturing group.
Code
Here's a slightly modified gist, which we can use as a trie.py library:
import re
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
Test
Here's a small test (the same as this one):
# Encoding: utf-8
import re
import timeit
import random
from trie import Trie
with open('/usr/share/dict/american-english') as wordbook:
banned_words = [word.strip().lower() for word in wordbook]
random.shuffle(banned_words)
test_words = [
("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
("First word", banned_words[0]),
("Last word", banned_words[-1]),
("Almost a word", "couldbeaword")
]
def trie_regex_from_words(words):
trie = Trie()
for word in words:
trie.add(word)
return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nTrieRegex of %d words" % 10**exp)
union = trie_regex_from_words(banned_words[:10**exp])
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %s : %.1fms" % (description, time))
It outputs:
TrieRegex of 10 words
Surely not a word : 0.3ms
First word : 0.4ms
Last word : 0.5ms
Almost a word : 0.5ms
TrieRegex of 100 words
Surely not a word : 0.3ms
First word : 0.5ms
Last word : 0.9ms
Almost a word : 0.6ms
TrieRegex of 1000 words
Surely not a word : 0.3ms
First word : 0.7ms
Last word : 0.9ms
Almost a word : 1.1ms
TrieRegex of 10000 words
Surely not a word : 0.1ms
First word : 1.0ms
Last word : 1.2ms
Almost a word : 1.2ms
TrieRegex of 100000 words
Surely not a word : 0.3ms
First word : 1.2ms
Last word : 0.9ms
Almost a word : 1.6ms
For info, the regex begins like this:
(?:a(?:(?:\'s|a(?:\'s|chen|liyah(?:\'s)?|r(?:dvark(?:(?:\'s|s))?|on))|b(?:\'s|a(?:c(?:us(?:(?:\'s|es))?|[ik])|ft|lone(?:(?:\'s|s))?|ndon(?:(?:ed|ing|ment(?:\'s)?|s))?|s(?:e(?:(?:ment(?:\'s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\'s)?|[ds]))?|ing|toir(?:(?:\'s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\'s|es))?|y(?:(?:\'s|s))?)|ot(?:(?:\'s|t(?:\'s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|y(?:\'s)?|\é(?:(?:\'s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?))|om(?:en(?:(?:\'s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\'s|s))?)|or(?:(?:\'s|s))?|s))?|l(?:\'s)?))|e(?:(?:\'s|am|l(?:(?:\'s|ard|son(?:\'s)?))?|r(?:deen(?:\'s)?|nathy(?:\'s)?|ra(?:nt|tion(?:(?:\'s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\'s|s))?|d)|ing|or(?:(?:\'s|s))?)|s))?|yance(?:\'s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\'s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\'s)?)|gail|l(?:ene|it(?:ies|y(?:\'s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\'s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\'s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\'s|s))?|y)|m\'s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\'s)?))|r(?:\'s)?)|ormal(?:(?:it(?:ies|y(?:\'s)?)|ly))?)|o(?:ard|de(?:(?:\'s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\'s|ist(?:(?:\'s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|r(?:igin(?:al(?:(?:\'s|s))?|e(?:(?:\'s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\'s|ist(?:(?:\'s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\'s|board))?)|r(?:a(?:cadabra(?:\'s)?|d(?:e[ds]?|ing)|ham(?:\'s)?|m(?:(?:\'s|s))?|si(?:on(?:(?:\'s|s))?|ve(?:(?:\'s|ly|ness(?:\'s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\'s|s))?|[ds]))?|ing|ment(?:(?:\'s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\'s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\'s)?))?)|s(?:alom|c(?:ess(?:(?:\'s|e[ds]|ing))?|issa(?:(?:\'s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\'s|s))?|t(?:(?:e(?:e(?:(?:\'s|ism(?:\'s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\'s|e(?:\'s)?))?|o(?:l(?:ut(?:e(?:(?:\'s|ly|st?))?|i(?:on(?:\'s)?|sm(?:\'s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\'s)?|t(?:(?:\'s|s))?)|d)|ing|s))?|pti...
It's really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!
Here's a diagram of the complete trie, exported with trie-python-graphviz and graphviz twopi:

TLDR
Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP's, it's approximately 2000 times faster than the accepted answer.
If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.
Theory
If your sentences aren't humongous strings, it's probably feasible to process many more than 50 per second.
If you save all the banned words into a set, it will be very fast to check if another word is included in that set.
Pack the logic into a function, give this function as argument to re.sub and you're done!
Code
import re
with open('/usr/share/dict/american-english') as wordbook:
banned_words = set(word.strip().lower() for word in wordbook)
def delete_banned_words(matchobj):
word = matchobj.group(0)
if word.lower() in banned_words:
return ""
else:
return word
sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
"GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000
word_pattern = re.compile('\w+')
for sentence in sentences:
sentence = word_pattern.sub(delete_banned_words, sentence)
Converted sentences are:
' . !
.
GiraffeElephantBoat
sfgsdg sdwerha aswertwe
Note that:
the search is case-insensitive (thanks to lower())
replacing a word with "" might leave two spaces (as in your code)
With python3, \w+ also matches accented characters (e.g. "ångström").
Any non-word character (tab, space, newline, marks, ...) will stay untouched.
Performance
There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.
In comparison, Liteye's answer needed 160s for 10 thousand sentences.
With n being the total amound of words and m the amount of banned words, OP's and Liteye's code are O(n*m).
In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).
Regex union test
What's the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?
It's pretty hard to grasp the way the regex engine works, so let's write a simple test.
This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :
one is clearly not a word (it begins with #)
one is the first word in the list
one is the last word in the list
one looks like a word but isn't
import re
import timeit
import random
with open('/usr/share/dict/american-english') as wordbook:
english_words = [word.strip().lower() for word in wordbook]
random.shuffle(english_words)
print("First 10 words :")
print(english_words[:10])
test_words = [
("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
("First word", english_words[0]),
("Last word", english_words[-1]),
("Almost a word", "couldbeaword")
]
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nUnion of %d words" % 10**exp)
union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %-17s : %.1fms" % (description, time))
It outputs:
First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']
Union of 10 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 0.7ms
Almost a word : 0.7ms
Union of 100 words
Surely not a word : 0.7ms
First word : 1.1ms
Last word : 1.2ms
Almost a word : 1.2ms
Union of 1000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 9.6ms
Almost a word : 10.1ms
Union of 10000 words
Surely not a word : 1.4ms
First word : 1.8ms
Last word : 96.3ms
Almost a word : 116.6ms
Union of 100000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 1227.1ms
Almost a word : 1404.1ms
So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:
O(1) best case
O(n/2) average case, which is still O(n)
O(n) worst case
These results are consistent with a simple loop search.
A much faster alternative to a regex union is to create the regex pattern from a trie.

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".
Because re relies on C code to do the actual matching, the savings can be dramatic.
As #pvg pointed out in the comments, it also benefits from single pass matching.
If your words are not regex, Eric's answer is faster.

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.
This should be faster, because to process a sentence, you just have to step through each of the words and check if it's a match.
Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then "discarding" the result of this work before the next pass.

Well, here's a quick and easy solution, with test set.
Best strategy:
re.sub("\w+",repl,sentence) searches for words.
"repl" can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.
This is the simplest and fastest solution (see function replace4 in example code below).
Second best strategy:
The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.
Implementation: (see function replace3 in example code below).
Timings for example functions:
replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240,000/s with PyPy)
...and code:
#! /bin/env python3
# -*- coding: utf-8
import time, random, re
def replace1( sentences ):
for n, sentence in enumerate( sentences ):
for search, repl in patterns:
sentence = re.sub( "\\b"+search+"\\b", repl, sentence )
def replace2( sentences ):
for n, sentence in enumerate( sentences ):
for search, repl in patterns_comp:
sentence = re.sub( search, repl, sentence )
def replace3( sentences ):
pd = patterns_dict.get
for n, sentence in enumerate( sentences ):
#~ print( n, sentence )
# Split the sentence on non-word characters.
# Note: () in split patterns ensure the non-word characters ARE kept
# and returned in the result list, so we don't mangle the sentence.
# If ALL separators are spaces, use string.split instead or something.
# Example:
#~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
#~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
words = re.split(r"([^\w]+)", sentence)
# and... done.
sentence = "".join( pd(w,w) for w in words )
#~ print( n, sentence )
def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
w = m.group()
return pd(w,w)
for n, sentence in enumerate( sentences ):
sentence = re.sub(r"\w+", repl, sentence)
# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]
# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]
def test( func, num ):
t = time.time()
func( test_sentences[:num] )
print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))
print( "Sentences", len(test_sentences) )
print( "Words ", len(test_words) )
test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )
EDIT: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl
def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
w = m.group()
return pd(w.lower(),w)

Perhaps Python is not the right tool here. Here is one with the Unix toolchain
sed G file |
tr ' ' '\n' |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'
assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.
This should run at least an order of magnitude faster.
For preprocessing the blacklist file from words (one word per line)
sed 's/.*/\\b&\\b/' words > blacklist

How about this:
#!/usr/bin/env python3
from __future__ import unicode_literals, print_function
import re
import time
import io
def replace_sentences_1(sentences, banned_words):
# faster on CPython, but does not use \b as the word separator
# so result is slightly different than replace_sentences_2()
def filter_sentence(sentence):
words = WORD_SPLITTER.split(sentence)
words_iter = iter(words)
for word in words_iter:
norm_word = word.lower()
if norm_word not in banned_words:
yield word
yield next(words_iter) # yield the word separator
WORD_SPLITTER = re.compile(r'(\W+)')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))
def replace_sentences_2(sentences, banned_words):
# slower on CPython, uses \b as separator
def filter_sentence(sentence):
boundaries = WORD_BOUNDARY.finditer(sentence)
current_boundary = 0
while True:
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
yield sentence[last_word_boundary:current_boundary] # yield the separators
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
word = sentence[last_word_boundary:current_boundary]
norm_word = word.lower()
if norm_word not in banned_words:
yield word
WORD_BOUNDARY = re.compile(r'\b')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))
corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
output.write(sentence.encode('utf-8'))
output.write(b' .')
print('time:', time.time() - start)
These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes' solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn't compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don't care about it, it should be fairly straightforward to remove them from the output.
I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu's wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I've defined sentences as anything separated by ". ".
$ # replace_sentences_1()
$ python3 filter_words.py
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py
number of sentences: 862462
time: 15.9370770454
$ # replace_sentences_2()
$ python3 filter_words.py
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py
number of sentences: 862462
time: 13.1190629005
PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.

Practical approach
A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.
With join/split tricks you can avoid loops at all which should speed up the algorithm.
Concatenate a sentences with a special delimeter which is not contained by the sentences:
merged_sentences = ' * '.join(sentences)
Compile a single regex for all the words you need to rid from the sentences using | "or" regex statement:
regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')
Performance
"".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:
for (i = 0; i < seqlen; i++) {
[...]
sz += PyUnicode_GET_LENGTH(item);
Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.
b.t.w. don't use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.

Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here's one) to locate all your "bad" words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.

Regex in python trouble

I have a text file that I would like to search through it to see how many of a certain word is in it. I'm getting the wrong count for the words.
File is here
code:
import re
with open('SysLog.txt', 'rt') as myfile:
for line in myfile:
m = re.search('guest', line, re.M|re.I)
if m is not None:
m.group(0)
print( "Found it.")
print('Found',len(m.group()), m.group(),'s')
break
for line in myfile:
n = re.search('Worm', line)
if n is not None:
n.group(0)
print("\n\tNext Match.")
print('Found', len(n.group()), n.group(), 's')
break
for line in myfile:
o = re.search('anonymous', line)
if o is not None:
o.group(0)
print("\n\tNext Match.")
print('Found', len(o.group()), o.group(), 's')
break

There is no need to use a regex, you can use str.count() to make the process much more simple:
with open('SysLog.txt', 'rt') as myfile:
text = myfile.read()
for word in ('guest', 'Worm', 'anonymous'):
print("\n\tNext Match.")
print('Found', text.count(word), word, 's')
To test this, I downloaded the file and ran the code above, and got the output:
Next Match.
Found 4 guest s
Next Match.
Found 91 Worm s
Next Match.
Found 18 anonymous s
which is correct if you do a find on the document in a text editor!
*As a sidenote, I'm not sure why you want to print a tab (\t) before 'Next Match' each time as it just looks weird in the output but it doesn't matter :)

There are multiple problems with your code:
re.search will only give you the first match, if any; this does not have to be a problem, though, as it seems like the word is only expected to appear once per line; otherwise, use re.findall
the line n.group(0) does not do anything without an assignment
len(n.group()) does not give you the number of matches, but the length of the matched string
you break after the first line in the file
myfile is an iterator, so once the first for line in myfile loop has finished, the other two won't have any lines left to loop (it will never finish because of the break anyway, though)
as already noted, you do not need regular expression at all
One (among many) possible ways of doing this would be this (not tested):
counts = {"worm": 0, "guest": 0, "anonymous": 0}
for line in myfile:
for word in counts:
if word in line:
counts[word] += 1

How do i delete first 2 lines which match with a text given by me ( using sed )?

How do i delete first 2 lines which match with a text given by me ( using sed ! )
E.g :
#file.txt contains following lines :
abc
def
def
abc
abc
def
And i want to delete first 2 "abc"

Using "sed"
While #EdMorton has pointed out that sed is not the best tool for this job (if you wonder why exactly, see my answer below and compare it to the awk code), my research showed that the solution to the generalized problem
Delete occurences "N" through "M" of a line matching a given pattern using sed
indeed is a very tricky one in my opinion. There seem to be many suggestions for how to replace the "N"th occurence of a matching pattern with sed, but I found that deleting a specific matching line (or a range of lines) is a much more complex undertaking.
While the generalized problem with arbitrary values for N, M, and the pattern would probably be solved best by writing a "sed script generator" on the basis of a Finite State Machine, the solution to the special case asked by the OP is still simple enough to be coded by hand. I must admit that I wasn't very familiar with the obfuscated intricacies of the sed command syntax before, but I found this challenge to be quite useful for gaining more experience with non-trivial sed usage.
Anyway, here's my solution for deleting the first two occurences of a line containing "abc" in a file. If there's a simpler approach, I'm eager to learn about it, as this has taken me some time now.
A final caveat: this assumes GNU sed, as I was unable to find a solution with POSIX sed:
sed -n ':1;/abc/{n;b2;};p;$b4;n;b1;:2;/abc/{n;b3;};p;$b4;n;b2;:3;p;$b4;n;b3;:4;q' file
or, in more verbose syntax:
sed -n '
# BEGIN - look for first match
:first;
/abc/ {
# First match found. Skip line and jump to second section
n; bsecond;
};
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfirst;
# END - look for first match
# BEGIN - look for second match
:second;
/abc/ {
# Second match found. Skip line and jump to final section
n; bfinal;
}
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bsecond;
# END - look for second match
# BEGIN - both matches found; print remaining lines
:final;
# Print line and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfinal;
# END - print remaining lines
# QUIT
:end;
q;
' file

sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk:
$ awk '!(/abc/ && ++c<3)' file
def
def
abc
def

How to replace a number in a file name but subtract 1 from it?

I have lots of files in a directory (Linux). For example:
"/data/2014/file300.data.20141231.MC.0930.vgf.img"
Here 0930 represents the hour and change from 1 to 24 (30 does not change), the date also changes. The hours are represented as
.0130. .0230. .0330. .0430. ...2330... ...2430.
I want to replace this part (only this part) in the file name by subtracting the hour by 1
.0030. .0130. .0230. .0330. ... .2230.
and do not touch any other number in the file name. So
.0130. becomes .0030.
.0230. becomes .0130.
and so on
2430. becomes .2330.
I tried this:
rename -n 's/(\d+)(\.vgf.img)/($1-1).$2/e' file300.data.20141231.MC.0930.vgf.img
but it returned this:
file300.data.20141231.MC.929.vgf.img
so .0930. became .929. which is not what I am looking for. I'm looking for .0830.

Regular Expressions are not great for such a task, and it would be easier to use awk or a perl script. However, you can still do it in sed if you really want! =)
There is no trivial way of decrementing number in sed, but you can emulate it:
#!/bin/sed -f
# Replace number XX from line
# "/data/2014/file300.data.20141231.MC.XX30.vgf.img"
# with decremented number (XX-1)
# zero is not changed
# copy filename to hold space
h
# remove everything that is not a number
s/.*MC\.//
s/30\.vgf.*//
# ensure that we don't have leading zeroes
s/^0*//
# here all the magic begins, decrementing
# we need to move all trailing zeroes to begin of number
# we do it using cycle:
# clear test condition
t b
: b
# if we have zero - move it
/0$/{
# remove from end
s/0$//
# append to begin
s/^/0/
}
# if substitution was made - continue cycle
t b
# now we have nonzero at the end, decrement it
s/1$/0/
s/2$/1/
s/3$/2/
s/4$/3/
s/5$/4/
s/6$/5/
s/7$/6/
s/8$/7/
s/9$/8/
# here we change number of digits in our number, this needs to be done only
# when number was of type 10*, in that case after all our permutations it is
# represented as line of all zeroes - just remove one.
/^0*$/s/0//
# another cycle to put zeroes back at end
t e
: e
/^0/{
# remove from beginning
s/^0//
# add to end, as 9
s/$/9/
}
t e
# Now we have decremented number in pattern space and original filename in hold
# format number as two-digit:
s/^$/00/
s/^.$/0&/
# append it to hold space
H
# switch hold and pattern
x
# now we manipulate string like "$filename\nXX" where XX is our decremented
# number.
# Replace number in filename with decremented one
s/\(.*MC\.\)..\(30.*\).\(..\)/\1\3\2/

I had initially written this as a somewhat tongue-in-cheek response and did not intend to post, but seeing Yury's solution (which is brilliant!) I feel compelled to give this as at least potentially usable.
You haven't specified the problem adequately, but assuming that your files all have the ending "${timestamp}.vgf.img" (really, just assuming the existence of two dots in the name after the timestamp):
echo /data/2014/file300.data.20141231.MC.0930.vgf.img |
awk '{a=substr($(NF-2),0,2); $(NF-2)=(a-1)"30"} 1' FS=. OFS=.

You need to subtract 100 instead of 1 to get 830 from 930. Then to format the result with leading zeros you can use sprintf. The below command will work as expected
~$ rename -n 's/(\d+)(\.vgf.img)/(sprintf("%04d", ($1 - 100))).$2/e' file300.data.20141231.MC.0930.vgf.img
rename(file300.data.20141231.MC.0930.vgf.img, file300.data.20141231.MC.0830.vgf.img)

Vim: How to delete repetition in a line

I am having a log file for analysis, in that few of the line will have repetition of it own, but not complete repetition, say
Alex is here and Alex is here and we went out
We bothWe both went out
I want to remove the first occurrence and get
Alex is here and we went out
We both went out
Please share a regex to do in Vim in Windows.

I don't recommend trying to use regex magic to solve this problem. Just write an external filter and use that.
Here's an external filter written in Python. You can use this to pre-process the log file, like so:
python prefix_chop.py logfile.txt > chopped.txt
But it also works by standard input:
cat logfile.txt | prefix_chop.py > chopped.txt
This means you can use it in vim with the ! command. Try these commands: goto line 1, then pipe from current line through the last line through the external program prefix_chop.py:
1G
!Gprefix_chop.py<Enter>
Or you can do it from ex mode:
:1,$!prefix_chop.py<Enter>
Here's the program:
#!/usr/bin/python
import sys
infile = sys.stdin if len(sys.argv) < 2 else open(sys.argv[1])
def repeated_prefix_chop(line):
"""
Check line for a repeated prefix string. If one is found,
return the line with that string removed, else return the
line unchanged.
"""
# Repeated string cannot be more than half of the line.
# So, start looking at mid-point of the line.
i = len(line) // 2 + 1
while True:
# Look for longest prefix that is found in the string after pos 0.
# The prefix starts at pos 0 and always matches itself, of course.
pos = line.rfind(line[:i])
if pos > 0:
return line[pos:]
i -= 1
# Stop testing before we hit a length-1 prefix, in case a line
# happens to start with a word like "oops" or a number like "77".
if i < 2:
return line
for line in infile:
sys.stdout.write(repeated_prefix_chop(line))
I put a #! comment on the first line, so this will work as a stand-alone program on Linux, Mac OS X, or on Windows if you are using Cygwin. If you are just using Windows without Cygwin, you might need to make a batch file to run this, or just type the whole command python prefix_chop.py. If you make a macro to run this you don't have to do the typing yourself.
EDIT: This program is pretty simple. Maybe it could be done in "vimscript" and run purely inside vim. But the external filter program can be used outside of vim... you can set things up so that the log file is run through the filter once per day every day, if you like.

Regex:\b(.*)\1\b
Replace with:\1 or $1
If you want to deal with more than two repeating sentences you can try this
\b(.+?\b)\1+\b
--
|->avoids matching individual characters in word like xxx
NOTE
Use \< and \> instead of \b

You could do it by matching as much as possible at the beginning of the line and then using a backreference to match the repeated bit.
For example, this command solves the problem you describe:
:%s/^\(.*\)\(\1.*\)/\2

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

match words with few differences allowed - regex

Related

Iteration over the string replace function in python [duplicate]

Regex in python trouble

How do i delete first 2 lines which match with a text given by me ( using sed )?

How to replace a number in a file name but subtract 1 from it?

Vim: How to delete repetition in a line

Categories

Resources