matching word boundaries in RegEx python 2.7 - regex

I have the following code that can return a line from text where a certain word exists
with open('/Users/Statistical_NLP/Project/text.txt') as f:
haystack = f.read()
with open('/Users/Statistical_NLP/Project/test.txt') as f:
for line in f:
needle = line.strip()
pattern = '^.*{}.*$'.format(re.escape(needle))
for match in re.finditer(pattern, haystack, re.MULTILINE):
print match.group(0)
How can I search for a word and return not the whole line, just the 3 words after and the three words before this certain word.
Something has to be changed in this line in my code:
pattern = '^.*{}.*$'.format(re.escape(needle))
Thanks a lot

The following regex will help you achieve what you want.
((?:\w+\s+){3}YOUR_WORD_HERE(?:\s+\w+){3})
For a better understanding of the regex, I suggest you go to the following page and experiment with it.
https://regex101.com/r/eS8zW5/3
This will match the three words before, the matched word and three words after.
The following will match 3 words before and after if they exist
((?:\w+\s+){0,3}YOUR_WORD_HERE(?:\s+\w+){0,3})

Related

Python Regex to add a "?" to the beginning of a word in a word list

#open text file
with open('words') as f:
for line in f.readlines():
#pull out all 3 letter words using regular expression and add to wordlist
word_list += re.findall(r'\b(\w{3})\b', line)
I use this to find all 3 letter words in a dictionary. From there, I want to add a question mark to the beginning of each word. I assume I need the re.sub function, but can't seem to get the syntax right.
You can do this a few ways, one of them is to get all your 3 letters words and then update them afterwards, otherwise, you can do along the lines of what you're doing and extend a list as you go. There's not really a need for re.sub here if you want to end up building a list of 3 letters words prefixed with ?
Sample words file:
the quick brown fox called bob jumped over the lazy dog
and went straight to bed
cos bob needed to sleep right now
Sample code:
word_list = []
with open('words') as fin:
for line in fin:
matches = re.findall(r'\b(\w{3})\b', line)
word_list.extend(f'?{word}' for word in matches)
Sample word_list after run:
['?the',
'?fox',
'?bob',
'?the',
'?dog',
'?and',
'?bed',
'?cos',
'?bob',
'?now']
You can use re.sub, where \1 refers to the first capture group:
re.sub(r'\b(\w{3})\b', r'?\1', line)
First compile pattern:
re.compile(r'\b(\w{3})\b')
and then use it like this:
word_list += '?' + re.search(line)

Python : trying to match and count words with regex (expected string or buffer)

I am trying to read a file and match words there has over 6 characters in it. but I keep getting this error:
Traceback (most recent call last):
File "dummy.py", line 9, in <module>
matches = re.findall("\w{6,}", f.read().split())
File "/usr/lib/python2.7/re.py", line 181, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
And I can't figure out why I am getting this error? The code is pasted below
import re
with open('test.txt', 'r') as f:
matches = re.findall("\w{6,}", f.read().split())
nr_long_words = len(matches)
print (matches)
f.read().split() gives a list of strings, but re.findall expects a single string, thus the TypeError: expected string or buffer. You could apply the regex to each of the substrings in a loop or list comprehension, but you do not need to split() at all:
matches = re.findall("\w{6,}", f.read())
Note that if the file is very large, then f.read() might not be a good idea (but for text files it's probably not an issue, as those are rarely marger than a few megabytes, if at all). In this case, you could read the file line-by-line and sum up the long words per line:
nr_long_words = sum(len(re.findall(r"\w{6,}", line)) for line in f)
Also, as noted in comments, \w{6,} might not be the best regex for "long words" to start with. \w will, e.g., also match numbers or the underscore _. If you want to match exclusively (ascii-)letters, better use [A-Za-z], but this might cause problems with non-ascii letters, such as umlauts, accents, arabic, etc. Also, you might want to include word boundary characters, i.e. \b, to make sure that the six letters are not part of a longer, non-word sequence, i.e. use a regex like r'\b[A-Za-z]{6,}\b'
Try:
import re
nr_long_words = 0
with open('input.txt', 'r') as f:
for line in f:
matches = re.findall("\w{6,}", line)
nr_long_words += len(matches)
print(nr_long_words)
it should print count of words longer than 6 characters in file.

Regex Search Pattern

I am searching a text file consisting of single words on each line for the following:
Lines that have two consecutive a’s in them but which don’t start with an a
import re
import sys
pattern = '^[^Aa][A-Za-z]*[Aa]{2}'
regexp = re.compile(pattern)
inFile = open('words.txt', 'r')
outFile = open('exercise04.log', 'w')
for line in inFile:
match = regexp.search(line)
if match:
outFile.write(line)
inFile.close()
outFile.close()
My main concern is my regex search pattern rather than the python itself. I understand the ^[^Aa] at the start stops the first character from being 'A' or 'a', but is there a better way of breaking out of this statement to check for two consecutive 'a's in each word than I have used?
Your pattern looks fine.
If you want to make sure the first character is a letter, use
pattern = '^[B-Zb-z][A-Za-z]*[Aa]{2}'

Python Extract every sentence that contains Parenthesis

with open(searchfile) as f:
pattern = "\.?(?P<sentence>.*?\(([A-Za-z0-9_]+)\).*?)\."
for line in f:
match = re.search(pattern, line)
if match != None:
print match.group("sentence")
I am trying to extract every sentence that contains an acronym in parenthesis (essentially 2-4 letter all caps in parenthesis.
In: Here is an (ABC) example. Do not include this sentence. Include this (AB) one. And (AVCD) this one.
Out: Here is an (ABC) example. Include this (AB) one. And (AVCD) this one.
You can use this:
[^.]*?\([A-Z]{2,4}\)[^.]*\.
But note that it is a particulary inefficient way, since the pattern starts with a very permissive subpattern. You can correct that a little by adding a kind of anchor at the begining:
(?:(?<=.)|^)[^.]*?\([A-Z]{2,4}\)[^.]*\.
Unfortunatly, even with this anchor, the regex engine must check the two alternatives for the most of the characters of the string.
A better approach might be to find substrings starting with the acronym until the end of the sentence and dots, and then to extract substrings using the end offset of each results:
#!/usr/bin/python
import re
txt = 'Here is an (ABC) example. Do not include this sentence. Include this (AB) one. And (AVCD) this one.'
pattern = re.compile(r'([!.?])(?=\s)|\([A-Z]{2,4}\)[^.]*(?:\.|$)')
offset = 0
result = ''
for m in pattern.finditer(txt):
if (m.group(1)==None):
result += txt[offset:m.end()]
offset = m.end()
print result
Note: you can be sure that a dot stands for the end of a sentence, it can be something else.
a little more efficient pattern
([^.(]++\([^.)]++\)[^.)]++\.)
Demo

Python extract words from a txt file

Is it possible to search for a series of words & extract the next word. For example in a txt file search for the word 'Test' & then return the word directly after it?
Test.txt
This is a test to test the function of the python code in the test environ_ment
I'm looking to get the results:-
to, the, environ_ment
You can use a regular expression for this:
import re
txt = "This is a test to test the function of the python code in the test environ_ment"
print re.findall("test\s+(\S+)", txt) # ['to', 'the', 'environ_ment']
The regular expression matches with "test" when it is followed by white space (\s+) and a series of non-white space characters \S+. The latter matches the words you are looking for and is put in a capture group (with parentheses) in order to return that part of the matches.