Python extract words from a txt file - python-2.7

Is it possible to search for a series of words & extract the next word. For example in a txt file search for the word 'Test' & then return the word directly after it?
Test.txt
This is a test to test the function of the python code in the test environ_ment
I'm looking to get the results:-
to, the, environ_ment

You can use a regular expression for this:
import re
txt = "This is a test to test the function of the python code in the test environ_ment"
print re.findall("test\s+(\S+)", txt) # ['to', 'the', 'environ_ment']
The regular expression matches with "test" when it is followed by white space (\s+) and a series of non-white space characters \S+. The latter matches the words you are looking for and is put in a capture group (with parentheses) in order to return that part of the matches.

Related

Python Regex to add a "?" to the beginning of a word in a word list

#open text file
with open('words') as f:
for line in f.readlines():
#pull out all 3 letter words using regular expression and add to wordlist
word_list += re.findall(r'\b(\w{3})\b', line)
I use this to find all 3 letter words in a dictionary. From there, I want to add a question mark to the beginning of each word. I assume I need the re.sub function, but can't seem to get the syntax right.
You can do this a few ways, one of them is to get all your 3 letters words and then update them afterwards, otherwise, you can do along the lines of what you're doing and extend a list as you go. There's not really a need for re.sub here if you want to end up building a list of 3 letters words prefixed with ?
Sample words file:
the quick brown fox called bob jumped over the lazy dog
and went straight to bed
cos bob needed to sleep right now
Sample code:
word_list = []
with open('words') as fin:
for line in fin:
matches = re.findall(r'\b(\w{3})\b', line)
word_list.extend(f'?{word}' for word in matches)
Sample word_list after run:
['?the',
'?fox',
'?bob',
'?the',
'?dog',
'?and',
'?bed',
'?cos',
'?bob',
'?now']
You can use re.sub, where \1 refers to the first capture group:
re.sub(r'\b(\w{3})\b', r'?\1', line)
First compile pattern:
re.compile(r'\b(\w{3})\b')
and then use it like this:
word_list += '?' + re.search(line)

python:How to extract a word before and after the match using regex

Cosnider the follwing data as sample
input_corpus = "this is an example.\n I am trying to extract it.\n"
I am trying to extract exactly 2 words before and after .\n with the following code
for m in re.finditer('(?:\S+\s+){2,}[\.][\n]\s*(?:\S+\b\s*){0,2}',input_corpus):
print(m)
Expected output :
an example. I am
extract it.
Actual output: Nothing gets captured
Can someone point me what is wrong with the regex.
You may use this regex:
r'(?:^|\S+\s+\S+)\n(?:\s*\S+\s+\S+|$)'
RegEx Demo
Code:
>>> input_corpus = "this is an example.\n I am trying to extract it.\n"
>>> print re.findall(r'(?:^|\S+\s+\S+)\n(?:\s*\S+\s+\S+|$)', input_corpus)
['an example.\n I am', 'extract it.\n']
Details:
(?:^|\S+\s+\S+): Match preceding 2 words or line start
\n: Match a new line
(?:\s*\S+\s+\S+|$): Match next 2 words or line end

Apply regular expression to the second word in "|" separated string in a interpretor Flume config

My requirement is to apply regular expression to the data coming from kafka.
The data is as follow:
abc|def|mnq|xyz
abc1|def1|mnq1|xyz1
abc2|def2|mnq2|xyz2
I want to apply regular expression on the second word i.e (def) from the first sting using a flume interpretor.
Regular expression can be to filter words and decimal numbers.
Can someone help in this.
Following python code matches all the second words in all the lines:
import re
# used || to add multilines combine into one string
parent = """abc|def|mnq|xyz||
abc1|def1|mnq1|xyz1||
abc2|def2|mnq2|xyz2"""
pattern = re.compile("\w+\|(.*?)\|\w+", re.MULTILINE)
m = pattern.findall(parent)
print m
which outputs|
['def', 'def1', 'def2']
Note: escape '|' by '\'.

matching word boundaries in RegEx python 2.7

I have the following code that can return a line from text where a certain word exists
with open('/Users/Statistical_NLP/Project/text.txt') as f:
haystack = f.read()
with open('/Users/Statistical_NLP/Project/test.txt') as f:
for line in f:
needle = line.strip()
pattern = '^.*{}.*$'.format(re.escape(needle))
for match in re.finditer(pattern, haystack, re.MULTILINE):
print match.group(0)
How can I search for a word and return not the whole line, just the 3 words after and the three words before this certain word.
Something has to be changed in this line in my code:
pattern = '^.*{}.*$'.format(re.escape(needle))
Thanks a lot
The following regex will help you achieve what you want.
((?:\w+\s+){3}YOUR_WORD_HERE(?:\s+\w+){3})
For a better understanding of the regex, I suggest you go to the following page and experiment with it.
https://regex101.com/r/eS8zW5/3
This will match the three words before, the matched word and three words after.
The following will match 3 words before and after if they exist
((?:\w+\s+){0,3}YOUR_WORD_HERE(?:\s+\w+){0,3})

Find missing entries in one file

I've got two files:
1st: Entries.txt
confirmation.resend
send
confirmation.showResendForm
login.header
login.loginBtn
2nd: Used_Entries.txt
confirmation.showResendForm = some value
login.header = some other value
I want to find all entries from the first file (Entries.txt) that have not been asigned a value in the 2nd file (Used_Entries.txt)
In this example I'd like the following result:
confirmation.resend
send
login.loginBtn
In the result confirmation.showResendForm and login.header do not show up because these exist in the Used_Entries.txt
How do I do this? I've been playing around with regular expressions but haven't been able to solve it. A bash script or sth would be much appreciated!
You can do this with regex. But get your code mood ready, because you can't match both files with regex at once, and we do want to match both contents with regex at once. Well, that means you must have at least some understanding of your language, I would like you to concatenate the contents from the two files with at least a new line in between.
This regex solution expects your string to be matched to be in this format:
text (no equals sign)
text
text
...
key (no equals sign) ␣ (optional whitespace) = (literal equal) whatever (our regex will skip this part.)
key=whatever
key=whatever
Do I have your attention? Yes? Please see the following regex (using techniques accessible to most regex engines):
/(^[^=\n]+$)(?!(?s).*^\1\s*=)/m
Inspired from a recent answer I saw from zx81, you can switch to (?s) flag in the middle to switch to DOTALL mode suddenly, allowing you to start multiline matching with . in the middle of a RegExp. Using this technique and the set syntax above, here's what the regex does, as an explanation:
(^[^=\n]+$) Goes through all the text (no equals sign) elements. Enforces no equals signs or newlines in the capture. This means our regex hits every text element as a line, and tries to match it appropriately.
(?! Opens a negative lookahead group. Asserts that this match will not locate the following:
(?s).* Any number of characters or new lines - As this is a greedy match, will throw our matcher pointer to the very end of the string, skipping to the last parts of the document to backtrack and scoop up quickly.
^\1\s*= The captured key, followed by an equals sign after some optional whitespaces, in its own line.
) Ends our group.
View a Regex Demo!
A regex demo with more test cases
I'm stupid. I could had just put this:
/(^[^=\n]+$)(?!.*^\1\s*=)/sm
I've been going at this a little bit to complex and just solved it with a small script in scala:
import scala.io.Source
object HelloWorld {
def main(args: Array[String]) {
val entries = (for(line <- Source.fromFile("Entries.txt").getLines()) yield {
line
}).toList
val usedEntries = (for(line <- Source.fromFile("Used_Entries.txt").getLines()) yield {
line.dropRight(line.length - line.indexOf(' '))
}).toList
println(entries)
println(usedEntries)
val missingEntries = (for {
entry <- entries
if !usedEntries.exists(_ == entry)
} yield {
entry
}).toList
println(missingEntries)
println("Missing Entries: ")
println()
for {
missingEntry <- missingEntries
} yield {
println(missingEntry)
}
}
}
import re
e=open("Entries.txt",'r')
m=e.readlines()
u=open("Used_Entries.txt",'r')
s=u.read()
y=re.sub(r"= .*","",s)
for i in m:
if i.strip() in [k.strip() for k in y.split("\n")] :
pass
else:
print i.strip()