Parse text between multiple lines - Python 2.7 and re Module - regex

I have a text file i want to parse. The file has multiple items I want to extract. I want to capture everything in between a colon ":" and a particular word. Let's take the following example.
Description : a pair of shorts
amount : 13 dollars
requirements : must be blue
ID1 : 199658
----
The following code parses the information out.
import re
f = open ("parse.txt", "rb")
fileRead = f.read()
Description = re.findall("Description :(.*?)amount", fileRead, re.DOTALL)
amount = re.findall("amount :(.*?)requirements", fileRead, re.DOTALL)
requirements = re.findall("requirements :(.*?)ID1", fileRead, re.DOTALL)
ID1 = re.findall("ID1 :(.*?)-", fileRead, re.DOTALL)
print Description[0]
print amount[0]
print requirements[0]
print ID1[0]
f.close()
The problem is that sometimes the text file will have a new line such as this
Description
: a pair of shorts
amount
: 13 dollars
requirements: must be blue
ID1: 199658
----
In this case my code will not work because it is unable to find "Description :" because it is now separated into a new line. If I choose to change the search to ":(.*?)requirements" it will not return just the 13 dollars, it will return a pair of shorts and 13 dollars because all of that text is in between the first colon and the word, requirements. I want to have a way of parsing out the information no matter if there is a line break or not. I have hit a road block and your help would be greatly appreciated.

You can use a regex like this:
Description[^:]*(.*)
^--- use the keyword you want
Working demo
Quoting your code you could use:
import re
f = open ("parse.txt", "rb")
fileRead = f.read()
Description = re.findall("Description[^:]*(.*)", fileRead)
amount = re.findall("amount[^:]*(.*)", fileRead)
requirements = re.findall("requirements[^:]*(.*)", fileRead)
ID1 = re.findall("ID1[^:]*(.*)", fileRead)
print Description[0]
print amount[0]
print requirements[0]
print ID1[0]
f.close()

You can simply do this:
import re
f = open ("new.txt", "rb")
fileRead = f.read()
keyvals = {k.strip():v.strip() for k,v in dict(re.findall('([^:]*):(.*)(?=\b[^:]*:|$)',fileRead,re.M)).iteritems()}
print(keyvals)
f.close()
Output:
{'amount': '13 dollars', 'requirements': 'must be blue', 'Description': 'a pair of shorts', 'ID1': '199658'}

Related

rstrip, split and sort a list from input text file

I am new with python. I am trying to rstrip space, split and append the list into words and than sort by alphabetical order. I donโ€™t what I am doing wrong.
fname = input("Enter file name: ")
fh = open(fname)
lst = list(fh)
for line in lst:
line = line.rstrip()
y = line.split()
i = lst.append()
k = y.sort()
print y
I have been able to fix my code and the expected result output.
This is what I was hoping to code:
name = input('Enter file: ')
handle = open(name, 'r')
wordlist = list()
for line in handle:
words = line.split()
for word in words:
if word in wordlist: continue
wordlist.append(word)
wordlist.sort()
print(wordlist)
If you are using python 2.7, I believe you need to use raw_input() in Python 3.X is correct to use input(). Also, you are not using correctly append(), Append is a method used for lists.
fname = raw_input("Enter filename: ") # Stores the filename given by the user input
fh = open(fname,"r") # Here we are adding 'r' as the file is opened as read mode
lines = fh.readlines() # This will create a list of the lines from the file
# Sort the lines alphabetically
lines.sort()
# Rstrip each line of the lines liss
y = [l.rstrip() for l in lines]
# Print out the result
print y

Text processing to get if else type condition from a string

First of all, I am sorry about the weird question heading. Couldn't express it in one line.
So, the problem statement is,
If I am given the following string --
"('James Gosling'/jamesgosling/james gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
I have to parse it as
list1 = ["'James Gosling'", 'jamesgosling', 'jame gosling']
list2 = ["'SUN Microsystem'", 'sunmicrosystem']
list3 = [ list1, list2, keyword]
So that, if I enter James Gosling Sun Microsystem keyword it should tell me that what I have entered is 100% correct
And if I enter J Gosling Sun Microsystem keyword it should say i am only 66.66% correct.
This is what I have tried so far.
import re
def main():
print("starting")
sentence = "('James Gosling'/jamesgosling/jame gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
splited = sentence.split(",")
number_of_primary_keywords = len(splited)
#print(number_of_primary_keywords, "primary keywords length")
number_of_brackets = 0
inside_quotes = ''
inside_quotes_1 = ''
inside_brackets = ''
for n in range(len(splited)):
#print(len(re.findall('\w+', splited[n])), "length of splitted")
inside_brackets = splited[n][splited[n].find("(") + 1: splited[n].find(")")]
synonyms = inside_brackets.split("/")
for x in range(len(synonyms)):
try:
inside_quotes_1 = synonyms[x][synonyms[x].find("\"") + 1: synonyms[n].find("\"")]
print(inside_quotes_1)
except:
pass
try:
inside_quotes = synonyms[x][synonyms[x].find("'") + 1: synonyms[n].find("'")]
print(inside_quotes)
except:
pass
#print(synonyms[x])
number_of_brackets += 1
print(number_of_brackets)
if __name__ == '__main__':
main()
Output is as follows
'James Gosling
jamesgoslin
jame goslin
'SUN Microsystem
SUN Microsystem
sunmicrosyste
sunmicrosyste
3
As you can see, the last letters of some words are missing.
So, if you read this far, I hope you can help me in getting the expected output
Unfortunately, your code has a logic issue that I could not figure it out, however there might be in these lines:
inside_quotes_1 = synonyms[x][synonyms[x].find("\"") + 1: synonyms[n].find("\"")]
inside_quotes = synonyms[x][synonyms[x].find("'") + 1: synonyms[n].find("'")]
which by the way you can simply use:
inside_quotes_1 = synonyms[x][synonyms[x].find("\x22") + 1: synonyms[n].find("\x22")]
inside_quotes = synonyms[x][synonyms[x].find("\x27") + 1: synonyms[n].find("\x27")]
Other than that, you seem to want to extract the words with their indices, which you can extract them using a basic expression:
(\w+)
Then, you might want to find a simple way to locate the indices, where the words are. Then, associate each word to the desired indices.
Example Test
# -*- coding: UTF-8 -*-
import re
string = "('James Gosling'/jamesgosling/james gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
expression = r'(\w+)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(1) + "\" is a match ๐Ÿ’š๐Ÿ’š๐Ÿ’š ")
else:
print('๐Ÿ™€ Sorry! No matches! Something is not right! Call 911 ๐Ÿ‘ฎ')

How to read a text file then convert it to a list of tuples

I want to convert text file contains for example this:
Alex
Gheith
40
John
Stewart
33
into:
[('Alex','Gheith','40'),('John','Stewart','33')]
Current code:
records =[]
f10 = open("PlayerRecords.txt","r")
for line in f10:
line = line.strip()
records.append(line)
t = ()
f10.close()
t = [(x,) for x in records]
print t
Current output:
[('Alex','Gheith',40),('John','Stewart',33)]
Try something like given below. i have taken str1 as multi-line string.
list1 = [line.strip() for line in str1.splitlines()]
l_iter = iter(list1)
mapped = zip(l_iter,l_iter,l_iter)
mapped = set(mapped)
print (mapped)

How can I perform multiple re.sub() on a file?

I am attempting to perform multiple regex alterations of a file but I'm not sure how to do this while retaining the previous alterations. I have found several ways to do this but I'm new to coding and couldn't get them to work in my code.
import re
import sys
if len(sys.argv) != 3:
sys.exit('Error: One input and one output file is required')
fasta = open(sys.argv[1],'r')
output = open(sys.argv[2],'r+')
output1 = re.sub(r'^>\w+\|(\d+)\|.*LOXAF.*', r'>Loxodonta africana, \1, MW =',fasta)
output2 = re.sub(r'^>\w+\|(\d+)\|.*DUGDU.*', r'>Dendrohyrax dorsalis, \1, MW =',output1)
output3 = re.sub(r'(^[A-Z].*)\n', r'\1',output2)
print(output3)
Ideally, I would write all of the regex to the output file instead of just printing it. I put an example of changes I'd like to make below (I cut the number and length of sequences down for simplicity).
>gi|75074720|sp|Q9TA19.1|NU5M_LOXAF RecName: Full=NADH-ubiquinone oxidoreductase chain 5; AltName: Full=NADH dehydrogenase subunit 5
MKVINLIPTLMLTSLIILTLPIITTLLQNNKTNCFLYITKTAVTYAFAISLIPTLLFIQSNQEAYISNWH
WMTIHTLKLSMSFKLDFFSLTFMPIALFITWSIM
>gi|75068112|sp|Q9TA29.1|NU1M_LOXAF RecName: Full=NADH-ubiquinone oxidoreductase chain 1; AltName: Full=NADH dehydrogenase subunit 1
MFLINVLTVTLPILLAVAFLTLVERKALGYMQLRKGPNVVGPYGLLQPIADAIKLFTKEPIYPQTSSKFL
FTVAPILALTLALTVWAPLPMPYPLINLNLSL
>gi|24418335|sp|Q8W9N2.1|ATP8_DUGDU RecName: Full=ATP synthase protein 8; AltName: Full=A6L; AltName: Full=F-ATPase subunit 8
MPQLDTTTWFITILSMLITLFILFQTKLLNYTYPLNALPISPNVTNHLTPWKMKWTKTYLPLSLPLQ
Output:
>Loxodonta africana, 75074720, MW =
MKVINLIPTLMLTSLIILTLPIITTLLQNNKTNCFLYITKTAVTYAFAISLIPTLLFIQSNQEAYISNWHWMTIHTLKLSMSFKLDFFSLTFMPIALFITWSIM
>Loxodonta africana, 75068112, MW =
MFLINVLTVTLPILLAVAFLTLVERKALGYMQLRKGPNVVGPYGLLQPIADAIKLFTKEPIYPQTSSKFLFTVAPILALTLALTVWAPLPMPYPLINLNLSL
>Dendrohyrax dorsalis, 24418335, MW =
MPQLDTTTWFITILSMLITLFILFQTKLLNYTYPLNALPISPNVTNHLTPWKMKWTKTYLPLSLPLQ
Thanks for all of your help!
fasta files can be very large. It isn't a good idea to load the whole file into a variable. I suggest to work line by line (less memory usage).
A fasta file is something with a format and not a wild text file, so understanding and using this format will help you to extract the informations you want without to use 3 blind regex replacements.
Suggestion:
import re
import sys
from itertools import takewhile
if len(sys.argv) != 3:
sys.exit('Error: One input and one output file is required')
with open(sys.argv[1], 'r') as fi, open(sys.argv[2], 'r+') as fo:
species = {
'LOXAF': 'Loxodonta africana',
'DUGDU': 'Dendrohyrax dorsalis'
}
sep = re.compile(r'[|_ ]');
recF = ">{}, {}, MW =\n{}"
def getSeq(f):
return ''.join([line.rstrip() for line in takewhile(lambda x: x!="\n", f)])
for line in fi:
if line.startswith('>'):
parts = sep.split(line, 6)
print(recF.format(species[parts[5]], parts[1], getSeq(fi)), file=fo)
You can try something like this:
import re
import sys
if len(sys.argv) != 3:
sys.exit('Error: One input and one output file is required')
else:
fasta = open(sys.argv[1],'r')
fasta_content = fasta.read()
print(fasta)
output = open(sys.argv[2],'w')
output1 = re.sub(r'>\w+\|(\d+)\|.*LOXAF.*', r'>Loxodonta africana, \1, MW =',fasta_content)
print(output1)
output2 = re.sub(r'>\w+\|(\d+)\|.*DUGDU.*', r'>Dendrohyrax dorsalis, \1, MW =',output1)
print(output2)
output3 = re.sub(r'([A-Z]+)\n', r'\1',output2)
print(output3)
output.write(output3)
output.close()
fasta.close()
First of all you need to operate on the text, so read() is needed.
To write to output file you can use output.write(), but when opening you have to have 'w' option
Regex here didn't work because in each regex you have start of string (^) and it applies only to the beginning of the text (unless you read line by line) but with read() you get whole text as single string.

Python RE: Find all matches of pattern b following pattern a

I have a text file that looks like this:
Warning-[blah1]
few lines
Warning-[blah2]
few more lines
Total warnings: 2
few more lines
Warning-[blah3]
more of random lines
Warning-[blah4]
My objective is to find all matches of Warnings that come after the line "Total warnings: 2".
So far I have tried two approaches:
regex = re.compile('Total\swarnings.(Warning-[\S+])',re.DOTALL)
regex = re.compile('Total\swarnings.?(Warning-[\S+])',re.DOTALL)
The first approach gives me the greedy result i.e. matches only blah4 and the second matches only blah3. How can I get it to match both?
I am using findall.
import re
with open('sample.txt') as f:
f = f.read()
f = f.split('Total warnings: 2')
f = f[:1]
for el in f:
el = el.split("\n")
el = [x for x in el if re.match(r'Warning\-\[.*?\]',x,flags=re.IGNORECASE)]
print el
You could try splitting the text file on "Total warnings", and then only processing the second half of the file:
import re
with open('yourfile.txt') as f:
halves = f.read().split('Total warnings')
regex = re.compile(r'Warning-\[(\S+)\]')
matches = re.findall(regex, halves[1])