How can I extract information in the lines between two headers? - python-2.7

I am new to python and am attempting to use this currently nonfunctioning code to extract information between two headers from a text file.
with open('toysystem.txt','r') as f:
start = '<Keywords>'
end = '</Keywords>'
i = 0
lines = f.readlines()
for line in lines:
if line == start:
keywords = lines[i+1]
i += 1
For reference, the text file looks like this:
<Keywords>
GTO
</Keywords>
Any ideas on what might be wrong with the code? Or perhaps a different way to approach this problem?
Thank you!

lines read from file contains newline symbol at the end, so we probably should strip them,
f object is an iterator, so we don't need to use str.readlines method here.
So we can write something like
with open('toysystem.txt', 'r') as f:
start = '<Keywords>'
end = '</Keywords>'
keywords = []
for line in f:
if line.rstrip() == start:
break
for line in f:
if line.rstrip() == end:
break
keywords.append(line)
gives us
>>> keywords
['GTO\n']
If you don't need newline at the end of keywords as well – strip them too:
with open('toysystem.txt', 'r') as f:
start = '<Keywords>'
end = '</Keywords>'
keywords = []
for line in f:
if line.rstrip() == start:
break
for line in f:
if line.rstrip() == end:
break
keywords.append(line.rstrip())
gives
>>> keywords
['GTO']
But in this case it will be better to create stripped lines generator like
with open('toysystem.txt', 'r') as f:
start = '<Keywords>'
end = '</Keywords>'
keywords = []
stripped_lines = (line.rstrip() for line in f)
for line in stripped_lines:
if line == start:
break
for line in stripped_lines:
if line == end:
break
keywords.append(line)
which does the same.
Finally, if you need your lines in the next parts of script, we can use str.readlines and stripped lines generator:
with open('test.txt', 'r') as f:
start = '<Keywords>'
end = '</Keywords>'
keywords = []
lines = f.readlines()
stripped_lines = (line.rstrip() for line in lines)
for line in stripped_lines:
if line.rstrip() == start:
break
for line in stripped_lines:
if line.rstrip() == end:
break
keywords.append(line.rstrip())
gives us
>>> lines
['<Keywords>\n', 'GTO\n', '</Keywords>\n']
>>> keywords
['GTO']
Further reading
file objects,
iterators (including file iterators),
list comprehension,
generator expression

Use Python re module insted and parse it using regex?!
import re
with open('toysystem.txt','r') as f:
contents = f.read()
# will find all the expressions in the file and return a list of values inside the (). You can extend the expression according to your need.
keywords = re.findall(r'\<keywords\>\s*\n*\s*(.*?)\s*\n*\s*\<\/keywords\>')
print(keywords)
from your file it will print
['GTO']
for more about regular expression and python check Tutorialspoint
, For python3 and Python2

Related

To find some words in a text file using regex and later print them in a different text file

I need to find some words such as inherited, INHERITANCE, Ingeritable, etc., using regex, in a text file (origin.txt) and later I want to print them in a new text file (origin_spp.txt) and the line where they were found.
This is my code
re_pattern_string = r'(?:inherit|INHERIT|Inherit)*\w'
print('Opening origin.txt')
with open('origin.txt', 'r') as in_stream:
print('Opening origin_spp.txt')
with open('origin_spp.txt', 'w') as out_stream:
for num, line in enumerate (in_stream):
re_pattern_object = re.compile(re_pattern_string)
line = line.strip()
inherit_list = line.split()
temp_list = re_pattern_object.findall('line')
complete = origin_list.append('temp_list')
for word in temp_list:
out_stream.write(str(num) + '\t{0}\n'.format(word))
print("Done!")
print('origin.txt is closed?', in_stream.closed)
print('origin_spp.txt is closed?', out_stream.closed)
if __name__ == '__main__':
print(temp_list)
Can you help me, please? I am not getting anything and I do not know where is the error.
Thank you in advance
I need to print the words that I want to find in the origin.txt in a different text file.
This new file must contain the number of the line in the origin.txt plus the word/s.
Your code had some problems:
It's redundant to define re.compile inside for.
for re_pattern_object.findall('line') and origin_list.append('temp_list') don't wrap variables with ''
with findall we don't need iterate lines, it's works for whole text.
Because you didn't provide input and output I just guess what you want:
import re
re_pattern_string = r'((?:inherit|INHERIT|Inherit)(\w*))'
originmain_list = []
re_pattern_object = re.compile(re_pattern_string)
print('Opening origin.txt')
with open('origin.txt', 'r') as in_stream:
print('Opening origin_spp.txt')
with open('origin_spp.txt', 'w') as out_stream:
for num, line in enumerate(in_stream):
temp_list = re_pattern_object.findall(line)
for word in temp_list:
out_stream.write(str(num) + '\t{0}\n'.format(word[0]))
originmain_list.append((num, word[0]))
print("Done!")
print('origin.txt is closed?', in_stream.closed)
print('origin_spp.txt is closed?', out_stream.closed)
print(originmain_list)
if origin.txt contains:
inheritxxxxxxx some text INHERITccccc some text
Inheritzzzzzzzz some text
inherit some text INHERIT some text
Inherit some text
the output in the origin_spp.txt will be
0 inheritxxxxxxx
0 INHERITccccc
1 Inheritzzzzzzzz
2 inherit
2 INHERIT
3 Inherit
The command line output will be:
Opening origin.txt
Opening origin_spp.txt
Done!
origin.txt is closed? True
origin_spp.txt is closed? True
[(0, 'inheritxxxxxxx'), (0, 'INHERITccccc'), (1, 'Inheritzzzzzzzz'), (2, 'inherit'), (2, 'INHERIT'), (3, 'Inherit')]

AttributeError: 'dict' object has no attribute 'append' on line 9?

Q.)8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.
this code is giving AttributeError: 'dict' object has no attribute 'append' on line 9
fname = input("Enter file name: ")
fh = open(fname)
lst = {}
for line in fh:
line = line.rstrip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
print(sorted(lst))
Python dictionary has no append method.
Append is used in the list (array) in Python. Make lst a list, not a dictionary. I have made a minor change in your code below, changing
lst = {} #creation of an empty dictionary
to
lst = [] #creation of an empty list
The full code:
fname = input("Enter file name: ")
fh = open(fname)
lst = []
for line in fh:
line = line.rstrip()
words = line.split()
for word in words:
if word not in lst:
lst.append(word)
print(sorted(lst))

rstrip, split and sort a list from input text file

I am new with python. I am trying to rstrip space, split and append the list into words and than sort by alphabetical order. I don’t what I am doing wrong.
fname = input("Enter file name: ")
fh = open(fname)
lst = list(fh)
for line in lst:
line = line.rstrip()
y = line.split()
i = lst.append()
k = y.sort()
print y
I have been able to fix my code and the expected result output.
This is what I was hoping to code:
name = input('Enter file: ')
handle = open(name, 'r')
wordlist = list()
for line in handle:
words = line.split()
for word in words:
if word in wordlist: continue
wordlist.append(word)
wordlist.sort()
print(wordlist)
If you are using python 2.7, I believe you need to use raw_input() in Python 3.X is correct to use input(). Also, you are not using correctly append(), Append is a method used for lists.
fname = raw_input("Enter filename: ") # Stores the filename given by the user input
fh = open(fname,"r") # Here we are adding 'r' as the file is opened as read mode
lines = fh.readlines() # This will create a list of the lines from the file
# Sort the lines alphabetically
lines.sort()
# Rstrip each line of the lines liss
y = [l.rstrip() for l in lines]
# Print out the result
print y

How can I perform multiple re.sub() on a file?

I am attempting to perform multiple regex alterations of a file but I'm not sure how to do this while retaining the previous alterations. I have found several ways to do this but I'm new to coding and couldn't get them to work in my code.
import re
import sys
if len(sys.argv) != 3:
sys.exit('Error: One input and one output file is required')
fasta = open(sys.argv[1],'r')
output = open(sys.argv[2],'r+')
output1 = re.sub(r'^>\w+\|(\d+)\|.*LOXAF.*', r'>Loxodonta africana, \1, MW =',fasta)
output2 = re.sub(r'^>\w+\|(\d+)\|.*DUGDU.*', r'>Dendrohyrax dorsalis, \1, MW =',output1)
output3 = re.sub(r'(^[A-Z].*)\n', r'\1',output2)
print(output3)
Ideally, I would write all of the regex to the output file instead of just printing it. I put an example of changes I'd like to make below (I cut the number and length of sequences down for simplicity).
>gi|75074720|sp|Q9TA19.1|NU5M_LOXAF RecName: Full=NADH-ubiquinone oxidoreductase chain 5; AltName: Full=NADH dehydrogenase subunit 5
MKVINLIPTLMLTSLIILTLPIITTLLQNNKTNCFLYITKTAVTYAFAISLIPTLLFIQSNQEAYISNWH
WMTIHTLKLSMSFKLDFFSLTFMPIALFITWSIM
>gi|75068112|sp|Q9TA29.1|NU1M_LOXAF RecName: Full=NADH-ubiquinone oxidoreductase chain 1; AltName: Full=NADH dehydrogenase subunit 1
MFLINVLTVTLPILLAVAFLTLVERKALGYMQLRKGPNVVGPYGLLQPIADAIKLFTKEPIYPQTSSKFL
FTVAPILALTLALTVWAPLPMPYPLINLNLSL
>gi|24418335|sp|Q8W9N2.1|ATP8_DUGDU RecName: Full=ATP synthase protein 8; AltName: Full=A6L; AltName: Full=F-ATPase subunit 8
MPQLDTTTWFITILSMLITLFILFQTKLLNYTYPLNALPISPNVTNHLTPWKMKWTKTYLPLSLPLQ
Output:
>Loxodonta africana, 75074720, MW =
MKVINLIPTLMLTSLIILTLPIITTLLQNNKTNCFLYITKTAVTYAFAISLIPTLLFIQSNQEAYISNWHWMTIHTLKLSMSFKLDFFSLTFMPIALFITWSIM
>Loxodonta africana, 75068112, MW =
MFLINVLTVTLPILLAVAFLTLVERKALGYMQLRKGPNVVGPYGLLQPIADAIKLFTKEPIYPQTSSKFLFTVAPILALTLALTVWAPLPMPYPLINLNLSL
>Dendrohyrax dorsalis, 24418335, MW =
MPQLDTTTWFITILSMLITLFILFQTKLLNYTYPLNALPISPNVTNHLTPWKMKWTKTYLPLSLPLQ
Thanks for all of your help!
fasta files can be very large. It isn't a good idea to load the whole file into a variable. I suggest to work line by line (less memory usage).
A fasta file is something with a format and not a wild text file, so understanding and using this format will help you to extract the informations you want without to use 3 blind regex replacements.
Suggestion:
import re
import sys
from itertools import takewhile
if len(sys.argv) != 3:
sys.exit('Error: One input and one output file is required')
with open(sys.argv[1], 'r') as fi, open(sys.argv[2], 'r+') as fo:
species = {
'LOXAF': 'Loxodonta africana',
'DUGDU': 'Dendrohyrax dorsalis'
}
sep = re.compile(r'[|_ ]');
recF = ">{}, {}, MW =\n{}"
def getSeq(f):
return ''.join([line.rstrip() for line in takewhile(lambda x: x!="\n", f)])
for line in fi:
if line.startswith('>'):
parts = sep.split(line, 6)
print(recF.format(species[parts[5]], parts[1], getSeq(fi)), file=fo)
You can try something like this:
import re
import sys
if len(sys.argv) != 3:
sys.exit('Error: One input and one output file is required')
else:
fasta = open(sys.argv[1],'r')
fasta_content = fasta.read()
print(fasta)
output = open(sys.argv[2],'w')
output1 = re.sub(r'>\w+\|(\d+)\|.*LOXAF.*', r'>Loxodonta africana, \1, MW =',fasta_content)
print(output1)
output2 = re.sub(r'>\w+\|(\d+)\|.*DUGDU.*', r'>Dendrohyrax dorsalis, \1, MW =',output1)
print(output2)
output3 = re.sub(r'([A-Z]+)\n', r'\1',output2)
print(output3)
output.write(output3)
output.close()
fasta.close()
First of all you need to operate on the text, so read() is needed.
To write to output file you can use output.write(), but when opening you have to have 'w' option
Regex here didn't work because in each regex you have start of string (^) and it applies only to the beginning of the text (unless you read line by line) but with read() you get whole text as single string.

Python exercise error

I am learning Python.
For code:
def main():
fileName = raw_input("file name ")
infile = open(fileName, "r")
sm = 0.0
ct = 0
line = infile.readline()
while line != "":
sm = sm + eval(line)
ct = ct + 1
line = infile.readline()
print "\nAverage is ", sm/ct
main()
it results the following error:
Traceback (most recent call last):
File "/home/sorin/avg6.py", line 13, in <module>
main()
File "/home/sorin/avg6.py", line 8, in main
sm = sm + eval(line)
File "<string>", line 1
^
SyntaxError: unexpected EOF while parsing
I don't understand way. Please help. Thank you.
The eval function expects the string you pass it to be a valid Python expression, but it got an empty string (or, given the while loop's condition, perhaps a string with only whitespace). That doesn't have a value, so it raises an error.
You may want to look in the data file to see if there any blank lines (and remove them, if you can). Or you could modify the code to ignore invalid strings:
while line != "":
try:
sm = sm + eval(line)
ct = ct + 1
except SyntaxError:
pass
line = infile.readline()
You might also want to catch other kinds of errors, if you go that route.
Another option is to explicitly check for specific invalid strings that might come up (like just a bare newline):
while line != "":
if line != "\n": # or maybe us "if line.strip()", to reject whitespace lines
sm = sm + eval(line)
ct = ct + 1
line = infile.readline()
One final suggestion is to use a for loop on the file object rather than a while loop with calls to readline. This won't prevent the kinds of error you're getting, but it generally results in nicer looking code, which might be easier to debug.
What does your input file look like? If your input file has a blank line, it would explain your error.