How can I perform multiple re.sub() on a file? - regex

I am attempting to perform multiple regex alterations of a file but I'm not sure how to do this while retaining the previous alterations. I have found several ways to do this but I'm new to coding and couldn't get them to work in my code.
import re
import sys
if len(sys.argv) != 3:
sys.exit('Error: One input and one output file is required')
fasta = open(sys.argv[1],'r')
output = open(sys.argv[2],'r+')
output1 = re.sub(r'^>\w+\|(\d+)\|.*LOXAF.*', r'>Loxodonta africana, \1, MW =',fasta)
output2 = re.sub(r'^>\w+\|(\d+)\|.*DUGDU.*', r'>Dendrohyrax dorsalis, \1, MW =',output1)
output3 = re.sub(r'(^[A-Z].*)\n', r'\1',output2)
print(output3)
Ideally, I would write all of the regex to the output file instead of just printing it. I put an example of changes I'd like to make below (I cut the number and length of sequences down for simplicity).
>gi|75074720|sp|Q9TA19.1|NU5M_LOXAF RecName: Full=NADH-ubiquinone oxidoreductase chain 5; AltName: Full=NADH dehydrogenase subunit 5
MKVINLIPTLMLTSLIILTLPIITTLLQNNKTNCFLYITKTAVTYAFAISLIPTLLFIQSNQEAYISNWH
WMTIHTLKLSMSFKLDFFSLTFMPIALFITWSIM
>gi|75068112|sp|Q9TA29.1|NU1M_LOXAF RecName: Full=NADH-ubiquinone oxidoreductase chain 1; AltName: Full=NADH dehydrogenase subunit 1
MFLINVLTVTLPILLAVAFLTLVERKALGYMQLRKGPNVVGPYGLLQPIADAIKLFTKEPIYPQTSSKFL
FTVAPILALTLALTVWAPLPMPYPLINLNLSL
>gi|24418335|sp|Q8W9N2.1|ATP8_DUGDU RecName: Full=ATP synthase protein 8; AltName: Full=A6L; AltName: Full=F-ATPase subunit 8
MPQLDTTTWFITILSMLITLFILFQTKLLNYTYPLNALPISPNVTNHLTPWKMKWTKTYLPLSLPLQ
Output:
>Loxodonta africana, 75074720, MW =
MKVINLIPTLMLTSLIILTLPIITTLLQNNKTNCFLYITKTAVTYAFAISLIPTLLFIQSNQEAYISNWHWMTIHTLKLSMSFKLDFFSLTFMPIALFITWSIM
>Loxodonta africana, 75068112, MW =
MFLINVLTVTLPILLAVAFLTLVERKALGYMQLRKGPNVVGPYGLLQPIADAIKLFTKEPIYPQTSSKFLFTVAPILALTLALTVWAPLPMPYPLINLNLSL
>Dendrohyrax dorsalis, 24418335, MW =
MPQLDTTTWFITILSMLITLFILFQTKLLNYTYPLNALPISPNVTNHLTPWKMKWTKTYLPLSLPLQ
Thanks for all of your help!

fasta files can be very large. It isn't a good idea to load the whole file into a variable. I suggest to work line by line (less memory usage).
A fasta file is something with a format and not a wild text file, so understanding and using this format will help you to extract the informations you want without to use 3 blind regex replacements.
Suggestion:
import re
import sys
from itertools import takewhile
if len(sys.argv) != 3:
sys.exit('Error: One input and one output file is required')
with open(sys.argv[1], 'r') as fi, open(sys.argv[2], 'r+') as fo:
species = {
'LOXAF': 'Loxodonta africana',
'DUGDU': 'Dendrohyrax dorsalis'
}
sep = re.compile(r'[|_ ]');
recF = ">{}, {}, MW =\n{}"
def getSeq(f):
return ''.join([line.rstrip() for line in takewhile(lambda x: x!="\n", f)])
for line in fi:
if line.startswith('>'):
parts = sep.split(line, 6)
print(recF.format(species[parts[5]], parts[1], getSeq(fi)), file=fo)

You can try something like this:
import re
import sys
if len(sys.argv) != 3:
sys.exit('Error: One input and one output file is required')
else:
fasta = open(sys.argv[1],'r')
fasta_content = fasta.read()
print(fasta)
output = open(sys.argv[2],'w')
output1 = re.sub(r'>\w+\|(\d+)\|.*LOXAF.*', r'>Loxodonta africana, \1, MW =',fasta_content)
print(output1)
output2 = re.sub(r'>\w+\|(\d+)\|.*DUGDU.*', r'>Dendrohyrax dorsalis, \1, MW =',output1)
print(output2)
output3 = re.sub(r'([A-Z]+)\n', r'\1',output2)
print(output3)
output.write(output3)
output.close()
fasta.close()
First of all you need to operate on the text, so read() is needed.
To write to output file you can use output.write(), but when opening you have to have 'w' option
Regex here didn't work because in each regex you have start of string (^) and it applies only to the beginning of the text (unless you read line by line) but with read() you get whole text as single string.

Related

To find some words in a text file using regex and later print them in a different text file

I need to find some words such as inherited, INHERITANCE, Ingeritable, etc., using regex, in a text file (origin.txt) and later I want to print them in a new text file (origin_spp.txt) and the line where they were found.
This is my code
re_pattern_string = r'(?:inherit|INHERIT|Inherit)*\w'
print('Opening origin.txt')
with open('origin.txt', 'r') as in_stream:
print('Opening origin_spp.txt')
with open('origin_spp.txt', 'w') as out_stream:
for num, line in enumerate (in_stream):
re_pattern_object = re.compile(re_pattern_string)
line = line.strip()
inherit_list = line.split()
temp_list = re_pattern_object.findall('line')
complete = origin_list.append('temp_list')
for word in temp_list:
out_stream.write(str(num) + '\t{0}\n'.format(word))
print("Done!")
print('origin.txt is closed?', in_stream.closed)
print('origin_spp.txt is closed?', out_stream.closed)
if __name__ == '__main__':
print(temp_list)
Can you help me, please? I am not getting anything and I do not know where is the error.
Thank you in advance
I need to print the words that I want to find in the origin.txt in a different text file.
This new file must contain the number of the line in the origin.txt plus the word/s.
Your code had some problems:
It's redundant to define re.compile inside for.
for re_pattern_object.findall('line') and origin_list.append('temp_list') don't wrap variables with ''
with findall we don't need iterate lines, it's works for whole text.
Because you didn't provide input and output I just guess what you want:
import re
re_pattern_string = r'((?:inherit|INHERIT|Inherit)(\w*))'
originmain_list = []
re_pattern_object = re.compile(re_pattern_string)
print('Opening origin.txt')
with open('origin.txt', 'r') as in_stream:
print('Opening origin_spp.txt')
with open('origin_spp.txt', 'w') as out_stream:
for num, line in enumerate(in_stream):
temp_list = re_pattern_object.findall(line)
for word in temp_list:
out_stream.write(str(num) + '\t{0}\n'.format(word[0]))
originmain_list.append((num, word[0]))
print("Done!")
print('origin.txt is closed?', in_stream.closed)
print('origin_spp.txt is closed?', out_stream.closed)
print(originmain_list)
if origin.txt contains:
inheritxxxxxxx some text INHERITccccc some text
Inheritzzzzzzzz some text
inherit some text INHERIT some text
Inherit some text
the output in the origin_spp.txt will be
0 inheritxxxxxxx
0 INHERITccccc
1 Inheritzzzzzzzz
2 inherit
2 INHERIT
3 Inherit
The command line output will be:
Opening origin.txt
Opening origin_spp.txt
Done!
origin.txt is closed? True
origin_spp.txt is closed? True
[(0, 'inheritxxxxxxx'), (0, 'INHERITccccc'), (1, 'Inheritzzzzzzzz'), (2, 'inherit'), (2, 'INHERIT'), (3, 'Inherit')]

How can I extract information in the lines between two headers?

I am new to python and am attempting to use this currently nonfunctioning code to extract information between two headers from a text file.
with open('toysystem.txt','r') as f:
start = '<Keywords>'
end = '</Keywords>'
i = 0
lines = f.readlines()
for line in lines:
if line == start:
keywords = lines[i+1]
i += 1
For reference, the text file looks like this:
<Keywords>
GTO
</Keywords>
Any ideas on what might be wrong with the code? Or perhaps a different way to approach this problem?
Thank you!
lines read from file contains newline symbol at the end, so we probably should strip them,
f object is an iterator, so we don't need to use str.readlines method here.
So we can write something like
with open('toysystem.txt', 'r') as f:
start = '<Keywords>'
end = '</Keywords>'
keywords = []
for line in f:
if line.rstrip() == start:
break
for line in f:
if line.rstrip() == end:
break
keywords.append(line)
gives us
>>> keywords
['GTO\n']
If you don't need newline at the end of keywords as well – strip them too:
with open('toysystem.txt', 'r') as f:
start = '<Keywords>'
end = '</Keywords>'
keywords = []
for line in f:
if line.rstrip() == start:
break
for line in f:
if line.rstrip() == end:
break
keywords.append(line.rstrip())
gives
>>> keywords
['GTO']
But in this case it will be better to create stripped lines generator like
with open('toysystem.txt', 'r') as f:
start = '<Keywords>'
end = '</Keywords>'
keywords = []
stripped_lines = (line.rstrip() for line in f)
for line in stripped_lines:
if line == start:
break
for line in stripped_lines:
if line == end:
break
keywords.append(line)
which does the same.
Finally, if you need your lines in the next parts of script, we can use str.readlines and stripped lines generator:
with open('test.txt', 'r') as f:
start = '<Keywords>'
end = '</Keywords>'
keywords = []
lines = f.readlines()
stripped_lines = (line.rstrip() for line in lines)
for line in stripped_lines:
if line.rstrip() == start:
break
for line in stripped_lines:
if line.rstrip() == end:
break
keywords.append(line.rstrip())
gives us
>>> lines
['<Keywords>\n', 'GTO\n', '</Keywords>\n']
>>> keywords
['GTO']
Further reading
file objects,
iterators (including file iterators),
list comprehension,
generator expression
Use Python re module insted and parse it using regex?!
import re
with open('toysystem.txt','r') as f:
contents = f.read()
# will find all the expressions in the file and return a list of values inside the (). You can extend the expression according to your need.
keywords = re.findall(r'\<keywords\>\s*\n*\s*(.*?)\s*\n*\s*\<\/keywords\>')
print(keywords)
from your file it will print
['GTO']
for more about regular expression and python check Tutorialspoint
, For python3 and Python2

Extracting data using regular expressions: Python

The basic outline of this problem is to read the file, look for integers using the re.findall(), looking for a regular expression of [0-9]+ and then converting the extracted strings to integers and summing up the integers.
I am finding trouble in appending the list. From my below code, it is just appending the first(0) index of the line. Please help me. Thank you.
import re
hand = open ('a.txt')
lst = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('[0-9]+', line)
if len(stuff)!= 1 : continue
num = int (stuff[0])
lst.append(num)
print sum(lst)
import re
ls=[];
text=open('C:/Users/pvkpu/Desktop/py4e/file1.txt');
for line in text:
line=line.rstrip();
l=re.findall('[0-9]+',line);
if len(l)==0:
continue
ls+=l
for i in range(len(ls)):
ls[i]=int(ls[i]);
print(sum(ls));
Great, thank you for including the whole txt file! Your main problem was in the if len(stuff)... line which was skipping if stuff had zero things in it and when it had 2,3 and so on. You were only keeping stuff lists of length 1. I put comments in the code but please ask any questions if something is unclear.
import re
hand = open ('a.txt')
str_num_lst = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('[0-9]+', line)
#If we didn't find anything on this line then continue
if len(stuff) == 0: continue
#if len(stuff)!= 1: continue #<-- This line was wrong as it skip lists with more than 1 element
#If we did find something, stuff will be a list of string:
#(i.e. stuff = ['9607', '4292', '4498'] or stuff = ['4563'])
#For now lets just add this list onto our str_num_list
#without worrying about converting to int.
#We use '+=' instead of 'append' since both stuff and str_num_lst are lists
str_num_lst += stuff
#Print out the str_num_list to check if everything's ok
print str_num_lst
#Get an overall sum by looping over the string numbers in the str_num_lst
#Can convert to int inside the loop
overall_sum = 0
for str_num in str_num_lst:
overall_sum += int(str_num)
#Print sum
print 'Overall sum is:'
print overall_sum
EDIT:
You are right, reading in the entire file as one line is a good solution, and it's not difficult to do. Check out this post. Here is what the code could look like.
import re
hand = open('a.txt')
all_lines = hand.read() #Reads in all lines as one long string
all_str_nums_as_one_line = re.findall('[0-9]+',all_lines)
hand.close() #<-- can close the file now since we've read it in
#Go through all the matches to get a total
tot = 0
for str_num in all_str_nums_as_one_line:
tot += int(str_num)
print('Overall sum is:',tot) #editing to add ()

Parse text between multiple lines - Python 2.7 and re Module

I have a text file i want to parse. The file has multiple items I want to extract. I want to capture everything in between a colon ":" and a particular word. Let's take the following example.
Description : a pair of shorts
amount : 13 dollars
requirements : must be blue
ID1 : 199658
----
The following code parses the information out.
import re
f = open ("parse.txt", "rb")
fileRead = f.read()
Description = re.findall("Description :(.*?)amount", fileRead, re.DOTALL)
amount = re.findall("amount :(.*?)requirements", fileRead, re.DOTALL)
requirements = re.findall("requirements :(.*?)ID1", fileRead, re.DOTALL)
ID1 = re.findall("ID1 :(.*?)-", fileRead, re.DOTALL)
print Description[0]
print amount[0]
print requirements[0]
print ID1[0]
f.close()
The problem is that sometimes the text file will have a new line such as this
Description
: a pair of shorts
amount
: 13 dollars
requirements: must be blue
ID1: 199658
----
In this case my code will not work because it is unable to find "Description :" because it is now separated into a new line. If I choose to change the search to ":(.*?)requirements" it will not return just the 13 dollars, it will return a pair of shorts and 13 dollars because all of that text is in between the first colon and the word, requirements. I want to have a way of parsing out the information no matter if there is a line break or not. I have hit a road block and your help would be greatly appreciated.
You can use a regex like this:
Description[^:]*(.*)
^--- use the keyword you want
Working demo
Quoting your code you could use:
import re
f = open ("parse.txt", "rb")
fileRead = f.read()
Description = re.findall("Description[^:]*(.*)", fileRead)
amount = re.findall("amount[^:]*(.*)", fileRead)
requirements = re.findall("requirements[^:]*(.*)", fileRead)
ID1 = re.findall("ID1[^:]*(.*)", fileRead)
print Description[0]
print amount[0]
print requirements[0]
print ID1[0]
f.close()
You can simply do this:
import re
f = open ("new.txt", "rb")
fileRead = f.read()
keyvals = {k.strip():v.strip() for k,v in dict(re.findall('([^:]*):(.*)(?=\b[^:]*:|$)',fileRead,re.M)).iteritems()}
print(keyvals)
f.close()
Output:
{'amount': '13 dollars', 'requirements': 'must be blue', 'Description': 'a pair of shorts', 'ID1': '199658'}

regex for detecting subtitle errors

I'm having some issues with subtitles, I need a way to detect specific errors. I think regular expressions would help but need help figuring this one out. In this example of SRT formatted subtitle, line #13 ends at 00:01:10,130 and line #14 begins at 00:01:10:129.
13
00:01:05,549 --> 00:01:10,130
some text here.
14
00:01:10,129 --> 00:01:14,109
some other text here.
Problem is that next line can't begin before current one is over - embedding algorithm doesn't work when that happens. I need to check my SRT files and correct this manually, but looking for this manually in about 20 videos each an hour long just isn't an option. Specially since I need it 'yesterday' (:
Format for SRT subtitles is very specific:
XX
START --> END
TEXT
EMPTY LINE
[line number (digits)][new line character]
[start and end times in 00:00:00,000 format, separated by _space__minusSign__minusSign__greaterThenSign__space_][new line character]
[text - can be any character - letter, digit, punctuation sign.. pretty much anything][new line character]
[new line character]
I need to check if END time is greater then START time of the following subtitle. Help would be appreciated.
PS. I can work with Notepad++, Eclipse (Aptana), python or javascript...
Regular expressions can be used to achieve what you want, that being said, they can't do it on their own. Regular expressions are used for matching patterns and not numerical ranges.
If I where you, what I would do would be as following:
Parse the file and place the start-end time in one data structure (call it DS_A) and the text in another (call it DS_B).
Sort DS_A in ascending order. This should guarantee that you will not have overlapping ranges. (This previous SO post should point you in the right direction).
Iterate over and write the following in your file:j DS_A[i] --> DS_A[i + 1] <newline> DS_B[j] where i is a loop counter for DS_A and j is a loop counter for DS_B.
I ended up writing short script to fix this. here it is:
# -*- coding: utf-8 -*-
from datetime import datetime
import getopt, re, sys
count = 0
def fix_srt(inputfile):
global count
parsed_file, errors_file = '', ''
try:
with open( inputfile , 'r') as f:
srt_file = f.read()
parsed_file, errors_file = parse_srt(srt_file)
except:
pass
finally:
outputfile1 = ''.join( inputfile.split('.')[:-1] ) + '_fixed.srt'
outputfile2 = ''.join( inputfile.split('.')[:-1] ) + '_error.srt'
with open( outputfile1 , 'w') as f:
f.write(parsed_file)
with open( outputfile2 , 'w') as f:
f.write(errors_file)
print 'Detected %s errors in "%s". Fixed file saved as "%s"
(Errors only as "%s").' % ( count, inputfile, outputfile1, outputfile2 )
previous_end_time = datetime.strptime("00:00:00,000", "%H:%M:%S,%f")
def parse_times(times):
global previous_end_time
global count
_error = False
_times = []
for time_code in times:
t = datetime.strptime(time_code, "%H:%M:%S,%f")
_times.append(t)
if _times[0] < previous_end_time:
_times[0] = previous_end_time
count += 1
_error = True
previous_end_time = _times[1]
_times[0] = _times[0].strftime("%H:%M:%S,%f")[:12]
_times[1] = _times[1].strftime("%H:%M:%S,%f")[:12]
return _times, _error
def parse_srt(srt_file):
parsed_srt = []
parsed_err = []
for srt_group in re.sub('\r\n', '\n', srt_file).split('\n\n'):
lines = srt_group.split('\n')
if len(lines) >= 3:
times = lines[1].split(' --> ')
correct_times, error = parse_times(times)
if error:
clean_text = map( lambda x: x.strip(' '), lines[2:] )
srt_group = lines[0].strip(' ') + '\n' + ' --> '.join( correct_times ) + '\n' + '\n'.join( clean_text )
parsed_err.append( srt_group )
parsed_srt.append( srt_group )
return '\r\n'.join( parsed_srt ), '\r\n'.join( parsed_err )
def main(argv):
inputfile = None
try:
options, arguments = getopt.getopt(argv, "hi:", ["input="])
except:
print 'Usage: test.py -i <input file>'
for o, a in options:
if o == '-h':
print 'Usage: test.py -i <input file>'
sys.exit()
elif o in ['-i', '--input']:
inputfile = a
fix_srt(inputfile)
if __name__ == '__main__':
main( sys.argv[1:] )
If someone needs it save the code as srtfix.py, for example, and use it from command line:
python srtfix.py -i "my srt subtitle.srt"
I was lazy and used datetime module to process timecodes, so not sure script will work for subtitles longer then 24h (: I'm also not sure when miliseconds were added to Python's datetime module, I'm using version 2.7.5; it's possible script won't work on earlier versions because of this...