I've got a simple routine that never quits
import sys
infile = sys.argv[1]
outfile = sys.argv[2]
count=1
print 'Input file is ', infile
print 'Output file is ', outfile
instream = open(infile,'r')
while True:
line=instream.readline()
if line[0:5]=='<?xml':
print 'new record', count
count=count+1
if line == "eof":
print 'end'
break
This reads the infield... but never ends. What do I need to do?
readline() doesn't return the string eof to signify the end of the file; it returns the empty string. (Note readline retains the newline that terminates each line, so a blank line would be represented as '\n'; so a truly blank line can only mean that no data is left.)
You could write
while True:
line = instream.readline()
# ...
if line == "":
break
but usually you simply treat the file as an iterator:
for line in instream:
if line[0:5] == '<?xml':
print 'new record', count
count = count + 1
Related
I have 2 text files file1 and file2. I am trying to compare both files line by line and print/write to a third file only content which are not matching/difference.
I have tried difflib.unified_diff but that gives output with a lot of unnecessary information. As, requirement is just to print the text of file1 which is not in file2.
following is the code for my attempt.
def file_byline_comp(f1,f2,f3):
# Read the first line from the files
file1= open(f1)
file2= open(f2)
result_output_file= open(f3,'w')
file1_line = file1.readline()
file2_line = file2.readline()
# Initialise counter for line number
line_no = 1
# Loop if either file1 or file2 has not reached EOF
while file1_line != '' or file2_line != '':
# Strip the leading whitespaces
file1_line = file1_line.rstrip()
file2_line = file2_line.rstrip()
# Compare the lines from both file
if file1_line != file2_line:
if file2_line == '' and file1_line != '':
# print("Line-%d" % line_no, file1_line)
print("Line-%d" % line_no)
print difflib.unified_diff(file1_line, file2_line,fromfile='f1', tofile='f2',lineterm='')
result_output_file.write("Line-%d " % (line_no))
result_output_file.write(file1_line)
# otherwise output the line on file1
elif file1_line != '':
#print("Line-%d" % line_no, file1_line)
print("Line-%d" % line_no)
for line in difflib.unified_diff(file1_line, file2_line,fromfile='f1', tofile='f2',lineterm=''):
print line
result_output_file.write("Line-%d " % (line_no))
result_output_file.write(file1_line)
I am stuck on an exercise from a Coursera Python course, this is the question:
"Open the file mbox-short.txt and read it line by line. When you find a line that starts with 'From ' like the following line:
From stephen.marquard#uct.ac.za Sat Jan 5 09:14:16 2008
You will parse the From line using split() and print out the second word in the line (i.e. the entire address of the person who sent the message). Then print out a count at the end.
Hint: make sure not to include the lines that start with 'From:'.
You can download the sample data at http://www.pythonlearn.com/code/mbox-short.txt"
Here is my code:
fname = raw_input("Enter file name: ")
if len(fname) < 1 : fname = "mbox-short.txt"
fh = open(fname)
count = 0
for line in fh:
words = line.split()
if len(words) > 2 and words[0] == 'From':
print words[1]
count = count + 1
else:
continue
print "There were", count, "lines in the file with From as the first word"`
The output should be a list of emails and the sum of them, but it doesn't work and I don't know why: actually the output is "There were 0 lines in the file with From as the first word"
I used your code and downloaded the file from the link. And I am getting this output:
There were 27 lines in the file with From as the first word
Have you checked if you are downloading the file in the same location as the code file.
fname = input("Enter file name: ")
counter = 0
fh = open(fname)
for line in fh :
line = line.rstrip()
if not line.startswith('From '): continue
words = line.split()
print (words[1])
counter +=1
print ("There were", counter, "lines in the file with From as the first word")
fname = input("Enter file name: ")
fh = open(fname)
count = 0
for line in fh :
if line.startswith('From '): # consider the lines which start from the word "From "
y=line.split() # we split the line into words and store it in a list
print(y[1]) # print the word present at index 1
count=count+1 # increment the count variable
print("There were", count, "lines in the file with From as the first word")
I have written all the comments if anyone faces any difficulty, in case you need help feel free to contact me. This is the easiest code available on internet. Hope you benefit from my answer
fname = input('Enter the file name:')
fh = open(fname)
count = 0
for line in fh:
if line.startswith('From'):
linesplit =line.split()
print(linesplit[1])
count = count +1
fname = input("Enter file name: ")
if len(fname) < 1 : fname = "mbox-short.txt"
fh = open(fname)
count = 0
for i in fh:
i=i.rstrip()
if not i.startswith('From '): continue
word=i.split()
count=count+1
print(word[1])
print("There were", count, "lines in the file with From as the first word")
fname = input("Enter file name: ")
if len(fname) < 1 : fname = "mbox-short.txt"
fh = open(fname)
count = 0
for line in fh:
if line.startswith('From'):
line=line.rstrip()
lt=line.split()
if len(lt)==2:
print(lt[1])
count=count+1
print("There were", count, "lines in the file with From as the first word")
My code looks like this and works as a charm:
fname = input("Enter file name: ")
if len(fname) < 1:
fname = "mbox-short.txt"
fh = open(fname)
count = 0 #initialize the counter to 0 for the start
for line in fh: #iterate the document line by line
words = line.split() #split the lines in words
if not len(words) < 2 and words[0] == "From": #check for lines starting with "From" and if the line is longer than 2 positions
print(words[1]) #print the words on position 1 from the list
count += 1 # count
else:
continue
print("There were", count, "lines in the file with From as the first word")
It is a nice exercise from the course of Dr. Chuck
There is also another way. You can store the found words in a separate empty list and then print out the lenght of the list. It will deliver the same result.
My tested code as follows:
fname = input("Enter file name: ")
if len(fname) < 1:
fname = "mbox-short.txt"
fh = open(fname)
newl = list()
for line in fh:
words = line.split()
if not len(words) < 2 and words[0] == 'From':
newl.append(words[1])
else:
continue
print(*newl, sep = "\n")
print("There were", len(newl), "lines in the file with From as the first word")
I did pass the exercise with it as well. Enjoy and keep the good work. Python is so much fun to me even though i always hated programming.
I have an input file like:
input.txt:
to
the
cow
eliphant
pigen
then
enthosiastic
I want to remove those words which has character length is <= 4 , and if some word has more than 8 character then write those word in new file till 8 character length
output should be like:
output.txt:
eliphant
pigen
enthosia
This is my code:
f2 = open('output.txt', 'w+')
x2 = open('input.txt', 'r').readlines()
for y in x2:
if (len(y) <= 4):
y = y.replace(y, '')
f2.write(y)
elif (len(y) > 8):
y = y[0:8]
f2.write(y)
else:
f2.write(y)
f2.close()
print "Done!"
when i compile it then it gives the output like:
eliphantpigen
then
enthosia
it also writes 4 character length word... i don't understand what is the problem and how to write the code to limit character length of text file words....?
Use with when working with files, this guarantees that file would be closed.
You have then in your results because your are reading lines and not worlds.
Each line have symbol of ending '\n'. So when you are reading world then you have string
'then\n' and len of this string is 5.
with open('output.txt', 'w+') as ofp, open('input.txt', 'r') as ifp:
for line in ifp:
line = line.strip()
if len(line) > 8:
line = line[:8]
elif len(line) <= 4:
continue
ofp.write(line + '\n')
i have a text file with some site links.... i want to remove the string which is before the site name.
here is the input file >>
input.txt:
http://www.site1.com/
http://site2.com/
https://www.site3333.co.uk/
site44.com/
http://www.site5.com/
site66.com/
output file should be like:
site1.com/
site2.com/
site3333.co.uk/
site44.com/
site5.com/
site66.com/
here is my code:
bad_words = ['https://', 'http://', 'www.']
with open('input.txt') as oldfile, open('output.txt', 'w') as newfile:
for line in oldfile:
if not any(bad_word in line for bad_word in bad_words):
newfile.write(line)
print './ done'
when i run this code then it totally remove the lines which containing bad_words
site44.com/
site66.com/
what should i do with code to get my specific result?
Thanks all i have solved this... code should be:
fin = open("input.txt")
fout = open("output.txt", "w+")
delete_list = ['https://', 'http://', 'www.']
for line in fin:
for word in delete_list:
line = line.replace(word, "")
fout.write(line)
fin.close()
fout.close()
print './ done'
I am trying to retrieve particular parts of a string in a text file such as below and i would like to save them in a text file in MATLAB
Original text file
D 1m8ea_ 1m8e A: d.174.1.1 74583 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=74583
D 1m8eb_ 1m8e B: d.174.1.1 74584 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=74584
D 3e7ia1 3e7i A:77-496 d.174.1.1 158052 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=158052
D 3e7ib1 3e7i B:77-496 d.174.1.1 158053 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=158053
D 2bhja1 2bhj A:77-497 d.174.1.1 128533 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=128533
So basically, I would like to retrieve the pdbcodes id which are labeled as "1m8e", chainid labeled as "A" the Start values which is "77" and stop values which is "496" and i would like all of these values to be saved inside of a fprintf statment.
Is there some kind of method is which i can use in RegExp stating which index its all starting at and retrieve those strings based on the position in the text file for each line?
In the end, all i want to have in the fprinf statement is 1m8e, A, 77, 496.
So far i have two fopen function which reads a file and one that writes to a new file and to read each line by line, also a fprintf statment:
pdbcode = '';
chainid = '';
start = '';
stop = '';
fin = fopen('dir.cla.scop.txt_1.75.txt', 'r');
fout = fopen('output_scop.txt', 'w');
% TODO: Add error check!
while true
line = fgetl(fin); % Get the next line from the file
if ~ischar(line)
% End of file
break;
end
% Print result into output_cath.txt file
fprintf(fout, 'INSERT INTO cath_domains (scop_pdbcode, scop_chainid, scopbegin, scopend) VALUES("%s", %s, %s, %s);\n', pdbcode, chainid, start, stop);
Thank you.
You should be able to strsplit on whitespace, get the third ("1m8e") and fourth elements ("A:77-496"), then repeat the process on the fourth element using ":" as the split character, and then again on the second of those two arguments using "-" as the split character. That's one approach. For example, you could do:
% split on space and tab, and ignore empty tokens
tokens = strsplit(line, ' \t', true);
pdbcode = tokens(3);
% split fourth token from previous split on colon
tokens = strsplit(tokens(4), ':');
chainid = tokens(1);
% split second token from previous split on dash
tokens = strsplit(tokens(2), '-');
start = tokens(1);
stop = tokens(2);
If you really wanted to use regular expressions, you could try the following
pattern = '\S+\s+\S+\s+(\S+)\s+([A-Za-z]+):([0-9]+)-([0-9]+)';
[mat tok] = regexp(line, pattern, 'match', 'tokens');
pdbcode = cell2mat(tok)(1);
chainid = cell2mat(tok)(2);
start = cell2mat(tok)(3);
stop = cell2mat(tok)(4);