how to skip multiple header lines using python - python-2.7

I am new to python. Trying to write a script that will use numeric colomns from a file whcih also contains a header. Here is an example of a file:
#File_Version: 4
PROJECTED_COORDINATE_SYSTEM
#File_Version____________-> 4
#Master_Project_______->
#Coordinate_type_________-> 1
#Horizon_name____________->
sb+
#Horizon_attribute_______-> STRUCTURE
474457.83994 6761013.11978
474482.83750 6761012.77069
474507.83506 6761012.42160
474532.83262 6761012.07251
474557.83018 6761011.72342
474582.82774 6761011.37433
474607.82530 6761011.02524
I'd like to skip the header. here is what i tried. It works of course if i know which characters will appear in the header like "#" and "#". But how can i skip all lines containing any letter character?
in_file1 = open(input_file1_short, 'r')
out_file1 = open(output_file1_short,"w")
lines = in_file1.readlines ()
x = []
y = []
for line in lines:
if "#" not in line and "#" not in line:
strip_line = line.strip()
replace_split = re.split(r'[ ,|;"\t]+', strip_line)
x = (replace_split[0])
y = (replace_split[1])
out_file1.write("%s\t%s\n" % (str(x),str(y)))
in_file1.close ()
Thank you very much!

I think you could use some built ins like this:
import string
for line in lines:
if any([letter in line for letter in string.ascii_letters]):
print "there is an ascii letter somewhere in this line"
This is only looking for ascii letters, however.
you could also:
import unicodedata
for line in lines:
if any([unicodedata.category(unicode(letter)).startswith('L') for letter in line]):
print "there is a unicode letter somewhere in this line"
but only if I understand my unicode categories correctly....
Even cleaner (using suggestions from other answers. This works for both unicode lines and strings):
for line in lines:
if any([letter.isalpha() for letter in line]):
print "there is a letter somewhere in this line"
But, interestingly, if you do:
In [57]: u'\u2161'.isdecimal()
Out[57]: False
In [58]: u'\u2161'.isdigit()
Out[58]: False
In [59]: u'\u2161'.isalpha()
Out[59]: False
The unicode for the roman numeral "Two" is none of those,
but unicodedata.category(u'\u2161') does return 'Nl' indicating a numeric (and u'\u2161'.isnumeric() is True).

This will check the first character in each line and skip all lines that doesn't start with a digit:
for line in lines:
if line[0].isdigit():
# we've got a line starting with a digit

Use a generator pipeline to filter your input stream.
This takes the lines from your original input lines, but stops to check that there are no letters in the entire line.
input_stream = (line in lines if
reduce((lambda x, y: (not y.isalpha()) and x), line, True))
for line in input_stream:
strip_line = ...

Related

how to not remove space in file

how to keep the space betwen the words?
in the code it deletes them and prints them in column.. so how to print them in row and with the space?
s ='[]'
f = open('q4.txt', "r")
for line in f:
for word in line:
b = word.strip()
c = list(b)
for j in b:
if ord(j) == 32:
print ord(33)
if ord(j) == 97:
print ord(123)
if ord(j) == 65:
print ord(91)
chr_nums = chr(ord(j) - 1)
print chr_nums
f.close()
Short answer: remove the word.strip() command - that's deleting the space. Then put a comma after the print operation to prevent a newline: print chr_nums,
There are several problems with your code aside from what you ask about here:
ord() takes a string (character) not an int, so ord(33) will fail.
for word in line: will be iterating over characters, not words, so word will be a single character and for j in b is unnecessary.
Take a look at the first for loop :
for line in f:
here the variable named 'line' is actually a line from the text file you are reading. So this 'line' variable is actually a string. Now take a look at the second for loop :
for word in line:
Here you are using a for loop on a string variable named as 'line' which we have got from the previous loop. So in the variable named 'word' you are not going to get a word, but single characters one by one. Let me demonstrate this using a simple example :
for word in "how are you?":
print(word)
The output of this code will be as follows :
h
o
w
a
r
e
y
o
u
?
You are getting individual characters from the line and so you don't need to use another for loop like you did 'for j in b:'. I hope this helped you.

regex excluding newline

I have a simple word counter that works with one exception. It is splitting on the \n character.
The small sample text file is:
'''
A tree is a woody perennial plant,typically with branches.
I added this second line,just to add eleven more words.
'''
Line #1 has ten words, line #2 has eleven. Total word count = 21.
This code yields a count of 22 because it is including the \n character at the end of line #1:
import re
testfile = "d:\\python\\workbook\\words2.txt"
number_of_words = 0
with open(testfile, "r") as datafile:
for line in datafile:
number_of_words += len(re.split(",|\s", line))
print(number_of_words)
If I change my regex to: number_of_words += len(re.split(",|^\n|\s", line))
the word count (22) remains unchanged.
My question is: why is exclude newline [^\n] failing, or more broadly, what
should be the correct way to code my regex so that I exclude the trailing \n and have the above code arrive at the correct word total of 21.
You can simply use:
number_of_words = 0
with open(testfile, "r") as datafile:
for line in datafile:
number_of_words += len(re.findall('\w+', line)

access char without indexing in python

lines = []
while True:
line = raw_input()
if line:
lines.append(line)
else:
break
print lines
This would take line by line in a list. Output is:
In [27]: lines
Out[27]: ['x-xx', 'y->y', '-z->']
How do I access the next letter, currently being at a letter, in the following specified code:
count = 0 # to check how many '->' are there in each line
for sentence in lines:
for letter in sentence:
if letter == '-':
#check if the next character is '>' (How to code this line)
#and if so, increment count
else:
break
Is there a way out for this kind of for loop, where you don't index letter but iterate on letter itself directly?

Extracting data using regular expressions: Python

The basic outline of this problem is to read the file, look for integers using the re.findall(), looking for a regular expression of [0-9]+ and then converting the extracted strings to integers and summing up the integers.
I am finding trouble in appending the list. From my below code, it is just appending the first(0) index of the line. Please help me. Thank you.
import re
hand = open ('a.txt')
lst = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('[0-9]+', line)
if len(stuff)!= 1 : continue
num = int (stuff[0])
lst.append(num)
print sum(lst)
import re
ls=[];
text=open('C:/Users/pvkpu/Desktop/py4e/file1.txt');
for line in text:
line=line.rstrip();
l=re.findall('[0-9]+',line);
if len(l)==0:
continue
ls+=l
for i in range(len(ls)):
ls[i]=int(ls[i]);
print(sum(ls));
Great, thank you for including the whole txt file! Your main problem was in the if len(stuff)... line which was skipping if stuff had zero things in it and when it had 2,3 and so on. You were only keeping stuff lists of length 1. I put comments in the code but please ask any questions if something is unclear.
import re
hand = open ('a.txt')
str_num_lst = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('[0-9]+', line)
#If we didn't find anything on this line then continue
if len(stuff) == 0: continue
#if len(stuff)!= 1: continue #<-- This line was wrong as it skip lists with more than 1 element
#If we did find something, stuff will be a list of string:
#(i.e. stuff = ['9607', '4292', '4498'] or stuff = ['4563'])
#For now lets just add this list onto our str_num_list
#without worrying about converting to int.
#We use '+=' instead of 'append' since both stuff and str_num_lst are lists
str_num_lst += stuff
#Print out the str_num_list to check if everything's ok
print str_num_lst
#Get an overall sum by looping over the string numbers in the str_num_lst
#Can convert to int inside the loop
overall_sum = 0
for str_num in str_num_lst:
overall_sum += int(str_num)
#Print sum
print 'Overall sum is:'
print overall_sum
EDIT:
You are right, reading in the entire file as one line is a good solution, and it's not difficult to do. Check out this post. Here is what the code could look like.
import re
hand = open('a.txt')
all_lines = hand.read() #Reads in all lines as one long string
all_str_nums_as_one_line = re.findall('[0-9]+',all_lines)
hand.close() #<-- can close the file now since we've read it in
#Go through all the matches to get a total
tot = 0
for str_num in all_str_nums_as_one_line:
tot += int(str_num)
print('Overall sum is:',tot) #editing to add ()

If line not present in the text file - python

I have a list with a set of strings and another dynamic list:
arr = ['sample1','sample2','sample3']
applist=[]
I am reading a text file line by line, and if a line starts with any of the strings in arr, then I append it to applist, as follows:
for line in open('test.txt').readlines():
for word in arr:
if line.startswith(word):
applist.append(line)
Now, if I do not have a line with any of the strings in the arr list, then I want to append 'NULL' to applist instead. I tried:
for line in open('test.txt').readlines():
for word in arr:
if line.startswith(word):
applist.append(line)
elif word not in 'test.txt':
applist.append('NULL')
But it obviously doesn't work (it inserts many unnecessary NULLs). How do I go about it? Also, there are other lines in the text file besides the three lines starting with the strings in arr. But I want to append only these three lines. Thanks in advance!
for line in open('test.txt').readlines():
found = False
for word in arr:
if line.startswith(word):
applist.append(line)
found = True
break
if not found: applist.append('NULL')
I think this might be what you are looking for:
found1 = NULL
found2 = NULL
found3 = NULL
for line in open('test.txt').readlines():
if line.startswith(arr[0]):
found1 = line;
elif line.startswith(arr[1]):
found2 = line;
elif line.startswith(arr[2]):
found3 = line;
for word in arr:
applist = [found1, found2, found3]
you could clean that up and make it better looking, but that should give you the logic you're going for.