python 2.7: reading a file only up to a known line - python-2.7

If I wanted to read starting from a given line I can do:
with open(myfile) as f:
for x in range(from_here):
next(f)
for line in f:
do stuff
How can I do the opposite: reading only up to a given line?
I was thinking about a for loop: is there another way?

The obvious answer is to use a loop that just counts:
with open(myfile) as f:
for i in xrange(number_of_wanted_lines):
line = next(f)
# do stuff with line
Regarding the second part of your question, you can also read in the full file into a list of lines, then use slices:
with open(myfile) as f:
lines = f.readlines()[start_line_number:end_line_number+1]
for line in lines:
# do stuff with line
If you don't want to load the whole file into memory, you can also use islice (from itertools) instead of list slices:
import itertools
with open(myfile) as f:
for line in itertools.islice(f, start_line_number, end_line_number + 1):
# do stuff with line

with open(myfile) as f:
for x in range(until_here):
line = next(f)
# do stuff with line
# do stuff with the rest of f
or
import itertools as it
with open(myfile) as f:
for line in it.islice(f, until_here):
# do stuff
# do stuff with the rest of f

Related

How to direct this to file

This is not direting to file p
with open('/var/tmp/out3') as f:
before = collections.deque(maxlen=1)
for line in f:
if 'disk#g5000cca025a1ee6c' in line:
sys.stdout.writelines(before)
p.write(before)
Try this:
import sys
filename = '/var/tmp/out3'
expression = 'disk#g5000cca025a1ee6c'
with open(filename, 'r') as f:
with open('p', 'w') as p_file:
previous = next(f)
for line in f:
if expression in line:
p_file.write(previous)
previous = line
If the expression is found, you should find a file 'p' in your current directory containing the expression.
It worked when I tried it on Python2.7.10. I took the code from this answer Refer to previous line when iterating through file with Python.
Hope this helps.

python readline from big text file

When I run this:
import os.path
import pyproj
srcProj = pyproj.Proj(proj='longlat', ellps='GRS80', datum='NAD83')
dstProj = pyproj.Proj(proj='longlat', ellps='WGS84', datum='WGS84')
f = file(os.path.join("DISTAL-data", "countries.txt"), "r")
heading = f.readline() # Ignore field names.
with open('C:\Python27\DISTAL-data\geonames_20160222\countries.txt', 'r') as f:
for line in f.readlines():
parts = line.rstrip().split("|")
featureName = parts[1]
featureClass = parts[2]
lat = float(parts[9])
long = float(parts[10])
if featureClass == "Populated Place":
long,lat = pyproj.transform(srcProj, dstProj, long, lat)
f.close()
I get this error:
File "C:\Python27\importing world datacountriesfromNAD83 toWGS84.py",
line 13, in for line in f.readlines() : MemoryError.
I have downloaded countries file from http://geonames.nga.mil/gns/html/namefiles.html as entire country file dataset.
Please help me to get out of this.
readlines() for large files creates a large structure in memory, you can try using:
f = open('somefilename','r')
for line in f:
dosomthing()
Answer given by Yael is helpful, I would like to improve it. A Good way to read a file or large file
with open(filename) as f:
for line in f:
print f
I like to use 'with' statement which ensure file will be properly closed.

Python read and write in same function

My code is currently taking in a csv file and outputting to text file. The piece of code I have below and am having trouble with is from the csv I am searching for a keyword like issues and every row that has that word I want to output that to a text file. Currently, I have it printing to a JSON file but its all on one line like this
"something,something1,something2,something3,something4,something5,something6,something7\r\n""something,something1,something2,something3,something4,something5,something6,something7\r\n"
But i want it to print out like this:
"something,something1,something2,something3,something4,something5,something6,something7"
"something,something1,something2,something3,something4,something5,something6,something7"
Here is the code I have so far:
def search(self, filename):
with open(filename, 'rb') as searchfile, open("weekly_test.txt", 'w') as text_file:
for line in searchfile:
if 'PBI 43125' in line:
#print (line)
json.dump(line, text_file, sort_keys=True, indent = 4)
So again I just need a little guidance on how to get my json file to be formatted the way I want.
Just replace print line with print >>file, line
def search(self, filename):
with open('test.csv', 'r') as searchfile, open('weekly_test.txt', 'w') as search_results_file:
for line in searchfile:
if 'issue' in line:
print >>search_results_file, line
# At this point, both the files will be closed automatically

Remove whitespaces from speciifc part of file

code:
with open(filename) as f:
file_list = f.readlines()
file_list = [line.strip() for line in file_list] # remove whitespaces from each line of file
code to process data between start and end tags (these tags can have whitespaces thats why i have removed them above)
This code works fine for me but if the file is too big then i don't think its sensible to copy whole data in a list then strip whitespaces from each line.
How can i remove whitespaces for specific part of list so that only that much part i can save in list ?
I tried:
with open(filename) as f:
for line in f.readlines():
if line.strip() == "start":
start = f.readlines.index("start")
if line.strip() == "end"
end = f.readlines.index("end")
file_list = f.readlines[start:end]
But its giving error
start = f.readlines.index("start")
AttributeError: 'builtin_function_or_method' object has no attribute 'index'
I just want to write an efficient code of code mentioned on top of this post.
The problem with your code is that the file object f is an iterator, and once you call f.readlines() it is exhausted, so finding the index of a line by calling f.readlines() again can't work. Also, calling readlines() at all negates your effort of storing only the interesting parts of the file, as readlines() would read the entire file into memory anyways.
Instead, just memorize whether you've already seen the start-line and add the following lines to the list until you see the end-line.
with open(filename) as f:
started, lines = False, []
for line in f:
stripped = line.strip()
if stripped == "end": break
if started: lines.append(stripped)
if stripped == "start": started = True
Alternatively, you could also use itertools.takewhile to get all the lines up to the end-line.
import itertools
with open(filename) as f:
for line in f:
if line.strip() == "start":
lines = itertools.takewhile(lambda l: l.strip() != "end", f)
lines = map(str.strip, lines)
break
Or even shorter, using another takewhile to read (and discard) the lines before the start-line:
with open("test.txt") as f:
list(itertools.takewhile(lambda l: l.strip() != "start", f))
lines = itertools.takewhile(lambda l: l.strip() != "end", f)
lines = map(str.strip, lines)
In all cases, lines holds the (stripped) lines between the start- and the end-line, both exclusive.
Tobias's first answer can be modified a bit with continue ...
with open(filename) as f:
started, lines = False, []
for line in f:
stripped = line.strip()
if stripped == "end": break
if stripped == "start":
started = True
continue
if not started: continue
# process line here no need to store it in a list ...

How to use list of strings as training data for svm using scikit.learn?

I am using scikit.learn to train an svm based on data where each observation (X) is a list of words. The tags for each observation (Y) are floating point values. I have tried following the example given in the scikit learn documentation (http://scikit-learn.org/stable/modules/svm.html) for Multi-class classification.
Here is my code:
from __future__ import division
from sklearn import svm
import os.path
import numpy
import re
'''
The stanford-postagger was included to see how it tags the words and to see if it would help in getting just the names
of the ingredients. Turns out its pointless.
'''
#from nltk.tag.stanford import POSTagger
mainDirectory = './nyu/PROJECTS/Epicurious/DATA/ingredients'
#st = POSTagger('/usr/share/stanford-postagger/models/english-bidirectional-distsim.tagger','/usr/share/stanford-postagger/stanford-postagger.jar')
'''
This is where we would reach each line of the file and then run a regex match on it to get all the words before
the first tab. (these are the names of the ingredients. Some of them may have adjectives like fresh, peeled,cut etc.
Not sure what to do about them yet.)
'''
def getFileDetails(_filename,_fileDescriptor):
rankingRegexMatch = re.match('([0-9](?:\_)[0-9]?)', _filename)
if len(rankingRegexMatch.group(0)) == 2:
ranking = float(rankingRegexMatch.group(0)[0])
else:
ranking = float(rankingRegexMatch.group(0)[0]+'.'+rankingRegexMatch.group(0)[2])
_keywords = []
for line in _fileDescriptor:
m = re.match('(\w+\s*\w*)(?=\t[0-9])', line)
if m:
_keywords.append(m.group(0))
return [_keywords,ranking]
'''
Open each file in the directory and pass the name and file descriptor to getFileDetails
'''
def this_is_it(files):
_allKeywords = []
_allRankings = []
for eachFile in files:
fullFilePath = mainDirectory + '/' + eachFile
f = open(fullFilePath)
XandYForThisFile = getFileDetails(eachFile,f)
_allKeywords.append(XandYForThisFile[0])
_allRankings.append(XandYForThisFile[1])
#_allKeywords = numpy.array(_allKeywords,dtype=object)
svm_learning(_allKeywords,_allRankings)
def svm_learning(x,y):
clf = svm.SVC()
clf.fit(x,y)
'''
This just prints the directory path and then calls the callback x on files
'''
def print_files( x, dir_path , files ):
print dir_path
x(files)
'''
code starts here
'''
os.path.walk(mainDirectory, print_files, this_is_it)
When the svm_learning(x,y) method is called, it throws me an error:
Traceback (most recent call last):
File "scan for files.py", line 72, in <module>
os.path.walk(mainDirectory, print_files, this_is_it)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 238, in walk
func(arg, top, names)
File "scan for files.py", line 68, in print_files
x(files)
File "scan for files.py", line 56, in this_is_it
svm_learning(_allKeywords,_allRankings)
File "scan for files.py", line 62, in svm_learning
clf.fit(x,y)
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/svm/base.py", line 135, in fit
X = atleast2d_or_csr(X, dtype=np.float64, order='C')
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 116, in atleast2d_or_csr
"tocsr")
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 96, in _atleast2d_or_sparse
X = array2d(X, dtype=dtype, order=order, copy=copy)
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 80, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/Library/Python/2.7/site-packages/numpy-1.8.0.dev_bbcfcf6_20130307-py2.7-macosx-10.8-intel.egg/numpy/core/numeric.py", line 331, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Can anyone help? I am new to scikit and could not find any help in the documentation.
You should take a look at: Text feature extraction. You are going to want to use either a TfidfVectorizer, a CountVectorizer, or a HashingVectorizer(if your data is very large). These components take your text in and output feature matrices that are acceptable to classifiers. Be advised that these work on lists of strings, with one string per example, so if you have a list of lists of strings (you have already tokenized), you may need to either join() the tokens to get a list of strings or skip tokenization.