how to improve the speed of the python script

how to improve the speed of the python script - python-2.7

I'm very new to python. I'm working in the area of hydrology and I want to learn python to assist me with processing hydrological data.
At the moment I write a script to extract bits of information from a big data set. I have three csv files:
Complete_borelist.csv
Borelist_not_interested.csv
Elevation_info.csv
I want to create a file with has all the bores that are in complete_borelist.csv but not in borelist_not_interested.csv. I also want to grab some information from complete_borelist.csv and Elevation_info.csv for those bores which satisfy the first criteria.
My script is as follow:
not_interested_list =[]
outfile1 = open('output.csv','w')
outfile1.write('Station_ID,Name,Easting,Northing,Location_name,Elevation')
outfile1.write('\n')
with open ('Borelist_not_interested.csv','r') as f1:
for line in f1:
if not line.startswith('Station'): #ignore header
line = line.rstrip()
words = line.split(',')
station = words[0]
not_interested_list.append(station)
with open('Complete_borelist.csv','r') as f2:
next(f2) #ignore header
for line in f2:
line= line.rstrip()
words = line.split(',')
station = words[0]
if not station in not_interested_list:
loc_name = words[1]
easting = words[4]
northing = words[5]
outfile1.write(station+','+easting+','+northing+','+loc_name+',')
with open ('Elevation_info.csv','r') as f3:
next(f3) #ignore header
for line in f3:
line = line.rstrip()
data = line.split(',')
bore_id = data[0]
if bore_id == station:
elevation = data[4]
outfile1.write(elevation)
outfile1.write ('\n')
outfile1.close()
I have two issues with the script:
The first is the Elevation_info.csv doesn't have information for all the bore in the Complete_borelist.csv. When my loop get to the station where it can't find Elevation record for it, the script doesn't write "null" but continue to write the information for the next station in the same line. Can anyone help me to fix this please?
The second is my complete borelist is about >200000 rows and my script runs through them very slow. Can anyone have any suggestion to make it run faster?
Very much appreciated and sorry if my question is too long.

performance-wise, this has a couple of problems. The first one is that you are opening and re-reading the Elevation info for every line of the complete file.. Read the elevation info into a dictionary keyed upon the bore_id before you open the complete file. Then you can test the dictionary very fast to see if station is in it instead of re-reading.
The second performance issue is that you don't stop searching in the bore_id list once you find a match. The dictionary idea solves that too, but otherwise a break once you have a match would help a little.
For the null printing problem, you just need to outfile1.write("\n") if the bore_id is not in the dictionary. An else statement on the dictionary test does that. In the current code, an else closing the for loop would do it. Or even changing the indentation of that last write("\n").

Related

Asking user for raw_input to open a file, when attempting to run program comes back with mode 'r'

I am trying to run the following code:
fname = raw_input ('Enter file name:')
fh = open (fname)
count = 0
for line in fh:
if not line.startswith ('X-DSPAM-Confidence:') : continue
else:
count = count + 1
new = fh #this new = fh is supposed to be fh stripped of the non- x-dspam lines
for line in new: # this seperates the lines in new and allows `finding the floats on each line`
numpos = new.find ('0')
endpos = new.find ('5', numpos)
num = new[numpos:endpos + 1]
float (num)
# should now have a list of floats
print num
The intention of this code is to prompt the user for a file name, open the file, read through the file, compile all the lines that start with X-DSPAM, and extract the float number on these lines. I am fairly new to coding so I realise I may have committed a number of errors, but currently when I try to run it, after putting in the file name I get the return:
I looked around and I have seen that mode 'r' refers to different file modes in python in relation to how the end of the line is handled. However the code I am trying to run is similar to other code I have formulated and it does not have any non-text files inside, the file being opened is a .txt file. Is it something to do with converting a list of strings line by line to a list of float numbers?
Any ideas on what I am doing wrong would be appreciated.

The default mode of handling a file is 'r' - which means 'read', which is what you want. It means the program is going to read the file (as opposed to 'w' - write, or 'a' - append, for example - which would allow you to overwrite the file or append to it, which you don't want in this case).
There are some bugs in your code, which I've tried to indicate in the edited code below.
You don't need to assign new = fh - you're not grabbing lines and passing them to a new file. Rather, you're checking each line against the 'XDSPAM' criteria and if it's a match, you can proceed to parse out the desired numbers. If not, you ignore it and go to the next line.
With that in mind, you can move all of the code from the for line in new to be part of the original if not ... else block.
How you find the end of the number is also a bit off. You set endpos by searching for an occurence of the number 5 - but what I think you want is to find a position 5 characters from the start position (numpos + 5).
(There are other ways to parse the line and pull the number, but I'm going to stick with your logic as indicated by your code, so nothing fancy here.)
You can convert to float in the same statement where you slice the number from the line (as below). It's acceptable to do:
num = line[numpos:endpos+1]
float_num = float(num)
but not necessary. In any event, you want to assign the conversion (float(num)) to a variable - just having float(num) doesn't allow you to pass the converted value to another statement (including print).
You say that you should have 'a list of floats' - the code as corrected below - will give you a display of all the floats, but if you want an actual Python list, there are other steps involved. I don't think you wanted a Python list, but just in case:
numlist = [] # at the beginning, declare a new, empty list
...
# after converting to float, append number to list
XDSPAM.append(num)
print XDSPAMs # at end of program, to print full list
In any event, this edited code works for me with an appropriate file of test data, and outputs the desired float numbers:
fname = raw_input ('Enter file name:')
fh = open (fname)
count = 0
for line in fh:
if not line.startswith ('X-DSPAM-Confidence:') : continue
else:
# there's no need to create the 'new' variable
# any lines that meet the criteria can be processed for numbers
count = count + 1
numpos = line.find ('0')
# i think what you want here is to set an endpoint 5 positions to the right
# but your code was looking for the position of a '5' in the line
endpos = numpos + 5
# you can convert to float and slice in the same statement
num = float(line[numpos:endpos+1])
print num

Removing Duplicate Lines by Title Only

I am trying to modify a script so that it will remove duplicate lines from a text file using only the title portion of that line.
To clarify the text file lines look something like this:
Title|Image Url|Description|Page Url
At the moment the script does remove duplicates, but it does so by reading the entire line, not just the first part. All the lines in the file are not going to be 100% the same, but a few will be very similar.
I want to remove all of the lines that contain the same "title", regardless of what the rest of the line contains.
This is the script I am working with:
import sys
from collections import OrderedDict
infile = "testfile.txt"
outfile = "outfile.txt"
inf = open(infile,"r")
lines = inf.readlines()
inf.close()
newset = list(OrderedDict.fromkeys(lines))
outf = open(outfile,"w")
lstline = len(newset)
for i in range(0,lstline):
ln = newset[i]
outf.write(ln)
outf.close()
So far I have tried using .split() to split the lines in the list. I have also tried .readline(lines[0:25]) in hopes of using a character limit to achieve the desired results, but no luck so far. I also can't seem to find any documentation on my exact problem so I'm stuck.
I am using Windows 8 and Python 2.7.9 for this project if that helps.

I made a few changes to the program you had set up. First, I changed your file interactions to use "with" statements, since those are very convenient and automatically handle a lot of the functionality you had to write out. Second off, I used a set instead of an OrderedDict because you were basically just trying to emulate set functionality (exclusivity of elements) by using keys in an OrderedDict. If the title hasn't been used, it adds it to the set so it can't be used again and prints the line to the output file. If it has been used, it keeps going. I hope this helps you!
with open("testfile.txt") as infile:
with open("outfile.txt",'w') as outfile:
titleset = set()
for line in infile:
title = line.split('|')[0]
if title not in titleset:
titleset.add(title)
outfile.write(line)

How do i print each line here in a for loop

thanks for the follow :)
hii... if u want to make a new friend just add me on facebook! :) xx
Just wanna say if you ever feel lonely or sad or bored, just come and talk to me. I'm free anytime :)
I hope she not a spy for someone. I hope she real on neautral side. Because just her who i trust. :-)
not always but sometimes maybe :)
\u201c Funny how you get what you want and pray for when you want the same thing God wants. :)
Thank you :) can you follow me on Twitter so I can DM you?
RT dj got us a fallin in love and yeah earth number one m\u00fcsic listen thank you king :-)
found a cheeky weekend for \u00a380 return that's flights + hotel.. middle of april, im still looking pal :)
RT happy birthday mary ! Hope you have a good day :)
Thank god twitters not blocked on the school computers cause all my data is gone on my phone :(
enjoy tmrro. saw them earlier this wk here in tokyo :)
UPDATE:
Oki, maybe my question was wrong. I have to do this:
Open file and read from it
Remove some links, names and stuff from it (I have used regex, but don't know if it the right way to do
After i got clean code (only tweets with sad face or happy face) i have to print each line out, cause i have to loop each like this:
for line in tweets:
if '' in line:
cl.train(line,'happy')
else if '' in line:
cl.train(line,'sad')
My code so far you see here, but it doesn't work yet.
import re
from pprint import pprint
tweets = []
tweets = open('englishtweet.txt').read()
regex_username = '#[^\s]*' # regex to detect username in file
regex_url = 'http[^\s]*' # regex to detect url in file
regex_names = '#[^\s]*' # regex to detect # in file
for username in re.findall(regex_username, tweets):
tweets = tweets.replace(username, '')
for url in re.findall(regex_url, tweets):
tweets = tweets.replace(url, '')
for names in re.findall(regex_names, tweets):
tweets = tweets.replace(names, '')

If you want to read the first line, use next
with open("englishtweet.txt","r") as infile:
print next(infile).strip()
# this prints the first line only, and consumes the first value from the
# generator so this:
for line in infile:
print line.strip()
# will print every line BUT the first (since the first has been consumed)
I'm also using a context manager here, which will automatically close the file once you exit the with block instead of having to remember to call tweets.close(), and also will handle in case of error (depending on what else you're doing in your file, you may throw a handled exception that doesn't allow you to get to the .close statement).
If your file is very small, you could use .readlines:
with open("englishtweet.txt","r") as infile:
tweets = infile.readlines()
# tweets is now a list, each element is a separate line from the file
print tweets[0] # so element 0 is the first line
for line in tweets[1:]: # the rest of the lines:
print line.strip()
However that's not really suggested to read a whole file object into memory, as with some files it can simply be a huge memory waster, especially if you only need the first line -- no reason to read the whole thing to memory.
That said, since it looks like you may be using these for more than just one iteration, maybe readlines IS the best approach

You almost have it. Just remove the .read() when you originally open the file. Then you can loop through the lines.
tweets = open('englishtweet.txt','r')
for line in tweets:
print line
tweets.close()

How to replay a list of event consistently

I have a file containing a list of event spaced with some time. Here is an example:
0, Hello World
0.5, Say Hi
2, Say Bye
I would like to be able to replay this sequence of events. The first column is the delta between the two consecutive events ( the first starts immendiately, the second happens 0.5s later, the third 2s later, ... )
How can i do that on Windows . Is there anything that can ensure that I am very accurate on the timing ? The idea is to be as close as what you would have listneing some music , you don't want your audio event to happen close to the right time but just on time .

This can be done easily by using the sleep function from the time module. The exact code should work like this:
import time
# Change data.txt to the name of your file
data_file = open("data.txt", "r")
# Get rid of blank lines (often the last line of the file)
vals = [i for i in data_file.read().split('\n') if i]
data_file.close()
for i in vals:
i = i.split(',')
i[1] = i[1][1:]
time.sleep(float(i[0]))
print i[1]
This is an imperfect algorithm, but it should give you an idea of how this can be done. We read the file, split it to a newline delimited list, then go through each comma delimited couplet sleeping for the number of seconds specified, and printing the specified string.

You're looking for time.sleep(...) in Python.
If you load that file as a list, and then print the values,
import time
with open("datafile.txt", "r") as infile:
lines = infile.read().split('\n')
for line in lines:
wait, response = line.split(',')
time.sleep(float(wait))
print response

Select random group of items from txt file

I'm working on a simple Python game where the computer tries to guess a number you think of. Every time it guesses the right answer, it saves the answer to a txt file. When the program is run again, it will guess the old answers first (if they're in the range the user specifies).
try:
f = open("OldGuesses.txt", "a")
r = open("OldGuesses.txt", "r")
except IOError as e:
f = open("OldGuesses.txt", "w")
r = open("OldGuesses.txt", "r")
data = r.read()
number5 = random.choice(data)
print number5
When I run that to pull the old answers, it grabs one item. Like say I have the numbers 200, 1242, and 1343, along with spaces to tell them apart, it will either pick a space, or a single digit. Any idea how to grab the full number (like 200) and/ or avoid picking spaces?

The r.read() call reads the entire contents of r and returns it as a single string. What you can do is use a list comprehension in combination with r.readlines(), like this:
data = [int(x) for x in r.readlines()]
which breaks up the file into lines and converts each line to an integer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

how to improve the speed of the python script - python-2.7

Related

Asking user for raw_input to open a file, when attempting to run program comes back with mode 'r'

Removing Duplicate Lines by Title Only

How do i print each line here in a for loop

How to replay a list of event consistently

Select random group of items from txt file

Categories

Resources