overuse of lists and string formating - python-2.7

This works, but is very un-Pythonic. I am sure I am overusing string formatting and lists. Can BeautifulSoup do it native?
from bs4 import BeautifulSoup
xmlurl = "http://forecast.weather.gov/MapClick.php?lat=33.37110&lon=-104.529&unit=0&lg=english&FcstType=dwml"
def get_forecast():
soup = BeautifulSoup(urllib2.urlopen(xmlurl))
temps = soup.find_all("temperature")
maxtemps = str(temps[0])
maxlist = maxtemps.split('\n')
maxvalue= str(maxlist[2]).lstrip()
maxvalue = maxvalue.replace('<value>','')
maxvalue = maxvalue.replace('</value>','')
mintemps = str(temps[1])
minlist = mintemps.split('\n')
minvalue= str(minlist[2]).lstrip()
minvalue = minvalue.replace('<value>','')
minvalue = minvalue.replace('</value>','')
print maxvalue
print minvalue
if __name__ == '__main__':
get_forecast()
temps comes back as a addressable as a list:
[<temperature time-layout="k-p24h-n7-1" type="maximum" units="Fahrenheit">
<name>Daily Maximum Temperature</name>
<value>65</value>
<value>75</value>
<value>88</value>
<value>92</value>
<value>92</value>
<value>89</value>
<value>83</value>
</temperature>, <temperature time-layout="k-p24h-n6-2" type="minimum" units="Fahrenheit">
<name>Daily Minimum Temperature</name>
<value>38</value>
<value>47</value>
<value>53</value>
<value>55</value>
<value>56</value>
<value>56</value>
</temperature>, <temperature time-layout="k-p1h-n1-1" type="apparent" units="Fahrenheit"> <value>53</value> </temperature>]
I then proceed to manipulate it (poorly) until I beat it into submission....
I have read through so many pages of documentation on Python and BeautifulSoup that I can't see straight. I'm sure BS4 can probably do this, but I haven't messed with XML enough to get the syntax right.
All I want is the first Daily Maximum Temperature (65) and the first Minimum Temperature (38).

Rather than overusing, I believe you are overthinking your approach. See the following code:
Code:
from bs4 import BeautifulSoup as bsoup
xml = """[<temperature time-layout="k-p24h-n7-1" type="maximum" units="Fahrenheit">
<name>Daily Maximum Temperature</name>
<value>65</value>
<value>75</value>
<value>88</value>
<value>92</value>
<value>92</value>
<value>89</value>
<value>83</value>
</temperature>, <temperature time-layout="k-p24h-n6-2" type="minimum" units="Fahrenheit">
<name>Daily Minimum Temperature</name>
<value>38</value>
<value>47</value>
<value>53</value>
<value>55</value>
<value>56</value>
<value>56</value>
</temperature>, <temperature time-layout="k-p1h-n1-1" type="apparent" units="Fahrenheit"> <value>53</value> </temperature>]"""
soup = bsoup(xml)
temps = soup.find_all("temperature")
max_temp = temps[0].find_all("value")[0].get_text()
print "Max temp: ", max_temp
min_temp = temps[1].find_all("value")[0].get_text()
print "Min temp: ", min_temp
Result:
Max temp: 65
Min temp: 38
[Finished in 0.6s]
You said you just want the first min and max temperatures, right? The way we did it is we create a soup first. Next, we search the soup for the temperature tags. We find two, one for the maximum and one for the minimum.
The third step -- finding the first of each -- is identical. First, we get the first temperature tag by using temps[0]. We then find all the elements with the value tag. The first element is returned by the [0] index. get_text() is just to get the inner text of the element. To get the first minimum temperature, we just change temps[0] to temps[1] so we can access the second temperature tag.
Let us know if this helps.

Related

How to put colon inbetween elements of a big number using python

My question is my title . I want to put colon to number 2034820. It should look like 2:03:48:20
Basically this is my time data in HHMMSSMS format i.e hour minute second and millisecond.I want to plot other data with respect to this time format. How can I plot my data in y-axis and time of given format in x-axis.
data = numpy.genfromtxt('inputfile.dat') fig=plt.figure()
ax1 = plt.subplot(111) sat1=ax1.plot(data[:,1],'b',linewidth=1,label='SVID-127')
sat2 = ax1.plot(data[:,2],'m-',linewidth=1,label='SVID-128')
Any help is highly appreciated.
Thanks
you can parse the time with datetime.strptime and then re-format it:
from datetime import datetime
tme = datetime.strptime('{:08d}'.format(2034820), '%H%M%S%f').time()
strg = '{0:%H:%M:%S:%f}'.format(tme)
print(strg[:-4]) # cut the trailing '0000'
# 02:03:48:20
this assumes your input is an integer (which will be converted to a zero-padded string of length 8 with '{:08d}'.format(2034820); if the data comes as string you need to convert it to an int first: '{:08d}'.format(int('2034820'))).
from your comments: you seem to be getting the number of seconds that have passed since midnight. for those you could to this:
from datetime import time
def convert(timefloat):
hours, rest = divmod(timefloat, 3600)
mins, rest = divmod(rest, 60)
secs, rest = divmod(rest, 1)
microsecs = int(10**6 * rest)
tme = time(int(hours), int(mins), int(secs), microsecs)
return '{0:%H:%M:%S:%f}'.format(tme)[:-4]
which gives for your test-data:
for d in data:
print(convert(d))
#23:59:59:58
#23:59:59:80
#23:59:59:99
#00:00:00:20
#00:00:00:40
#00:00:00:60

New to python - trying to chose individual columns from transposed matrix

So presently code is as so:
table = []
for line in open("harrytest.csv") as f:
data = line.split(",")
table.append(data)
transposed = [[table[j][i] for j in range(len(table))] for i in range(len(table[0]))]
openings = transposed[1][1: - 1]
openings = [float(i) for i in openings]
mean = sum(openings)/len(openings)
print mean
minimum = min(openings)
print minimum
maximum = max(openings)
print maximum
range1 = maximum - minimum
print range1
This only prints one column of 7 for me, it also leaves out the bottom line. We are not allowed to import with csv module, use numpy, pandas. The only module allowed is os, sys, math & datetime.
How do I write the code so as to get median, first, last values for any column.
Change this line:
openings = transposed[1][1: - 1]
to this
openings = transposed[1][1:]
and the last row should appear. You calculations for mean, min, max and range seem correct.
For median you have to sort the row and select the one middle element or average of the two middle elements. First and last element is just row[0] and row[-1].

Python: How to calculate tf-idf for a large data set

I have a following data frame df, which I converted from sframe
URI name text
0 <http://dbpedia.org/resource/Digby_M... Digby Morrell digby morrell born 10 october 1979 i...
1 <http://dbpedia.org/resource/Alfred_... Alfred J. Lewy alfred j lewy aka sandy lewy graduat...
2 <http://dbpedia.org/resource/Harpdog... Harpdog Brown harpdog brown is a singer and harmon...
3 <http://dbpedia.org/resource/Franz_R... Franz Rottensteiner franz rottensteiner born in waidmann...
4 <http://dbpedia.org/resource/G-Enka> G-Enka henry krvits born 30 december 1974 i...
I have done the following:
from textblob import TextBlob as tb
import math
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
bloblist = []
for i in range(0, df.shape[0]):
bloblist.append(tb(df.iloc[i,2]))
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
But this is taking a lot of time as there are 59000 documents.
Is there a better way to do it?
I am confused about this subject. But I found a few solution on the internet with use Spark. Here you can look at:
https://www.linkedin.com/pulse/understanding-tf-idf-first-principle-computation-apache-asimadi
On the other hand i tried theese method and i didn't get bad results. Maybe you want to try :
I hava a word list. This list contains word and it's counts.
I found the average of this words counts.
I selected the lower limit and the upper limit with the average value.
(e.g. lower bound = average / 2 and upper bound = average * 5)
Then i created a new word list with upper and lower bound.
With theese i got theese result :
Before normalization word vector length : 11880
Mean : 19 lower bound : 9 upper bound : 95
After normalization word vector length : 1595
And also cosine similarity results were better.

Need help in improving the speed of my code for duplicate columns removal in Python

I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.
The input file looks like this:
chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254
As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.
I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.
Thanks!
#!/usr/bin/env python
""" takes in a input file and
prints out only the variants that occur more than once """
import shlex
import collections
rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []
for row in rows:
var = shlex.split(row)
indices.append("_".join(var[0:3]))
dup_list = []
ind_tuple = collections.Counter(indices).items()
for x, y in ind_tuple:
if y>1:
dup_list.append(x)
print dup_list
print len(dup_list)
Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.
EDIT:
Edited the code as per the suggestion of damienfrancois. Below is my new code:
f = open('variants.txt', 'r')
indices = {}
for line in f:
row = line.rstrip()
var = shlex.split(row)
index = "_".join(var[0:3])
if indices.has_key(index):
indices[index] = indices[index] + 1
else:
indices[index] = 1
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
I used, time to see how long both the code takes.
My original code:
time run remove_dup.py
14428
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s
Wall time: 209.31 s
Code after modification:
time run remove_dup2.py
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s
I don't see any significant improvement in the time.
Some suggestions:
do not read the whole file at once ; read line by line and process it on the fly ; you'll save memory operations
let indices be a default dict and increment the value at key "_".join(var[0:3]) ; this saves the costly (guessing here, should use a profiler) collections.Counter(indices).items() step
try pypy or a python compiler
split your data in as many subsets as your computer has cores, apply the program to each subset in parallel then merge the results
HTH
A big time sink is probably the if..has_key() portion of the code. In my experience, try-except is a lot faster...
f = open('variants.txt', 'r')
indices = {}
for line in f:
var = line.split()
index = "_".join(var[0:3])
try:
indices[index] += 1
except KeyError:
indices[index] = 1
f.close()
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
Another option there would be replace the four try except lines with:
indices[index] = 1 + indices.get(index,0)
This approach only tells how many lines of the lines are duplicated, and not how many times they are repeated. (So if one line is duped 3x, then it will say one...)
If you are only trying to count the duplicates and not delete or note them, you could tally the lines of the file as you go, and compare this to the length of the indices dictionary, and the difference is the number of dupe lines (instead of looping back through and re-counting). This might save a little time, but gives a different answer:
#!/usr/bin/env python
f = open('variants.txt', 'r')
indices = {}
total_len=0
for line in f:
total_len +=1
var = line.split()
index = "_".join(var[0:3])
indices[index] = 1 + indices.get(index,0)
f.close()
print "Number of duplicated lines:", total_len - len(indices.keys())
I'd be curious to hear what your benchmarks are for code that does not include the has_key() test...

Selecting elements in numpy array using regular expressions

One may select elements in numpy arrays as follows
a = np.random.rand(100)
sel = a > 0.5 #select elements that are greater than 0.5
a[sel] = 0 #do something with the selection
b = np.array(list('abc abc abc'))
b[b==a] = 'A' #convert all the a's to A's
This property is used by the np.where function to retrive indices:
indices = np.where(a>0.9)
What I would like to do is to be able to use regular expressions in such element selection. For example, if I want to select elements from b above that match the [Aab] regexp, I need to write the following code:
regexp = '[Ab]'
selection = np.array([bool(re.search(regexp, element)) for element in b])
This looks too verbouse for me. Is there any shorter and more elegant way to do this?
There's some setup involved here, but unless numpy has some kind of direct support for regular expressions that I don't know about, then this is the most "numpytonic" solution. It tries to make iteration over the array more efficient than standard python iteration.
import numpy as np
import re
r = re.compile('[Ab]')
vmatch = np.vectorize(lambda x:bool(r.match(x)))
A = np.array(list('abc abc abc'))
sel = vmatch(A)