Extracting integers from strings - python-2.7

I need some help extracting an integer from strings created by beautiful soup. This is the following code I have.
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
#San Pedro
url = "http://www.ndbc.noaa.gov/mobile/station.php?station=46222"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
g_data = soup.find_all("body")
for item in g_data:
#Wheater Conditions
print item.contents[6]
#Wave Summary
print item.contents[9]
I am given the following output.
9:19 pm PST 01-Dec-2015
Seas: 3.9 ft (1.2 m)
Peak Period: 15 sec
Water Temp: 65 °F (18 °C)
9:19 pm PST 01-Dec-2015
Swell: 3.0 ft (0.9 m)
Period: 15.4 sec
Direction: W
Wind Wave: 2.3 ft (0.7 m)
Period: 9.9 sec
Direction: W
HTML Version
<p>9:19 pm PST 01-Dec-2015<br>
<b>Seas:</b> 3.9 ft (1.2 m)<br>
<b>Peak Period:</b> 15 sec<br>
<b>Water Temp:</b> 65 °F (18 °C)<br>
</br></br></br></br></p>
<p>
9:19 pm PST 01-Dec-2015<br>
<b>Swell:</b> 3.0 ft (0.9 m)<br>
<b>Period:</b> 15.4 sec<br>
<b>Direction:</b> W<br>
<b>Wind Wave:</b> 2.3 ft (0.7 m)<br>
<b>Period:</b> 9.9 sec<br>
<b>Direction:</b> W<br>
</br></br></br></br></br></br></br></p>
I need to get the individual value of each category I.E. make a value as such swell = 3.0
Thanks

Not sur this is the best solution but something like this should do :
def extractInt(string) :
stringToReturn = ""
for x in string :
if x.isdigit() or x==" " :
stringToReturn = stringToReturn + x
return stringToReturn
Just call this function on the string you want.

If you want to have numbers with floating points, you want to have floats intead of integers. To simply get number from string do:
if your_string.isdigit():
your_float = float(your_string)

Related

Regression analysis,using statsmodels

Please help me for getting output from this code.why the output of this code is nan?!!!whats my wrong?
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import math
import datetime as dt
#importing Data
es_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/hbrbcpe.txt'
vs_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/h_vstoxx.txt'
#creating DataFrame
cols=['SX5P','SX5E','SXXP','SXXE','SXXF','SXXA','DK5f','DKXF']
es=pd.read_csv(es_url,index_col=0,parse_dates=True,sep=';',dayfirst=True,header=None,skiprows=4,names=cols)
vs=pd.read_csv(vs_url,index_col=0,header=2,parse_dates=True,sep=',',dayfirst=True)
data=pd.DataFrame({'EUROSTOXX' : es['SX5E'][es.index > dt.datetime(1999,1,1)]},dtype=float)
data=data.join(pd.DataFrame({'VSTOXX' : vs['V2TX'][vs.index > dt.datetime(1999,1,1)]},dtype=float))
data=data.fillna(method='ffill')
rets=(((data/data.shift(1))-1)*100).round(2)
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets).fit()
print model.summary()
The problem is, when you compute rets, you divide by zero which causes an inf. Also, when you use shift, you have NaNs so you have missing values that need to be handled in some way first before proceeding to the regression.
Walk through this example using your data and see:
df = data.loc['2016-03-20':'2016-04-01'].copy()
df looks like:
EUROSTOXX VSTOXX
2016-03-21 3048.77 35.6846
2016-03-22 3051.23 35.6846
2016-03-23 3042.42 35.6846
2016-03-24 2986.73 35.6846
2016-03-25 0.00 35.6846
2016-03-28 0.00 35.6846
2016-03-29 3004.87 35.6846
2016-03-30 3044.10 35.6846
2016-03-31 3004.93 35.6846
2016-04-01 2953.28 35.6846
Shifting by 1 and dividing:
df = (((df/df.shift(1))-1)*100).round(2)
Prints out:
EUROSTOXX VSTOXX
2016-03-21 NaN NaN
2016-03-22 0.080688 0.0
2016-03-23 -0.288736 0.0
2016-03-24 -1.830451 0.0
2016-03-25 -100.000000 0.0
2016-03-28 NaN 0.0
2016-03-29 inf 0.0
2016-03-30 1.305547 0.0
2016-03-31 -1.286751 0.0
2016-04-01 -1.718842 0.0
Take-aways: shifting by 1 automatically always creates a NaN at the top. Dividing 0.00 by 0.00 produces an inf.
One possible solution to handle missing values:
...
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
# handle missing values
messed_up_indices = xdat[xdat.isin([-np.inf, np.inf, np.nan]) == True].index
xdat[messed_up_indices] = xdat[messed_up_indices].replace([-np.inf, np.inf], np.nan)
xdat[messed_up_indices] = xdat[messed_up_indices].fillna(xdat.mean())
ydat[messed_up_indices] = ydat[messed_up_indices].fillna(0.0)
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets, missing='raise').fit()
print(model.summary())
Notice I added the missing='raise' parameter to ols to see what's going on.
End result prints out:
OLS Regression Results
==============================================================================
Dep. Variable: ydat R-squared: 0.259
Model: OLS Adj. R-squared: 0.259
Method: Least Squares F-statistic: 1593.
Date: Wed, 03 Jan 2018 Prob (F-statistic): 5.76e-299
Time: 12:01:14 Log-Likelihood: -13856.
No. Observations: 4554 AIC: 2.772e+04
Df Residuals: 4552 BIC: 2.773e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.1608 0.075 2.139 0.033 0.013 0.308
xdat -1.4209 0.036 -39.912 0.000 -1.491 -1.351
==============================================================================
Omnibus: 4280.114 Durbin-Watson: 2.074
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4021394.925
Skew: -3.446 Prob(JB): 0.00
Kurtosis: 148.415 Cond. No. 2.11
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Python: How to calculate tf-idf for a large data set

I have a following data frame df, which I converted from sframe
URI name text
0 <http://dbpedia.org/resource/Digby_M... Digby Morrell digby morrell born 10 october 1979 i...
1 <http://dbpedia.org/resource/Alfred_... Alfred J. Lewy alfred j lewy aka sandy lewy graduat...
2 <http://dbpedia.org/resource/Harpdog... Harpdog Brown harpdog brown is a singer and harmon...
3 <http://dbpedia.org/resource/Franz_R... Franz Rottensteiner franz rottensteiner born in waidmann...
4 <http://dbpedia.org/resource/G-Enka> G-Enka henry krvits born 30 december 1974 i...
I have done the following:
from textblob import TextBlob as tb
import math
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
bloblist = []
for i in range(0, df.shape[0]):
bloblist.append(tb(df.iloc[i,2]))
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
But this is taking a lot of time as there are 59000 documents.
Is there a better way to do it?
I am confused about this subject. But I found a few solution on the internet with use Spark. Here you can look at:
https://www.linkedin.com/pulse/understanding-tf-idf-first-principle-computation-apache-asimadi
On the other hand i tried theese method and i didn't get bad results. Maybe you want to try :
I hava a word list. This list contains word and it's counts.
I found the average of this words counts.
I selected the lower limit and the upper limit with the average value.
(e.g. lower bound = average / 2 and upper bound = average * 5)
Then i created a new word list with upper and lower bound.
With theese i got theese result :
Before normalization word vector length : 11880
Mean : 19 lower bound : 9 upper bound : 95
After normalization word vector length : 1595
And also cosine similarity results were better.

Structure the output

I am trying to get a proper structured output into a csv.
Input:
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Code:
import pandas as pd
from datetime import datetime,time
import numpy as np
fn = r'00_Dart.csv'
cols = ['UserID','StartTime','StopTime', 'gps1', 'gps2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
#df.to_csv((r[r.LogCount > 0])'example.csv')
#print(r[r.LogCount > 0]) -- This gives the correct count and unique count but I want to write the output in a structure.
print (r['StartTime'], ['EndTime'], ['Day'], ['LogCount'], ['UniqueIDCount'])
Output: This is the output that I am getting which is not what I am looking for.
(2004-01-05 00:00:00 00:00:00
2004-01-05 01:00:00 01:00:00
2004-01-05 02:00:00 02:00:00
2004-01-05 03:00:00 03:00:00
2004-01-05 04:00:00 04:00:00
2004-01-05 05:00:00 05:00:00
2004-01-05 06:00:00 06:00:00
2004-01-05 07:00:00 07:00:00
2004-01-05 08:00:00 08:00:00
2004-01-05 09:00:00 09:00:00
And the Expected output headers are
StartTime, EndTime, Day, Count, UniqueIDCount
How do I structure the Write statement in code to have the above mentioned columns in my output csv.
Try This:
rout = r[['StartTime', 'EndTime', 'Day', 'LogCount', 'UniqueIDCount'] ]
print rout
rout.to_csv('results.csv', index=False)

Basic pyephem code

I have a very simple code to get the longitude of the sun but when I compare the output to Astrolog and Astrodienst its incorrect, there is a 13 minute difference. I have not added Observer as I think default is midnight GMT (which is what I want). What am I doing wrong?
import ephem
start = ephem.date('2015/01/01')
end = ephem.date('2015/12/31')
f2 = open("Sun", 'w')
while start <= end:
sun = ephem.Sun(start)
ecl = ephem.Ecliptic(sun)
f2.write(str(ephem.date(start))+' '+ str(ecl.lon) +'\n')
start+=1
f2.close()
Example of results for 2015/12/30:
code - 2015/12/30 00:00:00 277:43:36.6
Astrodienst - 7°56'39 Cap
Thanks
The reason why the 13 minute difference is because of the epoch setting, when I added
sun = ephem.Sun(start, epoch = start)
the results were the same as swiss ephemeris.

Converting a pandas DataFrame column containing negative strings into float

I have a pandas Dataframe df that contains negative strings and i would like to convert them to float:
NY_resitor1 NY_resitor2 SF_type SF_resitor2
45 "-36" Resis 40
47 "36" curr 34
. . . .
49 "39" curr 39
45 "-11" curr 12
12 "-200" Resis 45
This is the code I wrote
df["NY_resitor2 "]=df["NY_resitor2 "].astype(float)
but I have the error:
ValueError: could not convert string to float: "-32"
what is the problem?
I think this might be a case of having a strange unicode version of "-" somewhere in your string data. For example, this should work:
>>> import pandas as pd
>>> ser = pd.Series(['-36', '36'])
>>> ser.astype(float)
0 -36
1 36
dtype: float64
But this doesn't, because I've replaced the standard minus sign with a U+2212 minus sign:
>>> ser2 = pd.Series(['−32', '36'])
>>> ser2.astype(float)
...
ValueError: could not convert string to float: '−32'
you could address this by specifically getting rid of the offending characters, using str.replace():
>>> ser2.str.replace('−', '-').astype(float)
0 -32
1 36
dtype: float64
If that's not the issue, then I don't know what is!
Edit: another possibility is that your strings could have quotes within them. e.g.
>>> ser3 = pd.Series(['"-36"', '"36"'])
>>> ser3.astype(float)
...
ValueError: could not convert string to float: '"-36"'
In this case, you need to strip these out first:
>>> ser3.str.replace('"', '').astype(float)
0 -36
1 36
dtype: float64