Regression analysis,using statsmodels

Regression analysis,using statsmodels - python-2.7

Please help me for getting output from this code.why the output of this code is nan?!!!whats my wrong?
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import math
import datetime as dt
#importing Data
es_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/hbrbcpe.txt'
vs_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/h_vstoxx.txt'
#creating DataFrame
cols=['SX5P','SX5E','SXXP','SXXE','SXXF','SXXA','DK5f','DKXF']
es=pd.read_csv(es_url,index_col=0,parse_dates=True,sep=';',dayfirst=True,header=None,skiprows=4,names=cols)
vs=pd.read_csv(vs_url,index_col=0,header=2,parse_dates=True,sep=',',dayfirst=True)
data=pd.DataFrame({'EUROSTOXX' : es['SX5E'][es.index > dt.datetime(1999,1,1)]},dtype=float)
data=data.join(pd.DataFrame({'VSTOXX' : vs['V2TX'][vs.index > dt.datetime(1999,1,1)]},dtype=float))
data=data.fillna(method='ffill')
rets=(((data/data.shift(1))-1)*100).round(2)
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets).fit()
print model.summary()

The problem is, when you compute rets, you divide by zero which causes an inf. Also, when you use shift, you have NaNs so you have missing values that need to be handled in some way first before proceeding to the regression.
Walk through this example using your data and see:
df = data.loc['2016-03-20':'2016-04-01'].copy()
df looks like:
EUROSTOXX VSTOXX
2016-03-21 3048.77 35.6846
2016-03-22 3051.23 35.6846
2016-03-23 3042.42 35.6846
2016-03-24 2986.73 35.6846
2016-03-25 0.00 35.6846
2016-03-28 0.00 35.6846
2016-03-29 3004.87 35.6846
2016-03-30 3044.10 35.6846
2016-03-31 3004.93 35.6846
2016-04-01 2953.28 35.6846
Shifting by 1 and dividing:
df = (((df/df.shift(1))-1)*100).round(2)
Prints out:
EUROSTOXX VSTOXX
2016-03-21 NaN NaN
2016-03-22 0.080688 0.0
2016-03-23 -0.288736 0.0
2016-03-24 -1.830451 0.0
2016-03-25 -100.000000 0.0
2016-03-28 NaN 0.0
2016-03-29 inf 0.0
2016-03-30 1.305547 0.0
2016-03-31 -1.286751 0.0
2016-04-01 -1.718842 0.0
Take-aways: shifting by 1 automatically always creates a NaN at the top. Dividing 0.00 by 0.00 produces an inf.
One possible solution to handle missing values:
...
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
# handle missing values
messed_up_indices = xdat[xdat.isin([-np.inf, np.inf, np.nan]) == True].index
xdat[messed_up_indices] = xdat[messed_up_indices].replace([-np.inf, np.inf], np.nan)
xdat[messed_up_indices] = xdat[messed_up_indices].fillna(xdat.mean())
ydat[messed_up_indices] = ydat[messed_up_indices].fillna(0.0)
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets, missing='raise').fit()
print(model.summary())
Notice I added the missing='raise' parameter to ols to see what's going on.
End result prints out:
OLS Regression Results
==============================================================================
Dep. Variable: ydat R-squared: 0.259
Model: OLS Adj. R-squared: 0.259
Method: Least Squares F-statistic: 1593.
Date: Wed, 03 Jan 2018 Prob (F-statistic): 5.76e-299
Time: 12:01:14 Log-Likelihood: -13856.
No. Observations: 4554 AIC: 2.772e+04
Df Residuals: 4552 BIC: 2.773e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.1608 0.075 2.139 0.033 0.013 0.308
xdat -1.4209 0.036 -39.912 0.000 -1.491 -1.351
==============================================================================
Omnibus: 4280.114 Durbin-Watson: 2.074
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4021394.925
Skew: -3.446 Prob(JB): 0.00
Kurtosis: 148.415 Cond. No. 2.11
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Related

python 2: Slicing pandas dataframe using datetime index skips one day from the wanted date

I have below df with "start_datetime" as index. "start_datetime" is of type class'pandas._libs.tslib.Timestamp' :
col1 col2
start_datetime
2017-12-27 01:50:00 0.000000 0.0
2017-12-27 01:55:00 0.000000 0.0
2017-12-27 02:15:00 0.000000 0.0
2017-12-27 02:20:00 0.000000 0.0
2017-12-27 02:25:00 0.000000 0.0
... ... ...
2018-01-15 21:30:00 0.000000 0.0
2018-01-15 21:35:00 0.000000 0.0
2018-01-15 21:40:00 0.000000 0.0
2018-01-15 21:45:00 0.000000 0.0
2018-01-15 21:50:00 0.000000 0.0
2018-01-15 21:55:00 0.000000 0.0
2018-01-15 22:00:00 0.000000 0.0
I want to slice using the datetime index:
start = pd.to_datetime('2018-01-01-00-00') # class'pandas._libs.tslib.Timestamp'
df = df[start: ]
Below is what I got:
col1 col2
start_datetime
2018-01-02 00:00:00 0.0 0.0
2018-01-02 00:05:00 0.0 0.0
2018-01-02 00:10:00 0.0 0.0
2018-01-02 00:15:00 0.0 0.0
2018-01-02 00:20:00 0.0 0.0
Questions:
Why did it slice at "2018-01-02 00:00:00" instead of "2018-01-01 00:00:00" ?
How can I slice to include "2018-01-01 00:00:00" ?
I have tried:
df = df[start: ]
df = df.loc[(df.index >= start)]
I also reset the index and tried df = df.loc[(df.start_datetime >= start)] and even hard coded df = df["2018-01-01 00:00:00": ]
But none sliced at "2018-01-01 00:00:00"
Any ideas?

In my opinion problem is 2018-01-01 does not exist. You can check it:
print (df['2018-01-01'])
#return unique days by floor
idx = df.index.floor('d').unique()
#print (idx)
#get datetimes between
print (idx[(idx >= '2017-12-30') & (idx <= '2018-01-02')])

Calculate P-value in Sklearn using python?

I'm new to machine learning and created a logistic model using sklearn but i don't get any documentation on how to find P-value for my feature variables as well as model. I have checked the stack link but don't get the required output. please help. Thanks in advance

One can use regressors package for this. Following code is from: https://regressors.readthedocs.io/en/latest/usage.html
import numpy as np
from sklearn import datasets
boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False # Eliminate dummy variable
X = boston.data[:, which_betas]
y = boston.target
from sklearn import linear_model
from regressors import stats
ols = linear_model.LinearRegression()
ols.fit(X, y)
# To calculate the p-values of beta coefficients:
print("coef_pval:\n", stats.coef_pval(ols, X, y))
# to print summary table:
print("\n=========== SUMMARY ===========")
xlabels = boston.feature_names[which_betas]
stats.summary(ols, X, y, xlabels)
Output:
coef_pval:
[2.66897615e-13 4.15972994e-04 1.36473287e-05 4.67064962e-01
1.70032518e-06 0.00000000e+00 7.67610259e-01 1.55431223e-15
1.51691918e-07 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00]
=========== SUMMARY ===========
Residuals:
Min 1Q Median 3Q Max
-26.3743 -1.9207 0.6648 2.8112 13.3794
Coefficients:
Estimate Std. Error t value p value
_intercept 36.925033 4.915647 7.5117 0.000000
CRIM -0.112227 0.031583 -3.5534 0.000416
ZN 0.047025 0.010705 4.3927 0.000014
INDUS 0.040644 0.055844 0.7278 0.467065
NOX -17.396989 3.591927 -4.8434 0.000002
RM 3.845179 0.272990 14.0854 0.000000
AGE 0.002847 0.009629 0.2957 0.767610
DIS -1.485557 0.180530 -8.2289 0.000000
RAD 0.327895 0.061569 5.3257 0.000000
TAX -0.013751 0.001055 -13.0395 0.000000
PTRATIO -0.991733 0.088994 -11.1438 0.000000
B 0.009827 0.001126 8.7256 0.000000
LSTAT -0.534914 0.042128 -12.6973 0.000000
---
R-squared: 0.73547, Adjusted R-squared: 0.72904
F-statistic: 114.23 on 12 features

Exporting Stata's coefficient vector: Meaning of suffixes in interaction column

I ran a regression in Stata:
reg y I.ind1990#I.year, nocons r
Then I exported the coefficient vector from Stata using
matrix x = e(b)
esttab matrix(x) using "xx.csv", replace plain
and loaded it in Python and pandas using
df = pd.read_csv('xx.csv', skiprows=1, index_col=[0]).T.dropna()
df.index.name = 'interaction'
df = df.reset_index()
ind1990 and year are numeric. But I have some odd values in my csv (year and ind are manually pulled out of interaction):
interaction y1 ind year
0 0b.ind1990#2001b.year 0.000000 0b 2001b
1 0b.ind1990#2002.year 0.320578 0b 2002
2 0b.ind1990#2003.year 0.304471 0b 2003
3 0b.ind1990#2004.year 0.271429 0b 2004
4 0b.ind1990#2005.year 0.295347 0b 2005
I believe that 0b is how Stata translates missing values aka NIU. But I can't make sense of the other non-numeric values.
Here's what I get for years (and there is both b and o as unexpected suffix:
array(['2001b', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
'2009', '2010', '2011', '2012', '2013', '2014', '2015', '2004o',
'2008o', '2012o', '2003o', '2005o', '2006o', '2007o', '2009o',
'2010o', '2011o', '2013o', '2014o', '2015o', '2002o'], dtype=object)
and for ind1990 (where 0b is apparently NIU, but there are also o suffixes that I can't make sense of:
array(['0b', '10', '11', '12', '20', '31', '32', '40', '41', '42', '50',
'60', '100', '101', '102', '110', '111', '112', '120', '121', '122',
'122o', '130', '130o', '132', '140', '141', '142', '150', '151',
'152', '152o', '160', '161', '162', '171', '172', '180', '181',
'182', '190', '191', '192', '200', '201', '201o', '210', '211',
'220', '220o', '221', '221o', '222', '222o', '230', '231', '232',
'241', '242', '250', '251', '252', '261', '262', '270', '271',
'272o', '272'], dtype=object)
What do the b and o suffixes mean at the end of values of the interaction column?

This isn't an answer, but it won't go well as a comment and it may clarify the question.
The example here isn't reproducible without #FooBar's data. Here is another one that (a) Stata users can reproduce and (b) Python users can, I think, import:
. sysuse auto, clear
(1978 Automobile Data)
. regress mpg i.foreign#i.rep78, nocons r
note: 1.foreign#1b.rep78 identifies no observations in the sample
note: 1.foreign#2.rep78 identifies no observations in the sample
Linear regression Number of obs = 69
F(7, 62) = 364.28
Prob > F = 0.0000
R-squared = 0.9291
Root MSE = 6.1992
-------------------------------------------------------------------------------
| Robust
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign#rep78 |
Domestic#2 | 19.125 1.311239 14.59 0.000 16.50387 21.74613
Domestic#3 | 19 .8139726 23.34 0.000 17.37289 20.62711
Domestic#4 | 18.44444 1.520295 12.13 0.000 15.40542 21.48347
Domestic#5 | 32 1.491914 21.45 0.000 29.01771 34.98229
Foreign#1 | 0 (empty)
Foreign#2 | 0 (empty)
Foreign#3 | 23.33333 1.251522 18.64 0.000 20.83158 25.83509
Foreign#4 | 24.88889 .8995035 27.67 0.000 23.09081 26.68697
Foreign#5 | 26.33333 3.105666 8.48 0.000 20.1252 32.54147
-------------------------------------------------------------------------------
. matrix b = e(b)
. esttab matrix(b) using b.csv, plain
(output written to b.csv)
The file b.csv looks like this:
"","b","","","","","","","","",""
"","0b.foreign#1b.rep78","0b.foreign#2.rep78","0b.foreign#3.rep78","0b.foreign#4.rep78","0b.foreign#5.rep78","1o.foreign#1b.rep78","1o.foreign#2o.rep78","1.foreign#3.rep78","1.foreign#4.rep78","1.foreign#5.rep78"
"y1","0","19.125","19","18.44444","32","0","0","23.33333","24.88889","26.33333"
Stata's notation here is accessible to non-Stata users. See enter link description here
I don't use esttab (a user-written Stata program) or Python (that's ignorance, not prejudice), so I can't comment beyond that.

Structure the output

I am trying to get a proper structured output into a csv.
Input:
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Code:
import pandas as pd
from datetime import datetime,time
import numpy as np
fn = r'00_Dart.csv'
cols = ['UserID','StartTime','StopTime', 'gps1', 'gps2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
#df.to_csv((r[r.LogCount > 0])'example.csv')
#print(r[r.LogCount > 0]) -- This gives the correct count and unique count but I want to write the output in a structure.
print (r['StartTime'], ['EndTime'], ['Day'], ['LogCount'], ['UniqueIDCount'])
Output: This is the output that I am getting which is not what I am looking for.
(2004-01-05 00:00:00 00:00:00
2004-01-05 01:00:00 01:00:00
2004-01-05 02:00:00 02:00:00
2004-01-05 03:00:00 03:00:00
2004-01-05 04:00:00 04:00:00
2004-01-05 05:00:00 05:00:00
2004-01-05 06:00:00 06:00:00
2004-01-05 07:00:00 07:00:00
2004-01-05 08:00:00 08:00:00
2004-01-05 09:00:00 09:00:00
And the Expected output headers are
StartTime, EndTime, Day, Count, UniqueIDCount
How do I structure the Write statement in code to have the above mentioned columns in my output csv.

Try This:
rout = r[['StartTime', 'EndTime', 'Day', 'LogCount', 'UniqueIDCount'] ]
print rout
rout.to_csv('results.csv', index=False)

Extracting integers from strings

I need some help extracting an integer from strings created by beautiful soup. This is the following code I have.
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
#San Pedro
url = "http://www.ndbc.noaa.gov/mobile/station.php?station=46222"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
g_data = soup.find_all("body")
for item in g_data:
#Wheater Conditions
print item.contents[6]
#Wave Summary
print item.contents[9]
I am given the following output.
9:19 pm PST 01-Dec-2015
Seas: 3.9 ft (1.2 m)
Peak Period: 15 sec
Water Temp: 65 °F (18 °C)
9:19 pm PST 01-Dec-2015
Swell: 3.0 ft (0.9 m)
Period: 15.4 sec
Direction: W
Wind Wave: 2.3 ft (0.7 m)
Period: 9.9 sec
Direction: W
HTML Version
<p>9:19 pm PST 01-Dec-2015<br>
<b>Seas:</b> 3.9 ft (1.2 m)<br>
<b>Peak Period:</b> 15 sec<br>
<b>Water Temp:</b> 65 °F (18 °C)<br>
</br></br></br></br></p>
<p>
9:19 pm PST 01-Dec-2015<br>
<b>Swell:</b> 3.0 ft (0.9 m)<br>
<b>Period:</b> 15.4 sec<br>
<b>Direction:</b> W<br>
<b>Wind Wave:</b> 2.3 ft (0.7 m)<br>
<b>Period:</b> 9.9 sec<br>
<b>Direction:</b> W<br>
</br></br></br></br></br></br></br></p>
I need to get the individual value of each category I.E. make a value as such swell = 3.0
Thanks

Not sur this is the best solution but something like this should do :
def extractInt(string) :
stringToReturn = ""
for x in string :
if x.isdigit() or x==" " :
stringToReturn = stringToReturn + x
return stringToReturn
Just call this function on the string you want.

If you want to have numbers with floating points, you want to have floats intead of integers. To simply get number from string do:
if your_string.isdigit():
your_float = float(your_string)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regression analysis,using statsmodels - python-2.7

Related

python 2: Slicing pandas dataframe using datetime index skips one day from the wanted date

Calculate P-value in Sklearn using python?

Exporting Stata's coefficient vector: Meaning of suffixes in interaction column

Structure the output

Extracting integers from strings

Categories

Resources