Calculate P-value in Sklearn using python? - python-2.7

I'm new to machine learning and created a logistic model using sklearn but i don't get any documentation on how to find P-value for my feature variables as well as model. I have checked the stack link but don't get the required output. please help. Thanks in advance

One can use regressors package for this. Following code is from: https://regressors.readthedocs.io/en/latest/usage.html
import numpy as np
from sklearn import datasets
boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False # Eliminate dummy variable
X = boston.data[:, which_betas]
y = boston.target
from sklearn import linear_model
from regressors import stats
ols = linear_model.LinearRegression()
ols.fit(X, y)
# To calculate the p-values of beta coefficients:
print("coef_pval:\n", stats.coef_pval(ols, X, y))
# to print summary table:
print("\n=========== SUMMARY ===========")
xlabels = boston.feature_names[which_betas]
stats.summary(ols, X, y, xlabels)
Output:
coef_pval:
[2.66897615e-13 4.15972994e-04 1.36473287e-05 4.67064962e-01
1.70032518e-06 0.00000000e+00 7.67610259e-01 1.55431223e-15
1.51691918e-07 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00]
=========== SUMMARY ===========
Residuals:
Min 1Q Median 3Q Max
-26.3743 -1.9207 0.6648 2.8112 13.3794
Coefficients:
Estimate Std. Error t value p value
_intercept 36.925033 4.915647 7.5117 0.000000
CRIM -0.112227 0.031583 -3.5534 0.000416
ZN 0.047025 0.010705 4.3927 0.000014
INDUS 0.040644 0.055844 0.7278 0.467065
NOX -17.396989 3.591927 -4.8434 0.000002
RM 3.845179 0.272990 14.0854 0.000000
AGE 0.002847 0.009629 0.2957 0.767610
DIS -1.485557 0.180530 -8.2289 0.000000
RAD 0.327895 0.061569 5.3257 0.000000
TAX -0.013751 0.001055 -13.0395 0.000000
PTRATIO -0.991733 0.088994 -11.1438 0.000000
B 0.009827 0.001126 8.7256 0.000000
LSTAT -0.534914 0.042128 -12.6973 0.000000
---
R-squared: 0.73547, Adjusted R-squared: 0.72904
F-statistic: 114.23 on 12 features

Related

dataframe from 3d list

this is what i have (a 3D list) :
**
[ STOCK NAME
last_price price2 price3
0 0.00 0.0 0.0
1 870.95 7650.0 2371500.0
2 870.95 7650.0 2371500.0
3 870.95 7650.0 2371500.0
4 877.30 7650.0 2371500.0
5 879.20 6800.0 2381700.0]
**
I want to create a dataframe exactly like the list that I have above. how do I do so? thank you very much.. i tried pd.DataFrame(the_list) but it gave me this error: ValueError: Must pass 2-d input. shape=(190, 6, 3).. thanks

python 2: Slicing pandas dataframe using datetime index skips one day from the wanted date

I have below df with "start_datetime" as index. "start_datetime" is of type class'pandas._libs.tslib.Timestamp' :
col1 col2
start_datetime
2017-12-27 01:50:00 0.000000 0.0
2017-12-27 01:55:00 0.000000 0.0
2017-12-27 02:15:00 0.000000 0.0
2017-12-27 02:20:00 0.000000 0.0
2017-12-27 02:25:00 0.000000 0.0
... ... ...
2018-01-15 21:30:00 0.000000 0.0
2018-01-15 21:35:00 0.000000 0.0
2018-01-15 21:40:00 0.000000 0.0
2018-01-15 21:45:00 0.000000 0.0
2018-01-15 21:50:00 0.000000 0.0
2018-01-15 21:55:00 0.000000 0.0
2018-01-15 22:00:00 0.000000 0.0
I want to slice using the datetime index:
start = pd.to_datetime('2018-01-01-00-00') # class'pandas._libs.tslib.Timestamp'
df = df[start: ]
Below is what I got:
col1 col2
start_datetime
2018-01-02 00:00:00 0.0 0.0
2018-01-02 00:05:00 0.0 0.0
2018-01-02 00:10:00 0.0 0.0
2018-01-02 00:15:00 0.0 0.0
2018-01-02 00:20:00 0.0 0.0
Questions:
Why did it slice at "2018-01-02 00:00:00" instead of "2018-01-01 00:00:00" ?
How can I slice to include "2018-01-01 00:00:00" ?
I have tried:
df = df[start: ]
df = df.loc[(df.index >= start)]
I also reset the index and tried df = df.loc[(df.start_datetime >= start)] and even hard coded df = df["2018-01-01 00:00:00": ]
But none sliced at "2018-01-01 00:00:00"
Any ideas?
In my opinion problem is 2018-01-01 does not exist. You can check it:
print (df['2018-01-01'])
#return unique days by floor
idx = df.index.floor('d').unique()
#print (idx)
#get datetimes between
print (idx[(idx >= '2017-12-30') & (idx <= '2018-01-02')])

Regression analysis,using statsmodels

Please help me for getting output from this code.why the output of this code is nan?!!!whats my wrong?
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import math
import datetime as dt
#importing Data
es_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/hbrbcpe.txt'
vs_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/h_vstoxx.txt'
#creating DataFrame
cols=['SX5P','SX5E','SXXP','SXXE','SXXF','SXXA','DK5f','DKXF']
es=pd.read_csv(es_url,index_col=0,parse_dates=True,sep=';',dayfirst=True,header=None,skiprows=4,names=cols)
vs=pd.read_csv(vs_url,index_col=0,header=2,parse_dates=True,sep=',',dayfirst=True)
data=pd.DataFrame({'EUROSTOXX' : es['SX5E'][es.index > dt.datetime(1999,1,1)]},dtype=float)
data=data.join(pd.DataFrame({'VSTOXX' : vs['V2TX'][vs.index > dt.datetime(1999,1,1)]},dtype=float))
data=data.fillna(method='ffill')
rets=(((data/data.shift(1))-1)*100).round(2)
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets).fit()
print model.summary()
The problem is, when you compute rets, you divide by zero which causes an inf. Also, when you use shift, you have NaNs so you have missing values that need to be handled in some way first before proceeding to the regression.
Walk through this example using your data and see:
df = data.loc['2016-03-20':'2016-04-01'].copy()
df looks like:
EUROSTOXX VSTOXX
2016-03-21 3048.77 35.6846
2016-03-22 3051.23 35.6846
2016-03-23 3042.42 35.6846
2016-03-24 2986.73 35.6846
2016-03-25 0.00 35.6846
2016-03-28 0.00 35.6846
2016-03-29 3004.87 35.6846
2016-03-30 3044.10 35.6846
2016-03-31 3004.93 35.6846
2016-04-01 2953.28 35.6846
Shifting by 1 and dividing:
df = (((df/df.shift(1))-1)*100).round(2)
Prints out:
EUROSTOXX VSTOXX
2016-03-21 NaN NaN
2016-03-22 0.080688 0.0
2016-03-23 -0.288736 0.0
2016-03-24 -1.830451 0.0
2016-03-25 -100.000000 0.0
2016-03-28 NaN 0.0
2016-03-29 inf 0.0
2016-03-30 1.305547 0.0
2016-03-31 -1.286751 0.0
2016-04-01 -1.718842 0.0
Take-aways: shifting by 1 automatically always creates a NaN at the top. Dividing 0.00 by 0.00 produces an inf.
One possible solution to handle missing values:
...
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
# handle missing values
messed_up_indices = xdat[xdat.isin([-np.inf, np.inf, np.nan]) == True].index
xdat[messed_up_indices] = xdat[messed_up_indices].replace([-np.inf, np.inf], np.nan)
xdat[messed_up_indices] = xdat[messed_up_indices].fillna(xdat.mean())
ydat[messed_up_indices] = ydat[messed_up_indices].fillna(0.0)
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets, missing='raise').fit()
print(model.summary())
Notice I added the missing='raise' parameter to ols to see what's going on.
End result prints out:
OLS Regression Results
==============================================================================
Dep. Variable: ydat R-squared: 0.259
Model: OLS Adj. R-squared: 0.259
Method: Least Squares F-statistic: 1593.
Date: Wed, 03 Jan 2018 Prob (F-statistic): 5.76e-299
Time: 12:01:14 Log-Likelihood: -13856.
No. Observations: 4554 AIC: 2.772e+04
Df Residuals: 4552 BIC: 2.773e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.1608 0.075 2.139 0.033 0.013 0.308
xdat -1.4209 0.036 -39.912 0.000 -1.491 -1.351
==============================================================================
Omnibus: 4280.114 Durbin-Watson: 2.074
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4021394.925
Skew: -3.446 Prob(JB): 0.00
Kurtosis: 148.415 Cond. No. 2.11
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exporting Stata's coefficient vector: Meaning of suffixes in interaction column

I ran a regression in Stata:
reg y I.ind1990#I.year, nocons r
Then I exported the coefficient vector from Stata using
matrix x = e(b)
esttab matrix(x) using "xx.csv", replace plain
and loaded it in Python and pandas using
df = pd.read_csv('xx.csv', skiprows=1, index_col=[0]).T.dropna()
df.index.name = 'interaction'
df = df.reset_index()
ind1990 and year are numeric. But I have some odd values in my csv (year and ind are manually pulled out of interaction):
interaction y1 ind year
0 0b.ind1990#2001b.year 0.000000 0b 2001b
1 0b.ind1990#2002.year 0.320578 0b 2002
2 0b.ind1990#2003.year 0.304471 0b 2003
3 0b.ind1990#2004.year 0.271429 0b 2004
4 0b.ind1990#2005.year 0.295347 0b 2005
I believe that 0b is how Stata translates missing values aka NIU. But I can't make sense of the other non-numeric values.
Here's what I get for years (and there is both b and o as unexpected suffix:
array(['2001b', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
'2009', '2010', '2011', '2012', '2013', '2014', '2015', '2004o',
'2008o', '2012o', '2003o', '2005o', '2006o', '2007o', '2009o',
'2010o', '2011o', '2013o', '2014o', '2015o', '2002o'], dtype=object)
and for ind1990 (where 0b is apparently NIU, but there are also o suffixes that I can't make sense of:
array(['0b', '10', '11', '12', '20', '31', '32', '40', '41', '42', '50',
'60', '100', '101', '102', '110', '111', '112', '120', '121', '122',
'122o', '130', '130o', '132', '140', '141', '142', '150', '151',
'152', '152o', '160', '161', '162', '171', '172', '180', '181',
'182', '190', '191', '192', '200', '201', '201o', '210', '211',
'220', '220o', '221', '221o', '222', '222o', '230', '231', '232',
'241', '242', '250', '251', '252', '261', '262', '270', '271',
'272o', '272'], dtype=object)
What do the b and o suffixes mean at the end of values of the interaction column?
This isn't an answer, but it won't go well as a comment and it may clarify the question.
The example here isn't reproducible without #FooBar's data. Here is another one that (a) Stata users can reproduce and (b) Python users can, I think, import:
. sysuse auto, clear
(1978 Automobile Data)
. regress mpg i.foreign#i.rep78, nocons r
note: 1.foreign#1b.rep78 identifies no observations in the sample
note: 1.foreign#2.rep78 identifies no observations in the sample
Linear regression Number of obs = 69
F(7, 62) = 364.28
Prob > F = 0.0000
R-squared = 0.9291
Root MSE = 6.1992
-------------------------------------------------------------------------------
| Robust
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign#rep78 |
Domestic#2 | 19.125 1.311239 14.59 0.000 16.50387 21.74613
Domestic#3 | 19 .8139726 23.34 0.000 17.37289 20.62711
Domestic#4 | 18.44444 1.520295 12.13 0.000 15.40542 21.48347
Domestic#5 | 32 1.491914 21.45 0.000 29.01771 34.98229
Foreign#1 | 0 (empty)
Foreign#2 | 0 (empty)
Foreign#3 | 23.33333 1.251522 18.64 0.000 20.83158 25.83509
Foreign#4 | 24.88889 .8995035 27.67 0.000 23.09081 26.68697
Foreign#5 | 26.33333 3.105666 8.48 0.000 20.1252 32.54147
-------------------------------------------------------------------------------
. matrix b = e(b)
. esttab matrix(b) using b.csv, plain
(output written to b.csv)
The file b.csv looks like this:
"","b","","","","","","","","",""
"","0b.foreign#1b.rep78","0b.foreign#2.rep78","0b.foreign#3.rep78","0b.foreign#4.rep78","0b.foreign#5.rep78","1o.foreign#1b.rep78","1o.foreign#2o.rep78","1.foreign#3.rep78","1.foreign#4.rep78","1.foreign#5.rep78"
"y1","0","19.125","19","18.44444","32","0","0","23.33333","24.88889","26.33333"
Stata's notation here is accessible to non-Stata users. See enter link description here
I don't use esttab (a user-written Stata program) or Python (that's ignorance, not prejudice), so I can't comment beyond that.

Padding a multindex pandas dataframe

I have a dataframe of US opinion poll data that I'm trying to pad out on a daily basis. I can't figure out how to do it.
Here's the original data (the dataframe doesn't have to be a multiindex).
Democratic Other Republican
Date State
2008-11-04 AZ 0.451153 0.012495 0.536352
2012-05-20 AZ 0.462500 0.000000 0.537500
...
2008-11-04 WI 0.562178 0.014686 0.423137
2012-11-03 WI 0.515152 0.000000 0.484848
I want to pad it out so it looks something like this:
Democratic Other Republican
Date State
2008-11-04 AZ 0.451153 0.012495 0.536352
2008-11-05 AZ 0.451153 0.012495 0.536352
...
2012-05-20 AZ 0.462500 0.000000 0.537500
2012-05-21 AZ 0.462500 0.000000 0.537500
...
2012-11-06 AZ 0.462500 0.000000 0.537500
...
2008-11-04 WI 0.562178 0.014686 0.423137
2008-11-05 WI 0.562178 0.014686 0.423137
...
2012-11-03 WI 0.515152 0.000000 0.484848
2012-11-04 WI 0.515152 0.000000 0.484848
2012-11-05 WI 0.515152 0.000000 0.484848
2012-11-06 WI 0.515152 0.000000 0.484848
I tried doing this:
election_range = pd.date_range('2008-11-06', '2012-11-06')
dailies.reindex(election_range, method='pad')
but I get this error:
ValueError: cannot include dtype 'M' in a buffer
I tried just indexing on the date, but I got an error that the index wasn't unique.
The obvious thing to do is to split the frame state-by-state, reindex, and combine the frames, but there must be a better way of doing it. Does anyone have any ideas?
Try:
start = df.index.levels[0].min()
end = df.index.levels[0].max()
days = pd.date_range(start, end)
df.unstack().reindex(days).ffill().stack().sort_index(level=[1, 0])