Padding a multindex pandas dataframe - python-2.7

I have a dataframe of US opinion poll data that I'm trying to pad out on a daily basis. I can't figure out how to do it.
Here's the original data (the dataframe doesn't have to be a multiindex).
Democratic Other Republican
Date State
2008-11-04 AZ 0.451153 0.012495 0.536352
2012-05-20 AZ 0.462500 0.000000 0.537500
...
2008-11-04 WI 0.562178 0.014686 0.423137
2012-11-03 WI 0.515152 0.000000 0.484848
I want to pad it out so it looks something like this:
Democratic Other Republican
Date State
2008-11-04 AZ 0.451153 0.012495 0.536352
2008-11-05 AZ 0.451153 0.012495 0.536352
...
2012-05-20 AZ 0.462500 0.000000 0.537500
2012-05-21 AZ 0.462500 0.000000 0.537500
...
2012-11-06 AZ 0.462500 0.000000 0.537500
...
2008-11-04 WI 0.562178 0.014686 0.423137
2008-11-05 WI 0.562178 0.014686 0.423137
...
2012-11-03 WI 0.515152 0.000000 0.484848
2012-11-04 WI 0.515152 0.000000 0.484848
2012-11-05 WI 0.515152 0.000000 0.484848
2012-11-06 WI 0.515152 0.000000 0.484848
I tried doing this:
election_range = pd.date_range('2008-11-06', '2012-11-06')
dailies.reindex(election_range, method='pad')
but I get this error:
ValueError: cannot include dtype 'M' in a buffer
I tried just indexing on the date, but I got an error that the index wasn't unique.
The obvious thing to do is to split the frame state-by-state, reindex, and combine the frames, but there must be a better way of doing it. Does anyone have any ideas?

Try:
start = df.index.levels[0].min()
end = df.index.levels[0].max()
days = pd.date_range(start, end)
df.unstack().reindex(days).ffill().stack().sort_index(level=[1, 0])

Related

python 2: Slicing pandas dataframe using datetime index skips one day from the wanted date

I have below df with "start_datetime" as index. "start_datetime" is of type class'pandas._libs.tslib.Timestamp' :
col1 col2
start_datetime
2017-12-27 01:50:00 0.000000 0.0
2017-12-27 01:55:00 0.000000 0.0
2017-12-27 02:15:00 0.000000 0.0
2017-12-27 02:20:00 0.000000 0.0
2017-12-27 02:25:00 0.000000 0.0
... ... ...
2018-01-15 21:30:00 0.000000 0.0
2018-01-15 21:35:00 0.000000 0.0
2018-01-15 21:40:00 0.000000 0.0
2018-01-15 21:45:00 0.000000 0.0
2018-01-15 21:50:00 0.000000 0.0
2018-01-15 21:55:00 0.000000 0.0
2018-01-15 22:00:00 0.000000 0.0
I want to slice using the datetime index:
start = pd.to_datetime('2018-01-01-00-00') # class'pandas._libs.tslib.Timestamp'
df = df[start: ]
Below is what I got:
col1 col2
start_datetime
2018-01-02 00:00:00 0.0 0.0
2018-01-02 00:05:00 0.0 0.0
2018-01-02 00:10:00 0.0 0.0
2018-01-02 00:15:00 0.0 0.0
2018-01-02 00:20:00 0.0 0.0
Questions:
Why did it slice at "2018-01-02 00:00:00" instead of "2018-01-01 00:00:00" ?
How can I slice to include "2018-01-01 00:00:00" ?
I have tried:
df = df[start: ]
df = df.loc[(df.index >= start)]
I also reset the index and tried df = df.loc[(df.start_datetime >= start)] and even hard coded df = df["2018-01-01 00:00:00": ]
But none sliced at "2018-01-01 00:00:00"
Any ideas?
In my opinion problem is 2018-01-01 does not exist. You can check it:
print (df['2018-01-01'])
#return unique days by floor
idx = df.index.floor('d').unique()
#print (idx)
#get datetimes between
print (idx[(idx >= '2017-12-30') & (idx <= '2018-01-02')])

Regression analysis,using statsmodels

Please help me for getting output from this code.why the output of this code is nan?!!!whats my wrong?
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import math
import datetime as dt
#importing Data
es_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/hbrbcpe.txt'
vs_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/h_vstoxx.txt'
#creating DataFrame
cols=['SX5P','SX5E','SXXP','SXXE','SXXF','SXXA','DK5f','DKXF']
es=pd.read_csv(es_url,index_col=0,parse_dates=True,sep=';',dayfirst=True,header=None,skiprows=4,names=cols)
vs=pd.read_csv(vs_url,index_col=0,header=2,parse_dates=True,sep=',',dayfirst=True)
data=pd.DataFrame({'EUROSTOXX' : es['SX5E'][es.index > dt.datetime(1999,1,1)]},dtype=float)
data=data.join(pd.DataFrame({'VSTOXX' : vs['V2TX'][vs.index > dt.datetime(1999,1,1)]},dtype=float))
data=data.fillna(method='ffill')
rets=(((data/data.shift(1))-1)*100).round(2)
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets).fit()
print model.summary()
The problem is, when you compute rets, you divide by zero which causes an inf. Also, when you use shift, you have NaNs so you have missing values that need to be handled in some way first before proceeding to the regression.
Walk through this example using your data and see:
df = data.loc['2016-03-20':'2016-04-01'].copy()
df looks like:
EUROSTOXX VSTOXX
2016-03-21 3048.77 35.6846
2016-03-22 3051.23 35.6846
2016-03-23 3042.42 35.6846
2016-03-24 2986.73 35.6846
2016-03-25 0.00 35.6846
2016-03-28 0.00 35.6846
2016-03-29 3004.87 35.6846
2016-03-30 3044.10 35.6846
2016-03-31 3004.93 35.6846
2016-04-01 2953.28 35.6846
Shifting by 1 and dividing:
df = (((df/df.shift(1))-1)*100).round(2)
Prints out:
EUROSTOXX VSTOXX
2016-03-21 NaN NaN
2016-03-22 0.080688 0.0
2016-03-23 -0.288736 0.0
2016-03-24 -1.830451 0.0
2016-03-25 -100.000000 0.0
2016-03-28 NaN 0.0
2016-03-29 inf 0.0
2016-03-30 1.305547 0.0
2016-03-31 -1.286751 0.0
2016-04-01 -1.718842 0.0
Take-aways: shifting by 1 automatically always creates a NaN at the top. Dividing 0.00 by 0.00 produces an inf.
One possible solution to handle missing values:
...
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
# handle missing values
messed_up_indices = xdat[xdat.isin([-np.inf, np.inf, np.nan]) == True].index
xdat[messed_up_indices] = xdat[messed_up_indices].replace([-np.inf, np.inf], np.nan)
xdat[messed_up_indices] = xdat[messed_up_indices].fillna(xdat.mean())
ydat[messed_up_indices] = ydat[messed_up_indices].fillna(0.0)
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets, missing='raise').fit()
print(model.summary())
Notice I added the missing='raise' parameter to ols to see what's going on.
End result prints out:
OLS Regression Results
==============================================================================
Dep. Variable: ydat R-squared: 0.259
Model: OLS Adj. R-squared: 0.259
Method: Least Squares F-statistic: 1593.
Date: Wed, 03 Jan 2018 Prob (F-statistic): 5.76e-299
Time: 12:01:14 Log-Likelihood: -13856.
No. Observations: 4554 AIC: 2.772e+04
Df Residuals: 4552 BIC: 2.773e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.1608 0.075 2.139 0.033 0.013 0.308
xdat -1.4209 0.036 -39.912 0.000 -1.491 -1.351
==============================================================================
Omnibus: 4280.114 Durbin-Watson: 2.074
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4021394.925
Skew: -3.446 Prob(JB): 0.00
Kurtosis: 148.415 Cond. No. 2.11
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Calculate P-value in Sklearn using python?

I'm new to machine learning and created a logistic model using sklearn but i don't get any documentation on how to find P-value for my feature variables as well as model. I have checked the stack link but don't get the required output. please help. Thanks in advance
One can use regressors package for this. Following code is from: https://regressors.readthedocs.io/en/latest/usage.html
import numpy as np
from sklearn import datasets
boston = datasets.load_boston()
which_betas = np.ones(13, dtype=bool)
which_betas[3] = False # Eliminate dummy variable
X = boston.data[:, which_betas]
y = boston.target
from sklearn import linear_model
from regressors import stats
ols = linear_model.LinearRegression()
ols.fit(X, y)
# To calculate the p-values of beta coefficients:
print("coef_pval:\n", stats.coef_pval(ols, X, y))
# to print summary table:
print("\n=========== SUMMARY ===========")
xlabels = boston.feature_names[which_betas]
stats.summary(ols, X, y, xlabels)
Output:
coef_pval:
[2.66897615e-13 4.15972994e-04 1.36473287e-05 4.67064962e-01
1.70032518e-06 0.00000000e+00 7.67610259e-01 1.55431223e-15
1.51691918e-07 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00]
=========== SUMMARY ===========
Residuals:
Min 1Q Median 3Q Max
-26.3743 -1.9207 0.6648 2.8112 13.3794
Coefficients:
Estimate Std. Error t value p value
_intercept 36.925033 4.915647 7.5117 0.000000
CRIM -0.112227 0.031583 -3.5534 0.000416
ZN 0.047025 0.010705 4.3927 0.000014
INDUS 0.040644 0.055844 0.7278 0.467065
NOX -17.396989 3.591927 -4.8434 0.000002
RM 3.845179 0.272990 14.0854 0.000000
AGE 0.002847 0.009629 0.2957 0.767610
DIS -1.485557 0.180530 -8.2289 0.000000
RAD 0.327895 0.061569 5.3257 0.000000
TAX -0.013751 0.001055 -13.0395 0.000000
PTRATIO -0.991733 0.088994 -11.1438 0.000000
B 0.009827 0.001126 8.7256 0.000000
LSTAT -0.534914 0.042128 -12.6973 0.000000
---
R-squared: 0.73547, Adjusted R-squared: 0.72904
F-statistic: 114.23 on 12 features

Strange behaviour when adding columns

I'm using Python 2.7.8 |Anaconda 2.1.0. I'm wondering why the strange behavior below occurs
I create a pandas dataframe with two columns, then add a third column by summing the first two columns
x = pd.DataFrame(np.random.randn(5, 2), columns = ['a', 'b'])
x['c'] = x[['a', 'b']].sum(axis = 1) #or x['c'] = x['a'] + x['b']
Out[7]:
a b c
0 -1.644246 0.851602 -0.792644
1 -0.129092 0.237140 0.108049
2 0.623160 0.105494 0.728654
3 0.737803 -1.612189 -0.874386
4 0.340671 -0.113334 0.227337
All good so far. Now I want to set the values of column c to zero if they are negative
x[x['c']<0] = 0
Out[9]:
a b c
0 0.000000 0.000000 0.000000
1 -0.129092 0.237140 0.108049
2 0.623160 0.105494 0.728654
3 0.000000 0.000000 0.000000
4 0.340671 -0.113334 0.227337
This gives the desired result in column 'c', but for some reason columns 'a' and 'b' have been modified - i don't want this to happen. I was wondering why this is happening and how I can fix this behavior?
You have to specify you only want the 'c' column:
x.loc[x['c']<0, 'c'] = 0
When you just index with a boolean array/series, this will select full rows, as you can see in this example:
In [46]: x['c']<0
Out[46]:
0 True
1 False
2 False
3 True
4 False
Name: c, dtype: bool
In [47]: x[x['c']<0]
Out[47]:
a b c
0 -0.444493 -0.592318 -1.036811
3 -1.363727 -1.572558 -2.936285
Because you are setting to zero for all the columns. You should set it only for column c
x['c'][x['c']<0] = 0

Convert data frame column to float and perform operation in Pandas

I have a data frame that contains the following that are imported as strings
df3 = pd.DataFrame(data = {
'Column1':['10/1','9/5','7/4','12/3','18/7','14/2']})
I am tried to convert to float and do the division. The following did work well.
for i, v in enumerate(df3.Column1):
df3['Column2'] = float(v[:-2]) / float(v[-1])
print df3.Column2
This is the output that I am trying to achieve
df3 = pd.DataFrame(data = {
'Column1':['10/1','9/5','7/4','12/3','18/7','14/2'],
'Column2':['10.0','1.8','1.75','4.0','2.57142857143','7.0']})
df3
The following would work, define a function to perform the casting to float and return this, the result of which should be assigned to your new column:
In [10]:
df3 = pd.DataFrame(data = {
'Column1':['10/1','9/5','7/4','12/3','18/7','14/2']})
def func(x):
return float(x[:-2]) / float(x[-1])
df3['Column2'] = df3['Column1'].apply(func)
df3
Out[10]:
Column1 Column2
0 10/1 10.000000
1 9/5 1.800000
2 7/4 1.750000
3 12/3 4.000000
4 18/7 2.571429
5 14/2 7.000000
if, and ONLY IF, you do not have input/data from an untrusted source, here's a shortcut:
In [46]: df3
Out[46]:
Column1
0 10/1
1 9/5
2 7/4
3 12/3
4 18/7
5 14/2
In [47]: df3.Column1.map(eval)
Out[47]:
0 10.000000
1 1.800000
2 1.750000
3 4.000000
4 2.571429
5 7.000000
Name: Column1, dtype: float64
But seriously...be careful with eval.