Structure the output - python-2.7

I am trying to get a proper structured output into a csv.
Input:
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525
Code:
import pandas as pd
from datetime import datetime,time
import numpy as np
fn = r'00_Dart.csv'
cols = ['UserID','StartTime','StopTime', 'gps1', 'gps2']
df = pd.read_csv(fn, header=None, names=cols)
df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime
# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)
# building reporting DF: `r`
freq = '1H' # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)
# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1
r['LogCount'] = 0
r['UniqueIDCount'] = 0
for i, row in r.iterrows():
# intervals overlap test
# https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
# i've slightly simplified the calculations of m and d
# by getting rid of division by 2,
# because it can be done eliminating common terms
u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time
#df.to_csv((r[r.LogCount > 0])'example.csv')
#print(r[r.LogCount > 0]) -- This gives the correct count and unique count but I want to write the output in a structure.
print (r['StartTime'], ['EndTime'], ['Day'], ['LogCount'], ['UniqueIDCount'])
Output: This is the output that I am getting which is not what I am looking for.
(2004-01-05 00:00:00 00:00:00
2004-01-05 01:00:00 01:00:00
2004-01-05 02:00:00 02:00:00
2004-01-05 03:00:00 03:00:00
2004-01-05 04:00:00 04:00:00
2004-01-05 05:00:00 05:00:00
2004-01-05 06:00:00 06:00:00
2004-01-05 07:00:00 07:00:00
2004-01-05 08:00:00 08:00:00
2004-01-05 09:00:00 09:00:00
And the Expected output headers are
StartTime, EndTime, Day, Count, UniqueIDCount
How do I structure the Write statement in code to have the above mentioned columns in my output csv.

Try This:
rout = r[['StartTime', 'EndTime', 'Day', 'LogCount', 'UniqueIDCount'] ]
print rout
rout.to_csv('results.csv', index=False)

Related

How to create a measure which counts the # of row divide by the total # of row

I have a table like this:
Table
Name
HasEntry
Time
A
true
jan 22
A
false
jan 22
A
true
jan 22
A
true
jan 22
B
true
jan 22
B
false
jan 22
B
true
jan 22
I want a measure with gives `Ratio = "# of HasEntry = true / # of row of each name"
that mean for A the ratio is 3/4 = 0.75, B is 2/3 = 0.66
I tried doing
Ratio = DIVIDE(COUNTROWS(FILTER(Table, Table[HasENtry] = TRUE)), COUNT(Table[HasENtry]))
But when I use the ratio in my y-axis of the line chart, I get error 'Can't display the visul
'The function COUNT cannot work with values of type BOOLEAN?
So how to count the # of row for each name in my measure?
Use COUNTA() instead on booleans.
https://dax.guide/counta/

pandas - group by: create aggregation function using multiple columns

I have the following data frame:
id my_year my_month waiting_time target
001 2018 1 95 1
002 2018 1 3 3
003 2018 1 4 0
004 2018 1 40 1
005 2018 2 97 1
006 2018 2 3 3
007 2018 3 4 0
008 2018 3 40 1
I want to groupby my_year and my_month, then in each group I want to compute the my_rate based on
(# of records with waiting_time <= 90 and target = 1)/ total_records in the group
i.e. I am expecting output like:
my_year my_month my_rate
2018 1 0.25
2018 2 0.0
2018 3 0.5
I wrote the following code to compute the desired value my_rate:
def my_rate(data):
waiting_time_list = data['waiting_time']
target_list = data['target']
total = len(data)
my_count = 0
for i in range(len(data)):
if total_waiting_time_list[i] <= 90 and target_list[i] == 1:
my_count += 1
rate = float(my_count)/float(total)
return rate
df.groupby(['my_year','my_month']).apply(my_rate)
However, I got the following error:
KeyError 0
KeyErrorTraceback (most recent call last)
<ipython-input-29-5c4399cefd05> in <module>()
17
---> 18 df.groupby(['my_year','my_month']).apply(my_rate)
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in apply(self, func, *args, **kwargs)
714 # ignore SettingWithCopy here in case the user mutates
715 with option_context('mode.chained_assignment', None):
--> 716 return self._python_apply_general(f)
717
718 def _python_apply_general(self, f):
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in _python_apply_general(self, f)
718 def _python_apply_general(self, f):
719 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 720 self.axis)
721
722 return self._wrap_applied_output(
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in apply(self, f, data, axis)
1727 # group might be modified
1728 group_axes = _get_axes(group)
-> 1729 res = f(group)
1730 if not _is_indexed_like(res, group_axes):
1731 mutated = True
<ipython-input-29-5c4399cefd05> in conversion_rate(data)
8 #print total_waiting_time_list[i], target_list[i]
9 #print i, total_waiting_time_list[i], target_list[i]
---> 10 if total_waiting_time_list[i] <= 90:# and target_list[i] == 1:
11 convert_90_count += 1
12 #print 'convert ', convert_90_count
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
599 key = com._apply_if_callable(key, self)
600 try:
--> 601 result = self.index.get_value(self, key)
602
603 if not is_scalar(result):
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_value(self, series, key)
2426 try:
2427 return self._engine.get_value(s, k,
-> 2428 tz=getattr(series.dtype, 'tz', None))
2429 except KeyError as e1:
2430 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4363)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4046)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13913)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13857)()
KeyError: 0
Any idea what I did wrong here? And how do I fix it? Thanks!
I believe better is use mean of boolean mask per groups:
def my_rate(x):
return ((x['waiting_time'] <= 90) & (x['target'] == 1)).mean()
df = df.groupby(['my_year','my_month']).apply(my_rate).reset_index(name='my_rate')
print (df)
my_year my_month my_rate
0 2018 1 0.25
1 2018 2 0.00
2 2018 3 0.50
Any idea what I did wrong here?
Problem is waiting_time_list and target_list are not lists, but Series:
waiting_time_list = data['waiting_time']
target_list = data['target']
print (type(waiting_time_list))
<class 'pandas.core.series.Series'>
print (type(target_list))
<class 'pandas.core.series.Series'>
So if want indexing it failed, because in second group are indices 4,5, not 0,1.
if waiting_time_list[i] <= 90 and target_list[i] == 1:
For avoid it is possible convert Series to list:
waiting_time_list = data['waiting_time'].tolist()
target_list = data['target'].tolist()

Regression analysis,using statsmodels

Please help me for getting output from this code.why the output of this code is nan?!!!whats my wrong?
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import math
import datetime as dt
#importing Data
es_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/hbrbcpe.txt'
vs_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/h_vstoxx.txt'
#creating DataFrame
cols=['SX5P','SX5E','SXXP','SXXE','SXXF','SXXA','DK5f','DKXF']
es=pd.read_csv(es_url,index_col=0,parse_dates=True,sep=';',dayfirst=True,header=None,skiprows=4,names=cols)
vs=pd.read_csv(vs_url,index_col=0,header=2,parse_dates=True,sep=',',dayfirst=True)
data=pd.DataFrame({'EUROSTOXX' : es['SX5E'][es.index > dt.datetime(1999,1,1)]},dtype=float)
data=data.join(pd.DataFrame({'VSTOXX' : vs['V2TX'][vs.index > dt.datetime(1999,1,1)]},dtype=float))
data=data.fillna(method='ffill')
rets=(((data/data.shift(1))-1)*100).round(2)
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets).fit()
print model.summary()
The problem is, when you compute rets, you divide by zero which causes an inf. Also, when you use shift, you have NaNs so you have missing values that need to be handled in some way first before proceeding to the regression.
Walk through this example using your data and see:
df = data.loc['2016-03-20':'2016-04-01'].copy()
df looks like:
EUROSTOXX VSTOXX
2016-03-21 3048.77 35.6846
2016-03-22 3051.23 35.6846
2016-03-23 3042.42 35.6846
2016-03-24 2986.73 35.6846
2016-03-25 0.00 35.6846
2016-03-28 0.00 35.6846
2016-03-29 3004.87 35.6846
2016-03-30 3044.10 35.6846
2016-03-31 3004.93 35.6846
2016-04-01 2953.28 35.6846
Shifting by 1 and dividing:
df = (((df/df.shift(1))-1)*100).round(2)
Prints out:
EUROSTOXX VSTOXX
2016-03-21 NaN NaN
2016-03-22 0.080688 0.0
2016-03-23 -0.288736 0.0
2016-03-24 -1.830451 0.0
2016-03-25 -100.000000 0.0
2016-03-28 NaN 0.0
2016-03-29 inf 0.0
2016-03-30 1.305547 0.0
2016-03-31 -1.286751 0.0
2016-04-01 -1.718842 0.0
Take-aways: shifting by 1 automatically always creates a NaN at the top. Dividing 0.00 by 0.00 produces an inf.
One possible solution to handle missing values:
...
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
# handle missing values
messed_up_indices = xdat[xdat.isin([-np.inf, np.inf, np.nan]) == True].index
xdat[messed_up_indices] = xdat[messed_up_indices].replace([-np.inf, np.inf], np.nan)
xdat[messed_up_indices] = xdat[messed_up_indices].fillna(xdat.mean())
ydat[messed_up_indices] = ydat[messed_up_indices].fillna(0.0)
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets, missing='raise').fit()
print(model.summary())
Notice I added the missing='raise' parameter to ols to see what's going on.
End result prints out:
OLS Regression Results
==============================================================================
Dep. Variable: ydat R-squared: 0.259
Model: OLS Adj. R-squared: 0.259
Method: Least Squares F-statistic: 1593.
Date: Wed, 03 Jan 2018 Prob (F-statistic): 5.76e-299
Time: 12:01:14 Log-Likelihood: -13856.
No. Observations: 4554 AIC: 2.772e+04
Df Residuals: 4552 BIC: 2.773e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.1608 0.075 2.139 0.033 0.013 0.308
xdat -1.4209 0.036 -39.912 0.000 -1.491 -1.351
==============================================================================
Omnibus: 4280.114 Durbin-Watson: 2.074
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4021394.925
Skew: -3.446 Prob(JB): 0.00
Kurtosis: 148.415 Cond. No. 2.11
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

percentage bins based on predefined buckets

I have a series of numbers and I would like to know % of numbers falling in every bucket of a dataframe.
df['cuts'] have 10, 20 and 50 as values. Specifically, I would like to what % of series are in [0-10], (10-20] and (20-50] bin and this should be appended to the df dataframe.
I wrote the following code. I definitely feel that it could be improvised. Any help is appreciated.
bin_cuts = [-1] + list(df['cuts'].values)
out = pd.cut(series, bins = bin_cuts)
df_pct_bins = pd.value_counts(out, normalize= True).reset_index()
df_pct_bins = pd.concat([df_pct_bins['index'].str.split(', ', expand = True), df_pct_bins['cuts']], axis = 1)
df_pct_bins[1] = df_pct_bins[1].str[:-1].astype(str)
df['cuts'] = df['cuts'].astype(str)
df_pct_bins = pd.merge(df, df_pct_bins, left_on= 'cuts', right_on= 1)
Consider the sample data df and s
df = pd.DataFrame(dict(cuts=[10, 20, 50]))
s = pd.Series(np.random.randint(50, size=1000))
Option 1
np.searchsorted
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.value_counts(
c[np.searchsorted(c, s)],
normalize=True
)))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578
Option 2
pd.cut
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.cut(
s,
np.append(-np.inf, c),
labels=c
).value_counts(normalize=True)
))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578

python2 pandas: how to merge a part of another dataframe to a dataframe

I have a dataframe(df1) as following:
datetime m d 1d 2d 3d
2014-01-01 1 1 2 2 3
2014-01-02 1 2 3 4 3
2014-01-03 1 3 1 2 3
...........
2014-12-01 12 1 2 2 3
2014-12-31 12 31 2 2 3
Also I have another dataframe(df2) as following:
datetime m d
2015-01-02 1 2
2015-01-03 1 3
...........
2015-12-01 12 1
2015-12-31 12 31
I want to merge the 1d 2d 3d columns value of df1 to df2.
There are two conditions:
(1) only m and d are the same in both df1 and df2 can merge.
(2) if the index of df2 index % 30 ==0 don't merge, the value of 1d 2d 3d of these index can be Nan.
I mean I want the new dataframe of df2 like as following:
datetime m d 1d 2d 3d
2015-01-02 1 2 Nan Nan Nan
2015-01-03 1 3 1 2 3
...........
2015-12-01 12 1 2 2 3
2015-12-31 12 31 2 2 3
Thanks in advance!
I think you need add NaNs by loc and then merge with left join:
np.random.seed(10)
N = 365
rng = pd.date_range('2015-01-01', periods=N)
df_tr_2014 = pd.DataFrame(np.random.randint(10, size=(N, 3)), index=rng).reset_index()
df_tr_2014.columns = ['datetime','7d','15d','20d']
df_tr_2014.insert(1,'month', df_tr_2014['datetime'].dt.month)
df_tr_2014.insert(2,'day_m', df_tr_2014['datetime'].dt.day)
#print (df_tr_2014.head())
N = 366
rng = pd.date_range('2016-01-01', periods=N)
df_te = pd.DataFrame(index=rng)
df_te['month'] = df_te.index.month
df_te['day_m'] = df_te.index.day
df_te = df_te.reset_index()
#print (df_te.tail())
df2 = df_te.copy()
df1 = df_tr_2014.copy()
df1 = df1.set_index('datetime')
df1.index += pd.offsets.DateOffset(years=1)
#correct 29 February
y = df1.index[0].year
df1 = df1.reindex(pd.date_range(pd.datetime(y,1,1), pd.datetime(y,12,31)))
idx = df1.index[(df1.index.month == 2) & (df1.index.day == 29)]
df1.loc[idx, :] = df1.loc[idx - pd.Timedelta(1, unit='d'), :].values
df1.loc[idx, 'day_m'] = idx.day
df1[['month','day_m']] = df1[['month','day_m']].astype(int)
df1[['7d','15d', '20d']] = df1[['7d','15d', '20d']].astype(float)
df1.loc[np.arange(len(df1.index)) % 30 == 0, ['7d','15d','20d']] = 0
df1 = df1.reset_index()
print (df1.iloc[57:62])
index month day_m 7d 15d 20d
57 2016-02-27 2 27 2.0 0.0 1.0
58 2016-02-28 2 28 2.0 3.0 5.0
59 2016-02-29 2 29 2.0 3.0 5.0
60 2016-03-01 3 1 0.0 0.0 0.0
61 2016-03-02 3 2 7.0 6.0 9.0
Why don't you just remove the rows in df1 that don't match (m, d) pairs in df2?
df_new = df2.drop(df2[(not ((df2.m == df1.m) & (df2.n == df1.n)).any()) or (df2.index % 30 == 0)].index)
Or something along those lines.
Link to a related answer.
I'm not enormously familiar with Pandas and have not tested the above example.
df_te is df2
df_tr_2014 is df1
7d 15d 20 is 1d 2d 3d respectively in question. size_df_te is the length of df_te, month and day_m are m, d in df2
df_te['7d'] = 0
df_te['15d'] = 0
df_te['20d'] = 0
mj = 0
dj = 0
for i in range(size_df_te):
if i%30 != 0:
m = df_te.loc[i,'month']
d = df_te.loc[i,'day_m']
if (m== 2) & (d == 29):
m = 2
d = 28
dk_7 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['7d']
dk_15 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['15d']
dk_20 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['20d']
df_te.loc[i,'7d'] = float(dk_7)
df_te.loc[i,'15d'] = float(dk_15)
df_te.loc[i,'20d'] = float(dk_20)
EDIT:
Sample data:
np.random.seed(10)
N = 365
rng = pd.date_range('2014-01-01', periods=N)
df_tr_2014 = pd.DataFrame(np.random.randint(10, size=(N, 3)), index=rng).reset_index()
df_tr_2014.columns = ['datetime','7d','15d','20d']
df_tr_2014.insert(1,'month', df_tr_2014['datetime'].dt.month)
df_tr_2014.insert(2,'day_m', df_tr_2014['datetime'].dt.day)
#print (df_tr_2014.head())
N = 365
rng = pd.date_range('2015-01-01', periods=N)
df_te = pd.DataFrame(index=rng)
df_te['month'] = df_te.index.month
df_te['day_m'] = df_te.index.day
df_te = df_te.reset_index()
#print (df_te.head())