percentage bins based on predefined buckets - python-2.7

I have a series of numbers and I would like to know % of numbers falling in every bucket of a dataframe.
df['cuts'] have 10, 20 and 50 as values. Specifically, I would like to what % of series are in [0-10], (10-20] and (20-50] bin and this should be appended to the df dataframe.
I wrote the following code. I definitely feel that it could be improvised. Any help is appreciated.
bin_cuts = [-1] + list(df['cuts'].values)
out = pd.cut(series, bins = bin_cuts)
df_pct_bins = pd.value_counts(out, normalize= True).reset_index()
df_pct_bins = pd.concat([df_pct_bins['index'].str.split(', ', expand = True), df_pct_bins['cuts']], axis = 1)
df_pct_bins[1] = df_pct_bins[1].str[:-1].astype(str)
df['cuts'] = df['cuts'].astype(str)
df_pct_bins = pd.merge(df, df_pct_bins, left_on= 'cuts', right_on= 1)

Consider the sample data df and s
df = pd.DataFrame(dict(cuts=[10, 20, 50]))
s = pd.Series(np.random.randint(50, size=1000))
Option 1
np.searchsorted
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.value_counts(
c[np.searchsorted(c, s)],
normalize=True
)))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578
Option 2
pd.cut
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.cut(
s,
np.append(-np.inf, c),
labels=c
).value_counts(normalize=True)
))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578

Related

Create list from pandas dataframe

I have a function that takes all, non-distinct, MatchId and (xG_Team1 vs xG_Team2, paired) and gives an output of as an array. which then summed up to be sse constant.
The problem with the function is it iterates through each row, duplicating MatchId. I want to stop this.
For each distinct MatchId I need the corresponding home and away goals as a list. I.e. Home_Goal and Away_Goal to be used in each iteration. from Home_Goal_time and Away_Goal_time columns of the dataframe. The list below doesn't seem to work.
MatchId Event_Id EventCode Team1 Team2 Team1_Goals
0 842079 2053 Goal Away Huachipato Cobresal 0
1 842079 2053 Goal Away Huachipato Cobresal 0
2 842080 1029 Goal Home Slovan lava 3
3 842080 1029 Goal Home Slovan lava 3
4 842080 2053 Goal Away Slovan lava 3
5 842080 1029 Goal Home Slovan lava 3
6 842634 2053 Goal Away Rosario Boca Juniors 0
7 842634 2053 Goal Away Rosario Boca Juniors 0
8 842634 2053 Goal Away Rosario Boca Juniors 0
9 842634 2054 Cancel Goal Away Rosario Boca Juniors 0
Team2_Goals xG_Team1 xG_Team2 CurrentPlaytime Home_Goal_Time Away_Goal_Time
0 2 1.79907 1.19893 2616183 0 87
1 2 1.79907 1.19893 3436780 0 115
2 1 1.70662 1.1995 3630545 121 0
3 1 1.70662 1.1995 4769519 159 0
4 1 1.70662 1.1995 5057143 0 169
5 1 1.70662 1.1995 5236213 175 0
6 2 0.82058 1.3465 2102264 0 70
7 2 0.82058 1.3465 4255871 0 142
8 2 0.82058 1.3465 5266652 0 176
9 2 0.82058 1.3465 5273611 0 0
For example MatchId = 842079, Home_goal =[], Away_Goal = [87, 115]
x1 = [1,0,0]
x2 = [0,1,0]
x3 = [0,0,1]
m = 1 ,arbitrary constant used to optimise sse.
k = 196
total_timeslot = 196
Home_Goal = [] # No Goal
Away_Goal = [] # No Goal
def sum_squared_diff(x1, x2, x3, y):
ssd = []
for k in range(total_timeslot): # k will take multiple values
if k in Home_Goal:
ssd.append(sum((x2 - y) ** 2))
elif k in Away_Goal:
ssd.append(sum((x3 - y) ** 2))
else:
ssd.append(sum((x1 - y) ** 2))
return ssd
def my_function(row):
xG_Team1 = row.xG_Team1
xG_Team2 = row.xG_Team2
return np.array([1-(xG_Team1*m + xG_Team2*m)/k, xG_Team1*m/k, xG_Team2*m/k])
results = df.apply(lambda row: sum_squared_diff(x1, x2, x3, my_function(row)), axis=1)
results
sum(results.sum())
For the three matches above the desire outcome should look like the following.
If I need an individual sse, sum(sum_squared_diff(x1, x2, x3, y)) gives me the following.
MatchId = 842079 = 3.984053038520635
MatchId = 842080 = 7.882189570700502
MatchId = 842080 = 5.929085973050213
Given the size of the original data, realistically I am after the total sum of the sse. For the above sample data, simply adding up the values give total sse=17.79532858227135.` Once I achieve this, then I will try to optimise the sse based on this figure by updating the arbitrary value m.
Here are the lists i hoped the function will iterate over.
Home_scored = s.groupby('MatchId')['Home_Goal_time'].apply(list)
Away_scored = s.groupby('MatchId')['Away_Goal_Time'].apply(list)
type(HomeGoal)
pandas.core.series.Series
Then convert it to lists.
Home_Goal = Home_scored.tolist()
Away_Goal = Away_scored.tolist()
type(Home_Goal)
list
Home_Goal
Out[303]: [[0, 0], [121, 159, 0, 175], [0, 0, 0, 0]]
Away_Goal
Out[304]: [[87, 115], [0, 0, 169, 0], [70, 142, 176, 0]]
But the function still takes Home_Goal and Away_Goal as empty list.
If you only want to consider one MatchId at a time you should .groupby('MatchID') first
df.groupby('MatchID').apply(...)

load multiple csv files into Dataframe: columns names issue

I have multiple csv files with the same format (14 rows 4 columns).
I tried to load all of them into a single dataFrame, and use file's name to rename the values of the first column (1-14)
1 500 0 0
2 350 0 1
3 500 1 0
.............
13 600 0 0
14 800 0 0
I tried the following code but I am not getting what I am expecting:
filenames = os.listdir('Threshold/')
Y = pd.DataFrame () #empty df
# file name are in the following foramt "subx_ICA_thre.csv"
# need to get x (subject number to be used later for renaming columns values)
Sub_list=[]
for filename in filenames:
s= int(''.join(filter(str.isdigit, filename)))
Sub_list.append(int(s))
S_Sub_list= sorted(Sub_list)
for x in S_Sub_list: # get the file according to the subject number
temp = pd.read_csv('sub' +str(x)+'_ICA_thre.csv' )
df = pd.concat([Y, temp]) # concat the obtained frame with the empty frame
df.columns = ['id', 'data', 'isEB', 'isEM']
# replace the column values using subject id
for sub in range(1,15):
df['id'].replace(sub, 'sub' +str(x)+'_ICA_'+str(sub) ,inplace=True)
print (df)
output:
id data isEB isEM
0 sub1_ICA_2 200 0 0
1 sub1_ICA_3 275 0 0
2 sub1_ICA_4 500 1 0
................................
11 sub1_ICA_13 275 0 0
12 sub1_ICA_14 300 0 0
id data isEB isEM
0 sub2_ICA_2 275 0 0
1 sub2_ICA_3 500 0 0
2 sub2_ICA_4 400 0 0
.................................
11 sub2_ICA_13 300 0 0
12 sub2_ICA_14 450 0 0
First, it seems that the code makes different dataFrame not a single one.Second, the first row is removed (sub1_ICA_1 is missing, may be replaced with column names).
I couldn't find the problem in the loop that I am using
I think you need create list of DataFrames first, then concat with parameter keys for new values by range in MultiIndex, then modify column id and last remove MultiIndex by reset_index:
Also was added parameter names to read_csv for custom columns names.
Y = []
for x in S_Sub_list:
n = ['id', 'data', 'isEB', 'isEM']
temp = pd.read_csv('sub' + str(x) +'_ICA_thre.csv', names = n)
Y.append(temp)
#list comprehension alternative
#n = ['id', 'data', 'isEB', 'isEM']
#Y = [pd.read_csv('sub' + str(x) +'_ICA_thre.csv', names = n) for x in S_Sub_list]
df = pd.concat(Y, keys=range(1,len(S_Sub_list) + 1))
df['id'] = 'sub' + df.index.get_level_values(0).astype(str) +'_ICA_'+ df['id'].astype(str)
df = df.reset_index(drop=True)

python2 pandas: how to merge a part of another dataframe to a dataframe

I have a dataframe(df1) as following:
datetime m d 1d 2d 3d
2014-01-01 1 1 2 2 3
2014-01-02 1 2 3 4 3
2014-01-03 1 3 1 2 3
...........
2014-12-01 12 1 2 2 3
2014-12-31 12 31 2 2 3
Also I have another dataframe(df2) as following:
datetime m d
2015-01-02 1 2
2015-01-03 1 3
...........
2015-12-01 12 1
2015-12-31 12 31
I want to merge the 1d 2d 3d columns value of df1 to df2.
There are two conditions:
(1) only m and d are the same in both df1 and df2 can merge.
(2) if the index of df2 index % 30 ==0 don't merge, the value of 1d 2d 3d of these index can be Nan.
I mean I want the new dataframe of df2 like as following:
datetime m d 1d 2d 3d
2015-01-02 1 2 Nan Nan Nan
2015-01-03 1 3 1 2 3
...........
2015-12-01 12 1 2 2 3
2015-12-31 12 31 2 2 3
Thanks in advance!
I think you need add NaNs by loc and then merge with left join:
np.random.seed(10)
N = 365
rng = pd.date_range('2015-01-01', periods=N)
df_tr_2014 = pd.DataFrame(np.random.randint(10, size=(N, 3)), index=rng).reset_index()
df_tr_2014.columns = ['datetime','7d','15d','20d']
df_tr_2014.insert(1,'month', df_tr_2014['datetime'].dt.month)
df_tr_2014.insert(2,'day_m', df_tr_2014['datetime'].dt.day)
#print (df_tr_2014.head())
N = 366
rng = pd.date_range('2016-01-01', periods=N)
df_te = pd.DataFrame(index=rng)
df_te['month'] = df_te.index.month
df_te['day_m'] = df_te.index.day
df_te = df_te.reset_index()
#print (df_te.tail())
df2 = df_te.copy()
df1 = df_tr_2014.copy()
df1 = df1.set_index('datetime')
df1.index += pd.offsets.DateOffset(years=1)
#correct 29 February
y = df1.index[0].year
df1 = df1.reindex(pd.date_range(pd.datetime(y,1,1), pd.datetime(y,12,31)))
idx = df1.index[(df1.index.month == 2) & (df1.index.day == 29)]
df1.loc[idx, :] = df1.loc[idx - pd.Timedelta(1, unit='d'), :].values
df1.loc[idx, 'day_m'] = idx.day
df1[['month','day_m']] = df1[['month','day_m']].astype(int)
df1[['7d','15d', '20d']] = df1[['7d','15d', '20d']].astype(float)
df1.loc[np.arange(len(df1.index)) % 30 == 0, ['7d','15d','20d']] = 0
df1 = df1.reset_index()
print (df1.iloc[57:62])
index month day_m 7d 15d 20d
57 2016-02-27 2 27 2.0 0.0 1.0
58 2016-02-28 2 28 2.0 3.0 5.0
59 2016-02-29 2 29 2.0 3.0 5.0
60 2016-03-01 3 1 0.0 0.0 0.0
61 2016-03-02 3 2 7.0 6.0 9.0
Why don't you just remove the rows in df1 that don't match (m, d) pairs in df2?
df_new = df2.drop(df2[(not ((df2.m == df1.m) & (df2.n == df1.n)).any()) or (df2.index % 30 == 0)].index)
Or something along those lines.
Link to a related answer.
I'm not enormously familiar with Pandas and have not tested the above example.
df_te is df2
df_tr_2014 is df1
7d 15d 20 is 1d 2d 3d respectively in question. size_df_te is the length of df_te, month and day_m are m, d in df2
df_te['7d'] = 0
df_te['15d'] = 0
df_te['20d'] = 0
mj = 0
dj = 0
for i in range(size_df_te):
if i%30 != 0:
m = df_te.loc[i,'month']
d = df_te.loc[i,'day_m']
if (m== 2) & (d == 29):
m = 2
d = 28
dk_7 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['7d']
dk_15 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['15d']
dk_20 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['20d']
df_te.loc[i,'7d'] = float(dk_7)
df_te.loc[i,'15d'] = float(dk_15)
df_te.loc[i,'20d'] = float(dk_20)
EDIT:
Sample data:
np.random.seed(10)
N = 365
rng = pd.date_range('2014-01-01', periods=N)
df_tr_2014 = pd.DataFrame(np.random.randint(10, size=(N, 3)), index=rng).reset_index()
df_tr_2014.columns = ['datetime','7d','15d','20d']
df_tr_2014.insert(1,'month', df_tr_2014['datetime'].dt.month)
df_tr_2014.insert(2,'day_m', df_tr_2014['datetime'].dt.day)
#print (df_tr_2014.head())
N = 365
rng = pd.date_range('2015-01-01', periods=N)
df_te = pd.DataFrame(index=rng)
df_te['month'] = df_te.index.month
df_te['day_m'] = df_te.index.day
df_te = df_te.reset_index()
#print (df_te.head())

Python remove outliers from data

I have a data frame as following:
ID Value
A 70
A 80
B 75
C 10
B 50
A 1000
C 60
B 2000
.. ..
I would like to group this data by ID, remove the outliers from the grouped data (the ones we see from the boxplot) and then calculate mean.
So far
grouped = df.groupby('ID')
statBefore = pd.DataFrame({'mean': grouped['Value'].mean(), 'median': grouped['Value'].median(), 'std' : grouped['Value'].std()})
How can I find outliers, remove them and get the statistics.
I believe the method you're referring to is to remove values > 1.5 * the interquartile range away from the median. So first, calculate your initial statistics:
statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' : grouped['Value'].quantile(.75)})
And then determine whether values in the original DF are outliers:
def is_outlier(row):
iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
median = statBefore.loc[row.ID]['median']
if row.Value > (median + (1.5* iq_range)) or row.Value < (median - (1.5* iq_range)):
return True
else:
return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
#filter to only non-outliers:
df_no_outliers = df[~(df.outlier)]
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
data = df[~((df['Value'] < (Q1 - 1.5 * IQR)) |(df['Value'] > (Q3 + 1.5 *
IQR))).any(axis=1)]
just do :
In [187]: df[df<100].groupby('ID').agg(['mean','median','std'])
Out[187]:
Value
mean median std
ID
A 75.0 75.0 7.071068
B 62.5 62.5 17.677670
C 35.0 35.0 35.355339

Subtract value in one data frame from the next value in a second data frame

I have a data frame that is composed of several datasets (about 146 and counting). two of my columns are labeled "start_time" and "stop_time," which represent the start and stop of a response (i.e., the total duration of the response).
I need to get the "inter-response time" or the start_time subtracted from the next corresponding value in start_time. Basically if:
start_time = [1,4,7]
stop_time = [2,5,8]
I need:
stop_time[0] - start_time[1]
stop_time[2] - start_time[3]
in order to get:
iri = [2,2]
My code looks like this:
iri_t = []
def grps():
for grp in lset2_name_grps.groups:
beg_eng_t = pd.DataFrame([lset2_name_grps.stop_time, lset2_name_grps.start_time], columns=['end_t','beg_t'])
end_t = [i for i in lset2_name_grps.stop_time]
beg_t = [i for i in lset2_name_grps.start_time]
beg_t = np.insert(beg_t, len(beg_t),0)
end_t = np.insert(end_t, 0,0)
iri_t.append(np.subtract(end_t, beg_t))
# for i,j in zip(end_t, beg_t):
# iri_t.append(np.subtract(i,j))
# lset2_name_grps['iri'] = iri_t
grps()
Essentially, it doesn't do anything close to what I'm trying to accomplish and the only out I get is either "Not Implemented" or an error.
How about something like this:
import pandas as pd
starts = pd.Series([1, 4, 7])
stops = pd.Series([2, 5, 8])
iri_t = [0]
for i in range(1, len(starts)):
iri_t.append(starts[i] - ends[i-1])
times_df = pd.concat([starts, stops, pd.Series(iri_t)], axis=1)
This creates the following data_frame:
0 1 2
0 1 2 0
1 4 5 2
2 7 8 2
I think what your asking (correct me if I'm wrong) is best accomplished by putting the two columns in a single dataframe, using shift to offset one of your columns, then doing an ordinary subtraction.
df = pd.DataFrame({'start_time':[1,4,7], 'stop_time':[2,5,8]})
df.stop_time - df.start_time.shift()
Out[5]:
0 NaN
1 4
2 4
dtype: float64