Next/Prev opreation in Dataframe group-by - python-2.7

I want to get next (second) entry from a given dataframe after grouping it by certain columns. If any of this doesn't exist then it should return nan/nat depending upon the time. Consider following example:
>>> df1 = pd.DataFrame({'School': {0: 'DEF', 1: 'ABC', 2: 'PQR', 3: 'DEF', 4: 'PQR', 5: 'PQR'}, 'OpenTime': {0: '08:00:00.000', 1: '09:00:00.000', 2: '10:00:23.563', 3: '09:30:05.908', 4: '07:15:50.100', 5: '08:15:00.000'}, 'CloseTime': {0: '13:00:00.000', 1: '14:00:00.000', 2: '13:30:00.100', 3: '15:00:00.768', 4: '13:00:00.500', 5: '15:50:32.534'}, 'IsTopper':{0:'1',1:'1',2:'1',3:'1',4:'1',5:'-1'}})
>>> df1
CloseTime IsTopper OpenTime School
0 13:00:00.000 1 08:00:00.000 DEF
1 14:00:00.000 1 09:00:00.000 ABC
2 13:30:00.100 1 10:00:23.563 PQR
3 15:00:00.768 1 09:30:05.908 DEF
4 13:00:00.500 1 07:15:50.100 PQR
5 15:50:32.534 -1 08:15:00.000 PQR
Getting first value is simple and can be achieved by either of the following
>>> df1.groupby(['School', 'IsTopper'])['OpenTime'].first()
OR
>>> (df1.groupby(['School', 'IsTopper'])).apply(lambda x:x.iloc[0])['OpenTime']
Getting next(second) value using ...iloc[1] would throw error in above case.
Finally, I am trying to get following output in case of above example:
School IsTopper OpenTime Next_OpenTime
0 DEF 1 08:00:00.000 09:30:05.908
1 ABC 1 09:00:00.000
2 PQR 1 10:00:23.563 07:15:50.100
3 DEF 1 09:30:05.908
4 PQR 1 07:15:50.100
5 PQR -1 08:15:00.000

>>> df1['Next_OpenTime'] = (df1.groupby(['School', 'IsTopper']))['OpenTime'].shift(-1)
>>> df1
IsTopper OpenTime School Next_OpenTime
0 1 08:00:00.000 DEF 09:30:05.908
1 1 09:00:00.000 ABC NaN
2 1 10:00:23.563 PQR 07:15:50.100
3 1 09:30:05.908 DEF NaN
4 1 07:15:50.100 PQR NaN
5 -1 08:15:00.000 PQR NaN

Related

Python: max occurence of consecutive days

I have an Input file:
ID,ROLL_NO,ADM_DATE,FEES
1,12345,01/12/2016,500
2,12345,02/12/2016,200
3,987654,01/12/2016,1000
4,12345,03/12/2016,0
5,12345,04/12/2016,0
6,12345,05/12/2016,100
7,12345,06/12/2016,0
8,12345,07/12/2016,0
9,12345,08/12/2016,0
10,987654,02/12/2016,150
11,987654,03/12/2016,300
I'm trying to find maximum count of consecutive days where FEES is 0 for a particular ROLL_NO. If FEES is not equal to zero for consecutive days, max count will be zero for that particular ROLL_NO.
Expected Output:
ID,ROLL_NO,MAX_CNT -- First occurrence of ID for a particular ROLL_NO should come as ID in output
1,12345,3
3,987654,0
This is what I've come up with so far,
import pandas as pd
df = pd.read_csv('I5.txt')
df['COUNT'] = df.groupby(['ROLLNO','ADM_DATE'])['ROLLNO'].transform(pd.Series.value_counts)
print df
But I don't believe this is the right way to approach this.
Could someone help out a python newbie out here?
You can use:
#consecutive groups
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
print (a)
ID
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
9 4
10 5
11 5
dtype: int32
#filter 0 FEES, count, get max per first level and last add missing roll no by reindex
mask = df['FEES'].eq(0)
df = (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0)
.reset_index(name='MAX_CNT'))
print (df)
ROLL_NO MAX_CNT
0 12345 3
1 987654 0
Explanation:
First compare FEES column with 0, eq is same as == and multiple mask by column ROLL_NO:
mask = df['FEES'].eq(0)
r = df['ROLL_NO'] * mask
print (r)
0 0
1 0
2 0
3 12345
4 12345
5 0
6 12345
7 12345
8 12345
9 0
10 0
dtype: int64
Get consecutive groups by compare shifted Series r and cumsum:
a = r.ne(r.shift()).cumsum()
print (a)
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
8 4
9 5
10 5
dtype: int32
Filter only 0 in FEES and groupby with size, also filter a for same indexes:
print (df[mask].groupby(['ROLL_NO',a[mask]]).size())
ROLL_NO
12345 2 2
4 3
dtype: int64
Get max values per first level of MultiIndex:
print (df[mask].groupby(['ROLL_NO',a[mask]]).size().max(level=0))
ROLL_NO
12345 3
dtype: int64
Last add missing ROLL_NO without 0 by reindex:
print (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0))
ROLL_NO
12345 3
987654 0
dtype: int64
and for columns from index use reset_index.
EDIT:
For first ID use drop_duplicates with insert and map:
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
s = df.drop_duplicates('ROLL_NO').set_index('ROLL_NO')['ID']
mask = df['FEES'].eq(0)
df1 = (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0)
.reset_index(name='MAX_CNT'))
df1.insert(0, 'ID', df1['ROLL_NO'].map(s))
print (df1)
ID ROLL_NO MAX_CNT
0 1 12345 3
1 3 987654 0

Adding derived Timedelta to DateTime

I am trying to add NewTime as the mid-time between OpenTime and CloseTime to my dataframe df1 and it seems to be not working. Please see the code below. Any ideas?
>>> df1 = pd.DataFrame({'School': {0: 'ABC', 1: 'DEF', 2: 'GHI', 3: 'JKL', 4: 'MNO', 5: 'PQR'}, 'OpenTime': {0: '08:00:00.000', 1: '09:00:00.000', 2: '10:00:23.563', 3: '09:30:05.908', 4: '07:15:50.100', 5: '08:15:00.000'}, 'CloseTime': {0: '13:00:00.000', 1: '14:00:00.000', 2: '13:30:00.100', 3: '15:00:00.768', 4: '13:00:00.500', 5: '14:15:00.000'}, 'TimeZone':{0:'Europe/Vienna',1:'Europe/Brussels',2:'Europe/London',3:'Pacific/Auckland' ,4:'Asia/Seoul',5:'Europe/London'}})
>>> df1['OpenTime'] = pd.to_datetime(df1['OpenTime'])
>>> df1['CloseTime'] = pd.to_datetime(df1['CloseTime'])
>>> df1['Offset'] = df1.apply(lambda x:1/2*(x['CloseTime'] - x['OpenTime']), axis=1)
>>> df1
CloseTime OpenTime School TimeZone Offset
0 2016-11-22 13:00:00.000 2016-11-22 08:00:00.000 ABC Europe/Vienna 0 days
1 2016-11-22 14:00:00.000 2016-11-22 09:00:00.000 DEF Europe/Brussels 0 days
2 2016-11-22 13:30:00.100 2016-11-22 10:00:23.563 GHI Europe/London 0 days
3 2016-11-22 15:00:00.768 2016-11-22 09:30:05.908 JKL Pacific/Auckland 0 days
4 2016-11-22 13:00:00.500 2016-11-22 07:15:50.100 MNO Asia/Seoul 0 days
5 2016-11-22 14:15:00.000 2016-11-22 08:15:00.000 PQR Europe/London 0 days
>>> df1['NewTime'] = df1['OpenTime'] + df1['Offset']
>>> df1
CloseTime OpenTime School TimeZone Offset NewTime
0 2016-11-22 13:00:00.000 2016-11-22 08:00:00.000 ABC Europe/Vienna 0 days 2016-11-22 08:00:00.000
1 2016-11-22 14:00:00.000 2016-11-22 09:00:00.000 DEF Europe/Brussels 0 days 2016-11-22 09:00:00.000
2 2016-11-22 13:30:00.100 2016-11-22 10:00:23.563 GHI Europe/London 0 days 2016-11-22 10:00:23.563
3 2016-11-22 15:00:00.768 2016-11-22 09:30:05.908 JKL Pacific/Auckland 0 days 2016-11-22 09:30:05.908
4 2016-11-22 13:00:00.500 2016-11-22 07:15:50.100 MNO Asia/Seoul 0 days 2016-11-22 07:15:50.100
5 2016-11-22 14:15:00.000 2016-11-22 08:15:00.000 PQR Europe/London 0 days 2016-11-22 08:15:00.000
>>>
However if I remove 1/2 from my lambda function this seems to be working. So essentially I am not able to multiply/divide timedelta with any number.
It is quite critical for me to use lambda function because I am doing this iteratively to generate many times (not just midtime)
Did you try
df1['Offset'] = df1.apply(lambda x:((x['CloseTime'] - x['OpenTime']))/2, axis=1)
I just did that in my console and it worked fine. I'm assuming that putting the 1/2 in front is what is causing the problem.

pandas dataframe update or set column[y] = x where column[z] = 'abc'

I'm new to python pandas and haven't found an answer to this in the documentation. I have an existing dataframe and I've added a new column Y. I want to set the value of column Y to 'abc' in all rows in which column Z = 'xyz'. In sql this would be a simple
update table set colY = 'abc' where colZ = 'xyz'
Is there a similar way to do this update in pandas?
Thanks!
You can use loc or numpy.where if you need set other value too:
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[1,2,3],
'Z':['xyz',5,6],
'C':[7,8,9]})
print (df)
C X Z
0 7 1 xyz
1 8 2 5
2 9 3 6
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
print (df)
C X Z Y
0 7 1 xyz abc
1 8 2 5 NaN
2 9 3 6 NaN
df['Y1'] = np.where(df.Z == 'xyz', 'abc', 'klm')
print (df)
C X Z Y Y1
0 7 1 xyz abc abc
1 8 2 5 NaN klm
2 9 3 6 NaN klm
You can use set column values too:
df['Y2'] = np.where(df.Z == 'xyz', 'abc', df.C)
print (df)
C X Z Y Y2
0 7 1 xyz abc abc
1 8 2 5 NaN 8
2 9 3 6 NaN 9

Conditional join Pandas.Dataframe

I try to partially join two dataframes :
import pandas
import numpy
entry1= pandas.datetime(2014,6,1)
entry2= pandas.datetime(2014,6,2)
df1=pandas.DataFrame(numpy.array([[1,1],[2,2],[3,3],[3,3]]), columns=['zick','zack'], index=[entry1, entry1, entry2, entry2])
df2=pandas.DataFrame(numpy.array([[2,3],[3,3]]), columns=['eins','zwei'], index=[entry1, entry2])
I tried
df1 = df1[(df1['zick']>= 2) & (df1['zick'] < 4)].join(df2['eins'])
but this doesn't work. After joining values of df1['eins'] are expected to be [NaN,2,3,3].
How to do it? I'd like to it inplace without df copies.
I think this is what you actually meant to use:
df1 = df1.join(df2['eins'])
mask = (df1['zick']>= 2) & (df1['zick'] < 4)
df1.loc[~mask, 'eins'] = np.nan
df1
yielding:
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
Issue you were having is that you were joining filtered dataframe, and not the original one, there was no place for NaN to appear (every cell was satisfying your filter).
EDIT:
Considering new inputs in the comments below, here is another approach.
Create an empty column that will need to be updated with values from second dataframe:
df1['eins'] = np.nan
print df1
print df2
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 NaN
2014-06-02 3 3 NaN
2014-06-02 3 3 NaN
eins zwei
2014-06-01 2 3
2014-06-02 3 3
Set the filter and make values in the column_to_be_updated satisfying the filter equal to 0.
mask = (df1['zick']>= 2) & (df1['zick'] < 4)
df1.loc[(mask & (df1['eins'].isnull())), 'eins'] = 0
print df1
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 0
2014-06-02 3 3 0
2014-06-02 3 3 0
Update inplace your df1 with df2 values (only values equal to 0 will be updated):
df1.update(df2, filter_func=lambda x: x == 0)
print df1
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
Now if you want to change the filter and do the update again it will not change previously updated values:
mask = (df1['zick']>= 1) & (df1['zick'] == 1)
df1.loc[(mask & (df1['eins'].isnull())), 'eins'] = 0
print df1
zick zack eins
2014-06-01 1 1 0
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
df1.update(df2, filter_func=lambda x: x == 0)
print df1
zick zack eins
2014-06-01 1 1 2
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3

Get row-index of the last non-NaN value in each column of a pandas data frame

How can I return the row index location of the last non-nan value for each column of the pandas data frame and return the locations as a pandas dataframe?
Use notnull and specifically idxmax to get the index values of the non NaN values
In [22]:
df = pd.DataFrame({'a':[0,1,2,NaN], 'b':[NaN, 1,NaN, 3]})
df
Out[22]:
a b
0 0 NaN
1 1 1
2 2 NaN
3 NaN 3
In [29]:
df[pd.notnull(df)].idxmax()
Out[29]:
a 2
b 3
dtype: int64
EDIT
Actually as correctly pointed out by #Caleb you can use last_valid_index which is designed for this:
In [3]:
df = pd.DataFrame({'a':[3,1,2,np.NaN], 'b':[np.NaN, 1,np.NaN, -1]})
df
Out[3]:
a b
0 3 NaN
1 1 1
2 2 NaN
3 NaN -1
In [6]:
df.apply(pd.Series.last_valid_index)
Out[6]:
a 2
b 3
dtype: int64
If you want the row index of the last non-nan (and non-none) value, here is a one-liner:
>>> df = pd.DataFrame({
'a':[5,1,2,NaN],
'b':[NaN, 6,NaN, 3]})
>>> df
a b
0 5 NaN
1 1 6
2 2 NaN
3 NaN 3
>>> df.apply(lambda column: column.dropna().index[-1])
a 2
b 3
dtype: int64
Explanation:
df.apply in this context applies a function to each column of the dataframe. I am passing it a function that takes as its argument a column, and returns the column's last non-null index.