Adding derived Timedelta to DateTime - python-2.7

I am trying to add NewTime as the mid-time between OpenTime and CloseTime to my dataframe df1 and it seems to be not working. Please see the code below. Any ideas?
>>> df1 = pd.DataFrame({'School': {0: 'ABC', 1: 'DEF', 2: 'GHI', 3: 'JKL', 4: 'MNO', 5: 'PQR'}, 'OpenTime': {0: '08:00:00.000', 1: '09:00:00.000', 2: '10:00:23.563', 3: '09:30:05.908', 4: '07:15:50.100', 5: '08:15:00.000'}, 'CloseTime': {0: '13:00:00.000', 1: '14:00:00.000', 2: '13:30:00.100', 3: '15:00:00.768', 4: '13:00:00.500', 5: '14:15:00.000'}, 'TimeZone':{0:'Europe/Vienna',1:'Europe/Brussels',2:'Europe/London',3:'Pacific/Auckland' ,4:'Asia/Seoul',5:'Europe/London'}})
>>> df1['OpenTime'] = pd.to_datetime(df1['OpenTime'])
>>> df1['CloseTime'] = pd.to_datetime(df1['CloseTime'])
>>> df1['Offset'] = df1.apply(lambda x:1/2*(x['CloseTime'] - x['OpenTime']), axis=1)
>>> df1
CloseTime OpenTime School TimeZone Offset
0 2016-11-22 13:00:00.000 2016-11-22 08:00:00.000 ABC Europe/Vienna 0 days
1 2016-11-22 14:00:00.000 2016-11-22 09:00:00.000 DEF Europe/Brussels 0 days
2 2016-11-22 13:30:00.100 2016-11-22 10:00:23.563 GHI Europe/London 0 days
3 2016-11-22 15:00:00.768 2016-11-22 09:30:05.908 JKL Pacific/Auckland 0 days
4 2016-11-22 13:00:00.500 2016-11-22 07:15:50.100 MNO Asia/Seoul 0 days
5 2016-11-22 14:15:00.000 2016-11-22 08:15:00.000 PQR Europe/London 0 days
>>> df1['NewTime'] = df1['OpenTime'] + df1['Offset']
>>> df1
CloseTime OpenTime School TimeZone Offset NewTime
0 2016-11-22 13:00:00.000 2016-11-22 08:00:00.000 ABC Europe/Vienna 0 days 2016-11-22 08:00:00.000
1 2016-11-22 14:00:00.000 2016-11-22 09:00:00.000 DEF Europe/Brussels 0 days 2016-11-22 09:00:00.000
2 2016-11-22 13:30:00.100 2016-11-22 10:00:23.563 GHI Europe/London 0 days 2016-11-22 10:00:23.563
3 2016-11-22 15:00:00.768 2016-11-22 09:30:05.908 JKL Pacific/Auckland 0 days 2016-11-22 09:30:05.908
4 2016-11-22 13:00:00.500 2016-11-22 07:15:50.100 MNO Asia/Seoul 0 days 2016-11-22 07:15:50.100
5 2016-11-22 14:15:00.000 2016-11-22 08:15:00.000 PQR Europe/London 0 days 2016-11-22 08:15:00.000
>>>
However if I remove 1/2 from my lambda function this seems to be working. So essentially I am not able to multiply/divide timedelta with any number.
It is quite critical for me to use lambda function because I am doing this iteratively to generate many times (not just midtime)

Did you try
df1['Offset'] = df1.apply(lambda x:((x['CloseTime'] - x['OpenTime']))/2, axis=1)
I just did that in my console and it worked fine. I'm assuming that putting the 1/2 in front is what is causing the problem.

Related

Selecting dataframe rows based on values in other dataframe

I have following two dataframes:
df1:
name
abc
lmn
pqr
df2:
m_name n_name loc
abc tyu IND
bcd abc RSA
efg poi SL
lmn ert AUS
nne bnm ENG
pqr lmn NZ
xyz asd BAN
I want to generate a new dataframe on following condition:
if df2.m_name==df1.name or df2.n_name==df1.name
eliminate duplicate rows
Following is desired output:
m_name n_name loc
abc tyu IND
bcd abc RSA
lmn ert AUS
pqr lmn NZ
Can I get any suggestions on how to achieve this??
Use:
print (df2)
m_name n_name loc
0 abc tyu IND
1 abc tyu IND
2 bcd abc RSA
3 efg poi SL
4 lmn ert AUS
5 nne bnm ENG
6 pqr lmn NZ
7 xyz asd BAN
df3 = df2.filter(like='name')
#another solution is filter columns by columns names in list
#df3 = df2[['m_name','n_name']]
df = df2[df3.isin(df1['name'].tolist()).any(axis=1)]
df = df.drop_duplicates(df3.columns)
print (df)
m_name n_name loc
0 abc tyu IND
2 bcd abc RSA
4 lmn ert AUS
6 pqr lmn NZ
Details:
Seelct all columns with name by filter:
print (df2.filter(like='name'))
m_name n_name
0 abc tyu
1 abc tyu
2 bcd abc
3 efg poi
4 lmn ert
5 nne bnm
6 pqr lmn
7 xyz asd
Compare by DataFrame.isin:
print (df2.filter(like='name').isin(df1['name'].tolist()))
m_name n_name
0 True False
1 True False
2 False True
3 False False
4 True False
5 False False
6 True True
7 False False
Get at least one True per row by any:
print (df2.filter(like='name').isin(df1['name'].tolist()).any(axis=1))
0 True
1 True
2 True
3 False
4 True
5 False
6 True
7 False
dtype: bool
Filter by boolean indexing:
df = df2[df2.filter(like='name').isin(df1['name'].tolist()).any(axis=1)]
print (df)
m_name n_name loc
0 abc tyu IND
1 abc tyu IND
2 bcd abc RSA
4 lmn ert AUS
6 pqr lmn NZ
And last remove duplicates drop_duplicates (If need remove dupes by all name columns add subset parameter)
df = df.drop_duplicates(subset=df3.columns)
print (df)
m_name n_name loc
0 abc tyu IND
2 bcd abc RSA
4 lmn ert AUS
6 pqr lmn NZ
Use
In [56]: df2[df2.m_name.isin(df1.name) | df2.n_name.isin(df1.name)]
Out[56]:
m_name n_name loc
0 abc tyu IND
1 bcd abc RSA
3 lmn ert AUS
5 pqr lmn NZ
Or using query
In [58]: df2.query('m_name in #df1.name or n_name in #df1.name')
Out[58]:
m_name n_name loc
0 abc tyu IND
1 bcd abc RSA
3 lmn ert AUS
5 pqr lmn NZ

Python: max occurence of consecutive days

I have an Input file:
ID,ROLL_NO,ADM_DATE,FEES
1,12345,01/12/2016,500
2,12345,02/12/2016,200
3,987654,01/12/2016,1000
4,12345,03/12/2016,0
5,12345,04/12/2016,0
6,12345,05/12/2016,100
7,12345,06/12/2016,0
8,12345,07/12/2016,0
9,12345,08/12/2016,0
10,987654,02/12/2016,150
11,987654,03/12/2016,300
I'm trying to find maximum count of consecutive days where FEES is 0 for a particular ROLL_NO. If FEES is not equal to zero for consecutive days, max count will be zero for that particular ROLL_NO.
Expected Output:
ID,ROLL_NO,MAX_CNT -- First occurrence of ID for a particular ROLL_NO should come as ID in output
1,12345,3
3,987654,0
This is what I've come up with so far,
import pandas as pd
df = pd.read_csv('I5.txt')
df['COUNT'] = df.groupby(['ROLLNO','ADM_DATE'])['ROLLNO'].transform(pd.Series.value_counts)
print df
But I don't believe this is the right way to approach this.
Could someone help out a python newbie out here?
You can use:
#consecutive groups
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
print (a)
ID
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
9 4
10 5
11 5
dtype: int32
#filter 0 FEES, count, get max per first level and last add missing roll no by reindex
mask = df['FEES'].eq(0)
df = (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0)
.reset_index(name='MAX_CNT'))
print (df)
ROLL_NO MAX_CNT
0 12345 3
1 987654 0
Explanation:
First compare FEES column with 0, eq is same as == and multiple mask by column ROLL_NO:
mask = df['FEES'].eq(0)
r = df['ROLL_NO'] * mask
print (r)
0 0
1 0
2 0
3 12345
4 12345
5 0
6 12345
7 12345
8 12345
9 0
10 0
dtype: int64
Get consecutive groups by compare shifted Series r and cumsum:
a = r.ne(r.shift()).cumsum()
print (a)
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
8 4
9 5
10 5
dtype: int32
Filter only 0 in FEES and groupby with size, also filter a for same indexes:
print (df[mask].groupby(['ROLL_NO',a[mask]]).size())
ROLL_NO
12345 2 2
4 3
dtype: int64
Get max values per first level of MultiIndex:
print (df[mask].groupby(['ROLL_NO',a[mask]]).size().max(level=0))
ROLL_NO
12345 3
dtype: int64
Last add missing ROLL_NO without 0 by reindex:
print (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0))
ROLL_NO
12345 3
987654 0
dtype: int64
and for columns from index use reset_index.
EDIT:
For first ID use drop_duplicates with insert and map:
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
s = df.drop_duplicates('ROLL_NO').set_index('ROLL_NO')['ID']
mask = df['FEES'].eq(0)
df1 = (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0)
.reset_index(name='MAX_CNT'))
df1.insert(0, 'ID', df1['ROLL_NO'].map(s))
print (df1)
ID ROLL_NO MAX_CNT
0 1 12345 3
1 3 987654 0

Next/Prev opreation in Dataframe group-by

I want to get next (second) entry from a given dataframe after grouping it by certain columns. If any of this doesn't exist then it should return nan/nat depending upon the time. Consider following example:
>>> df1 = pd.DataFrame({'School': {0: 'DEF', 1: 'ABC', 2: 'PQR', 3: 'DEF', 4: 'PQR', 5: 'PQR'}, 'OpenTime': {0: '08:00:00.000', 1: '09:00:00.000', 2: '10:00:23.563', 3: '09:30:05.908', 4: '07:15:50.100', 5: '08:15:00.000'}, 'CloseTime': {0: '13:00:00.000', 1: '14:00:00.000', 2: '13:30:00.100', 3: '15:00:00.768', 4: '13:00:00.500', 5: '15:50:32.534'}, 'IsTopper':{0:'1',1:'1',2:'1',3:'1',4:'1',5:'-1'}})
>>> df1
CloseTime IsTopper OpenTime School
0 13:00:00.000 1 08:00:00.000 DEF
1 14:00:00.000 1 09:00:00.000 ABC
2 13:30:00.100 1 10:00:23.563 PQR
3 15:00:00.768 1 09:30:05.908 DEF
4 13:00:00.500 1 07:15:50.100 PQR
5 15:50:32.534 -1 08:15:00.000 PQR
Getting first value is simple and can be achieved by either of the following
>>> df1.groupby(['School', 'IsTopper'])['OpenTime'].first()
OR
>>> (df1.groupby(['School', 'IsTopper'])).apply(lambda x:x.iloc[0])['OpenTime']
Getting next(second) value using ...iloc[1] would throw error in above case.
Finally, I am trying to get following output in case of above example:
School IsTopper OpenTime Next_OpenTime
0 DEF 1 08:00:00.000 09:30:05.908
1 ABC 1 09:00:00.000
2 PQR 1 10:00:23.563 07:15:50.100
3 DEF 1 09:30:05.908
4 PQR 1 07:15:50.100
5 PQR -1 08:15:00.000
>>> df1['Next_OpenTime'] = (df1.groupby(['School', 'IsTopper']))['OpenTime'].shift(-1)
>>> df1
IsTopper OpenTime School Next_OpenTime
0 1 08:00:00.000 DEF 09:30:05.908
1 1 09:00:00.000 ABC NaN
2 1 10:00:23.563 PQR 07:15:50.100
3 1 09:30:05.908 DEF NaN
4 1 07:15:50.100 PQR NaN
5 -1 08:15:00.000 PQR NaN

ValueError: The truth value of a Series is ambiguous

>>> df.head()
β„– Summer Gold Silver Bronze Total β„– Winter \
Afghanistan (AFG) 13 0 0 2 2 0
Algeria (ALG) 12 5 2 8 15 3
Argentina (ARG) 23 18 24 28 70 18
Armenia (ARM) 5 1 2 9 12 6
Australasia (ANZ) [ANZ] 2 3 4 5 12 0
Gold.1 Silver.1 Bronze.1 Total.1 β„– Games Gold.2 \
Afghanistan (AFG) 0 0 0 0 13 0
Algeria (ALG) 0 0 0 0 15 5
Argentina (ARG) 0 0 0 0 41 18
Armenia (ARM) 0 0 0 0 11 1
Australasia (ANZ) [ANZ] 0 0 0 0 2 3
Silver.2 Bronze.2 Combined total
Afghanistan (AFG) 0 2 2
Algeria (ALG) 2 8 15
Argentina (ARG) 24 28 70
Armenia (ARM) 2 9 12
Australasia (ANZ) [ANZ] 4 5 12
Not sure why do I see this error:
>>> df['Gold'] > 0 | df['Gold.1'] > 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ankuragarwal/data_insight/env/lib/python2.7/site-packages/pandas/core/generic.py", line 917, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Whats ambiguous here ?
But this works:
>>> (df['Gold'] > 0) | (df['Gold.1'] > 0)
Assuming we have the following DF:
In [35]: df
Out[35]:
a b c
0 9 0 1
1 7 7 4
2 1 8 9
3 6 7 5
4 1 4 6
The following command:
df.a > 5 | df.b > 5
because | has higher precedence (compared to >) as it's specified in the Operator precedence table) it will be translated to:
df.a > (5 | df.b) > 5
which will be translated to:
df.a > (5 | df.b) and (5 | df.b) > 5
step by step:
In [36]: x = (5 | df.b)
In [37]: x
Out[37]:
0 5
1 7
2 13
3 7
4 5
Name: c, dtype: int32
In [38]: df.a > x
Out[38]:
0 True
1 False
2 False
3 False
4 False
dtype: bool
In [39]: x > 5
Out[39]:
0 False
1 True
2 True
3 True
4 False
Name: b, dtype: bool
but the last operation won't work:
In [40]: (df.a > x) and (x > 5)
---------------------------------------------------------------------------
...
skipped
...
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The error message above might lead inexperienced users to do something like this:
In [12]: (df.a > 5).all() | (df.b > 5).all()
Out[12]: False
In [13]: df[(df.a > 5).all() | (df.b > 5).all()]
...
skipped
...
KeyError: False
But in this case you just need to set your precedence explicitly in order to get expected result:
In [10]: (df.a > 5) | (df.b > 5)
Out[10]:
0 True
1 True
2 True
3 True
4 False
dtype: bool
In [11]: df[(df.a > 5) | (df.b > 5)]
Out[11]:
a b c
0 9 0 1
1 7 7 4
2 1 8 9
3 6 7 5
This is the real reason for the error:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html
pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not. It is not clear what the result of
>>> if pd.Series([False, True, False]):
...
should be. Should it be True because it’s not zero-length? False because there are False values? It is unclear, so instead, pandas raises a ValueError:
>>> if pd.Series([False, True, False]):
print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
If you see that, you need to explicitly choose what you want to do with it (e.g., use any(), all() or empty). or, you might want to compare if the pandas object is None

pandas dataframe update or set column[y] = x where column[z] = 'abc'

I'm new to python pandas and haven't found an answer to this in the documentation. I have an existing dataframe and I've added a new column Y. I want to set the value of column Y to 'abc' in all rows in which column Z = 'xyz'. In sql this would be a simple
update table set colY = 'abc' where colZ = 'xyz'
Is there a similar way to do this update in pandas?
Thanks!
You can use loc or numpy.where if you need set other value too:
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[1,2,3],
'Z':['xyz',5,6],
'C':[7,8,9]})
print (df)
C X Z
0 7 1 xyz
1 8 2 5
2 9 3 6
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
print (df)
C X Z Y
0 7 1 xyz abc
1 8 2 5 NaN
2 9 3 6 NaN
df['Y1'] = np.where(df.Z == 'xyz', 'abc', 'klm')
print (df)
C X Z Y Y1
0 7 1 xyz abc abc
1 8 2 5 NaN klm
2 9 3 6 NaN klm
You can use set column values too:
df['Y2'] = np.where(df.Z == 'xyz', 'abc', df.C)
print (df)
C X Z Y Y2
0 7 1 xyz abc abc
1 8 2 5 NaN 8
2 9 3 6 NaN 9