Selecting dataframe rows based on values in other dataframe - python-2.7

I have following two dataframes:
df1:
name
abc
lmn
pqr
df2:
m_name n_name loc
abc tyu IND
bcd abc RSA
efg poi SL
lmn ert AUS
nne bnm ENG
pqr lmn NZ
xyz asd BAN
I want to generate a new dataframe on following condition:
if df2.m_name==df1.name or df2.n_name==df1.name
eliminate duplicate rows
Following is desired output:
m_name n_name loc
abc tyu IND
bcd abc RSA
lmn ert AUS
pqr lmn NZ
Can I get any suggestions on how to achieve this??

Use:
print (df2)
m_name n_name loc
0 abc tyu IND
1 abc tyu IND
2 bcd abc RSA
3 efg poi SL
4 lmn ert AUS
5 nne bnm ENG
6 pqr lmn NZ
7 xyz asd BAN
df3 = df2.filter(like='name')
#another solution is filter columns by columns names in list
#df3 = df2[['m_name','n_name']]
df = df2[df3.isin(df1['name'].tolist()).any(axis=1)]
df = df.drop_duplicates(df3.columns)
print (df)
m_name n_name loc
0 abc tyu IND
2 bcd abc RSA
4 lmn ert AUS
6 pqr lmn NZ
Details:
Seelct all columns with name by filter:
print (df2.filter(like='name'))
m_name n_name
0 abc tyu
1 abc tyu
2 bcd abc
3 efg poi
4 lmn ert
5 nne bnm
6 pqr lmn
7 xyz asd
Compare by DataFrame.isin:
print (df2.filter(like='name').isin(df1['name'].tolist()))
m_name n_name
0 True False
1 True False
2 False True
3 False False
4 True False
5 False False
6 True True
7 False False
Get at least one True per row by any:
print (df2.filter(like='name').isin(df1['name'].tolist()).any(axis=1))
0 True
1 True
2 True
3 False
4 True
5 False
6 True
7 False
dtype: bool
Filter by boolean indexing:
df = df2[df2.filter(like='name').isin(df1['name'].tolist()).any(axis=1)]
print (df)
m_name n_name loc
0 abc tyu IND
1 abc tyu IND
2 bcd abc RSA
4 lmn ert AUS
6 pqr lmn NZ
And last remove duplicates drop_duplicates (If need remove dupes by all name columns add subset parameter)
df = df.drop_duplicates(subset=df3.columns)
print (df)
m_name n_name loc
0 abc tyu IND
2 bcd abc RSA
4 lmn ert AUS
6 pqr lmn NZ

Use
In [56]: df2[df2.m_name.isin(df1.name) | df2.n_name.isin(df1.name)]
Out[56]:
m_name n_name loc
0 abc tyu IND
1 bcd abc RSA
3 lmn ert AUS
5 pqr lmn NZ
Or using query
In [58]: df2.query('m_name in #df1.name or n_name in #df1.name')
Out[58]:
m_name n_name loc
0 abc tyu IND
1 bcd abc RSA
3 lmn ert AUS
5 pqr lmn NZ

Related

SAS count unique observation by group

I am looking to figure out how many customers get their product from a certain store. The problem each prod_id can have up to 12 weeks of data for each customer. I have tried a multitude of codes, some add up all of the obersvations for each customer while others like the one below remove all but the last observation.
proc sort data= have; BY Prod_ID cust; run;
Data want;
Set have;
by Prod_Id cust;
if (last.Prod_Id and last.cust);
count= +1;
run;
data have
prod_id cust week store
1 A 7/29 ABC
1 A 8/5 ABC
1 A 8/12 ABC
1 A 8/19 ABC
1 B 7/29 ABC
1 B 8/5 ABC
1 B 8/12 ABC
1 B 8/19 ABC
1 B 8/26 ABC
1 C 7/29 XYZ
1 C 8/5 XYZ
1 F 7/29 XYZ
1 F 8/5 XYZ
2 A 7/29 ABC
2 A 8/5 ABC
2 A 8/12 ABC
2 A 8/19 ABC
2 C 7/29 EFG
2 C 8/5 EFG
2 C 8/12 EFG
2 C 8/19 EFG
2 C 8/26 EFG
what i want it to look like
prod_id store count
1 ABC 2
1 XYZ 2
2 ABC 1
2 EFG 2
Firstly, read about if-statement.
I've just edited your code to make it work:
proc sort data=have;
by prod_id store cust;
run;
data want(drop=cust week);
set have;
retain count;
by prod_id store cust;
if (last.cust) then count=count+1;
else if (first.prod_id or first.store) then count = 0;
if (last.prod_id or last.store) then output;
run;
If you will have questions, ask.
The only place where the result of the COUNT() aggregate function in SQL might be confusing is that it will not count missing values of the variable.
select prod_id
, store
, count(distinct cust) as count
, count(distinct cust)+max(missing(cust)) as count_plus_missing
from have
group by prod_id ,store
;

SAS retain value and assign to new variable

I have the following data
EMPID XVAR SRC
ABC PER1 1
ABC 2
XYZ PER1 1
XYZ 2
LMN PER1 1
LMN 2
LMN PER2 1
LMN 2
LMN 2
LMN PER3 1
LMN 2
I need to create a new variable _XVAR for records where SRC=2 based on the value for XVAR on the previous record (where SRC=1)
The output should be like:
EMPID XVAR SRC _XVAR
ABC PER1 1
ABC 2 PER1
XYZ PER1 1
XYZ 2 PER1
LMN PER1 1
LMN 2 PER1
LMN PER2 1
LMN 2 PER2
LMN 2 PER2
LMN PER3 1
LMN 2 PER3
I am trying the following, but it isnt working;
data t003;
set t003;
by EMPID;
retain XVAR;
if SRC eq 2 then _XVAR=XVAR;
run;
It can also be done by saving the XVAR in a new variable (last_XVAR), retaining it and dropping it (you dont want it in the output). Then use that one to assign _XVAR. Note that you need to set last_XVAR after the IF, or the current XVAR is used in the assignment of _XVAR.
Your code, edited:
data t003;
set t003;
by EMPID;
length _XVAR last_XVAR $ 10;
if SRC eq 2 then _XVAR = last_XVAR;
last_XVAR = XVAR;
retain last_XVAR;
drop last_XVAR;
run;
You can use LAG to retrieve prior row values and conditionally use that value in an assignment.
Sample data
data have; input
EMPID $ XVAR $ SRC; datalines;
ABC PER1 1
ABC . 2
XYZ PER1 1
XYZ . 2
LMN PER1 1
LMN . 2
LMN PER2 1
LMN . 2
LMN . 2
LMN PER3 1
LMN . 2
run;
Example code
data want;
set have;
lag_xvar = lag(xvar);
if src eq 2 then do;
if lag_xvar ne '' then _xvar = lag_xvar;
end;
else
_xvar = ' ';
retain _xvar;
drop lag_xvar;
run;

Next/Prev opreation in Dataframe group-by

I want to get next (second) entry from a given dataframe after grouping it by certain columns. If any of this doesn't exist then it should return nan/nat depending upon the time. Consider following example:
>>> df1 = pd.DataFrame({'School': {0: 'DEF', 1: 'ABC', 2: 'PQR', 3: 'DEF', 4: 'PQR', 5: 'PQR'}, 'OpenTime': {0: '08:00:00.000', 1: '09:00:00.000', 2: '10:00:23.563', 3: '09:30:05.908', 4: '07:15:50.100', 5: '08:15:00.000'}, 'CloseTime': {0: '13:00:00.000', 1: '14:00:00.000', 2: '13:30:00.100', 3: '15:00:00.768', 4: '13:00:00.500', 5: '15:50:32.534'}, 'IsTopper':{0:'1',1:'1',2:'1',3:'1',4:'1',5:'-1'}})
>>> df1
CloseTime IsTopper OpenTime School
0 13:00:00.000 1 08:00:00.000 DEF
1 14:00:00.000 1 09:00:00.000 ABC
2 13:30:00.100 1 10:00:23.563 PQR
3 15:00:00.768 1 09:30:05.908 DEF
4 13:00:00.500 1 07:15:50.100 PQR
5 15:50:32.534 -1 08:15:00.000 PQR
Getting first value is simple and can be achieved by either of the following
>>> df1.groupby(['School', 'IsTopper'])['OpenTime'].first()
OR
>>> (df1.groupby(['School', 'IsTopper'])).apply(lambda x:x.iloc[0])['OpenTime']
Getting next(second) value using ...iloc[1] would throw error in above case.
Finally, I am trying to get following output in case of above example:
School IsTopper OpenTime Next_OpenTime
0 DEF 1 08:00:00.000 09:30:05.908
1 ABC 1 09:00:00.000
2 PQR 1 10:00:23.563 07:15:50.100
3 DEF 1 09:30:05.908
4 PQR 1 07:15:50.100
5 PQR -1 08:15:00.000
>>> df1['Next_OpenTime'] = (df1.groupby(['School', 'IsTopper']))['OpenTime'].shift(-1)
>>> df1
IsTopper OpenTime School Next_OpenTime
0 1 08:00:00.000 DEF 09:30:05.908
1 1 09:00:00.000 ABC NaN
2 1 10:00:23.563 PQR 07:15:50.100
3 1 09:30:05.908 DEF NaN
4 1 07:15:50.100 PQR NaN
5 -1 08:15:00.000 PQR NaN

Adding derived Timedelta to DateTime

I am trying to add NewTime as the mid-time between OpenTime and CloseTime to my dataframe df1 and it seems to be not working. Please see the code below. Any ideas?
>>> df1 = pd.DataFrame({'School': {0: 'ABC', 1: 'DEF', 2: 'GHI', 3: 'JKL', 4: 'MNO', 5: 'PQR'}, 'OpenTime': {0: '08:00:00.000', 1: '09:00:00.000', 2: '10:00:23.563', 3: '09:30:05.908', 4: '07:15:50.100', 5: '08:15:00.000'}, 'CloseTime': {0: '13:00:00.000', 1: '14:00:00.000', 2: '13:30:00.100', 3: '15:00:00.768', 4: '13:00:00.500', 5: '14:15:00.000'}, 'TimeZone':{0:'Europe/Vienna',1:'Europe/Brussels',2:'Europe/London',3:'Pacific/Auckland' ,4:'Asia/Seoul',5:'Europe/London'}})
>>> df1['OpenTime'] = pd.to_datetime(df1['OpenTime'])
>>> df1['CloseTime'] = pd.to_datetime(df1['CloseTime'])
>>> df1['Offset'] = df1.apply(lambda x:1/2*(x['CloseTime'] - x['OpenTime']), axis=1)
>>> df1
CloseTime OpenTime School TimeZone Offset
0 2016-11-22 13:00:00.000 2016-11-22 08:00:00.000 ABC Europe/Vienna 0 days
1 2016-11-22 14:00:00.000 2016-11-22 09:00:00.000 DEF Europe/Brussels 0 days
2 2016-11-22 13:30:00.100 2016-11-22 10:00:23.563 GHI Europe/London 0 days
3 2016-11-22 15:00:00.768 2016-11-22 09:30:05.908 JKL Pacific/Auckland 0 days
4 2016-11-22 13:00:00.500 2016-11-22 07:15:50.100 MNO Asia/Seoul 0 days
5 2016-11-22 14:15:00.000 2016-11-22 08:15:00.000 PQR Europe/London 0 days
>>> df1['NewTime'] = df1['OpenTime'] + df1['Offset']
>>> df1
CloseTime OpenTime School TimeZone Offset NewTime
0 2016-11-22 13:00:00.000 2016-11-22 08:00:00.000 ABC Europe/Vienna 0 days 2016-11-22 08:00:00.000
1 2016-11-22 14:00:00.000 2016-11-22 09:00:00.000 DEF Europe/Brussels 0 days 2016-11-22 09:00:00.000
2 2016-11-22 13:30:00.100 2016-11-22 10:00:23.563 GHI Europe/London 0 days 2016-11-22 10:00:23.563
3 2016-11-22 15:00:00.768 2016-11-22 09:30:05.908 JKL Pacific/Auckland 0 days 2016-11-22 09:30:05.908
4 2016-11-22 13:00:00.500 2016-11-22 07:15:50.100 MNO Asia/Seoul 0 days 2016-11-22 07:15:50.100
5 2016-11-22 14:15:00.000 2016-11-22 08:15:00.000 PQR Europe/London 0 days 2016-11-22 08:15:00.000
>>>
However if I remove 1/2 from my lambda function this seems to be working. So essentially I am not able to multiply/divide timedelta with any number.
It is quite critical for me to use lambda function because I am doing this iteratively to generate many times (not just midtime)
Did you try
df1['Offset'] = df1.apply(lambda x:((x['CloseTime'] - x['OpenTime']))/2, axis=1)
I just did that in my console and it worked fine. I'm assuming that putting the 1/2 in front is what is causing the problem.

pandas dataframe update or set column[y] = x where column[z] = 'abc'

I'm new to python pandas and haven't found an answer to this in the documentation. I have an existing dataframe and I've added a new column Y. I want to set the value of column Y to 'abc' in all rows in which column Z = 'xyz'. In sql this would be a simple
update table set colY = 'abc' where colZ = 'xyz'
Is there a similar way to do this update in pandas?
Thanks!
You can use loc or numpy.where if you need set other value too:
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[1,2,3],
'Z':['xyz',5,6],
'C':[7,8,9]})
print (df)
C X Z
0 7 1 xyz
1 8 2 5
2 9 3 6
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
print (df)
C X Z Y
0 7 1 xyz abc
1 8 2 5 NaN
2 9 3 6 NaN
df['Y1'] = np.where(df.Z == 'xyz', 'abc', 'klm')
print (df)
C X Z Y Y1
0 7 1 xyz abc abc
1 8 2 5 NaN klm
2 9 3 6 NaN klm
You can use set column values too:
df['Y2'] = np.where(df.Z == 'xyz', 'abc', df.C)
print (df)
C X Z Y Y2
0 7 1 xyz abc abc
1 8 2 5 NaN 8
2 9 3 6 NaN 9