inputing a pandas dataframe matching two columns from two dataframes - python-2.7

I have a dataframe C and another dataframe S. I want to change the values in one of the column of C if two columns in C and S have same values.
Please consider the example given below,
C.head(3)
id1 id2 title val
0 1 0 'abc' 0
1 2 0 'bcd' 0
2 3 0 'efg' 0
S.head(3)
id1 id2
0 1 1
1 3 0
I want to assign the value of 1 to the column 'val' in C corresponding only to the rows where C.id1 = S.id1 and C.id2 = S.id2
The combination of (C.id1, C.id2) and (S.id1, S.id2) is unique in respective tables
In the above case, I want the result as
C.head(3)
id1 id2 title val
0 1 0 'abc' 0
1 2 0 'bcd' 0
2 3 0 'efg' 1
as only in the third row of C, it matches with one of the rows of S for the columns id1 and id2.

I think need merge with left join and parameter indicator, last convert boolen mask to 0 and 1:
#if same columns for join in both df parameter on is possible omit
df = C.merge(S, indicator=True, how='left')
#if multiple same columns in both df
#df = C.merge(S, indicator=True, how='left', on=['id1', 'id2'])
df['val'] = (df['_merge'] == 'both').astype(int)
df = df.drop('_merge', axis=1)
print (df)
id1 id2 val
0 1 0 0
1 2 0 0
2 3 0 1
Solutiion working nice with new data:
df = C.merge(S, indicator=True, how='left')
#if multiple same columns in both df
#df = C.merge(S, indicator=True, how='left', on=['id1', 'id2'])
df['val'] = (df['_merge'] == 'both').astype(int)
df = df.drop('_merge', axis=1)
print (df)
id1 id2 title val
0 1 0 abc 0
1 2 0 bcd 0
2 3 0 efg 1

Related

Create boolean dataframe showing existance of each element in a dictionary of lists

I have a dictionary of lists and I have constructed a dataframe where the index is the dictionary keys and the columns are the set of possible values contained within the lists. The dataframe values represent existance of each column for each list contained in the dictionary. What is the most efficient way to construct this? Below is the way I have done it now using for loops, but I am sure there is a more efficient way using either vectorization or concatenation.
import pandas as pd
data = {0:[1,2,3,4],1:[2,3,4],2:[3,4,5,6]}
cols = sorted(list(set([x for y in data.values() for x in y])))
df = pd.DataFrame(0,index=data.keys(),columns=cols)
for row in df.iterrows():
for col in cols:
if col in data[row[0]]:
df.loc[row[0],col] = 1
else:
df.loc[row[0],col] = 0
print(df)
Output:
1 2 3 4 5 6
0 1 1 1 1 0 0
1 0 1 1 1 0 0
2 0 0 1 1 1 1
Use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data.values()),
columns=mlb.classes_,
index=data.keys())
print (df)
1 2 3 4 5 6
0 1 1 1 1 0 0
1 0 1 1 1 0 0
2 0 0 1 1 1 1
Pure pandas, but much slowier solution with str.get_dummies:
df = pd.Series(data).astype(str).str.strip('[]').str.get_dummies(', ')

Transform categorical column into dummy columns using Power Query M

Using Power Query "M" language, how would you transform a categorical column containing discrete values into multiple "dummy" columns? I come from the Python world and there are several ways to do this but one way would be below:
>>> import pandas as pd
>>> dataset = pd.DataFrame(list('ABCDACDEAABADDA'),
columns=['my_col'])
>>> dataset
my_col
0 A
1 B
2 C
3 D
4 A
5 C
6 D
7 E
8 A
9 A
10 B
11 A
12 D
13 D
14 A
>>> pd.get_dummies(dataset)
my_col_A my_col_B my_col_C my_col_D my_col_E
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 1 0 0 0 0
5 0 0 1 0 0
6 0 0 0 1 0
7 0 0 0 0 1
8 1 0 0 0 0
9 1 0 0 0 0
10 0 1 0 0 0
11 1 0 0 0 0
12 0 0 0 1 0
13 0 0 0 1 0
14 1 0 0 0 0
Interesting question. Here's an easy, scalable method I've found:
Create a custom column of all ones (Add Column > Custom Column > Formula = 1).
Add an index column (Add Column > Index Column).
Pivot on the custom column (select my_col > Transform > Pivot Column).
Replace null values with 0 (select all columns > Transform > Replace Values).
Here's what the M code looks like for this process:
#"Added Custom" = Table.AddColumn(#"Previous Step", "Custom", each 1),
#"Added Index" = Table.AddIndexColumn(#"Added Custom", "Index", 0, 1),
#"Pivoted Column" = Table.Pivot(#"Added Index", List.Distinct(#"Added Index"[my_col]), "my_col", "Custom"),
#"Replaced Value" = Table.ReplaceValue(#"Pivoted Column",null,0,Replacer.ReplaceValue,Table.ColumnNames(#"Pivoted Column"))
Once you've completed the above, you can remove the index column if desired.

Python: max occurence of consecutive days

I have an Input file:
ID,ROLL_NO,ADM_DATE,FEES
1,12345,01/12/2016,500
2,12345,02/12/2016,200
3,987654,01/12/2016,1000
4,12345,03/12/2016,0
5,12345,04/12/2016,0
6,12345,05/12/2016,100
7,12345,06/12/2016,0
8,12345,07/12/2016,0
9,12345,08/12/2016,0
10,987654,02/12/2016,150
11,987654,03/12/2016,300
I'm trying to find maximum count of consecutive days where FEES is 0 for a particular ROLL_NO. If FEES is not equal to zero for consecutive days, max count will be zero for that particular ROLL_NO.
Expected Output:
ID,ROLL_NO,MAX_CNT -- First occurrence of ID for a particular ROLL_NO should come as ID in output
1,12345,3
3,987654,0
This is what I've come up with so far,
import pandas as pd
df = pd.read_csv('I5.txt')
df['COUNT'] = df.groupby(['ROLLNO','ADM_DATE'])['ROLLNO'].transform(pd.Series.value_counts)
print df
But I don't believe this is the right way to approach this.
Could someone help out a python newbie out here?
You can use:
#consecutive groups
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
print (a)
ID
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
9 4
10 5
11 5
dtype: int32
#filter 0 FEES, count, get max per first level and last add missing roll no by reindex
mask = df['FEES'].eq(0)
df = (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0)
.reset_index(name='MAX_CNT'))
print (df)
ROLL_NO MAX_CNT
0 12345 3
1 987654 0
Explanation:
First compare FEES column with 0, eq is same as == and multiple mask by column ROLL_NO:
mask = df['FEES'].eq(0)
r = df['ROLL_NO'] * mask
print (r)
0 0
1 0
2 0
3 12345
4 12345
5 0
6 12345
7 12345
8 12345
9 0
10 0
dtype: int64
Get consecutive groups by compare shifted Series r and cumsum:
a = r.ne(r.shift()).cumsum()
print (a)
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
8 4
9 5
10 5
dtype: int32
Filter only 0 in FEES and groupby with size, also filter a for same indexes:
print (df[mask].groupby(['ROLL_NO',a[mask]]).size())
ROLL_NO
12345 2 2
4 3
dtype: int64
Get max values per first level of MultiIndex:
print (df[mask].groupby(['ROLL_NO',a[mask]]).size().max(level=0))
ROLL_NO
12345 3
dtype: int64
Last add missing ROLL_NO without 0 by reindex:
print (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0))
ROLL_NO
12345 3
987654 0
dtype: int64
and for columns from index use reset_index.
EDIT:
For first ID use drop_duplicates with insert and map:
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
s = df.drop_duplicates('ROLL_NO').set_index('ROLL_NO')['ID']
mask = df['FEES'].eq(0)
df1 = (df[mask].groupby(['ROLL_NO',a[mask]])
.size()
.max(level=0)
.reindex(df['ROLL_NO'].unique(), fill_value=0)
.reset_index(name='MAX_CNT'))
df1.insert(0, 'ID', df1['ROLL_NO'].map(s))
print (df1)
ID ROLL_NO MAX_CNT
0 1 12345 3
1 3 987654 0

In pandas, i have 2 columns(ID and Name). If ID is assigned to more than one name how do i replace duplicates with first occurance

Image with the csv file with the two columns
You can use:
df.drop_duplicates('Salesperson_1')
Or maybe need:
df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
Sample:
df = pd.DataFrame({'Salesperson_1':['a','a','b'],
'Salesperson_1_ID':[4,5,6]})
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 5
2 b 6
df1 = df.drop_duplicates('Salesperson_1')
print (df1)
Salesperson_1 Salesperson_1_ID
0 a 4
2 b 6
df.Salesperson_1_ID = df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 4
2 b 6
Pandas.groupby.first()
if your DataFrame is called df, you could just do this:
df.groupby('Salesperson_1_ID').first()

adding all list value to to dataframe

I have a list consisting of several string as shown below list=['abc','cde','fgh'] I want to add all the values to a particular index of dataframe. I am trying it with the code
df1.ix[1,3]=[list]
df1.to_csv('test.csv', sep=',')
I want in dataframe at poistion 1,3 all values to be inserted as it is ['abc','cde','fgh']. I don't want to convert it to string or any other format. But it is giving me error. what I am doing wrong here
I think you can use:
df1.ix[1,3] = L
Also is not recommended use variable list, because code word in python.
Sample:
df1 = pd.DataFrame({'a':[1,2,3], 'b':[1,2,3], 'c':[1,2,3], 'd':[1,2,3]})
print (df1)
a b c d
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
L = ['abc','cde','fgh']
df1.ix[1,3]= L
print (df1)
a b c d
0 1 1 1 1
1 2 2 2 [abc, cde, fgh]
2 3 3 3 3
I think you meant to use 1:3 not 1,3
consider the pd.DataFrame df
df = pd.DataFrame(dict(A=list('xxxxxxx')))
use loc or ix
df.loc[1:3, 'A'] = ['abc', 'cde', 'fgh']
or
df.ix[1:3, 'A'] = ['abc', 'cde', 'fgh']
yields