pandas dataframe match column from external file - python-2.7

I have a df with
col0 col1 col2 col3
a 1 2 text1
b 1 2 text2
c 1 3 text3
and i another text file with
col0 col1 col2
met1 a text1
met2 b text2
met3 c text3
how do i match row values from col3 in my first df to the text file col2 and add to previous df only col0 string with out changing the structure of the df
desired output:
col0 col1 col2 col3 col4
a 1 2 text1 met1
b 1 2 text2 met2
c 1 3 text3 met3

You can use pandas.dataframe.merge(). E.g.:
df.merge(df2.loc[:, ['col0', 'col2']], left_on='col3', right_on='col2')

print(df)
col0 col1 col2 col3
0 a 1 2 text1
1 b 1 2 text2
2 c 1 3 text3
print(df2)
col0 col1 col2
0 met1 a text1
1 met2 b text2
2 met3 c text3
Merge df and df2
df3 = df.merge(df2, left_on='col3', right_on='col2',suffixes=('','_1'))
Housekeeping... renaming columns etc...
df3 = df3.rename(columns={'col0_1':'col4'}).drop(['col1_1','col2_1'], axis=1)
print(df3)
col0 col1 col2 col3 col4
0 a 1 2 text1 met1
1 b 1 2 text2 met2
2 c 1 3 text3 met3
And, reassign to df if you wish.
df = df3
OR
df = df.assign(col4=df.merge(df2, left_on='col3', right_on='col2',suffixes=('','_1'))['col0_1'])
print(df)
col0 col1 col2 col3 col4
0 a 1 2 text1 met1
1 b 1 2 text2 met2
2 c 1 3 text3 met3

Call your df df1. Then first load the text file into a dataframe using df2 = pd.read_csv('filename.txt'). Now, you want to rename the columns in df2 so that the column on which you want to merge has the same name in both columns:
df2.columns = ['new_col1', 'new_col2', 'col3']
Then:
pd.merge(df1, df2, on='col3')

Related

Sum columns based on several other columns - SAS

I'm trying to sum some columns based on several other columns, and then produce a new table with the results in.
Say I have the following data:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
3
1
AAAA
BBBB
CCCC
DDDD
5
1
WWWW
XXXX
YYYY
ZZZZ
1
4
WWWW
XXXX
YYYY
ZZZZ
8
2
And I want to sum Col5 and Col6 (separately) where Col 1-4 are the same. i.e. the output I want is:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
8
2
WWWW
XXXX
YYYY
ZZZZ
9
6
I've put my code below, but its giving me the following:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
8
2
AAAA
BBBB
CCCC
DDDD
8
2
WWWW
XXXX
YYYY
ZZZZ
9
6
WWWW
XXXX
YYYY
ZZZZ
9
6
Any help would be greatly appreciated to:
a) get this to code work.
b) show me a better (more efficient?) way of doing this? I think I've massively(!) overcomplicated this (I'm very new to SAS!).
--- Code ---
data XXX;
input Col1 $ Col2 $ Col3 $ Col4 $ Col5 Col6;
datalines;
AAAA BBBB CCCC DDDD 3 1
AAAA BBBB CCCC DDDD 5 1
WWWW XXXX YYYY ZZZZ 1 4
WWWW XXXX YYYY ZZZZ 8 2
;
run;
data test1;
set XXX;
groupID = put(md5(upcase(catx('|',Col1,Col2,Col3,Col4))),hex32.);
run;
proc sort data = test1;
by groupID;
run;
proc summary data = test1;
var Col5 Col6;
by groupID;
Output out = want sum=;
run;
proc sql;
create table test1_results as
select b.Col1,b.Col2,b.Col3,b.Col4, a.*
from want as a
left join test1 as b
on a.groupID = b.groupID;
run;
data Final_table;
set test1_results;
Keep Col1 Col2 Col3 Col4 Col5 Col6;
run;
I think you need Proc SUMMARY. The remaining steps are unnecessary.
Key concept - BY or CLASS statements take multiple variables.
data XXX;
input Col1 $ Col2 $ Col3 $ Col4 $ Col5 Col6;
datalines;
AAAA BBBB CCCC DDDD 3 1
AAAA BBBB CCCC DDDD 5 1
WWWW XXXX YYYY ZZZZ 1 4
WWWW XXXX YYYY ZZZZ 8 2
;
run;
proc summary data=xxx NWAY noprint;
class col1 col2 col3 col4;
var Col5 Col6;
Output out=want (drop=_type_ _freq_) sum=;
run;
proc print data=want;run;

inputing a pandas dataframe matching two columns from two dataframes

I have a dataframe C and another dataframe S. I want to change the values in one of the column of C if two columns in C and S have same values.
Please consider the example given below,
C.head(3)
id1 id2 title val
0 1 0 'abc' 0
1 2 0 'bcd' 0
2 3 0 'efg' 0
S.head(3)
id1 id2
0 1 1
1 3 0
I want to assign the value of 1 to the column 'val' in C corresponding only to the rows where C.id1 = S.id1 and C.id2 = S.id2
The combination of (C.id1, C.id2) and (S.id1, S.id2) is unique in respective tables
In the above case, I want the result as
C.head(3)
id1 id2 title val
0 1 0 'abc' 0
1 2 0 'bcd' 0
2 3 0 'efg' 1
as only in the third row of C, it matches with one of the rows of S for the columns id1 and id2.
I think need merge with left join and parameter indicator, last convert boolen mask to 0 and 1:
#if same columns for join in both df parameter on is possible omit
df = C.merge(S, indicator=True, how='left')
#if multiple same columns in both df
#df = C.merge(S, indicator=True, how='left', on=['id1', 'id2'])
df['val'] = (df['_merge'] == 'both').astype(int)
df = df.drop('_merge', axis=1)
print (df)
id1 id2 val
0 1 0 0
1 2 0 0
2 3 0 1
Solutiion working nice with new data:
df = C.merge(S, indicator=True, how='left')
#if multiple same columns in both df
#df = C.merge(S, indicator=True, how='left', on=['id1', 'id2'])
df['val'] = (df['_merge'] == 'both').astype(int)
df = df.drop('_merge', axis=1)
print (df)
id1 id2 title val
0 1 0 abc 0
1 2 0 bcd 0
2 3 0 efg 1

Merge two duplicate rows with imputing values from each other

I have a dataframe (df1) with only one column (col1) having identical values while other columns have missing values, for example as follows:
df1
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 NaT 120 NaN 115 XYZ
1| 1234 2015/01/12 120 Abc 115 NaN
2| 1234 2015/01/12 NaN NaN NaN NaN
I would like to merge the three rows with identical col1 values into one row such that the missing values are replaced with values from the other rows where the values exist in place of missing values. The resulting df will look like this:
result_df
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 2015/01/12 120 Abc 115 XYZ
Can anyone help me with this issue? Thanks in advance!
First remove duplicates in columns names col3 and col4:
s = df.columns.to_series()
df.columns = (s + '.' + s.groupby(s).cumcount().replace({0:''}).astype(str)).str.strip('.')
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 NaT 120.0 NaN 115.0 XYZ
1 1234 2015-01-12 120.0 Abc 115.0 NaN
2 1234 2015-01-12 NaN NaN NaN NaN
And then aggregate first:
df = df.groupby('col1', as_index=False).first()
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 2015-01-12 120.0 Abc 115.0 XYZ

DataFrame Multiple Column Comparison using LAMBDA

I have a dataframe which contain some integer values, I want to create a new dataframe of the row only if multiple columns [col1, col3, col4] are not ALL zeroes. Example:
col1 col2 col3 col4 col5 col6
0 0 text1 3 0 22 0
1 9 text2 13 11 22 1
2 0 text3 0 0 22 0 # not valid
3 9 text4 13 11 22 0
4 0 text5 0 1 12 4
I am not sure if possible to do this with a single lambda.
There's no need for any custom function at all. We can just select the columns we want, do our boolean comparison, and then use that to index into your dataframe:
In [28]: df[["col1", "col3", "col4"]] == 0
Out[28]:
col1 col3 col4
0 True False True
1 False False False
2 True True True
3 False False False
4 True True False
In [29]: (df[["col1", "col3", "col4"]] == 0).all(axis=1)
Out[29]:
0 False
1 False
2 True
3 False
4 False
dtype: bool
In [30]: df.loc[~(df[["col1", "col3", "col4"]] == 0).all(axis=1)]
Out[30]:
col1 col2 col3 col4 col5 col6
0 0 text1 3 0 22 0
1 9 text2 13 11 22 1
3 9 text4 13 11 22 0
4 0 text5 0 1 12 4
There are lots of similar ways to rewrite it (not all zeroes is any being nonzero, etc.)

Sorting two-dimensional dataframe using Pandas

I have a two-dimensional DataFrame, for simplicity it looks like:
df = pd.DataFrame([(1,2.2,5),(2,3,-1)], index=['row1', 'row2'], columns = ["col1","col2",'col3'])
with the output:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
What's the best way to order it by values to get:
RowName ColName Value
row2 col3 -1
row1 col1 1
row2 col1 2
row1 col2 2.2
row2 col2 3.0
row1 col3 5
I did try using .stack(), didn't get very far, constructing this using nested for loops is possible - but inelegant..
Any ideas here?
melt is a reverse unstack
In [6]: df
Out[6]:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
In [7]: pd.melt(df.reset_index(),id_vars='index')
Out[7]:
index variable value
0 row1 col1 1.0
1 row2 col1 2.0
2 row1 col2 2.2
3 row2 col2 3.0
4 row1 col3 5.0
5 row2 col3 -1.0
stack() plus sort() appears to give the desired output
In [35]: df
Out[35]:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
In [36]: stacked = df.stack()
In [38]: stacked.sort()
In [39]: stacked
Out[39]:
row2 col3 -1.0
row1 col1 1.0
row2 col1 2.0
row1 col2 2.2
row2 col2 3.0
row1 col3 5.0