Merge two duplicate rows with imputing values from each other - python-2.7

I have a dataframe (df1) with only one column (col1) having identical values while other columns have missing values, for example as follows:
df1
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 NaT 120 NaN 115 XYZ
1| 1234 2015/01/12 120 Abc 115 NaN
2| 1234 2015/01/12 NaN NaN NaN NaN
I would like to merge the three rows with identical col1 values into one row such that the missing values are replaced with values from the other rows where the values exist in place of missing values. The resulting df will look like this:
result_df
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 2015/01/12 120 Abc 115 XYZ
Can anyone help me with this issue? Thanks in advance!

First remove duplicates in columns names col3 and col4:
s = df.columns.to_series()
df.columns = (s + '.' + s.groupby(s).cumcount().replace({0:''}).astype(str)).str.strip('.')
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 NaT 120.0 NaN 115.0 XYZ
1 1234 2015-01-12 120.0 Abc 115.0 NaN
2 1234 2015-01-12 NaN NaN NaN NaN
And then aggregate first:
df = df.groupby('col1', as_index=False).first()
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 2015-01-12 120.0 Abc 115.0 XYZ

Related

Measures with Multiple Categories to Compute New Column in PowerBI

Need to compute several measures to compute values for a new column in Power BI.
My data is structured like this:
Col1 Col2 Col3 Col4
A Type1 2 4
B Type2 5 9
A Type3 9 12
B Type1 4 2
A Type3 1 2
B Type2 9 8
A Type2 7 3
I'm trying to find the proportion or 'share' of the difference of Col4 and Col3 per each category 1 and 2. For example, Col5 for row1 would compute the difference of Col4 and Col3 divided by the total of all Type1 AND A differences sum(Col4)-sum(Col3) to give the proportion. Ultimately, I want something like this:
Col1 Col2 Col3 Col4 Col5
A Type1 2 4 66.6%
B Type2 5 9 133.3%
A Type2 9 12 -300%
B Type1 4 2 100%
A Type1 1 2 33.3%
B Type2 9 8 -33.3%
A Type2 7 3 400%
(e.g. all values in Col5 where Col2 = Type1 AND Col1 = 'A' should sum to 100%)
Given this data, I tried to create a measure 'Total' that was sum(Col4)-sum(Col3) (using proper PowerBI notation for column references) then created a new column Col5 hoping to apply that to each category such that: Col5 = (Col4-Col3)/'Total'
But, I got something like this:
Col1 Col2 Col3 Col4 Col5
A Type1 2 4 0%
B Type2 5 9 -100%
A Type3 9 12 0%
B Type1 4 2 -100%
A Type3 1 2 0%
B Type2 9 8 -100%
A Type2 7 3 -100%
When I try to use a quick measure, it can only do one categorization (i.e. either Col1 OR Col2) - but I want to classify totals by both at once.
OK. Please try this one!
Col5 =
VAR Tbl =
ADDCOLUMNS (
SUMMARIZE ( YourTable, YourTable[Col1], YourTable[Col2] ),
"RowTotal", SUMX ( YourTable, [Col4] - [Col3] ),
"GroupTotal",
CALCULATE (
SUMX ( YourTable, [Col4] - [Col3] ),
ALLEXCEPT ( YourTable, YourTable[Col1], YourTable[Col2] )
)
)
RETURN
SUMX ( Tbl, DIVIDE ( [RowTotal], [GroupTotal] ) )

Sum columns based on several other columns - SAS

I'm trying to sum some columns based on several other columns, and then produce a new table with the results in.
Say I have the following data:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
3
1
AAAA
BBBB
CCCC
DDDD
5
1
WWWW
XXXX
YYYY
ZZZZ
1
4
WWWW
XXXX
YYYY
ZZZZ
8
2
And I want to sum Col5 and Col6 (separately) where Col 1-4 are the same. i.e. the output I want is:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
8
2
WWWW
XXXX
YYYY
ZZZZ
9
6
I've put my code below, but its giving me the following:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
8
2
AAAA
BBBB
CCCC
DDDD
8
2
WWWW
XXXX
YYYY
ZZZZ
9
6
WWWW
XXXX
YYYY
ZZZZ
9
6
Any help would be greatly appreciated to:
a) get this to code work.
b) show me a better (more efficient?) way of doing this? I think I've massively(!) overcomplicated this (I'm very new to SAS!).
--- Code ---
data XXX;
input Col1 $ Col2 $ Col3 $ Col4 $ Col5 Col6;
datalines;
AAAA BBBB CCCC DDDD 3 1
AAAA BBBB CCCC DDDD 5 1
WWWW XXXX YYYY ZZZZ 1 4
WWWW XXXX YYYY ZZZZ 8 2
;
run;
data test1;
set XXX;
groupID = put(md5(upcase(catx('|',Col1,Col2,Col3,Col4))),hex32.);
run;
proc sort data = test1;
by groupID;
run;
proc summary data = test1;
var Col5 Col6;
by groupID;
Output out = want sum=;
run;
proc sql;
create table test1_results as
select b.Col1,b.Col2,b.Col3,b.Col4, a.*
from want as a
left join test1 as b
on a.groupID = b.groupID;
run;
data Final_table;
set test1_results;
Keep Col1 Col2 Col3 Col4 Col5 Col6;
run;
I think you need Proc SUMMARY. The remaining steps are unnecessary.
Key concept - BY or CLASS statements take multiple variables.
data XXX;
input Col1 $ Col2 $ Col3 $ Col4 $ Col5 Col6;
datalines;
AAAA BBBB CCCC DDDD 3 1
AAAA BBBB CCCC DDDD 5 1
WWWW XXXX YYYY ZZZZ 1 4
WWWW XXXX YYYY ZZZZ 8 2
;
run;
proc summary data=xxx NWAY noprint;
class col1 col2 col3 col4;
var Col5 Col6;
Output out=want (drop=_type_ _freq_) sum=;
run;
proc print data=want;run;

Set value of different columns in Pandas DataFrame with indexes

I have extracted the indexes of certain rows that I'd like to assign to rows of another DataFrame with the following command:
indexes = df1[df1.iloc[:, 0].isin(df2.iloc[:, 0].values)].index.values
What I'd like to do is assign certain values of columns of df2 to the rows (of which I have the indexes) to certain columns of df1.
For example:
df1:
index | col1 | col2 | col3
0 | ABC | DEF | GHI
1 | JKL | MNO | PQR
2 | STU | VWX | YZ
df2:
index | colA | colB | colC
0 | WHAT | EVER | 123
2 | 111 | 222 | 333
What I'd like to do now for example is to assign the value of colB (df2) to col3 (df1) according to the indexes. So the result should be:
df1:
index | col1 | col2 | col3
0 | ABC | DEF | EVER <- value of colB (df2)
1 | JKL | MNO | PQR
2 | STU | VWX | 222 <- value of colB (df2)
I'm aware that I can set values with .iloc (integer location) function. But I can't figure out how to do this with the corresponding indexes.
Also I'd appreciate a good Pandas guide (as you can see I'm new with Pandas)
Greetings,
Frame

pandas dataframe match column from external file

I have a df with
col0 col1 col2 col3
a 1 2 text1
b 1 2 text2
c 1 3 text3
and i another text file with
col0 col1 col2
met1 a text1
met2 b text2
met3 c text3
how do i match row values from col3 in my first df to the text file col2 and add to previous df only col0 string with out changing the structure of the df
desired output:
col0 col1 col2 col3 col4
a 1 2 text1 met1
b 1 2 text2 met2
c 1 3 text3 met3
You can use pandas.dataframe.merge(). E.g.:
df.merge(df2.loc[:, ['col0', 'col2']], left_on='col3', right_on='col2')
print(df)
col0 col1 col2 col3
0 a 1 2 text1
1 b 1 2 text2
2 c 1 3 text3
print(df2)
col0 col1 col2
0 met1 a text1
1 met2 b text2
2 met3 c text3
Merge df and df2
df3 = df.merge(df2, left_on='col3', right_on='col2',suffixes=('','_1'))
Housekeeping... renaming columns etc...
df3 = df3.rename(columns={'col0_1':'col4'}).drop(['col1_1','col2_1'], axis=1)
print(df3)
col0 col1 col2 col3 col4
0 a 1 2 text1 met1
1 b 1 2 text2 met2
2 c 1 3 text3 met3
And, reassign to df if you wish.
df = df3
OR
df = df.assign(col4=df.merge(df2, left_on='col3', right_on='col2',suffixes=('','_1'))['col0_1'])
print(df)
col0 col1 col2 col3 col4
0 a 1 2 text1 met1
1 b 1 2 text2 met2
2 c 1 3 text3 met3
Call your df df1. Then first load the text file into a dataframe using df2 = pd.read_csv('filename.txt'). Now, you want to rename the columns in df2 so that the column on which you want to merge has the same name in both columns:
df2.columns = ['new_col1', 'new_col2', 'col3']
Then:
pd.merge(df1, df2, on='col3')

Sorting two-dimensional dataframe using Pandas

I have a two-dimensional DataFrame, for simplicity it looks like:
df = pd.DataFrame([(1,2.2,5),(2,3,-1)], index=['row1', 'row2'], columns = ["col1","col2",'col3'])
with the output:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
What's the best way to order it by values to get:
RowName ColName Value
row2 col3 -1
row1 col1 1
row2 col1 2
row1 col2 2.2
row2 col2 3.0
row1 col3 5
I did try using .stack(), didn't get very far, constructing this using nested for loops is possible - but inelegant..
Any ideas here?
melt is a reverse unstack
In [6]: df
Out[6]:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
In [7]: pd.melt(df.reset_index(),id_vars='index')
Out[7]:
index variable value
0 row1 col1 1.0
1 row2 col1 2.0
2 row1 col2 2.2
3 row2 col2 3.0
4 row1 col3 5.0
5 row2 col3 -1.0
stack() plus sort() appears to give the desired output
In [35]: df
Out[35]:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
In [36]: stacked = df.stack()
In [38]: stacked.sort()
In [39]: stacked
Out[39]:
row2 col3 -1.0
row1 col1 1.0
row2 col1 2.0
row1 col2 2.2
row2 col2 3.0
row1 col3 5.0