Sorting two-dimensional dataframe using Pandas - python-2.7

I have a two-dimensional DataFrame, for simplicity it looks like:
df = pd.DataFrame([(1,2.2,5),(2,3,-1)], index=['row1', 'row2'], columns = ["col1","col2",'col3'])
with the output:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
What's the best way to order it by values to get:
RowName ColName Value
row2 col3 -1
row1 col1 1
row2 col1 2
row1 col2 2.2
row2 col2 3.0
row1 col3 5
I did try using .stack(), didn't get very far, constructing this using nested for loops is possible - but inelegant..
Any ideas here?

melt is a reverse unstack
In [6]: df
Out[6]:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
In [7]: pd.melt(df.reset_index(),id_vars='index')
Out[7]:
index variable value
0 row1 col1 1.0
1 row2 col1 2.0
2 row1 col2 2.2
3 row2 col2 3.0
4 row1 col3 5.0
5 row2 col3 -1.0

stack() plus sort() appears to give the desired output
In [35]: df
Out[35]:
col1 col2 col3
row1 1 2.2 5
row2 2 3.0 -1
In [36]: stacked = df.stack()
In [38]: stacked.sort()
In [39]: stacked
Out[39]:
row2 col3 -1.0
row1 col1 1.0
row2 col1 2.0
row1 col2 2.2
row2 col2 3.0
row1 col3 5.0

Related

Measures with Multiple Categories to Compute New Column in PowerBI

Need to compute several measures to compute values for a new column in Power BI.
My data is structured like this:
Col1 Col2 Col3 Col4
A Type1 2 4
B Type2 5 9
A Type3 9 12
B Type1 4 2
A Type3 1 2
B Type2 9 8
A Type2 7 3
I'm trying to find the proportion or 'share' of the difference of Col4 and Col3 per each category 1 and 2. For example, Col5 for row1 would compute the difference of Col4 and Col3 divided by the total of all Type1 AND A differences sum(Col4)-sum(Col3) to give the proportion. Ultimately, I want something like this:
Col1 Col2 Col3 Col4 Col5
A Type1 2 4 66.6%
B Type2 5 9 133.3%
A Type2 9 12 -300%
B Type1 4 2 100%
A Type1 1 2 33.3%
B Type2 9 8 -33.3%
A Type2 7 3 400%
(e.g. all values in Col5 where Col2 = Type1 AND Col1 = 'A' should sum to 100%)
Given this data, I tried to create a measure 'Total' that was sum(Col4)-sum(Col3) (using proper PowerBI notation for column references) then created a new column Col5 hoping to apply that to each category such that: Col5 = (Col4-Col3)/'Total'
But, I got something like this:
Col1 Col2 Col3 Col4 Col5
A Type1 2 4 0%
B Type2 5 9 -100%
A Type3 9 12 0%
B Type1 4 2 -100%
A Type3 1 2 0%
B Type2 9 8 -100%
A Type2 7 3 -100%
When I try to use a quick measure, it can only do one categorization (i.e. either Col1 OR Col2) - but I want to classify totals by both at once.
OK. Please try this one!
Col5 =
VAR Tbl =
ADDCOLUMNS (
SUMMARIZE ( YourTable, YourTable[Col1], YourTable[Col2] ),
"RowTotal", SUMX ( YourTable, [Col4] - [Col3] ),
"GroupTotal",
CALCULATE (
SUMX ( YourTable, [Col4] - [Col3] ),
ALLEXCEPT ( YourTable, YourTable[Col1], YourTable[Col2] )
)
)
RETURN
SUMX ( Tbl, DIVIDE ( [RowTotal], [GroupTotal] ) )

Sum columns based on several other columns - SAS

I'm trying to sum some columns based on several other columns, and then produce a new table with the results in.
Say I have the following data:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
3
1
AAAA
BBBB
CCCC
DDDD
5
1
WWWW
XXXX
YYYY
ZZZZ
1
4
WWWW
XXXX
YYYY
ZZZZ
8
2
And I want to sum Col5 and Col6 (separately) where Col 1-4 are the same. i.e. the output I want is:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
8
2
WWWW
XXXX
YYYY
ZZZZ
9
6
I've put my code below, but its giving me the following:
Col1
Col2
Col3
Col4
Col5
Col6
AAAA
BBBB
CCCC
DDDD
8
2
AAAA
BBBB
CCCC
DDDD
8
2
WWWW
XXXX
YYYY
ZZZZ
9
6
WWWW
XXXX
YYYY
ZZZZ
9
6
Any help would be greatly appreciated to:
a) get this to code work.
b) show me a better (more efficient?) way of doing this? I think I've massively(!) overcomplicated this (I'm very new to SAS!).
--- Code ---
data XXX;
input Col1 $ Col2 $ Col3 $ Col4 $ Col5 Col6;
datalines;
AAAA BBBB CCCC DDDD 3 1
AAAA BBBB CCCC DDDD 5 1
WWWW XXXX YYYY ZZZZ 1 4
WWWW XXXX YYYY ZZZZ 8 2
;
run;
data test1;
set XXX;
groupID = put(md5(upcase(catx('|',Col1,Col2,Col3,Col4))),hex32.);
run;
proc sort data = test1;
by groupID;
run;
proc summary data = test1;
var Col5 Col6;
by groupID;
Output out = want sum=;
run;
proc sql;
create table test1_results as
select b.Col1,b.Col2,b.Col3,b.Col4, a.*
from want as a
left join test1 as b
on a.groupID = b.groupID;
run;
data Final_table;
set test1_results;
Keep Col1 Col2 Col3 Col4 Col5 Col6;
run;
I think you need Proc SUMMARY. The remaining steps are unnecessary.
Key concept - BY or CLASS statements take multiple variables.
data XXX;
input Col1 $ Col2 $ Col3 $ Col4 $ Col5 Col6;
datalines;
AAAA BBBB CCCC DDDD 3 1
AAAA BBBB CCCC DDDD 5 1
WWWW XXXX YYYY ZZZZ 1 4
WWWW XXXX YYYY ZZZZ 8 2
;
run;
proc summary data=xxx NWAY noprint;
class col1 col2 col3 col4;
var Col5 Col6;
Output out=want (drop=_type_ _freq_) sum=;
run;
proc print data=want;run;

Merge two duplicate rows with imputing values from each other

I have a dataframe (df1) with only one column (col1) having identical values while other columns have missing values, for example as follows:
df1
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 NaT 120 NaN 115 XYZ
1| 1234 2015/01/12 120 Abc 115 NaN
2| 1234 2015/01/12 NaN NaN NaN NaN
I would like to merge the three rows with identical col1 values into one row such that the missing values are replaced with values from the other rows where the values exist in place of missing values. The resulting df will look like this:
result_df
--------------------------------------------------------------------
col1 col2 col3 col4 col5 col6
--------------------------------------------------------------------
0| 1234 2015/01/12 120 Abc 115 XYZ
Can anyone help me with this issue? Thanks in advance!
First remove duplicates in columns names col3 and col4:
s = df.columns.to_series()
df.columns = (s + '.' + s.groupby(s).cumcount().replace({0:''}).astype(str)).str.strip('.')
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 NaT 120.0 NaN 115.0 XYZ
1 1234 2015-01-12 120.0 Abc 115.0 NaN
2 1234 2015-01-12 NaN NaN NaN NaN
And then aggregate first:
df = df.groupby('col1', as_index=False).first()
print (df)
col1 col2 col3 col4 col3.1 col4.1
0 1234 2015-01-12 120.0 Abc 115.0 XYZ

pandas dataframe match column from external file

I have a df with
col0 col1 col2 col3
a 1 2 text1
b 1 2 text2
c 1 3 text3
and i another text file with
col0 col1 col2
met1 a text1
met2 b text2
met3 c text3
how do i match row values from col3 in my first df to the text file col2 and add to previous df only col0 string with out changing the structure of the df
desired output:
col0 col1 col2 col3 col4
a 1 2 text1 met1
b 1 2 text2 met2
c 1 3 text3 met3
You can use pandas.dataframe.merge(). E.g.:
df.merge(df2.loc[:, ['col0', 'col2']], left_on='col3', right_on='col2')
print(df)
col0 col1 col2 col3
0 a 1 2 text1
1 b 1 2 text2
2 c 1 3 text3
print(df2)
col0 col1 col2
0 met1 a text1
1 met2 b text2
2 met3 c text3
Merge df and df2
df3 = df.merge(df2, left_on='col3', right_on='col2',suffixes=('','_1'))
Housekeeping... renaming columns etc...
df3 = df3.rename(columns={'col0_1':'col4'}).drop(['col1_1','col2_1'], axis=1)
print(df3)
col0 col1 col2 col3 col4
0 a 1 2 text1 met1
1 b 1 2 text2 met2
2 c 1 3 text3 met3
And, reassign to df if you wish.
df = df3
OR
df = df.assign(col4=df.merge(df2, left_on='col3', right_on='col2',suffixes=('','_1'))['col0_1'])
print(df)
col0 col1 col2 col3 col4
0 a 1 2 text1 met1
1 b 1 2 text2 met2
2 c 1 3 text3 met3
Call your df df1. Then first load the text file into a dataframe using df2 = pd.read_csv('filename.txt'). Now, you want to rename the columns in df2 so that the column on which you want to merge has the same name in both columns:
df2.columns = ['new_col1', 'new_col2', 'col3']
Then:
pd.merge(df1, df2, on='col3')

DataFrame Multiple Column Comparison using LAMBDA

I have a dataframe which contain some integer values, I want to create a new dataframe of the row only if multiple columns [col1, col3, col4] are not ALL zeroes. Example:
col1 col2 col3 col4 col5 col6
0 0 text1 3 0 22 0
1 9 text2 13 11 22 1
2 0 text3 0 0 22 0 # not valid
3 9 text4 13 11 22 0
4 0 text5 0 1 12 4
I am not sure if possible to do this with a single lambda.
There's no need for any custom function at all. We can just select the columns we want, do our boolean comparison, and then use that to index into your dataframe:
In [28]: df[["col1", "col3", "col4"]] == 0
Out[28]:
col1 col3 col4
0 True False True
1 False False False
2 True True True
3 False False False
4 True True False
In [29]: (df[["col1", "col3", "col4"]] == 0).all(axis=1)
Out[29]:
0 False
1 False
2 True
3 False
4 False
dtype: bool
In [30]: df.loc[~(df[["col1", "col3", "col4"]] == 0).all(axis=1)]
Out[30]:
col1 col2 col3 col4 col5 col6
0 0 text1 3 0 22 0
1 9 text2 13 11 22 1
3 9 text4 13 11 22 0
4 0 text5 0 1 12 4
There are lots of similar ways to rewrite it (not all zeroes is any being nonzero, etc.)