double ranking in python after a groupby - python-2.7

I have done a groupby which resulted in a dataframe similar to the below example.
df = pd.DataFrame({'a': ['A', 'A','A', 'B', 'B','B'], 'b': ['A1', 'A2','A3' ,'B1', 'B2','B3'], 'c': ['2','3','4','5','6','1'] })
>>> df
a b c
0 A A1 2
1 A A2 3
2 A A3 4
3 B B1 5
4 B B2 6
5 B B3 1
desired output
>>> df
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
As you can see it is a double ranking based on column a then column b. We first start with the highest which is B and within B we also start with the highest which is B2.
how i can do that in python please

you can first find maxima in each group and sort your DF descending by this local maxima and column c:
In [49]: (df.assign(x=df.groupby('a')['c'].transform('max'))
.sort_values(['x','c'], ascending=[0,0])
.drop('x',1))
Out[49]:
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2

Use
In [1072]: df.sort_values(by=['a', 'c'], ascending=[False, False])
Out[1072]:
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2

I think need first get max values by aggregating, then create ordered Categorical by ordering by max indices and last sort_values working as you need:
c = df.groupby('a')['c'].max().sort_values(ascending=False)
print (c)
a
B 6
A 4
Name: c, dtype: object
df['a'] = pd.Categorical(df['a'], categories=c.index, ordered=True)
df = df.sort_values(by=['a', 'c'], ascending=[True, False])
print (df)
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2

Related

Return all values using LOOKUPVALUE, not just matches

I have two tables with related fields. I am trying to return the enrollment# from table A into a column in table B.
table A
Serial# Enrollment#
A 1
B 2
C 3
D 4
E 5
table B
Serial# Enrollment#
A 1
B 20
C 3
D 4
E 50
I want this calculated column in table B
Serial# Enrollment# tableAEnrollment#
A 1 1
B 20 2
C 3 3
D 4 4
E 50 5
however this is what I am getting:
Serial# Enrollment# tableAEnrollment#
A 1 1
B 20
C 3 3
D 4 4
E 50
my function is:
tableAEnrollemnt# = LOOKUPVALUE(A[Enrollment #], A[Serial #], B[Serial #])
Its only bringing back where enrollment numbers match. What am I doing wrong?
Thanks in advance!

pandas dataframe category codes from two columns

I got a pandas dataframe where two columns correspond to names of people. The columns are related and the same name means same person. I want to assign the category code such that it is valid for the whole "name" space.
For example my data frame is
df = pd.DataFrame({"P1":["a","b","c","a"], "P2":["b","c","d","c"]})
>>> df
P1 P2
0 a b
1 b c
2 c d
3 a c
I want it to be replaced by the corresponding category codes, such that
>>> df
P1 P2
0 1 2
1 2 3
2 3 4
3 1 3
The categories are in fact derived from the concatenated array ["a","b","c","d"] and applied on individual columns seperatly. How can I achive this ?.
Use:
print (df.stack().rank(method='dense').astype(int).unstack())
P1 P2
0 1 2
1 2 3
2 3 4
3 1 3
EDIT:
For more general solution I used another answer, because problem with duplicates in index:
df = pd.DataFrame({"P1":["a","b","c","a"],
"P2":["b","c","d","c"],
"A":[3,4,5,6]}, index=[2,2,3,3])
print (df)
A P1 P2
2 3 a b
2 4 b c
3 5 c d
3 6 a c
cols = ['P1','P2']
df[cols] = (pd.factorize(df[cols].values.ravel())[0]+1).reshape(-1, len(cols))
print (df)
A P1 P2
2 3 1 2
2 4 2 3
3 5 3 4
3 6 1 3
You can do
In [465]: pd.DataFrame((pd.factorize(df.values.ravel())[0]+1).reshape(df.shape),
columns=df.columns)
Out[465]:
P1 P2
0 1 2
1 2 3
2 3 4
3 1 3

Python pandas groupby object apply method adds index

I have this question is an extension after reading the "Python pandas groupby object apply method duplicates first group".
I get the answer, and tried some experiments on my own, e.g.:
import pandas as pd
from cStringIO import StringIO
s = '''c1 c2 c3
1 2 3
4 5 6'''
df = pd.read_csv(StringIO(s), sep=' ')
print df
def f2(df):
print df.iloc[:]
print "--------"
return df.iloc[:]
df2 = df.groupby(['c1']).apply(f2)
print "======"
print df2
gives as expected:
c1 c2 c3
0 1 2 3
1 4 5 6
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
1 4 5 6
--------
======
c1 c2 c3
0 1 2 3
1 4 5 6
However, when I try to return only df.iloc[0]:
def f3(df):
print df.iloc[0:]
print "--------"
return df.iloc[0:]
df3 = df.groupby(['c1']).apply(f3)
print "======"
print df3
, I get an additional index:
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
1 4 5 6
--------
======
c1 c2 c3
c1
1 0 1 2 3
4 1 4 5 6
I did some search and suspect this may mean there is a different code path taken?
The difference is that iloc[:] returns the object itself, while iloc[0:] returns a view of the object. Take a look at this:
>>> df.iloc[:] is df
True
>>> df.iloc[0:] is df
False
Where this makes a difference is that within the groupby, each group has a name attribute that reflects the grouping. When your function returns an object with this name attribute, no index is added to the result, while if you return an object without this name attribute, an index is added to track which group each came from.
Interestingly, you can force the iloc[:] behavior for iloc[0:] by explicitly setting the name attribute of the group before returning:
def f(x):
out = x.iloc[0:]
out.name = x.name
return out
df.groupby('c1').apply(f)
# c1 c2 c3
# 0 1 2 3
# 1 4 5 6
My guess is that the no-index behavior with named output is basically a special case meant to make df.groupby(col).apply(lambda x: x) be a no-op.

Pandas: How to get a new dataframe filled with unions of 2 or 3 or X other dataframes?

I have a long dataframe with daily dates starting from 1999. I apply a filter to the original_dataframe to create a new_dataframe_1 and another filter to create new_dataframe_2.
How do I create a third dataframe which contains only the rows that new_dataframe_1 and new_dataframe_2 have in common?
new_dataframe_1
A B C D
1 a b c d
2 a b c d
3 a b c d
4 a b c d
new_dataframe_2
A B C D
3 a b c d
4 a b c d
5 a b c d
6 a b c d
new_dataframe_3 = union of new_dataframe_1 and new_dataframe_2
A B C D
3 a b c d
4 a b c d
If you want the columns from both DataFrames joined together, do an inner join:
import pandas as pd
df1 = pd.DataFrame({'A': range(5)}, index=list('abcde'))
df2 = pd.DataFrame({'B': range(10,20,2)}, index=list('AbCdE'))
print(df1)
# A
# a 0
# b 1
# c 2
# d 3
# e 4
print(df2)
# B
# A 10
# b 12
# C 14
# d 16
# E 18
print(df1.join(df2, how='inner'))
yields
A B
b 1 12
d 3 16
If you only wish to select the columns from one of the DataFrames,
do a reindex on the intersection of the indices:
import pandas as pd
df1 = pd.DataFrame({'A': range(5)}, index=list('abcde'))
df2 = pd.DataFrame({'A': range(5)}, index=list('AbCdE'))
print(df1)
# A
# a 0
# b 1
# c 2
# d 3
# e 4
print(df2)
# A
# A 0
# b 1
# C 2
# d 3
# E 4
print(df1.reindex(df1.index.intersection(df2.index)))
yields
A
b 1
d 3
There is also df1.loc and df1.ix, but df1.reindex appears to be faster:
In [33]: idx1 = df1.index
In [34]: idx2 = df2.index
In [35]: %timeit df1.loc[idx1.intersection(idx2)]
1000 loops, best of 3: 269 µs per loop
In [36]: %timeit df1.ix[idx1.intersection(idx2)]
1000 loops, best of 3: 276 µs per loop
In [37]: %timeit df1.reindex(idx1.intersection(idx2))
10000 loops, best of 3: 186 µs per loop

R combine list to specical format of table

I have a list of comprising of sub-lists with different numbers of entries, as follows:
x <- list(
c("a1", "a2", "a3", "a4", "a5", "a6", "a7"),
c("b1","b2","b3","b4"),
c("c1","c2","c3"),
c("d1")
)
I want to convert this file to a dataframe with three columns (1st column is sequence of the sub-list, i.e. 1 to 4: 2nd column is the entries; the 3rd stands for my stop code, so, I used 1 for every lines, the final results is as follows:
1 a1 1
1 a2 1
1 a3 1
1 a4 1
1 a5 1
1 a6 1
1 a7 1
2 b1 1
2 b2 1
2 b3 1
2 b4 1
3 c1 1
3 c2 1
3 c3 1
4 d1 1
I tried to use cbind, however, it seems to me only works for sub-list with same number of entries. Are there any smarter way of doing this?
here is an example:
data.frame(
x=rep(1:length(x), sapply(x, length)),
y=unlist(x),
z=1
)
library(reshape2)
x <- melt(x) ## Done...
## Trivial...
x$stop <- 1
x <- x[c(2,1,3)]
One option is to use the split, apply, combine functionality in package plyr. In this case you need ldply which will take a list and combine the elements into data.frame:
library(plyr)
ldply(seq_along(x), function(i)data.frame(n=i, x=x[[i]], stop=1))
n x stop
1 1 a1 1
2 1 a2 1
3 1 a3 1
4 1 a4 1
5 1 a5 1
6 1 a6 1
7 1 a7 1
8 2 b1 1
9 2 b2 1
10 2 b3 1
11 2 b4 1
12 3 c1 1
13 3 c2 1
14 3 c3 1
15 4 d1 1