I have this question is an extension after reading the "Python pandas groupby object apply method duplicates first group".
I get the answer, and tried some experiments on my own, e.g.:
import pandas as pd
from cStringIO import StringIO
s = '''c1 c2 c3
1 2 3
4 5 6'''
df = pd.read_csv(StringIO(s), sep=' ')
print df
def f2(df):
print df.iloc[:]
print "--------"
return df.iloc[:]
df2 = df.groupby(['c1']).apply(f2)
print "======"
print df2
gives as expected:
c1 c2 c3
0 1 2 3
1 4 5 6
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
1 4 5 6
--------
======
c1 c2 c3
0 1 2 3
1 4 5 6
However, when I try to return only df.iloc[0]:
def f3(df):
print df.iloc[0:]
print "--------"
return df.iloc[0:]
df3 = df.groupby(['c1']).apply(f3)
print "======"
print df3
, I get an additional index:
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
0 1 2 3
--------
c1 c2 c3
1 4 5 6
--------
======
c1 c2 c3
c1
1 0 1 2 3
4 1 4 5 6
I did some search and suspect this may mean there is a different code path taken?
The difference is that iloc[:] returns the object itself, while iloc[0:] returns a view of the object. Take a look at this:
>>> df.iloc[:] is df
True
>>> df.iloc[0:] is df
False
Where this makes a difference is that within the groupby, each group has a name attribute that reflects the grouping. When your function returns an object with this name attribute, no index is added to the result, while if you return an object without this name attribute, an index is added to track which group each came from.
Interestingly, you can force the iloc[:] behavior for iloc[0:] by explicitly setting the name attribute of the group before returning:
def f(x):
out = x.iloc[0:]
out.name = x.name
return out
df.groupby('c1').apply(f)
# c1 c2 c3
# 0 1 2 3
# 1 4 5 6
My guess is that the no-index behavior with named output is basically a special case meant to make df.groupby(col).apply(lambda x: x) be a no-op.
Related
I have done a groupby which resulted in a dataframe similar to the below example.
df = pd.DataFrame({'a': ['A', 'A','A', 'B', 'B','B'], 'b': ['A1', 'A2','A3' ,'B1', 'B2','B3'], 'c': ['2','3','4','5','6','1'] })
>>> df
a b c
0 A A1 2
1 A A2 3
2 A A3 4
3 B B1 5
4 B B2 6
5 B B3 1
desired output
>>> df
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
As you can see it is a double ranking based on column a then column b. We first start with the highest which is B and within B we also start with the highest which is B2.
how i can do that in python please
you can first find maxima in each group and sort your DF descending by this local maxima and column c:
In [49]: (df.assign(x=df.groupby('a')['c'].transform('max'))
.sort_values(['x','c'], ascending=[0,0])
.drop('x',1))
Out[49]:
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
Use
In [1072]: df.sort_values(by=['a', 'c'], ascending=[False, False])
Out[1072]:
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
I think need first get max values by aggregating, then create ordered Categorical by ordering by max indices and last sort_values working as you need:
c = df.groupby('a')['c'].max().sort_values(ascending=False)
print (c)
a
B 6
A 4
Name: c, dtype: object
df['a'] = pd.Categorical(df['a'], categories=c.index, ordered=True)
df = df.sort_values(by=['a', 'c'], ascending=[True, False])
print (df)
a b c
4 B B2 6
3 B B1 5
5 B B3 1
2 A A3 4
1 A A2 3
0 A A1 2
Image with the csv file with the two columns
You can use:
df.drop_duplicates('Salesperson_1')
Or maybe need:
df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
Sample:
df = pd.DataFrame({'Salesperson_1':['a','a','b'],
'Salesperson_1_ID':[4,5,6]})
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 5
2 b 6
df1 = df.drop_duplicates('Salesperson_1')
print (df1)
Salesperson_1 Salesperson_1_ID
0 a 4
2 b 6
df.Salesperson_1_ID = df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 4
2 b 6
Pandas.groupby.first()
if your DataFrame is called df, you could just do this:
df.groupby('Salesperson_1_ID').first()
I have a .csv file named fileOne.csv that contains many unnecessary strings and records. I want to delete unnecessary records / rows and strings based on multiple condition / criteria using a Python or R script and save the records into a new .csv file named resultFile.csv.
What I want to do is as follows:
Delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
Keep only records that have the .csv file extension in the files column, but do not contain No Bi in cut column.
Assign new name to each of the columns.
Delete the records that contain strings like less in the CC column.
Trim all other unnecessary string from the records.
Delete the reamining filds of each rows after I find the "Mi" in each rows.
My fileOne.csv is as follows:
AA BB CC DD EE FF GG
1 1_1.csv (=0 =10" 27" =57 "Mi"
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv (=0 =10" 27" "Mi" 0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv (=0 =10" "Mi" =53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv No Bi 000 000 000 000
5 2_8.csv No Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(=0 =26" =46" "Mi" 121
My 1st expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57 Mi
1 3 0 10 27 Mi 0.5
1 6 0 10 Mi 53 cnt
7 9 0 26 46 Mi 121
My final expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46
This can be achieved with the following Python script:
import csv
import re
import string
output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
def sanitise_cell(cell):
return cell.translate(sanitise_table, nodigits_table) # Keep digits
with open('fileOne.csv') as f_input, open('resultFile.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
for row in csv_input:
bb = re.match(r'(\d+)_(\d+)\.csv', row[1])
if bb and row[2] not in ['No Bi', 'less']:
# Remove all columns after 'Mi' if present
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
row[:] = [sanitise_cell(col) for col in row]
row[0] = bb.group(1)
row[1] = bb.group(2)
csv_output.writerow(row)
To simply remove Mi columns from an existing file the following can be used:
import csv
with open('input.csv') as f_input, open('output.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for row in csv_input:
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
csv_output.writerow(row)
Tested using Python 2.7.9
Define:
dats <- list( df1 = data.frame(A=sample(1:3), B = sample(11:13)),
df2 = data.frame(AA=sample(1:3), BB = sample(11:13)))
s.t.
> dats
$df1
A B
1 2 12
2 3 11
3 1 13
$df2
AA BB
1 1 13
2 2 12
3 3 11
I would like to change all variable names from all caps to lower. I can do this with a loop but somehow cannot get this lapply call to work:
dats <- lapply(dats, function(x)
names(x)<-tolower(names(x)))
which results in:
> dats
$df1
[1] "a" "b"
$df2
[1] "aa" "bb"
while the desired result is:
> dats
$df1
a b
1 2 12
2 3 11
3 1 13
$df2
aa bb
1 1 13
2 2 12
3 3 11
If you don't use return at the end of a function, the last evaluated expression returned. So you need to return x.
dats <- lapply(dats, function(x) {
names(x)<-tolower(names(x))
x})
I have a list of comprising of sub-lists with different numbers of entries, as follows:
x <- list(
c("a1", "a2", "a3", "a4", "a5", "a6", "a7"),
c("b1","b2","b3","b4"),
c("c1","c2","c3"),
c("d1")
)
I want to convert this file to a dataframe with three columns (1st column is sequence of the sub-list, i.e. 1 to 4: 2nd column is the entries; the 3rd stands for my stop code, so, I used 1 for every lines, the final results is as follows:
1 a1 1
1 a2 1
1 a3 1
1 a4 1
1 a5 1
1 a6 1
1 a7 1
2 b1 1
2 b2 1
2 b3 1
2 b4 1
3 c1 1
3 c2 1
3 c3 1
4 d1 1
I tried to use cbind, however, it seems to me only works for sub-list with same number of entries. Are there any smarter way of doing this?
here is an example:
data.frame(
x=rep(1:length(x), sapply(x, length)),
y=unlist(x),
z=1
)
library(reshape2)
x <- melt(x) ## Done...
## Trivial...
x$stop <- 1
x <- x[c(2,1,3)]
One option is to use the split, apply, combine functionality in package plyr. In this case you need ldply which will take a list and combine the elements into data.frame:
library(plyr)
ldply(seq_along(x), function(i)data.frame(n=i, x=x[[i]], stop=1))
n x stop
1 1 a1 1
2 1 a2 1
3 1 a3 1
4 1 a4 1
5 1 a5 1
6 1 a6 1
7 1 a7 1
8 2 b1 1
9 2 b2 1
10 2 b3 1
11 2 b4 1
12 3 c1 1
13 3 c2 1
14 3 c3 1
15 4 d1 1