the order within group apply function - python-2.7

Having the code (below) I am trying to figure will particular group order always remain the same as in original dataframe.
It looks like the order within the group preserved for my little example, but what if I have dataframe with ~1 mln records? Will pandas provide such guarantee and I should worry about that by myself?
Code:
import numpy as np
import pandas as pd
N = 10
df = pd.DataFrame(index = xrange(N))
df['A'] = map(lambda x: int(x) / 5, np.random.randn(N) * 10.0)
df['B'] = map(lambda x: int(x) / 5, np.random.randn(N) * 10.0)
df['v'] = np.random.randn(N)
def show_x(x):
print x
print "----------------"
df.groupby('A').apply(show_x)
print "==============="
print df
Output:
A B v
6 -4 -1 -2.047354
[1 rows x 3 columns]
----------------
A B v
6 -4 -1 -2.047354
[1 rows x 3 columns]
----------------
A B v
8 -3 0 -1.190831
[1 rows x 3 columns]
----------------
A B v
0 -1 -1 0.456397
9 -1 -2 -1.329169
[2 rows x 3 columns]
----------------
A B v
1 0 0 0.663928
2 0 2 0.626204
7 0 -3 -0.539166
[3 rows x 3 columns]
----------------
A B v
4 2 2 -1.115721
5 2 1 -1.905266
[2 rows x 3 columns]
----------------
A B v
3 4 -1 0.751016
[1 rows x 3 columns]
----------------
===============
A B v
0 -1 -1 0.456397
1 0 0 0.663928
2 0 2 0.626204
3 4 -1 0.751016
4 2 2 -1.115721
5 2 1 -1.905266
6 -4 -1 -2.047354
7 0 -3 -0.539166
8 -3 0 -1.190831
9 -1 -2 -1.329169
[10 rows x 3 columns]

If you are using apply not only is the order not guaranteed, but as you've found it can trigger the function for the same group a couple of times (to decide which "path" to take / what type of result to return). So if your function has side-effects don't do this!
I recommend simply iterating through the groupby object!
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 5 6
In [13]: g = df.groupby('A')
In [14]: for key, sub_df in g:
print("key =", key)
print(sub_df)
print('') # apply whatever function you want
key = 1
A B
0 1 2
1 1 4
key = 5
A B
2 5 6
Note that this is ordered (the same as the levels) see g.grouper._get_group_keys():
In [21]: g.grouper.levels
Out[21]: [Int64Index([1, 5], dtype='int64')]
It's sorted by default (there's a sort kwarg when doing the groupby), through it's not clear what this actually means if it's not a numeric dtype.

Related

Create boolean dataframe showing existance of each element in a dictionary of lists

I have a dictionary of lists and I have constructed a dataframe where the index is the dictionary keys and the columns are the set of possible values contained within the lists. The dataframe values represent existance of each column for each list contained in the dictionary. What is the most efficient way to construct this? Below is the way I have done it now using for loops, but I am sure there is a more efficient way using either vectorization or concatenation.
import pandas as pd
data = {0:[1,2,3,4],1:[2,3,4],2:[3,4,5,6]}
cols = sorted(list(set([x for y in data.values() for x in y])))
df = pd.DataFrame(0,index=data.keys(),columns=cols)
for row in df.iterrows():
for col in cols:
if col in data[row[0]]:
df.loc[row[0],col] = 1
else:
df.loc[row[0],col] = 0
print(df)
Output:
1 2 3 4 5 6
0 1 1 1 1 0 0
1 0 1 1 1 0 0
2 0 0 1 1 1 1
Use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data.values()),
columns=mlb.classes_,
index=data.keys())
print (df)
1 2 3 4 5 6
0 1 1 1 1 0 0
1 0 1 1 1 0 0
2 0 0 1 1 1 1
Pure pandas, but much slowier solution with str.get_dummies:
df = pd.Series(data).astype(str).str.strip('[]').str.get_dummies(', ')

pandas dataframe category codes from two columns

I got a pandas dataframe where two columns correspond to names of people. The columns are related and the same name means same person. I want to assign the category code such that it is valid for the whole "name" space.
For example my data frame is
df = pd.DataFrame({"P1":["a","b","c","a"], "P2":["b","c","d","c"]})
>>> df
P1 P2
0 a b
1 b c
2 c d
3 a c
I want it to be replaced by the corresponding category codes, such that
>>> df
P1 P2
0 1 2
1 2 3
2 3 4
3 1 3
The categories are in fact derived from the concatenated array ["a","b","c","d"] and applied on individual columns seperatly. How can I achive this ?.
Use:
print (df.stack().rank(method='dense').astype(int).unstack())
P1 P2
0 1 2
1 2 3
2 3 4
3 1 3
EDIT:
For more general solution I used another answer, because problem with duplicates in index:
df = pd.DataFrame({"P1":["a","b","c","a"],
"P2":["b","c","d","c"],
"A":[3,4,5,6]}, index=[2,2,3,3])
print (df)
A P1 P2
2 3 a b
2 4 b c
3 5 c d
3 6 a c
cols = ['P1','P2']
df[cols] = (pd.factorize(df[cols].values.ravel())[0]+1).reshape(-1, len(cols))
print (df)
A P1 P2
2 3 1 2
2 4 2 3
3 5 3 4
3 6 1 3
You can do
In [465]: pd.DataFrame((pd.factorize(df.values.ravel())[0]+1).reshape(df.shape),
columns=df.columns)
Out[465]:
P1 P2
0 1 2
1 2 3
2 3 4
3 1 3

Python Pandas Data frame creation

I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.
One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])
If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6

Pandas replacing values depending on prior row

I'm pretty new to pandas and would like your input on how to tackle my problem. I've got the following data frame:
df = pd.DataFrame({'A' : ["me","you","you","me","me","me","me"],
'B' : ["Y","X","X","X","X","X","Z"],
'C' : ["1","2","3","4","5","6","7"]
})
I need to transform it based on the row values in column A and B. The logic should be that as soon as values in column A and B are the same on consecutive rows, the first row in this sequence should be maintained but following rows should have an 'A' set in column B.
For example: Values in column A and B are the same in row 1 and 2. Value in column B row 2 should be replaced with A. This is my expected output:
df2= pd.DataFrame({'A' : ["me","you","you","me","me","me","me"],
'B' : ["Y","X","A","X","A","A","Z"],
'C' : ["1","2","3","4","5","6","7"]})
You can first sum columns A and B:
a = df.A + df.B
Then compare with shifted version:
print (a != a.shift())
0 True
1 True
2 False
3 True
4 False
5 False
6 True
dtype: bool
Create unique groups by cumsum:
print ((a != a.shift()).cumsum())
0 1
1 2
2 2
3 3
4 3
5 3
6 4
dtype: int32
Get boolean mask where values are duplicated:
print ((a != a.shift()).cumsum().duplicated())
0 False
1 False
2 True
3 False
4 True
5 True
6 False
dtype: bool
Solutions for replace True values to A:
df.loc[(a != a.shift()).cumsum().duplicated(), 'B'] = 'A'
print (df)
A B C
0 me Y 1
1 you X 2
2 you A 3
3 me X 4
4 me A 5
5 me A 6
6 me Z 7
df.B = df.B.mask((a != a.shift()).cumsum().duplicated(), 'A')
print (df)
A B C
0 me Y 1
1 you X 2
2 you A 3
3 me X 4
4 me A 5
5 me A 6
6 me Z 7
print (df2.equals(df))
True

pandas dataframe update or set column[y] = x where column[z] = 'abc'

I'm new to python pandas and haven't found an answer to this in the documentation. I have an existing dataframe and I've added a new column Y. I want to set the value of column Y to 'abc' in all rows in which column Z = 'xyz'. In sql this would be a simple
update table set colY = 'abc' where colZ = 'xyz'
Is there a similar way to do this update in pandas?
Thanks!
You can use loc or numpy.where if you need set other value too:
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[1,2,3],
'Z':['xyz',5,6],
'C':[7,8,9]})
print (df)
C X Z
0 7 1 xyz
1 8 2 5
2 9 3 6
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
print (df)
C X Z Y
0 7 1 xyz abc
1 8 2 5 NaN
2 9 3 6 NaN
df['Y1'] = np.where(df.Z == 'xyz', 'abc', 'klm')
print (df)
C X Z Y Y1
0 7 1 xyz abc abc
1 8 2 5 NaN klm
2 9 3 6 NaN klm
You can use set column values too:
df['Y2'] = np.where(df.Z == 'xyz', 'abc', df.C)
print (df)
C X Z Y Y2
0 7 1 xyz abc abc
1 8 2 5 NaN 8
2 9 3 6 NaN 9