I have a dataframe where the index value is a mixture of string and number separated by an underscore.
sub_int1_ICA_int2 #
I would like to sort the column index using int1 first and after that int2
The expected output would be:
sub_1_ICA_1
sub_1_ICA_2
sub_1_ICA_3
...........
sub_2_ICA_1
sub_2_ICA_2
...........
I tried to use convert_numeric as I saw in many posts, but I get an error
X.convert_objects(convert_numeric=True).sort_values(['id] , ascending=[True], inplace=True)
>>(KeyError: 'id')
Any help would be nice!
Use reindex by sorted list by custom function with dictionary of tuples:
print (df)
a
sub_1_ICA_0 4
sub_1_ICA_1 8
sub_1_ICA_10 7
sub_1_ICA_11 3
sub_1_ICA_12 2
sub_1_ICA_2 6
sub_1_ICA_3 6
sub_2_ICA_1 1
sub_2_ICA_2 3
a = df.index.tolist()
b = {}
for x in a:
i = x.split('_')
b[x] = ((int(i[1]), int(i[-1])))
print (b)
{'sub_1_ICA_10': (1, 10), 'sub_1_ICA_11': (1, 11),
'sub_1_ICA_1': (1, 1), 'sub_2_ICA_2': (2, 2),
'sub_1_ICA_0': (1, 0), 'sub_1_ICA_12': (1, 12),
'sub_1_ICA_3': (1, 3), 'sub_1_ICA_2': (1, 2),
'sub_2_ICA_1': (2, 1)}
c = sorted(a, key=lambda x: b[x])
print (c)
['sub_1_ICA_0', 'sub_1_ICA_1', 'sub_1_ICA_2', 'sub_1_ICA_3',
'sub_1_ICA_10', 'sub_1_ICA_11', 'sub_1_ICA_12', 'sub_2_ICA_1', 'sub_2_ICA_2']
df = df.reindex(c)
print (df)
a
sub_1_ICA_0 4
sub_1_ICA_1 8
sub_1_ICA_2 6
sub_1_ICA_3 6
sub_1_ICA_10 7
sub_1_ICA_11 3
sub_1_ICA_12 2
sub_2_ICA_1 1
sub_2_ICA_2 3
Another pure pandas solution:
#create MultiIndex by split index, convert to DataFrame
df1 = df.index.str.split('_', expand=True).to_frame()
#set columns and index to original df
df1.columns = list('abcd')
df1.index = df.index
#convert columns to int and sort
df1[['b','d']] = df1[['b','d']].astype(int)
df1 = df1.sort_values(['b','d'])
print (df1)
a b c d
sub_1_ICA_0 sub 1 ICA 0
sub_1_ICA_1 sub 1 ICA 1
sub_1_ICA_2 sub 1 ICA 2
sub_1_ICA_3 sub 1 ICA 3
sub_1_ICA_10 sub 1 ICA 10
sub_1_ICA_11 sub 1 ICA 11
sub_1_ICA_12 sub 1 ICA 12
sub_2_ICA_1 sub 2 ICA 1
sub_2_ICA_2 sub 2 ICA 2
df = df.reindex(df1.index)
print (df)
a
sub_1_ICA_0 4
sub_1_ICA_1 8
sub_1_ICA_2 6
sub_1_ICA_3 6
sub_1_ICA_10 7
sub_1_ICA_11 3
sub_1_ICA_12 2
sub_2_ICA_1 1
sub_2_ICA_2 3
And last version with natsort:
from natsort import natsorted
df = df.reindex(natsorted(df.index))
print (df)
a
sub_1_ICA_0 4
sub_1_ICA_1 8
sub_1_ICA_2 6
sub_1_ICA_3 6
sub_1_ICA_10 7
sub_1_ICA_11 3
sub_1_ICA_12 2
sub_2_ICA_1 1
sub_2_ICA_2 3
EDIT:
If duplicates values then create new columns by split, convert to int, sort and get back:
print (df)
a
sub_1_ICA_0 4
sub_1_ICA_0 4
sub_1_ICA_1 8
sub_1_ICA_10 7
sub_1_ICA_11 3
sub_1_ICA_12 2
sub_1_ICA_2 6
sub_1_ICA_3 6
sub_2_ICA_1 1
sub_2_ICA_2 3
df.index = df.index.str.split('_', expand=True)
df = df.reset_index()
df[['level_1','level_3']] = df[['level_1','level_3']].astype(int)
df = df.sort_values(['level_1','level_3']).astype(str)
df = df.set_index(['level_0','level_1','level_2','level_3'])
df.index = df.index.map('_'.join)
print (df)
a
sub_1_ICA_0 4
sub_1_ICA_0 4
sub_1_ICA_1 8
sub_1_ICA_2 6
sub_1_ICA_3 6
sub_1_ICA_10 7
sub_1_ICA_11 3
sub_1_ICA_12 2
sub_2_ICA_1 1
sub_2_ICA_2 3
Related
I got a pandas dataframe where two columns correspond to names of people. The columns are related and the same name means same person. I want to assign the category code such that it is valid for the whole "name" space.
For example my data frame is
df = pd.DataFrame({"P1":["a","b","c","a"], "P2":["b","c","d","c"]})
>>> df
P1 P2
0 a b
1 b c
2 c d
3 a c
I want it to be replaced by the corresponding category codes, such that
>>> df
P1 P2
0 1 2
1 2 3
2 3 4
3 1 3
The categories are in fact derived from the concatenated array ["a","b","c","d"] and applied on individual columns seperatly. How can I achive this ?.
Use:
print (df.stack().rank(method='dense').astype(int).unstack())
P1 P2
0 1 2
1 2 3
2 3 4
3 1 3
EDIT:
For more general solution I used another answer, because problem with duplicates in index:
df = pd.DataFrame({"P1":["a","b","c","a"],
"P2":["b","c","d","c"],
"A":[3,4,5,6]}, index=[2,2,3,3])
print (df)
A P1 P2
2 3 a b
2 4 b c
3 5 c d
3 6 a c
cols = ['P1','P2']
df[cols] = (pd.factorize(df[cols].values.ravel())[0]+1).reshape(-1, len(cols))
print (df)
A P1 P2
2 3 1 2
2 4 2 3
3 5 3 4
3 6 1 3
You can do
In [465]: pd.DataFrame((pd.factorize(df.values.ravel())[0]+1).reshape(df.shape),
columns=df.columns)
Out[465]:
P1 P2
0 1 2
1 2 3
2 3 4
3 1 3
I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.
One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])
If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6
Image with the csv file with the two columns
You can use:
df.drop_duplicates('Salesperson_1')
Or maybe need:
df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
Sample:
df = pd.DataFrame({'Salesperson_1':['a','a','b'],
'Salesperson_1_ID':[4,5,6]})
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 5
2 b 6
df1 = df.drop_duplicates('Salesperson_1')
print (df1)
Salesperson_1 Salesperson_1_ID
0 a 4
2 b 6
df.Salesperson_1_ID = df.groupby('Salesperson_1')['Salesperson_1_ID'].transform('first')
print (df)
Salesperson_1 Salesperson_1_ID
0 a 4
1 a 4
2 b 6
Pandas.groupby.first()
if your DataFrame is called df, you could just do this:
df.groupby('Salesperson_1_ID').first()
I'm new to python pandas and haven't found an answer to this in the documentation. I have an existing dataframe and I've added a new column Y. I want to set the value of column Y to 'abc' in all rows in which column Z = 'xyz'. In sql this would be a simple
update table set colY = 'abc' where colZ = 'xyz'
Is there a similar way to do this update in pandas?
Thanks!
You can use loc or numpy.where if you need set other value too:
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[1,2,3],
'Z':['xyz',5,6],
'C':[7,8,9]})
print (df)
C X Z
0 7 1 xyz
1 8 2 5
2 9 3 6
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
print (df)
C X Z Y
0 7 1 xyz abc
1 8 2 5 NaN
2 9 3 6 NaN
df['Y1'] = np.where(df.Z == 'xyz', 'abc', 'klm')
print (df)
C X Z Y Y1
0 7 1 xyz abc abc
1 8 2 5 NaN klm
2 9 3 6 NaN klm
You can use set column values too:
df['Y2'] = np.where(df.Z == 'xyz', 'abc', df.C)
print (df)
C X Z Y Y2
0 7 1 xyz abc abc
1 8 2 5 NaN 8
2 9 3 6 NaN 9
I try to partially join two dataframes :
import pandas
import numpy
entry1= pandas.datetime(2014,6,1)
entry2= pandas.datetime(2014,6,2)
df1=pandas.DataFrame(numpy.array([[1,1],[2,2],[3,3],[3,3]]), columns=['zick','zack'], index=[entry1, entry1, entry2, entry2])
df2=pandas.DataFrame(numpy.array([[2,3],[3,3]]), columns=['eins','zwei'], index=[entry1, entry2])
I tried
df1 = df1[(df1['zick']>= 2) & (df1['zick'] < 4)].join(df2['eins'])
but this doesn't work. After joining values of df1['eins'] are expected to be [NaN,2,3,3].
How to do it? I'd like to it inplace without df copies.
I think this is what you actually meant to use:
df1 = df1.join(df2['eins'])
mask = (df1['zick']>= 2) & (df1['zick'] < 4)
df1.loc[~mask, 'eins'] = np.nan
df1
yielding:
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
Issue you were having is that you were joining filtered dataframe, and not the original one, there was no place for NaN to appear (every cell was satisfying your filter).
EDIT:
Considering new inputs in the comments below, here is another approach.
Create an empty column that will need to be updated with values from second dataframe:
df1['eins'] = np.nan
print df1
print df2
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 NaN
2014-06-02 3 3 NaN
2014-06-02 3 3 NaN
eins zwei
2014-06-01 2 3
2014-06-02 3 3
Set the filter and make values in the column_to_be_updated satisfying the filter equal to 0.
mask = (df1['zick']>= 2) & (df1['zick'] < 4)
df1.loc[(mask & (df1['eins'].isnull())), 'eins'] = 0
print df1
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 0
2014-06-02 3 3 0
2014-06-02 3 3 0
Update inplace your df1 with df2 values (only values equal to 0 will be updated):
df1.update(df2, filter_func=lambda x: x == 0)
print df1
zick zack eins
2014-06-01 1 1 NaN
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
Now if you want to change the filter and do the update again it will not change previously updated values:
mask = (df1['zick']>= 1) & (df1['zick'] == 1)
df1.loc[(mask & (df1['eins'].isnull())), 'eins'] = 0
print df1
zick zack eins
2014-06-01 1 1 0
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3
df1.update(df2, filter_func=lambda x: x == 0)
print df1
zick zack eins
2014-06-01 1 1 2
2014-06-01 2 2 2
2014-06-02 3 3 3
2014-06-02 3 3 3