I'm practicing the function of .apply() in pandas, but something goes wrong when I use .Series.mean() in the function.
here is my code:
In[1]: column = ['UserInfo_2', 'UserInfo_4','info_1', 'info_2', 'info_3','target']
value = [['a', 'b', 'a', 'c', 'b', 'a'],
['a', 'c', 'b', 'c', 'b', 'b'],
range(0, 11, 2),
range(1, 12, 2),
range(15, 21),
[0, 0, 1, 0, 1, 0]
]
master_train = pd.DataFrame(dict(zip(column, value)))
In[2]: def f(group):
return pd.DataFrame({'original': group,'demand':group-group.mean()})
In[3]: master_train.groupby('UserInfo_2')['info_1'].apply(f)
Out[3]:
demand original
0 -4.666667 0
1 -3.000000 2
2 -0.666667 4
3 0.000000 6
4 3.000000 8
5 5.333333 10
I am confused because the mean of info_1 is actually 5, but from the result abrove, the mean changes from 4.666667 to 7.
What's wrong??
I think now it is clear - you count mean of column info_1 (or original) by groups from column UserInfo_2:
def f(group):
return pd.DataFrame({'original': group,
'groups': group.name,
'demand':group - group.mean() ,
'mean':group.mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
demand groups mean original
0 -4.666667 a 4.666667 0
1 -3.000000 b 5.000000 2
2 -0.666667 a 4.666667 4
3 0.000000 c 6.000000 6
4 3.000000 b 5.000000 8
5 5.333333 a 4.666667 10
I think you want mean of column info_1:
def f(group):
return pd.DataFrame({'original': group,
'demand':group - master_train['info_1'].mean(),
'mean':master_train['info_1'].mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
demand mean original
0 -5.0 5.0 0
1 -3.0 5.0 2
2 -1.0 5.0 4
3 1.0 5.0 6
4 3.0 5.0 8
5 5.0 5.0 10
EDIT:
For testing is possible add print(group) to function f - it returns Series from column info_1 by groups from column UserInfo_2:
def f(group):
print (group)
return pd.DataFrame({'original': group,
'groups': group.name,
'demand':group - group.mean() ,
'mean':group.mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
0 0
2 4
5 10
Name: a, dtype: int32
1 2
4 8
Name: b, dtype: int32
3 6
Name: c, dtype: int32
demand groups mean original
0 -4.666667 a 4.666667 0
1 -3.000000 b 5.000000 2
2 -0.666667 a 4.666667 4
3 0.000000 c 6.000000 6
4 3.000000 b 5.000000 8
5 5.333333 a 4.666667 10
And if you need mean of all column info_1:
print (master_train['info_1'])
0 0
1 2
2 4
3 6
4 8
5 10
Name: info_1, dtype: int32
print (master_train['info_1'].mean())
5.0
Related
I want to add values of dataframe of which format is same.
for exmaple
>>> my_dataframe1
class1 score
subject 1 2 3
student
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
>>> my_dataframe2
class2 score
subject 1 2 3
student
0 4 2 2
1 4 4 14
2 8 7 7
3 1 2 NaN
4 NaN 2 3
as you can see, the two dataframes have multi-layer columns that the main column is 'class score' and the sub columns is 'subject'.
what i want to do is that get summed dataframe which can be showed like this
score
subject 1 2 3
student
0 5 4 7
1 2 1 5
2 16 14 9
3 4 6 7
4 6 9 10
Actually, i could get this dataframe by
for i in my_dataframe1['class1 score'].index:
my_dataframe1['class1 score'].loc[i,:] = my_dataframe1['class1 score'].loc[i,:].add(my_dataframe2['class2 score'].loc[i,:], fill_value = 0)
but, when dimensions increases, it takes tremendous time to get result dataframe, and i do think it isn't good way to solve problem.
If you add values from the second dataframe, it will ignore the indexing
# you don't need `astype(int)`.
my_dataframe1.add(my_dataframe2.values, fill_value=0).astype(int)
class1 score
subject 1 2 3
student
0 5 4 7
1 6 7 23
2 16 14 9
3 4 6 7
4 6 9 10
Setup
my_dataframe1 = pd.DataFrame([
[1, 2, 5],
[2, 3, 9],
[8, 7, 2],
[3, 4, 7],
[6, 7, 7]
], pd.RangeIndex(5, name='student'), pd.MultiIndex.from_product([['class1 score'], [1, 2, 3]], names=[None, 'subject']))
my_dataframe2 = pd.DataFrame([
[4, 2, 2],
[4, 4, 14],
[8, 7, 7],
[1, 2, np.nan],
[np.nan, 2, 3]
], pd.RangeIndex(5, name='student'), pd.MultiIndex.from_product([['class2 score'], [1, 2, 3]], names=[None, 'subject']))
IIUC:
df_out = df['class1 score'].add(df2['class2 score'],fill_value=0).add_prefix('scores_')
df_out.columns = df_out.columns.str.split('_',expand=True)
df_out
Output:
scores
1 2 3
student
0 5.0 4 7.0
1 6.0 7 23.0
2 16.0 14 9.0
3 4.0 6 7.0
4 6.0 9 10.0
The way I would approach this is keep the data in the same dataframe. You could concatenate the two you have already:
big_df = pd.concat([my_dataframe1, my_dataframe2], axis=1)
Then sum over the larger dataframe, specifying level:
big_df.sum(axis=1, level='subject')
I have a 3 dimensional numpy array, (z, x, y). z is a time dimension and x and y are coordinates.
I want to convert this to a multiindexed pandas.DataFrame. I want the row index to be the z dimension
and each column to have values from a unique x, y coordinate (and so, each column would be multi-indexed).
The simplest case (not multi-indexed):
>>> array.shape
(500L, 120L, 100L)
>>> df = pd.DataFrame(array[:,0,0])
>>> df.shape
(500, 1)
I've been trying to pass the whole array into a multiindex dataframe using pd.MultiIndex.from_arrays but I'm getting an error:
NotImplementedError: > 1 ndim Categorical are not supported at this time
Looks like it should be fairly simple but I cant figure it out.
I find that a Series with a Multiindex is the most analagous pandas datatype for a numpy array with arbitrarily many dimensions (presumably 3 or more).
Here is some example code:
import pandas as pd
import numpy as np
time_vals = np.linspace(1, 50, 50)
x_vals = np.linspace(-5, 6, 12)
y_vals = np.linspace(-4, 5, 10)
measurements = np.random.rand(50,12,10)
#setup multiindex
mi = pd.MultiIndex.from_product([time_vals, x_vals, y_vals], names=['time', 'x', 'y'])
#connect multiindex to data and save as multiindexed Series
sr_multi = pd.Series(index=mi, data=measurements.flatten())
#pull out a dataframe of x, y at time=22
sr_multi.xs(22, level='time').unstack(level=0)
#pull out a dataframe of y, time at x=3
sr_multi.xs(3, level='x').unstack(level=1)
I think you can use panel - and then for Multiindex DataFrame add to_frame:
np.random.seed(10)
arr = np.random.randint(10, size=(5,3,2))
print (arr)
[[[9 4]
[0 1]
[9 0]]
[[1 8]
[9 0]
[8 6]]
[[4 3]
[0 4]
[6 8]]
[[1 8]
[4 1]
[3 6]]
[[5 3]
[9 6]
[9 1]]]
df = pd.Panel(arr).to_frame()
print (df)
0 1 2 3 4
major minor
0 0 9 1 4 1 5
1 4 8 3 8 3
1 0 0 9 0 4 9
1 1 0 4 1 6
2 0 9 8 6 3 9
1 0 6 8 6 1
Also transpose can be useful:
df = pd.Panel(arr).transpose(1,2,0).to_frame()
print (df)
0 1 2
major minor
0 0 9 0 9
1 1 9 8
2 4 0 6
3 1 4 3
4 5 9 9
1 0 4 1 0
1 8 0 6
2 3 4 8
3 8 1 6
4 3 6 1
Another possible solution with concat:
arr = arr.transpose(1,2,0)
df = pd.concat([pd.DataFrame(x) for x in arr], keys=np.arange(arr.shape[2]))
print (df)
0 1 2 3 4
0 0 9 1 4 1 5
1 4 8 3 8 3
1 0 0 9 0 4 9
1 1 0 4 1 6
2 0 9 8 6 3 9
1 0 6 8 6 1
np.random.seed(10)
arr = np.random.randint(10, size=(500,120,100))
df = pd.Panel(arr).transpose(2,0,1).to_frame()
print (df.shape)
(60000, 100)
print (df.index.max())
(499, 119)
I have a pandas dataframe which has lists as values. I would like to transform this dataframe into the format in expected result. The dataframe is too large(1 million rows)
import pandas as pd
import numpy as np
df = pd.DataFrame(
[[['A', 'Second'], [], 'N/A', [6]],
[[2, 3], [3, 4, 6], [3, 4, 5, 7], [2, 6, 3, 4]]],
columns=list('ABCD')
)
df.replace('N/A',np.NaN, inplace=True)
df
A B C D
0 [A,Second] [] NaN [6]
1 [2,3] [3,4,6] [3,4,5,7] [2,6,3,4]
Expected result
0 A A
0 A Second
0 D 6
1 A 2
1 A 3
1 B 3
1 B 4
1 B 6
1 C 3
1 C 4
1 C 5
1 C 7
1 D 2
1 D 6
1 D 3
1 D 4
`
You can use double stack:
df1 = df.stack()
df = pd.DataFrame(df1.values.tolist(), index=df1.index).stack()
.reset_index(level=2,drop=True).reset_index()
df.columns = list('abc')
print (df)
a b c
0 0 A A
1 0 A Second
2 0 D 6
3 1 A 2
4 1 A 3
5 1 B 3
6 1 B 4
7 1 B 6
8 1 C 3
9 1 C 4
10 1 C 5
11 1 C 7
12 1 D 2
13 1 D 6
14 1 D 3
15 1 D 4
df.stack().apply(pd.Series).stack().reset_index(2, True) \
.rename_axis(['a', 'b']).reset_index(name='c')
a b c
0 0 A A
1 0 A Second
2 0 D 6
3 1 A 2
4 1 A 3
5 1 B 3
6 1 B 4
7 1 B 6
8 1 C 3
9 1 C 4
10 1 C 5
11 1 C 7
12 1 D 2
13 1 D 6
14 1 D 3
15 1 D 4
I have a dataframe with multiindexed columns. I want to select on the first level based on the column name, and then return all columns but the last one, and assign a new value to all these elements.
Here's a sample dataframe:
In [1]: mydf = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
Out[1]:
A B C
a b c a b c a b c
0 4 1 2 1 4 2 1 1 3
1 4 4 1 2 3 4 2 2 3
2 2 3 4 1 2 1 3 2 3
3 1 3 4 2 3 4 1 5 1
If want to be able to assign to this elements for example:
In [2]: mydf.loc[:,('A')].iloc[:,:-1]
Out[2]:
A
a b
0 4 1
1 4 4
2 2 3
3 1 3
If I wanted to modify one column only, I know how to select it properly with a tuple so that the assigning works:
In [3]: mydf.loc[:,('A','a')] = 0
In [4]: mydf.loc[:,('A','a')]
Out[4]:
0 0
1 0
2 0
3 0
Name: (A, a), dtype: int32
So that worked well.
Now the following doesn't work...
In [5]: mydf.loc[:,('A')].ix[:,:-1] = 6 - mydf.loc[:,('A')].ix[:,:-1]
In [6]: mydf.loc[:,('A')].iloc[:,:-1] = 6 - mydf.loc[:,('A')].iloc[:,:-1]
Sometimes I will, and sometimes I won't, get the warning that a value is trying to be set on a copy of a slice from a DataFrame. But in both cases it doesn't actually assign.
I've pretty much tried everything I could think, I still can't figure out how to mix both label and integer indexing in order to set the value correctly.
Any idea please?
Versions:
Python 2.7.9
Pandas 0.16.1
This is not directly supported as .loc MUST have labels and NOT positions. In theory .ix could support this with mulit-index slicers, but the usual complicates of figuring out what is 'meant' by the user (e.g. is it a label or a position).
In [63]: df = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
In [64]: df
Out[64]:
A B C
a b c a b c a b c
0 4 4 4 4 3 2 5 1 4
1 1 2 1 3 2 1 1 4 5
2 3 2 4 4 2 2 3 1 4
3 5 1 1 3 1 1 5 5 5
so we compute the indexer for the 'A' block; np.r_ turns this slice into an actual indexer; then we select the element (e.g. 0 in this case). This feeds into .iloc.
In [65]: df.iloc[:,np.r_[df.columns.get_loc('A')][0]] = 0
In [66]: df
Out[66]:
A B C
a b c a b c a b c
0 0 4 4 4 3 2 5 1 4
1 0 2 1 3 2 1 1 4 5
2 0 2 4 4 2 2 3 1 4
3 0 1 1 3 1 1 5 5 5
I want to create a pivot table from a pandas dataframe
using dataframe.pivot()
and include not only dataframe columns but also the data within the dataframe index.
Couldn't find any docs that show how to do that.
Any tips?
Use reset_index to make the index a column:
In [45]: df = pd.DataFrame({'y': [0, 1, 2, 3, 4, 4], 'x': [1, 2, 2, 3, 1, 3]}, index=np.arange(6)*10)
In [46]: df
Out[46]:
x y
0 1 0
10 2 1
20 2 2
30 3 3
40 1 4
50 3 4
In [47]: df.reset_index()
Out[47]:
index x y
0 0 1 0
1 10 2 1
2 20 2 2
3 30 3 3
4 40 1 4
5 50 3 4
So pivot uses the index as values:
In [48]: df.reset_index().pivot(index='y', columns='x')
Out[48]:
index
x 1 2 3
y
0 0 NaN NaN
1 NaN 10 NaN
2 NaN 20 NaN
3 NaN NaN 30
4 40 NaN 50