Pandas data-frame ungrouping functionality - python-2.7

I have a dataframe with 3 columns:
df1 = pd.DataFrame([[2, 2, 5, 7], [2, 5, 7.5, 10], [2, 5, 1, 3]]).T
df1.columns = ['col1', 'col2', 'col3']
df1
col1 col2 col3
0 2 2.0 2
1 2 5.0 5
2 5 7.5 1
3 7 10.0 3
Now I want to ungroup the 3rd column and get a longer dataframe with a new column col4, as shown below in df2:
df2 = pd.DataFrame([[2, 2, 2, 2, 2, 2, 2, 5, 7, 7, 7], [2, 2, 5, 5, 5, 5, 5, 7.5, 10, 10, 10], [2, 2, 5, 5, 5, 5, 5, 1, 3, 3, 3], [1, 2, 1, 2, 3, 4, 5, 1, 1, 2, 3]]).T
df2.columns = ['col1', 'col2', 'col3', 'col4']
df2
col1 col2 col3 col4
0 2 2.0 2 1
1 2 2.0 2 2
2 2 5.0 5 1
3 2 5.0 5 2
4 2 5.0 5 3
5 2 5.0 5 4
6 2 5.0 5 5
7 5 7.5 1 1
8 7 10.0 3 1
9 7 10.0 3 2
10 7 10.0 3 3

Here is one way to use groupby with reindex.
# custom apply function
def func(group):
return group.reset_index(drop=True).reindex(np.arange(group.col3)).fillna(method='ffill')
# groupby apply
result = df1.groupby(level=0).apply(func)
col1 col2 col3
0 0 2 2.0 2
1 2 2.0 2
1 0 2 5.0 5
1 2 5.0 5
2 2 5.0 5
3 2 5.0 5
4 2 5.0 5
2 0 5 7.5 1
3 0 7 10.0 3
1 7 10.0 3
2 7 10.0 3
result['col4'] = result.index.get_level_values(1) + 1
result.reset_index(drop=True)
col1 col2 col3 col4
0 2 2.0 2 1
1 2 2.0 2 2
2 2 5.0 5 1
3 2 5.0 5 2
4 2 5.0 5 3
5 2 5.0 5 4
6 2 5.0 5 5
7 5 7.5 1 1
8 7 10.0 3 1
9 7 10.0 3 2
10 7 10.0 3 3

You can also use numpy for faster calculation:
import numpy as np
import pandas as pd
df = pd.DataFrame([[2, 2, 5, 7], [2, 5, 7.5, 10], [2, 5, 1, 3]]).T
df.columns = ['col1', 'col2', 'col3']
x = df.values
n = df.iloc[:,-1].astype(int).values
data = np.repeat(x,n,axis=0)
df1 = pd.DataFrame(data)
df1.loc[:,3] = n.repeat(n)
df1.columns = ['col1','col2','col3','col4']
print(df1)
Gives:
col1 col2 col3 col4
0 2.0 2.0 2.0 2
1 2.0 2.0 2.0 2
2 2.0 5.0 5.0 5
3 2.0 5.0 5.0 5
4 2.0 5.0 5.0 5
5 2.0 5.0 5.0 5
6 2.0 5.0 5.0 5
7 5.0 7.5 1.0 1
8 7.0 10.0 3.0 3
9 7.0 10.0 3.0 3
10 7.0 10.0 3.0 3

Related

Replace values with NA based on condition

I am currently working on my first dataset as a PhD student. I have a dataset where several conditions have not been finished. In the dataset, this is visibly when 4 or more columns in a row have the value "1" (see example below). I want all the "1" values which do not depict "real" numbers (instead, they are "NAs) replaced by NA.
Any suggestions on how I could succeed?
example <- tibble(
a = c(1, 2, 3, 4, 5, 6, 7, 3, 4, 2, 7, 1),
b = c(1, 1, 1, 1, 1, 1, 2, 3, 4, 5, 6, 2),
c = c(3, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1),
d = c(5, 1, 2, 3, 1, 1, 1, 1, 1, 4, 1, 5),
e = c(4, 1, 3, 4, 1, 1, 1, 1, 2, 3, 7, 5),
f = c(3, 7, 6, 1, 1, 1, 1, 2, 1, 1, 1, 1))
This means I have this:
a b c d e f
1 1 1 3 5 4 3
2 2 1 1 1 1 7
3 3 1 1 2 3 6
4 4 1 1 3 4 1
5 5 1 1 1 1 1
6 6 1 4 1 1 1
7 7 2 1 1 1 1
8 3 3 1 1 1 2
9 4 4 1 1 2 1
10 2 5 1 4 3 1
11 7 6 1 1 7 1
12 1 2 1 5 5 1
And I need this:
a b c d e f
1 1 1 3 5 4 3
2 2 NA NA NA NA 7
3 3 1 1 2 3 6
4 4 1 1 3 4 1
5 5 1 NA NA NA NA
6 6 1 4 1 1 1
7 7 2 NA NA NA NA
8 3 3 1 1 1 2
9 4 4 1 1 2 1
10 2 5 1 4 3 1
11 7 6 1 1 7 1
12 1 2 1 5 5 1
Thank you very much!!

Conditional mutation across rows (by group/id)?

I have a large dataset that I would like some help with. An example is given below:
id id_row material
1 1 1 1
2 1 2 1
3 1 3 1
4 2 1 1
5 2 2 2
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 2 2
I would like to add a new column based on the values in material for the same id (across rows). In the new colum, I would like all id with values 1 and 2 in material (across rows) to be identified (e.g. as value 99) and if not both are present then return either 1 or 2.
Something like this:
id id_row material new_column
1 1 1 1 1
2 1 2 1 1
3 1 3 1 1
4 2 1 1 99
5 2 2 2 99
6 2 3 1 99
7 3 1 2 2
8 3 2 2 2
9 3 3 2 2
10 4 1 1 99
11 4 2 2 99
I have been looking online for a solution without any luck as well as tried using dplyr and group_by, mutate and ifelse without any luck. Thank you in advance!
Try this approach:
library(tidyverse)
tribble(
~id, ~id_row, ~material,
1, 1, 1,
1, 2, 1,
1, 3, 1,
2, 1, 1,
2, 2, 2,
2, 3, 1,
3, 1, 2,
3, 2, 2,
3, 3, 2,
4, 1, 1,
4, 2, 2
) |>
group_by(id) |>
mutate(new_column = if_else(any(material == 2) & any(material == 1), 99, NA_real_),
new_column = if_else(is.na(new_column), material, new_column))
#> # A tibble: 11 × 4
#> # Groups: id [4]
#> id id_row material new_column
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 1
#> 2 1 2 1 1
#> 3 1 3 1 1
#> 4 2 1 1 99
#> 5 2 2 2 99
#> 6 2 3 1 99
#> 7 3 1 2 2
#> 8 3 2 2 2
#> 9 3 3 2 2
#> 10 4 1 1 99
#> 11 4 2 2 99
Created on 2022-05-25 by the reprex package (v2.0.1)

(Python2) Combining pandas dataframe of mulilayer columns

I want to add values of dataframe of which format is same.
for exmaple
>>> my_dataframe1
class1 score
subject 1 2 3
student
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
>>> my_dataframe2
class2 score
subject 1 2 3
student
0 4 2 2
1 4 4 14
2 8 7 7
3 1 2 NaN
4 NaN 2 3
as you can see, the two dataframes have multi-layer columns that the main column is 'class score' and the sub columns is 'subject'.
what i want to do is that get summed dataframe which can be showed like this
score
subject 1 2 3
student
0 5 4 7
1 2 1 5
2 16 14 9
3 4 6 7
4 6 9 10
Actually, i could get this dataframe by
for i in my_dataframe1['class1 score'].index:
my_dataframe1['class1 score'].loc[i,:] = my_dataframe1['class1 score'].loc[i,:].add(my_dataframe2['class2 score'].loc[i,:], fill_value = 0)
but, when dimensions increases, it takes tremendous time to get result dataframe, and i do think it isn't good way to solve problem.
If you add values from the second dataframe, it will ignore the indexing
# you don't need `astype(int)`.
my_dataframe1.add(my_dataframe2.values, fill_value=0).astype(int)
class1 score
subject 1 2 3
student
0 5 4 7
1 6 7 23
2 16 14 9
3 4 6 7
4 6 9 10
Setup
my_dataframe1 = pd.DataFrame([
[1, 2, 5],
[2, 3, 9],
[8, 7, 2],
[3, 4, 7],
[6, 7, 7]
], pd.RangeIndex(5, name='student'), pd.MultiIndex.from_product([['class1 score'], [1, 2, 3]], names=[None, 'subject']))
my_dataframe2 = pd.DataFrame([
[4, 2, 2],
[4, 4, 14],
[8, 7, 7],
[1, 2, np.nan],
[np.nan, 2, 3]
], pd.RangeIndex(5, name='student'), pd.MultiIndex.from_product([['class2 score'], [1, 2, 3]], names=[None, 'subject']))
IIUC:
df_out = df['class1 score'].add(df2['class2 score'],fill_value=0).add_prefix('scores_')
df_out.columns = df_out.columns.str.split('_',expand=True)
df_out
Output:
scores
1 2 3
student
0 5.0 4 7.0
1 6.0 7 23.0
2 16.0 14 9.0
3 4.0 6 7.0
4 6.0 9 10.0
The way I would approach this is keep the data in the same dataframe. You could concatenate the two you have already:
big_df = pd.concat([my_dataframe1, my_dataframe2], axis=1)
Then sum over the larger dataframe, specifying level:
big_df.sum(axis=1, level='subject')

Reshape pandas dataframe which has lists as values

I have a pandas dataframe which has lists as values. I would like to transform this dataframe into the format in expected result. The dataframe is too large(1 million rows)
import pandas as pd
import numpy as np
df = pd.DataFrame(
[[['A', 'Second'], [], 'N/A', [6]],
[[2, 3], [3, 4, 6], [3, 4, 5, 7], [2, 6, 3, 4]]],
columns=list('ABCD')
)
df.replace('N/A',np.NaN, inplace=True)
df
A B C D
0 [A,Second] [] NaN [6]
1 [2,3] [3,4,6] [3,4,5,7] [2,6,3,4]
Expected result
0 A A
0 A Second
0 D 6
1 A 2
1 A 3
1 B 3
1 B 4
1 B 6
1 C 3
1 C 4
1 C 5
1 C 7
1 D 2
1 D 6
1 D 3
1 D 4
`
You can use double stack:
df1 = df.stack()
df = pd.DataFrame(df1.values.tolist(), index=df1.index).stack()
.reset_index(level=2,drop=True).reset_index()
df.columns = list('abc')
print (df)
a b c
0 0 A A
1 0 A Second
2 0 D 6
3 1 A 2
4 1 A 3
5 1 B 3
6 1 B 4
7 1 B 6
8 1 C 3
9 1 C 4
10 1 C 5
11 1 C 7
12 1 D 2
13 1 D 6
14 1 D 3
15 1 D 4
df.stack().apply(pd.Series).stack().reset_index(2, True) \
.rename_axis(['a', 'b']).reset_index(name='c')
a b c
0 0 A A
1 0 A Second
2 0 D 6
3 1 A 2
4 1 A 3
5 1 B 3
6 1 B 4
7 1 B 6
8 1 C 3
9 1 C 4
10 1 C 5
11 1 C 7
12 1 D 2
13 1 D 6
14 1 D 3
15 1 D 4

something wrong in pandas.Series.mean() or .apply()

I'm practicing the function of .apply() in pandas, but something goes wrong when I use .Series.mean() in the function.
here is my code:
In[1]: column = ['UserInfo_2', 'UserInfo_4','info_1', 'info_2', 'info_3','target']
value = [['a', 'b', 'a', 'c', 'b', 'a'],
['a', 'c', 'b', 'c', 'b', 'b'],
range(0, 11, 2),
range(1, 12, 2),
range(15, 21),
[0, 0, 1, 0, 1, 0]
]
master_train = pd.DataFrame(dict(zip(column, value)))
In[2]: def f(group):
return pd.DataFrame({'original': group,'demand':group-group.mean()})
In[3]: master_train.groupby('UserInfo_2')['info_1'].apply(f)
Out[3]:
demand original
0 -4.666667 0
1 -3.000000 2
2 -0.666667 4
3 0.000000 6
4 3.000000 8
5 5.333333 10
I am confused because the mean of info_1 is actually 5, but from the result abrove, the mean changes from 4.666667 to 7.
What's wrong??
I think now it is clear - you count mean of column info_1 (or original) by groups from column UserInfo_2:
def f(group):
return pd.DataFrame({'original': group,
'groups': group.name,
'demand':group - group.mean() ,
'mean':group.mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
demand groups mean original
0 -4.666667 a 4.666667 0
1 -3.000000 b 5.000000 2
2 -0.666667 a 4.666667 4
3 0.000000 c 6.000000 6
4 3.000000 b 5.000000 8
5 5.333333 a 4.666667 10
I think you want mean of column info_1:
def f(group):
return pd.DataFrame({'original': group,
'demand':group - master_train['info_1'].mean(),
'mean':master_train['info_1'].mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
demand mean original
0 -5.0 5.0 0
1 -3.0 5.0 2
2 -1.0 5.0 4
3 1.0 5.0 6
4 3.0 5.0 8
5 5.0 5.0 10
EDIT:
For testing is possible add print(group) to function f - it returns Series from column info_1 by groups from column UserInfo_2:
def f(group):
print (group)
return pd.DataFrame({'original': group,
'groups': group.name,
'demand':group - group.mean() ,
'mean':group.mean()})
print (master_train.groupby('UserInfo_2')['info_1'].apply(f))
0 0
2 4
5 10
Name: a, dtype: int32
1 2
4 8
Name: b, dtype: int32
3 6
Name: c, dtype: int32
demand groups mean original
0 -4.666667 a 4.666667 0
1 -3.000000 b 5.000000 2
2 -0.666667 a 4.666667 4
3 0.000000 c 6.000000 6
4 3.000000 b 5.000000 8
5 5.333333 a 4.666667 10
And if you need mean of all column info_1:
print (master_train['info_1'])
0 0
1 2
2 4
3 6
4 8
5 10
Name: info_1, dtype: int32
print (master_train['info_1'].mean())
5.0