Reshape pandas dataframe which has lists as values - list

I have a pandas dataframe which has lists as values. I would like to transform this dataframe into the format in expected result. The dataframe is too large(1 million rows)
import pandas as pd
import numpy as np
df = pd.DataFrame(
[[['A', 'Second'], [], 'N/A', [6]],
[[2, 3], [3, 4, 6], [3, 4, 5, 7], [2, 6, 3, 4]]],
columns=list('ABCD')
)
df.replace('N/A',np.NaN, inplace=True)
df
A B C D
0 [A,Second] [] NaN [6]
1 [2,3] [3,4,6] [3,4,5,7] [2,6,3,4]
Expected result
0 A A
0 A Second
0 D 6
1 A 2
1 A 3
1 B 3
1 B 4
1 B 6
1 C 3
1 C 4
1 C 5
1 C 7
1 D 2
1 D 6
1 D 3
1 D 4
`

You can use double stack:
df1 = df.stack()
df = pd.DataFrame(df1.values.tolist(), index=df1.index).stack()
.reset_index(level=2,drop=True).reset_index()
df.columns = list('abc')
print (df)
a b c
0 0 A A
1 0 A Second
2 0 D 6
3 1 A 2
4 1 A 3
5 1 B 3
6 1 B 4
7 1 B 6
8 1 C 3
9 1 C 4
10 1 C 5
11 1 C 7
12 1 D 2
13 1 D 6
14 1 D 3
15 1 D 4

df.stack().apply(pd.Series).stack().reset_index(2, True) \
.rename_axis(['a', 'b']).reset_index(name='c')
a b c
0 0 A A
1 0 A Second
2 0 D 6
3 1 A 2
4 1 A 3
5 1 B 3
6 1 B 4
7 1 B 6
8 1 C 3
9 1 C 4
10 1 C 5
11 1 C 7
12 1 D 2
13 1 D 6
14 1 D 3
15 1 D 4

Related

Conditional mutation across rows (by group/id)?

I have a large dataset that I would like some help with. An example is given below:
id id_row material
1 1 1 1
2 1 2 1
3 1 3 1
4 2 1 1
5 2 2 2
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 2 2
I would like to add a new column based on the values in material for the same id (across rows). In the new colum, I would like all id with values 1 and 2 in material (across rows) to be identified (e.g. as value 99) and if not both are present then return either 1 or 2.
Something like this:
id id_row material new_column
1 1 1 1 1
2 1 2 1 1
3 1 3 1 1
4 2 1 1 99
5 2 2 2 99
6 2 3 1 99
7 3 1 2 2
8 3 2 2 2
9 3 3 2 2
10 4 1 1 99
11 4 2 2 99
I have been looking online for a solution without any luck as well as tried using dplyr and group_by, mutate and ifelse without any luck. Thank you in advance!
Try this approach:
library(tidyverse)
tribble(
~id, ~id_row, ~material,
1, 1, 1,
1, 2, 1,
1, 3, 1,
2, 1, 1,
2, 2, 2,
2, 3, 1,
3, 1, 2,
3, 2, 2,
3, 3, 2,
4, 1, 1,
4, 2, 2
) |>
group_by(id) |>
mutate(new_column = if_else(any(material == 2) & any(material == 1), 99, NA_real_),
new_column = if_else(is.na(new_column), material, new_column))
#> # A tibble: 11 × 4
#> # Groups: id [4]
#> id id_row material new_column
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 1
#> 2 1 2 1 1
#> 3 1 3 1 1
#> 4 2 1 1 99
#> 5 2 2 2 99
#> 6 2 3 1 99
#> 7 3 1 2 2
#> 8 3 2 2 2
#> 9 3 3 2 2
#> 10 4 1 1 99
#> 11 4 2 2 99
Created on 2022-05-25 by the reprex package (v2.0.1)

3 dimensional numpy array to multiindex pandas dataframe

I have a 3 dimensional numpy array, (z, x, y). z is a time dimension and x and y are coordinates.
I want to convert this to a multiindexed pandas.DataFrame. I want the row index to be the z dimension
and each column to have values from a unique x, y coordinate (and so, each column would be multi-indexed).
The simplest case (not multi-indexed):
>>> array.shape
(500L, 120L, 100L)
>>> df = pd.DataFrame(array[:,0,0])
>>> df.shape
(500, 1)
I've been trying to pass the whole array into a multiindex dataframe using pd.MultiIndex.from_arrays but I'm getting an error:
NotImplementedError: > 1 ndim Categorical are not supported at this time
Looks like it should be fairly simple but I cant figure it out.
I find that a Series with a Multiindex is the most analagous pandas datatype for a numpy array with arbitrarily many dimensions (presumably 3 or more).
Here is some example code:
import pandas as pd
import numpy as np
time_vals = np.linspace(1, 50, 50)
x_vals = np.linspace(-5, 6, 12)
y_vals = np.linspace(-4, 5, 10)
measurements = np.random.rand(50,12,10)
#setup multiindex
mi = pd.MultiIndex.from_product([time_vals, x_vals, y_vals], names=['time', 'x', 'y'])
#connect multiindex to data and save as multiindexed Series
sr_multi = pd.Series(index=mi, data=measurements.flatten())
#pull out a dataframe of x, y at time=22
sr_multi.xs(22, level='time').unstack(level=0)
#pull out a dataframe of y, time at x=3
sr_multi.xs(3, level='x').unstack(level=1)
I think you can use panel - and then for Multiindex DataFrame add to_frame:
np.random.seed(10)
arr = np.random.randint(10, size=(5,3,2))
print (arr)
[[[9 4]
[0 1]
[9 0]]
[[1 8]
[9 0]
[8 6]]
[[4 3]
[0 4]
[6 8]]
[[1 8]
[4 1]
[3 6]]
[[5 3]
[9 6]
[9 1]]]
df = pd.Panel(arr).to_frame()
print (df)
0 1 2 3 4
major minor
0 0 9 1 4 1 5
1 4 8 3 8 3
1 0 0 9 0 4 9
1 1 0 4 1 6
2 0 9 8 6 3 9
1 0 6 8 6 1
Also transpose can be useful:
df = pd.Panel(arr).transpose(1,2,0).to_frame()
print (df)
0 1 2
major minor
0 0 9 0 9
1 1 9 8
2 4 0 6
3 1 4 3
4 5 9 9
1 0 4 1 0
1 8 0 6
2 3 4 8
3 8 1 6
4 3 6 1
Another possible solution with concat:
arr = arr.transpose(1,2,0)
df = pd.concat([pd.DataFrame(x) for x in arr], keys=np.arange(arr.shape[2]))
print (df)
0 1 2 3 4
0 0 9 1 4 1 5
1 4 8 3 8 3
1 0 0 9 0 4 9
1 1 0 4 1 6
2 0 9 8 6 3 9
1 0 6 8 6 1
np.random.seed(10)
arr = np.random.randint(10, size=(500,120,100))
df = pd.Panel(arr).transpose(2,0,1).to_frame()
print (df.shape)
(60000, 100)
print (df.index.max())
(499, 119)

How to add a number to a portion of dataframe column in pandas?

I have a dataframe with two columns A and B.
A B
1 0
2 0
3 1
4 2
5 0
6 3
What I want to do is to add column A with with column B. But only with the corresponding non zero values of column B. And put the result on column B.
A B
1 0
2 0
3 4
4 6
5 0
6 9
Thank you for your help and sugestion in advance.
use .loc with a boolean mask:
In [49]:
df.loc[df['B'] != 0, 'B'] = df['A'] + df['B']
df
Out[49]:
A B
0 1 0
1 2 0
2 3 4
3 4 6
4 5 0
5 6 9

Pandas groupping values to column

I have dataframe look like this:
a b c d e
0 0 1 2 1 0
1 3 0 0 4 3
2 3 4 0 4 2
3 4 1 0 4 3
4 2 1 3 4 3
5 3 2 0 3 3
6 2 1 1 1 0
7 1 1 0 3 3
8 3 3 3 3 4
9 2 3 4 2 2
I do following command:
df.groupby('A').sum()
And i get:
b c d e
a
0 1 2 1 0
1 1 0 3 3
2 5 8 7 5
3 9 3 14 12
4 1 0 4 3
And after that I want to access
labels = df['A']
But I have an error that there are no such column.
So does pandas have some syntax to get something like this?
a b c d e
0 0 1 2 1 0
1 1 1 0 3 3
2 2 5 8 7 5
3 3 9 3 14 12
4 4 1 0 4 3
I need to sum all values of columns b, c, d, e to column a with the relevant index
You can just access the index with df.index, and add it back into your dataframe as another column.
grouped_df = df.groupby('A').sum()
grouped_df['A'] = grouped_df.index
grouped_df.sum(axis=1)
Alternatively, groupby has 'as_index' option to keep the column 'A'
groupby('A', as_index=False)
or, after groupby, you can use reset_index to put the column 'A' back.

Assigning value to multiindexed pandas dataframe based on mix of integer and labels indexing

I have a dataframe with multiindexed columns. I want to select on the first level based on the column name, and then return all columns but the last one, and assign a new value to all these elements.
Here's a sample dataframe:
In [1]: mydf = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
Out[1]:
A B C
a b c a b c a b c
0 4 1 2 1 4 2 1 1 3
1 4 4 1 2 3 4 2 2 3
2 2 3 4 1 2 1 3 2 3
3 1 3 4 2 3 4 1 5 1
If want to be able to assign to this elements for example:
In [2]: mydf.loc[:,('A')].iloc[:,:-1]
Out[2]:
A
a b
0 4 1
1 4 4
2 2 3
3 1 3
If I wanted to modify one column only, I know how to select it properly with a tuple so that the assigning works:
In [3]: mydf.loc[:,('A','a')] = 0
In [4]: mydf.loc[:,('A','a')]
Out[4]:
0 0
1 0
2 0
3 0
Name: (A, a), dtype: int32
So that worked well.
Now the following doesn't work...
In [5]: mydf.loc[:,('A')].ix[:,:-1] = 6 - mydf.loc[:,('A')].ix[:,:-1]
In [6]: mydf.loc[:,('A')].iloc[:,:-1] = 6 - mydf.loc[:,('A')].iloc[:,:-1]
Sometimes I will, and sometimes I won't, get the warning that a value is trying to be set on a copy of a slice from a DataFrame. But in both cases it doesn't actually assign.
I've pretty much tried everything I could think, I still can't figure out how to mix both label and integer indexing in order to set the value correctly.
Any idea please?
Versions:
Python 2.7.9
Pandas 0.16.1
This is not directly supported as .loc MUST have labels and NOT positions. In theory .ix could support this with mulit-index slicers, but the usual complicates of figuring out what is 'meant' by the user (e.g. is it a label or a position).
In [63]: df = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
In [64]: df
Out[64]:
A B C
a b c a b c a b c
0 4 4 4 4 3 2 5 1 4
1 1 2 1 3 2 1 1 4 5
2 3 2 4 4 2 2 3 1 4
3 5 1 1 3 1 1 5 5 5
so we compute the indexer for the 'A' block; np.r_ turns this slice into an actual indexer; then we select the element (e.g. 0 in this case). This feeds into .iloc.
In [65]: df.iloc[:,np.r_[df.columns.get_loc('A')][0]] = 0
In [66]: df
Out[66]:
A B C
a b c a b c a b c
0 0 4 4 4 3 2 5 1 4
1 0 2 1 3 2 1 1 4 5
2 0 2 4 4 2 2 3 1 4
3 0 1 1 3 1 1 5 5 5