(Python2) Combining pandas dataframe of mulilayer columns - python-2.7

I want to add values of dataframe of which format is same.
for exmaple
>>> my_dataframe1
class1 score
subject 1 2 3
student
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
>>> my_dataframe2
class2 score
subject 1 2 3
student
0 4 2 2
1 4 4 14
2 8 7 7
3 1 2 NaN
4 NaN 2 3
as you can see, the two dataframes have multi-layer columns that the main column is 'class score' and the sub columns is 'subject'.
what i want to do is that get summed dataframe which can be showed like this
score
subject 1 2 3
student
0 5 4 7
1 2 1 5
2 16 14 9
3 4 6 7
4 6 9 10
Actually, i could get this dataframe by
for i in my_dataframe1['class1 score'].index:
my_dataframe1['class1 score'].loc[i,:] = my_dataframe1['class1 score'].loc[i,:].add(my_dataframe2['class2 score'].loc[i,:], fill_value = 0)
but, when dimensions increases, it takes tremendous time to get result dataframe, and i do think it isn't good way to solve problem.

If you add values from the second dataframe, it will ignore the indexing
# you don't need `astype(int)`.
my_dataframe1.add(my_dataframe2.values, fill_value=0).astype(int)
class1 score
subject 1 2 3
student
0 5 4 7
1 6 7 23
2 16 14 9
3 4 6 7
4 6 9 10
Setup
my_dataframe1 = pd.DataFrame([
[1, 2, 5],
[2, 3, 9],
[8, 7, 2],
[3, 4, 7],
[6, 7, 7]
], pd.RangeIndex(5, name='student'), pd.MultiIndex.from_product([['class1 score'], [1, 2, 3]], names=[None, 'subject']))
my_dataframe2 = pd.DataFrame([
[4, 2, 2],
[4, 4, 14],
[8, 7, 7],
[1, 2, np.nan],
[np.nan, 2, 3]
], pd.RangeIndex(5, name='student'), pd.MultiIndex.from_product([['class2 score'], [1, 2, 3]], names=[None, 'subject']))

IIUC:
df_out = df['class1 score'].add(df2['class2 score'],fill_value=0).add_prefix('scores_')
df_out.columns = df_out.columns.str.split('_',expand=True)
df_out
Output:
scores
1 2 3
student
0 5.0 4 7.0
1 6.0 7 23.0
2 16.0 14 9.0
3 4.0 6 7.0
4 6.0 9 10.0

The way I would approach this is keep the data in the same dataframe. You could concatenate the two you have already:
big_df = pd.concat([my_dataframe1, my_dataframe2], axis=1)
Then sum over the larger dataframe, specifying level:
big_df.sum(axis=1, level='subject')

Related

Replace values with NA based on condition

I am currently working on my first dataset as a PhD student. I have a dataset where several conditions have not been finished. In the dataset, this is visibly when 4 or more columns in a row have the value "1" (see example below). I want all the "1" values which do not depict "real" numbers (instead, they are "NAs) replaced by NA.
Any suggestions on how I could succeed?
example <- tibble(
a = c(1, 2, 3, 4, 5, 6, 7, 3, 4, 2, 7, 1),
b = c(1, 1, 1, 1, 1, 1, 2, 3, 4, 5, 6, 2),
c = c(3, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1),
d = c(5, 1, 2, 3, 1, 1, 1, 1, 1, 4, 1, 5),
e = c(4, 1, 3, 4, 1, 1, 1, 1, 2, 3, 7, 5),
f = c(3, 7, 6, 1, 1, 1, 1, 2, 1, 1, 1, 1))
This means I have this:
a b c d e f
1 1 1 3 5 4 3
2 2 1 1 1 1 7
3 3 1 1 2 3 6
4 4 1 1 3 4 1
5 5 1 1 1 1 1
6 6 1 4 1 1 1
7 7 2 1 1 1 1
8 3 3 1 1 1 2
9 4 4 1 1 2 1
10 2 5 1 4 3 1
11 7 6 1 1 7 1
12 1 2 1 5 5 1
And I need this:
a b c d e f
1 1 1 3 5 4 3
2 2 NA NA NA NA 7
3 3 1 1 2 3 6
4 4 1 1 3 4 1
5 5 1 NA NA NA NA
6 6 1 4 1 1 1
7 7 2 NA NA NA NA
8 3 3 1 1 1 2
9 4 4 1 1 2 1
10 2 5 1 4 3 1
11 7 6 1 1 7 1
12 1 2 1 5 5 1
Thank you very much!!

C++ Sort vector by index

I need to sort a std::vector by index. Let me explain it with an example:
Imagine I have a std::vector of 12 positions (but can be 18 for example) filled with some values (it doesn't have to be sorted):
Vector Index: 0 1 2 3 4 5 6 7 8 9 10 11
Vector Values: 3 0 2 3 2 0 1 2 2 4 5 3
I want to sort it every 3 index. This means: the first 3 [0-2] stay, then I need to have [6-8] and then the others. So it will end up like this (new index 3 has the value of previous idx 6):
Vector Index: 0 1 2 3 4 5 6 7 8 9 10 11
Vector Values: 3 0 2 1 2 2 3 2 0 4 5 3
I'm trying to make it in one line using std::sort + lambda but I can't get it. Also discovered the std::partition() function and tried to use it but the result was really bad hehe
Found also this similar question which orders by odd and even index but can't figure out how to make it in my case or even if it is possible: Sort vector by even and odd index
Thank you so much!
Note 0: No, my vector is not always sorted. It was just an example. I've changed the values
Note 1: I know it sound strange... think it like hte vecotr positions are like: yes yes yes no no no yes yes yes no no no yes yes yes... so the 'yes' positions will go in the same order but before the 'no' positions
Note 2: If there isn't a way with lambda then I thought making it with a loop and auxiliar vars but it's more ugly I think.
Note 3: Another example:
Vector Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Vector Values: 3 0 2 3 2 0 1 2 2 4 5 3 2 3 0 0 2 1
Sorted Values: 3 0 2 1 2 2 2 3 0 3 2 0 4 5 3 0 2 1
The final Vector Values is sorted (in term of old index): 0 1 2 6 7 8 12 13 14 3 4 5 9 10 11 15 16 17
You can imagine those index in 2 colums, so I want first the Left ones and then the Right one:
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
You don't want std::sort, you want std::rotate.
std::vector<int> v = {20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31};
auto b = std::next(std::begin(v), 3); // skip first three elements
auto const re = std::end(v); // keep track of the actual end
auto e = std::next(b, 6); // the end of our current block
while(e < re) {
auto mid = std::next(b, 3);
std::rotate(b, mid, e);
b = e;
std::advance(e, 6);
}
// print the results
std::copy(std::begin(v), std::end(v), std::ostream_iterator<int>(std::cout, " "));
This code assumes you always do two groups of 3 for each rotation, but you could obviously work with whichever arbitrary ranges you wanted.
The output looks like what you'd want:
20 21 22 26 27 28 23 24 25 29 30 31
Update: #Blastfurnace pointed out that std::swap_ranges would work as well. The rotate call can be replaced with the following line:
std::swap_ranges(b, mid, mid); // passing mid twice on purpose
With the range-v3 library, you can write this quite conveniently, and it's very readable. Assuming your original vector is called input:
namespace rs = ranges;
namespace rv = ranges::views;
// input [3, 0, 2, 3, 2, 0, 1, 2, 2, 4, 5, 3, 2, 3, 0, 0, 2, 1]
auto by_3s = input | rv::chunk(3); // [[3, 0, 2], [3, 2, 0], [1, 2, 2], [4, 5, 3], [2, 3, 0], [0, 2, 1]]
auto result = rv::concat(by_3s | rv::stride(2), // [[3, 0, 2], [1, 2, 2], [2, 3, 0]]
by_3s | rv::drop(1) | rv::stride(2)) // [[3, 2, 0], [4, 5, 3], [0, 2, 1]]
| rv::join
| rs::to<std::vector<int>>; // [3, 0, 2, 1, 2, 2, 2, 3, 0, 3, 2, 0, 4, 5, 3, 0, 2, 1]
Here's a demo.

3 dimensional numpy array to multiindex pandas dataframe

I have a 3 dimensional numpy array, (z, x, y). z is a time dimension and x and y are coordinates.
I want to convert this to a multiindexed pandas.DataFrame. I want the row index to be the z dimension
and each column to have values from a unique x, y coordinate (and so, each column would be multi-indexed).
The simplest case (not multi-indexed):
>>> array.shape
(500L, 120L, 100L)
>>> df = pd.DataFrame(array[:,0,0])
>>> df.shape
(500, 1)
I've been trying to pass the whole array into a multiindex dataframe using pd.MultiIndex.from_arrays but I'm getting an error:
NotImplementedError: > 1 ndim Categorical are not supported at this time
Looks like it should be fairly simple but I cant figure it out.
I find that a Series with a Multiindex is the most analagous pandas datatype for a numpy array with arbitrarily many dimensions (presumably 3 or more).
Here is some example code:
import pandas as pd
import numpy as np
time_vals = np.linspace(1, 50, 50)
x_vals = np.linspace(-5, 6, 12)
y_vals = np.linspace(-4, 5, 10)
measurements = np.random.rand(50,12,10)
#setup multiindex
mi = pd.MultiIndex.from_product([time_vals, x_vals, y_vals], names=['time', 'x', 'y'])
#connect multiindex to data and save as multiindexed Series
sr_multi = pd.Series(index=mi, data=measurements.flatten())
#pull out a dataframe of x, y at time=22
sr_multi.xs(22, level='time').unstack(level=0)
#pull out a dataframe of y, time at x=3
sr_multi.xs(3, level='x').unstack(level=1)
I think you can use panel - and then for Multiindex DataFrame add to_frame:
np.random.seed(10)
arr = np.random.randint(10, size=(5,3,2))
print (arr)
[[[9 4]
[0 1]
[9 0]]
[[1 8]
[9 0]
[8 6]]
[[4 3]
[0 4]
[6 8]]
[[1 8]
[4 1]
[3 6]]
[[5 3]
[9 6]
[9 1]]]
df = pd.Panel(arr).to_frame()
print (df)
0 1 2 3 4
major minor
0 0 9 1 4 1 5
1 4 8 3 8 3
1 0 0 9 0 4 9
1 1 0 4 1 6
2 0 9 8 6 3 9
1 0 6 8 6 1
Also transpose can be useful:
df = pd.Panel(arr).transpose(1,2,0).to_frame()
print (df)
0 1 2
major minor
0 0 9 0 9
1 1 9 8
2 4 0 6
3 1 4 3
4 5 9 9
1 0 4 1 0
1 8 0 6
2 3 4 8
3 8 1 6
4 3 6 1
Another possible solution with concat:
arr = arr.transpose(1,2,0)
df = pd.concat([pd.DataFrame(x) for x in arr], keys=np.arange(arr.shape[2]))
print (df)
0 1 2 3 4
0 0 9 1 4 1 5
1 4 8 3 8 3
1 0 0 9 0 4 9
1 1 0 4 1 6
2 0 9 8 6 3 9
1 0 6 8 6 1
np.random.seed(10)
arr = np.random.randint(10, size=(500,120,100))
df = pd.Panel(arr).transpose(2,0,1).to_frame()
print (df.shape)
(60000, 100)
print (df.index.max())
(499, 119)

Pandas data-frame ungrouping functionality

I have a dataframe with 3 columns:
df1 = pd.DataFrame([[2, 2, 5, 7], [2, 5, 7.5, 10], [2, 5, 1, 3]]).T
df1.columns = ['col1', 'col2', 'col3']
df1
col1 col2 col3
0 2 2.0 2
1 2 5.0 5
2 5 7.5 1
3 7 10.0 3
Now I want to ungroup the 3rd column and get a longer dataframe with a new column col4, as shown below in df2:
df2 = pd.DataFrame([[2, 2, 2, 2, 2, 2, 2, 5, 7, 7, 7], [2, 2, 5, 5, 5, 5, 5, 7.5, 10, 10, 10], [2, 2, 5, 5, 5, 5, 5, 1, 3, 3, 3], [1, 2, 1, 2, 3, 4, 5, 1, 1, 2, 3]]).T
df2.columns = ['col1', 'col2', 'col3', 'col4']
df2
col1 col2 col3 col4
0 2 2.0 2 1
1 2 2.0 2 2
2 2 5.0 5 1
3 2 5.0 5 2
4 2 5.0 5 3
5 2 5.0 5 4
6 2 5.0 5 5
7 5 7.5 1 1
8 7 10.0 3 1
9 7 10.0 3 2
10 7 10.0 3 3
Here is one way to use groupby with reindex.
# custom apply function
def func(group):
return group.reset_index(drop=True).reindex(np.arange(group.col3)).fillna(method='ffill')
# groupby apply
result = df1.groupby(level=0).apply(func)
col1 col2 col3
0 0 2 2.0 2
1 2 2.0 2
1 0 2 5.0 5
1 2 5.0 5
2 2 5.0 5
3 2 5.0 5
4 2 5.0 5
2 0 5 7.5 1
3 0 7 10.0 3
1 7 10.0 3
2 7 10.0 3
result['col4'] = result.index.get_level_values(1) + 1
result.reset_index(drop=True)
col1 col2 col3 col4
0 2 2.0 2 1
1 2 2.0 2 2
2 2 5.0 5 1
3 2 5.0 5 2
4 2 5.0 5 3
5 2 5.0 5 4
6 2 5.0 5 5
7 5 7.5 1 1
8 7 10.0 3 1
9 7 10.0 3 2
10 7 10.0 3 3
You can also use numpy for faster calculation:
import numpy as np
import pandas as pd
df = pd.DataFrame([[2, 2, 5, 7], [2, 5, 7.5, 10], [2, 5, 1, 3]]).T
df.columns = ['col1', 'col2', 'col3']
x = df.values
n = df.iloc[:,-1].astype(int).values
data = np.repeat(x,n,axis=0)
df1 = pd.DataFrame(data)
df1.loc[:,3] = n.repeat(n)
df1.columns = ['col1','col2','col3','col4']
print(df1)
Gives:
col1 col2 col3 col4
0 2.0 2.0 2.0 2
1 2.0 2.0 2.0 2
2 2.0 5.0 5.0 5
3 2.0 5.0 5.0 5
4 2.0 5.0 5.0 5
5 2.0 5.0 5.0 5
6 2.0 5.0 5.0 5
7 5.0 7.5 1.0 1
8 7.0 10.0 3.0 3
9 7.0 10.0 3.0 3
10 7.0 10.0 3.0 3

pandas pivot table using index data of dataframe

I want to create a pivot table from a pandas dataframe
using dataframe.pivot()
and include not only dataframe columns but also the data within the dataframe index.
Couldn't find any docs that show how to do that.
Any tips?
Use reset_index to make the index a column:
In [45]: df = pd.DataFrame({'y': [0, 1, 2, 3, 4, 4], 'x': [1, 2, 2, 3, 1, 3]}, index=np.arange(6)*10)
In [46]: df
Out[46]:
x y
0 1 0
10 2 1
20 2 2
30 3 3
40 1 4
50 3 4
In [47]: df.reset_index()
Out[47]:
index x y
0 0 1 0
1 10 2 1
2 20 2 2
3 30 3 3
4 40 1 4
5 50 3 4
So pivot uses the index as values:
In [48]: df.reset_index().pivot(index='y', columns='x')
Out[48]:
index
x 1 2 3
y
0 0 NaN NaN
1 NaN 10 NaN
2 NaN 20 NaN
3 NaN NaN 30
4 40 NaN 50