Pandas groupping values to column - python-2.7

I have dataframe look like this:
a b c d e
0 0 1 2 1 0
1 3 0 0 4 3
2 3 4 0 4 2
3 4 1 0 4 3
4 2 1 3 4 3
5 3 2 0 3 3
6 2 1 1 1 0
7 1 1 0 3 3
8 3 3 3 3 4
9 2 3 4 2 2
I do following command:
df.groupby('A').sum()
And i get:
b c d e
a
0 1 2 1 0
1 1 0 3 3
2 5 8 7 5
3 9 3 14 12
4 1 0 4 3
And after that I want to access
labels = df['A']
But I have an error that there are no such column.
So does pandas have some syntax to get something like this?
a b c d e
0 0 1 2 1 0
1 1 1 0 3 3
2 2 5 8 7 5
3 3 9 3 14 12
4 4 1 0 4 3
I need to sum all values of columns b, c, d, e to column a with the relevant index

You can just access the index with df.index, and add it back into your dataframe as another column.
grouped_df = df.groupby('A').sum()
grouped_df['A'] = grouped_df.index
grouped_df.sum(axis=1)
Alternatively, groupby has 'as_index' option to keep the column 'A'
groupby('A', as_index=False)
or, after groupby, you can use reset_index to put the column 'A' back.

Related

Pandas: count when condition is met in subgroups

I have a dataframe that looks like:
subgroup value
0 1 0
1 1 1
2 1 1
3 1 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 1
9 3 0
10 3 0
I need to add a column that add 1 whenever there is at least one value different than 0 in the different subgroups. Please, note that if the value 1 is repeated more than once in the same subgroup, it doesn't affect the count.
The result should be:
subgroup value count
0 1 0 1
1 1 1 1
2 1 1 1
3 1 1 1
4 2 0 1
5 2 0 1
6 2 0 1
7 3 0 2
8 3 1 2
9 3 0 2
10 3 0 2
Thank you in advance for your help!
Using shift with -1 and 1 and cumsum the result
mask=(df.value.ne(df.value.shift()))&(df.value.ne(df.value.shift(-1)))
mask.cumsum()
Out[18]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 2
9 2
10 2
Name: value, dtype: int32
Using merge and groupby
df.merge(df.groupby('subgroup').value.sum().gt(0).cumsum().reset_index(name='out'))
subgroup value out
0 1 0 1
1 1 1 1
2 1 1 1
3 1 0 1
4 2 0 1
5 2 0 1
6 2 0 1
7 3 0 2
8 3 1 2
9 3 0 2
10 3 0 2

How to add a number to a portion of dataframe column in pandas?

I have a dataframe with two columns A and B.
A B
1 0
2 0
3 1
4 2
5 0
6 3
What I want to do is to add column A with with column B. But only with the corresponding non zero values of column B. And put the result on column B.
A B
1 0
2 0
3 4
4 6
5 0
6 9
Thank you for your help and sugestion in advance.
use .loc with a boolean mask:
In [49]:
df.loc[df['B'] != 0, 'B'] = df['A'] + df['B']
df
Out[49]:
A B
0 1 0
1 2 0
2 3 4
3 4 6
4 5 0
5 6 9

Assigning value to multiindexed pandas dataframe based on mix of integer and labels indexing

I have a dataframe with multiindexed columns. I want to select on the first level based on the column name, and then return all columns but the last one, and assign a new value to all these elements.
Here's a sample dataframe:
In [1]: mydf = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
Out[1]:
A B C
a b c a b c a b c
0 4 1 2 1 4 2 1 1 3
1 4 4 1 2 3 4 2 2 3
2 2 3 4 1 2 1 3 2 3
3 1 3 4 2 3 4 1 5 1
If want to be able to assign to this elements for example:
In [2]: mydf.loc[:,('A')].iloc[:,:-1]
Out[2]:
A
a b
0 4 1
1 4 4
2 2 3
3 1 3
If I wanted to modify one column only, I know how to select it properly with a tuple so that the assigning works:
In [3]: mydf.loc[:,('A','a')] = 0
In [4]: mydf.loc[:,('A','a')]
Out[4]:
0 0
1 0
2 0
3 0
Name: (A, a), dtype: int32
So that worked well.
Now the following doesn't work...
In [5]: mydf.loc[:,('A')].ix[:,:-1] = 6 - mydf.loc[:,('A')].ix[:,:-1]
In [6]: mydf.loc[:,('A')].iloc[:,:-1] = 6 - mydf.loc[:,('A')].iloc[:,:-1]
Sometimes I will, and sometimes I won't, get the warning that a value is trying to be set on a copy of a slice from a DataFrame. But in both cases it doesn't actually assign.
I've pretty much tried everything I could think, I still can't figure out how to mix both label and integer indexing in order to set the value correctly.
Any idea please?
Versions:
Python 2.7.9
Pandas 0.16.1
This is not directly supported as .loc MUST have labels and NOT positions. In theory .ix could support this with mulit-index slicers, but the usual complicates of figuring out what is 'meant' by the user (e.g. is it a label or a position).
In [63]: df = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
In [64]: df
Out[64]:
A B C
a b c a b c a b c
0 4 4 4 4 3 2 5 1 4
1 1 2 1 3 2 1 1 4 5
2 3 2 4 4 2 2 3 1 4
3 5 1 1 3 1 1 5 5 5
so we compute the indexer for the 'A' block; np.r_ turns this slice into an actual indexer; then we select the element (e.g. 0 in this case). This feeds into .iloc.
In [65]: df.iloc[:,np.r_[df.columns.get_loc('A')][0]] = 0
In [66]: df
Out[66]:
A B C
a b c a b c a b c
0 0 4 4 4 3 2 5 1 4
1 0 2 1 3 2 1 1 4 5
2 0 2 4 4 2 2 3 1 4
3 0 1 1 3 1 1 5 5 5

pandas - Perform computation against a reference record within groups

For each row of data in a DataFrame I would like to compute the number of unique values in columns A and B for that particular row and a reference row within the group identified by another column ID. Here is a toy dataset:
d = {'ID' : pd.Series([1,1,1,2,2,2,2,3,3])
,'A' : pd.Series([1,2,3,4,5,6,7,8,9])
,'B' : pd.Series([1,2,3,4,11,12,13,14,15])
,'REFERENCE' : pd.Series([1,0,0,0,0,1,0,1,0])}
data = pd.DataFrame(d)
The data looks like this:
In [3]: data
Out[3]:
A B ID REFERENCE
0 1 1 1 1
1 2 2 1 0
2 3 3 1 0
3 4 4 2 0
4 5 11 2 0
5 6 12 2 1
6 7 13 2 0
7 8 14 3 1
8 9 15 3 0
Now, within each group defined using ID I want to compare each record with the reference record and I want to compute the number of unique A and B values for the combination. For instance, I can compute the value for data record 3 by taking len(set([4,4,6,12])) which gives 3. The result should look like this:
A B ID REFERENCE CARDINALITY
0 1 1 1 1 1
1 2 2 1 0 2
2 3 3 1 0 2
3 4 4 2 0 3
4 5 11 2 0 4
5 6 12 2 1 2
6 7 13 2 0 4
7 8 14 3 1 2
8 9 15 3 0 3
The only way I can think of implementing this is using for loops that loop over each grouped object and then each record within the grouped object and computes it against the reference record. This is non-pythonic and very slow. Can anyone please suggest a vectorized approach to achieve the same?
I would create a new column where I combine a and b into a tuple and then I would group by And then use groups = dict(list(groupby)) and then get the length of each frame using len()

Find all puddles on the square (algorithm)

The problem is defined as follows:
You're given a square. The square is lined with flat flagstones size 1m x 1m. Grass surround the square. Flagstones may be at different height. It starts raining. Determine where puddles will be created and compute how much water will contain. Water doesn't flow through the corners. In any area of ​​grass can soak any volume of water at any time.
Input:
width height
width*height non-negative numbers describing a height of each flagstone over grass level.
Output:
Volume of water from puddles.
width*height signs describing places where puddles will be created and places won't.
. - no puddle
# - puddle
Examples
Input:
8 8
0 0 0 0 0 1 0 0
0 1 1 1 0 1 0 0
0 1 0 2 1 2 4 5
0 1 1 2 0 2 4 5
0 3 3 3 3 3 3 4
0 3 0 1 2 0 3 4
0 3 3 3 3 3 3 0
0 0 0 0 0 0 0 0
Output:
11
........
........
..#.....
....#...
........
..####..
........
........
Input:
16 16
8 0 1 0 0 0 0 2 2 4 3 4 5 0 0 2
6 2 0 5 2 0 0 2 0 1 0 3 1 2 1 2
7 2 5 4 5 2 2 1 3 6 2 0 8 0 3 2
2 5 3 3 0 1 0 3 3 0 2 0 3 0 1 1
1 0 1 4 1 1 2 0 3 1 1 0 1 1 2 0
2 6 2 0 0 3 5 5 4 3 0 4 2 2 2 1
4 2 0 0 0 1 1 2 1 2 1 0 4 0 5 1
2 0 2 0 5 0 1 1 2 0 7 5 1 0 4 3
13 6 6 0 10 8 10 5 17 6 4 0 12 5 7 6
7 3 0 2 5 3 8 0 3 6 1 4 2 3 0 3
8 0 6 1 2 2 6 3 7 6 4 0 1 4 2 1
3 5 3 0 0 4 4 1 4 0 3 2 0 0 1 0
13 3 6 0 7 5 3 2 21 8 13 3 5 0 13 7
3 5 6 2 2 2 0 2 5 0 7 0 1 3 7 5
7 4 5 3 4 5 2 0 23 9 10 5 9 7 9 8
11 5 7 7 9 7 1 0 17 13 7 10 6 5 8 10
Output:
103
................
..#.....###.#...
.......#...#.#..
....###..#.#.#..
.#..##.#...#....
...##.....#.....
..#####.#..#.#..
.#.#.###.#..##..
...#.......#....
..#....#..#...#.
.#.#.......#....
...##..#.#..##..
.#.#.........#..
......#..#.##...
.#..............
................
I tried different ways. Floodfill from max value, then from min value, but it's not working for every input or require code complication. Any ideas?
I'm interesting algorithm with complexity O(n^2) or o(n^3).
Summary
I would be tempted to try and solve this using a disjoint-set data structure.
The algorithm would be to iterate over all heights in the map performing a floodfill operation at each height.
Details
For each height x (starting at 0)
Connect all flagstones of height x to their neighbours if the neighbour height is <= x (storing connected sets of flagstones in the disjoint set data structure)
Remove any sets that connected to the grass
Mark all flagstones of height x in still remaining sets as being puddles
Add the total count of flagstones in remaining sets to a total t
At the end t gives the total volume of water.
Worked Example
0 0 0 0 0 1 0 0
0 1 1 1 0 1 0 0
0 1 0 2 1 2 4 5
0 1 1 2 0 2 4 5
0 3 3 3 3 3 3 4
0 3 0 1 2 0 3 4
0 3 3 3 3 3 3 0
0 0 0 0 0 0 0 0
Connect all flagstones of height 0 into sets A,B,C,D,E,F
A A A A A 1 B B
A 1 1 1 A 1 B B
A 1 C 2 1 2 4 5
A 1 1 2 D 2 4 5
A 3 3 3 3 3 3 4
A 3 E 1 2 F 3 4
A 3 3 3 3 3 3 A
A A A A A A A A
Remove flagstones connecting to the grass, and mark remaining as puddles
1
1 1 1 1
1 C 2 1 2 4 5 #
1 1 2 D 2 4 5 #
3 3 3 3 3 3 4
3 E 1 2 F 3 4 # #
3 3 3 3 3 3
Count remaining set size t=4
Connect all of height 1
G
C C C G
C C 2 D 2 4 5 #
C C 2 D 2 4 5 #
3 3 3 3 3 3 4
3 E E 2 F 3 4 # #
3 3 3 3 3 3
Remove flagstones connecting to the grass, and mark remaining as puddles
2 2 4 5 #
2 2 4 5 #
3 3 3 3 3 3 4
3 E E 2 F 3 4 # # #
3 3 3 3 3 3
t=4+3=7
Connect all of height 2
A B 4 5 #
A B 4 5 #
3 3 3 3 3 3 4
3 E E E E 3 4 # # #
3 3 3 3 3 3
Remove flagstones connecting to the grass, and mark remaining as puddles
4 5 #
4 5 #
3 3 3 3 3 3 4
3 E E E E 3 4 # # # #
3 3 3 3 3 3
t=7+4=11
Connect all of height 3
4 5 #
4 5 #
E E E E E E 4
E E E E E E 4 # # # #
E E E E E E
Remove flagstones connecting to the grass, and mark remaining as puddles
4 5 #
4 5 #
4
4 # # # #
After doing this for heights 4 and 5 nothing will remain.
A preprocessing step to create lists of all locations with each height should mean that the algorithm is close to O(n^2).
This seems to be working nicely. The idea is it is a recursive function, that checks to see if there is an "outward flow" that will allow it to escape to the edge. If the values that do no have such an escape will puddle. I tested it on your two input files and it works quite nicely. I copied the output for these two files for you. Pardon my nasty use of global variables and what not, I figured it was the concept behind the algorithm that mattered, not good style :)
#include <fstream>
#include <iostream>
#include <vector>
using namespace std;
int SIZE_X;
int SIZE_Y;
bool **result;
int **INPUT;
bool flowToEdge(int x, int y, int value, bool* visited) {
if(x < 0 || x == SIZE_X || y < 0 || y == SIZE_Y) return true;
if(visited[(x * SIZE_X) + y]) return false;
if(value < INPUT[x][y]) return false;
visited[(x * SIZE_X) + y] = true;
bool left = false;
bool right = false;
bool up = false;
bool down = false;
left = flowToEdge(x-1, y, value, visited);
right = flowToEdge(x+1, y, value, visited);
up = flowToEdge(x, y+1, value, visited);
down = flowToEdge(x, y-1, value, visited);
return (left || up || down || right);
}
int main() {
ifstream myReadFile;
myReadFile.open("test.txt");
myReadFile >> SIZE_X;
myReadFile >> SIZE_Y;
INPUT = new int*[SIZE_X];
result = new bool*[SIZE_X];
for(int i = 0; i < SIZE_X; i++) {
INPUT[i] = new int[SIZE_Y];
result[i] = new bool[SIZE_Y];
for(int j = 0; j < SIZE_Y; j++) {
int someInt;
myReadFile >> someInt;
INPUT[i][j] = someInt;
result[i][j] = false;
}
}
for(int i = 0; i < SIZE_X; i++) {
for(int j = 0; j < SIZE_Y; j++) {
bool visited[SIZE_X][SIZE_Y];
for(int k = 0; k < SIZE_X; k++)//You can avoid this looping by using maps with pairs of coordinates instead
for(int l = 0; l < SIZE_Y; l++)
visited[k][l] = 0;
result[i][j] = flowToEdge(i,j, INPUT[i][j], &visited[0][0]);
}
}
for(int i = 0; i < SIZE_X; i++) {
cout << endl;
for(int j = 0; j < SIZE_Y; j++)
cout << result[i][j];
}
cout << endl;
}
The 16 by 16 file:
1111111111111111
1101111100010111
1111111011101011
1111000110101011
1011001011101111
1110011111011111
1100000101101011
1010100010110011
1110111111101111
1101101011011101
1010111111101111
1110011010110011
1010111111111011
1111110110100111
1011111111111111
1111111111111111
The 8 by 8 file
11111111
11111111
11011111
11110111
11111111
11000011
11111111
11111111
You could optimize this algorithm easily and considerably by doing several things. A: return true immediately upon finding a route would speed it up considerably. You could also connect it globally to the current set of results so that any given point would only have to find a flow point to an already known flow point, and not all the way to the edge.
The work involved, each n will have to exam each node. However, with optimizations, we should be able to get this much lower than n^2 for most cases, but it still an n^3 algorithm in the worst case... but creating this would be very difficult(with proper optimization logic... dynamic programming for the win!)
EDIT:
The modified code works for the following circumstances:
8 8
1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 1
1 0 1 1 1 1 0 1
1 0 1 0 0 1 0 1
1 0 1 1 0 1 0 1
1 0 1 1 0 1 0 1
1 0 0 0 0 1 0 1
1 1 1 1 1 1 1 1
And these are the results:
11111111
10000001
10111101
10100101
10110101
10110101
10000101
11111111
Now when we remove that 1 at the bottom we want to see no puddling.
8 8
1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 1
1 0 1 1 1 1 0 1
1 0 1 0 0 1 0 1
1 0 1 1 0 1 0 1
1 0 1 1 0 1 0 1
1 0 0 0 0 1 0 1
1 1 1 1 1 1 0 1
And these are the results
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1