pandas: slice a MultiIndex DataFrame by range of secondary index - python-2.7

It has been posted that slicing on the second index can be done on a multi-indexed pandas Series:
import numpy as np
import pandas as pd
buckets = np.repeat(range(3), [3,5,7])
sequence = np.hstack(map(range,[3,5,7]))
s = pd.Series(np.random.randn(len(sequence)),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence)))
print s
0 0 0.021362
1 0.917947
2 -0.956313
1 0 -0.242659
1 0.398657
2 0.455909
3 0.200061
4 -1.273537
2 0 0.747849
1 -0.012899
2 1.026659
3 -0.256648
4 0.799381
5 0.064147
6 0.491336
Then to get the first three rows for the first index=1, you simply say:
s[1].ix[range(3)]
0 -0.242659
1 0.398657
2 0.455909
This works fine for 1-dimensional Series, but not for DataFrames:
buckets = np.repeat(range(3), [3,5,7])
sequence = np.hstack(map(range,[3,5,7]))
d = pd.DataFrame(np.random.randn(len(sequence),2),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence)))
print d
0 1
0 0 1.217659 0.312286
1 0.559782 0.686448
2 -0.143116 1.146196
1 0 -0.195582 0.298426
1 1.504944 -0.205834
2 0.018644 -0.979848
3 -0.387756 0.739513
4 0.719952 -0.996502
2 0 0.065863 0.481190
1 -1.309163 0.881319
2 0.545382 2.048734
3 0.506498 0.451335
4 0.872743 -0.070985
5 -1.160473 1.082550
6 0.331796 -0.366597
d[1].ix[range(3)]
0 0 0.312286
1 0.686448
2 1.146196
Name: 1
It gives you the "1th" column of data, and the first three rows, irrespective of the first index level. How can you get the first three rows for the first index=1 for a multi-indexed DataFrame?

d.xs(1)[0:3]
0 1
0 -0.716206 0.119265
1 -0.782315 0.097844
2 2.042751 -1.116453

.loc is more efficient and is evaluated simultaneously
s.loc[pd.IndexSlice[1],:3] will return 0th level = 1 and [0:3] entry.

Related

How to add a number to a portion of dataframe column in pandas?

I have a dataframe with two columns A and B.
A B
1 0
2 0
3 1
4 2
5 0
6 3
What I want to do is to add column A with with column B. But only with the corresponding non zero values of column B. And put the result on column B.
A B
1 0
2 0
3 4
4 6
5 0
6 9
Thank you for your help and sugestion in advance.
use .loc with a boolean mask:
In [49]:
df.loc[df['B'] != 0, 'B'] = df['A'] + df['B']
df
Out[49]:
A B
0 1 0
1 2 0
2 3 4
3 4 6
4 5 0
5 6 9

Assigning value to multiindexed pandas dataframe based on mix of integer and labels indexing

I have a dataframe with multiindexed columns. I want to select on the first level based on the column name, and then return all columns but the last one, and assign a new value to all these elements.
Here's a sample dataframe:
In [1]: mydf = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
Out[1]:
A B C
a b c a b c a b c
0 4 1 2 1 4 2 1 1 3
1 4 4 1 2 3 4 2 2 3
2 2 3 4 1 2 1 3 2 3
3 1 3 4 2 3 4 1 5 1
If want to be able to assign to this elements for example:
In [2]: mydf.loc[:,('A')].iloc[:,:-1]
Out[2]:
A
a b
0 4 1
1 4 4
2 2 3
3 1 3
If I wanted to modify one column only, I know how to select it properly with a tuple so that the assigning works:
In [3]: mydf.loc[:,('A','a')] = 0
In [4]: mydf.loc[:,('A','a')]
Out[4]:
0 0
1 0
2 0
3 0
Name: (A, a), dtype: int32
So that worked well.
Now the following doesn't work...
In [5]: mydf.loc[:,('A')].ix[:,:-1] = 6 - mydf.loc[:,('A')].ix[:,:-1]
In [6]: mydf.loc[:,('A')].iloc[:,:-1] = 6 - mydf.loc[:,('A')].iloc[:,:-1]
Sometimes I will, and sometimes I won't, get the warning that a value is trying to be set on a copy of a slice from a DataFrame. But in both cases it doesn't actually assign.
I've pretty much tried everything I could think, I still can't figure out how to mix both label and integer indexing in order to set the value correctly.
Any idea please?
Versions:
Python 2.7.9
Pandas 0.16.1
This is not directly supported as .loc MUST have labels and NOT positions. In theory .ix could support this with mulit-index slicers, but the usual complicates of figuring out what is 'meant' by the user (e.g. is it a label or a position).
In [63]: df = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
In [64]: df
Out[64]:
A B C
a b c a b c a b c
0 4 4 4 4 3 2 5 1 4
1 1 2 1 3 2 1 1 4 5
2 3 2 4 4 2 2 3 1 4
3 5 1 1 3 1 1 5 5 5
so we compute the indexer for the 'A' block; np.r_ turns this slice into an actual indexer; then we select the element (e.g. 0 in this case). This feeds into .iloc.
In [65]: df.iloc[:,np.r_[df.columns.get_loc('A')][0]] = 0
In [66]: df
Out[66]:
A B C
a b c a b c a b c
0 0 4 4 4 3 2 5 1 4
1 0 2 1 3 2 1 1 4 5
2 0 2 4 4 2 2 3 1 4
3 0 1 1 3 1 1 5 5 5

pandas - Perform computation against a reference record within groups

For each row of data in a DataFrame I would like to compute the number of unique values in columns A and B for that particular row and a reference row within the group identified by another column ID. Here is a toy dataset:
d = {'ID' : pd.Series([1,1,1,2,2,2,2,3,3])
,'A' : pd.Series([1,2,3,4,5,6,7,8,9])
,'B' : pd.Series([1,2,3,4,11,12,13,14,15])
,'REFERENCE' : pd.Series([1,0,0,0,0,1,0,1,0])}
data = pd.DataFrame(d)
The data looks like this:
In [3]: data
Out[3]:
A B ID REFERENCE
0 1 1 1 1
1 2 2 1 0
2 3 3 1 0
3 4 4 2 0
4 5 11 2 0
5 6 12 2 1
6 7 13 2 0
7 8 14 3 1
8 9 15 3 0
Now, within each group defined using ID I want to compare each record with the reference record and I want to compute the number of unique A and B values for the combination. For instance, I can compute the value for data record 3 by taking len(set([4,4,6,12])) which gives 3. The result should look like this:
A B ID REFERENCE CARDINALITY
0 1 1 1 1 1
1 2 2 1 0 2
2 3 3 1 0 2
3 4 4 2 0 3
4 5 11 2 0 4
5 6 12 2 1 2
6 7 13 2 0 4
7 8 14 3 1 2
8 9 15 3 0 3
The only way I can think of implementing this is using for loops that loop over each grouped object and then each record within the grouped object and computes it against the reference record. This is non-pythonic and very slow. Can anyone please suggest a vectorized approach to achieve the same?
I would create a new column where I combine a and b into a tuple and then I would group by And then use groups = dict(list(groupby)) and then get the length of each frame using len()

Create a dummy variable for the last rows based on on another variable

I would like to create a dummy variable that will look at the variable "count" and label the rows as 1 starting from the last row of each id. As an example ID 1 has count of 3 and the last three rows of this id will have such pattern: 0,0,1,1,1 Similarly, ID 4 which has a count of 1 will have 0,0,0,1. The IDs have different number of rows. The variable "wish" shows what I want to obtain as a final output.
input byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
For future questions, you should provide your failed attempts. This shows that you have done your part, namely, research your problem.
One way is:
clear
set more off
*----- example data -----
input ///
byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
list, sepby(id)
*----- what you want -----
bysort id: gen wish2 = _n > (_N - count)
list, sepby(id)
I assume you already sorted your date variable within ids.
One way to accomplish this would be to use within-group row numbers using 'bysort'-type logic:
***Create variable of within-group row numbers.
bysort id: gen obsnum = _n
***Calculate total number of rows within each group.
by id: egen max_obsnum = max(obsnum)
***Subtract the count variable from the group row count.
***This is the number of rows where we want the dummy to equal zero.
gen max_obsnum_less_count = max_obsnum - count
***Create the dummy to equal one when the row number is
***greater than this last variable.
gen dummy = (obsnum > max_obsnum_less_count)
***Clean up.
drop obsnum max_obsnum max_obsnum_less_count

Function that given these 2 values produces this third value?

I'm trying to write a function that when given 2 arguments, the 2 leftmost columns, produces the third column as a result:
0 0 0
1 0 3
2 0 2
3 0 1
0 1 1
1 1 0
2 1 3
3 1 2
0 2 2
1 2 1
2 2 0
3 2 3
0 3 3
1 3 2
2 3 1
3 3 0
I know there will be a modulus involved but I can't quite figure it out.
I'm trying to figure out if 4 people are sitting at a table, given the person and target, from the person's perspective which seat is the target sitting in?
Thanks
If a and b are the positions of the two persons, their "distance" is:
(4+b-a) % 4
This also shows that the forth block in your example is wrong.
Assuming that last block of numbers is wrong, I think you're looking for (4 + b - a) % 4 gives c (for columns a b c).