How to shuffle data in pandas? [duplicate] - python-2.7

What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0) that takes a dataframe, a number of shuffles n, and an axis (axis=0 is rows, axis=1 is columns) and returns a copy of the dataframe that has been shuffled n times.
Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index that loses all that information. I want the resulting df to be the same as the original except with the order of rows or order of columns different.
Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns a and b, I want each row shuffled on its own, so that you don't have the same associations between a and b as you do if you just re-order each row as a whole. Something like:
for 1...n:
for each col in df: shuffle column
return new_df
But hopefully more efficient than naive looping. This does not work for me:
def shuffle(df, n, axis=0):
shuffled_df = df.copy()
for k in range(n):
shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
return shuffled_df
df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)

Use numpy's random.permuation function:
In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [2]: df
Out[2]:
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
A B
0 0 0
5 5 5
6 6 6
3 3 3
8 8 8
7 7 7
9 9 9
1 1 1
2 2 2
4 4 4

Sampling randomizes, so just sample the entire data frame.
df.sample(frac=1)
As #Corey Levinson notes, you have to be careful when you reassign:
df['column'] = df['column'].sample(frac=1).reset_index(drop=True)

In [16]: def shuffle(df, n=1, axis=0):
...: df = df.copy()
...: for _ in range(n):
...: df.apply(np.random.shuffle, axis=axis)
...: return df
...:
In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [18]: shuffle(df)
In [19]: df
Out[19]:
A B
0 8 5
1 1 7
2 7 3
3 6 2
4 3 4
5 0 1
6 9 0
7 4 6
8 2 8
9 5 9

You can use sklearn.utils.shuffle() (requires sklearn 0.16.1 or higher to support Pandas data frames):
# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))
# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))
outputs:
df: A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
df: A B
1 1 1
0 0 0
3 3 3
4 4 4
2 2 2
Then you can use df.reset_index() to reset the index column, if needs to be:
df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)
outputs:
df: A B
0 1 1
1 0 0
2 4 4
3 2 2
4 3 3

A simple solution in pandas is to use the sample method independently on each column. Use apply to iterate over each column:
df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
df.apply(lambda x: x.sample(frac=1).values)
a b
0 4 2
1 1 6
2 6 5
3 5 3
4 2 4
5 3 1
You must use .value so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:
df.apply(lambda x: x.sample(frac=1))
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6

From the docs use sample():
In [79]: s = pd.Series([0,1,2,3,4,5])
# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]:
0 0
dtype: int64
# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]:
5 5
2 2
4 4
dtype: int64
# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]:
5 5
4 4
1 1
dtype: int64

I resorted to adapting #root 's answer slightly and using the raw values directly. Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.
In [1]: import numpy
In [2]: import pandas
In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})
In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop
In [5]: %%timeit
...: for view in numpy.rollaxis(df.values, 1):
...: numpy.random.shuffle(view)
...:
10000 loops, best of 3: 22.8 µs per loop
In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop
In [7]: %%timeit
for view in numpy.rollaxis(df.values, 0):
numpy.random.shuffle(view)
...:
10000 loops, best of 3: 23.4 µs per loop
Note that numpy.rollaxis brings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, i.e., if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.
In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)
In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)
Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:
def shuffle(df, n=1, axis=0):
df = df.copy()
axis = int(not axis) # pandas.DataFrame is always 2D
for _ in range(n):
for view in numpy.rollaxis(df.values, axis):
numpy.random.shuffle(view)
return df

This might be more useful when you want your index shuffled.
def shuffle(df):
index = list(df.index)
random.shuffle(index)
df = df.ix[index]
df.reset_index()
return df
It selects new df using new index, then reset them.

I know the question is for a pandas df but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.array instead, then np.apply_along_axis() will be what you are looking for.
If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.
If you panda data frame is named df, maybe you can:
get the values of the dataframe with values = df.values,
create an np.array from values
apply the method shown below to shuffle the np.array by row or column
recreate a new (shuffled) pandas df from the shuffled np.array
Original array
a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
Keep row order, shuffle colums within each row
print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
[22 21 20]
[31 30 32]
[40 41 42]]
Keep colums order, shuffle rows within each column
print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
[20 31 42]
[10 11 12]
[30 21 22]]
Original array is unchanged
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]

Here is a work around I found if you want to only shuffle a subset of the DataFrame:
shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])

Related

3 dimensional numpy array to multiindex pandas dataframe

I have a 3 dimensional numpy array, (z, x, y). z is a time dimension and x and y are coordinates.
I want to convert this to a multiindexed pandas.DataFrame. I want the row index to be the z dimension
and each column to have values from a unique x, y coordinate (and so, each column would be multi-indexed).
The simplest case (not multi-indexed):
>>> array.shape
(500L, 120L, 100L)
>>> df = pd.DataFrame(array[:,0,0])
>>> df.shape
(500, 1)
I've been trying to pass the whole array into a multiindex dataframe using pd.MultiIndex.from_arrays but I'm getting an error:
NotImplementedError: > 1 ndim Categorical are not supported at this time
Looks like it should be fairly simple but I cant figure it out.
I find that a Series with a Multiindex is the most analagous pandas datatype for a numpy array with arbitrarily many dimensions (presumably 3 or more).
Here is some example code:
import pandas as pd
import numpy as np
time_vals = np.linspace(1, 50, 50)
x_vals = np.linspace(-5, 6, 12)
y_vals = np.linspace(-4, 5, 10)
measurements = np.random.rand(50,12,10)
#setup multiindex
mi = pd.MultiIndex.from_product([time_vals, x_vals, y_vals], names=['time', 'x', 'y'])
#connect multiindex to data and save as multiindexed Series
sr_multi = pd.Series(index=mi, data=measurements.flatten())
#pull out a dataframe of x, y at time=22
sr_multi.xs(22, level='time').unstack(level=0)
#pull out a dataframe of y, time at x=3
sr_multi.xs(3, level='x').unstack(level=1)
I think you can use panel - and then for Multiindex DataFrame add to_frame:
np.random.seed(10)
arr = np.random.randint(10, size=(5,3,2))
print (arr)
[[[9 4]
[0 1]
[9 0]]
[[1 8]
[9 0]
[8 6]]
[[4 3]
[0 4]
[6 8]]
[[1 8]
[4 1]
[3 6]]
[[5 3]
[9 6]
[9 1]]]
df = pd.Panel(arr).to_frame()
print (df)
0 1 2 3 4
major minor
0 0 9 1 4 1 5
1 4 8 3 8 3
1 0 0 9 0 4 9
1 1 0 4 1 6
2 0 9 8 6 3 9
1 0 6 8 6 1
Also transpose can be useful:
df = pd.Panel(arr).transpose(1,2,0).to_frame()
print (df)
0 1 2
major minor
0 0 9 0 9
1 1 9 8
2 4 0 6
3 1 4 3
4 5 9 9
1 0 4 1 0
1 8 0 6
2 3 4 8
3 8 1 6
4 3 6 1
Another possible solution with concat:
arr = arr.transpose(1,2,0)
df = pd.concat([pd.DataFrame(x) for x in arr], keys=np.arange(arr.shape[2]))
print (df)
0 1 2 3 4
0 0 9 1 4 1 5
1 4 8 3 8 3
1 0 0 9 0 4 9
1 1 0 4 1 6
2 0 9 8 6 3 9
1 0 6 8 6 1
np.random.seed(10)
arr = np.random.randint(10, size=(500,120,100))
df = pd.Panel(arr).transpose(2,0,1).to_frame()
print (df.shape)
(60000, 100)
print (df.index.max())
(499, 119)

Drop all rows before first occurrence of a value

I have a df like so:
Year ID Count
1997 1 0
1998 2 0
1999 3 1
2000 4 0
2001 5 1
and I want to remove all rows before the first occurrence of 1 in Count which would give me:
Year ID Count
1999 3 1
2000 4 0
2001 5 1
I can remove all rows AFTER the first occurrence like this:
df=df.loc[: df[(df['Count'] == 1)].index[0], :]
but I can't seem to follow the slicing logic to make it do the opposite.
I'd do:
df[(df.Count == 1).idxmax():]
df.Count == 1 returns a boolean array. idxmax() will identify the index of the maximum value. I know the max value will be True and when there are more than one Trues it will return the position of the first one found. That's exactly what you want. By the way, that value is 2. Finally, I slice the dataframe for everything from 2 onward with df[2:]. I put all that in one line in the answer above.
you can use cumsum() method:
In [13]: df[(df.Count == 1).cumsum() > 0]
Out[13]:
Year ID Count
2 1999 3 1
3 2000 4 0
4 2001 5 1
Explanation:
In [14]: (df.Count == 1).cumsum()
Out[14]:
0 0
1 0
2 1
3 1
4 2
Name: Count, dtype: int32
Timing against 500K rows DF:
In [18]: df = pd.concat([df] * 10**5, ignore_index=True)
In [19]: df.shape
Out[19]: (500000, 3)
In [20]: %timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 3.7 ms per loop
In [21]: %timeit df[(df.Count == 1).cumsum() > 0]
100 loops, best of 3: 16.4 ms per loop
In [22]: %timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
The slowest run took 4.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 7.02 ms per loop
Conclusion: #piRSquared's idxmax() solution is a clear winner...
Using np.where:
df[np.where(df['Count']==1)[0][0]:]
Timings
Timings were performed on a larger version of the DataFrame:
df = pd.concat([df]*10**5, ignore_index=True)
Results:
%timeit df[np.where(df['Count']==1)[0][0]:]
100 loops, best of 3: 2.74 ms per loop
%timeit df[(df.Count == 1).idxmax():]
100 loops, best of 3: 6.18 ms per loop
%timeit df[(df.Count == 1).cumsum() > 0]
10 loops, best of 3: 26.6 ms per loop
%timeit df.loc[df[(df['Count'] == 1)].index[0]:, :]
100 loops, best of 3: 11.2 ms per loop
Just slice the other way :
if idx is your index do :
df.loc[idx:]
Instead of
df.loc[:idx]
That means :
df.loc[df[(df['Count'] == 1)].index[0]:, :]

Assigning value to multiindexed pandas dataframe based on mix of integer and labels indexing

I have a dataframe with multiindexed columns. I want to select on the first level based on the column name, and then return all columns but the last one, and assign a new value to all these elements.
Here's a sample dataframe:
In [1]: mydf = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
Out[1]:
A B C
a b c a b c a b c
0 4 1 2 1 4 2 1 1 3
1 4 4 1 2 3 4 2 2 3
2 2 3 4 1 2 1 3 2 3
3 1 3 4 2 3 4 1 5 1
If want to be able to assign to this elements for example:
In [2]: mydf.loc[:,('A')].iloc[:,:-1]
Out[2]:
A
a b
0 4 1
1 4 4
2 2 3
3 1 3
If I wanted to modify one column only, I know how to select it properly with a tuple so that the assigning works:
In [3]: mydf.loc[:,('A','a')] = 0
In [4]: mydf.loc[:,('A','a')]
Out[4]:
0 0
1 0
2 0
3 0
Name: (A, a), dtype: int32
So that worked well.
Now the following doesn't work...
In [5]: mydf.loc[:,('A')].ix[:,:-1] = 6 - mydf.loc[:,('A')].ix[:,:-1]
In [6]: mydf.loc[:,('A')].iloc[:,:-1] = 6 - mydf.loc[:,('A')].iloc[:,:-1]
Sometimes I will, and sometimes I won't, get the warning that a value is trying to be set on a copy of a slice from a DataFrame. But in both cases it doesn't actually assign.
I've pretty much tried everything I could think, I still can't figure out how to mix both label and integer indexing in order to set the value correctly.
Any idea please?
Versions:
Python 2.7.9
Pandas 0.16.1
This is not directly supported as .loc MUST have labels and NOT positions. In theory .ix could support this with mulit-index slicers, but the usual complicates of figuring out what is 'meant' by the user (e.g. is it a label or a position).
In [63]: df = pd.DataFrame(np.random.random_integers(low=1,high=5,size=(4,9)),
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['a', 'b', 'c']]))
In [64]: df
Out[64]:
A B C
a b c a b c a b c
0 4 4 4 4 3 2 5 1 4
1 1 2 1 3 2 1 1 4 5
2 3 2 4 4 2 2 3 1 4
3 5 1 1 3 1 1 5 5 5
so we compute the indexer for the 'A' block; np.r_ turns this slice into an actual indexer; then we select the element (e.g. 0 in this case). This feeds into .iloc.
In [65]: df.iloc[:,np.r_[df.columns.get_loc('A')][0]] = 0
In [66]: df
Out[66]:
A B C
a b c a b c a b c
0 0 4 4 4 3 2 5 1 4
1 0 2 1 3 2 1 1 4 5
2 0 2 4 4 2 2 3 1 4
3 0 1 1 3 1 1 5 5 5

pandas pivot table using index data of dataframe

I want to create a pivot table from a pandas dataframe
using dataframe.pivot()
and include not only dataframe columns but also the data within the dataframe index.
Couldn't find any docs that show how to do that.
Any tips?
Use reset_index to make the index a column:
In [45]: df = pd.DataFrame({'y': [0, 1, 2, 3, 4, 4], 'x': [1, 2, 2, 3, 1, 3]}, index=np.arange(6)*10)
In [46]: df
Out[46]:
x y
0 1 0
10 2 1
20 2 2
30 3 3
40 1 4
50 3 4
In [47]: df.reset_index()
Out[47]:
index x y
0 0 1 0
1 10 2 1
2 20 2 2
3 30 3 3
4 40 1 4
5 50 3 4
So pivot uses the index as values:
In [48]: df.reset_index().pivot(index='y', columns='x')
Out[48]:
index
x 1 2 3
y
0 0 NaN NaN
1 NaN 10 NaN
2 NaN 20 NaN
3 NaN NaN 30
4 40 NaN 50

pandas: slice a MultiIndex DataFrame by range of secondary index

It has been posted that slicing on the second index can be done on a multi-indexed pandas Series:
import numpy as np
import pandas as pd
buckets = np.repeat(range(3), [3,5,7])
sequence = np.hstack(map(range,[3,5,7]))
s = pd.Series(np.random.randn(len(sequence)),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence)))
print s
0 0 0.021362
1 0.917947
2 -0.956313
1 0 -0.242659
1 0.398657
2 0.455909
3 0.200061
4 -1.273537
2 0 0.747849
1 -0.012899
2 1.026659
3 -0.256648
4 0.799381
5 0.064147
6 0.491336
Then to get the first three rows for the first index=1, you simply say:
s[1].ix[range(3)]
0 -0.242659
1 0.398657
2 0.455909
This works fine for 1-dimensional Series, but not for DataFrames:
buckets = np.repeat(range(3), [3,5,7])
sequence = np.hstack(map(range,[3,5,7]))
d = pd.DataFrame(np.random.randn(len(sequence),2),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence)))
print d
0 1
0 0 1.217659 0.312286
1 0.559782 0.686448
2 -0.143116 1.146196
1 0 -0.195582 0.298426
1 1.504944 -0.205834
2 0.018644 -0.979848
3 -0.387756 0.739513
4 0.719952 -0.996502
2 0 0.065863 0.481190
1 -1.309163 0.881319
2 0.545382 2.048734
3 0.506498 0.451335
4 0.872743 -0.070985
5 -1.160473 1.082550
6 0.331796 -0.366597
d[1].ix[range(3)]
0 0 0.312286
1 0.686448
2 1.146196
Name: 1
It gives you the "1th" column of data, and the first three rows, irrespective of the first index level. How can you get the first three rows for the first index=1 for a multi-indexed DataFrame?
d.xs(1)[0:3]
0 1
0 -0.716206 0.119265
1 -0.782315 0.097844
2 2.042751 -1.116453
.loc is more efficient and is evaluated simultaneously
s.loc[pd.IndexSlice[1],:3] will return 0th level = 1 and [0:3] entry.