Find Indexes of Non-NaN Values in Pandas DataFrame - python-2.7

I have a very large dataset (roughly 200000x400), however I have it filtered and only a few hundred values remain, the rest are NaN. I would like to create a list of indexes of those remaining values. I can't seem to find a simple enough solution.
0 1 2
0 NaN NaN 1.2
1 NaN NaN NaN
2 NaN 1.1 NaN
3 NaN NaN NaN
4 1.4 NaN 1.01
For instance, I would like a list of [(0,2), (2,1), (4,0), (4,2)].

Convert the dataframe to it's equivalent NumPy array representation and check for NaNs present. Later, take the negation of it's corresponding indices (indicating non nulls) using numpy.argwhere. Since the output required must be a list of tuples, you could then make use of generator map function applying tuple as function to every iterable of the resulting array.
>>> list(map(tuple, np.argwhere(~np.isnan(df.values))))
[(0, 2), (2, 1), (4, 0), (4, 2)]

assuming that your column names are of int dtype:
In [73]: df
Out[73]:
0 1 2
0 NaN NaN 1.20
1 NaN NaN NaN
2 NaN 1.1 NaN
3 NaN NaN NaN
4 1.4 NaN 1.01
In [74]: df.columns.dtype
Out[74]: dtype('int64')
In [75]: df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
Out[75]: [(0, 2), (2, 1), (4, 0), (4, 2)]
if your column names are of object dtype:
In [81]: df.columns.dtype
Out[81]: dtype('O')
In [83]: df.stack().reset_index().astype(int).drop(0,1).apply(tuple, axis=1).tolist()
Out[83]: [(0, 2), (2, 1), (4, 0), (4, 2)]
Timing for 50K rows DF:
In [89]: df = pd.concat([df] * 10**4, ignore_index=True)
In [90]: df.shape
Out[90]: (50000, 3)
In [91]: %timeit list(map(tuple, np.argwhere(~np.isnan(df.values))))
10 loops, best of 3: 144 ms per loop
In [92]: %timeit df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
1 loop, best of 3: 1.67 s per loop
Conclusion: the Nickil Maveli's solution is 12 times faster for this test DF

Related

How to shuffle data in pandas? [duplicate]

What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0) that takes a dataframe, a number of shuffles n, and an axis (axis=0 is rows, axis=1 is columns) and returns a copy of the dataframe that has been shuffled n times.
Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index that loses all that information. I want the resulting df to be the same as the original except with the order of rows or order of columns different.
Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns a and b, I want each row shuffled on its own, so that you don't have the same associations between a and b as you do if you just re-order each row as a whole. Something like:
for 1...n:
for each col in df: shuffle column
return new_df
But hopefully more efficient than naive looping. This does not work for me:
def shuffle(df, n, axis=0):
shuffled_df = df.copy()
for k in range(n):
shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
return shuffled_df
df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)
Use numpy's random.permuation function:
In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [2]: df
Out[2]:
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
A B
0 0 0
5 5 5
6 6 6
3 3 3
8 8 8
7 7 7
9 9 9
1 1 1
2 2 2
4 4 4
Sampling randomizes, so just sample the entire data frame.
df.sample(frac=1)
As #Corey Levinson notes, you have to be careful when you reassign:
df['column'] = df['column'].sample(frac=1).reset_index(drop=True)
In [16]: def shuffle(df, n=1, axis=0):
...: df = df.copy()
...: for _ in range(n):
...: df.apply(np.random.shuffle, axis=axis)
...: return df
...:
In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [18]: shuffle(df)
In [19]: df
Out[19]:
A B
0 8 5
1 1 7
2 7 3
3 6 2
4 3 4
5 0 1
6 9 0
7 4 6
8 2 8
9 5 9
You can use sklearn.utils.shuffle() (requires sklearn 0.16.1 or higher to support Pandas data frames):
# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))
# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))
outputs:
df: A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
df: A B
1 1 1
0 0 0
3 3 3
4 4 4
2 2 2
Then you can use df.reset_index() to reset the index column, if needs to be:
df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)
outputs:
df: A B
0 1 1
1 0 0
2 4 4
3 2 2
4 3 3
A simple solution in pandas is to use the sample method independently on each column. Use apply to iterate over each column:
df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
df.apply(lambda x: x.sample(frac=1).values)
a b
0 4 2
1 1 6
2 6 5
3 5 3
4 2 4
5 3 1
You must use .value so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:
df.apply(lambda x: x.sample(frac=1))
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
From the docs use sample():
In [79]: s = pd.Series([0,1,2,3,4,5])
# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]:
0 0
dtype: int64
# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]:
5 5
2 2
4 4
dtype: int64
# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]:
5 5
4 4
1 1
dtype: int64
I resorted to adapting #root 's answer slightly and using the raw values directly. Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.
In [1]: import numpy
In [2]: import pandas
In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})
In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop
In [5]: %%timeit
...: for view in numpy.rollaxis(df.values, 1):
...: numpy.random.shuffle(view)
...:
10000 loops, best of 3: 22.8 µs per loop
In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop
In [7]: %%timeit
for view in numpy.rollaxis(df.values, 0):
numpy.random.shuffle(view)
...:
10000 loops, best of 3: 23.4 µs per loop
Note that numpy.rollaxis brings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, i.e., if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.
In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)
In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)
Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:
def shuffle(df, n=1, axis=0):
df = df.copy()
axis = int(not axis) # pandas.DataFrame is always 2D
for _ in range(n):
for view in numpy.rollaxis(df.values, axis):
numpy.random.shuffle(view)
return df
This might be more useful when you want your index shuffled.
def shuffle(df):
index = list(df.index)
random.shuffle(index)
df = df.ix[index]
df.reset_index()
return df
It selects new df using new index, then reset them.
I know the question is for a pandas df but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.array instead, then np.apply_along_axis() will be what you are looking for.
If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.
If you panda data frame is named df, maybe you can:
get the values of the dataframe with values = df.values,
create an np.array from values
apply the method shown below to shuffle the np.array by row or column
recreate a new (shuffled) pandas df from the shuffled np.array
Original array
a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
Keep row order, shuffle colums within each row
print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
[22 21 20]
[31 30 32]
[40 41 42]]
Keep colums order, shuffle rows within each column
print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
[20 31 42]
[10 11 12]
[30 21 22]]
Original array is unchanged
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
Here is a work around I found if you want to only shuffle a subset of the DataFrame:
shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])

Matlab: find small islands of numbers surrounded by NaN

I have a lengthy vector of numeric data, with some sequences of NaNs here and there. Most of the NaNs come in large chunks, but sometimes the segments of NaNs are close together, creating islands of numbers surrounded by NaNs like this:
...NaN 1 2 3 5 ... 9 4 2 NaN...
I would like to find all islands of data that are between 1 - 15000 elements in size and replace them by a solid block of NaNs.
I've tried a few things, but there are some problems --the data set is HUGE, hence converting it to a string and using a regular expression to do:
[found start end] = regexp(num2str(isnan(data)),'10{1,7}1','match','start','end')
is out of the question because it takes prohibitively long to do num2str(isnan(data)). So I need a numeric way to find all NaN-numbers-Nan where the number of numbers is between 1 and 15000.
Here is an example how you can do that:
% generate random data
data = rand(1,20)
data(data>0.5) = NaN
% add NaN before and after the original array
% for easier detection of the numerical block
% starting at 1st element and finishing at the last one
datatotest = [ NaN data NaN ];
NumBlockStart = find( ~isnan(datatotest(2:end)) & isnan(datatotest(1:end-1)) )+0
NumBlockEnd = find( ~isnan(datatotest(1:end-1)) & isnan(datatotest(2:end)) )-1
NumBlockLength = NumBlockEnd - NumBlockStart + 1
In this example NumBlockStart contains start index of numeric block and NumBlockEnd contains last index of numeric block. NumBlockLength contains length of each block.
Now you can do whatever you want to them :)
Here is possible output
data =
0.0382 0.3767 0.8597 0.2743 0.6276 0.2974 0.2587 0.8577 0.8319 0.1408 0.9288 0.0990 0.7653 0.7806 0.8576 0.8032 0.8340 0.1600 0.4937 0.7784
data =
0.0382 0.3767 NaN 0.2743 NaN 0.2974 0.2587 NaN NaN 0.1408 NaN 0.0990 NaN NaN NaN NaN NaN 0.1600 0.4937 NaN
NumBlockStart =
1 4 6 10 12 18
NumBlockEnd =
2 4 7 10 12 19
NumBlockLength =
2 1 2 1 1 2
UPDATE1
This is more efficient version:
data = rand(1,19)
data(data>0.5) = NaN
test2 = diff( ~isnan([ NaN data NaN ]) );
NumBlockStart = find( test2>0 )-0
NumBlockEnd = find( test2<0 )-1
NumBlockLength = NumBlockEnd - NumBlockStart + 1

Selecting data from an HDFStore by floating-point data_column

I have a table in an HDFStore with a column of floats f stored as a data_column. I would like to select a subset of rows where, e.g., f==0.6.
I'm running in to trouble that I'm assuming is related to a floating-point precision mismatch somewhere. Here is an example:
In [1]: f = np.arange(0, 1, 0.1)
In [2]: s = f.astype('S')
In [3]: df = pd.DataFrame({'f': f, 's': s})
In [4]: df
Out[4]:
f s
0 0.0 0.0
1 0.1 0.1
2 0.2 0.2
3 0.3 0.3
4 0.4 0.4
5 0.5 0.5
6 0.6 0.6
7 0.7 0.7
8 0.8 0.8
9 0.9 0.9
[10 rows x 2 columns]
In [5]: with pd.get_store('test.h5', mode='w') as store:
...: store.append('df', df, data_columns=True)
...:
In [6]: with pd.get_store('test.h5', mode='r') as store:
...: selection = store.select('df', 'f=f')
...:
In [7]: selection
Out[7]:
f s
0 0.0 0.0
1 0.1 0.1
2 0.2 0.2
4 0.4 0.4
5 0.5 0.5
8 0.8 0.8
9 0.9 0.9
[7 rows x 2 columns]
I would like the query to return all of the rows but instead several are missing. A query with where='f=0.3' returns an empty table:
In [8]: with pd.get_store('test.h5', mode='r') as store:
selection = store.select('df', 'f=0.3')
...:
In [9]: selection
Out[9]:
Empty DataFrame
Columns: [f, s]
Index: []
[0 rows x 2 columns]
I'm wondering whether this is the intended behavior, and if so is there is a simple workaround, such as setting a precision limit for floating-point queries in pandas? I'm using version 0.13.1:
In [10]: pd.__version__
Out[10]: '0.13.1-55-g7d3e41c'
I don't think so, no. Pandas is built around numpy, and I have never seen any tools for approximate float equality except testing utilities like assert_allclose, and that won't help here.
The best you can do is something like:
In [17]: with pd.get_store('test.h5', mode='r') as store:
selection = store.select('df', '(f > 0.2) & (f < 0.4)')
....:
In [18]: selection
Out[18]:
f s
3 0.3 0.3
If this is a common idiom for you, make a function for it. You can even get fancy by incorporating numpy float precision.

pandas pivot table using index data of dataframe

I want to create a pivot table from a pandas dataframe
using dataframe.pivot()
and include not only dataframe columns but also the data within the dataframe index.
Couldn't find any docs that show how to do that.
Any tips?
Use reset_index to make the index a column:
In [45]: df = pd.DataFrame({'y': [0, 1, 2, 3, 4, 4], 'x': [1, 2, 2, 3, 1, 3]}, index=np.arange(6)*10)
In [46]: df
Out[46]:
x y
0 1 0
10 2 1
20 2 2
30 3 3
40 1 4
50 3 4
In [47]: df.reset_index()
Out[47]:
index x y
0 0 1 0
1 10 2 1
2 20 2 2
3 30 3 3
4 40 1 4
5 50 3 4
So pivot uses the index as values:
In [48]: df.reset_index().pivot(index='y', columns='x')
Out[48]:
index
x 1 2 3
y
0 0 NaN NaN
1 NaN 10 NaN
2 NaN 20 NaN
3 NaN NaN 30
4 40 NaN 50

Split Data knowing its common ID

I want to split this data,
ID x y
1 2.5 3.5
1 85.1 74.1
2 2.6 3.4
2 86.0 69.8
3 25.8 32.9
3 84.4 68.2
4 2.8 3.2
4 24.1 31.8
4 83.2 67.4
I was able, making match with their partner like,
ID x y ID x y
1 2.5 3.5 1 85.1 74.1
2 2.6 3.4 2 86.0 69.8
3 25.8 32.9
4 24.1 31.8
However, as you notice some of the new row in ID 4 were placed wrong, because it just got added in the next few rows. I want to split them properly without having to use complex logic which I am already using... Someone can give me an algorithm or idea?
it should looks like,
ID x y ID x y ID x y
1 2.5 3.5 1 85.1 74.1 3 25.8 32.9
2 2.6 3.4 2 86.0 69.8 4 24.1 31.8
4 2.8 3.2 3 84.4 68.2
4 83.2 67.4
It seems that your question is really about clustering, and that the ID column has nothing to do with the determining which points correspond to which.
A common algorithm to achieve that would be k-means clustering. However, your question implies that you don't know the number of clusters in advance. This complicates matters, and there have been already a lot of questions asked here on StackOverflow regarding this issue:
Kmeans without knowing the number of clusters?
compute clustersize automatically for kmeans
How do I determine k when using k-means clustering?
How to optimal K in K - Means Algorithm
K-Means Algorithm
Unfortunately, there is no "right" solution for this. Two clusters in one specific problem could be indeed considered as one cluster in another problem. This is why you'll have to decide that for yourself.
Nevertheless, if you're looking for something simple (and probably inaccurate), you can use Euclidean distance as a measure. Compute the distances between points (e.g. using pdist), and group points where the distance falls below a certain threshold.
Example
%// Sample input
A = [1, 2.5, 3.5;
1, 85.1, 74.1;
2, 2.6, 3.4;
2, 86.0, 69.8;
3, 25.8, 32.9;
3, 84.4, 68.2;
4, 2.8, 3.2;
4, 24.1, 31.8;
4, 83.2, 67.4];
%// Cluster points
pairs = nchoosek(1:size(A, 1), 2); %// Rows of pairs
d = sqrt(sum((A(pairs(:, 1), :) - A(pairs(:, 2), :)) .^ 2, 2)); %// d = pdist(A)
thr = d < 10; %// Distances below threshold
kk = 1;
idx = 1:size(A, 1);
C = cell(size(idx)); %// Preallocate memory
while any(idx)
x = unique(pairs(pairs(:, 1) == find(idx, 1) & thr, :));
C{kk} = A(x, :);
idx(x) = 0; %// Remove indices from list
kk = kk + 1;
end
C = C(~cellfun(#isempty, C)); %// Remove empty cells
The result is a cell array C, each cell representing a cluster:
C{1} =
1.0000 2.5000 3.5000
2.0000 2.6000 3.4000
4.0000 2.8000 3.2000
C{2} =
1.0000 85.1000 74.1000
2.0000 86.0000 69.8000
3.0000 84.4000 68.2000
4.0000 83.2000 67.4000
C{3} =
3.0000 25.8000 32.9000
4.0000 24.1000 31.8000
Note that this simple approach has the flaw of restricting the cluster radius to the threshold. However, you wanted a simple solution, so bear in mind that it gets complicated as you add more "clustering logic" to the algorithm.