python pandas dataframes add column depending on values other 2 col

python pandas dataframes add column depending on values other 2 col - python-2.7

I finally got to a message that I expected could solve my problem. I have two columns in a dataFrame (height, upper) with values either 1 or 0. The combination of this is 4 elements and with them I am trying to create a third column containing the 4 combinations, but I cannot figure out what is going wrong, My code is as follows:
def quad(clasif):
if (raw['upper']==0 and raw['height']==0):
return 1
if (raw['upper']==1 and raw['height']==0):
return 2
if (raw['upper']==0 and raw['height']==1):
return 3
if (raw['upper']==1 and raw['height']==1):
return 4
raw['cuatro']=raw.apply(lambda clasif: quad(clasif), axis=1)
I am getting the following error:
'The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', u'occurred at index 0'
if someone could help?

Assuming that upper and height can only be 0 or 1, you can rewrite this as a simple addition:
raw['cuatro'] = 1 + raw['upper'] + 2 * raw['height']
The reason you see this error is because raw['upper'] == 0 is a Boolean series, which you can't use and... See the "gotcha" section of the docs.
I think you're missing the fundamentals of apply, when passed the Series clasif, your function should do something with clasif (at the moment, the function body makes no mention of it).

You have to pass the function to apply.
import pandas as pd
def quad(clasif):
if (clasif['upper']==0 and clasif['height']==0):
return 1
if (clasif['upper']==1 and clasif['height']==0):
return 2
if (clasif['upper']==0 and clasif['height']==1):
return 3
if (clasif['upper']==1 and clasif['height']==1):
return 4

raw = pd.DataFrame({'upper': [0, 0, 1, 1], 'height': [0, 1, 0, 1]})
raw['cuatro']=raw.apply(quad, axis=1)
print raw
height upper cuatro
0 0 0 1
1 1 0 3
2 0 1 2
3 1 1 4
Andy Hayden's answer is better suited for your case.

Related

Intuition behind working with `k` to find the kth-symbol in the grammar

I took part in a coding contest wherein I encountered the following question:
On the first row, we write a 0. Now in every subsequent row, we look at the previous row and replace each occurrence of 0 with 01, and each occurrence of 1 with 10. Given row N and index K, return the K-th indexed symbol in row N. (The values of K are 1-indexed.)
While solving the question, I solved it like a level-order traversal of a tree, trying to form the new string at each level. Unfortunately, it timed-out. I then tried to think along the terms of caching the results, etc. with no luck.
One of the highly upvoted solutions is like this:
class Solution {
public:
int kthGrammar(int N, int K) {
if (N == 1) return 0;
if (K % 2 == 0) return (kthGrammar(N - 1, K / 2) == 0) ? 1 : 0;
else return (kthGrammar(N - 1, (K + 1) / 2) == 0) ? 0 : 1;
}
};
My question is simple - what is the intuition behind working with the value of K (especially, the parities of K)? (I hope to be able to identify such questions when I encounter them in future).
Thanks.

Look at the sequence recursively. In generating a new row, the first half is identical to the process you used to get the previous row, so that part of the expansion is already done. The second half is merely the same sequence inverted (0 for 1, 1 for 0). This is one classic way to generate a parity map: flip all the bits and append, representing adding a 1 to the start of each binary number. Thinking of expanding the sequence 0-3 to 0-7, we start with
00 => 0
01 => 1
10 => 1
11 => 0
We now replicate the 2-digit sequence twice: first with a leading 0, which preserves the original parity; second with a leading 1, which inverts the parity.
000 => 0
001 => 1
010 => 1
011 => 0
100 => 1
101 => 0
110 => 0
111 => 1
Is that an intuition that works for you?

Just for fun, as a different way to solve this, consider that the nth row (0-indexed) has 2^n elements in it, and a determination as to the value of the kth (0-indexed) element can be made soley according to the parity of how many bits are set in k.

The check for parity in the code you posted is just to make the division by two correct, there's no advanced math or mystery hiding here :) Since the pattern is akin to a tree, where the pattern size multiplies by two for each added row, correctly dividing points to the element's parent. The indexes in this question are said to be "1-indexed;" if the index is 2, dividing by two yields the parent index (1) in the row before; and if the index is 1, dividing (1+1) by two yields that same parent index. I'll leave it to the reader to generalize that to ks parity. After finding the parent, the code follows the rule stated in the question: if the parent is 0, the left-child must be 0 and right-child 1, and vice versa.
0
0 1
0 1 1 0
0 1 1 0 1 0 0 1
0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0
a a b a b b a
0 01 0110 01101001 0110100110010110
a b b a b a a b
0110100110010110 1001011001101001

python dataframe - lambda X function - more efficient implementation possible?

in a previous thread, a brilliant response was given to the following problem(Pandas: reshaping data).
The goal is to reshape a pandas series containing lists into a pandas dataframe in the following way:
In [9]: s = Series([list('ABC'),list('DEF'),list('ABEF')])
In [10]: s
Out[10]:
0 [A, B, C]
1 [D, E, F]
2 [A, B, E, F]
dtype: object
should be shaped into this:
Out[11]:
A B C D E F
0 1 1 1 0 0 0
1 0 0 0 1 1 1
2 1 1 0 0 1 1
That is, a dataframe is created where every element in the lists of the series becomes a column. For every element in the series, a row in the dataframe is created. For every element in the lists, a 1 is assigned to the corresponding dataframe column (and 0 otherwise). I know that the wording may be cumbersome, but hopefully the example above is clear.
The brilliant response by user Jeff (https://stackoverflow.com/users/644898/jeff) was to write this simple yet powerful line of code:
In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)
That turns [10] into out[11].
That line of code served me extremely well, however I am running into memory issues with a series of roughly 50K elements and about 100K different elements in all lists. My machine has 16G of memory. Before resorting to a bigger machine, I would like to think of a more efficient implementation of the function above.
Does anyone know how to re-implement the above line:
In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)
to make it more efficient, in terms of memory usage?

You could try breaking your dataframe into chunks and writing to a file as you go, something like this:
chunksize = 10000
def f(df):
return f.apply(lambda x: Series(1,index=x)).fillna(0)
with open('out.csv','w') as f:
f.write(df.ix[[]].to_csv()) #write the header
for chunk in df.groupby(np.arange(len(df))//chunksize):
f.write(f(chunk).to_csv(header=None))

If memory use is the issue, it seems like a sparse matrix solution would be better. Pandas doesn't really have sparse matrix support, but you could use scipy.sparse like this:
data = pd.Series([list('ABC'),list('DEF'),list('ABEF')])
from scipy.sparse import csr_matrix
cols, ind = np.unique(np.concatenate(data), return_inverse=True)
indptr = np.cumsum([0] + list(map(len, data)))
vals = np.ones_like(ind)
M = csr_matrix((vals, ind, indptr))
This sparse matrix now contains the same data as the pandas solution, but the zeros are not explicitly stored. We can confirm this by converting the sparse matrix to a dataframe:
>>> pd.DataFrame(M.toarray(), columns=cols)
A B C D E F
0 1 1 1 0 0 0
1 0 0 0 1 1 1
2 1 1 0 0 1 1
Depending on what you're doing with the data from here, having it in a sparse form may help solve your problem without using excessive memory.

How to shuffle data in pandas? [duplicate]

What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0) that takes a dataframe, a number of shuffles n, and an axis (axis=0 is rows, axis=1 is columns) and returns a copy of the dataframe that has been shuffled n times.
Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index that loses all that information. I want the resulting df to be the same as the original except with the order of rows or order of columns different.
Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns a and b, I want each row shuffled on its own, so that you don't have the same associations between a and b as you do if you just re-order each row as a whole. Something like:
for 1...n:
for each col in df: shuffle column
return new_df
But hopefully more efficient than naive looping. This does not work for me:
def shuffle(df, n, axis=0):
shuffled_df = df.copy()
for k in range(n):
shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
return shuffled_df
df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)

Use numpy's random.permuation function:
In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [2]: df
Out[2]:
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
A B
0 0 0
5 5 5
6 6 6
3 3 3
8 8 8
7 7 7
9 9 9
1 1 1
2 2 2
4 4 4

Sampling randomizes, so just sample the entire data frame.
df.sample(frac=1)
As #Corey Levinson notes, you have to be careful when you reassign:
df['column'] = df['column'].sample(frac=1).reset_index(drop=True)

In [16]: def shuffle(df, n=1, axis=0):
...: df = df.copy()
...: for _ in range(n):
...: df.apply(np.random.shuffle, axis=axis)
...: return df
...:
In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [18]: shuffle(df)
In [19]: df
Out[19]:
A B
0 8 5
1 1 7
2 7 3
3 6 2
4 3 4
5 0 1
6 9 0
7 4 6
8 2 8
9 5 9

You can use sklearn.utils.shuffle() (requires sklearn 0.16.1 or higher to support Pandas data frames):
# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))
# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))
outputs:
df: A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
df: A B
1 1 1
0 0 0
3 3 3
4 4 4
2 2 2
Then you can use df.reset_index() to reset the index column, if needs to be:
df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)
outputs:
df: A B
0 1 1
1 0 0
2 4 4
3 2 2
4 3 3

A simple solution in pandas is to use the sample method independently on each column. Use apply to iterate over each column:
df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
df.apply(lambda x: x.sample(frac=1).values)
a b
0 4 2
1 1 6
2 6 5
3 5 3
4 2 4
5 3 1
You must use .value so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:
df.apply(lambda x: x.sample(frac=1))
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6

From the docs use sample():
In [79]: s = pd.Series([0,1,2,3,4,5])
# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]:
0 0
dtype: int64
# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]:
5 5
2 2
4 4
dtype: int64
# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]:
5 5
4 4
1 1
dtype: int64

I resorted to adapting #root 's answer slightly and using the raw values directly. Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.
In [1]: import numpy
In [2]: import pandas
In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})
In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop
In [5]: %%timeit
...: for view in numpy.rollaxis(df.values, 1):
...: numpy.random.shuffle(view)
...:
10000 loops, best of 3: 22.8 µs per loop
In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop
In [7]: %%timeit
for view in numpy.rollaxis(df.values, 0):
numpy.random.shuffle(view)
...:
10000 loops, best of 3: 23.4 µs per loop
Note that numpy.rollaxis brings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, i.e., if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.
In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)
In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)
Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:
def shuffle(df, n=1, axis=0):
df = df.copy()
axis = int(not axis) # pandas.DataFrame is always 2D
for _ in range(n):
for view in numpy.rollaxis(df.values, axis):
numpy.random.shuffle(view)
return df

This might be more useful when you want your index shuffled.
def shuffle(df):
index = list(df.index)
random.shuffle(index)
df = df.ix[index]
df.reset_index()
return df
It selects new df using new index, then reset them.

I know the question is for a pandas df but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.array instead, then np.apply_along_axis() will be what you are looking for.
If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.
If you panda data frame is named df, maybe you can:
get the values of the dataframe with values = df.values,
create an np.array from values
apply the method shown below to shuffle the np.array by row or column
recreate a new (shuffled) pandas df from the shuffled np.array
Original array
a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
Keep row order, shuffle colums within each row
print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
[22 21 20]
[31 30 32]
[40 41 42]]
Keep colums order, shuffle rows within each column
print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
[20 31 42]
[10 11 12]
[30 21 22]]
Original array is unchanged
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]

Here is a work around I found if you want to only shuffle a subset of the DataFrame:
shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])

How to create a list out of integers list?

I have a list of integers that looks like this when performing the print command:
0
1
0
1
1
I want to create a list out of it: [0, 1, 0, 1, 1]
How to do it?
I cannot find such information anywhere :(

s = """0
1
0
1
1"""
integers = map(int, s.splitlines())
idea taken from https://stackoverflow.com/a/27171335/1644901

pandas: slice a MultiIndex DataFrame by range of secondary index

It has been posted that slicing on the second index can be done on a multi-indexed pandas Series:
import numpy as np
import pandas as pd
buckets = np.repeat(range(3), [3,5,7])
sequence = np.hstack(map(range,[3,5,7]))
s = pd.Series(np.random.randn(len(sequence)),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence)))
print s
0 0 0.021362
1 0.917947
2 -0.956313
1 0 -0.242659
1 0.398657
2 0.455909
3 0.200061
4 -1.273537
2 0 0.747849
1 -0.012899
2 1.026659
3 -0.256648
4 0.799381
5 0.064147
6 0.491336
Then to get the first three rows for the first index=1, you simply say:
s[1].ix[range(3)]
0 -0.242659
1 0.398657
2 0.455909
This works fine for 1-dimensional Series, but not for DataFrames:
buckets = np.repeat(range(3), [3,5,7])
sequence = np.hstack(map(range,[3,5,7]))
d = pd.DataFrame(np.random.randn(len(sequence),2),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence)))
print d
0 1
0 0 1.217659 0.312286
1 0.559782 0.686448
2 -0.143116 1.146196
1 0 -0.195582 0.298426
1 1.504944 -0.205834
2 0.018644 -0.979848
3 -0.387756 0.739513
4 0.719952 -0.996502
2 0 0.065863 0.481190
1 -1.309163 0.881319
2 0.545382 2.048734
3 0.506498 0.451335
4 0.872743 -0.070985
5 -1.160473 1.082550
6 0.331796 -0.366597
d[1].ix[range(3)]
0 0 0.312286
1 0.686448
2 1.146196
Name: 1
It gives you the "1th" column of data, and the first three rows, irrespective of the first index level. How can you get the first three rows for the first index=1 for a multi-indexed DataFrame?

d.xs(1)[0:3]
0 1
0 -0.716206 0.119265
1 -0.782315 0.097844
2 2.042751 -1.116453

.loc is more efficient and is evaluated simultaneously
s.loc[pd.IndexSlice[1],:3] will return 0th level = 1 and [0:3] entry.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

python pandas dataframes add column depending on values other 2 col - python-2.7

Related

Intuition behind working with `k` to find the kth-symbol in the grammar

python dataframe - lambda X function - more efficient implementation possible?

How to shuffle data in pandas? [duplicate]

How to create a list out of integers list?

pandas: slice a MultiIndex DataFrame by range of secondary index

Categories

Resources