python dataframe - lambda X function - more efficient implementation possible? - list

in a previous thread, a brilliant response was given to the following problem(Pandas: reshaping data).
The goal is to reshape a pandas series containing lists into a pandas dataframe in the following way:
In [9]: s = Series([list('ABC'),list('DEF'),list('ABEF')])
In [10]: s
Out[10]:
0 [A, B, C]
1 [D, E, F]
2 [A, B, E, F]
dtype: object
should be shaped into this:
Out[11]:
A B C D E F
0 1 1 1 0 0 0
1 0 0 0 1 1 1
2 1 1 0 0 1 1
That is, a dataframe is created where every element in the lists of the series becomes a column. For every element in the series, a row in the dataframe is created. For every element in the lists, a 1 is assigned to the corresponding dataframe column (and 0 otherwise). I know that the wording may be cumbersome, but hopefully the example above is clear.
The brilliant response by user Jeff (https://stackoverflow.com/users/644898/jeff) was to write this simple yet powerful line of code:
In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)
That turns [10] into out[11].
That line of code served me extremely well, however I am running into memory issues with a series of roughly 50K elements and about 100K different elements in all lists. My machine has 16G of memory. Before resorting to a bigger machine, I would like to think of a more efficient implementation of the function above.
Does anyone know how to re-implement the above line:
In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)
to make it more efficient, in terms of memory usage?

You could try breaking your dataframe into chunks and writing to a file as you go, something like this:
chunksize = 10000
def f(df):
return f.apply(lambda x: Series(1,index=x)).fillna(0)
with open('out.csv','w') as f:
f.write(df.ix[[]].to_csv()) #write the header
for chunk in df.groupby(np.arange(len(df))//chunksize):
f.write(f(chunk).to_csv(header=None))

If memory use is the issue, it seems like a sparse matrix solution would be better. Pandas doesn't really have sparse matrix support, but you could use scipy.sparse like this:
data = pd.Series([list('ABC'),list('DEF'),list('ABEF')])
from scipy.sparse import csr_matrix
cols, ind = np.unique(np.concatenate(data), return_inverse=True)
indptr = np.cumsum([0] + list(map(len, data)))
vals = np.ones_like(ind)
M = csr_matrix((vals, ind, indptr))
This sparse matrix now contains the same data as the pandas solution, but the zeros are not explicitly stored. We can confirm this by converting the sparse matrix to a dataframe:
>>> pd.DataFrame(M.toarray(), columns=cols)
A B C D E F
0 1 1 1 0 0 0
1 0 0 0 1 1 1
2 1 1 0 0 1 1
Depending on what you're doing with the data from here, having it in a sparse form may help solve your problem without using excessive memory.

Related

Armadillo C++ : Linear Combination with modulus calculations

I want to extract linear combinations from matrices but by performing combinations in modulus.
Let us consider the calculation modulus 5, we then have the following for the addition:
+ | 0 1 2 3 4
--+-----------
0 | 0 1 2 3 4
1 | 1 2 3 4 0
2 | 2 3 4 0 1
3 | 3 4 0 1 2
4 | 4 0 1 2 3
and this table for the multiplication:
* | 0 1 2 3 4
--+-----------
0 | 0 0 0 0 0
1 | 0 1 2 3 4
2 | 0 2 4 1 3
3 | 0 3 1 4 2
4 | 0 4 3 2 1
So let us take an example:
Let us consider the following matrix:
E = 2 1 3 2 0
4 3 0 1 1
Then we can obtain the triangulation matrix by applying a LU decomposition (https://en.wikipedia.org/wiki/LU_decomposition) (or a Gaussian Elimination), which is the following:
T = 1 0 0 0 0
2 1 0 0 0
and finally, the matrix that I want to extract, which is the one storing the linear combinations:
CL = 3 2 3 0 3
0 1 1 3 4
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
So basically, the algorithm should work as follows:
Input: a matrix E with n rows and m columns, and p, a prime number.
* We perform a Gaussian elimination/LU decomposition to obtain the lower-triangulation matrix T.
But all the calculus are done modulo 'p'.
Output: T (with the same size as E, n rows m columns).
CL (with a size m rows, m columns),
which is basically the identity matrix on which we
applied all the modifications that were performed on E to obtain T.
Alright, so now we have the context, let me explain the problem.
I started to do it using the Armadillo library (http://arma.sourceforge.net/), but I did not find any solution on the library to perform the calculus on a mathematical Field p. I easily found the LU method to obtain the lower-triangle matrix, but the calculations are performed in the real.
#include <iostream>
#include <armadillo>
using namespace arma;
using namespace std;
int main(int argc,char** argv)
{
mat A = mat({{2,1,3,2,0},{4,3,0,1,1}});
mat L, U, P;
lu(L, U, P, A);
cout << L << endl;
return 0;
}
With the following, you obtain the lower-triangle matrix 'L' but in the real calculus. Thus you obtain:
T' = 1 0
1/2 1
Is there any technique to perform the computation in a modulus way?
EDIT The Armadillo library is not able to do it. I developed my own LU decomposition in modulus but there is still a bug there. I asked a new question here Linear Combination C++ in modulus, hoping to solve it.
First of all: drop the using namespaces, code can become completely unreadable if you do that.
I haven't used Armadillo yet. But I have looked at the documentation, and it seems templated to me.
Now things are getting a bit wild. The type you use, arma::mat seems to be a typedef on arma::Mat<double>.
The high-level function arma::lu isn't properly documented. It obviously does an LU-decomposition, but I don't know if the function is templated. If it is, i.e., you cannot just call it with double mats but also other types, you might have a shot using a custom type representing the field (since 5 is prime, otherwise you'd be completely lost) of calculations modulo 5. Meaning you write a class, let's call it IntMod5 and define all required operators for this class, meaning all operators that IntMod5 uses. For example, you'd need to define operator/(), e.g. by making a table of inverses of 4 of the 5 elements of the field (0 has none), i.e. 1->1, 2->3, 3->2, 4->4, and define
IntMod5 operator/(const IntMod5& o) const
{
return IntMod5((this->value*inverse(o.value))%5);
}
This is just one example, you likely need to define all arithmetic operators, binary and unary, and possibly more such as comparison (LU decomposition might use finding good pivot elements). If you're lucky and the library is written in a way that it works for any field, not just floating point, you have a chance.
Before you go through all the work, you should use a trivial class simply wrapping double and check if arma::Mat or arma::lu do any type checks blocking you out.
If either of these fails, you'll likely have to write your own LU decomposition modulo 5 or find another library that supports it.

python pandas dataframes add column depending on values other 2 col

I finally got to a message that I expected could solve my problem. I have two columns in a dataFrame (height, upper) with values either 1 or 0. The combination of this is 4 elements and with them I am trying to create a third column containing the 4 combinations, but I cannot figure out what is going wrong, My code is as follows:
def quad(clasif):
if (raw['upper']==0 and raw['height']==0):
return 1
if (raw['upper']==1 and raw['height']==0):
return 2
if (raw['upper']==0 and raw['height']==1):
return 3
if (raw['upper']==1 and raw['height']==1):
return 4
raw['cuatro']=raw.apply(lambda clasif: quad(clasif), axis=1)
I am getting the following error:
'The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', u'occurred at index 0'
if someone could help?
Assuming that upper and height can only be 0 or 1, you can rewrite this as a simple addition:
raw['cuatro'] = 1 + raw['upper'] + 2 * raw['height']
The reason you see this error is because raw['upper'] == 0 is a Boolean series, which you can't use and... See the "gotcha" section of the docs.
I think you're missing the fundamentals of apply, when passed the Series clasif, your function should do something with clasif (at the moment, the function body makes no mention of it).
You have to pass the function to apply.
import pandas as pd
def quad(clasif):
if (clasif['upper']==0 and clasif['height']==0):
return 1
if (clasif['upper']==1 and clasif['height']==0):
return 2
if (clasif['upper']==0 and clasif['height']==1):
return 3
if (clasif['upper']==1 and clasif['height']==1):
return 4
​
raw = pd.DataFrame({'upper': [0, 0, 1, 1], 'height': [0, 1, 0, 1]})
raw['cuatro']=raw.apply(quad, axis=1)
print raw
height upper cuatro
0 0 0 1
1 1 0 3
2 0 1 2
3 1 1 4
Andy Hayden's answer is better suited for your case.

How to shuffle data in pandas? [duplicate]

What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0) that takes a dataframe, a number of shuffles n, and an axis (axis=0 is rows, axis=1 is columns) and returns a copy of the dataframe that has been shuffled n times.
Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index that loses all that information. I want the resulting df to be the same as the original except with the order of rows or order of columns different.
Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns a and b, I want each row shuffled on its own, so that you don't have the same associations between a and b as you do if you just re-order each row as a whole. Something like:
for 1...n:
for each col in df: shuffle column
return new_df
But hopefully more efficient than naive looping. This does not work for me:
def shuffle(df, n, axis=0):
shuffled_df = df.copy()
for k in range(n):
shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
return shuffled_df
df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)
Use numpy's random.permuation function:
In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [2]: df
Out[2]:
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
A B
0 0 0
5 5 5
6 6 6
3 3 3
8 8 8
7 7 7
9 9 9
1 1 1
2 2 2
4 4 4
Sampling randomizes, so just sample the entire data frame.
df.sample(frac=1)
As #Corey Levinson notes, you have to be careful when you reassign:
df['column'] = df['column'].sample(frac=1).reset_index(drop=True)
In [16]: def shuffle(df, n=1, axis=0):
...: df = df.copy()
...: for _ in range(n):
...: df.apply(np.random.shuffle, axis=axis)
...: return df
...:
In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [18]: shuffle(df)
In [19]: df
Out[19]:
A B
0 8 5
1 1 7
2 7 3
3 6 2
4 3 4
5 0 1
6 9 0
7 4 6
8 2 8
9 5 9
You can use sklearn.utils.shuffle() (requires sklearn 0.16.1 or higher to support Pandas data frames):
# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))
# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))
outputs:
df: A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
df: A B
1 1 1
0 0 0
3 3 3
4 4 4
2 2 2
Then you can use df.reset_index() to reset the index column, if needs to be:
df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)
outputs:
df: A B
0 1 1
1 0 0
2 4 4
3 2 2
4 3 3
A simple solution in pandas is to use the sample method independently on each column. Use apply to iterate over each column:
df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
df.apply(lambda x: x.sample(frac=1).values)
a b
0 4 2
1 1 6
2 6 5
3 5 3
4 2 4
5 3 1
You must use .value so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:
df.apply(lambda x: x.sample(frac=1))
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
From the docs use sample():
In [79]: s = pd.Series([0,1,2,3,4,5])
# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]:
0 0
dtype: int64
# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]:
5 5
2 2
4 4
dtype: int64
# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]:
5 5
4 4
1 1
dtype: int64
I resorted to adapting #root 's answer slightly and using the raw values directly. Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.
In [1]: import numpy
In [2]: import pandas
In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})
In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop
In [5]: %%timeit
...: for view in numpy.rollaxis(df.values, 1):
...: numpy.random.shuffle(view)
...:
10000 loops, best of 3: 22.8 µs per loop
In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop
In [7]: %%timeit
for view in numpy.rollaxis(df.values, 0):
numpy.random.shuffle(view)
...:
10000 loops, best of 3: 23.4 µs per loop
Note that numpy.rollaxis brings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, i.e., if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.
In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)
In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)
Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:
def shuffle(df, n=1, axis=0):
df = df.copy()
axis = int(not axis) # pandas.DataFrame is always 2D
for _ in range(n):
for view in numpy.rollaxis(df.values, axis):
numpy.random.shuffle(view)
return df
This might be more useful when you want your index shuffled.
def shuffle(df):
index = list(df.index)
random.shuffle(index)
df = df.ix[index]
df.reset_index()
return df
It selects new df using new index, then reset them.
I know the question is for a pandas df but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.array instead, then np.apply_along_axis() will be what you are looking for.
If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.
If you panda data frame is named df, maybe you can:
get the values of the dataframe with values = df.values,
create an np.array from values
apply the method shown below to shuffle the np.array by row or column
recreate a new (shuffled) pandas df from the shuffled np.array
Original array
a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
Keep row order, shuffle colums within each row
print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
[22 21 20]
[31 30 32]
[40 41 42]]
Keep colums order, shuffle rows within each column
print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
[20 31 42]
[10 11 12]
[30 21 22]]
Original array is unchanged
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
Here is a work around I found if you want to only shuffle a subset of the DataFrame:
shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])

Optimise conversion to integer - pandas

I have a DataFrame with 80,000 rows. One column 'prod_prom' contains either null values or string representations of numbers, i.e. including ','. I need to convert these to integers. So far I have been doing this:
for row in DF.index:
if pd.notnull(DF.loc[row, 'prod_prom']):
DF.loc[row, 'prod_prom'] = int(''.join([char for char in DF.loc[row, 'prod_prom'] if char != ',']))
But it is extremely slow. Would it be quicker to do this in list comprehension, or with an apply function? What is best practice for this kind of operation?
Thanks
So if I understand right, you have data like the following:
data = """
A,B
100,"5,000"
200,"10,000"
300,"100,000"
400,
500,"2,000"
"""
If that is the case probably the easiest thing is to use the thousands option in read_csv (the type will be float instead of int because of the missing value):
df = pd.read_csv(StringIO(data),header=True,thousands=',')
A B
0 100 5000
1 200 10000
2 300 100000
3 400 NaN
4 500 2000
If that is not an possible you can do something like the following:
print df
A B
0 100 5,000
1 200 10,000
2 300 100,000
3 400 NaN
4 500 2,000
df['B'] = df['B'].str.replace(r',','').astype(float)
print df
A B
0 100 5000
1 200 10000
2 300 100000
3 400 NaN
4 500 200
I changed the type to float because there are no NaN integers in pandas.

Given two dynamic R x C matrixes, how can I interleave the rows to produce one 2R x C matrix?

Using eigen2, and given a matrix A
a_0_0, a_0_1, a_0_2, ...
a_1_0, a_1_0, a_1_2, ...
...
and a matrix B:
b_0_0, b_0_1, b_0_2, ...
b_1_0, b_1_1, b_1_2, ...
...
and where A and B have the same dimensions, I would like to interleave the rows, producing:
a_0_0, a_0_1, a_0_2, ...
b_0_0, b_0_1, b_0_2, ...
a_1_0, a_1_0, a_1_2, ...
b_1_0, b_1_1, b_1_2, ...
...
Obviously I can write a function that will construct an output matrix of the proper dimensions, then loop over each of the input matrices and assign elements to the result. I'd rather not re-invent the wheel though, so if eigen2 already has a mechanism to express this sort of matrix surgery elegantly I'd much prefer to use it.
I did look through the eigen2 docs and nothing jumped out at me as obviously correct. The closest thing I found was MatrixBase::select, but that does 'element from a or element from b', where what I want is 'element from a then element from b in the next row'.
Efficiency is not of paramount concern since I don't need to do this in the fast path, only at initialization.
I apologize for the formatting if there is a better way to represent matrices.
Multiply each R x C matrix by a 2R x R matrix consisting of zeroes and ones on the appropriate diagonal, then add.
Matrix 1
1 0 0 0 ...
0 0 0 0 ...
0 1 0 0 ...
0 0 0 0 ...
Matrix 2
0 0 0 0 ...
1 0 0 0 ...
0 0 0 0 ...
0 1 0 0 ...
Not sure if this is specific to Eigen3, but you can interleave rows using the Map and Stride objects.
MatrixXi C(A.rows()+B.rows(),A.cols());
Map<MatrixXi,0,Stride<Dynamic,2> >(C.data(),A.rows(),A.cols(),Stride<Dynamic,2>(2*A.rows(),2)) = A;
Map<MatrixXi,0,Stride<Dynamic,2> >(C.data()+1,B.rows(),B.cols(),Stride<Dynamic,2>(2*B.rows(),2)) = B;
source