iterate over Dataframe row by index value and find max

iterate over Dataframe row by index value and find max - python-2.7

I need to iterate over df rows based on its index. I need to find the max in the column p1 and fill it in the output dataframe (along with the max p1), the same for the column p2. In each range of my row indexes (sub_1_ica_1---> sub_1_ica_n), there must be only one 1 and one 2 and I need to fill the remaining with zeros. That's why I need to do the operation range by range.
I tried to split the index name and make a counter for each subject to be used to iterate over the rows, but I feel that I am in the wrong way!
from collections import Counter
a = df.id.tolist()
indlist = []
for x in a:
i = x.split('_')
b = int(i[1])
indlist.insert(-1,b)
c=Counter(indlist)
keyInd = c.keys()
Any ideas?
EDIT: according to Jerazel example my desired output would look like this.
First I find the max for p1 and p2 columns which will be translated in the new df into 1 and 2, and the remaining fields will be zeros

I think you need numpy.argmax with max, also if need columns names use idxmax:
idx = ['sub_1_ICA_0','sub_1_ICA_1','sub_1_ICA_2','sub_2_ICA_0','sub_2_ICA_1','sub_2_ICA_2']
df = pd.DataFrame({'p0':[7,8,9,4,2,3],
'p1':[1,3,5,7,1,0],
'p2':[5,9,6,1,2,4]}, index=idx)
print (df)
cols = ['p0','p1','p2']
df['a'] = df[cols].values.argmax(axis=1)
df['b'] = df[cols].max(axis=1)
df['c'] = df[cols].idxmax(axis=1)
print (df)
p0 p1 p2 a b c
sub_1_ICA_0 7 1 5 0 7 p0
sub_1_ICA_1 8 3 9 2 9 p2
sub_1_ICA_2 9 5 6 0 9 p0
sub_2_ICA_0 4 7 1 1 7 p1
sub_2_ICA_1 2 1 2 0 2 p0
sub_2_ICA_2 3 0 4 2 4 p2

Related

Magic Square in Python Debugging

Problem originally is in this link. I wrote a Python code but I got 64 points (total points is 100) and this indicates that my code has some missing points. I passed 11 of 16 test cases but 5 test cases have problematic for me. Could you say where my code has some missing points and how can I fix it?
import math
m = int(raw_input())
liste = []
y_liste = []
md = 0
ad = 0
sum = 0
sum2 = 0
for k in range(m):
temp = str(raw_input())
liste.append(temp)
liste[k] = liste[k].split(" ")
liste[k] = [int(i) for i in liste[k]]
for k in range(m):
md += liste[k][k]
ad += liste[k][m-k-1]
if md == ad:
print 0
else:
for k in range(m):
for l in range(m):
sum2 += liste[l][k]
sum += liste[k][l]
if sum2 != md and -(k+1) is not y_liste:
y_liste.append(-(k+1))
if sum != md and (k+1) is not y_liste:
y_liste.append(k+1)
sum2 = 0
sum = 0
if md != ad:
y_liste.append(0)
print len(y_liste)
y_liste.sort()
for i in y_liste:
print i
Problem Statement
Magic Square
Johnny designed a magic square (square of numbers with the same sum for all rows, columns and diagonals i.e. both the main diagonal - meaning the diagonal that leads from the top-left corner towards bottom-right corner - and the antidiagonal - meaning the diagonal that leads from top-right corner towards bottom-left corner). Write a program to test it.
Task
Write a program that will check if the given square is magic (i.e. has the same sum for all rows, columns and diagonals).
Input
First line: N , the size of the square (1 <= N <= 600).
Next N lines: The square, N space separated integers pre line, representing the entries per each row of the square.
Output
First line: M , the number of lines that do not sum up to the sum of the main diagonal (i.e. the one that contains the first element of the square). If the Square is magic, the program should output 0.
Next M lines: A sorted (in incremental order ) list of the lines that do not sum up to the sum of the main diagonal. The rows are numbered 1,2,…,N; the columns are numbered -1,-2,…,-N; and the antidiagonal is numbered zero.
Note: There is a newline character at the end of the last line of the output.
Sample Input 1
3
8 1 6
3 5 7
4 9 2
Sample Output 1
0
Sample Input 2
4
16 3 2 13
5 10 11 8
6 9 7 12
4 15 14 1
Sample Output 2
3
-2
-1
0
Explanation of Sample Output 2
The input square looks as follows: http://i.stack.imgur.com/JyMgc.png
(Sorry for link but I cannot add image due to reputation)
The square has 4 rows (labeled from 1 to 4 in orange) and 4 columns (labeled from -1 to -4 in green) as depicted in the image above. The main diagonal and antidiagonal of the square are highlighted in red and blue respectively.
The main diagonal has sum = 16 + 10 + 7 +1 = 34.
The antidiagonal has sum = 13 + 11 + 9 + 4 = 37. This is different to the sum of the main diagonal so value 0 corresponding to the antidiagonal should be reported.
Row 1 has sum = 16 + 3 + 2 + 13 = 34.
Row 2 has sum = 5 + 10 + 11 + 8 = 34.
Row 3 has sum = 6 + 9 + 7 + 12 = 34.
Row 4 has sum = 4 + 15 + 14 + 1 = 34.
Column -1 has sum = 16 + 5 + 6 + 4 = 31. This is different to the sum of the main diagonal so value -1 should be reported.
Column -2 has sum = 3 + 10 + 9 + 15 = 37. This is different to the sum of the main diagonal so value -2 should be reported.
Column -3 has sum = 2 + 11 + 7 + 14 = 34.
Column -4 has sum = 13 + 8 + 12 + 1 = 34.
Based on the above, there are 3 lines that do not sum up to the sum of the elements of the main diagonal. Since they should be sorted in incremental order, the output should be:
3
-2
-1
0

Your explanation doesn't discuss this clause which is a potential source of error:
if md == ad:
print 0
else:
It says that if the main diagonal and antidiagonal add up to the same value, print just a 0 (no bad lines) indicating the magic square is valid (distinct from reporting a 0 in the list of bad lines). Consider this valid magic square:
9 6 3 16
4 15 10 5
14 1 8 11
7 12 13 2
If I swap 13 and 11, the diagonals still equal each other but the square is invalid. So the above code doesn't appear to be correct. In the else clause for the above if statement, you test:
if md != ad:
y_liste.append(0)
a fact you already know to be true from the previous/outer test so your code seems to be out of agreement with itself.

How to shuffle data in pandas? [duplicate]

What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0) that takes a dataframe, a number of shuffles n, and an axis (axis=0 is rows, axis=1 is columns) and returns a copy of the dataframe that has been shuffled n times.
Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index that loses all that information. I want the resulting df to be the same as the original except with the order of rows or order of columns different.
Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns a and b, I want each row shuffled on its own, so that you don't have the same associations between a and b as you do if you just re-order each row as a whole. Something like:
for 1...n:
for each col in df: shuffle column
return new_df
But hopefully more efficient than naive looping. This does not work for me:
def shuffle(df, n, axis=0):
shuffled_df = df.copy()
for k in range(n):
shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
return shuffled_df
df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)

Use numpy's random.permuation function:
In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [2]: df
Out[2]:
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
A B
0 0 0
5 5 5
6 6 6
3 3 3
8 8 8
7 7 7
9 9 9
1 1 1
2 2 2
4 4 4

Sampling randomizes, so just sample the entire data frame.
df.sample(frac=1)
As #Corey Levinson notes, you have to be careful when you reassign:
df['column'] = df['column'].sample(frac=1).reset_index(drop=True)

In [16]: def shuffle(df, n=1, axis=0):
...: df = df.copy()
...: for _ in range(n):
...: df.apply(np.random.shuffle, axis=axis)
...: return df
...:
In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [18]: shuffle(df)
In [19]: df
Out[19]:
A B
0 8 5
1 1 7
2 7 3
3 6 2
4 3 4
5 0 1
6 9 0
7 4 6
8 2 8
9 5 9

You can use sklearn.utils.shuffle() (requires sklearn 0.16.1 or higher to support Pandas data frames):
# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))
# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))
outputs:
df: A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
df: A B
1 1 1
0 0 0
3 3 3
4 4 4
2 2 2
Then you can use df.reset_index() to reset the index column, if needs to be:
df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)
outputs:
df: A B
0 1 1
1 0 0
2 4 4
3 2 2
4 3 3

A simple solution in pandas is to use the sample method independently on each column. Use apply to iterate over each column:
df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
df.apply(lambda x: x.sample(frac=1).values)
a b
0 4 2
1 1 6
2 6 5
3 5 3
4 2 4
5 3 1
You must use .value so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:
df.apply(lambda x: x.sample(frac=1))
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6

From the docs use sample():
In [79]: s = pd.Series([0,1,2,3,4,5])
# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]:
0 0
dtype: int64
# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]:
5 5
2 2
4 4
dtype: int64
# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]:
5 5
4 4
1 1
dtype: int64

I resorted to adapting #root 's answer slightly and using the raw values directly. Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.
In [1]: import numpy
In [2]: import pandas
In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})
In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop
In [5]: %%timeit
...: for view in numpy.rollaxis(df.values, 1):
...: numpy.random.shuffle(view)
...:
10000 loops, best of 3: 22.8 µs per loop
In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop
In [7]: %%timeit
for view in numpy.rollaxis(df.values, 0):
numpy.random.shuffle(view)
...:
10000 loops, best of 3: 23.4 µs per loop
Note that numpy.rollaxis brings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, i.e., if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.
In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)
In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)
Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:
def shuffle(df, n=1, axis=0):
df = df.copy()
axis = int(not axis) # pandas.DataFrame is always 2D
for _ in range(n):
for view in numpy.rollaxis(df.values, axis):
numpy.random.shuffle(view)
return df

This might be more useful when you want your index shuffled.
def shuffle(df):
index = list(df.index)
random.shuffle(index)
df = df.ix[index]
df.reset_index()
return df
It selects new df using new index, then reset them.

I know the question is for a pandas df but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.array instead, then np.apply_along_axis() will be what you are looking for.
If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.
If you panda data frame is named df, maybe you can:
get the values of the dataframe with values = df.values,
create an np.array from values
apply the method shown below to shuffle the np.array by row or column
recreate a new (shuffled) pandas df from the shuffled np.array
Original array
a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
Keep row order, shuffle colums within each row
print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
[22 21 20]
[31 30 32]
[40 41 42]]
Keep colums order, shuffle rows within each column
print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
[20 31 42]
[10 11 12]
[30 21 22]]
Original array is unchanged
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]

Here is a work around I found if you want to only shuffle a subset of the DataFrame:
shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])

matlab - selecting rows when importing

I have a list of values (product codes, like '1123','4356'...), call it LIST, and I want to select from a matrix M only the correspondent rows. I.e., the first col of the matrix M contains codes, the other cols the data, and I have an additional vector LIST that contains the codes to select.
Ex.
LIST MATRIX I WANT
[123; [000 1 2 3 ; [123 3 5 6 ;
456] 123 3 5 6 ; 456 1 4 6 ]
000 5 6 7 ;
456 1 4 6 ]
Efficient way to do it?

list = [123; 456];
mat = [000 1 2 3; 123 3 5 6; 000 5 6 7; 456 1 4 6];
iwant = [123 3 5 6 ; 456 1 4 6];
[a,b]=ismember(list,mat);
iwant2 = mat(b,:);
iwant==iwant2

Stata moving products

Using Stata I want a formula (line of code) that takes all of the previous entries for a given group G at a given cell and returns the product for all of the values at that cell and above. For example:
G X Y
1 1 1
1 2 2
1 6 12
1 3 36
2 2 2
2 4 8
3 2 2
4 2 2
4 11 22
4 7 154
G = Group ID, X = Value, Y = Moving Product
The way I have been doing this is pretty long and involves creating a good number of variables. There must be a way in Stata to just have it do a moving product by group ID (G).
Any insight is helpful

Here is the solution:
sort G
by G: gen moving_product = exp(sum(ln(X)))
This should make X = Y

Transposing a sparse matrix using linked lists (Traversal problems)

I'm trying to transpose a sparse matrix in c++. I'm struggling with the traversal of the new transposed matrix. I want to enter everything from the first row of the matrix to the first column of the new matrix.
Each row has the column index the number should be in and the number itself.
Input:
colInd num colInd num colInd num
Input:
1 1 2 2 3 3
1 4 2 5 3 6
1 7 2 8 3 9
Output:
1 1 2 4 3 7
1 2 2 5 3 8
1 3 2 6 3 9
How do I make the list traverse down the first column inserting the first element as it goes then go back to the top inserting down the second column. Apologies if this is two hard to follow. But all I want help with is traversing the Transposed matrix to be in the right place at the right time inserting a nz(non zero) object in the right place.
Here is my code
list<singleRow> tran;
//Finshed reading so transpose
for (int i = 0; i < rows.size(); i++){ // Initialize transposed matrix
singleRow trow;
tran.push_back(trow);
}
list<singleRow>::const_iterator rit;
list<singleRow>::const_iterator trowit;
int rowind;
for (rit = rows.begin(), rowind = 1; rit != rows.end(); rit++, rowind++){//rit = row iterator
singleRow row = *rit;
singleRow::const_iterator nzit;
trowit = tran.begin(); //Start at the beginning of the list of rows
trow = *trowit;
for (nzit = row.begin(); nzit != row.end(); nzit++){//nzit = non zero iterator
int col = nzit->getCol();
double val = nzit->getVal();
trow.push_back(nz(rowind,val)); //How do I attach this to tran so that it goes in the right place?
trowit++;
}
}

Your representation of the matrix is inefficient: it doesn't use the fact that the matrix is sparse. I say so because it includes all the rows of the matrix, even if most of them are zero (empty), like it usually happens with sparse matrices.
Your representation is also hard to work with. So i suggest converting the representation first (to a regular 2-D array), transposing the matrix, and convert back.
(Edited:)
Alternatively, you can change the representation, for example, like this:
Input: rowInd colInd num
1 1 1
1 2 2
1 2 3
2 1 4
2 2 5
2 3 6
3 1 7
3 2 8
3 3 9
Output:
1 1 1
2 1 2
3 1 3
1 2 4
2 2 5
3 2 6
1 3 7
2 3 8
3 3 9
The code would be something like this:
struct singleElement {int row, col; double val;};
list<singleElement> matrix_input, matrix_output;
...
// Read input matrix from file or some such
list<singleElement>::const_iterator i;
for (i = matrix_input.begin(); i != matrix_input.end(); ++i)
{
singleElement e = *i;
std::swap(e.row, e.col);
matrix_output.push_back(e);
}

Your choice of list-of-list representation for a sparse matrix is poor for transposition. Sometimes, when considering algorithms and data structures, the best thing to do is to take the hit for transforming your data structure into one better suited for your algorithm than to mangle your algorithm to work with the wrong data structure.
In this case you could, for example, read your matrix into a coordinate list representation which would be very easy to transpose, then write into whatever representation you like. If space is a challenge, then you might need to do this chunk by chunk, allocating new columns in your target representation 1 by 1 and deallocating columns in your old representation as you go.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

iterate over Dataframe row by index value and find max - python-2.7

Related

Magic Square in Python Debugging

How to shuffle data in pandas? [duplicate]

matlab - selecting rows when importing

Stata moving products

Transposing a sparse matrix using linked lists (Traversal problems)

Categories

Resources