Efficiently walking through pandas dataframe index - python-2.7

import pandas as pd
from numpy.random import randn
oldn = pd.DataFrame(randn(10, 4), columns=['A', 'B', 'C', 'D'])
I want to make a new DataFrame that is 0..9 rows long, and has one column "avg", whose value for row N = average(old[N]['A'], old[N]['B']..old[N]['D'])
I'm not very familiar with pandas, so all my ideas how to do this are gross for- loops and things. What is the efficient way to create and populate the new table?

Call mean on your df and pass param axis=1 to calculate the mean row-wise, you can then pass this as data to the DataFrame ctor:
In [128]:
new_df = pd.DataFrame(data = oldn.mean(axis=1), columns=['avg'])
new_df
Out[128]:
avg
0 0.541550
1 0.525518
2 -0.492634
3 0.163784
4 0.012363
5 0.514676
6 -0.468888
7 0.334473
8 0.669139
9 0.736748

If you want average for specific columns use the following. Else you can use the answer provided by #EdChum
oldn['Avg'] = oldn.apply(lambda v: ((v['A']+v['B']+v['C']+v['D']) / 4.), axis=1)
or
old['Avg'] = oldn.apply(lambda v: ((v[['A','B','C','D']]).sum() / 4.), axis=1)
print oldn
A B C D Avg
0 -0.201468 -0.832845 0.100299 0.044853 -0.222290
1 1.510688 -0.955329 0.239836 0.767431 0.390657
2 0.780910 0.335267 0.423232 -0.678401 0.215252
3 0.780518 2.876386 -0.797032 -0.523407 0.584116
4 0.438313 -1.952162 0.909568 -0.465147 -0.267357
5 0.145152 -0.836300 0.352706 -0.794815 -0.283314
6 -0.375432 -1.354249 0.920052 -1.002142 -0.452943
7 0.663149 -0.064227 0.321164 0.779981 0.425017
8 -1.279022 -2.206743 0.534943 0.794929 -0.538973
9 -0.339976 0.636516 -0.530445 -0.832413 -0.266579

Related

Drop rows based on one column values

I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?
Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.
Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]

numpy array to pandas pivot table

I'm new to pandas and am trying to create a pivot table from a numpy array.
variable npArray is just that, a numpy array:
>>> npArray
array([(1, 3), (4, 3), (1, 3), ..., (1, 4), (1, 12), (1, 12)],
dtype=[('MATERIAL', '<i4'), ('DIVISION', '<i4')])
I'd to count occurrences of each material by division, with division being rows and material being columns. Example:
What I have:
#numpy array to pandas data frame
pandaDf = pandas.DataFrame (npArray)
#pivot table - guessing here
pandas.pivot_table (pandaDf, index = "DIVISION",
columns = "MATERIAL",
aggfunc = numpy.sum) #<--- want count, not sum
Results:
Empty DataFrame
Columns: []
Index: []
Sample of pandaDf:
>>> print pandaDf
MATERIAL DIVISION
0 1 3
1 4 3
2 1 3
3 1 3
4 1 3
5 1 3
6 1 3
7 1 3
8 1 3
9 1 3
10 1 3
11 1 3
12 4 3
... ... ...
3845291 1 4
3845292 1 4
3845293 1 4
3845294 1 12
3845295 1 12
[3845296 rows x 2 columns]
Any help would be appreciated.
Something similar has already been asked: https://stackoverflow.com/a/12862196/9754169
Bottom line, just do aggfunc=lambda x: len(x)
#GerardoFlores is correct. Another solution I found was adding a column for frequency.
#numpy array to pandas data frame
pandaDf = pandas.DataFrame (npArray)
print "adding frequency column"
pandaDf ["FREQ"] = 1
#pivot table
pivot = pandas.pivot_table (pandaDf, values = "FREQ",
index = "DIVISION", columns = "MATERIAL",
aggfunc = "count")

Concatenate pandas dataframe with result of apply(lambda) where lambda returns another dataframe

A dataframe stores some values in columns, passing those values to a function I get another dataframe. I'd like to concatenate the returned dataframe's columns to the original dataframe.
I tried to do something like
i = pd.concat([i, i[['cid', 'id']].apply(lambda x: xy(*x), axis=1)], axis=1)
but it did not work with error:
ValueError: cannot copy sequence with size 2 to array axis with dimension 1
So I did like this:
def xy(x, y):
return pd.DataFrame({'x': [x*2], 'y': [y*2]})
df1 = pd.DataFrame({'cid': [4, 4], 'id': [6, 10]})
print('df1:\n{}'.format(df1))
df2 = pd.DataFrame()
for _, row in df1.iterrows():
nr = xy(row['cid'], row['id'])
nr['cid'] = row['cid']
nr['id'] = row['id']
df2 = df2.append(nr, ignore_index=True)
print('df2:\n{}'.format(df2))
Output:
df1:
cid id
0 4 6
1 4 10
df2:
x y cid id
0 8 12 4 6
1 8 20 4 10
The code does not look nice and should work slowly.
Is there pandas/pythonic way to do it properly and fast working?
python 2.7
Option 0
Most directly with pd.DataFrame.assign. Not very generalizable.
df1.assign(x=df1.cid * 2, y=df1.id * 2)
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 1
Use pd.DataFrame.join to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.join(df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 2
Use pd.DataFrame.assign to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.assign(**df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 3
However, if your function really is just multiplying by 2
df1.join(df1.mul(2).rename(columns=dict(cid='x', id='y')))
Or
df1.assign(**df1.mul(2).rename(columns=dict(cid='x', id='y')))

writing to columns in same row in csv file (python)

Im trying to write values to a csv file such that for every two iterations, the result is in the same row and then the next the values print to a new row. Any help would be greatly appreciated. Thank you!
This is what I have so far:
import csv
import math
savePath = '/home/dehaoliu/opencv_test/Engineering_drawings_outputs/'
with open(str(savePath) +'outputsTest.csv','w') as f1:
writer=csv.writer(f1, delimiter='\t',lineterminator='\n',)
temp = []
for k in range(0,2):
temp = []
for i in range(0,4):
a = 2 +i
b = 3+ i
list = [a,b]
temp.append(list)
writer.writerow(temp)
The result I am getting now is
[2 3][3 4][4 5][5 6]
[2 3][3 4][4 5][5 6]
But I would like to get this (without the brackets) where each number in a row is in a separate column:
2 3 3 4
4 5 5 6
Try the following:
import csv
import math
savePath = '/home/dehaoliu/opencv_test/Engineering_drawings_outputs/'
with open(str(savePath) +'outputsTest.csv','w') as f1:
writer=csv.writer(f1, delimiter='\t',lineterminator='\n',)
temp = [2, 3]
for i in range(2):
temp = [x + i for x in temp]
additional = [y+1 for y in temp]
writer.writerow(temp + additional)
temp = additional[:]
This should return:
# 2 3 3 4
# 4 5 5 6
You start with a temporary containing the numbers 2 and 3. Then, you loop from 0 to 2 (excluding). At every iteration, you increment the values of the temporary by the current index and subsequently create an additional list with these new values of your temporary list. Once that's done, you join the two lists together and write the result out to your file. At this point, you can set your temporary list to be equal to the values of the additional list, before moving on to the next iteration.
I hope this helps.
The way you present it you can do it with a simple seed and increment:
import csv
import os
save_path = "/home/dehaoliu/opencv_test/Engineering_drawings_outputs/"
with open(os.path.join(save_path, "outputsTest.csv"), "w") as f:
writer = csv.writer(f, delimiter="\t", lineterminator="\n")
temp = [2, 3, 3, 4] # init seed
increment = len(temp) // 2 # how many pairs we have, used to increase our seed each row
for _ in range(2): # how many rows do you need, any positive integer will do
writer.writerow(temp) # write the current value
temp = [x + increment for x in temp] # add 'increment' to the elements
Resulting in:
2 3 3 4
4 5 5 6
But if your seed is: temp = [2, 3, 3, 4, 4, 5] and you decide to generate 4 rows, it will still adapt:
2 3 3 4 4 5
5 6 6 7 7 8
8 9 9 10 10 11
11 12 12 13 13 14

Using Pandas to subset data from a dataframe based on multiple columns?

I am new to python. I have to extract a subset from pandas dataframe based on 2 lists corresponding to 2 columns in that dataframe. Both the values in list should match with that of dataframe at index level. I have tried with "isin" function but obviously it doesn't work with combinations.
from pandas import *
d = {'A' : ['a', 'a', 'c', 'a','b'] ,'B' : [1, 2, 1, 4,1]}
df = DataFrame(d)
list1 = ['a','b']
list2 = [1,2]
print df
A B
0 a 1
1 a 2
2 c 1
3 a 4
4 b 1
### Using isin function
df[(df.A.isin(list1)) & (df.B.isin(list2)) ]
A B
0 a 1
1 a 2
4 b 1
###Desired outcome
d2 = {'A' : ['a'], 'B':[1]}
DataFrame(d2)
A B
0 a 1
Please let me know if this can be done without using loops and if there is a way to do it in a single step.
A quick and dirty way to do this is using zip:
df['C'] = zip(df['A'], df['B'])
list3 = zip(list1, list2)
d2 = df[df['C'].isin(list3)
print(df2)
A B C
0 a 1 (a, 1)
You can of course drop the newly created column after you're done filtering on it.