Drop rows based on one column values

Drop rows based on one column values - python-2.7

I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?

Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.

Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]

Related

numpy array to pandas pivot table

I'm new to pandas and am trying to create a pivot table from a numpy array.
variable npArray is just that, a numpy array:
>>> npArray
array([(1, 3), (4, 3), (1, 3), ..., (1, 4), (1, 12), (1, 12)],
dtype=[('MATERIAL', '<i4'), ('DIVISION', '<i4')])
I'd to count occurrences of each material by division, with division being rows and material being columns. Example:
What I have:
#numpy array to pandas data frame
pandaDf = pandas.DataFrame (npArray)
#pivot table - guessing here
pandas.pivot_table (pandaDf, index = "DIVISION",
columns = "MATERIAL",
aggfunc = numpy.sum) #<--- want count, not sum
Results:
Empty DataFrame
Columns: []
Index: []
Sample of pandaDf:
>>> print pandaDf
MATERIAL DIVISION
0 1 3
1 4 3
2 1 3
3 1 3
4 1 3
5 1 3
6 1 3
7 1 3
8 1 3
9 1 3
10 1 3
11 1 3
12 4 3
... ... ...
3845291 1 4
3845292 1 4
3845293 1 4
3845294 1 12
3845295 1 12
[3845296 rows x 2 columns]
Any help would be appreciated.

Something similar has already been asked: https://stackoverflow.com/a/12862196/9754169
Bottom line, just do aggfunc=lambda x: len(x)

#GerardoFlores is correct. Another solution I found was adding a column for frequency.
#numpy array to pandas data frame
pandaDf = pandas.DataFrame (npArray)
print "adding frequency column"
pandaDf ["FREQ"] = 1
#pivot table
pivot = pandas.pivot_table (pandaDf, values = "FREQ",
index = "DIVISION", columns = "MATERIAL",
aggfunc = "count")

for loop in pandas to search dataframe and update list stuck

I want to count areas of interest in my dataframe column 'which_AOI' (ranging from 0 -9). I would like to have a new column with the results added to a dataframe depending on a variable 'marker' (ranging from 0 - x) which tells me when one 'picture' is done and the next begins (one marker can go on for a variable length of rows). This is my code so far but it seems to be stuck and runs on without giving output. I tried reconstructing it from the beginning once but as soon as i get to 'if df.marker == num' it doesn't stop. What am I missing?
(example dataframe below)
## AOI count of spec. type function (in progress):
import numpy as np
import pandas as pd
path_i = "/Users/Desktop/Pilot/results/gazedata_filename.csv"
df = pd.read_csv(path_i, sep =",")
#create a new dataframe for AOIs:
d = {'marker': []}
df_aoi = pd.DataFrame(data=d)
### Creating an Aoi list
item = df.which_AOI
aoi = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #list for search
aoi_array = [0, 0 , 0, 0, 0, 0, 0, 0, 0, 0] #list for filling
num = 0
for i in range (0, len (df.marker)): #loop through the dataframe
if df.marker == num: ## if marker = num its one picture
for index, item in enumerate(aoi): #look for item (being a number in which_AOI) in aoi list
if (item == aoi[index]):
aoi_array[index] += 1
print (aoi)
print (aoi_array)
se = pd.Series(aoi_array) # make list into a series to attach to dataframe
df_aoi['new_col'] = se.values #add list to dataframe
aoi_array.clear() #clears list before next picture
else:
num +=1
index pos_time pos_x pos_y pup_time pup_diameter marker which_AOI fixation Picname shock
1 16300 168.608779907227 -136.360855102539 16300 2.935715675354 0 7 18 5 save
2 16318 144.97673034668 -157.495513916016 16318 3.08838820457459 0 8 33 5 save
3 16351 152.92560577392598 -156.64172363281298 16351 3.0895299911499 0 7 17 5 save
4 16368 152.132453918457 -157.989685058594 16368 3.111008644104 0 7 18 5 save
5 16386 151.59835815429702 -157.55587768554702 16386 3.09514689445496 0 7 18 5 save
6 16404 150.88092803955098 -152.69479370117202 16404 3.10009074211121 1 7 37 5 save
7 16441 152.76554107666 -142.06188964843798 16441 3.0821495056152304 1 7 33 5 save

Not 100% clear based on your question but it sounds like you want to count the number of rows for each which_AOI value in each marker.
You can accomplish this using groupby
df_aoi = df.groupby(['marker','which_AOI']).size().unstack('which_AOI',fill_value=0)
In:
pos_time pos_x pos_y pup_time pup_diameter marker \
0 16300 168.608780 -136.360855 16300 2.935716 0
1 16318 144.976730 -157.495514 16318 3.088388 0
2 16351 152.925606 -156.641724 16351 3.089530 0
3 16368 152.132454 -157.989685 16368 3.111009 0
4 16386 151.598358 -157.555878 16386 3.095147 0
5 16404 150.880928 -152.694794 16404 3.100091 1
6 16441 152.765541 -142.061890 16441 3.082150 1
which_AOI fixation Picname shock
0 7 18 5 save
1 8 33 5 save
2 7 17 5 save
3 7 18 5 save
4 7 18 5 save
5 7 37 5 save
6 7 33 5 save
Out:
which_AOI 7 8
marker
0 4 1
1 2 0

Concatenate pandas dataframe with result of apply(lambda) where lambda returns another dataframe

A dataframe stores some values in columns, passing those values to a function I get another dataframe. I'd like to concatenate the returned dataframe's columns to the original dataframe.
I tried to do something like
i = pd.concat([i, i[['cid', 'id']].apply(lambda x: xy(*x), axis=1)], axis=1)
but it did not work with error:
ValueError: cannot copy sequence with size 2 to array axis with dimension 1
So I did like this:
def xy(x, y):
return pd.DataFrame({'x': [x*2], 'y': [y*2]})
df1 = pd.DataFrame({'cid': [4, 4], 'id': [6, 10]})
print('df1:\n{}'.format(df1))
df2 = pd.DataFrame()
for _, row in df1.iterrows():
nr = xy(row['cid'], row['id'])
nr['cid'] = row['cid']
nr['id'] = row['id']
df2 = df2.append(nr, ignore_index=True)
print('df2:\n{}'.format(df2))
Output:
df1:
cid id
0 4 6
1 4 10
df2:
x y cid id
0 8 12 4 6
1 8 20 4 10
The code does not look nice and should work slowly.
Is there pandas/pythonic way to do it properly and fast working?
python 2.7

Option 0
Most directly with pd.DataFrame.assign. Not very generalizable.
df1.assign(x=df1.cid * 2, y=df1.id * 2)
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 1
Use pd.DataFrame.join to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.join(df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 2
Use pd.DataFrame.assign to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.assign(**df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 3
However, if your function really is just multiplying by 2
df1.join(df1.mul(2).rename(columns=dict(cid='x', id='y')))
Or
df1.assign(**df1.mul(2).rename(columns=dict(cid='x', id='y')))

How to split the data in the data frame in python?

I used the below code:
import pandas as pd
pandas_bigram = pd.DataFrame(bigram_data)
print pandas_bigram
I got output as below
0
0 ashoka -**0
1 - wikipedia,**1
2 wikipedia, the**2
3 the free**2
4 free encyclopedia**2
5 encyclopedia ashoka**1
6 ashoka from**2
7 from wikipedia,**1
8 wikipedia, the**2
9 the free**2
10 free encyclopedia**2
My question is How to split this data frame. So, that i will get data in two rows. the data here is separated by "**".

import pandas as pd
df= [" ashoka -**0","- wikipedia,**1","wikipedia, the**2"]
df=pd.DataFrame(df)
print(df)
0
0 ashoka -**0
1 - wikipedia,**1
2 wikipedia, the**2
Use split function: The method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
df1 = pd.DataFrame(df[0].str.split('*',1).tolist(),
columns = ['0','1'])
print(df1)
0 1
0 ashoka - *0
1 - wikipedia, *1
2 wikipedia, the *2

Efficiently walking through pandas dataframe index

import pandas as pd
from numpy.random import randn
oldn = pd.DataFrame(randn(10, 4), columns=['A', 'B', 'C', 'D'])
I want to make a new DataFrame that is 0..9 rows long, and has one column "avg", whose value for row N = average(old[N]['A'], old[N]['B']..old[N]['D'])
I'm not very familiar with pandas, so all my ideas how to do this are gross for- loops and things. What is the efficient way to create and populate the new table?

Call mean on your df and pass param axis=1 to calculate the mean row-wise, you can then pass this as data to the DataFrame ctor:
In [128]:
new_df = pd.DataFrame(data = oldn.mean(axis=1), columns=['avg'])
new_df
Out[128]:
avg
0 0.541550
1 0.525518
2 -0.492634
3 0.163784
4 0.012363
5 0.514676
6 -0.468888
7 0.334473
8 0.669139
9 0.736748

If you want average for specific columns use the following. Else you can use the answer provided by #EdChum
oldn['Avg'] = oldn.apply(lambda v: ((v['A']+v['B']+v['C']+v['D']) / 4.), axis=1)
or
old['Avg'] = oldn.apply(lambda v: ((v[['A','B','C','D']]).sum() / 4.), axis=1)
print oldn
A B C D Avg
0 -0.201468 -0.832845 0.100299 0.044853 -0.222290
1 1.510688 -0.955329 0.239836 0.767431 0.390657
2 0.780910 0.335267 0.423232 -0.678401 0.215252
3 0.780518 2.876386 -0.797032 -0.523407 0.584116
4 0.438313 -1.952162 0.909568 -0.465147 -0.267357
5 0.145152 -0.836300 0.352706 -0.794815 -0.283314
6 -0.375432 -1.354249 0.920052 -1.002142 -0.452943
7 0.663149 -0.064227 0.321164 0.779981 0.425017
8 -1.279022 -2.206743 0.534943 0.794929 -0.538973
9 -0.339976 0.636516 -0.530445 -0.832413 -0.266579

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Drop rows based on one column values - python-2.7

Doing df[np.abs(df.mad-df.mad.mean()) <= (3df.mad.std())] will not change the dataframe. But assign it back to df, so that: df = df[np.abs(df.mad-df.mad.mean()) <= (3df.mad.std())]

Related

numpy array to pandas pivot table

for loop in pandas to search dataframe and update list stuck

Concatenate pandas dataframe with result of apply(lambda) where lambda returns another dataframe

How to split the data in the data frame in python?

Efficiently walking through pandas dataframe index

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Drop rows based on one column values - python-2.7

Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe. But assign it back to df, so that: df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]

Related

numpy array to pandas pivot table

for loop in pandas to search dataframe and update list stuck

Concatenate pandas dataframe with result of apply(lambda) where lambda returns another dataframe

How to split the data in the data frame in python?

Efficiently walking through pandas dataframe index

Categories

Resources

Doing df[np.abs(df.mad-df.mad.mean()) <= (3df.mad.std())] will not change the dataframe. But assign it back to df, so that: df = df[np.abs(df.mad-df.mad.mean()) <= (3df.mad.std())]