Below is a small sample of a very large dataframe:
import pandas as pd
In [32]: df3
Out[32]:
Location_ID Time
0 10000000568366 2012-05-31 14:08:00
1 10000000257225 2012-05-31 07:22:00
2 10000000730693 2012-05-31 02:19:00
3 10000000257225 2012-05-30 12:20:00
4 10000001072890 2012-05-30 11:19:00
5 10000000811587 2012-05-31 03:09:00
6 10000000094837 2012-06-02 08:39:00
7 10000000730693 2012-06-01 14:04:00
8 10000000955747 2012-05-31 07:24:00
9 10000000924241 2012-05-30 14:48:00
10 10000000893286 2012-05-18 13:12:00
11 10000000924241 2012-05-31 01:56:00
12 10000000211696 2012-05-30 02:09:00
13 10000000211696 2012-05-29 11:41:00
14 10000000084450 2012-05-31 18:34:00
15 10000000939505 2012-06-02 18:12:00
16 10000000893286 2012-05-31 22:54:00
17 10000000811598 2012-06-01 07:55:00
18 10000000683255 2012-05-29 03:44:00
I am trying to find the time difference in seconds between consecutive rows of "Time" for a particular Location_ID. I am using pandas.to_numeric which converts it into nanoseconds , then divide it by 1000000000 to get the result in seconds:
df4 = df3.assign(time_difference=df3['Time'].groupby('Location_ID').apply(lambda x : (pd.to_numeric(x.shift()-x).abs())/1000000000))
The error I get is:
KeyError: 'Location_ID'
Related
I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?
Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.
Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
I want to count areas of interest in my dataframe column 'which_AOI' (ranging from 0 -9). I would like to have a new column with the results added to a dataframe depending on a variable 'marker' (ranging from 0 - x) which tells me when one 'picture' is done and the next begins (one marker can go on for a variable length of rows). This is my code so far but it seems to be stuck and runs on without giving output. I tried reconstructing it from the beginning once but as soon as i get to 'if df.marker == num' it doesn't stop. What am I missing?
(example dataframe below)
## AOI count of spec. type function (in progress):
import numpy as np
import pandas as pd
path_i = "/Users/Desktop/Pilot/results/gazedata_filename.csv"
df = pd.read_csv(path_i, sep =",")
#create a new dataframe for AOIs:
d = {'marker': []}
df_aoi = pd.DataFrame(data=d)
### Creating an Aoi list
item = df.which_AOI
aoi = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #list for search
aoi_array = [0, 0 , 0, 0, 0, 0, 0, 0, 0, 0] #list for filling
num = 0
for i in range (0, len (df.marker)): #loop through the dataframe
if df.marker == num: ## if marker = num its one picture
for index, item in enumerate(aoi): #look for item (being a number in which_AOI) in aoi list
if (item == aoi[index]):
aoi_array[index] += 1
print (aoi)
print (aoi_array)
se = pd.Series(aoi_array) # make list into a series to attach to dataframe
df_aoi['new_col'] = se.values #add list to dataframe
aoi_array.clear() #clears list before next picture
else:
num +=1
index pos_time pos_x pos_y pup_time pup_diameter marker which_AOI fixation Picname shock
1 16300 168.608779907227 -136.360855102539 16300 2.935715675354 0 7 18 5 save
2 16318 144.97673034668 -157.495513916016 16318 3.08838820457459 0 8 33 5 save
3 16351 152.92560577392598 -156.64172363281298 16351 3.0895299911499 0 7 17 5 save
4 16368 152.132453918457 -157.989685058594 16368 3.111008644104 0 7 18 5 save
5 16386 151.59835815429702 -157.55587768554702 16386 3.09514689445496 0 7 18 5 save
6 16404 150.88092803955098 -152.69479370117202 16404 3.10009074211121 1 7 37 5 save
7 16441 152.76554107666 -142.06188964843798 16441 3.0821495056152304 1 7 33 5 save
Not 100% clear based on your question but it sounds like you want to count the number of rows for each which_AOI value in each marker.
You can accomplish this using groupby
df_aoi = df.groupby(['marker','which_AOI']).size().unstack('which_AOI',fill_value=0)
In:
pos_time pos_x pos_y pup_time pup_diameter marker \
0 16300 168.608780 -136.360855 16300 2.935716 0
1 16318 144.976730 -157.495514 16318 3.088388 0
2 16351 152.925606 -156.641724 16351 3.089530 0
3 16368 152.132454 -157.989685 16368 3.111009 0
4 16386 151.598358 -157.555878 16386 3.095147 0
5 16404 150.880928 -152.694794 16404 3.100091 1
6 16441 152.765541 -142.061890 16441 3.082150 1
which_AOI fixation Picname shock
0 7 18 5 save
1 8 33 5 save
2 7 17 5 save
3 7 18 5 save
4 7 18 5 save
5 7 37 5 save
6 7 33 5 save
Out:
which_AOI 7 8
marker
0 4 1
1 2 0
A dataframe stores some values in columns, passing those values to a function I get another dataframe. I'd like to concatenate the returned dataframe's columns to the original dataframe.
I tried to do something like
i = pd.concat([i, i[['cid', 'id']].apply(lambda x: xy(*x), axis=1)], axis=1)
but it did not work with error:
ValueError: cannot copy sequence with size 2 to array axis with dimension 1
So I did like this:
def xy(x, y):
return pd.DataFrame({'x': [x*2], 'y': [y*2]})
df1 = pd.DataFrame({'cid': [4, 4], 'id': [6, 10]})
print('df1:\n{}'.format(df1))
df2 = pd.DataFrame()
for _, row in df1.iterrows():
nr = xy(row['cid'], row['id'])
nr['cid'] = row['cid']
nr['id'] = row['id']
df2 = df2.append(nr, ignore_index=True)
print('df2:\n{}'.format(df2))
Output:
df1:
cid id
0 4 6
1 4 10
df2:
x y cid id
0 8 12 4 6
1 8 20 4 10
The code does not look nice and should work slowly.
Is there pandas/pythonic way to do it properly and fast working?
python 2.7
Option 0
Most directly with pd.DataFrame.assign. Not very generalizable.
df1.assign(x=df1.cid * 2, y=df1.id * 2)
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 1
Use pd.DataFrame.join to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.join(df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 2
Use pd.DataFrame.assign to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.assign(**df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 3
However, if your function really is just multiplying by 2
df1.join(df1.mul(2).rename(columns=dict(cid='x', id='y')))
Or
df1.assign(**df1.mul(2).rename(columns=dict(cid='x', id='y')))
I have a two dimensional list of values:
[
[[12.2],[5325]],
[[13.4],[235326]],
[[15.9],[235326]],
[[17.7],[53521]],
[[21.3],[42342]],
[[22.6],[6546]],
[[25.9],[34634]],
[[27.2],[523523]],
[[33.4],[235325]],
[[36.2],[235352]]
]
I would like to get a list of averages defined by a given step so that for a step=10 it would like like this:
[
[[10],[average of all 10-19]],
[[20],[average of all 20-29]],
[[30],[average of all 30-39]]
]
How can I achieve that? Please note that the number of 10s, 20s, 30s and so on is not always the same.
import pandas as pd
df = pd.DataFrame((q[0][0], q[1][0]) for q in thelist)
df['group'] = (df[0] / 10).astype(int)
Now df is:
0 1 group
0 12.2 5325 1
1 13.4 235326 1
2 15.9 235326 1
3 17.7 53521 1
4 21.3 42342 2
5 22.6 6546 2
6 25.9 34634 2
7 27.2 523523 2
8 33.4 235325 3
9 36.2 235352 3
Then:
df.groupby('group').mean()
Gives you the answers you seek:
0 1
group
1 14.80 132374
2 24.25 151761
3 34.80 235338
I have a pandas data frame:
data = pd.read_csv(path)
I'm looking for a good way to remove outlier rows that have an extreme value in any of the features (I have 400 features in the data frame) before I run some prediction algorithms.
Tried a few ways but they don't seem to solve the issue:
data[data.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
using Standard Scaler
I think you can check your output but comparing both indexes by Index.difference, because I think your solution works very nice:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
print (df)
A B C
0 0.471435 -1.190976 1.432707
1 -0.312652 -0.720589 0.887163
2 0.859588 -0.636524 0.015696
3 -2.242685 1.150036 0.991946
4 0.953324 -2.021255 -0.334077
5 0.002118 0.405453 0.289092
6 1.321158 -1.546906 -0.202646
7 -0.655969 0.193421 0.553439
8 1.318152 -0.469305 0.675554
9 -1.817027 -0.183109 1.058969
10 -0.397840 0.337438 1.047579
11 1.045938 0.863717 -0.122092
12 0.124713 -0.322795 0.841675
13 2.390961 0.076200 -0.566446
14 0.036142 -2.074978 0.247792
15 -0.897157 -0.136795 0.018289
16 0.755414 0.215269 0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620 0.354020
19 -0.035513 0.565738 1.545659
20 -0.974236 -0.070345 0.307969
21 -0.208499 1.033801 -2.400454
22 2.030604 -1.142631 0.211883
23 0.704721 -0.785435 0.462060
24 0.704228 0.523508 -0.926254
25 2.007843 0.226963 -1.152659
26 0.631979 0.039513 0.464392
27 -3.563517 1.321106 0.152631
28 0.164530 -0.430096 0.767369
29 0.984920 0.270836 1.391986
df1 = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
print (df1)
A B C
0 0.471435 -1.190976 1.432707
1 -0.312652 -0.720589 0.887163
2 0.859588 -0.636524 0.015696
3 -2.242685 1.150036 0.991946
4 0.953324 -2.021255 -0.334077
5 0.002118 0.405453 0.289092
6 1.321158 -1.546906 -0.202646
7 -0.655969 0.193421 0.553439
8 1.318152 -0.469305 0.675554
9 -1.817027 -0.183109 1.058969
10 -0.397840 0.337438 1.047579
11 1.045938 0.863717 -0.122092
12 0.124713 -0.322795 0.841675
13 2.390961 0.076200 -0.566446
14 0.036142 -2.074978 0.247792
15 -0.897157 -0.136795 0.018289
16 0.755414 0.215269 0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620 0.354020
19 -0.035513 0.565738 1.545659
20 -0.974236 -0.070345 0.307969
22 2.030604 -1.142631 0.211883
23 0.704721 -0.785435 0.462060
24 0.704228 0.523508 -0.926254
25 2.007843 0.226963 -1.152659
26 0.631979 0.039513 0.464392
28 0.164530 -0.430096 0.767369
29 0.984920 0.270836 1.391986
30 0.079842 -0.399965 -1.027851
31 -0.584718 0.816594 -0.081947
idx = df.index.difference(df1.index)
print (idx)
Int64Index([21, 27], dtype='int64')
print (df.loc[idx])
A B C
21 -0.208499 1.033801 -2.400454
27 -3.563517 1.321106 0.152631