complex time difference operation using pandas

complex time difference operation using pandas - python-2.7

Below is a small sample of a very large dataframe:
import pandas as pd
In [32]: df3
Out[32]:
Location_ID Time
0 10000000568366 2012-05-31 14:08:00
1 10000000257225 2012-05-31 07:22:00
2 10000000730693 2012-05-31 02:19:00
3 10000000257225 2012-05-30 12:20:00
4 10000001072890 2012-05-30 11:19:00
5 10000000811587 2012-05-31 03:09:00
6 10000000094837 2012-06-02 08:39:00
7 10000000730693 2012-06-01 14:04:00
8 10000000955747 2012-05-31 07:24:00
9 10000000924241 2012-05-30 14:48:00
10 10000000893286 2012-05-18 13:12:00
11 10000000924241 2012-05-31 01:56:00
12 10000000211696 2012-05-30 02:09:00
13 10000000211696 2012-05-29 11:41:00
14 10000000084450 2012-05-31 18:34:00
15 10000000939505 2012-06-02 18:12:00
16 10000000893286 2012-05-31 22:54:00
17 10000000811598 2012-06-01 07:55:00
18 10000000683255 2012-05-29 03:44:00
I am trying to find the time difference in seconds between consecutive rows of "Time" for a particular Location_ID. I am using pandas.to_numeric which converts it into nanoseconds , then divide it by 1000000000 to get the result in seconds:
df4 = df3.assign(time_difference=df3['Time'].groupby('Location_ID').apply(lambda x : (pd.to_numeric(x.shift()-x).abs())/1000000000))
The error I get is:
KeyError: 'Location_ID'

Related

Drop rows based on one column values

I've a dataframe which looks like this:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
As it's clearly evident in the above table that some of the values in the column mad and median are very big(outliers). So i want to remove the rows which have these very big values.
For example in row3 the value of mad is 30.408377 which very big so i want to drop this row. I know that i can use one line
to remove these values from the columns but it doesn't removes the complete row
df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]
But i want to remove the complete row.
How can i do that?

Predicates like what you've given will remove entire rows. But none of your data is outside of 3 standard deviations. If you tone it down to just one standard deviation, rows are removed with your example data.
Here's an example using your data:
import pandas as pd
import numpy as np
columns = ["wave", "mean", "median", "mad"]
data = [
[4050.32, -0.016182, -0.011940, 0.008885],
[4208.98, 0.023707, 0.007189, 0.032585],
[4508.28, 3.662293, 0.001414, 7.193139],
[4531.62, -15.459313, -0.001523, 30.408377],
[4551.65, 0.009028, 0.007581, 0.005247],
[4554.46, 0.001861, 0.010692, 0.027969],
[6828.60, -10.604568, -0.000590, 21.084799],
[6839.84, -0.003466, -0.001870, 0.010169],
[6842.04, -32.751551, -0.002514, 65.118329],
[6842.69, 18.293519, -0.002158, 36.385884],
[6843.66, 0.006386, -0.002468, 0.034995],
[6855.72, 0.020803, 0.000886, 0.040529],
]
df = pd.DataFrame(np.array(data), columns=columns)
print("ORIGINAL: ")
print(df)
print()
res = df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())]
print("REMOVED: ")
print(res)
this outputs:
ORIGINAL:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
8 6842.04 -32.751551 -0.002514 65.118329
9 6842.69 18.293519 -0.002158 36.385884
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
REMOVED:
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4508.28 3.662293 0.001414 7.193139
3 4531.62 -15.459313 -0.001523 30.408377
4 4551.65 0.009028 0.007581 0.005247
5 4554.46 0.001861 0.010692 0.027969
6 6828.60 -10.604568 -0.000590 21.084799
7 6839.84 -0.003466 -0.001870 0.010169
10 6843.66 0.006386 -0.002468 0.034995
11 6855.72 0.020803 0.000886 0.040529
Observe that rows indexed 8 and 9 are now gone.
Be sure you're reassigning the output of df[np.abs(df['mad']-df['mad'].mean()) <= (df['mad'].std())] as shown above. The operation is not done in place.

Doing df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())] will not change the dataframe.
But assign it back to df, so that:
df = df[np.abs(df.mad-df.mad.mean()) <= (3*df.mad.std())]

for loop in pandas to search dataframe and update list stuck

I want to count areas of interest in my dataframe column 'which_AOI' (ranging from 0 -9). I would like to have a new column with the results added to a dataframe depending on a variable 'marker' (ranging from 0 - x) which tells me when one 'picture' is done and the next begins (one marker can go on for a variable length of rows). This is my code so far but it seems to be stuck and runs on without giving output. I tried reconstructing it from the beginning once but as soon as i get to 'if df.marker == num' it doesn't stop. What am I missing?
(example dataframe below)
## AOI count of spec. type function (in progress):
import numpy as np
import pandas as pd
path_i = "/Users/Desktop/Pilot/results/gazedata_filename.csv"
df = pd.read_csv(path_i, sep =",")
#create a new dataframe for AOIs:
d = {'marker': []}
df_aoi = pd.DataFrame(data=d)
### Creating an Aoi list
item = df.which_AOI
aoi = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #list for search
aoi_array = [0, 0 , 0, 0, 0, 0, 0, 0, 0, 0] #list for filling
num = 0
for i in range (0, len (df.marker)): #loop through the dataframe
if df.marker == num: ## if marker = num its one picture
for index, item in enumerate(aoi): #look for item (being a number in which_AOI) in aoi list
if (item == aoi[index]):
aoi_array[index] += 1
print (aoi)
print (aoi_array)
se = pd.Series(aoi_array) # make list into a series to attach to dataframe
df_aoi['new_col'] = se.values #add list to dataframe
aoi_array.clear() #clears list before next picture
else:
num +=1
index pos_time pos_x pos_y pup_time pup_diameter marker which_AOI fixation Picname shock
1 16300 168.608779907227 -136.360855102539 16300 2.935715675354 0 7 18 5 save
2 16318 144.97673034668 -157.495513916016 16318 3.08838820457459 0 8 33 5 save
3 16351 152.92560577392598 -156.64172363281298 16351 3.0895299911499 0 7 17 5 save
4 16368 152.132453918457 -157.989685058594 16368 3.111008644104 0 7 18 5 save
5 16386 151.59835815429702 -157.55587768554702 16386 3.09514689445496 0 7 18 5 save
6 16404 150.88092803955098 -152.69479370117202 16404 3.10009074211121 1 7 37 5 save
7 16441 152.76554107666 -142.06188964843798 16441 3.0821495056152304 1 7 33 5 save

Not 100% clear based on your question but it sounds like you want to count the number of rows for each which_AOI value in each marker.
You can accomplish this using groupby
df_aoi = df.groupby(['marker','which_AOI']).size().unstack('which_AOI',fill_value=0)
In:
pos_time pos_x pos_y pup_time pup_diameter marker \
0 16300 168.608780 -136.360855 16300 2.935716 0
1 16318 144.976730 -157.495514 16318 3.088388 0
2 16351 152.925606 -156.641724 16351 3.089530 0
3 16368 152.132454 -157.989685 16368 3.111009 0
4 16386 151.598358 -157.555878 16386 3.095147 0
5 16404 150.880928 -152.694794 16404 3.100091 1
6 16441 152.765541 -142.061890 16441 3.082150 1
which_AOI fixation Picname shock
0 7 18 5 save
1 8 33 5 save
2 7 17 5 save
3 7 18 5 save
4 7 18 5 save
5 7 37 5 save
6 7 33 5 save
Out:
which_AOI 7 8
marker
0 4 1
1 2 0

Concatenate pandas dataframe with result of apply(lambda) where lambda returns another dataframe

A dataframe stores some values in columns, passing those values to a function I get another dataframe. I'd like to concatenate the returned dataframe's columns to the original dataframe.
I tried to do something like
i = pd.concat([i, i[['cid', 'id']].apply(lambda x: xy(*x), axis=1)], axis=1)
but it did not work with error:
ValueError: cannot copy sequence with size 2 to array axis with dimension 1
So I did like this:
def xy(x, y):
return pd.DataFrame({'x': [x*2], 'y': [y*2]})
df1 = pd.DataFrame({'cid': [4, 4], 'id': [6, 10]})
print('df1:\n{}'.format(df1))
df2 = pd.DataFrame()
for _, row in df1.iterrows():
nr = xy(row['cid'], row['id'])
nr['cid'] = row['cid']
nr['id'] = row['id']
df2 = df2.append(nr, ignore_index=True)
print('df2:\n{}'.format(df2))
Output:
df1:
cid id
0 4 6
1 4 10
df2:
x y cid id
0 8 12 4 6
1 8 20 4 10
The code does not look nice and should work slowly.
Is there pandas/pythonic way to do it properly and fast working?
python 2.7

Option 0
Most directly with pd.DataFrame.assign. Not very generalizable.
df1.assign(x=df1.cid * 2, y=df1.id * 2)
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 1
Use pd.DataFrame.join to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.join(df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 2
Use pd.DataFrame.assign to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.assign(**df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 3
However, if your function really is just multiplying by 2
df1.join(df1.mul(2).rename(columns=dict(cid='x', id='y')))
Or
df1.assign(**df1.mul(2).rename(columns=dict(cid='x', id='y')))

Transform a list to a list of average values (by step)

I have a two dimensional list of values:
[
[[12.2],[5325]],
[[13.4],[235326]],
[[15.9],[235326]],
[[17.7],[53521]],
[[21.3],[42342]],
[[22.6],[6546]],
[[25.9],[34634]],
[[27.2],[523523]],
[[33.4],[235325]],
[[36.2],[235352]]
]
I would like to get a list of averages defined by a given step so that for a step=10 it would like like this:
[
[[10],[average of all 10-19]],
[[20],[average of all 20-29]],
[[30],[average of all 30-39]]
]
How can I achieve that? Please note that the number of 10s, 20s, 30s and so on is not always the same.

import pandas as pd
df = pd.DataFrame((q[0][0], q[1][0]) for q in thelist)
df['group'] = (df[0] / 10).astype(int)
Now df is:
0 1 group
0 12.2 5325 1
1 13.4 235326 1
2 15.9 235326 1
3 17.7 53521 1
4 21.3 42342 2
5 22.6 6546 2
6 25.9 34634 2
7 27.2 523523 2
8 33.4 235325 3
9 36.2 235352 3
Then:
df.groupby('group').mean()
Gives you the answers you seek:
0 1
group
1 14.80 132374
2 24.25 151761
3 34.80 235338

Removing outliers automatically in pandas data frame

I have a pandas data frame:
data = pd.read_csv(path)
I'm looking for a good way to remove outlier rows that have an extreme value in any of the features (I have 400 features in the data frame) before I run some prediction algorithms.
Tried a few ways but they don't seem to solve the issue:
data[data.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
using Standard Scaler

I think you can check your output but comparing both indexes by Index.difference, because I think your solution works very nice:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
print (df)
A B C
0 0.471435 -1.190976 1.432707
1 -0.312652 -0.720589 0.887163
2 0.859588 -0.636524 0.015696
3 -2.242685 1.150036 0.991946
4 0.953324 -2.021255 -0.334077
5 0.002118 0.405453 0.289092
6 1.321158 -1.546906 -0.202646
7 -0.655969 0.193421 0.553439
8 1.318152 -0.469305 0.675554
9 -1.817027 -0.183109 1.058969
10 -0.397840 0.337438 1.047579
11 1.045938 0.863717 -0.122092
12 0.124713 -0.322795 0.841675
13 2.390961 0.076200 -0.566446
14 0.036142 -2.074978 0.247792
15 -0.897157 -0.136795 0.018289
16 0.755414 0.215269 0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620 0.354020
19 -0.035513 0.565738 1.545659
20 -0.974236 -0.070345 0.307969
21 -0.208499 1.033801 -2.400454
22 2.030604 -1.142631 0.211883
23 0.704721 -0.785435 0.462060
24 0.704228 0.523508 -0.926254
25 2.007843 0.226963 -1.152659
26 0.631979 0.039513 0.464392
27 -3.563517 1.321106 0.152631
28 0.164530 -0.430096 0.767369
29 0.984920 0.270836 1.391986
df1 = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
print (df1)
A B C
0 0.471435 -1.190976 1.432707
1 -0.312652 -0.720589 0.887163
2 0.859588 -0.636524 0.015696
3 -2.242685 1.150036 0.991946
4 0.953324 -2.021255 -0.334077
5 0.002118 0.405453 0.289092
6 1.321158 -1.546906 -0.202646
7 -0.655969 0.193421 0.553439
8 1.318152 -0.469305 0.675554
9 -1.817027 -0.183109 1.058969
10 -0.397840 0.337438 1.047579
11 1.045938 0.863717 -0.122092
12 0.124713 -0.322795 0.841675
13 2.390961 0.076200 -0.566446
14 0.036142 -2.074978 0.247792
15 -0.897157 -0.136795 0.018289
16 0.755414 0.215269 0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620 0.354020
19 -0.035513 0.565738 1.545659
20 -0.974236 -0.070345 0.307969
22 2.030604 -1.142631 0.211883
23 0.704721 -0.785435 0.462060
24 0.704228 0.523508 -0.926254
25 2.007843 0.226963 -1.152659
26 0.631979 0.039513 0.464392
28 0.164530 -0.430096 0.767369
29 0.984920 0.270836 1.391986
30 0.079842 -0.399965 -1.027851
31 -0.584718 0.816594 -0.081947
idx = df.index.difference(df1.index)
print (idx)
Int64Index([21, 27], dtype='int64')
print (df.loc[idx])
A B C
21 -0.208499 1.033801 -2.400454
27 -3.563517 1.321106 0.152631

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

complex time difference operation using pandas - python-2.7

Related

Drop rows based on one column values

for loop in pandas to search dataframe and update list stuck

Concatenate pandas dataframe with result of apply(lambda) where lambda returns another dataframe

Transform a list to a list of average values (by step)

Removing outliers automatically in pandas data frame

Categories

Resources