Removing outliers automatically in pandas data frame - python-2.7

I have a pandas data frame:
data = pd.read_csv(path)
I'm looking for a good way to remove outlier rows that have an extreme value in any of the features (I have 400 features in the data frame) before I run some prediction algorithms.
Tried a few ways but they don't seem to solve the issue:
data[data.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
using Standard Scaler

I think you can check your output but comparing both indexes by Index.difference, because I think your solution works very nice:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
print (df)
A B C
0 0.471435 -1.190976 1.432707
1 -0.312652 -0.720589 0.887163
2 0.859588 -0.636524 0.015696
3 -2.242685 1.150036 0.991946
4 0.953324 -2.021255 -0.334077
5 0.002118 0.405453 0.289092
6 1.321158 -1.546906 -0.202646
7 -0.655969 0.193421 0.553439
8 1.318152 -0.469305 0.675554
9 -1.817027 -0.183109 1.058969
10 -0.397840 0.337438 1.047579
11 1.045938 0.863717 -0.122092
12 0.124713 -0.322795 0.841675
13 2.390961 0.076200 -0.566446
14 0.036142 -2.074978 0.247792
15 -0.897157 -0.136795 0.018289
16 0.755414 0.215269 0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620 0.354020
19 -0.035513 0.565738 1.545659
20 -0.974236 -0.070345 0.307969
21 -0.208499 1.033801 -2.400454
22 2.030604 -1.142631 0.211883
23 0.704721 -0.785435 0.462060
24 0.704228 0.523508 -0.926254
25 2.007843 0.226963 -1.152659
26 0.631979 0.039513 0.464392
27 -3.563517 1.321106 0.152631
28 0.164530 -0.430096 0.767369
29 0.984920 0.270836 1.391986
df1 = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]
print (df1)
A B C
0 0.471435 -1.190976 1.432707
1 -0.312652 -0.720589 0.887163
2 0.859588 -0.636524 0.015696
3 -2.242685 1.150036 0.991946
4 0.953324 -2.021255 -0.334077
5 0.002118 0.405453 0.289092
6 1.321158 -1.546906 -0.202646
7 -0.655969 0.193421 0.553439
8 1.318152 -0.469305 0.675554
9 -1.817027 -0.183109 1.058969
10 -0.397840 0.337438 1.047579
11 1.045938 0.863717 -0.122092
12 0.124713 -0.322795 0.841675
13 2.390961 0.076200 -0.566446
14 0.036142 -2.074978 0.247792
15 -0.897157 -0.136795 0.018289
16 0.755414 0.215269 0.841009
17 -1.445810 -1.401973 -0.100918
18 -0.548242 -0.144620 0.354020
19 -0.035513 0.565738 1.545659
20 -0.974236 -0.070345 0.307969
22 2.030604 -1.142631 0.211883
23 0.704721 -0.785435 0.462060
24 0.704228 0.523508 -0.926254
25 2.007843 0.226963 -1.152659
26 0.631979 0.039513 0.464392
28 0.164530 -0.430096 0.767369
29 0.984920 0.270836 1.391986
30 0.079842 -0.399965 -1.027851
31 -0.584718 0.816594 -0.081947
idx = df.index.difference(df1.index)
print (idx)
Int64Index([21, 27], dtype='int64')
print (df.loc[idx])
A B C
21 -0.208499 1.033801 -2.400454
27 -3.563517 1.321106 0.152631

Related

pandas - group by: create aggregation function using multiple columns

I have the following data frame:
id my_year my_month waiting_time target
001 2018 1 95 1
002 2018 1 3 3
003 2018 1 4 0
004 2018 1 40 1
005 2018 2 97 1
006 2018 2 3 3
007 2018 3 4 0
008 2018 3 40 1
I want to groupby my_year and my_month, then in each group I want to compute the my_rate based on
(# of records with waiting_time <= 90 and target = 1)/ total_records in the group
i.e. I am expecting output like:
my_year my_month my_rate
2018 1 0.25
2018 2 0.0
2018 3 0.5
I wrote the following code to compute the desired value my_rate:
def my_rate(data):
waiting_time_list = data['waiting_time']
target_list = data['target']
total = len(data)
my_count = 0
for i in range(len(data)):
if total_waiting_time_list[i] <= 90 and target_list[i] == 1:
my_count += 1
rate = float(my_count)/float(total)
return rate
df.groupby(['my_year','my_month']).apply(my_rate)
However, I got the following error:
KeyError 0
KeyErrorTraceback (most recent call last)
<ipython-input-29-5c4399cefd05> in <module>()
17
---> 18 df.groupby(['my_year','my_month']).apply(my_rate)
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in apply(self, func, *args, **kwargs)
714 # ignore SettingWithCopy here in case the user mutates
715 with option_context('mode.chained_assignment', None):
--> 716 return self._python_apply_general(f)
717
718 def _python_apply_general(self, f):
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in _python_apply_general(self, f)
718 def _python_apply_general(self, f):
719 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 720 self.axis)
721
722 return self._wrap_applied_output(
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in apply(self, f, data, axis)
1727 # group might be modified
1728 group_axes = _get_axes(group)
-> 1729 res = f(group)
1730 if not _is_indexed_like(res, group_axes):
1731 mutated = True
<ipython-input-29-5c4399cefd05> in conversion_rate(data)
8 #print total_waiting_time_list[i], target_list[i]
9 #print i, total_waiting_time_list[i], target_list[i]
---> 10 if total_waiting_time_list[i] <= 90:# and target_list[i] == 1:
11 convert_90_count += 1
12 #print 'convert ', convert_90_count
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
599 key = com._apply_if_callable(key, self)
600 try:
--> 601 result = self.index.get_value(self, key)
602
603 if not is_scalar(result):
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_value(self, series, key)
2426 try:
2427 return self._engine.get_value(s, k,
-> 2428 tz=getattr(series.dtype, 'tz', None))
2429 except KeyError as e1:
2430 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4363)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4046)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13913)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13857)()
KeyError: 0
Any idea what I did wrong here? And how do I fix it? Thanks!
I believe better is use mean of boolean mask per groups:
def my_rate(x):
return ((x['waiting_time'] <= 90) & (x['target'] == 1)).mean()
df = df.groupby(['my_year','my_month']).apply(my_rate).reset_index(name='my_rate')
print (df)
my_year my_month my_rate
0 2018 1 0.25
1 2018 2 0.00
2 2018 3 0.50
Any idea what I did wrong here?
Problem is waiting_time_list and target_list are not lists, but Series:
waiting_time_list = data['waiting_time']
target_list = data['target']
print (type(waiting_time_list))
<class 'pandas.core.series.Series'>
print (type(target_list))
<class 'pandas.core.series.Series'>
So if want indexing it failed, because in second group are indices 4,5, not 0,1.
if waiting_time_list[i] <= 90 and target_list[i] == 1:
For avoid it is possible convert Series to list:
waiting_time_list = data['waiting_time'].tolist()
target_list = data['target'].tolist()

for loop in pandas to search dataframe and update list stuck

I want to count areas of interest in my dataframe column 'which_AOI' (ranging from 0 -9). I would like to have a new column with the results added to a dataframe depending on a variable 'marker' (ranging from 0 - x) which tells me when one 'picture' is done and the next begins (one marker can go on for a variable length of rows). This is my code so far but it seems to be stuck and runs on without giving output. I tried reconstructing it from the beginning once but as soon as i get to 'if df.marker == num' it doesn't stop. What am I missing?
(example dataframe below)
## AOI count of spec. type function (in progress):
import numpy as np
import pandas as pd
path_i = "/Users/Desktop/Pilot/results/gazedata_filename.csv"
df = pd.read_csv(path_i, sep =",")
#create a new dataframe for AOIs:
d = {'marker': []}
df_aoi = pd.DataFrame(data=d)
### Creating an Aoi list
item = df.which_AOI
aoi = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] #list for search
aoi_array = [0, 0 , 0, 0, 0, 0, 0, 0, 0, 0] #list for filling
num = 0
for i in range (0, len (df.marker)): #loop through the dataframe
if df.marker == num: ## if marker = num its one picture
for index, item in enumerate(aoi): #look for item (being a number in which_AOI) in aoi list
if (item == aoi[index]):
aoi_array[index] += 1
print (aoi)
print (aoi_array)
se = pd.Series(aoi_array) # make list into a series to attach to dataframe
df_aoi['new_col'] = se.values #add list to dataframe
aoi_array.clear() #clears list before next picture
else:
num +=1
index pos_time pos_x pos_y pup_time pup_diameter marker which_AOI fixation Picname shock
1 16300 168.608779907227 -136.360855102539 16300 2.935715675354 0 7 18 5 save
2 16318 144.97673034668 -157.495513916016 16318 3.08838820457459 0 8 33 5 save
3 16351 152.92560577392598 -156.64172363281298 16351 3.0895299911499 0 7 17 5 save
4 16368 152.132453918457 -157.989685058594 16368 3.111008644104 0 7 18 5 save
5 16386 151.59835815429702 -157.55587768554702 16386 3.09514689445496 0 7 18 5 save
6 16404 150.88092803955098 -152.69479370117202 16404 3.10009074211121 1 7 37 5 save
7 16441 152.76554107666 -142.06188964843798 16441 3.0821495056152304 1 7 33 5 save
Not 100% clear based on your question but it sounds like you want to count the number of rows for each which_AOI value in each marker.
You can accomplish this using groupby
df_aoi = df.groupby(['marker','which_AOI']).size().unstack('which_AOI',fill_value=0)
In:
pos_time pos_x pos_y pup_time pup_diameter marker \
0 16300 168.608780 -136.360855 16300 2.935716 0
1 16318 144.976730 -157.495514 16318 3.088388 0
2 16351 152.925606 -156.641724 16351 3.089530 0
3 16368 152.132454 -157.989685 16368 3.111009 0
4 16386 151.598358 -157.555878 16386 3.095147 0
5 16404 150.880928 -152.694794 16404 3.100091 1
6 16441 152.765541 -142.061890 16441 3.082150 1
which_AOI fixation Picname shock
0 7 18 5 save
1 8 33 5 save
2 7 17 5 save
3 7 18 5 save
4 7 18 5 save
5 7 37 5 save
6 7 33 5 save
Out:
which_AOI 7 8
marker
0 4 1
1 2 0

python2 pandas: how to merge a part of another dataframe to a dataframe

I have a dataframe(df1) as following:
datetime m d 1d 2d 3d
2014-01-01 1 1 2 2 3
2014-01-02 1 2 3 4 3
2014-01-03 1 3 1 2 3
...........
2014-12-01 12 1 2 2 3
2014-12-31 12 31 2 2 3
Also I have another dataframe(df2) as following:
datetime m d
2015-01-02 1 2
2015-01-03 1 3
...........
2015-12-01 12 1
2015-12-31 12 31
I want to merge the 1d 2d 3d columns value of df1 to df2.
There are two conditions:
(1) only m and d are the same in both df1 and df2 can merge.
(2) if the index of df2 index % 30 ==0 don't merge, the value of 1d 2d 3d of these index can be Nan.
I mean I want the new dataframe of df2 like as following:
datetime m d 1d 2d 3d
2015-01-02 1 2 Nan Nan Nan
2015-01-03 1 3 1 2 3
...........
2015-12-01 12 1 2 2 3
2015-12-31 12 31 2 2 3
Thanks in advance!
I think you need add NaNs by loc and then merge with left join:
np.random.seed(10)
N = 365
rng = pd.date_range('2015-01-01', periods=N)
df_tr_2014 = pd.DataFrame(np.random.randint(10, size=(N, 3)), index=rng).reset_index()
df_tr_2014.columns = ['datetime','7d','15d','20d']
df_tr_2014.insert(1,'month', df_tr_2014['datetime'].dt.month)
df_tr_2014.insert(2,'day_m', df_tr_2014['datetime'].dt.day)
#print (df_tr_2014.head())
N = 366
rng = pd.date_range('2016-01-01', periods=N)
df_te = pd.DataFrame(index=rng)
df_te['month'] = df_te.index.month
df_te['day_m'] = df_te.index.day
df_te = df_te.reset_index()
#print (df_te.tail())
df2 = df_te.copy()
df1 = df_tr_2014.copy()
df1 = df1.set_index('datetime')
df1.index += pd.offsets.DateOffset(years=1)
#correct 29 February
y = df1.index[0].year
df1 = df1.reindex(pd.date_range(pd.datetime(y,1,1), pd.datetime(y,12,31)))
idx = df1.index[(df1.index.month == 2) & (df1.index.day == 29)]
df1.loc[idx, :] = df1.loc[idx - pd.Timedelta(1, unit='d'), :].values
df1.loc[idx, 'day_m'] = idx.day
df1[['month','day_m']] = df1[['month','day_m']].astype(int)
df1[['7d','15d', '20d']] = df1[['7d','15d', '20d']].astype(float)
df1.loc[np.arange(len(df1.index)) % 30 == 0, ['7d','15d','20d']] = 0
df1 = df1.reset_index()
print (df1.iloc[57:62])
index month day_m 7d 15d 20d
57 2016-02-27 2 27 2.0 0.0 1.0
58 2016-02-28 2 28 2.0 3.0 5.0
59 2016-02-29 2 29 2.0 3.0 5.0
60 2016-03-01 3 1 0.0 0.0 0.0
61 2016-03-02 3 2 7.0 6.0 9.0
Why don't you just remove the rows in df1 that don't match (m, d) pairs in df2?
df_new = df2.drop(df2[(not ((df2.m == df1.m) & (df2.n == df1.n)).any()) or (df2.index % 30 == 0)].index)
Or something along those lines.
Link to a related answer.
I'm not enormously familiar with Pandas and have not tested the above example.
df_te is df2
df_tr_2014 is df1
7d 15d 20 is 1d 2d 3d respectively in question. size_df_te is the length of df_te, month and day_m are m, d in df2
df_te['7d'] = 0
df_te['15d'] = 0
df_te['20d'] = 0
mj = 0
dj = 0
for i in range(size_df_te):
if i%30 != 0:
m = df_te.loc[i,'month']
d = df_te.loc[i,'day_m']
if (m== 2) & (d == 29):
m = 2
d = 28
dk_7 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['7d']
dk_15 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['15d']
dk_20 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['20d']
df_te.loc[i,'7d'] = float(dk_7)
df_te.loc[i,'15d'] = float(dk_15)
df_te.loc[i,'20d'] = float(dk_20)
EDIT:
Sample data:
np.random.seed(10)
N = 365
rng = pd.date_range('2014-01-01', periods=N)
df_tr_2014 = pd.DataFrame(np.random.randint(10, size=(N, 3)), index=rng).reset_index()
df_tr_2014.columns = ['datetime','7d','15d','20d']
df_tr_2014.insert(1,'month', df_tr_2014['datetime'].dt.month)
df_tr_2014.insert(2,'day_m', df_tr_2014['datetime'].dt.day)
#print (df_tr_2014.head())
N = 365
rng = pd.date_range('2015-01-01', periods=N)
df_te = pd.DataFrame(index=rng)
df_te['month'] = df_te.index.month
df_te['day_m'] = df_te.index.day
df_te = df_te.reset_index()
#print (df_te.head())

Transform a list to a list of average values (by step)

I have a two dimensional list of values:
[
[[12.2],[5325]],
[[13.4],[235326]],
[[15.9],[235326]],
[[17.7],[53521]],
[[21.3],[42342]],
[[22.6],[6546]],
[[25.9],[34634]],
[[27.2],[523523]],
[[33.4],[235325]],
[[36.2],[235352]]
]
I would like to get a list of averages defined by a given step so that for a step=10 it would like like this:
[
[[10],[average of all 10-19]],
[[20],[average of all 20-29]],
[[30],[average of all 30-39]]
]
How can I achieve that? Please note that the number of 10s, 20s, 30s and so on is not always the same.
import pandas as pd
df = pd.DataFrame((q[0][0], q[1][0]) for q in thelist)
df['group'] = (df[0] / 10).astype(int)
Now df is:
0 1 group
0 12.2 5325 1
1 13.4 235326 1
2 15.9 235326 1
3 17.7 53521 1
4 21.3 42342 2
5 22.6 6546 2
6 25.9 34634 2
7 27.2 523523 2
8 33.4 235325 3
9 36.2 235352 3
Then:
df.groupby('group').mean()
Gives you the answers you seek:
0 1
group
1 14.80 132374
2 24.25 151761
3 34.80 235338

complex time difference operation using pandas

Below is a small sample of a very large dataframe:
import pandas as pd
In [32]: df3
Out[32]:
Location_ID Time
0 10000000568366 2012-05-31 14:08:00
1 10000000257225 2012-05-31 07:22:00
2 10000000730693 2012-05-31 02:19:00
3 10000000257225 2012-05-30 12:20:00
4 10000001072890 2012-05-30 11:19:00
5 10000000811587 2012-05-31 03:09:00
6 10000000094837 2012-06-02 08:39:00
7 10000000730693 2012-06-01 14:04:00
8 10000000955747 2012-05-31 07:24:00
9 10000000924241 2012-05-30 14:48:00
10 10000000893286 2012-05-18 13:12:00
11 10000000924241 2012-05-31 01:56:00
12 10000000211696 2012-05-30 02:09:00
13 10000000211696 2012-05-29 11:41:00
14 10000000084450 2012-05-31 18:34:00
15 10000000939505 2012-06-02 18:12:00
16 10000000893286 2012-05-31 22:54:00
17 10000000811598 2012-06-01 07:55:00
18 10000000683255 2012-05-29 03:44:00
I am trying to find the time difference in seconds between consecutive rows of "Time" for a particular Location_ID. I am using pandas.to_numeric which converts it into nanoseconds , then divide it by 1000000000 to get the result in seconds:
df4 = df3.assign(time_difference=df3['Time'].groupby('Location_ID').apply(lambda x : (pd.to_numeric(x.shift()-x).abs())/1000000000))
The error I get is:
KeyError: 'Location_ID'