I'm trying to do some feature selection algorithms on the UCI adult data set and I'm running into a problem with Univaraite feature selection. I'm doing onehot encoding on all the categorical data to change them to numerical but that gives me a lot of f scores.
How can I avoid this? What should I do to make this code better?
# Encode
adult['Gender'] = adult['sex'].map({'Female': 0, 'Male': 1}).astype(int)
adult = adult.drop(['sex'], axis=1)
adult['Earnings'] = adult['income'].map({'<=50K': 0, '>50K': 1}).astype(int)
adult = adult.drop(['income'], axis=1)
#OneHot Encode
adult = pd.get_dummies(adult, columns=["race"])
target = adult["Earnings"]
data = adult.drop(["Earnings"], axis=1)
selector = SelectKBest(f_classif, k=5)
selector.fit_transform(data, target)
for n,s in zip( data.head(0), selector.scores_):
print "F Score ", s,"for feature ", n
EDIT:
Partial results of current code:
F Score 26.1375747945 for feature race_Amer-Indian-Eskimo
F Score 3.91592196913 for feature race_Asian-Pac-Islander
F Score 237.173133254 for feature race_Black
F Score 31.117798305 for feature race_Other
F Score 218.117092671 for feature race_White
Expected Results:
F Score "f_score" for feature "race"
By doing the one hot encoding the feature in above is split into many sub-features, where I would just like to generalize it to just race (see Expected Results) if that is possible.
One way in which you can reduce the number of features, whilst still encoding your categories in a non-ordinal manner, is by using binary encoding. One-hot-encoding has a linear growth rate n where n is the number of categories in a categorical feature. Binary encoding has log_2(n) growth rate. In other words, doubling the number of categories adds a single column for binary encoding, where as it doubles the number of columns for one-hot encoding.
Binary encoding can be easily implemented in python by using the categorical_encoding package. The package is pip installable and works very seamlessly with sklearn and pandas. Here is an example
import pandas as pd
import category_encoders as ce
df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_bin = ce.binary_encoding.BinaryEncoding(cols=['cat1']) # cols=None, all string columns encoded
df_trans = enc_bin.fit_transform(df)
print(df_trans)
Out[1]:
cat1_0 cat1_1 cat2
0 1 1 C
1 0 1 S
2 1 0 T
3 0 0 B
Here is the code from a previous answer by me using the same variables as above but with one-hot encoding. Lets compare how the two different outputs look.
import pandas as pd
import category_encoders as ce
df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_ohe = ce.one_hot.OneHotEncoder(cols=['cat1']) # cols=None, all string columns encoded
df_trans = enc_ohe.fit_transform(df)
print(df_trans)
Out[2]:
cat1_0 cat1_1 cat1_2 cat1_3 cat2
0 0 0 1 0 C
1 0 0 0 1 S
2 1 0 0 0 T
3 0 1 0 0 B
See how binary encoding uses half as many columns to uniquely describe each category within the category cat1.
I'm parsing logs generated from multiple sources and joined together to form a huge log file in the following format;
My_testNumber: 14, JobType = testx.
ABC 2234
**SR 111**
1483529571 1 1 Wed Jan 4 11:32:51 2017 0 4
datatype someRandomValue
SourceCode.Cpp 588
DBConnection failed
TB 132
**SR 284**
1483529572 0 1 Wed Jan 4 11:32:52 2017 5010400 4
datatype someRandomXX
SourceCode2.cpp 455
DBConnection Success
TB 102
**SR 299**
1483529572 0 1 **Wed Jan 4 11:32:54 2017** 5010400 4
datatype someRandomXX
SourceCode3.cpp 455
ConnectionManager Success
....
(there are dozens of SR Numbers here)
Now i'm looking a smart way to parse logs so that it calculates time differences in seconds for each testNumber and SR number
like
My_testNumber:14 it subtracts SR 284 and SR 111 time (difference would be 1 second here), for SR 284 and 299 it is 2 seconds and so on.
You can parse your posted log file and save the corresponding data accordingly. Then, you can work with the data to get the time differences. The following should be a decent start:
from itertools import combinations
from itertools import permutations # if order matters
from collections import OrderedDict
from datetime import datetime
import re
sr_numbers = []
dates = []
# Loop through the file and get the test number and times
# Save the data in a list
pattern = re.compile(r"(.*)\*{2}(.*)\*{2}(.*)")
for line in open('/Path/to/log/file'):
if '**' in line:
# Get the data between the asterisks
if 'SR' in line:
sr_numbers.append(re.sub(pattern,"\\2", line.strip()))
else:
dates.append(datetime.strptime(re.sub(pattern,"\\2", line.strip()), '%a %b %d %H:%M:%S %Y'))
else:
continue
# Use hashmap container (ordered dictionary) to make it easy to get the time differences
# Using OrderedDict here to maintain the order of the order of the test number along the file
log_dict = OrderedDict((k,v) for k,v in zip(sr_numbers, dates))
# Use combinations to get the possible combinations (or permutations if order matters) of time differences
time_differences = {"{} - {}".format(*x):(log_dict[x[1]] - log_dict[x[0]]).seconds for x in combinations(log_dict, 2)}
print(time_differences)
# {'SR 284 - SR 299': 2, 'SR 111 - SR 284': 1, 'SR 111 - SR 299': 3}
Edit:
Parsing the file without relying on the asterisks around the dates:
from itertools import combinations
from itertools import permutations # if order matters
from collections import OrderedDict
from datetime import datetime
import re
sr_numbers = []
dates = []
# Loop through the file and get the test number and times
# Save the data in a list
pattern = re.compile(r"(.*)\*{2}(.*)\*{2}(.*)")
for line in open('/Path/to/log/file'):
if 'SR' in line:
current_sr_number = re.sub(pattern,"\\2", line.strip())
sr_numbers.append(current_sr_number)
elif line.strip().count(":") > 1:
try:
dates.append(datetime.strptime(re.split("\s{3,}",line)[2].strip("*"), '%a %b %d %H:%M:%S %Y'))
except IndexError:
#print(re.split("\s{3,}",line))
dates.append(datetime.strptime(re.split("\t+",line)[2].strip("*"), '%a %b %d %H:%M:%S %Y'))
else:
continue
# Use hashmap container (ordered dictionary) to make it easy to get the time differences
# Using OrderedDict here to maintain the order of the order of the test number along the file
log_dict = OrderedDict((k,v) for k,v in zip(sr_numbers, dates))
# Use combinations to get the possible combinations (or permutations if order matters) of time differences
time_differences = {"{} - {}".format(*x):(log_dict[x[1]] - log_dict[x[0]]).seconds for x in combinations(log_dict, 2)}
print(time_differences)
# {'SR 284 - SR 299': 2, 'SR 111 - SR 284': 1, 'SR 111 - SR 299': 3}
I hope this proves useful.
My original data looks like this:
SUBBASIN HRU HRU_SLP OV_N
1 1 0.016155144 0.15
1 2 0.015563287 0.14
2 1 0.010589782 0.15
2 2 0.011574839 0.14
3 1 0.013865396 0.15
3 2 0.01744597 0.15
3 3 0.018983217 0.14
3 4 0.013890315 0.05
3 5 0.011792533 0.05
I need to modify value of OV_N for each SUBBASIN number:
hru = pd.read_csv('hru.csv')
for i in hru.OV_N:
hru.ix[hru.SUBBASIN.isin([76,65,64,72,81,84,60,46,37,1,2]), 'OV_N'] = i*(1+df21.value[12])
hru.ix[hru.SUBBASIN.isin([80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12]), 'OV_N'] = i*(1+df23.value[12])
hru.ix[hru.SUBBASIN.isin([85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,49,29,22,24,25,9,10]), 'OV_N'] = i*(1+df56.value[12])
hru.ix[hru.SUBBASIN.isin([92,88,95,94,93]), 'OV_N'] = i*(1+df58.value[12])
where df21.value[12] is a value from a txt file
The code results in an infinite value of OV_N for all subbasins, so I assume that looping through a file goes multiple times, but I can't find a mistake and this code was working before with different numbers of subbasins.
It is generally better not to loop and index over rows in a pandas DataFrame. Transforming the DataFrame by column operations is the more pandasnic approach. A pandas DataFrame can be thought of as a zipped combination of pandas Series: each column is its own pandas Series – all sharing the same index. Operations can be applied to one or more pandas Series to create a new Series that shares the same index. Operations can also be applied to combine a Series with one dimensional numpy array to create a new Series. It is helpful to understand pandas indexing – however this answer will just use sequential integer indexing.
To modify the value of OV_N for each SUBBASIN number:
Initialize the hru DataFrame by reading it in from the hru.csv as in the original question. Here we initialize it with the data given in the question.
import numpy as np
import pandas as pd
hru = pd.DataFrame({
'SUBBASIN':[1,1,2,2,3,3,3,3,3],
'HRU':[1,2,1,2,1,2,3,4,5],
'HRU_SLP':[0.016155144,0.015563287,0.010589782,0.011574839,0.013865396,0.01744597,0.018983217,0.013890315,0.011792533],
'OV_N':[0.15,0.14,0.15,0.14,0.15,0.15,0.14,0.05,0.05]})
Create one separate pandas Series that gathers and stores all the values from the various DataFrames, i.e. df21, df23, df56, and df58, into one place. This will be used to look up values by index. Let’s call it subbasin_multiplier_ds. Let’s respectively assume values of 21, 23, 56, and 58 were read from the txt file. Do replace these with the real values read in from the txt file.
subbasin_multiplier_ds=pd.Series([21]*96)
subbasin_multiplier_ds[80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,
34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12] = 23
subbasin_multiplier_ds[85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,
86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,
49,29,22,24,25,9,10] = 56
subbasin_multiplier_ds[92,88,95,94,93] = 58
Replace OV_N in hru DataFrame based on columns in the DataFrame and a lookup in subbasin_multiplier_ds by index.
hru['OV_N'] = hru['OV_N'] * (1 + subbasin_multiplier_ds[hru['SUBBASIN']].values)
A numpy array is created by .values above so expected results are achieved. If you want to experiment with removing values give it a try to see what happens.
I'm trying to do basic interpolation of position data at 60hz (~16ms) intervals. When I try to use pandas 0.14 interpolation over the dataframe, it tells me I only have NaNs in my data set (not true). When I try to run it over individual series pulled from the dataframe, it returns the same series without the NaNs filled in. I've tried setting the indices to integers, using different methods, fiddling with the axis and limit parameters of the interpolation function - no dice. What am I doing wrong?
df.head(5) :
x y ms
0 20.5815 14.1821 333.3333
1 NaN NaN 350
2 20.6112 14.2013 366.6667
3 NaN NaN 383.3333
4 20.5349 14.2232 400
df = df.set_index(df.ms) # set indices to milliseconds
When I try running
df.interpolate(method='values')
I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-462-cb0f1f01eb84> in <module>()
12
13
---> 14 df.interpolate(method='values')
15
16
/Users/jsb/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in interpolate(self, method, axis, limit, inplace, downcast, **kwargs)
2511
2512 if self._data.get_dtype_counts().get('object') == len(self.T):
-> 2513 raise TypeError("Cannot interpolate with all NaNs.")
2514
2515 # create/use the index
TypeError: Cannot interpolate with all NaNs.
I've also tried running over individual series, which only return what I put in:
temp = df.x
temp.interpolate(method='values')
333.333333 20.5815
350.000000 NaN
366.666667 20.6112
383.333333 NaN
400.000000 20.5349 Name: x, dtype: object
EDIT :
Props to Jeff for inspiring the solution.
Adding:
df[['x','y','ms']] = df[['x','y','ms']].astype(float)
before
df.interpolate(method='values')
interpolation did the trick.
Based on your edit with props to Jeff for inspiring the solution.
Adding:
df = df.astype(float)
before
df.interpolate(method='values')
interpolation did the trick for me as well. Unless you're sub-selecting a column set, you don't need to specify the columns.
I'm not able to to reproduce the error (see below for a copy/paste-able example), can you make sure the the data you show is actually representative of your data?
In [137]: from StringIO import StringIO
In [138]: df = pd.read_csv(StringIO(""" x y ms
...: 0 20.5815 14.1821 333.3333
...: 1 NaN NaN 350
...: 2 20.6112 14.2013 366.6667
...: 3 NaN NaN 383.3333
...: 4 20.5349 14.2232 400"""), delim_whitespace=True)
In [140]: df = df.set_index(df.ms)
In [142]: df.interpolate(method='values')
Out[142]:
x y ms
ms
333.3333 20.58150 14.18210 333.3333
350.0000 20.59635 14.19170 350.0000
366.6667 20.61120 14.20130 366.6667
383.3333 20.57305 14.21225 383.3333
400.0000 20.53490 14.22320 400.0000