Removing features with low variance using scikit-learn

Removing features with low variance using scikit-learn - python-2.7

scikit-learn provides various methods to remove descriptors, a basic method for this purpose has been provided by the given tutorial below,
http://scikit-learn.org/stable/modules/feature_selection.html
but the tutorial does not provide any method or a way that can tell you the way to keep the list of features that either removed or kept.
The code below has been taken from the tutorial.
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
The given example code above depicts only two descriptors "shape(6, 2)", but in my case, I have a huge data frames with a shape of (rows 51, columns 9000). After finding a suitable model I want to keep the track of useful and useless features because I can save computational time during the computation of the features of test data set by calculating only useful features.
For example, when you perform machine learning modeling with WEKA 6.0, it provided with remarkable flexibility over feature selection and after removing the useless feature you can get a list of a discarded features along with the useful features.
thanks

Then, what you can do, if I'm not wrong is:
In the case of the VarianceThreshold, you can call the method fit instead of fit_transform. This will fit data, and the resulting variances will be stored in vt.variances_ (assuming vt is your object).
Having a threhold, you can extract the features of the transformation as fit_transform would do:
X[:, vt.variances_ > threshold]
Or get the indexes as:
idx = np.where(vt.variances_ > threshold)[0]
Or as a mask
mask = vt.variances_ > threshold
PS: default threshold is 0
EDIT:
A more straight forward to do, is by using the method get_support of the class VarianceThreshold. From the documentation:
get_support([indices]) Get a mask, or integer index, of the features selected
You should call this method after fit or fit_transform.

import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
# Just make a convenience function; this one wraps the VarianceThreshold
# transformer but you can pass it a pandas dataframe and get one in return
def get_low_variance_columns(dframe=None, columns=None,
skip_columns=None, thresh=0.0,
autoremove=False):
"""
Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
"""
print("Finding low-variance features.")
try:
# get list of all the original df columns
all_columns = dframe.columns
# remove `skip_columns`
remaining_columns = all_columns.drop(skip_columns)
# get length of new index
max_index = len(remaining_columns) - 1
# get indices for `skip_columns`
skipped_idx = [all_columns.get_loc(column)
for column
in skip_columns]
# adjust insert location by the number of columns removed
# (for non-zero insertion locations) to keep relative
# locations intact
for idx, item in enumerate(skipped_idx):
if item > max_index:
diff = item - max_index
skipped_idx[idx] -= diff
if item == max_index:
diff = item - len(skip_columns)
skipped_idx[idx] -= diff
if idx == 0:
skipped_idx[idx] = item
# get values of `skip_columns`
skipped_values = dframe.iloc[:, skipped_idx].values
# get dataframe values
X = dframe.loc[:, remaining_columns].values
# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=thresh)
# fit vt to data
vt.fit(X)
# get the indices of the features that are being kept
feature_indices = vt.get_support(indices=True)
# remove low-variance columns from index
feature_names = [remaining_columns[idx]
for idx, _
in enumerate(remaining_columns)
if idx
in feature_indices]
# get the columns to be removed
removed_features = list(np.setdiff1d(remaining_columns,
feature_names))
print("Found {0} low-variance columns."
.format(len(removed_features)))
# remove the columns
if autoremove:
print("Removing low-variance features.")
# remove the low-variance columns
X_removed = vt.transform(X)
print("Reassembling the dataframe (with low-variance "
"features removed).")
# re-assemble the dataframe
dframe = pd.DataFrame(data=X_removed,
columns=feature_names)
# add back the `skip_columns`
for idx, index in enumerate(skipped_idx):
dframe.insert(loc=index,
column=skip_columns[idx],
value=skipped_values[:, idx])
print("Succesfully removed low-variance columns.")
# do not remove columns
else:
print("No changes have been made to the dataframe.")
except Exception as e:
print(e)
print("Could not remove low-variance features. Something "
"went wrong.")
pass
return dframe, removed_features

this worked for me if you want to see exactly which columns are remained after thresholding you may use this method:
from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]]

When testing features I wrote this simple function that tells me which variables remained in the data frame after the VarianceThreshold is applied.
from sklearn.feature_selection import VarianceThreshold
from itertools import compress
def fs_variance(df, threshold:float=0.1):
"""
Return a list of selected variables based on the threshold.
"""
# The list of columns in the data frame
features = list(df.columns)
# Initialize and fit the method
vt = VarianceThreshold(threshold = threshold)
_ = vt.fit(df)
# Get which column names which pass the threshold
feat_select = list(compress(features, vt.get_support()))
return feat_select
which returns a list of column names which are selected. For example: ['col_2','col_14', 'col_17'].

Related

Dask: Groupby with nlargest automatically brings in index and doesn't allow reset_index()

I've been trying to get nlargest rows for a group by following method from this question. The solution to the question is correct up to a point.
In this example, I groupby column A and want to return the rows of C and D based on the top two values in B.
For some reason the index of grp_df is multilevel and includes both A and the original index of ddf.
I was hoping to simply reset_index() and drop the unwanted index and just keep A, but I get the following error:
ValueError: The columns in the computed data do not match the columns in the provided metadata
Here is a simple example reproducing the error:
import numpy as np
import dask.dataframe as dd
import pandas as pd
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(df, npartitions=3)
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta={
"B": 'f8', "C": 'f8'})
# Print is successful and results are correct
print(grp_df.head())
grp_df = grp_df.reset_index()
# Print is unsuccessful and shows error below
print(grp_df.head())

Found approach for solution here.
Following code now allows for reset_index() to work and gets rid of the original ddf index. Still not sure why the original ddf index came through the groupby in the first place, though
meta = pd.DataFrame(columns=['B', 'C'], dtype=int, index=pd.MultiIndex([[], []], [[], []], names=['A', None]))
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta=meta)
grp_df = grp_df.reset_index().drop('level_1', axis=1)

PySpark Using collect_list to collect Arrays of Varying Length

I am attempting to use collect_list to collect arrays (and maintain order) from two different data frames.
Test_Data and Train_Data have the same format.
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Group').orderBy('date')
# Train_Data has 4 data points
# Test_Data has 7 data points
# desired target array: [1, 1, 2, 3]
# desired MarchMadInd array: [0, 0, 0, 1, 0, 0, 1]
sorted_list_diff_array_lens = train_data.withColumn('target',
F.collect_list('target').over(w)
)\
test_data.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Group')\
.agg(F.max('target').alias('target'),
F.max('MarchMadInd').alias('MarchMadInd')
)
I realize the syntax is incorrect with "test_data.withColumn", but I want to select the array for the MarchMadInd from the test_date, but the array for the target from the train_data. The desired output would look like the following:
{"target":[1, 1, 2, 3], "MarchMadInd":[0, 0, 0, 1, 0, 0, 1]}
Context: this is for a DeepAR time series model (using AWS) that requires dynamic features to include the prediction period, but the target should be historical data.

The solution involves using a join as recommended by pault.
Create a dataframe with dynamic features of length equal to Training + Prediction period
Create a dataframe with target values of length equal to just the Training period.
Use a LEFT JOIN (with the dynamic feature data on LEFT) to bring these dataframes together
Now, using collect_list will create the desired result.

Rewriting some functions for xlsxwriter box borders from Python 2 to Python 3

I am having some problem getting xlsxwriter to create box borders around a number of cells when creating a Excel sheet. After some searching I found a thread here where there was a example on how to do this in Python 2.
The link to the thread is:
python XlsxWriter set border around multiple cells
The answer I am trying to use is the one given by aubaub.
I am using Python 3 and is trying to get this to work but I am having some problems with it.
The first thing I did was changing xrange to range in the
def box(workbook, sheet_name, row_start, col_start, row_stop, col_stop),
and then I changed dict.iteritems() to dict.items() in
def add_to_format(existing_format, dict_of_properties, workbook):
Since there have been some changes to this from Python 2 to 3.
But the next part I am struggling with, and kinda have no idea what to do, and this is the
return(workbook.add_format(dict(new_dict.items() + dict_of_properties.items())))
part. I tried to change this by adding the two dictionaries in another way, by adding this before the return part.
dest = dict(list(new_dict.items()) + list(dict_of_properties.items()))
return(workbook.add_format(dest))
But this is not working, I have not been using dictionaries a lot before, and am kinda blank on how to get this working, and if it there have been some other changes to xlsxwriter or other factors that prevent this from working. Does anyone have some good ideas for how to solve this?
Here I have added a working example of the code and problem.
import pandas as pd
import xlsxwriter
import numpy as np
from xlsxwriter.utility import xl_range
#Adding the functions from aubaub copied from question on Stackoverflow
# https://stackoverflow.com/questions/21599809/python-xlsxwriter-set-border-around-multiple-cells/37907013#37907013
#And added the changes I thought would make it work.
def add_to_format(existing_format, dict_of_properties, workbook):
"""Give a format you want to extend and a dict of the properties you want to
extend it with, and you get them returned in a single format"""
new_dict={}
for key, value in existing_format.__dict__.items():
if (value != 0) and (value != {}) and (value != None):
new_dict[key]=value
del new_dict['escapes']
dest = dict(list(new_dict.items()) + list(dict_of_properties.items()))
return(workbook.add_format(dest))
def box(workbook, sheet_name, row_start, col_start, row_stop, col_stop):
"""Makes an RxC box. Use integers, not the 'A1' format"""
rows = row_stop - row_start + 1
cols = col_stop - col_start + 1
for x in range((rows) * (cols)): # Total number of cells in the rectangle
box_form = workbook.add_format() # The format resets each loop
row = row_start + (x // cols)
column = col_start + (x % cols)
if x < (cols): # If it's on the top row
box_form = add_to_format(box_form, {'top':1}, workbook)
if x >= ((rows * cols) - cols): # If it's on the bottom row
box_form = add_to_format(box_form, {'bottom':1}, workbook)
if x % cols == 0: # If it's on the left column
box_form = add_to_format(box_form, {'left':1}, workbook)
if x % cols == (cols - 1): # If it's on the right column
box_form = add_to_format(box_form, {'right':1}, workbook)
sheet_name.write(row, column, "", box_form)
#Adds dataframe with some data
frame1 = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'))
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
#Add frame to Excel sheet
frame1.to_excel(writer, sheet_name='Sheet1', startcol= 1, startrow= 2)
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
#Add some formating to the table
format00 = workbook.add_format()
format00.set_bold()
format00.set_font_size(14)
format00.set_bg_color('#F2F2F2')
format00.set_align('center')
worksheet.conditional_format(xl_range(2, 1, 2, 5),
{'type': 'no_blanks',
'format': format00})
box(workbook, 'Sheet1', 3, 1, 12, 5)
writer.save()

I stumbled on this when trying to see if anyone else had posted a better way to deal with formats. Don't use my old way; whether you could make it work with Python 3 or not, it's pretty crappy. Instead, grab what I just put here: https://github.com/Yoyoyoyoyoyoyo/XlsxFormatter.
If you use sheet.cell_writer() instead of sheet.write(), then it will keep a memory of the formats you ask for on a cell-by-cell basis, so writing something new in a cell (or adding a border around it) won't delete the cell's old format, but adds to it instead.
A simple example of your code:
from format_classes import Book
book = Book(where_to_save)
sheet = book.add_book_sheet('Sheet1')
sheet.box(3, 1, 12, 5)
# add data to the box with sheet.cell_writer(...)
book.close()
Look at the code & the README to see how to do other things, like format the box's borders or backgrounds, write data, apply a format to an entire worksheet, etc.

Evaluating the predictive accuracy of the NB model

What am I doing wrong with using scikit-learn from nltk to check the accuracy of the naive bayes classifier?
...readFile definition not needed
#divide the data into training and testing sets
data = readFile('Data_test/')
training_set = list_nltk[:2000000]
testing_set = list_nltk[2000000:]
#applied Bag of words as a way to select and extract feature
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_set.split('\n'))
#apply tfd
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
#Train the data
clf = MultinomialNB().fit(X_train_tf, training_set.split('\n'))
#now test the accuracy of the naive bayes classifier
test_data_features = count_vect.transform(testing_set)
X_new_tfidf = tf_transformer.transform(test_data_features)
predicted = clf.predict(X_new_tfidf)
print "%.3f" % nltk.classify.accuracy(clf, predicted)
The problem is when I print the nltk.classify.accuracy, it takes forever and I am suspecting this is because I have done something wrong but since I get no error, I can't figure out what it is that is wrong.

Use instead the accuracy_score of the sklearn metrics.
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
I think you're mixing some things about Supervised Learning.See this answer and try to understand the top of this page.
Your data should be in this form (before doing the Vectorization) :
X = [["The cat is sleeping"], ..., ["The man is dead"]]
Y = [1, ..., 0]

Your have an error at least in this line
clf = MultinomialNB().fit(X_train_tf, training_set.split('\n'))
You need to have your training labels and vectorized data in there but you have the original and vectorized data.
It should look like this:
clf = MultinomialNB().fit(X_train_tf, y_train)
But you don't even have the label y_train data anywhere in your code as far I can tell.

Dictvectorizer for list as one feature in Python Pandas and Scikit-learn

I have been trying to solve this for days, and although I have found a similar problem here How can i vectorize list using sklearn DictVectorizer, the solution is overly simplified.
I would like to fit some features into a logistic regression model to predict 'chinese' or 'non-chinese'. I have a raw_name which I will extract to get two features 1) is just the last name, and 2) is a list of substring of the last name, for example, 'Chan' will give ['ch', 'ha', 'an']. But it seems Dictvectorizer doesn't take list type as part of the dictionary. From the link above, I try to create a function list_to_dict, and successfully, return some dict elements,
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
but I have no idea how to incorporate that in the my_dict = ... before applying the dictvectorizer.
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
lr = LogisticRegression()
dv = DictVectorizer()
# Get csv file into data frame
data = pd.read_csv("V2-1_2000Records_Processed_SEP2015.csv", header=0, encoding="utf-8")
df = DataFrame(data)
# Pandas data frame shuffling
df_shuffled = df.iloc[np.random.permutation(len(df))]
df_shuffled.reset_index(drop=True)
# Assign X and y variables
X = df.raw_name.values
y = df.chineseScan.values
# Feature extraction functions
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return None
except: return None
def feature_twoLetters(nameString):
placeHolder = []
try:
for i in range(0, len(nameString)):
x = nameString[i:i+2]
if len(x) == 2:
placeHolder.append(x)
return placeHolder
except: return []
def list_to_dict(substring_list):
try:
substring_dict = {}
for i in substring_list:
substring_dict['substring='+str(i)] = True
return substring_dict
except: return None
list_example = ['co', 'or', 'rn', 'ns']
print list_to_dict(list_example)
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'two-letter-substrings': feature_twoLetters(feature_full_last_name(i)),
'last-name': feature_full_last_name(i), 'dummy': 1} for i in X]
print my_dict[3]
Output:
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
{'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'}
Sample data:
Raw_name chineseScan
Jack Anderson non-chinese
Po Lee chinese

If I have understood correctly you want a way to encode list values in order to have a feature dictionary that DictVectorizer could use. (One year too late but) something like this can be used depending on the case:
my_dict_list = []
for i in X:
# create a new feature dictionary
feat_dict = {}
# add the features that are straight forward
feat_dict['last-name'] = feature_full_last_name(i)
feat_dict['dummy'] = 1
# for the features that have a list of values iterate over the values and
# create a custom feature for each value
for two_letters in feature_twoLetters(feature_full_last_name(i)):
# make sure the naming is unique enough so that no other feature
# unrelated to this will have the same name/ key
feat_dict['two-letter-substrings-' + two_letters] = True
# save it to the feature dictionary list that will be used in Dict vectorizer
my_dict_list.append(feat_dict)
print my_dict_list
from sklearn.feature_extraction import DictVectorizer
dict_vect = DictVectorizer(sparse=False)
transformed_x = dict_vect.fit_transform(my_dict_list)
print transformed_x
Output:
[{'dummy': 1, u'two-letter-substrings-er': True, 'last-name': u'Anderson', u'two-letter-substrings-on': True, u'two-letter-substrings-de': True, u'two-letter-substrings-An': True, u'two-letter-substrings-rs': True, u'two-letter-substrings-nd': True, u'two-letter-substrings-so': True}, {'dummy': 1, u'two-letter-substrings-ee': True, u'two-letter-substrings-Le': True, 'last-name': u'Lee'}]
[[ 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1.]
[ 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]]
Another thing you could do (but I don't recommend) if you don't want to create as many features as the values in your lists is something like this:
# sorting the values would be a good idea
feat_dict[frozenset(feature_twoLetters(feature_full_last_name(i)))] = True
# or
feat_dict[" ".join(feature_twoLetters(feature_full_last_name(i)))] = True
but the first one means that you can't have any duplicate values and probably both don't make good features, especially if you need fine-tuned and detailed ones. Also, they reduce the possibility of two rows having the same combination of two letter combinations, thus the classification probably won't do well.
Output:
[{'dummy': 1, 'last-name': u'Anderson', frozenset([u'on', u'rs', u'de', u'nd', u'An', u'so', u'er']): True}, {'dummy': 1, 'last-name': u'Lee', frozenset([u'ee', u'Le']): True}]
[{'dummy': 1, 'last-name': u'Anderson', u'An nd de er rs so on': True}, {'dummy': 1, u'Le ee': True, 'last-name': u'Lee'}]
[[ 1. 0. 1. 1. 0.]
[ 0. 1. 1. 0. 1.]]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Removing features with low variance using scikit-learn - python-2.7

Related

Dask: Groupby with nlargest automatically brings in index and doesn't allow reset_index()

PySpark Using collect_list to collect Arrays of Varying Length

Rewriting some functions for xlsxwriter box borders from Python 2 to Python 3

Evaluating the predictive accuracy of the NB model

Dictvectorizer for list as one feature in Python Pandas and Scikit-learn

Categories

Resources