Decision tree classifier,multilabel output - python-2.7

Decision Tree supports multi label classification right? my y labels are of type [['brufen','amoxil'],['brufen'],['xanex']]. Now y labels can be of the type list of list of labels as mentioned in the sklearn documentation, so why does it gives me error of unknown label type?
This error is resolved in a way that the length of list should be consistent, but how else should I handle this problem apart from one hot encoding it?

You need to convert the labels to label-indicator format first. Then you can use them with decision trees.
For converting, you can use MultiLabelBinarizer.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_converted = mlb.fit_transform([['brufen','amoxil'], ['brufen'], ['xanex']])
# Output: array([[1, 1, 0],
# [0, 1, 0],
# [0, 0, 1]])
mlb.classes_
# OutPut: array(['amoxil', 'brufen', 'xanex'], dtype=object)
Now use this y_converted instead of original y in decision tree.

Based on the information here: https://scikit-learn.org/stable/modules/multiclass.html#multioutputclassifier
You can use sklearn.multioutput.MultiOutputClassifier with a decision tree to get multi-label behavior. If I understand correctly, it works by internally creating a separate tree for each label.

Related

PySpark Using collect_list to collect Arrays of Varying Length

I am attempting to use collect_list to collect arrays (and maintain order) from two different data frames.
Test_Data and Train_Data have the same format.
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Group').orderBy('date')
# Train_Data has 4 data points
# Test_Data has 7 data points
# desired target array: [1, 1, 2, 3]
# desired MarchMadInd array: [0, 0, 0, 1, 0, 0, 1]
sorted_list_diff_array_lens = train_data.withColumn('target',
F.collect_list('target').over(w)
)\
test_data.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Group')\
.agg(F.max('target').alias('target'),
F.max('MarchMadInd').alias('MarchMadInd')
)
I realize the syntax is incorrect with "test_data.withColumn", but I want to select the array for the MarchMadInd from the test_date, but the array for the target from the train_data. The desired output would look like the following:
{"target":[1, 1, 2, 3], "MarchMadInd":[0, 0, 0, 1, 0, 0, 1]}
Context: this is for a DeepAR time series model (using AWS) that requires dynamic features to include the prediction period, but the target should be historical data.
The solution involves using a join as recommended by pault.
Create a dataframe with dynamic features of length equal to Training + Prediction period
Create a dataframe with target values of length equal to just the Training period.
Use a LEFT JOIN (with the dynamic feature data on LEFT) to bring these dataframes together
Now, using collect_list will create the desired result.

Viz LDA model with Bokeh and T-sne

I have tried to follow this tutorial (https://shuaiw.github.io/2016/12/22/topic-modeling-and-tsne-visualzation.html) of visualizing LDA with t-sne and bokeh.
But i run into a bit of problem.
When i tried to run the following code:
plot_lda.scatter(x=tsne_lda[:, 0], y=tsne_lda[:, 1],
color=colormap[_lda_keys][:num_example],
source=bp.ColumnDataSource({
"content": text[:num_example],
"topic_key": _lda_keys[:num_example]
}))
NB: In the tutorial the content is called news, in mine it is called text
i get this error:
Supplying a user-defined data source AND iterable values to glyph methods is
not possibe. Either:
Pass all data directly as literals:
p.circe(x=a_list, y=an_array, ...)
Or, put all data in a ColumnDataSource and pass column names:
source = ColumnDataSource(data=dict(x=a_list, y=an_array))
p.circe(x='x', y='x', source=source, ...)
To me this do not make so much sense and i have not succeded in finding any annswer to it ethier here, github or else where. Hope that some on can help. best Niels
I've been also battling with that piece of code and I've found two problems with it.
First, when you pass a source to the scatter function, like the error states, you must include all data in the dictionary, i.e., x and y axes, colors, labels, and any other information that you want to include in the tooltip.
Second, the x and y axes have a different shape than the information passed to the tooltip, so you also have to slice both arrays in the axes with the num_example variable.
The following code got me running:
# create the dictionary with all the information
plot_dict = {
'x': tsne_lda[:num_example, 0],
'y': tsne_lda[:num_example, 1],
'colors': colormap[_lda_keys][:num_example],
'content': text[:num_example],
'topic_key': _lda_keys[:num_example]
}
# create the dataframe from the dictionary
plot_df = pd.DataFrame.from_dict(plot_dict)
# declare the source
source = bp.ColumnDataSource(data=plot_df)
title = 'LDA viz'
# initialize bokeh plot
plot_lda = bp.figure(plot_width=1400, plot_height=1100,
title=title,
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
# build scatter function from the columns of the dataframe
plot_lda.scatter('x', 'y', color='colors', source=source)

Tensor flow shuffle a tensor for batch gradient

To whom it may concern,
I am pretty new to tensorflow. I am trying to solve the famous MNIST problem for CNN. But i have encountered difficulty when i have to resuffle the x_training data (which is a [40000, 28, 28, 1] shape data.
my code is as below:
x_train_final = tf.reshape(x_train_final, [-1, image_width, image_width, 1])
x_train_final = tf.cast(x_train_final, dtype=tf.float32)
perm = np.arange(num_training_example).astype(np.int32)
np.random.shuffle(perm)
x_train_final = x_train_final[perm]
Below errors happened:
ValueError: Shape must be rank 1 but is rank 2 for 'strided_slice_1371' (op: 'StridedSlice') with input shapes: [40000,28,28,1], [1,40000], [1,40000], [1].
Anyone can advise how can i work around this? Thanks.
I would suggest you to make use of scikit's shuffle function.
from sklearn.utils import shuffle
x_train_final = shuffle(x_train_final)
Also, you can pass in multiple arrays and shuffle function will reorganize(shuffle) the data in those multiple arrays maintaining same shuffling order in all those arrays. So with that, you can even pass in your label dataset as well.
Ex:
X_train, y_train = shuffle(X_train, y_train)

How do you Unit Test Python DataFrames

How do i unit test python dataframes?
I have functions that have an input and output as dataframes. Almost every function I have does this. Now if i want to unit test this what is the best method of doing it? It seems a bit of an effort to create a new dataframe (with values populated) for every function?
Are there any materials you can refer me to? Should you write unit tests for these functions?
While Pandas' test functions are primarily used for internal testing, NumPy includes a very useful set of testing functions that are documented here: NumPy Test Support.
These functions compare NumPy arrays, but you can get the array that underlies a Pandas DataFrame using the values property. You can define a simple DataFrame and compare what your function returns to what you expect.
One technique you can use is to define one set of test data for a number of functions. That way, you can use Pytest Fixtures to define that DataFrame once, and use it in multiple tests.
In terms of resources, I found this article on Testing with NumPy and Pandas to be very useful. I also did a short presentation about data analysis testing at PyCon Canada 2016: Automate Your Data Analysis Testing.
you can use pandas testing functions:
It will give more flexbile to compare your result with computed result in different ways.
For example:
df1=pd.DataFrame({'a':[1,2,3,4,5]})
df2=pd.DataFrame({'a':[6,7,8,9,10]})
expected_res=pd.Series([7,9,11,13,15])
pd.testing.assert_series_equal((df1['a']+df2['a']),expected_res,check_names=False)
For more details refer this link
If you are using pytest, pandasSnapshot will be useful.
# use with pytest
import pandas as pd
from snapshottest_ext.dataframe import PandasSnapshot
def test_format(snapshot):
df = pd.DataFrame([['a', 'b'], ['c', 'd']],
columns=['col 1', 'col 2'])
snapshot.assert_match(PandasSnapshot(df))
One big cons is that the snapshot is not readable anymore. (store the content as csv is more readable, but it is problematic.
PS: I am the author of pytest snapshot extension.
I don't think it's hard to create small DataFrames for unit testing?
import pandas as pd
from nose.tools import assert_dict_equal
input_df = pd.DataFrame.from_dict({
'field_1': [some, values],
'field_2': [other, values]
})
expected = {
'result': [...]
}
assert_dict_equal(expected, my_func(input_df).to_dict(), "oops, there's a bug...")
You could use snapshottest and do something like this:
def test_something_works(snapshot): # snapshot is a pytest fixture from snapshottest
data_frame = calc_something_and_return_pandas_dataframe()
snapshot.assert_match(data_frame.to_csv(index=False), 'some_module_level_unique_name_for_the_snapshot')
This will create a snapshots folder with a file in that contains the csv output that you can update with --snapshot-update when your code changes.
It works by comparing the data_frame variable to what is saved to disk.
Might be worth mentioning that your snapshots should be checked in to source control.
I would suggest writing the values as CSV in docstrings (or separate files if they're large) and parsing them using pd.read_csv(). You can parse the expected output from CSV too, and compare, or else use df.to_csv() to write a CSV out and diff it.
Pandas has built in testing functions, but I don't find the output easy to parse, so I created an open source project called beavis with functions that output error messages that are easier for humans to read.
Here's an example of one of the built in testing methods:
df = pd.DataFrame({"col1": [1042, 2, 9, 6], "col2": [5, 2, 7, 6]})
pd.testing.assert_series_equal(df["col1"], df["col2"])
Here's the error message:
> ???
E AssertionError: Series are different
E
E Series values are different (50.0 %)
E [index]: [0, 1, 2, 3]
E [left]: [1042, 2, 9, 6]
E [right]: [5, 2, 7, 6]
Not very easy to see which rows are mismatched because the output isn't aligned.
Here's how you can write the same test with beavis.
import beavis
beavis.assert_pd_column_equality(df, "col1", "col2")
This'll give you the following readable error message:
The built-in assert_frame_equal doesn't give a readable error message either. Here's how you can compare DataFrame equality with beavis.
df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
beavis.assert_pd_equality(df1, df2)
The frame-fixtures Python package (of which I am an author) is designed to make it easy to "create a new dataframe (with values populated)" for unit or performance tests.
For example, if you want to test against a DataFrame of floats and strings with a numerical index, you can use a compact string declaration to generate a DataFrame.
>>> ff.Fixture.to_frame('i(I,int)|v(float,str)|s(4,2)').to_pandas()
0 1
34715 1930.40 zaji
-3648 -1760.34 zJnC
91301 1857.34 zDdR
30205 1699.34 zuVU
>>> ff.Fixture.to_frame('i(I,int)|v(float,str)|s(8,3)').to_pandas()
0 1 2
34715 1930.40 zaji 694.30
-3648 -1760.34 zJnC -72.96
91301 1857.34 zDdR 1826.02
30205 1699.34 zuVU 604.10
54020 268.96 zKka 1080.40
129017 3511.58 zJXD 2580.34
35021 1175.36 zPAQ 700.42
166924 2925.68 zyps 3338.48

Removing features with low variance using scikit-learn

scikit-learn provides various methods to remove descriptors, a basic method for this purpose has been provided by the given tutorial below,
http://scikit-learn.org/stable/modules/feature_selection.html
but the tutorial does not provide any method or a way that can tell you the way to keep the list of features that either removed or kept.
The code below has been taken from the tutorial.
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
The given example code above depicts only two descriptors "shape(6, 2)", but in my case, I have a huge data frames with a shape of (rows 51, columns 9000). After finding a suitable model I want to keep the track of useful and useless features because I can save computational time during the computation of the features of test data set by calculating only useful features.
For example, when you perform machine learning modeling with WEKA 6.0, it provided with remarkable flexibility over feature selection and after removing the useless feature you can get a list of a discarded features along with the useful features.
thanks
Then, what you can do, if I'm not wrong is:
In the case of the VarianceThreshold, you can call the method fit instead of fit_transform. This will fit data, and the resulting variances will be stored in vt.variances_ (assuming vt is your object).
Having a threhold, you can extract the features of the transformation as fit_transform would do:
X[:, vt.variances_ > threshold]
Or get the indexes as:
idx = np.where(vt.variances_ > threshold)[0]
Or as a mask
mask = vt.variances_ > threshold
PS: default threshold is 0
EDIT:
A more straight forward to do, is by using the method get_support of the class VarianceThreshold. From the documentation:
get_support([indices]) Get a mask, or integer index, of the features selected
You should call this method after fit or fit_transform.
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
# Just make a convenience function; this one wraps the VarianceThreshold
# transformer but you can pass it a pandas dataframe and get one in return
def get_low_variance_columns(dframe=None, columns=None,
skip_columns=None, thresh=0.0,
autoremove=False):
"""
Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
"""
print("Finding low-variance features.")
try:
# get list of all the original df columns
all_columns = dframe.columns
# remove `skip_columns`
remaining_columns = all_columns.drop(skip_columns)
# get length of new index
max_index = len(remaining_columns) - 1
# get indices for `skip_columns`
skipped_idx = [all_columns.get_loc(column)
for column
in skip_columns]
# adjust insert location by the number of columns removed
# (for non-zero insertion locations) to keep relative
# locations intact
for idx, item in enumerate(skipped_idx):
if item > max_index:
diff = item - max_index
skipped_idx[idx] -= diff
if item == max_index:
diff = item - len(skip_columns)
skipped_idx[idx] -= diff
if idx == 0:
skipped_idx[idx] = item
# get values of `skip_columns`
skipped_values = dframe.iloc[:, skipped_idx].values
# get dataframe values
X = dframe.loc[:, remaining_columns].values
# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=thresh)
# fit vt to data
vt.fit(X)
# get the indices of the features that are being kept
feature_indices = vt.get_support(indices=True)
# remove low-variance columns from index
feature_names = [remaining_columns[idx]
for idx, _
in enumerate(remaining_columns)
if idx
in feature_indices]
# get the columns to be removed
removed_features = list(np.setdiff1d(remaining_columns,
feature_names))
print("Found {0} low-variance columns."
.format(len(removed_features)))
# remove the columns
if autoremove:
print("Removing low-variance features.")
# remove the low-variance columns
X_removed = vt.transform(X)
print("Reassembling the dataframe (with low-variance "
"features removed).")
# re-assemble the dataframe
dframe = pd.DataFrame(data=X_removed,
columns=feature_names)
# add back the `skip_columns`
for idx, index in enumerate(skipped_idx):
dframe.insert(loc=index,
column=skip_columns[idx],
value=skipped_values[:, idx])
print("Succesfully removed low-variance columns.")
# do not remove columns
else:
print("No changes have been made to the dataframe.")
except Exception as e:
print(e)
print("Could not remove low-variance features. Something "
"went wrong.")
pass
return dframe, removed_features
this worked for me if you want to see exactly which columns are remained after thresholding you may use this method:
from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]]
When testing features I wrote this simple function that tells me which variables remained in the data frame after the VarianceThreshold is applied.
from sklearn.feature_selection import VarianceThreshold
from itertools import compress
def fs_variance(df, threshold:float=0.1):
"""
Return a list of selected variables based on the threshold.
"""
# The list of columns in the data frame
features = list(df.columns)
# Initialize and fit the method
vt = VarianceThreshold(threshold = threshold)
_ = vt.fit(df)
# Get which column names which pass the threshold
feat_select = list(compress(features, vt.get_support()))
return feat_select
which returns a list of column names which are selected. For example: ['col_2','col_14', 'col_17'].