How to print features after modeling in random forest? - python-2.7

I am modeling a dataset using random forest classifier. I want to print the features that are being selected by random forest.
I have used feature_importances_ as follows:
modelRF.feature_importances_
But it is showing the error as:
NameError: name 'feature_importances_' is not defined
Also on using the "fit" method, It is giving the error as:
AttributeError: 'RandomForest' object has no attribute 'fit'
Following are the parameters used in random forest classifier:
(data, x_cols, y_col, num_trees, method, impurity, max_depth=10, min_instance_per_node=20, min_information_gain=0.01, max_bin=32, feature_subset_strategy=u'auto', seed=123, async_execution=False)
I want to print the features that are selected using random forest.
Is there a need to define some additional thing to make the above methods work for random forest?(I am modelling RF in distributed platform using adatao/arimo package).

There is a module named variable_importance in arimo package which will give you the features selected by random forest classifier.
It will give a pandas dataframe with variable name, importance score
The variable name which has importance score> 0.0 is the feature selected by random forest classifier.
This can be used for arimo package in python for distributed platform.
model.feature_importances_
can be used otherwise for other packages.

Related

Approach to get the weight values from the pre-trained weights from Darknet?

I'm currently trying to implement YOLOv3 object detection model in C(only detection, not training).
I have tested my convolution method with arbitrary values and it seems to be working as I expected.
Before stacking up multiple method calls to do forward propagation, I thought it would be safe to test with the actual pretrained weight file data.
When I look up Darknet's pre-trained weight file, it was a huge chunk of binary files. I tried to convert it to hex and decimals, but it still doesn't look simple to pinpoint what part of values to use.
So, my question is, what should I do to extract the decimal numbers of the weights or the filter values so that I can use them in the same order of the forward propagation happening in YOLOv3?
*I'm currently trying to build my c version of YOLOv3 using the structure image shown in https://www.itread01.com/content/1541167345.html
*My c code will be run on an FPGA board called MicroZed, along with other HDL code.
*I tried to plug some printf functions into some places of Darknet code to see what kinds of data are moving around when YOLOv3 runs, however, when I ran it on in Linux terminal, it didn't show anything new and kept outputting the same results.
Any help or advice will be really appreciated. Thank you!
I am not too sure if there is a direct way to read darknet weights, but you can convert it into .h5 format and obtain the weight values from it
You can convert the darknet yolov3 weights into .h5 format (used by keras) by using the appropriate command from this repository.
You can choose the command based on your Yolo version from the list shown in the ReadMe of the linked repo. For the standard yolov3, the command for converting is
python tools/model_converter/convert.py cfg/yolov3.cfg weights/yolov3.weights weights/yolov3.h5
Once you have the .h5weights, you can use the below code snippet for obtaining the
values from the weights. credit/source
import h5py
path = "<path to weights>.h5"
weights = {}
keys = []
with h5py.File(path, 'r') as f: # open file
f.visit(keys.append) # append all keys to list
for key in keys:
if ':' in key: # contains data if ':' in key
param_name = f[key].name
weights[f[key].name] = f[key].value
print(param_name,weights[f[key].name])

specify priors in multi-label Naive Bayes in python scikit-learn

I am working on a multi-label classification. I used GaussianNB function on python scikit-learn. The target is an array with (N, L) shape, where L is the number of classes and N is the number of observations.
I used three ways to deal with multi-label case:
binary relevance
chain model
label powerset
I have a prior distribution for L classes, which is an array of (L,) shape. I tried to incorporate this prior distribution into GaussianNB through priors parameter like this
classifier = BinaryRelevance(GaussianNB(priors = prior_dist))
However, it returns the following error
ValueErrors: number of priors must match number of classes
What is the correct way to specify priors into GaussianNB in a multi-label case?
I haven't added support for this yet in scikit-multilearn, but it seems fairly easy to add - could you put it as a feature request in scikit-multilearn? I think I have an idea how to add this, but we can track the issue further in github.

Negative test score for random forest

Hi I am using a random forest classifier to product logerror. The log error contains both =ve & -ve values. After running the classifier with different settings. i am able to get training test score of around 0.8 but the test score is always negative. why is that so?
should i be using abs(log error) for prediction or is my choice for random forest wrong?
Choice of the random forest might be wrong but you better check it in context of data as if you have shared data here it would be easy to help you at the exact point. But, I suggest you try Knn if your total observations are around 1000-2000.
Also, if you are using any kind of encoding to convert categorical data to nominal please use only one hot encoding as other my add values to attributes.
You should have checked correlation of attributes to the target variable as the low correlation of target variable in test data may result in the negative score.
Apart from above all, distribution of data plays a vital role in randomforest regresssion. So, try to check distribution and apply methods such as box-cox to convert data in normal distribution.

How to show reasoning/basis for a classification/prediction in Weka

I am using Weka Java API for training a model and making predictions. I am able to build a classifier based on 3 algorithms: Decision Trees, Naïve Bayes and Random Forest. Then able to classify a test instance and get a probability distribution over the target classes.
My question is – how do I show the reasoning/basis [consumable result, easily understandable] for the prediction. Why a given instance was classified as ‘A’, ‘B’ or ‘C’. Since the end-user would also like to know the logic behind the classification.

Regression Tree Forest in Weka

I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.