similarity measure scikit-learn document classification - python-2.7

I am doing some work in document classification with scikit-learn. For this purpose, I represent my documents in a tf-idf matrix and feed a Random Forest classifier with this information, works perfectly well. I was just wondering which similarity measure is used by the classifier (cosine, euclidean, etc.) and how I can change it. Haven't found any parameters or informatin in the documentation.
Thanks in advance!

As with most supervised learning algorithms, Random Forest Classifiers do not use a similarity measure, they work directly on the feature supplied to them. So decision trees are built based on the terms in your tf-idf vectors.
If you want to use similarity then you will have to compute a similarity matrix for your documents and use this as your features.

Related

Principal component analysis on proportional data

Is it valid to run a PCA on data that is comprised of proportions? For example, I have data on the proportion of various food items in the diet of different species. Can I run a PCA on this type of data or should I transform the data or do something else beforehand?
I had a similar question. You should search for "compositional data analysis". There are transformation to apply to proportions in order to analyze them with multivariate tecniques such as PCA. You can find also "robust" PCA algorithms to run your analysis in R. Let us know if you find an appropriate solution to your specific problem.
I don't think so.
PCA will give you "impossible" answers. You might get principal components with values that proportions can't have, like negative values or values greater than 1. How would you interpret this component?
In technical terms, the support of your data is a subset of the support of PCA. Say you have $k$ classes. Then:
the support for PCA vectors is $\R^k$
the support for your proportion vectors is the $k$- dimensional simplex. By simplex I mean the set of $p$ vectors of length $k$ such that:
$0 \le p_i \le 1$ where $i = 1, ..., k$
$\sum_{i=1}^k{p_i} = 1$
One way around this is if there's a one to one mapping between the $k$-simplex to all of $\R^k$. If so, you could map from your proportions to $\R^k$, do PCA there, then map the PCA vectors to the simplex.
But I'm not sure the simplex is a self-contained linear space. If you add two elements of the simplex, you don't get an element of the simplex :/
A better approach, I think, is clustering, eg with Gaussian mixtures, or spectral clustering. This is related to PCA. But a nice property of clustering is you can express any element of your data as a "convex combination" of the clusters. If you analyze your proportion data and find clusters, they (unlike PCA vectors) will be within the simplex space, and any mixture of them will be, too.
I also recommend looking into nonnegative matrix factorization. This is like PCA but, as the name suggests, avoids negative components and also negative eigenvectors. It's very useful for inferring structure in strictly positive data, like proportions. But nmf does not give you a basis for simplex space.

Using AutoML to evaluate tha hyperparameters of the algorithm Word2Vec

Is it possible with AutoML (from H2O) to use only the Word2Vec algorithm and try out different values for the parameters to find out which parameter settings give me the most accurate vectors for my data set? So I don't want AutoML to apply the algorithms DeepLearning, GBM etc. to my dataset. Only the Word2Vec algorithm… How Do I do that?
So far I only managed to build a word2vec model with H2O.
I would like to test different Settings of the hyperparameters of Word2Vec with AutoML to evaluate which Settings are optimal...
The Word2Vec algorithm is a data transformation algorithm (converting rows of text to a matrix), not a supervised machine learning algorithm (which is what AutoML and all the algorithms inside of it do).
The typical way that Word2Vec is used is it apply Word2Vec to your text data so that your data can be used to train a supervised ML algorithm. From here you can run any supervised algorithm (GLM, Random Forest, GBM, etc) on this transformed dataset -- or my recommendation is to just pass the transformed data to AutoML, so it can find the best algorithm for you.
You will have to try out different settings for Word2Vec manually and see how well they do, given some particular supervised learning algorithm that you want to apply to your problem. Hopefully that clears up the confusion.

How can I send custom input to meta classifier of Stacking using weka api

From a research paper "In addition to stacking across all classifier outputs, we also evaluate stacking using only the aggregate output of each resampled (bagged) base classifier. For example, the outputs of all 10 SVM classifiers are averaged and used as a single level 0 input to the meta learner."
I am wondering how can I implement this. Actually I need to implement this for my thesis.
If you need only the average of the 10 classifiers you can add a voting classifier as one of the base classifiers for stacking. The voting classifier can use as many SVM classifiers as you need.
if you want to use the predictions of the SVM classifiers also as inputs for the stacking classifier, you can add SVM classifiers next to the voting classifiers (as base classifiers for stacking). However this would not be very effective.
Otherwise you can modify the code by yourself as Weka is an open source.

classifying a weighted feature vector

I want to give weights to features of a data set before using the feature in any classification algorithm like KNN or J48, but i don't know how to evaluate a weighted feature vector.
dose any of the classification algorithms accept weights as input instead of just '0' and '1'?
especially, is any of Weka's ready classification functions capable of working with weights (not 0 and 1 as filters)?
In most situations, you can just scale the data set according to your weights. This is trivial to prove for Minkowski distances such as Euclidean distance.
Not all of weka's classification algorithms support weights but some do.
You need to set weight information while after loading your dataset , see example code in weka wiki. I remember that Weka J48 , decision tree , supports weights in developer version but can not find reference. There exists a patch though.
This search for feature weights in weka wiki may help.
I suggest trying add weights to data set and training in your data.

Conditional Random Fields

Is there a training and optimization algorithm for 2-D (two dimensional) conditional random fields (CRF) suited for classification of imagery?
Has anyone used CRF package in R (http://crf.r-forge.r-project.org/html/CRF-package.html) for image classification? I would like to have a view of a working example code.
Thanks.
Look up on Markov Random Fields. Here's a link to a paper you might be interested in: Patric Perez: Markov Random Fields and Images (1998).
I do not think it will work alone. Since image classification is about scaling and affine transformation, so the key feature for accurate image classification is preprocessing not classification algorithm.
classification of imagery usually involves bag of words and feature pooling and stuff, whereas conditional random field is for labeling sequential data. so it might not be appropriate to use crf in this scenario.