Using AutoML to evaluate tha hyperparameters of the algorithm Word2Vec - word2vec

Is it possible with AutoML (from H2O) to use only the Word2Vec algorithm and try out different values for the parameters to find out which parameter settings give me the most accurate vectors for my data set? So I don't want AutoML to apply the algorithms DeepLearning, GBM etc. to my dataset. Only the Word2Vec algorithm… How Do I do that?
So far I only managed to build a word2vec model with H2O.
I would like to test different Settings of the hyperparameters of Word2Vec with AutoML to evaluate which Settings are optimal...

The Word2Vec algorithm is a data transformation algorithm (converting rows of text to a matrix), not a supervised machine learning algorithm (which is what AutoML and all the algorithms inside of it do).
The typical way that Word2Vec is used is it apply Word2Vec to your text data so that your data can be used to train a supervised ML algorithm. From here you can run any supervised algorithm (GLM, Random Forest, GBM, etc) on this transformed dataset -- or my recommendation is to just pass the transformed data to AutoML, so it can find the best algorithm for you.
You will have to try out different settings for Word2Vec manually and see how well they do, given some particular supervised learning algorithm that you want to apply to your problem. Hopefully that clears up the confusion.

Related

Vertex AI forecasting AutoML giving different answers for same input data

I trained Vertex AI forecasting AutoML model one with target column as String and other numeric input features as String then I trained another AutoML model with target column as float and other input features as Integer.
The predictions are different for both the models. The data is same only the datatypes/schema changed.
Google documentation says:
When you train a model with a feature with a numeric transformation,
Vertex AI applies the following data transformations to the feature,
and uses any that provide signal for training:
The value converted to float32.
So both the data should be same even after transformation.
Why would results be different? Is it possible?
I have follow the steps to have a forecasting model as show on Build an AutoML Forecasting Model with Vertex AI and reach the conclusion that vertex AI compress a lot of the steps of the prediction model generation so it can be easily operate by users.
I think the most reasonable answer for your observation among strings and numeric values resides in the way data processing is performed to generate our prediction models. I think you will not find inside vertex AI documentation as it would mean to disclose how vertex AI code works and handles its Feature Engineering and train steps to generate the models, which is protected.
Regardless, Lets speculate a bit, I think the difference among datatypes conversion might occur when datatype is converted and passed to the algorithm for processing. Lets said a linear regression sample, you will find that the slightest variation on data conversion can affect the outcome of your prediction model which could also be what is happening here.

similarity measure scikit-learn document classification

I am doing some work in document classification with scikit-learn. For this purpose, I represent my documents in a tf-idf matrix and feed a Random Forest classifier with this information, works perfectly well. I was just wondering which similarity measure is used by the classifier (cosine, euclidean, etc.) and how I can change it. Haven't found any parameters or informatin in the documentation.
Thanks in advance!
As with most supervised learning algorithms, Random Forest Classifiers do not use a similarity measure, they work directly on the feature supplied to them. So decision trees are built based on the terms in your tf-idf vectors.
If you want to use similarity then you will have to compute a similarity matrix for your documents and use this as your features.

How to train a svm for classifying images of english alphabet?

My objective is to detected text in an image and recognize them.
I have achieved detecting characters using stroke width transform.
What to do to recognize them?
As per my knowledge, I thought of training the svm with my dataset of letters of different fonts[images] by detecting feature point and extracting feature vectors from each and every image.[I have used SIFT Feature vector,did build the dictionary using kmean clusetering and all].
I have detected a character before, i will extract the sift feature vector for this character . and i thought of feeding this into the svm prediction function.
I dont know how to recognize using svm. I am confused! Help me and correct me where ever I went wrong with concept..
I followed this turorial for recognizing part. Can this turotial can be applicable to recognize characters.
http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
SVM is a supervised classifier. To use it, you will need to have training data that is of the type of objects you are trying to recognize.
Step 1 - Prepare training data
The training data consists of pairs of feature vectors and their corresponding class labels. In your case, it appears that you have extracted a SIFT-based "Bag-of-word" (BOW) feature vector for the characters you detected. So, for your training data, you will need to find many examples of the different characters, extract this feature vector for each of them, and associate them with a label (sometimes called a class label, and typically an integer) which you will perhaps map to a textual description (for e.g., the number 0 could be mapped to the character 'a', and so on.)
Step 2 - Training the classifier
The SVM classifier takes in as input an array/Mat of feature vectors (one per row) and their associated labels. Tune the parameters of the SVM (i.e., the regularization parameter C, and if applicable, any other parameters for kernels) on a separate validation set.
Step 3 - Predict for unseen data
At test time, given a sample that was not seen by the SVM during training, you compute a feature vector (your SIFT-based BOW vector) for the sample. Pass this feature vector to the SVM's predict function, and it will return you an integer. Remember earlier when preparing your training data, you have associated an integer with each label? This is the label predicted by the SVM for this sample. You can then map this label to a character. For e.g., if you have associated 0 with 'a', 1 with 'b' etc., you can use a vector/hashmap to map the integer to its textual counterpart.
Additional Notes
You can check out OpenCV's SVM tutorial here for details.
NOTE: Often, for beginners, the hardest part (after getting the data) is tuning the classifier. My advice is first try a simple classifier (for e.g., a linear SVM) which has few parameters to tune. A decent one would be the linear SVM, which only requires you to adjust one parameter C. Once you manage to get somewhat decent results (which gives some assurance that the rest of your code is working) you can move on to more "sophisticated" classifiers.
Lastly, the training data and feature vectors you extract are very important. The training data must be "similar" to the test data you are trying to predict. For e.g., if you are predicting characters found in road signs which comes with different fonts, lighting conditions, and pose differences, then using training data consisting of characters taken from say a newspaper/book archive may not give you good results. This is an issue of domain adaptation in machine learning.

classifying a weighted feature vector

I want to give weights to features of a data set before using the feature in any classification algorithm like KNN or J48, but i don't know how to evaluate a weighted feature vector.
dose any of the classification algorithms accept weights as input instead of just '0' and '1'?
especially, is any of Weka's ready classification functions capable of working with weights (not 0 and 1 as filters)?
In most situations, you can just scale the data set according to your weights. This is trivial to prove for Minkowski distances such as Euclidean distance.
Not all of weka's classification algorithms support weights but some do.
You need to set weight information while after loading your dataset , see example code in weka wiki. I remember that Weka J48 , decision tree , supports weights in developer version but can not find reference. There exists a patch though.
This search for feature weights in weka wiki may help.
I suggest trying add weights to data set and training in your data.

Conditional Random Fields

Is there a training and optimization algorithm for 2-D (two dimensional) conditional random fields (CRF) suited for classification of imagery?
Has anyone used CRF package in R (http://crf.r-forge.r-project.org/html/CRF-package.html) for image classification? I would like to have a view of a working example code.
Thanks.
Look up on Markov Random Fields. Here's a link to a paper you might be interested in: Patric Perez: Markov Random Fields and Images (1998).
I do not think it will work alone. Since image classification is about scaling and affine transformation, so the key feature for accurate image classification is preprocessing not classification algorithm.
classification of imagery usually involves bag of words and feature pooling and stuff, whereas conditional random field is for labeling sequential data. so it might not be appropriate to use crf in this scenario.