How to manually set the class width of histograms in Weka? - weka

I have an instance with values between 0 and 20 (these are school grades). I want to make a histogram with 21 classes, each representing a single grade. The default histograms in the Weka Explorer use Scott's rule to set the class width and amount of classes, but I cannot find a way to change this.

If I understand correctly, you have numeric data for the school grade. You can use the unsupervised attribute filter NumericToNominal to convert the attribute and get your 21 distinct values.

Related

Feature Selection and PCA in Machine Learning

I have a dataset with around 15 numeric columns and two categorical columns which are a "State" column and an "Income" column with six buckets representing each different income range. Do I need to encode the "Income" column if it contains integers 1-6 representing each income range? In addition, what type of encoder should I use for the "state" column and does anyone have any good resources on this?
In addition, does one typically perform feature selection (wrapper and filter methods such as Pearson's and Recursive Feature Elimination) before PCA? What is the typical correlation threshold when using a method like Pearson's? And what is the ideal number of dimensions or explained variance ratio one should use when running PCA. I'm confused if you use one of them or both. Thank you.

How does CfsSubsetEva (Correlation-based Feature Selection) works in Weka

I have a dataset which is categorical dataset. I am using WEKA software for feature selection. I have used CfsSubsetEval as attribute evaluator with Greedystepwise method. I came to know this link that CFS uses Pearson correlation to find the strong correlation between the dataset. I also found out how to calculate Pearson correlation coefficient using this link. As per the link the data values need to be numerical for evaluation. Then how can WEKA did the evaluation on my categorical dataset?
The strange result is that Among 70 attributes CFS selects only 10 attributes. Is it because of the categorical dataset? Additionally my dataset is a highly imbalanced dataset where imbalanced ration 1:9(yes:no).
A Quick question
If you go through the link you can found the statement the correlation coefficient to measure the strength and direction of the linear relationship between two numerical variables X and Y. Now I can understand the strength of the correlation coefficient which is varied in between +1 to -1 but what about the direction? How can I get that? I mean the variable is not a vector so it should not have a direction.
The method correlate in the CfsSubsetEval class is used to compute the correlation between two attributes. It calls other methods, depending on the attribute types, which I've linked here:
two numeric attributes: num_num
numeric/nominal attributes: num_nom2
two nominal attributes: nom_nom

Weighted Average Calculations across various combinations using Cube.js

We have a question on designing schema and handling analytics requirement for our product and would appreciate your advise on this. We are just getting started with Cube.js. Here is our req: We have data (for simplicity...i will use an example) where say we have multiple columns (attributes) and say 1 "value" and 1 "weight" column. We need to calculate weighted averages across all combinations of the columns (attributes) and the value / weight columns.
e.g. Group by Column 1 and weighted average (value/Weight column)
or Group by Column 1, 2 and weighted average etc. etc...
it can be many types of combinations and we have atleast 8 to 12 columns like that
Wondering how best to model?
Probably for you will be convenient to create one cube with several predefined segments or also you can create several cubes per each attribute.
It depends on your data.

OpenCV machine learning library for agglomerative hierarchical clustering

I want to cluster some (x,y) coordinates based on distance using agglomerative hierarchical clustering as number of clusters are not known before. Is there any library that supports this task?
Iam doing in c++ using Opencv libraries.
http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_ml/py_kmeans/py_kmeans_opencv/py_kmeans_opencv.html#kmeans-opencv
This is a link for K-Means clustering in OpenCV for Python.
Shouldn't be too hard to convert this to c++ code once you understand the logic
In Gesture Recognition Toolkit (GRT) there is a simple module for hierarchical clustering. This is a "bottom up" approach as you need, where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
You can train the method by:
UnlabelledData: The only thing you really need to know about the UnlabelledData class is that you must set the number of input dimensions of your dataset before you try and add samples to the training dataset.
ClassificationData:
You must set the number of input dimensions of your dataset before you try and add samples to the training dataset,
You can not use the class label of 0 when you add a new sample to your dataset. This is because the class label of 0 is reserved for the special null gesture class.
MatrixDouble: MatrixDouble is the default datatype for storing M by N dimensional data, where M is the number of rows and N is the number of columns.
Furthermore you can save or load your models from/to a file and get the clusters by getClusters().

How to tune a training schema for a different data set in Caffe?

Currently I am following the caffe imagenet example but apply it on my own training data set. My dataset is about 2000 classes and about 10 ~ 50 images each class. Actually I was classifying vehicle images and the images were cropped to the front, so the images within each class have the same size, the same view angle(almost).
I've tried the imagenet schema but looks like it didn't work well and after about 3000 iterations the accuracy was down to 0. So I am wondering is there a practical guide on how to tune the schema?
You can delete the last layer in imagenet, add your own last layer with a different name(to fit the number of classes), specify it with a higher learning rate, and specify a lower overall learning rate. There does exist an official example here: http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html
However, if the accuracy was 0 you should check the model parameters first, perhaps it's an overflow