Building the folds in WEKA - weka

In k-fold cross-validation, the original sample is randomly
partitioned into k subsamples
How can I manually choose what instances each subsample is?

Related

When to filter data during dimensionality reduction of image data?

I am extracting numerical data from biological images (phenotypic profiling of fluorescently labelled cells) to eventually be able to identify data clusters of cells that are phenotypically similar.
I record images on a microcope, extract data from an imaging plate that contains untreated cells (negative control), cells with a "strong" phenotype (as positive control), as well as several treatments. The data is organized in 2D with rows as cells and columns with information extracted from these cells.
My workflow is roughly as follows:
Plate-wise normalization of data
Elimination of features (= columns) that have reduntant information, contain too many NAs, show little variance or are not replicated betwen different experiments
PCA
tSNE for visualization
Cluster analysis
If I'm interested in only a subset of data (controls and let's say treatment 1 and 2, out of 10), when should I filter the data?
Currently, I filter before normalization, but I'm afraid that will impact behaviour and results of PCA/tSNE. Should I do the entire analysis with all the data and filter only before tSNE visualization?

Stratified sampling in WEKA

How can I split a data set into a training and test set of sizes 75% and 25% of the original data set, respectively using stratified sampling in order to preserve the proportional class sizes in these new sets. I am trying to do this with WEKA.
The "RemovePercentage" filter helps does not do it in a stratified manner, and the "StratifiedRemoveFolds" filter does not do this using percentages.
I would appreciate any help or suggestion.
So, as a work around, I split the data set into two using stratifiedRemoveFolds. in this case my number of folds was 2, yielding a 50%-50% data set. Then, I split one of the folds into two using the same method, yielding a 25%-25% subset of the original data set. Then I merged one of the 25% data sets to the left over 50% yielding a 75%-25% stratified split - which was my target.

How Bag of Features works?

I'm not sure that this is the right forum for this question, I'm sorry otherwise.
I'm quite new to the Bag of Features model and I'm trying to implement in order to represent an image through a vector (for a CBIR project).
From what I've understood, given a training set S of n images and supposing that we want to represent an image through a vector of size k, these are the steps for implementing BoF:
For each image i, compute the set of keypoints and from that compute the set of descriptor i-D.
Put all together the set of descriptor from all the images, so now we have D.
Run the k means (where k is defined above) algorithm on D, so now we have k clusters and each descriptor vector belongs exactly to one cluster.
Define iv as the resulting BoF vector (of size k) relative to image i. Each dimension is initialized to 0.
For each image i, and for each descriptor d belonging to i-D, find out which cluster d belongs between all the k clusters. Supposing that d belongs to the j-th cluster, then vi[j]++.
What is not clear to me is how to implement point 5., so how do we understand to which cluster a descriptor belongs to, in particular if the image that we are trying to compute the BoF vector is a query image (and so didn't belong to the initial dataset)? Should we find the nearest neighbor (1-NN) in order to understand to which cluster the query descriptor belongs to?
WHY I NEED THIS - THE APPLICATION:
I'm implementing the BoF model in order to implement a CBIR: given a query image q, find the most similar image i of q in a dataset of images. In order to do this, we need to solve the 1-approximate nearest neighbor problem, for example using LSH. The problem is that the input in LSH each image is represented as a vector, so we need the BoF in order to do that! I hope that now it's clearer why I need it :)
Please, let me know also if I did some mistake in the procedure described above.
What your algorithm is doing is generating the equivalent of words for a image. The set of "words" is not meant to be a final result, but just something that makes it simple to use with other machine learning techniques.
In this setup, you generate a set of k clusters from the initial feature (the keypoints from point 1).
Then you describe each image by the number of keypoints that fall in each cluster (just like you have a text composed of words from a dictionary of length k).
Point 3 says that you take all the keypoints from the traing set images, and run k-means algorithm, to figure out some reasonable separation between the points. This basically establishes what the words are.
So for a new image you need to compute the keypoints like you did for the training set and then using the clusters you have already computed in training you figure out the feature vector for your new image. That is you convert your image into words from the dictionary you've built.
This is all a way to generate a reasonable feature vector from images (a partial result if you want). This is not a complete machine learning algorithm. To complete it you need to know what you want to do. If you just want to find the most similar image(s), then yes a nearest neighbor search should do that. If you want to label images, then you need to train a classifier (like naive-bayes) from the feature vector and use it to figure out the label for the query.

OpenCV machine learning library for agglomerative hierarchical clustering

I want to cluster some (x,y) coordinates based on distance using agglomerative hierarchical clustering as number of clusters are not known before. Is there any library that supports this task?
Iam doing in c++ using Opencv libraries.
http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_ml/py_kmeans/py_kmeans_opencv/py_kmeans_opencv.html#kmeans-opencv
This is a link for K-Means clustering in OpenCV for Python.
Shouldn't be too hard to convert this to c++ code once you understand the logic
In Gesture Recognition Toolkit (GRT) there is a simple module for hierarchical clustering. This is a "bottom up" approach as you need, where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
You can train the method by:
UnlabelledData: The only thing you really need to know about the UnlabelledData class is that you must set the number of input dimensions of your dataset before you try and add samples to the training dataset.
ClassificationData:
You must set the number of input dimensions of your dataset before you try and add samples to the training dataset,
You can not use the class label of 0 when you add a new sample to your dataset. This is because the class label of 0 is reserved for the special null gesture class.
MatrixDouble: MatrixDouble is the default datatype for storing M by N dimensional data, where M is the number of rows and N is the number of columns.
Furthermore you can save or load your models from/to a file and get the clusters by getClusters().

classifying a weighted feature vector

I want to give weights to features of a data set before using the feature in any classification algorithm like KNN or J48, but i don't know how to evaluate a weighted feature vector.
dose any of the classification algorithms accept weights as input instead of just '0' and '1'?
especially, is any of Weka's ready classification functions capable of working with weights (not 0 and 1 as filters)?
In most situations, you can just scale the data set according to your weights. This is trivial to prove for Minkowski distances such as Euclidean distance.
Not all of weka's classification algorithms support weights but some do.
You need to set weight information while after loading your dataset , see example code in weka wiki. I remember that Weka J48 , decision tree , supports weights in developer version but can not find reference. There exists a patch though.
This search for feature weights in weka wiki may help.
I suggest trying add weights to data set and training in your data.