Can we do analytics on masking data using WEKA tool? - weka

I just want to ask if there any possibilities to do analytics on masking data using the WEKA tool. In this case, not all data fields will be masking. Only a few of them.
Thank you in advance!

If you mean by masking whether an attribute should be used by an algorithm for learning, then the answer is yes (in a sense).
However, rather than flagging attributes whether to be included or not, in Weka you'd simply remove them.
For example, for classification you would use the FilteredClassifier meta-classifier in conjunction with the actual base classifier that you want to use and the Remove filter.
If you need additional filters to applied (e.g., Normalize), then simply define a filter pipeline using MultiFilter as the filter in your FilteredClassifier setup. With that approach you can work with your original data and the base classifier only sees the data your filter pipeline outputs.

Related

How to manually create a decision tree in Weka

I would like to create my own decision tree model in Weka. In other words, I would like to manually specify all the splits and all the split values in the decision tree, without training any of the decision tree algorithms (e.g. REPTree, J48, etc.) on data. Is this possible in the Weka GUI or through Weka's Java API? If so, how?
Not that I'm aware of. You will need to create your own classifier, with the relevant options that define the splits that you want to use (or point to a file that contains that information for you to parse).

How to output the final tableau of simplex method in docplex?

Is there a way to output the final tableau in python with docplex library? If not, is there a work around?
I want to use dual simplex method to solve linear programming problem with newly added constraints. So, I would need to access the final tableau to decide which variable to exit the basis, without having to re-solve the problem from scratch.
This sort of low level interaction cannot be done at the docplex level. In order to do this you can use Model.get_cplex() to get a reference to the underlying engine object. With this you can then get additional information. You can find the reference documentation for this class here. You probably want to look at the solution, solution.basis, solution.advanced properties. This should give you all the information you need.
Note that the engine works with an index oriented model in which every variable or constraint is just a number. You can convert docplex variable objects by using Model.get_var_by_index().
I also wonder whether you may want drop docplex and instead directly use the CPLEX Python API. You can find documentation of this here.

Why Classification model in weka predicting all instances as one class?

I have built a classification model using weka.I have two classes namely {spam,non-spam} After applying stringtowordvector filter, I get 10000 attributes for 19000 records. Then I am using liblinear library to build model which gives me F-score as follows:
Spam-94%
non-spam-98%
When I use same model to predict new instances, it predict all of them as spam.
Also, when I try to use test set same as training set, It predict all of them as spam too. I am mentally exhausted to find the problem.Any help will be appreciated.
I get it also wrong every so often. Then I watch this video to remind myself how it's done: https://www.youtube.com/watch?v=Tggs3Bd3ojQ where Prof Witten, one of the Weka Developers/Architects shows how to use the FilteredClassifier (which in turn is configured to load the StringToWordVector Filter) on the training-dataset and the test-set correctly.
This is shown for weka 3.6, weka 3.7. might be slightly different.
What does ZeroR give you? If it's close to 100%, you know that any classification algorithm should be not too far off either.
Why do you optimize for F-Measure? Just asking. I have never used this and don't know much about it. (I would optimize for the "Precision" metric assuming you have much more Spam than Nonspam).

Load existing Model in Weka Knowledge Flow

I am trying to plot multiple ROC curves in the same diagram in Weka. I have learnt that I can do this in Weka Knowledge Flow using "Model Performance Chart". However, I can't figure out how to do this for existing models.
I have tried using ArffLoader and TestSetMaker to generate the testing data, and connected this to a suitable Classifier icon (eg AdaBoostM1 when this is the kind of model I am trying to load). In the configurations of the Classifier icon I choose "load model" and in the Status bar it says "Loaded model.". However, when I run this it says "ERROR: no trained/loaded classifier to use for prediction".
Can anyone tell me what I am doing wrong here? Thanks in advance!
There is a post that was published here that indicates some ambiguity in the meaning of the error. It also continues to state that the order of attributes and the number and order of values is also rather important.
It also states that 'for performance results to be computed, your Knowledge Flow process will need a "ClassifierPerformanceEvaluator" component after the classifier and before a TextViewer component.'
If you are new with the KnowledgeFlow environment, there is a great tutorial here from Rushdi Shams that details the general process.
Below is a sample workflow that has generated desirable results using AdaBoost (preloaded model):
Hope this Helps!

Best approach for doing full-text search with list-of-integers documents

I'm working on a C++/Qt image retrieval system based on similarity that works as follows (I'll try to avoid irrelevant or off-topic details):
I take a collection of images and build an index from them using OpenCV functions. After that, for each image, I get a list of integer values representing important "classes" that each image belongs to. The more integers two images have in common, the more similar they are believed to be.
So, when I want to query the system, I just have to compute the list of integers representing the query image, perform a full-text search (or similar) and retrieve the X most similar images.
My question is, what's the best approach to permorm such a search?
I've heard about Lucene, Lemur and other indexing methods, but I don't know if this kind of full-text searchs are the best way, given the domain is reduced (only integers instead of words).
I'd like to know about the alternatives in terms of efficiency, accuracy or C++ friendliness.
Thanks!
It sounds to me like you have a vectorspace model, so Lucene or a similar product may work well for you. In general, an inverted-index model will be good if:
You don't know the number of classes in advance
There are a lot of classes relative to the number of images
If your problem doesn't fit these criteria, a normal relational DB might work better, as Thomas suggested. If it meets #1 but not #2, you could investigate one of the "column oriented" non-relational databases. I'm not familiar enough with these to tell you how well they would work, but my intuition is that you'll need to replicate a lot of the functionality in an IR toolkit yourself.
Lucene is written in Java and I don't know of any C++ ports. Solr exposes Lucene as a web service, so it's easy enough to access it that way from whatever language you choose.
I don't know much about Lemur, but it looks like it has a similar vectorspace model, and it's written in C++, so that might be easier for you to use.
You can take a look at Lucene for image retrieval (LIRE) here: http://www.semanticmetadata.net/2006/05/19/lire-lucene-image-retrieval-04-released/
If I'm mistaken, you are trying to implement a typical bag of words image retrieval am I correct? If so you are probably trying to build an inverted file index. Lucene on its own is not suitable as you probably have already realized as it index text instead of numbers. Using its classes for querying the index would also be a problem as it is not designed to "parse" (i.e. detect keypoints, extract descriptors then vector-quantize them) image into the query vector.
LIRE on the other hand have been modified to index feature vectors. However, it does not appear to work out of the box for bag of words model. Also, I think I've read on the author's website that it currently uses brute force matching rather than the inverted file index to retrieve the images but I would expect it to be easier to extend than Lucene itself for your purposes.
Hope this helps.