ELKI tool - outlier detection results for ABOD - data-mining

I am trying to run ELKI for Outlier Detection using ABOD method. I see the various visualizations as result, but not the outlier scores or rankings. What should I do to say get top 10 outliers using ELKI?

The ELKI ResultWriter will write the outliers to a file in decreasing outlierness (if the method is implemented with the appropriate metadata to allow correct sorting).
As for ABOD, please not that the implementation you are using (ELKI 0.6.0~beta2 and before) is actually FastABOD, unless you set abod.k to the data set size. The 0.6.0 release will have separate classes for ABOD, FastABOD and LB-ABOD. But since ABOD scales O(n^3) it will be only usable for small data sets!

You can see the result if you use the cmd (command line) for running the algorithm like this
java -cp <<path>> -algorithm outlier.ABOD -dbc.in data.txt -out myresults/ABOD
, on <<path>> that containd elki.jar/elki.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication

Related

Weka can I train a model to minimize or maximize an input value?

Is it possible in Weka to train a model minimizing a cost factor?
I have a data set containing a cost factor in each sample. It defines what using this sample would cost. Now, I would like to select as much of the samples as possible while minimizing this cost factor.
E.g. with Multilayer perceptron, I want to train the neurons in a way, that it chooses as many samples as possible while minimizing the sum of the cost factor.
I've checked all the model options and also searched the package manager for something like that, but I was unable to find anything. Could someone tell me whether this can be done using Weka?
What you are describing sounds more like an optimization problem rather than a classification or regression problem (for which you would use a Weka classifier).
Weka does have some limited support for optimization through its abstract weka.core.Optimization class (e.g., used internally by weka.classifiers.functions.Logistic). But that requires implementing some methods.
To cast your net wider, you might want to take a look at the following article that describes various optimization techniques:
https://machinelearningmastery.com/tour-of-optimization-algorithms/

How to use uncertainties to weight residuals in a Savitzky-Golay filter.

Is there a way to incorporate the uncertainties on my data set into the result of the Savitzky Golay fit? Since I am not passing this information into the function, I asume that it is simply calcuating the 'best fit' via an unweighted least-squares process. I am currently working with data that has non-uniform uncertainty, and so the fit of the data could be improved by including the errors that I have for my main dataset.
The wikipedia page for the Savitzky-Golay filter suggests how I might go about alter the process of calculating the coefficients of the fit, and I am staring at the code for scipy.signal.savgol_filter, but I cannot get my head around what I need to adjust so that this will do what I want it to.
Are there any ready-made weighted SG filters floating about? I find it hard to believe that no-one else has ever needed this tool in Python, but maybe I have missed something.
Check out this Python module: https://github.com/surhudm/savitzky_golay_with_errors
This python script improves upon the traditional Savitzky-Golay filter
by accounting for errors or covariance in the data. The inputs and
arguments are all modelled after scipy.signal.savgol_filter
Matlab function sgolayfilt supports weights. Check the documentation.

SegNet results of train set (test via test_segmentation.py)

I run SegNet on my own dataset (by Segnet tutorial). I see great results via test_segmentation.py.
my problem is that I want to see the real net results and not test_segmentation own colorisation (via classes).
for example, if I have trained net with 2 classes, so after the train I will see not only 2 colors (as we see with the classes), but we will see the real net color segmentation ([0.22,0.19,0.3....) lighter and darker as the net see it]
I hope that I explained myself well. thanks for helping.
You could use a python script to achieve what you want. Take a look at this script.
The command out = out['argmax'], extracts the raw output, so you can get a segmentation map with 'lighter and darker' values as you wanted.
When you say the 'real' net color segmentation I will assume that you mean the probability maps. Effectively the last layer will have one map for every class; and if you check the function predict in inference.py, they take the argmax; that is the channel (which represents the class) with the highest probability. If you want to get these maps, you just have to get the data without computing the argmax; something like:
predicted = net.blobs['prob'].data
I solve it. the solution is to range cmin and cmax from 0 to 1 in the scipy saving method. for example: scipy.misc.toimage(output, cmin=0.0, amax=1).save(/path/.../image.png)

Caffe GoogleNet classification.cpp gives random outputs

I used Caffe GoogleNet model to train my own data (10k images, 2 classes). I stop it at 400000th iteration with an accuracy of ~80%.
If I run the below command:
./build/examples/cpp_classification/classification.bin
models/bvlc_googlenet/deploy.prototxt
models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
data/ilsvrc12/imagenet_mean.binaryproto
data/ilsvrc12/synset_words.txt
1.png
it gives me a different -- apparently random -- result each time (i.e. if I run it n times, then I get n different results). Why? Does my training fail? Does it still use the old data from the reference model?
I don't think it is a problem with the training. Even if the training data wasn't, it should give the same (possibly wrong) output every time. If you are getting random results, it indicates that the weights are not being loaded properly.
When you load a .caffemodel against a .prototxt, caffe will load the weights of all the layers in the prototxt whose names match with the ones in the caffemodel. For the other layers, it will do a random initialisation (gaussian xavier, etc according to the specification in the prototxt).
So the best thing for you to do now is to check if the model was trained using the same prototxt you are using now.
I see that you are using GoogleNet prototxt and reference_caffenet caffemodel. Is this intentional?
When you want to deploy the fine-tuned model, you should check two main things:
Inputs:
Input image uses a BGR channel instead of RGB (e.g. opencv)
Mean file: Is same as mean file when training?
Prototxt:
When fine-tuning the model, you will change some layers' name in the original prototxt, and you should check whether the same layer name used?
And there are some
Fine-tune tricks and CS231n_transfer_learning which are very useful for fine-tuning.

Weka: Classifier and ReplaceMissingValues

I am relatively new to the data mining area and have been experimenting with Weka.
I have a dataset which consists of almost 8000 records related to customers and items they have purchased. 58% of this data set has missing values for the "Gender" attribute.
I want to find the missing gender values based on the other data I do have.
I first thought I could do this using a classifier algorithm in Weka using a training set to build a model. Based on examples I saw online, I tried this with pretty much all the available algorithms available in Weka using a training set that consisted of 60-80% of the data which did not have missing values. This gave me a lower accuracy rate than I wanted (80-86% depending on the algorithm used)
Did I go about this correctly? Is there a way to improve this accuracy? I experimented with using different attributes, different pre-processing of the data etc.
I also tried using the ReplaceMissingValues filter on the complete dataset to see how that would handle the missing values. However, it just changed all the missing values to "Female" which obviously cannot be the case. So I'm wondering also wondering if I need to use this filter in my situation or not.
It sounds like you went about it in the correct way. The ReplaceMissingValues filter replaces the missing values with the most frequent of the non-missing values I think, so it is not what you want in this case.
A better way to get an idea of the true accuracy of your gender-predictor would be to use cross-validation instead of the training/test split (Weka has a separate option for that). 80-86% may seem low, but keep in mind that random guessing will only get you about 50%, so it's still a lot better than that. To try to get better performance, pick a classifier that performs well and then play with its parameters until you get better performance. This is likely to be quite labour-intensive (although you could of course use automated methods for tuning, see e.g. Auto-WEKA), but the only way to improve the performance.
You can also combine the algorithm you choose with a separate feature selection step (Weka has a special meta-classifier for this). This may improve performance, but again you'll have to experiment to find the particular configuration that works for you.