Why Classification model in weka predicting all instances as one class? - weka

I have built a classification model using weka.I have two classes namely {spam,non-spam} After applying stringtowordvector filter, I get 10000 attributes for 19000 records. Then I am using liblinear library to build model which gives me F-score as follows:
Spam-94%
non-spam-98%
When I use same model to predict new instances, it predict all of them as spam.
Also, when I try to use test set same as training set, It predict all of them as spam too. I am mentally exhausted to find the problem.Any help will be appreciated.

I get it also wrong every so often. Then I watch this video to remind myself how it's done: https://www.youtube.com/watch?v=Tggs3Bd3ojQ where Prof Witten, one of the Weka Developers/Architects shows how to use the FilteredClassifier (which in turn is configured to load the StringToWordVector Filter) on the training-dataset and the test-set correctly.
This is shown for weka 3.6, weka 3.7. might be slightly different.
What does ZeroR give you? If it's close to 100%, you know that any classification algorithm should be not too far off either.
Why do you optimize for F-Measure? Just asking. I have never used this and don't know much about it. (I would optimize for the "Precision" metric assuming you have much more Spam than Nonspam).

Related

Text Detection with YOLO on Challenging Images

I have images that look as follows:
My goal is to detect and recognize the number 31197394. I have already fine-tuned a deep neural network on text recognition. It can successfully identify the correct number, if it is provided it in the following format:
The only task that remains is the detection of the corresponding bounding box. For this purpose, I have tried darknet. Unfortunately, it's not recognizing anything. Does anyone have an idea of a network that performs better on these kind of images? I know, that amazon recognition is able to solve this task. But I need a solution that works offline. So my hopes are still high that there exist pre-trained networks that work. Thank's a lot for your help!
Don't say darknet doesn't work. It depends on how you labeled your dataset. It is true that the numbers you want to recognize are too small so if you don't make any changes to the image during the pre-processing phase, it would be complicated for a neural network to recognize them well. So what you can do that will surely work is:
1---> Before labeling, increase the size of all images by 2 times its current size (like 1000*1000)
2---> used this size (1000 * 1000) for the darket trainer instead of the default size proposed by darknet which is 416 * 416. You would then have to change the configuration file
3---> use the latest darknet version (yolo v4)
4---> On the configuration file, always keep a number of subdivisions at 1.
I also specify that this method is too greedy in memory, it is therefore necessary to provide a machine with RAM > 16 GB. The advantage is that it works...
Thanks for your answers guys! You were right, I had to finetune yolo to make it work. So I created a dataset and fine-tuned yolov5. I am surprised how good the results are. Despite only having about 300 images in total, I get an accuracy of 97% to predict the correct number. This is mainly due to strong augmentations. And indeed the memory requirements are large, but I could train on a 32 GM RAM machine. I can really encourage anyone who faces similar problems to give yolo a shot!!
Maybe use an R-CNN to identify the region where the number is and then pass that region to your fine-tuned neural network for the digit classification

Classification with machine learning and a small database

I want to create a valve detection and classification like this video : https://www.youtube.com/watch?v=VY92fqmSdfA
to detect the positions Open and close and intermediate of the valve.
I have done some research and I have found some methods to resolve this problem, but i have some conditions to respect to resolve this problem :
Condition 1 : Use machine learning in the application, I can't use simple methods like Template matching,...
Condition 2 : Use a small database (Minimum 10 images by classe, maximum 40 images by classe)
Condition 3 : detect the position of the valve if the camera position changes, so I can't use only colors to detect the valve handle.
I want to use HOG (Histogram oriented gradient) + SVM/ANN but HOG needs a lot of images to train SVM/ANN.
I dont know if I can resolve this problem respecting this conditions?
As we know, the most important thing that ML approaches need to work properly is data. So, I'd say your 1st and 2nd conditions are conflicting with each other. In addition, your 3rd condition is adding more complexity in the problem. You can solve it including more data from different angles and illumination conditions. But again, it's conflicting with condition 2.
Even so, if you'd like to follow the ML path, I'd recommend you to use a pre-trained model, a strong data augmentation and, maybe, an ensemble of models to help increase the detection. As the problem is not that hard, it should work.

Is tf.py_func allowed at online prediction time?

Is tf.py_func allowed at online prediction time?
If yes any examples of how to use it?
Does the answer change if I need to install additional pip packages?
My use-case: I work with text, I need to do word stemming (using porter stemmer), I know how to do it using python, tensorflow doesn't have Ops for that. I would like to use the same text processing at training and prediction time - thus I would like to encode it all into a tensorflow graph.
https://www.tensorflow.org/api_docs/python/tf/py_func comes with known limitations and I would like to know if it will work during training and online prediction before I invest more time into it.
Thanks
Unfortunately, no. Py_func can not be restored from a saved model. However, since your use case involves pre-processing, just invoke the py_func explicitly in all three (train, eval, serving) input functions. This won't work if the py_func is in the middle of your graph, but for stemming, it should work just fine.

Load existing Model in Weka Knowledge Flow

I am trying to plot multiple ROC curves in the same diagram in Weka. I have learnt that I can do this in Weka Knowledge Flow using "Model Performance Chart". However, I can't figure out how to do this for existing models.
I have tried using ArffLoader and TestSetMaker to generate the testing data, and connected this to a suitable Classifier icon (eg AdaBoostM1 when this is the kind of model I am trying to load). In the configurations of the Classifier icon I choose "load model" and in the Status bar it says "Loaded model.". However, when I run this it says "ERROR: no trained/loaded classifier to use for prediction".
Can anyone tell me what I am doing wrong here? Thanks in advance!
There is a post that was published here that indicates some ambiguity in the meaning of the error. It also continues to state that the order of attributes and the number and order of values is also rather important.
It also states that 'for performance results to be computed, your Knowledge Flow process will need a "ClassifierPerformanceEvaluator" component after the classifier and before a TextViewer component.'
If you are new with the KnowledgeFlow environment, there is a great tutorial here from Rushdi Shams that details the general process.
Below is a sample workflow that has generated desirable results using AdaBoost (preloaded model):
Hope this Helps!

Weka - Measuring testing time

I'm using Weka 3.6.8 to carry out some machine learning and I'm want to find the 'time taken to test model on training/testing data'. When I test a predictive model on evaluation data, this parameter seems to be missing. Has this feature been removed from Weka or is it just a setting I'm missing? All I seem to be able to find is the time taken to build the actual predictive model. (I've also checked the Weka Manual but can't find anything)
Thanks in advance
That feature was added to 3.7.7, you need to upgrade. You should be able to get this data by running the test on the command line with the -T parameter.