how to predict related videos to a video in weka - weka

I'm trying to predict related videos to a video,
is that possible in weka?
the database takes the following structrue:
video_id,catygory,keywords,related_videos_ids
a single keywords field might have many values (for ex: stackoverflow, predict, videos),so the related_videos (the video could have more than one related video).
The related videos depend on catygory an keywords
what's the way to do that in weka?? any ideas?

There is a way to do it in weka. You should look into clustering:
https://www.youtube.com/watch?v=zjYUYJ2b4r8
I would also suggest trying to get more features (more columns) for better results.

Related

Google Inceptionism: obtain images by class

In the famous Google Inceptionism article,
http://googleresearch.blogspot.jp/2015/06/inceptionism-going-deeper-into-neural.html
they show images obtained for each class, such as banana or ant. I want to do the same for other datasets.
The article does describe how it was obtained, but I feel that the explanation is insufficient.
There's a related code
https://github.com/google/deepdream/blob/master/dream.ipynb
but what it does is to produce a random dreamy image, rather than specifying a class and learn what it looks like in the network, as shown in the article above.
Could anyone give a more concrete overview, or code/tutorial on how to generate images for specific class? (preferably assuming caffe framework)
I think this code is a good starting point to reproduce the images Google team published. The procedure looks clear:
Start with a pure noise image and a class (say "cat")
Perform a forward pass and backpropagate the error wrt the imposed class label
Update the initial image with the gradient computed at the data layer
There are some tricks involved, that can be found in the original paper.
It seems that the main difference is that Google folks tried to get a more "realistic" image:
By itself, that doesn’t work very well, but it does if we impose a prior constraint that the image should have similar statistics to natural images, such as neighboring pixels needing to be correlated.

Why Classification model in weka predicting all instances as one class?

I have built a classification model using weka.I have two classes namely {spam,non-spam} After applying stringtowordvector filter, I get 10000 attributes for 19000 records. Then I am using liblinear library to build model which gives me F-score as follows:
Spam-94%
non-spam-98%
When I use same model to predict new instances, it predict all of them as spam.
Also, when I try to use test set same as training set, It predict all of them as spam too. I am mentally exhausted to find the problem.Any help will be appreciated.
I get it also wrong every so often. Then I watch this video to remind myself how it's done: https://www.youtube.com/watch?v=Tggs3Bd3ojQ where Prof Witten, one of the Weka Developers/Architects shows how to use the FilteredClassifier (which in turn is configured to load the StringToWordVector Filter) on the training-dataset and the test-set correctly.
This is shown for weka 3.6, weka 3.7. might be slightly different.
What does ZeroR give you? If it's close to 100%, you know that any classification algorithm should be not too far off either.
Why do you optimize for F-Measure? Just asking. I have never used this and don't know much about it. (I would optimize for the "Precision" metric assuming you have much more Spam than Nonspam).

Weka ,Text Classification on an arff file

.This is a basic question .I am trying to classify text files into 20 different classes.
Therefore I have a project structure with a folder called train,test.
In the train folder I have 20 different folders ,each folder again has many files related to that particular class.ex:weather, atheism...etc
I have now created a train.arff file for the entire train folder.When the data is visualized through I can see only two attributes .
Have provided a link below:
Screen in weka
My Doubt is how can i view the various files under these folders and remove the stopwords,punctuation,stemmin.How do I go about preprocessing.If some links to good resources are available please suggest and provide the necessary links
I found the videos below quite helpful when I first got my hands on text classification using Weka. You might want to take a look.
Weka Tutorial 31: Document Classification 1 (Application)
Weka Tutorial 32: Document classification 2 (Application)
WEKA Text Classification for First Time & Beginner Users
You might want to use StringToWordVector filter to see the effect of each word as an attribute, which is indeed described in detail in the first and last video . Within the filter settings you can give a stopwords list and choose in each run to use it or not. Same with the stemming you can change it as well. This documentation and videos will get you to understand it easily.

Read svm data and retrain with more data?

I am implementing a facial expression recognition and am using SVM to classify given expression.
When I train, I use this command line
svm.train(myFeatureVector,myLabels,Mat(),Mat(), myParameters);
svm.save("myClassifier.yml");
which will later when I will predict using
response = svm.predict(incomingFeatureVector);
But then when I want to train more than once (exited the program and start again), it seems to have overwritten my previous svm file. Is there any way I could do read previous svm file and add more data into it (and then resave it ,etc) ? I looked up on this openCV documentation and found nothing. However, when I read on this page; there is a method called CvSVM::read. I don't know what that does/how to implement it.
Hope anyone can help me :(
What you are trying to do is incremental learning but unfortunately Support Vector Machines is a batch algorithm, hence if you want to add more data you have to retrain with the whole set again.
There are online learning alternatives, like Pegasos SVM but I am not aware of any that is implemented on OpenCV

Best approach for doing full-text search with list-of-integers documents

I'm working on a C++/Qt image retrieval system based on similarity that works as follows (I'll try to avoid irrelevant or off-topic details):
I take a collection of images and build an index from them using OpenCV functions. After that, for each image, I get a list of integer values representing important "classes" that each image belongs to. The more integers two images have in common, the more similar they are believed to be.
So, when I want to query the system, I just have to compute the list of integers representing the query image, perform a full-text search (or similar) and retrieve the X most similar images.
My question is, what's the best approach to permorm such a search?
I've heard about Lucene, Lemur and other indexing methods, but I don't know if this kind of full-text searchs are the best way, given the domain is reduced (only integers instead of words).
I'd like to know about the alternatives in terms of efficiency, accuracy or C++ friendliness.
Thanks!
It sounds to me like you have a vectorspace model, so Lucene or a similar product may work well for you. In general, an inverted-index model will be good if:
You don't know the number of classes in advance
There are a lot of classes relative to the number of images
If your problem doesn't fit these criteria, a normal relational DB might work better, as Thomas suggested. If it meets #1 but not #2, you could investigate one of the "column oriented" non-relational databases. I'm not familiar enough with these to tell you how well they would work, but my intuition is that you'll need to replicate a lot of the functionality in an IR toolkit yourself.
Lucene is written in Java and I don't know of any C++ ports. Solr exposes Lucene as a web service, so it's easy enough to access it that way from whatever language you choose.
I don't know much about Lemur, but it looks like it has a similar vectorspace model, and it's written in C++, so that might be easier for you to use.
You can take a look at Lucene for image retrieval (LIRE) here: http://www.semanticmetadata.net/2006/05/19/lire-lucene-image-retrieval-04-released/
If I'm mistaken, you are trying to implement a typical bag of words image retrieval am I correct? If so you are probably trying to build an inverted file index. Lucene on its own is not suitable as you probably have already realized as it index text instead of numbers. Using its classes for querying the index would also be a problem as it is not designed to "parse" (i.e. detect keypoints, extract descriptors then vector-quantize them) image into the query vector.
LIRE on the other hand have been modified to index feature vectors. However, it does not appear to work out of the box for bag of words model. Also, I think I've read on the author's website that it currently uses brute force matching rather than the inverted file index to retrieve the images but I would expect it to be easier to extend than Lucene itself for your purposes.
Hope this helps.