strange Train and test set are not compatible error in weka - weka

I have read many solution about this error. But my problem is definitely different from the others: I have a "train" dataset(arff) and a "test" dataset(arff), both these two arff have an attribute "id"(string). It works well if I 'remove' "id" of these two arff at the same time(if I don't remove the id in "test" I will get an error); what confuse me is that my friend can do it by remove only the "id" in "train", so his output will contains the "id".
(since he didn't remove the "id" in the "test", the number of attribute will not be the same, and this is against what I read that the number of attribute should be exactly the same).
I really need an output that can contain the "id".
Maybe I did something wrong with the "remove"? I read somewhere said that the test feature may be superior to that of train. And also a paragraph talking about how to remove:"Instead of using a nominal ID attribute, declare it as STRING
attribute. With this you don't have to declare each possible value
like with NOMINAL attributes and it therefore doesn't matter what
strings are used in the test set that you're trying to use the trained
model on. In order to be able to work with this STRING ID attribute
you have to use the FilteredClassifier in conjunction with the Remove
filter (package weka.filters.unsupervised.attribute) and your original
base classifier. This setup will remove the ID attribute for the
learning process (i.e., the base classifier), but you'll still be able
to use it outside for tracking instances. "
http://weka.8497.n7.nabble.com/use-saved-model-td22857.html
Anyone have an idea?
Any help will be appreciated.
my 2 arff, left: train; right: test
left: output of myfriend with id such as test_subject1005 ; right: my output

Finally I got my solution. Just click directly the "supplied test set" and in the prompt interface click "Yes". That all! (It seems that I did not see this prompt before, so I did not try)

Related

Weka GUI: add attribute is-missing-value

I have a couple of attributes with missing values.
This is a survey, so the fact that the person refused to answer is, by itself, useful information!
I would like to create a new attribute called is-missing-value = 1 if a given value in an attribute is a missing value and 0 otherwise.
Things I have tried:
I have tried using AddExpression, but this seems to only perform arithmetic operations such as 2*attribute.
I know that MathExpression allows using if-elses, such as ifelse(A < 3.0, 1, 0)... Do you guys know if/how I can test if a value is nan?
MakeIndicator (or NominalToBinary) should be able to do what I want, but I think I need (i) to convert my missing values to a nominal value, so that then (ii) I can convert this new nominal value to binary. The problem is that ReplaceMissingValue only works for mode or mean; I need to be able to define a new value. One solution could be to Edit the data directly, but I'd rather avoid this.
Please notice that I need to do this using the Weka GUI, not the Java interface.
I think I have a solution for you:
copy the attribute (if you want the original one to remain): apply the copy filter (this and the following filters are all under unsupervised/attribute folder) with the index of the attribute
Convert your attribute to nominal using the numericToNominal filter (set the attribute index)
Fill the missing values with a new value using ReplaceMissingWithUserConstant. Here you need to specify the nominalStringReplacementValue parameter (e.g. "missing") in addition to the index of your attribute.
Apply the NominalToBinary filter on your attribute. This will create several new attributes (as the number of unique values in the dataset + the missing value). You can remove the attributes you don't need and keep only the missing attribute.
Hope it helped.

Sort: Get the author's documents before other documents

I have the following question:
Is there any way to use a specific constant value as a tie-breaker when sorting?
Here's my example:
Assume the index structure:
{
title: "Title",
author: "Name of author"
}
Say we have the following search query : http://example.cloudsearch.amazonaws.com/2013-01-01/search?q=test&return=_all_fields%2C_score&sort=_score%20desc
The problem is I have if there are 10 documents with the same title "test" they will all have the same score, now I want to sort these documents and get the documents created by the current author on top. I have tried using an expression but I can't seem to get it to work, this is what I tried:
http://example.cloudsearch.amazonaws.com/2013-01-01/search?q=test&return=_all_fields%2C_score&sort=_score%20desc,isauthor%20desc&expr.isauthor=(author%3D%3D%id) however I doubt that cloudsearch will accept that. Is there any way to solve this via the search string or do I need to index something like a numeric author identifier?
If anyone else is interested, this is what I ended up doing.
Created a new index field : author_identifier which I used to index a unique integer identifier for each user (which was another challenge to get because the user ids were GUIDs so I had to associate them with numbers.
Then I used a sort expression: http://example.cloudsearch.amazonaws.com/2013-01-01/search?q=test&return=_all_fields%2C_score&sort=_score%20desc,isauthor%20desc&expr.isauthor=(author_identifer%3D%3D%mappedid)
This was not an ideal solution, but it's better than nothing I guess.

What does `first: true` do while defining migrations in Rails?

I was going through Codeschool's courses for learning Rails. There, they have placed a PDF file that contains a summary of all the options while writing migrations for defining an individual column like default: <value>, limit: <number>, unique: true. There is an option first: true, that I'm unable to understand.
Apparently, it seems that it is gonna change the position of the column to the first column in the table, but it doesn't seem to do anything like that. What exactly does it do?
When defining the columns you can determine their order by using first: true and after: column_name. I couldn't find it documented anywhere, but you can see it in the sources.

Remove Missing Values in Weka

I'm using a dataset in Weka for classfication that includes missing values. As far as I understood, Weka replaces them automatically with the Modes or Mean of the training data (using the filter unsupervised/attribute/ReplaceMissingValues) when using a classifier like NaiveBayes.
I would like to try removing them, to see how this effects the quality of the classifier. Is there a filter to do that?
See this answer below for a better, modern approach.
My approach is not the perfect one because IF you have more than 5 or 6 attributes then it becomes quite cumbersome to apply but I can suggest that MultiFilter should be used for this purpose if only a few attributes have missing values.
If you have missing values in 2 attributes then you'll use RemoveWithValues 2 times in a MultiFilter.
Load your data in Weka Explorer
Select MultiFilter from the Filter area
Click on MultiFilter and Add RemoveWithValues
Then configure each RemoveWithValues filter with the attribute index and select True in matchMissingValues
Save the filter settings and click Apply in Explorer.
Use the removeIf() method on weka.core.Instances using the method reference from weka.core.Instance for the hasMissingValue method, which returns a boolean if a given Instance has any missing values.
Instances dataset = source.getDataSet(); // for some source
dataset.removeIf(Instance::hasMissingValue);

Convert String attributes to numeric values in WEKA

I am new to weka.. My data contains a column of student name. I want to convert these names to numeric values, over the whole column.
Eg: Suppose there are 10 names abcd ,cdef,xyz ,etc. I want to pre process the data so that corresponding to each name there is distinct numeric value, like abcd changes to 1 ,cdef changes to 2 ,etc.
Also two or more rows can have same name. So in this case, same name should have same value.
Please help me...
Weka supports 4 non-relational attribute types: nominal, numeric, string and date. You can find out more about them in Weka Manual (it can be found in the same folder were you downloaded Weka), chapter "The ARFF Header Section".
You should find out what is the type of the "student's name" attribute (probably string, but could be nominal), and decide what should be the type of the attribute with converted values (numeric, nominal, or string).
There can be 2 scenarios:
(1) If types of the existing and desired attributes are the same (string-string or nominal-nominal, i.e. you only want to change values, not attribute type), you could do so
(a) manually - open the data file in Weka Explorer, and click Edit... button, or
(b) write a small program using Weka's Attribute class functions value and setValue.
(2) Types are different - Weka attribute types cannot be converted, so you will have to create and insert a new attribute with the converted values, and delete the old attribute. An example of how to create a new attribute can be found at
http://weka.wikispaces.com/Programmatic+Use#Step.
As far as I understand, strictly converting names into a "numeric" type doesn't seem like the best approach, within the context of WEKA - WEKA will treat numeric attributes differently than it does "string" or "nominal" attributes (for example, for running certain "attribute selection" algorithms, you can not use "numeric" types - they need to be "discretized" or converted into nominal form).
So, for your case, I think you can convert your "string" names into just "nominal" type using the StringToNominal class (this class acts as a WEKA "filter" to help convert a given "string" attribute into an attribute of type "nominal"). This will also take care about the repeating names - the list of "nominal" values for the names (that will be generated after you apply this filter) will contain any given name (that appears any number of times) only one time.
"Nominal" attributes also have the advantage that implicitly, they do have a numeric representation (the index of the value within the set of values; similar to how the "enums" in Java have a numeric index). So, you can utilize that as the "numeric" information corresponding to the names (though as I said earlier, it's probably best to just use it as "nominal" attribute; really depends on your particular use case).
I had the same problem as the one mentioned in the question, and I could "address" it in the following way.
I first applied the StringToNominal filter as mentioned before (don't forget to change the attribute range (from "last" to "first-last")). Once done that, I saved the dataset in LibSVM format, which changes the nominal values to numeric ones.
Then, if you close Weka and open it again, you will have the same dataset with the same number of features but they will be numeric. Now some changes should be done, first of all, normalizing all the numeric values in the dataset, using the Normalize filter. After that, apply the NumericToNominal filter to the last attribute.
Then, you will have a similar dataset with numeric values.
Hope this helps.