Convert a text file into ARFF file - weka

I am trying to convert a text file into an ARFF (Attribute Relation File Format) file. Below are the first few lines of the file.
#RELATION Graph
#ATTRIBUTE real {1,-1}
#ATTRIBUTE authorOne string
#ATTRIBUTE authorTwo string
#ATTRIBUTE year real
#DATA
1,authorName1,authorName2,1999
....
I am getting the below error on loading this file onto weka.
train.arff is not recognized as an 'Arff data files' file.
Reason: number expected, read token[authorName1]
Could you please let me know what's wrong with this?

Related

Weka StringToWordVector attributes omitted

I´m working with Weka. My problem is, that some of the attributes are omitted after using StringToWordVector. So here is my code:
This is the ARFF file before using any filter:
#relation QueryResult
#attribute class {Qualität,Bord,Kite,Harness}
#attribute text {evo,foil,end,fin,edg}
#data
Qualität,evo
Bord,foil
Kite,end
Harness,fin
Qualität,edg
Here is my java code:
Instances train = new Instances(loadInstancesForWeka("root","",sqlCommand));
train.setClassIndex(train.numAttributes() - 2);
System.out.println(train);
NominalToString filter1 = new NominalToString();
filter1.setInputFormat(train);
train = Filter.useFilter(train, filter1);
System.out.println("\nSelect nach NominaltoString \n"+train);
//filter
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(train);
train = Filter.useFilter(train, filter);
After using the Vector it looks like this:
#relation 'QueryResult-weka.filters.unsupervised.attribute.NominalToString-Clast-weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-stopwords-handlerweka.core.stopwords.Null-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'
#attribute class {Qualität,Bord,Kite,Harness}
#attribute edg numeric
#attribute evo numeric
#attribute foil numeric
#attribute end numeric
#attribute fin numeric
#data
{2 1}
{0 Bord,3 1}
{0 Kite,4 1}
{0 Harness,5 1}
{1 1}
So why are the attributes "foil,end,fin" omitted? Thank you for your help.
There aren't any attributes omitted from your output. The output is in sparse ARFF format:
Sparse ARFF files are very similar to ARFF files, but data with value
0 are not explicitly represented. ...
Each instance is surrounded by
curly braces, and the format for each entry is:
[index] [space] [value] where index is the attribute index (starting from 0).
So for the third instance in your example,
{0 Kite,4 1}
means that attribute 0 for this instance is Kite, attribute 4 (i.e. 'end') is 1, and the other attributes are 0.
It makes sense for StringToWordVector to produce sparse output because it creates a lot of new attributes, most of which will be 0 for each instance. If you need the non-sparse version you can use weka.filters.unsupervised.instance.SparseToNonSparse.

[C++]: Writing a numerical data to an ODS file, ODS does not treat them as numbers

When I export my calculations via ofstream in C++ to an ODS (Apache OpenOffice) file, the numbers are correctly shown there, however I cannot make any calculations in that specific ODS file.
For example, when I try to add, say 0.9191 on A1, and 0.5757 on A2, the =SUM(A1:A2) returns zero.
I tried to solve this thru formatting cells, but none worked so far. Any suggestions? Thank you.
Edit: The portion of code that does the exporting job.
string datafolder; datafolder = "c:/Users/cousinvinnie/Desktop/Code Vault/ArmaTut3/" + Jvalue;
string graph_path = datafolder + "/Graphavgs.ods"; ofstream graphavgs; graphavgs.open(graph_path);
for(int ctr = 0; ctr<cycledata; ctr++){
cyclepoints = (howmanyDC + 1) * (ctr + 1);
graphavgs<<(ctr + 1)<<" ";
calcguy = sum((wholedata.row(cyclepoints))) / nextgenpop;
secondbiggiesavg(ctr) = -log(calcguy);
graphavgs<<secondbiggiesavg(ctr)<<" ";
calcguy = sum((thirdbiggest.row(cyclepoints))) / nextgenpop;
thirdbiggiesavg(ctr) = -log(calcguy);
graphavgs<<thirdbiggiesavg(ctr)<<" ";
calcguy = sum((matrixavgs.row(cyclepoints))) / nextgenpop;
avgmatrixdata(ctr) = -log(calcguy);
graphavgs<<avgmatrixdata(ctr)<<" "<<endl;
}
graphavgs.close();
This code creates the Graphavgs.ods file. In that file I have
1 0.111753 0.182331 0.358724
2 0.147015 0.259202 0.48334
3 0.195855 0.362397 0.648719
4 0.25348 0.476696 0.839261
5 0.314722 0.618828 1.0633
6 0.420704 0.857286 1.37501
7 0.536699 1.1179 1.69503
8 0.76933 1.56382 2.13464
9 0.90525 1.89921 2.42443
10 1.15678 2.41533 2.82584
Now these numbers are not treated as numbers. When I try to work a function on them, like =SUM(A1:A2) the return is zero.
When I do =LN(A1), the return is #VALUE!
SOLVED: Find & Replace all dots with commas.
You are making a confusion between the CSV file format, the ODS file format and the representation of both in OpenOffice or LibreOffice.
What you build is a CSV file, that means purely text file that only contains a textual representation of values. By default, your C++ program generates floating values with a dot as a decimal separator.
An ODS file is in fact a ZIP file containing meta-data (name of creator, date of creation, date of last print, etc.), actual data and formatting informations. That way an ODS file is directly opened by LibreOffice or OpenOffice.
What you open a CSV file in LibreOffice or OpenOffice, you actually import it. That means that the program makes some assumptions on the data separator, the decimal separator and if appropriate on the date format to translate the textual values into numeric (or date) ones. Those assumptions are based on your system locale. The formatting in normally the default one. Depending on the version you use, a dialog box with import option may be displayed always or only if you explicitely import the file (menu File/Import). That dialog box allows you to specify the separators and decimal separators that the CSV file contains.
Once you have correctly loaded a CSV file, it is recommended to save it in ODS format to make sure that you will no longer have that import problem again.

Prediction in weka using explorer

Once i have trained and generated a model , as of now from the examples i have seen , we are using a testing set where we have to put values for actual and predicted , is there a way where i can either put this actual column as empty or cannot use it at all when am doing the prediction
if i take with an example , following is my training set
#relation supermarket
#attribute 'department1' { t}
#attribute 'department2' { t}
#attribute 'department3' { t}
#attribute value
and am using a testing set like
#relation supermarket
#attribute 'department1' { t}
#attribute 'department2' { t}
#attribute 'department3' { t}
#attribute value
and output like
#relation supermarket
#attribute 'department1' { t}
#attribute 'department2' { t}
#attribute 'department3' { t}
#attribute value
#attribute predicted-value
#attribute predicted-margin
My Question is can i either remove value or keep it as empty from testing set
Case 1: Both your training and test set have class labels
Training:
#relation
simple-training
#attribute
feature1 numeric
feature2 numeric
class string{a,b}
#data
1, 2, b
2, 4, a
.......
Testing:
#relation
simple-testing
#attribute
feature1 numeric
feature2 numeric
class string{a,b}
#data
7, 12, a
8, 14, a
.......
In this case, whether you are using k-fold cv or train-test setup, Weka will not take a look at your class labels in the test set. It gets its model from training, blindly apply that on test set and then compares its prediction with the actual class labels in your testing set.
This is useful if you want to see the performance evaluation of your classifier.
Case 2: You have class labels for training data but you don't have class labels for testing data.
Training:
#relation
simple-training
#attribute
feature1 numeric
feature2 numeric
class string{a,b}
#data
1, 2, b
2, 4, a
.......
Testing:
#relation
simple-testing
#attribute
feature1 numeric
feature2 numeric
class string{a,b}
#data
7, 12, ?
8, 14, ?
.......
This is very normal since this is what we need to do- apply training model on unseen unlabeled data to label them! In that case simply put ? marks at your testing class labels. After running Weka on this setup you will get the output with these ? marks replaced by the predicted values (you don't need to create any additional column as this will give you error).
So, in a nutshell- you need to have compatibility in your training and testing data. In testing data if you don't know the value and you want to predict it, then put a ? mark in that column.

converting to weka arff format

i want to convert the file in this link : http://archive.ics.uci.edu/ml/datasets/Credit+Approval to match weka .arff file and open it there.
i know that we need to define the file like:
#relation
#attribute
#data
i found the data, but didn't found the attributes! also the relation is the file name right ?
and one last thing how to make file extension .arff ?
please help.
Thank You SO MUCH!!
In crx.names from data folder, says : All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.
But they give you the values that they use:
Attribute Information:
A1: b, a.
A2: continuous.
A3: continuous.
A4: u, y, l, t.
A5: g, p, gg.
A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7: v, h, bb, j, n, z, dd, ff, o.
A8: continuous.
A9: t, f.
A10: t, f.
A11: continuous.
A12: t, f.
A13: g, p, s.
A14: continuous.
A15: continuous.
A16: +,- (class attribute)
You can give to this information, the meaning that you need.
For create this to arff file you write something like that:
%Test Data set
#relation Credit Approval Data Set
#attribute attribute_name {a,b}
#attribute ...
#data
b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+
Add the next attribute reading the credit.lisp, you need 16 attributes.
Save the file how: name file.arff. You can create this file in a text editor of your preferred.
If you want to follow GUI based approach then
1) open crx.data in any editor.
2) Add a column heading at the first line like:
A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,Class
3) Save the file as crx.csv
4) Open Weka -> Explorer
5) In preprocess tab -> Click on Open file
6) Change file type csv
7) Locate the file 'crx.csv`
8) Click on Save
9) Specify the file name crx.arff.
That's done.

WEKA & HMMWeka, train and test models‏

I'm trying to use WEKA in order to make gesture recognition. I'm new in this procedure, so any help would be appreciated.
More specific what I have done in steps:
install WEKA
install HMMWeka library
my data contains rotations form sensors, and I tried to create .arff in "simple" format and "multi-instance" format
I have 4 gestures that have been recorded and 3 repetitions for each. So my initail idea with the "simple" format, is to train the model with gesture1 from first repetition and to test it (recognize it) with the gesture1 from the rest two repetitions.
With the "multi-instance" format, in each .arff file I have all the four gestures from each repetition.
So my questions are:
I'm not sure if my file in "multi-instance" format is correct. Here is an example of its structure
:
#relation rotation
#attribute bag_ID {1, 2, 3, 4, 5, 6, 7, 8, 9}
#attribute bag relational
#attribute rotation { rot1 , rot2 , rot3 }
#attribute x_left_hand numeric
#attribute y_left_hand numeric
#attribute z_left_hand numeric
#attribute x_right_hand numeric
#attribute y_right_hand numeric
#attribute z_right_hand numeric
#end bag
#attribute gesture { g1, g2, g3, g4}
#data
1,"rot1, 1.394962, 19.704826, 0.536432, 1.594745, 7.511097, 0.269678", g1
2,"rot1, 1.337786, 19.681709, 0.468583, 1.63736, 7.536188, 0.35687", g1
3,"rot1, 1.280635, 19.658672, 0.400756, 1.679905, 7.561322, 0.443999", g1
4,"rot1, 2.217022, 15.327432, -1.997938, 0.256819, 10.011353, 2.300805", g2
5,"rot1, 2.304201, 15.276058, -2.076832, 0.161013, 9.993914, 2.351273", g2
6,"rot1, 2.271477, 22.43351, 3.477951, 1.245202, 5.531068, -1.06918", g3
7,"rot1, 2.218041, 22.370411, 3.506101, 1.299245, 5.590856, -1.078336", g3
8,"rot1, 1.557125, 16.531981, 4.000765, 3.098644, 5.841918, -3.751997", g4
9,"rot1, 1.557125, 16.531981, 4.000765, 3.116652, 5.932492, -3.760822", g4
Although WEKA reads both formats, when I choose HMM for training, it selects (which is also the default) the nominal class gesture, while I would like to use either the relational attribute, or all the other attributes as a group. The result of correct classification in trainig is also very low..22%
The result of testing would be which gesture is, according to all attributes that I give to WEKA as input.
Do you know if that is possible? Can I use all numeric attributes for the training? Do I have something wrong with the format?
I searched a lot in google, finding things like http://weka.8497.n7.nabble.com/Relational-attributes-vs-regular-attributes-td29946.html and tried many combinations..but I still have problem!
Also I tried to use two classifiers, gaussian processes and HMM but it pops up an error (weka.classifiers.meta.Stacking: cannot handle binary class).
Any help would be really appreciated!!
Thank you in advance!!
Best regards,
mmmm you are using time sequences but with no sequence, i mean, imagine you have an vector
x[] and each element of x is the value of x in that time, you only posted x[0], many times, in your case x is an structure that has
struct x {
double x_left_hand;
double y_left_hand;
double z_left_hand;
double x_right_hand;
double y_right_hand;
double z_right_hand;
}
And this is correct, but there are no evolution in time of the gesture, i dont know if i'm explaining well, so bad english...
I'm going to post you an little example that i'm working on, it may help you
#relation AUs
#attribute sequences relational
#attribute AU0 NUMERIC
#attribute AU1 NUMERIC
#attribute AU2 NUMERIC
#attribute AU3 NUMERIC
#attribute AU4 NUMERIC
#attribute AU5 NUMERIC
#end sequences
#attribute class {01, 01b, 01c, 01d, 02, 02a, 02c, 04, 05, 07, 10, 11, 13, 14, 15, 17, 18, 19, 21}
#data
"0.5840133,-0.13073722,-0.8034836,0.16867049,-0.30464363,-0.15208732\n....\n0.47603312,-0.10599292,-0.4781358,0.30258814,-0.27299756,0.07913079\n0.5878206,-0.12593555,-0.42014712,0.30809718,-0.33109784,0.013338812\n0.6120923,-0.12400548,-0.3479098,0.26818287,-0.39161837,0.07279621\n0.6180023,-0.11955684,-0.35120794,0.28354084,-0.351862,-0.017126387\n0.6166399,-0.13956356,-0.3506564,0.25470608,-0.34935358,0.025823373\n0.59575284,-0.13704558,-0.42580596,0.24725975,-0.33137816,-0.04043307\n0.5571964,-0.13607484,-0.3777615,0.21615964,-0.35109887,-0.068926826\n0.52844477,-0.10942138,-0.38436648,0.2355144,-0.3238311,-0.06743353\n0.64967036,-0.13547328,-0.28889894,0.21237339,-0.3741229,0.02283336\n0.641207,-0.13648787,-0.35315526,0.27048427,-0.39234316,0.026359601\n0.6241689,-0.14557217,-0.39503983,0.261346,-0.3732989,0.0811597\n0.46664864,-0.092378475,-0.42906052,0.29789245,-0.3076035,0.015037567\n0.528294,-0.19327107,-0.59035814,0.26079395,-0.3222413,-0.022527361\n0.56722254,-0.16849008,-0.4722441,0.2480416,-0.3971509,0.023736712",01
In this example we have time, not just an initial position.
i Hope it was helpful