converting to weka arff format - weka

i want to convert the file in this link : http://archive.ics.uci.edu/ml/datasets/Credit+Approval to match weka .arff file and open it there.
i know that we need to define the file like:
#relation
#attribute
#data
i found the data, but didn't found the attributes! also the relation is the file name right ?
and one last thing how to make file extension .arff ?
please help.
Thank You SO MUCH!!

In crx.names from data folder, says : All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.
But they give you the values that they use:
Attribute Information:
A1: b, a.
A2: continuous.
A3: continuous.
A4: u, y, l, t.
A5: g, p, gg.
A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7: v, h, bb, j, n, z, dd, ff, o.
A8: continuous.
A9: t, f.
A10: t, f.
A11: continuous.
A12: t, f.
A13: g, p, s.
A14: continuous.
A15: continuous.
A16: +,- (class attribute)
You can give to this information, the meaning that you need.
For create this to arff file you write something like that:
%Test Data set
#relation Credit Approval Data Set
#attribute attribute_name {a,b}
#attribute ...
#data
b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+
Add the next attribute reading the credit.lisp, you need 16 attributes.
Save the file how: name file.arff. You can create this file in a text editor of your preferred.

If you want to follow GUI based approach then
1) open crx.data in any editor.
2) Add a column heading at the first line like:
A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,Class
3) Save the file as crx.csv
4) Open Weka -> Explorer
5) In preprocess tab -> Click on Open file
6) Change file type csv
7) Locate the file 'crx.csv`
8) Click on Save
9) Specify the file name crx.arff.
That's done.

Related

Weka StringToWordVector attributes omitted

I´m working with Weka. My problem is, that some of the attributes are omitted after using StringToWordVector. So here is my code:
This is the ARFF file before using any filter:
#relation QueryResult
#attribute class {Qualität,Bord,Kite,Harness}
#attribute text {evo,foil,end,fin,edg}
#data
Qualität,evo
Bord,foil
Kite,end
Harness,fin
Qualität,edg
Here is my java code:
Instances train = new Instances(loadInstancesForWeka("root","",sqlCommand));
train.setClassIndex(train.numAttributes() - 2);
System.out.println(train);
NominalToString filter1 = new NominalToString();
filter1.setInputFormat(train);
train = Filter.useFilter(train, filter1);
System.out.println("\nSelect nach NominaltoString \n"+train);
//filter
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(train);
train = Filter.useFilter(train, filter);
After using the Vector it looks like this:
#relation 'QueryResult-weka.filters.unsupervised.attribute.NominalToString-Clast-weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-stopwords-handlerweka.core.stopwords.Null-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'
#attribute class {Qualität,Bord,Kite,Harness}
#attribute edg numeric
#attribute evo numeric
#attribute foil numeric
#attribute end numeric
#attribute fin numeric
#data
{2 1}
{0 Bord,3 1}
{0 Kite,4 1}
{0 Harness,5 1}
{1 1}
So why are the attributes "foil,end,fin" omitted? Thank you for your help.
There aren't any attributes omitted from your output. The output is in sparse ARFF format:
Sparse ARFF files are very similar to ARFF files, but data with value
0 are not explicitly represented. ...
Each instance is surrounded by
curly braces, and the format for each entry is:
[index] [space] [value] where index is the attribute index (starting from 0).
So for the third instance in your example,
{0 Kite,4 1}
means that attribute 0 for this instance is Kite, attribute 4 (i.e. 'end') is 1, and the other attributes are 0.
It makes sense for StringToWordVector to produce sparse output because it creates a lot of new attributes, most of which will be 0 for each instance. If you need the non-sparse version you can use weka.filters.unsupervised.instance.SparseToNonSparse.

[C++]: Writing a numerical data to an ODS file, ODS does not treat them as numbers

When I export my calculations via ofstream in C++ to an ODS (Apache OpenOffice) file, the numbers are correctly shown there, however I cannot make any calculations in that specific ODS file.
For example, when I try to add, say 0.9191 on A1, and 0.5757 on A2, the =SUM(A1:A2) returns zero.
I tried to solve this thru formatting cells, but none worked so far. Any suggestions? Thank you.
Edit: The portion of code that does the exporting job.
string datafolder; datafolder = "c:/Users/cousinvinnie/Desktop/Code Vault/ArmaTut3/" + Jvalue;
string graph_path = datafolder + "/Graphavgs.ods"; ofstream graphavgs; graphavgs.open(graph_path);
for(int ctr = 0; ctr<cycledata; ctr++){
cyclepoints = (howmanyDC + 1) * (ctr + 1);
graphavgs<<(ctr + 1)<<" ";
calcguy = sum((wholedata.row(cyclepoints))) / nextgenpop;
secondbiggiesavg(ctr) = -log(calcguy);
graphavgs<<secondbiggiesavg(ctr)<<" ";
calcguy = sum((thirdbiggest.row(cyclepoints))) / nextgenpop;
thirdbiggiesavg(ctr) = -log(calcguy);
graphavgs<<thirdbiggiesavg(ctr)<<" ";
calcguy = sum((matrixavgs.row(cyclepoints))) / nextgenpop;
avgmatrixdata(ctr) = -log(calcguy);
graphavgs<<avgmatrixdata(ctr)<<" "<<endl;
}
graphavgs.close();
This code creates the Graphavgs.ods file. In that file I have
1 0.111753 0.182331 0.358724
2 0.147015 0.259202 0.48334
3 0.195855 0.362397 0.648719
4 0.25348 0.476696 0.839261
5 0.314722 0.618828 1.0633
6 0.420704 0.857286 1.37501
7 0.536699 1.1179 1.69503
8 0.76933 1.56382 2.13464
9 0.90525 1.89921 2.42443
10 1.15678 2.41533 2.82584
Now these numbers are not treated as numbers. When I try to work a function on them, like =SUM(A1:A2) the return is zero.
When I do =LN(A1), the return is #VALUE!
SOLVED: Find & Replace all dots with commas.
You are making a confusion between the CSV file format, the ODS file format and the representation of both in OpenOffice or LibreOffice.
What you build is a CSV file, that means purely text file that only contains a textual representation of values. By default, your C++ program generates floating values with a dot as a decimal separator.
An ODS file is in fact a ZIP file containing meta-data (name of creator, date of creation, date of last print, etc.), actual data and formatting informations. That way an ODS file is directly opened by LibreOffice or OpenOffice.
What you open a CSV file in LibreOffice or OpenOffice, you actually import it. That means that the program makes some assumptions on the data separator, the decimal separator and if appropriate on the date format to translate the textual values into numeric (or date) ones. Those assumptions are based on your system locale. The formatting in normally the default one. Depending on the version you use, a dialog box with import option may be displayed always or only if you explicitely import the file (menu File/Import). That dialog box allows you to specify the separators and decimal separators that the CSV file contains.
Once you have correctly loaded a CSV file, it is recommended to save it in ODS format to make sure that you will no longer have that import problem again.

Convert a text file into ARFF file

I am trying to convert a text file into an ARFF (Attribute Relation File Format) file. Below are the first few lines of the file.
#RELATION Graph
#ATTRIBUTE real {1,-1}
#ATTRIBUTE authorOne string
#ATTRIBUTE authorTwo string
#ATTRIBUTE year real
#DATA
1,authorName1,authorName2,1999
....
I am getting the below error on loading this file onto weka.
train.arff is not recognized as an 'Arff data files' file.
Reason: number expected, read token[authorName1]
Could you please let me know what's wrong with this?

Python: copy line (tabs)

My source file is a txt file where I aim to select specific lines based on a few values that are spaced by tabs. My objective is to write these lines to a destination txt file. Every line has the same values of say (a or b), written over about 10 columns (1 value per column).
I have looked at solutions of SO and elsewhere online. I have defined search queries yet they give me error messages. I am just starting out with Python. Thank you for your help.
My code:
searchquery1 = \ta\ta\ta # 3 a's spaced by tab
with open(oldest) as f1: # source input file
with open('newtest.txt', 'a') as f2: # output file
lines = f1.readlines()
for i, line in enumerate(lines):
if line.endswith(searchquery1):
f2.writelines(line + "\n")
A short example:
source file:
A1 a a b b
A2 a a a a
A3 b b a a
...
with searchquery1 = 'a a a' (values are spaced by a tab)
destination file:
A2 a a a a (copy line 2 from source)

WEKA & HMMWeka, train and test models‏

I'm trying to use WEKA in order to make gesture recognition. I'm new in this procedure, so any help would be appreciated.
More specific what I have done in steps:
install WEKA
install HMMWeka library
my data contains rotations form sensors, and I tried to create .arff in "simple" format and "multi-instance" format
I have 4 gestures that have been recorded and 3 repetitions for each. So my initail idea with the "simple" format, is to train the model with gesture1 from first repetition and to test it (recognize it) with the gesture1 from the rest two repetitions.
With the "multi-instance" format, in each .arff file I have all the four gestures from each repetition.
So my questions are:
I'm not sure if my file in "multi-instance" format is correct. Here is an example of its structure
:
#relation rotation
#attribute bag_ID {1, 2, 3, 4, 5, 6, 7, 8, 9}
#attribute bag relational
#attribute rotation { rot1 , rot2 , rot3 }
#attribute x_left_hand numeric
#attribute y_left_hand numeric
#attribute z_left_hand numeric
#attribute x_right_hand numeric
#attribute y_right_hand numeric
#attribute z_right_hand numeric
#end bag
#attribute gesture { g1, g2, g3, g4}
#data
1,"rot1, 1.394962, 19.704826, 0.536432, 1.594745, 7.511097, 0.269678", g1
2,"rot1, 1.337786, 19.681709, 0.468583, 1.63736, 7.536188, 0.35687", g1
3,"rot1, 1.280635, 19.658672, 0.400756, 1.679905, 7.561322, 0.443999", g1
4,"rot1, 2.217022, 15.327432, -1.997938, 0.256819, 10.011353, 2.300805", g2
5,"rot1, 2.304201, 15.276058, -2.076832, 0.161013, 9.993914, 2.351273", g2
6,"rot1, 2.271477, 22.43351, 3.477951, 1.245202, 5.531068, -1.06918", g3
7,"rot1, 2.218041, 22.370411, 3.506101, 1.299245, 5.590856, -1.078336", g3
8,"rot1, 1.557125, 16.531981, 4.000765, 3.098644, 5.841918, -3.751997", g4
9,"rot1, 1.557125, 16.531981, 4.000765, 3.116652, 5.932492, -3.760822", g4
Although WEKA reads both formats, when I choose HMM for training, it selects (which is also the default) the nominal class gesture, while I would like to use either the relational attribute, or all the other attributes as a group. The result of correct classification in trainig is also very low..22%
The result of testing would be which gesture is, according to all attributes that I give to WEKA as input.
Do you know if that is possible? Can I use all numeric attributes for the training? Do I have something wrong with the format?
I searched a lot in google, finding things like http://weka.8497.n7.nabble.com/Relational-attributes-vs-regular-attributes-td29946.html and tried many combinations..but I still have problem!
Also I tried to use two classifiers, gaussian processes and HMM but it pops up an error (weka.classifiers.meta.Stacking: cannot handle binary class).
Any help would be really appreciated!!
Thank you in advance!!
Best regards,
mmmm you are using time sequences but with no sequence, i mean, imagine you have an vector
x[] and each element of x is the value of x in that time, you only posted x[0], many times, in your case x is an structure that has
struct x {
double x_left_hand;
double y_left_hand;
double z_left_hand;
double x_right_hand;
double y_right_hand;
double z_right_hand;
}
And this is correct, but there are no evolution in time of the gesture, i dont know if i'm explaining well, so bad english...
I'm going to post you an little example that i'm working on, it may help you
#relation AUs
#attribute sequences relational
#attribute AU0 NUMERIC
#attribute AU1 NUMERIC
#attribute AU2 NUMERIC
#attribute AU3 NUMERIC
#attribute AU4 NUMERIC
#attribute AU5 NUMERIC
#end sequences
#attribute class {01, 01b, 01c, 01d, 02, 02a, 02c, 04, 05, 07, 10, 11, 13, 14, 15, 17, 18, 19, 21}
#data
"0.5840133,-0.13073722,-0.8034836,0.16867049,-0.30464363,-0.15208732\n....\n0.47603312,-0.10599292,-0.4781358,0.30258814,-0.27299756,0.07913079\n0.5878206,-0.12593555,-0.42014712,0.30809718,-0.33109784,0.013338812\n0.6120923,-0.12400548,-0.3479098,0.26818287,-0.39161837,0.07279621\n0.6180023,-0.11955684,-0.35120794,0.28354084,-0.351862,-0.017126387\n0.6166399,-0.13956356,-0.3506564,0.25470608,-0.34935358,0.025823373\n0.59575284,-0.13704558,-0.42580596,0.24725975,-0.33137816,-0.04043307\n0.5571964,-0.13607484,-0.3777615,0.21615964,-0.35109887,-0.068926826\n0.52844477,-0.10942138,-0.38436648,0.2355144,-0.3238311,-0.06743353\n0.64967036,-0.13547328,-0.28889894,0.21237339,-0.3741229,0.02283336\n0.641207,-0.13648787,-0.35315526,0.27048427,-0.39234316,0.026359601\n0.6241689,-0.14557217,-0.39503983,0.261346,-0.3732989,0.0811597\n0.46664864,-0.092378475,-0.42906052,0.29789245,-0.3076035,0.015037567\n0.528294,-0.19327107,-0.59035814,0.26079395,-0.3222413,-0.022527361\n0.56722254,-0.16849008,-0.4722441,0.2480416,-0.3971509,0.023736712",01
In this example we have time, not just an initial position.
i Hope it was helpful