Weka StringToWordVector attributes omitted

Weka StringToWordVector attributes omitted - weka

I´m working with Weka. My problem is, that some of the attributes are omitted after using StringToWordVector. So here is my code:
This is the ARFF file before using any filter:
#relation QueryResult
#attribute class {Qualität,Bord,Kite,Harness}
#attribute text {evo,foil,end,fin,edg}
#data
Qualität,evo
Bord,foil
Kite,end
Harness,fin
Qualität,edg
Here is my java code:
Instances train = new Instances(loadInstancesForWeka("root","",sqlCommand));
train.setClassIndex(train.numAttributes() - 2);
System.out.println(train);
NominalToString filter1 = new NominalToString();
filter1.setInputFormat(train);
train = Filter.useFilter(train, filter1);
System.out.println("\nSelect nach NominaltoString \n"+train);
//filter
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(train);
train = Filter.useFilter(train, filter);
After using the Vector it looks like this:
#relation 'QueryResult-weka.filters.unsupervised.attribute.NominalToString-Clast-weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-stopwords-handlerweka.core.stopwords.Null-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'
#attribute class {Qualität,Bord,Kite,Harness}
#attribute edg numeric
#attribute evo numeric
#attribute foil numeric
#attribute end numeric
#attribute fin numeric
#data
{2 1}
{0 Bord,3 1}
{0 Kite,4 1}
{0 Harness,5 1}
{1 1}
So why are the attributes "foil,end,fin" omitted? Thank you for your help.

There aren't any attributes omitted from your output. The output is in sparse ARFF format:
Sparse ARFF files are very similar to ARFF files, but data with value
0 are not explicitly represented. ...
Each instance is surrounded by
curly braces, and the format for each entry is:
[index] [space] [value] where index is the attribute index (starting from 0).
So for the third instance in your example,
{0 Kite,4 1}
means that attribute 0 for this instance is Kite, attribute 4 (i.e. 'end') is 1, and the other attributes are 0.
It makes sense for StringToWordVector to produce sparse output because it creates a lot of new attributes, most of which will be 0 for each instance. If you need the non-sparse version you can use weka.filters.unsupervised.instance.SparseToNonSparse.

Related

Convert a text file into ARFF file

I am trying to convert a text file into an ARFF (Attribute Relation File Format) file. Below are the first few lines of the file.
#RELATION Graph
#ATTRIBUTE real {1,-1}
#ATTRIBUTE authorOne string
#ATTRIBUTE authorTwo string
#ATTRIBUTE year real
#DATA
1,authorName1,authorName2,1999
....
I am getting the below error on loading this file onto weka.
train.arff is not recognized as an 'Arff data files' file.
Reason: number expected, read token[authorName1]
Could you please let me know what's wrong with this?

Prediction in weka using explorer

Once i have trained and generated a model , as of now from the examples i have seen , we are using a testing set where we have to put values for actual and predicted , is there a way where i can either put this actual column as empty or cannot use it at all when am doing the prediction
if i take with an example , following is my training set
#relation supermarket
#attribute 'department1' { t}
#attribute 'department2' { t}
#attribute 'department3' { t}
#attribute value
and am using a testing set like
#relation supermarket
#attribute 'department1' { t}
#attribute 'department2' { t}
#attribute 'department3' { t}
#attribute value
and output like
#relation supermarket
#attribute 'department1' { t}
#attribute 'department2' { t}
#attribute 'department3' { t}
#attribute value
#attribute predicted-value
#attribute predicted-margin
My Question is can i either remove value or keep it as empty from testing set

Case 1: Both your training and test set have class labels
Training:
#relation
simple-training
#attribute
feature1 numeric
feature2 numeric
class string{a,b}
#data
1, 2, b
2, 4, a
.......
Testing:
#relation
simple-testing
#attribute
feature1 numeric
feature2 numeric
class string{a,b}
#data
7, 12, a
8, 14, a
.......
In this case, whether you are using k-fold cv or train-test setup, Weka will not take a look at your class labels in the test set. It gets its model from training, blindly apply that on test set and then compares its prediction with the actual class labels in your testing set.
This is useful if you want to see the performance evaluation of your classifier.
Case 2: You have class labels for training data but you don't have class labels for testing data.
Training:
#relation
simple-training
#attribute
feature1 numeric
feature2 numeric
class string{a,b}
#data
1, 2, b
2, 4, a
.......
Testing:
#relation
simple-testing
#attribute
feature1 numeric
feature2 numeric
class string{a,b}
#data
7, 12, ?
8, 14, ?
.......
This is very normal since this is what we need to do- apply training model on unseen unlabeled data to label them! In that case simply put ? marks at your testing class labels. After running Weka on this setup you will get the output with these ? marks replaced by the predicted values (you don't need to create any additional column as this will give you error).
So, in a nutshell- you need to have compatibility in your training and testing data. In testing data if you don't know the value and you want to predict it, then put a ? mark in that column.

converting to weka arff format

i want to convert the file in this link : http://archive.ics.uci.edu/ml/datasets/Credit+Approval to match weka .arff file and open it there.
i know that we need to define the file like:
#relation
#attribute
#data
i found the data, but didn't found the attributes! also the relation is the file name right ?
and one last thing how to make file extension .arff ?
please help.
Thank You SO MUCH!!

In crx.names from data folder, says : All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.
But they give you the values that they use:
Attribute Information:
A1: b, a.
A2: continuous.
A3: continuous.
A4: u, y, l, t.
A5: g, p, gg.
A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7: v, h, bb, j, n, z, dd, ff, o.
A8: continuous.
A9: t, f.
A10: t, f.
A11: continuous.
A12: t, f.
A13: g, p, s.
A14: continuous.
A15: continuous.
A16: +,- (class attribute)
You can give to this information, the meaning that you need.
For create this to arff file you write something like that:
%Test Data set
#relation Credit Approval Data Set
#attribute attribute_name {a,b}
#attribute ...
#data
b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+
Add the next attribute reading the credit.lisp, you need 16 attributes.
Save the file how: name file.arff. You can create this file in a text editor of your preferred.

If you want to follow GUI based approach then
1) open crx.data in any editor.
2) Add a column heading at the first line like:
A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,Class
3) Save the file as crx.csv
4) Open Weka -> Explorer
5) In preprocess tab -> Click on Open file
6) Change file type csv
7) Locate the file 'crx.csv`
8) Click on Save
9) Specify the file name crx.arff.
That's done.

Named Entity Recognition using WEKA

I am new to WEKA and I want to ask you few questions regarding WEKA.
I had follow this tutorial (Named Entity Recognition using WEKA).
But I am really confusing and have no idea at all.
Is it possible if I want to filter the string by phrase not word/token?
For example in my .ARFF file:
#attribute text string
#attribute tag {CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, LS, MD, NN, NNS, NNP, NNPS, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD , VBG, VBN , VBP, VBZ, WDT, WP, WP$, WRB, ,, ., :}
#attribute capital {Y, N}
#attribute chunked {B-NP, I-NP, B-VP, I-VP, B-PP, I-PP, B-ADJP, B-ADVP , B-SBAR, B-PRT, O-Punctuation}
#attribute ##class## {B-PER, I-PER, B-ORG, I-ORG, B-NUM, I-NUM, O, B-LOC, I-LOC}
#data
'Wanna',NNP,Y,B-NP,O
'be',VB,N,B-VP,O
'like',IN,N,B-PP,O
'New',NNP,Y,B-NP,B-LOC
'York',NNP,Y,I-NP,I-LOC
'?',.,N,O-Punctuation,O
So, when I filtered the String, it tokenized the string into word but what I want is, I want to tokenize/filter the string according to the phrase. For example extract the phrase "New York" not "New" and "York" according to the chunked attributes.
"B-NP" means start phrase and "I-NP" means next phrase (the middle or end of the phrase).
How can i show the result for the classify class for example:
B-PER and I-PER to the class name PERSON?
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0 0.021 0 0 0 0.768 B-PER
1 0.084 0.333 1 0.5 0.963 I-PER
0.167 0.054 0.167 0.167 0.167 0.313 B-ORG
0 0 0 0 0 0.964 I-ORG
0 0 0 0 0 0.281 B-NUM
0 0 0 0 0 0.148 I-NUM
0.972 0.074 0.972 0.972 0.972 0.949 O
0.875 0 1 0.875 0.933 0.977 B-LOC
0 0 0 0 0 0.907 I-LOC
Weighted Avg. 0.828 0.061 0.811 0.828 0.813 0.894

In my opinion, WEKA won't (currently) be the best machine learning software to do NER... as far as I know, WEKA does classify sets of examples, for NER it may be done either:
By tokenizing sentences in tokens: in that case sequence (i.e. contiguity) will be lost... "New" and "York" are two separate examples, the fact that those words are contiguous won't be taken into account in any way.
By keeping chunks / sentences as examples: sequences can then be kept as a whole and filtered (StringToWordVector for instance), but one class has to be associated for each chunk/sentence (for instance O+O+O+B-LOC+I-LOC+O is the class of the whole sentence in your example).
In both cases, contiguity is not taken into account, which is really disturbing. Also, as far as I know, this is the same for R (?). This why "sequence labelling" (NER, morpho-syntax, syntax and dependencies) are usually done using software that determines a token category using current word, but also previous, next word, etc. and can output single tokens but also multitoken expressions or more complicated structures.
For NER, currently, CRF are usually used for that, see:
CRF++
CRFSuite
Wapiti
Mallet
...

WEKA & HMMWeka, train and test models‏

I'm trying to use WEKA in order to make gesture recognition. I'm new in this procedure, so any help would be appreciated.
More specific what I have done in steps:
install WEKA
install HMMWeka library
my data contains rotations form sensors, and I tried to create .arff in "simple" format and "multi-instance" format
I have 4 gestures that have been recorded and 3 repetitions for each. So my initail idea with the "simple" format, is to train the model with gesture1 from first repetition and to test it (recognize it) with the gesture1 from the rest two repetitions.
With the "multi-instance" format, in each .arff file I have all the four gestures from each repetition.
So my questions are:
I'm not sure if my file in "multi-instance" format is correct. Here is an example of its structure
:
#relation rotation
#attribute bag_ID {1, 2, 3, 4, 5, 6, 7, 8, 9}
#attribute bag relational
#attribute rotation { rot1 , rot2 , rot3 }
#attribute x_left_hand numeric
#attribute y_left_hand numeric
#attribute z_left_hand numeric
#attribute x_right_hand numeric
#attribute y_right_hand numeric
#attribute z_right_hand numeric
#end bag
#attribute gesture { g1, g2, g3, g4}
#data
1,"rot1, 1.394962, 19.704826, 0.536432, 1.594745, 7.511097, 0.269678", g1
2,"rot1, 1.337786, 19.681709, 0.468583, 1.63736, 7.536188, 0.35687", g1
3,"rot1, 1.280635, 19.658672, 0.400756, 1.679905, 7.561322, 0.443999", g1
4,"rot1, 2.217022, 15.327432, -1.997938, 0.256819, 10.011353, 2.300805", g2
5,"rot1, 2.304201, 15.276058, -2.076832, 0.161013, 9.993914, 2.351273", g2
6,"rot1, 2.271477, 22.43351, 3.477951, 1.245202, 5.531068, -1.06918", g3
7,"rot1, 2.218041, 22.370411, 3.506101, 1.299245, 5.590856, -1.078336", g3
8,"rot1, 1.557125, 16.531981, 4.000765, 3.098644, 5.841918, -3.751997", g4
9,"rot1, 1.557125, 16.531981, 4.000765, 3.116652, 5.932492, -3.760822", g4
Although WEKA reads both formats, when I choose HMM for training, it selects (which is also the default) the nominal class gesture, while I would like to use either the relational attribute, or all the other attributes as a group. The result of correct classification in trainig is also very low..22%
The result of testing would be which gesture is, according to all attributes that I give to WEKA as input.
Do you know if that is possible? Can I use all numeric attributes for the training? Do I have something wrong with the format?
I searched a lot in google, finding things like http://weka.8497.n7.nabble.com/Relational-attributes-vs-regular-attributes-td29946.html and tried many combinations..but I still have problem!
Also I tried to use two classifiers, gaussian processes and HMM but it pops up an error (weka.classifiers.meta.Stacking: cannot handle binary class).
Any help would be really appreciated!!
Thank you in advance!!
Best regards,

mmmm you are using time sequences but with no sequence, i mean, imagine you have an vector
x[] and each element of x is the value of x in that time, you only posted x[0], many times, in your case x is an structure that has
struct x {
double x_left_hand;
double y_left_hand;
double z_left_hand;
double x_right_hand;
double y_right_hand;
double z_right_hand;
}
And this is correct, but there are no evolution in time of the gesture, i dont know if i'm explaining well, so bad english...
I'm going to post you an little example that i'm working on, it may help you
#relation AUs
#attribute sequences relational
#attribute AU0 NUMERIC
#attribute AU1 NUMERIC
#attribute AU2 NUMERIC
#attribute AU3 NUMERIC
#attribute AU4 NUMERIC
#attribute AU5 NUMERIC
#end sequences
#attribute class {01, 01b, 01c, 01d, 02, 02a, 02c, 04, 05, 07, 10, 11, 13, 14, 15, 17, 18, 19, 21}
#data
"0.5840133,-0.13073722,-0.8034836,0.16867049,-0.30464363,-0.15208732\n....\n0.47603312,-0.10599292,-0.4781358,0.30258814,-0.27299756,0.07913079\n0.5878206,-0.12593555,-0.42014712,0.30809718,-0.33109784,0.013338812\n0.6120923,-0.12400548,-0.3479098,0.26818287,-0.39161837,0.07279621\n0.6180023,-0.11955684,-0.35120794,0.28354084,-0.351862,-0.017126387\n0.6166399,-0.13956356,-0.3506564,0.25470608,-0.34935358,0.025823373\n0.59575284,-0.13704558,-0.42580596,0.24725975,-0.33137816,-0.04043307\n0.5571964,-0.13607484,-0.3777615,0.21615964,-0.35109887,-0.068926826\n0.52844477,-0.10942138,-0.38436648,0.2355144,-0.3238311,-0.06743353\n0.64967036,-0.13547328,-0.28889894,0.21237339,-0.3741229,0.02283336\n0.641207,-0.13648787,-0.35315526,0.27048427,-0.39234316,0.026359601\n0.6241689,-0.14557217,-0.39503983,0.261346,-0.3732989,0.0811597\n0.46664864,-0.092378475,-0.42906052,0.29789245,-0.3076035,0.015037567\n0.528294,-0.19327107,-0.59035814,0.26079395,-0.3222413,-0.022527361\n0.56722254,-0.16849008,-0.4722441,0.2480416,-0.3971509,0.023736712",01
In this example we have time, not just an initial position.
i Hope it was helpful

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Weka StringToWordVector attributes omitted - weka

Related

Convert a text file into ARFF file

Prediction in weka using explorer

converting to weka arff format

Named Entity Recognition using WEKA

WEKA & HMMWeka, train and test models‏

Categories

Resources