I am generating an .arff file using a Java program. The file has about 600 attributes.
I am unable to open the file in Weka Explorer.
It says: "nominal value not declared in header, read Token[0], line 626."
Here is the first attribute line: #attribute vantuono numeric
Here are the first few chars of line 626: 0,0,0,0,1,0,0,0,0,1,0,1...
Why is WEKA unable to parse '0' as a numeric value?
Interestingly, this happens only in this file. I have other files with numeric attributes accepting '0' for a value.
Are you sure that your declaration is correct? The WEKA FAQ says:
nominal value not declared in header, read Token[X], line Y
If you get this error message than you seem to have declared a nominal attribute in the ARFF header section, but Weka came across a value ("X") in the data (in line Y) for this particular attribute that wasn't listed as possible value.
All nominal values that appear in the data must be declared in the header.
There is also a bug regarding sparse ARFF files
Increase the memory to accommodate all the rows using -B #noOfRecords option.
java weka.core.converters.CSVLoader filename.csv filename.arff -B 33000
If you get this error, it's more likely that in your dataset (after the line #data), you kept the HEADER (column names) that you have already declared. Please remove that header line, and you should be good to go.
I got the same error. Then I saw that my program puts an extra Apostrophe. When I remove the Apostrophe it works
I had such a problem and it costed me so you won't be costed Okay. Just put the class attribute last, and ensure the attributes are in order as in the text.
Related
i have many .csv files which are stored into gcs and i want to load data from.csv to BigQuery using below commands:
bq load 'datasate.table' gs://path.csv json_schema
i have tried but giving errors, same error is giving for many file.
error screenshot
how can i remove unwanted values from .csv files before importing into table.
Suggest me to load file in easiest way
The answer depends on what do you want to do with this junk rows. If you look at the documentation, you have several options
Number of errors allowed. By default, it's set to 0 and that why the load job fails at the first line. If you know the total number of rom, set this value to the Number of errors allowed and all the errors will be ignored in the Load Job
Ignore unknown values. If your errors are made because some line contains more column as defined in the schema, this option keep the line in error and only the known column, the others are ignore
Allow jagged rows. If your errors are made by too short line (and it is in your message) and you still want to keep the first columns (because the last ones are optional and/or not relevant), you can check this option
For more advanced and specific filters, you have to perform pre or post processing. If it's the case, let me know to add this part to my answer.
I am trying to format my dataset as a weka arff file. this is a sample of my arff file:
#relation my_relation
#attribute 'attrib_1' numeric
#attribute 'attrib_2' numeric
#attribute 'attrib_3' numeric
...
#attribute 'class' {1,2,3,4,5}
#data
6,6,55,0,0,0,18.9,0,1,2,'?',14,15,20,'?','?','?','?',28,29,1
54,25,19,4.85,0,1,10,13,'?','?','?','?','?','?',15,16,19,20,21,0,3
...
My featrues are numeric and real values but there are some missing values for each feature in different cases(instances). how should i determine that my features contain missing values?
(I used '?' for missing values but this error occurs while trying to open mydata.arff
number expected, read token[?], line 746
)
Edit: I changed the '?' to ? and tried to load the file.this time the following error occurs:
nominal value not declared in header, read Token[86], line 746
This is too long to fit into a comment. I think that I can see a likely problem with your data. It contains some bad characters. You are probably reading this in a web browser. If so, view the html source for this page and then scroll down to your data. In Internet explorer, I was able to save this web page as a text file and then just look at the text in an editor to see the bad characters. In many places throughout the data, I see These are zero-width characters (see zwnj and 8203. That is, they are characters that are present in the data, but do not show up on the screen, not even as blank space. Because your data contains these spurious characters, WEKA cannot read it. Please check your data to see if the original contains these hidden characters.
This is my issue, but it doesn't say HOW to define the template file correctly.
My training file looks like this:
上 B-NR
海 L-NR
浦 B-NR
东 L-NR
开 B-NN
发 L-NN
与 U-CC
法 B-NN
制 L-NN
建 B-NN
...
CRF++ is extremely easy to use. The instructions on the website explains it clearly.
http://crfpp.googlecode.com/svn/trunk/doc/index.html
Suppose we extract feature for the line
东 L-NR
Unigram
U02:%x[0,0] #means column 0 of the current line
U03:%x[1,0] #means column 0 of the next line
So the underlying feature is "column0=开"
Similar for bigrams
It seems that this issue arises from not clearly understanding how CRF++ is processing the training file. Your features may not include the values in the last column. These are the labels! If you were to include them in your features, your model would be trivially perfect! When you define your template file, because you only have two columns, it can only include rules of the form %x[n,0]. It is hardcoded into CRF++ (though not clearly documented, as far as I can tell), that -4 <= n <= 4.
I m using weka and try to test my file but always got a popup window showing "Train and Test set are not compatible". I m using the csv file. All the attributes are same in both file.out of 30 attributes i divide them in two parts first 20 attributes as training set and rest 10 as test set. pls help me.
Your attributes and their order must be the same in the both files. See following Weka Wiki post and stack overflow question 1 and question 2. Even a small difference may cause this error.
According to you their order may be same but according to weka they are not same. Convert them to arff format and try again. You will see that their arff headers are not same. See below example.
CSV file1
Feature A
true
false
CSV file2
Feature A
false
true
Representation of these CSV files as arff header are not SAME.Since their first occurrence change in files, their order in arff header change too.
While I am running CRF++ on my training data (train.txt) I have got the follwoing error
C:\Users\2012\Desktop\CRF_Software_Package\CRF++-0.58>crf_learn template train.d
ata model
CRF++: Yet Another CRF Tool Kit
Copyright (C) 2005-2013 Taku Kudo, All rights reserved.
reading training data: tagger.cpp(393) [feature_index_->buildFeatures(this)]
0.00 s
My training data contains Unicode characters and the data is saved using Notepad (encoding= Unicode big indian)
I am not sure If the problem with the template or with the format of the training data. How can I check the format of the training data?
I think this is because of your template file.
Please check whether you have included the last column which is gold-standard as training features. The column index starts from 0.
E.g if you have 6 column in your BIO file.
The template should not have something like %x[0,5]
The Problem is with the Template file
check your features for incorrect "grammer"
i.e
U10:%x[-1,0]/% [0,0]
you realize that after the second % there is a missing 'x'
the corrected line should look like the one below
U10:%x[-1,0]/%x[0,0]
I had the same issue, files are in UTF-8, and template file and training file are definitely in the correct format. The reason was that CRFPP expects at most 1024 columns in the input files. Would be great if it would output an appropriate error message in such a case.
The problem is not with the Unicode encoding, but the template file.
Have a look at this similar Q: The failure in using CRF+0.58 train NE Model