faliure in reading training data: tagger.cpp (393) CRF++ - c++

While I am running CRF++ on my training data (train.txt) I have got the follwoing error
C:\Users\2012\Desktop\CRF_Software_Package\CRF++-0.58>crf_learn template train.d
ata model
CRF++: Yet Another CRF Tool Kit
Copyright (C) 2005-2013 Taku Kudo, All rights reserved.
reading training data: tagger.cpp(393) [feature_index_->buildFeatures(this)]
0.00 s
My training data contains Unicode characters and the data is saved using Notepad (encoding= Unicode big indian)
I am not sure If the problem with the template or with the format of the training data. How can I check the format of the training data?

I think this is because of your template file.
Please check whether you have included the last column which is gold-standard as training features. The column index starts from 0.
E.g if you have 6 column in your BIO file.
The template should not have something like %x[0,5]

The Problem is with the Template file
check your features for incorrect "grammer"
i.e
U10:%x[-1,0]/% [0,0]
you realize that after the second % there is a missing 'x'
the corrected line should look like the one below
U10:%x[-1,0]/%x[0,0]

I had the same issue, files are in UTF-8, and template file and training file are definitely in the correct format. The reason was that CRFPP expects at most 1024 columns in the input files. Would be great if it would output an appropriate error message in such a case.

The problem is not with the Unicode encoding, but the template file.
Have a look at this similar Q: The failure in using CRF+0.58 train NE Model

Related

Concatenate Monthy modis data

I downloaded daily MODIS DATA LEVEL 3 data for a few months from https://disc.gsfc.nasa.gov/datasets. The filenames are of the form MCD06COSP_M3_MODIS.A2006001.061.2020181145945 but the files do not contain any time dimension. Hence when I use ncecat to concatenate various files, the date information is missing in the resulting file. I want to know how to add the time information in the combined dataset.
Your commands look correct. Good job crafting them. Not sure why it's not working. Possibly the input files are HDF4 format (do they have a .hdf suffix?) and your NCO is not HDF4-enabled. Try to download the files in netCDF3 or netCDF4 format and your commands above should work. If that's not what's wrong, then examine the output files in each step of your procedure and identify which step produces the unintended results and then narrow your question. Good luck.

GCP > Video Intelligence: Prepare CSV error: Has critical error in root level csv, Expected 2 columns, but found 1 columns only

I'm trying to follow documentation from below GCP link to prepare my video training data. In the doc, it says that if you want to use GCP to label videos, you can use UNASSIGNED feature.
I have my videos uploaded to a bucket.
I have a traffic_video_labels.csv with below rows:
gs://video_intel/1.mp4
gs://video_intel/2.mp4
Now, in my Video Intelligence Import section, I want to use a CSV called check.csv that has below row as it references back to the video locations. Using UNNASIGNED value should let me use the labelling feature within GCP.
UNASSIGNED,gs://video_intel/traffic_video_labels.csv
However, when I try to check.csv as a file, I get the error:
Has critical error in root level csv gs://video_intel/check.csv line 1: Expected 2 columns, but found
1 columns only.
Can anyone pls help with this? thanks!
https://cloud.google.com/video-intelligence/automl/object-tracking/docs/prepare
For the error message "Expected 2 columns, but found
1 columns only." try to fix the format of your CSV file, open the file in a text editor of your choice (such as Cloud Shell, Sublime, Atom, etc.) to inspect the file format.
When opening a CSV file in Google Sheets or a similar product, you won't be able to format the file properly (i.e. empty values from tailing commas) due to limitation on the user interface, but in text editors, you should not run into those issues.
If this does not work, please share your CSV file to make a test with your file by my own.

Blankspace and colon not found in firstline

I have a jupyter notebook in SageMaker in which I want to run the XGBoost algorithm. The data has to match 3 criteria:
-No header row
-Outcome variable in the first column, features in the rest of the columns
-All columns need to be numeric
The error I get is the following:
Error for Training job xgboost-2019-03-13-16-21-25-000:
Failed Reason: ClientError: Blankspace and colon not found in firstline
'0.0,0.0,99.0,314.07,1.0,0.0,0.0,0.0,0.48027846,0.0...' of file 'train.csv'
In the error itself it can be seen that there are no headers, the output is the first column (it just takes 1.0 and 0.0 values) and all features are numerical. The data is stored in its own bucket.
I have seen a related question in GitHub but there are no solution there. Also, the example notebook that Amazon has does not take care of change the default sep or anything when saving a dataframe to csv for using it later on.
The error message indicated XGBoost was expecting the input data set as libsvm format instead of csv. SageMaker XGBoost by default assumed the input data set was in libsvm format. For using input data set in csv, please explicitly specify content-type as text/csv.
For more information: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost

define CRF++ template file

This is my issue, but it doesn't say HOW to define the template file correctly.
My training file looks like this:
上 B-NR
海 L-NR
浦 B-NR
东 L-NR
开 B-NN
发 L-NN
与 U-CC
法 B-NN
制 L-NN
建 B-NN
...
CRF++ is extremely easy to use. The instructions on the website explains it clearly.
http://crfpp.googlecode.com/svn/trunk/doc/index.html
Suppose we extract feature for the line
东 L-NR
Unigram
U02:%x[0,0] #means column 0 of the current line
U03:%x[1,0] #means column 0 of the next line
So the underlying feature is "column0=开"
Similar for bigrams
It seems that this issue arises from not clearly understanding how CRF++ is processing the training file. Your features may not include the values in the last column. These are the labels! If you were to include them in your features, your model would be trivially perfect! When you define your template file, because you only have two columns, it can only include rules of the form %x[n,0]. It is hardcoded into CRF++ (though not clearly documented, as far as I can tell), that -4 <= n <= 4.

Test and Train sets are not compatible

I m using weka and try to test my file but always got a popup window showing "Train and Test set are not compatible". I m using the csv file. All the attributes are same in both file.out of 30 attributes i divide them in two parts first 20 attributes as training set and rest 10 as test set. pls help me.
Your attributes and their order must be the same in the both files. See following Weka Wiki post and stack overflow question 1 and question 2. Even a small difference may cause this error.
According to you their order may be same but according to weka they are not same. Convert them to arff format and try again. You will see that their arff headers are not same. See below example.
CSV file1
Feature A
true
false
CSV file2
Feature A
false
true
Representation of these CSV files as arff header are not SAME.Since their first occurrence change in files, their order in arff header change too.