define CRF++ template file

define CRF++ template file - c++

This is my issue, but it doesn't say HOW to define the template file correctly.
My training file looks like this:
上 B-NR
海 L-NR
浦 B-NR
东 L-NR
开 B-NN
发 L-NN
与 U-CC
法 B-NN
制 L-NN
建 B-NN
...

CRF++ is extremely easy to use. The instructions on the website explains it clearly.
http://crfpp.googlecode.com/svn/trunk/doc/index.html
Suppose we extract feature for the line
东 L-NR
Unigram
U02:%x[0,0] #means column 0 of the current line
U03:%x[1,0] #means column 0 of the next line
So the underlying feature is "column0=开"
Similar for bigrams

It seems that this issue arises from not clearly understanding how CRF++ is processing the training file. Your features may not include the values in the last column. These are the labels! If you were to include them in your features, your model would be trivially perfect! When you define your template file, because you only have two columns, it can only include rules of the form %x[n,0]. It is hardcoded into CRF++ (though not clearly documented, as far as I can tell), that -4 <= n <= 4.

Related

nomenclature in elsevier latex template

I got some trouble with nomenclature in Elsevier template. I followed this instruction proposed by delrocco.
and now when I wrote the \nomenclature command I got nothing (the box is empty)(please see the attached file). I do not know I must use tabular environment in the middle of the table* environment or use \nomenclature command. When I use tabular environment, I have columns that with different size (the fourth column go out of the box)(the proposed instruction needs making some change in template). I do not know it is the best solution or not. Also when I use the other solution with \mbox commant i received the error relates to the not proper for two column. I m confused.
I really appreciate any help
%\documentclass[review]{elsarticle}
\documentclass[3p,twocolumn]{elsarticle}
\usepackage{framed} % Framing content%###
\usepackage{multicol} % Multiple columns environment
\usepackage{nomencl} % Nomenclature package
\makenomenclature
\setlength{\nomitemsep}{-\parskip} % Baseline skip between items%###
\renewcommand*\nompreamble{\begin{multicols}{2}}
\renewcommand*\nompostamble{\end{multicols}}
\modulolinenumbers[5]
\usepackage{nomencl}
\makenomenclature
\journal{Journal of \LaTeX\ Templates}
\bibliographystyle{elsarticle-num}
\DeclareUnicodeCharacter{2212}{-}
\begin{document}
\begin{frontmatter}
\begin{abstract}
\end{abstract}
\begin{keyword}
%\texttt{elsarticle.cls}\sep \LaTeX\sep Elsevier \sep template
%\MSC[2010] 00-01\sep 99-00
\texttt{MMMMr}
%\MSC[2010] 00-01\sep 99-00
\end{keyword}
\end{frontmatter}
\begin{table*}[!t] %### for nomenculture
\begin{framed}
\nomenclature{$abbreviation$}{explanation for the abbreviation}
\nomenclature{\(c\)}{Speed of light in a vacuum}
\nomenclature{\(h\)}{Planck constant}
\printnomenclature
\end{framed}
\end{table*}
\end{document}

The source files structure will be changed on daily basis in informatica cloud

Requirement is, The source files structure will be changed on daily basis / dynamically. how we can achieve in Informatica could:
For example,
Let's consider the source is a flat file with different formats like with header, without header, different metadata(today file with 4 columns and tomorrow its 7 different columns and day after tomorrow without header , another day file with count of records in file)
I need to consume all dynamically changed files in one informatica cloud mapping. could you please help me on this.

This is a tricky situation. I know its not a perfect solution but here is my idea-
create a source file structure having maximum number of columns of type text, say 50. Read file, apply filter to cleanup header data etc. Then use router to treat files as per their structure - may be filename can give you a hint what it contains. Once you identify the type of file, treat,convert columns according to their data type and load into correct target.
Mapping would look like Source -> SQ -> EXP -> FIL -> RTR -> TGT1, TGT2
There has to be a pattern to identify the dynamic file structure.
HTH...

To summarise my understanding of the problem:
You have a random number of file formats
You don't know the file formats in advance
The files don't contain the necessary information to determine their format.
If this is correct then I don't believe this is a solvable problem in Informatica or in any other tool, coding language, etc. You don't have enough information available to enable you to define the solution.
The only solution is to change your source files. Possibilities include:
a standard format (or one of a small number of standard formats with information in the file that allows you to programatically determine the format being used)
a self-documenting file type such as JSON

What are hp.Discrete and hp.Realinterval? Can I include more values in hp.realinterval instead of just 2?

I am using Hyperparameter using HParams Dashboard in Tensorflow 2.0-beta0 as suggested here https://www.tensorflow.org/tensorboard/r2/hyperparameter_tuning_with_hparams
I am confused in step 1, I could not find any better explanation. My questions are related to following lines:
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16, 32]))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
My question:
I want to try more dropout values instead of just two (0.1 and 0.2). If I write more values in it then it throws an error- 'maximum 2 arguments can be given'. I tried to look for documentation but could not find anything like from where these hp.Discrete and hp.RealInterval functions came.
Any help would be appreciated. Thank you!

Good question. They notebook tutorial lacks in many aspects. At any rate, here is how you do it at a certain resolution res
for dropout_rate in tf.linspace(
HP_DROPOUT.domain.min_value,
HP_DROPOUT.domain.max_value,
res,):

By looking at the implementation to me it really doesn't seem to be GridSearch but MonteCarlo/Random search (note: this is not 100% correct, please see my edit below)
So on every iteration a random float of that real interval is chosen
If you want GridSearch behavior just use "Discrete". That way you can even mix and match GridSearch with Random search, pretty cool!
Edit: 27th of July '22: (based on the comment of #dpoiesz)
Just to make it a little more clear, as it is sampled from the intervals, concrete values are returned. Therefore, those are added to the grid dimension and grid search is performed using those

RealInterval is a min, max tuple in which the hparam will pick a number up.
Here a link to the implementation for better understanding.
The thing is that as it is currently implemented it does not seems to have any difference in between the two except if you call the sample_uniform method.

Note that tf.linspace breaks the mentioned sample code when saving current value.
See https://github.com/tensorflow/tensorboard/issues/2348
In particular OscarVanL's comment about his quick&dirty workaround.

faliure in reading training data: tagger.cpp (393) CRF++

While I am running CRF++ on my training data (train.txt) I have got the follwoing error
C:\Users\2012\Desktop\CRF_Software_Package\CRF++-0.58>crf_learn template train.d
ata model
CRF++: Yet Another CRF Tool Kit
Copyright (C) 2005-2013 Taku Kudo, All rights reserved.
reading training data: tagger.cpp(393) [feature_index_->buildFeatures(this)]
0.00 s
My training data contains Unicode characters and the data is saved using Notepad (encoding= Unicode big indian)
I am not sure If the problem with the template or with the format of the training data. How can I check the format of the training data?

I think this is because of your template file.
Please check whether you have included the last column which is gold-standard as training features. The column index starts from 0.
E.g if you have 6 column in your BIO file.
The template should not have something like %x[0,5]

The Problem is with the Template file
check your features for incorrect "grammer"
i.e
U10:%x[-1,0]/% [0,0]
you realize that after the second % there is a missing 'x'
the corrected line should look like the one below
U10:%x[-1,0]/%x[0,0]

I had the same issue, files are in UTF-8, and template file and training file are definitely in the correct format. The reason was that CRFPP expects at most 1024 columns in the input files. Would be great if it would output an appropriate error message in such a case.

The problem is not with the Unicode encoding, but the template file.
Have a look at this similar Q: The failure in using CRF+0.58 train NE Model

Arff File - Nominal Value not declared in header.

I am generating an .arff file using a Java program. The file has about 600 attributes.
I am unable to open the file in Weka Explorer.
It says: "nominal value not declared in header, read Token[0], line 626."
Here is the first attribute line: #attribute vantuono numeric
Here are the first few chars of line 626: 0,0,0,0,1,0,0,0,0,1,0,1...
Why is WEKA unable to parse '0' as a numeric value?
Interestingly, this happens only in this file. I have other files with numeric attributes accepting '0' for a value.

Are you sure that your declaration is correct? The WEKA FAQ says:
nominal value not declared in header, read Token[X], line Y
If you get this error message than you seem to have declared a nominal attribute in the ARFF header section, but Weka came across a value ("X") in the data (in line Y) for this particular attribute that wasn't listed as possible value.
All nominal values that appear in the data must be declared in the header.
There is also a bug regarding sparse ARFF files

Increase the memory to accommodate all the rows using -B #noOfRecords option.
java weka.core.converters.CSVLoader filename.csv filename.arff -B 33000

If you get this error, it's more likely that in your dataset (after the line #data), you kept the HEADER (column names) that you have already declared. Please remove that header line, and you should be good to go.

I got the same error. Then I saw that my program puts an extra Apostrophe. When I remove the Apostrophe it works

I had such a problem and it costed me so you won't be costed Okay. Just put the class attribute last, and ensure the attributes are in order as in the text.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

define CRF++ template file - c++

This is my issue, but it doesn't say HOW to define the template file correctly. My training file looks like this: 上 B-NR 海 L-NR 浦 B-NR 东 L-NR 开 B-NN 发 L-NN 与 U-CC 法 B-NN 制 L-NN 建 B-NN ...

Related

nomenclature in elsevier latex template

The source files structure will be changed on daily basis in informatica cloud

What are hp.Discrete and hp.Realinterval? Can I include more values in hp.realinterval instead of just 2?

faliure in reading training data: tagger.cpp (393) CRF++

Arff File - Nominal Value not declared in header.

Categories

Resources