How to detect eof in ml-lex - sml

While writing a code in ml-lex
we need to write to write the eof function
val eof = fn () => EOF;
is this a necessary part to write
also if i want my lexer to stop at the detection of an eof then what should i add to the given function.
Thanks.

The User’s Guide to ML-Lex and ML-Yacc by Roger Price is great for learning ml-lex and ml-yacc.
The eof function is mandatory in the user declarations part of your lex definition together with the lexresult type as:
The function eof is called by the lexer when the end of the input
stream is reached.
Where your eof function can either throw an exception if that is appropriate for your application or the EOF token. In any way it have to return something of type lexresult. There is an example in chapter 7.1.2 of the user guide which prints a string if EOF was in the middle of a block comment.
I use a somewhat "simpler" eof function
structure T = Tokens
structure C = SourceData.Comments
fun eof data =
if C.depth data = 0 then
T.EOF (~1, ~1)
else
fail (C.start data) "Unclosed comment"
where the C structure is a "special" comment handling structure that counts number of opening and closing comments. If the current depth is 0 then it returns the EOF token, where (~1, ~1) are used indicate the left and right position. As I don't use this position information for EOF i just set it to (~1, ~1).
Normally you would then set the %eop (end of parse) to use the EOF token in the yacc file, to indicate that what ever start symbol that is used, it may be followed by the EOF token. Also remember to add EOF to %noshift. Se section 9.4.5 for %eop and %noshift.
Obviously you have to define EOF in %term declaration of your yacc file aswel.
Hope this helps, else take a look at an MLB parser or an SML parser written in ml-lex and ml-yacc. The MLB parser is the simplest and thus might be easier to understand.

Related

How to do something else in Bison after Flex returns 0?

The Bison yyparse() function stop to read its input (file or stream) when 0 is returned.
I was wondering if there is a way to execute some more commands after it occurs.
I mean, is possible to tread 0 (or some token thrown by its return) in bison file?
Something like:
Flex
<<EOF>> { return 0; }
Bison
%token start
start : start '0' {
// Desired something else
}
Suppose program is the top-level symbol in a grammar. That is, it is the non-terminal which must be matched by the parser's input.
Of course, it's possible that program will also be matched several times before the input terminates. For example, the grammar might look something like:
%start program
%%
program: %empty
| program declaration
In that grammar, there is no way to inject an action which is only executed when the input is fully parsed. I gather that is what you want to do.
But it's very simple to create a non-terminal whose action will only be executed once, at the end of the parse. All we need to do is insert a new "unit production" at the top of the grammar:
%start start
%%
start : program { /* Completion action */ }
program: %empty
| program declaration
Since start does not appear on the right-hand side of any production in the grammar, it can only be reduced at the end of the parse, when the parser reduced the %start symbol. So even though the production does not explicitly include the end token, we know that that the end token is the lookahead token when the reduction action is executed.
Unit productions -- productions whose right-hand side contains just one symbol -- are frequently used to trigger actions at strategic points of the parse, and the above is just one example of the technique.

While loop construct in combination with getline function that continues until EOF

I am in a bind right now and the most frustrating thing about this is that I know what the problem is but, I cannot fix it :(...
My goal is to ultimately use getline to read lines of strings from redirected input (from a text file) and keep going until EOF is reached.
Example text file (contents):
Hello World!
Good Bye.
My source code(only includes the section where it will not work):
while (!(getline(std::cin, s_array)).eof()){ // it won't read second line
//do some awesome stuff to the first line read!
}
As far as I know, getline reads everything upto the newline and stops so how do we get it to keep reading because it always stops at Hello World!.
Use while (getline(std::cin, s_array)) { } instead.
std::getline() returns istream&, and istream::operator void*() makes it evaluated as false whenever any error flag is set.
You should definitely read Joseph Mansfield's blog post titled "Don't condition input on eof()" which describes this pitfall in details and provides a well justified guideline.

lex (flex) generated program not parsing whole input

I have a relatively simple lex/flex file and have been running it with flex's debug flag to make sure it's tokenizing properly. Unfortunately, I'm always running into one of two problems - either the program the flex generates stops just gives up silently after a couple of tokens, or the rule I'm using to recognize characters and strings isn't called and the default rule is called instead.
Can someone point me in the right direction? I've attached my flex file and sample input / output.
Edit: I've found that the generated lexer stops after a specific rule: "cdr". This is more detailed, but also much more confusing. I've posted a shorted modified lex file.
/* lex file*/
%option noyywrap
%option nodefault
%{
enum tokens{
CDR,
CHARACTER,
SET
};
%}
%%
"cdr" { return CDR; }
"set" { return SET; }
[ \t\r\n] /*Nothing*/
[a-zA-Z0-9\\!##$%^&*()\-_+=~`:;"'?<>,\.] { return CHARACTER; }
%%
Sample input:
set c cdra + cdr b + () ;
Complete output from running the input through the generated parser:
--(end of buffer or a NUL)
--accepting rule at line 16 ("set")
--accepting rule at line 18 (" ")
--accepting rule at line 19 ("c")
--accepting rule at line 18 (" ")
--accepting rule at line 15 ("cdr")
Any thoughts? The generated program is giving up after half of the input! (for reference, I'm doing input by redirecting the contents of a file to the generated program).
When generating a lexer that's standalone (that is, not one with tokens that are defined in bison/yacc, you typically write an enum at the top of the file defining your tokens. However, the main loop of a lex program, including the main loop generated by default, looks something like this:
while( token = yylex() ){
...
This is fine, until your lexer matches the rule that appears first in the enum - in this specific case CDR. Since enums by default start at zero, this causes the while loop to end. Renumbering your enum - will solve the issue.
enum tokens{
CDR = 1,
CHARACTER,
SET
};
Short version: when defining tokens by hand for a lexer, start with 1 not 0.
This rule
[-+]?([0-9*\.?[0-9]+|[0-9]+\.)([Ee][-+]?[0-9]+)?
|
seems to be missing a closing bracket just after the first 0-9, I added a | below where I think it should be. I couldn't begin to guess how flex would respond to that.
The rule I usually use for symbol names is [a-zA-Z$_], this is like your unquoted strings
except that I usually allow numbers inside symbols as long as the symbol doesn't start with a number.
[a-zA-Z$_]([a-zA-Z$_]|[0-9])*
A characters is just a short symbol. I don't think it needs to have its own rule, but if it does, then you need to insure that the string rule requires at least 2 characters.
[a-zA-Z$_]([a-zA-Z$_]|[0-9])+

ocamlyacc parse error: what token?

I'm using ocamlyacc and ocamllex. I have an error production in my grammar that signals a custom exception. So far, I can get it to report the error position:
| error { raise (Parse_failure (string_of_position (symbol_start_pos ()))) }
But, I also want to know which token was read. There must be a way---anyone know?
Thanks.
The best way to debug your ocamlyacc parser is to set the OCAMLRUNPARAM param to include the character p - this will make the parser print all the states that it goes through, and each shift / reduce it performs.
If you are using bash, you can do this with the following command:
$ export OCAMLRUNPARAM='p'
Tokens are generated by lexer, hence you can use the current lexer token when error occurs :
let parse_buf_exn lexbuf =
try
T.input T.rule lexbuf
with exn ->
begin
let curr = lexbuf.Lexing.lex_curr_p in
let line = curr.Lexing.pos_lnum in
let cnum = curr.Lexing.pos_cnum - curr.Lexing.pos_bol in
let tok = Lexing.lexeme lexbuf in
let tail = Sql_lexer.ruleTail "" lexbuf in
raise (Error (exn,(line,cnum,tok,tail)))
end
Lexing.lexeme lexbuf is what you need. Other parts are not necessary but useful.
ruleTail will concat all remaining tokens into string for the user to easily locate error position. lexbuf.Lexing.lex_curr_p should be updated in the lexer to contain correct positions. (source)
I think that, similar to yacc, the tokens are stored in variables corresponding to the symbols in your grammar rule. Here since there is one symbol (error), you may be able to simply output $1 using printf, etc.
Edit: responding to comment.
Why do you use an error terminal? I'm reading an ocamlyacc tutorial that says a special error-handling routine is called when a parse error happens. Like so:
3.1.5. The Error Reporting Routine
When ther parser function detects a
syntax error, it calls a function
named parse_error with the string
"syntax error" as argument. The
default parse_error function does
nothing and returns, thus initiating
error recovery (see Error Recovery).
The user can define a customized
parse_error function in the header
section of the grammar file such as:
let parse_error s = (* Called by the parser function on error *)
print_endline s;
flush stdout
Well, looks like you only get "syntax error" with that function though. Stay tuned for more info.

Whitespace at end of file causing EOF check to fail in C++

I am reading in data from a file that has three columns. For example the data will look something like:
3 START RED
4 END RED
To read in the data I am using the following check:
while (iFile.peek() != EOF) {
// read in column 1
// read in column 2
// read in column 3
}
My problem is that the loop usually does an extra loop. I am pretty sure this is because a lot of text editors seem to put a blank line after the last line of actual content.
I did a little bit of Googling and searched on SO and found some similar situations such as Reading from text file until EOF repeats last line however I couldn't quite seem to adapt the solution given to solve my problem. Any suggestions?
EOF is not a prediction but an error state. Hence, you can't use it like you're using it now, to predict whether you can read Column 1, 2 and 3. For that reason, a common pattern in C++ is:
while (input >> obj1 >> obj2) {
use(obj1, obj2);
}
All operator>>(istream& is, T&) return the inputstream, and when used in boolean context the stream is "true" as long as the last extraction succeeded. It's then safe to use the extracted objects.
Presuming iFile is an istream:
You should break out of the loop on any error, not only on EOF (which can be checked for with iFile.eof(), BTW), because this is an endless loop when any format failure sets the stream into a bad state other that EOF. It is usually necessary to break out of a reading loop in the middle of the loop, after everything was read (either successfully or not), and before it is entered.
To make sure there isn't anything interesting coming anymore, you could, after the loop, reset the stream state and then try to read whitespace only until your reach EOF:
while( !iFile.eof() )
{
iFile >> std::ws;
string line;
std::getline(iFile,line);
if(!line.empty()) error(...);
}
If any of the reads fail (where you read the column data), just break out of the while loop. Presumably you are then at the end of the file and reading the last 'not correct' line
Maybe you'll consider it a good idea to handle whitespace and other invalid input then. Perhaps some basic validation of columns 1,2,3 would be desirable as well.
Don't worry about the number of times that you loop: just validate your data and handle invalid inputs.
Basically, check that you have three columns to read and if you don't decide if it's because the file is over or because of some other issue.