OCaml lex: doesn't work at all, whatsoever - ocaml

I am at the end of my rope here. I cannot get anything to work in ocamllex, and it is driving me nuts. This is my .mll file:
{
open Parser
}
rule next = parse
| (['a'-'z'] ['a'-'z']*) as id { Identifier id }
| '=' { EqualsSign }
| ';' { Semicolon }
| '\n' | ' ' { next lexbuf }
| eof { EOF }
Here are the contents of the file I pass in as input:
a=b;
Yet, when I compile and run the thing, I get an error on the very first character, saying it's not valid. I honestly have no idea what's going on, and Google has not helped me at all. How can this even be possible? As you can see, I'm really stumped here.
EDIT:
I was working for so long that I gave up on the parser. Now this is the relevant code in my main file:
let parse_file filename =
let l = Lexing.from_channel (open_in filename) in
try
Lexer.next l; ()
with
| Failure msg ->
printf "line: %d, col: %d\n" l.lex_curr_p.pos_lnum l.lex_curr_p.pos_cnum
Prints out "line: 1, col: 1".

Without the corresponding ocamlyacc parser, nobody will be able to find the issue with your code since your lexer works perfectly fine!
I have taken the liberty of writing the following tiny parser (parser.mly) that constructs a list of identifier pairs, e.g. input "a=b;" should give the singleton list [("a", "b")].
%{%}
%token <string> Identifier
%token EqualsSign
%token Semicolon
%token EOF
%start start
%type <(string * string) list> start
%%
start:
| EOF {[]}
| Identifier EqualsSign Identifier Semicolon start {($1, $3) :: $5}
;
%%
To test whether the parser does what I promised, we create another file (main.ml) that parses the string "a=b;" and prints the result.
let print_list = List.iter (fun (a, b) -> Printf.printf "%s = %s;\n" a b)
let () = print_list (Parser.start Lexer.next (Lexing.from_string "a=b;"))
The code should compile (e.g. ocamlbuild main.byte) without any complaints and the program should output "a=b;" as promised.
In response to the latest edit:
In general, I don't believe that catching standard library exceptions that are meant to indicate failure or misuse (like Invalid_argument or Failure) is a good idea. The reason is that they are used ubiquitously throughout the library such that you usually cannot tell which function raised the exception and why it did so.
Furthermore, you are throwing away the only useful information: the error message! The error message should tell you what the source of the problem is (my best guess is an IO-related issue). Thus, you should either print the error message or let the exception propagate to the toplevel. Personally, I prefer the latter option.
However, you probably still want to deal with syntactically ill-formed inputs in a graceful manner. For this, you can define a new exception in the lexer and add a default case that catches invalid tokens.
{
exception Unexpected_token
}
...
| _ {raise Unexpected_token}
Now, you can catch the newly defined exception in your main file and, unlike before, the exception is specific to syntactically invalid inputs. Consequently, you know both the source and the cause of the exception giving you the chance to do something far more meaningful than before.
A fairly random OCaml development hint: If you compile the program with debug information enabled, setting the environment variable OCAMLRUNPARAM to "b" (e.g. export OCAMLRUNPARAM=b) enables stack traces for uncaught exceptions!

btw. ocamllex also can do the + operator for 'one or more' in regular expressions, so this
['a'-'z']+
is equivalent to your
['a'-'z']['a'-'z']*

I was just struggling with the same thing (which is how I found this question), only to finally realize that I had mistakenly specified the path to input file as Sys.argv.(0) instead of Sys.argv.(1)! LOLs
I really hope it helps! :)

It looks like you have a space in the regular expression for identifiers. This could keep the lexer from recognizing a=b, although it should still recognize a = b ;

Related

Unusual order-of-execution problem with OCaml

I'm new to OCaml. I've written this very trivial program:
open Stdio
let getline =
match In_channel.input_line In_channel.stdin with
| Some x -> x
| None -> "No input"
;;
let () =
printf "What's your name? ";
printf "Hi, %s!" getline;;
and several variations thereof.
You get the idea. I ask for the user's name and print "Hi, [name]!".
It works but the output is in the wrong order. I build the project with dune, run the executable, don't get any prompt for the name, I type it in and then I get all at once "What's your name? Hi, [name]!".
Now, I know the basics of buffering in OCaml. In fact, I remember having the same problem when dabbling in OCaml a while back. I distinctly remember reading this:
The strange order of code execution in OCaml
and I'm almost certain that adding "%!" worked. I've even tried
Out_channel.flush stdout
between the two printf's, and it doesn't work. I've tried adding a "\n" in the first printf. And I've tried this in a couple of terminals.
I'm sure it's something really trivial, possibly just basic syntax.
You defined getline as a variable, not a function.
A function must always accept one parameter, although that parameter can be ():
open Stdio
let getline () =
match In_channel.input_line In_channel.stdin with
| Some x -> x
| None -> "No input"
;;
let () =
printf "What's your name? ";
printf "Hi, %s!" (getline ());;

How to do something else in Bison after Flex returns 0?

The Bison yyparse() function stop to read its input (file or stream) when 0 is returned.
I was wondering if there is a way to execute some more commands after it occurs.
I mean, is possible to tread 0 (or some token thrown by its return) in bison file?
Something like:
Flex
<<EOF>> { return 0; }
Bison
%token start
start : start '0' {
// Desired something else
}
Suppose program is the top-level symbol in a grammar. That is, it is the non-terminal which must be matched by the parser's input.
Of course, it's possible that program will also be matched several times before the input terminates. For example, the grammar might look something like:
%start program
%%
program: %empty
| program declaration
In that grammar, there is no way to inject an action which is only executed when the input is fully parsed. I gather that is what you want to do.
But it's very simple to create a non-terminal whose action will only be executed once, at the end of the parse. All we need to do is insert a new "unit production" at the top of the grammar:
%start start
%%
start : program { /* Completion action */ }
program: %empty
| program declaration
Since start does not appear on the right-hand side of any production in the grammar, it can only be reduced at the end of the parse, when the parser reduced the %start symbol. So even though the production does not explicitly include the end token, we know that that the end token is the lookahead token when the reduction action is executed.
Unit productions -- productions whose right-hand side contains just one symbol -- are frequently used to trigger actions at strategic points of the parse, and the above is just one example of the technique.

OCaml string length limitation when reading from stdin\file

As part of a Compiler Principles course I'm taking in my university, we're writing a compiler that's implemented in OCaml, which compiles Scheme code into CISC-like assembly (which is just C macros).
the basic operation of the compiler is such:
Read a *.scm file and convert it to an OCaml string.
Parse the string and perform various analyses.
Run a code generator on the AST output from the semantic analyzer, that outputs text into a *.c file.
Compile that file with GCC and run it in the terminal.
Well, all is good and well, except for this: I'm trying to read an input file, that's around 4000 lines long, and is basically one huge expressions that's a mix of Scheme if & and.
I'm executing the compiler via utop. When I try to read the input file, I immediately get a stack overflow error message. It is my initial guess that the file is just to large for OCaml to handle, but I wasn't able to find any documentation that would support this theory.
Any suggestions?
The maximum string length is given by Sys.max_string_length. For a 32-bit system, it's quite short: 16777211. For a 64-bit system, it's 144115188075855863.
Unless you're using a 32-bit system, and your 4000-line file is over 16MB, I don't think you're hitting the string length limit.
A stack overflow is not what you'd expect to see when a string is too long.
It's more likely that you have infinite recursion, or possibly just a very deeply nested computation.
Well, it turns out that the limitation was the amount of maximum ram the OCaml is configured to use.
I ran the following command in the terminal in order to increase the quota:
export OCAMLRUNPARAM="l=5555555555"
This worked like a charm - I managed to read and compile the input file almost instantaneously.
For reference purposes, this is the code that reads the file:
let file_to_string input_file =
let in_channel = open_in input_file in
let rec run () =
try
let ch = input_char in_channel in ch :: (run ())
with End_of_file ->
( close_in in_channel;
[] )
in list_to_string (run ());;
where list_to_string is:
let list_to_string s =
let rec loop s n =
match s with
| [] -> String.make n '?'
| car :: cdr ->
let result = loop cdr (n + 1) in
String.set result n car;
result
in
loop s 0;;
funny thing is - I wrote file_to_string in tail recursion. This prevented the stack overflow, but for some reason went into an infinite loop. Oh, well...

lex (flex) generated program not parsing whole input

I have a relatively simple lex/flex file and have been running it with flex's debug flag to make sure it's tokenizing properly. Unfortunately, I'm always running into one of two problems - either the program the flex generates stops just gives up silently after a couple of tokens, or the rule I'm using to recognize characters and strings isn't called and the default rule is called instead.
Can someone point me in the right direction? I've attached my flex file and sample input / output.
Edit: I've found that the generated lexer stops after a specific rule: "cdr". This is more detailed, but also much more confusing. I've posted a shorted modified lex file.
/* lex file*/
%option noyywrap
%option nodefault
%{
enum tokens{
CDR,
CHARACTER,
SET
};
%}
%%
"cdr" { return CDR; }
"set" { return SET; }
[ \t\r\n] /*Nothing*/
[a-zA-Z0-9\\!##$%^&*()\-_+=~`:;"'?<>,\.] { return CHARACTER; }
%%
Sample input:
set c cdra + cdr b + () ;
Complete output from running the input through the generated parser:
--(end of buffer or a NUL)
--accepting rule at line 16 ("set")
--accepting rule at line 18 (" ")
--accepting rule at line 19 ("c")
--accepting rule at line 18 (" ")
--accepting rule at line 15 ("cdr")
Any thoughts? The generated program is giving up after half of the input! (for reference, I'm doing input by redirecting the contents of a file to the generated program).
When generating a lexer that's standalone (that is, not one with tokens that are defined in bison/yacc, you typically write an enum at the top of the file defining your tokens. However, the main loop of a lex program, including the main loop generated by default, looks something like this:
while( token = yylex() ){
...
This is fine, until your lexer matches the rule that appears first in the enum - in this specific case CDR. Since enums by default start at zero, this causes the while loop to end. Renumbering your enum - will solve the issue.
enum tokens{
CDR = 1,
CHARACTER,
SET
};
Short version: when defining tokens by hand for a lexer, start with 1 not 0.
This rule
[-+]?([0-9*\.?[0-9]+|[0-9]+\.)([Ee][-+]?[0-9]+)?
|
seems to be missing a closing bracket just after the first 0-9, I added a | below where I think it should be. I couldn't begin to guess how flex would respond to that.
The rule I usually use for symbol names is [a-zA-Z$_], this is like your unquoted strings
except that I usually allow numbers inside symbols as long as the symbol doesn't start with a number.
[a-zA-Z$_]([a-zA-Z$_]|[0-9])*
A characters is just a short symbol. I don't think it needs to have its own rule, but if it does, then you need to insure that the string rule requires at least 2 characters.
[a-zA-Z$_]([a-zA-Z$_]|[0-9])+

ocamlyacc parse error: what token?

I'm using ocamlyacc and ocamllex. I have an error production in my grammar that signals a custom exception. So far, I can get it to report the error position:
| error { raise (Parse_failure (string_of_position (symbol_start_pos ()))) }
But, I also want to know which token was read. There must be a way---anyone know?
Thanks.
The best way to debug your ocamlyacc parser is to set the OCAMLRUNPARAM param to include the character p - this will make the parser print all the states that it goes through, and each shift / reduce it performs.
If you are using bash, you can do this with the following command:
$ export OCAMLRUNPARAM='p'
Tokens are generated by lexer, hence you can use the current lexer token when error occurs :
let parse_buf_exn lexbuf =
try
T.input T.rule lexbuf
with exn ->
begin
let curr = lexbuf.Lexing.lex_curr_p in
let line = curr.Lexing.pos_lnum in
let cnum = curr.Lexing.pos_cnum - curr.Lexing.pos_bol in
let tok = Lexing.lexeme lexbuf in
let tail = Sql_lexer.ruleTail "" lexbuf in
raise (Error (exn,(line,cnum,tok,tail)))
end
Lexing.lexeme lexbuf is what you need. Other parts are not necessary but useful.
ruleTail will concat all remaining tokens into string for the user to easily locate error position. lexbuf.Lexing.lex_curr_p should be updated in the lexer to contain correct positions. (source)
I think that, similar to yacc, the tokens are stored in variables corresponding to the symbols in your grammar rule. Here since there is one symbol (error), you may be able to simply output $1 using printf, etc.
Edit: responding to comment.
Why do you use an error terminal? I'm reading an ocamlyacc tutorial that says a special error-handling routine is called when a parse error happens. Like so:
3.1.5. The Error Reporting Routine
When ther parser function detects a
syntax error, it calls a function
named parse_error with the string
"syntax error" as argument. The
default parse_error function does
nothing and returns, thus initiating
error recovery (see Error Recovery).
The user can define a customized
parse_error function in the header
section of the grammar file such as:
let parse_error s = (* Called by the parser function on error *)
print_endline s;
flush stdout
Well, looks like you only get "syntax error" with that function though. Stay tuned for more info.