External definitions for ocamllex regular expressions - ocaml

I have implemented the usual combination of lexer/parser/pretty-printer for reading-in/printing a type in my code. I find there is redundancy among the lexer and the pretty-printer when it comes to plain-string regular expressions, usually employed for symbols, punctuation or separators.
For example I now have
rule token = parse
| "|-" { TURNSTILE }
in my lexer.mll file, and a function like:
let pp fmt (l,r) =
Format.fprintf fmt "#[%a |-# %a#]" Form.pp l Form.pp r
for pretty-printing. If I decide to change the string for TURNSTILE, I have to edit two places in the code, which I find less than ideal.
Apparently, the OCaml lexer supports a certain ability to define regular expressions and then refer to them within the mll file. So lexer.mll could be written as
let symb_turnstile = "|-"
rule token = parse
| symb_turnstile { TURNSTILE }
But this will not let me externally access symb_turnstile, say from my pretty-printing functions. In fact, after running ocamllex, there are no occurences of symb_turnstile in lexer.ml. I cannot even refer to these identifiers in the OCaml epilogue of lexer.mll.
Is there any way of achieving this?

In the end, I went for the following style which I stole from the sources of ocamllex itself (so I am guessing it's standard practice). A map from strings to tokens (here an association list) is defined in the preamble of lexer.mll
let symbols =
[
...
(Symb.turnstile, TURNSTILE);
...
]
where Symb is a module defining turnstile as a string. Then, the lexing part of lexer.mll is purposely overly general:
rule token = parse
...
| punctuation
{
try
List.assoc (Lexing.lexeme lexbuf) symbols
with Not_found -> lex_error lexbuf
}
...
where punctuation is a regular expression matching a sequence of symbols.
The pretty-printer can now be written like this.
let pp fmt (l,r) =
Format.fprintf fmt "#[%a %s# %a#]" Form.pp Symb.turnstile l Form.pp r

Although the two tokens both look like strings notationally, they're really very different. I don't think there's a convenient type under which they could be shared for use by ocamllex and Printf.printf. This is possibly the reason that ocamllex doesn't support such external definitions. You could get probably the effect you want with a macro facility (textual inclusion).

Related

"Eval" a string in OCaml

I'm trying to "eval" a string representing an OCaml expression in OCaml. I'm looking to do something equivalent to Python's eval.
So far I've not been able to find much. The Parsing module looks like it could be helpful, but I was not able to find a way to just eval a string.
Here is how to do it, but I didn't tell you. (Also the Parsing module is about Parsing, not executing code)
#require "compiler-libs" (* Assuming you're using utop, if compiling then this is the package you need *)
let eval code =
let as_buf = Lexing.from_string code in
let parsed = !Toploop.parse_toplevel_phrase as_buf in
ignore (Toploop.execute_phrase true Format.std_formatter parsed)
example:
eval "let () = print_endline \"hello\";;"
Notice the trailing ;; in the code sample.
To use ocamlbuild, you will need to use both compiler-libs and compiler-libs.toplevel.
OCaml is a compiled (not interpreted) language. So there's no simple way to do this. Certainly there are no language features that support it (as there are in almost every interpreted language). About the best you could do would be to link your program against the OCaml toplevel (which is an OCaml interpreter).

What's it for ';;' in OCaml?

This is my simple OCaml code to print out a merged list.
let rec merge cmp x y = match x, y with
| [], l -> l
| l, [] -> l
| hx::tx, hy::ty ->
if cmp hx hy
then hx :: merge cmp tx (hy :: ty)
else hy :: merge cmp (hx :: tx) ty
let rec print_list = function
| [] -> ()
| e::l -> print_int e ; print_string " " ; print_list l
;;
print_list (merge ( <= ) [1;2;3] [4;5;6]) ;;
I copied the print_list function from Print a List in OCaml, but I had to add ';;' after the function's implementation in order to avoid this error message.
File "merge.ml", line 11, characters 47-57:
Error: This function has type int list -> unit
It is applied to too many arguments; maybe you forgot a `;'.
My question is why ';;' is needed for print_list while merge is not?
The ;; is, in essence, a way of marking the end of an expression. It's not necessary in source code (though you can use it if you like). It is useful in the toplevel (the OCaml REPL) to cause the evaluation of what you've typed so far. Without such a symbol, the toplevel has no way to know whether you're going to type more of the expression later.
In your case, it marks the end of the expression representing the body of the function print_list. Without this marker, the following symbols look like they're part of the same expression, which leads to a type error.
For top-level expressions in OCaml files, I prefer to write something like the following:
let () = print_list (merge ( <= ) [1;2;3] [4;5;6])
If you code this way you don't need to use the ;; token in your source files.
This is an expansion of Jeffrey's answer.
As you know, when doing language parsing, a program has to break the flow in manageable lexical elements, and expect these so called lexemes (or tokens) to follow certain syntactic rules, allowing to regroup lexemes in larger units of meaning.
In many languages, the largest syntactic element is the statement, which diversifies in instructions and definitions. In these same languages, the structure of the grammar requires a special lexeme to indicate the end of some of these units, usually either instructions or statements. Others use instead a lexeme to separate instructions (rather than end each of them), but it's basically the same idea.
In OCaml, the grammar follows patterns which, within the algorithm used to parse the language, permits to elide such instruction terminator (or separator) in various circumstances. The keyword let for instance, is always necessary to introduce a new definition. This can be used to detect the end of the preceding statement at the outermost level of program statements (the top level).
However, you can easily see the problem it induces: the interactive version of Ocaml always need the beginning of a new statement to be able to figure out the previous one, and a user would never be able to provide statements one by one! The ;; special token was thus defined(*) to indicate the end of a top level statement.
(*): I actually seem to recall that the token was originally in OCaml ancestors, but then made optional when OCaml was devised; however, don't quote me on that.

accessing part of matched string ocamllex

I am trying to arrange for ocamllex and ocamlyacc code to scan and parse a simple language. I have defined the abstract syntax for the same but am finding difficulty scanning for complex rules. Here's my code
{
type exp = B of bool | Const of float | Iszero of exp | Diff of exp*exp |
If of exp * exp * exp
}
rule scanparse = parse
|"true"| "false" as boolean {B boolean}
|['0'-'9']+ "." ['0'-'9']* as num {Const num}
|"iszero" space+ ['a'-'z']+ {??}
|'-' space+ '(' space* ['a'-'z']+ space* ',' space* ['a'-'z']+ space* ')' {??}
But I am not able to access certain portions of the matched string. Since the expression declaration is recursive, nested functions aren't helping either(?). Please help.
To elaborate slightly on my comment above, it looks to me like you're trying to use ocamllex to do what ocamlyacc is for. I think you need to define very simple tokens in ocamllex (like booleans, numbers, and variable names), then use ocamlyacc to define how they go together to make things like Iszero, Diff, and If. ocamllex isn't powerful enough to parse the structures defined by your abstract syntax.
Update
Here is an ocamlyacc tutorial that I found linked from OCaml.org, which is a pretty good endorsement: OCamlYacc tutorial. I looked through it and it looks good. (When I started using ocamlyacc, I already knew yacc so I was able to get going pretty quickly.)

Regex to match exact word with Kiama Parser in Scala

I am looking for the correct regex form to give to my Kiama Packrat Parser in order that when it encounters keywords like int it recognises this is a type, and not a valid var name.
At present I have :
lazy val type_int_ = ".*\\bint\\b.*".r ^^ (s => TypeInt)
lazy val var_ =
idn ^^ TermVar
lazy val idn =
"[a-zA-Z][a-zA-Z0-9]*".r
But this does not work, so I would appreciate pointers on this.
Many thanks
I've successfully used the following approach:
val keyword = regex ("int[^a-zA-Z]".r)
val identifier = not (keyword) ~> "[a-zA-Z]+".r
In other words, recognise the keyword only if it's not followed by a character that can extend it to be an identifier. A downside is that the extension regexp is repeated in both the keyword definition and the identifier one, but that can be factored out if you want.
You've got to be a bit careful how you use the keyword parser, since it captures the character after the keyword as well. It's safe in the context of a not, since no input is consumed.
Note that whitespace usually does not need to be handled explicitly since the literal and regex parser combinators take care of it before they start parsing for what you really want.
This approach is easy to generalise to multiple identifiers, by writing a method to build the keyword parser from a list of the keyword strings and the extension regular expression.
BTW, Kiama doesn't really provide parsing combinators. We rely on the ones in the Scala library. We do provide some extensions of the standard ones for special situations, but the basic behaviour is just straight from the library. Thus, it's not clear to me that your question actually relates to Kiama at all. As mentioned in the comments above, including a self-contained example of the problem would help us be clearer about exactly which library you are using.

Regular expression for a grammar

I'm reading finite automata & grammars from the compiler construction of Aho and I'm stuck with this grammar for so long. I don't have a clear perception of how I can describe it:
Consider the following Grammar:
S -> (L) | a L -> L, S | S
Note that the parentheses and comma are actually terminals in this
language and appear in the sentences accepted by this grammar. Try to
describe the language generated by this grammar. Is this grammar
ambiguous?
My concern here is: Can language generated by this grammar be described as regular expressions? I'm confused about how to do it. Any help?
To show that the grammar is ambiguous, you need to be able to construct two different parse trees while parsing the same string. Your string will be comprised of "(", ")", ",", and "a", since those are the only terminal symbols in the grammar.
Try arranging those 4 terminal symbols in a few ways and see if you can show different successful parses, in the spirit of the example ambiguous grammar on Wikipedia.
Immediate left recursion tends to cause problems for some parsers. See if "a,a,a" does anything interesting on "L → L , S | S"...
my concern here is language generated by this grammar as regular expression can it be described...i'am confused about how to do
A regular expression can not fully describe the grammar. Rewriting part of the grammar will make this more apparent:
S → ( L )
S → a
L → L , S
L → S
Pay attention to #1 and #4. L can produce S, and S can produce ( L ). This means S can produce ( S ), which can produce ( ( S ) ), ( ( ( S ) ) ), etc. ad infinitum. The key thing is that those parentheses are matched; there are the same amount of "(" symbols as ")" symbols.
A regex can't do that.
Regular expressions map to finite automata. Finite automata can not count. A language L ∈ {w: 0n 1n} is not a regular. L ∈ {w: (n )n}, just being a substiution of "(" for "0" and ")" for "1", isn't either. See: the first examples section under Regular Languages - Wikipedia. (Notation note: s1 is s, s2 is ss, ..., sn is s repeated n times.)
This means you can't use a regex to describe that part of the language. That puts it in the domain of CFGs, Turing Machines, and pushdown automata.
Regular expressions (and a library to interpret them) are a poor tool for recognizing sentences of a context-free grammar. Instead, you would want to use a parser generator like yacc, bison, or ANTLR.
I think the point of the exercise in Aho's book is to "describe the language" in words, in order to understand whether it is ambiguous. One way to approach it: can you devise a grammatical sentence that can be parsed in two different ways, given the productions of the grammar? If so, the grammar is ambiguous.