How to scan with ocamllex until end of file - ocaml

I am trying to implement a parser that read regular expression. It ask the user to enter a valid input of string/integers/float. If is valid and the user press ctrl^d, then print the number. Otherwise shows an error. But the problem in the following code does not stop when I press ctrl^D. How to implement eof token and print the input ?
test.mll :
{ type result = Int of int | Float of float | String of string }
let digit = ['0'-'9']
let digits = digit +
let lower_case = ['a'-'z']
let upper_case = ['A'-'Z']
let letter = upper_case | lower_case
let letters = letter +
rule main = parse
(digits)'.'digits as f { Float (float_of_string f) }
| digits as n { Int (int_of_string n) }
| letters as s { String s}
| _ { main lexbuf }
{ let newlexbuf = (Lexing.from_channel stdin) in
let result = main newlexbuf in
print_endline result }

I'd say the main problem is that each call to main produces one token, and there's only one call to main in your code. So it will process just one token.
You need to have some kind of iteration that calls main repeatedly.
There is a special pattern eof in OCamllex that matches the end of the input file. You can use this to return a special value that stops the iteration.
As a side comment, you can't call print_endline with a result as its parameter. Its parameter must be a string. You will need to write your own function for printing the results.
Update
To get an iteration, change your code to something like this:
{
let newlexbuf = Lexing.from_channel stdin in
let rec loop () =
match main newlexbuf with
| Int i -> iprint i; loop ()
| Float f -> fprint f; loop ()
| String s -> sprint s; loop ()
| Endfile -> ()
in
loop ()
}
Then add a rule something like this to your patterns:
| eof { Endfile }
Then add Endfile as an element of your type.
A assume this is homework. So make sure you see how the iteration is working. Aside from the details of ocamllex, that's something you want to master (apologies for unsolicited advice).

Related

Pretty-printing with a comment string prefixing a box

I am trying to generate a text file for use in another program. This program only has line-style comments. I want to pretty-print a comment that, whenever the line is broken, it is prefixed by //.
Here is what I have so far:
type elaborate_type = A | B
let elaborate_to_string = function
| A -> "OK, this is type A, but long"
| B -> "B"
let pp_elaborate chan v = Format.pp_print_string chan (elaborate_to_string v)
Format.printf "#[<hv2>{#,#[<hov>// Here is a long comment I want to break# // \
here, but also indent. It should also be the case that anything# // \
I put here (such as some complex printable term \"%a\") should# // \
only break if it has //, too).#]#,\
#[...#]\
#]#,}#."
pp_elaborate A
which gives the output
{
// Here is a long comment I want to break
// here, but also indent. It should also be the case that anything
// I put here (such as some complex printable term "OK, this is type A, but long") should
// only break if it has //, too).
...
}
Is there a way to do this without adding the //# to the end of each line I want to break?
A option to solving this issue is to update the newline function of the formatter to make it prints // right after the newline:
let add_double_slash_after_linebreak_and_before_indents fmt =
let fns = Format.pp_get_formatter_out_functions fmt () in
let out_newline () =
fns.out_newline ();
fns.out_string "//" 0 2
in
Format.pp_set_formatter_out_functions fmt { fns with out_newline}
let () =
let () =
add_double_slash_after_linebreak_and_before_indents Format.std_formatter
in
Format.printf "#[<v 2>This tests the formatting#,One line#,two line #]"
This tests the formatting
// One line
// two line val add_double_slash_after_linebreak_and_before_indents :
However, the double slashes // will appear at the start of the line independently of the indentation, if you prefer them to appear after the indentation, you can update the indentation function of the formatter instead:
let add_double_slash_after_linebreak_and_indents fmt =
let fns = Format.pp_get_formatter_out_functions fmt () in
let out_indent n =
fns.out_indent n;
fns.out_string "//" 0 2
in
Format.pp_set_formatter_out_functions fmt { fns with out_indent}
let () =
let () =
add_double_slash_after_linebreak_and_indents Format.std_formatter
in
Format.printf "#[<v 2>This tests the formatting#,One line#,two line #]"
This tests the formatting
//One line
//two line
Concerning your follow-up question, any \n in a string will mess up the formatting if there are printed with %s. You can avoid this issue by using pp_print_text which replaces and \n in the string by calls to pp_print_space and pp_force_line.

"decimal literal empty" when combining several strings for a regex in Rust

I'm looking to parse a string to create a vector of floats:
fn main() {
let vector_string: &str = "{12.34, 13.}";
let vec = parse_axis_values(vector_string);
// --- expected output vec: Vec<f32> = vec![12.34, 13.]
}
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let pattern_float = String::from(r"\s*(\d*.*\d*)\s*");
let pattern_opening = String::from(r"\s*{{");
let pattern_closing = String::from(r"}}\s*");
let pattern =
pattern_opening + "(" + &pattern_float + ",)*" + &pattern_float + &pattern_closing;
let re = Regex::new(&pattern).unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
// --- snip : for loop for adding the elements to the vector ---
vec_axis1
}
This code compiles but an error arises at runtime when unwrapping the Regex::new():
regex parse error:
\s*{{(\s*(\d*.*\d*)\s*,)*\s*(\d*.*\d*)\s*}}\s*
^
error: decimal literal empty
According to other posts, this error can arise when escaping the curly bracket { is not properly done, but I think I escaped the bracket properly.
What is wrong with this regex?
There are several problems in your code:
Escaping a { in regex is done with \{.
Your . matches any character and doesn't take what you want. You must escape it.
You're capturing more than just the number, which makes the parsing more complex.
Your regex building is unnecessary verbose, you may comment without it.
Here's a proposed improved version:
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let re = Regex::new(r"(?x)
\s*\{\s* # opening
(\d*\.\d*) # captured float
\s*,\s* # separator
\d*\.\d* # ignored float
\s*\}\s* # closing
").unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
if let Some(c) = re.captures(str_values) {
if let Some(g) = c.get(1) {
vec_axis1.push(g.as_str().parse().unwrap());
}
}
vec_axis1
}
fn main() {
let vector_string: &str = "{12.34, 13.}";
let vec = parse_axis_values(vector_string);
println!("v: {:?}", vec);
}
playground
If you call this function several times, you might want to avoid recompiling the regex at each call too.
I want to be able to match 0.123, .123, 123 or 123., the use of d+ would break these possibilities
It looks like you want to fetch all the floats in the string. This could be simply done like this:
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let re = Regex::new(r"\d*\.\d*").unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
for c in re.captures_iter(str_values) {
vec_axis1.push(c[0].parse().unwrap());
}
vec_axis1
}
If you want both:
to check the complete string is correctly wrapped between { and }
to capture all numbers
Then you could either:
combine two regexes (the first one used to extract the internal part)
use a Serde-based parser (I wouldn't at this point but it would be interesting if the problem's complexity grows)

How to append end of file to a string

I just hit that "problem" : is there a smart way to insert the end of file (ASCII 0) character in a string?
By "smart", I mean something better than
let s = "foo" ^ (String.make 1 (Char.chr 0))
let s = "foo\000"
that is, something which would reflect that we are adding an EOF, not a "mystery char which ascii value is 0".
EDIT:
Mmh... indeed I was messing with eof being a char. But anyway, in C you can have
#include <stdio.h>
int main(void)
{
char a = getchar();
if (a = EOF)
printf("eof");
else
printf("not eof");
return 0;
}
Where you can test whether a char is an EOF (and (int) EOF is -1, not 0 as I was thinking). And similarly, you can set a char to be EOF, etc..
My question is: Is it possible to have something similar in ocaml ?
As #melpomene says, there is no EOF character, and '\000' really is just a character. So there's no real answer to your question as near as I can tell.
You can define your own name for a string consisting of just the NUL character (as we used to call it):
let eof = "\000"
Then your function looks like this:
let add_eof s = s ^ eof
Your C has two errors. First, you assign EOF to a instead of comparing a with EOF. Second, getchar() returns an int. It returns an int expressly so that it can return EOF, a value not representable by a char. Your code (with the first error corrected), which assigns getchar()s value to a char before testing it, will fail to process a file with a char of value 255 in it:
$ gcc -Wall getchar.c -o getchar
$ echo -e "\xFF" > fake-eof
$ echo " " > space
$ ./getchar < fake-eof
eof
$ ./getchar < space
not eof
The trick with getchar returning int, of returning a larger type so that your return can include the smaller type and alternately other kinds of information, is a trick that's wholly unnecessary in OCaml due to its more advanced type system. OCaml could have
(* using hypothetical c_getchar, a wrapper for the getchar() in C that returns an int *)
let getchar_opt () =
match c_getchar () with
| -1 -> None
| c -> Some (char_of_int c)
let getchar_exn () =
match c_getchar () with
| -1 -> raise End_of_file
| c -> char_of_int c
type `a ior = EOF | Value of 'a
let getchar_ior () =
match c_getchar_ior () with
| -1 -> EOF
| c -> Value (char_of_int c)
Of course Pervasives.input_char in OCaml raises an exception on EOF rather than doing one of these other things. If you want a non-exceptional interface, you could wrap input_char with your own version that catches the exception, or you could - depending on your program - use Unix.read instead, which returns the number of bytes it was able to read, which is 0 on EOF.

ocamllex regex syntax error

I have some basic ocamllex code, which was written by my professor, and seems to be fine:
{ type token = EOF | Word of string }
rule token = parse
| eof { EOF }
| [’a’-’z’ ’A’-’Z’]+ as word { Word(word) }
| _ { token lexbuf }
{
(*module StringMap = BatMap.Make(String) in *)
let lexbuf = Lexing.from_channel stdin in
let wordlist =
let rec next l = match token lexbuf with
EOF -> l
| Word(s) -> next (s :: l)
in next []
in
List.iter print_endline wordlist
}
However, running ocamllex wordcount.mll produces
File "wordcount.mll", line 4, character 3: syntax error.
This indicates that there is an error at the first [ in the regex in the fourth line here. What is going on?
You seem to have curly quotes (also called "smart quotes" -- ugh) in your text. You need regular old single quotes.
curly quote: ’
old fashioned single quote: '

token meaning dependent on context

I have a weird string syntax where the meaning of a
delimiter depends on context. In the following sample
input:
( (foo) (bar) )
the result is a list of two strings ["foo"; "bar"].
The outer pair of parenthesis enters list mode.
Then, the next pair of parentheses delimits the string.
Inside strings, balanced pairs of parentheses are to be
treated as part of the string.
Right now the lexer decides what to return depending
on a global variable inside.
{
open Sample_parser
exception Error of string
let inside = ref false (* <= to be eliminated *)
}
The delimiters are parentheses. If the lexer hits an
opening parenthesis, then
if inside is false, it emits an
Enter token and inside is set to true.
If inside is true, it switches to a string lexer
which treats any properly nested pair of parentheses
as part of the string. If the nesting level returns to
zero, the string buffer is passed to the parser.
If a closing parenthesis is encountered outside a string,
a Leave token is emitted and inside is unset.
My question is: How do I rewrite the lexer without
the global variable inside?
Fwiw I use menhir but afaict the same would be true for
ocamlyacc.
(Sorry if this sounds confused, I’m really a newbie to
the yacc/lex approach.
I can express all the above without thinking as a PEG but I
haven’t got used to mentally keeping lexer and parser
separated.
Feel free to point out other issues with the code!)
Simple example: *sample_lexer.mll*
{
open Sample_parser
exception Error of string
let inside = ref false (* <= to be eliminated *)
}
let lpar = "("
let rpar = ")"
let ws = [' ' '\t' '\n' '\r']
rule tokenize = parse
| ws { tokenize lexbuf }
| lpar { if not !inside then begin
inside := true;
Enter
end else begin
let buf = Buffer.create 20 in
String (string_scanner
(Lexing.lexeme_start lexbuf)
0
buf
lexbuf)
end }
| rpar { inside := false; Leave }
and string_scanner init depth buf = parse
| rpar { if depth = 0 then begin
Buffer.contents buf;
end else begin
Buffer.add_char buf ')';
string_scanner init (depth - 1) buf lexbuf end }
| lpar { Buffer.add_char buf '(';
string_scanner init (depth + 1) buf lexbuf }
| eof { raise (Error (Printf.sprintf
"Unexpected end of file inside string, pos %d--%d]!\n"
init
(Lexing.lexeme_start lexbuf))) }
| _ as chr { Buffer.add_char buf chr;
string_scanner init depth buf lexbuf }
*sample_scanner.mly*:
%token <string> String
%token Enter
%token Leave
%start <string list> process
%%
process:
| Enter lst = string_list Leave { lst }
string_list:
| elm = element lst = string_list { elm :: lst }
| elm = element { [elm] }
element:
| str = String { str }
main.ml:
open Batteries
let sample_input = "( (foo (bar) baz) (xyzzy) )"
(* EibssssssssssssseibssssseiL
* where E := enter inner
* L := leave inner
* i := ignore (whitespace)
* b := begin string
* e := end string
* s := part of string
*
* desired result: [ "foo (bar) baz"; "xyzzy" ] (type string list)
*)
let main () =
let buf = Lexing.from_string sample_input in
try
List.print
String.print stdout
(Sample_parser.process Sample_lexer.tokenize buf);
print_string "\n";
with
| Sample_lexer.Error msg -> Printf.eprintf "%s%!" msg
| Sample_parser.Error -> Printf.eprintf
"Invalid syntax at pos %d.\n%!"
(Lexing.lexeme_start buf)
let _ = main ()
You can pass the state as an argument to tokenize. It still has to be mutable, but not global.
rule tokenize inside = parse
| ws { tokenize inside lexbuf }
| lpar { if not !inside then begin
inside := true;
Enter
end else begin
let buf = Buffer.create 20 in
String (string_scanner
(Lexing.lexeme_start lexbuf)
0
buf
lexbuf)
end }
| rpar { inside := false; Leave }
And you call the parser as follows:
Sample_parser.process (Sample_lexer.tokenize (ref false)) buf