Is trailing white space is forbidden in s-expression? - ocaml

When I try sexplib, it tells me
Sexp.of_string " a";; is correct.
Sexp.of_string "a ";; is wrong.
Is trailing white space is forbidden in sexp?
Why?

According to an informal grammar specification, whitespaces should be ignored on both ends of an atom:
{2 Syntax Specification of S-expressions}
{9 Lexical conventions of S-expression}
Whitespace, which consists of space, newline, carriage return,
horizontal tab and form feed, is ignored unless within an
OCaml-string, where it is treated according to OCaml-conventions. The
semicolon introduces comments. Comments are ignored, and range up to
the next newline character. The left parenthesis opens a new list,
the right parenthesis closes it again. Lists can be empty. The
double quote denotes the beginning and end of a string following the
lexical conventions of OCaml (see OCaml-manual for details). All
characters other than double quotes, left- and right parentheses, and
whitespace are considered part of a contiguous string.
Indeed, you can read an atom with a trailing whitespace from a file without any errors.
The error is thrown from a function Pre_sexp.of_string_bigstring in a case when a parser successfully returns, but something was left in a buffer. So the main question is why did something has left in the buffer. It seems that there exists several parsers, and files and string are parsed with different parsers.
I've examined parse_atom rule defined at pre_sexp.ml:699 (all locations are for this commit ) and discovered that when the trailing whitespace is hit, the bump_found_atom is called. Then, if something is on stack, the position indicator is incremented and parsing continues. Otherwise, parsing is finished, but the position is not incremented. With a simple patch this can be fixed:
diff --git a/lib/pre_sexp.ml b/lib/pre_sexp.ml
index 86603f3..9690c0f 100644
--- a/lib/pre_sexp.ml
+++ b/lib/pre_sexp.ml
## -502,7 +502,7 ## let mk_cont_parser cont_parse = (); fun _state str ~max_pos ~pos ->
let pbuf_str = Buffer.contents pbuf in \
let atom = MK_ATOM in \
match GET_PSTACK with \
- | [] -> Done (atom, mk_parse_pos state pos) \
+ | [] -> Done (atom, mk_parse_pos state (pos + 1)) \
| rev_sexp_lst :: sexp_stack -> \
Buffer.clear pbuf; \
let pstack = (atom :: rev_sexp_lst) :: sexp_stack in \
After this patch, the following code produces an expected 'a', 'a', 'a' output:
let s1 = Sexp.of_string " a" in
let s2 = Sexp.of_string "a " in
let s3 = Sexp.of_string " a " in
printf "'%s', '%s', '%s'\n"
(Sexp.to_string s1)
(Sexp.to_string s2)
(Sexp.to_string s3);

Related

Changing the State of Lexing.lexbuf

I am writing a lexer for Brainfuck with Ocamllex, and to implement its loop, I need to change the state of lexbuf so it can returns to a previous position in the stream.
Background info on Brainfuck (skippable)
in Brainfuck, a loop is accomplished by a pair of square brackets with
the following rule:
[ -> proceed and evaluate the next token
] -> if the current cell's value is not 0, return to the matching [
Thus, the following code evaluates to 15:
+++ [ > +++++ < - ] > .
it reads:
In the first cell, assign 3 (increment 3 times)
Enter loop, move to the next cell
Assign 5 (increment 5 times)
Move back to the first cell, and subtract 1 from its value
Hit the closing square bracket, now the current cell (first) is equals to 2, thus jumps back to [ and proceed into the loop again
Keep going until the first cell is equals to 0, then exit the loop
Move to the second cell and output the value with .
The value in the second cell would have been incremented to 15
(incremented by 5 for 3 times).
Problem:
Basically, I wrote two functions to take care of pushing and popping the last position of the last [ in the header section of brainfuck.mll file, namely push_curr_p and pop_last_p which pushes and pops the lexbuf's current position to a int list ref named loopstack:
{ (* Header *)
let tape = Array.make 100 0
let tape_pos = ref 0
let loopstack = ref []
let push_curr_p (lexbuf: Lexing.lexbuf) =
let p = lexbuf.Lexing.lex_curr_p in
let curr_pos = p.Lexing.pos_cnum in
(* Saving / pushing the position of `[` to loopstack *)
( loopstack := curr_pos :: !loopstack
; lexbuf
)
let pop_last_p (lexbuf: Lx.lexbuf) =
match !loopstack with
| [] -> lexbuf
| hd :: tl ->
(* This is where I attempt to bring lexbuf back *)
( lexbuf.Lexing.lex_curr_p <- { lexbuf.Lexing.lex_curr_p with Lexing.pos_cnum = hd }
; loopstack := tl
; lexbuf
)
}
{ (* Rules *)
rule brainfuck = parse
| '[' { brainfuck (push_curr_p lexbuf) }
| ']' { (* current cell's value must be 0 to exit the loop *)
if tape.(!tape_pos) = 0
then brainfuck lexbuf
(* this needs to bring lexbuf back to the previous `[`
* and proceed with the parsing
*)
else brainfuck (pop_last_p lexbuf)
}
(* ... other rules ... *)
}
The other rules work just fine, but it seems to ignore [ and ]. The problem is obviously at the loopstack and how I get and set lex_curr_p state. Would appreciate any leads.
lex_curr_p is meant to keep track of the current position, so that you can use it in error messages and the like. Setting it to a new value won't make the lexer actually seek back to an earlier position in the file. In fact I'm 99% sure that you can't make the lexer loop like that no matter what you do.
So you can't use ocamllex to implement the whole interpreter like you're trying to do. What you can do (and what ocamllex is designed to do) is to translate the input stream of characters into a stream of tokens.
In other languages that means translating a character stream like var xyz = /* comment */ 123 into a token stream like VAR, ID("xyz"), EQ, INT(123). So lexing helps in three ways: it finds where one token ends and the next begins, it categorizes tokens into different types (identifiers vs. keywords etc.) and discards tokens you don't need (white space and comments). This can simplify further processing a lot.
Lexing Brainfuck is a lot less helpful as all Brainfuck tokens only consist of a single character anyway. So finding out where each token ends and the next begins is a no-op and finding out the type of the token just means comparing the character against '[', '+' etc. So the only useful thing a Brainfuck lexer does is to discard whitespace and comments.
So what your lexer would do is turn the input [,[+-. lala comment ]>] into something like LOOP_START, IN, LOOP_START, INC, DEC, OUT, LOOP_END, MOVE_RIGHT, LOOP_END, where LOOP_START etc. belong to a discriminated union that you (or your parser generator if you use one) defined and made the lexer output.
If you want to use a parser generator, you'd define the token types in the parser's grammar and make the lexer produce values of those types. Then the parser can just parse the token stream.
If you want to do the parsing by hand, you'd call the lexer's token function by hand in a loop to get all the tokens. In order to implement loops, you'd have to store the already-consumed tokens somewhere to be able to loop back. In the end it'd end up being more work than just reading the input into a string, but for a learning exercise I suppose that doesn't matter.
That said, I would recommend going all the way and using a parser generator to create an AST. That way you don't have to create a buffer of tokens for looping and having an AST actually saves you some work (you no longer need a stack to keep track of which [ belongs to which ]).

Split string with specified delimiter in lua

I'm trying to create a split() function in lua with delimiter by choice, when the default is space.
the default is working fine. The problem starts when I give a delimiter to the function. For some reason it doesn't return the last sub string.
The function:
function split(str,sep)
if sep == nil then
words = {}
for word in str:gmatch("%w+") do table.insert(words, word) end
return words
end
return {str:match((str:gsub("[^"..sep.."]*"..sep, "([^"..sep.."]*)"..sep)))} -- BUG!! doesnt return last value
end
I try to run this:
local str = "a,b,c,d,e,f,g"
local sep = ","
t = split(str,sep)
for i,j in ipairs(t) do
print(i,j)
end
and I get:
1 a
2 b
3 c
4 d
5 e
6 f
Can't figure out where the bug is...
When splitting strings, the easiest way to avoid corner cases is to append the delimiter to the string, when you know the string cannot end with the delimiter:
str = "a,b,c,d,e,f,g"
str = str .. ','
for w in str:gmatch("(.-),") do print(w) end
Alternatively, you can use a pattern with an optional delimiter:
str = "a,b,c,d,e,f,g"
for w in str:gmatch("([^,]+),?") do print(w) end
Actually, we don't need the optional delimiter since we're capturing non-delimiters:
str = "a,b,c,d,e,f,g"
for w in str:gmatch("([^,]+)") do print(w) end
Here's my go-to split() function:
-- split("a,b,c", ",") => {"a", "b", "c"}
function split(s, sep)
local fields = {}
local sep = sep or " "
local pattern = string.format("([^%s]+)", sep)
string.gsub(s, pattern, function(c) fields[#fields + 1] = c end)
return fields
end
"[^"..sep.."]*"..sep This is what causes the problem. You are matching a string of characters which are not the separator followed by the separator. However, the last substring you want to match (g) is not followed by the separator character.
The quickest way to fix this is to also consider \0 a separator ("[^"..sep.."\0]*"..sep), as it represents the beginning and/or the end of the string. This way, g, which is not followed by a separator but by the end of the string would still be considered a match.
I'd say your approach is overly complicated in general; first of all you can just match individual substrings that do not contain the separator; secondly you can do this in a for-loop using the gmatch function
local result = {}
for field in your_string:gsub(("[^%s]+"):format(your_separator)) do
table.insert(result, field)
end
return result
EDIT: The above code made a bit more simple:
local pattern = "[^%" .. your_separator .. "]+"
for field in string.gsub(your_string, pattern) do
-- ...and so on (The rest should be easy enough to understand)
EDIT2: Keep in mind that you should also escape your separators. A separator like % could cause problems if you don't escape it as %%
function escape(str)
return str:gsub("([%^%$%(%)%%%.%[%]%*%+%-%?])", "%%%1")
end

Is there a whitelist or blacklist of characters for custom sml infixes?

infix 3 .. errors out. Which characters are allowed or not allowed for defining custom infixes? Where might I find a list online?
thanks
You may infix any non-qualified identifier.
The following is from the SML 90' definition
The following are the reserved words used in the Core. They may not (except =) be used as identifiers.
abstype and andalso as case do datatype else
end exception fn fun handle if in infix
infixr let local nonfix of op open orelse
raise rec then type val with withtype while
( ) [ ] { } , : ; ... _ | = => -> #
....
An identifier is either alphanumeric: any sequence of letters,
digits or primes (') and underbars (_) starting with a letter or
prime, or symbolic: any non-empty sequence of the following
symbols:
! % & # + - / : < = > ? # \ ~ ' ^ | *
In either case, however, reserved words are excluded. This means that
for example # and | are not identifiers, but ## and |=| are
identifiers. The only exception to this rule is that the symbol =,
which is a reserved word, is also allowed as an identifier to stand
for the equality predicate.

OCamllex matching beginning of line?

I am messing around writing a toy programming language in OCaml with ocamllex, and was trying to make the language sensitive to indentation changes, python-style, but am having a problem matching the beginning of a line with ocamllex's regex rules. I am used to using ^ to match the beginning of a line, but in OCaml that is the string concat operator. Google searches haven't been turning up much for me unfortunately :( Anyone know how this would work?
I'm not sure if there is explicit support for zero-length matching symbols (like ^ in Perl-style regular expressions, which matches a position rather than a substring). However, you should be able to let your lexer turn newlines into an explicit token, something like this:
parser.mly
%token EOL
%token <int> EOLWS
% other stuff here
%%
main:
EOL stmt { MyStmtDataType(0, $2) }
| EOLWS stmt { MyStmtDataType($1 - 1, $2) }
;
lexer.mll
{
open Parser
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip other blanks *)
| ['\n'][' ']+ as lxm { EOLWS(String.length(lxm)) }
| ['\n'] { EOL }
(* ... *)
This is untested, but the general idea is:
Treat newlines as staetment 'starters'
Measure whitespace that immediately follows the newline and pass its length as an int
Caveat: you will need to preprocess your input to start with a single \n if it doesn't contain one.

What is the difference between these three fscanf calls in OCaml?

I wrote a short bit of code to simply skip num_lines lines in an input file (printing the lines out for debugging purposes. Here's two things I tried that didn't work:
for i = 0 to num_lines do
print_endline (fscanf infile "%s" (fun p -> p));
done;;
for i = 0 to num_lines do
print_endline (fscanf infile "%S\n" (fun p -> p));
done;;
But this one did work:
for i = 0 to num_lines do
print_endline (fscanf infile "%s\n" (fun p -> p));
done;;
I've been trying to comprehend the documentation on fscanf, but it doesn't seem to be sinking in. Could someone explain to me exactly why the last snippet worked, but the first two didn't?
"%s" -- Matches everything to next white-space ("\n" here) but never matches "\n"
"%S\n" -- Matches thing that looks like Ocaml strings, then a "\n"
"%s\n" -- Matches everything to next white-space ("\n" here) then "\n". This will act different if there is no trailing "\n" in file or if there is a space before the "\n", etc.
"%s " -- Matches anything up to white-space, and then all trailing white-space including "\n" (or possibly even no white-space). This works because " " means "any white space, possible none", in the format string.