How to append end of file to a string - ocaml

I just hit that "problem" : is there a smart way to insert the end of file (ASCII 0) character in a string?
By "smart", I mean something better than
let s = "foo" ^ (String.make 1 (Char.chr 0))
let s = "foo\000"
that is, something which would reflect that we are adding an EOF, not a "mystery char which ascii value is 0".
EDIT:
Mmh... indeed I was messing with eof being a char. But anyway, in C you can have
#include <stdio.h>
int main(void)
{
char a = getchar();
if (a = EOF)
printf("eof");
else
printf("not eof");
return 0;
}
Where you can test whether a char is an EOF (and (int) EOF is -1, not 0 as I was thinking). And similarly, you can set a char to be EOF, etc..
My question is: Is it possible to have something similar in ocaml ?

As #melpomene says, there is no EOF character, and '\000' really is just a character. So there's no real answer to your question as near as I can tell.
You can define your own name for a string consisting of just the NUL character (as we used to call it):
let eof = "\000"
Then your function looks like this:
let add_eof s = s ^ eof

Your C has two errors. First, you assign EOF to a instead of comparing a with EOF. Second, getchar() returns an int. It returns an int expressly so that it can return EOF, a value not representable by a char. Your code (with the first error corrected), which assigns getchar()s value to a char before testing it, will fail to process a file with a char of value 255 in it:
$ gcc -Wall getchar.c -o getchar
$ echo -e "\xFF" > fake-eof
$ echo " " > space
$ ./getchar < fake-eof
eof
$ ./getchar < space
not eof
The trick with getchar returning int, of returning a larger type so that your return can include the smaller type and alternately other kinds of information, is a trick that's wholly unnecessary in OCaml due to its more advanced type system. OCaml could have
(* using hypothetical c_getchar, a wrapper for the getchar() in C that returns an int *)
let getchar_opt () =
match c_getchar () with
| -1 -> None
| c -> Some (char_of_int c)
let getchar_exn () =
match c_getchar () with
| -1 -> raise End_of_file
| c -> char_of_int c
type `a ior = EOF | Value of 'a
let getchar_ior () =
match c_getchar_ior () with
| -1 -> EOF
| c -> Value (char_of_int c)
Of course Pervasives.input_char in OCaml raises an exception on EOF rather than doing one of these other things. If you want a non-exceptional interface, you could wrap input_char with your own version that catches the exception, or you could - depending on your program - use Unix.read instead, which returns the number of bytes it was able to read, which is 0 on EOF.

Related

How to scan with ocamllex until end of file

I am trying to implement a parser that read regular expression. It ask the user to enter a valid input of string/integers/float. If is valid and the user press ctrl^d, then print the number. Otherwise shows an error. But the problem in the following code does not stop when I press ctrl^D. How to implement eof token and print the input ?
test.mll :
{ type result = Int of int | Float of float | String of string }
let digit = ['0'-'9']
let digits = digit +
let lower_case = ['a'-'z']
let upper_case = ['A'-'Z']
let letter = upper_case | lower_case
let letters = letter +
rule main = parse
(digits)'.'digits as f { Float (float_of_string f) }
| digits as n { Int (int_of_string n) }
| letters as s { String s}
| _ { main lexbuf }
{ let newlexbuf = (Lexing.from_channel stdin) in
let result = main newlexbuf in
print_endline result }
I'd say the main problem is that each call to main produces one token, and there's only one call to main in your code. So it will process just one token.
You need to have some kind of iteration that calls main repeatedly.
There is a special pattern eof in OCamllex that matches the end of the input file. You can use this to return a special value that stops the iteration.
As a side comment, you can't call print_endline with a result as its parameter. Its parameter must be a string. You will need to write your own function for printing the results.
Update
To get an iteration, change your code to something like this:
{
let newlexbuf = Lexing.from_channel stdin in
let rec loop () =
match main newlexbuf with
| Int i -> iprint i; loop ()
| Float f -> fprint f; loop ()
| String s -> sprint s; loop ()
| Endfile -> ()
in
loop ()
}
Then add a rule something like this to your patterns:
| eof { Endfile }
Then add Endfile as an element of your type.
A assume this is homework. So make sure you see how the iteration is working. Aside from the details of ocamllex, that's something you want to master (apologies for unsolicited advice).

I am trying to read a string from stdin and flush it out to stdout but I can't find a Standard ML way

NOTE: I'm totally Newbie in Standard ML. I merely have basic F# knowledge.
This is a good ol' code in C
#include <stdio.h>
int main()
{
char str[100]; // size whatever you want
scanf("%s", str);
printf("%s\n", str);
return 0;
}
now, I want to make a Standard ML-version-equivalent of this code. so I tried this:
val str = valOf (TextIO.inputLine TextIO.stdIn)
val _ = print str
but my SML/NJ says this:
uncaught exception Option
raised at: smlnj/init/pre-perv.sml:21.28-21.34
I googled it, and I also searched this site, but I cannot find any solution which doesn't cause error.
does anyone knows it?
EDIT: I tried this code:
fun main =
let val str = valOf (TextIO.inputLine TextIO.stdIn)
in
case str
of NONE => print "NONE\n"
| _ => print str
end
but it also makes error:
stdIn:1.6-1.10 Error: can't find function arguments in clause
stdIn:4.9-6.33 Error: case object and rules don't agree [tycon mismatch]
rule domain: 'Z option
object: string
in expression:
(case str
of NONE => print "NONE\n"
| _ => print str)
This answer was pretty much given in the next-most recent question tagged sml: How to read string from user keyboard in SML language? -- you can just replace the user keyboard with stdin, since stdin is how you interact with the keyboard using a terminal.
So you have two problems with this code:
fun main =
let val str = valOf (TextIO.inputLine TextIO.stdIn)
in
case str
of NONE => print "NONE\n"
| _ => print str
end
One problem is that if you write fun main then it has to take arguments, e.g. fun main () = .... The () part does not represent "nothing" but rather exactly one thing, being the unit value.
The other problem is eagerness. The Option.valOf function will crash when there is no value, and it will do this before you reach the case-of, making the case-of rather pointless. So what you can do instead is:
fun main () =
case TextIO.inputLine TextIO.stdIn of
SOME s => print s
| NONE => print "NONE\n"
Using the standard library this can be shortened to:
fun main () =
print (Option.getOpt (TextIO.inputLine TextIO.stdIn, "NONE\n"))
I encourage you to read How to read string from user keyboard in SML language?

Changing the State of Lexing.lexbuf

I am writing a lexer for Brainfuck with Ocamllex, and to implement its loop, I need to change the state of lexbuf so it can returns to a previous position in the stream.
Background info on Brainfuck (skippable)
in Brainfuck, a loop is accomplished by a pair of square brackets with
the following rule:
[ -> proceed and evaluate the next token
] -> if the current cell's value is not 0, return to the matching [
Thus, the following code evaluates to 15:
+++ [ > +++++ < - ] > .
it reads:
In the first cell, assign 3 (increment 3 times)
Enter loop, move to the next cell
Assign 5 (increment 5 times)
Move back to the first cell, and subtract 1 from its value
Hit the closing square bracket, now the current cell (first) is equals to 2, thus jumps back to [ and proceed into the loop again
Keep going until the first cell is equals to 0, then exit the loop
Move to the second cell and output the value with .
The value in the second cell would have been incremented to 15
(incremented by 5 for 3 times).
Problem:
Basically, I wrote two functions to take care of pushing and popping the last position of the last [ in the header section of brainfuck.mll file, namely push_curr_p and pop_last_p which pushes and pops the lexbuf's current position to a int list ref named loopstack:
{ (* Header *)
let tape = Array.make 100 0
let tape_pos = ref 0
let loopstack = ref []
let push_curr_p (lexbuf: Lexing.lexbuf) =
let p = lexbuf.Lexing.lex_curr_p in
let curr_pos = p.Lexing.pos_cnum in
(* Saving / pushing the position of `[` to loopstack *)
( loopstack := curr_pos :: !loopstack
; lexbuf
)
let pop_last_p (lexbuf: Lx.lexbuf) =
match !loopstack with
| [] -> lexbuf
| hd :: tl ->
(* This is where I attempt to bring lexbuf back *)
( lexbuf.Lexing.lex_curr_p <- { lexbuf.Lexing.lex_curr_p with Lexing.pos_cnum = hd }
; loopstack := tl
; lexbuf
)
}
{ (* Rules *)
rule brainfuck = parse
| '[' { brainfuck (push_curr_p lexbuf) }
| ']' { (* current cell's value must be 0 to exit the loop *)
if tape.(!tape_pos) = 0
then brainfuck lexbuf
(* this needs to bring lexbuf back to the previous `[`
* and proceed with the parsing
*)
else brainfuck (pop_last_p lexbuf)
}
(* ... other rules ... *)
}
The other rules work just fine, but it seems to ignore [ and ]. The problem is obviously at the loopstack and how I get and set lex_curr_p state. Would appreciate any leads.
lex_curr_p is meant to keep track of the current position, so that you can use it in error messages and the like. Setting it to a new value won't make the lexer actually seek back to an earlier position in the file. In fact I'm 99% sure that you can't make the lexer loop like that no matter what you do.
So you can't use ocamllex to implement the whole interpreter like you're trying to do. What you can do (and what ocamllex is designed to do) is to translate the input stream of characters into a stream of tokens.
In other languages that means translating a character stream like var xyz = /* comment */ 123 into a token stream like VAR, ID("xyz"), EQ, INT(123). So lexing helps in three ways: it finds where one token ends and the next begins, it categorizes tokens into different types (identifiers vs. keywords etc.) and discards tokens you don't need (white space and comments). This can simplify further processing a lot.
Lexing Brainfuck is a lot less helpful as all Brainfuck tokens only consist of a single character anyway. So finding out where each token ends and the next begins is a no-op and finding out the type of the token just means comparing the character against '[', '+' etc. So the only useful thing a Brainfuck lexer does is to discard whitespace and comments.
So what your lexer would do is turn the input [,[+-. lala comment ]>] into something like LOOP_START, IN, LOOP_START, INC, DEC, OUT, LOOP_END, MOVE_RIGHT, LOOP_END, where LOOP_START etc. belong to a discriminated union that you (or your parser generator if you use one) defined and made the lexer output.
If you want to use a parser generator, you'd define the token types in the parser's grammar and make the lexer produce values of those types. Then the parser can just parse the token stream.
If you want to do the parsing by hand, you'd call the lexer's token function by hand in a loop to get all the tokens. In order to implement loops, you'd have to store the already-consumed tokens somewhere to be able to loop back. In the end it'd end up being more work than just reading the input into a string, but for a learning exercise I suppose that doesn't matter.
That said, I would recommend going all the way and using a parser generator to create an AST. That way you don't have to create a buffer of tokens for looping and having an AST actually saves you some work (you no longer need a stack to keep track of which [ belongs to which ]).

Idiomatic way to include a null character / byte in a string in OCaml

I was messing around with the OCaml Unix module to see if it would reject strings containing certain bytes that might have surprising effects in the context of the given system call and throw an exception. E.g. a null byte in the prog argument to Unix.create_process, or a newline in one of the strings in the env : string array argument.
I tried a few ways to include a null byte in my string, such as "/bin/ls\0" (which is an illegal escape sequence in a string literal) and "/bin/ls" ^ string_of_char '\0' (which is an illegal sequence in a character literal). Finally, I cast zero to a string, and then made a string of length 1 containing the null character and then concatenated it with my string.
module U = Unix;;
let string_of_char ch : string = String.make 1 ch
let sketchy_string = "/bin/ls" ^ string_of_char (char_of_int 0)
let _ = U.create_process sketchy_string [|"ls"|] U.stdin U.stdout U.stderr
What's the right way to add a null byte to an ocaml string?
You can use the generic "hexadecimal code" escape sequence to write the null byte (or any other byte you want):
let null_byte = '\x00';;
let sketchy_string = "/bin/ls\x00";;
For further reference, see the section of the Ocaml manual covering escape sequences: http://caml.inria.fr/pub/docs/manual-ocaml/lex.html#escape-sequence

isnumeric() with PostgreSQL

I need to determine whether a given string can be interpreted as a number (integer or floating point) in an SQL statement. As in the following:
SELECT AVG(CASE WHEN x ~ '^[0-9]*.?[0-9]*$' THEN x::float ELSE NULL END) FROM test
I found that Postgres' pattern matching could be used for this. And so I adapted the statement given in this place to incorporate floating point numbers. This is my code:
WITH test(x) AS (
VALUES (''), ('.'), ('.0'), ('0.'), ('0'), ('1'), ('123'),
('123.456'), ('abc'), ('1..2'), ('1.2.3.4'))
SELECT x
, x ~ '^[0-9]*.?[0-9]*$' AS isnumeric
FROM test;
The output:
x | isnumeric
---------+-----------
| t
. | t
.0 | t
0. | t
0 | t
1 | t
123 | t
123.456 | t
abc | f
1..2 | f
1.2.3.4 | f
(11 rows)
As you can see, the first two items (the empty string '' and the sole period '.') are misclassified as being a numeric type (which they are not). I can't get any closer to this at the moment. Any help appreciated!
Update Based on this answer (and its comments), I adapted the pattern to:
WITH test(x) AS (
VALUES (''), ('.'), ('.0'), ('0.'), ('0'), ('1'), ('123'),
('123.456'), ('abc'), ('1..2'), ('1.2.3.4'), ('1x234'), ('1.234e-5'))
SELECT x
, x ~ '^([0-9]+[.]?[0-9]*|[.][0-9]+)$' AS isnumeric
FROM test;
Which gives:
x | isnumeric
----------+-----------
| f
. | f
.0 | t
0. | t
0 | t
1 | t
123 | t
123.456 | t
abc | f
1..2 | f
1.2.3.4 | f
1x234 | f
1.234e-5 | f
(13 rows)
There are still some issues with the scientific notation and with negative numbers, as I see now.
As you may noticed, regex-based method is almost impossible to do correctly. For example, your test says that 1.234e-5 is not valid number, when it really is. Also, you missed negative numbers. What if something looks like a number, but when you try to store it it will cause overflow?
Instead, I would recommend to create function that tries to actually cast to NUMERIC (or FLOAT if your task requires it) and returns TRUE or FALSE depending on whether this cast was successful or not.
This code will fully simulate function ISNUMERIC():
CREATE OR REPLACE FUNCTION isnumeric(text) RETURNS BOOLEAN AS $$
DECLARE x NUMERIC;
BEGIN
x = $1::NUMERIC;
RETURN TRUE;
EXCEPTION WHEN others THEN
RETURN FALSE;
END;
$$
STRICT
LANGUAGE plpgsql IMMUTABLE;
Calling this function on your data gets following results:
WITH test(x) AS ( VALUES (''), ('.'), ('.0'), ('0.'), ('0'), ('1'), ('123'),
('123.456'), ('abc'), ('1..2'), ('1.2.3.4'), ('1x234'), ('1.234e-5'))
SELECT x, isnumeric(x) FROM test;
x | isnumeric
----------+-----------
| f
. | f
.0 | t
0. | t
0 | t
1 | t
123 | t
123.456 | t
abc | f
1..2 | f
1.2.3.4 | f
1x234 | f
1.234e-5 | t
(13 rows)
Not only it is more correct and easier to read, it will also work faster if data was actually a number.
You problem is the two 0 or more [0-9] elements on each side of the decimal point. You need to use a logical OR | in the number identification line:
~'^([0-9]+\.?[0-9]*|\.[0-9]+)$'
This will exclude a decimal point alone as a valid number.
I suppose one could have that opinion (that it's not a misuse of exception handling), but generally I think that an exception handling mechanism should be used just for that. Testing whether a string contains a number is part of normal processing, and isn't "exceptional".
But you're right about not handling exponents. Here's a second stab at the regular expression (below). The reason I had to pursue a solution that uses a regular expression was that the solution offered as the "correct" solution here will fail when the directive is given to exit when an error is encountered:
SET exit_on_error = true;
We use this often when groups of SQL scripts are run, and when we want to stop immediately if there is any issue/error. When this session directive is given, calling the "correct" version of isnumeric will cause the script to exit immediately, even though there's no "real" exception encountered.
create or replace function isnumeric(text) returns boolean
immutable
language plpgsql
as $$
begin
if $1 is null or rtrim($1)='' then
return false;
else
return (select $1 ~ '^ *[-+]?[0-9]*([.][0-9]+)?[0-9]*(([eE][-+]?)[0-9]+)? *$');
end if;
end;
$$;
Since PostgreSQL 9.5 (2016) you can just ask the type of a json field:
jsonb_typeof(field)
From the PostgreSQL documentation:
json_typeof(json)
jsonb_typeof(jsonb)
Returns the type of the outermost JSON value as a text string. Possible types are object, array, string, number, boolean, and null.
Example
When aggregating numbers and wanting to ignore strings:
SELECT m.title, SUM(m.body::numeric)
FROM messages as m
WHERE jsonb_typeof(m.body) = 'number'
GROUP BY m.title;
Without WHERE the ::numeric part would crash.
The obvious problem with the accepted solution is that it is an abuse of exception handling. If there's another problem encountered, you'll never know it because you've tossed away the exceptions. Very bad form. A regular expression would be the better way to do this. The regex below seems to behave well.
create function isnumeric(text) returns boolean
immutable
language plpgsql
as $$
begin
if $1 is not null then
return (select $1 ~ '^(([-+]?[0-9]+(\.[0-9]+)?)|([-+]?\.[0-9]+))$');
else
return false;
end if;
end;
$$
;