Multiple parsing files with menhir - ocaml

I'm trying to have two different parsers with different entry points.
I have the following project hierarchy:
.
├── bin
│   ├── dune
│   ├── lexer.mll
│   ├── main.ml
│   ├── parser.messages
│   ├── parser.mly
│   └── test_parser.mly
├── dune-project
├── program.opam
└── test
lexer.mll
{
open Parser
exception Error of string
}
rule token = parse
| [' ' '\t'] { token lexbuf }
| '\n' { EOL }
| ['0'-'9']+ as i { INT (int_of_string i) }
| '+' { PLUS }
| eof { exit 0 }
| _ { raise Exit }
parser.mly
%token <int> INT
%token PLUS
%token EOL
%left PLUS
%start <int> main
%%
main:
| e = expr EOL
{ e }
%public expr:
| i = INT { i }
| e1 = expr PLUS e2 = expr { e1 + e2 }
and dune
(ocamllex
(modules lexer))
(menhir
(modules parser)
(flags --table)
)
(executable
(name main)
(public_name program)
(libraries menhirLib)
)
(rule
(targets parser_messages.ml)
(deps parser.messages parser.mly)
(action (with-stdout-to %{targets} (run menhir --compile-errors %{deps}))))
This works perfectly (the parser.messages has been generated by menhir --list-errors bin/parser.mly > bin/parser.messages and is not really important)
However, I'd like to have a second parser that uses some rules defined in parser.mly and according to http://gallium.inria.fr/~scherer/tmp/menhir-manual/main.html#sec23 it should be possible but I guess I'm doing it wrong:
test_parser.mly
%start <int> test
%%
test:
| el = separated_nonempty_list(COMMA, expr) EOL
{ List.fold_left (+) 0 e }
And in my dune file I added:
(menhir
(modules parser test_parser)
(merge_into test_parser)
(flags --table))
(rule
(targets test_parser_messages.ml)
(deps test_parser.messages test_parser.mly)
(action (with-stdout-to %{targets} (run menhir --compile-errors %{deps}))))
But when I dune build:
❯ dune build
File "test_parser.mly", line 6, characters 21-25:
Error: expr is undefined.
File "bin/test_parser.mly", line 7, characters 27-28:
Error: Unbound value e
expr is preceded by %public in parser.mly so I thought I could use it from test_parser.mly. Can this feature only be used with a common entry point and different producing rules?

Related

regex replace string PySpark

I have a column with string values like '{"phones":["phone1", "phone2"]}' and i would like to remove characters and result in a string like phone1, phone2. I am using a regex like
df.withColumn('Phones',
F.regexp_replace(F.split(F.col('input_phones'), ':').getItem(1), r'\}', ''))
which returns a string like '["phone1", "phone2"]'.
Is there a way to test different regex and how to exclude other special characters?
Your string ('{"phones":["phone1", "phone2"]}') looks like a json, and we can parse the same in pyspark using from_json.
data_sdf = spark.sparkContext.parallelize([('{"phones":["phone1", "phone2"]}',)]).toDF(['json_str'])
# json's schema
json_target_schema = StructType([
StructField('phones', ArrayType(StringType()), True)
])
data_sdf. \
withColumn('json_parsed', func.from_json(func.col('json_str'), json_target_schema)). \
select('json_str', 'json_parsed.*'). \
withColumn('phones_str', func.concat_ws(',', 'phones')). \
show(truncate=False)
# +-------------------------------+----------------+-------------+
# |json_str |phones |phones_str |
# +-------------------------------+----------------+-------------+
# |{"phones":["phone1", "phone2"]}|[phone1, phone2]|phone1,phone2|
# +-------------------------------+----------------+-------------+
Lets check the dataframe's schema to see the columns' data types
data_sdf. \
withColumn('json_parsed', func.from_json(func.col('json_str'), json_target_schema)). \
select('json_str', 'json_parsed', 'json_parsed.*'). \
withColumn('phones_str', func.concat_ws(',', 'phones')). \
printSchema()
# root
# |-- json_str: string (nullable = true)
# |-- json_parsed: struct (nullable = true)
# | |-- phones: array (nullable = true)
# | | |-- element: string (containsNull = true)
# |-- phones: array (nullable = true)
# | |-- element: string (containsNull = true)
# |-- phones_str: string (nullable = false)
# +-------------------------------+------------------+----------------+-------------+
# |json_str |json_parsed |phones |phones_str |
# +-------------------------------+------------------+----------------+-------------+
# |{"phones":["phone1", "phone2"]}|{[phone1, phone2]}|[phone1, phone2]|phone1,phone2|
# +-------------------------------+------------------+----------------+-------------+

Ocaml reading line by line

So i have this ocaml code :
let read_file file_name = "/home/test.txt" in
let in_channel = open_in file_name in
try
while true do
let line = input_line in_channel in
print_endline line
done
with End_of_file ->
close_in in_channel
let my_fun()=
let f = "test1.ml" in
read_file f
;;
my_fun ()
But it is printing only the 1st line of the file
Can you help me out ?
Your code shows an error along the lines of "unbound value file_name" on line 2. I just turned read_file into a function, and called it with the path to the file:
let read_file file_name =
let in_channel = open_in file_name in
try
while true do
let line = input_line in_channel in
print_endline line
done
with End_of_file ->
close_in in_channel
let my_fun() =
let f = "home/test.txt" in
read_file f;;
my_fun()
Calling $ ocaml <filename.ml> on the file with this code prints each line from the file in the specified path to stdout.
Edit: Adding some more details on file structure and how to execute:
I have a directory named sample-dir with the following contents:
sample-dir
├── home
│   └── test.txt
└── sample.ml
The code is in sample.ml, the test.txt is in the sub-directory called home. The contents of test.txt are as follows:
one
two
three
I cd into sample-dir and execute the following command:
$ ocaml sample.ml
I get the following output:
one
two
three

How to parse a string to a regex type in OCaml

We define a regex type like this:
type regex_t =
| Empty_String
| Char of char
| Union of regex_t * regex_t
| Concat of regex_t * regex_t
| Star of regex_t
We want to write a function string_to_regex: string -> regex_t.
The only char for Empty_string is 'E'
The only chars for Char are 'a'..'z'
'|' is for Union
'*' is for Star
Concat is assumed for continuous parsing.
'(' / ')' have highest Precedence, then star, then concat, then union
For example,
(a|E)*(a|b) will be
Concat(Star(Union(Char 'a',Empty_String)),Union(Char 'a',Char 'b'))
How to implement string_to_regex?
Ocamllex and menhir are wonderful tools to write lexers and parsers
ast.mli
type regex_t =
| Empty
| Char of char
| Concat of regex_t * regex_t
| Choice of regex_t * regex_t
| Star of regex_t
lexer.mll
{ open Parser }
rule token = parse
| ['a'-'z'] as c { CHAR c }
| 'E' { EMPTY }
| '*' { STAR }
| '|' { CHOICE }
| '(' { LPAR }
| ')' { RPAR }
| eof { EOF }
parser.mly
%{ open Ast %}
%token <char> CHAR
%token EMPTY STAR CHOICE LPAR RPAR CONCAT
%token EOF
%nonassoc LPAR EMPTY CHAR
%left CHOICE
%left STAR
%left CONCAT
%start main
%type <Ast.regex_t> main
%%
main: r = regex EOF { r }
regex:
| EMPTY { Empty }
| c = CHAR { Char c }
| LPAR r = regex RPAR { r }
| a = regex CHOICE b = regex { Choice(a, b) }
| r = regex STAR { Star r }
| a = regex b = regex { Concat(a, b) } %prec CONCAT
main.ml
open Ast
let rec format_regex = function
| Empty -> "Empty"
| Char c -> "Char " ^ String.make 1 c
| Concat(a, b) -> "Concat("^format_regex a^", "^format_regex b^")"
| Choice(a, b) -> "Choice("^format_regex a^", "^format_regex b^")"
| Star(a) -> "Star("^format_regex a^")"
let () =
let s = read_line () in
let r = Parser.main Lexer.token (Lexing.from_string s) in
print_endline (format_regex r)
and to compile
ocamllex lexer.mll
menhir parser.mly
ocamlc -c ast.mli
ocamlc -c parser.mli
ocamlc -c parser.ml
ocamlc -c lexer.ml
ocamlc -c main.ml
ocamlc -o regex parser.cmo lexer.cmo main.cmo
and then
$ ./regex
(a|E)*(a|b)
Concat(Star(Choice(Char a, Empty)), Choice(Char a, Char b))
The comment by #Thomas is very complete, but actually the precedence of the operators is not correct: parsing a|aa will result in Concat(Choice(Char a, Char a), Char a), namely equals (a|a)a. The regular expression operators precedence requires that a|aa = a|(aa), which should then result in Choice(Char a, Concat(Char a, Char a)). The problem is that the CONCAT token is a hack, and even if you specify %left CHOICE before %left CONCAT in the parser.mly file, this doesn't imply that the precedence will be respected. One possible solution is to perform recursive descending parsing, effectively making the grammar unambiguous. If you want to use this approach you can modify parser.mly with:
%{ open Ast %}
%token <char> CHAR
%token EMPTY
%token STAR
%token CHOICE
%token LPAR
%token RPAR
%token EOF
%start main
%type <Ast.regex_t> main
%%
main: r = regex EOF { r }
regex:
| r = disjunction { r }
disjunction:
| a = disjunction CHOICE b = concat { Choice(a, b) }
| r = concat {r}
concat:
| a = concat b = repetition { Concat(a, b) }
| r = repetition {r}
repetition:
| r = repetition STAR { Star r }
| r = atom { r }
atom:
| LPAR r = regex RPAR { r }
| c = CHAR { Char c }
| EMPTY { Empty }
This leads to no ambiguity (= no need to specify operators associativity and precedence) and will produce the correct result.

OCaml parser for :: case

I'm having trouble with associativity. For some reason my = operator has higher precedence than my :: operator
So for instance, if I have
"1::[] = []"
in as a string, I would get
1 = []::[]
as my expression instead of
[1] = []
If my string is "1::2::[] = []"
I thought I it would parse it into exp1 EQ exp2, then from then on it will parse exp1 and exp2. But it is parsing as exp1 COLONCOLON exp2 instead
.
.
.
%nonassoc LET FUN IF
%left OR
%left AND
%left EQ NE LT LE
%right SEMI COLONCOLON
%left PLUS MINUS
%left MUL DIV
%left APP
.
.
.
exp4:
| exp4 EQ exp9 { Bin ($1,Eq,$3) }
| exp4 NE exp9 { Bin ($1,Ne,$3) }
| exp4 LT exp9 { Bin ($1,Lt,$3) }
| exp4 LE exp9 { Bin ($1,Le,$3) }
| exp9 { $1 }
exp9:
| exp COLONCOLON exp9 { Bin ($1,Cons,$3) }
| inner { $1 }
.
.
.
It looks like you might have multiple expression rules (exp, exp1, exp2, ... exp9), in which case the precedence of operations is determined by the interrelation of those rules (which rule expands to which other rule), and the %left/%right declarations are largely irrelevant.
The yacc precedence rules are only used to resolve shift/reduce conflicts, and if your grammar doesn't have shift/reduce conflicts (having resolved the ambiguity by using multiple rules), the precedence levels will have no effect.
Rules aren't just applied like functions, so you can't refactor your grammar in a set of rules, at least with ocamlyacc. You can try to use menhir, that allows such refactoring by inlining rules (%inline directive).
To enable menhir you need to install it, and pass an option -use-menhir to ocamlbuild, if you're using it.

Use of StringTemplate in Antlr

I would have this problem :
Given this rules
defField: type VAR ( ',' VAR)* SEP ;
VAR : ('a'..'z'|'A'..'Z')+ ;
type: 'Number'|'String' ;
SEP : '\n'|';' ;
where I have to do is to associate a template with a rule "defField", that returns the string that represents the xml-schema for the field, that is:
Number a,b,c ;-> "<xs:element name="a" type = "xs:Number"\>" ,also for b and c.
my problem is in * of Kleene, that is, how do I write the template to do what I described above in the light of the '*' ??
Thanks you!!!
Collect all VAR tokens in a java.util.List by using the += operator:
defField
: t=type v+=VAR (',' v+=VAR)* SEP
;
Now v (a List) contains all VAR's.
Then pass t and v as a parameter to a method in your StringTemplateGroup:
defField
: t=type v+=VAR (',' v+=VAR)* SEP -> defFieldSchema(type={$t.text}, vars={$v})
;
where defFieldSchema(...) must be declared in your StringTemplateGroup, which might look like (file: T.stg):
group T;
defFieldSchema(type, vars) ::= <<
<vars:{ v | \<xs:element name="<v.text>" type="xs:<type>"\>
}>
>>
The syntax for iterating over a collection is as follows:
<COLLECTION:{ EACH_ITEM_IN_COLLECTION | TEXT_TO_EMIT }>
Ans since vars is a List containing CommonTokens's, I grabbed its .text attribute instead of relying on its toString() method.
Demo
Take the following grammar (file T.g):
grammar T;
options {
output=template;
}
defField
: t=type v+=VAR (',' v+=VAR)* SEP -> defFieldSchema(type={$t.text}, vars={$v})
;
type
: NUMBER
| STRING
;
NUMBER
: 'Number'
;
STRING
: 'String'
;
VAR
: ('a'..'z'|'A'..'Z')+
;
SEP
: '\n'
| ';'
;
SPACE
: ' ' {skip();}
;
which can be tested with the following class (file: Main.java):
import org.antlr.runtime.*;
import org.antlr.stringtemplate.*;
import java.io.*;
public class Main {
public static void main(String[] args) throws Exception {
StringTemplateGroup group = new StringTemplateGroup(new FileReader("T.stg"));
ANTLRStringStream in = new ANTLRStringStream("Number a,b,c;");
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
parser.setTemplateLib(group);
TParser.defField_return returnValue = parser.defField();
StringTemplate st = (StringTemplate)returnValue.getTemplate();
System.out.println(st.toString());
}
}
As you will see when you run this class, it parses the input "Number a,b,c;" and produces the following output:
<xs:element name="a" type="xs:Number">
<xs:element name="b" type="xs:Number">
<xs:element name="c" type="xs:Number">
EDIT
To run the demo, make sure you have all of the following files in the same directory:
T.g (the combined grammar file)
T.stg (the StringTemplateGroup file)
antlr-3.3.jar (the latest stable ANTLR build as of this writing)
Main.java (the test class)
then execute to following commands from your OS's shell/prompt (from the same directory all the files are in):
java -cp antlr-3.3.jar org.antlr.Tool T.g # generate the lexer & parser
javac -cp antlr-3.3.jar *.java # compile all .java source files
java -cp .:antlr-3.3.jar Main # run the main class (*nix)
# or
java -cp .;antlr-3.3.jar Main # run the main class (Windows)
Probably not necessary to mention, but the # including the text after it should not be a part of the commands: these are only comments to indicate what these commands are for.