Writing an interpreter for ANTLR grammar - clojure

I've made a grammar for APL subset.
grammar APL;
program: (statement NEWLINE)*;
statement: thing;
assignment: variable LARR thing;
thing: simpleThing
| complexThing;
escapedThing: simpleThing
| '(' complexThing ')';
simpleThing: variable # ThingVariable
| number # ThingNumber
;
complexThing: unary # ThingUOp
| binary # ThingBOp
| assignment # ThingAssignment
;
variable: CAPITAL;
number: DIGITS;
unary: iota # UOpIota
| negate # UOpNegate
;
iota: SMALL_IOTA number;
negate: TILDA thing;
binary: drop # BOpDrop
| select # BOpSelect
| outerProduct # BOpOuterProduct
| setInclusion # BOpSetInclusion
;
drop: left=number SPIKE right=thing;
select: left=escapedThing SLASH right=thing;
outerProduct: left=escapedThing OUTER_PRODUCT_OP right=thing;
setInclusion: left=escapedThing '∊' right=thing;
NEWLINE: [\r\n]+;
CAPITAL: [A-Z];
CAPITALS: (CAPITAL)+;
DIGITS: [0-9]+;
TILDA: '~';
SLASH: '/';
// greek
SMALL_IOTA: 'ι' | '#i';
// arrows
LARR: '←' | '#<-';
SPIKE: '↓' | '#Iv';
OUTER_PRODUCT_OP: '∘.×' | '#o.#x';
Now I'd like to create an interpreter for it. I'm trying to use clj-antlr with Clojure. How do I do that?

As Jared314 pointed, take a look at instaparse:
This is how you create a grammar:
(def as-and-bs
(insta/parser
"S = AB*
AB = A B
A = 'a'+
B = 'b'+"))
This is how you call it:
(as-and-bs "aaaaabbbaaaabb")
And here is the result with default formatting:
[:S
[:AB [:A "a" "a" "a" "a" "a"] [:B "b" "b" "b"]]
[:AB [:A "a" "a" "a" "a"] [:B "b" "b"]]]
While ANTLR is definitely doing a great job, in the Clojure world you can remove all the surrounding glue by using instaparse.

Related

Extract letters and numbers from string

I have the following strings:
KZ1,345,769.1
PKS948,123.9
XG829,823.5
324JKL,282.7
456MJB87,006.01
How can I separate the letters and numbers?
This is the outcome I expect:
KZ 1345769.1
PKS 948123.9
XG 829823.5
JKL 324282.7
MJB 45687006
I have tried using the split command for this purpose but without success.
#Pearly Spencer's answer is surely preferable, but the following kind of naive looping should occur to any programmer. Look at each character in turn and decide whether it is a letter; or a number or decimal point; or something else (implicitly) and build up answers that way. Note that although we loop over the length of the string, looping over observations too is tacit.
clear
input str42 whatever
"KZ1,345,769.1"
"PKS948,123.9"
"XG829,823.5"
"324JKL,282.7"
"456MJB87,006.01"
end
compress
local length = substr("`: type whatever'", 4, .)
gen letters = ""
gen numbers = ""
quietly forval j = 1/`length' {
local arg substr(whatever,`j', 1)
replace letters = letters + `arg' if inrange(`arg', "A", "Z")
replace numbers = numbers + `arg' if `arg' == "." | inrange(`arg', "0", "9")
}
list
+-----------------------------------------+
| whatever letters numbers |
|-----------------------------------------|
1. | KZ1,345,769.1 KZ 1345769.1 |
2. | PKS948,123.9 PKS 948123.9 |
3. | XG829,823.5 XG 829823.5 |
4. | 324JKL,282.7 JKL 324282.7 |
5. | 456MJB87,006.01 MJB 45687006.01 |
+-----------------------------------------+

Multiple Patterns in 1 case

In SML, is it possible for you to have multiple patterns in one case statement?
For example, I have 4 arithmetic operators express in string, "+", "-", "*", "/" and I want to print "PLUS MINUS" of it is "+" or "-" and "MULT DIV" if it is "*" or "/".
TL;DR: Is there somewhere I can simplify the following to use less cases?
case str of
"+" => print("PLUS MINUS")
| "-" => print("PLUS MINUS")
| "*" => print("MULT DIV")
| "/" => print("MULT DIV")
Given that you've tagged your question with the smlnj tag, then yes, SML/NJ supports this kind of patterns. They call it or-patterns and it looks like this:
case str of
("+" | "-") => print "PLUS MINUS"
| ("*" | "/") => print "MULT DIV"
Notice the parentheses.
The master branch of MLton supports it too, as part of their Successor ML effort, but you'll have to compile MLton yourself.
val str = "+"
val _ =
case str of
"+" | "-" => print "PLUS MINUS"
| "*" | "/" => print "MULT DIV"
Note that MLton does not require parantheses. Now compile it using this command (unlike SML/NJ, you have to enable this feature explicitly in MLton):
mlton -default-ann 'allowOrPats true' or-patterns.sml
In Standard ML, no. In other dialects of ML, such as OCaml, yes. You may in some cases consider splitting pattern matching up into separate cases/functions, or skip pattern matching in favor of a shorter catch-all expression, e.g.
if str = "+" orelse str = "-" then "PLUS MINUS" else
if str = "*" orelse str = "/" then "MULT DIV" else ...
Expanding upon Ionuț's example, you can even use datatypes with other types in them, but their types (and identifier assignments) must match:
datatype mytype = COST as int | QUANTITY as int | PERSON as string | PET as string;
case item of
(COST n|QUANTITY n) => print Int.toString n
|(PERSON name|PET name) => print name
If the types or names don't match, it will get rejected:
case item of
(COST n|PERSON n) => (* fails because COST is int and PERSON is string *)
(COST n|QUANTITY q) => (* fails because the identifiers are different *)
And these patterns work in function definitions as well:
fun myfun (COST n|QUANTITY n) = print Int.toString n
|myfun (PERSON name|PET name) = print name
;

Changing how Antlr4 prints a tree

I am using Antlr4 in IntelliJ to make a small compiler for arithmetic expressions.
I want to print the tree and use this code snippet to do so.
JFrame frame = new JFrame("Tree");
JPanel panel = new JPanel();
TreeViewer viewr = new TreeViewer(Arrays.asList(
parser.getRuleNames()),tree);
viewr.setScale(2);//scale a little
panel.add(viewr);
frame.add(panel);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.setSize(400,400);
frame.setVisible(true);
This makes a tree which looks like this, for the input 3*5\n
Is there a way to adjust this so it reads from top to bottom
Statement
Expression /n
INT * INT
3 5
instead?
My grammar is defined as:
grammar Expression;
statement: expression ENDSTATEMENT # printExpr
| ID '=' expression ENDSTATEMENT # assign
| ENDSTATEMENT # blank
;
expression: expression MULTDIV expression # MulDiv
| expression ADDSUB expression # AddSub
| INT # int
| FLOAT # float
| ID # id
| '(' expression ')' # parens
;
ID : [a-zA-Z]+ ; // match identifiers
INT : [0-9]+ ; // match integers
MULTDIV : ('*' | '/'); //match multiply or divide
ADDSUB : ('+' | '-'); //match add or subtract
FLOAT: INT '.' INT; //match a floating point number
ENDSTATEMENT:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WHITESPACE : [ \t]+ -> skip ; // ignore whitespace
No, not without changing the source of the tree viewer yourself.

How to disable non-standard features in SML/NJ

SML/NJ provides a series of non-standard features, such as higher-order modules, vector literal syntax, etc.
Is there a way to disable these non-standard features in SML/NJ, through some command-line param maybe, or, ideally, using a CM directive?
Just by looking at the grammar used by the parser, I'm going to say that there is not a way to do this. From "admin/base/compiler/Parse/parse/ml.grm":
apat' : OP ident (VarPat [varSymbol ident])
| ID DOT qid (VarPat (strSymbol ID :: qid varSymbol))
| int (IntPat int)
| WORD (WordPat WORD)
| STRING (StringPat STRING)
| CHAR (CharPat CHAR)
| WILD (WildPat)
| LBRACKET RBRACKET (ListPat nil)
| LBRACKET pat_list RBRACKET (ListPat pat_list)
| VECTORSTART RBRACKET (VectorPat nil)
| VECTORSTART pat_list RBRACKET (VectorPat pat_list)
| LBRACE RBRACE (unitPat)
| LBRACE plabels RBRACE (let val (d,f) = plabels
in RecordPat{def=d,flexibility=f}
end)
The VectorPat stuff is fully mixed in with the rest of the patterns. A recursive grep for VectorPat also will show that there aren't any options to turn this off anywhere else.

Checking if last word of string is (case-insensitively) contained in another string

I'm using the regex SPARQL function and I pass two variables to it in this way:
FILTER regex(?x, ?y, "i")
I would like, for example, to compare these two strings: Via de' cerretani and via dei Cerretani. by extracting the significant word of the first string, which is usually the last word, cerretani in this case, and check if it's contained in the second string. As you can see, I pass these two strings as variables. How can I do this?
At first I though that this was a duplicate of your earlier question, Comparing two strings with SPARQL, but that's asking about a function that returns an edit distance. The task here is much more specific: Check whether the last word of a string is contained (case insensitively) in another string. As long as we take your specification that
the significant word of the string … is usually the last one
strictly and always use only the last word of the string (since there's no way to determine, in general, what the “significant word of the string” is), we can do this. You won't end up using the regex function, though. Instead we'll use replace, contains, and lcase (or ucase).
The trick is that we can get the last word of a string ?x by using replace to remove all the words by the last one (and the space before the one), and can then use strcontains to check whether this last word is contained in the other string. Using case normalization functions (in the following code, I used lcase, but ucase should work, too) we can do the containment check case insensitively.
select ?x ?y ?lastWordOfX ?isMatch ?isIMatch where {
# Values gives us some test data. It just means that ?x and ?y
# will be bound to the specified values. In your final query,
# these would be coming from somewhere else.
values (?x ?y) {
("Via de' cerretani" "via dei Cerretani")
("Doctor Who" "Who's on first?")
("CaT" "The cAt in the hat")
("John Doe" "Don't, John!")
}
# For "the significant word of the string which is
# usually the last one", note that the "all but the last word"
# is matched by the pattern ".* ". We can replace "all but the
# last word to leave just the last word. (Note that if the
# pattern doesn't match, then the original string is returned.
# This is good for us, because if there's just a single word,
# then it's also the last word.)
bind( replace( ?x, ".* ", "" ) as ?lastWordOfX )
# When you check whether the second string contains the first,
# you can either leave the cases as they are and have a case
# sensitive check, or you can convert them both to the same
# case and have a case insensitive match.
bind( contains( ?y, ?lastWordOfX ) as ?isMatch )
bind( contains( lcase(?y), lcase(?lastWordOfX) ) as ?isIMatch )
}
---------------------------------------------------------------------------------
| x | y | lastWordOfX | isMatch | isIMatch |
=================================================================================
| "Via de' cerretani" | "via dei Cerretani" | "cerretani" | false | true |
| "Doctor Who" | "Who's on first?" | "Who" | true | true |
| "CaT" | "The cAt in the hat" | "CaT" | false | true |
| "John Doe" | "Don't, John!" | "Doe" | false | false |
---------------------------------------------------------------------------------
That might look like a lot of code, but's because there are comments, and the last word is bound to another variable, and I've included both case sensitive and case insensitive matches. When you're actually using this, it will be much shorter. For instance, to select only those ?x and ?y that match in this way:
select ?x ?y {
values (?x ?y) {
("Via de' cerretani" "via dei Cerretani")
("Doctor Who" "Who's on first?")
("CaT" "The cAt in the hat")
("John Doe" "Don't, John!")
}
filter( contains( lcase(?y), lcase(replace( ?x, ".* ", "" ))))
}
----------------------------------------------
| x | y |
==============================================
| "Via de' cerretani" | "via dei Cerretani" |
| "Doctor Who" | "Who's on first?" |
| "CaT" | "The cAt in the hat" |
----------------------------------------------
It's true that
contains( lcase(?y), lcase(replace( ?x, ".* ", "" )))
is a bit longer than something like
regex( ?x, ?y, "some-special-flag" )
but I think it's fairly short. If you're willing to use the last word of ?x as a regular expression (which probably isn't a good idea, because you don't know that it doesn't contain special regular expression characters) you could even use:
regex( replace( ?x, ".* ", "" ), ?y, "i" )
but I suspect that it's probably faster to use contains, since regex has many more things to check.