Any way to speed up instaparse? - clojure

I'm trying to use instaparse on a dimacs file less than 700k in size, with the following grammar
<file>=<comment*> <problem?> clause+
clause=literal* <'0'>
calling like so
(def parser
(insta/parser ( "dimacs.bnf") :auto-whitespace :standard))
(time (parser (slurp filename)))
and it's taking about a hundred seconds. That's three orders of magnitude slower than I was hoping for. Is there some way to speed it up, some way to tweak the grammar or some option I'm missing?

The grammar is wrong. It can't be satisfied.
Every file ends with a clause.
Every clause ends with a '0'.
The literal in the clause, being a greedy reg-exp,will eat
the final '0'.
Conclusion: No clause will ever be found.
For example ...
=> (parser "60")
Parse error at line 1, column 3:
Expected one of:
We can parse a literal
=> (parser "60" :start :literal)
... but not a clause
=> (parser "60" :start :clause)
Parse error at line 1, column 3:
Expected one of:
"0" (followed by end-of-string)
Why is it so slow?
If there is a comment:
it can swallow the whole file;
or be broken at any 'c' character into successive comments;
or terminate at any point after the initial 'c'.
This implies that every tail has to be presented to the rest of the grammar, which includes a reg-exp for literal that Instaparse can't see inside. Hence all have to be tried, and all will ultimately fail. No wonder it's slow.
I suspect that this file is actually divided into lines. And that your problems arise from trying to conflate newlines with other forms of white-space.
May I gently point out that playing with a few tiny examples - which is all I've done - might have saved you a deal of trouble.

I think that your extensive use of * is causing the problem. Your grammar is too ambiguous/ambitious (I guess). I would check two things:
;;run it as
(insta/parses grammar input)
;; with a small input
That will show you how much ambiguity is in your grammar definition: check "ambiguous grammar".
Read Engelberg performance notes, it would help understand your own problem and probably find out what fits best for you.


ANTLR4: Matching an identifier but NOT a keyword

I'm using ANTLR4 to lex and parse a string. The string is this:
alpha at 3
The grammar is as such:
access: IDENTIFIER 'at' INT;
INT: '-'? ([1-9][0-9]* | [0-9]);
However, this ANTLR gives me line 1:6 mismatched input 'at' expecting 'at'. I've found that it is because IDENTIFIER is a superset of 'at', as seen in this answer. So, I tried changing the grammar to this:
access: identifier AT INT;
identifier: NAME | ~AT;
NAME: [A-Za-z]+;
INT: '-'? ([1-9][0-9]* | [0-9]);
AT: 'at';
However I get an identical error.
How can I match alpha at 3 where alpha is [A-Za-z]+ while at is also in [A-Za-z]+?
I found in my work with ANTLR4 it was easier to divide my grammer into a seperate lexer and Parser. This has it's own learning curve. But the result is that I think about "Tokens" being fed to the parser. And I can use grun -tokens to see that my tokens are being recognized by the lexer before they get to the parser. I'm still an ANTLR4 newbie so maybe 2 weeks ahead of your on the learning curve after playing with ANTLR4 off and on for a few years.
So in my Grammer file I would have
AT: 'at';
INT: -?[0-9]+;
Beware after you do:
antlr4 myLexer.g4
antlr4 myParser.g4
javac *.java
The GRUN command to run your parser is not:
grun myParser -tokens access infile
grun my -tokens access infile
Adding "Parser" to the name always kills me when I split my grammer into seperate lexer/parser g4 files. I typicaly Use ANTLR4 get mediocre at at, then don't use it for 8-12 months and run into the same issues where I come here to Stack Overflow to get myself back on track.
This will show up in the grun -tokens as an "AT" token specifically. But as mentioned in the comments the AT needs to come first.
Any case where two rules can match "AT:'at'" is also a legal IDENTIFIER: [a-ZA-Z]+ put the smaller match first.
ALSO I tend to avoid the * greedy matches and use the non greedy ? matches, even though I don't quite have my head around the specific mechanics of how ANTLR4 distinguishes between '' and '*?'. Future study for this student.
The other trick you can use is to use parser modes. I think the maintence overhead and complexity of parser modes is a bit high, but they can provide a work-around hack to solve a problem until you can get your head around a "proper" parsing solution. Thats how I use them today. A crutch to get my problem solved and I have //TODO -I need to fix this comments in my grammar.
So if your parsing gets more complex, you could try lexer modes, but I think they are a risky crutch... and you can get far down a time sink rabbit hole with them. (I think I'm half way down one now).
But I find ANTLR4 is a wonderful parsing tool... although I think I may have been better off just hardcoding 'C'/Perl parsers than learning ANTLR4. The end result I'm finding is a grammar that can be more powerful I think than my falling back to my old 'C'/'Perl' brute force token readers. And it's orders of magnitude more productive than trying Lexx/Yacc was in the old days. I never got far enough down that path to consider them useful tools. ANTLR4 has been way more useful.
The first grammar you mentioned works fine, this is the result:
The second:
access: identifier AT INT;
identifier: NAME | ~AT;
NAME: [A-Za-z]+;
INT: '-'? ([1-9][0-9]* | [0-9]);
AT: 'at';
produces indeed the error. This is because NAME and AT both match the text "at". And because NAME is defined before AT, a NAME token will be created.
Always be careful with such overlapping tokens: place keywords always above NAME or identifier tokens:
AT: 'at';
INT: '-'? ([1-9][0-9]* | [0-9]);
Note that ANTLR will only look at which rule is defined first when rules match the same amount of characters. So for input like "atat", an IDENTIFIER will be created (not 2 AT tokens!).

How can you require an undetermined character to be repeated consecutively a certain number of times in Ruby Treetop?

I want to create a rule that requires a non-number, non-letter character to be consecutively repeated three times. The rule would look something like this:
# Note, this code does not do what I want!
grammar ThreeCharacters
rule threeConsecutiveCharacters
(![a-zA-Z0-9] .) 3..3
Is there any way to require the first character that it detects to be repeated three times?
There was previously a similar question about detecting the number of indentations: PEG for Python style indentation
The solution there was to first initialize the indentation stack:
&{|s| #indents = [-1] }
Then save the indentation for the current line:
level = s[0].indentation.text_value.length
#indents << level
Whenever a new line begins it peeks at the indentation like this:
# Peek at the following indentation:
save = index; i = _nt_indentation; index = save
# We're closing if the indentation is less or the same as our enclosing block's:
closing = i.text_value.length <= #indents.last
If the indentation is larger it adds the new indentation level to the stack.
I could create something similar for my problem, but this seems like a very tedious way to solve it.
Are there any other ways to create my rule?
Yes, you can do it this way in Treetop. This kind of thing not generally possible with a PEG because of the way packrat parsing works; it's greedy but you need to limit its greed using semantic information from earlier in the parse. It's only the addition in Treetop of semantic predicates (&{...}} that make it possible. So yes, it's tedious. You might consider using Rattler instead, as it has a significant number of features in addition to those available in Treetop. I can't advise (as maintainer of Treetop, but not being a user of Rattler) but I am very impressed by its feature set and I think it will handle this case better.
If you proceed with Treetop, bear in mind that every semantic predicate should return a boolean value indicating success or failure. This is not explicit in the initialisation of #indents above.

Creating a simple parser in (V)C++ (2010) similar to PEG

For an school project, I need to parse a text/source file containing a simplified "fake" programming language to build an AST. I've looked at boost::spirit, however since this is a group project and most seems reluctant to learn extra libraries, plus the lecturer/TA recommended leaning to create a simple one on C++. I thought of going that route. Is there some examples out there or ideas on how to start? I have a few attempts but not really successful yet ...
parsing line by line
Test each line with a bunch of regex (1 for procedure/function declaration), one for assignment, one for while etc...
But I will need to assume there are no multiple statements in one line: eg. a=b;x=1;
When I reach a container statement, procedures, whiles etc, I will increase the indent. So all nested statements will go under this
When I reach a } I will decrement indent
Any better ideas or suggestions? Example code I need to parse (very simplified here ...)
procedure Hello {
a = 1;
while a {
b = a + 1 + z;
Another idea was to read whole file into a string, and go top down. Match all procedures, then capture everything in { ... } then start matching statements (end with ;) or containers while { ... }. This is similar to how PEG does things? But I will need to read entire file
Multipass makes things easier. On a first pass, split things into tokens, like "=", or "abababa", or a quote-delimited string, or a block of whitespace. Don't be destructive (keep the original data), but break things down to simple chunks, and maybe have a little struct or enum that describes what the token is (ie, whitespace, a string literal, an identifier type thing, etc).
So your sample code gets turned into:
identifier(procedure) whitespace( ) identifier(Hello) whitespace( ) operation({) whitespace(\n\t) identifier(a) whitespace( ) operation(=) whitespace( ) number(1) operation(;) whitespace(\n\t) etc.
In those tokens, you might also want to store line number and offset on the line (this will help with error message generation later).
A quick test would be to turn this back into the original text. Another quick test might be to dump out pretty-printed version in html or something (where you color whitespace to have a pink background, identifiers as light blue, operations as light green, numbers as light orange), and see if your tokenizer is making sense.
Now, your language may be whitespace insensitive. So discard the whitespace if that is the case! (C++ isn't, because you need newlines to learn when // comments end)
(Note: a professional language parser will be as close to one-pass as possible, because it is faster. But you are a student, and your goal should be to get it to work.)
So now you have a stream of such tokens. There are a bunch of approaches at this point. You could pull out some serious parsing chops and build a CFG to parse them. (Do you know what a CFG is? LR(1)? LL(1)?)
An easier method might be to do it a bit more ad-hoc. Look for operator({) and find the matching operator(}) by counting up and down. Look for language keywords (like procedure), which then expects a name (the next token), then a block (a {). An ad-hoc parser for a really simple language may work fine.
I've done exactly this for a ridiculously simple language, where the parser consisted of a really simple PDA. It might work for you guys. Or it might not.
Since you mentioned PEG i'll like to throw in my open source project :
Here is a visual tool that can export C++ version:
Documentation for rule grammar:
If i was writing my own language I would probably look at the terminals/non-terminals found in System.Linq.Expressions as these would be a great start for your grammar rules.

Indentation control while developing a small python like language

I'm developing a small python like language using flex, byacc (for lexical and parsing) and C++, but i have a few questions regarding scope control.
just as python it uses white spaces (or tabs) for indentation, not only that but i want to implement index breaking like for instance if you type "break 2" inside a while loop that's inside another while loop it would not only break from the last one but from the first loop as well (hence the number 2 after break) and so on.
while 1
while 1
break 2
'hello world'!! #will never reach this. "!!" outputs with a newline
'hello world again'!! #also will never reach this. again "!!" used for cout
#after break 2 it would jump right here
but since I don't have an "anti" tab character to check when a scope ends (like C for example i would just use the '}' char) i was wondering if this method would the the best:
I would define a global variable, like "int tabIndex" on my yacc file that i would access in my lex file using extern. then every time i find a tab character on my lex file i would increment that variable by 1. when parsing on my yacc file if i find a "break" keyword i would decrement by the amount typed after it from the tabIndex variable, and when i reach and EOF after compiling and i get a tabIndex != 0 i would output compilation error.
now the problem is, whats the best way to see if the indentation got reduced, should i read \b (backspace) chars from lex and then reduce the tabIndex variable (when the user doesn't use break)?
another method to achieve this?
also just another small question, i want every executable to have its starting point on the function called start() should i hardcode this onto my yacc file?
sorry for the long question any help is greatly appreciated. also if someone can provide an yacc file for python would be nice as a guideline (tried looking on Google and had no luck).
thanks in advance.
I am currently implementing a programming language rather similar to this (including the multilevel break oddly enough). My solution was to have the tokenizer emit indent and dedent tokens based on indentation. Eg:
while 1: # colons help :)
break 1
["while", "1", ":",
"print", "(", "'foo'", ")",
"break", "1",
It makes the tokenizer's handling of '\n' somewhat complicated though. Also, i wrote the tokenizer and parser from scratch, so i'm not sure whether this is feasable in lex and yacc.
Semi-working pseudocode example:
level = 0
levels = []
for c = getc():
if c=='\n':
n = 0
while (c=getc())==' ':
n += 1
if n > level:
while n < level:
level = pop(levels)
if level < n:
error tokenize
# fall through
emit(c) #lazy example
Very interesting exercise. Can't you use the end keyword to check when the scope ends?
On a different note, I have never seen a language that allows you to break out of several nested loops at once. There may be a good reason for that...

Regex, writing a toy compiler, parsing, comment remover

I'm currently working my way through this book:
I'm currently on a section where the excersize is to create a compiler for a very simple java like language.
The book always states what is required but not the how the how (which is a good thing). I should also mention that it talks about yacc and lex and specifically says to avoid them for the projects in the book for the sake of learning on your own.
I'm on chaper 10 which and starting to write the tokenizer.
1) Can anyone give me some general advice - are regex the best approach for tokenizing a source file?
2) I want to remove comments from source files before parsing - this isn't hard but most compilers tell you the line an error occurs on, if I just remove comments this will mess up the line count, are there any simple strategies for preserving the line count while still removing junk?
Thanks in advance!
The tokenizer itself is usually written using a large DFA table that describes all possible valid tokens (stuff like, a token can start with a letter followed by other letters/numbers followed by a non-letter, or with a number followed by other numbers and either a non-number/point or a point followed by at least 1 number and then a non-number, etc etc). The way i built mine was to identify all the regular expressions my tokenizer will accept, transform them into DFA's and combine them.
Now to "remove comments", when you're parsing a token you can have a comment token (the regex to parse a comment, too long to describe in words), and when you finish parsing this comment you just parse a new token, thus ignoring it. Alternatively you can pass it to the compiler and let it deal with it (or ignore it as it will). Either aproach will preserve meta-data like line numbers and characters-into-the-line.
edit for DFA theory:
Every regular expression can be converted (and is converted) into a DFA for performance reasons. This removes any backtracking in parsing them. This link gives you an idea of how this is done. You first convert the regular expression into an NFA (a DFA with backtracking), then remove all the backtracking nodes by inflating your finite automata.
Another way you can build your DFA is by hand using some common sense. Take for example a finite automata that can parse either an identifier or a number. This of course isn't enough, since you most likely want to add comments too, but it'll give you an idea of the underlying structures.
A-Z space
| | ^
| +-+
| A-Z0-9
| space
+---->(N1)---+--->((Number)) <----------+
0-9 | ^ | |
| | | . 0-9 space |
+-+ +--->(N2)----->(N3)--------+
0-9 | ^
Some notes on the notation used, the DFA starts at the (Start) node and moves through the arrows as input is read from your file. At any one point it can match only ONE path. Any paths missing are defaulted to an "error" node. ((Number)) and ((Identifier)) are your ending, success nodes. Once in those nodes, you return your token.
So from the start, if your token starts with a letter, it HAS to continue with a bunch of letters or numbers and end with a "space" (spaces, new lines, tabs, etc). There is no backtracking, if this fails the tokenizing process fails and you can report an error. You should read a theory book on error recovery to continue parsing, its a really huge topic.
If however your token starts with a number, it has to be followed by either a bunch of numbers or one decimal point. If there's no decimal point, a "space" has to follow the numbers, otherwise a number has to follow followed by a bunch of numbers followed by a space. I didn't include the scientific notation but it's not hard to add.
Now for parsing speed, this gets transformed into a DFA table, with all nodes on both the vertical and horizontal lines. Something like this:
I1 Identifier N1 N2 N3 Number
start letter nothing number nothing nothing nothing
I1 letter+number space nothing nothing nothing nothing
Identifier nothing SUCCESS nothing nothing nothing nothing
N1 nothing nothing number dot nothing space
N2 nothing nothing nothing nothing number nothing
N3 nothing nothing nothing nothing number space
Number nothing nothing nothing nothing nothing SUCCESS
The way you'd run this is you store your starting state and move through the table as you read your input character by character. For example an input of "1.2" would parse as start->N1->N2->N3->Number->SUCCESS. If at any point you hit a "nothing" node, you have an error.
edit 2: the table should actually be node->character->node, not node->node->character, but it worked fine in this case regardless. It's been a while since i last written a compiler by hand.
1- Yes regex are good to implement the tokenizer. If using a generated tokenizer like lex, then you describe the each token as a regex. see Mark's answer.
2- The lexer is what normally tracks line/column information, as tokens are consumed by the tokenizer, you track the line/column information with the token, or have it as current state. Therefore when a problem is found the tokenizer knows where you are. Therefore when processing comments, as new lines are processed the tokenizer just increments the line_count.
In Lex you can also have parsing states. Multi-line comments are often implemented using these states, thus allowing simpler regex's. Once you find the match to the start of a comment eg '/*' you change into comment state, which you can setup to be exclusive from the normal state. Therefore as you consume text looking for the end comment marker '*/' you do not match normal tokens.
This state based process is also useful for process string literals that allow nested end makers eg "test\"more text".