How do you get instaparse to skip whitespace between tokens?
(I could of course define whitespace as a token in its own right and insert it between all the elements on the right hand side of every rule, but I'm dealing with a grammar that has over three hundred rules, and hoping for a way to say it once rather than three hundred times.)
You can pass an optional parameter to turn on auto-whitespace:
(doc insta/parser)
-------------------------
instaparse.core/parser
([grammar-specification & {:as options}])
:auto-whitespace (:standard or :comma)
or
:auto-whitespace custom-whitespace-parser
Related
I am trying to parse multiple tags in one string literal.
such as name=testName, key=testKey, columns=(c1, c2, c3), and I might consider add more tags with different syntax in this string in the near future.
So it's natural to study regex to implement it.
as for the syntax:
valid:
`name=testName,key=testKey`
`name=testName, key=testKey`
`name=testName key=testKey`
`name=testName key=testKey`
`name=testName key=testKey columns=(c1 c2 c3)`
`name=testName key=testKey columns=(c1, c2, c3)`
`name=testName, key=testKey, columns=(c1 c2 c3)`
invalid:
`name=testName,, key=testKey` (multiple commas in between)
`name=testName, key=testKey,` (end with a comma)
`name=testName, key=testKey, columns=(c1,c2 c3)` u can only use comma or whitespace consistently inside columns, the rule applies to the whole tags as well. see below
`name=testName, key=testKey columns=(c1,c2,c3)`
I come up the whole pattern like this:
((name=\w+|key=\w+)+,\s*)*(name=\w+|key=\w+)+
I am wondering is it possible to set the subpattern as a regex and then combine them into a larger pattern.
such as
patternName := regexp.MustCompile(`name=\w+`)
patternKey := regexp.MustCompile(`key=\w+`)
pattern = ((patternName|patternKey)+,\s*)*(patternName|patternKey)+
considering I will add more tags, the whole pattern will definitely get larger and more ugly. Is there any elegant way like the combined way?
Yes, what you want is possible. the regexp.Regexp type has a String() method, which produces the string representation. So you can use this to combine regular expressions:
patternName := regexp.MustCompile(`name=\w+`)
patternKey := regexp.MustCompile(`key=\w+`)
pattern = regexp.MustCompile(`((`+patternName.String()+`|`+patternKey.String()+`)+,\s*)*(`+patternName.String()+`|`+patternKey.String()`+`)+`)
Can be shortened (though less efficient) with fmt.Sprintf:
pattern = regexp.MustCompile(fmt.Sprintf(`((%s|%s)+,\s*)*(%s|%s)+`, patternName, patternKey, patternName, patternKey)
But just because it's possible doesn't mean you should do it...
Your particular examples would be much more easily handled using standard text parsing methods such as strings.Split or strings.FieldsFunc, etc. Given your provided sample inputs, I would do it this way:
Split on whitespace/comma
Split each result on the equals sign.
Validate that the key names are expected (name and/or key)
This code will be far easier to read, and will execute probably hundreds or thousands of times faster, compared to a regular expression. This approach also lends itself easily to stream processing, which can be a big benefit if you're processing hundreds or more records, and don't want to consume a lot of memory. (Regexp can be made to do this as well, but it's still less readable).
I'm trying to use instaparse on a dimacs file less than 700k in size, with the following grammar
<file>=<comment*> <problem?> clause+
comment=#'c.*'
problem=#'p\s+cnf\s+\d+\s+\d+\s*'
clause=literal* <'0'>
<literal>=#'[1-9]\d*'|#'-\d+'
calling like so
(def parser
(insta/parser (clojure.java.io/resource "dimacs.bnf") :auto-whitespace :standard))
...
(time (parser (slurp filename)))
and it's taking about a hundred seconds. That's three orders of magnitude slower than I was hoping for. Is there some way to speed it up, some way to tweak the grammar or some option I'm missing?
The grammar is wrong. It can't be satisfied.
Every file ends with a clause.
Every clause ends with a '0'.
The literal in the clause, being a greedy reg-exp,will eat
the final '0'.
Conclusion: No clause will ever be found.
For example ...
=> (parser "60")
Parse error at line 1, column 3:
60
^
Expected one of:
"0"
#"\s+"
#"-\d+"
#"[1-9]\d*"
We can parse a literal
=> (parser "60" :start :literal)
("60")
... but not a clause
=> (parser "60" :start :clause)
Parse error at line 1, column 3:
60
^
Expected one of:
"0" (followed by end-of-string)
#"\s+"
#"-\d+"
#"[1-9]\d*"
Why is it so slow?
If there is a comment:
it can swallow the whole file;
or be broken at any 'c' character into successive comments;
or terminate at any point after the initial 'c'.
This implies that every tail has to be presented to the rest of the grammar, which includes a reg-exp for literal that Instaparse can't see inside. Hence all have to be tried, and all will ultimately fail. No wonder it's slow.
I suspect that this file is actually divided into lines. And that your problems arise from trying to conflate newlines with other forms of white-space.
May I gently point out that playing with a few tiny examples - which is all I've done - might have saved you a deal of trouble.
I think that your extensive use of * is causing the problem. Your grammar is too ambiguous/ambitious (I guess). I would check two things:
;;run it as
(insta/parses grammar input)
;; with a small input
That will show you how much ambiguity is in your grammar definition: check "ambiguous grammar".
Read Engelberg performance notes, it would help understand your own problem and probably find out what fits best for you.
I have the sequence below that I need to write a regular expression for. Any hints or tips on how to get started would be appreciated!
update: my assignment is to write a reg expression for the given 'alignment', not 'sequence', as I previously misread. Also, I added spaces to show how the sequence looks in the assignment, just without the spaces in between.
QIQAAKIWAAKPYVDESRISIWGWSYGGF
QIAAAKHWAQKDYIDEDRLAIWGWSYGGY
QIQAAKAWGKKPYVDKTRMAIWGWSYGG
QIEATRQFSKMGFVDDKRIAIWGWSYGGY
QIEAARQFLKMGFVDSKRVAIWGWSYGGY
QVFAAKELLKNRWADKDHIGIWGWSYGGF
QVFAAKEVLKNRWADKDHIGIWGXSYGGF
QVFAAKELLKNRWADKDHIGIWGWSYGGF
QVFAAKELLKNRWADKDHIGIWGWSYGGF
VGSASVSMMPRLPRLPQLLDQPGSSSGGY
FIAAAEYLKAEGYTRTDRLAIRGGSNGGL
FQCAAEYLIKEGYTSPKRLTINGGSNGGL
FQCAAEYLIKEGYTTSKRLTINGGSNGGL
FIAAGEYLQKNGYTSKDYMALSGRSNGGL
YLDACDALLKLGYGSPSLCYAMGGSAGGM
FIAAAKHLIDQNYTSPTKMAARGGSAGGL
QITAVRKFIEMGFIDEKRIAIWGWSYGGY
QLTAVRKFIEMGFIDEERIAIWGWSYGGY
These are the steps I would take:
1) align the sequences
2) read each column of the alignment and produce a list of the different possible amino acids in each position
3) each position can now be represented by a list which is easily converted to a regular expression
For 1st three positions it would be:
(Q|V|F|Y)(I|V|G|Q|L)(T|A|D|L|S|F|E|Q)
Oh, and for crying out loud, if you want to be a biostats grad student, learn some biology!
I'm trying to write an Emacs major mode for working with biological sequence data (i.e. DNA and peptides), and I want to implement syntax highlighting where different letters are colored differently. Since the mode needs to be able to differentiate DNA sequences and amino acid sequences and color them differently, I am putting each sequence in the files on a single line with a single-character prefix (+ or #) that should indicate how the following line should be highlighted.
So, for example, if the file contained a line that read:
+AAGATCCCAGATT
The "A"s should all be in one color that is different from the rest of the line.
I have tried the following as a test:
(setq dna-keyword
'(("^\+\\([GCT\-]*\\(A\\)\\)*" (2 font-lock-function-name-face))
)
)
(define-derived-mode bioseq-mode fundamental-mode
(setq font-lock-defaults '(dna-keyword))
(setq mode-name "bioseq mode")
)
But that only matches the last A instead of all of them.
My first thought was to try to match the whole line with one regexp and then use another regexp to match just the A's within that line, but I have no idea if that's possible in the context of font-lock-mode or how it would be accomplished. Any ideas on how to do something like that, or how to accomplish this in a different way?
Indeed, Emacs provides just what you need to do this with the "anchored match" feature of font-lock mode. The syntax is a bit hairy, but it allows you to specify additional "matchers" (basically a regexp, subexpression identifier and face name) which (by default) will be applied following the position where the main "matcher" regexp finished up to the end of the line. There are more complicated ways of customizing exactly what range of text they apply to, but that's the general idea.
Here's a simple example which also shows how you could define your own faces for the purpose:
(defface bioseq-mode-a
'((((min-colors 8)) :foreground "red"))
"Face for As in bioseq-mode")
(defface bioseq-mode-g
'((((min-colors 8)) :foreground "blue"))
"Face for Gs in bioseq-mode")
(setq dna-keyword
'(("^\\+" ("A" nil nil (0 'bioseq-mode-a)))
("^\\+" ("G" nil nil (0 'bioseq-mode-g)))))
You can also specify two or more anchored matchers for one main matcher (the main matcher here being the regexp "^\\+"). To make this work, each anchored matcher after the first needs to explicitly return to the beginning of the line before beginning its search; otherwise it would only begin highlighting after the last occurrence of the previous anchored matcher. This is accomplished by putting (beginning-of-line) in the PRE-MATCH-FORM slot (element 2 of the list; see below).
(setq dna-keyword
'(("^\\+"
("A" nil nil (0 'bioseq-mode-a))
("G" (beginning-of-line) nil (0 'bioseq-mode-g)))))
I think it's mostly a matter of taste which you prefer; the second way might be slightly clearer code if you have many different anchored matchers for a single line, but I doubt there's a significant performance difference.
Here's the relevant bit of the documentation for font-lock-defaults:
HIGHLIGHT should be either MATCH-HIGHLIGHT or MATCH-ANCHORED.
[....]
MATCH-ANCHORED should be of the form:
(MATCHER PRE-MATCH-FORM POST-MATCH-FORM MATCH-HIGHLIGHT ...)
where MATCHER is a regexp to search for or the function name to call to make
the search, as for MATCH-HIGHLIGHT above, but with one exception; see below.
PRE-MATCH-FORM and POST-MATCH-FORM are evaluated before the first, and after
the last, instance MATCH-ANCHORED's MATCHER is used. Therefore they can be
used to initialize before, and cleanup after, MATCHER is used. Typically,
PRE-MATCH-FORM is used to move to some position relative to the original
MATCHER, before starting with MATCH-ANCHORED's MATCHER. POST-MATCH-FORM might
be used to move back, before resuming with MATCH-ANCHORED's parent's MATCHER.
The above-mentioned exception is as follows. The limit of the MATCHER search
defaults to the end of the line after PRE-MATCH-FORM is evaluated.
However, if PRE-MATCH-FORM returns a position greater than the position after
PRE-MATCH-FORM is evaluated, that position is used as the limit of the search.
It is generally a bad idea to return a position greater than the end of the
line, i.e., cause the MATCHER search to span lines.
I always find that I have to read the font-lock documentation about three times before it starts to make sense to me ;-)
I'm currently working my way through this book:
http://www1.idc.ac.il/tecs/
I'm currently on a section where the excersize is to create a compiler for a very simple java like language.
The book always states what is required but not the how the how (which is a good thing). I should also mention that it talks about yacc and lex and specifically says to avoid them for the projects in the book for the sake of learning on your own.
I'm on chaper 10 which and starting to write the tokenizer.
1) Can anyone give me some general advice - are regex the best approach for tokenizing a source file?
2) I want to remove comments from source files before parsing - this isn't hard but most compilers tell you the line an error occurs on, if I just remove comments this will mess up the line count, are there any simple strategies for preserving the line count while still removing junk?
Thanks in advance!
The tokenizer itself is usually written using a large DFA table that describes all possible valid tokens (stuff like, a token can start with a letter followed by other letters/numbers followed by a non-letter, or with a number followed by other numbers and either a non-number/point or a point followed by at least 1 number and then a non-number, etc etc). The way i built mine was to identify all the regular expressions my tokenizer will accept, transform them into DFA's and combine them.
Now to "remove comments", when you're parsing a token you can have a comment token (the regex to parse a comment, too long to describe in words), and when you finish parsing this comment you just parse a new token, thus ignoring it. Alternatively you can pass it to the compiler and let it deal with it (or ignore it as it will). Either aproach will preserve meta-data like line numbers and characters-into-the-line.
edit for DFA theory:
Every regular expression can be converted (and is converted) into a DFA for performance reasons. This removes any backtracking in parsing them. This link gives you an idea of how this is done. You first convert the regular expression into an NFA (a DFA with backtracking), then remove all the backtracking nodes by inflating your finite automata.
Another way you can build your DFA is by hand using some common sense. Take for example a finite automata that can parse either an identifier or a number. This of course isn't enough, since you most likely want to add comments too, but it'll give you an idea of the underlying structures.
A-Z space
->(Start)----->(I1)------->((Identifier))
| | ^
| +-+
| A-Z0-9
|
| space
+---->(N1)---+--->((Number)) <----------+
0-9 | ^ | |
| | | . 0-9 space |
+-+ +--->(N2)----->(N3)--------+
0-9 | ^
+-+
0-9
Some notes on the notation used, the DFA starts at the (Start) node and moves through the arrows as input is read from your file. At any one point it can match only ONE path. Any paths missing are defaulted to an "error" node. ((Number)) and ((Identifier)) are your ending, success nodes. Once in those nodes, you return your token.
So from the start, if your token starts with a letter, it HAS to continue with a bunch of letters or numbers and end with a "space" (spaces, new lines, tabs, etc). There is no backtracking, if this fails the tokenizing process fails and you can report an error. You should read a theory book on error recovery to continue parsing, its a really huge topic.
If however your token starts with a number, it has to be followed by either a bunch of numbers or one decimal point. If there's no decimal point, a "space" has to follow the numbers, otherwise a number has to follow followed by a bunch of numbers followed by a space. I didn't include the scientific notation but it's not hard to add.
Now for parsing speed, this gets transformed into a DFA table, with all nodes on both the vertical and horizontal lines. Something like this:
I1 Identifier N1 N2 N3 Number
start letter nothing number nothing nothing nothing
I1 letter+number space nothing nothing nothing nothing
Identifier nothing SUCCESS nothing nothing nothing nothing
N1 nothing nothing number dot nothing space
N2 nothing nothing nothing nothing number nothing
N3 nothing nothing nothing nothing number space
Number nothing nothing nothing nothing nothing SUCCESS
The way you'd run this is you store your starting state and move through the table as you read your input character by character. For example an input of "1.2" would parse as start->N1->N2->N3->Number->SUCCESS. If at any point you hit a "nothing" node, you have an error.
edit 2: the table should actually be node->character->node, not node->node->character, but it worked fine in this case regardless. It's been a while since i last written a compiler by hand.
1- Yes regex are good to implement the tokenizer. If using a generated tokenizer like lex, then you describe the each token as a regex. see Mark's answer.
2- The lexer is what normally tracks line/column information, as tokens are consumed by the tokenizer, you track the line/column information with the token, or have it as current state. Therefore when a problem is found the tokenizer knows where you are. Therefore when processing comments, as new lines are processed the tokenizer just increments the line_count.
In Lex you can also have parsing states. Multi-line comments are often implemented using these states, thus allowing simpler regex's. Once you find the match to the start of a comment eg '/*' you change into comment state, which you can setup to be exclusive from the normal state. Therefore as you consume text looking for the end comment marker '*/' you do not match normal tokens.
This state based process is also useful for process string literals that allow nested end makers eg "test\"more text".