How to make flex try the second longest matching regular expression? - regex

This question might sound a little confusing. I'm using Flex to pass tokens to Bison.
The behavior I want is that Flex matches the longest regular expression and passes that token (it DOES work like this), but if that token doesn't work with the grammar, it then matches the second longest regular expression and passes that token.
I'm struggling to think of a way to create this behavior. How could I make this happen?
To clarify, for example, say I have two rules:
"//" return TOKEN_1;
"///" return TOKEN_2;
Given the string "///", I'd like it to first pass TOKEN_2 (it does).
If TOKEN_2 doesn't fit with the grammar as specified in Bison, it then passes TOKEN_1 (which is also valid).
How can I create this behavior?

In flex, you can have a rule that tries to do something but fails and tries the second-best rule by using the REJECT macro:
REJECT directs the scanner to proceed on to the
"second best" rule which matched the input (or a
prefix of the input). The rule is chosen as
described above in "How the Input is Matched", and
yytext and yyleng set up appropriately. It may
either be one which matched as much text as the
originally chosen rule but came later in the flex
input file, or one which matched less text.
(source: The Flex Manual Page).
So to answer your question about getting the second-longest expression, you might be able to do this using REJECT (though you have to be careful, because it could just pick something of the same length with equal priority).
Note that flex will run slower with REJECT being used because it needs to maintain extra logic to "fall back" to worse matches at any point. I'd suggest only using this if there's no other way to fix your problem.

Sorry but you cant do that. I'm actually unsure how much flex talks to bison. I do know there is a mode for REPL parsing and i do know there is another mode that parses it all.
You'll have to inline the rule. For example instead of // and / you write a rule which accepts /// then another that assumes /// means // /. But that gets messy and i only did that in a specific case in my code.

I would just have the lexer scan two tokens // and / and then have the parser deal with cases when they are supposed to be regarded as one token, or separate. I.e. a grammar rule that begins with /// can actually be refactored into one which starts with // and /. In other words, do not have a TOKEN_2 at all. In same cases this sort of thing can be tricky though, because the LARL(1) parser has only one token of lookahead. It has to make a shift or reduce decision based on seeing the // only, without regard for the / which follows.
I had an idea for solving this with a hacky approach involving a lexical tie in, but it proved unworkable.
The main flaw with the idea is that there isn't any easy way to do error recovery in yacc which is hidden from the user. If a syntax error is triggered, that is visible. The yyerror function could contain a hack to try to hide this, but it lacks the context information.
In other words, you can't really use Yacc error actions to trigger a backtracking search for another parse.

This is tough for bison/yacc to deal with, as it doesn't do backtracking. Even if you use a backtracking parser generator like btyacc, it doesn't really help unless it also backtracks through the lexer (which would likely require a parser generator with integrated lexer.)
My suggestion would be to have the lexer recognize a slash immediately followed by a slash specially and return a different token:
\//\/ return SLASH;
\/ return '/'; /* not needed if you have the catch-all rule: */
. return *yytext;
Now you need to 'assemble' multi-slash 'tokens' as non-terminals in the grammer.
single_slash: SLASH | '/' ;
double_slash: SLASH SLASH | SLASH '/' ;
triple_slash: SLASH SLASH SLASH | SLASH SLASH '/' ;
However, you'll now likely find you have conflicts in the grammar due to the 1-token lookahead not being enough. You may be able to resolve those by using btyacc or bison's %glr-parser option.

Related

How to exclude parts of input from being parsed?

OK, so I've set up a complete Bison grammar (+ its Lex counterpart) and this is what I need :
Is there any way I can set up a grammar rule so that a specific portion of input is excluded from being parsed, but instead retrieved as-is?
E.g.
external_code : EXT_CODE_START '{' '}';
For instance, how could I get the part between the curly brackets as a string, without allowing the parser to consume it (since it'll be "external" code, it won't abide by my current language rules... so, it's ok - text is fine).
How would you go about that?
Should I tackle the issue by adding a token to the Lexer? (same as I do with string literals, for example?)
Any ideas are welcome! (I hope you've understood what I need...)
P.S. Well, I also thought of treating the whole situation pretty much as I do with C-style multiline comments (= capture when the comment begins, in the Lexer, and then - from within a custom function, keep going until the end-of-comment is found). That'd definitely be some sort of solution. But isn't there anything... easier?
You can call the lexer's input/yyinput function to read characters from the input stream and do something with them (and they won't be tokenized so the parser will never see them).
You can use lexer states, putting the lexer in a different state where it will skip over the excluded text, rather than returning it as tokens.
The problem with either of the above from a parser action is dealing with the parser's one token lookahead, which occurs in some (but not all) cases. For example, the following will probably work:
external_code: EXT_CODE_START '{' { skip_external_code(); } '}'
as the action will be in a default reduction state with no lookahead. In this case, skip_external_code could either just set the lexer state (second option above), or it could call input until it gets to the matching } and then calls unput once (first option above).
Note that the skip_external_code function needs to be defined in the 3rd section of the the lexer file so it has access to static functions and macros in the lexer (which both of these techniques depend on).

How can I safely validate an untrusted regex in Perl?

This answer explains that to validate an arbitrary regular expression, one simply uses eval:
while (<>) {
eval "qr/$_/;"
print $# ? "Not a valid regex: $#\n" : "That regex looks valid\n";
}
However, this strikes me as very unsafe, for what I hope are obvious reasons. Someone could input, say:
foo/; system('rm -rf /'); qr/
or whatever devious scheme they can devise.
The natural way to prevent such things is to escape special characters, but if I escape too many characters, I severely limit the usefulness of the regex in the first place. A strong argument can be made, I believe, that at least []{}()/-,.*?^$! and white space characters ought to be permitted (and probably others), un-escaped, in a user regex interface, for the regexes to have minimal usefulness.
Is it possible to secure myself from regex injection, without limiting the usefulness of the regex language?
The solution is simply to change
eval("qr/$_/")
to
eval("qr/\$_/")
This can be written more clearly as follows:
eval('qr/$_/')
But that's still not optimal. The following would be far better as it doesn't involve generating and compiling Perl code at run-time:
eval { qr/$_/ }
Note that neither solution protects you from denial of service attacks. It's quite easy to write a pattern that will take longer than the life of the universe to complete. To hand that situation, you could execute the regex match in a child for which CPU ulimit has been set.
There is some discussion about this over at The Monastery.
TLDR: use re::engine::RE2 (-strict => 1);
Make sure at add (-strict => 1) to your use statement or re::engine::RE2 will fall back to perl's re.
The following is a quote from junyer, owner of the project on github.
RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk. One of its primary guarantees is that the match time is linear in the length of the input string. It was also written with production concerns in mind: the parser, the compiler and the execution engines limit their memory usage by working within a configurable budget – failing gracefully when exhausted – and they avoid stack overflow by eschewing recursion.

Is it bad idea using regex to tokenize string for lexer?

I'm not sure how am I gonna tokenize source for lexer. For now, I only can think of using regex to parse string into array with given rule (identifier, symbols such as +,-, etc).
For instance,
begin x:=1;y:=2;
then I want to tokenize word, variable (x, y in this case) and each symbol (:,=,;).
Using regexes is a common way of implementing a lexer. If you don't want to use them then you'll sort of end up implementing some regex parts yourself anyway.
Although performance-wise it can be more efficient if you do it yourself, it isn't a must.
Using regular expressions is THE traditional way to generate your tokens.
lex and yacc (or flex and bison) are a traditional compiler creation pair, where lex does nothing except tokenize symbols and pass them to YACC
http://en.wikipedia.org/wiki/Lex_%28software%29
YACC is a stack based state machine (pushdown automaton) that processes the symbols.
I think regex processing is the way to go for parsing symbols of any level of complexity. As Oak mentions, you'll end up writing your own (probably inferior) regex parser. The only exception would be if it is dead simple, and even your posted example starts to exceed "dead simple".
in lex syntax:
:= return ASSIGN_TOKEN_OR_WHATEVER;
begin return BEGIN_TOKEN;
[0-9]+ return NUMBER;
[a-zA-Z][a-zA-Z0-9]* return WORD;
Character sequences are optionally passed along with the token.
Individual characters that are tokens in their own right [e.g. ";" )get passed along unmodified. Its not the only way, but I have found it to work very well.
Have a look:
http://www.faqs.org/docs/Linux-HOWTO/Lex-YACC-HOWTO.html

Is it feasible to write a regex that can validate simple math?

I’m using a commercial application that has an option to use RegEx to validate field formatting. Normally this works quite well. However, today I’m faced with validating the following strings: quoted alphanumeric codes with simple arithmetic operators (+-/*). Apparently the issue is sometimes users add additional spaces (e.g. “ FLR01” instead of “FLR01”) or have other typos such as mismatched parenthesis that cause issues with downstream processing.
The first examples all had 5 codes being added:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
So I started going down the road of matching 5 alphanumeric characters quoted by strings:
"[0-9a-zA-Z]{5}"[+-*/]
However, the formulas quickly got harder and I don’t know how to get around the following complications:
I need to test for one of the four simple math operators (+-*/) between each code, but not after the last one.
There can be any number of codes being added together, not just five as in the example above.
Enclosed parenthesis are okay (“X”+”Y”)/”2”
Mismatched parenthesis are not okay.
No formula (e.g. a blank) is okay.
Valid:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
"0XT"+"1SEAL"+"1XT"+"23LSL"+"23NBL"
("LS400"+"LT400")*"LC430"/("EL414"+"EL414R"+"LC407"+"LC407R"+"LC410"+"LC410R"+"LC420"+"LC420R")
Invalid:
" FLR01" +"FLR02"
"FLR01"J"FLR02"
("FLR01"+"FLR02"
Is this not something you can easily do with RegExp? Based on Jeff’s answer to 230517, I suspect I’m failing at least the ‘matched pairing’ issue. Even a partial solution to the problem (e.g. flagging extra spaces, invalid operators) would likely be better than nothing, even if I can't solve the parenthesis issue. Suggestions welcomed!
Thanks,
Stephen
As you are aware you can't check for matching parentheses with regular expressions. You need something more powerful since regexes have no way of remembering state and counting the nested parentheses.
This is a simple enough syntax that you could hand code a simple parser which counts the parentheses, incrementing and decrementing a counter as it goes. You'd simply have to make sure the counter never goes negative.
As for the rest, how about this?
("[0-9a-zA-Z]+"([+\-*/]"[0-9a-zA-Z]+")*)?
You could also use this regular expression to check the parentheses. It wouldn't verify that they're nested properly but it would verify that the open and close parentheses show up in the right places. Add in the counter described above and you'd have a proper validator.
(\(*"[0-9a-zA-Z]+"\)*([+\-*/]\(*"[0-9a-zA-Z]+"\)*)*)?
You can easily use regex's to match your tokens (numbers, operators, etc), but you cannot match balanced parenthesis. This isn't too big of a problem though, as you just need to create a state machine that operates on the tokens you match. If you're not familiar with these, think of it as a flow chart within your program where you keep track of where you are, and where you can go. You can also have a look at the Wikipedia page.

Is stringing together multiple regular expressions with "or" safe?

We have a configuration file that lists a series of regular expressions used to exclude files for a tool we are building (it scans .class files). The developer has appended all of the individual regular expressions into a single one using the OR "|" operator like this:
rx1|rx2|rx3|rx4
My gut reaction is that there will be an expression that will screw this up and give us the wrong answer. He claims no; they are ORed together. I cannot come up with case to break this but still fee uneasy about the implementation.
Is this safe to do?
Not only is it safe, it's likely to yield better performance than separate regex matching.
Take the individual regex patterns and test them. If they work as expected then OR them together and each one will still get matched. Thus, you've increased the coverage using one regex rather than multiple regex patterns that have to be matched individually.
As long as they are valid regexes, it should be safe. Unclosed parentheses, brackets, braces, etc would be a problem. You could try to parse each piece before adding it to the main regex to verify they are complete.
Also, some engines have escapes that can toggle regex flags within the expression (like case sensitivity). I don't have enough experience to say if this carries over into the second part of the OR or not. Being a state machine, I'd think it wouldn't.
It's as safe as anything else in regular expressions!
As far as regexes go , Google code search provides regexes for searches so ... it's possible to have safe regexes
I don't see any possible problem too.
I guess by saying 'Safe' you mean that it will match as you needed (because I've never heard of RegEx security hole). Safe or not, we can't tell from this. You need to give us more detail like what the full regex is. Do you wrap it with group and allow multiple? Do you wrap it with start and end anchor?
If you want to match a few class file name make sure you use start and end anchor to be sure the matching is done from start til end. Like this "^(file1|file2)\.class$". Without start and end anchor, you may end up matching 'my_file1.class too'
The answer is that yes this is safe, and the reason why this is safe is that the '|' has the lowest precedence in regular expressions.
That is:
regexpa|regexpb|regexpc
is equivalent to
(regexpa)|(regexpb)|(regexpc)
with the obvious exception that the second would end up with positional matches whereas the first would not, however the two would match exactly the same input. Or to put it another way, using the Java parlance:
String.matches("regexpa|regexpb|regexpc");
is equivalent to
String.matches("regexpa") | String.matches("regexpb") | String.matches("regexpc");