Regex to match exact word with Kiama Parser in Scala - regex

I am looking for the correct regex form to give to my Kiama Packrat Parser in order that when it encounters keywords like int it recognises this is a type, and not a valid var name.
At present I have :
lazy val type_int_ = ".*\\bint\\b.*".r ^^ (s => TypeInt)
lazy val var_ =
idn ^^ TermVar
lazy val idn =
"[a-zA-Z][a-zA-Z0-9]*".r
But this does not work, so I would appreciate pointers on this.
Many thanks

I've successfully used the following approach:
val keyword = regex ("int[^a-zA-Z]".r)
val identifier = not (keyword) ~> "[a-zA-Z]+".r
In other words, recognise the keyword only if it's not followed by a character that can extend it to be an identifier. A downside is that the extension regexp is repeated in both the keyword definition and the identifier one, but that can be factored out if you want.
You've got to be a bit careful how you use the keyword parser, since it captures the character after the keyword as well. It's safe in the context of a not, since no input is consumed.
Note that whitespace usually does not need to be handled explicitly since the literal and regex parser combinators take care of it before they start parsing for what you really want.
This approach is easy to generalise to multiple identifiers, by writing a method to build the keyword parser from a list of the keyword strings and the extension regular expression.
BTW, Kiama doesn't really provide parsing combinators. We rely on the ones in the Scala library. We do provide some extensions of the standard ones for special situations, but the basic behaviour is just straight from the library. Thus, it's not clear to me that your question actually relates to Kiama at all. As mentioned in the comments above, including a self-contained example of the problem would help us be clearer about exactly which library you are using.

Related

How do I check if a string is a valid prefix of a regex?

So I'm using Rust to build a parser for my programming language and I need it to well ... parse. I'm implementing a Stream struct that is responsible for some basic operations on Vec<char> that is possesses. One of those functions is as follows:
fn consume_while_regex_matches(&self, regex: Regex, mut acc: String) -> (Self, String)
It's a recursive function that takes a regex and an accumulator which it's supposed to be filling up until the next char in the stream makes it no longer match the Regex. I'd use this function to match simple tokens like integers, strings, etc.
I tried implementing it in the following kind of way (this is not the exact version, but it explains the issue):
acc.push(self.peek())
if !regex.is_match() {
acc.pop();
return self.clone(), acc;
}
return self.consume_while_regex_matches(regex, acc);
The problem with this code is that it is testing whether acc matches the whole regex. Imagine if we want to consume a string -42 with a regex like ^-[0-9]+$. The algorithm would read the very first char -, the match would fail and the accumulator is going to be empty.
Is there a way to check that a string (e.g. acc) is a prefix of a potential regex match?
Like - is not a match on its own, but -42 is a match and - is a valid prefix.
And it'd be great if it's like a library way and it doesn't require me to produce my own regex engine.
Update: I'm not using the described function for parsing. I use it for lexing. I am aware that regex is not enough to parse a complex language.
What I'm asking is whether I can use some regex lib to match tokens gradually as opposed to having to provide the whole string to match against. I'm looking for a way to check whether the underlying DFA is or isn't in the error state by the end of marching a string without having to write my own regex parser and DFA implementation. If this was possible, I'd pass the - to the integer regex, check that after marching it didn't end up in the ERROR state, and if so, it's a valid prefix.
Taking the question at face value. The crate regex-automata (maintained by one of the regex authors) provides some access to the lower level details of building and parsing regexes. In particular, you can access and drive the internal DFA (deterministic finite automata) yourself.
use regex_automata::{Regex, DFA}; // 0.1.9
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
enum PotentialMatch {
Match,
CouldMatch,
DoesntMatch,
}
fn potentially_matches(pattern: &Regex, input: &str) -> PotentialMatch {
let input = input.as_bytes();
let dfa = pattern.forward();
let mut state = dfa.start_state();
for byte in input {
state = dfa.next_state(state, *byte);
if dfa.is_dead_state(state) {
return PotentialMatch::DoesntMatch;
}
}
if dfa.is_match_state(state) {
PotentialMatch::Match
}
else {
PotentialMatch::CouldMatch
}
}
fn main() {
let pattern = Regex::new("-[0-9]+").unwrap();
assert_eq!(potentially_matches(&pattern, ""), PotentialMatch::CouldMatch);
assert_eq!(potentially_matches(&pattern, "-"), PotentialMatch::CouldMatch);
assert_eq!(potentially_matches(&pattern, "-1"), PotentialMatch::Match);
assert_eq!(potentially_matches(&pattern, "-12"), PotentialMatch::Match);
assert_eq!(potentially_matches(&pattern, "-12a"), PotentialMatch::DoesntMatch);
}
You could probably integrate this state tracking into your implementation to be more performant over calling potentially_matches() over and over.
Regular expressions are basically a succinct format for state machines that work on inputs of characters/strings, whether finite-state machines or pushdown automata, depending on the complexity of the language you need to accept/reject.
Your approach is to build the string until it matches against a complex regex, but regexs are essentially often recompiled into state machines for efficiency. The way these state machines work is to break down the complex regex into individual states, and then, against these states, parse and transition between states token-by-token (or character-by-character). A final regex may have many different states. Here's a (somewhat improper) FSM example for [+-][1-9][0-9]* with s2 as the accept state:
Is there a way to check that a string (e.g. acc) is a prefix of a potential regex match? Like - is not a match on its own, but -42 is a match and - is a valid prefix. And it'd be great if it's like a library way and it doesn't require me to produce my own regex engine.
You could use pure regex for simpler parsing problems, but if you're building a programming language, you're eventually going to be looking for a parsing library to handle parsing to arbitrary nested depth (which requires a system stronger than regexes). In Rust, one of the most popular crates for this is nom. Nom is a parsing combinator-style library, which means it relies on putting smaller parsers together in different ways with "operators" called "combinators." Regular expressions are often a weaker form of parser than these due to the limited number and limited properties of their operators/combinators.
I will also note that the full job of validating a proper program written in a programming language won't be satisfied by either, which is why we bother going beyond the lexing/parsing process. All traditional and proper programming languages are somewhere between context-sensitive grammars and Turing-complete languages, which means parsing alone won't satisfy your needs because there is additional context and semantic information that needs to be checked by type checking, reference checking, etc.

Rules & Actions for Parser Generator, and

I am trying to wrap my head around an assignment question, therefore I would very highly appreciate any help in the right direction (and not necessarily a complete answer). I am being asked to write the grammar specification for this parser. The specification for the grammar that I must implement can be found here:
http://anoopsarkar.github.io/compilers-class/decafspec.html
Although the documentation is there, I do not understand a few things, such as how to write (in my .y file) things such as
{ identifier },+
I understand that this would mean a comma-separated list of 1 (or more) occurrences of an identifier, however when I write it as such, the compiler displays an error of unrecognized symbols '+' and ',', being mistaken as whitespace. I tried '{' identifier "},+", but I haven't the slightest clue whether that is correct or not.
I have written the lexical analyzer portion (as it was from the previous segment of the assignment) which returns tokens (T_ID, T_PLUS, etc.) accordingly, however there is this new notion that I must assign 'yylval' to be the value of the token itself. To my understanding, this is only necessary if I am in need of the actual value of the token, therefore I would need the value of an identifier token T_ID, but not necessarily the value of T_PLUS, being '+'. This is done by creating a %union in the parser generator file, which I have done, and have provided the tokens that I currently believe would require the literal token value with the proper yylval assignment.
Here is my lexical analysis code (I could not get it to format properly, I apologize): https://pastebin.com/XMZwvWCK
Here is my parser file decafast.y: https://pastebin.com/2jvaBFQh
And here is the final piece of code supplied to me, the C++ code to build an abstract syntax tree at the end:
https://pastebin.com/ELy53VrW?fbclid=IwAR2cFT_-pGKlVZ2liC-zAe3Fw0BWDlGjrrayqEGV4JuJq1_7nKoe9-TLTlA
To finalize my question, I do not know if I am creating my grammar rules correctly. I have tried my best to follow the specification in the above website, but I can't help but feel that what I am writing is completely wrong. My compiler is spitting out nothing but "warning: rule useless in grammar" for almost every (if not every) rule.
If anyone could help me out and point me in the right direction on how to make any progress, I would highly, highly appreciate it.
The decaf specification is written in (an) Extended Backus Naur Form (EBNF), which includes a number of convenience operators for repetition, optionality and grouping. These are not part of the bison/yacc syntax, which is pretty well limited to BNF. (Bison/yacc do allow the alternation operator |, but since there is no way to group subpatterns, alteration can only be used at the top-level, to combine two productions for the same non-terminal.)
The short section at the beginning of the specification which describes EBNF includes a grammar for the particular variety of EBNF that is being used. (Since this grammar is itself recursively written in the same EBNF, there is a need to apply a bit of inductive reasoning.) When it says, for example,
CommaList = "{" Expression "}+," .
it is not saying that "}+," is the peculiar spelling of a comma-repetition operator. What it is saying is that when you see something in the Decaf grammar surrounded by { and }+,, that should be interpreted as describing a comma-separated list.
For example, the Decaf grammar includes:
FieldDecl = var { identifier }+, Type ";" .
That means that a FieldDecl can be (amongst other possibilities) the token var followed by a comma-separated list of identifier tokens and then followed by a Type and finally a semicolon.
As I said, bison/yacc don't implement the EBNF operators, so you have to find an equivalent yourself. Since BNF doesn't allow any form of grouping -- and a list is a grouped subexpression -- we need to rewrite the subexpression of a production as a new non-terminal. Also, I suppose we need to use the tokens defined in spec (although bison allows a more readable syntax).
So to yacc-ify this EBNF production, we first introducing the new non-terminal and replace the token names:
FieldDecl: T_VAR IdentifierList Type T_SEMICOLON
Which leaves the definition of IdentifierList. Repetition in BNF is always produced with recursion, following a very simple model which uses two productions:
the base, which is the simplest possible repetition (usually either nothing or a single list item), and
the recursion, which describes a longer possibility by extending a shorter one.
In this case, the list must have at least one item, and we extend by adding a comma and another item:
IdentifierList
: T_ID /* base case */
| IdentifierList T_COMMA T_ID /* Recursive extension */
The point of this exercise is to develop your skills in thinking grammatically: that is, factoring out the syntax and semantics of the language. So you should try to understand the grammars presented, both for Decaf and for the author's version of EBNF, and avoid blindly copying code (including grammars). Good luck!

How to exclude parts of input from being parsed?

OK, so I've set up a complete Bison grammar (+ its Lex counterpart) and this is what I need :
Is there any way I can set up a grammar rule so that a specific portion of input is excluded from being parsed, but instead retrieved as-is?
E.g.
external_code : EXT_CODE_START '{' '}';
For instance, how could I get the part between the curly brackets as a string, without allowing the parser to consume it (since it'll be "external" code, it won't abide by my current language rules... so, it's ok - text is fine).
How would you go about that?
Should I tackle the issue by adding a token to the Lexer? (same as I do with string literals, for example?)
Any ideas are welcome! (I hope you've understood what I need...)
P.S. Well, I also thought of treating the whole situation pretty much as I do with C-style multiline comments (= capture when the comment begins, in the Lexer, and then - from within a custom function, keep going until the end-of-comment is found). That'd definitely be some sort of solution. But isn't there anything... easier?
You can call the lexer's input/yyinput function to read characters from the input stream and do something with them (and they won't be tokenized so the parser will never see them).
You can use lexer states, putting the lexer in a different state where it will skip over the excluded text, rather than returning it as tokens.
The problem with either of the above from a parser action is dealing with the parser's one token lookahead, which occurs in some (but not all) cases. For example, the following will probably work:
external_code: EXT_CODE_START '{' { skip_external_code(); } '}'
as the action will be in a default reduction state with no lookahead. In this case, skip_external_code could either just set the lexer state (second option above), or it could call input until it gets to the matching } and then calls unput once (first option above).
Note that the skip_external_code function needs to be defined in the 3rd section of the the lexer file so it has access to static functions and macros in the lexer (which both of these techniques depend on).

POSIX Character Class with Negation

I'm curious as to whether there is anyway to have a regex expression which looks for input which is printable (defined by the POSIX character class [:print:], but also does not contain a specific letter, such as the letter a.
Such an expression would enable me to look for all characters which are printable, and then perform additional exclusions. My initial thought was to use nested character classes to achieve this, but I do not believe that will work.
This is for a small parser which I am working on in lex -- thanks for any feedback.
flex (if you can use that) offers the {-} operator which provides exactly what you're looking for:
[[:print:]]{-}[a]
It also has an {+} operator.. They only work with character classes, though.
In PCRE and other engines with lookaround, you could use that (e.g. [[:print:]](?<!a)), but unless it has changed recently, lex doesn't support lookaround.
While there are probably ways to make this distinction in the lexical analyzer, it may be cleaner to do it in the parsing logic instead.

Do regex implementations actually need a split() function?

Is there any application for a regex split() operation that could not be performed by a single match() (or search(), findall() etc.) operation?
For example, instead of doing
subject.split('[|]')
you could get the same result with a call to
subject.findall('[^|]*')
And in nearly all regex engines (except .NET and JGSoft), split() can't do some things like "split on | unless they are escaped \|" because you'd need to have unlimited repetition inside lookbehind.
So instead of having to do something quite unreadable like this (nested lookbehinds!)
splitArray = Regex.Split(subjectString, #"(?<=(?<!\\)(?:\\\\)*)\|");
you can simply do (even in JavaScript which doesn't support any kind of lookbehind)
result = subject.match(/(?:\\.|[^|])*/g);
This has led me to wondering: Is there anything at all that I can do in a split() that's impossible to achieve with a single match()/findall() instead? I'm willing to bet there isn't, but I'm probably overlooking something.
(I'm defining "regex" in the modern, non-regular sense, i. e., using everything that modern regexes have at their disposal like backreferences and lookaround.)
The purpose of regular expressions is to describe the syntax of a language. These regular expressions can then be used to find strings that match the syntax of these languages. That’s it.
What you actually do with the matches, depends on your needs. If you’re looking for all matches, repeat the find process and collect the matches. If you want to split the string, repeat the find process and split the input string at the position the matches where found.
So basically, regular expression libraries can only do one thing: perform a search for a match. Anything else are just extensions.
A good example for this is JavaScript where there is RegExp.prototype.exec that actually performs the match search. Any other method that accepts regular expression (e. g. RegExp.prototype.test, String.prototype.match, String.prototype.search) just uses the basic functionality of RegExp.prototype.exec somehow:
// pseudo-implementations
RegExp.prototype.test = function(str) {
return RegExp(this).exec(str);
};
String.prototype.match = function(pattern) {
return RegExp(pattern).exec(this);
};
String.prototype.search = function(pattern) {
return RegExp(pattern).exec(this).index;
};