I have tried something like this in my Bison file...
ReturnS: RETURN expression {printf(";")}
...but the semicolon gets printed AFTER the next token, past this rule, instead of right after the expression. This rule was made as we're required to convert the input file to a c-like form and the original language doesn't require a semicolon after the expression in the return statement, but C does, so I thought I'd add it manually to the output with printf. That doesn't seem to work, as the semicolon gets added but for some reason, it gets added after the next token is parsed (outside the ReturnS rule) instead of right when the expression rule returns to ReturnS.
This rule also causes the same result:
loop_for: FOR var_name COLONEQUALS expression TO {printf("%s<=", $<chartype>2);} expression STEP {printf("%s+=", $<chartype>2);} expression {printf(")\n");} Code ENDFOR
Besides the first two printf's not working right (I'll post another question regarding that), the last printf is actually called AFTER the first token/literal of the "Code" rule has been parsed, resulting in something like this:
for (i=0; i<=5; i+=1
a)
=a+1;
instead of
for (i=0; i<=5; i+=1)
a=a+1;
Any ideas what I'm doing wrong?
Probably because the grammar has to look-ahead one token to decide to reduce by the rule you show.
The action is executed when the rule is reduced, and it is very typical that the grammar has to read one more token before it knows that it can/should reduce the previous rule.
For example, if an expression can consist of an indefinite sequence of added terms, it has to read beyond the last term to know there isn't another '+' to continue the expression.
After seeing the Yacc/Bison grammar and Lex/Flex analyzer, some of the problems became obvious, and others took a little more sorting out.
Having the lexical analyzer do much of the printing meant that the grammar was not properly in control of what appeared when. The analyzer was doing too much.
The analyzer was also not doing enough work - making the grammar process strings and numbers one character at a time is possible, but unnecessarily hard work.
Handling comments is tricky if they need to be preserved. In a regular C compiler, the lexical analyzer throws the comments away; in this case, the comments had to be preserved. The rule handling this was moved from the grammar (where it was causing shift/reduce and reduce/reduce conflicts because of empty strings matching comments) to the lexical analyzer. This may not always be optimal, but it seemed to work OK in this context.
The lexical analyzer needed to ensure that it returned a suitable value for yylval when a value was needed.
The grammar needed to propagate suitable values in $$ to ensure that rules had the necessary information. Keywords for the most part did not need a value; things like variable names and numbers do.
The grammar had to do the printing in the appropriate places.
The prototype solution returned had a major memory leak because it used strdup() liberally and didn't use free() at all. Making sure that the leaks are fixed - possibly by using a char array rather than a char pointer for YYSTYPE - is left to the OP.
Comments aren't a good place to provide code samples, so I'm going to provide an example of code that works, after Jonathan (replied above) did some work on my code. All due credit goes to him, this isn't mine.
Instead of having FLEX print any recognized parts and letting BISON do the formatting afterwards, Jonathan suggested that FLEX prints nothing and only returns to BISON, which should then handle all printing it self.
So, instead of something like this...
FLEX
"FOR" {printf("for ("); return FOR;}
"TO" {printf("; "); return TO;}
"STEP" {printf("; "); return STEP;}
"ENDFOR" {printf("\n"); printf("}\n"); return ENDFOR;}
[a-zA-Z]+ {printf("%s",yytext); yylval.strV = yytext; return CHARACTERS;}
":=" {printf("="); lisnew=0; return COLONEQUALS;}
BISON
loop_for: FOR var_name {strcpy(myvar, $<strV>2);} COLONEQUALS expression TO {printf("%s<=", myvar);} expression STEP {printf("%s+=", myvar);} expression {printf(")\n");} Code ENDFOR
...he suggested this:
FLEX
[a-zA-Z][a-zA-Z0-9]* { yylval = strdup(yytext); return VARNAME;}
[1-9][0-9]*|0 { yylval = strdup(yytext); return NUMBER; }
BISON
loop_for: FOR var_name COLONEQUALS NUMBER TO NUMBER STEP NUMBER
{ printf("for (%s = %s; %s <= %s; %s += %s)\n", $2, $4, $2, $6, $2, $8); }
var_name: VARNAME
Related
I am trying to wrap my head around an assignment question, therefore I would very highly appreciate any help in the right direction (and not necessarily a complete answer). I am being asked to write the grammar specification for this parser. The specification for the grammar that I must implement can be found here:
http://anoopsarkar.github.io/compilers-class/decafspec.html
Although the documentation is there, I do not understand a few things, such as how to write (in my .y file) things such as
{ identifier },+
I understand that this would mean a comma-separated list of 1 (or more) occurrences of an identifier, however when I write it as such, the compiler displays an error of unrecognized symbols '+' and ',', being mistaken as whitespace. I tried '{' identifier "},+", but I haven't the slightest clue whether that is correct or not.
I have written the lexical analyzer portion (as it was from the previous segment of the assignment) which returns tokens (T_ID, T_PLUS, etc.) accordingly, however there is this new notion that I must assign 'yylval' to be the value of the token itself. To my understanding, this is only necessary if I am in need of the actual value of the token, therefore I would need the value of an identifier token T_ID, but not necessarily the value of T_PLUS, being '+'. This is done by creating a %union in the parser generator file, which I have done, and have provided the tokens that I currently believe would require the literal token value with the proper yylval assignment.
Here is my lexical analysis code (I could not get it to format properly, I apologize): https://pastebin.com/XMZwvWCK
Here is my parser file decafast.y: https://pastebin.com/2jvaBFQh
And here is the final piece of code supplied to me, the C++ code to build an abstract syntax tree at the end:
https://pastebin.com/ELy53VrW?fbclid=IwAR2cFT_-pGKlVZ2liC-zAe3Fw0BWDlGjrrayqEGV4JuJq1_7nKoe9-TLTlA
To finalize my question, I do not know if I am creating my grammar rules correctly. I have tried my best to follow the specification in the above website, but I can't help but feel that what I am writing is completely wrong. My compiler is spitting out nothing but "warning: rule useless in grammar" for almost every (if not every) rule.
If anyone could help me out and point me in the right direction on how to make any progress, I would highly, highly appreciate it.
The decaf specification is written in (an) Extended Backus Naur Form (EBNF), which includes a number of convenience operators for repetition, optionality and grouping. These are not part of the bison/yacc syntax, which is pretty well limited to BNF. (Bison/yacc do allow the alternation operator |, but since there is no way to group subpatterns, alteration can only be used at the top-level, to combine two productions for the same non-terminal.)
The short section at the beginning of the specification which describes EBNF includes a grammar for the particular variety of EBNF that is being used. (Since this grammar is itself recursively written in the same EBNF, there is a need to apply a bit of inductive reasoning.) When it says, for example,
CommaList = "{" Expression "}+," .
it is not saying that "}+," is the peculiar spelling of a comma-repetition operator. What it is saying is that when you see something in the Decaf grammar surrounded by { and }+,, that should be interpreted as describing a comma-separated list.
For example, the Decaf grammar includes:
FieldDecl = var { identifier }+, Type ";" .
That means that a FieldDecl can be (amongst other possibilities) the token var followed by a comma-separated list of identifier tokens and then followed by a Type and finally a semicolon.
As I said, bison/yacc don't implement the EBNF operators, so you have to find an equivalent yourself. Since BNF doesn't allow any form of grouping -- and a list is a grouped subexpression -- we need to rewrite the subexpression of a production as a new non-terminal. Also, I suppose we need to use the tokens defined in spec (although bison allows a more readable syntax).
So to yacc-ify this EBNF production, we first introducing the new non-terminal and replace the token names:
FieldDecl: T_VAR IdentifierList Type T_SEMICOLON
Which leaves the definition of IdentifierList. Repetition in BNF is always produced with recursion, following a very simple model which uses two productions:
the base, which is the simplest possible repetition (usually either nothing or a single list item), and
the recursion, which describes a longer possibility by extending a shorter one.
In this case, the list must have at least one item, and we extend by adding a comma and another item:
IdentifierList
: T_ID /* base case */
| IdentifierList T_COMMA T_ID /* Recursive extension */
The point of this exercise is to develop your skills in thinking grammatically: that is, factoring out the syntax and semantics of the language. So you should try to understand the grammars presented, both for Decaf and for the author's version of EBNF, and avoid blindly copying code (including grammars). Good luck!
I have this trouble: I must verify the correctness of many mathematical expressions especially check for consecutive operators + - * /.
For example:
6+(69-9)+3
is ok while
6++8-(52--*3)
no.
I am not using the library <regex> since it is only compatible with C++11.
Is there a alternative method to solve this problem? Thanks.
You can use a regular expression to verify everything about a mathematical expression except the check that parentheses are balanced. That is, the regular expression will only ensure that open and close parentheses appear at the point in the expression they should appear, but not their correct relationship with other parentheses.
So you could check both that the expression matches a regex and that the parentheses are balanced. Checking for balanced parentheses is really simple if there is only one type of parenthesis:
bool check_balanced(const char* expr, char open, char close) {
int parens = 0;
for (const char* p = expr; *p; ++p) {
if (*p == open) ++parens;
else if (*p == close && parens-- == 0) return false;
}
return parens == 0;
}
To get the regular expression, note that mathematical expressions without function calls can be summarized as:
BEFORE* VALUE AFTER* (BETWEEN BEFORE* VALUE AFTER*)*
where:
BEFORE is sub-regex which matches an open parenthesis or a prefix unary operator (if you have prefix unary operators; the question is not clear).
AFTER is a sub-regex which matches a close parenthesis or, in the case that you have them, a postfix unary operator.
BETWEEN is a sub-regex which matches a binary operator.
VALUE is a sub-regex which matches a value.
For example, for ordinary four-operator arithmetic on integers you would have:
BEFORE: [-+(]
AFTER: [)]
BETWEEN: [-+*/]
VALUE: [[:digit:]]+
and putting all that together you might end up with the regex:
^[-+(]*[[:digit:]]+[)]*([-+*/][-+(]*[[:digit:]]+[)]*)*$
If you have a Posix C library, you will have the <regex.h> header, which gives you regcomp and regexec. There's sample code at the bottom of the referenced page in the Posix standard, so I won't bother repeating it here. Make sure you supply REG_EXTENDED in the last argument to regcomp; REG_EXTENDED|REG_NOSUB, as in the example code, is probably even better since you don't need captures and not asking for them will speed things up.
You can loop over each charin your expression.
If you encounter a + you can check whether it is follow by another +, /, *...
Additionally you can group operators together to prevent code duplication.
int i = 0
while(!EOF) {
switch(expression[i]) {
case '+':
case '*': //Do your syntax checks here
}
i++;
}
Well, in general case, you can't solve this with regex. Arithmethic expressions "language" can't be described with regular grammar. It's context-free grammar. So if what you want is to check correctness of an arbitrary mathemathical expression then you'll have to write a parser.
However, if you only need to make sure that your string doesn't have consecutive +-*/ operators then regex is enough. You can write something like this [-+*/]{2,}. It will match substrings with 2 or more consecutive symbols from +-*/ set.
Or something like this ([-+*/]\s*){2,} if you also want to handle situations with spaces like 5+ - * 123
Well, you will have to define some rules if possible. It's not possible to completely parse mathamatical language with Regex, but given some lenience it may work.
The problem is that often the way we write math can be interpreted as an error, but it's really not. For instance:
5--3 can be 5-(-3)
So in this case, you have two choices:
Ensure that the input is parenthesized well enough that no two operators meet
If you find something like --, treat it as a special case and investigate it further
If the formulas are in fact in your favor (have well defined parenthesis), then you can just check for repeats. For instance:
--
+-
+*
-+
etc.
If you have a match, it means you have a poorly formatted equation and you can throw it out (or whatever you want to do).
You can check for this, using the following regex. You can add more constraints to the [..][..]. I'm giving you the basics here:
[+\-\*\\/][+\-\*\\/]
which will work for the following examples (and more):
6++8-(52--*3)
6+\8-(52--*3)
6+/8-(52--*3)
An alternative, probably a better one, is just write a parser. it will step by step process the equation to check it's validity. A parser will, if well written, 100% accurate. A Regex approach leaves you to a lot of constraints.
There is no real way to do this with a regex because mathematical expressions inherently aren't regular. Heck, even balancing parens isn't regular. Typically this will be done with a parser.
A basic approach to writing a recursive-descent parser (IMO the most basic parser to write) is:
Write a grammar for a mathematical expression. (These can be found online)
Tokenize the input into lexemes. (This will be done with a regex, typically).
Match the expressions based on the next lexeme you see.
Recurse based on your grammar
A quick Google search can provide many example recursive-descent parsers written in C++.
So I am trying to parse an expression using Tcl_ParseExpr:
// Any syntax errors?
Tcl_Interp *myInterpBuild;
Tcl_Parse parseInfo;
std::string expression = "(test1==1) ? 0.0) : (test2.hello+1.0)";
if (Tcl_ParseExpr(myInterpBuild,expression.c_str(),-1,&parseInfo)
== TCL_ERROR)
{
std::string failMsg = Tcl_GetStringResult(myInterpBuild);
std::cout << failMsg;
Now, usually this works and no error is give. However, if the expression contains a . (dot symbol) then it only parses the part of the expression up the the dot.
For example, if I set expression to '(test1==1) ? 0.0) : (test2.hello+1.0)' then only 'test2' is parsed and 'hello' is thrown away.
The output of the above is:
invalid bareword "test2"
It appears only to evaluate the expression up to and not including the dot character.
Does anyone know why this is happening and what I have to do to fix this?
The . character is not an operator in Tcl's expression language as things stand right now. It can be used within a floating point literal, of course, but it simply isn't a legal part of the grammar rules as an operator. Thus, Tcl's parser stops when it encounters it and throws an error: it would do exactly the same if you fed it into the Tcl expr command. What's more, Tcl's expression language doesn't currently support barewords except as function names (and there's a few keywords that look like barewords too, such as true and false).
Changing that would require rewriting the expression parser and (probably) assigning a meaning to that operator in terms of Tcl's internal bytecode. Not exactly a trivial thing (there's quite a few places in the code to change) but possible if you have a good idea for what to do. If you do, please talk to the Tcl Core Team and we'll see what we can do; you might find us quite receptive to a good suggestion!
Or you can use your own parser, of course. Absolutely nothing stopping that.
OK, so I've set up a complete Bison grammar (+ its Lex counterpart) and this is what I need :
Is there any way I can set up a grammar rule so that a specific portion of input is excluded from being parsed, but instead retrieved as-is?
E.g.
external_code : EXT_CODE_START '{' '}';
For instance, how could I get the part between the curly brackets as a string, without allowing the parser to consume it (since it'll be "external" code, it won't abide by my current language rules... so, it's ok - text is fine).
How would you go about that?
Should I tackle the issue by adding a token to the Lexer? (same as I do with string literals, for example?)
Any ideas are welcome! (I hope you've understood what I need...)
P.S. Well, I also thought of treating the whole situation pretty much as I do with C-style multiline comments (= capture when the comment begins, in the Lexer, and then - from within a custom function, keep going until the end-of-comment is found). That'd definitely be some sort of solution. But isn't there anything... easier?
You can call the lexer's input/yyinput function to read characters from the input stream and do something with them (and they won't be tokenized so the parser will never see them).
You can use lexer states, putting the lexer in a different state where it will skip over the excluded text, rather than returning it as tokens.
The problem with either of the above from a parser action is dealing with the parser's one token lookahead, which occurs in some (but not all) cases. For example, the following will probably work:
external_code: EXT_CODE_START '{' { skip_external_code(); } '}'
as the action will be in a default reduction state with no lookahead. In this case, skip_external_code could either just set the lexer state (second option above), or it could call input until it gets to the matching } and then calls unput once (first option above).
Note that the skip_external_code function needs to be defined in the 3rd section of the the lexer file so it has access to static functions and macros in the lexer (which both of these techniques depend on).
This question might sound a little confusing. I'm using Flex to pass tokens to Bison.
The behavior I want is that Flex matches the longest regular expression and passes that token (it DOES work like this), but if that token doesn't work with the grammar, it then matches the second longest regular expression and passes that token.
I'm struggling to think of a way to create this behavior. How could I make this happen?
To clarify, for example, say I have two rules:
"//" return TOKEN_1;
"///" return TOKEN_2;
Given the string "///", I'd like it to first pass TOKEN_2 (it does).
If TOKEN_2 doesn't fit with the grammar as specified in Bison, it then passes TOKEN_1 (which is also valid).
How can I create this behavior?
In flex, you can have a rule that tries to do something but fails and tries the second-best rule by using the REJECT macro:
REJECT directs the scanner to proceed on to the
"second best" rule which matched the input (or a
prefix of the input). The rule is chosen as
described above in "How the Input is Matched", and
yytext and yyleng set up appropriately. It may
either be one which matched as much text as the
originally chosen rule but came later in the flex
input file, or one which matched less text.
(source: The Flex Manual Page).
So to answer your question about getting the second-longest expression, you might be able to do this using REJECT (though you have to be careful, because it could just pick something of the same length with equal priority).
Note that flex will run slower with REJECT being used because it needs to maintain extra logic to "fall back" to worse matches at any point. I'd suggest only using this if there's no other way to fix your problem.
Sorry but you cant do that. I'm actually unsure how much flex talks to bison. I do know there is a mode for REPL parsing and i do know there is another mode that parses it all.
You'll have to inline the rule. For example instead of // and / you write a rule which accepts /// then another that assumes /// means // /. But that gets messy and i only did that in a specific case in my code.
I would just have the lexer scan two tokens // and / and then have the parser deal with cases when they are supposed to be regarded as one token, or separate. I.e. a grammar rule that begins with /// can actually be refactored into one which starts with // and /. In other words, do not have a TOKEN_2 at all. In same cases this sort of thing can be tricky though, because the LARL(1) parser has only one token of lookahead. It has to make a shift or reduce decision based on seeing the // only, without regard for the / which follows.
I had an idea for solving this with a hacky approach involving a lexical tie in, but it proved unworkable.
The main flaw with the idea is that there isn't any easy way to do error recovery in yacc which is hidden from the user. If a syntax error is triggered, that is visible. The yyerror function could contain a hack to try to hide this, but it lacks the context information.
In other words, you can't really use Yacc error actions to trigger a backtracking search for another parse.
This is tough for bison/yacc to deal with, as it doesn't do backtracking. Even if you use a backtracking parser generator like btyacc, it doesn't really help unless it also backtracks through the lexer (which would likely require a parser generator with integrated lexer.)
My suggestion would be to have the lexer recognize a slash immediately followed by a slash specially and return a different token:
\//\/ return SLASH;
\/ return '/'; /* not needed if you have the catch-all rule: */
. return *yytext;
Now you need to 'assemble' multi-slash 'tokens' as non-terminals in the grammer.
single_slash: SLASH | '/' ;
double_slash: SLASH SLASH | SLASH '/' ;
triple_slash: SLASH SLASH SLASH | SLASH SLASH '/' ;
However, you'll now likely find you have conflicts in the grammar due to the 1-token lookahead not being enough. You may be able to resolve those by using btyacc or bison's %glr-parser option.