why simple grammar rule in bison not working? - c++

I am learning flex & bison and I am stuck here and cannot figure out how such a simple grammar rule does not work as I expected, below is the lexer code:
%{
#include <stdio.h>
#include "zparser.tab.h"
%}
%%
[\t\n ]+ //ignore white space
FROM|from { return FROM; }
select|SELECT { return SELECT; }
update|UPDATE { return UPDATE; }
insert|INSERT { return INSERT; }
delete|DELETE { return DELETE; }
[a-zA-Z].* { return IDENTIFIER; }
\* { return STAR; }
%%
And below is the parser code:
%{
#include<stdio.h>
#include<iostream>
#include<vector>
#include<string>
using namespace std;
extern int yyerror(const char* str);
extern int yylex();
%}
%%
%token SELECT UPDATE INSERT DELETE STAR IDENTIFIER FROM;
ZQL : SELECT STAR FROM IDENTIFIER { cout<<"Done"<<endl; return 0;}
;
%%
Can any one tell me why it shows error if I try to put "select * from something"

[a-zA-Z].* will match an alphabetic character followed by any number of arbitrary characters except newline. In other words, it will match from an alphabetic character to the end of the line.
Since flex always accepts the longest match, the line select * from ... will appear to have only one token, IDENTIFIER, and that is a syntax error.

[a-zA-Z].* { return IDENTIFIER; }
The problem is here. It allows any junk to follow an initial alpha character and be returned as IDENTIFIER, including in this case the entire rest of the line after the initial ''s.
It should be:
[a-zA-Z]+ { return IDENTIFIER; }
or possibly
[a-zA-Z][a-zA-Z0-9]* { return IDENTIFIER; }
or whatever else you want to allow to follow an initial alpha character in your identifiers.

Related

Location of the error token always starts at 0

I'm writing a parser with error handling. I would like to output to the user the exact location of the parts of the input that couldn't be parsed.
However, the location of the error token always starts at 0, even if before it were parts that were parsed successfully.
Here's a heavily simplified example of what I did.
(The problematic part is probably in the parser.yy.)
Location.hh:
#pragma once
#include <string>
// The full version tracks position in bytes, line number and offset in the current line.
// Here however, I've shortened it to line number only.
struct Location
{
int beginning, ending;
operator std::string() const { return std::to_string(beginning) + '-' + std::to_string(ending); }
};
LexerClass.hh:
#pragma once
#include <istream>
#include <string>
#if ! defined(yyFlexLexerOnce)
#include <FlexLexer.h>
#endif
#include "Location.hh"
class LexerClass : public yyFlexLexer
{
int currentPosition = 0;
protected:
std::string *yylval = nullptr;
Location *yylloc = nullptr;
public:
LexerClass(std::istream &in) : yyFlexLexer(&in) {}
[[nodiscard]] int yylex(std::string *const lval, Location *const lloc);
void onNewLine() { yylloc->beginning = yylloc->ending = ++currentPosition; }
};
lexer.ll:
%{
#include "./parser.hh"
#include "./LexerClass.hh"
#undef YY_DECL
#define YY_DECL int LexerClass::yylex(std::string *const lval, Location *const lloc)
%}
%option c++ noyywrap
%option yyclass="LexerClass"
%%
%{
yylval = lval;
yylloc = lloc;
%}
[[:blank:]] ;
\n { onNewLine(); }
[0-9] { return yy::Parser::token::DIGIT; }
. { return yytext[0]; }
parser.yy:
%language "c++"
%code requires {
#include "LexerClass.hh"
#include "Location.hh"
}
%define api.parser.class {Parser}
%define api.value.type {std::string}
%define api.location.type {Location}
%parse-param {LexerClass &lexer}
%defines
%code {
template<typename RHS>
void calcLocation(Location &current, const RHS &rhs, const int n);
#define YYLLOC_DEFAULT(Cur, Rhs, N) calcLocation(Cur, Rhs, N)
#define yylex lexer.yylex
}
%token DIGIT
%%
numbers:
%empty
| numbers number ';' { std::cout << std::string(#number) << "\tnumber" << std::endl; }
| error ';' { yyerrok; std::cerr << std::string(#error) << "\terror context" << std::endl; }
;
number:
DIGIT {}
| number DIGIT {}
;
%%
#include <iostream>
template<typename RHS>
inline void calcLocation(Location &current, const RHS &rhs, const int n)
{
current = (n <= 1)
? YYRHSLOC(rhs, n)
: Location{YYRHSLOC(rhs, 1).beginning, YYRHSLOC(rhs, n).ending};
}
void yy::Parser::error(const Location &location, const std::string &message)
{
std::cout << std::string(location) << "\terror: " << message << std::endl;
}
int main()
{
LexerClass lexer(std::cin);
yy::Parser parser(lexer);
return parser();
}
For the input:
123
456
789;
123;
089
xxx
123;
765
432;
expected output:
0-2 number
3-3 number
5-5 error: syntax error
4-6 error context
7-8 number
actual output:
0-2 number
3-3 number
5-5 error: syntax error
0-6 error context
7-8 number
Here's your numbers rule, for reference (without actions, since they're not really relevant):
numbers:
%empty
| numbers number ';'
| error ';'
numbers is also your start symbol. It should be reasonably clear that there is nothing before a numbers non-terminal in any derivation. There is a top-level numbers non-terminal, which encompasses the entire input, and it starts with a numbers non-terminal which contains everything except the last number ;, and so on. All of these numbers start at the beginning.
Similarly, the error pseudotoken is at the start of some numbers derivation. So it, too, must start at the beginning of the input.
In other words, your statement that "the location of the error token always starts at 0, even if before it were parts that were parsed successfully" is untestable. The location of the error token always starts at 0, because there cannot be anything before it, and the output you're receiving is "expected". Or, at least, predictable; I understand that you didn't expect it, and it's an easy confusion to fall into. I didn't really see it until I ran the parser with tracing enabled, which is highly recommended; note that to do so, it's helpful to add an overload of std::operator(ostream&, Location const&).
I'm building upon the rici's answer, so read that one first.
Let's consider the rule:
numbers:
%empty
| numbers number ';'
| error ';' { yyerrok; }
;
This means the nonterminal numbers can be one of these three things:
It may be empty.
It may be a number preceded by any valid numbers.
It may be an error.
Do you see the problem yet?
The whole numbers has to be an error, from the beginning; there is no rule saying that anything else allowed before it.
Of course Bison obediently complies to your wishes and makes the error start at the very beginning of the nonterminal numbers.
It can do that because error is a jack of all trades and there can be no rule about what can be included inside of it. Bison, to fulfill your rule, needs to extend the error over all previous numbers.
When you understand the problem, fixing it is rather easy. You just need to tell Bison that numbers are allowed before the error:
numbers:
%empty
| numbers number ';'
| numbers error ';' { yyerrok; }
;
This is IMO the best solution. There is another approach, though.
You can move the error token to the number:
numbers:
%empty
| numbers number ';' { yyerrok; }
;
number:
DIGIT
| number DIGIT
| error
;
Notice that yyerrok needs to stay in numbers because the parser would enter an infinite loop if you place it next to a rule that ends with token error.
A disadvantage of this approach is that if you place an action next to this error, it will be triggered multiple times (more or less once per every illegal terminal).
Maybe in some situations this is preferable but generally I suggest using the first way of solving the issue.

Why is yyerror() being called even when string is valid?

This is a yacc program to recognize all strings ending with b preceded by n a’s using the grammar a n b (note: input n value).
%{
#include "y.tab.h"
%}
%%
a {return A;}
b {return B;}
\n {return 0;}
. {return yytext[0];}
%%
The yacc part
YACC PART
%{
#include <stdio.h>
int aCount=0,n;
%}
%token A
%token B
%%
s : X B { if (aCount<n || aCount>n)
{
YYFAIL();
}
}
X : X T | T
T : A { aCount++;}
;
%%
int main()
{ printf("Enter the value of n \n");
scanf("%d",&n);
printf("Enter the string\n");
yyparse();
printf("Valid string\n");
}
int YYFAIL()
{
printf("Invalid count of 'a'\n");
exit(0);
}
int yyerror()
{
printf("Invalid string\n");
exit(0);
}
output
invalid string
It displays invalid string even for valid string like aab for n value 2.
For every string i enter,yyerror() is called.
Please help me resolve this!
TIA
scanf("%d",&n);
reads a number from standard input.
It does not read a number and the following newline. It just reads a number. Whatever follows the number will be returned from the next operation which reads from stdin.
So when you attempt to parse, the character read by the lexer is the newline character which you typed after the number. That newline character causes the lexer to return 0 to the parser, which the parser interprets as the end of input. But the grammar doesn't allow empty inputs, so the parser reports a syntax error.
On my system, the parser reports a syntax error before it gives me the opportunity to type any input. The fact that it allows you to type an input line is a bit puzzling to me, but it might have something to do with whatever IDE you are using to run your program.

Lex is not returning what I want

%{
#include<stdio.h>
int n_chars = 0;
int n_lines = 0;
%}
%%
"if"|"else"|"while"|"do"|"switch"|"case" {
printf("Keyword");
}
[a-zA-Z][a-z|0-9]* {printf("Identifier");}
[0-9]* {printf("Number");}
"!"|"#"|"*"|"&"|"^"|"%"|"$"|"#" {printf("Special Character");}
\n { ++n_lines, ++n_chars; }
. ++n_chars;
%%
int yywrap() {
return 1;
}
main(int argc[], char *argv[]) {
yyin = fopen("index.txt", "r");
printf("Number of characters is: %d", n_chars);
yylex();
return 0;
}
My code above returns: Number of characters is: 0
The content of my file index.txt is:
if hello #
while 1
do test
Why does it return 0? What I expect is the number of all characters and also it should tell me if it is a keyword, an identifier or a special character.
I must be doing something wrong, since I am very new to this.
I am using EditPlus. So any help would be appreciated!
There are at least two problems with your code.
You print n_chars before calling yylex.
The last rule for . will not be matched for anything that is matches by one of the rules above, so you will not get the number of chars with this approach.
With calling yylex first, I get the number of "other" characters, such as spaces and newlines.
To count all characters, you can
Add the statement n_chars += strlen (yytext); to the first four rules to count the characters that were matched by the rule.
Add the statement REJECT to the first four rules to continue searching and therefor match the. with the action ++n_chars;.

Segmentation fault on simple Bison script

OK, I'm doing a few experiments with Lex/Bison(Yacc), and given that my C skills are rather rusty (I've once created compilers and stuff with all these tools and now I'm lost in the first few lines... :-S), I need your help.
This is what my Parser looks like :
%{
#include <stdio.h>
#include <string.h>
void yyerror(const char *str)
{
fprintf(stderr,"error: %s\n",str);
}
int yywrap()
{
return 1;
}
main()
{
yyparse();
}
%}
%union
{
char* str;
}
%token <str> WHAT IS FIND GET SHOW WITH POSS OF NUMBER WORD
%type <str> statement
%start statements
%%
statement
: GET { printf("get\n"); }
| SHOW { printf("%s\n",$1); }
| OF { printf("of\n"); }
;
statements
: statement
| statements statement
;
The Issue :
So, basically, whenever the parser comes across a "get", it prints "get". And so on.
However, when trying to print "show" (using the $1 specifier) it gives out a segmentation fault error.
What am I doing wrong?
Lex returns a number representing the token, you need to access yytext to get the text of what is parsed.
something like
statement : GET { printf("get\n"); }
| SHOW { printf("%s\n",yytext); }
| OF { printf("of\n"); }
;
to propogate the text of terminals, I go ahead associate a nonterminal with a terminal and pass back the char* and start building the parse tree for example. Note I've left out the type decl and the implementation of create_sww_ASTNode(char*,char*,char*); However, importantly not all nonterminals will return the same type, for number is an integer, word return char* sww return astNode (or whatever generic abstract syntax tree structure you come up with). Usually beyond the nonterminal representing terminals, it's all AST stuff.
sww : show word word
{
$$ = create_sww_ASTNode($1,$2,$3);
}
;
word : WORD
{
$$ = malloc(strlen(yytext) + 1);
strcpy($$,yytext);
}
;
show : SHOW
{
$$ = malloc(strlen(yytext) + 1);
strcpy($$,yytext);
}
;
number : NUMBER
{
$$ = atoi(yytext);
}
;
You don't show your lexer code, but the problem is probably that you never set yylval to anything, so when you access $1 in the parser, it contains garbage and you get a crash. Your lexer actions need to set yylval.str to something so it will be valid:
"show" { yylval.str = "SHOW"; return SHOW }
[a-z]+ { yylval.str = strdup(yytext); return WORD; }
OK, so here's the answer (Can somebody tell me what it is that I always come up with the solution once I've already published a question here in SO? lol!)
The problem was not with the parser itself, but actually with the Lexer.
The thing is : when you tell it to { printf("%s\n",$1); }, we actually tell it to print yylval (which is by default an int, not a string).
So, the trick is to convert the appropriate tokens into strings.
Here's my (updated) Lexer file :
%{
#include <stdio.h>
#include "parser.tab.h"
void toStr();
%}
DIGIT [0-9]
LETTER [a-zA-Z]
LETTER_OR_SPACE [a-zA-Z ]
%%
find { toStr(); return FIND; }
get { toStr(); return GET; }
show { toStr(); return SHOW; }
{DIGIT}+(\.{DIGIT}+)? { toStr(); return NUMBER; }
{LETTER}+ { toStr(); return WORD; }
\n /* ignore end of line */;
[ \t]+ /* ignore whitespace */;
%%
void toStr()
{
yylval.str=strdup(yytext);
}

Lex parsing without spaces

I am coding a custom shell using Lex, Yacc, and C++. It is being run in a Unix environment. It currently works fine as long as there are spaces between the tokens. for example:
ls | grep test > out
will pass:
WORD PIPE WORD WORD GREAT WORD
to Yacc, and then actions are taken from there. However, I need it to work when there are not spaces as well. for example:
ls|grep test>out
should work the same as the previous command. However, it currently only passes:
WORD WORD
is there a way to parse the input before Lex tokenizes it?
Edit:
Here is my Lex file:
%{
#include <string.h>
#include "y.tab.h"
%}
%%
\n {
return NEWLINE;
}
[ \t] {
/* Discard spaces and tabs */
}
">" { return GREAT; }
">&" { return GREATAMPERSAND; }
">>" { return GREATGREAT; }
">>&" { return GREATGREATAMPERSAND; }
"<" { return LESS; }
"|" { return PIPE; }
"&" { return AMPERSAND; }
[^ \t\n][^ \t\n]* {
/* Assume that file names have only alpha chars */
yylval.string_val = strdup(yytext);
return WORD;
}
. {
/* Invalid character in input */
return NOTOKEN;
}
%%
You need to change your definition of a WORD. Right now, when it encounters an alphabetic character, it considers everything up to the next whitespace as part of that WORD.
You want to change that so it doesn't include any of the punctuation you're using for other purposes:
[^ \t\n\>\<\|\&]+ {
/* Assume that file names have only alpha chars */
yylval.string_val = strdup(yytext);
return WORD;
}
I figured it out. WORD was including the pipes and other special characters.
I changed it to
[^\|\>\<\& \t\n][^\|\>\<\& \t\n]* {
yylval.string_val = strdup(yytext);
return WORD;
}
and now it works.