Segmentation fault on simple Bison script - c++

OK, I'm doing a few experiments with Lex/Bison(Yacc), and given that my C skills are rather rusty (I've once created compilers and stuff with all these tools and now I'm lost in the first few lines... :-S), I need your help.
This is what my Parser looks like :
%{
#include <stdio.h>
#include <string.h>
void yyerror(const char *str)
{
fprintf(stderr,"error: %s\n",str);
}
int yywrap()
{
return 1;
}
main()
{
yyparse();
}
%}
%union
{
char* str;
}
%token <str> WHAT IS FIND GET SHOW WITH POSS OF NUMBER WORD
%type <str> statement
%start statements
%%
statement
: GET { printf("get\n"); }
| SHOW { printf("%s\n",$1); }
| OF { printf("of\n"); }
;
statements
: statement
| statements statement
;
The Issue :
So, basically, whenever the parser comes across a "get", it prints "get". And so on.
However, when trying to print "show" (using the $1 specifier) it gives out a segmentation fault error.
What am I doing wrong?

Lex returns a number representing the token, you need to access yytext to get the text of what is parsed.
something like
statement : GET { printf("get\n"); }
| SHOW { printf("%s\n",yytext); }
| OF { printf("of\n"); }
;
to propogate the text of terminals, I go ahead associate a nonterminal with a terminal and pass back the char* and start building the parse tree for example. Note I've left out the type decl and the implementation of create_sww_ASTNode(char*,char*,char*); However, importantly not all nonterminals will return the same type, for number is an integer, word return char* sww return astNode (or whatever generic abstract syntax tree structure you come up with). Usually beyond the nonterminal representing terminals, it's all AST stuff.
sww : show word word
{
$$ = create_sww_ASTNode($1,$2,$3);
}
;
word : WORD
{
$$ = malloc(strlen(yytext) + 1);
strcpy($$,yytext);
}
;
show : SHOW
{
$$ = malloc(strlen(yytext) + 1);
strcpy($$,yytext);
}
;
number : NUMBER
{
$$ = atoi(yytext);
}
;

You don't show your lexer code, but the problem is probably that you never set yylval to anything, so when you access $1 in the parser, it contains garbage and you get a crash. Your lexer actions need to set yylval.str to something so it will be valid:
"show" { yylval.str = "SHOW"; return SHOW }
[a-z]+ { yylval.str = strdup(yytext); return WORD; }

OK, so here's the answer (Can somebody tell me what it is that I always come up with the solution once I've already published a question here in SO? lol!)
The problem was not with the parser itself, but actually with the Lexer.
The thing is : when you tell it to { printf("%s\n",$1); }, we actually tell it to print yylval (which is by default an int, not a string).
So, the trick is to convert the appropriate tokens into strings.
Here's my (updated) Lexer file :
%{
#include <stdio.h>
#include "parser.tab.h"
void toStr();
%}
DIGIT [0-9]
LETTER [a-zA-Z]
LETTER_OR_SPACE [a-zA-Z ]
%%
find { toStr(); return FIND; }
get { toStr(); return GET; }
show { toStr(); return SHOW; }
{DIGIT}+(\.{DIGIT}+)? { toStr(); return NUMBER; }
{LETTER}+ { toStr(); return WORD; }
\n /* ignore end of line */;
[ \t]+ /* ignore whitespace */;
%%
void toStr()
{
yylval.str=strdup(yytext);
}

Related

Location of the error token always starts at 0

I'm writing a parser with error handling. I would like to output to the user the exact location of the parts of the input that couldn't be parsed.
However, the location of the error token always starts at 0, even if before it were parts that were parsed successfully.
Here's a heavily simplified example of what I did.
(The problematic part is probably in the parser.yy.)
Location.hh:
#pragma once
#include <string>
// The full version tracks position in bytes, line number and offset in the current line.
// Here however, I've shortened it to line number only.
struct Location
{
int beginning, ending;
operator std::string() const { return std::to_string(beginning) + '-' + std::to_string(ending); }
};
LexerClass.hh:
#pragma once
#include <istream>
#include <string>
#if ! defined(yyFlexLexerOnce)
#include <FlexLexer.h>
#endif
#include "Location.hh"
class LexerClass : public yyFlexLexer
{
int currentPosition = 0;
protected:
std::string *yylval = nullptr;
Location *yylloc = nullptr;
public:
LexerClass(std::istream &in) : yyFlexLexer(&in) {}
[[nodiscard]] int yylex(std::string *const lval, Location *const lloc);
void onNewLine() { yylloc->beginning = yylloc->ending = ++currentPosition; }
};
lexer.ll:
%{
#include "./parser.hh"
#include "./LexerClass.hh"
#undef YY_DECL
#define YY_DECL int LexerClass::yylex(std::string *const lval, Location *const lloc)
%}
%option c++ noyywrap
%option yyclass="LexerClass"
%%
%{
yylval = lval;
yylloc = lloc;
%}
[[:blank:]] ;
\n { onNewLine(); }
[0-9] { return yy::Parser::token::DIGIT; }
. { return yytext[0]; }
parser.yy:
%language "c++"
%code requires {
#include "LexerClass.hh"
#include "Location.hh"
}
%define api.parser.class {Parser}
%define api.value.type {std::string}
%define api.location.type {Location}
%parse-param {LexerClass &lexer}
%defines
%code {
template<typename RHS>
void calcLocation(Location &current, const RHS &rhs, const int n);
#define YYLLOC_DEFAULT(Cur, Rhs, N) calcLocation(Cur, Rhs, N)
#define yylex lexer.yylex
}
%token DIGIT
%%
numbers:
%empty
| numbers number ';' { std::cout << std::string(#number) << "\tnumber" << std::endl; }
| error ';' { yyerrok; std::cerr << std::string(#error) << "\terror context" << std::endl; }
;
number:
DIGIT {}
| number DIGIT {}
;
%%
#include <iostream>
template<typename RHS>
inline void calcLocation(Location &current, const RHS &rhs, const int n)
{
current = (n <= 1)
? YYRHSLOC(rhs, n)
: Location{YYRHSLOC(rhs, 1).beginning, YYRHSLOC(rhs, n).ending};
}
void yy::Parser::error(const Location &location, const std::string &message)
{
std::cout << std::string(location) << "\terror: " << message << std::endl;
}
int main()
{
LexerClass lexer(std::cin);
yy::Parser parser(lexer);
return parser();
}
For the input:
123
456
789;
123;
089
xxx
123;
765
432;
expected output:
0-2 number
3-3 number
5-5 error: syntax error
4-6 error context
7-8 number
actual output:
0-2 number
3-3 number
5-5 error: syntax error
0-6 error context
7-8 number
Here's your numbers rule, for reference (without actions, since they're not really relevant):
numbers:
%empty
| numbers number ';'
| error ';'
numbers is also your start symbol. It should be reasonably clear that there is nothing before a numbers non-terminal in any derivation. There is a top-level numbers non-terminal, which encompasses the entire input, and it starts with a numbers non-terminal which contains everything except the last number ;, and so on. All of these numbers start at the beginning.
Similarly, the error pseudotoken is at the start of some numbers derivation. So it, too, must start at the beginning of the input.
In other words, your statement that "the location of the error token always starts at 0, even if before it were parts that were parsed successfully" is untestable. The location of the error token always starts at 0, because there cannot be anything before it, and the output you're receiving is "expected". Or, at least, predictable; I understand that you didn't expect it, and it's an easy confusion to fall into. I didn't really see it until I ran the parser with tracing enabled, which is highly recommended; note that to do so, it's helpful to add an overload of std::operator(ostream&, Location const&).
I'm building upon the rici's answer, so read that one first.
Let's consider the rule:
numbers:
%empty
| numbers number ';'
| error ';' { yyerrok; }
;
This means the nonterminal numbers can be one of these three things:
It may be empty.
It may be a number preceded by any valid numbers.
It may be an error.
Do you see the problem yet?
The whole numbers has to be an error, from the beginning; there is no rule saying that anything else allowed before it.
Of course Bison obediently complies to your wishes and makes the error start at the very beginning of the nonterminal numbers.
It can do that because error is a jack of all trades and there can be no rule about what can be included inside of it. Bison, to fulfill your rule, needs to extend the error over all previous numbers.
When you understand the problem, fixing it is rather easy. You just need to tell Bison that numbers are allowed before the error:
numbers:
%empty
| numbers number ';'
| numbers error ';' { yyerrok; }
;
This is IMO the best solution. There is another approach, though.
You can move the error token to the number:
numbers:
%empty
| numbers number ';' { yyerrok; }
;
number:
DIGIT
| number DIGIT
| error
;
Notice that yyerrok needs to stay in numbers because the parser would enter an infinite loop if you place it next to a rule that ends with token error.
A disadvantage of this approach is that if you place an action next to this error, it will be triggered multiple times (more or less once per every illegal terminal).
Maybe in some situations this is preferable but generally I suggest using the first way of solving the issue.

Why is yyerror() being called even when string is valid?

This is a yacc program to recognize all strings ending with b preceded by n a’s using the grammar a n b (note: input n value).
%{
#include "y.tab.h"
%}
%%
a {return A;}
b {return B;}
\n {return 0;}
. {return yytext[0];}
%%
The yacc part
YACC PART
%{
#include <stdio.h>
int aCount=0,n;
%}
%token A
%token B
%%
s : X B { if (aCount<n || aCount>n)
{
YYFAIL();
}
}
X : X T | T
T : A { aCount++;}
;
%%
int main()
{ printf("Enter the value of n \n");
scanf("%d",&n);
printf("Enter the string\n");
yyparse();
printf("Valid string\n");
}
int YYFAIL()
{
printf("Invalid count of 'a'\n");
exit(0);
}
int yyerror()
{
printf("Invalid string\n");
exit(0);
}
output
invalid string
It displays invalid string even for valid string like aab for n value 2.
For every string i enter,yyerror() is called.
Please help me resolve this!
TIA
scanf("%d",&n);
reads a number from standard input.
It does not read a number and the following newline. It just reads a number. Whatever follows the number will be returned from the next operation which reads from stdin.
So when you attempt to parse, the character read by the lexer is the newline character which you typed after the number. That newline character causes the lexer to return 0 to the parser, which the parser interprets as the end of input. But the grammar doesn't allow empty inputs, so the parser reports a syntax error.
On my system, the parser reports a syntax error before it gives me the opportunity to type any input. The fact that it allows you to type an input line is a bit puzzling to me, but it might have something to do with whatever IDE you are using to run your program.

why simple grammar rule in bison not working?

I am learning flex & bison and I am stuck here and cannot figure out how such a simple grammar rule does not work as I expected, below is the lexer code:
%{
#include <stdio.h>
#include "zparser.tab.h"
%}
%%
[\t\n ]+ //ignore white space
FROM|from { return FROM; }
select|SELECT { return SELECT; }
update|UPDATE { return UPDATE; }
insert|INSERT { return INSERT; }
delete|DELETE { return DELETE; }
[a-zA-Z].* { return IDENTIFIER; }
\* { return STAR; }
%%
And below is the parser code:
%{
#include<stdio.h>
#include<iostream>
#include<vector>
#include<string>
using namespace std;
extern int yyerror(const char* str);
extern int yylex();
%}
%%
%token SELECT UPDATE INSERT DELETE STAR IDENTIFIER FROM;
ZQL : SELECT STAR FROM IDENTIFIER { cout<<"Done"<<endl; return 0;}
;
%%
Can any one tell me why it shows error if I try to put "select * from something"
[a-zA-Z].* will match an alphabetic character followed by any number of arbitrary characters except newline. In other words, it will match from an alphabetic character to the end of the line.
Since flex always accepts the longest match, the line select * from ... will appear to have only one token, IDENTIFIER, and that is a syntax error.
[a-zA-Z].* { return IDENTIFIER; }
The problem is here. It allows any junk to follow an initial alpha character and be returned as IDENTIFIER, including in this case the entire rest of the line after the initial ''s.
It should be:
[a-zA-Z]+ { return IDENTIFIER; }
or possibly
[a-zA-Z][a-zA-Z0-9]* { return IDENTIFIER; }
or whatever else you want to allow to follow an initial alpha character in your identifiers.

My last regular expression won't work but i cannot figure out the reason why

I have two vectors, one which holds my regular expressions and one which holds the string in which will be checked against the regular expression, most of them work fine except for this one (shown below) the string is a correct string and matches the regular expression but it outputs incorrect instead of correct.
INPUT STRING
.C/IATA
CODE IS BELOW
std::string errorMessages [6][6] = {
{
"Correct Corparate Name\n",
},
{
"Incorrect Format for Corporate Name\n",
}
};
std::vector<std::string> el;
split(el,message,boost::is_any_of("\n"));
std::string a = ("");
for(int i = 0; i < el.size(); i++)
{
if(el[i].substr(0,3) == ".C/")
{
DCS_LOG_DEBUG("--------------- Validating .C/ ---------------");
output.push_back("\n--------------- Validating .C/ ---------------\n");
str = el[i].substr(3);
split(st,str,boost::is_any_of("/"));
for (int split_id = 0 ; split_id < splitMask.size() ; split_id++ )
{
boost::regex const string_matcher_id(splitMask[split_id]);
if(boost::regex_match(st[split_id],string_matcher_id))
{
a = errorMessages[0][split_id];
DCS_LOG_DEBUG("" << a )
}
else
{
a = errorMessages[1][split_id];
DCS_LOG_DEBUG("" << a)
}
output.push_back(a);
}
}
else
{
DCS_LOG_DEBUG("Do Nothing");
}
st[split_id] = "IATA"
splitMask[split_id] = "[a-zA-Z]{1,15}" <---
But it still outputs Incorrect format for corporate name
I cannot see why it prints incorrect when it should be correct can someone help me here please ?
Your regex and the surrounding logic is OK.
You need to extend your logging and to print the relevant part of splitMask and st right before the call to boost::regex_match to double check that the values are what you believe they are. Print them surrounded in some punctuation and also print the string length to be sure.
As you probably know, boost::regex_match only finds a match if the whole string is a match; therefore, if there is a non-printable character somewhere, or maybe a trailing space character, that will perfectly explain the result you have seen.

Lex parsing without spaces

I am coding a custom shell using Lex, Yacc, and C++. It is being run in a Unix environment. It currently works fine as long as there are spaces between the tokens. for example:
ls | grep test > out
will pass:
WORD PIPE WORD WORD GREAT WORD
to Yacc, and then actions are taken from there. However, I need it to work when there are not spaces as well. for example:
ls|grep test>out
should work the same as the previous command. However, it currently only passes:
WORD WORD
is there a way to parse the input before Lex tokenizes it?
Edit:
Here is my Lex file:
%{
#include <string.h>
#include "y.tab.h"
%}
%%
\n {
return NEWLINE;
}
[ \t] {
/* Discard spaces and tabs */
}
">" { return GREAT; }
">&" { return GREATAMPERSAND; }
">>" { return GREATGREAT; }
">>&" { return GREATGREATAMPERSAND; }
"<" { return LESS; }
"|" { return PIPE; }
"&" { return AMPERSAND; }
[^ \t\n][^ \t\n]* {
/* Assume that file names have only alpha chars */
yylval.string_val = strdup(yytext);
return WORD;
}
. {
/* Invalid character in input */
return NOTOKEN;
}
%%
You need to change your definition of a WORD. Right now, when it encounters an alphabetic character, it considers everything up to the next whitespace as part of that WORD.
You want to change that so it doesn't include any of the punctuation you're using for other purposes:
[^ \t\n\>\<\|\&]+ {
/* Assume that file names have only alpha chars */
yylval.string_val = strdup(yytext);
return WORD;
}
I figured it out. WORD was including the pipes and other special characters.
I changed it to
[^\|\>\<\& \t\n][^\|\>\<\& \t\n]* {
yylval.string_val = strdup(yytext);
return WORD;
}
and now it works.