Lex parsing without spaces - c++

I am coding a custom shell using Lex, Yacc, and C++. It is being run in a Unix environment. It currently works fine as long as there are spaces between the tokens. for example:
ls | grep test > out
will pass:
WORD PIPE WORD WORD GREAT WORD
to Yacc, and then actions are taken from there. However, I need it to work when there are not spaces as well. for example:
ls|grep test>out
should work the same as the previous command. However, it currently only passes:
WORD WORD
is there a way to parse the input before Lex tokenizes it?
Edit:
Here is my Lex file:
%{
#include <string.h>
#include "y.tab.h"
%}
%%
\n {
return NEWLINE;
}
[ \t] {
/* Discard spaces and tabs */
}
">" { return GREAT; }
">&" { return GREATAMPERSAND; }
">>" { return GREATGREAT; }
">>&" { return GREATGREATAMPERSAND; }
"<" { return LESS; }
"|" { return PIPE; }
"&" { return AMPERSAND; }
[^ \t\n][^ \t\n]* {
/* Assume that file names have only alpha chars */
yylval.string_val = strdup(yytext);
return WORD;
}
. {
/* Invalid character in input */
return NOTOKEN;
}
%%

You need to change your definition of a WORD. Right now, when it encounters an alphabetic character, it considers everything up to the next whitespace as part of that WORD.
You want to change that so it doesn't include any of the punctuation you're using for other purposes:
[^ \t\n\>\<\|\&]+ {
/* Assume that file names have only alpha chars */
yylval.string_val = strdup(yytext);
return WORD;
}

I figured it out. WORD was including the pipes and other special characters.
I changed it to
[^\|\>\<\& \t\n][^\|\>\<\& \t\n]* {
yylval.string_val = strdup(yytext);
return WORD;
}
and now it works.

Related

Flex error negative range in character class

I am writing a parser using Flex and Bison and have defined various tokens as:
[0-9]+ { yylval.str=strdup(yytext); return digit; }
[0-9]+\.[0-9]* { yylval.str=strdup(yytext); return floating; }
[a-zA-Z_][a-zA-Z0-9_]* { yylval.str=strdup(yytext); return key; }
[a-zA-Z/][a-zA-Z_-/.]* { yylval.str=strdup(yytext); return string; }
[a-zA-Z0-9._-]+ { yylval.str=strdup(yytext); return hostname; }
["][a-zA-Z0-9!##$%^&*()_-+=.,/?]* { yylval.str=strdup(yytext); return qstring1; }
[a-zA-Z0-9!##$%^&*()_-+=.,/?]*["] { yylval.str=strdup(yytext); return qstring2; }
[#].+ { yylval.str=strdup(yytext); return comment;}
[ \n\t] {} /* Ignore white space. */
. {printf("ERR:L:%d\n", q); return ERROR;}
And it shows an error "Negative Range in Character Class" in the regexps for string, qstring1 and qstring2.
Can someone please help me with where I went wrong?
The spec is that:
Non quoted strings may contain ASCII alphanumeric characters, underscores, hyphens, forward slash and period and must start with letter or slash.
Quoted strings may contain any alphanumeric character between the quotes.
I have taken two different strings for quoted strings for some more specifications to be fulfilled.
Thanks.
For (string, qstring1, qstring2) you need to either place the hyphen (-) as the first or last character of your character class [] or just simply escape it \- if elsewhere.
(string)
[a-zA-Z/][a-zA-Z_./-]*
(qstring1)
["][a-zA-Z0-9!##$%^&*()_+=.,/?-]*
(qstring2)
[a-zA-Z0-9!##$%^&*()_+=.,/?-]*["]
- needs to be escaped with a backslash.
For qstring1, try the following:
["][a-zA-Z0-9!##$%^&*()_\-+=.,/?]*
I guess while writing a regular expression you should always write it with it's priority order :
for example for this line of code :
[+-/*><=] {printf("Operator %c\n",yytext[0]); return yytext[0];} won't give any error.
whereas :
[+-*/><=] {printf("Operator %c\n",yytext[0]); return yytext[0];} will.
hope it helps.

Segmentation fault on simple Bison script

OK, I'm doing a few experiments with Lex/Bison(Yacc), and given that my C skills are rather rusty (I've once created compilers and stuff with all these tools and now I'm lost in the first few lines... :-S), I need your help.
This is what my Parser looks like :
%{
#include <stdio.h>
#include <string.h>
void yyerror(const char *str)
{
fprintf(stderr,"error: %s\n",str);
}
int yywrap()
{
return 1;
}
main()
{
yyparse();
}
%}
%union
{
char* str;
}
%token <str> WHAT IS FIND GET SHOW WITH POSS OF NUMBER WORD
%type <str> statement
%start statements
%%
statement
: GET { printf("get\n"); }
| SHOW { printf("%s\n",$1); }
| OF { printf("of\n"); }
;
statements
: statement
| statements statement
;
The Issue :
So, basically, whenever the parser comes across a "get", it prints "get". And so on.
However, when trying to print "show" (using the $1 specifier) it gives out a segmentation fault error.
What am I doing wrong?
Lex returns a number representing the token, you need to access yytext to get the text of what is parsed.
something like
statement : GET { printf("get\n"); }
| SHOW { printf("%s\n",yytext); }
| OF { printf("of\n"); }
;
to propogate the text of terminals, I go ahead associate a nonterminal with a terminal and pass back the char* and start building the parse tree for example. Note I've left out the type decl and the implementation of create_sww_ASTNode(char*,char*,char*); However, importantly not all nonterminals will return the same type, for number is an integer, word return char* sww return astNode (or whatever generic abstract syntax tree structure you come up with). Usually beyond the nonterminal representing terminals, it's all AST stuff.
sww : show word word
{
$$ = create_sww_ASTNode($1,$2,$3);
}
;
word : WORD
{
$$ = malloc(strlen(yytext) + 1);
strcpy($$,yytext);
}
;
show : SHOW
{
$$ = malloc(strlen(yytext) + 1);
strcpy($$,yytext);
}
;
number : NUMBER
{
$$ = atoi(yytext);
}
;
You don't show your lexer code, but the problem is probably that you never set yylval to anything, so when you access $1 in the parser, it contains garbage and you get a crash. Your lexer actions need to set yylval.str to something so it will be valid:
"show" { yylval.str = "SHOW"; return SHOW }
[a-z]+ { yylval.str = strdup(yytext); return WORD; }
OK, so here's the answer (Can somebody tell me what it is that I always come up with the solution once I've already published a question here in SO? lol!)
The problem was not with the parser itself, but actually with the Lexer.
The thing is : when you tell it to { printf("%s\n",$1); }, we actually tell it to print yylval (which is by default an int, not a string).
So, the trick is to convert the appropriate tokens into strings.
Here's my (updated) Lexer file :
%{
#include <stdio.h>
#include "parser.tab.h"
void toStr();
%}
DIGIT [0-9]
LETTER [a-zA-Z]
LETTER_OR_SPACE [a-zA-Z ]
%%
find { toStr(); return FIND; }
get { toStr(); return GET; }
show { toStr(); return SHOW; }
{DIGIT}+(\.{DIGIT}+)? { toStr(); return NUMBER; }
{LETTER}+ { toStr(); return WORD; }
\n /* ignore end of line */;
[ \t]+ /* ignore whitespace */;
%%
void toStr()
{
yylval.str=strdup(yytext);
}

Extracting individual sentences from a text file ... I haven't got it right YET

As part of a larger program, I'm extracting individual sentences from a text file and placing them as strings into a vector of strings. I first decided to use the procedure I've commented out. But then, after a test, I realized that it's doing 2 things wrong:
(1) It's not separating sentences when they are separated by a new line.
(2) It's not separating sentences when they end in a quotation mark. (Ex. The sentences The string Obama said, "Yes, we can." Then he audience gave a thunderous applause. would not be separated.)
I need to fix those problems. However, I'm afraid this going to end up as spaghetti code, if it isn't already. Am I going about this wrong? I don't want to keep going back and fixing things. Maybe there's some easier way?
// Extract sentences from Plain Text file
std::vector<std::string> get_file_sntncs(std::fstream& file) {
// The sentences will be stored in a vector of strings, strvec:
std::vector<std::string> strvec;
// Print out error if the file could not be found:
if(file.fail()) {
std::cout << "Could not find the file. :( " << std::endl;
// Otherwise, proceed to add the sentences to strvec.
} else {
char curchar;
std::string cursentence;
/* While we haven't reached the end of the file, add the current character to the
string representing the current sentence. If that current character is a period,
then we know we've reached the end of a sentence if the next character is a space or
if there is no next character; we then must add the current sentence to strvec. */
while (file >> std::noskipws >> curchar) {
cursentence.push_back(curchar);
if (curchar == '.') {
if (file >> std::noskipws >> curchar) {
if (curchar == ' ') {
strvec.push_back(cursentence);
cursentence.clear();
} else {
cursentence.push_back(curchar);
}
} else {
strvec.push_back(cursentence);
cursentence.clear();
}
}
}
}
return strvec;
}
Given your request to detect sentence boundaries by punctuation, whitespace, and certain combinations of them, using a regular expression seems to be a good solution. You can use regular expression to describe possible sequences of characters that indicate sentence boundaries, e.g.
[.!?]\s+
which means: "one of dot, exclamation mark question mark, followed by one or more whitespaces".
One particularly convenient way of using regular expressions in C++ is to use the regex implementation included in the Boost library. Here is an example of how it work in your case:
#include <string>
#include <vector>
#include <iostream>
#include <iterator>
#include <boost/regex.hpp>
int main()
{
/* Input. */
std::string input = "Here is a short sentence. Here is another one. And we say \"this is the final one.\", which is another example.";
/* Define sentence boundaries. */
boost::regex re("(?: [\\.\\!\\?]\\s+" // case 1: punctuation followed by whitespace
"| \\.\\\",?\\s+" // case 2: start of quotation
"| \\s+\\\")", // case 3: end of quotation
boost::regex::perl | boost::regex::mod_x);
/* Iterate through sentences. */
boost::sregex_token_iterator it(begin(input),end(input),re,-1);
boost::sregex_token_iterator endit;
/* Copy them onto a vector. */
std::vector<std::string> vec;
std::copy(it,endit,std::back_inserter(vec));
/* Output the vector, so we can check. */
std::copy(begin(vec),end(vec),
std::ostream_iterator<std::string>(std::cout,"\n"));
return 0;
}
Notice I used the boost::regex::perl and boost:regex:mod_x options to construct the regex matcher. This allowed by to use extra whitespace inside the regex to make it more readable.
Also note that certain characters, such as . (dot), ! (exclamation mark) and others need to be escaped (i.e. you need to put \\ in front of them), because they would meta characters with special meanings otherwise.
When compiling/linking the code above, you need to link it with the boost-regex library. Using GCC the command looks something like:
g++ -W -Wall -std=c++11 -o test test.cpp -lboost_regex
(assuming your program in stored in a file called test.cpp).

why simple grammar rule in bison not working?

I am learning flex & bison and I am stuck here and cannot figure out how such a simple grammar rule does not work as I expected, below is the lexer code:
%{
#include <stdio.h>
#include "zparser.tab.h"
%}
%%
[\t\n ]+ //ignore white space
FROM|from { return FROM; }
select|SELECT { return SELECT; }
update|UPDATE { return UPDATE; }
insert|INSERT { return INSERT; }
delete|DELETE { return DELETE; }
[a-zA-Z].* { return IDENTIFIER; }
\* { return STAR; }
%%
And below is the parser code:
%{
#include<stdio.h>
#include<iostream>
#include<vector>
#include<string>
using namespace std;
extern int yyerror(const char* str);
extern int yylex();
%}
%%
%token SELECT UPDATE INSERT DELETE STAR IDENTIFIER FROM;
ZQL : SELECT STAR FROM IDENTIFIER { cout<<"Done"<<endl; return 0;}
;
%%
Can any one tell me why it shows error if I try to put "select * from something"
[a-zA-Z].* will match an alphabetic character followed by any number of arbitrary characters except newline. In other words, it will match from an alphabetic character to the end of the line.
Since flex always accepts the longest match, the line select * from ... will appear to have only one token, IDENTIFIER, and that is a syntax error.
[a-zA-Z].* { return IDENTIFIER; }
The problem is here. It allows any junk to follow an initial alpha character and be returned as IDENTIFIER, including in this case the entire rest of the line after the initial ''s.
It should be:
[a-zA-Z]+ { return IDENTIFIER; }
or possibly
[a-zA-Z][a-zA-Z0-9]* { return IDENTIFIER; }
or whatever else you want to allow to follow an initial alpha character in your identifiers.

Why the regular expression "//" and "/*" can't match the single comment and block comment?

I want to calculate the "empty line","single comment","block comment" about c++ program.
I write the tool use flex.But the tool can't match the c++ block comment.
1 flex code:
%{
int block_flag = 0;
int empty_num = 0;
int single_line_num = 0;
int block_line_num = 0;
int line = 0;
%}
%%
^[\t ]*\n {
empty_num++;
printf("empty line\n");
}
"//" {
single_line_num++;
printf("single line comment\n");
}
"/*" {
block_flag = 1;
block_line_num++;
printf("block comment begin.block line:%d\n", block_line_num);
}
"*/" {
block_flag = 0;
printf("block comment end.block line:%d\n", block_line_num);
}
^(.*)\n {
if(block_flag)
block_line_num++;
else
line++;
}
%%
int main(int argc , char *argv[])
{
yyin = fopen(argv[1], "r");
yylex();
printf("lines :%d\n" ,line);
fclose(yyin);
return 0;
}
2 hello.c
bbg#ubuntu:~$ cat hello.c
#include <stdlib.h>
//
//
/*
*/
/* */
3 output
bbg#ubuntu:~$ ./a.out hello.c
empty line
empty line
lines :6
Why the "//" and "/*" can't match the single comment and block comment ?
Flex:
doesn't search. It matches patterns sequentially, each one starting where the other one ends.
always picks the pattern with the longest match. (If two or more patterns match exactly the same amount, it picks the first one.
So, you have
"//" { /* Do something */ }
and
^.*\n { /* Do something else */ }
Suppose it has just matched the second one, so we're at the beginning of a line, and suppose the line starts //. Now, both these patterns match, but the second one matches the whole line, whereas the first one only matches two characters. So the second one wins. That wasn't what you wanted.
Hint 1: You probably want // comments to match to the end of the line
Hint 2: There is a regular expression which will match /* comments, although it's a bit tedious: "/*"[^*]*"*"+([^*/][^*]*"*"+)*"/" Unfortunately, if you use that, it won't count line ends for you, but you should be able to adapt it to do what you want.
Hint 3: You might want to think about comments which start in the middle of a line, possibly having been indented. You rule ^.*\n will swallow an entire line without even looking to see if there is a comment somewhere inside it.
Hint 4: String literals hide comments.