lex (flex) generated program not parsing whole input - regex

I have a relatively simple lex/flex file and have been running it with flex's debug flag to make sure it's tokenizing properly. Unfortunately, I'm always running into one of two problems - either the program the flex generates stops just gives up silently after a couple of tokens, or the rule I'm using to recognize characters and strings isn't called and the default rule is called instead.
Can someone point me in the right direction? I've attached my flex file and sample input / output.
Edit: I've found that the generated lexer stops after a specific rule: "cdr". This is more detailed, but also much more confusing. I've posted a shorted modified lex file.
/* lex file*/
%option noyywrap
%option nodefault
%{
enum tokens{
CDR,
CHARACTER,
SET
};
%}
%%
"cdr" { return CDR; }
"set" { return SET; }
[ \t\r\n] /*Nothing*/
[a-zA-Z0-9\\!##$%^&*()\-_+=~`:;"'?<>,\.] { return CHARACTER; }
%%
Sample input:
set c cdra + cdr b + () ;
Complete output from running the input through the generated parser:
--(end of buffer or a NUL)
--accepting rule at line 16 ("set")
--accepting rule at line 18 (" ")
--accepting rule at line 19 ("c")
--accepting rule at line 18 (" ")
--accepting rule at line 15 ("cdr")
Any thoughts? The generated program is giving up after half of the input! (for reference, I'm doing input by redirecting the contents of a file to the generated program).

When generating a lexer that's standalone (that is, not one with tokens that are defined in bison/yacc, you typically write an enum at the top of the file defining your tokens. However, the main loop of a lex program, including the main loop generated by default, looks something like this:
while( token = yylex() ){
...
This is fine, until your lexer matches the rule that appears first in the enum - in this specific case CDR. Since enums by default start at zero, this causes the while loop to end. Renumbering your enum - will solve the issue.
enum tokens{
CDR = 1,
CHARACTER,
SET
};
Short version: when defining tokens by hand for a lexer, start with 1 not 0.

This rule
[-+]?([0-9*\.?[0-9]+|[0-9]+\.)([Ee][-+]?[0-9]+)?
|
seems to be missing a closing bracket just after the first 0-9, I added a | below where I think it should be. I couldn't begin to guess how flex would respond to that.
The rule I usually use for symbol names is [a-zA-Z$_], this is like your unquoted strings
except that I usually allow numbers inside symbols as long as the symbol doesn't start with a number.
[a-zA-Z$_]([a-zA-Z$_]|[0-9])*
A characters is just a short symbol. I don't think it needs to have its own rule, but if it does, then you need to insure that the string rule requires at least 2 characters.
[a-zA-Z$_]([a-zA-Z$_]|[0-9])+

Related

How to do something else in Bison after Flex returns 0?

The Bison yyparse() function stop to read its input (file or stream) when 0 is returned.
I was wondering if there is a way to execute some more commands after it occurs.
I mean, is possible to tread 0 (or some token thrown by its return) in bison file?
Something like:
Flex
<<EOF>> { return 0; }
Bison
%token start
start : start '0' {
// Desired something else
}
Suppose program is the top-level symbol in a grammar. That is, it is the non-terminal which must be matched by the parser's input.
Of course, it's possible that program will also be matched several times before the input terminates. For example, the grammar might look something like:
%start program
%%
program: %empty
| program declaration
In that grammar, there is no way to inject an action which is only executed when the input is fully parsed. I gather that is what you want to do.
But it's very simple to create a non-terminal whose action will only be executed once, at the end of the parse. All we need to do is insert a new "unit production" at the top of the grammar:
%start start
%%
start : program { /* Completion action */ }
program: %empty
| program declaration
Since start does not appear on the right-hand side of any production in the grammar, it can only be reduced at the end of the parse, when the parser reduced the %start symbol. So even though the production does not explicitly include the end token, we know that that the end token is the lookahead token when the reduction action is executed.
Unit productions -- productions whose right-hand side contains just one symbol -- are frequently used to trigger actions at strategic points of the parse, and the above is just one example of the technique.

How to do proper error handling in BNFC? (C++, Flex, Bison)

I'm making a compiler in BNFC and it's got to a stage where it already compiles some stuff and the code works on my device. But before shipping it, I want my compiler to return proper error messages when the user tries to compile an invalid program.
I found how bison can write error on the stderr stream and I'm able to catch those. Now suppose the user's code has no syntax error, it just references an undefined variable, I'm able to catch this in my visitor, but I can't know what the line number was, how can I find the line number?
In bison you can access the starting and ending position of the current expression using the variable #$, which contains a struct with the members first_column, first_line, last_column and last_line. Similarly #1 etc. contain the same information for the sub-expressions $1 etc. respectively.
In order to have access to the same information later, you need to write it into your ast. So add a field to your AST node types to store the location and then set that field when creating the node in your bison file.
(previous answer is richer) but in some simple parsers if we declare
%option yylineno
in flex, and print it in yyerror,
yyerror(char *s) {
fprintf(stderr,"ERROR (line %d):before '%s'\n-%s",yylineno, yytext,s);
}
sometimes it help...

Bison : Line number included in the error messages

OK, so I suppose my question is quite self-explanatory.
I'm currently building a parser in Bison, and I want to make error reporting somewhat better.
Currently, I've set %define parse.error verbose (which actually gives messages like syntax error, unexpected ***********************, expecting ********************.
All I want is to add some more information in the error messages, e.g. line number (in input/file/etc)
My current yyerror (well nothing... unusual... lol) :
void yyerror(const char *str)
{
fprintf(stderr,"\x1B[35mInterpreter : \x1B[37m%s\n",str);
}
P.S.
I've gone through the latest Bison documentation, but I seem quite lost...
I've also had a look into the %locations directive, which most likely is very close to what I need - however, I still found no complete working example and I'm not sure how this is to be used.
So, here I'm a with a step-by-step solution :
We add the %locations directive in our grammar file (between %} and the first %%)
We make sure that our lexer file contains an include for our parser (e.g. #include "mygrammar.tab.h"), at the top
We add the %option yylineno option in our lexer file (between %} and the first %%)
And now, in our yyerror function (which will supposedly be in our lexer file), we may freely use this... yylineno (= current line in file being processed) :
void yyerror(const char *str)
{
fprintf(stderr,"Error | Line: %d\n%s\n",yylineno,str);
}
Yep. Simple as that! :-)
Whats worked for me was adding extern int yylineno in .ypp file:
/* parser.ypp */
%{
extern int yylineno;
%}
/* scanner.lex */
...
%option yylineno
Bison ships with a number of examples to demonstrate its features, see /usr/local/share/doc/bison/examples on your machine (where the prefix /usr/local depends on your configuration.
These examples in particular might be of interest to you:
lexcalc uses precedence directives and location tracking. It uses Flex to generate the scanner.
bistromathic demonstrates best practices when using Bison.
Its hand-written scanner tracks locations.
Its interface is pure.
It uses %params to pass user information to the parser and scanner.
Its scanner uses the error token to signal lexical errors and enter
error recovery.
Its interface is "incremental", well suited for interaction: it uses the
push-parser API to feed the parser with the incoming tokens.
It features an interactive command line with completion based on the
parser state, based on yyexpected_tokens.
It uses Bison's standard catalog for internationalization of generated
messages.
It uses a custom syntax error with location, lookahead correction and
token internationalization.
Error messages quote the source with squiggles that underline the error:
> 123 456
1.5-7: syntax error: expected end of file or + or - or * or / or ^ before number
1 | 123 456
| ^~~
It supports debug traces with semantic values.
It uses named references instead of the traditional $1, $2, etc.

C++ system() not working when there are spaces in two different parameters

I'm trying to run a .exe that requires some parameters by using system().
If there's a space in the .exe's path AND in the path of a file passed in parameters, I get the following error:
The filename, directory name, or volume label syntax is incorrect.
Here is the code that generates that error:
#include <stdlib.h>
#include <conio.h>
int main (){
system("\"C:\\Users\\Adam\\Desktop\\pdftotext\" -layout \"C:\\Users\\Adam\\Desktop\\week 4.pdf\"");
_getch();
}
If the "pdftotext"'s path doesn't use quotation marks (I need them because sometimes the directory will have spaces), everything works fine. Also, if I put what's in "system()" in a string and output it and I copy it in an actual command window, it works.
I thought that maybe I could chain some commands using something like this:
cd C:\Users\Adam\Desktop;
pdftotext -layout "week 4.pdf"
So I would already be in the correct directory, but I don't know how to use multiple commands in the same system() function.
Can anyone tell me why my command doesn't work or if the second way I thought about would work?
Edit: Looks like I needed an extra set of quotation marks because system() passes its arguments to cmd /k, so it needs to be in quotations. I found it here:
C++: How to make a my program open a .exe with optional args
so I'll vote to close as duplicate since the questions are pretty close even though we weren't getting the same error message, thanks!
system() runs command as cmd /C command. And here's citation from cmd doc:
If /C or /K is specified, then the remainder of the command line after
the switch is processed as a command line, where the following logic is
used to process quote (") characters:
1. If all of the following conditions are met, then quote characters
on the command line are preserved:
- no /S switch
- exactly two quote characters
- no special characters between the two quote characters,
where special is one of: &<>()#^|
- there are one or more whitespace characters between the
two quote characters
- the string between the two quote characters is the name
of an executable file.
2. Otherwise, old behavior is to see if the first character is
a quote character and if so, strip the leading character and
remove the last quote character on the command line, preserving
any text after the last quote character.
It seems that you are hitting case 2, and cmd thinks that the whole string C:\Users\Adam\Desktop\pdftotext" -layout "C:\Users\Adam\Desktop\week 4.pdf (i.e. without the first and the last quote) is the name of executable.
So the solution would be to wrap the whole command in extra quotes:
//system("\"D:\\test\" nospaces \"text with spaces\"");//gives same error as you're getting
system("\"\"D:\\test\" nospaces \"text with spaces\"\""); //ok, works
And this is very weird. I think it's also a good idea to add /S just to make sure it will always parse the string by the case 2:
system("cmd /S /C \"\"D:\\test\" nospaces \"text with spaces\"\""); //also works
I got here looking for an answer, and this is the code that I came up with (and I was this explicit for the benefit of next person maintaining my code):
std::stringstream ss;
std::string pathOfCommand;
std::string pathOfInputFile;
// some code to set values for paths
ss << "\""; // command opening quote
ss << "\"" << pathOfCommand << "\" "; // Quoted binary (could have spaces)
ss << "\"" << pathOfInputFile << "\""; // Quoted input (could have spaces)
ss << "\""; // command closing quote
system( ss.str().c_str() ); // Execute the command
and it solved all of my problems.
Good learning from here on the internals of System call.Same issue reproducible(of course) with C++ string, TCHARs etc.
One approach that always helped me is SetCurrentDirectory() call. I first set current path and then execute. This has worked for me so far. Any comments welcome.
-Sreejith. D. Menon

How to detect eof in ml-lex

While writing a code in ml-lex
we need to write to write the eof function
val eof = fn () => EOF;
is this a necessary part to write
also if i want my lexer to stop at the detection of an eof then what should i add to the given function.
Thanks.
The User’s Guide to ML-Lex and ML-Yacc by Roger Price is great for learning ml-lex and ml-yacc.
The eof function is mandatory in the user declarations part of your lex definition together with the lexresult type as:
The function eof is called by the lexer when the end of the input
stream is reached.
Where your eof function can either throw an exception if that is appropriate for your application or the EOF token. In any way it have to return something of type lexresult. There is an example in chapter 7.1.2 of the user guide which prints a string if EOF was in the middle of a block comment.
I use a somewhat "simpler" eof function
structure T = Tokens
structure C = SourceData.Comments
fun eof data =
if C.depth data = 0 then
T.EOF (~1, ~1)
else
fail (C.start data) "Unclosed comment"
where the C structure is a "special" comment handling structure that counts number of opening and closing comments. If the current depth is 0 then it returns the EOF token, where (~1, ~1) are used indicate the left and right position. As I don't use this position information for EOF i just set it to (~1, ~1).
Normally you would then set the %eop (end of parse) to use the EOF token in the yacc file, to indicate that what ever start symbol that is used, it may be followed by the EOF token. Also remember to add EOF to %noshift. Se section 9.4.5 for %eop and %noshift.
Obviously you have to define EOF in %term declaration of your yacc file aswel.
Hope this helps, else take a look at an MLB parser or an SML parser written in ml-lex and ml-yacc. The MLB parser is the simplest and thus might be easier to understand.