How to use flex with my own parser?

How to use flex with my own parser? - c++

I want to leave the lexical analysis to lex but develop the parser on my own.
I made a token.h header which has the enums for token types and a simple class hierarchy,
For the lex rule:
[0-9]+ {yylval = new NumToken(std::stoi(yytext));return NUM;}
How do I get the NumToken pointer from the parser code?
Suppose I just want to print out the tokens..
while(true)
{
auto t = yylex();
//std::cout <<yylval.data<<std::endl; // What goes here ?
}
I can do this with yacc/bison, but can not find any documentation or example about how to do this manually.

In a traditional bison/flex parser, yylval is a global variable defined in the parser generated by bison, and declared in the header file generated by bison (which should be #include'd into the generated scanner). So a simple solution would be just to replicate that: declare yylval (as a global) in token.h and define it somewhere in your parser.
But modern programming style has shifted away from the use of globals (for good reason), and indeed even flex will generate scanners which do not depend on global state, if requested. To request such a scanner, specify
%option reentrant
in your scanner definition. By default, this changes the prototype of yylex to:
int yylex(yyscan_t yyscanner);
where yyscan_t is an opaque pointer. (This is C, so that means it's a void*.) You can read about the details in the Flex manual; the most important takeaway is that you can ask flex to also generate a header file (with %option header-file), so that other translation units can refer to the various functions for creating, destroying and manipulating a yyscan_t, and that you need to minimally create one so that yylex has somewhere to store its state. (Ideally, you would also destroy it.) [Note 1].
The expected way to use a reentrant scanner from bison is to enable %option bison-bridge (and %option bison-location if your lexer generates source location information for each token). This will add an additional parameter to the yylex prototype:
int yylex(YYSTYPE *yylval_param, yyscan_t scanner);
With `%option bison-locations', two parameters are added:
int yylex(YYSTYPE *yylval_param,
YYLTYPE *yylloc_param,
yyscan_t scanner);
The semantic type YYSTYPE and the location type YYLTYPE are not declared by the flex-generated code. They must appear in the token.h header you #include into your scanner.
The intention of the bison-bridge parameters is to provide a mechanism to return the semantic value yylval to the caller (i.e. the parser). Since yylval is effectively the same as the parameter yylval_param [Note 2], it will be a pointer to the actual semantic value, so you need to write (for example) yylval->data = ... in your flex actions.
So that's one way to do it.
A possibly simpler alternative to bison-bridge is just to provide your own yylex prototype, which you can do with the macro YY_DECL. For example, you could do something like this (if YYSTYPE were something simple):
#define YY_DECL std::pair<int, YYSTYPE> yylex(yyscan_t yyscanner)
Then a rule could just return the pair:
[0-9]+ {return std::make_pair(NUM, new NumToken(std::stoi(yytext));}
Obviously, there are many variants on this theme.
Notes
Unfortunately, the generated header includes quite a lot of unnecessary baggage, including a bunch of macro definitions for the standard "globals" which won't work because in a reentrant scanner these variables can only be used in a flex action.
The scanner generated with bison-bridge defines yylval as a macro which refers to a field in the opaque state structure, and stores yylval_param into this field. yyget_lval and yyset_lval functions are provided in order to get or set this field from outside of yylex. I don't know why; it seems somewhere between unnecessary and dangerous, since the state will contain the pointer to the value, as supplied in the call to yylex, which may well be a dangling pointer once the call returns.

Related

Change yylex in C++ Flex

I want to change yylex to alpha_yylex, that also takes in a vector as an argument.
.
.
#define YY_DECL int yyFlexLexer::alpha_yylex(std::vector<alpha_token_t> tokens)
%}
.
.
. in main()
std::vector<alpha_token_t> tokens;
while(lexer->alpha_yylex(tokens) != 0) ;
I think i know why this fails, because obviously in the FlexLexer.h there is NO alpha_yylex , but i don't know how to achieve what i want...
How can I make my own alpha_yylex() or modify the existing one?

It's true that you cannot edit the definition of yyFlexLexer, since FlexLexer.h is effectively a system-wide header file. But you can certainly subclass it, which will provide most of what you need.
Subclassing yyFlexLexer
Flex allows you to use %option yyclass (or the --yyclass command-line option) to specify the name of a subclass, which will be used instead of yyFlexLexer to define yylex. Subclassing yyFlexLexer allows you to include your own header which defines your subclass' members and maybe even additional functions, as well as its constructors; in short, if your intention was simply to fill in a std::vector<alpha_token_t> with the successive tokens, you could easily do that by defining AlphaLexer as a subclass of yyFlexLexer, with an instance member called tokens (or, perhaps, with accessor functions).
You can also add additional member functions to your new class, which might provide what you need those additional arguments for.
The thing which is not quite so straight-forward, although it could easily be accomplished using the YY_DECL macro in the C interface, is to change the name and prototype of the scanning function generated by flex. It can be done (see below) but it is not clear that it is actually supported. In any case, it is possibly less important in the case of C++.
Aside from a small wrinkle created by the curious organization of Flex's C++ classes [Note 1], subclassing the lexer class is simple. You need to derive your class from yyFlexLexer [Note 2], which is declared in FlexLexer.h, and you need to tell Flex what the name of your class is, either by using %option yyclass in your Flex file, or by specifying the name on the command line with --yyclass.
yyFlexLexer includes the various methods for manipulating input buffers, as well as all the mutable state for the lexical scanner used by the standard skeleton. (Much of this is actually derived from the base class FlexLexer.) It also includes a virtual yylex method with prototype
virtual int yylex();
When you subclass yyFlexLexer, yyFlexLexer::yylex() is defined to signal an error by calling yyFlexLexer::LexerError(const char*) and the generated scanner is defined as the override in the class defined as yyclass. (If you don't subclass, the generated scanner is yyFlexLexer::yylex().)
The one wrinkle is the way you need to declare your subclass. Normally, you would do that in a header file like this:
File: myscanner.h (Don't use this version)
#pragma once
// DON'T DO THIS; IT WON'T WORK (flex 2.6)
#include <yyFlexLexer.h>
class MyScanner : public yyFlexLexer {
// whatever
};
You would then #include "myscanner.h" in any file which needed to use the scanner, including the generated scanner itself.
Unfortunately, that won't work because it will result in FlexLexer.h being included twice in the generated scanner; FlexLexer.h does not have an include guard in the normal sense of the word because it is designed to be included multiple times in order to support the prefix option. So you need to define two header files:
File: myscanner-internal.h
#pragma once
// This file depends on FlexLexer.h having already been included
// in the translation unit. Don't use it other than in the scanner
// definition.
class MyScanner : public yyFlexLexer {
// whatever
};
File: myscanner.h
#pragma once
#include <FlexLexer.h>
#include "myscanner.h"
Then you use #include "myscanner.h" in every file which needs to know about the scanner except the scanner definition itself. In your myscanner.ll file, you will #include "myscanner-internal.h", which works because Flex has already included FlexLexer.h before it inserts the prologue C++ code from your scanner definition.
Changing the yylex prototype
You can't really change the prototype (or name) of yylex, because it is declared in FlexLexer.h and, as mentioned above, defined to signal an error. You can, however, redefine YY_DECL to create a new scanner interface. To do so, you must first #undef the existing YY_DECL definition, at least in your scanner definition, because a scanner with %option yyclass="MyScanner" contains #define YY_DECL int MyScanner::yylex(). That would make your myscanner-internal.h` file look like this:
#pragma once
// This file depends on FlexLexer.h having already been included
// in the translation unit. Don't use it other than in the scanner
// definition.
#undef YY_DECL
#define YY_DECL int MyScanner::alpha_yylex(std::vector<alpha_token_t>& tokens)
#include <vector>
#include "alpha_token.h"
class MyScanner : public yyFlexLexer {
public:
int alpha_yylex(std::vector<alpha_token_t>& tokens);
// whatever else you need
};
The fact that the MyScanner object still has a (not very functional) yylex method might not be a problem. There are some undocumented interfaces in FlexLexer which call yylex(), but those don't matter if you don't use them. (They're not all that useful, anyway.) But you should at least be aware that the interface exists.
In any case, I don't see the point of renaming yylex (but perhaps you have a different aesthetic sense). It's already effectively namespaced by being a member of a specific class (MyScanner, above), so yylex doesn't really create any confusion.
In the particular case of the std::vector<alpha_token_t>& argument, it seems to me that a cleaner solution would be to put the reference as a member variable in the MyScanner class and set it with the constructor or with an accessor method. Unless you actually use different vectors at different points in the lexical analysis -- not evident in the example code in your question -- there's no point burdening every call site with the need to pass the address of the vector into the yylex call. Since lexer actions are compiled inside yylex, which is a member function of MyScanner, instance variables -- even private instance variables -- are usable in the lexer actions. Of course, that's not the only use case for extra yylex arguments, but it's a pretty common one.
Notes
"The C++ interface is a mess," according to a comment in the generated code.
Using %option prefix, you can change yy to something else if you want to. This a feature which is supposedly intended to allow you to include multiple lexical scanners in the same project. However, if you're planning on subclassing, the base classes for all these lexical scanners will be identical (other than their names). Thus, there is little or no point having different base classes. Renaming the scanner class using %option prefix is less flexible and no more efficient than subclassing, and it creates an additional header complication. (See this older answer for details.) So I'd recommend sticking with subclassing.

Can I define a macro in a header file?

I have a macro definition in MyClass.h, stated as such:
#define _BufferSize_ 64
I placed the include directive for MyClass.h inside of main.cpp:
#include "MyClass.h"
Does this mean I can use _BufferSize_ in both main.cpp and MyClass.h? Also, is this good practice?

Yes, it would work. (Disregarding the problem with underscores that others have pointed out.)
Directive #include "MyClass.h" just copies the whole content of file MyClass.h and pastes it in the place of the #include. From the point of view of the compiler there is only one source file composed of the file specified by the user and all included files.
Having said that, it would be much better if you use in-language construction instead of preprocessor directive.
For example replace:
#define _BufferSize_ 64
with
constexpr size_t BufferSize = 64;
The only thing it does differently than the #define is that it specifies the type of the value (size_t in this case). Beside that, the second code will behave the same way and it avoids disadvantages of preprocessor.
In general, try to avoid using preprocessor directives. This is an old mechanism that was used when c++ coudn't do that things in-language yet.

Yes, that is the purpose of header files: creating declarations and constants in one file that you can "include" into translation units whenever you like.
However, your macro name is illegal, and a nice constexpr size_t BufferSize = 64 would be more idiomatic nowadays; even before recent versions of C++, a typed constant would be preferable to a macro in many cases.

First, regarding the identifier _BufferSize_, the standard states that:
3. ...some identifiers are reserved for use by C++ implementations and shall not be used otherwise; no diagnostic is required.
(3.1) Each identifier that contains a double underscore __ or begins with an underscore followed by an uppercase letter is reserved to the implementation for any use.
So having such an identifier in your code would lead to undefined behavior.
And as already suggested in the comments, using macro variables is not good practice in C++. You can use a const int instead.

Replying 3 years later because the answers are wrong and this is first google search result in certain keywords.
https://google.github.io/styleguide/cppguide.html#Preprocessor_Macros
Avoid defining macros, especially in headers; prefer inline functions, enums, and const variables. Name macros with a project-specific prefix. Do not use macros to define pieces of a C++ API.
Highlight by me, not in original text.

Flex C++ - #ifdef inside flex block

I want to define constant in preprocessor which launches matching some patterns only when it's defined. Is it possible to do this, or there is the other way how to deal with this problem?
i.e. simplified version of removing one-line comments in C:
%{
#define COMMENT
%}
%%
#ifdef COMMENT
[\/][\/].*$ ;
#endif
[1-9][0-9]* printf("It's a number, and it works with and without defining COMMENT");
%%

There is no great solution to this (very reasonable) request, but there are some possibilities.
(F)lex start conditions
Flex start conditions make it reasonably simple to define a few optional patterns, but they don't compose well. This solution will work best if you have only a single controlling variable, since you will have ti define a separate start condition for every possible combination of controlling variables.
For example:
%s NO_COMMENTS
%%
<NO_COMMENTS>"//".* ; /* Ignore comments in `NO_COMMENTS mode. */
The %s declaration means that all unmarked rules also apply to the N_COMMENTS state; you will commonly see %x ("exclusive") in examples, but that would force you to explicitly mark almost every rule.
Once you have modified you grammar in this way, you can select the appropriate set of rules at run-time by setting the lexer's state with BEGIN(INITIAL) or BEGIN(NO_COMMENTS). (The BEGIN macro is only defined in the flex generated file, so you will want to export a function which performs one of these two actions.)
Using cpp as a utility.
There is no preprocessor feature in flex. It's possible that you could use a C preprocessor to preprocess your flex file before passing it to flex, but you will have to be very careful with your input file:
The C preprocessor expects its input to be a sequence of valid C preprocessor tokens. Many common flex patterns will not match this assumption, because of the very different quoting rules. (For a simple example, a common pattern to recognise C comments includes the character class [^/*] which will be interpreted by the C preprocessor as containing the start of a C comment.)
The flex input file is likely to have a number of lines which are valid #include directives. There is no way to avoid these directives from being expanded (other than removing them from the file). Once expanded and incorporated into the source, the header files no longer have include guards, so you will have to tell flex not to insert any #include files from its own templates. I believe that is possible, but it will be a bit fragile.
The C preprocessor may expand what looks to it like a macro invocation.
The C preprocessor might not preserve linear whitespace, altering the meaning of the flex scanner definition.
m4 and other preprocessors
It would be safer to use m4 as a preprocessor, but of course that means learning m4. ( You shouldn't need to install it because flex already depends on it. So if you have flex you also have m4.) And you will still need to be very careful with quoting sequences. M4 lets you customize these sequences, so it is more manageable than cpp. But don't copy the common idiom of defining [[ as a quote delimiter; it is very common inside regular expressions.
Also, m4 does not insert #line directives and any non-trivial use will change the number of input lines, making error messages harder to interpret. (To say nothing of the challenge of debugging.) You can probably avoid this issue in this very simple case but the issue will reappear.
You could also write your own simple preprocessor, but you will still need to address the above issues.

Bison Grammar: yylval is embedded in yyparse

No wonder i cant link to it from my flex file.
I have checked this and taken out the declaration "YYSTYPE yylval;" from the beginning of yyparse and it works as intended. Surely this is not the correct way to use bison and flex? Can somebody show me another way?
Thank you.

It is normal that yylval is declared and defined in the y.tab.c file output by bison. Its also declared (as extern) in the y.tab.h file, so if you include that in your lexer, you can access yylval as a global var. This is the normal way in which flex/bison works and there should be no need to edit the files to take out things -- it should 'just work'
This use of a global var causes problems if you want to have more than one parser in a program, or want to use multiple parsers in different threads (or otherwise simultaneously). Bison provides a way to avoid this with %define api.pure, which gets rid of yylval as a global -- instead the parser will call yylex with the address of a YYSTYPE (a pointer) and the lexer should put the token value there instead of in yylval. If you're using flex, you'll want #define YY_DECL int yylex(YYSTYPE *val) in the top of your flex file to change the declaration it uses for yylex.

Instead of using
#define YY_DECL int yylex(YYSTYPE *val)
you can also use
%option bison-bridge
But if you want to write a flex+bison parser in C++, then this method does not work.
For C++ parsers, check this example out.

I have checked this and taken out the
declaration "YYSTYPE yylval;"
I wonder if there is something wrong with your "taken out", but you could try
bison -d your-yacc-file.y
then bison will generate a header file for you with all those declarations.

Passing the caller FILE LINE to a function without using macro

I'm used to this:
class Db {
_Commit(char *file, int line) {
Log("Commit called from %s:%d", file, line);
}
};
#define Commit() _Commit(__FILE__, __LINE__)
but the big problem is that I redefine the word Commit globally, and in a 400k lines application framework it's a problem. And I don't want to use a specific word like DbCommit: I dislike redundancies like db->DbCommit(), or to pass the values manually everywhere: db->Commit(__FILE__, __LINE__) is worst.
So, any advice?

So, you're looking to do logging (or something) with file & line info, and you would rather not use macros, right?
At the end of the day, it simply can't be done in C++. No matter what mechanism you chose -- be that inline functions, templates, default parameters, or something else -- if you don't use a macro, you'll simply end up with the filename & linenumber of the logging function, rather than the call point.
Use macros. This is one place where they are really not replaceable.
EDIT:
Even the C++ FAQ says that macros are sometimes the lesser of two evils.
EDIT2:
As Nathon says in the comments below, in cases where you do use macros, it's best to be explicit about it. Give your macros macro-y names, like COMMIT() rather than Commit(). This will make it clear to maintainers & debuggers that there's a macro call going on, and it should help in most cases to avoid collisions. Both good things.

Wait till C++20, you cal use source_location
https://en.cppreference.com/w/cpp/utility/source_location

You can use a combination of default parameter and preprocessor trick to pass the caller file to a functions. It is the following:
Function declaration:
static const char *db_caller_file = CALLER_FILE;
class Db {
_Commit(const char *file = db_caller_file) {
Log("Commit called from %s", file);
}
};
Declare db_caller_file variable in the class header file.
Each translation unit will have a const char *db_caller_file. It is static, so it will not interfere between translation units. (No multiple declarations).
Now the CALLER_FILE thing, it is a macro and will be generated from gcc's command line parameters. Actually if using automated Make system, where there is generic rule for source files, it is a lot easier: You can add a rule to define macro with the file's name as a value. For example:
CFLAGS= -MMD -MF $(DEPS_DIR)/$<.d -Wall -D'CALLER_FILE="$<"'
-D defines a macro, before compiling this file.
$< is Make's substitution for the name of the prerequisite for the rule, which in this case is the name of the source file. So, each translation unit will have it's own db_caller_file variable with value a string, containing file's name.
The same idea cannot be applied for the caller line, because each call in the same translation unit should have different line numbers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js