Problem:
I am using VC++2010, so except a few supported features like decltype pre-C++11 is required.
Given a C++ identifier, is it possible to use some meta-programming techniques to check if that identifier is a type name or variable name. In order words, given the code below :
void f() {
s=(uint32)-1;
}
is it possible to somehow identify if uint32 is:
the name of a variable which means the RHS of the assignment is a subtraction;or
a type name where the RHS operand is literal -1 typecasted to (uint32)
by something like mytemplate<uint32> or similiar?
Rationale: I am using my own in-house developed mini parser to analyze/instrument C++ source code. But my mini parser lacks many features like building a table of identifier types so it always interpret the above code as either subtraction or typecast. My parser is able to modify the source code so that I can insert/modify anything surrounding the uint32 e.g.
void f() {
s=(mytemplate<uint32>(...))-1;
s=myfunc(uint32)-1;
}
But my inserted code will cause syntax error depending on the meaning of the identifier uint32 (type name Vs variable name). I am looking for some generic code that I can insert to cater for both cases.
Related
I am trying to write some grammar in bison that parses C Code. I am new to bison and I am trying to learn from examples that I find online. I am in the process of writing AST. If this is the grammar that I define (most basic use case)
declarator
: IDENTIFIER { $$ = $1;}
| declarator '(' ')' { $$ = new functionDecl($1); }
Now when I compile this code, an error message is thrown that 'declarator' does not have a type.
I understand that I can define the type using %type declarative. But I want "declarator" to be associated with a variant type:
%code {
class Stmt
{
public:
std::string id;
Stmt(std::string aId)
: id(aId)
{}
};
typedef std::variant<std::string, Stmt> decl_type;
}
%define api.value.type variant
%token <std::string> IDENTIFIER
%type <decl_type> declarator
I am unable to compile this code as well. It throws an error that decl_type is unknown. What am I missing?
When you write
%define api.value.type variant
you are telling bison to implement semantic values with its own implementation of a variant type.
This is highlighted in the bison manual in a note prominently marked Warning::
Warning: We do not use Boost.Variant, for two reasons. First, it appeared unacceptable to require Boost on the user’s machine (i.e., the machine on which the generated parser will be compiled, not the machine on which bison was run). Second, for each possible semantic value, Boost.Variant not only stores the value, but also a tag specifying its type. But the parser already “knows” the type of the semantic value, so that would be duplicating the information.
We do not use C++17’s std::variant either: we want to support all the C++ standards, and of course std::variant also stores a tag to record the current type.
(It's worth reading the entire section on Bison variants if you want to use them.)
You could define bison's semantic type to be a std::variant, of course:
%define api.value.type std::variant<std::string, Stmt>
But that might not be what you want, because bison doesn't know anything about std::variant and it will not do anything to help you with the syntax of accessing the value of the variant. Normally, as the page on Bison variants points out, Bison can deduce the type of the semantic value of a grammar symbol (using your %type declarations, of course), but if you don't use a union or Bison variant type, then all Bison knows is that the value is a std::variant. If you happen to know that it is a particular type (for example, because you know what the types of your terminal symbols are), and you want to examine the value using that type, you'll have to use std::variant::get, something like $2.get<NodeList>.
There has been some discussion on the Bison mailing list about improving Bison's handling of variant types. Sadly, I haven't been following it in detail, so you might want to take a look yourself.
After reading this question I am left wondering what happens (regarding the AST) when major C++ compilers parse code like this:
struct foo
{
void method() { a<b>c; }
// a b c may be declared here
};
Do they handle it like a GLR parser would or in a different way? What other ways are there to parse this and similar cases?
For example, I think it's possible to postpone parsing the body of the method until the whole struct has been parsed, but is this really possible and practical?
Although it is certainly possible to use GLR techniques to parse C++ (see a number of answers by Ira Baxter), I believe that the approach commonly used in commonly-used compilers such as gcc and clang is precisely that of deferring the parse of function bodies until the class definition is complete. (Since C++ source code passes through a preprocessor before being parsed, the parser works on streams of tokens and that is what must be saved in order to reparse the function body. I don't believe that it is feasible to reparse the source code.)
It's easy to know when a function definition is complete, since braces ({}) must balance even if it is not known how angle brackets nest.
C++ is not the only language in which it is useful to defer parsing until declarations have been handled. For example, a language which allows users to define new operators with different precedences would require all expressions to be (re-)parsed once the names and precedences of operators are known. A more pathological example is COBOL, in which the precedence of OR in a = b OR c depends on whether c is an integer (a is equal to one of b or c) or a boolean (a is equal to b or c is true). Whether designing languages in this manner is a good idea is another question.
The answer will obviously depend on the compiler, but the article How Clang handles the type / variable name ambiguity of C/C++ by Eli Bendersky explains how Clang does it. I will simply note some key points from the article:
Clang has no need for a lexer hack: the information goes in a single direction from lexer to parser
Clang knows when an identifier is a type by using a symbol table
C++ requires declarations to be visible throughout the class, even in code that appears before it
Clang gets around this by doing a full parsing/semantic analysis of the declaration, but leaving the definition for later; in other words, it's lexed but parsed after all the declared types are available
I was searching for a while on the net and unfortunately i didn't find an answer or a solution for my problem, in fact, let's say i have 2 functions named like this :
1) function1a(some_args)
2) function2b(some_args)
what i want to do is to write a macro that can recognize those functions when feeded with the correct parameter, just that the thing is, this parameter should be also a parameter of a C/C++ function, here is what i did so far.
#define FUNCTION_RECOGNIZER(TOKEN) function##TOKEN()
void function1a()
{
}
void function2a()
{
}
void anotherParentFunction(const char* type)
{
FUNCTION_RECOGNIZER(type);
}
clearly, the macro is recognizing "functiontype" and ignoring the argument of anotherParentFunction, i'm asking if there is/exist a trick or anything to perform this way of pasting.
thank you in advance :)
If you insist on using a macro: Skip the anotherParentFunction() function and use the macro directly instead. When called with constant strings, i.e.
FUNCTION_RECOGNIZER( "1a");
it should work.
A more C++ like solution would be to e.g use an enum, then implement anotherParentFunction() with the enum as parameter and a switch that calls the corresponding function. Of course you need to change the enum and the switch statement then every time you add a new function, but you would be more flexible in choosing the names of the functions.
There are many more solutions to achieve something similar, the question really is: What is your use case? What do want to achieve?
In 16.1.5 the standard says:
The implementation can process and skip sections of source files conditionally, include other source files, and replace macros. These capabilities are called preprocessing, because conceptually they occur before translation of the resulting translation unit.
[emphasis mine]
Originally pre-processing was done by a separate app, it is essentially an independent language.
Today, the pre-processor is often part of the compiler, but - for example - you can't see macros etc in the Clang AST tree.
The significance of this is that the pre-processor knows nothing about types or functions or arguments.
Your function definition
void anotherParentFunction(const char* type)
means nothing to the pre-processor and is completely ignored by it.
FUNCTION_RECOGNIZER(type);
this is recognized as a defined macro, but type is not a recognized pre-processor symbol so it is treated as a literal, the pre-processor does not consult the C++ parser or interact with it's AST tree.
It consults the macro definition:
#define FUNCTION_RECOGNIZER(TOKEN) function##TOKEN()
The argument, literal type, is tokenized as TOKEN. The word function is taken as a literal and copied to the result string, the ## tells the processor to copy the value of the token TOKEN literally, production functiontype in the result string. Because TOKEN isn't recognized as a macro, the ()s end the token and the () is appended as a literal to the result string.
Thus, the pre-processor substitutes
FUNCTION_RECOGNIZER(type);
with
functiontype();
So the bad news is, no there is no way to do what you were trying to do, but this may be an XY Problem and perhaps there's a solution to what you were trying to achieve instead.
For instance, it is possible to overload functions based on argument type, or to specialize template functions based on parameters, or you can create a lookup table based on parameter values.
Suppose there's a complex expression EXPRESSION, and it's quite hard even for the IDE to find some of the methods called in it etc., so it's very hard to figure out the type it evaluates to. Currently to make the compiler (gcc) print the human-readable type I'm using a construct like
struct {} s=EXPRESSION;
which won't compile for any expression if it evaluates not to {}. In this case gcc says something like
Conversion from Type_I_am_Interested_In to non-scalar type main()::<anonymous struct> requested
, which allows me to see the Type_I_am_Interested_In.
My question is now, is there a nicer way to get human-readable Type_I_am_Interested_In using some gcc/clang extensions or whatever instead of relying on error message format?
You can use decltype to get the type of the expression and then use partially specialized templates and typeid (demangle via cxxabi.h) to create a readable form as you like.
While you can skip the template decomposition step, you will receive slightly less information without it.
I was reading about parsers and parser generators and found this statement in wikipedia's LR parsing -page:
Many programming languages can be parsed using some variation of an LR parser. One notable exception is C++.
Why is it so? What particular property of C++ causes it to be impossible to parse with LR parsers?
Using google, I only found that C can be perfectly parsed with LR(1) but C++ requires LR(∞).
LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).
C and C++ both allow the following statement:
x * y ;
It has two different parses:
It can be the declaration of y, as pointer to type x
It can be a multiply of x and y, throwing away the answer.
Now, you might think the latter is stupid and should be ignored.
Most would agree with you; however, there are cases where it might
have a side effect (e.g., if multiply is overloaded). but that isn't the point.
The point is there are two different parses, and therefore a program
can mean different things depending on how this should have been parsed.
The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammar ambiguous.
Thus pure LR parsing can't handle this. Nor can many other widely available parser generators, such as Antlr, JavaCC, YACC, or traditional Bison, or even PEG-style parsers, used in a "pure" way.
There are lots of more complicated cases (parsing template syntax requires arbitrary lookahead, whereas LALR(k) can look ahead at most k tokens), but only it only takes one counterexample to shoot down pure LR (or the others) parsing.
Most real C/C++ parsers handle this example by using some
kind of deterministic parser with an extra hack: they intertwine parsing with symbol table
collection... so that by the time "x" is encountered,
the parser knows if x is a type or not, and can thus
choose between the two potential parses. But a parser
that does this isn't context free, and LR parsers
(the pure ones, etc.) are (at best) context free.
One can cheat, and add per-rule reduction-time semantic checks in the
to LR parsers to do this disambiguation. (This code often isn't simple). Most of the other parser types
have some means to add semantic checks at various points
in the parsing, that can be used to do this.
And if you cheat enough, you can make LR parsers work for
C and C++. The GCC guys did for awhile, but gave it
up for hand-coded parsing, I think because they wanted
better error diagnostics.
There's another approach, though, which is nice and clean
and parses C and C++ just fine without any symbol table
hackery: GLR parsers.
These are full context free parsers (having effectively infinite
lookahead). GLR parsers simply accept both parses,
producing a "tree" (actually a directed acyclic graph that is mostly tree like)
that represents the ambiguous parse.
A post-parsing pass can resolve the ambiguities.
We use this technique in the C and C++ front ends for our
DMS Software Reengineering Tookit (as of June 2017
these handle full C++17 in MS and GNU dialects).
They have been used to process millions of lines
of large C and C++ systems, with complete, precise parses producing ASTs with complete details of the source code. (See the AST for C++'s most vexing parse.)
There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.
It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:
"C++ grammar is ambiguous,
context-dependent and potentially
requires infinite lookahead to resolve
some ambiguities".
It goes on to give a number of examples (see page 147 of the pdf).
The example is:
int(x), y, *const z;
meaning
int x;
int y;
int *const z;
Compare to:
int(x), y, new int;
meaning
(int(x)), (y), (new int));
(a comma-separated expression).
The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.
The problem is never defined like this, whereas it should be interesting :
what is the smallest set of modifications to C++ grammar that would be necessary so that this new grammar could be perfectly parsed by a "non-context-free" yacc parser ? (making use only of one 'hack' : the typename/identifier disambiguation, the parser informing the lexer of every typedef/class/struct)
I see a few ones:
Type Type; is forbidden. An identifier declared as a typename cannot become a non-typename identifier (note that struct Type Type is not ambiguous and could be still allowed).
There are 3 types of names tokens :
types : builtin-type or because of a typedef/class/struct
template-functions
identifiers : functions/methods and variables/objects
Considering template-functions as different tokens solves the func< ambiguity. If func is a template-function name, then < must be the beginning of a template parameter list, otherwise func is a function pointer and < is the comparison operator.
Type a(2); is an object instantiation.
Type a(); and Type a(int) are function prototypes.
int (k); is completely forbidden, should be written int k;
typedef int func_type(); and
typedef int (func_type)(); are forbidden.
A function typedef must be a function pointer typedef : typedef int (*func_ptr_type)();
template recursion is limited to 1024, otherwise an increased maximum could be passed as an option to the compiler.
int a,b,c[9],*d,(*f)(), (*g)()[9], h(char); could be forbidden too, replaced by int a,b,c[9],*d;
int (*f)();
int (*g)()[9];
int h(char);
one line per function prototype or function pointer declaration.
An highly preferred alternative would be to change the awful function pointer syntax,
int (MyClass::*MethodPtr)(char*);
being resyntaxed as:
int (MyClass::*)(char*) MethodPtr;
this being coherent with the cast operator (int (MyClass::*)(char*))
typedef int type, *type_ptr; could be forbidden too : one line per typedef. Thus it would become
typedef int type;
typedef int *type_ptr;
sizeof int, sizeof char, sizeof long long and co. could be declared in each source file.
Thus, each source file making use of the type int should begin with
#type int : signed_integer(4)
and unsigned_integer(4) would be forbidden outside of that #type directive
this would be a big step into the stupid sizeof int ambiguity present in so many C++ headers
The compiler implementing the resyntaxed C++ would, if encountering a C++ source making use of ambiguous syntax, move source.cpp too an ambiguous_syntax folder, and would create automatically an unambiguous translated source.cpp before compiling it.
Please add your ambiguous C++ syntaxes if you know some!
As you can see in my answer here, C++ contains syntax that cannot be deterministically parsed by an LL or LR parser due to the type resolution stage (typically post-parsing) changing the order of operations, and therefore the fundamental shape of the AST (typically expected to be provided by a first-stage parse).
I think you are pretty close to the answer.
LR(1) means that parsing from left to right needs only one token to look-ahead for the context, whereas LR(∞) means an infinite look-ahead. That is, the parser would have to know everything that was coming in order to figure out where it is now.
The "typedef" problem in C++ can be parsed with an LALR(1) parser that builds a symbol table while parsing (not a pure LALR parser). The "template" problem probably cannot be solved with this method. The advantage of this kind of LALR(1) parser is that the grammar (shown below) is an LALR(1) grammar (no ambiguity).
/* C Typedef Solution. */
/* Terminal Declarations. */
<identifier> => lookup(); /* Symbol table lookup. */
/* Rules. */
Goal -> [Declaration]... <eof> +> goal_
Declaration -> Type... VarList ';' +> decl_
-> typedef Type... TypeVarList ';' +> typedecl_
VarList -> Var /','...
TypeVarList -> TypeVar /','...
Var -> [Ptr]... Identifier
TypeVar -> [Ptr]... TypeIdentifier
Identifier -> <identifier> +> identifier_(1)
TypeIdentifier -> <identifier> =+> typedefidentifier_(1,{typedef})
// The above line will assign {typedef} to the <identifier>,
// because {typedef} is the second argument of the action typeidentifier_().
// This handles the context-sensitive feature of the C++ language.
Ptr -> '*' +> ptr_
Type -> char +> type_(1)
-> int +> type_(1)
-> short +> type_(1)
-> unsigned +> type_(1)
-> {typedef} +> type_(1)
/* End Of Grammar. */
The following input can be parsed without a problem:
typedef int x;
x * y;
typedef unsigned int uint, *uintptr;
uint a, b, c;
uintptr p, q, r;
The LRSTAR parser generator reads the above grammar notation and generates a parser that handles the "typedef" problem without ambiguity in the parse tree or AST. (Disclosure: I am the guy who created LRSTAR.)