How to pass token kind with its associated information from lexer to preprocessor, then to parser - c++

I try to implement a simple C/C++ parser, which try to partially parsing C++ language. So I need to create a Lexer, a Preprocessor and a Parser class.
I'm considering what is the data type I need to pass information from those three layers. Normally, a Token class is need here, for right now, my Token class looks like below:
struct Token
{
TokenKind id;
std::string lexeme;
int fileIndex;
int line;
int column;
}
I think the most important part is the TokenKind(it could be IDENTIFIER or CLASS_KEYWORD or any other punctuation like LPAREN), and some times, the lexeme is also important, because it usually contains the type name or variable name information.
I looked at some implementations about how the Token is passed to Parsers.
1, I see the Clang has some functions in it's Preprocessor class like Preprocessor.cpp:739
void Preprocessor::Lex(Token &Result)
You see, a reference is passed as a the function argument, and the function fill the object with the result, see another reference here on a Clang's tutorial here:Clang-tutorial/CItutorial3.cpp at master · loarabia/Clang-tutorial, here the instance tok is reused in a loop.
Token tok;
do {
ci.getPreprocessor().Lex(tok);
if( ci.getDiagnostics().hasErrorOccurred())
break;
ci.getPreprocessor().DumpToken(tok);
std::cerr << std::endl;
} while ( tok.isNot(clang::tok::eof));
2, For some lexer generator, I see function yylex() just return an int type, which is actually a TokenKind, and the other information such as the actual lexeme string is stored in a global variables like yylval.
3, For a tiny language for GCC A tiny GCC front end – Part 3 | Think In Geek, I see the Lexer return a std::shared_ptr<Token>, that is:
static TokenPtr
make_identifier (location_t locus, const std::string& str)
{
return TokenPtr(new Token (IDENTIFIER, locus, str));
}
The Lexer return a TokenPtr which is a smart pointer of the Token object to the Parser, so the whole Token is returned to the Parser.
4, GCC's cpp library has some interface of the cpp_get_token() function like below:
const cpp_token *token = cpp_get_token (pfile);
Then token->type is just like the TokenKind field.
So, my question is: what are the advantages and disadvantages of those kinds of implementations. Some of the mentioned methods above do not even have a preprocess layer, for me, I do need three layers(the lexer, the preprocessor and the parser).
Note that my parser won't be big enough as clang or GCC's parser. My main idea is that my parser can only parse very limited part of C++ language, and I would like to make them all hand written.
EDIT A similar question is here What should be the datatype of the tokens a lexer returns to its parser?, I also post some comments there several days ago, but that question does not involve the three layers.

Related

How to do string formatting in BetterC mode?

I'd like to use something like the "Concepts" package from Atila Neves.
I implemented the check of an object against a type signature myself in a simple naive way. I can check struct objects against interfaces which I define within compile-time-evaluated delegate blocks to make them work with BetterC. I only used compile-time function evaluation with enums which receive return values of executed delegate code blocks.
Now I faced problems with std.format.format which uses TypeInfo for %s formatters which gives errors when compiling in BetterC. For code generation I'd like to use token strings because they have syntax highlighting. But proper usage of them requires string interpolation or string formatting. core.stdc.stdio.snprintf is no alternative because CTFE can only interprete D source code.
This is not technically a problem. I can just turn token strings into WYSIWYG strings.
But I wonder why I can't use it. The official documentation says, compile-time features are unrestricted for BetterC (I assume this includes TypeInfo). Either it is plain wrong or I am doing it wrong.
template implementsType(alias symbol, type)
if (isAbstractClass!type)
{
enum implementsType = mixin(implementsTypeExpr);
enum implementsTypeExpr =
{
import std.format : format;
auto result = "";
static foreach(memberName; __traits(allMembers, type))
{
result ~= format(
q{__traits(compiles, __traits(getMember, symbol, "%1$s")) && }~
q{covariantSignature!(__traits(getMember, symbol, "%1$s"), __traits(getMember, type, "%1$s")) && }
, memberName);
}
return (result.length >= 3)? result[0 .. $-3] : result;
}();
}
TypeInfo are not available with BetterC.
There's a bc-string dub package that provides a limited string formatter that will work in BetterC.

using %union for structs

I searched a lot but can't seem to find a clear example on how to use %union is my parser file.
I would like to save for example the following token in a struct called classID:
[a-zA-Z][a-zA-Z0-9]* { yylval=new IDClass(yytext); return ID; }
This is the struct in my .hpp folder which I included:
class Node {
public:
Node(){}
};
class IDClass : public Node {
public:
string name;
IDClass(string name):
Node(),name(name)
{}
};
& then in my .ypp file , I would like to use it for certain checking:
Define: Type ID { if(doesIDexists(***$2->name***)){errorDef(yylineno, ID_ptr->name);exit(1);}}
But obviously, $2->name won't return it. What is the correct use of %union in structs? How can I grab the value of name properly?
Thank you in advance.
There is no "correct use of %union in structs", since %union is used to declare a union. (You could, of course, declare a union with just one member, but that's almost pointless.)
The correct way of declaring a semantic type which is not a union is:
%define api.value.type { Node* }
But that's not going to get you want, since what you want is neither a union nor a fixed type, but rather an implicit dynamic cast, or something similar. And that's not on Bison's menu of options. (It's easy to see why not. The dynamic cast wouldn't be an lvalue so Bison would have to know when to apply it and when not to.)
So you could use the above %define api.value.type and then write out the grammar action:
Define: Type ID { if (doesIDexist(dynamic_cast<IDClass>($2)->name)) {
errorDef(#2.first_line,
dynamic_cast<IDClass>($2)->name);
exit(1);
}
}
If you generate a C++ parser instead of a C parser which happens to work in C++, then you have some other options, which might or might not be a better fix for your application. The options for a C parser are explained in the Language Semantics chapter of the Bison manual, and the C++ options have their own chapter, although if you are going to use the C++ interface you need to read the entire C++ section.
Note: I changed yylineno to #2.first_line because yylineno is usually inaccurate; yylineno is often the line number associated with the lookahead token, which is often not on the same line as the error. But you can't just make that change; you also have to make sure that your lexer fills in yylloc correctly. See the Tracking Locations chapter in the Bison manual.

Met c++ code " #define ELEMENT(TYPE, FIELD)"

#define ELEMENT(TYPE, FIELD)\
bool get##FIELD(TYPE *field) const throw()\
{ \
return x_->get##FIELD(y_, field);\
} \
I never met code like this before.
First, why do we put code in #define, is it a macro? So, I can use ELEMENT() later in other places?
Second, what is ##? What I can find online is "The ## operator takes two separate tokens and pastes them together to form a single token. The resulting token could be a variable name, class name or any other identifier."
Could someone tell me how I should know what this kind of code works?
Yes, ELEMENT() is a preprocessor macro, which is just a fancy way to replace one piece of text with another piece of text before the compiler is invoked. At the site where a macro is invoked, it is replaced with the text content of the macro. If the macro has parameters, each parameter is replaced with the text that the caller passed in to the macro.
In this case, the TYPE parameter is being used as-is within the macro text, whereas the FIELD parameter is being concatenated with get via ## concatenation to produce a new token identifier get<FIELD>.
ELEMENT() can be used like this, for example:
class MyClass
{
ELEMENT(int, IntValue) // TYPE=int, FIELD=IntValue
ELEMENT(string, StrData) // TYPE=string, FIELD=StrData
// and so on ...
};
Which will be expanded by the preprocessor to this code, which is what the compiler actually sees:
class MyClass
{
bool getIntValue(int *field) const throw()
{
return x_->getIntValue(y_, field);
}
bool getStrData(string *field) const throw()
{
return x_->getStrData(y_, field);
}
// and so on ...
};
I'm sorry to tell you, someone tried to be clever.
#define is used to textually replace one piece of text with another. The 2 arguments can be passed as a kind of arguments. Normally, such an argument is a token. However, thanks to ##, one can do token concatenation.
Let's take an example: ELEMENT(int, Cost);
This will result in the following code being injected:
bool getCost(int *field) const throw()
...
So as you can see, int is kept as token, while Cost is glued together into getCost.
I hope you found this in legacy code, cause using the preprocessor is considered bad coding in C++. The language hasn't been able to get rid of most usages. However they are providing alternatives to most common usages.
The #include and header guards have gotten replacements with the C++20 modules proposal.

Is there a better alternative to preprocessor redirection for runtime tracking of an external API?

I have sort of a tricky problem I'm attempting to solve. First of all, an overview:
I have an external API not under my control, which is used by a massive amount of legacy code.
There are several classes of bugs in the legacy code that could potentially be detected at run-time, if only the external API was written to track its own usage, but it is not.
I need to find a solution that would allow me to redirect calls to the external API into a tracking framework that would track api usage and log errors.
Ideally, I would like the log to reflect the file and line number of the API call that triggered the error, if possible.
Here is an example of a class of errors that I would like to track. The API we use has two functions. I'll call them GetAmount, and SetAmount. They look something like this:
// Get an indexed amount
long GetAmount(short Idx);
// Set an indexed amount
void SetAmount(short Idx, long amount);
These are regular C functions. One bug I am trying to detect at runtime is when GetAmount is called with an Idx that hasn't already been set with SetAmount.
Now, all of the API calls are contained within a namespace (call it api_ns), however they weren't always in the past. So, of course the legacy code just threw a "using namespace api_ns;" in their stdafx.h file and called it good.
My first attempt was to use the preprocessor to redirect API calls to my own tracking framework. It looked something like this:
// in FormTrackingFramework.h
class FormTrackingFramework
{
private:
static FormTrackingFramework* current;
public:
static FormTrackingFramework* GetCurrent();
long GetAmount(short Idx, const std::string& file, size_t line)
{
// track usage, log errors as needed
api_ns::GetAmount(Idx);
}
};
#define GetAmount(Idx) (FormTrackingFramework::GetCurrent()->GetAmount(Idx, __FILE__, __LINE__))
Then, in stdafx.h:
// in stdafx.h
#include "theAPI.h"
#include "FormTrackingFramework.h"
#include "LegacyPCHIncludes.h"
Now, this works fine for GetAmount and SetAmount, but there's a problem. The API also has a SetString(short Idx, const char* str). At some point, our legacy code added an overload: SetString(short Idx, const std::string& str) for convenience. The problem is, the preprocessor doesn't know or care whether you are calling SetString or defining a SetString overload. It just sees "SetString" and replaces it with the macro definition. Which of course doesn't compile when defining a new SetString overload.
I could potentially reorder the #includes in stdafx.h to include FormTrackingFramework.h after LegacyPCHIncludes.h, however that would mean that none of the code in the LegacyPCHIncludes.h include tree would be tracked.
So I guess I have two questions at this point:
1: how do I solve the API overload problem?
2: Is there some other method of doing what I want to do that works better?
Note: I am using Visual Studio 2008 w/SP1.
Well, for the cases you need overloads, you could use a class instance that overloads operater() for a number of parameters.
#define GetAmount GetAmountFunctor(FormTrackingFramework::GetCurrent(), __FILE__, __LINE__)
then, make a GetAmountFunctor:
class GetAmountFunctor
{
public:
GetAmountFunctor(....) // capture relevant debug info for logging
{}
void operator() (short idx, std::string str)
{
// logging here
api_ns::GetAmount(idx, str);
}
void operator() (short idx)
{
/// logging here
api_ns::GetAmount(Idx);
}
};
This is very much pseudocode but I think you get the idea. Whereever in your legacy code the particular function name is mentioned, it is replaced by a functor object, and the function is actually called on the functor. Do consider you only need to do this for functions where overloads are a problem. To reduce the amount of glue code, you can create a single struct for the parameters __FILE__, __LINE__, and pass it into the constructor as one argument.
The problem is, the preprocessor doesn't know or care whether you are calling SetString or defining a SetString overload.
Clearly, the reason the preprocessor is being used is that it it oblivious to the namespace.
A good approach is to bite the bullet and retarget the entire large application to use a different namespace api_wrapped_ns instead of api_ns.
Inside api_wrapped_ns, inline functions can be provided which wrap counterparts with like signatures in api_ns.
There can even be a compile time switch like this:
namespace api_wrapped_ns {
#ifdef CONFIG_API_NS_WRAPPER
inline long GetAmount(short Idx, const std::string& file, size_t line)
{
// of course, do more than just wrapping here
return api_ns::GetAmount(Idx, file, line);
}
// other inlines
#else
// Wrapping turned off: just bring in api_ns into api_wrapper_ns
using namespace api_ns;
#endif
}
Also, the wrapping can be brought in piecemeal:
namespace api_wrapped_ns {
// This function is wrapped;
inline long GetAmount(short Idx, const std::string& file, size_t line)
{
// of course, do more than just wrapping here
return
}
// The api_ns::FooBar symbol is unwrapped (for now)
using api_ns::FooBar;
}

C++ std::string and NULL const char*

I am working in C++ with two large pieces of code, one done in "C style" and one in "C++ style".
The C-type code has functions that return const char* and the C++ code has in numerous places things like
const char* somecstylefunction();
...
std::string imacppstring = somecstylefunction();
where it is constructing the string from a const char* returned by the C style code.
This worked until the C style code changed and started returning NULL pointers sometimes. This of course causes seg faults.
There is a lot of code around and so I would like to most parsimonious way fix to this problem. The expected behavior is that imacppstring would be the empty string in this case. Is there a nice, slick solution to this?
Update
The const char* returned by these functions are always pointers to static strings. They were used mostly to pass informative messages (destined for logging most likely) about any unexpected behavior in the function. It was decided that having these return NULL on "nothing to report" was nice, because then you could use the return value as a conditional, i.e.
if (somecstylefunction()) do_something;
whereas before the functions returned the static string "";
Whether this was a good idea, I'm not going to touch this code and it's not up to me anyway.
What I wanted to avoid was tracking down every string initialization to add a wrapper function.
Probably the best thing to do is to fix the C library functions to their pre-breaking change behavior. but maybe you don't have control over that library.
The second thing to consider is to change all the instances where you're depending on the C lib functions returning an empty string to use a wrapper function that'll 'fix up' the NULL pointers:
const char* nullToEmpty( char const* s)
{
return (s ? s : "");
}
So now
std::string imacppstring = somecstylefunction();
might look like:
std::string imacppstring( nullToEmpty( somecstylefunction());
If that's unacceptable (it might be a lot of busy work, but it should be a one-time mechanical change), you could implement a 'parallel' library that has the same names as the C lib you're currently using, with those functions simply calling the original C lib functions and fixing the NULL pointers as appropriate. You'd need to play some tricky games with headers, the linker, and/or C++ namespaces to get this to work, and this has a huge potential for causing confusion down the road, so I'd think hard before going down that road.
But something like the following might get you started:
// .h file for a C++ wrapper for the C Lib
namespace clib_fixer {
const char* somecstylefunction();
}
// .cpp file for a C++ wrapper for the C Lib
namespace clib_fixer {
const char* somecstylefunction() {
const char* p = ::somecstylefunction();
return (p ? p : "");
}
}
Now you just have to add that header to the .cpp files that are currently calling calling the C lib functions (and probably remove the header for the C lib) and add a
using namespace clib_fixer;
to the .cpp file using those functions.
That might not be too bad. Maybe.
Well, without changing every place where a C++ std::string is initialized directly from a C function call (to add the null-pointer check), the only solution would be to prohibit your C functions from returning null pointers.
In GCC compiler, you can use a compiler extension "Conditionals with Omitted Operands" to create a wrapper macro for your C function
#define somecstylefunction() (somecstylefunction() ? : "")
but in general case I would advise against that.
I suppose you could just add a wrapper function which tests for NULL, and returns an empty std::string. But more importantly, why are your C functions now returning NULL? What does a NULL pointer indicate? If it indicates a serious error, you might want your wrapper function to throw an exception.
Or to be safe, you could just check for NULL, handle the NULL case, and only then construct an std::string.
const char* s = somecstylefunction();
if (!s) explode();
std::string str(s);
For a portable solution:
(a) define your own string type. The biggest part is a search and replace over the entire project - that can be simple if it's always std::string, or big one-time pain. (I'd make the sole requriement that it's Liskov-substitutable for a std::string, but also constructs an empty string from an null char *.
The easiest implementation is inheriting publicly from std::string. Even though that's frowned upon (for understandable reasons), it would be ok in this case, and also help with 3rd party libraries expecting a std::string, as well as debug tools. Alternatively, aggegate and forward - yuck.
(b) #define std::string to be your own string type. Risky, not recommended. I wouldn't do it unless I knew the codebases involved very well and saves you tons of work (and I'd add some disclaimers to protect the remains of my reputation ;))
(c) I've worked around a few such cases by re-#define'ing the offensive type to some utility class only for the purpose of the include (so the #define is much more limited in scope). However, I have no idea how to do that for a char *.
(d) Write an import wrapper. If the C library headers have a rather regular layout, and/or you know someone who has some experience parsing C++ code, you might be able to generate a "wrapper header".
(e) ask the library owner to make the "Null string" value configurable at least at compile time. (An acceptable request since switching to 0 can break compatibility as well in other scenarios) You might even offer to submit the change yourself if that's less work for you!
You could wrap all your calls to C-stlye functions in something like this...
std::string makeCppString(const char* cStr)
{
return cStr ? std::string(cStr) : std::string("");
}
Then wherever you have:
std::string imacppstring = somecstylefunction();
replace it with:
std::string imacppstring = makeCppString( somecystylefunction() );
Of course, this assumes that constructing an empty string is acceptable behavior when your function returns NULL.
I don't generally advocate subclassing standard containers, but in this case it might work.
class mystring : public std::string
{
// ... appropriate constructors are an exercise left to the reader
mystring & operator=(const char * right)
{
if (right == NULL)
{
clear();
}
else
{
std::string::operator=(right); // I think this works, didn't check it...
}
return *this;
}
};
Something like this should fix your problem.
const char *cString;
std::string imacppstring;
cString = somecstylefunction();
if (cString == NULL) {
imacppstring = "";
} else {
imacppstring = cString;
}
If you want, you could stick the error checking logic in its own function. You'd have to put this code block in fewer places, then.