Accessing tokenization of a C++ source file

Accessing tokenization of a C++ source file - c++

My understanding is that one step of the compilation of a program (irrespective of the language, I guess) is parsing the source file into some kind of space separated tokens (this tokenization would be made by what's referred to as scanner in this answer. For instance I understand that at some point in the compilation process, a line containing x += fun(nullptr); is separated is something like
x
+=
fun
(
nullptr
)
;
Is this true? If so, is there a way to have access to this tokenization of a C++ source code?
I'm asking this question mostly for curiosity, and I do not intend to write a lexer myself
And the reason I'm curious to know whether one can leverage the compiler is that, to give an example, before meeting [[noreturn]] & Co. I wouldn't have ever considered [[ as a valid token, if I was to write a lexer myself.
Do we necessarily need a true, actual use case? I think we don't, if I am curious about whether there's an existing tool or not to do something.
However, if we really need a use case,
let's say my target is to write a C++ function which reads in a C++ source file and returns a std::vector of the lexemes it's made up of. Clearly, a requirement is that concatenating the elments of the output should make up the whole text again, including line breakers and every other byte of it.

With the restriction mentioned in the comment (tokenization keeping __DATE__) it seems rather manageable. You need the preprocessing tokens. The Boost::Wave preprocessor necessarily creates a token list, because it has to work on those tokens.
Basile correctly points out that it's hard to assign a meaning to those tokens.

C++ is a very complex programming language.
Be sure to read the C++11 draft standard n3337 before even attempting to parse C++ code.
Look inside the source code of existing open source C++ compilers, such as GCC (at least GCC 10 in October 2020) or Clang (at least Clang 10 in October 2020)
If you have to write your C++ parser from scratch, be sure to have the budget for at least a full person year of work.
Look also into existing C++ static source code analyzers, such as Frama-C++ or Clang static analyzer. Consider adapting one of them to your needs, but do document in writing your needs before starting coding. Be aware of Rice's theorem.
If you want to parse a small subset of C++ (you'll need to document and specify that subset), consider using parser generators like ANTLR or GNU bison.
Most compilers are building some internal representations, in particular some abstract syntax tree. Read the Dragon book for more.
I would suggest instead writing your own GCC plugin.
Indeed, it would be tied to some major version of GCC, but you'll win months of work.
Is this true? If so, is there a way to have access to this tokenization of a C++ source code?
Yes, by patching some existing opensource C++ compiler, or extending it with your plugin (there are licensing conditions related to both approaches).
let's say my target is to write a C++ function which reads in a C++ source file and returns a std::vector of the lexemes it's made up of.
The above specification is ambiguous.
Do you want the lexeme before or after the C++ preprocessing phase? In other words, what would be the lexeme for e.g. __DATE__ or __TIME__ ? Read e.g. the documentation of GNU cpp ... If you happen to use GCC on Linux (see gcc(1)) and have some C++ translation unit foo.cc, try running g++ -C -E -Wall foo.cc > foo.ii and look (using less(1)...) into the generated preprocessed form foo.ii ? And what about template expansion, or preprocessor conditionals or preprocessor stringizing ?
I would suggest writing your GCC plugin working on GENERIC representations. You could also start a PhD work related to your goals.
Notice that generating C++ code is a lot easier than parsing it.
Look inside Qt for an example of software generating C++ code. Yo could consider using GNU m4, or GNU gawk, or GNU autoconf, or GPP, or your own C++ source generator (perhaps with the help of GNU bison or of ANTLR) to generate some of your C++ code.
PS. On my home page you'll find an hyperlink to some draft report related to your question, and another hyperlink to an open source program generating C++ code. It sadly seems that I am forbidden here to give these hyperlinks, but you could find them in two mouse clicks. You might also look into two European H2020 projects funding that draft report: CHARIOT & DECODER.

Related

Ignore missing headers with clang AST parser

I'm on Windows, using MSVC to compile my project, but I need clang for its neat AST parser, which allow me to write a little code generator.
Problem is, clang cannot parse MSVC headers (a very-well known and understandable problem).
I tried two options :
I include MSVC header folder, parsing the built-in headers included in my code will end-up leading to a fatal error at some point, preventing me from parsing the parts I want correctly.
What I did before is simply not provide any built-in headers and forward declare the types I needed. It worked fine and somehow it doesn't anymore with latest Clang. I don't really know if the parser policy on missing header changed, but it is causing complete failure every time something like <string> is included and not much get parsed.
I am using the python bindings (libclang), but I would consider switching to C/C++ API if there would be a solution there.
Is there anyway I can alter this behavior and make clang continue parsing even when some headers are not found ?

Use SetSuppressIncludeNotFoundError. Took me an hour to find! You can imagine how glad I was to find it!
https://clang.llvm.org/doxygen/classclang_1_1Preprocessor.html#ac7bafe67fc32e41460855b39d20ff6af

One way to ignore the errors due to missing headers is to set SetSuppressIncludeNotFoundError to true in your definition of ASTFrontendAction. An example for the same is given below.
{
public:
virtual std::unique_ptr<clang::ASTConsumer> CreateASTConsumer(
clang::CompilerInstance &Compiler, llvm::StringRef InFile)
{
Compiler.getPreprocessor().SetSuppressIncludeNotFoundError(true);
return std::unique_ptr<clang::ASTConsumer>(
new CustomASTConsumer(&Compiler.getASTContext()));
}
};
For a complete example using ASTFrontendAction, please visit at https://clang.llvm.org/docs/RAVFrontendAction.html

So you want to process C++ code that uses MS headers, and you want access to ASTs so that you can generate code. And Clang won't handle MS headers.
So Clang can't be the answer unless it gets a radical upgrade.
You asked for "any solution that can make this work".
Our DMS Software Reengineering Tookit with its C++14 Front End can do this.
DMS provides general parsing, AST construction/inspection/transformation/generation, and inverse parsing (conversion of ASTs back into compilable code), parameterized by language definitions.
The C++ front end provides a full C++14 parser, preprocessor handling, AST construction, and full name and type resolution. It has been tested with GCC and MS VS 2013 header files; we're testing with 2015 header files now.
(It also handles MS VS 2013 syntax, too).
It handles the tough parsing cases completely, including the C++ famous "most vexing parse". You can see parse trees at get human readable AST from c++ code.
DMS does not provide Python bindings, nor a direct C++ interface. Rather, it is a standalone tool designed to support the construction of custom tools (e.g., your "little code generator"). It has its own very extensive set of internal APIs, coded in metaprogramming language PARLANSE, which is LISP-like. Other aspects of DMS are managed by using DSLs for lexers, grammars, and transformations. See below.
A word of caution: any tool that can process C++ is gauranteed to be complex. DMS is correspondingly complex, and it takes a while to learn to use it, so you're not going to get instant answers. The good news here
is that some things are easier to do. Your code generation problem
is likely "read a skeleton file, and then replace key entries in it with problem specific code". If that's the case, a DMS tool with the following code (simplified for presentation here) will likely do the trick:
...
(= myAST (Registry:ParseFile (. filename) (. `CppVisualStudio2013') ...)
(Registry:ApplyTransforms myAST (. `MyTransforms.rsl'))
(Registry:PrettyPrint myAST (concat filename `.modified'))
...
with a transforms file MyTransforms.rsl containing source-to-source surface-syntax (e.g, C++ syntax) transformation rules of the conceptual form
rule rulename if_you_see THIS then replace_by ("-->") THAT
An actual C++ rule might look like (making this up because I don't
know your actual code generation goals)
rule replace_abstraction(s: STRING_LITERAL):
" abstraction_place_holder(\s) "
-> " my_DSL_library(\s,17); "
The ApplyTransforms call above will apply all the rules in this file until none apply any further.
Writing surface syntax transforms, where you can do it, is way easier than making calls on a procedure library (which, like Clang, DMS offers) that hack at the tree.
You can write more complex metaprograms using PARLANSE to apply some rules in one place, other rules someplace else, and you can mix source-to-source transforms with procedural transforms that hack directly at the tree if you want.
If you want more details on what transforms look like, ask and I'll provide a link.

Getting AST for C++?

I'm looking to get an AST for C++ that I can then parse with an external program. What programs are out there that are good for generating an AST for C++? I don't care what language it is implemented in or the output format (so long as it is readily parseable).
My overall goal is to transform a C++ unit test bed to its corresponding C# wrapper test bed.

You can use clang and especially libclang to parse C++ code. It's a very high quality, hand written library for lexing, parsing and compiling C++ code but it can also generate an AST.
Clang also supports C, Objective-C and Objective-C++. Clang itself is written in C++.

Actually, GCC will emit the AST at any stage in the pipeline that interests you, including the GENERIC and GIMPLE forms. Check out the (plethora of) command-line switches begining with -fdump- — e.g. -fdump-tree-original-raw
This is one of the easier (…) ways to work, as you can use it on arbitrary code; just pass the appropriate CFLAGS or CXXFLAGS into most Makefiles:
make CXXFLAGS=-fdump-tree-original-raw all
… and you get “the works.”
Updated: Saw this neat little graphing system based on GCC AST's while checking my flag name :-) Google FTW.
http://digitocero.com/en/blog/exporting-and-visualizing-gccs-abstract-syntax-tree-ast

Our C++ Front End, built on top of our DMS Software Reengineering Toolkit can parse a variety of C++ dialects (including C++11 and ObjectiveC) and export that AST as an XML document with a command line switch. See example ASTs produced by this front end.
As a practical matter, you will need more than the AST; you can't really do much with C++ (or any other modern language) without an understanding of the meaning and scope of each identifier. For C++, meaning/scope are particularly ugly. The DMS C++ front end handles all of that; it can build full symbol tables associating identifers to explicit C++ types. That information isn't dumpable in XML with a command line switch, but it is "technically easy" to code logic in DMS to walk the symbol table and spit out XML. (there is an option to dump this information, just not in XML format).
I caution you against the idea of manipulating (or even just analyzing) the XML. First, XSLT isn't a particularly good way to understand the meaning of the ASTs, let alone transform the AST, because the ASTs represent context sensitive language structures (that's why you want [nee MUST HAVE] the symbol table). You can read the XML into a dom-like tree if you like and write your own procedural code to manipulate it. But source-to-source transformations are an easier way; you can write your transformations using C++ notation rather than buckets of code goo climbing over a tree data structure.
You'll have another problem: how to generate valid C++ code from the transformed XML. If you don't mind spitting out raw text, you can solve this problem in purely ad hoc ways, at the price of having no gaurantee other than sweat that generated code is syntactically valid. If you want to generate a C++ representation of your final result as an AST, and regenerate valid text from that, you'll need a prettyprinter, which are not technically hard but still a lot of work to build especially for a language as big as C++.
Finally, the reason that tools like DMS exist is to provide the vast amount of infrastructure it takes to process/manipulate complex structure such as C++ ASTs. (parse, analyse, transform, prettyprint). You can try to replicate all this machinery yourself, but this is usually a poor time/cost/productivity tradeoff. The claim is it is best to stay within the tool ecosystem rather than escape it and build bad versions of it yourself. If you haven't done this before, you'll find this out painfully.
FWIW, DMS has been used to carry out massive analysis and transformations on C++ source code. See Publications on DMS and check the papers by Akers on "Re-engineering C++ Component Models".
Clang is based on the same kind of philosophy; there's an ecosystem of tools.
YMMV, but I'd be surprised.

Can Boost.Spirit be theoretically/practically used to parse C++(0x) (or any other language)?

Is it theoretically up to the task?
Can it be done practically and would the resulting parser be used with sufficient performance and output (say, LLVM IR or GCC's gimple) to be integrated in a competing compiler?

I'm sorry. I talked to its author, and he said he won't make it parse C++ fully, but admits that he accepts it to parse certain constructs as ambiguous.
So this is not an answer anymore!!
I recommend you to have a look at scalpel. From its homepage
Scalpel stands for source code analysis, libre and portable library. This is a C++ library which aims to perform full syntax and semantic analysis of any given C++ program.
And
What makes me think Scalpel could be accepted into Boost
Scalpel uses itself several Boost libraries: Spirit, Wave, shared_ptr (now in C++0x's STL), Optional, Test, etc.. Actually, it exclusively uses Boost libraries and the C++ standard library, which is required by Boost.
Besides, Boost already provides a Spirit-based C++ source code preprocessing library: Wave. Including a C++ source code analysis library seems to be a natural evolution.

No. C++ is too hard to parse for most automatic tools, and in practice usually is parsed by hand written parsers.
[Edit 1-Mar-2015: Added 'most' and 'usually'.]
Among the hard problems are:
A * B; which could be either the definition of a variable B with type A* or just the multiplication of two variables A and B.
A < B > C > D Where does the template A<> end? The usual 'max-munch' rules for parsing expressions will not work here.
vector<shared_ptr<int>> where the >> ends two templates, which is hard to do with only one token (and a space in between is allowed). But in 1>>15 no space is allowed.
And I bet that this list is far from complete.
Addition: The grammar is available, but is ambiguous and thus not valid as input to tools like Spirit.
Update 1-Mar-2015: As Ira Baxter, a well known expert in this field, points out in the comments, there are some parser generators that can generate a parser that will generate the full parser forest. As far as I know, selecting the right parse still requires a semantic phase. I'm not aware of any non-commercial parser generators that can do so for C++'s grammar. For more information, see this answer.

For "any other language", I once tried creating a shell script parser with Spirit. It turned out to be theoretically possible (I believe it would work), but it was not compilable on a machine with 1 GB memory, so eventually I gave up.

Parsing c++ function headers from a file using GNU toolchain

I need to parse function headers from a .i file used by SWIG which contains all sorts of garbage beside the function headers. (final output would be a list of function declarations)
The best option for me would be using the GNU toolchain (GCC, Binutils, etc..) to do so, but i might be missing an easy way of doing it with SWIG. If I am please tell me!
Thanks :]
edit: I also don't know how to do that with GCC toolchain, if you have an idea it will be great.

I would try getting an XML dump of the abstract syntax tree either from clang or from gccxml. From there you can easily extract the function declarations you are interested in.

Our DMS Software Reengineering Toolkit provides general purpose program parsing, analysis, and transformation capability. It has front ends for a wide variety of languages, including C++.
It has been used to analyze and transforms very complex C++ programs and their header files.
You aren't clear as to what you will do after you "parse the function headers"; normally people want to extract some information or produce another artifact. DMS with its C++ front end can do the parsing; you can configure DMS to do the custom stuff.
As a practical matter, this isn't usually an afternoon's exercise; DMS is a complex beast, because it has to deal with complex beasts such as C++. And I'd expect you to face the same kind of complexity for any tool that can handle C++. The GCC toolchain can clearly handle C++, so you might be able to do it with that (at that same level of complexity) but GCC is designed to be a compiler, and IMHO you will find it a fight to get it do what you want.
Your "output function declarations" goal isn't clear. You want just the function names? You want a function signature? You want all the type declarations on which the function depends? You want all the type declarations on which the function depends, if they are not already present in an existing include file you intend to use?

The best way to extract function decls from the garbage which is C header files is to substitute out what constitutes the most smelly garbage: macros. You can do that with:
cpp - The C Preprocessor

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

I recent added source file parsing to an existing tool that generated output files from complex command line arguments.
The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it was a very large command line, but the syntax was still awkward. So I added the ability to parse a source file using a more reasonable syntax.
I used flex 2.5.4 for windows to generate the tokenizer for this custom source file format, and it worked. But I hated the code. global variables, wierd naming convention, and the c++ code it generated was awful. The existing code generation backend was glued to the output of flex - I don't use yacc or bison.
I'm about to dive back into that code, and I'd like to use a better/more modern tool. Does anyone know of something that.
Runs in Windows command prompt (Visual studio integration is ok, but I use make files to build)
Generates a proper encapsulated C++ tokenizer. (No global variables)
Uses regular expressions for describing the tokenizing rules (compatible with lex syntax a plus)
Does not force me to use the c-runtime (or fake it) for file reading. (parse from memory)
Warns me when my rules force the tokenizer to backtrack (or fixes it automatically)
Gives me full control over variable and method names (so I can conform to my existing naming convention)
Allows me to link multiple parsers into a single .exe without name collisions
Can generate a UNICODE (16bit UCS-2) parser if I want it to
Is NOT an integrated tokenizer + parser-generator (I want a lex replacement, not a lex+yacc replacement)
I could probably live with a tool that just generated the tokenizing tables if that was the only thing available.

Ragel: http://www.complang.org/ragel/ It fits most of your requirements.
It runs on Windows
It doesn't declare the variables, so you can put them inside a class or inside a function as you like.
It has nice tools for analyzing regular expressions to see when they would backtrack. (I don't know about this very much, since I never use syntax in Ragel that would create a backtracking parser.)
Variable names can't be changed.
Table names are prefixed with the machine name, and they're declared "const static", so you can put more than one in the same file and have more than one with the same name in a single program (as long as they're in different files).
You can declare the variables as any integer type, including UChar (or whatever UTF-16 type you prefer). It doesn't automatically handle surrogate pairs, though. It doesn't have special character classes for Unicode either (I think).
It only does regular expressions... has no bison/yacc features.
The code it generates interferes very little with a program. The code is also incredibly fast, and the Ragel syntax is more flexible and readable than anything I've ever seen. It's a rock solid piece of software. It can generate a table-driven parser or a goto-driven parser.

Flex also has a C++ output option.
The result is a set of classes that do that parsing.
Just add the following to the head of you lex file:
%option C++
%option yyclass="Lexer"
Then in you source it is:
std::fstream file("config");
Lexer lexer(&file)
while(int token = lexer.yylex())
{
}

Boost.Spirit.Qi (parser-tokenizer) or Boost.Spirit.Lex (tokenizer only). I absolutely love Qi, and Lex is not bad either, but I just tend to take Qi for my parsing needs...
The only real drawback with Qi tends to be an increase in compile time, and it is also runs slightly slower than hand-written parsing code. It is generally much faster than parsing with regex, though.
http://www.boost.org/doc/libs/1_41_0/libs/spirit/doc/html/index.html

There's two tools that comes to mind, although you would need to find out for yourself which would be suitable, Antlr and GoldParser. There are language bindings available in both tools in which it can be plugged into the C++ runtime environment.

boost.spirit and Yard parser come to my mind. Note that the approach of having lexer generators is somewhat substituted by C++ inner DSL (domain-specific language) to specify tokens. Simply because it is part of your code without using an external utility, just by following a series of rules to specify your grammar.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js