Tokenization and AST - c++

Have a rather abstract question for you all. I'm looking at getting involved in a static code analysis project. It uses C and C++ as the language to develop in so if any code could be in either of those languages in your response that would be great.
My question:
I need to understand some of the basic concepts/constructs used to process code for static analysis. I have heard people use things like AST and tokenization etc. I was just wondering if anything can clarify how these things are applied in creating a static analysis tool? Id like more of an explanation of tokenization as I dont really understand that so well. I understand it is a sort of way to process strings but I'm not confident in that answer. Furthermore, I know that the project I'm looking at passes the code through the preprocessor before it is analyzed. Can anyone explain this? Surely if it is static code analysis it needn't be preprocessed?
Hope someone can clear this up for me.
Cheers.

Tokenization is the act of breaking source text into language elements such as operators, variable names, numbers, etc. Parsing reads sequences of tokens and build Abstract Syntax Trees, which is a particular program representation. Tokenization and parsing are necessary for static analysis, but hardly interesting, in the same way that ante-to-the-pot is necessary to playing poker but not the interesting part of the game in any way.
If you are building a static analyzer (you imply you expect to work on one implemented in C or C++), you will need fundamental knowledge of compiling (parsing not so much unless you are building a parser for the language to be statically analyzed), but certainly about program representations (ASTs, triples, control and data flow graphs, ...), type and property inference, and limits on analysis accuracy (the cause of conservative analyses.
The program representations are fundamental because these are the data structures that most static analysers really process; its simply too hard to wring useful facts directly from program text. These concepts can be used to implement static analysis capabilities in any programming language to implement analysis type tools; there's nothing special in implementing them in C or C++.
Run, don't walk, to your nearest compiler class for the first part of this. If you don't have it, you won't be able to do anything effective in tool building. The second part you will more likely find in a graduate computer science class.
If you get past that basic knowledge issue, you will either decide to implement an analysis tool from scratch, or build on existing analysis tool infrastructure. Few people decide to build one from scratch; it takes a huge amount of work (years or decades) to build robust parsers, flow analyzers, etc. needed as foundations for any particular static analysis. Mostly people try to use some existing infrastructure.
There's a huge list of candidates at: http://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis
If you insist on processing C or C++ and building your own custom sophisticated analysis, you really, really need a tool that can handle real C and C++ code. There are IMHO a limited number of good candidates:
GCC (and various graft ons such as Starynkevitch's MELT, about which I know little)
Clang (pretty spectacular toolset)
DMS (and its C and C++ front ends) [my company's tool]
Open64 compiler infrastructure
Rose compiler infrastructure (based on EDG's industry-wide front end)
Each one of these are big systems and require a big investment to understand and begin to use. Don't underrate the learning curve.
There are lots of other tools out there that sort of process C and C++, but "sort of" is pretty useless for static analysis purposes.
If you intend to simply use a static analysis tool, you can avoid learning most of the parsing and program representation questions; instead you'll need to learn as much as you can about the specific analysis tool you intend to use. You'll still be a lot better off with the compiler background above because you will understand what the tool does, why it does it, and why it produces the kinds of answers that it does (usually it produces answers that are unsatisfying in a lot of ways due to the conservative limits on analysis accuracy).
Lastly, you should be clear that you understand the difference between static analysis and dynamic analysis [using data collected at runtime to determine program properties]. Most end users couldn't care less how you get information about their code, and each analysis approach has its strength and weaknesses.

If you are interested in static analysis, you should not bother about syntax analysis (i.e. lexing, parsing, building the abstract syntax tree), because that won't learn you much.
So I suggest you to use some existing parsing infrastructure. This is particularly true if you want to analyze C++ source code, because just coding a C++ parser is a difficult thing to do.
For example, you could leverage on the GCC compiler, or on the LLVM/Clang compiler. Be aware of their open source license: GCC is under GPLv3, and Clang/LLVM has a BSD-like license. GCC is able to handle not only C & C++, but also Fortran, Go, Ada, Objective-C.
Because of the GPLv3 license of GCC, your development above it should be also free software under GPLv3. However, that also means that the GCC community is quite big. If you know Ocaml and are interested of static analysis of C only, you could consider Frama-C (LGPL licensed)
Actually, I am working on GCC, by providing the MELT extension framework (MELT is also GPLv3 licensed). MELT is a high-level domain specific language to extend GCC and should be the ideal choice to work on static analysis using the powerful framework and internal representations (Gimple, Tree, ...) of GCC.
(Of course, LLVM folks would be delighted to have their work used too; and nobody knows well both LLVM and GCC, because learning such big software is already a challenge, so people have to choose on which they invest their time).
So my suggestion is to join an existing static analysis or compiler effort, don't reinvent the wheel by starting from scratch (because that means years of effort, just to have a good parser).
By so doing, you will take advantage of the existing infrastructure (so you won't bother about AST-s and parsing) and you will improve some existing project.
I would be very delighted if you are interested in MELT. FWIW, there is a MELT tutorial at Paris, on january 24th 2012, HiPEAC conference.

If you are looking into static analysis for C & C++, your first stop should be libclang, along with its associated papers.
In terms of why the preprocessor is run, static analysis is meant to catch errors and possible unintended code functionality (very basically), this it must analyse what the compiler will be compiling, not what the user sees.
I'm pretty sure #Ira Baxter will popup with a far more complete answer :)

Related

Building a simple DSL in Haskell with C++ interop

I'm designing a simple interpreted language for testing real-time embedded systems. The control flow is severely restricted to provide strong static guarantees on what the scripts will do + how long they will run. For example, you can only branch on constant conditionals or loop over fixed ranges.
There is a large existing codebase in C++ with relevant models and IO libs, so this language must be able to call into C++. The systems under test have hard timing requirements, so we can't tolerate much jitter in the test framework. Our past solution was a custom DSL embedded in the C++ runtime, but we ended up re-inventing too many wheels (parser, linter, interactive interpreter, etc..) to achieve the static guarantees we need.
Haskell's facilities for crafting embedded DSLs with these guarantees are extremely appealing to me, but I'm stuck in determining how to embed it into the soft real-time C++ runtime. Any ideas? Pointers to any libraries / existing projects would be greatly appreciated!
Sounds like the path of least resistance would be an EDSL that generates C++. This way, you don't have to worry about the potential mismatch between soft real-time and the GHC RTS.
You might look at how other EDSLs that generate PLs are implemented:
HJScript uses a free monad approach to embedding JS.
JMacro uses more of an external DSL approach but embedded via TH. Wouldn't be my choice.
Instead of generating strings of C++ code, it's nice to have a data structure. Unfortunately, there doesn't seem to be a package available for C++. However, you could take a look at language-c — perhaps extend that or build your own. You might even consider generating C and using the C to C++ interop provided by those languages.
I'd probably dissuade you from looking at the design of Cryptol or Cogent as these are fully-fledged programming languages (that you have indicated you are inclined to steer away from).

Current state of the art for c++ parsers?

I understand that it's a very hard thing to do, what with #ifdef, #define, and templates, but what is the state of the art of c++ parsers (be it open source, or proprietary?).
I mean, for a university project I'm thinking of creating a tool for analysing c++ code bases, but it seems very hard to find a good parser for it.
Should I give up and settle for java parsers? Similarly, what's the state of the art for java parsers? What about c#?
Also, would ripping the parser part of g++ apart from it ever work for the purposes of code analysis, or is it too much effort trying to do so?
You're in luck! Clang just started being able to parse most c++ programs within the last few months: http://clang.llvm.org/ It's one of the few open source parsers actually able to parse most of C++. (Mostly just GCC and CLANG, I hear Oink(?) Can get pretty good sometimes) And it's built to be used as a library by IDEs and the like, even has architecture built to support code rewriting.
There are some proprietary parsers that get the job done, But none of them are really usable without source access.
Regarding ripping apart gcc, That's not very practical for code analysis depending on what you are trying to do, you could use the new plugin architecture to get some usable information out of it, however at a very early level in parsing, it does something called term folding, where the parser itself will optimize out things like "x = x" (A simplistic example) And other aspects of the compiler expects this to happen, so it's not trivial to remove. Thus making gcc nearly useless for anything resembling source rewriting.
For C++ you can use GCC with -fdump-translation-unit & friends option to get AST from it.
See: http://www.manpagez.com/man/1/g++/
If you can compile something by g++ then you can get tree from it.
The industry stardard C++ parser, widely used in compilers, in EDG's C++ front end. I have no experience with this; but I understand it handles a huge variety of C++ dialects. I understand you can get it free for research purposes.
The open source standard is the GCC compiler. I hear is it difficult to understand and modify.
There's CLANG as mentioned in other answers. I have no experience here. My understanding is that it is fairly sophisticated especially in terms of supporting analysis.
Our proprietary DMS Software Reengineering Toolkit has full C++ parser with full name and type resolution, preprocessor expansion (or retention, which the other tools will not do). The C++ front end handles several dialects of C++: ANSI, GCC, MS Visual Studio. As you might guess, I have a lot of experience with this one.
DMS/CppFrontEnd has been used to carry out program analyses as well as massive program source-to-source transformations on C++ code, enabled by DMS's pattern parser, which will parse any fragment of C++ code. I believe the other C++ front ends don't provide source-to-source transformations. With those you can likely hack at the ASTs procedurally, but this is pretty inconvenient because you have to know the precise AST structure and for C++ this is pretty complicated.
DMS also has full C, Java and COBOL front ends with name and type resolution as well as control and data flow analysis. It has parsers (but not name and type analysis) for many other langauges, including C#.
AFAIK, the other "C++ parsers" can't do this, sort of by definition. One can apply source-to-source transformations on any of these, or any mixture of these.
clang is worth looking into. it's fast and they provide apis to hook into their backend.
Xcode 4 uses clang for tasks such as parsing, error reporting/detection in some cases, auto-completion, and fix-its.

Need C++ parser

I need a good, stable and, maybe, easy to use C++ parser library with C/C++ interface (C is preferred).
I hear that cint is good c++ interpreter. Can I use it (or some part of it) for this purpose?
Any suggestions?
See: http://clang.llvm.org/
It has both a C++ and a C interface (libclang).
C++ parsing is famously hard. AFAIK there are only three parsers that are acceptable by todays standards: EDG (widely used as a frontend in popular C++ compilers), GCC's and Microsoft's. And apparently, Microsoft has started using EDG's parser in VS2010, for Intellisense.
When you're looking at the free options, you're pretty much stuck at GCC. It can produce XML, though, so the easy part is there. (Easy by C++ parsing standards, that is)
Clang is the most up-to-date and mature option, with a decent C++ API (but no plain C). Elsa is a bit out of date and unmaintained, but still a usable choice. Both could be used as libraries as well as standalone XML frontends.
If you want to parse C or C++ code, there are some options:
http://bellard.org/tcc/
http://students.ceid.upatras.gr/~sxanth/ncc/
If you want to create a parser using C/C++, you can try:
http://boost-spirit.com/home/
http://dinosaur.compilertools.net/ Lex and Yacc
http://www.codeguru.com/csharp/.net/net_general/patterns/article.php/c12805 Flex and Bison
Our C++ Front End is able to parse a variety of C++ dialects (ANSI, GCC, MSVS), automatically builds ASTs whose nodes are marked with precise source positions and are decorated with any nearby comment text, and builds a full symbol table. (EDIT Jan 2013: the C++ front end has been able to handle C++11 for quite awhile now).
The C++ front end is built on top of our DMS Software Reengineering Toolkit, generalized compiler technology for program analysis and transformation, designed to support custom tool building. The C++ front end includes a preprocessor, in which the preprocessor directives can be expanded or not collectively or individually as appropriate for the task. It also includes full symbol construction with all the nasty Koenig lookup stuff.
DMS accepts explicit language definitions (that's how it understands C++; there are also fron ends for C, C#, Java, COBOL, and variety of other languages). DMS provides general parsing, symbol table building, flow analysis machinery, procedural APIs for tree navigation/inspection/modification, source-to-source transformation, and AST-to-source text regeneration including the original comments, number radices, etc. All of these capabilities are available for use by the C++ Front End.
DMS is also designed to handle the scale required for serious tasks. Often you need not just one compilation unit (which is what GCC will give you at best) but access to an entire set. DMS has been used to analyze/transform thousands of C++ compilation units, and literally tens of thousands of C compilation units (on a 25 million line application).
"Easy to use library" is an oxymoron when it comes to program manipulation tools. The langauges themselves are complex (C++ being one of the most difficult and getting worse with C++0X) and that induces complexity in the nature of the questions you can ask and what the answers look like (e.g. "are there any template instantions that can modify local variable X in method Y in class C in any namespace N?"). The questions themselves are hard.
What you want is a library with the necessary complexity to let you carry off your task. DMS has been under continuous development for the last 15 years, to provide that necessary complexity. If you want to do serious program processing, I claim you will need that information.
As proof, DMS has been used to carry out massive automated reengineering of C++-based mission avionics software for Boeing. I don't believe there are any other tools that can do this. (Clang looks to be trying, but only for C++. YMMV).
I don't know for cint, but I heard people use gcc-xml for this.
I have been looking for a good stand-alone library too, but haven't found any.
If you're feeling brave the links in the answer to "is there a yacc-able C++ grammar?" might be helpful. Gcc-xml and clang have already been suggested and Swig also has an XML output which depending on what you're trying to achieve might be relevant.
I did not try it, but I think that best choice will be getting modules for parsing from some popular open source compiler like gcc for C++;
Maybe you'll find something interesting here http://www.nobugs.org/developer/parsingcpp/

Good tools for creating a C/C++ parser/analyzer [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
What are some good tools for getting a quick start for parsing and analyzing C/C++ code?
In particular, I'm looking for open source tools that handle the C/C++ preprocessor and language. Preferably, these tools would use lex/yacc (or flex/bison) for the grammar, and not be too complicated. They should handle the latest ANSI C/C++ definitions.
Here's what I've found so far, but haven't looked at them in detail (thoughts?):
CScope - Old-school C analyzer. Doesn't seem to do a full parse, though. Described as a glorified 'grep' for finding C functions.
GCC - Everybody's favorite open source compiler. Very complicated, but seems to do it all. There's a related project for creating GCC extensions called GEM, but hasn't been updated since GCC 4.1 (2006).
PUMA - The PUre MAnipulator. (from the page: "The intention of this project is to
provide a library of classes for the analysis and manipulation of C/C++ sources. For this
purpose PUMA provides classes for scanning, parsing and of course manipulating C/C++
sources."). This looks promising, but hasn't been updated since 2001. Apparently PUMA has been incorporated into AspectC++, but even this project hasn't been updated since 2006.
Various C/C++ raw grammars. You can get c-c++-grammars-1.2.tar.gz, but this has been unmaintained since 1997. A little Google searching pulls up other basic lex/yacc grammars that could serve as a starting place.
Any others?
I'm hoping to use this as a starting point for translating C/C++ source into a new toy language.
Thanks!
-Matt
(Added 2/9): Just a clarification: I want to extract semantic information from the preprocessor in addition to the C/C++ code itself. I don't want "#define foo 42" to disappear into the integer "42", but remain attached to the name "foo". This, unfortunately, excludes several solutions that run the preprocessor first and only deliver the C/C++ parse tree)
Parsing C++ is extremely hard because the grammar is undecidable. To quote Yossi Kreinin:
Outstandingly complicated grammar
"Outstandingly" should be interpreted literally, because all popular languages have context-free (or "nearly" context-free) grammars, while C++ has undecidable grammar. If you like compilers and parsers, you probably know what this means. If you're not into this kind of thing, there's a simple example showing the problem with parsing C++: is AA BB(CC); an object definition or a function declaration? It turns out that the answer depends heavily on the code before the statement - the "context". This shows (on an intuitive level) that the C++ grammar is quite context-sensitive.
You can look at clang that uses llvm for parsing.
Support C++ fully now link
The ANTLR parser generator has a grammar for C/C++ as well as the preprocessor. I've never used it so I can't say how complete its parsing of C++ is going to be. ANTLR itself has been a useful tool for me on a couple of occasions for parsing much simpler languages.
Depending on your problem GCCXML might be your answer.
Basically it parses the source using GCC and then gives you easily digestible XML of parse tree.
With GCCXML you are done once and for all.
pycparser is a complete parser for C (C99) written in Python. It has a fully configurable AST backend, so it's being used as a basis for any kind of language processing you might need.
Doesn't support C++, though. Granted, it's much harder than C.
Update (2012): at this time the answer, without any doubt, would be Clang - it's modular, supports the full C++ (with many C++-11 features) and has a relatively friendly code base. It also has a C API for bindings to high-level languages (i.e. for Python).
Have a look at how doxygen works, full source code is available and it's flex-based.
A misleading candidate is GOLD which is a free Windows-based parser toolkit explicitly for creating translators. Their list of supported languages refers to the languages in which one can implement parsers, not the list of supported parse grammars.
They only have grammars for C and C#, no C++.
Parsing C++ is a very complex challenge.
There's the Boost/Spirit framework, and a couple of years ago they did play with the idea of implementing a C++ parser, but it's far from complete.
Fully and properly parsing ISO C++ is far from trivial, and there were in fact many related efforts. But it is an inherently complex job that isn't easily accomplished, without rewriting a full compiler frontend understanding all of C++ and the preprocessor. A pre-processor implementation called "wave" is available from the Spirit folks.
That said, you might want to have a look at pork/oink (elsa-based), which is a C++ parser toolkit specifically meant to be used for source code transformation purposes, it is being used by the Mozilla project to do large-scale static source code analysis and automated code rewriting, the most interesting part is that it not only supports most of C++, but also the preprocessor itself!
On the other hand there's indeed one single proprietary solution available: the EDG frontend, which can be used for pretty much all C++ related efforts.
Personally, I would check out the elsa-based pork/oink suite which is used at Mozilla, apart from that, the FSF has now approved work on gcc plugins using the runtime library license, thus I'd assume that things are going to change rapidly, once people can easily leverage the gcc-based C++ parser for such purposes using binary plugins.
So, in a nutshell: if you the bucks: EDG, if you need something free/open source now: else/oink are fairly promising, if you have some time, you might want to use gcc for your project.
Another option just for C code is cscout.
The grammar for C++ is sort of notoriously hairy. There's a good thread at Lambda about it, but the gist is that C++ grammar can require arbitrarily much lookahead.
For the kind of thing I imagine you might be doing, I'd think about hacking either Gnu CC, or Splint. Gnu CC in particular does separate out the language generation part pretty thoroughly, so you might be best off building a new g++ backend.
Actually, PUMA and AspectC++ are still both actively maintained and updated. I was looking into using AspectC++ and was wondering about the lack of updates myself. I e-mailed the author who said that both AspectC++ and PUMA are still being developed. You can get to source code through SVN https://svn.aspectc.org/repos/ or you can get regular binary builds at http://akut.aspectc.org. As with a lot of excellent c++ projects these days, the author doesn't have time to keep up with web page maintenance. Makes sense if you've got a full time job and a life.
how about something easier to comprehend like tiny-C or Small C
Elsa beats everything else I know hands down for C++ parsing, even though it is not 100% compliant. I'm a fan. There's a module that prints out C++, so that may be a good starting point for your toy project.
See our C++ Front End
for a full-featured C++ parser: builds ASTs, symbol tables, does name
and type resolution. You can even parse and retain the preprocessor
directives. The C++ front end is built on top of our DMS Software Reengineering
Toolkit, which allows you to use that information to carry out arbitrary
source code changes using source-to-source transformations.
DMS is the ideal engine for implementing such a translator.
Having said that, I don't see much point in your imagined task; I don't
see much value in trying to replace C++, and you'll find building
a complete translator an enormous amount of work, especially if your
target is a "toy" language. And there is likely little point in
parsing C++ using a robust parser, if its only purpose is to produce
an isomorphic version of C++ that is easier to parse (wait, we postulated
a robust C++ already!).
EDIT May 2012: DMS's C++ front end now handles GCC3/GCC4/C++11,Microsoft VisualC 2005/2010. Robustly.
EDIT Feb 2015: Now handles C++14 in GCC and MS dialects.
EDIT August 2015: Now parses and captures both the code and the preprocessor directives in a unified tree.
EDIT May 2020: Has been doing C++17 for the past few years. C++20 in process.
A while back I attempted to write a tool that will automatically generate unit tests for c files.
For preprosessing I put the files thru GCC. The output is ugly but you can easily trace where in the original code from the preprocessed file. But for your needs you might need somthing else.
I used Metre as the base for a C parser. It is open source and uses lex and yacc. This made it easy to get up and running in a short time without fully understanding lex & yacc.
I also wrote a C app since the lex & yacc solution could not help me trace functionality across functions and parse the structure of the entire function in one pass. It became unmaintainable in a short time and was abandoned.
What about using a tool like GNU's CFlow, that can analyse the code and produce charts of call-graphs, here's what the opengroup(man page) has to say about cflow. The GNU version of cflow comes with source, and open source also ...
Hope this helps,
Best regards,
Tom.

C++ code analysis tools

I'm currently in the process of learning C++, and because I'm still learning, I keep making mistakes.
With a language as permissive as C++, it often takes a long time to figure out exactly what's wrong -- because the compiler lets me get away with a lot. I realize that this flexibility is one of C++'s major strengths, but it makes it difficult to learn the basic language.
Is there some tool I can use to analyze my code and make suggestions based on best practices or just sensible coding? Preferably as an Eclipse plugin or linux application.
Enable maximum compiler warnings (that's the -Wall option if you're using the Gnu compiler).
'Lint' is the archetypical static analysis tool.
valgrind is a good run-time analyzer.
I think you'd better have some lectures about good practices and why they are good. That should help you more than a code analysis tool (in the beginning at least).
I suggest you read the series of Effective C++ and **Effective STL books, at least. See alsot The Definitive C++ Book Guide and List
For g++, as well as turning on -Wall, turn on -pedantic too, and prepare to be suprised at the number of issues it finds!
Tool support for C++ is quite bad compared to Java, C#, etc. because it does not have a context-free grammar. In fact, there are parts of the C++ grammar that are undecidable. Basically, that means that understanding C++ code at the syntactic level requires implementing pretty much a compiler front end with semantic analysis. C++ cannot be parsed into an AST independently of semantic analysis, and most code analysis tools in IDEs, etc. work at the AST level. This is part of the tradeoff you make in exchange for the flexibility and backwards compatibility of C++.
Turning on all compiler warnings (at least initially) and then understanding what they mean, how to fix the problems highlighted and which of the warnings represent genuine constructs that compiler writers might consider ambiguous is a good first step.
If you need something more heavy duty, you could try PC-Lint if you're on Windows, which is still one of the best lint tools for C++. Just keep in mind that you'll need to configure these tools to reflect your coding style, otherwise you'll get swamped with warnings and won't be able to see the wood for the trees. Yes, it costs money and it's probably a bit overkill if you're not doing C++ on a "getting paid for it" level, but I find it invaluable.
There is as list of static code analysis tools at wikipedia.
But warnings are generally good but one problem with enabling all warnings with pedantic and Wall is the number of warnings you might get from including headers that you have no control over, this can create a lot of noise. I prefer to compile my own software with all warnings enabled though. As I program in linux I usually do like this:
Put the external headers I need to include in a separate file, and in the beginning of that file before the includes put:
#pragma GCC system_header
And then include this file from your code. This enables you to see all warnings from your own code without it being drowned in warnings from external code. The downside is that it's a gcc specific solution, I am not aware of any platform independent solution to this.
lint - there are lots of versions but if you google for lint you should find one that works. The other thing to do is turn on your compiler warnings - if you are using gcc/g++ the option is -Wall.
You might find CppChecker helpful as a plugin for Eclipse that supports gcc/PC lint.
I think that really what you need to learn here is how to debug outside of an IDE. This is a valuable skill in my opinion, since you will no longer require such a heavy toolset to develop software, and it will apply to the vast majority of languages you already know and will ever learn.
However, its a difficult one to get used to. You will have to write code just for debugging purposes, e.g. write checks after each line not yet debugged, to ensure that the result is as expected, or print the values to the console or in message boxes so that you can check them yourself. Its tedious but will enable you to pick up on your mistakes more easily, inside or outside of an IDE.
Download and try some of the free debugging tools like GDB too, they can help you to probe memory, etc, without having to write your own code.