Dynamic slicing in C/C++ - c++

After reading the book of debugging from Andreas Zeller, I became interested in Dynamic Slicing.
At the moment I only found relevant tools for Java analysis. Do you know such tools for C/C++?

A little information in addition to Rob's
the Wisconsin Program-Slicing Tool has evolved in a tool called CodeSurfer. Good news: it's commercially available and supported, and it works great for what it does. Bad news (perhaps): it does not actually produce a reduced program that computes the same value that you selected, but it's very convenient for navigating source code that you have not written.
Frama-C handles only C (no C++ for the foreseeable future). It is nice, not great, for navigating source code, but it can produce an equivalent smaller program for the criterion that you specify, if the original program is of the kind that it can analyze automatically (no recursion, no dynamic allocation). Frama-C is Open Source and has a mailing list in which your questions will be welcome if you are interested in the techniques it uses.
The reason CodeSurfer does not risk itself to produce an equivalent program and Frama-C can only do it for code with embedded-like restrictions is, in short, that doing so requires knowing the values of pointers, which can be arbitrarily difficult to compute with precision.

There's a tool listed on the Wikipedia page you cite. It's for C, so I guess it could work for whatever "C/C++" is.
Wisconsin Program-Slicing Tool
Also for C, and also mentioned on the Wikipedia page:
Frama-C

Giri implements dynamic backwards slicing in LLVM compiler, which as far as I know, is the latest effort to build a usable, effective and thread-aware dynamic slicer in modern compilers.

Related

Tokenization and AST

Have a rather abstract question for you all. I'm looking at getting involved in a static code analysis project. It uses C and C++ as the language to develop in so if any code could be in either of those languages in your response that would be great.
My question:
I need to understand some of the basic concepts/constructs used to process code for static analysis. I have heard people use things like AST and tokenization etc. I was just wondering if anything can clarify how these things are applied in creating a static analysis tool? Id like more of an explanation of tokenization as I dont really understand that so well. I understand it is a sort of way to process strings but I'm not confident in that answer. Furthermore, I know that the project I'm looking at passes the code through the preprocessor before it is analyzed. Can anyone explain this? Surely if it is static code analysis it needn't be preprocessed?
Hope someone can clear this up for me.
Cheers.
Tokenization is the act of breaking source text into language elements such as operators, variable names, numbers, etc. Parsing reads sequences of tokens and build Abstract Syntax Trees, which is a particular program representation. Tokenization and parsing are necessary for static analysis, but hardly interesting, in the same way that ante-to-the-pot is necessary to playing poker but not the interesting part of the game in any way.
If you are building a static analyzer (you imply you expect to work on one implemented in C or C++), you will need fundamental knowledge of compiling (parsing not so much unless you are building a parser for the language to be statically analyzed), but certainly about program representations (ASTs, triples, control and data flow graphs, ...), type and property inference, and limits on analysis accuracy (the cause of conservative analyses.
The program representations are fundamental because these are the data structures that most static analysers really process; its simply too hard to wring useful facts directly from program text. These concepts can be used to implement static analysis capabilities in any programming language to implement analysis type tools; there's nothing special in implementing them in C or C++.
Run, don't walk, to your nearest compiler class for the first part of this. If you don't have it, you won't be able to do anything effective in tool building. The second part you will more likely find in a graduate computer science class.
If you get past that basic knowledge issue, you will either decide to implement an analysis tool from scratch, or build on existing analysis tool infrastructure. Few people decide to build one from scratch; it takes a huge amount of work (years or decades) to build robust parsers, flow analyzers, etc. needed as foundations for any particular static analysis. Mostly people try to use some existing infrastructure.
There's a huge list of candidates at: http://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis
If you insist on processing C or C++ and building your own custom sophisticated analysis, you really, really need a tool that can handle real C and C++ code. There are IMHO a limited number of good candidates:
GCC (and various graft ons such as Starynkevitch's MELT, about which I know little)
Clang (pretty spectacular toolset)
DMS (and its C and C++ front ends) [my company's tool]
Open64 compiler infrastructure
Rose compiler infrastructure (based on EDG's industry-wide front end)
Each one of these are big systems and require a big investment to understand and begin to use. Don't underrate the learning curve.
There are lots of other tools out there that sort of process C and C++, but "sort of" is pretty useless for static analysis purposes.
If you intend to simply use a static analysis tool, you can avoid learning most of the parsing and program representation questions; instead you'll need to learn as much as you can about the specific analysis tool you intend to use. You'll still be a lot better off with the compiler background above because you will understand what the tool does, why it does it, and why it produces the kinds of answers that it does (usually it produces answers that are unsatisfying in a lot of ways due to the conservative limits on analysis accuracy).
Lastly, you should be clear that you understand the difference between static analysis and dynamic analysis [using data collected at runtime to determine program properties]. Most end users couldn't care less how you get information about their code, and each analysis approach has its strength and weaknesses.
If you are interested in static analysis, you should not bother about syntax analysis (i.e. lexing, parsing, building the abstract syntax tree), because that won't learn you much.
So I suggest you to use some existing parsing infrastructure. This is particularly true if you want to analyze C++ source code, because just coding a C++ parser is a difficult thing to do.
For example, you could leverage on the GCC compiler, or on the LLVM/Clang compiler. Be aware of their open source license: GCC is under GPLv3, and Clang/LLVM has a BSD-like license. GCC is able to handle not only C & C++, but also Fortran, Go, Ada, Objective-C.
Because of the GPLv3 license of GCC, your development above it should be also free software under GPLv3. However, that also means that the GCC community is quite big. If you know Ocaml and are interested of static analysis of C only, you could consider Frama-C (LGPL licensed)
Actually, I am working on GCC, by providing the MELT extension framework (MELT is also GPLv3 licensed). MELT is a high-level domain specific language to extend GCC and should be the ideal choice to work on static analysis using the powerful framework and internal representations (Gimple, Tree, ...) of GCC.
(Of course, LLVM folks would be delighted to have their work used too; and nobody knows well both LLVM and GCC, because learning such big software is already a challenge, so people have to choose on which they invest their time).
So my suggestion is to join an existing static analysis or compiler effort, don't reinvent the wheel by starting from scratch (because that means years of effort, just to have a good parser).
By so doing, you will take advantage of the existing infrastructure (so you won't bother about AST-s and parsing) and you will improve some existing project.
I would be very delighted if you are interested in MELT. FWIW, there is a MELT tutorial at Paris, on january 24th 2012, HiPEAC conference.
If you are looking into static analysis for C & C++, your first stop should be libclang, along with its associated papers.
In terms of why the preprocessor is run, static analysis is meant to catch errors and possible unintended code functionality (very basically), this it must analyse what the compiler will be compiling, not what the user sees.
I'm pretty sure #Ira Baxter will popup with a far more complete answer :)

Boost advocacy - help needed

Possible duplicates
Is there a reason to not use Boost?
What are the advantages of using the C++ BOOST libraries?
OK, the high-level question is "Please provide me with what you consider to be the most effective arguments of why entire Boost, or some specific parts of it, should be compiled on our company's system and endorsed in software engineering standards".
Details of what I need:
Would gladly accept both positive arguments (why install), as well as proposed rebuttals of likely counter-arguments I might hear (see question context below).
Arguments should be made aimed at both technical Software Engineering team members and/or very technical senior managers - in other words, for the latter, the details of the argument may/should be technical, but the thrust of the argument should be "how would this make/save the company X money vs losing the company Y money as a cost of adding it to our toolset".
Context of the question:
I'm a developer in a company with several hundred developers, many dosens of whom do C++.
I had the (mis)fortune of being reassigned from my beloved Perl development spot to a team where I am also doing C++ development. So far I found numerous things that I could easily have done in Perl that are very hard/cumbersome to do in C++ (foreach loop as an example), and anytime I hit one of these, the answer 50% likely ends up being "You can't do this in standard C++ but you can do it with Boost"
Our toolkit includes some legacy RogeWave libraries, and VERY limited number of Boost libraries (e.g. no regex, no foreach), of very old vintage.
Any development must use libraries compiled and vetted by Software Engineering team. That is a hard and fast rule.
SE team is somewhat resistant to adding new libraries, for a variety of reasons (e.g. effort to do this; functionality conflicts with RogeWave, for example for RegEx; the risk of installing and using any new software; cost of educating developers, etc...). They will add the libraries if presented with sufficient business need or majorly convincing cost/benefit ratio argument, but they have pretty tough threshold.
So, I'm looking for examples of which parts of Boost are so wonderful (with exact cost/benefit estimates) that installing them would be an Obviously Worth It Effort for Software Engineering.
Thanks in advance for any ideas/suggestions/examples.
Please don't mark this question as subjective as I am looking for measurable answers, not merely wonderful feelings :)
Wherever I worked in the last decade, when they had their own smart pointer class, I found bugs in that - usually within a few weeks. And, no, I never went and looked at it hoping to find errors.
I got into the habit of posting the following quote from the TR1 smart pointer proposal:
The Boost developers found a shared-ownership smart pointer exceedingly difficult to implement correctly. Others have made the same observation. For example, Scott Meyers [Meyers01] says:
"The STL itself contains no reference-counting smart pointer, and writing a good one - one that works correctly all the time - is tricky enough that you don't want to do it unless you have to. I published the code for a reference-counting smart pointer in More Effective C++ in 1996, and despite basing it on established smart pointer implementations and submitting it to extensive pre- publication reviewing by experienced developers, a small parade of valid bug reports has trickled in for years. The number of subtle ways in which reference-counting smart pointers can fail is remarkable."
This plus a detailed analysis of the bug(s) I found usually got me the job of incorporating the boost libs into the code base. :)
It seems to me you're doing this the wrong way around. Since proposals to add new libraries are going to be met with a lot of resistance, don't even bother trying to argue for boost as a whole. Choose your battles instead.
Find the specific Boost libraries which you know (with your knowledge of the application it's to be used in) will be useful and save time and money. And then propose adding those.
I could easily list the Boost libs I've found useful, and why I think they're great, but I don't know if they'll be of any use in your application.
Push for individual boost libraries to be included, and then perhaps, over time, so many of them will be included that it'll be simpler for everyone to just include all of Boost.
It's an open standard not controlled by a specific company ( no licensing costs )
It's cross platform
It's expertly designed / written and very fast / efficient extensively tested
There are open source implementations which your team can compile themselves.
Boost will soon become part of the standard C++ STL
Here is a slightly oldish 2005 article on Dr. Dobbs discussing the upcoming C++0x standard.
http://www.ddj.com/cpp/184401958
I had to maintain a component using that old vintage Tools.h++ from Roguewave, on a Solaris system.
On Solaris, if we want to use boost, we need to use either gcc, or SunStudio with STLport implementation of the standard (instead of Roguewave one). And as Tools.h++ requires the old Roguewave pre-standard implementation of the standard -- on Solaris --, I had to give up on boost.
In the end I rewrote a simplified version of a few boost-like features I needed.
If you are in that same situation (*) , you would not be able to move from Roguewave library to boost that easily. There is a non negligible cost in this operation, as for instance pointer containers from both libraries have quite different interfaces.
(*) Where we can't slowly change bits by bits of the old code to progressively use boost. In that situation, the migration has to be radical and to change simultaneously every occurrence of Tools.h++ by something more trendy, or even better.
NB: Most people are able to progressively use boost in old projects, and may miss a very important, and yes technical, point. Hence my negative answer.
Boost is a great tool and an invaluable part of our product development (we'd be lost without smart_ptr)... but because it is changing so fast the stability of releases can be effected.
For example, we were happily introducing new versions of Boost as soon as they came out without thinking twice. That is until we were stung with a bug in the 1.35 threading library that produced occassional (ie difficult to debug) but critical errors. Fortunately we identified the issue before anything was released to the public and could move back to 1.34.
Ever since then we've taken a specific version, extensively tested it, and not updated without a compelling reason to do so.
Here are two suggestions for advocating boost:
Who's Using Boost? (http://www.boost.org/users/uses.html)
lots of major projects use boost: (e.g., adobe photoshop, CERN)
Boost Project Cost (http://www.boost.org/development/index.html)
How much it would cost to hire a team to write boost from scratch? There's a nifty (somewhat gimmicky) calculator there that helps to put it in perspective.

How do C/C++ compilers work?

After over a decade of C/C++ coding, I've noticed the following pattern - very good programmers tend to have detailed knowledge of the innards of the compiler.
I'm a reasonably good programmer, and I have an ad-hoc collection of compiler "superstitions", so I'd like to reboot my knowledge and start from the basics.
Can anyone recommend links to online resources or favorite books? I'm particularly interested in C/C++ compiling, optimization, GCC and LLVM.
Start with the dragon book....(stress more on code optimization and code generation)
Go onto write a toy compiler for an educational programming language like Decaf or Cool.., you may use parser generators (lex and yacc) for your front end(to make life easier and focus on more imp stuff)....
Then read gcc internals book along with browsing gcc source code.
GCC Internals Manual.
CPP Internals Manual
LLVM Documentation
Compiler Text are good, but they are a bit heavy for teaching yourself. Jack Crenshaw has a "Book" that was a series of articles you can download and read call "Lets Build a Compiler." It follows a "Learn By Doing" methodology that is great if you didn't get anything out of taking formal classes on the subject, or it's been WAY too many years since took it (that's my case). It holds your hand and leads you through writting a compiler instead of smacking you around with Lambda Calculus and deep theoretical issues that only academia cares about. It was a good way to stir up those brain cells that only had a fuzzy memory of writting something on the Vax (YEAH, that right a VAX!) many many moons ago at school. It's written very conversationally and easy to just sit down and read, unlike most text books which require several pots of coffee just to get past the first chapter. Once you have a basis for understanding then more traditional text such as the Dragon book are great references to expand on your understanding. (And personal I like the Dead Tree versions, I printed out Jack's, it's much easier to read in a comfortable position than on a laptop. And the Ebook readers are too expensive for something that doesn't actually feel like you're reading a real book yet.)
What some might call a "downside" is that it's written in Pascal, but I thought that just made me think about it more than if someone had given me a working C program to start with. Appart from that it was written with the 68000 in mind, which is only being used in embedded systems at this point time. Again for me this wasn't a problem, I knew 68000 asm and 68000 asm is easier to read than some other asm.
If you want dead-tree edition, try The Art of Compiler Design: Theory and Practice.
As noted by Pete Eddy, Jack Crenshaw's tutorial is excellent for newbies. But if you want to see how to a real, production C compiler works—one which was designed by brilliant engineers instead of created by throwing code at the wall until something stuck—get yourself a copy of Fraser and Hanson's A Retargetable C Compiler: Design and Implementation, which contains the source code to the very clean lcc compiler. Explanations of the design and implementation are mixed in with the code. It is not a first book for a beginner, but it will repay careful study, and you can get a used copy for $35.
For a longer blurb about lcc, see Compile C Faster on Linux.
The lcc web page also has links to a number of good textbooks. I don't know of an intro text that I really like, however.
P.S. Sorry you got ripped off at Uni.
http://se-radio.net/podcast/2007-07/episode-61-internals-gcc
see Fabrice Bellard's otcc source code
http://bellard.org/otcc/
Depending on what you exactly want to know, you should have a look at pipes&filter pattern, because as far as I know this (or something similar) is used in a lot of compilers in the last years.
When my compiler knowledge is not too outdated it works like this:
Parse sourcecode into symbolic representation
Clean up symbolic representation, do some normalization
Optimization of the symbolic tree based on certain rules
write out executable code based on symbolic tree
Of course dependencies etc. have to be resolved too.
And of course having a look at gcc or javac sourcecode may help in getting more detailed understanding.
It may also be valuable to pick up and read the source code to a compiler. I doubt that GCC is the best first choice, since it is burdened with full compatibility to more than 20 years of evolution of the language. But I'm also sure that a reading of its source, guided by one of the internal reference manuals, would be educational.
I'd seriously consider looking at the source to a scripting language that is internally compiled to a bytecode for a virtual machine. Several languages fit that description, but I would start with Lua. The language is small, and the VM is novel. The source code is also small and the bits I've looked at have been very clear although lightly commented.
have a look on Kaleidoscope.
You can write your own compiler in just a few days with LLVM.

C++ code analysis tools

I'm currently in the process of learning C++, and because I'm still learning, I keep making mistakes.
With a language as permissive as C++, it often takes a long time to figure out exactly what's wrong -- because the compiler lets me get away with a lot. I realize that this flexibility is one of C++'s major strengths, but it makes it difficult to learn the basic language.
Is there some tool I can use to analyze my code and make suggestions based on best practices or just sensible coding? Preferably as an Eclipse plugin or linux application.
Enable maximum compiler warnings (that's the -Wall option if you're using the Gnu compiler).
'Lint' is the archetypical static analysis tool.
valgrind is a good run-time analyzer.
I think you'd better have some lectures about good practices and why they are good. That should help you more than a code analysis tool (in the beginning at least).
I suggest you read the series of Effective C++ and **Effective STL books, at least. See alsot The Definitive C++ Book Guide and List
For g++, as well as turning on -Wall, turn on -pedantic too, and prepare to be suprised at the number of issues it finds!
Tool support for C++ is quite bad compared to Java, C#, etc. because it does not have a context-free grammar. In fact, there are parts of the C++ grammar that are undecidable. Basically, that means that understanding C++ code at the syntactic level requires implementing pretty much a compiler front end with semantic analysis. C++ cannot be parsed into an AST independently of semantic analysis, and most code analysis tools in IDEs, etc. work at the AST level. This is part of the tradeoff you make in exchange for the flexibility and backwards compatibility of C++.
Turning on all compiler warnings (at least initially) and then understanding what they mean, how to fix the problems highlighted and which of the warnings represent genuine constructs that compiler writers might consider ambiguous is a good first step.
If you need something more heavy duty, you could try PC-Lint if you're on Windows, which is still one of the best lint tools for C++. Just keep in mind that you'll need to configure these tools to reflect your coding style, otherwise you'll get swamped with warnings and won't be able to see the wood for the trees. Yes, it costs money and it's probably a bit overkill if you're not doing C++ on a "getting paid for it" level, but I find it invaluable.
There is as list of static code analysis tools at wikipedia.
But warnings are generally good but one problem with enabling all warnings with pedantic and Wall is the number of warnings you might get from including headers that you have no control over, this can create a lot of noise. I prefer to compile my own software with all warnings enabled though. As I program in linux I usually do like this:
Put the external headers I need to include in a separate file, and in the beginning of that file before the includes put:
#pragma GCC system_header
And then include this file from your code. This enables you to see all warnings from your own code without it being drowned in warnings from external code. The downside is that it's a gcc specific solution, I am not aware of any platform independent solution to this.
lint - there are lots of versions but if you google for lint you should find one that works. The other thing to do is turn on your compiler warnings - if you are using gcc/g++ the option is -Wall.
You might find CppChecker helpful as a plugin for Eclipse that supports gcc/PC lint.
I think that really what you need to learn here is how to debug outside of an IDE. This is a valuable skill in my opinion, since you will no longer require such a heavy toolset to develop software, and it will apply to the vast majority of languages you already know and will ever learn.
However, its a difficult one to get used to. You will have to write code just for debugging purposes, e.g. write checks after each line not yet debugged, to ensure that the result is as expected, or print the values to the console or in message boxes so that you can check them yourself. Its tedious but will enable you to pick up on your mistakes more easily, inside or outside of an IDE.
Download and try some of the free debugging tools like GDB too, they can help you to probe memory, etc, without having to write your own code.

Understanding memory management in mingw/g++

I was given an assignment to basically explain this. I have taken a quick look at the compiler documentation, and it seems to be a good place to start although it is quite extensive and I don't have much time. I'd like to know if I'd need to understand the C99 standards beforehand, or if there's another good source I can check. I'm going to be using it with Windows if it makes any difference. I also understand simple concepts such as heaps, stacks, linking and whatnot.
AFAIK, g++ is simply a C/C++ compiler, nothing more. memory is managed according to the standard C/C++ libraries.
Any decent C/C++ tutorial should give the basics for this information - but memory management in C/C++ is a huge topic. Surely for an entry level class your instructor would give some guidance & probably a more specific, less open-ended question.
g++ is just a compiler. It follows the rules of the language it compiles (In G++'s case, C++, but you also mention C99).
And for your fairly specific questions, you may need to
Consult the language standard (For C++ this is ISO/IEC 14882 ). Unfortunately not free, but you can find drafts online for free that are basically as good as the real thing. The latest official version is C++2003 (ISO/IEC 14882-2003), but contains only very minor changes from the original '89. C++09 is getting close to completion too, and again, there are drafts available online for this. Be warned though, it is heavy reading, and I wouldn't recommend trying to find anything there unless you're already very familiar with the language.
Analyze the assembler code the
compiler generates. The standard
leaves a lot up to the
implementation, so the only way to
find out how G++ specifically pushes
things onto the stack, in which
order and so on, is to analyze the
code it generates. (Also note that
this likely changes between
different versions of G++)
C++ is a notoriously underspecified language. There are huge chunks that just aren't covered by the standard, and where the compiler is free to do what it likes. This makes it a bit of a challenge to find out exactly what a given compiler is doing under the hood.
For this reason, you should also make sure you know exactly what you're expected to do. Dig up information on what the language says about memory management, or how g++ specifically deals with it?
I'm not clear on this question. If by "understanding memory management in mingw/g++" you mean "understand how the mingw g++ compiler handles memory internally while it is compiling files, such as when it allocates and frees abstract syntax tree nodes, etc.") then your answer is that as a multi-pass compiler GCC may not know the optimal lifetime for any particular piece of data but it does know that large groups of objects won't be needed from one pass to the other, so it uses memory pools when possible and garbage collection elsewhere.
On the other hand, if you're asking about how "how/when/what order ... objects, functions, variables, etc are put into the stack ... what is allocated and when and how this impacts performance" then you're in for a long night skimming through code.