Dead code detection in legacy C/C++ project [closed]

Dead code detection in legacy C/C++ project [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
How would you go about dead code detection in C/C++ code? I have a pretty large code base to work with and at least 10-15% is dead code. Is there any Unix based tool to identify this areas? Some pieces of code still use a lot of preprocessor, can automated process handle that?

You could use a code coverage analysis tool for this and look for unused spots in your code.
A popular tool for the gcc toolchain is gcov, together with the graphical frontend lcov (http://ltp.sourceforge.net/coverage/lcov.php).
If you use gcc, you can compile with gcov support, which is enabled by the '--coverage' flag. Next, run your application or run your test suite with this gcov enabled build.
Basically gcc will emit some extra files during compilation and the application will also emit some coverage data while running. You have to collect all of these (.gcdo and .gcda files). I'm not going in full detail here, but you probably need to set two environment variables to collect the coverage data in a sane way: GCOV_PREFIX and GCOV_PREFIX_STRIP...
After the run, you can put all the coverage data together and run it through the lcov toolsuite. Merging of all the coverage files from different test runs is also possible, albeit a bit involved.
Anyhow, you end up with a nice set of webpages showing some coverage information, pointing out the pieces of code that have no coverage and hence, were not used.
Off course, you need to double check if the portions of code are not used in any situation and a lot depends on how good your tests exercise the codebase. But at least, this will give an idea about possible dead-code candidates...

Compile it under gcc with -Wunreachable-code.
I think that the more recent the version, the better results you'll get, but I may be wrong in my impression that it's something they've been actively working on. Note that this does flow analysis, but I don't believe it tells you about "code" which is already dead by the time it leaves the preprocessor, because that's never parsed by the compiler. It also won't detect e.g. exported functions which are never called, or special case handling code which just so happen to be impossible because nothing ever calls the function with that parameter - you need code coverage for that (and run the functional tests, not the unit tests. Unit tests are supposed to have 100% code coverage, and hence execute code paths which are 'dead' as far as the application is concerned). Still, with these limitations in mind it's an easy way to get started finding the most completely bollixed routines in the code base.
This CERT advisory lists some other tools for static dead code detection

For C code only and assuming that the source code of the whole project
is available, launch an analysis with the Open Source tool Frama-C.
Any statement of the program that displays red in the GUI is
dead code.
If you have "dead code" problems, you may also be interested in
removing "spare code", code that is executed but does not
contribute to the end result. This requires you to provide
an accurate modelization of I/O functions (you wouldn't want
to remove a computation that appears to be "spare" but
that is used as an argument to printf). Frama-C has an option for pointing out spare code.

Your approach depends on the availability (automated) tests. If you have a test suite that you trust to cover a sufficient amount of functionality, you can use a coverage analysis, as previous answers already suggested.
If you are not so fortunate, you might want to look into source code analysis tools like SciTools' Understand that can help you analyse your code using a lot of built in analysis reports. My experience with that tool dates from 2 years ago, so I can't give you much detail, but what I do remember is that they had an impressive support with very fast turnaround times of bug fixes and answers to questions.
I found a page on static source code analysis that lists many other tools as well.
If that doesn't help you sufficiently either, and you're specifically interested in finding out the preprocessor-related dead code, I would recommend you post some more details about the code. For example, if it is mostly related to various combinations of #ifdef settings you could write scripts to determine the (combinations of) settings and find out which combinations are never actually built, etc.

Both Mozilla and Open Office have home-grown solutions.

g++ 4.01 -Wunreachable-code warns about code that is unreachable within a function, but does not warn about unused functions.
int foo() {
return 21; // point a
}
int bar() {
int a = 7;
return a;
a += 9; // point b
return a;
}
int main(int, char **) {
return bar();
}
g++ 4.01 will issue a warning about point b, but say nothing about foo() (point a) even though it is unreachable in this file. This behavior is correct although disappointing, because a compiler cannot know that function foo() is not declared extern in some other compilation unit and invoked from there; only a linker can be sure.

Dead code analysis like this requires a global analysis of your entire project. You can't get this information by analyzing translation units individually (well, you can detect dead entities if they are entirely within a single translation unit, but I don't think that's what you are really looking for).
We've used our DMS Software Reengineering Toolkit to implement exactly this for Java code, by parsing all the compilation-units involved at once, building symbol tables for everything and chasing down all the references. A top level definition with no references and no claim of being an external API item is dead. This tool also automatically strips out the dead code, and at the end you can choose what you want: the report of dead entities, or the code stripped of those entities.
DMS also parses C++ in a variety of dialects (EDIT Feb 2014: including MS and GCC versions of C++14 [EDIT Nov 2017: now C++17]) and builds all the necessary symbol tables. Tracking down the dead references would be straightforward from that point. DMS could also be used to strip them out. See http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html

Bullseye coverage tool would help. It is not free though.

Related

Stripping C++ library of not required code for an application [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have an application dependent on many libraries. I am building everything from sources on an ubuntu machine. I want to remove any function/class that is not required by an application. Is there any tool to help with that?
P.S. I want to remove source code from the library not just symbols from object files.

Standard strip utility was created exactly for this.

I have now researched this a bit in the context of my own project and decided this was worth a full answer rather than just a comment. This answer is based on Apple's toolchain on macOS (which uses clang, rather than gcc), but I think things work in much the same way for both.
The key to this is enabling 'link time optimization' when building your libraries and executable(s). The mechanics of this are actually very simple - just pass -flto to gcc and ld on the command line. This has two effects:
Code (functions / methods) in object files or archives that is never called is omitted from the final executable.
The linker performs the sort of optimisations that the compiler can perform (such as function inlining), but with knowledge that extends across compilation unit boundaries.
It won't help you if you are linking against a shared library, but it might help if that shared library links with other (static) libraries which contain code that the shared library never calls.
On the upside, this reduced the size of my final executable by about 5%, which I'm pleased about. YMMV.
On the downside, my object files roughly doubled in size and sometimes link times increased dramatically (by something like a factor of 100). Then, if I re-linked, it was much faster. This behaviour might be a peculiarity of Apple's toolchain however. Perhaps it is stashing away some build intermediates somewhere on the first link. In any case, if you only enable this option for release builds it should not be a major issue.
There are more details of the full set of gcc command line options that control optimisation here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html. Search that page for flto to narrow down your search.
And for a glimpse behind the scenes, see: https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html
Edit:
A bit more information about link times. Apple's linker creates some huge files in a directory called LTOCache when you link. I've not seen these before today so these look to be the build intermediates that speed up linking second time around. As for my initial link being so slow, this may in part be due to the fact that, in my case, these are created on an SMB server. But then again, the CPU was maxed out so maybe not.

OK, now that I understand the OP's requirements better I have another answer for this that I think might better suit his needs. I think the way to tackle this is with a code coverage tool. After all, the problem is identifying what you can safely get rid of it. Actually stripping it out is easy.
My IDE (Visual Studio) has one of these built in but I think the OP is using gcc so the first port of call appears to be gcov. There are a number of commercial options, but they are expensive. There's also a potentially useful post here.
The other thing you need, of course, is a program that exercises all the parts of the library that you want to keep to give you a coverage report to work from, but it sounds like the OP already has that. A good IDE will also help as it makes navigating around the code so much easier. In Visual Studio, I find Jump to Definition and quick and easy 'bookmarking' to be key features.

Dynamic dead code elimination tools for complex C++ projects

We have a project with a lot of code, part of it is legacy.
As part of the work flow, every once in a while, all the functionality of the product is checked.
I wonder if there is a way to use this fact to dynamically check which parts of the code were never used? (The difficult part is the C++ code, the .Net and Java are more under control and have less legacy).
Also - are there dynamic dead code elimination tools are there that can work with lots of code and complex projects (i.e. ~1M lines)?
All the similar questions I found talked about static analysis which we all ready do.
Thank you!

You might want to look at the code coverage tools that are used in testing. The idea of these tools is that they instrument the code and after running the set of tests you know what lines of code were executed at least once and what lines were never executed. After that you can improve tests.
The same thing can be used to identify dead code in case if you have diverse enough execution environment.

I don't know what platform you are on but we have used Gcov with success if you're compiling with the gnu toolchain:
http://gcc.gnu.org/onlinedocs/gcc/Gcov.html

Tool to parse C++ source and move in-header inline methods to the .cpp source file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
The source code of our application is hundreds of thousands of line, thousands of files, and in places very old - the app was first written in 1995 or 1996. Over the past few years my team has greatly improved the quality of the source, but one issue remains that particularly bugs me: a lot of classes have a lot of methods fully defined in their header file.
I have no problem with methods declared inline in a header in some cases - a struct's constructor, a simple method where inlining measurably makes it faster (we have some math functions like this), etc. But the liberal use of inlined methods for no apparent reason is:
Messy
Makes it hard to find the implementation of a method (especially searching through a tree of classes for a virtual function, only to find one class had its version declared in the header...)
Probably increases the compiled code size
Probably causes issues for our linker, which is notoriously flaky for large codebases. To be fair, it has got much better in the past few years, but it's not perfect.
That last reason may now be causing problems for us and it's a good reason to go through the codebase and move most definitions to the source file.
Our codebase is huge. Is there an automated tool that can do (most of) this for us?
Notes:
We use Embarcadero RAD Studio 2010. In other words, the dialect of C++ includes VCL and other extensions, etc.
A few headers are standalone, but most are paired with a corresponding .cpp file, as you normally would. Apart from the extension the filename is the same, i.e., if there are methods defined in X.h, they can be moved to X.cpp. This also means the tool doesn't have to handle parsing the whole project - it could probably just parse individual pairs of .cpp/.h files, ignore the includes, etc, so long as it could reliably recognise a method with a body defined in a class declaration and move it.

You might try Lazy C++. I have not used it, but I believe it is a command line tool to do just what you want.

If the code is working then I would vote against any major automated rewrite.
Lots of work could be involved fixing it up.
Small iterative improvements over time is a better technique as you will be able to test each change in isolation (and add unit tests). Anyway your major complaint about not being able to find the code is not a real problem and is already solved. There are already tools that will index your code base so your editor will jump to the correct function definition without you having to search for it. Take a look at ctags or the equivalent for your editor.
Messy
Subjective
Makes it hard to find the implementation of a method (especially searching through a tree of classes for a virtual function, only to find one class had its version declared in the header...)
There are already tools available for finding the function. ctags will make a file that allows you to jump directly to the function from any decent editor (vim/emacs). I am sure your editor if nto one of these has the equivalent tool.
Probably increases the compiled code size
Unlikely. The compiler will choose to inline or not based on internal metrics not weather it is marked inline in the source.
Probably causes issues for our linker, which is notoriously flaky for large codebases. To be fair, it has got much better in the past few years, but it's not perfect.
Unlikely. If your linker is flakey then it is flakey it is not going to make much difference where the functions are defined as this has no bearing on if they are inlined anyway.

XE2 includes a new static analyzer. It might be worthwhile to give the new version of C++Builer's trial a spin.

You have a number of problems to solve:
How to regroup the source and header files ideally
How to automate the code modifications to carry this out
In both cases, you need a robust C++ parser with full name resolution to determine the dependencies accurately.
Then you need machinery that can reliably modify the C++ source code.
Our DMS Software Reengineering Toolkit with its C++ Front End could be used for this. DMS has been used for large-scale C++ code restructuring; see http://www.semdesigns.com/Company/Publications/ and track down the first paper "Case Study: Re-engineering C++ Component Models Via Automatic Program Transformation". (There's an older version of this paper you can download from there, but the published one is better). AFAIK, DMS is the only tool to have ever been applied to transforming C++ on large scale.
This SO discussion on reorganizing code addresses the problem of grouping directly.

Edit and Continue on GDB

I know that E&C is a controversial subject and some say that it encourages a wrong approach to debugging, but still - I think we can agree that there are numerous cases when it is clearly useful - experimenting with different values of some constants, redesigning GUI parameters on-the-fly to find a good look... You name it.
My question is: Are we ever going to have E&C on GDB? I understand that it is a platform-specific feature and needs some serious cooperation with the compiler, the debugger and the OS (MSVC has this one easy as the compiler and debugger always come in one package), but... It still should be doable. I've even heard something about Apple having it implemented in their version of GCC [citation needed]. And I'd say it is indeed feasible.
Knowing all the hype about MSVC's E&C (my experience says it's the first thing MSVC users mention when asked "why not switch to Eclipse and gcc/gdb"), I'm seriously surprised that after quite some years GCC/GDB still doesn't have such feature. Are there any good reasons for that? Is someone working on it as we speak?

It is a surprisingly non-trivial amount of work, encompassing many design decisions and feature tradeoffs. Consider: you are debugging. The debugee is suspended. Its image in memory contains the object code of the source, and the binary layout of objects, the heap, the stacks. The debugger is inspecting its memory image. It has loaded debug information about the symbols, types, address mappings, pc (ip) to source correspondences. It displays the call stack, data values.
Now you want to allow a particular set of possible edits to the code and/or data, without stopping the debuggee and restarting. The simplest might be to change one line of code to another. Perhaps you recompile that file or just that function or just that line. Now you have to patch the debuggee image to execute that new line of code the next time you step over it or otherwise run through it. How does that work under the hood? What happens if the code is larger than the line of code it replaced? How does it interact with compiler optimizations? Perhaps you can only do this on a specially compiled for EnC debugging target. Perhaps you will constrain possible sites it is legal to EnC. Consider: what happens if you edit a line of code in a function suspended down in the call stack. When the code returns there does it run the original version of the function or the version with your line changed? If the original version, where does that source come from?
Can you add or remove locals? What does that do to the call stack of suspended frames? Of the current function?
Can you change function signatures? Add fields to / remove fields from objects? What about existing instances? What about pending destructors or finalizers? Etc.
There are many, many functionality details to attend to to make any kind of usuable EnC work. Then there are many cross-tools integration issues necessary to provide the infrastructure to power EnC. In particular, it helps to have some kind of repository of debug information that can make available the before- and after-edit debug information and object code to the debugger. For C++, the incrementally updatable debug information in PDBs helps. Incremental linking may help too.
Looking from the MS ecosystem over into the GCC ecosystem, it is easy to imagine the complexity and integration issues across GDB/GCC/binutils, the myriad of targets, some needed EnC specific target abstractions, and the "nice to have but inessential" nature of EnC, are why it has not appeared yet in GDB/GCC.
Happy hacking!
(p.s. It is instructive and inspiring to look at what the Smalltalk-80 interactive programming environment could do. In St80 there was no concept of "restart" -- the image and its object memory were always live, if you edited any aspect of a class you still had to keep running. In such environments object versioning was not a hypothetical.)

I'm not familiar with MSVC's E&C, but GDB has some of the things you've mentioned:
http://sourceware.org/gdb/current/onlinedocs/gdb/Altering.html#Altering
17. Altering Execution
Once you think you have found an error in your program, you might want to find out for certain whether correcting the apparent error would lead to correct results in the rest of the run. You can find the answer by experiment, using the gdb features for altering execution of the program.
For example, you can store new values into variables or memory locations, give your program a signal, restart it at a different address, or even return prematurely from a function.
Assignment: Assignment to variables
Jumping: Continuing at a different address
Signaling: Giving your program a signal
Returning: Returning from a function
Calling: Calling your program's functions
Patching: Patching your program
Compiling and Injecting Code: Compiling and injecting code in GDB

This is a pretty good reference to the old Apple implementation of "fix and continue". It also references other working implementations.
http://sources.redhat.com/ml/gdb/2003-06/msg00500.html
Here is a snippet:
Fix and continue is a feature implemented by many other debuggers,
which we added to our gdb for this release. Sun Workshop, SGI ProDev
WorkShop, Microsoft's Visual Studio, HP's wdb, and Sun's Hotspot Java
VM all provide this feature in one way or another. I based our
implementation on the HP wdb Fix and Continue feature, which they
added a few years back. Although my final implementation follows the
general outlines of the approach they took, there is almost no shared
code between them. Some of this is because of the architectual
differences (both the processor and the ABI), but even more of it is
due to implementation design differences.
Note that this capability may have been removed in a later version of their toolchain.
UPDATE: Dec-21-2012
There is a GDB Roadmap PDF presentation that includes a slide describing "Fix and Continue" among other bullet points. The presentation is dated July-9-2012 so maybe there is hope to have this added at some point. The presentation was part of the GNU Tools Cauldron 2012.
Also, I get it that adding E&C to GDB or anywhere in Linux land is a tough chore with all the different components.
But I don't see E&C as controversial. I remember using it in VB5 and VB6 and it was probably there before that. Also it's been in Office VBA since way back. And it's been in Visual Studio since VS2005. VS2003 was the only one that didn't have it and I remember devs howling about it. They intended to add it back anyway and they did with VS2005 and it's been there since. It works with C#, VB, and also C and C++. It's been in MS core tools for 20+ years, almost continuous (counting VB when it was standalone), and subtracting VS2003. But you could still say they had it in Office VBA during the VS2003 period ;)
And Jetbrains recently added it too their C# tool Rider. They bragged about it (rightly so imo) in their Rider blog.

Tool for analyzing C++ sources (MSVC)

I need a tool which analyzes C++ sources and says what code isn't used. Size of sources is ~500mb

PC-Lint is good. If it needs to be free/open source your choices dwindle. Cppcheck is free, and will check for unused private functions. I don't think that it looks for things like uninstantiated classes like PC-Lint.

Once again, I'll throw AQTime into the discussion. Has static code analysis for most, if not all, of the supported languages. I didn't really go into that part though, I mainly used the dynamic profilers (memory, performance and so on).

You could use a code coverage tool (dynamic analysis) to get an idea of what code isn't
being executed, and then hand analyze to see if that code is really useless.
If you want a static analysis, you need a tool that can read the entire
500Mb of source code (est. 20 million lines? Wow!) and compute a
conservative estimate of what is used. This requires doing a points-to
analysis over the entire system.
Here's why: If you leave out any module Z, and
decide that FOO is unused, you
might find out later that Z happened to be the one that used FOO,
or more subtly, Z copied a pointer value that happened to have
&FOO in it to a third module M that in turn called the "unused" function
throught the pointer.
What this means is that no static analysis tool that reads just
single modules (compilation units) can answer this question safely.
And at your scale, you can't afford to make dumb mistakes.
My company, Semantic Designs has done points-to analysis for 35 million line systems
of C code using our DMS Software Reengineering Toolkit. DMS
can read very large systems of source code. It required
a custom tool, not so much because the source code was in an odd (archiac)
dialect of C++ (systems in extremely modern dialects can't be this big,
not enough time to code them!), but rather because in very large systems
there are other peculiar factors at play. For the C system we did,
there was a custom dynamic linker, and that affected the points-to analysis,
which in turn had to be customized.
Because systems of the scale you are discussing alway have surprises like this (BIBSEH: "Because In Big Systems, Everything Happens"), you will
likely need a custom tool to answer the question. DMS is designed
to be customized.
See http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html
and http://www.semanticdesigns.com/Products/FrontEnds/CppFrontEnd.html

Code coverage tool is what you need, but you will have to run our program through all functionality and see what is repoted as unused. Since the code could be DLL exported functions you will have to make sure nothing uses them externally. Some code coverage tools: Purify, CTC++, Boundschecker may have code coverage functionality if I remember right and a bunch of other tools.
Be very careful about removing any function that may have been exported without knowing what external program may be linking/using it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js