How can I tell if two source files produce functionally identical code? - c++

I'm using uncrustify to format a directory full of C and C++ code. I need to ensure that uncrustify won't change the resulting code; I can't do a diff on the object file or binaries because the object files have a timestamp and so won't ever be identical. I can't check the source of the files one by one because I'd be here for years.
The project uses make for the build process so I was wondering if there is some way to output something there that could be checked.
I've searched SO and Google to no avail, so my apologies if this is a duplicate.
EDIT: I'm using gcc/g++ and compiling for 32 bit.

One possibility would be to compile them with CLang, and get the output as LLVM IR. If memory serves, this should be command line arguments of -S -emit-llvm.
To do the same with gcc/g++, you can use one of its flags to generate a file containing its intermediate representation at some stage of compilation. Early stages will still show differences from changes in white space and such, but a quick test indicates that by the SSA stage, such non-operational changes have disappeared from the IR.
g++ -c -fdump-tree-ssa foo.cpp
In addition to the normal object file, this will produce a file named foo.cpp.018t.ssa that represents the semantic actions in your source file.
As noted above, I haven't tested this extensive though--it's possible that at this stage, some non-operational changes will still produce different output files (though I kind of doubt it). If necessary, you can use -fdump-tree-all to get output from all stages of compilation1. As a simple rule of thumb, I'd expect later stages to be more immune to changes in formatting and such, so if the ssa stage doesn't work, my next choice would probably be the optimized stage, which is one of the last stages (note: the files produced are numbered in order of the stage that produced each file, so when you dump all stages, it's obvious which are produced by early stages and which by later stages).
1. Note that this produces quite a few files, many of them quite large. The first time you do this, you probably want to do it on a single source file in a directory by itself to keep from drowning in files, so to speak. Also, don't be surprised when compilation this way takes quite a bit longer than normal.

Related

Is there any way to use multiple precompiled headers simultaneously with Clang?

I am playing around with a clang++ command line in order to learn how precompiled headers work. I've got something like this:
clang++ [several lines of options] sourcefile.cpp -o sourcefile.o -include durr.h -include hurr.h
where the two headers included via the command line have been precompiled into corresponding .h.pch files.
If I "-include" just one of the two headers, compilation succeeds and is faster than it is when I include neither, in the normal fashion for precompiled headers. But when I include both (as above), I get this error:
clang: warning: precompiled header 'hurr.h.pch' was ignored because '-include hurr.h' is not first '-include'
Is there any way (not necessarily using -include) to use multiple .h.pch precompiled header files to speed compilation of one .cpp file? I understand that such a feature would be seriously complicated by the tendency of the preprocessor to cause headers to affect one another (even if only via include guards). I don't really expect what I want to be supported, now that I've thought about it a little. But I'm trying to confirm here. The above error message is suggestive but not comprehensive, and the Clang user manual didn't seem to tell me the answer....
It turns out there is: Clang supports "chained" PCH files, which is a feature that allows one PCH to represent an extension of another one. If the latter is included during some later compilation, then both it and the PCH that it depends on will be used to speed compilation. I think.
Something like this might produce an ordinary PCH:
clang++ -x c++-header header1.h -o header1.h.pch
And, if I'm understanding correctly (questionable), then something like this would produce a chained PCH that would extend header1.h.pch:
clang++ -x c++-header header2.h -o header2.h.pch -include-pch header1.h.pch
And then that chain of two PCH files can be used to speed compilation like so:
clang++ source.cpp -o source.o -include-pch header2.h.pch
(The parent PCH does not need to be mentioned in the command; header2.h.pch already knows where to find it I think.)
I haven't found any way to explicitly demand this sort of "chaining" via the command line. Simply including a PCH when compiling another PCH seems to produce a chained PCH... probably. My evidence that it does is mainly that in cases where header2.h actually includes header1.h, this technique seems to produce a small header2.h.pch and a large header1.h.pch, even though one would expect that if chaining weren't happening then header2.h.pch would generally be larger than header1.h.pch, since it contains at least as much information. This seems to match what I understand the purpose of PCH chaining to be: It saves resources by storing a dependent PCH's duplicate information as a cheap reference to the contents of another PCH.
My casual exploration suggests that a dependent PCH may itself have another one depending on it, extending the chain to three or more steps, though I'm not certain of this. When I tried to extend a chain involving real code by something like fifteen or twenty links, the PCH file sizes eventually blew up in what appeared to be an erroneous way, approximately doubling with each step. This eventually produced 300 MB PCH files for headers that, if compiled into PCHs without any attempt at chaining, would produce files smaller than 6 MB in size. This happened in Clang 3.6.something, and in Clang 3.7.0. I imagine it's a bug but who knows. I gave up on my exploration before reaching the point where I cared to try to pin it down and report it. And maybe the PCH chains aren't intended to ever grow very long anyway. This feature doesn't seem to be something people normally use....
Regardless, there seemed to be no way to do what I really wanted: mixing any two arbitrary PCHs, as long as they didn't directly conflict with one another, regardless of how they had been created. Chaining only allows two PCHs to be used together if one depends on the other. I had been interested in speeding compilation by making a PCH for each header in a project and then mixing groups of PCHs together during compilation as appropriate. But accomplishing this with chained PCHs seems to require making a tree of PCH files in which more than one PCH may correspond to a single header. I actually attempted to generate such a thing automatically, and seemed to succeed... but the above-mentioned "error" (if error it was) bogged me down, and to the extent that I did succeed, the time savings were not impressive enough to warrant continuing.
There is also some kind of "module" system in Clang that may be relevant here. But I have the impression that trying to exploit this to achieve the effect I want, if it could even be successful at all, would probably require me to change my source code to something special, and maybe to something non-standard. I didn't look into it much though, so maybe not. But anyway I guess standard C++ will probably get modules eventually and then all of this mess will (I hope) become a thing of the past.
GCC did not seem to support anything related to what I wanted, incidentally.

Multiple source file executable slower than single source file executable

I had a single source file which had all the class definitions and functions.
For better organization, I moved the class declarations(.h) and implementations(.cpp) into separate files.
But when I compiled them, it resulted in a slower executable than the one I get from single source executable. Its around 20-30 seconds slower for the same input. I dint change any code.
Why is this happening? And how can I make it faster again?
Update: The single source executable completes in 40 seconds whereas the multiple source executable takes 60. And I'm referring to runtime and not compilation.
I think, your program runs faster when compiled as a single file because in this case compiler has more information, needed to optimize the code. For example, it can automatically inline some functions, which is not possible in case of separate compilation.
To make it faster again, you can try to enable link-time optimizer (or whole program optimizer) with this option: -flto.
If -flto option is not available (and it is available only starting with gcc 4.6) or if you don't want to use it for some reason, you have at least 2 options:
If you split your project only for better organization, you can create a single source file (like all.cxx) and #include all source files (all other *.cxx files) to this file. Then you need to build only this all.cxx, and all compiler optimizations are available again. Or, if you split it also to make compilation incremental, you may prepare 2 build options: incremental build and unity build. First one builds all separate sources, second one - only all.cxx. See more information on this here.
You can find functions, that cost you performance after splitting the project, and move them either to the compilation unit, where they are used, or to header file. To do this, start with profiling (see "What can I use to profile C++ code in Linux?"). Further investigate parts of the program, that significantly impact program's performance; here are 2 options: either use profiler again to compare results of incremental and unity builds (but this time you need a sampling profiler, like oprofile, while, an instrumenting profiler, like gprof, most likely, is too heavy for this task); or apply 'experimental' strategy, as described by gbulmer.
This probably has to do with link time optimization. When all your code is in a single source file, the compiler has more knowledge about what your code does so it can perform more optimizations. One such optimization is inlining: the compiler can only inline a function if it knows its implementation at compile time!
These optimizations can also be done at link time (rather than compile time) by passing the -flto flag to gcc, both for the compile and for the link stage (see here).
This is a slower approach to get back to the faster runtime, but if you wanted to get a better understanding of what is causing the large change, you could do a few 'experiments'
One experiment would be to find which function might be responsible for the large change.
To do that, you could 'profile' the runtime of each function.
For example, use GNU gprof, part of GNU binutils:
http://www.gnu.org/software/binutils/
Docs at: http://sourceware.org/binutils/docs-2.22/gprof/index.html
This will measure the time consumed by each function in your program, and where it was called from. Doing these measurements will likely have an 'Heisenberg effect'; taking measurements will effect the performance of the program. So you might want to try an experiment to find which class is making the most difference.
Try to get a picture of how the runtime varies between having the class source code in the main source, and the same program but with the class compiled and linked in separately.
To get a class implementation into the final program, you can either compile and link it, or just #include it into the 'main' program then compile main.
To make it easier to try permutations, you could switch a #include on or off using #if:
#if defined(CLASSA) // same as #ifdef CLASSA
#include "classa.cpp"
#endif
#if defined(CLASSB)
#include "classb.cpp"
#endif
Then you can control which files are #included using command line flags to the compiler, e.g.
g++ -DCLASSA -DCLASSB ... main.c classc.cpp classd.cpp classf.cpp
It might only take you a few minutes to generate the permutations of the -Dflags, and link commands. Then you'd have a way to generate all permutations of compile 'in one unit' vs separately linked, then run (and time) each one.
I assume your header files are wrapped in the usual
#ifndef _CLASSA_H_
#define _CLASSA_H_
//...
#endif
You would then have some information about the class which is important.
These types of experiment might yield some insight into the behaviour of the program and compiler which might stimulate some other ideas for improvements.

Reverse engineering your own code c++

I have a compiled program which I want to know if a certain line exist in it. Is there a way, using my source code, I could determine that?
Tony commented on my message so I'll add some info:
I'm using the g++ compiler.
I'm compiling the code on Linux(Scientific)/Unix machine
I only use standard library (nothing downloaded from the web)
The desired line is either multiplication by a number (in a subfunction of a while group) or printing a line in a specific case (if statement)
I need this becouse I'm running several MD simulations and sometimes I find my self in a situation where I'm not sure of the conditions.
objdump is a utility that can be used as a disassembler to view executable in assembly form.
Use this command to disassemble a binary,
objdump -Dslx file
Important to note though that disassemblers make use of the symbolic debugging information present in object files(ELF), So that information should be present in your object files. Also, constants & comments in source code will not be a part of the disassembled output.
Summary
Use source code control and keep track of which source code revision the executable's built from... it should write that into the output so you can always cross-reference the two, checkout the same sources and rebuild the executable that gave you those results etc..
Discussion
The desired line is either multiplication by a number (in a subfunction of a while group) or printing a line in a specific case (if statement)
I need this becouse I'm running several MD simulations and sometimes I find my self in a situation where I'm not sure of the conditions.
For the very simplest case where you want all the MD simulations to be running the latest source, you can compare timestamps on the source files with the executable to see if you forgot to recompile, compare the process start time (e.g. as listed by ps) with the executable creation time.
Where you're deliberately deploying multiple versions of the program and only have the latest source, then it gets pretty tricky. A multiplication will typically only generate a single machine code instruction... unless you have some contextual insight you're unlikely to know which multiplication is significant (or if it's missing). The compiler may generate its own multiplications for e.g. array indexing, and may sometimes optimise multiplications into bit shifts (or nothing, as Ira comments), so it's not as simple as saying 'well, it's my only multiplication in function "X"'. If you're printing a specific line that may be easier to distinguish... if there's a unique string literal you can search for it in the executable (e.g. puts("Hello") -> strings program | grep Hello, though that may get other matches too, and the compiler's allowed to reuse string literal sequences so "Well Hello" might cater to your need via a pointer to 'H' too). If there's a new extern symbol involved you might see it in nm output etc..
All that said (woah)... you should do something altogether different really. Best is to use a source control system (e.g. svn, cvs...), and get it configured so you can do something to find out which revision of the codebase was used to create the executable - it should be a FAQ for any revision control system.
Failing that, you could, for example, do something to print out what multipliers or conditions the progarm was using when it starts running, capturing that in your logs. While hackish, macros allow you to "stringify" their parameters, so you can log and execute something without typing all the code twice. Lots of other options too.
Hope some of that helps....

Why Compile to an Object File First?

In the last year I've started programming in Fortran working at a research university. Most of my prior experience is in web languages like PHP or old ASP, so I'm a newbie to compile statements.
I have two different code I'm modifying.
One has an explicit statement creating .o files from modules (e.g. gfortran -c filea.f90) before creating the executable.
Another are creating the executable file directly (sometimes creating .mod files, but no .o files, e.g. gfortran -o executable filea.f90 fileb.f90 mainfile.f90).
Is there a reason (other than, maybe, Makefiles) that one method is preferred over the other?
Compiling to object files first is called separate compilation. There are many advantages and a few drawbacks.
Advantages:
easy to transform object files (.o) to libraries and link to them later
many people can work on different source files at the same time
faster compiling (you don't compile the same files again and again when the source hasn't changed)
object files can be made from different language sources and linked together at some later time. To do that, the object files just have to use the same format and compatible calling conventions.
separate compilation enables distribution of system wide libraries (either OS libraries, language standard libraries or third party libraries) either static or shared.
Drawbacks:
There are some optimizations (like optimizing functions away) that the compiler cannot perform, and the linker does not care about; however, many compilers now include the option to perform "link time optimization", which largely negates this drawback. But this is still an issue for system libraries and third party libraries, especially for shared libraries (impossible to optimize away parts of a component that may change at each run, however other techniques like JIT compilation may mitigate this).
in some languages, the programmer has to provide some kind of header for the use of others that will link with this object. For example in C you have to provide .h files to go with your object files. But it is good practice anyway.
in languages with text based includes like C or C++, if you change a function prototype, you have to change it in two places. Once in header file, once in the implementation file.
When you have a project with a few 100 source files, you don't want to recompile all of them every time one changes. By compiling each source file into a separate object file and only recompile those source files that are affected by a change, you spend the minimum amount of time from source code change to new executable.
make is the common tool used to track such dependencies and recreate your binary when something changes. Typically you set up what each source file depends on (these dependencies can typically be generated by your compiler - in a format suitable for make), and let make handle the details of creating an up to date binary.
The .o file is the Object File. It's an intermediate representation of the final program.
Specifically, typically, the .o file has compiled code, but what it does not have is final addresses for all of the different routines or data.
One of the things that a program needs before it can be run is something similar to a memory image.
For example.
If you have your main program and it calls a routine A. (This is faux fortran, I haven't touched in decades, so work with me here.)
PROGRAM MAIN
INTEGER X,Y
X = 10
Y = SQUARE(X)
WRITE(*,*) Y
END
Then you have the SQUARE function.
FUNCTION SQUARE(N)
SQUARE = N * N
END
The are individually compiled units. You can see than when MAIN is compiled it does not KNOW where "SQUARE" is, what address it is at. It needs to know that so when it calls the microprocessors JUMP SUBROUTINE (JSR) instruction, the instruction has someplace to go.
The .o file has the JSR instruction already, but it doesn't have the actual value. That comes later in the linking or loading phase (depending on your application).
So, MAINS .o file has all of the code for main, and a list of references that it wants to resolved (notably SQUARE). SQUARE is basically stand alone, it doesn't have any references, but at the same time, it had no address as to where it exists in memory yet.
The linker will take all off the .o files and combine them in to a single exe. In the old days, compiled code would literally be a memory image. The program would start at some address and simply loaded in to RAM wholesale, and then executed. So, in the scenario, you can see the linker taking the two .o files, concatenating them together (to get SQUAREs actual address), then it would go back and find the SQUARE reference in MAIN, and fill in the address.
Modern linkers don't go quite that far, and defer much of that final processing to when the program is actually loaded. But the concept is similar.
By compiling to .o files, you end up with reusable units of logic that are then combined later by the linking and loading processes before execution.
The other nice aspect is that the .o files can come from different languages. As long as the calling mechanisms are compatible (i.e. how are arguments passed to and from functions and procedures), then once compiled in to a .o, the source language becomes less relevant. You can link, combine, C code with FORTRAN code, say.
In PHP et all, the process is different because all of the code is loaded in to a single image at runtime. You can consider the FORTRANs .o files similar to how you would use PHPs include mechanism to combine files in to a large, cohesive whole.
Another reason, apart from compile time, is that the compilation process is a multi-step process.
The object files are just one intermediate output from that process. They will eventually be used by the linker to produce the executable file.
We compile to object files to be able to link them together to form larger executables. That is not the only way to do it.
There are also compilers that don't do it that way, but instead compiles to memory and executes the result immediately. Earlier, when students had to use mainframes, this was standard. Turbo Pascal also did it this way.

Why compile + link when build a C++ code, instead of generating executable directly

I was asked of this question when mentor an entry-level programmer, I was thinking of this compile + link process so official and usual that I never think about why.
One thing I could think of is to improve the development productivity, but should there be any other more compiler-related reasons?
Efficiency.
When you compile a program you create an object file for each source file, if you change a source file you only need to recompile that module and then relink (relinking is cheap).
If the compiler did everything in one pass it would have to recompile everything for every change.
It also fits with the unix philosophy of small programs that do one thing, so you have a pre-processor, a compiler, a linker, a library creator. These steps might now be different modes of the same tool.
However there are reasons why you want the compiler to link in one step, there are some optimizations you can do if you allow the compiler to change object files at link time - most modern compilers allow this but it requires them to put extra info into the object files at compile time.
It would be better if the compiler could store the entire project in a single database, rather than the mess of sources, resources, browse info files, object files etc - but developers are very conservative!
Part of this is historical. Back in the dark ages, computers had little memory. It was easy to to create a program with enough source code that all of it couldn't be processed at one time. So the processing had to be done in stages: preprocessing source code, compile source to assembly (one by one), assembly to object code, all object files linked into the final executable. Each of these steps had one or more stand alone tools to do its task. Over the years the tools were improved incrementally, but no major redesign of the process has ever become mainstream.
It's important that the build time, even for a very large project, be under 24 hours. And being able to build overnight is better. Separate compilation, which is to say dividing a program into "compilation units" and compiling them independently, is the way to reduce build time:
If a compilation unit hasn't been changed, and if nothing it depends on has changed, you can reuse the result of an old compilation.
You can often compile multiple units in parallel, or even distributed over a network of workstations. The lowly Make will compile in parallel, and other tools like ccdist exist to distribute the work of compilation.
Linking provides few benefits in and of itself but is necessary to use the results of separate compilation.
What an excellent time to teach your protégé about the Single Responsibility Principle!
Compiling the file changes the code into binary that the computer can read. Linking the file tells the computer how to complete a command. So its impossible to generate it all at once, without the two steps.