What is the difference between compile code and executable code? - build

I always use the terms compile and build interchangeably.
What exactly do these terms stand for?

Compiling is the act of turning source code into object code.
Linking is the act of combining object code with libraries into a raw executable.
Building is the sequence composed of compiling and linking, with possibly other tasks such as installer creation.
Many compilers handle the linking step automatically after compiling source code.

From wikipedia:
In the field of computer software, the term software build refers either to the process of converting source code files into standalone software artifact(s) that can be run on a computer, or the result of doing so. One of the most important steps of a software build is the compilation process where source code files are converted into executable code.
While for simple programs the process consists of a single file being compiled, for complex software the source code may consist of many files and may be combined in different ways to produce many different versions.

A build could be seen as a script, which comprises of many steps - the primary one of which would be to compile the code.
Others could be
running tests
reporting (e.g. coverage)
static analysis
pre and post-build steps
running custom tools over certain files
creating installs
labelling them and deploying/copying them to a repository

They often are used to mean the same thing. However, "build" may also mean the full process of compiling and linking a whole application (in the case of e.g. C and C++), or even more, including, among others
packaging
automatic (unit and/or integration) testing
installer generation
installation/deployment
documentation/site generation
report generation (e.g. test results, coverage).
There are systems like Maven, which generalize this with the concept of lifecycle, which consists of several stages, producing different artifacts, possibly using results and artifacts from previous stages.

From my experience I would say that "compiling" refers to the conversion of one or several human-readable source files to byte code (object files in C) while "building" denominates the whole process of compiling, linking and whatever else needs to be done of an entire package or project.

Most people would probably use the terms interchangeably.
You could see one nuance : compiling is only the step where you pass some source file through the compiler (gcc, javac, whatever).
Building could be heard as the more general process of checking out the source, creating a target folder for the compiled artifacts, checking dependencies, choosing what has to be compiled, running automated tests, creating a tar / zip / ditributions, pushing to an ftp, etc...

Related

Capture all compiler invocations and command line parameters during build

I want to run tools for static C/C++ (and possibly Python, Java etc.) code analysis for a large software project built with help of make. As it is known, make (or any other build tool) invokes compiler and similar tools for specified source code files. It is also possible to control compilation by defining environmental variables to be later passed to the compiler via its arguments.
The key to accurate static analysis is to provide defines and include paths exactly as they were passed to the compiler (basically all its -D and -I arguments). This way, the tool will be able to follow same code paths the compiler have followed.
The problem is, the high complexity of the project means there is no way to statically determine such environment, as different files are built with different sets of defines/include paths and other compilation flags.
The idea is that it should be somehow possible to capture individual invocations of the compiler with all arguments passed to it for each input file. Having such information and after its straightforward filtering (e.g. there is no need to know -O optimization levels or -W warning settings) it should be possible to invoke the static analyzer for each input file with the identical set of defines/includes used just for that input file.
The question is: are there existing tools/workflows that implement the idea I've described? I am mostly interested in a solution for POSIX systems, but ideas for Windows are also welcome.
A few ideas I've come to on my own.
The most trivial solution would be to collect make output and process it afterwards. However, certain projects have makefile rules that give very concise output instead of verbose one, so it might require some tinkering with Makefiles, which is not always desirable. Parallel builds may also have their console output mixed up and impossible to parse. Adaptation to other build systems (Cmake) will not be trivial either, so it is far from being the most convenient way.
Running make under ptrace and recording all invocations of exec* system calls that correspond to starting new applications, including compiler invocations. Then one will need to parse ptrace's output. This approach is build system and language agnostic (will catch all invocations of any compiler for any language) and should work for parallel builds. However it seems to be more technically complex. Performance degradation to the build process because of ptrace sitting on make's back is unclear either. It will also be harder to port it to Windows, as program-tracing API is somewhat different there.
The proprietary static analyzer for C++ on Windows (and recently Linux AFAIK) PVS-Studio seems to implement the second approach, however details on how they do it are welcome. If there are other IDEs/tools that already have something similar to what I need, please share information on them.
There are the following ways to gather information about the parameters of compilation in Linux:
Override environment CC/CXX variables. It is used in the utility scan-build from Clang Analyzer. This method works reliably only with simple projects for Make.
procfs - all the information on the processes is stored in /proc/PID/... . Reading from a disk is a slow process, you might not be able to receive information about all processes of a build.
strace utility (ptrace library). The output of this utility contains a lot of useful information, but it requires a complicated parsing, because information is written randomly. If you do not use many threads to build the project, it is a fairly reliable way to gather information about the processes. It’s used in PVS-Studio.
JSON Compilation Database in CMake. You can get all the compilation parameters using the definition -DCMAKE_EXPORT_COMPILE_COMMANDS=On. It is a reliable method if a project does not depend on non-standard environment variables. Also the project for CMake can be written with errors and issue incorrect Json, although this doesn’t affect the project build. It’s supported in PVS-Studio.
Bear utility (function substitution using LD_PRELOAD). You can get JSON Database Compilation for any project. But without environment variables it’ll be impossible to run the analyzer for some projects. Also, you cannot use it with projects, which already use LD_PRELOAD for a build. It’s supported in PVS-Studio.
Collecting information about compiling in Windows for PVS-Studio:
Visual Studio API to get the compilation parameters of standard projects;
MSBuild API to get the compilation parameters of standard projects;
Win API to get the information on any compilation processes as, for example, Windows Task Manager does it.
VERBOSE=true is a default make option to display all commands with all parameters. It also works with CMake, for instance.
You might want to look at Coverity. They are attaching their tool to the compiler to get everything that the compiler receives. You could overwrite the environment variables CC or CXX to first collect everything and then call the compiler as usual.

Visual Studio : compile list of modules on each platform and configuration

I am working on a huge C++ project, targeting many platforms with several configurations for each platform.
Because of the long compilation time, build the entire project on every platform to test if a change compile successfully, isn't an option.
What I usually do, is compile the single cpp modules I modified on different combination of platform/configuration.
I'd like to automate this process, either using a script, a VS extension, whatever, I am open to evaluate different options.
What I need exactly is taking a list of cpp files and compile each file, for each platform and each configuration (basically iterating through all combination of the configuration manager).
Is this possible? any good suggestion on how to approach the problem?
EDIT:
I am aware that this is way far to be a perfect solution, and will spot only a subset of errors.
I will still have to face linking errors, compiler errors on other cpp units depended on a modified header, and so on..
I also, don't have any chance to modify the current build system, or project generation.
I am mostly interested in a local solution, to reduce the amount of possible issues and facing the huge building time process.
EDIT2
We have a build system. This has to be considered a pre-build system optimization, for my personal workflow.
Reasons:
Triggering a build system job requires time. It will be the final step, but instead of spending hours waiting, and maybe discover later that a given compiler on a given platform for a specific configuration raise an error, it would be much more efficient to anticipate those findings as much as possible.
Current manual workflow:
Open each cpp file I modified
Compile each cpp file as a single unit (not building the project. On VS Build-> Compile)
Change Platform and/or configuration and re-do point 2 again.
This is the manual workflow I'd like to optimize.
I would suggest that you "simply" write a script to do this (using Python for instance, which is very powerful for this kind of this)
You could:
Parse the .sln file to extract the list of configurations, platforms ( GlobalSection(SolutionConfigurationPlatforms) entry) and projects (Project entry)
If needed, you can parse every project to find the list of source files (that's easier than parsing the .sln, as vcxproj files are in xml). Look for ClCompile xml nodes to extract the list of .cpp files.
Then you can identify which projects needs some files to be recompiled (getting list of modified files as script input parameter or based on timestamp checking)
Finally, to rebuild, you have two options:
Call "msbuild " to recompile the whole project (vcxproj) (for instance msbuild project.vcxproj /p:Configuration=Debug;TargetFrameworkVersion=v3.5)
You could also recompile a single file (cl simple.cpp). To do so, you need to know what are the cl build options to be sure you compile the file exactly the same way as Visual Studio would. If you earlier did a full build of the solution (it could be a rquirement for your script to work), then you should be able to find that from Visual Studio logs (within the target folder). In my solutions, I can find for every project (vcxproj file) a build log per configuration (in %OUTPUT_DIR%\lib\%libname%\%libname%.dir\%configuration%\%libname%.tlog\CL.command.1.tlog), this file reports the exact cl arguments that were used to compile every file of the project. Then you can manually invoke cl command and this should end up recompiling the file the same way Visual Studio would do it.
Additionnaly, you could add a project in your Visual Studio solution that would fire this script as a custom command.
Such a script should be able to identify which projects has to be rebuilt and rebuild them.
This is a very common requirement, it is never solved this way. What you are proposing is not completely impossible, but it is certainly very painful to implement. You are overlooking what should happen when you modify a .h file, that can force a bunch of .cpp files to be recompiled. And you are not considering linker errors. While you'll have a shot at discovering .cpp files, discovering #include file dependencies is very gritty. You can't get them from the project or make file. Compiling with /showIncludes and parsing the build trace files is what it takes. Nothing off-the-shelf afaik.
Don't do this, you'll regret it. Use the solution that everybody uses: you need a build server. Preferably with a continuous integration feature so the server kicks-off the build for all target platforms as soon as you check-in a code change. Many to choose from, this Q+A talks about it.

C Compiler automatic Version increment

I'm wondering if there's a macro or a simple way to let the compiler increment either major, minor or revision of my code each time when I compile?
By the way I'm using the ARM compiler and uVision from Keil.
To set the version is not a compiler topic. This should be done in connection with a source code control / version control system like cvs/svn/git or others. Your build id should be connected to the content of your source code database to get reproducible builds from checkouts from your version control system. Or, if your code is not already committed to your database a dirty-tag should be provided and compiled in to give the user of the software a chance to see that this is not a controlled version.
Simply counting a value in a variable can be done by a Makefile or in pre- and post-build instructions which depends on the used IDE. Sorry, for keil I have no experience...
Define Post-Build Event to run small external program. The program has to modify specific .h file. In the header file define macros like VER_MAJOR, VER_MINOR, VER_BUILD. Date/time string can also be updated. I use this method and can control the version numbers as I wish.
IMO, you do not need to do this, especially increase a number every time you compile your code.
Set the major and minor revision manually in a header file; you should not have to do this often.
Build number should only be related to the source control revision number (i.e. you should be able to build and rebuild any revision under source control).
Imagine if you are a team of 5 developers, and everyone build and rebuild on their side what is the actual build number?
do they all update a header file ?
who is responsible of owning that header file ?
Some compilers do support features like "post build", which runs a program of your choice after compiling, but that would be tricky if your program is built from multiple source files. Not all compilers do though.
For that reason, I wouldn't actually do this sort of thing via the compiler. I'd do it in the build script (e.g. makefile) or by configuring the build settings in your IDE.
Assuming you're using make or similar, add a target something like setversion (you pick the name) that runs a program which modifies a header file which specifies the components of your version number. So typing make setversion will update your version numbers.
Optionally, that target can also - after updating the version numbers - do a make clean (i.e. delete all object files and executables) and make all (to recompile and link everything).
I also suggest avoiding changing version numbers after each recompile. Imagine you're busily testing and debugging code, and go through several rebuild cycles. Do you really want the version number updated every time you recompile even one source file? It can be done that way if you choose, but will make every rebuild take longer (and, in projects with multiple source files) you will need to take care if you want to preserve capability for incremental builds.

Safe Unity Builds

I work on a large C++ project which makes use of Unity builds. For those unfamiliar with the practice, Unity builds #include multiple related C++ implementation files into one large translation unit which is then compiled as one. This saves recompiling headers, reduces link times, improves executable size/performance by bring more functions into internal linkage, etc. Generally good stuff.
However, I just caught a latent bug in one of our builds. An implementation file used library without including the associated header file, yet it compiled and ran. After scratching my head for bit, I realized that it was being included by an implementation file included before this one in our unity build. No harm done here, but it could have been a perplexing surprise if someone had tried to reuse that file later independently.
Is there any way to catch these silent dependencies and still keep the benefits of Unity builds besides periodically building the non-Unity version?
I've used UB approaches before for frozen projects in our source repository that we never planned to maintain again. Unfortunately, I'm pretty sure the answer to your question is no. You have to periodically build all the cpp files separately if you want to test for those kinds of errors.
Probably the closest thing you can get to an automagic solution is a buildbot which automatically gathers all the cpp files in a project (with the exception of your UB files) and builds the regular way periodically from the source repository, pointing out any build errors in the meantime. This way your local development is still fast (using UBs), but you can still catch any errors you miss from using a unity build from these periodic buildbot builds which build all cpps separately.
I suggest not to use Unity Build for your local development environment. Unity Build won't help you improve compile time when you edit and compile anyway. Use Unity Build only for your non-incremental continuous build system, which you are not expect to use the production from.
As long as every one commit changes after they have compiled locally, the situation you described should not appear.
And Unity Build might form unexpected overload calls between locally defined functions happen to use with duplicated names. That's dangerous and you are not aware of that, i.e. no compile error or warning might generated in such case. Unless you have a way to prevent that, please do not reply on the production generated by Unity Build.

VS 2008 C++ build output?

Why when I watch the build output from a VC++ project in VS do I see:
1>Compiling...
1>a.cpp
1>b.cpp
1>c.cpp
1>d.cpp
1>e.cpp
[etc...]
1>Generating code...
1>x.cpp
1>y.cpp
[etc...]
The output looks as though several compilation units are being handled before any code is generated. Is this really going on? I'm trying to improve build times, and by using pre-compiled headers, I've gotten great speedups for each ".cpp" file, but there is a relatively long pause during the "Generating Code..." message. I do not have "Whole Program Optimization" nor "Link Time Code Generation" turned on. If this is the case, then why? Why doesn't VC++ compile each ".cpp" individually (which would include the code generation phase)? If this isn't just an illusion of the output, is there cross-compilation-unit optimization potentially going on here? There don't appear to be any compiler options to control that behavior (I know about WPO and LTCG, as mentioned above).
EDIT:
The build log just shows the ".obj" files in the output directory, one per line. There is no indication of "Compiling..." vs. "Generating code..." steps.
EDIT:
I have confirmed that this behavior has nothing to do with the "maximum number of parallel project builds" setting in Tools -> Options -> Projects and Solutions -> Build and Run. Nor is it related to the MSBuild project build output verbosity setting. Indeed if I cancel the build before the "Generating code..." step, none of the ".obj" files will exist for the most recent set of "compiled" files. This implies that the compiler truly is handling multiple translation units together. Why is this?
Compiler architecture
The compiler is not generating code from the source directly, it first compiles it into an intermediate form (see compiler front-end) and then generates the code from the intermediate form, including any optimizations (see compiler back-end).
Visual Studio compiler process spawning
In a Visual Studio build compiler process (cl.exe) is executed to compile multiple source files sharing the same command line options in one command. The compiler first performs "compilation" sequentially for each file (this is most likely front-end), but "Generating code" (probably back-end) is done together for all files once compilation is done with them.
You can confirm this by watching cl.exe with Process Explorer.
Why code generation for multiple files at once
My guess is Code generation being done for multiple files at once is done to make the build process faster, as it includes some things which can be done only once for multiple sources, like instantiating templates - it has no use to instantiate them multiple times, as all instances but one would be discarded anyway.
Whole program optimization
In theory it would be possible to perform some cross-compilation-unit optimization as well at this point, but it is not done - no such optimizations are ever done unless enabled with /LTCG, and with LTCG the whole Code generation is done for the whole program at once (hence the Whole Program Optimization name).
Note: it seems as if WPO is done by linker, as it produces exe from obj files, but this a kind of illusion - the obj files are not real object files, they contain the intermediate representation, and the "linker" is not a real linker, as it is not only linking the existing code, it is generating and optimizing the code as well.
It is neither parallelization nor code optimization.
The long "Generating Code..." phase for multiple source files goes back to VC6. It occurs independent of optimizations settings or available CPUs, even in debug builds with optimizations disabled.
I haven't analyzed in detail, but my observations are: They occur when switching between units with different compile options, or when certain amounts of code has passed the "file-by-file" part. It's also the stage where most compiler crashes occured in VC6 .
Speculation: I've always assumed that it's the "hard part" that is improved by processing multiple items at once, maybe just the code and data loaded in cache. Another possibility is that the single step phase eats memory like crazy and "Generating code" releases that.
To improve build performance:
Buy the best machine you can afford
It is the fastest, cheapest improvement you can make. (unless you already have one).
Move to Windows 7 x64, buy loads of RAM, and an i7 860 or similar. (Moving from a Core2 dual core gave me a factor of 6..8, building on all CPUs.)
(Don't go cheap on the disks, too.)
Split into separate projects for parallel builds
This is where 8 CPUS (even if 4 physical + HT) with loads of RAM come to play. You can enable per-project parallelization with /MP option, but this is incompatible with many other features.
At one time compilation meant parse the source and generate code. Now though, compilation means parse the source and build up a symbolic database representing the code. The database can then be transformed to resolved references between symbols. Later on, the database is used as the source to generate code.
You haven't got optimizations switched on. That will stop the build process from optimizing the generated code (or at least hint that optimizations shouldn't be done... I wouldn't like to guarantee no optimizations are performed). However, the build process is still optimized. So, multiple .cpp files are being batched together to do this.
I'm not sure how the decision is made as to how many .cpp files get batched together. Maybe the compiler starts processing files until it decides the memory size of the database is large enough such that if it grows any more the system will have to start doing excessive paging of data in and out to disk and the performance gains of batching any more .cpp files would be negated.
Anyway, I don't work for the VC compiler team, so can't answer conclusively, but I always assumed it was doing it for this reason.
There's a new write-up on the Visual C++ Blog that details some undocumented switches that can be used to time/profile various stages of the build process (I'm not sure how much, if any, of the write-up applies to versions of MSVC prior to VS2010). Interesting stuff which should provide at least a little insight into what's going on behind the scenes:
http://blogs.msdn.com/vcblog/archive/2010/04/01/vc-tip-get-detailed-build-throughput-diagnostics-using-msbuild-compiler-and-linker.aspx
If nothing else, it lets you know what processes, dlls, and at least some of the phases of translation/processing correspond to which messages you see in normal build output.
It parallelizes the build (or at least the compile) if you have a multicore CPU
edit: I am pretty sure it parallelizes in the same was as "make -j", it compiles multiple cpp files at the same time (since cpp are generally independent) - but obviously links them once.
On my core-2 machine it is showing 2 devenv jobs while compiling a single project.