Linking phase in distcc

Linking phase in distcc - c++

Is there any particular reason that the linking phase when building a project with distcc is done locally rather than sent off to other computers to be done like compiling is? Reading the distcc whitepages didn't give a clear answer but I'm guessing that the time spent linking object files is not very significant compared to compilation. Any thoughts?

The way that distcc works is by locally preprocessing the input files until a single file translation unit is created. That file is then sent over the network and compiled. At that stage the remote distcc server only needs a compiler, it does not even need the header files for the project. The output of the compilation is then moved back to the client and stored locally as an object file. Note that this means that not only linking, but also preprocessing is performed locally. That division of work is common to other build tools, like ccache (preprocessing is always performed, then it tries to resolve the input with previously cached results and if succeeds returns the binary without recompiling).
If you were to implement a distributed linker, you would have to either ensure that all hosts in the network have the exact same configuration, or else you would have to send all required inputs for the operation in one batch. That would imply that distributed compilation would produce a set of object files, and all those object files would have to be pushed over the network for a remote system to link and return the linked file. Note that this might require system libraries that a referred and present in the linker path, but not present in the linker command line, so a 'pre-link' would have to determine what set of libraries are actually required to be sent. Even if possible this would require the local system to guess/calculate all real dependencies and send them with a great impact in network traffic and might actually slow down the process, as the cost of sending might be greater than the cost of linking --if the cost of getting the dependencies is not itself almost as expensive as linking.
The project I am currently working on has a single statically linked executable of over 100M. The static libraries range in size but if a distributed system would consider that the final executable was to be linked remotely it would require probably three to five times as much network traffic as the final executable (templates, inlines... all these appear in all translation units that include them, so there would be multiple copies flying around the network).

Linking, almost by definition, requires that you have all of the object files in one place. Since distcc drops the object files on the computer that invoked distcc in the first place, the local machine is the best place to perform the link as the objects are already present.
In addition, remote linking would become particularly hairy once you start throwing libraries into the mix. If the remote machine linked your program against its local version of a library, you've opened up potential problems when the linked binary is returned to the local machine.

The reason that compliation can be sent to other machines is that each source file is compiled independently of others. In simple terms, for each .c input file there is a .o output file from the compilation step. This can be done on as many different machines as you like.
On the other hand, linking collects all the .o files and creates one output binary. There isn't really much to distribute there.

Related

Benefits of splitting project into executable and libraries

I sometimes observe that big projects are split into dynamic libraries and an executable.
The libraries are ad-hoc - they contain functionality that is only required by this executable. They also reside in the same repository and build by the same build pipeline as the executable. From my point of view this approach creates additional trouble since we need to deploy not only executable but also libraries. So the question is why it is done this way? Why not just statically link everything and produce single executable?

So the question is why it is done this way? Why not just statically link everything and produce single executable?
There are a few possible reasons:
If the project is sufficiently large, it may not be possible to link code into a single executable on x86_64 or i686 platform (the default small memory model limits a single binary to 2GiB or .text and .data),
Even if the binary links fine as a single static executable, it may be much faster to rebuild a shared library. If the ABI didn't change (e.g. a small fix to internal implementation detail), then relinking the full executable is unnecessary if shared library is used. This can greatly speed up edit/build/test cycle.
This may also be solved by using a faster linker (e.g. Gold was significantly faster than BFD ld, and lld is faster still). But the project may have been split before Gold and lld became available, or it may use a platform to which faster linkers have not been ported.
Even when neither of the two reasons above applies, it may still be desirable to maintain API separation between a given library and its clients if the library is maintained by a different sub-team. The less of the implementation is exposed, the fewer chances there are to misuse the API or introduce unwanted dependencies on the current implementation, and shared libraries allow maintainers to hide much of the internals via symbol visibility.

Do .lib files have to be linked every time a project is compiled in Visual Studio 2015?

Right now in a project I'm working on, compile times are taking very long.
We think it's due to the fact that it is linking all the library files every time it has to recompile the project.
Can we speed this up somehow? Do .libs have to be linked every single time, even when making very small changes?

Yes, object libraries have to be re-linked every time the program compiles.
However, you can make this less painful by making those other projects into DLL projects, which delays the linking until runtime, rather than compile time. That can make the program take a little longer to start up (depending on certain circumstances) and it'll make managing the project output a little more cumbersome, but it'll speed up project compilation by a significant factor.
If you're working with third-party libraries, see if they have DLL versions of the object code (many do), or recompile them as a DLL (if you have the source code), and use those instead. Depending on the library, you may need to make adjustments to your project configuration.

make SCons compile everything in one gcc line?

I have a rather complex SCons script that compiles a big C++ project.
This gcc manual page says:
The compiler performs optimization based on the knowledge it has of the program. Compiling multiple files at once to a single output file mode allows the compiler to use information gained from all of the files when compiling each of them.
So it's better to give all my files to a single g++ invocation and let it drive the compilation however it pleases.
But SCons does not do this. it calls g++ separately for every single C++ file in the project and then links them using ld
Is there a way to make SCons do this?

The main reason to have a build system with the ability to express dependencies is to support some kind of conditional/incremental build. Otherwise you might as well just use a script with the one command you need.
That being said, the result of having gcc/g++ optimize as the manual describe is substantial. In particular if you have C++ templates you use often. Good for run-time performance, bad for recompile performance.
I suggest you try and make your own builder doing what you need. Here is another question with an inspirational answer: SCons custom builder - build with multiple files and output one file

Currently the answer is no.
Logic similar to this was developed for MSVC only.
You can see this in the man page (http://scons.org/doc/production/HTML/scons-man.html) as follows:
MSVC_BATCH When set to any true value, specifies that SCons should
batch compilation of object files when calling the Microsoft Visual
C/C++ compiler. All compilations of source files from the same source
directory that generate target files in a same output directory and
were configured in SCons using the same construction environment will
be built in a single call to the compiler. Only source files that have
changed since their object files were built will be passed to each
compiler invocation (via the $CHANGED_SOURCES construction variable).
Any compilations where the object (target) file base name (minus the
.obj) does not match the source file base name will be compiled
separately.
As always patches are welcome to add this in a more general fashion.

In general this should be left up to the program developer. Trying to compile all together in an amalgamation may introduce unintended behaviour to the program if it even compiles in the first place. Your best bet if you want this kind of optimisation without editing the source yourself is to use a compiler with inter-process optimisation like icc -ipo.
Example where an amalgamation of two .c files would not compile is for example if they use two identical static symbols with different functionality.

Optimization: .cpp or .obj/.o or .lib/.a

I have this chuck of code that could be placed in a separate library but I'm unsure how that will affect the compiler's ability to optimize.
Option 1: include the code directly in the projects and compile it together with everything else.
Option 2: build the .obj/.o files and simply use them when building the projects.
Option 3: create a static library (.lib or .a) and link with that when building the projects.
Now, my question is: which of these will give the best performance? If you could discuss/explain the consequences of each of the options with regard to compiler optimization that would be super awesome!
Thanks in advance :-)

There should be no difference in performance:
An .a file is simply an archive of .o files. They are treated the same by the linker (except that .a files need to be unpacked first).
Directly compiling all sources together will still result in all compilation units be compiled separately, and subsequently linked together. It’s just that the compiler hides this and calls the linker behind your back. Nevertheless, the work is the same as when first compiling the compilation units separately and then linking them together in an explicit step.

There's no difference in the optimization a compiler can do. In every case, the object can be built with as much or as less optimization you want.
The only difference you might see, is when you build a shared library. Then you have a call overhead, which you have not, when linking the objects or a static library directly into the executable.

If by Option 1 you mean #include the code via header files, then the compiler may be able to optimise slightly better than linking multiple objects together, as in Options 2 and 3. This is because the compiler can see the entire source code, rather than just the object code, and may be able to inline functions.
There is no difference between Options 2 and 3, as an archive file - *.a - is just a collection of object files - *.o.
All this being said, The Architecture of Open Source Applications: LLVM implies that you can build LLVM IR code objects, which when linked can be optimised properly, including inlining of functions. So, if you are using clang++, this may be an option.

Lib Files and Defines

I'm using a couple of external libraries and I'd rather not have to include all their source and header files in my main source directory or in my project file. One option would be to compile the libraries as lib files and link them like that. However I'm not sure the defines get evaluated before or after the lib file gets created (which one is it?). If it's before then obviously I can't just pack them because they might not work properly on different compilers or systems.
So if I can't pack the libraries as lib files, is there any way for me to link in the c or cpp source files? Probably not, since they would have to be compiled first, but maybe I'm wrong.
Edit: Here's a follow-up question, based on answers. Do you think it'd be too much of a hassle to have a makefile that creates the lib files? I'd still rather not add the sources to my project or in my source directory.

Library is a binary file, so all defines obviously already in.
Just to make order, defines are evaluated as 1st stage of compilation process - the step is called preprocess. At this stage, for each cpp files created one file containing all #include'ed in it files recursively and all macros are evaluated.
Any way 3rd party should not depend on your compilation flags with one exception - release/build lib. Only in this case you need 2 versions of 3rd lib.
As regarding to question if to compile 3rd party libs once or each time while compiling your code it depends. If you are doing it only for itself than do what looks an easies way for you, but if we're talking about development team and the project to be maintain for a long time, than more things are to be considered.
SO we're talking about some solid solution for a team and we want to compile library several times.
In this case I personally strive to compile 3rd part library once and use it many times. This reduces compilation times for each build for each developers, which means faster development.
Nice, but where you hold these libs. I like phisycal separation - 3rd party library and my code not in same tree. This can avoid some not intentional errors. A good build system, and most of time it's mandatory, should be re-buildable. This means that if you checkout your code after year, you can compile and receive exactly same binaries.
Once I used some external read only tree on my machine. This tree was managed only by me.
To make my sources re-buildable, each next version of 3rd party library put in direcoty containing it's version and my source tree was updated to point to this point. If you build on several machines, than the read only tree should be visible on all these machines.
Additioanal solution is to check if your SCM tool (I suppose you use one) gives you some ability to combine several sub-tries from repository in one checkhout. For each 3rd party library there's one sub-tree. This way 3rd party libraries are available on all machines your build. I currenly use these method on subversion - it's called svn:external. On CVS AFAIK it's called cvs modules. Additional advantage that the libraries are managed by source control system, so you can track all changes done to 3rd party libs.

defines get evaluated even before compiling. They are dealt with by the pre-processor, that prepares the code for the compiler to use. So yes, they are evaluated before the libraries are created.
You can't link against source code. You can only link with object files, static libraries, or dynamic libraries (shared object files/DLLs).
Using dynamic linking can be a good option, especially if the externals are large and/or you'll be using them in many executables.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js