Is thinlto's concurrency useful in a parallel build system?

Is thinlto's concurrency useful in a parallel build system? - build

I have a build system that is using the long standing LTO support in clang via the -flto flag.
The ThinLTO support added to LLVM (https://clang.llvm.org/docs/ThinLTO.html) looks interesting, but I'm a little puzzled about the decision to launch std::thread::hardware_concurrency parallel processing threads in the context of a build system that already runs concurrent jobs.
If you have a build system that is already launching a thread per core and running a mix of compile and link jobs, does it still make sense for the linker to assume that it should use all cores, or even more than one?
Or does it make sense instead to reduce ThinLTOs background concurrency to 1 with the flags documented at https://clang.llvm.org/docs/ThinLTO.html#controlling-backend-parallelism? Are there any advantages to ThinLTO over regular plain old LTO when the parallelism has been removed?

ThinLTO actually can greatly improve build times for large projects, among its other benefits. The cache is not designed only for incremental builds - it's part and parcel for how the multi-threaded link stage works and is meant to speed symbol lookups. How helpful ThinLTO is for you with respect to shortening build times depends on your project and build system.
I found a very good video that goes over some details of the design for ThinLTO, its usecases, and some ways it has been implemented successfully:
https://www.youtube.com/watch?v=p9nH2vZ2mNo&list=WL&index=51&t=2812s
The corresponding Google Research paper is also a very interesting (if heavy) read:
https://research.google/pubs/pub47584/
For a lighter and more casual take, this blog post was also helpful:
http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html

Related

Faster build times in C++ [duplicate]

I once worked on a C++ project that took about an hour and a half for a full rebuild. Small edit, build, test cycles took about 5 to 10 minutes. It was an unproductive nightmare.
What is the worst build times you ever had to handle?
What strategies have you used to improve build times on large projects?
Update:
How much do you think the language used is to blame for the problem? I think C++ is prone to massive dependencies on large projects, which often means even simple changes to the source code can result in a massive rebuild. Which language do you think copes with large project dependency issues best?

Forward declaration
pimpl idiom
Precompiled headers
Parallel compilation (e.g. MPCL add-in for Visual Studio).
Distributed compilation (e.g. Incredibuild for Visual Studio).
Incremental build
Split build in several "projects" so not compile all the code if not needed.
[Later Edit]
8. Buy faster machines.

My strategy is pretty simple - I don't do large projects. The whole thrust of modern computing is away from the giant and monolithic and towards the small and componentised. So when I work on projects, I break things up into libraries and other components that can be built and tested independantly, and which have minimal dependancies on each other. A "full build" in this kind of environment never actually takes place, so there is no problem.

One trick that sometimes helps is to include everything into one .cpp file. Since includes are processed once per file, this can save you a lot of time. (The downside to this is that it makes it impossible for the compiler to parallelize compilation)
You should be able to specify that multiple .cpp files should be compiled in parallel (-j with make on linux, /MP on MSVC - MSVC also has an option to compile multiple projects in parallel. These are separate options, and there's no reason why you shouldn't use both)
In the same vein, distributed builds (Incredibuild, for example), may help take the load off a single system.
SSD disks are supposed to be a big win, although I haven't tested this myself (but a C++ build touches a huge number of files, which can quickly become a bottleneck).
Precompiled headers can help too, when used with care. (They can also hurt you, if they have to be recompiled too often).
And finally, trying to minimize dependencies in the code itself is important. Use the pImpl idiom, use forward declarations, keep the code as modular as possible. In some cases, use of templates may help you decouple classes and minimize dependencies. (In other cases, templates can slow down compilation significantly, of course)
But yes, you're right, this is very much a language thing. I don't know of another language which suffers from the problem to this extent. Most languages have a module system that allows them to eliminate header files, which area huge factor. C has header files, but is such a simple language that compile times are still manageable. C++ gets the worst of both worlds. A big complex language, and a terrible primitive build mechanism that requires a huge amount of code to be parsed again and again.

Multi core compilation. Very fast with 8 cores compiling on the I7.
Incremental linking
External constants
Removed inline methods on C++ classes.
The last two gave us a reduced linking time from around 12 minutes to 1-2 minutes. Note that this is only needed if things have a huge visibility, i.e. seen "everywhere" and if there are many different constants and classes.
Cheers

IncrediBuild

Unity Builds
Incredibuild
Pointer to implementation
forward declarations
compiling "finished" sections of the proejct into dll's

ccache & distcc (for C/C++ projects) -
ccache caches compiled output, using the pre-processed file as the 'key' for finding the output. This is great because pre-processing is pretty quick, and quite often changes that force recompile don't actually change the source for many files. Also, it really speeds up a full re-compile. Also nice is the instance where you can have a shared cache among team members. This means that only the first guy to grab the latest code actually compiles anything.
distcc does distributed compilation across a network of machines. This is only good if you HAVE a network of machines to use for compilation. It goes well with ccache, and only moves the pre-processed source around, so the only thing you have to worry about on the compiler engine systems is that they have the right compiler (no need for headers or your entire source tree to be visible).

The best suggestion is to build makefiles that actually understand dependencies and do not automatically rebuild the world for a small change. But, if a full rebuild takes 90 minutes, and a small rebuild takes 5-10 minutes, odds are good that your build system already does that.
Can the build be done in parallel? Either with multiple cores, or with multiple servers?
Checkin pre-compiled bits for pieces that really are static and do not need to be rebuilt every time. 3rd party tools/libraries that are used, but not altered are a good candidate for this treatment.
Limit the build to a single 'stream' if applicable. The 'full product' might include things like a debug version, or both 32 and 64 bit versions, or may include help files or man pages that are derived/built every time. Removing components that are not necessary for development can dramatically reduce the build time.
Does the build also package the product? Is that really required for development and testing? Does the build incorporate some basic sanity tests that can be skipped?
Finally, you can re-factor the code base to be more modular and to have fewer dependencies. Large Scale C++ Software Design is an excellent reference for learning to decouple large software products into something that is easier to maintain and faster to build.
EDIT: Building on a local filesystem as opposed to a NFS mounted filesystem can also dramatically speed up build times.

Fiddle with the compiler optimisation flags,
use option -j4 for gmake for parallel compilation (multicore or single core)
if you are using clearmake , use winking
we can take out the debug flags..in extreme cases.
Use some powerful servers.

This book Large-Scale C++ Software Design has very good advice I've used in past projects.

Minimize your public API
Minimize inline functions in your API. (Unfortunately this also increases linker requirements).
Maximize forward declarations.
Reduce coupling between code. For instance pass in two integers to a function, for coordinates, instead of your custom Point class that has it's own header file.
Use Incredibuild. But it has some issues sometimes.
Do NOT put code that get exported from two different modules in the SAME header file.
Use the PImple idiom. Mentioned before, but bears repeating.
Use Pre-compiled headers.
Avoid C++/CLI (i.e. managed c++). Linker times are impacted too.
Avoid using a global header file that includes 'everything else' in your API.
Don't put a dependency on a lib file if your code doesn't really need it.
Know the difference between including files with quotes and angle brackets.

Powerful compilation machines and parallel compilers. We also make sure the full build is needed as little as possible. We don't alter the code to make it compile faster.
Efficiency and correctness is more important than compilation speed.

In Visual Studio, you can set number of project to compile at a time. Its default value is 2, increasing that would reduce some time.
This will help if you don't want to mess with the code.

This is the list of things we did for a development under Linux :
As Warrior noted, use parallel builds (make -jN)
We use distributed builds (currently icecream which is very easy to setup), with this we can have tens or processors at a given time. This also has the advantage of giving the builds to the most powerful and less loaded machines.
We use ccache so that when you do a make clean, you don't have to really recompile your sources that didn't change, it's copied from a cache.
Note also that debug builds are usually faster to compile since the compiler doesn't have to make optimisations.

We tried creating proxy classes once.
These are really a simplified version of a class that only includes the public interface, reducing the number of internal dependencies that need to be exposed in the header file. However, they came with a heavy price of spreading each class over several files that all needed to be updated when changes to the class interface were made.

In general large C++ projects that I've worked on that had slow build times were pretty messy, with lots of interdependencies scattered through the code (the same include files used in most cpps, fat interfaces instead of slim ones). In those cases, the slow build time was just a symptom of the larger problem, and a minor symptom at that. Refactoring to make clearer interfaces and break code out into libraries improved the architecture, as well as the build time. When you make a library, it forces you to think about what is an interface and what isn't, which will actually (in my experience) end up improving the code base. If there's no technical reason to have to divide the code, some programmers through the course of maintenance will just throw anything into any header file.

Cătălin Pitiș covered a lot of good things. Other ones we do:
Have a tool that generates reduced Visual Studio .sln files for people working in a specific sub-area of a very large overall project
Cache DLLs and pdbs from when they are built on CI for distribution on developer machines
For CI, make sure that the link machine in particular has lots of memory and high-end drives
Store some expensive-to-regenerate files in source control, even though they could be created as part of the build
Replace Visual Studio's checking of what needs to be relinked by our own script tailored to our circumstances

It's a pet peeve of mine, so even though you already accepted an excellent answer, I'll chime in:
In C++, it's less the language as such, but the language-mandated build model that was great back in the seventies, and the header-heavy libraries.
The only thing that is wrong about Cătălin Pitiș' reply: "buy faster machines" should go first. It is the easyest way with the least impact.
My worst was about 80 minutes on an aging build machine running VC6 on W2K Professional. The same project (with tons of new code) now takes under 6 minutes on a machine with 4 hyperthreaded cores, 8G RAM Win 7 x64 and decent disks. (A similar machine, about 10..20% less processor power, with 4G RAM and Vista x86 takes twice as long)
Strangely, incremental builds are most of the time slower than full rebuuilds now.

Full build is about 2 hours. I try to avoid making modification to the base classes and since my work is mainly on the implementation of these base classes I only need to build small components (couple of minutes).

Create some unit test projects to test individual libraries, so that if you need to edit low level classes that would cause a huge rebuild, you can use TDD to know your new code works before you rebuild the entire app. The John Lakos book as mentioned by Themis has some very practical advice for restructuring your libraries to make this possible.

What is the difference between -fprofile-use and -fauto-profile?

What is the difference between -fprofile-use and -fauto-profile?
Here's what the docs say:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options
-fprofile-use
-fprofile-use=path
Enable profile feedback-directed optimizations, and the following optimizations which are generally profitable only with profile feedback available: [...]
If path is specified, GCC looks at the path to find the profile feedback data files. See -fprofile-dir.
and underneath that
-fauto-profile
-fauto-profile=path
Enable sampling-based feedback-directed optimizations, and the following optimizations which are generally profitable only with profile feedback available: [...]
path is the name of a file containing AutoFDO profile information. If omitted, it defaults to fbdata.afdo in the current directory.
(The list of optimizations in the [...] for -fauto-profile is longer.)

I stumbled into this thread by a path I can't even remember and am learning this stuff as I go along. But I don't like seeing an unanswered question if I could learn something from it! So I got reading.
Feedback-Directed Optimisation
As GCC say, both of these are modes of applying Feedback-Directed Optimisation. By running the program and profiling what it does, how it does it, how long it spends in which functions, etc. - we may facilitate extra, directed optimisations from the resulting data. Results from the profiler are 'fed forward' to the optimiser. Next, presumably, you can take your profile-optimised binary and profile that, then compile another FDO'd version, and so on... hence the feedback part of the name.
The real answer, the difference between these two switches, isn't very clearly documented, but it's available if we just need to look a little further.
-fprofile-use
Firstly, your quote for -fprofile-use only really states that it requires -fprofile-generate, an option that isn't very well documented: the reference from -use just tells you to read the page you're already on, where in all cases, -generate is only mentioned but never defined. Useful! But! We can refer to the answers to this question: How to use profile guided optimizations in g++?
As that answer states, and the piece of GCC's documentation in question here gently indicates... -fprofile-generate causes instrumentation to be added to the output binary. As that page summarises, an instrumented executable has stuff added to facilitate extra checks or insights during its runtime.
(The other form of instrumentation I know - and the one I've used - is the compiler add-on library UBSan, which I use via GCC's -fsanitize=undefined option. This catches bits of Undefined Behaviour at runtime. GCC with this on has revealed UB I might've otherwise taken ages to find - and made me wonder how my programs ran at all! Clang can use this library too, and maybe other compilers.)
-fauto-profile
In contrast, -fauto-profile is different. The key distinction is hinted, if not clearly, in the synopsis you quoted for it:
path is the name of a file containing AutoFDO profile information.
This mode handles profiling and subsequent optimisations using AutoFDO. To Google we go: AutoFDO The first few lines don't explain this as succinctly as possible, and I think the best summary is buried rather far down the page:
The major difference between AutoFDO [-fauto-profile] and FDO [-fprofile-use] is that AutoFDO profiles on optimized binary instead of instrumented binary. This makes it very different in handling cloned functions.
How does it do this? -fauto-profile requires you to provide profiling files written out by the Linux kernel's profiler, Perf, converted to the AutoFDO format. Perf, rather than adding instrumentation, uses hardware features of the CPU and kernel-level features of the OS to profile various statistics about a program while it's running:
perf is powerful: it can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing). It is capable of lightweight profiling. [...] Performance counters are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots.
So, that lets it profile an optimised program, rather than an instrumented one. We might reasonably presume this is more representative of how your program would react in the real world - and so can facilitate gathering more useful profiling data and applying more effective optimisations as a result.
An example of how to do the legwork of tying all this together and getting -fauto-profile to do something with your program is summarised here: Feedback directed optimization with GCC and Perf
(Maybe now that I learned all this, I'll try these options out some day!)

underscore_d gives an in-depth insight into the differences.
Here is my take on it.
Performing internal profiling by compiling initially with -fprofile-generate, which integrates the profiler into the binary for the performance data collection run. Execute the binary, for 10 minutes or whatever time you think covers enough activity for the profiler to record. Recompile again instead with -fprofile-use along with -fprofile-correction if it is a multi-threaded application. Internal profiler runs causes a significant performance hit (25% in my case) which does not reflect the real world non-profiler included binary behavior, so could result in less accurate profiling, but if all activity when running the profiler enabled binary scales with the performance penalty, I guess it should not matter.
Alternatively you can use the perf tool (more error prone and effort) which is specific to your kernel (may also need kernel built to support profiling, tracing etc) to create the profiling data. This could be considered, external profiling and has negligible impact on the application performance while being profiled. You run this on the binary that you compile normally. I cannot find any studies comparing the two.
perf record -e br_inst_retired:near_taken -b -o perf.data *your_program.unstripped -program -parameters*
then without stripping the binary, convert the profiling data into something GCC understands...
create_gcov --binary=your_program.unstripped --profile=perf.data --gcov=profile.afdo
Then recompile the application using -fauto-profile. Perf and AutoFDO/create_gcov version specific issues are known to exist. I referred to https://blog.wnohang.net/index.php/2015/04/29/feedback-directed-optimization-with-gcc-and-perf/ for detailed information on this alternative profiling method.
-fprofile-use and -fauto-profile both enable many optimization options by default, in my case the unwanted -funroll-loops which I knew had negative impact on performance in my application. If your the pedantic type, you can test option combinations by including the disabling counterpart in the compile flags, in my case -fno-unroll-loops.
Using internal profiling with my program after stripping the binary, it reduced the size by 25% (compared to original non-profiler stripped binary) however I only observed sub-percentile performance gains and the previous work output fluctuations that are reported by the program log (it's a crypto currency miner) were more erratic, instead of a gradual rising and lowering between peaks and troughs in hash rates like originally.
Overall, a stab in the dark.

How to add compilation for profiling to static library?

My project currently has a library that is static linked (compiled with gcc and linked with ar), but I am currently trying to profile my whole entire project with gprof, in which I would also like to profile this statically linked library. Is there any way of going about doing this?
Gprof requires that you provide -pg to GCC for compilation and -pg to the linker. However, ar complains when -pg is added to the list of flags for it.

I haven't used gprof in a long time, but is -pg even a valid argument to ar? Does profiling work if you compile all of the objects with -pg, then create your archive without -pg?
If you can't get gprof to work, gperftools contains a CPU profiler which I think should work very well in this case. You don't need to compile your application with any special flags, and you don't need to try to change how your static library is linked.
Before starting, there are two tradeoffs involved with using gperftools that you should be aware of:
gperftools is a sampling profiler. As such, your results won't be 100%
accurate, but they should be really good. The big upside to using a
sampling profiler is that it won't really slow your application down.
In multithreaded applications, in my experience, gperftools will only
profile the main thread. The only way I've been able to successfully
profile worker threads is by adding profiling code to my application.
With that said, profiling the main thread shouldn't require any code
changes.
There are lots of different ways to use gperftools. My preferred way is to load the gperftools library with $LD_PRELOAD, specify a logging destination with $CPUPROFILE, and maybe bump up the sample frequency with $CPUPROFILE_FREQUENCY before starting my application up. Something like this:
export LD_PRELOAD=/usr/lib/libprofiler.so
export CPUPROFILE=/tmp/prof.out
export CPUPROFILE_FREQUENCY=10000
./my_application
This will write a bunch of profiling information to /tmp/prof.out. You can run a post-processing script to convert this file into something human readable. There are lots of supported output formats -- my preferred one is callgrind:
google-pprof --callgrind /path/to/my_application /tmp/prof.out > callgrind.dat
kcachegrind callgrind.dat &
This should provide a nice view of where your program is spending its time.
If you're interested, I spent some time over the weekend learning how to use gperftools to profile I/O bound applications, and I documented a lot of my findings here. There's a lot of overlap with what you're trying to do, so maybe it will be helpful.

Passing command line arguments when profiling with llvm-prof

How can I pass command line arguments to my program when profiling with llvm-prof?
And where can I find a more comprehensive documentation for llvm-prof? "llvm-prof -help" output is too brief. Its manual is even shorter.

I would recommend staying away from llvm-prof at this point. The reason is that it was actually removed from trunk LLVM a month ago (in revision 191835). Here is the commit message that should clarify the motivation:
Remove the very substantial, largely unmaintained legacy PGO
infrastructure.
This was essentially work toward PGO based on a design that had several
flaws, partially dating from a time when LLVM had a different
architecture, and with an effort to modernize it abandoned without being
completed. Since then, it has bitrotted for several years further. The
result is nearly unusable, and isn't helping any of the modern PGO
efforts. Instead, it is getting in the way, adding confusion about PGO
in LLVM and distracting everyone with maintenance on essentially dead
code. Removing it paves the way for modern efforts around PGO.
Among other effects, this removes the last of the runtime libraries from
LLVM. Those are being developed in the separate 'compiler-rt' project
now, with somewhat different licensing specifically more approriate for
runtimes.

The way you write the question implies that you are trying to execute your program using llvm-prof. However I am not sure if that is the way to do it. The way to profile is first to instrument your code with counters using:
opt -disable-opt -insert-edge-profiling -o program.profile.bc program.bc
Then execute the instrumented program using lli as follows:
lli -O0 -fake-argv0 'program.bc < YOUR_ARGS' -load llvm/Debug+Asserts/lib/libprofile_rt.so program.profile.bc
Note the way to pass the arguments to the program using -fake-argv0 'program.bc < YOUR_ARGS' above. This step will generate llvmprof.out file which can then be read with llvm-prof to generate the execution profiles as follows:
llvm-prof program.profile.bc

Compile and optimize for different target architectures

Summary: I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable application (running on different processors). Normally I could indeed compile 5 times and let the user choose the right one to run.
My question is: how can I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it?
I have an application with a lot of low level math calculations. These calculations will typically run for a long time.
I would like to take advantage of as much optimization as possible, preferably also of (not always supported) instruction sets. On the other hand I would like my application to be portable and easy to use (so I would not like to compile 5 different versions and let the user choose).
Is there a possibility to compile 5 different versions of my code and run dynamically the most optimized version that's possible at execution time? With 5 different versions I mean with different instruction sets and different optimizations for processors.
I don't care about the size of the application.
At this moment I'm using gcc on Linux (my code is in C++), but I'm also interested in this for the Intel compiler and for the MinGW compiler for compilation to Windows.
The executable doesn't have to be able to run on different OS'es, but ideally there would be something possible with automatically selecting 32 bit and 64 bit as well.
Edit: Please give clear pointers how to do it, preferably with small code examples or links to explanations. From my point of view I need a super generic solution, which is applicable on any random C++ project I have later.
Edit I assigned the bounty to ShuggyCoUk, he had a great number of pointers to look out for. I would have liked to split it between multiple answers but that is not possible. I'm not having this implemented yet, so the question is still 'open'! Please, still add and/or improve answers, even though there is no bounty to be given anymore.
Thanks everybody!

Yes it's possible. Compile all your differently optimised versions as different dynamic libraries with a common entry point, and provide an executable stub that that loads and runs
the correct library at run-time, via the entry point, depending on config file or other information.

Can you use script?
You could detect the CPU using script, and dynamically load the executable that is most optimized for architecture. It can choose 32/64 bit versions too.
If you are using a Linux you can query the cpu with
cat /proc/cpuinfo
You could probably do this with a bash/perl/python script or windows scripting host on windows. You probably don't want to force the user to install a script engine. One that works on the OS out of the box IMHO would be best.
In fact, on windows you probably would want to write a small C# app so you can more easily query the architecture. The C# app could just spawn whatever executable is fastest.
Alternatively you could put your different versions of code in a dll's or shared object's, then dynamically load them based on the detected architecture. As long as they have the same call signature it should work.

If you wish this to cleanly work on Windows and take full advantage in 64bit capable platforms of the additional 1. Addressing space and 2. registers (likely of more use to you) you must have at a minimum a separate process for the 64bit ones.
You can achieve this by having a separate executable with the relevant PE64 header. Simply using CreateProcess will launch this as the relevant bitness (unless the executable launched is in some redirected location there is no need to worry about WoW64 folder redirection
Given this limitation on windows it is likely that simply 'chaining along' to the relevant executable will be the simplest option for all different options, as well as making testing an individual one simpler.
It also means you 'main' executable is free to be totally separate depending on the target operating system (as detecting the cpu/OS capabilities is, by it's nature, very OS specific) and then do most of the rest of your code as shared objects/dlls.
Also you can 'share' the same files for two different architectures if you currently do not feel that there is any point using the differing capabilities.
I would suggest that the main executable is capable of being forced into making a specific choice so you can see what happens with 'lesser' versions on a more capable machine (or what errors come up if you try something different).
Other possibilities given this model are:
Statically linking to different versions of the standard runtimes (for ones with/without thread safety) and using them appropriately if you are running without any SMP/SMT capabilities.
Detect if multiple cores are present and whether they are real or hyper threading (also whether the OS knows how the schedule effectively in those cases)
checking the performance of things like the system timer/high performance timers and using code optimized to this behaviour, say if you do anything where you look for a certain amount of time to expire and thus can know your best possible granularity.
If you wish to optimize you choice of code based on cache sizing/other load on the box. If you are using unrolled loops then more aggressive unrolling options may depend on having a certain amount level 1/2 cache.
Compiling conditionally to use doubles/floats depending on the architecture. Less important on intel hardware but if you are targetting certain ARM cpu's some have actual floating point hardware support and others require emulation. The optimal code would change heavily, even to the extent you just use conditional compilation rather than using the optimizing compiler(1).
Making use of co-processor hardware like CUDA capable graphics cards.
detect virtualization and alter behaviour (perhaps trying to avoid file system writes)
As to doing this check you have a few options, the most useful one on Intel being the the cpuid instruction.
Windows
Use someone else's implementation but you'll have to pay
Use a free open source one
Linux
Use the built in one
You could also look at open source software doing the same thing
Pixman does a fair amount of this and is a permissive licence.
Alternatively re-implement/update an existing one using available documentation on the features you need.
Quite a lot of separate documents to work out how to detect things:
Intel:
SSE 4.1/4.2
SSE3
MMX
A large part of what you would be paying for in the CPU-Z library is someone doing all this (and the nasty little issues involved) for you.
be careful with this - it is hard to beat decent optimizing compilers on this

Have a look at liboil: http://liboil.freedesktop.org/wiki/ . It can dynamically select implementations of multimedia-related computations at run-time. You may find you can liboil itself and not just its techniques.

Since you mention you are using GCC, I'll assume your code is in C (or C++).
Neil Butterworth already suggested making separate dynamic libraries, but that requires some non-trivial cross-platform considerations (manually loading dynamic libraries is different on Linux, Windows, OSX, etc., and getting it right will likely take some time).
A cheap solution is to simply write all of your variants using unique names, and use a function pointer to select the proper one at runtime.
I suspect the extra dereference caused by the function pointer will be amortized by the actual work you are doing (but you'll want to confirm that).
Also, getting different compiler optimizations will likely require different .c/.cpp files, as well as some twiddling of your build tool. But it's probably less overall work than separate libraries (which needed this already in one form or another).

Since you didn't specify whether you have limits on the number of files, I propose another solution: compile 5 executables, and then create a sixth executable that launches the appropriate binary. Here is some pseudocode, for Linux
int main(int argc, char* argv[])
{
char* target_path[MAXPATH];
char* new_argv[];
char* specific_version = determine_name_of_specific_version();
strcpy(target_path, "/usr/lib/myapp/versions");
strcat(target_path, specific_version);
/* append NULL to argv */
new_argv = malloc(sizeof(char*)*(argc+1));
memcpy(new_argv, argv, argc*sizeof(char*));
new_argv[argc] = 0;
/* optionally set new_argv[0] to target_path */
execv(target_path, new_argv);
}
On the plus side, this approach allows to provide the user transparently with both 32-bit and 64-bit binaries, unlike any library methods that have been proposed. On the minus side, there is no execv in Win32 (but a good emulation in cygwin); on Windows, you have to create a new process, rather than re-execing the current one.

Lets break the problem down to its two constituent parts. 1) Creating platform dependent optimized code and 2) building on multiple platforms.
The first problem is pretty straightforward. Encapsulate the platform dependent code in a set of functions. Create a different implementation of each function for each platform. Put each implementation in its own file or set of files. It's easiest for the build system if you put each platform's code in a separate directory.
For part two I suggest you look at Gnu Atuotools (Automake, AutoConf, and Libtool). If you've ever downloaded and built a GNU program from source code you know you have to run ./configure before running make. The purpose of the configure script is to 1) verify that your system has all of the required libraries and utilities need to build and run the program and 2) customize the Makefiles for the target platform. Autotools is the set of utilities for generating the configure script.
Using autoconf, you can create little macros to check that the machine supports all of the CPU instructions your platform dependent code needs. In most cases, the macros already exists, you just have to copy them into your autoconf script. Then, automake and autoconf can set up the Makefiles to pull in the appropriate implementation.
All this is a bit much for creating an example here. It takes a little time to learn. But the documentation is all out there. There is even a free book available online. And the process is applicable to your future projects. For multi-platform support, this is really the most robust and easiest way to go, I think. A lot of the suggestions posted in other answers are things that Autotools deals with (CPU detection, static & shared library support) without you have to think about it too much. The only wrinkle you might have to deal with is finding out if Autotools are available for MinGW. I know they are part of Cygwin if you can go that route instead.

You mentioned the Intel compiler. That is funny, because it can do something like this by default. However, there is a catch. The Intel compiler didn't insert checks for the approopriate SSE functionality. Instead, they checked if you had a particular Intel chip. There would still be a slow default case. As a result, AMD CPUs would not get suitable SSE-optimized versions. There are hacks floating around that will replace the Intel check with a proper SSE check.
The 32/64 bits difference will require two executables. Both the ELF and PE format store this information in the exectuables header. It's not too hard to start the 32 bits version by default, check if you are on a 64 bit system, and then restart the 64 bit version. But it may be easier to create an appropriate symlink at installation time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js