Mixing native executables and llvm bitcode libraries

Mixing native executables and llvm bitcode libraries - fortran

Situation
I have a program (written in fortran) which consists of:
A set of core routines, used every time the program is run.
A large collection of alternate routines, only one of which is used for each run, selected by the user at the start.
The user may reasonably select different alternatives for subsequent runs.
Most of the building time is spent compiling the alternatives, which is frustrating when I know only one will be used each time. Most of the run time is spent in the alternative routine, which is short but called many times.
Idea
Compile all the core routines to a native executable and all the alternatives to an llvm bitcode library. At runtime, the selected alternative only is automatically compiled and linked. This would hopefully save a lot of building time and slow down the running only marginally.
Questions
Is this even possible and if so, how?
Is it a good idea? Are there better ways of achieving a similar result?

Related

Understanding static & dynamic library linking [duplicate]

Are there any compelling performance reasons to choose static linking over dynamic linking or vice versa in certain situations? I've heard or read the following, but I don't know enough on the subject to vouch for its veracity.
1) The difference in runtime performance between static linking and dynamic linking is usually negligible.
2) (1) is not true if using a profiling compiler that uses profile data to optimize program hotpaths because with static linking, the compiler can optimize both your code and the library code. With dynamic linking only your code can be optimized. If most of the time is spent running library code, this can make a big difference. Otherwise, (1) still applies.

Dynamic linking can reduce total resource consumption (if more than one process shares the same library (including the version in "the same", of course)). I believe this is the argument that drives its presence in most environments. Here "resources" include disk space, RAM, and cache space. Of course, if your dynamic linker is insufficiently flexible there is a risk of DLL hell.
Dynamic linking means that bug fixes and upgrades to libraries propagate to improve your product without requiring you to ship anything.
Plugins always call for dynamic linking.
Static linking, means that you can know the code will run in very limited environments (early in the boot process, or in rescue mode).
Static linking can make binaries easier to distribute to diverse user environments (at the cost of sending a larger and more resource-hungry program).
Static linking may allow slightly faster startup times, but this depends to some degree on both the size and complexity of your program and on the details of the OS's loading strategy.
Some edits to include the very relevant suggestions in the comments and in other answers. I'd like to note that the way you break on this depends a lot on what environment you plan to run in. Minimal embedded systems may not have enough resources to support dynamic linking. Slightly larger small systems may well support dynamic linking because their memory is small enough to make the RAM savings from dynamic linking very attractive. Full-blown consumer PCs have, as Mark notes, enormous resources, and you can probably let the convenience issues drive your thinking on this matter.
To address the performance and efficiency issues: it depends.
Classically, dynamic libraries require some kind of glue layer which often means double dispatch or an extra layer of indirection in function addressing and can cost a little speed (but is the function calling time actually a big part of your running time???).
However, if you are running multiple processes which all call the same library a lot, you can end up saving cache lines (and thus winning on running performance) when using dynamic linking relative to using static linking. (Unless modern OS's are smart enough to notice identical segments in statically linked binaries. Seems hard, does anyone know?)
Another issue: loading time. You pay loading costs at some point. When you pay this cost depends on how the OS works as well as what linking you use. Maybe you'd rather put off paying it until you know you need it.
Note that static-vs-dynamic linking is traditionally not an optimization issue, because they both involve separate compilation down to object files. However, this is not required: a compiler can in principle, "compile" "static libraries" to a digested AST form initially, and "link" them by adding those ASTs to the ones generated for the main code, thus empowering global optimization. None of the systems I use do this, so I can't comment on how well it works.
The way to answer performance questions is always by testing (and use a test environment as much like the deployment environment as possible).

1) is based on the fact that calling a DLL function is always using an extra indirect jump. Today, this is usually negligible. Inside the DLL there is some more overhead on i386 CPU's, because they can't generate position independent code. On amd64, jumps can be relative to the program counter, so this is a huge improvement.
2) This is correct. With optimizations guided by profiling you can usually win about 10-15 percent performance. Now that CPU speed has reached its limits it might be worth doing it.
I would add: (3) the linker can arrange functions in a more cache efficient grouping, so that expensive cache level misses are minimised. It also might especially effect the startup time of applications (based on results i have seen with the Sun C++ compiler)
And don't forget that with DLLs no dead code elimination can be performed. Depending on the language, the DLL code might not be optimal either. Virtual functions are always virtual because the compiler doesn't know whether a client is overwriting it.
For these reasons, in case there is no real need for DLLs, then just use static compilation.
EDIT (to answer the comment, by user underscore)
Here is a good resource about the position independent code problem http://eli.thegreenplace.net/2011/11/03/position-independent-code-pic-in-shared-libraries/
As explained x86 does not have them AFAIK for anything else then 15 bit jump ranges and not for unconditional jumps and calls. That's why functions (from generators) having more then 32K have always been a problem and needed embedded trampolines.
But on popular x86 OS like Linux you do not need to care if the .so/DLL file is not generated with the gcc switch -fpic (which enforces the use of the indirect jump tables). Because if you don't, the code is just fixed like a normal linker would relocate it. But while doing this it makes the code segment non shareable and it would need a full mapping of the code from disk into memory and touching it all before it can be used (emptying most of the caches, hitting TLBs) etc. There was a time when this was considered slow.
So you would not have any benefit anymore.
I do not recall what OS (Solaris or FreeBSD) gave me problems with my Unix build system because I just wasn't doing this and wondered why it crashed until I applied -fPIC to gcc.

Dynamic linking is the only practical way to meet some license requirements such as the LGPL.

I agree with the points dnmckee mentions, plus:
Statically linked applications might be easier to deploy, since there are fewer or no additional file dependencies (.dll / .so) that might cause problems when they're missing or installed in the wrong place.

One reason to do a statically linked build is to verify that you have full closure for the executable, i.e. that all symbol references are resolved correctly.
As a part of a large system that was being built and tested using continuous integration, the nightly regression tests were run using a statically linked version of the executables. Occasionally, we would see that a symbol would not resolve and the static link would fail even though the dynamically linked executable would link successfully.
This was usually occurring when symbols that were deep seated within the shared libs had a misspelt name and so would not statically link. The dynamic linker does not completely resolve all symbols, irrespective of using depth-first or breadth-first evaluation, so you can finish up with a dynamically linked executable that does not have full closure.

1/ I've been on projects where dynamic linking vs static linking was benchmarked and the difference wasn't determined small enough to switch to dynamic linking (I wasn't part of the test, I just know the conclusion)
2/ Dynamic linking is often associated with PIC (Position Independent Code, code which doesn't need to be modified depending on the address at which it is loaded). Depending on the architecture PIC may bring another slowdown but is needed in order to get benefit of sharing a dynamically linked library between two executable (and even two process of the same executable if the OS use randomization of load address as a security measure). I'm not sure that all OS allow to separate the two concepts, but Solaris and Linux do and ISTR that HP-UX does as well.
3/ I've been on other projects which used dynamic linking for the "easy patch" feature. But this "easy patch" makes the distribution of small fix a little easier and of complicated one a versioning nightmare. We often ended up by having to push everything plus having to track problems at customer site because the wrong version was token.
My conclusion is that I'd used static linking excepted:
for things like plugins which depend on dynamic linking
when sharing is important (big libraries used by multiple processes at the same time like C/C++ runtime, GUI libraries, ... which often are managed independently and for which the ABI is strictly defined)
If one want to use the "easy patch", I'd argue that the libraries have to be managed like the big libraries above: they must be nearly independent with a defined ABI that must not to be changed by fixes.

Static linking is a process in compile time when a linked content is copied into the primary binary and becomes a single binary.
Cons:
compile time is longer
output binary is bigger
Dynamic linking is a process in runtime when a linked content is loaded. This technic allows to:
upgrade linked binary without recompiling a primary one that increase an ABI stability[About]
has a single shared copy
Cons:
start time is slower(linked content should be copied)
linker errors are thrown in runtime
[iOS Static vs Dynamic framework]

It is pretty simple, really. When you make a change in your source code, do you want to wait 10 minutes for it to build or 20 seconds? Twenty seconds is all I can put up with. Beyond that, I either get out the sword or start thinking about how I can use separate compilation and linking to bring it back into the comfort zone.

Best example for dynamic linking is, when the library is dependent on the used hardware. In ancient times the C math library was decided to be dynamic, so that each platform can use all processor capabilities to optimize it.
An even better example might be OpenGL. OpenGl is an API that is implemented differently by AMD and NVidia. And you are not able to use an NVidia implementation on an AMD card, because the hardware is different. You cannot link OpenGL statically into your program, because of that. Dynamic linking is used here to let the API be optimized for all platforms.

Dynamic linking requires extra time for the OS to find the dynamic library and load it. With static linking, everything is together and it is a one-shot load into memory.
Also, see DLL Hell. This is the scenario where the DLL that the OS loads is not the one that came with your application, or the version that your application expects.

On Unix-like systems, dynamic linking can make life difficult for 'root' to use an application with the shared libraries installed in out-of-the-way locations. This is because the dynamic linker generally won't pay attention to LD_LIBRARY_PATH or its equivalent for processes with root privileges. Sometimes, then, static linking saves the day.
Alternatively, the installation process has to locate the libraries, but that can make it difficult for multiple versions of the software to coexist on the machine.

Another issue not yet discussed is fixing bugs in the library.
With static linking, you not only have to rebuild the library, but will have to relink and redestribute the executable. If the library is just used in one executable, this may not be an issue. But the more executables that need to be relinked and redistributed, the bigger the pain is.
With dynamic linking, you just rebuild and redistribute the dynamic library and you are done.

Static linking includes the files that the program needs in a single executable file.
Dynamic linking is what you would consider the usual, it makes an executable that still requires DLLs and such to be in the same directory (or the DLLs could be in the system folder).
(DLL = dynamic link library)
Dynamically linked executables are compiled faster and aren't as resource-heavy.

static linking gives you only a single exe, inorder to make a change you need to recompile your whole program. Whereas in dynamic linking you need to make change only to the dll and when you run your exe, the changes would be picked up at runtime.Its easier to provide updates and bug fixes by dynamic linking (eg: windows).

There are a vast and increasing number of systems where an extreme level of static linking can have an enormous positive impact on applications and system performance.
I refer to what are often called "embedded systems", many of which are now increasingly using general-purpose operating systems, and these systems are used for everything imaginable.
An extremely common example are devices using GNU/Linux systems using Busybox. I've taken this to the extreme with NetBSD by building a bootable i386 (32-bit) system image that includes both a kernel and its root filesystem, the latter which contains a single static-linked (by crunchgen) binary with hard-links to all programs that itself contains all (well at last count 274) of the standard full-feature system programs (most except the toolchain), and it is less than 20 megabytes in size (and probably runs very comfortably in a system with only 64MB of memory (even with the root filesystem uncompressed and entirely in RAM), though I've been unable to find one so small to test it on).
It has been mentioned in earlier posts that the start-up time of a static-linked binaries is faster (and it can be a lot faster), but that is only part of the picture, especially when all object code is linked into the same file, and even more especially when the operating system supports demand paging of code direct from the executable file. In this ideal scenario the startup time of programs is literally negligible since almost all pages of code will already be in memory and be in use by the shell (and and init any other background processes that might be running), even if the requested program has not ever been run since boot since perhaps only one page of memory need be loaded to fulfill the runtime requirements of the program.
However that's still not the whole story. I also usually build and use the NetBSD operating system installs for my full development systems by static-linking all binaries. Even though this takes a tremendous amount more disk space (~6.6GB total for x86_64 with everything, including toolchain and X11 static-linked) (especially if one keeps full debug symbol tables available for all programs another ~2.5GB), the result still runs faster overall, and for some tasks even uses less memory than a typical dynamic-linked system that purports to share library code pages. Disk is cheap (even fast disk), and memory to cache frequently used disk files is also relatively cheap, but CPU cycles really are not, and paying the ld.so startup cost for every process that starts every time it starts will take hours and hours of CPU cycles away from tasks which require starting many processes, especially when the same programs are used over and over, such as compilers on a development system. Static-linked toolchain programs can reduce whole-OS multi-architecture build times for my systems by hours. I have yet to build the toolchain into my single crunchgen'ed binary, but I suspect when I do there will be more hours of build time saved because of the win for the CPU cache.

Another consideration is the number of object files (translation units) that you actually consume in a library vs the total number available. If a library is built from many object files, but you only use symbols from a few of them, this might be an argument for favoring static linking, since you only link the objects that you use when you static link (typically) and don't normally carry the unused symbols. If you go with a shared lib, that lib contains all translation units and could be much larger than what you want or need.

C++ increase build speed in large project by using libraries

I'm currently trying to optimise build speed for a big project with following in mind:
build speed is priority 1
resulting binaries size does not matter
Infos:
Environment: Visual Studio 2012 (required, because of the software I'm developing for) + Windows machine
BuildTime: 12mins (clean build), 1min for small changes and every now and than small changes result in 5-6min because of slow linking (this is what I want to address)
Custom files in project: approx. 2500 (SDK I need to use excluded, a big SDK for a CAD system)
Lines of code in custom files: approx. 500000
I'm using an up-to-date CAD capable computer (32GB RAM, >3GHz QuadCore, SSD)
Ideas:
use precompiled headers => done, but does not have the effect I want; helps speed up compile time most of the time, but every now and than does not
split up project into libraries => not sure if this helps
Questions
I could not find anything about using libraries and build speed, but I assume if I precompile libraries, the linker will be faster.
Is this assumption true?
If I make a static library with the core functions, will this have an effect on build time? Or will the linker need as long as it does currently?
If I make a dynamic library, will this have an effect on build time? Or will the linker again check the dll completely and will need the same time?

I assume if I precompile libraries, the linker will be faster. Is this assumption true?
No, not likely. If at all (because the linker has to open fewer files), then the difference will be marginal.
If I make a static library with the core functions, will this have an effect on build time? Or will the linker need as long as it does currently?
It may make a huge difference on compile time, since although on a truly clean rebuild you still have to compile everything as before, on a normal "mostly clean" rebuild rebuilding the support libraries is superfluous since nothing ever changes inside them, so all you really need to rebuild is the user code, and as a result you compile a lot fewer files.
Note that every sane build system normally builds a dependency graph and tries to compile as few files as possible anyway (and, to the extent possible, with some level of parallelism), unless you explicitly tell it to do a clean build (which is rarely necessary to be done). Doctor, it hurts when I do this -- well, don't do it.
The difference for the linker will, again, be marginal. The linker still needs to look up the exact same amount of symbols, and still needs to copy the same amount of code into the executable.
You may want to play with link order. Funny as it sounds, sometimes the order in which libraries and object files are linked makes a 5x difference on how long it takes the linker to do its job.
That being said, 12 minutes for a clean build indeed isn't a lot. Your non-clean buils will likely be in the two-digit second range, of which linking probably takes 90%. That's normally not a showstopper. Come back when a build takes 4 hours :-)
If I make a dynamic library, will this have an effect on build time? Or will the linker again check the dll completely and will need the same time?
The linker will still have to do some work for every function you call, which might be slightly faster, but will still be more or less the same.
Note that you add runtime (startup) overhead by moving code into a DLL. It is more work for the loader to load a program with parts of the code in a DLL as it needs to load another image, parse its header, resolve symbols, set up some pointers, run per-thread init functions, etc. That's usually not an issue (the difference is not really that much noticeable), just letting you know it's not free.

12 minutes is a short full build time and 500KLOC is not that big. Many free software projects (GCC, Qt, ...) have longer ones (hours) and millions of C++ lines.
You might want to use a serious and parallel build automation tool, such as ninja. Perhaps you could do some distributed build (like what distcc permits) if you can compile on remote machines.
You could configure your IDE to run an external command (such as ninja) for builds. This don't change autocompletion abilities. You could adopt another source code editor (e.g. GNU emacs).
C++ is not (yet) modular (it does not have genuine modules, like e.g. Ocaml or Go), and that makes its compilation slow (e.g. because standard container headers are big, e.g. <vector> brings about 10KLOC of included code, probably used and included in most of your C++ code). So you should avoid having many small files (e.g. merging two files of 250 lines each into one of 500 lines could decrease build time) and it looks like you have too much small C++ files. I would recommend source files of more than a thousand lines each. Having only one class implementation (or one function) per source file slows down the total build time.
You surely want to use more indirection in your code. Use more systematically PIMPL idioms and virtual method tables, closures, std::function-s. Remember the rule of five.

How to add compilation for profiling to static library?

My project currently has a library that is static linked (compiled with gcc and linked with ar), but I am currently trying to profile my whole entire project with gprof, in which I would also like to profile this statically linked library. Is there any way of going about doing this?
Gprof requires that you provide -pg to GCC for compilation and -pg to the linker. However, ar complains when -pg is added to the list of flags for it.

I haven't used gprof in a long time, but is -pg even a valid argument to ar? Does profiling work if you compile all of the objects with -pg, then create your archive without -pg?
If you can't get gprof to work, gperftools contains a CPU profiler which I think should work very well in this case. You don't need to compile your application with any special flags, and you don't need to try to change how your static library is linked.
Before starting, there are two tradeoffs involved with using gperftools that you should be aware of:
gperftools is a sampling profiler. As such, your results won't be 100%
accurate, but they should be really good. The big upside to using a
sampling profiler is that it won't really slow your application down.
In multithreaded applications, in my experience, gperftools will only
profile the main thread. The only way I've been able to successfully
profile worker threads is by adding profiling code to my application.
With that said, profiling the main thread shouldn't require any code
changes.
There are lots of different ways to use gperftools. My preferred way is to load the gperftools library with $LD_PRELOAD, specify a logging destination with $CPUPROFILE, and maybe bump up the sample frequency with $CPUPROFILE_FREQUENCY before starting my application up. Something like this:
export LD_PRELOAD=/usr/lib/libprofiler.so
export CPUPROFILE=/tmp/prof.out
export CPUPROFILE_FREQUENCY=10000
./my_application
This will write a bunch of profiling information to /tmp/prof.out. You can run a post-processing script to convert this file into something human readable. There are lots of supported output formats -- my preferred one is callgrind:
google-pprof --callgrind /path/to/my_application /tmp/prof.out > callgrind.dat
kcachegrind callgrind.dat &
This should provide a nice view of where your program is spending its time.
If you're interested, I spent some time over the weekend learning how to use gperftools to profile I/O bound applications, and I documented a lot of my findings here. There's a lot of overlap with what you're trying to do, so maybe it will be helpful.

Why compile + link when build a C++ code, instead of generating executable directly

I was asked of this question when mentor an entry-level programmer, I was thinking of this compile + link process so official and usual that I never think about why.
One thing I could think of is to improve the development productivity, but should there be any other more compiler-related reasons?

Efficiency.
When you compile a program you create an object file for each source file, if you change a source file you only need to recompile that module and then relink (relinking is cheap).
If the compiler did everything in one pass it would have to recompile everything for every change.
It also fits with the unix philosophy of small programs that do one thing, so you have a pre-processor, a compiler, a linker, a library creator. These steps might now be different modes of the same tool.
However there are reasons why you want the compiler to link in one step, there are some optimizations you can do if you allow the compiler to change object files at link time - most modern compilers allow this but it requires them to put extra info into the object files at compile time.
It would be better if the compiler could store the entire project in a single database, rather than the mess of sources, resources, browse info files, object files etc - but developers are very conservative!

Part of this is historical. Back in the dark ages, computers had little memory. It was easy to to create a program with enough source code that all of it couldn't be processed at one time. So the processing had to be done in stages: preprocessing source code, compile source to assembly (one by one), assembly to object code, all object files linked into the final executable. Each of these steps had one or more stand alone tools to do its task. Over the years the tools were improved incrementally, but no major redesign of the process has ever become mainstream.

It's important that the build time, even for a very large project, be under 24 hours. And being able to build overnight is better. Separate compilation, which is to say dividing a program into "compilation units" and compiling them independently, is the way to reduce build time:
If a compilation unit hasn't been changed, and if nothing it depends on has changed, you can reuse the result of an old compilation.
You can often compile multiple units in parallel, or even distributed over a network of workstations. The lowly Make will compile in parallel, and other tools like ccdist exist to distribute the work of compilation.
Linking provides few benefits in and of itself but is necessary to use the results of separate compilation.

What an excellent time to teach your protégé about the Single Responsibility Principle!

Compiling the file changes the code into binary that the computer can read. Linking the file tells the computer how to complete a command. So its impossible to generate it all at once, without the two steps.

Compile and optimize for different target architectures

Summary: I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable application (running on different processors). Normally I could indeed compile 5 times and let the user choose the right one to run.
My question is: how can I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it?
I have an application with a lot of low level math calculations. These calculations will typically run for a long time.
I would like to take advantage of as much optimization as possible, preferably also of (not always supported) instruction sets. On the other hand I would like my application to be portable and easy to use (so I would not like to compile 5 different versions and let the user choose).
Is there a possibility to compile 5 different versions of my code and run dynamically the most optimized version that's possible at execution time? With 5 different versions I mean with different instruction sets and different optimizations for processors.
I don't care about the size of the application.
At this moment I'm using gcc on Linux (my code is in C++), but I'm also interested in this for the Intel compiler and for the MinGW compiler for compilation to Windows.
The executable doesn't have to be able to run on different OS'es, but ideally there would be something possible with automatically selecting 32 bit and 64 bit as well.
Edit: Please give clear pointers how to do it, preferably with small code examples or links to explanations. From my point of view I need a super generic solution, which is applicable on any random C++ project I have later.
Edit I assigned the bounty to ShuggyCoUk, he had a great number of pointers to look out for. I would have liked to split it between multiple answers but that is not possible. I'm not having this implemented yet, so the question is still 'open'! Please, still add and/or improve answers, even though there is no bounty to be given anymore.
Thanks everybody!

Yes it's possible. Compile all your differently optimised versions as different dynamic libraries with a common entry point, and provide an executable stub that that loads and runs
the correct library at run-time, via the entry point, depending on config file or other information.

Can you use script?
You could detect the CPU using script, and dynamically load the executable that is most optimized for architecture. It can choose 32/64 bit versions too.
If you are using a Linux you can query the cpu with
cat /proc/cpuinfo
You could probably do this with a bash/perl/python script or windows scripting host on windows. You probably don't want to force the user to install a script engine. One that works on the OS out of the box IMHO would be best.
In fact, on windows you probably would want to write a small C# app so you can more easily query the architecture. The C# app could just spawn whatever executable is fastest.
Alternatively you could put your different versions of code in a dll's or shared object's, then dynamically load them based on the detected architecture. As long as they have the same call signature it should work.

If you wish this to cleanly work on Windows and take full advantage in 64bit capable platforms of the additional 1. Addressing space and 2. registers (likely of more use to you) you must have at a minimum a separate process for the 64bit ones.
You can achieve this by having a separate executable with the relevant PE64 header. Simply using CreateProcess will launch this as the relevant bitness (unless the executable launched is in some redirected location there is no need to worry about WoW64 folder redirection
Given this limitation on windows it is likely that simply 'chaining along' to the relevant executable will be the simplest option for all different options, as well as making testing an individual one simpler.
It also means you 'main' executable is free to be totally separate depending on the target operating system (as detecting the cpu/OS capabilities is, by it's nature, very OS specific) and then do most of the rest of your code as shared objects/dlls.
Also you can 'share' the same files for two different architectures if you currently do not feel that there is any point using the differing capabilities.
I would suggest that the main executable is capable of being forced into making a specific choice so you can see what happens with 'lesser' versions on a more capable machine (or what errors come up if you try something different).
Other possibilities given this model are:
Statically linking to different versions of the standard runtimes (for ones with/without thread safety) and using them appropriately if you are running without any SMP/SMT capabilities.
Detect if multiple cores are present and whether they are real or hyper threading (also whether the OS knows how the schedule effectively in those cases)
checking the performance of things like the system timer/high performance timers and using code optimized to this behaviour, say if you do anything where you look for a certain amount of time to expire and thus can know your best possible granularity.
If you wish to optimize you choice of code based on cache sizing/other load on the box. If you are using unrolled loops then more aggressive unrolling options may depend on having a certain amount level 1/2 cache.
Compiling conditionally to use doubles/floats depending on the architecture. Less important on intel hardware but if you are targetting certain ARM cpu's some have actual floating point hardware support and others require emulation. The optimal code would change heavily, even to the extent you just use conditional compilation rather than using the optimizing compiler(1).
Making use of co-processor hardware like CUDA capable graphics cards.
detect virtualization and alter behaviour (perhaps trying to avoid file system writes)
As to doing this check you have a few options, the most useful one on Intel being the the cpuid instruction.
Windows
Use someone else's implementation but you'll have to pay
Use a free open source one
Linux
Use the built in one
You could also look at open source software doing the same thing
Pixman does a fair amount of this and is a permissive licence.
Alternatively re-implement/update an existing one using available documentation on the features you need.
Quite a lot of separate documents to work out how to detect things:
Intel:
SSE 4.1/4.2
SSE3
MMX
A large part of what you would be paying for in the CPU-Z library is someone doing all this (and the nasty little issues involved) for you.
be careful with this - it is hard to beat decent optimizing compilers on this

Have a look at liboil: http://liboil.freedesktop.org/wiki/ . It can dynamically select implementations of multimedia-related computations at run-time. You may find you can liboil itself and not just its techniques.

Since you mention you are using GCC, I'll assume your code is in C (or C++).
Neil Butterworth already suggested making separate dynamic libraries, but that requires some non-trivial cross-platform considerations (manually loading dynamic libraries is different on Linux, Windows, OSX, etc., and getting it right will likely take some time).
A cheap solution is to simply write all of your variants using unique names, and use a function pointer to select the proper one at runtime.
I suspect the extra dereference caused by the function pointer will be amortized by the actual work you are doing (but you'll want to confirm that).
Also, getting different compiler optimizations will likely require different .c/.cpp files, as well as some twiddling of your build tool. But it's probably less overall work than separate libraries (which needed this already in one form or another).

Since you didn't specify whether you have limits on the number of files, I propose another solution: compile 5 executables, and then create a sixth executable that launches the appropriate binary. Here is some pseudocode, for Linux
int main(int argc, char* argv[])
{
char* target_path[MAXPATH];
char* new_argv[];
char* specific_version = determine_name_of_specific_version();
strcpy(target_path, "/usr/lib/myapp/versions");
strcat(target_path, specific_version);
/* append NULL to argv */
new_argv = malloc(sizeof(char*)*(argc+1));
memcpy(new_argv, argv, argc*sizeof(char*));
new_argv[argc] = 0;
/* optionally set new_argv[0] to target_path */
execv(target_path, new_argv);
}
On the plus side, this approach allows to provide the user transparently with both 32-bit and 64-bit binaries, unlike any library methods that have been proposed. On the minus side, there is no execv in Win32 (but a good emulation in cygwin); on Windows, you have to create a new process, rather than re-execing the current one.

Lets break the problem down to its two constituent parts. 1) Creating platform dependent optimized code and 2) building on multiple platforms.
The first problem is pretty straightforward. Encapsulate the platform dependent code in a set of functions. Create a different implementation of each function for each platform. Put each implementation in its own file or set of files. It's easiest for the build system if you put each platform's code in a separate directory.
For part two I suggest you look at Gnu Atuotools (Automake, AutoConf, and Libtool). If you've ever downloaded and built a GNU program from source code you know you have to run ./configure before running make. The purpose of the configure script is to 1) verify that your system has all of the required libraries and utilities need to build and run the program and 2) customize the Makefiles for the target platform. Autotools is the set of utilities for generating the configure script.
Using autoconf, you can create little macros to check that the machine supports all of the CPU instructions your platform dependent code needs. In most cases, the macros already exists, you just have to copy them into your autoconf script. Then, automake and autoconf can set up the Makefiles to pull in the appropriate implementation.
All this is a bit much for creating an example here. It takes a little time to learn. But the documentation is all out there. There is even a free book available online. And the process is applicable to your future projects. For multi-platform support, this is really the most robust and easiest way to go, I think. A lot of the suggestions posted in other answers are things that Autotools deals with (CPU detection, static & shared library support) without you have to think about it too much. The only wrinkle you might have to deal with is finding out if Autotools are available for MinGW. I know they are part of Cygwin if you can go that route instead.

You mentioned the Intel compiler. That is funny, because it can do something like this by default. However, there is a catch. The Intel compiler didn't insert checks for the approopriate SSE functionality. Instead, they checked if you had a particular Intel chip. There would still be a slow default case. As a result, AMD CPUs would not get suitable SSE-optimized versions. There are hacks floating around that will replace the Intel check with a proper SSE check.
The 32/64 bits difference will require two executables. Both the ELF and PE format store this information in the exectuables header. It's not too hard to start the 32 bits version by default, check if you are on a 64 bit system, and then restart the 64 bit version. But it may be easier to create an appropriate symlink at installation time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js