Summary: I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable application (running on different processors). Normally I could indeed compile 5 times and let the user choose the right one to run.
My question is: how can I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it?
I have an application with a lot of low level math calculations. These calculations will typically run for a long time.
I would like to take advantage of as much optimization as possible, preferably also of (not always supported) instruction sets. On the other hand I would like my application to be portable and easy to use (so I would not like to compile 5 different versions and let the user choose).
Is there a possibility to compile 5 different versions of my code and run dynamically the most optimized version that's possible at execution time? With 5 different versions I mean with different instruction sets and different optimizations for processors.
I don't care about the size of the application.
At this moment I'm using gcc on Linux (my code is in C++), but I'm also interested in this for the Intel compiler and for the MinGW compiler for compilation to Windows.
The executable doesn't have to be able to run on different OS'es, but ideally there would be something possible with automatically selecting 32 bit and 64 bit as well.
Edit: Please give clear pointers how to do it, preferably with small code examples or links to explanations. From my point of view I need a super generic solution, which is applicable on any random C++ project I have later.
Edit I assigned the bounty to ShuggyCoUk, he had a great number of pointers to look out for. I would have liked to split it between multiple answers but that is not possible. I'm not having this implemented yet, so the question is still 'open'! Please, still add and/or improve answers, even though there is no bounty to be given anymore.
Thanks everybody!
Yes it's possible. Compile all your differently optimised versions as different dynamic libraries with a common entry point, and provide an executable stub that that loads and runs
the correct library at run-time, via the entry point, depending on config file or other information.
Can you use script?
You could detect the CPU using script, and dynamically load the executable that is most optimized for architecture. It can choose 32/64 bit versions too.
If you are using a Linux you can query the cpu with
cat /proc/cpuinfo
You could probably do this with a bash/perl/python script or windows scripting host on windows. You probably don't want to force the user to install a script engine. One that works on the OS out of the box IMHO would be best.
In fact, on windows you probably would want to write a small C# app so you can more easily query the architecture. The C# app could just spawn whatever executable is fastest.
Alternatively you could put your different versions of code in a dll's or shared object's, then dynamically load them based on the detected architecture. As long as they have the same call signature it should work.
If you wish this to cleanly work on Windows and take full advantage in 64bit capable platforms of the additional 1. Addressing space and 2. registers (likely of more use to you) you must have at a minimum a separate process for the 64bit ones.
You can achieve this by having a separate executable with the relevant PE64 header. Simply using CreateProcess will launch this as the relevant bitness (unless the executable launched is in some redirected location there is no need to worry about WoW64 folder redirection
Given this limitation on windows it is likely that simply 'chaining along' to the relevant executable will be the simplest option for all different options, as well as making testing an individual one simpler.
It also means you 'main' executable is free to be totally separate depending on the target operating system (as detecting the cpu/OS capabilities is, by it's nature, very OS specific) and then do most of the rest of your code as shared objects/dlls.
Also you can 'share' the same files for two different architectures if you currently do not feel that there is any point using the differing capabilities.
I would suggest that the main executable is capable of being forced into making a specific choice so you can see what happens with 'lesser' versions on a more capable machine (or what errors come up if you try something different).
Other possibilities given this model are:
Statically linking to different versions of the standard runtimes (for ones with/without thread safety) and using them appropriately if you are running without any SMP/SMT capabilities.
Detect if multiple cores are present and whether they are real or hyper threading (also whether the OS knows how the schedule effectively in those cases)
checking the performance of things like the system timer/high performance timers and using code optimized to this behaviour, say if you do anything where you look for a certain amount of time to expire and thus can know your best possible granularity.
If you wish to optimize you choice of code based on cache sizing/other load on the box. If you are using unrolled loops then more aggressive unrolling options may depend on having a certain amount level 1/2 cache.
Compiling conditionally to use doubles/floats depending on the architecture. Less important on intel hardware but if you are targetting certain ARM cpu's some have actual floating point hardware support and others require emulation. The optimal code would change heavily, even to the extent you just use conditional compilation rather than using the optimizing compiler(1).
Making use of co-processor hardware like CUDA capable graphics cards.
detect virtualization and alter behaviour (perhaps trying to avoid file system writes)
As to doing this check you have a few options, the most useful one on Intel being the the cpuid instruction.
Windows
Use someone else's implementation but you'll have to pay
Use a free open source one
Linux
Use the built in one
You could also look at open source software doing the same thing
Pixman does a fair amount of this and is a permissive licence.
Alternatively re-implement/update an existing one using available documentation on the features you need.
Quite a lot of separate documents to work out how to detect things:
Intel:
SSE 4.1/4.2
SSE3
MMX
A large part of what you would be paying for in the CPU-Z library is someone doing all this (and the nasty little issues involved) for you.
be careful with this - it is hard to beat decent optimizing compilers on this
Have a look at liboil: http://liboil.freedesktop.org/wiki/ . It can dynamically select implementations of multimedia-related computations at run-time. You may find you can liboil itself and not just its techniques.
Since you mention you are using GCC, I'll assume your code is in C (or C++).
Neil Butterworth already suggested making separate dynamic libraries, but that requires some non-trivial cross-platform considerations (manually loading dynamic libraries is different on Linux, Windows, OSX, etc., and getting it right will likely take some time).
A cheap solution is to simply write all of your variants using unique names, and use a function pointer to select the proper one at runtime.
I suspect the extra dereference caused by the function pointer will be amortized by the actual work you are doing (but you'll want to confirm that).
Also, getting different compiler optimizations will likely require different .c/.cpp files, as well as some twiddling of your build tool. But it's probably less overall work than separate libraries (which needed this already in one form or another).
Since you didn't specify whether you have limits on the number of files, I propose another solution: compile 5 executables, and then create a sixth executable that launches the appropriate binary. Here is some pseudocode, for Linux
int main(int argc, char* argv[])
{
char* target_path[MAXPATH];
char* new_argv[];
char* specific_version = determine_name_of_specific_version();
strcpy(target_path, "/usr/lib/myapp/versions");
strcat(target_path, specific_version);
/* append NULL to argv */
new_argv = malloc(sizeof(char*)*(argc+1));
memcpy(new_argv, argv, argc*sizeof(char*));
new_argv[argc] = 0;
/* optionally set new_argv[0] to target_path */
execv(target_path, new_argv);
}
On the plus side, this approach allows to provide the user transparently with both 32-bit and 64-bit binaries, unlike any library methods that have been proposed. On the minus side, there is no execv in Win32 (but a good emulation in cygwin); on Windows, you have to create a new process, rather than re-execing the current one.
Lets break the problem down to its two constituent parts. 1) Creating platform dependent optimized code and 2) building on multiple platforms.
The first problem is pretty straightforward. Encapsulate the platform dependent code in a set of functions. Create a different implementation of each function for each platform. Put each implementation in its own file or set of files. It's easiest for the build system if you put each platform's code in a separate directory.
For part two I suggest you look at Gnu Atuotools (Automake, AutoConf, and Libtool). If you've ever downloaded and built a GNU program from source code you know you have to run ./configure before running make. The purpose of the configure script is to 1) verify that your system has all of the required libraries and utilities need to build and run the program and 2) customize the Makefiles for the target platform. Autotools is the set of utilities for generating the configure script.
Using autoconf, you can create little macros to check that the machine supports all of the CPU instructions your platform dependent code needs. In most cases, the macros already exists, you just have to copy them into your autoconf script. Then, automake and autoconf can set up the Makefiles to pull in the appropriate implementation.
All this is a bit much for creating an example here. It takes a little time to learn. But the documentation is all out there. There is even a free book available online. And the process is applicable to your future projects. For multi-platform support, this is really the most robust and easiest way to go, I think. A lot of the suggestions posted in other answers are things that Autotools deals with (CPU detection, static & shared library support) without you have to think about it too much. The only wrinkle you might have to deal with is finding out if Autotools are available for MinGW. I know they are part of Cygwin if you can go that route instead.
You mentioned the Intel compiler. That is funny, because it can do something like this by default. However, there is a catch. The Intel compiler didn't insert checks for the approopriate SSE functionality. Instead, they checked if you had a particular Intel chip. There would still be a slow default case. As a result, AMD CPUs would not get suitable SSE-optimized versions. There are hacks floating around that will replace the Intel check with a proper SSE check.
The 32/64 bits difference will require two executables. Both the ELF and PE format store this information in the exectuables header. It's not too hard to start the 32 bits version by default, check if you are on a 64 bit system, and then restart the 64 bit version. But it may be easier to create an appropriate symlink at installation time.
Related
OK, so: I have successfully integrated the first working Halide generator into the cmake build system for my little image-processing project.
The generator implements an image-resizing and -resampling algorithm, based on the example code from the Halide codebase – Halide/apps/resize/resize.cpp – I adapted the sample in order to leverage generator parameters, and tied the generators’ compilation and invocation to my cmake script using the functions defined in HalideGenerator.cmake, just as the Halide project does in its own build script.
All this works great, so far – but my domain expertise is lacking in the realm of code-generation nuances. For example, I tweaked the scheduling method to get the best observed empirical speed on my laptop – but despite taking many long tinkering sessions and code-reading sojourns into the depths of Halide’s many generator-related tools and scripts, I have only the most superficial understanding of the code-generation process.
Specifically, I don’t know how to approach this. Is it best to use defaults or try to turn on specific options for my target platform – and if the latter, do I have to have conditional code somewhere, or can the binary include fallbacks?
Here’s what I am talking about: in the source for Halide tutorial lesson #15, there’s a complex script that invokes a generator with various options. Here’s a snippet from code comments in this script:
# If you're compiling and linking multiple Halide pipelines, then the
# multiple copies of the runtime should combine into a single copy
# (via weak linkage). If you're compiling and linking for multiple
# different targets (e.g. avx and non-avx), then the runtimes might be
# different, and you can't control which copy of the runtime the
# linker selects.
# You can control this behavior explicitly by compiling your pipelines
# with the no_runtime target flag. Let's generate and link several
# different versions of the first pipeline for different x86 variants: [snip]
… from this it is hard to separate what must be done, from what should be done, or what may be done, discretionally. Comparatively, one doesn’t have to deal with these issues when setting up C++ or Objective-C projects (even more Byzantine examples) as the compiler and linker make most these decisions for you, and at most need a flag or two.
My question is: how can I integrate the Halide generator’s output library binaries into my existing project – such that the generator output is as fast as possible (e.g. uses GPU, SSE2/3, AVX2 etc) without further constraining portability (e.g. it won’t mysteriously segfault on a slightly different machine)?
Specifically, what should my process be – as in, should I only target the lowest-common-denominator at first, and then leverage more exotic processor features incrementally?
EDIT: As I mentioned in comments below, this is what my GenGen binary outputs to stdout when invoked with no options:
As of recently, AOT-compilation now supports generating customizations for multiple CPU features with runtime detection. Just use GenGen with a comma-separated list of targets and static_library as the output, e.g.
GenGen -e static_library,h target=x86-64-linux-sse41-avx,x86-64-linux-sse41,x86-64-linux
This will produce a .a file that contains:
3 versions of your code, compiled with specializations for AVX+SSE4.1, SSE4.1, and plain-old-x86-64
a thin wrapper that uses the halide_can_use_target_features() runtime call to choose the right one for the actual target machine
See Func::compile_to_multitarget_static_library() and multitarget_generator.cpp/multitarget_aottest.cpp for more information.
For the case of pre-generating your binaries (AOT), it sounds like you want runtime dispatch. Your program will examine the CPU/GPU environment at startup and decide what features (AVX, OpenCL, etc.) should be used. This is not Halide specific.
Select a set of advanced features to target (high powered desktop GPU) as a best case and a set of minimal features that will work on every machine (SSE2 only).
Build a DLL/dylib/so for each of these feature sets that contains every performance hungry function. These can be scheduled differently or even built with completely different Func definitions. You can have both sets in the same source code file and test the Target object at generation time to choose between them.
At program startup, see if your best case features are present and, if so, load that library and use it. If any features are missing, default to the most compatible version.
You are free to choose how many feature sets and libraries you want to support.
The alternative is to compile your Halide code at program startup (JIT). I recommend using the Target object returned by get_jit_target_from_environment(), which uses the contents of the environment variable HL_JIT_TARGET or "host" if that variable is not set. The "host" target string is the same as get_host_target() and means Halide will examine the CPU/GPU environment and set whatever features it finds. You can then dynamically test the Target object and use GPU or CPU scheduling.
I want to give a C++ programme to someone for testing but I don't want them to see the source just yet. My main issues are that I don't know what platform that person is using and I don't want to create a shared library unless I have no other option. Ideally, I would like to send headers and object files for the person to compile and link him/herself but as far as I know that would only work if the person has the same set up that I have.
I am currently using Windows but I'm comfortable working on any Unix-like system as well and I am not using an IDE, in case you need that information
Well, a Windows development environment allows you to bind some native always backward compatible winapi functions. The distribution of correctly setup binary .dll files, along with consistent headers, is enough.
For Linux distributions, the scenario is different, since you need to have a distributed package compiled from source (that's disclosed), or distributed binaries for every Linux distributions you actually want to support.
If you want to avoid source code disclosure, where it's needed to compile on specific target systems, use a licencing mechanism that's preventing to run it.
Assuming the choice of machine is "reasonable" - in other words, it's something running Linux, Windows, Android or MacOS and a reasonable target processor such as MIPS, Sparc, x86 or ARM, then one POSSIBLE solution is to use clang -S -emit-llvm yourfile.cpp to produce an intermediate form of the LLVM "virtual machine code". This can then, using llc, be compiled to machine code for any target that LLVM supports.
It's not completely impossible to figure out roughly what the source code looked like, but unless someone wants to put a LOT of effort into running your code, they won't be able to see what the code does. And even giving someone a binary allows them, if they are that way inclined, to reverse engineer the code.
The other alternative, as I see it, is that you demonstrate the code on your machine [or a machine under your control].
There are also tools that can "obfuscate" source-code (rename variables, structure/class members and functions to a, b, c; remove any comments; and "unformat" the code - all of which makes it much harder to understand what the code does). Sorry, you'll have to google to find a good one, as I have never used such a thing myself. And again, of course, it's not IMPOSSIBLE to recover the code into something that can be used and modified and rebuilt. There is really no way to avoid giving the customer something they can compile unless you know what OS/processor it is for.
I have compiled my code with specific flags (-Os, -O2, -march=native and their combinations) in order to produce a faster execution time.
But my problem is that I don't run always in the same machine (because in my lab there are several different machines). Sometimes I run within a MacOS, or within a Linux (in both cases with different OS versions).
I wonder if there is a way to determine which binary will be run depending on the environment where the binary will run (I mean cache size, cpu cores, and other properties about the specific machine)?. In other words, how to choose (when the program loads) the faster binary (previously compiled with different target binary sizes and instruction-set extensions) according to the specific machine used?
Thanks in advance.
What you're talking about is called a fat binary (not FAT, the acronym). From Wikipedia1:
A fat binary (or multiarchitecture binary) is a computer executable program which has been expanded (or "fattened") with code native to multiple instruction sets which can consequently be run on multiple processor types. This results in a file larger than a normal one-architecture binary file, thus the name.
At quick glance, there doesn't seem to be much support for it (see this question from the Programmer StackExchange for more information). Apple implemented this briefly when transitioning from PowerPC to Intel, but it doesn't seem to have been explored much since then.
Technically, fat binaries refer to a single binary that could run on multiple architectures...but I imagine the premise would hold for a single binary that runs on multiple OSes. And it comes back to the point Bizkit made in his/her/zir answer - generally, you compile your source code for the environment that you're in ahead of time.
You may prebuilt a bunch of executables and choose one according to environment variable or things like uname. A Better approach to the problem is choose a toolchain that is able to perform JIT, install-time optimization and/or runtime optimization, like llvm.
Is there a reason you can't just recompile your source code on each machine? Compilers are already written and optimized exactly for this kind of stuff. Simply recompile your source code on that machine architecture and you'll have a binary that runs just fine on that machine.
If you want your code tuned for the cache size of the machine you run on, check out the way Automatically Tuned Linear Algebra Software (ATLAS) does it. When you compile it, it runs some tests to find what size to use for cache-blocking its loops, and puts that in a header file.
I'm making a very simple program with c++ for linux usage, and I'd like to know if it is possible to make just one big binary containing all the dependencies that would work on any linux system.
If my understanding is correct, any compiler turns source code into machine instructions, but since there are often common parts of code that can be reused with different programs, most programs depend on another libraries.
However if I have the source code for all my dependencies, I should be able to compile a binary in a way that would not require anything from the system? Will I be able to run stuff compiled on 64bit system on a 32bit system?
In short: Maybe.
The longer answer is:
It depends. You can't, for example, run a 64-bit binary on a 32-bit system, that's just not even nearly possible. Yes, it's the same processor family, but there are twice as many registers in the 64-bit system, which also has twice as long registers. What's the 32-bit processor going to "give back" for the value of those bits and registers that doesn't exist in the hardware in the processor? It just plain won't work. Some of the instructions also completely change meaning, so the system really needs to be "right" for the compiled code, or it won't work - fortunately, Linux will check this and plain refuse if it's not right.
You can BUILD a 32-bit binary on a 64-bit system (assuming you have all the right libraries, etc, installed for both 64- and 32-bit, etc).
Similarly, if you try to run ARM code on an x86 processor, or MIPS code on an ARM processor, it simply has no chance of working, because the actual instructions are completely different (or they would be in breach of some patent/copyright or similar, because processor instruction sets contain portions that are "protected intellectual property" in nearly all cases - so designers have to make sure they do NOT do "the same as someone else's design"). Like for 32-bit and 64-bit, you simply won't get a chance to run the wrong binary here, it just won't work.
Sometimes, there are subtle differences, for example ARM code can be compiled with "hard" or "soft" floating point. If the code is compiled for hard float, and there isn't the right support in the OS, then it won't run the binary. Worse yet, if you compile on x86 for SSE instructions, and try to run on a non-SSE processor, the code will simply crash [unless you specifically build code to "check for SSE, and display error if not present"].
So, if you have a binary that passes the above criteria, the Linux system tends to change a tiny bit between releases, and different distributions have subtle "fixes" that change things. Most of the time, these are completely benign (they fix some obscure corner-case that someone found during testing, but the general, non-corner case behaviour is "normal"). However, if you go from Linux version 2.2 to Linux version 3.15, there will be some substantial differences between the two versions, and the binary from the old one may very well be incompatible with the newer (and almost certainly the other way around) - it's hard to know exactly which versions are and aren't compatible. Within releases that are close, then it should work OK as long as you are not specifically relying on some feature that is present in only one (after all, new things ARE added to the Linux kernel from time to time). Here the answer is "maybe".
Note that in the above is also your implementation of the C and C++ runtime, so if you have a "new" C or C++ runtime library that uses Linux kernel feature X, and try to run it on an older kernel, before feature X was implemented (or working correctly for the case the C or C++ runtime is trying to use it).
Static linking is indeed a good way to REDUCE the dependency of different releases. And a good way to make your binary huge, which may be preventing people from downloading it.
Making the code open source is a much better way to solve this problem, then you just distribute your source code and a list of "minimum requirements", and let other people deal with it needing to be recompiled.
In practice, it depends on "sufficiently simple". If you're using C++11, you'll quickly find that the C++11 libraries have dependencies on modern libc releases. In turn, those only ship with modern Linux distributions. I'm not aware of any "Long Term Support" Linux distribution which today (June 2014) ships with libc support for GCC 4.8
The short answer is no, at least without serious hack.
Different linux distribution may have different glue code between user-space and kernel. For instant, an hello world seemingly without dependency built from ubuntu cannot be executed under CentOS.
EDIT: Thanks for the comment. I re-verify this and the cause is im using 32-bit VM. Sorry for causing confusion. However, as noted above, the rule of thumb is that even same linux distribution may sometime breaks compatibility in order to deploy bugfix, so the conclusion stands.
test.c:
int main()
{
return 0;
}
I haven't used any flags (I am a newb to gcc) , just the command:
gcc test.c
I have used the latest TDM build of GCC on win32.
The resulting executable is almost 23KB, way too big for an empty program.
How can I reduce the size of the executable?
Don't follow its suggestions, but for amusement sake, read this 'story' about making the smallest possible ELF binary.
How can I reduce its size?
Don't do it. You just wasting your time.
Use -s flag to strip symbols (gcc -s)
By default some standard libraries (e.g. C runtime) linked with your executable. Check out keys --nostdlib --nostartfiles --nodefaultlib for details. Link options described here.
For real program second option is to try optimization options, e.g. -Os (optimize for size).
Give up. On x86 Linux, gcc 4.3.2 produces a 5K binary. But wait! That's with dynamic linking! The statically linked binary is over half a meg: 516K. Relax and learn to live with the bloat.
And they said Modula-3 would never go anywhere because of a 200K hello world binary!
In case you wonder what's going on, the Gnu C library is structured such as to include certain features whether your program depends on them or not. These features include such trivia as malloc and free, dlopen, some string processing, and a whole bucketload of stuff that appears to have to do with locales and internationalization, although I can't find any relevant man pages.
Creating small executables for programs that require minimum services is not a design goal for glibc. To be fair, it has also been not a design goal for every run-time system I've ever worked with (about half a dozen).
Actually, if your code does nothing, is it even fair that the compiler still creates an executable? ;-)
Well, on Windows any executable would still have a size, although it can be reasonable small. With the old MS-DOS system, a complete do-nothing application would just be a couple of bytes. (I think four bytes to use the 21h interrupt to close the program.) Then again, those application were loaded straight into memory.
When the EXE format became more popular, things changed a bit. Now executables had additional information about the process itself, like the relocation of code and data segments plus some checksums and version information.
The introduction of Windows added another header to the format, to tell MS-DOS that it couldn't execute the executable since it needed to run under Windows. And Windows would recognize it without problems.
Of course, the executable format was also extended with resource information, like bitmaps, icons and dialog forms and much, much more.
A do-nothing executable would nowadays be between 4 and 8 kilobytes in size, depending on your compiler and every method you've used to reduce it's size. It would be at a size where UPX would actually result in bigger executables! Additional bytes in your executable might be added because you added certain libraries to your code. Especially libraries with initialized data or resources will add a considerable amount of bytes. Adding debug information also increases the size of the executable.
But while this all makes a nice exercise at reducing size, you could wonder if it's practical to just continue to worry about bloatedness of applications. Modern hard disks will divide files up in segments and for really large disks, the difference would be very small. However, the amount of trouble it would take to keep the size as small as possible will slow down development speed, unless you're an expert developer whom is used to these optimizations. These kinds of optimizations don't tend to improve performance and considering the average disk space of most systems, I don't see why it would be practical. (Still, I do optimize my own code in similar ways but then again, I am experienced with these optimizations.)
Interested in the EXE header? It's starts with the letters MZ, for "Mark Zbikowski". The first part is the old-style MS-DOS header for executables and is used as a stub to MS-DOS saying the program is not an MS-DOS executable. (In the binary, you can find the text 'This program cannot be run in DOS mode.' which is basically all it does: displaying that message. Next is the PE header, which Windows will recognise and use instead of the MS-DOS header. It starts with the letters PE for Portable Executable. After this second header there will be the executable itself, divided in several blocks of code and data. The header contains special reallocation tables which tells the OS where to load a specific block. And if you can keep this to a limit, the final executable can be smaller than 4 KB, but 90% would then be header information and no functionality.
I like the way the DJGPP FAQ addressed this many many years ago:
In general, judging code sizes by looking at the size of "Hello" programs is meaningless, because such programs consist mostly of the startup code. ... Most of the power of all these features goes wasted in "Hello" programs. There is no point in running all that code just to print a 15-byte string and exit.
What is the purpose of this exercise?
Even with as low a level language as C, there's still a lot of setup that has to happen before main can be called. Some of that setup is handled by the loader (which needs certain information), some is handled by the code that calls main. And then there's probably a little bit of library code that any normal program would have to have. At the least, there's probably references to the standard libraries, if they are in dlls.
Examining the binary size of the empty program is a worthless exercise in and of itself. It tells you nothing. If you want to learn something about code size, try writing non-empty (and preferably non-trivial) programs. Compare programs that use standard libraries with programs that do everything themselves.
If you really want to know what's going on in that binary (and why it's so big), then find out the executable format get a binary dump tool and take the thing apart.
What does 'size a.out' tell you about the size of the code, data, and bss segments? The majority of the code is likely to be the start up code (classically crt0.o on Unix machines) which is invoked by the o/s and does set up work (like sorting out command line arguments into argc, argv) before invoking main().
Run strip on the binary to get rid of the symbols. With gcc version 3.4.4 (cygming special) I drop from 10k to 4K.
You can try linking a custom run time (The part that calls main) to setup your runtime environment. All programs use the same one to setup the runtime environment that comes with gcc but for your executable you don't need data or zero'ed memory. The means you could get rid of unused library functions like memset/memcpy and reduce CRT0 size. When looking for info on this look at GCC in embedded environment. Embedded developers are general the only people that use custom runtime environments.
The rest is overheads for the OS that loads the executable. You are not going to same much there unless you tune that by hand?
Using GCC, compile your program using -Os rather than one of the other optimization flags (-O2 or -O3). This tells it to optimize for size rather than speed. Incidentally, it can sometimes make programs run faster than the speed optimizations would have, if some critical segment happens to fit more nicely. On the other hand, -O3 can actually induce code-size increases.
There might also be some linker flags telling it to leave out unused code from the final binary.