Where can I find resources/tutorials on how to implement threads on a x86 architecture bootloader... lets say I want to load resources in the background while displaying a progress bar..
That is a very unusual question...so allow me to provide my opinion on it...
Bootloaders, are really a limited bunch of assembly code, 464 bytes to be exact, 64 bytes for partition information and a final two bytes for the magic marker to indicate the end of the boot loader, that is 512bytes in total.
Bootloaders such as Grub can get around this limitation by implementing a two phase bootloader, the first phase is the 512 bytes as mentioned, then the second phase is loaded in which further options etc are performed.
Generally, the bootloader code is in 16 bit assembly because the original BIOS code is 16bit code, and that is what the processor 386 upwards to the modern processor today, boots up in, real mode.
Using a two phase bootloader, the first 512bytes is 16bit, then the second phase switches the processor into 32bit mode, setting up the registers and gate selectors in preparation, which in turn then jumps to the entry code of the actual program to do the booting up - this is taking into account in having to read from a specific location on disk or reading a configuration file which contains data on where the boot code is stored.
Implmenting threads in 32bit mode is something that will be tricky to produce as you will have to create some kind of a scheduler in Assembly (Since you mentioned implement threads on a x86 architecture bootloader).
You may get around this by implementing a second phase part of the bootloader using C (but the tricky bit is that no standard libraries are to be used as the runtime environment has not been set up yet!)
You may be better by using Grub or even check out this Open source BIOS bootloaders here, nowadays, bios's are flashable so you may be able to get an EFI (Extensible Firmware Interface here) which is pure 32bit bios - this will be dependant on your processor. There is also another website here which might provide further info here.
The progress bar on boot, is unfortunately written in C/C++ which (already, in 32bit, environment set up, tasking scheduler set up, threads included, virtual memory manager loaded etc - this is the kernel level, after boot up procedure is complete), in which is a process where a thread has been created, that runs in the background illustrating hardware detection/further environment set up etc by using a progress bar as a way to tell the user to "wait, the system is loading"
This book might help you somewhat -- it describes various aspects of the linux kernel -- including initialization. You might want to look at GRUB its pretty standard across UNIX flavours.
The book I mentioned should be your resource of choice, the kernel doesn't consider its metal thread-capable till quite late in the initialization cycle, and by this I mean setting up the data structures for threading is well-documented.
Although I can't seem to think of any real benefit of allowing threading constructs in a boot loader -- firstly its simpler to setup your basic hardware using single threaded procedural code and secondly you expect the code to be bullet-proof so threading as a defence mechanism isn't needed.
So I'd expect you're looking at emulating a progress bar :D
Related
I read an interesting paper, entitled "A High-Resolution Side-Channel Attack on Last-Level Cache", and wanted to find out the index hash function for my own machine—i.e., Intel Core i7-7500U (Kaby Lake architecture)—following the leads from this work.
To reverse-engineer the hash function, the paper mentions the first step as:
for (n=16; ; n++)
{
// ignore any miss on first run
for (fill=0; !fill; fill++)
{
// set pmc to count LLC miss
reset_pmc();
for (a=0; a<n; a++)
// set_count*line_size=2^19
load(a*2^19);
}
// get the LLC miss count
if (read_pmc()>0)
{
min = n;
break;
}
}
How can I code the reset_pmc() and read_pmc() in C++? From all that I read online so far, I think it requires inline assembly code, but I have no clue what instructions to use to get the LLC miss count. I would be obliged if someone can specify the code for these two steps.
I am running Ubuntu 16.04.1 (64-bit) on VMware workstation.
P.S.: I found mention of these LONGEST_LAT_CACHE.REFERENCES and LONGEST_LAT_CACHE.MISSES in Chapter-18 Volume 3B of the Intel Architectures Software Developer's Manual, but I do not know how to use them.
You can use perf as Cody suggested to measure the events from outside the code, but I suspect from your code sample that you need fine-grained, programmatic access to the performance counters.
To do that, you need to enable user-mode reading of the counters, and also have a way to program them. Since those are restricted operations, you need at least some help from the OS kernel to do that. Rolling your own solution is going to be pretty difficult, but luckily there are several existing solutions for Ubunty 16.04:
Andi Kleen's jevents library, which among other things lets you read PMU events from user space. I haven't personally used this part of pmu-tools, but the stuff I have used has been high quality. It seems to use the existing perf_events syscalls for counter programming so and doesn't need a kernel model.
The libpfc library is a from-scratch implementation of a kernel module and userland code that allows userland reading of the performance counters. I've used this and it works well. You install the kernel module which allows you to program the PMU, and then use the API exposed by libpfc to read the counters from userspace (the calls boil down to rdpmc instructions). It is the most accurate and precise way to read the counters, and it includes "overhead subtraction" functionality which can give you the true PMU counts for the measured region by subtracting out the events caused by the PMU read code itself. You need to pin to a single core for the counts to make sense, and you will get bogus results if your process is interrupted.
Intel's open-sourced Processor Counter Monitor library. I haven't tried this on Linux, but I used its predecessor library, the very similarly named1 Performance Counter Monitor on Windows, and it worked. On Windows it needs a kernel driver, but on Linux it seems you can either use a drive or have it go through perf_events.
Use the likwid library's Marker API functionality. Likwid has been around for a while and seems well supported. I have used likwid in the past, but only to measure whole processes in a matter similar to perf stat and not with the marker API. To use the marker API you still need to run your process as a child of the likwid measurement process, but you can read programmatically the counter values within your process, which is what you need (as I understand it). I'm not sure how likwid is setting up and reading the counters when the marker API is used.
So you've got a lot of options! I think all of them could work, but I can personally vouch for libpfc since I've used it myself for the same purpose on Ubuntu 16.04. The project is actively developed and probably the most accurate (least overhead) of the above. So I'd probably start with that one.
All of the solutions above should be able to work for Kaby Lake, since the functionality of each successive "Performance Monitoring Architecture" seems to generally be a superset of the prior one, and the API is generally preserved. In the case of libpfc, however, the author has restricted it to only support Haswell's architecture (PMA v3), but you just need to change one line of code locally to fix that.
1 Indeed, they are both commonly called by their acronym, PCM, and I suspect that the new project is simply the officially open sourced continuation of the old PCM project (which was also available in source form, but without a mechanism for community contribution).
I would use PAPI, see http://icl.cs.utk.edu/PAPI/
This is a cross platform solution that has a lot of support, especially from the hpc community.
I'm currently porting a VS2005 C++ application from CE5 to CE6 and I'm experiencing severe performance problems. This goes so far that a single HTTP request retrieving dynamic content takes 40ms on CE5 and 350ms on CE6. These values used to be worse due to a bunch of inefficiencies that I already cleaned up, improving performance on both systems, but at the moment I'm stuck at that latency. For the record, both tests are made on the same machine and the webserver is not the one supplied with CE but a custom one implemented in C++. Note also that the problem is not the network IO, CE6 even outperforms CE5 on the same machine when serving static files, but it's the dynamic content handling.
While trying to figure out why the program performs so badly, I stumbled across something that puzzled me: Under CE5, the Interlocked* API for x86 use neither the compiler intrinsics nor real function calls but inline assembly code. This code has a comment saying that the intrinsic includes lock prefixes that are only required for multi-processor systems and that slow down code running on just a single core like CE5. On CE6, these functions are implemented using the compiler intrinsics including the lock prefix. Since these functions are used by e.g. Boost and STLport, both of which are used inside the webserver, I was wondering if those could be the culprit.
Another thing I noticed was that some string parsing functions take extremely long. Worse, it seems that calling the same function a second time after the first time takes less time, so it seems as if some kind of caching was going on. Since this is a short (<1kB) string received via TCP that is parsed in memory, I can't imagine which cache could be responsible for that. The only cache could be the instruction cache, but the program is not larger than the CE5 version and if the code was running from uncached memory it would not show these caching effects.
TLDR - Questions:
Is CE6 capable of handling multiple processors at all?
Is there an easy way to tell the compiler that it should omit the lock prefix? My current approach to achieve that is to simply copy the inline assembly from the CE5 SDK, but that's beyond ugly.
I'd also appreciate any other suggestions what to look at or what to try. Many thanks in advance!
Summary There is no problem that depends on the executable, let alone on the Interlocked API. Running the same executable proved that. However, running on a different machine with a different platform setup made a difference. We're now back to Platform Builder, trying to figure out the differences between the two platforms.
No. WEC7 is required for SMP support. Most likely in CE6 the OEM has disabled the other cores.
None that I am aware of.
Either use the performance profiling tools or instrument your code with timing calls to narrow down where things are taking too long.
I have finally found the reason for the performance behaviour, it's simply paging. CE6 has a pool manager (see http://blogs.msdn.com/b/ce_base/archive/2008/01/19/paging-and-the-windows-ce-paging-pool.aspx) which handles paging out unused mapped DLLs and EXEs. When the amount of mapped binaries exceeds a certain size, it starts (with low priority) to page out memory. The limit when it starts paging out is just 3MiB by default, which is rather low for current applications. Also, the cache is not an LRU cache but simply discarding the pages in the order they were loaded.
It turns out that our system exceeded this limit, which causes the paging to begin. Due to the algorithm used, it will always throw out used ones that will then have to be paged in again. The code that serves static files is small, so this wasn't affected as much by this limit. The code that serves dynamic pages is much larger though, so it wreaks havoc on the overall system with IO. This also explains why the problem couldn't be attributed to a specific piece of code, it wasn't the code itself but loading it.
I have detected this via IOCTL_HAL_GET_POOL_PARAMETERS, which gave me the relevant configuration parameters, current state, how often the pageout-thread ran and for how long (although the latter is only the time it took to swap out pages). I should be able to find the resulting page faults in the kernel tracker, too, now that I know what I'm looking for. I could also observe that the activity LED on the CF card adapter now lights up when first loading a file, but not on subsequent requests, where it is taken from cache. This used to always cause the LED to flash on dynamic pages.
The simple solution is to increase the limit for the pool manager, so it doesn't start throwing out things. This can be done easily in config.bib by patching kernel.dll with the according values. Alternatively, reducing the executable size would help, but that's not so easy.
I am aware of the basics of shared memory and inter process communication, but since my application is fairly specific I'm asking this question for general feedback.
I am working on 64 bit machines (MacOS and Win 64), using a 32bit visual coding toolkit. It is not practical to port the toolkit to 64bit at this time so I have memory limitations.
I am working on an application which must be able to scrub (go back and forth based on user input) high quality video at fast speeds. The obvious solutions are:
1 - Keep it all in memory.
2 - Stream from disk.
Putting it all in memory at the moment requires lowering the video quality to an unacceptable point, and streaming from disk causes the scrub to hang while loading.
My current train of thought is to run a master and multiple slave programs. Each slave will load up a segment of the video into ram, and when the master program needs to load a different section of the video it will request this data from the slave and have it transferred over.
My question is, what is an appropriate way to do this?
I suspect shared memory will not allow me to get past the 32bit memory limitations my application currently has. I could do something as simple as pipes, but I was wondering if there is something else that is more suitable.
Ideally this solution would be Mac/Win portable, but since the final solution must reside on a windows box I will opt for windows solutions. Also the easier the better, as I'm not looking to spend weeks in dev time on this.
Thanks in advanced.
I'm going to guess you are (or at least can be) using a 64-bit machine with a 64-bit OS, even though it's impractical to port all your code to 64 bits. I'm also assuming that your machine has enough memory available to hold the data you care about -- the real problem is getting access to enough of that memory from 32-bit code.
If that's the case, then I'd look at Windows' Address Windowing Extensions (AWE) functions, such as AllocateUserPhysicalPages and MapUserPhysicalPages. These work quite a bit like file mapping except that when you map data into your address space, it's already in physical memory instead of having to be read from the disk (i.e., the mapping is much faster).
I would embed or install, depending on your requirements for distribution, one or more instances of Memcached and have one (or more if necessary) thread feed blocks from disk into the memcache.
Once you moved your data onto memcached, you are pretty much immune to 32 bit limitations, especially if the memcached itself runs as a 64 bit process.
Basically you would in your program read from a socket instead of a file, and memcached would be a fancy file cache.
Is there a way I could make a C or C++ program that would run without an operating system and that would draw something like a red pixel to the top left corner? I have always wondered how these types of applications are made. Since Windows is written in C I imagine there is a way to do this.
Thanks
If you're writing for a bare processor, with no library support at all, you'll have to get all the hardware manuals, figure out how to access your video memory, and perform whatever operations that hardware requires to get a pixel drawn onto the display (or a sound on the beeper, or a block of memory read from the disk, or whatever).
When you're using an operating system, you'll rely on device drivers to know all this for you. Programs are still written, every day, for platforms without operating systems, but rarely for a bare processor. Many small MPUs come with a support library, usually a set of routines that lets you manipulate whatever peripheral devices they support.
It can certainly be done. You typically write the code in C, and you pretty much have to do everything on your own, with no standard library. To set your pixel, you'd usually load a pointer to the physical address of the screen, and write the correct value to that pointer. Alternatively, on a PC you could consider using the VESA BIOS. In all honesty, it's fairly similar to the way most code for MS-DOS was written (most used MS-DOS to read and write data on disk, but little else).
The core bootloader and the part of the Kernel that bootstraps the OS are written in assembly. See http://en.wikipedia.org/wiki/Booting for a brief writeup of how an operating system boots. There's no way I'm aware of to write a bootloader or Kernel purely in a higher level language such as C or C++ without using assembly.
You need to write a bootstrapper and a loader combination followed by a payload which involves setting the VGA mode manually by interrupt, grabbing a handle to the basic video buffer and then writing a value to the 0th byte.
Start here: http://en.wikipedia.org/wiki/Bootstrapping_(computing)
Without an OS it's difficult to have a loader, which means no dynamic libc. You'd have to link statically, as well as have a decent amount of bootstrap code written in assembly (although it could be provided as object files which you could then link with). Also, since you'd be at the mercy of whatever the system has, you'd be stuck with the VESA video modes (unless you want to write your own graphics driver and subsystem, which you don't).
There is, but not generally from within the OS. Initially, they are an asm stub that's executed from the MBR on the drive. See MBR. For x86 processors, this is generally 16-bit processing code, this generally jumps into the operating system code from here, and upgrades to 32-bit/64-bit mode depending on the operating system and chipset.
So I take my C++ program in Visual studio, compile, and it'll spit out a nice little EXE file. But EXEs will only run on windows, and I hear a lot about how C/C++ compiles into assembly language, which is runs directly on a processor. The EXE runs with the help of windows, or I could have a program that makes an executable that runs on a mac. But aren't I compiling C++ code into assembly language, which is processor specific?
My Insights:
I'm guessing I'm probably not. I know there's an Intel C++ compiler, so would it make processor-specific assembly code? EXEs run on windows, so they advantage of tons of things already set up, from graphics packages to the massive .NET framework. A processor-specific executable would be literally starting from scratch, with just the instruction set of the processor.
Would this executable be a file-type? We could be running windows and open it, but then would control switch to processor only? I assume this executable would be something like an operating system, in that it would have to be run before anything else was booted up, and have only the processor instruction set to "use".
Let's think about what "run" means...
Something has to load the binary codes into memory. That's an OS feature. The .EXE or binary executable file or bundle or whatever, is formatted in a very OS-specific way so that the OS can load it into memory.
Something has to turn control over to those binary codes. There's the OS, again.
The I/O routines (in C++, but this is true in most places) are just a library that encapsulate OS API's. Drat that OS, it's everywhere.
Reminiscing.
In the olden days (yes, I'm this old) I worked on machines that didn't have OS's. We also didn't have C.
We wrote machine codes using tools like "assemblers" and "linkers" to create big binary images that we could load into the machine. We had to load these binary images through a painful bootstrap process.
We'd use front panel keys to load enough code into memory to read a handy device like a punched paper-tape reader. This would load a small piece of fairly standard boot linking loader software. (We used mylar tape so it wouldn't wear out.)
Then, when we had this linking loader in memory, we could feed the tape we'd prepared earlier with the assembler.
We wrote our own device drivers. Or we used library routines that were in source form, punched on paper tapes.
A "patch" was actually patched pieces of paper tape. Plus, since there were also little bugs, we'd have to adjust the memory image based on hand-written instructions -- patches that hadn't been put into the tape.
Later, we had simple OS's that had simple API's, simple device drivers, and a few utilities like a "file system", an "editor" and a "compiler". It was for a language called Jovial, but we also used Fortran sometimes.
We had to solder serial interface boards so we could plug in a device. We had to write device drivers.
Bottom Line.
You can easily write C++ programs that don't require an OS.
Learn about the hardware BIOS (or BIOS-like) facilities that are part of your processor's chipset. Most modern hardware has a simple OS wired into ROM that does power-on self-test (POST), loads a few simple drivers, and locates boot blocks.
Learn how to write your own boot block. That is the first proper "software" thing that's loaded after POST. This isn't all that hard. You can use various partitioning tools to force your boot block program onto a disk and you'll have complete control over the hardware. No OS.
Learn how GRUB, LILO or BootCamp launch an OS. It's not complicated. Once they're booted, they can load your program and you're off and running. This is slightly simpler because you create the kind of partition that a boot loader wants to load. Base yours on the Linux kernel and you'll be happier. Don't try to figure out how Windows boots -- it's too complicated.
Read up on ELF. http://en.wikipedia.org/wiki/Executable_and_Linkable_Format
Learn how device drivers are written. If you don't use an OS, you'll need to write device drivers.
The problem is that the OS really does a lot to start your programs. The EXE file itself has header information on it that Windows recognizes, identifying itself as an EXE file. Your app does everything, from filesystem access to memory allocations, through the OS.
But yes, you CAN run apps compiled for Windows/intel on other platforms without emulation. If you want to run your EXE on a Mac or UNIX, you will need to install a bit more software to do the work that Windows would do to run your program -- take a look at the "Wine" project.
What you're talking about is what's known in the embedded world as a "bare-metal" application. They're very common for things like a ARM Cortex-M3 that goes in (say) a debit-card validator box or an interactive toy, and doesn't have enough memory or capability to run a full operating system. So, instead of getting an "ARM/Linux" compiler that would compile an application to run on Linux on an ARM processor, you get an "ARM bare-metal" compiler that compiles things to run on an ARM processor without an operating system. (I'm using ARM rather than x86 as an example, because x86 bare-metal applications are really quite rare these days.)
As stated in your question and the other answers, your application will need to do some things that would otherwise be taken care of by the operating system.
First, it needs to initialize the memory system, the interrupt vectors, and various other bits of board goo. Typically this is something that a bare-metal compiler will do for you, though if you have a weird board, you may need to tell it how to do that. This gets things from the point where the board turns on to the point where your main() function starts.
Then, you need to interact with things outside the CPU and RAM. An operating system includes all sorts of functions for doing this -- disk I/O, screen output, keyboard and mouse input, networking, etc., so forth, and so on. Without an operating system, you have to get that from somewhere else. You may get some of that from libraries from your hardware manufacturer; for instance, a board I was recently playing with has a 40x200-pixel LED screen, and it came with a library with the code to turn that on and set individual pixel values on it. And there are several companies selling libraries to implement a TCP/IP stack and things like that, for doing networking or whatnot.
Consider, for example, that this makes it difficult to do even a basic printf. When you have an operating system, printf just sends a message to the operating system that says "put this string on the console", and the operating system finds the current cursor position on the console, and does all the stuff to figure out what pixels to change on the screen, and what CPU instructions to use to change those pixels, in order to do that.
Oh, and did we mention that you first have to figure out how to get the program into the CPU? A typical computer has a bit of programmable ROM that it will load instructions from when it starts up. On an x86, this is the BIOS, and it usually already contains a handy program that gets the CPU started, sets up the display, looks for disks, and loads a program off the disk that it finds. On an embedded system, that's typically where your program goes -- which means you need some way to put your program there. Often, that means you have a device called a "debugger" that's physically attached to your embedded board that loads the program -- and can also do things that allow you to pause the processor and determine what its state is, so that you can step through your program just as if you were running it in a software debugger on your computer. But I digress.
Anyway, to answer your second question, this executable that you'd create is something that gets stored in that ROM on your embedded board -- or perhaps you'd just store a bit of it in ROM (which is, after all, pretty small) and store the rest on a flash drive, and the bit in ROM would include the instructions to get the rest of it off the flash drive. It would probably be stored as a file on your main computer (that is, the Linux or Windows computer where you're creating it), but that's just for storage, it wouldn't run there.
You'll notice that when you've got a lot of these libraries together, they're doing a fair bit of what an operating system does, and there's sort of this space between the pile of libraries and a real operating system. In that space goes what's called an RTOS -- "real-time operating system". The smaller ones of these are really just collections of libraries that work together to do all the operating-systemy things, and sometimes also include stuff so you can run multiple threads at once (and then you can have different threads act like different programs) -- though all of this is all compiled into the same compiled "program", and the RTOS is really nothing more than a library you've included. Larger ones start storing parts of the code in separate places, and I think some of them can even load pieces of code off of disks -- just like Windows and Linux do when running a program. It's sort of a continuum, rather than an either/or.
The FreeRTOS system is an open-source RTOS that's towards the smaller end of the RTOS space; they might be a good place to look at some of this if you're more interested. They do have some examples of x86 applications, which would give you an idea of what sort of x86 systems would run a bare-metal or RTOS-based program and how you'd compile something to run on one; link here: http://www.freertos.org/a00090.html#186.
The computer is not the CPU. To do anything useful, the CPU has to be connected to memory and IO controllers and other devices. An OS takes care of abstracting all of that from running programs. So, if you want to write a program that runs without an OS, your program will have to replicate at least some features of an OS: Taking over from the BIOS during the boot process, initializing devices, communicating with the disk controller to load code and data, communicating with the display controller to show information to the user, communicating with the keyboard controller and the mouse controller to read user input etc etc etc.
Unless you are building an embedded system with specialized hardware, there is no point in doing this. Besides, running your program would mean the user would have to give up running other programs. While this may be acceptable for an ATM today or WordStar in 1984, these days people frown on not being able to check email while listening to music.
Sure, they exist. They are called cross compilers. For example, that's how I can program for the iPhone platform using Xcode.
A related type of compiler is one that compiles for a virtual platform. That's how Java works.
Any given compiler/toolset produces code for a particular processor/OS combination. So your Visual Studio compile example produces code for x86/Windows. That .EXE will only run on x86/Windows and not on (for example) ARM/Windows (as used by some cellphones).
To produce code for a processor/OS combination other than what you're running the compiler on requires what is generally referred to as a cross-compiler. If you have a full professional Visual Studio subscription, you can get the ARM cross compiler, which will allow you to produce ARM/Windows .EXE files which won't run on your desktop machine, but WILL run on an ARM/Windows based cellphone or palmtop.
Yes, you can make an executable that runs on the 'bare metal' of a processor. Obviously that's how operating system kernels work. The main thing you need to do is create an executable that uses no libraries whatsoever. However, the "no libraries" restriction includes the C standard library! So that means no malloc, no printf, etc. You have to basically be your own OS and manage memory and I/O yourself. This will inevitably require a fair bit of work directly in assembly at some stage.
You also lose several other luxuries, such as main(), which can't be the starting point of your program since main() is something that is invoked by the OS and the C runtime environment.
Absolutely! That is what embedded programming is. As many have probably said already the operating system does quite a bit for you. And even in the embedded world without an operating system a number of the development tools will provide the startup code to get the processor running enough to jump to your program. Some/many provide full blow C/C++ libraries so that you can call functions like memcpy() and sometimes even malloc() and printf().
You are welcome to provide every line of code and every instruction and not use a development tool package but still use a compiler like gcc for example. Some of the binary formats are common to those run on operating systems like elf for example. You can execute elf files on Linux but also have your embedded program result in an elf binary. The processor cannot execute elf in that format but whatever programs the boot prom or ram in some cases will extract the binary program from the elf file, not unlike an operating system extracting the program to run from an elf file. EXE is not one of those file formats. Your favorite windows application compiler is probably not an embedded compiler either although you can sometimes use one to do the high level language stuff and then use an alternative assembler and linker. More work than it is worth usually. For example you write a function in C (that does NOT make any library or system calls), compile that to an object. Write your own or find a utility to extract the compiled binary from that object, convert it to another object format or to assembler (disassemble). Add your startup code and other assembly to it. Assemble and link everything together as an embedded program. I did it once with Microsofts embedded visual C just to see how it measured up to other compilers, it wasnt horrible but certainly was not worth the effort of hacking to get at the output.
Every processor from the one in your computer to the one in your cell phone or microwave has too have some boot up code. That code is not running on an operating system. That code uses the same or similar compilers than operating system applications use. For some devices that code puts the processor and memory and on and off chip peripherals in a state where the operating system can be started. From there the operating system takes over. On your computer this would be the BIOS followed by the bootloader, then eventually the operating system, dos, windows, linux, etc.
The main problem is the file format. PE is very different to ELF(Used in unix-like systems). A valid PE program cannot be a valid ELF. So, you either load the binary dynamically with different starters or you have to give up.
Other than that, with knowledge of OS services, the value of registers at startup, etc. your code can probably detect easily and reliably which OS you are running under and act accordingly(Some malware does just that). Another challenge is then reusing code instead of having two or more different programs in the same binary. Basically you would have to write an emulator, at least for the services that you need.
Don't also forget about the Windows libraries. Look into QT and GTK+