Binary files and OS - c++

I'm currently learning C++ and there are some (basic) things which I don't really know about and where I didn't find anything useful over different search engines.
Well as all operating systems have different "binary formats" for their executeables (Windows/Linux/Mac) - what are the differences? I mean all of them are binary but is there anything (beside all the OS APIs) that really differs?
(Windows) This is a dumb question - but are all the applications there really just binary (and I mean just 0's and 1's)? In which format are they stored? (As you don't see 0's and 1's in all the Text editors but mainly non-displayable characters)
Best regards,
lamas

Executable file formats for Windows (PE), Linux (ELF), OS/X etc (MACH-O), tend to be designed to solve common problems, so they all share common features. However, each platform specifies a different standard, so the files are not compatible across platforms, even if the platforms use the same type of CPU.
Executable file formats are not only used for executable files, but also libraries, which also contain code but are never run directly by the user - only loaded into memory to satisfy the needs to directly executable binaries.
Common Features of an executable file format:
One or more blocks of executable code
One or more blocks of read-only data such as text and numbers
One or more blocks of read/write data
Instructions on where to place these blocks in memory when the application is run
Instructions on what libraries (which are also in an 'executable file format') need to be loaded as well, and how they connect (link) up to this executable file.
One or more tables mapping code and data locations to strings or ids that describe them, useful for linking and debugging.
It's interesting to compare such formats to more basic formats, such as the venerable DOS .com file, which simply describes 64K of assorted 'stuff' to be loaded at the next available location, and has few of the features listed above.
Binary in this sense is used to compare them to 'source' files, which are written in text format. Binary format simply says that they are encoded in a non-text way, and doesn't really relate to the 0-and-1 sense of binary.

Executables for Windows/Linux differ in:
The format of the file headers, i.e. the part of the file that indexes where and what's what in the rest of the file;
the instructions required for system calls (interrupts, register contents, etc)
the actual format in which binary code is linked together; there are several different ones for Linux, and I think also for Windows.
Applications are data and machine language opcodes, crammed into a file. Most bytes in an executable file don't contain text and can therefore contain values between 0 and 255 inclusive, i.e. all possible values. People would say that's binary. There are 8 bits in a byte, so each of those bytes could be said to contain 8 binary digits, some of which will be 0 and some 1.

When you get down to it, every single file in a computer is "binary" in the sense that it is stored as a sequence of 1s and 0s on disk (even text files). When you open up a file in a text editor, it groups these characters up into characters based on various encoding rules. Now if the file is actually a text file, this will give you readable text. However, if the file is not, the text editor will faithfully try and decode the stream of bits, but will most likely end up with lots of non-displayable characters as the bits are not actually the encoded forms of characters, but of CPU instructions.
As for the other part of your question, about "binary formats": there are multiple formats for how to lay out the various parts of an executable, such as ELF or the Windows DLL/EXE format. These all specify exactly where in the file various parts of the executable are (i.e. where the metadata is, where the symbol table is, where the entry point is, where the static data and resources are, etc.)

The most common file-format for Windows is PE; for Linux is ELF. They both contain mostly the same things (data segment, code segment, etc) and are only different simply because they were designed separately.
It should be noted that even if both Windows and Linux used the same file-format, they would still not be able to run each others' binaries, because the system APIs and available DLLs/SOs are completely different.

Related

How can I find the underlying file type in C++?

In *nix system there is a command called 'file', which can tell you the underlying type of a file. Say, if you rename a binary executable's name into foo.txt, or you rename a mp3 file into .txt, the system will always tell you the real type of the file. But in Windows, there seems no such functionality, if you rename an executable into .txt, you cannot execute it. Can anyone explain to me how this is done in *nix system, and how can I find the real type of a file using C++, especially in windows, where I cannot use std::system("file blah")?
File utility uses libmagic library. It recognises filetype parsing "special" fields in the file.
Of course, you can program by yourself recognition of some formats, but sometimes this requires plenty of work. E.g. when you try to differentiate between different formats of MP4.
Developers of that library did pretty huge amount of work. So it's adviced to use their results if you want to get god results in saying what type format you deal with.(this is a big sphere, really, and if knowing what type format you are working with,better rely on them then on your code)
File utility - http://www.darwinsys.com/file/
You can download source code and see how really many different recognition types they use.
Download archive file-4.26 -> magic -> Magdir
Personally I had luck with compiling file 4.26 on Windows ftp://ftp.astron.com/pub/file/
Caution It's merely a convention that files of certain formats should have predefined signatures and it's true almost always and helps identify formats of files properly.
If it's not point of concern, you can surely trust signature. But just keep in mind that anyone having enough knowledge and wish can open a file in hex editor and playing with bits make another format of file.
Even in Unix/Linux, the system doesn't actually definitively know a file's type. The "file" program makes an educated guess by comparing the file's contents against a database of patterns that characterize a variety of common file types, but it's no more than a guess — it doesn't know about all possible file formats, and it can be wrong about the ones that it does know.
It's entirely possible to write a program like "file" for Windows; it doesn't depend on any special capabilities in the OS. Cygwin provides a Windows port of the "file" program, for example.
The issue of renaming a program to have a .txt extension is unrelated to the "file" program. That comes from the fact that Windows decides whether a file is executable based on its name (specifically, its extension), whereas Unix/Linux decides whether a file is executable based on its permissions — not its contents. If you chmod a-x a program on a Linux system, the system will consider it non-executable, just like if you remove the .exe extension from a program on Windows.
The command reference is suggesting that the type information is saved to an external place for further usage. It is also mentioning magic numbers, which is refering to file signatures.
Being 100% sure of a file type is theorically impossible since there is no precise rules around what a certain type should contain. Even if they were such rules, it would be possible to alter the file in a way to make it look like another one. While both signatures and extension can give you a good idea of what the type actually is, you still need to face the possibility of dealing with a wrong type.
UNIX file command uses heuristics. There is a database of magic numbers, usually in /usr/share/file/magic and /etc/magic/ that allows you to add new file "types" to be recogized by the file command. It simply probes the file to look for magic numbers (signatures) in its contents.
UNIX traditionally doesn't have the same type of file extension and type associations that Windows does, although Linux is accumulating that in recent times.
I would think on Windows you'd want to at least check the file extension association, to be correct. But even within a given extension (such as .txt) the individual program may perform its own heuristics. Example, notepad has to make an educated guess at the character encoding when it opens a file. Raymond Chen wrote a good read in his blog about it The Old New Thing - The Notepad file encoding problem, redux

What executable format normally results in smaller file sizes?

What executable format normally results in smaller file sizes? For example, I've heard the a.out format is smaller than ELF's. This question is cross platform, so MS-DOS too.
Extensions and formats are unrelated.
a.out is normally ELF format on linux systems, while it will be some other formats in BSDs; ".out" is just the default extension, unrelated to the format.
Aside from the format, the original language will also make the binary file vary in size, as well as the compilation flags used.
My smallest PE file produced with an hex editor is about 380 bytes and is able to store any ressources I like. (Well with ressources it's bigger)
You are right: ELF carries more information than the old (deprecated under Linux?) "a.out" format. But the "overhead" (actually, quite useful metadata) is small for any Desktop application.
For MS-DOS, COM files are smaller (and more restricted) than EXEs.

Binary file turns out smaller on Windows than on Linux

I'm working on a program that writes binary data to a file. On Windows, the resulting file is slightly smaller than on Linux for some reason. The size in bytes and the MD5 hash are both different. How can this happen with the same code?
I already added the ifstream::binary flag and I made sure I set noskipws...
ofstream output("output", ifstream::binary);
output << std::noskipws;
I ran Application Verifier on my program and it didn't generate any errors or warnings regarding possible memory corruption.
Are there any other reasons why file output might be different?
The difference is probably due to different development environments. Different compilers, hardware and operating systems can all change the format of the underlying data. For example, the different compilers may pack your data structures with varying amounts of efficiency. Also, your basic types (ints, longs, floats, etc) may be different sizes due to different processors.
In short, programs that require cross-platform compatibility between binary data develop very precise rules for packing structures and values into binary format (often at the field, and equally precise rules for reading the data.
Without seeing exactly what you're writing to the file and how you are calling it, it's hard to give a truly useful answer, but if you're using the stream insertion operator (aka formatted output operator) to do your file writing, then any non-string data will be converted into strings according to your locale settings. If that is what you are actually doing, using ofstream::binary seems somewhat pointless, because you are just writing text anyway.
I would recommend creating the smallest possible example for which there is a difference and examining the output in a hex editor to see what is going on.

Embedding compressed files into a c++ program

I want to create a cross-platform installer, in c++. It can be any compression type, eg zip or gzip, embedded inside the program itself like an average installer. I don't want to create many changes on different platforms, linux and windows. How do I embed and extract files to a c++ program, cross-platform?
C++ is a poor choice for a cross-platform installer, because there's no such thing as cross-platform machine code.
C++ code can be extremely portable, but it needs to be compiled for each platform, and then you get a distinct output executable for each platform.
If you want to build installers for many platforms from a single source file, you can use C++. But if you want to build ONE installer that works on many platforms, you'll need to use an interpreted or JIT-compiled language with runtime support available on all your targets. Of those, the only one likely to already be installed on a majority of computers of each platform is Java.
Ok, assuming that you're building many single-platform installers from machine code, this is what is needed:
You need to get the compressed code into the program. You want to do this in a way that doesn't affect the load time badly, nor cause compilation to take a few months. So using an initialized global array is a bad idea.
One way is to link your data in as an additional section. There are tools to help with that, e.g. Binary to COFF converter, I've seen an ELF version as well, maybe this. But this might still cause the runtime library to try to lead the entire file into memory before execution begins.
Another way is to use platform-specific resource APIs. This is efficient, but platform specific.
The most straightforward solution is to simply append the compressed archive to your executable, then append eight more bytes with the file offset where the compressed archive begins. Unpacking is then as simple as opening the executable in read-only mode, fseek(-8, SEEK_END), reading the correct offset, then seeking to the beginning of the compressed data and passing that stream to your decompressor.
Of course, now I find a website listing pretty much the same methods.
And here's a program which implements the last option, with additional ability to store multiple files. I wouldn't recommend doing that, let the compression library take care of storing the file metadata.
The only way I know of to portably embed data (strings or raw, binary
data) in a C++ program is to convert it into a data table, then compile
that. For raw data, this would look something like:
unsigned char data[] =
{
// raw data here.
};
It should be fairly trivial to write a small program which reads your
binary data, and writes it out as a C++ table, like the above. Compile
it, and link it into your program, and there you are.
Use zlib.
Have your packing program generate a list of exe's in the program. i,e.
unsigned char x86_windows_version[] = { 0xff,...,0xff};
unsigned char arm_linux_version[] = { 0xff,...,0xff};
unsigned char* binary_files[MAX_BINARIES] = {x86_windows_version,arm_linux_version};
somewhere in your excitable:
enflate(x86_windows_version);
And thats about it. Look at the zlib docs for the parameters for enflate() and deflate() and that's about it.
It's a pattern used a lot on embedded platforms (that are not linux) mostly for string _tables and other image binaries. It should work for your needs.

What do C and Assembler actually compile to? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
So I found out that C(++) programs actually don't compile to plain "binary" (I may have gotten some things wrong here, in that case I'm sorry :D) but to a range of things (symbol table, os-related stuff,...) but...
Does assembler "compile" to pure binary? That means no extra stuff besides resources like predefined strings, etc.
If C compiles to something else than plain binary, how can that small assembler bootloader just copy the instructions from the HDD to memory and execute them? I mean if the OS kernel, which is probably written in C, compiles to something different than plain binary - how does the bootloader handle it?
edit: I know that assembler doesn't "compile" because it only has your machine's instruction set - I didn't find a good word for what assembler "assembles" to. If you have one, leave it here as comment and I'll change it.
C typically compiles to assembler, just because that makes life easy for the poor compiler writer.
Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:
Code and data appear in named "sections".
Relocatable object files may include definitions of labels, which refer to locations within the sections.
Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.
For example, if you compile and assemble (but don't link) this program
int main () { printf("Hello, world\n"); }
you are likely to wind up with a relocatable object file with
A text section containing the machine code for main
A label definition for main which points to the beginning of the text section
A rodata (read-only data) section containing the bytes of the string literal "Hello, world\n"
A relocation entry that depends on printf and that points to a "hole" in a call instruction in the middle of a text section.
If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o, and you can explore the label definitions and uses with a simple tool called nm, and you can get more detailed information from a somewhat more complicated tool called objdump.
I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.
Let's take a C program.
When you run gcc, clang, or 'cl' on the c program, it will go through these stages:
Preprocessor (#include, #ifdef, trigraph analysis, encoding translations, comment management, macros...) including lexing into preprocessor tokens and eventually resulting in flat text for input to the compiler proper.
Lexical analysis (producing tokens and lexical errors).
Syntactical analysis (producing a parse tree and syntactical errors).
Semantic analysis (producing a symbol table, scoping information and scoping/typing errors) Also data-flow, transforming the program logic into an "intermediate representation" that the optimizer can work with. (Often an SSA). clang/LLVM uses LLVM-IR, gcc uses GIMPLE then RTL.
Optimization of the program logic, including constant propagation, inlining, hoisting invariants out of loops, auto-vectorization, and many many other things. (Most of the code for a widely-used modern compiler is optimization passes.) Transforming through intermediate representations is just part of how some compilers work, making it impossible / meaningless to "disable all optimizations"
Outputing into assembly source (or another intermediate format like .NET IL bytecode)
Assembling of the assembly into some binary object format.
Linking of the assembly into whatever static libraries are needed, as well as relocating it if needed.
Output of final executable in elf, PE/coff, MachO64, or whatever other format
In practice, some of these steps may be done at the same time, but this is the logical order. Most compilers have options to stop after any given step (e.g. preprocess or asm), including dumping internal representation between optimization passes for open-source compilers like GCC. (-ftree-dump-...)
Note that there's a 'container' of elf or coff format around the actual executable binary, unless it's a DOS .com executable
You will find that a book on compilers(I recommend the Dragon book, the standard introductory book in the field) will have all the information you need and more.
As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaders covers.
I've wiki'd this answer to let people tweak any errors/add information.
There are different phases in translating C++ into a binary executable. The language specification does not explicitly state the translation phases. However, I will describe the common translation phases.
Source C++ To Assembly or Itermediate Language
Some compilers actually translate the C++ code into an assembly language or an intermediate language. This is not a required phase, but helpful in debugging and optimizations.
Assembly To Object Code
The next common step is to translate Assembly language into an Object code. The object code contains assembly code with relative addresses and open references to external subroutines (methods or functions). In general, the translator puts in as much information into an object file as it can, everything else is unresolved.
Linking Object Code(s)
The linking phase combines one or more object codes, resolves references and eliminates duplicate subroutines. The final output is an executable file. This file contains information for the operating system and relative addresses.
Executing Binary Files
The Operating System loads the executable file, usually from a hard drive, and places it into memory. The OS may convert relative addresses into physical locations. The OS may also prepare resources (such as DLLs and GUI widgets) that are required by the executable (which may be stated in the Executable file).
Compiling Directly To Binary
Some compilers, such as the ones used in Embedded Systems, have the capability to compile from C++ directly to an executable binary code. This code will have physical addresses instead of relative address and not require an OS to load.
Advantages
One of the advantages of these phases is that C++ programs can be broken into pieces, compiled individually and linked at a later time. They can even be linked with pieces from other developers (a.k.a. libraries). This allows developers to only compiler pieces in development and link in pieces that are already validated. In general, the translation from C++ to object is the time consuming part of the process. Also, a person doesn't want to wait for all the phases to complete when there is an error in the source code.
Keep an open mind and always expect the Third Alternative (Option).
To answer your questions, please note that this is subjective as there are different processors, different platforms, different assemblers and C compilers, in this case, I will talk about the Intel x86 platform.
Assemblers do not usually assemble to pure / flat binary (raw machine code), instead usually to a file defined with segments such as data, text and bss to name but a few; this is called an object file. The Linker steps in and adjusts the segments to make it executable, that is, ready to run. Incidentally, the default output when you assemble using GNU as foo.s is a.out, that is a shorthand for Assembler Output. (But the same filename is the gcc default for linker output, with the assembler output being only a temporary.)
Boot loaders have a special directive defined, back in the days of DOS, it would be common to find a directive such as .Org 100h, which defines the assembler code to be of the old .COM variety before .EXE took over in popularity. Also, you did not need to have a assembler to produce a .COM file, using the old debug.exe that came with MSDOS, did the trick for small simple programs, the .COM files did not need a linker and were straight ready-to-run binary format. Here's a simple session using DEBUG.
1:*a 0100
2:* mov AH,07
3:* int 21
4:* cmp AL,00
5:* jnz 010c
6:* mov AH,07
7:* int 21
8:* mov AH,4C
9:* int 21
10:*
11:*r CX
12:*10
13:*n respond.com
14:*w
15:*q
This produces a ready-to-run .COM program called 'respond.com' that waits for a keystroke and not echo it to the screen. Notice, the beginning, the usage of 'a 100h' which shows that the Instruction pointer starts at 100h which is the feature of a .COM. This old script was mainly used in batch files waiting for a response and not echo it. The original script can be found here.
Again, in the case of boot loaders, they are converted to a binary format, there was a program that used to come with DOS, called EXE2BIN. That was the job of converting the raw object code into a format that can be copied on to a bootable disk for booting. Remember no linker is run against the assembled code, as the linker is for the runtime environment and sets up the code to make it runnable and executable.
The BIOS when booting, expects code to be at segment:offset, 0x7c00, if my memory serves me correct, the code (after being EXE2BIN'd), will start executing, then the bootloader relocates itself lower down in memory and continue loading by issuing int 0x13 to read from the disk, switch on the A20 gate, enable the DMA, switch onto protected mode as the BIOS is in 16bit mode, then the data read from the disk is loaded into memory, then the bootloader issues a far jump into the data code (likely to be written in C). That is in essence how the system boots.
Ok, the previous paragraph sounds abstracted and simple, I may have missed out something, but that is how it is in a nutshell.
To answer the assembly part of the question, assembly doesn't compile to binary as I understand it. Assembly === binary. It directly translates. Each assembly operation has a binary string that directly matches it. Each operation has a binary code, and each register variable has a binary address.
That is, unless Assembler != Assembly and I'm misunderstanding your question.
They compile to a file in a specific format (COFF for Windows, etc), composed of headers and segments, some of which have "plain binary" op codes. Assemblers and compilers (such as C) create the same sort of output. Some formats, such as the old *.COM files, had no headers, but still had certain assumptions (such as where in memory it would get loaded or how big it could be).
On Windows machines, the OS's boostrapper is in a disk sector loaded by the BIOS, where both of these are "plain". Once the OS has loaded its loader, it can read files that have headers and segments.
Does that help?
There are two things that you may mix here. Generally there are two topics:
Executable File Formats (see a list here), for example COFF, XCOFF, ELF
Intermediate Languages, like CIL or GIMPLE or bytecode
The latter may compile to the former in the process of assembly. Some intermediate formats are not assembled, but executed by a virtual machine. In case of C++ it may be compiled into CIL, which is assembled into a .NET assembly, hence there me be some confusion.
But in general C and C++ are usually compiled into binary, or in other words, into a executable file format.
You have a lot of answers to read through, but I think I can keep this succinct.
"Binary code" refers to the bits that feed through the microprocessor's circuits. The microprocessor loads each instruction from memory in sequence, doing whatever they say. Different processor families have different formats for instructions: x86, ARM, PowerPC, etc. You point the processor at the instruction you want by giving it the address of the instruction in memory, and then it chugs merrily along through the rest of the program.
When you want to load a program into the processor, you first have to make the binary code accessible in memory so it has an address in the first place. The C compiler outputs a file in the filesystem, which has to be loaded into a new virtual address space. Therefore, in addition to binary code, that file has to include the information that it has binary code, and what its address space should look like.
A bootloader has different requirements, so its file format might be different. But the idea is the same: binary code is always a payload in a larger file format, which includes at a minimum a sanity check to ensure that it's written in the correct instruction set.
C compilers and assemblers are typically configured to produce static library files. For embedded applications, you're more likely to find a compiler which produces something like a raw memory image with instructions beginning at address zero. Otherwise, you can write a linker which converts the output of the C compiler into whatever else you want.
As I understand it, a chipset (CPU, etc.) will have a set of registers for storing data, and understand a set of instructions for manipulating these registers. The instructions will be things like 'store this value to this register', 'move this value', or 'compare these two values'. These instructions are often expressed in short human-grokable alphabetic codes (assembly language, or assembler) which are mapped to the numbers that the chipset understands - those numbers are presented to the chip in binary (machine code.)
Those codes are the lowest level that the software gets down to. Going deeper than that gets into the architecture of the actual chip, which is something I haven't gotten involved in.
The executable files (PE format on windows) cannot be used to boot the computer because the PE loader is not in memory.
The way bootstrapping works is that the master boot record on the disk contains a blob of a few hundred bytes of code. The BIOS of the computer (in ROM on the motherboard) loads this blob into memory and sets the CPU instruction pointer to the beginning of this boot code.
The boot code then loads a "second stage" loader, on Windows called NTLDR (no extension) from the root directory. This is raw machine code that, like the MBR loader, is loaded into memory cold and executed.
NTLDR has the full capability to load PE files including DLLs and drivers.
С(++) (unmanaged) really compiles to plain binary. Some OS-related stuff - are BIOS and OS function calls, they're different for each OS, but still binary.
1. Assembler compiles to pure binary, but, as strange as it gets, it is less optimized than C(++)
2. OS kernel, as well as bootloader, also written in C, so no problems here.
Java, Managed C++, and other .NET stuff, compiles into some pseudocode (MSIL in .NET), which makes it cross-OS and cross-platform, but requires local interpreter or translator to run.