1) Is it possible, using IRBuilder, to generate system calls independent of operation system? I have read this: http://llvm.lyngvig.org/Articles/Mapping-High-Level-Constructs-to-LLVM-IR#59
It seems like that when I generate LLVM IR and want to generate system call for e.g output to terminal, then I must tailor the LLVM IR to Linux/Windows/Mac. Or does LLVM have some interface for system calls?
2) Has this tool http://llvm.org/docs/CommandGuide/llc.html ability to do the stuff I want in 1) ?
Absolutely not. LLVM is a compiler backend; it does not concern itself with system calls. System calls are usually employed inside the platform's C library, which implements them with a mixture of low-level C and target-specific assembly. System calls are both OS and target (CPU) dependent.
As for more materials on studying this stuff - you have my sympathy. It's not a well documented area, because 99.9% of programmers never need to operate at this level. I suggest you start picking up some basic assembly programming and go from there.
Related
with reference to : http://www.cplusplus.com/articles/2v07M4Gy/
During the compilation phase,
This phase translates the program into a low level assembly level code. The compiler takes the preprocessed file ( without any directives) and generates an object file containing assembly level code. Now, the object file created is in the binary form. In the object file created, each line describes one low level machine level instruction.
Now, if I am correct then different CPU architectures works on different assembly languages/syntax.
My question is how does the compiler comes to know to which assembly language syntax the source code has to be changed? In other words, how does the C++ compiler know which CPU architecture is there in the machine it is working on ?
Is there any mapping used by assembler w.r.t the CPU architecture for generating assembly code for different CPU architectures?
N.S : I am beginner !!
Each compiler needs to be "ported" to the given system. For each system supported, a "compiler port" needs to be programmed by someone who knows the system in-depth.
WARNING : This is extremely simplified
In short, there are three main parts to a compiler :
"Front-end" : This part reads the language (in this case c++) and converts it to a sort of pseudo-code specific to the compiler. (An Abstract Syntactic Tree, or AST)
"Optimizer/Middle-end" : This part takes the AST and make a non-architecture-dependant optimized one.
"Back-end" : This part takes the AST, and converts it to binary executable code, specific to the architecture you want to compile your language on.
When you download a c++ compiler for your platform, you, in fact, download the c++ frontend with the linux-amd64 backend, for example.
This coding architecture is extremely helpful, because it allows to port the compiler for another architecture without rewriting the whole parsing/optimizing thing. It also allows someone to create another optimizer, or even another frontend supporting a whole different language, and, as long as it outputs a correct AST, it will be compatible with every single backend ever written for this compiler.
Simply put, the knowledge of the target system is coded into the compiler.
So you might have a C compiler that generates SPARC binaries, and a C compiler that generates VAX binaries. They both accept the same input language (as defined in the C standard), but produce different programs from it.
Often we just refer to "the compiler", meaning the one that will generate binaries for our current environment.
In modern times, the distinction has become less obvious with compiler collections such as GCC. Now the "different compilers" are often the same compiler program, just set up with different configurations (these are the "target description files").
Just to complete the answers given here:
The target architecture is indeed coded into the specific compiler instance you're using. This is important also for a process called "cross-compiling" - The process of compiling on a certain system an executable that would operate on another system/architecture.
Consider working on an embedded system-on-chip that uses a completely different instruction set than your own - You're working on an x86/64 Linux system, but need to compile a mobile app running on an ARM micro-processor, or some other type of assembly architecture.
It would be unreasonable to compile your code on the target system, which might be so limited in CPU and memory that it can't feasibly run a compiler - and so you can use a GCC (or any other compiler) port for that target architecture on your favorite system.
It's also quite critical to remember that the entire tool-chain is often compatible to the target system, for instance when shared libraries such as libc are getting in play - as the target OS could be a different release of Linux and would have different versions of common functions - In which case it's common to use tool-chains that contain all the necessary libraries and use something like chroot or mock to compile in the "target environment" from within your system.
I need to do heavy instrumentation using a LLVM pass. I want to avoid the IR Builder because it is somehow complicated and the code looks really messy. Isn't there a more convenient way to create LLVM IR? I think of a way where I can use for example C/C++ to create the instrumented code.
I do not use the IR Builder for my instrumentation. Most of the instrumentation is accomplished with two steps:
The LLVM pass identifies the instructions of interest and inserts function calls to the requisite instrumentation routine.
The instrumentation routines are written in C files and compiled and linked into the final program. Using link-time optimization (LTO), this approach achieves good performance by removing the function calls and directly inserting the machine code for the library instrumentation routines.
Therefore, most of the instrumentation is C code that clang compiles down to necessary IR.
Other pieces of instrumentation have to be crafted dynamically, so the IR is constructed by invoking the appropriate XYZInst::Create calls, where XYZ is the specific instruction.
The first approach is closer to what you desire; however, it required writing a separate script to act as the compiler driver and managing Makefiles, libraries, et cetera.
Normally If you want to modify LLVM IR, you need to write a pass. However, writing a pass by yourself is an overkill sometimes if a higher level tool could facilitate you.
For example, someone might wish to log every load and store in the program. For that purpose, he would need to inject code that does the logging. Now if there is a higher level tool, it can provide callbacks to us to write what we want. So in this case, for example, it could provide us OnLoad and OnStore functions which we can fill to tell the tool what to do on each load and store. Does such kind of a tool exist?
So basically I want something similar to what is provided by Dynamic Binary Instrumentation tools but that works with LLVM, for compile time code injection.
I think you should consider using PIN instead of LLVM for such things: http://www.pintool.org/
PIN enables you insert instrumentation/analyze code at several granularity levels: instruction, basic block, function, traces and even load/unload of shared libraries. Is may be a way more practical since you won't need to compile the application - so you can analyze programs wich aren't open source for example.
There are version of PIN for windows and linux.
PS: Another tool that seems useful: http://eces.colorado.edu/~blomsted/llvmpin/llvmpin.html
As you might know, PIN is a dynamic binary instrumentation tool. By using Pin for example, I can instrument every load and store in my application. I was wondering If there is a similar tool which injects code at compile time (Using a higher level of information, not requiring us to write the LLVM pass), rather than at runtime like Pin. I am especially interested for such kind of tool for LLVM.
You could write LLVM passes of your own and apply them on your code to "instrument" it during compile time. These work on LLVM IR and produce LLVM IR, so for some tasks this will be a very natural thing to do and for other tasks it might be cumbersome or difficult (because of the differences between LLVM and IR and the source language). It depends.
I am not trying to do any such thing, but I was wondering out of curiosity whether one could implement an "entire OS" (not necessarily something big like Linux or Microsoft Windows, but more like a small DOS-like operating system) in C and/or C++ using no or little assembly.
By implementing an OS , I mean making an OS from scratch starting the boot-loader and the kernel to the graphics drivers (and optionally GUI) in C or C++. I have seen a few low-level things done in C++ by accessing low-level features through the compiler. Can this be done for an entire OS?
I am not asking whether it is a good idea, I am just asking whether it is even remotely possible?
Obligatory link to the OSDev wiki, which describes most of the steps needed to create an OS as described on x86/x64.
To answer your question, it is gonna be extremely difficult/unpleasant to create the boot loader and start protected mode without resorting to at least some assembly, though it can be kept to a minimum (especially if you're not really counting stuff like using __asm__ ( "lidt %0\n" : : "m" (*idt) ); as 'assembly').
A big hurdle (again on x86) is that the processor starts in 16-bit real mode, so you need some 16-bit code. According to this discussion you can have GCC generate 16-bit code, but you would still need some way to setup memory, load code from some storage media and so on, all of which requires interfacing with the hardware in ways that standard C just has no concept of (interrupts, IO ports etc.).
For architectures which communicate with hardware solely through memory mapped IO you could probably get away with writing everything except the C start-up code (that sets up the stack, initializes variables and so on) in pure C, though specific requirements of interrupt routines / exception or syscall gates etc. may be difficult to impossible to implement (as you have to access special CPU registers).
I assume that you have an OS for x86 in mind. In that case you need at least a few pages of assembler to set up protected mode and stuff like that, and besides that a lot of knowledge of all the stuff like paging, call gates, rings, exceptions, etc. If you are going to use a form of system calls you'll also need some lines of assembly code to switch between kernel and userspace mode.
Besides those things the rest of an OS can easily be programmed in C. For C++ you'll need a runtime environment to support things like virtual members and exceptions, but as far as I know that can all be programmed in C.
Just take a look at Linux kernel source, the most important assembler code (for x86) can be found in arch/x86/boot, but you'll notice that even in that directory most files are written in C. Furthermore you'll find a few assembly lines in the arch/x86/kernel directory for handling system calls and stuff like that.
Outside the arch directory there is hardly any assembler used (because assembler is machine specific, that machine specific code belongs in the arch directory). Even graphic drivers don't use assembler (e.g. nvidia driver in drivers/gpu/drm/nouveau).
A boot loader? You might want to skip that bit. For instance, Linux is quite often started by non-Linux boot loaders such as UBoot. After all, once the system is running, the OS will be present but not the boot loader, that's just there to get the OS proper into memory.
And once you've selected a decent existing bootloader, the remainder is pretty much all straightforward. Well, you have to deal with memory and files yourself; you can't rely on fopen obviously. But even a C++ compiler has little problem generating code that can run without OS support.