Please note: this question is not about LLVM IR, but LLVM's MIR, an internal intermediate representation lower than the former one.
This documentation on LLVM Machine code description classes, says (highlighting mine):
At the high-level, LLVM code is translated to a machine specific representation formed out of MachineFunction , MachineBasicBlock , and MachineInstr instances...
However, the same paragraph goes on and says:
This representation is completely target agnostic, representing instructions in their most abstract form...
My question is, how to understand this paragraph?
I have a hard time reconciling the claim that this intermediate representation is machine specific and target agnostic at the same time. I thought "machine" and "target", in LLVM's context, mean the same thing - the instruction set architecture (e.g. x86_64, MIPS) used by the compiled executable.
Examples are welcome.
There are different ways to be platform specific. For instance, you could have a differently-named opcode for add, or perhaps with different overflow semantics, or you could use the same add for all, with the operands/flags specified by the same arguments for all target platforms, with the same default values.
And there are many target-specific details such as the size or alignment of pointers that affect your code even if they don't affect any single instruction.
Machine IR represents instructions in their most abstract form. It doesn't try hide that on this target, pointers have 32 bits.
Related
Object code can be disassembled in to an assembly language. Is there a way to turn object code or an executable into LLVM IR?
I mean, yes, you can convert machine language to LLVM IR. The IR is Turing-complete, meaning it can compute whatever some other Turing-complete system can compute. At worst, you could have an LLVM IR representation of an x86 emulator, and just execute the machine code given as a string.
But your question specifically asked about converting "back" to IR, in the sense of the IR result being similar to the original IR. And the answer is, no, not really. The machine language code will be the result of various optimization passes, and there's no way to determine what the code looked like before that optimization. (arrowd mentioned McSema in a comment, which does its best, but in general the results will be very different from the original code.)
with reference to : http://www.cplusplus.com/articles/2v07M4Gy/
During the compilation phase,
This phase translates the program into a low level assembly level code. The compiler takes the preprocessed file ( without any directives) and generates an object file containing assembly level code. Now, the object file created is in the binary form. In the object file created, each line describes one low level machine level instruction.
Now, if I am correct then different CPU architectures works on different assembly languages/syntax.
My question is how does the compiler comes to know to which assembly language syntax the source code has to be changed? In other words, how does the C++ compiler know which CPU architecture is there in the machine it is working on ?
Is there any mapping used by assembler w.r.t the CPU architecture for generating assembly code for different CPU architectures?
N.S : I am beginner !!
Each compiler needs to be "ported" to the given system. For each system supported, a "compiler port" needs to be programmed by someone who knows the system in-depth.
WARNING : This is extremely simplified
In short, there are three main parts to a compiler :
"Front-end" : This part reads the language (in this case c++) and converts it to a sort of pseudo-code specific to the compiler. (An Abstract Syntactic Tree, or AST)
"Optimizer/Middle-end" : This part takes the AST and make a non-architecture-dependant optimized one.
"Back-end" : This part takes the AST, and converts it to binary executable code, specific to the architecture you want to compile your language on.
When you download a c++ compiler for your platform, you, in fact, download the c++ frontend with the linux-amd64 backend, for example.
This coding architecture is extremely helpful, because it allows to port the compiler for another architecture without rewriting the whole parsing/optimizing thing. It also allows someone to create another optimizer, or even another frontend supporting a whole different language, and, as long as it outputs a correct AST, it will be compatible with every single backend ever written for this compiler.
Simply put, the knowledge of the target system is coded into the compiler.
So you might have a C compiler that generates SPARC binaries, and a C compiler that generates VAX binaries. They both accept the same input language (as defined in the C standard), but produce different programs from it.
Often we just refer to "the compiler", meaning the one that will generate binaries for our current environment.
In modern times, the distinction has become less obvious with compiler collections such as GCC. Now the "different compilers" are often the same compiler program, just set up with different configurations (these are the "target description files").
Just to complete the answers given here:
The target architecture is indeed coded into the specific compiler instance you're using. This is important also for a process called "cross-compiling" - The process of compiling on a certain system an executable that would operate on another system/architecture.
Consider working on an embedded system-on-chip that uses a completely different instruction set than your own - You're working on an x86/64 Linux system, but need to compile a mobile app running on an ARM micro-processor, or some other type of assembly architecture.
It would be unreasonable to compile your code on the target system, which might be so limited in CPU and memory that it can't feasibly run a compiler - and so you can use a GCC (or any other compiler) port for that target architecture on your favorite system.
It's also quite critical to remember that the entire tool-chain is often compatible to the target system, for instance when shared libraries such as libc are getting in play - as the target OS could be a different release of Linux and would have different versions of common functions - In which case it's common to use tool-chains that contain all the necessary libraries and use something like chroot or mock to compile in the "target environment" from within your system.
I have been looking through the Clang / LLVM source-code and I came across the CodeModel property of CodeGenOptions.
Based on this method, the valid values appear to be: "small", "kernel", "medium" and "large".
What do this property control?
How do I go about choosing the correct value for my application?
Code model is a term from AMD64 ABI (see 3.5.1 from https://www.intel.com/content/dam/develop/external/us/en/documents/mpx-linux64-abi.pdf for more information).
In short - the majority of the offsets inside x86-64 instructions are PC-relative, however the immediate field inside instructions is only 32-bit long. Therefore if the data is located "far" from the code (more than 32-bit apart), then one could not use immediate field inside the instructions to efficiently encode the offset and should calculate the address explicitly. The code model provides various restrictions on the relative location of code and data.
If you're compiling everything statically, then 'small' is safe (and default). If you're JIT'ing, then everything is possible especially if ASLR is enabled and you'd need to use medium / large code model.
Do Fortran kind parameter for the same precision change depending on the processor even with the same compiler? I have already read the post here.
The thing I struggle is if we are using the same compiler, say gfortran, why would there be a different set of kind parameter for the same precision? I mean, the compiler's specification is the same, so should't compiler always give us the same precision for a particular kind parameter no matter what operating system or processor I am using?
EDIT: I read some where that for integers, different CPUs support different integral data types, which means some processors might not directly support certain precision of an integer. I also read that programming language like Fortran opt for optimization so the language is implemented in a way that avoid strange precision that are not directly supported by the hardware. Does this has to do with my concern?
You are asking "do they change". The answer is "they may".
The meaning of a certain kind value for a certain type is Fortran processor (the language concept - which is not the same thing as a a microprocessor) dependent.
The concept of a Fortran processor covers the entire system that is responsible for processing and executing Fortran source - the hardware, operating system, compiler, libraries, perhaps even the human operator - all of it. Change any part of that system, and you can have a different Fortran processor.
Consequently there is no requirement that the interpretation of a particular kind value for a particular type be the same for the same compiler given variations in compiler options or hardware in use.
If you want your code to be portable, then don't make the code depend on particular kind values.
I'm trying to write an LLVM optimization pass. And I need a way to determine if one LLVM instruction affects the other (or depends on the other).
These dependencies can have different nature:
first instruction creates a value, the other uses it as operand
first instructions writes to memory location, the other reads from there
other possibilities?
In short, first instruction must always be executed before the other in order to preserve code correctness. Three-way answer (depends, may depend, doesn't depend) will also do.
I understand that I can use use-def chains to find dependencies of type 1, and AliasAnalysis can help me with dependencies of type 2. But I'm afraid there can be other dependency types...
Does LLVM provide any general mechanism for that?