I'm trying to write an LLVM optimization pass. And I need a way to determine if one LLVM instruction affects the other (or depends on the other).
These dependencies can have different nature:
first instruction creates a value, the other uses it as operand
first instructions writes to memory location, the other reads from there
other possibilities?
In short, first instruction must always be executed before the other in order to preserve code correctness. Three-way answer (depends, may depend, doesn't depend) will also do.
I understand that I can use use-def chains to find dependencies of type 1, and AliasAnalysis can help me with dependencies of type 2. But I'm afraid there can be other dependency types...
Does LLVM provide any general mechanism for that?
Related
I am working on a C library which compiles/links to a .a file that users can statically link into their code. The library's performance is very important, so I am writing performance-critical routines in x86-64 assembly to optimize performance.
For some routines, I can get significantly better performance if I use BMI2 instructions than if I stick to the "standard" x86-64 instruction set. Trouble is, BMI2 was introduced fairly recently and some of my users use processors that do not support those instructions.
So, I've written optimized the routines twice, once using BMI2 instructions and once without using them. In my current setup, I would distribute two versions of the .a file: a "fast" one that requires support for BMI2 instructions, and a "slow" one that does not require support for BMI2 instructions.
I am asking if there's a way to simplify this by distributing a single .a file that will dynamically choose the correct implementation based on whether the CPU on which the final application runs supports BMI2 instructions.
Unlike similar questions on StackOverflow, there are two peculiarities here:
The technique to choose the function needs to have particularly low overhead in the critical path. The routines in question, after assembly-optimization, run in ~10 ns, so even a single if statement could be significant.
The function that needs to be chosen "dynamically" is chosen once at the beginning, and then remains fixed for the duration of the program. I'm hoping that this will offer a faster solution than the one suggested in this question: Choosing method implementation at runtime
The fastest solution I've come up with so far is to do the following:
Check whether the CPU supports BMI2 instructions using the cpuid instruction.
Set a global variable true or false depending on the result.
Branch on the value of this global variable on every function invocation.
I'm not satisfied with this approach because it has two drawbacks:
I'm not sure how I can automatically run cpuid and set a global variable at the beginning of the program, given that I'm distributing a .a file and don't have control over the main function in the final binary. I'm happy to use C++ here if it offers a better solution, as long as the final library can still be linked with and called from a C program.
This incurs overhead on every function call, when ideally the only overhead would be on program startup.
Are there any solutions that are more efficient than the one I've detailed above?
x264 uses an init function (which users of the library are required to call before calling anything else, or something like that) to set up a struct of function pointers based on CPUID results. Including taking into account that pshufb is slow on some early CPUs that support it.
If your functions depend on pdep / pext, you probably want to detect AMD vs. Intel, because AMD's pdep/pext is very slow and probably not worth using on Ryzen, even though it is available. (See https://agner.org/optimize/ for instruction tables.)
Function pointers are fairly low overhead, about the same as calling a function in a shared library or DLL. call [rel funcptr] instead of call func. (In the compiler-generated asm that calls your functions).
CPU dependent code: how to avoid function pointers? shows a very simple example of it in C, and is asking for ways to avoid it. With dynamic linking, you can do CPU detection at dynamic link time so the dynamic-linking indirection becomes your CPU-dispatch indirection as well (like glibc does for selecting an optimized memcpy implementation.)
But with static linking for a .a, just make function pointers that are statically initialized to the baseline versions, and your CPU init function (which hopefully runs before any of the function pointers are dereferenced) rewrites them to point at the best version for the current CPU.
If you are using gcc, you can get the compiler to implement all the boiler plate code automatically. gcc manual page on function multiversioning
Please note: this question is not about LLVM IR, but LLVM's MIR, an internal intermediate representation lower than the former one.
This documentation on LLVM Machine code description classes, says (highlighting mine):
At the high-level, LLVM code is translated to a machine specific representation formed out of MachineFunction , MachineBasicBlock , and MachineInstr instances...
However, the same paragraph goes on and says:
This representation is completely target agnostic, representing instructions in their most abstract form...
My question is, how to understand this paragraph?
I have a hard time reconciling the claim that this intermediate representation is machine specific and target agnostic at the same time. I thought "machine" and "target", in LLVM's context, mean the same thing - the instruction set architecture (e.g. x86_64, MIPS) used by the compiled executable.
Examples are welcome.
There are different ways to be platform specific. For instance, you could have a differently-named opcode for add, or perhaps with different overflow semantics, or you could use the same add for all, with the operands/flags specified by the same arguments for all target platforms, with the same default values.
And there are many target-specific details such as the size or alignment of pointers that affect your code even if they don't affect any single instruction.
Machine IR represents instructions in their most abstract form. It doesn't try hide that on this target, pointers have 32 bits.
I want to force LLVM to generate CMPx-, TEST- and alike instructions on x86-64 to be up to 8 bit width only, forcing e.g. 32bit-int comparisons into four separate cmp+branch pairs. This obviously requires some bit-masking and increased instruction count.
Can I achieve this by simply "disabling" certain instructions for x86-64 so LLVM auto-generates the required glue code? Do I have to write a pass and work on the IR myself?
No, there is no way of disabling certain instructions like this from a vanilla build of LLVM. Anything you do to achieve this will require modifying LLVM.
You have several options for modifying LLVM:
You can add an x86-specific pass to the LLVM backend (does not work on the IR) which directly expands the cmp and test instructions into chains of instructions on sub-registers. You would have to do this after instruction selection to preclude some target-independent pass from undoing the transformation. This is called an "MI" pass in LLVM parlance. As an example you can look in X86FixupSetCC.cpp (mirror here). This has a huge advantage in that you can put it behind a flag and otherwise control whether it occurs once you add the core functionality.
You can modify LLVM's instruction tables for LLVM in the X86 .td files to only define these instructions for 8-bit registers, and then add the def Pat<...>; patterns to the .td files that allow programs with wider comparisons to still have their instructions selected (much as Colin suggested above). This has the disadvantage of not only have you modified your LLVM but you can't easily turn those modifications on and off behind some flag.
You can't do anything to LLVM's IR that will really help here because the code generator will just optimize things back into instruction patterns you're trying to avoid.
Hope this helps!
What you're probably looking to do is redefine the lowering pattern in the x86 .td files. There's code that looks like "def Pat<...>;" that defines a translation from one graph of instructions to another. There should be a pattern for going from IR comparison instructions to the x86 32bit compare instructions. You'll want to edit this pattern and instead output your sequence of comparisons.
I'm looking into generating a call-graph for the linux kernel that would include function pointers (see my previous question Static call graph generation for the Linux kernel for more information). I've been told LLVM should be suitable for this purpose, however I was unable to find the relevant information on llvm.org
Any help, including pointers to relevant documentation, would be appreciated.
First, you have to compile your kernel into LLVM IR (instead of native object files). Then, using llvm-ld, combine all the IR object files into a single large module. It could be quite a tricky thing to do, you'll have to modify the makefiles heavily, but I believe it is doable.
Now you can do your analysis. A simple call graph can be generated using the opt tool with -dot-callgraph pass. It is unlikely to handle function pointers, so you may want to modify it.
Tracking all the possible data flow paths that would carry your function pointers is quite a challenge, and in general case it is impossible to do (if there are any pointer to integer casts, if pointers are stored in complicated data structures, etc.). For a majority of specific cases you can try to implement a global abstract interpretation to approximate all the possible data flow paths for your pointers. It would not be accurate, of course, but then you'll get at least a conservative approximation.
Function parameters are placed on the stack, but compilers can optimize this task by the use of optional registers. It would make sense that this optimization will kick in if there are only 1-2 parameters, and not when there are 256 (not that one would want to have the max number of parameters).
How can one find out the parameter limit (number of parameters) for a certain compiler (such as gcc) where one can be sure that this optimization will be used?
Function parameters are placed on the stack, but compilers can optimize this task by the use of optional registers.
As FrankH says in his comments and as I'm going to say in my answer, the application binary interface for the system in question determines how arguments are passed to functions - this is called the calling convention for that platform.
To complicate matters, x86 32-bit actually has several. This is historical and comes from the fact that when Win32 bit arrived, everyone went crazy doing different things.
So, yes, you can "optimise" by writing function calls in such a way, but no, you shouldn't. You should follow the standards for your platform. Because the honest truth is, the speed of stack access probably isn't slowing your code down to that great an extent that you need to be binary-incompatible from everyone else on your system.
Why the need for ABIs/standard calling conventions? Well, in terms of using the processor registers, stack etc, applications must agree on what means what and where it shoudl go. If one function decided all its arguments were in registers and another that some were on the stack, how would they be interoperable? Moreover, you might come across the term scratch registers to mean those registers you don't have to restore. What happens if you call a function expecting it to leave some registers alone?
Anyway, as for what you asked for, here's some ABI documentation:
The difference between x86 and x64 on windows.
x86_64 ABI used for Unix-like platforms.
Wikipedia's x86 calling conventions.
A document on compiler calling conventions.
The last one is my favourite. To quote it:
In the days of the old DOS operating system, it was often possible to combine development
tools from different vendors with few compatibility problems. With 32-bit Windows, the
situation has gone completely out of hand. Different compilers use different data
representations, different function calling conventions, and different object file formats.
While static link libraries have traditionally been considered compiler-specific, the
widespread use of dynamic link libraries (DLL's) has made the distribution of function
libraries in binary form more common.
So whatever you're trying to do with optimising via modifying the function calling method, don't. Find another way to optimise. Profile your code. Study the compiler optimisations you've got for your compiler (-OX) if you think it helps and dump the assembly to check, if the speed is really that crucial
For publically visible functions, this is documented in the ABI standard. For functions that are not referencible from the outside, all bets are off anyway.
You would have to read the fine manual for the compiler. If you were lucky, you would find it there in a description of function calling conventions. Otherwise, for an OSS compiler such as gcc you would probably have to read its source-code.