Determine instruction property for Backends in LLVM - llvm

Using the LLVM C++ API, I'm trying to determine some metadata about instructions for various Backends.
For example, I'd like to know what the valid registers are for pop in x86, or the valid operand types (immediate vs registers) for mov in ARM.
I looked through some usages of LLVM, like in keystone-engine, but it's not quite what I'm looking for. I see how to disassemble or even assemble instructions, but I'm looking for a more granular API around Backend Instructions.

Related

How do applications determine if instruction set is available and use it in case it is?

Just interesting how it works in games and other software.
More precisely, I'm asking for a solution in C++.
Something like:
if AMX available -> Use AMX version of the math library
else if AVX-512 available -> Use AVX-512 version of the math library
else if AVX-256 available -> Use AVX-256 version of the math library
etc.
The basic idea I have is to compile the library in different DLLs and swap them on runtime but it seems not to be the best solution for me.
For the detection part
See Are the xgetbv and CPUID checks sufficient to guarantee AVX2 support? which shows how to detect CPU and OS support for new extensions: cpuid and xgetbv, respectively.
ISA extensions that add new/wider registers that need to be saved/restored on context switch also need to be supported and enabled by the OS, not just the CPU. New instructions like AVX-512 will still fault on a CPU that supports them if the OS hasn't set a control-register bit. (Effectively promising that it knows about them and will save/restore them.) Intel designed things so the failure mode is faulting, not silent corruption of registers on CPU migration, or context switch between two programs using the extension.
Extensions that added new or wider registers are AVX, AVX-512F, and AMX. OSes need to know about them. (AMX is very new, and adds a large amount of state: 8 tile registers T0-T7 of 1KiB each. Apparently OSes need to know about AMX for power-management to work properly.)
OSes don't need to know about AVX2/FMA3 (still YMM0-15), or any of the various AVX-512 extensions which still use k0-k7 and ZMM0-31.
There's no OS-independent way to detect OS support of SSE, but fortunately it's old enough that these days you don't have to. It and SSE2 are baseline for x86-64. Everything up to SSE4.2 uses the same register state (XMM0-15) so OS support for SSE1 is sufficient for user-space to use SSE4.2. SSE1 was new in 1999, with Pentium 3.
Different compilers have different ways of doing CPUID and xgetbv detection. See does gcc's __builtin_cpu_supports check for OS support? - unfortunately no, only CPUID, at least when that was asked. I'd consider that a GCC bug, but IDK if it ever got reported or fixed.
For the optional-use part
Typically setting function pointers to selected versions of some important functions. Inlining through function pointers isn't generally possible, so make sure you choose the boundaries appropriately, like an AVX-512 version of a function that includes a loop, not just a single vector.
GCC's function multi-versioning can automate that for you, transparently compiling multiple versions and hooking some function-pointer setup.
There have been some previous Q&As about this with different compilers, search for "CPU dispatch avx" or something like that, along with other search terms.
See The Effect of Architecture When Using SSE / AVX Intrinisics to understand the difference between GCC/clang's model for intrinsics where you have to enable -march=skylake or whatever, or manually -mavx2, before you can use an intrinsic. vs. MSVC and classic ICC where you could use any intrinsic anywhere, even to emit instructions the compiler wouldn't be able to auto-vectorize with. (Those compilers can't or don't optimize intrinsics much at all, perhaps because that could lead to them getting hoisted out of if(cpu) statements.)
Windows provides IsProcessorFeaturePresent but AVX support is not on the list.
For more detailed detection you need to ask the CPU directly. On x86 this means the CPUID instruction. Visual C++ provides the __cpuidex intrinsic for this. In your case, function/leaf 1 and check bit 28 in ECX. Wikipedia has a decent article but you really should download the Intel instruction set manual to use as a reference.

How can I achieve native-level optimizations when cross-compiling with Clang?

When cross-compiling using clang and the -target option, targeting the same architecture and hardware as the native system, I've noticed that clang seems to generate worse optimizations than the native-built counter-part for cases where the <sys> in the triple is none.
Consider this simple code example:
int square(int num) {
return num * num;
}
When optimized at -O3 with -target x86_64-linux-elf, the native x86_64 target code generation yields:
square(int):
mov eax, edi
imul eax, edi
ret
The code generated with -target x86_64-none-elf yields:
square(int):
push rbp
mov rbp, rsp
mov eax, edi
imul eax, edi
pop rbp
ret
Live Example
Despite having the same hardware and optimization flags, clearly something is missing an optimization. The problem goes away if none is replaced with linux in the target triple, despite no system-specific features being used.
At first glance it may look like it simply isn't optimizing at all, but different code segments show that it is performing some optimizations, just not all. For example, loop-unrolling is still occurring.
Although the above examples are simply with x86_64, in practice, this issue is generating code-bloat for an armv7-based constrained embedded system, and I've noticed several missed optimizations in certain circumstances such as:
Not removing unnecessary setup/cleanup instructions (same as in x86_64)
Not coalescing certain sequential inlined increments into a single add instruction (at -Os, when inlining vector-like push_back calls. This optimizes when built natively from an arm-based system running armv7.)
Not coalescing adjacent small integer values into a single mov (such as merging a 32-bit int with a bool in an optional implementation. This optimizes when built natively from an arm-based system running armv7.)
etc
I would like to know what I can do, if anything, to achieve the same optimizations when cross-compiling as compiling natively? Are there any relevant flags that can help tweak tuning that are somehow implied with the <sys> in the triple?
If possible, I'd love some insight as to why the cross-compilation target appears to fail to optimize certain things simply because the system is different, despite having the same architecture and abi. My understanding is that LLVM's optimizer operates on the IR, which should generate effectively the same optimizations so long as nothing is reliant on the target system itself.
TL;DR: for the x86 targets, frame-pointers are enabled by default when the OS is unknown. You can manually disable them using -fomit-frame-pointer. For ARM platforms, you certainly need to provide more information so that the backend can deduce the target ABI. Use -emit-llvm so to check which part of Clang/LLVM generate an inefficient code.
The Application Binary Interface (ABI) can change from one target to another. There is no standard ABI in C. The chosen one is dependent of several parameters including the architecture, its version, the vendor, the OS, the environment, etc.
The use of the -target parameter help the compiler to select a ABI. The thing is x86_64-none-elf is not complete enough so the backend can actually generate a fast code. In fact, I think this is actually not a valid target since there is a warning from Clang in this case and the same warning appear with wrong random targets. Surprisingly, the compiler still succeed to generate a generic binary with the provided information. Targets like x86_64-windows and x86_64-linux works as well as x86_64-unknown-windows-cygnus (for Cygwin in Windows). You can get the list of the Clang supported platforms, OS, environment, etc. in the code.
One particular aspect of the ABI is the calling conventions. They are different between operating systems. For example, x86-64 Linux platforms uses the calling convention from the System V AMD64 ABI while recent x86-64 Windows platforms uses the vectorcall calling convention based on the older Microsoft x64 one. The situation is more complex for old x86 architectures. For more information about this please read this and this.
In your case, without information about the OS, the compiler might select its own generic ABI resulting in the old push/pop instruction being used. That being said, the compiler assumes that edi contains the argument passed to the function meaning that the chosen ABI is the System V AMD64 (or a derived version). The environment can play an important role in stack optimizations like the access of the stack from outside the function (eg. the callee functions).
My initial guess was that the assembly backend disabled some optimizations due to the lack of information regarding the specified target, but this is not the case for x86-64 ELFs as the same backend is selected. Note that target is parsed by architecture-specific backend (see here for example).
In fact, the issue comes from Clang emitting an IR code with frame-pointer flag set to all. You can check the IR code with the flag -emit-llvm. You can solve this problem using -fomit-frame-pointer.
On ARM, the problem can be different and come from the assembly backend. You should certainly specify the target OS or at least more information like the sub-architecture type (mainly for ARM), the vendor and the environment.
Overall, note that it is reasonable to think that a more generic targets produces a less efficient code due to the lack of information. If you want more information about this, please fill an issue on the Clang or LLVM bug tracker so to track the reason why this happens or/and let developers fix this.
Related posts/links:
clang: how to list supported target architectures?
https://clang.llvm.org/docs/CrossCompilation.html

Is it possible to tell clang which registers to use for certain parts of the code without using assembly

I'm working on an project that requires it to work on both Linux and Windows.
However, there are portions of the code that don't work on Linux due to differing registers under clang and msvc.
Is there a way to either make the register use consistent or request that clang use a specific register during an operation? I would like to find a solution that doesn't involve rewriting portions in assembly. Here is what I'm talking about as differing output code.
https://godbolt.org/z/DO9pQN
Any help is appreciated.
EDIT per comments:
This is for an emulator so certain registers are used for certain tasks.
One of the main things is that we use RSI for a certain variable and then clang uses RSI in function calls. MSVC compiled does not suffer from the same problem.
EDIT 2 per comments:
This is for the xbox 360 emulator, Xenia.
We are currently trying to finish the Linux side of things. However, we are running into problems with clang using the same registers for function calls as we use for storing something called a context.
Our idea was to just ask clang to not use that particular register, but I couldn't find a way to do that without just writing it in Assembly. Another problem with that solution is that gcc might also have the same issue on a different register. Specifically, we are looking at the ppc-tests. The above link is the output from clang compared with msvc.
Here is the relevant code:
https://github.com/xenia-project/xenia/blob/e79e18bb271212b13bcb65a610d957b6058f34db/src/xenia/cpu/backend/x64/x64_backend.cc
https://github.com/xenia-project/xenia/blob/master/src/xenia/cpu/ppc/testing/ppc_testing_main.cc
rsi cannot be used for your own purpose on linux because it is used in the function calling convention psABI-x86_64
But if you can use an other register as r10 code compiled with Gcc and option -ffixed-r10 will not use r10 (demo).

Force reduced width of comparison instructions

I want to force LLVM to generate CMPx-, TEST- and alike instructions on x86-64 to be up to 8 bit width only, forcing e.g. 32bit-int comparisons into four separate cmp+branch pairs. This obviously requires some bit-masking and increased instruction count.
Can I achieve this by simply "disabling" certain instructions for x86-64 so LLVM auto-generates the required glue code? Do I have to write a pass and work on the IR myself?
No, there is no way of disabling certain instructions like this from a vanilla build of LLVM. Anything you do to achieve this will require modifying LLVM.
You have several options for modifying LLVM:
You can add an x86-specific pass to the LLVM backend (does not work on the IR) which directly expands the cmp and test instructions into chains of instructions on sub-registers. You would have to do this after instruction selection to preclude some target-independent pass from undoing the transformation. This is called an "MI" pass in LLVM parlance. As an example you can look in X86FixupSetCC.cpp (mirror here). This has a huge advantage in that you can put it behind a flag and otherwise control whether it occurs once you add the core functionality.
You can modify LLVM's instruction tables for LLVM in the X86 .td files to only define these instructions for 8-bit registers, and then add the def Pat<...>; patterns to the .td files that allow programs with wider comparisons to still have their instructions selected (much as Colin suggested above). This has the disadvantage of not only have you modified your LLVM but you can't easily turn those modifications on and off behind some flag.
You can't do anything to LLVM's IR that will really help here because the code generator will just optimize things back into instruction patterns you're trying to avoid.
Hope this helps!
What you're probably looking to do is redefine the lowering pattern in the x86 .td files. There's code that looks like "def Pat<...>;" that defines a translation from one graph of instructions to another. There should be a pattern for going from IR comparison instructions to the x86 32bit compare instructions. You'll want to edit this pattern and instead output your sequence of comparisons.

Register allocation rules in code generated by major C/C++ compilers

I remember some rules from a time ago (pre-32bit Intel processors), when was quite frequent (at least for me) having to analyze the assembly output generated by C/C++ compilers (in my case, Borland/Turbo at that time) to find performance bottlenecks, and to safely mix assembly routines with C/C++ code. Things like using the SI register for the this pointer, AX being used for return values, which registers should be preserved when an assembly routine returns, etc.
Now I was wondering if there's some reference for the more popular C/C++ compilers (Visual C++, GCC, Intel...) and processors (Intel, ARM, ...), and if not, where to find the pieces to create one. Ideas?
You are asking about "application binary interface" (ABI) and calling conventions. These are typically set by operating systems and libraries, and enforced by compilers and linkers. Google for "ABI" or "calling convention." Some starting points from Wikipedia and Debian for ARM.
Agner Fog's "Calling Conventions" document summarizes, amongst other things, the Windows and Linux 64 and 32-bit ABIs: http://www.agner.org/optimize/calling_conventions.pdf. See Table 4 on p.10 for a summary of register usage.
One warning from personal experience: don't embed assumptions about the ABI in inline assembly. If you write a function in inline assembly that assumes return and/or parameter transfer in particular registers (e.g. eax, rdi, rsi), it will break if/when the function is inlined by the compiler.
Open Watcom C/C++ compiler supports two calling conventions, register-based (default) and stack-based (very close to what other compilers use). User's Guide for this compiler describes them both and is available for free online, together with the compiler itself. You may find these topics in the User's Guide especially helpful:
10.4.1 Passing Arguments Using Register-Based Calling Conventions
10.4.6 Using Stack-Based Calling Conventions
10.5 Calling Conventions for 80x87-based Applications
Well, today if optimisation is turned on, there arn't any. But GCC allows you to declare that your assembly instruction should use particular variable regardless if it's in register or not, or even to force GCC tu put that variable into a register usable with your instruction. You can also declare which registers your inline assembly block reserves for itself (so compiler should generate apropriate save/restore code around your inline piece, if needed)
I believe but am by no means sure that GCC uses the Itanium ABI for most of its function; the incompatibilites between it and the ABI it uses are documented.