Is it possible to tell clang which registers to use for certain parts of the code without using assembly - c++

I'm working on an project that requires it to work on both Linux and Windows.
However, there are portions of the code that don't work on Linux due to differing registers under clang and msvc.
Is there a way to either make the register use consistent or request that clang use a specific register during an operation? I would like to find a solution that doesn't involve rewriting portions in assembly. Here is what I'm talking about as differing output code.
https://godbolt.org/z/DO9pQN
Any help is appreciated.
EDIT per comments:
This is for an emulator so certain registers are used for certain tasks.
One of the main things is that we use RSI for a certain variable and then clang uses RSI in function calls. MSVC compiled does not suffer from the same problem.
EDIT 2 per comments:
This is for the xbox 360 emulator, Xenia.
We are currently trying to finish the Linux side of things. However, we are running into problems with clang using the same registers for function calls as we use for storing something called a context.
Our idea was to just ask clang to not use that particular register, but I couldn't find a way to do that without just writing it in Assembly. Another problem with that solution is that gcc might also have the same issue on a different register. Specifically, we are looking at the ppc-tests. The above link is the output from clang compared with msvc.
Here is the relevant code:
https://github.com/xenia-project/xenia/blob/e79e18bb271212b13bcb65a610d957b6058f34db/src/xenia/cpu/backend/x64/x64_backend.cc
https://github.com/xenia-project/xenia/blob/master/src/xenia/cpu/ppc/testing/ppc_testing_main.cc

rsi cannot be used for your own purpose on linux because it is used in the function calling convention psABI-x86_64
But if you can use an other register as r10 code compiled with Gcc and option -ffixed-r10 will not use r10 (demo).

Related

How can I achieve native-level optimizations when cross-compiling with Clang?

When cross-compiling using clang and the -target option, targeting the same architecture and hardware as the native system, I've noticed that clang seems to generate worse optimizations than the native-built counter-part for cases where the <sys> in the triple is none.
Consider this simple code example:
int square(int num) {
return num * num;
}
When optimized at -O3 with -target x86_64-linux-elf, the native x86_64 target code generation yields:
square(int):
mov eax, edi
imul eax, edi
ret
The code generated with -target x86_64-none-elf yields:
square(int):
push rbp
mov rbp, rsp
mov eax, edi
imul eax, edi
pop rbp
ret
Live Example
Despite having the same hardware and optimization flags, clearly something is missing an optimization. The problem goes away if none is replaced with linux in the target triple, despite no system-specific features being used.
At first glance it may look like it simply isn't optimizing at all, but different code segments show that it is performing some optimizations, just not all. For example, loop-unrolling is still occurring.
Although the above examples are simply with x86_64, in practice, this issue is generating code-bloat for an armv7-based constrained embedded system, and I've noticed several missed optimizations in certain circumstances such as:
Not removing unnecessary setup/cleanup instructions (same as in x86_64)
Not coalescing certain sequential inlined increments into a single add instruction (at -Os, when inlining vector-like push_back calls. This optimizes when built natively from an arm-based system running armv7.)
Not coalescing adjacent small integer values into a single mov (such as merging a 32-bit int with a bool in an optional implementation. This optimizes when built natively from an arm-based system running armv7.)
etc
I would like to know what I can do, if anything, to achieve the same optimizations when cross-compiling as compiling natively? Are there any relevant flags that can help tweak tuning that are somehow implied with the <sys> in the triple?
If possible, I'd love some insight as to why the cross-compilation target appears to fail to optimize certain things simply because the system is different, despite having the same architecture and abi. My understanding is that LLVM's optimizer operates on the IR, which should generate effectively the same optimizations so long as nothing is reliant on the target system itself.
TL;DR: for the x86 targets, frame-pointers are enabled by default when the OS is unknown. You can manually disable them using -fomit-frame-pointer. For ARM platforms, you certainly need to provide more information so that the backend can deduce the target ABI. Use -emit-llvm so to check which part of Clang/LLVM generate an inefficient code.
The Application Binary Interface (ABI) can change from one target to another. There is no standard ABI in C. The chosen one is dependent of several parameters including the architecture, its version, the vendor, the OS, the environment, etc.
The use of the -target parameter help the compiler to select a ABI. The thing is x86_64-none-elf is not complete enough so the backend can actually generate a fast code. In fact, I think this is actually not a valid target since there is a warning from Clang in this case and the same warning appear with wrong random targets. Surprisingly, the compiler still succeed to generate a generic binary with the provided information. Targets like x86_64-windows and x86_64-linux works as well as x86_64-unknown-windows-cygnus (for Cygwin in Windows). You can get the list of the Clang supported platforms, OS, environment, etc. in the code.
One particular aspect of the ABI is the calling conventions. They are different between operating systems. For example, x86-64 Linux platforms uses the calling convention from the System V AMD64 ABI while recent x86-64 Windows platforms uses the vectorcall calling convention based on the older Microsoft x64 one. The situation is more complex for old x86 architectures. For more information about this please read this and this.
In your case, without information about the OS, the compiler might select its own generic ABI resulting in the old push/pop instruction being used. That being said, the compiler assumes that edi contains the argument passed to the function meaning that the chosen ABI is the System V AMD64 (or a derived version). The environment can play an important role in stack optimizations like the access of the stack from outside the function (eg. the callee functions).
My initial guess was that the assembly backend disabled some optimizations due to the lack of information regarding the specified target, but this is not the case for x86-64 ELFs as the same backend is selected. Note that target is parsed by architecture-specific backend (see here for example).
In fact, the issue comes from Clang emitting an IR code with frame-pointer flag set to all. You can check the IR code with the flag -emit-llvm. You can solve this problem using -fomit-frame-pointer.
On ARM, the problem can be different and come from the assembly backend. You should certainly specify the target OS or at least more information like the sub-architecture type (mainly for ARM), the vendor and the environment.
Overall, note that it is reasonable to think that a more generic targets produces a less efficient code due to the lack of information. If you want more information about this, please fill an issue on the Clang or LLVM bug tracker so to track the reason why this happens or/and let developers fix this.
Related posts/links:
clang: how to list supported target architectures?
https://clang.llvm.org/docs/CrossCompilation.html

Getting address of caller in c++

At the moment I'm working on a anticheat. I added a way to detect any hooking to the directx functions, since those are what most cheats do.
The problem comes in when a lot of programs, such as OBS, Fraps and many other programs that hook directx get their hook detected too.
So to be able to hook directx, you will most probabbly have to call VirtualProtect. If I could determine what address this is being called from, then I could loop through all dll's in memory, and then find what module it has been called from, and then sending the information to the server, maybe perhaps even taking a md5 hash and sending it to the server for validation.
I could also hook the DirectX functions that the cheats hook and check where those get called from (since most of them use ms detours).
I looked it up, and apparently you can check the call stack, but every example I found did not seem to help me.
This -getting the caller's address- is not possible in standard C++. And many C++ compilers might optimize some calls (e.g. by inlining them, even when you don't specify inline, or because there is no more any framepointer, e.g. compiler option -fomit-frame-pointerfor x86 32 bits with GCC, or by optimizing a tail-call ....) to the point that the question might not make any sense.
With some implementations and some C or C++ standard libraries and some (but not all) compiler options (in particular, don't ask the compiler to optimize too much*) you might get it, e.g. (on Linux) use backtrace from GNU glibc or I.Taylor's libbacktrace (from inside GCC implementation) or GCC return address builtins.
I don't know how difficult would it be to port these to Windows (Perhaps Cygwin did it). The GCC builtins might somehow work, if you don't optimize too much.
Read also about continuations. See also this answer to a related question.
Note *: on Linux, better compile all the code (including external libraries!) with at most g++ -Wall -g -O1 : you don't want too much optimization, and you want the debug information (in particular for libbacktrace)
Ray Chen's blog 'The old new thing' covers using return address' to make security decisions and why its a pretty pointless thing
https://devblogs.microsoft.com/oldnewthing/20060203-00/?p=32403
https://devblogs.microsoft.com/oldnewthing/20040101-00/?p=41223
Basically its pretty easy to fake (by injecting code or using a manually constructed fake stack to trick you). Its Windows centric but the basic concepts are generally applicable.

Is it possible to convert C/C++ source code to assembly?

Is it possible to somehow convert a simple C or C++ code (by simple I mean: taking some int as input, printing some simple shapes dependent on that int as output) to assembly language? If there isn't I'll just do it manually but since I'm gonna be doing it for processors like Intel 8080, it just seemed a bit tedious. Can you somehow automate the process?
Also, if there is a way, how good (as in: elegant) would the output assembly file source code be when compared to just translating it manually?
Most compilers will let you produce assembly output. For a couple of obvious examples, Clang and gcc/g++ use the -S flag, and MS VC++ uses the -Fa flag to do so.
A few compilers don't support this directly (e.g., if memory serves Watcom didn't). The ones I've seen like this had you produce an object file, and then included a disassembler that would produce an assembly language file from the object file. I don't remember for sure, but it wouldn't surprise me if this is what you'd need to do with the Digital Mars compiler.
To somebody who's accustomed to writing assembly language, the output from most compilers typically tends to look at least somewhat inelegant, especially on a CPU like an x86 that has quite a few registers that are now really general purpose, but have historically had more specific meanings. For example, if some piece of code needs both a pointer and a counter, a person would probably put the pointer in ESI or EDI, and the counter in ECX. The compiler might easily reverse those. That'll work fine, but an experienced assembly language programmer will undoubtedly find it more readable using ESI for the pointer and ECX for the counter.
Take look at gcc -S:
gcc -S hello.c # outputs hello.s file
Other compilers that maintain at lest partial gcc compatibility may also accept this flag. LLVM's clang, for example, does.
Well, yes there is such a program. It's called "Compiler"
To answer your edit: The elegance of the output depends on the optimization of your compiler. Usually compilers do not generate code we humans would call "elegant".
Most folks here are right, but seem to have missed the note about 8080 (no wonder, it's not in the title :). However, Google comes to the rescue as always - looking for compiler for 8080 produces some nice results like these:
http://www.bdsoft.com/resources/bdsc.html
http://tack.sourceforge.net/
Most of these are pretty old and might be poorly maintained. You might also try 8085 which is fairly similar
(by simple I mean: taking some int as input, printing some simple shapes dependent on
that int as output) to assembly language?
Looking at the output of an x86 compiler is not going to be very helpful, since inputting and outputting are done by a C or C++ library. With an 8080 there is no such library so you have to develop your own I/O routines for some particular hardware. That's lots and lots of additional work.

How to do inline assembly in C++ (Visual Studio 2010)

I'm writing a performance-critical, number-crunching C++ project where 70% of the time is used by the 200 line core module.
I'd like to optimize the core using inline assembly, but I'm completely new to this. I do, however, know some x86 assembly languages including the one used by GCC and NASM.
All I know:
I have to put the assembler instructions in _asm{} where I want them to be.
Problem:
I have no clue where to start. What is in which register at the moment my inline assembly comes into play?
You can access variables by their name and copy them to registers.
Here's an example from MSDN:
int power2( int num, int power )
{
__asm
{
mov eax, num ; Get first argument
mov ecx, power ; Get second argument
shl eax, cl ; EAX = EAX * ( 2 to the power of CL )
}
// Return with result in EAX
}
Using C or C++ in ASM blocks might be also interesting for you.
The microsoft compiler is very poor at optimisations when inline assembly gets involved. It has to back up registers because if you use eax then it won't move eax to another free register it will continue using eax. The GCC assembler is far more advanced on this front.
To get round this microsoft started offering intrinsics. These are a far better way to do your optimisation as it allows the compiler to work with you. As Chris mentioned inline assembly doesn't work under x64 with the MS compiler as well so on that platform you REALLY are better off just using the intrinsics.
They are easy to use and give good performance. I will admit I am often able to squeeze a few more cycles out of it by using an external assembler but they're bloody good for the productivity improvement they provide
Nothing is in the registers. as the _asm block is executed. You need to move stuff into the registers. If there is a variable: 'a', then you would need to
__asm {
mov eax, [a]
}
It is worth pointing out that VS2010 comes with Microsofts assembler. Right click on a project, go to build rules and turn on the assembler build rules and the IDE will then process .asm files.
this is a somewhat better solution as VS2010 supports 32bit AND 64bit projects and the __asm keyword does NOT work in 64bit builds. You MUST use external assembler for 64bit code :/
I prefer writing entire functions in assembly rather than using inline assembly. This allows you to swap out the high level language function with the assembly one during the build process. Also, you don't have to worry about compiler optimizations getting in the way.
Before you write a single line of assembly, print out the assembly language listing for your function. This gives you a foundation to build upon or modify. Another helpful tool is the interweaving of assembly with source code. This will tell you how the compiler is coding specific statements.
If you need to insert inline assembly for a large function, make a new function for the code that you need to inline. Again replace with C++ or assembly during build time.
These are my suggestions, Your Mileage May Vary (YMMV).
Go for the low hanging fruit first...
As other have said, the Microsoft compiler is pretty poor at optimisation. You may be able to save yourself a lot of effort just by investing in a decent compiler, such as Intel's ICC, and re-compiling the code "as is". You can get a 30 day free evaluation license from Intel and try it out.
Also, if you have the option to build a 64-bit executable, then running in 64-bit mode can yield a 30% performance improvement, due to the x2 increase in number of available registers.
I really like assembly, so I'm not going to be a nay-sayer here. It appears that you've profiled your code and found the 'hotspot', which is the correct way to start. I also assume that the 200 lines in question don't use a lot of high-level constructs like vector.
I do have to give one bit of warning: if the number-crunching involves floating-point math, you are in for a world of pain, specifically a whole set of specialized instructions, and a college term's worth of algorithmic study.
All that said: if I were you, I'd step through the code in question in the VS debugger, using the Disassembly view. If you feel comfortable reading the code as you go along, that's a good sign. After that, do a Release compile (Debug turns off optimization) and generate an ASM listing for that module. Then if you think you see room for improvement...you have a place to start. Other people's answers have linked to the MSDN documentation, which is really pretty skimpy but still a reasonable start.

Register allocation rules in code generated by major C/C++ compilers

I remember some rules from a time ago (pre-32bit Intel processors), when was quite frequent (at least for me) having to analyze the assembly output generated by C/C++ compilers (in my case, Borland/Turbo at that time) to find performance bottlenecks, and to safely mix assembly routines with C/C++ code. Things like using the SI register for the this pointer, AX being used for return values, which registers should be preserved when an assembly routine returns, etc.
Now I was wondering if there's some reference for the more popular C/C++ compilers (Visual C++, GCC, Intel...) and processors (Intel, ARM, ...), and if not, where to find the pieces to create one. Ideas?
You are asking about "application binary interface" (ABI) and calling conventions. These are typically set by operating systems and libraries, and enforced by compilers and linkers. Google for "ABI" or "calling convention." Some starting points from Wikipedia and Debian for ARM.
Agner Fog's "Calling Conventions" document summarizes, amongst other things, the Windows and Linux 64 and 32-bit ABIs: http://www.agner.org/optimize/calling_conventions.pdf. See Table 4 on p.10 for a summary of register usage.
One warning from personal experience: don't embed assumptions about the ABI in inline assembly. If you write a function in inline assembly that assumes return and/or parameter transfer in particular registers (e.g. eax, rdi, rsi), it will break if/when the function is inlined by the compiler.
Open Watcom C/C++ compiler supports two calling conventions, register-based (default) and stack-based (very close to what other compilers use). User's Guide for this compiler describes them both and is available for free online, together with the compiler itself. You may find these topics in the User's Guide especially helpful:
10.4.1 Passing Arguments Using Register-Based Calling Conventions
10.4.6 Using Stack-Based Calling Conventions
10.5 Calling Conventions for 80x87-based Applications
Well, today if optimisation is turned on, there arn't any. But GCC allows you to declare that your assembly instruction should use particular variable regardless if it's in register or not, or even to force GCC tu put that variable into a register usable with your instruction. You can also declare which registers your inline assembly block reserves for itself (so compiler should generate apropriate save/restore code around your inline piece, if needed)
I believe but am by no means sure that GCC uses the Itanium ABI for most of its function; the incompatibilites between it and the ABI it uses are documented.