Compiler optimization of functions parameters - c++

Function parameters are placed on the stack, but compilers can optimize this task by the use of optional registers. It would make sense that this optimization will kick in if there are only 1-2 parameters, and not when there are 256 (not that one would want to have the max number of parameters).
How can one find out the parameter limit (number of parameters) for a certain compiler (such as gcc) where one can be sure that this optimization will be used?

Function parameters are placed on the stack, but compilers can optimize this task by the use of optional registers.
As FrankH says in his comments and as I'm going to say in my answer, the application binary interface for the system in question determines how arguments are passed to functions - this is called the calling convention for that platform.
To complicate matters, x86 32-bit actually has several. This is historical and comes from the fact that when Win32 bit arrived, everyone went crazy doing different things.
So, yes, you can "optimise" by writing function calls in such a way, but no, you shouldn't. You should follow the standards for your platform. Because the honest truth is, the speed of stack access probably isn't slowing your code down to that great an extent that you need to be binary-incompatible from everyone else on your system.
Why the need for ABIs/standard calling conventions? Well, in terms of using the processor registers, stack etc, applications must agree on what means what and where it shoudl go. If one function decided all its arguments were in registers and another that some were on the stack, how would they be interoperable? Moreover, you might come across the term scratch registers to mean those registers you don't have to restore. What happens if you call a function expecting it to leave some registers alone?
Anyway, as for what you asked for, here's some ABI documentation:
The difference between x86 and x64 on windows.
x86_64 ABI used for Unix-like platforms.
Wikipedia's x86 calling conventions.
A document on compiler calling conventions.
The last one is my favourite. To quote it:
In the days of the old DOS operating system, it was often possible to combine development
tools from different vendors with few compatibility problems. With 32-bit Windows, the
situation has gone completely out of hand. Different compilers use different data
representations, different function calling conventions, and different object file formats.
While static link libraries have traditionally been considered compiler-specific, the
widespread use of dynamic link libraries (DLL's) has made the distribution of function
libraries in binary form more common.
So whatever you're trying to do with optimising via modifying the function calling method, don't. Find another way to optimise. Profile your code. Study the compiler optimisations you've got for your compiler (-OX) if you think it helps and dump the assembly to check, if the speed is really that crucial

For publically visible functions, this is documented in the ABI standard. For functions that are not referencible from the outside, all bets are off anyway.

You would have to read the fine manual for the compiler. If you were lucky, you would find it there in a description of function calling conventions. Otherwise, for an OSS compiler such as gcc you would probably have to read its source-code.

Related

When to use `__fastcall` calling convention

We have a lot of VCL-based applications written in C++. All the VCL methods (under the __published class modifier require the __fastcall calling convention. However, for whatever reason, developers have been adding __fastcall to other non-VCL functions which are private, protected, or public.
Based on this article, this makes no sense to me as it unnecessarily complexifies the code and might even be a performance hit (probably neglible though). Nonetheless, after suggesting we remove it in some places I was told we've always done it that way so be consistent and it's just a question of style. I think it actually confuses people if it isn't necessary, so it's bad practice.
My question is, when is it appropriate to use the __fastcall calling convention?
A good optimizing compiler that supports whole-program optimization (aka link-time code generation) doesn't care about the calling convention for internal functions*. It will use whatever calling convention is the fastest/best in that situation, including inventing a custom calling convention or inlining the function altogether.
The only time a calling convention matters is for functions that form part of a public API. And in that case, __fastcall is probably a poor choice. Use a more standard calling convention like __cdecl or __stdcall, widely supported by Windows toolchains. __fastcall is an especially poor choice for interoperability, since it was never standardized and therefore is implemented differently by different vendors. This becomes a nightmare the minute you try to use your DLL with an application compiled with a different toolchain, much less in a different language.
Except, of course, when you're working with the VCL APIs that are documented as requiring the __fastcall convention. For example, the documentation says that member functions for VCL classes use the __fastcall convention, so you need to use the same calling convention in all of your overrides.
Or when you need caller clean-up, e.g., to support variadic arguments. Then you need __cdecl.
If you do want to use a particular calling convention for internal functions (i.e., those that are not part of a public API), you should really prefer to specify that globally with a compiler switch. This will then specify the calling convention to be used for all functions whose prototypes do not specifically override it. This has several advantages. For one, it avoids cluttering your code with a bunch of calling-convention boilerplate. Second, it allows you to easily make changes later (for example, if profiling reveals that your original choice of calling convention is a bottleneck that the optimizer is unable to resolve).
Anecdotally, __stdcall is superior to __cdecl because of a reduction of binary size, made possible by the fact that the callee adjusts the stack instead of the caller (and there are fewer callees than callers), but as the article you linked mentions, __fastcall may not always be faster than __stdcall. The article doesn't go into any technical details, but the issue is basically the extremely limited numbers of registers available on 32-bit x86. Passing values in registers instead of on the stack is generally a performance win, but can become a pessimization in certain cases when the function is large and runs out of registers, forcing it to spill the arguments back to the stack, doing double work (which evokes a speed penalty) and further inflating the code (which evokes a cache penalty and, indirectly, a speed penalty). It is also a pessimization in cases where the values are already on the stack, but need to be moved into registers in order to make a function call, hindering the optimization potential in both places.
Do note that this all becomes irrelevant when you start targeting 64-bit x86 architectures. The calling convention is finally standardized there for all Windows applications, regardless of vendor. The x64 calling convention is somewhat akin to __fastcall, but works much better there because of the larger number of available registers. The optimizer is not required to go through as many contortions to free up registers for passing parameters as it is on x86-32.
* Note that when I say "internal" functions here, I refer not to a particular access modifier, but rather to functions that are within a single compiland and/or those that are never called into by external code.

Can I control what gets copied into CPU cache in C++?

I read about cache optimization in C++ and the mechanisms, modern CPUs use to predict what data is needed next, to copy that into cache. But is there a direct way in C++ for the programmers, who know what actually is needed next, to determine what data gets copied into CPU cache?
This varies with the processor and compiler you're using.
Assuming you're using an Intel x86/x64 or compatible (e.g., AMD) processor, the processor provides a number of prefetch instructions, and most compilers include intrinsics to invoke them. With VC++ you use _m_prefetch or _m_prefetchw. With gcc you use __builtin_prefetch.
Likewise, VC++ on an ARM provides a __prefetch intrinsic for the same purpose (no, I really don't know why they couldn't have used the same name as on x86; the signature and effect appear identical).
Most other reasonably modern, higher-end processors probably provide similar instructions, and
I'd guess most compilers provide intrinsics to make them available, but just as with these, the names of the intrinsics will vary. For that matter, even though the functions are intrinsic to the compiler, most require that you include some header to use them -- and the name of the header will also vary.
The prefetch intrinsics Jerry provided would do the trick. keep in mind that there are several flavors controlled by an argument to that function, determining which levels of the cache (if any) would be used to keep the line. A prefetch_NTA for e.g. would not pollute the caches, but rather provide the line only for immediate use (and is used in cases where you're going to use it soon and once only)
Also keep in mind that these instructions are basically hints to the CPU (which also does quite well by itself trying to guess which lines to prefetch). As such, they are not guaranteed to work, they might fail in many cases (if the memory subsystem is loaded, or the address got swapped out of memory).

Plainly and simply, why do we use _stdcall?

I've come across calling conventions whilst studying states for game making with C++.
In a previous question someone stated that MSDN doesn't explain _stdcall very well - I agree.
What are the primary purposes for calling conventions like _stdcall? Does it matter what order the arguments are placed on the stack? How does it reduce the size of the code in X86 (as someone else stated)?
The reason for having some calling convention is pretty simple: so that the caller and the callee agree on how things will work. Without it, the caller doesn't know where to put arguments when it's calling a particular function.
As for why Microsoft decided on the specific details of _stdcall, that's largely historical. On MS-DOS, all calls were register based, so all OS calls required assembly language, or strange extensions to most higher-level languages.
When they first did Windows, they used the cdecl calling convention, mostly because that's what the compiler did by default. At least according to rumor, shortly before they got ready to release Windows 1.0, they switched to the Pascal calling convention because it was enough more efficient that (among other things) it allowed Windows to fit on one fewer floppy disc. Regardless of the precise details, the Pascal calling convention did make code a little smaller, because the called function cleaned up the arguments from the stack instead of needing to clean them up everywhere the function was called. For any function that was called from at least 2 different places, that's a win (and if it's tie anywhere else).
Then they started work on OS/2, and invented yet another calling convention (syscall).
Then, of course, came Win32. There wasn't really a lot wrong with syscall from a technical viewpoint, but (I'd guess) everything associated with OS/2 was considered tainted, so syscall had to go. The result was something just enough different to justify a new name. In fairness, that's a little bit of an exaggeration: they did add one truly useful addition: they encoded the number of bytes of arguments into each function name, so if (for example) you supplied an incorrect prototype for a function, the code wouldn't link rather than ending up with a mismatch between caller and callee that could lead to much more serious problems.
For the most part, it really comes back to the original point though: the exact details of the calling convention don't matter all that much, as long as you don't make a complete mess of it. Most of what matters is that the caller and callee agree on the same thing, so if a compiler knows what parameters a function accepts, it knows how to generate code to get those parameters to the function correctly (and, likewise, they both agree on how stack cleanup is handled, etc.)

MS Visual C++: When should you care about using calling conventions?

In C/C++ (specifically, I'm using MSVS), in what situation would one ever need to worry about specifying a calling convention for a function definition? Are they ever important? Isn't the complied capable of choosing the optimal convention when necessary (ie fastcall, etc).
Maybe my understanding is lacking, but I just do not see when their would be a case that the programmer would need to care about things like the order that the arguments are placed on the stack and so forth. I also do not see why the compiler's optimization would not be able to choose whatever scheme would work best for that particular function. Any knowledge anyone could provide me with would be great. Thanks!
In general terms, the calling convention is important when you're integrating code that's being compiled by different compilers. For example, if you're publishing a DLL that will be used by your customers, you will want to make sure that the functions you export all have a consistent, expected calling convention.
You are correct that within a single program, the compiler can generally choose which calling convention to use for each function (and the rules are usually pretty simple).
You do not need to care for 64-bit applicatins since there is only one calling convention.
You do need to care for 32-bit applications in the following cases:
You interact with 3rd party libraries and the headers for these libraries did not declare the correct calling convention.
You are creating a library or DLL for someone else to use. You need to decide on a calling convention so that other code would use the correct calling convention when calling your code.

Register allocation rules in code generated by major C/C++ compilers

I remember some rules from a time ago (pre-32bit Intel processors), when was quite frequent (at least for me) having to analyze the assembly output generated by C/C++ compilers (in my case, Borland/Turbo at that time) to find performance bottlenecks, and to safely mix assembly routines with C/C++ code. Things like using the SI register for the this pointer, AX being used for return values, which registers should be preserved when an assembly routine returns, etc.
Now I was wondering if there's some reference for the more popular C/C++ compilers (Visual C++, GCC, Intel...) and processors (Intel, ARM, ...), and if not, where to find the pieces to create one. Ideas?
You are asking about "application binary interface" (ABI) and calling conventions. These are typically set by operating systems and libraries, and enforced by compilers and linkers. Google for "ABI" or "calling convention." Some starting points from Wikipedia and Debian for ARM.
Agner Fog's "Calling Conventions" document summarizes, amongst other things, the Windows and Linux 64 and 32-bit ABIs: http://www.agner.org/optimize/calling_conventions.pdf. See Table 4 on p.10 for a summary of register usage.
One warning from personal experience: don't embed assumptions about the ABI in inline assembly. If you write a function in inline assembly that assumes return and/or parameter transfer in particular registers (e.g. eax, rdi, rsi), it will break if/when the function is inlined by the compiler.
Open Watcom C/C++ compiler supports two calling conventions, register-based (default) and stack-based (very close to what other compilers use). User's Guide for this compiler describes them both and is available for free online, together with the compiler itself. You may find these topics in the User's Guide especially helpful:
10.4.1 Passing Arguments Using Register-Based Calling Conventions
10.4.6 Using Stack-Based Calling Conventions
10.5 Calling Conventions for 80x87-based Applications
Well, today if optimisation is turned on, there arn't any. But GCC allows you to declare that your assembly instruction should use particular variable regardless if it's in register or not, or even to force GCC tu put that variable into a register usable with your instruction. You can also declare which registers your inline assembly block reserves for itself (so compiler should generate apropriate save/restore code around your inline piece, if needed)
I believe but am by no means sure that GCC uses the Itanium ABI for most of its function; the incompatibilites between it and the ABI it uses are documented.