When to use `__fastcall` calling convention - c++

We have a lot of VCL-based applications written in C++. All the VCL methods (under the __published class modifier require the __fastcall calling convention. However, for whatever reason, developers have been adding __fastcall to other non-VCL functions which are private, protected, or public.
Based on this article, this makes no sense to me as it unnecessarily complexifies the code and might even be a performance hit (probably neglible though). Nonetheless, after suggesting we remove it in some places I was told we've always done it that way so be consistent and it's just a question of style. I think it actually confuses people if it isn't necessary, so it's bad practice.
My question is, when is it appropriate to use the __fastcall calling convention?

A good optimizing compiler that supports whole-program optimization (aka link-time code generation) doesn't care about the calling convention for internal functions*. It will use whatever calling convention is the fastest/best in that situation, including inventing a custom calling convention or inlining the function altogether.
The only time a calling convention matters is for functions that form part of a public API. And in that case, __fastcall is probably a poor choice. Use a more standard calling convention like __cdecl or __stdcall, widely supported by Windows toolchains. __fastcall is an especially poor choice for interoperability, since it was never standardized and therefore is implemented differently by different vendors. This becomes a nightmare the minute you try to use your DLL with an application compiled with a different toolchain, much less in a different language.
Except, of course, when you're working with the VCL APIs that are documented as requiring the __fastcall convention. For example, the documentation says that member functions for VCL classes use the __fastcall convention, so you need to use the same calling convention in all of your overrides.
Or when you need caller clean-up, e.g., to support variadic arguments. Then you need __cdecl.
If you do want to use a particular calling convention for internal functions (i.e., those that are not part of a public API), you should really prefer to specify that globally with a compiler switch. This will then specify the calling convention to be used for all functions whose prototypes do not specifically override it. This has several advantages. For one, it avoids cluttering your code with a bunch of calling-convention boilerplate. Second, it allows you to easily make changes later (for example, if profiling reveals that your original choice of calling convention is a bottleneck that the optimizer is unable to resolve).
Anecdotally, __stdcall is superior to __cdecl because of a reduction of binary size, made possible by the fact that the callee adjusts the stack instead of the caller (and there are fewer callees than callers), but as the article you linked mentions, __fastcall may not always be faster than __stdcall. The article doesn't go into any technical details, but the issue is basically the extremely limited numbers of registers available on 32-bit x86. Passing values in registers instead of on the stack is generally a performance win, but can become a pessimization in certain cases when the function is large and runs out of registers, forcing it to spill the arguments back to the stack, doing double work (which evokes a speed penalty) and further inflating the code (which evokes a cache penalty and, indirectly, a speed penalty). It is also a pessimization in cases where the values are already on the stack, but need to be moved into registers in order to make a function call, hindering the optimization potential in both places.
Do note that this all becomes irrelevant when you start targeting 64-bit x86 architectures. The calling convention is finally standardized there for all Windows applications, regardless of vendor. The x64 calling convention is somewhat akin to __fastcall, but works much better there because of the larger number of available registers. The optimizer is not required to go through as many contortions to free up registers for passing parameters as it is on x86-32.
* Note that when I say "internal" functions here, I refer not to a particular access modifier, but rather to functions that are within a single compiland and/or those that are never called into by external code.

Related

Designing Interfaces in c++

I am developing an interface that can be used as dynamic loading.Also it should be compiler independent.So i wanted to export Interfaces.
I am facing now the following problems..
Problem 1: The interface functions are taking some custom data types (basically classes or structures) as In\Out parameters.I want to initialise members of these classes with default values using constructors.If i do this it is not possible to load my library dynamically and it becomes compiler dependent. How to solve this.
Problem 2: Some interfaces returns lists(or maps) of element to client.I am using std containers for this purpose.But this also once again compiler dependent(and compiler version also some times).
Thanks.
Code compiled differently can only work together if it adopts the same Application Binary Interface (ABI) for the set of types used for parameters and return value. ABI's are significant at a much deeper level - name mangling, virtual dispatch tables etc., but my point's that if there's one your compilers support allowing calling of functions with simple types, you can at least think about hacking together some support for more complex types like compiler-specific implementations of Standard containers and user-defined types.
You'll have to research what ABI support your compilers provide, and infer what you can about what they'll continue to provide.
If you want to support other types beyond what the relevant ABI standardises, options include:
use simpler types to expose internals of more complex types
pass [const] char* and size_t extracted by my_std_string.data() or &my_std_string[0] and my_std_string.size(), similarly for std::vector
serialise the data and deserialise it using the data structures of the receiver (can be slow)
provide a set of function pointers to simple accessor/mutator functions implemented by the object that created the data type
e.g. the way the classic C qsort function accepts a pointer to an element comparison function
As I usually have a multithreading focus, I'm mostly going to bark about your second problem.
You already realized that passing elements of a container over an API seems to be compiler dependent. It's actually worse: it's header file & C++-library dependent, so at least for Linux you're already stuck with two different sets: libstc++ (originating from gcc) and libcxx (originating from clang).
Because part of the containers is header files and part is library code, getting things ABI-independent is close to impossible.
My bigger worry is that you actually thought of passing container elements around. This is a huge threadsafety issue: the STL containers are not threadsafe - by design.
By passing references over the interface, you are passing "pointers to encapsulated knowledge" around - the users of your API could make assumptions of your internal structures and start modifying the data pointed to. That is usually already really bad in a singlethreaded environment, but gets worse in a multithreaded environment.
Secondly, pointers you provided could get stale, not good either.
Make sure to return copies of your inner knowledge to prevent user modification of your structures.
Passing things const is not enough: const can be cast away and you still expose your innards.
So my suggestion: hide the data types, only pass simple types and/or structs that you fully control (i.e. are not dependent on STL or boost).
Designing an API with the widest ABI compatibility is an extremely complex subject, even more so when C++ is involved instead of C.
Yet there are more theoretical-type issues that aren't really quite as bad as they sound in practice. For example, in theory, calling conventions and structure padding/alignment sound like they could be major headaches. In practice they aren't so much, and you can even resolve such issues in hindsight by specifying additional build instructions to third parties or decorating your SDK functions with macros indicating the appropriate calling convention. By "not so bad" here, I mean that they can trip you up but they won't have you going back to the drawing board and redesigning your entire SDK in response.
The "practical" issues I want to focus on are issues that can have you revisiting the drawing board and redoing the entire SDK. My list is also not exhaustive, but are some of the ones I think you should really keep in mind first.
You can also treat your SDK as consisting of two parts: a dynamically-linked part that actually exports functionality whose implementation is hidden from clients, and a statically (internally) linked convenience library part that adds C++ wrappers on top. If you treat your SDK as having these two distinct parts, you're allowed a lot more liberty in the statically-linked library to use a lot more C++ mechanisms.
So, let's get started with those practical headache inducers:
1. The binary layout of a vtable is not necessarily consistent across compilers.
This is, in my opinion, one of the biggest gotchas. We're usually looking at 2 main ways to access functionality from one module to another at runtime: function pointers (including those provided by dylib symbol lookup) and interfaces containing virtual functions. The latter can be so much more convenient in C++ (both for implementor and client using the interface), yet unfortunately using virtual functions in an API that aims to be binary compatible with the widest range of compilers is like playing minesweeper through a land of gotchas.
I would recommend avoiding virtual functions outright for this purpose unless your team consists of minesweeper experts who know all of these gotchas. It's useful to try to fall in love with C again for those public interface parts and start building a fondness for these kinds of interfaces consisting of function pointers:
struct Interface
{
void* opaque_private_data;
void (*func1)(struct Interface* self, ...);
void (*func2)(struct Interface* self, ...);
void (*func3)(struct Interface* self, ...);
};
These present far fewer gotchas and are nowhere near as fragile against changes (ex: you're perfectly allowed to do things like add more function pointers to the bottom of the structure without affecting ABI).
2. Stub libs for dylib symbol lookup are linker-specific (as are all static libs in general).
This might not seem like a big deal until combined with #1. When you toss out virtual functions for the purpose of exporting interfaces, then the next big temptation is to often export whole classes or select methods through a dylib.
Unfortunately doing this with manual symbol lookup can become very unwieldy very quickly, so the temptation is to often do this automatically by simply linking to the appropriate stub.
Yet this too can become unwieldy when your goal is to support as many compilers/linkers as possible. In such a case, you may have to possess many compilers and build and distribute different stubs for each possibility.
So this can kind of push you into a corner where it's no longer very practical export class definitions anymore. At this point you might simply export free-standing functions with C linkage (to avoid C++ name mangling which is another potential source of headaches).
One of the things that should be obvious already is that we're getting nudged more and more towards favoring a C or C-like API if our goal is universal binary compatibility without opening up too many cans of worms.
3. Different modules have 'different heaps'.
If you allocate memory in one module and try to deallocate it in another, then you're trying to free memory from a mismatching heap and will invoke undefined behavior.
Even in plain old C, it's easy to forget this rule and malloc in one exported function only to return a pointer to it with the expectation that the client accessing the memory from a different module will free it when done. This once again invokes undefined behavior, and we have to export a second function to indirectly free the memory from the same module that allocated it.
This can become a much bigger gotcha in C++ where we often have class templates that have internal linkage that implicitly do memory management. For example, even if we roll our own std::vector-like sequence like List<T>, we can run into a scenario where a client creates a list, passes it to our API by reference where we use functions that can allocate/deallocate memory (like push_back or insert) and butt heads with this mismatching heap/free store issue. So even this hand-rolled container should ensure that it allocates and deallocates memory from the same central location if it's going to be passed around across modules, and placement new will become your friend when implementing such containers.
4. Passing/returning C++ standard objects is not ABI-compatible.
This includes C++ standard containers as you have already guessed. There's no really practical way to ensure that one compiler will use a compatible representation of something like std::vector when including <vector> as another. So passing/returning such standard objects whose representation is outside of your control is generally out of the question if you're targeting wide binary compatibility.
These don't even necessarily have compatible representations within two projects built by the same compiler, as their representations can vary in incompatible ways based on build settings.
This might make you think that you should now roll all kinds of containers by hand, but I would suggest a KISS approach here. If you're returning a variable number of elements as a result from a function, then we don't need a wide range of container types. We only need one dynamic array kind of container, and it doesn't even have to be a growable sequence, just something with proper copy, move, and destruction semantics.
It might seem nicer and could save some cycles if you just returned a set or a map in a function that computes one, but I'd suggest forgetting about returning these more sophisticated structures and convert to/from this basic dynamic array kind of representation. It's rarely the bottleneck you might think it would be to transfer to/from contiguous representations, and if you actually do run into a hotspot as a result of this which you actually gained from a legit profiling session of a real world use case, then you can always add more to your SDK in a very discrete and selective fashion.
You can also always wrap those more sophisticated containers like map into a C-like function pointer interface that treats the handle to the map as opaque, hidden away from clients. For heftier data structures like a binary search tree, paying the cost of one level of indirection is generally very negligible (for simpler structures like a random-access contiguous sequence, it generally isn't quite as negligible, especially if your read operations like operator[] involve indirect calls).
Another thing worth noting is that everything I've discussed so far relates to the exported, dynamically-linked side of your SDK. The static convenience library that is internally linked is free to receive and return standard objects to make things convenient for the third party using your library, provided that you're not actually passing/returning them in your exported interfaces. You can even avoid rolling your own containers outright and just take a C-style mindset to your exported interfaces, returning raw pointers to T* that needs to be freed while your convenience library does that automatically and transfers the contents to std::vector<T>, e.g.
5. Throwing exceptions across module boundaries is undefined.
We should generally not be throwing exceptions from one module to be caught in another when we cannot ensure compatible build settings in the two modules, let alone the same compiler. So throwing exceptions from your API to indicate input errors is generally out of the question in this case.
Instead we should catch all possible exceptions at the entry points to our module to avoid leaking them into the outside world, and translate all such exceptions into error codes.
The statically-linked convenience library can still call one of your exported functions, check the error code, and in the case of failure, throw an exception. This is perfectly fine here since that convenience library is internally linked to the module of the third party using this library, so it's effectively throwing the exception from the third party module to be caught by the same third party module.
Conclusion
While this is, by no means, an exhaustive list, these are some caveats that can, when unheeded, cause some of the biggest issues at the broadest level of your API design. These kinds of design-level issues can be exponentially more expensive to fix in hindsight than implementation-type issues, so they should generally have the highest priority.
If you're new to these subjects, you can't go too far wrong favoring a C or very C-like API. You can still use a lot of C++ implementing it and can also build a C++ convenience library back on top (your clients don't even have to use anything but the C++ interfaces provided by that internally-linked convenience library).
With C, you're typically looking at more work at the baseline level, but potentially far fewer of those disastrous design-level gotchas. With C++, you're looking at less work at the baseline level, but far more potentially disastrous surprise scenarios. If you favor the latter route, you generally want to ensure that your team's expertise with ABI issues is higher with a larger coding standards document dedicating large sections to these potential ABI gotchas.
For your specific questions:
Problem 1: The interface functions are taking some custom data types
(basically classes or structures) as In\Out parameters.I want to
initialise members of these classes with default values using
constructors.If i do this it is not possible to load my library
dynamically and it becomes compiler dependent. How to solve this.
This is where that statically-linked convenience library can come in handy. You can statically link all that convenient code like a class with constructors and still pass in its data in a more raw, primitive kind of form to the exported interfaces. Another option is to selectively inline or statically link the constructor so that its code is not exported as with the rest of the class, but you probably don't want to be exporting classes as indicated above if your goal is max binary compatibility and don't want too many gotchas.
Problem 2: Some interfaces returns lists(or maps) of element to
client.I am using std containers for this purpose.But this also once
again compiler dependent(and compiler version also some times).
Here we have to forgo those standard container goodies at least at the exported API level. You can still utilize them at the convenience library level which has internal linkage.

Plainly and simply, why do we use _stdcall?

I've come across calling conventions whilst studying states for game making with C++.
In a previous question someone stated that MSDN doesn't explain _stdcall very well - I agree.
What are the primary purposes for calling conventions like _stdcall? Does it matter what order the arguments are placed on the stack? How does it reduce the size of the code in X86 (as someone else stated)?
The reason for having some calling convention is pretty simple: so that the caller and the callee agree on how things will work. Without it, the caller doesn't know where to put arguments when it's calling a particular function.
As for why Microsoft decided on the specific details of _stdcall, that's largely historical. On MS-DOS, all calls were register based, so all OS calls required assembly language, or strange extensions to most higher-level languages.
When they first did Windows, they used the cdecl calling convention, mostly because that's what the compiler did by default. At least according to rumor, shortly before they got ready to release Windows 1.0, they switched to the Pascal calling convention because it was enough more efficient that (among other things) it allowed Windows to fit on one fewer floppy disc. Regardless of the precise details, the Pascal calling convention did make code a little smaller, because the called function cleaned up the arguments from the stack instead of needing to clean them up everywhere the function was called. For any function that was called from at least 2 different places, that's a win (and if it's tie anywhere else).
Then they started work on OS/2, and invented yet another calling convention (syscall).
Then, of course, came Win32. There wasn't really a lot wrong with syscall from a technical viewpoint, but (I'd guess) everything associated with OS/2 was considered tainted, so syscall had to go. The result was something just enough different to justify a new name. In fairness, that's a little bit of an exaggeration: they did add one truly useful addition: they encoded the number of bytes of arguments into each function name, so if (for example) you supplied an incorrect prototype for a function, the code wouldn't link rather than ending up with a mismatch between caller and callee that could lead to much more serious problems.
For the most part, it really comes back to the original point though: the exact details of the calling convention don't matter all that much, as long as you don't make a complete mess of it. Most of what matters is that the caller and callee agree on the same thing, so if a compiler knows what parameters a function accepts, it knows how to generate code to get those parameters to the function correctly (and, likewise, they both agree on how stack cleanup is handled, etc.)

How much bad can be done using register variables in C++

I just came to know that we can use registers, explicitly in C++ programs. I wonder what if i declare and use all available registers in a single C++ program and run it for considerable amount of time. How badly will my system behave and what (if any) measures will be taken by the os to come out of the situation.
The compiler will simply ignore the register keyword, so you are not going to run out of registers. It may well ignore it anyway - compilers are typically much better at register allocation than humans.
The register keyword indicates to the compiler that the variable does not need to be addressable in main memory. Thus the compiler can be sure that there are no pointers to the value and optimize accordingly.
A common misconsception: the register keyword
register Keyword (see the non-Microsoft specific part)
"register" keyword in C?
Excessive use of the register keyword is unlikely to have serious negative impact on modern systems. Every thread maintains its own register values during execution, and its register usage will not have any direct impact on other threads. The compiler will either reject or ignore register usage that cannot result in a viable program. Poor register use will at most merely reduce performance and the OS will take no special action.
Only specific number of Registers are available for your C++ program.
Also, it is just a suggestion for the compiler mostly compilers can do this optimization themselves so there is not really much use of using register keyword more because compilers may or may not follow the suggestion.
So the only thing register keyword does with modern compilers is prevent you from using & to take the address of the variable.
To Quote Herb Sutter on this:
Never write register. It's exactly as meaningful as whitespace
The register keyword is only a suggestion to the compiler and can be ignored. Let the compiler do the optimization for you.
The register keyword is only a polite suggestion to the compiler that you think this variable will be heavily used and could it pretty-please just keep it in a register. The compiler is free to ignore this suggestion and, in fact, will usually do so in a modern environment.
register is basically a vestigial remnant of the old, grossly-inefficient C compilers that were available way back when. (The same compilers that led to things like the execrable Duff's Device and other monstrosities, in fact.) Modern compilers are far more capable than you are of keeping track of which variables should be placed into which registers at which points in execution. They will, thus, politely ignore you without saying a word.
Als posted a link to Herb Sutter's article on keywords. I agree with Sutter that one should never use register. I disagree with him on whether register is meaningless.
It is worse than meaningless.
I have seen code where a variable qualified with register is later used with "&". Code with dozens and dozens of variables qualified with register. And the ultimate doozy, "register volatile foo;"
Never use "register".
All the CPU registers are at the disposal of your program anyway, so there is nothing exceptional in using all of them. OS won't even notice it.

Compiler optimization of functions parameters

Function parameters are placed on the stack, but compilers can optimize this task by the use of optional registers. It would make sense that this optimization will kick in if there are only 1-2 parameters, and not when there are 256 (not that one would want to have the max number of parameters).
How can one find out the parameter limit (number of parameters) for a certain compiler (such as gcc) where one can be sure that this optimization will be used?
Function parameters are placed on the stack, but compilers can optimize this task by the use of optional registers.
As FrankH says in his comments and as I'm going to say in my answer, the application binary interface for the system in question determines how arguments are passed to functions - this is called the calling convention for that platform.
To complicate matters, x86 32-bit actually has several. This is historical and comes from the fact that when Win32 bit arrived, everyone went crazy doing different things.
So, yes, you can "optimise" by writing function calls in such a way, but no, you shouldn't. You should follow the standards for your platform. Because the honest truth is, the speed of stack access probably isn't slowing your code down to that great an extent that you need to be binary-incompatible from everyone else on your system.
Why the need for ABIs/standard calling conventions? Well, in terms of using the processor registers, stack etc, applications must agree on what means what and where it shoudl go. If one function decided all its arguments were in registers and another that some were on the stack, how would they be interoperable? Moreover, you might come across the term scratch registers to mean those registers you don't have to restore. What happens if you call a function expecting it to leave some registers alone?
Anyway, as for what you asked for, here's some ABI documentation:
The difference between x86 and x64 on windows.
x86_64 ABI used for Unix-like platforms.
Wikipedia's x86 calling conventions.
A document on compiler calling conventions.
The last one is my favourite. To quote it:
In the days of the old DOS operating system, it was often possible to combine development
tools from different vendors with few compatibility problems. With 32-bit Windows, the
situation has gone completely out of hand. Different compilers use different data
representations, different function calling conventions, and different object file formats.
While static link libraries have traditionally been considered compiler-specific, the
widespread use of dynamic link libraries (DLL's) has made the distribution of function
libraries in binary form more common.
So whatever you're trying to do with optimising via modifying the function calling method, don't. Find another way to optimise. Profile your code. Study the compiler optimisations you've got for your compiler (-OX) if you think it helps and dump the assembly to check, if the speed is really that crucial
For publically visible functions, this is documented in the ABI standard. For functions that are not referencible from the outside, all bets are off anyway.
You would have to read the fine manual for the compiler. If you were lucky, you would find it there in a description of function calling conventions. Otherwise, for an OSS compiler such as gcc you would probably have to read its source-code.

MS Visual C++: When should you care about using calling conventions?

In C/C++ (specifically, I'm using MSVS), in what situation would one ever need to worry about specifying a calling convention for a function definition? Are they ever important? Isn't the complied capable of choosing the optimal convention when necessary (ie fastcall, etc).
Maybe my understanding is lacking, but I just do not see when their would be a case that the programmer would need to care about things like the order that the arguments are placed on the stack and so forth. I also do not see why the compiler's optimization would not be able to choose whatever scheme would work best for that particular function. Any knowledge anyone could provide me with would be great. Thanks!
In general terms, the calling convention is important when you're integrating code that's being compiled by different compilers. For example, if you're publishing a DLL that will be used by your customers, you will want to make sure that the functions you export all have a consistent, expected calling convention.
You are correct that within a single program, the compiler can generally choose which calling convention to use for each function (and the rules are usually pretty simple).
You do not need to care for 64-bit applicatins since there is only one calling convention.
You do need to care for 32-bit applications in the following cases:
You interact with 3rd party libraries and the headers for these libraries did not declare the correct calling convention.
You are creating a library or DLL for someone else to use. You need to decide on a calling convention so that other code would use the correct calling convention when calling your code.