Choose assembly implementation to use based on supported instructions

Choose assembly implementation to use based on supported instructions - c++

I am working on a C library which compiles/links to a .a file that users can statically link into their code. The library's performance is very important, so I am writing performance-critical routines in x86-64 assembly to optimize performance.
For some routines, I can get significantly better performance if I use BMI2 instructions than if I stick to the "standard" x86-64 instruction set. Trouble is, BMI2 was introduced fairly recently and some of my users use processors that do not support those instructions.
So, I've written optimized the routines twice, once using BMI2 instructions and once without using them. In my current setup, I would distribute two versions of the .a file: a "fast" one that requires support for BMI2 instructions, and a "slow" one that does not require support for BMI2 instructions.
I am asking if there's a way to simplify this by distributing a single .a file that will dynamically choose the correct implementation based on whether the CPU on which the final application runs supports BMI2 instructions.
Unlike similar questions on StackOverflow, there are two peculiarities here:
The technique to choose the function needs to have particularly low overhead in the critical path. The routines in question, after assembly-optimization, run in ~10 ns, so even a single if statement could be significant.
The function that needs to be chosen "dynamically" is chosen once at the beginning, and then remains fixed for the duration of the program. I'm hoping that this will offer a faster solution than the one suggested in this question: Choosing method implementation at runtime
The fastest solution I've come up with so far is to do the following:
Check whether the CPU supports BMI2 instructions using the cpuid instruction.
Set a global variable true or false depending on the result.
Branch on the value of this global variable on every function invocation.
I'm not satisfied with this approach because it has two drawbacks:
I'm not sure how I can automatically run cpuid and set a global variable at the beginning of the program, given that I'm distributing a .a file and don't have control over the main function in the final binary. I'm happy to use C++ here if it offers a better solution, as long as the final library can still be linked with and called from a C program.
This incurs overhead on every function call, when ideally the only overhead would be on program startup.
Are there any solutions that are more efficient than the one I've detailed above?

x264 uses an init function (which users of the library are required to call before calling anything else, or something like that) to set up a struct of function pointers based on CPUID results. Including taking into account that pshufb is slow on some early CPUs that support it.
If your functions depend on pdep / pext, you probably want to detect AMD vs. Intel, because AMD's pdep/pext is very slow and probably not worth using on Ryzen, even though it is available. (See https://agner.org/optimize/ for instruction tables.)
Function pointers are fairly low overhead, about the same as calling a function in a shared library or DLL. call [rel funcptr] instead of call func. (In the compiler-generated asm that calls your functions).
CPU dependent code: how to avoid function pointers? shows a very simple example of it in C, and is asking for ways to avoid it. With dynamic linking, you can do CPU detection at dynamic link time so the dynamic-linking indirection becomes your CPU-dispatch indirection as well (like glibc does for selecting an optimized memcpy implementation.)
But with static linking for a .a, just make function pointers that are statically initialized to the baseline versions, and your CPU init function (which hopefully runs before any of the function pointers are dereferenced) rewrites them to point at the best version for the current CPU.

If you are using gcc, you can get the compiler to implement all the boiler plate code automatically. gcc manual page on function multiversioning

Related

Why do I see __scalbnf in my profiler?

I am profiling some C++ code with perf, and I see that __scalbnf and __wrap_scalbnf are taking up a good chunk of the run time. I looked up what these functions are, and my best guess is I am calling them via a call to std::exp. However I'd like to be able to confirm this. Is there a place where I can see the C++ code implementing std::exp to confirm this? Or what is the best way for me (a C++ amateur) to start digging into this and understanding what is happening?
Thank you.

Set a breakpoint on __scalbn. Run your program. Look at a backtrace (in GDB, bt). The call tree will show that exp() is a parent function for __scalbn.
If a function has multiple callers, the first hit might not be from the "hot" function you're profiling.
To actually figure out which higher-up function (including its children) is responsible for using a lot of time, see linux perf: how to interpret and find hotspots. Top-down profiling can find expensive functions that do all their work in calls to other functions, even when those other functions also have "innocent" callers. (e.g. memcpy is heavily used and often unavoidable, but what you'd want to find are callers that use it too much and could be optimized better. Or not called at all.)
And BTW, yes glibc's math lib exp() implementation does internally use __scalbn. I'm not sure how bad the implementation is, but I don't see an asm version for x86-64, only this pure C version. https://code.woboq.org/userspace/glibc/sysdeps/ieee754/dbl-64/wordsize-64/s_scalbn.c.html. (For __scalbnl(long double) there's https://code.woboq.org/userspace/glibc/sysdeps/x86_64/fpu/s_scalbnl.S.html, using the x87 fscale instruction for 80-bit floats. But there are only i386 asm files for the other sizes. And IA-64 (Itanium), but not x86-64).
glibc does have some vectorized EXP code, though, like the SSE4 SVML version https://code.woboq.org/userspace/glibc/sysdeps/x86_64/fpu/multiarch/svml_d_exp2_core_sse4.S.html#_ZGVbN2v_exp_sse4.
If you want higher-performance exp() without perfect accuracy, see Fastest Implementation of Exponential Function Using AVX (that's for float, not double. I forget if there's an SO answer with a double version).
Also related: Efficient implementation of log2(__m256d) in AVX2.

To confirm that std::exp is the reason for __scalbnf and __wrap_scalbnf, you can replace the std::exp calls by either:
an identity function that returns the input value
or by an alternative exp implementation (for example fm_exp, found here)
Then, if you still see __scalbnf and __wrap_scalbnf in the profiler output, it means it's not coming from std::exp.

Can I control what gets copied into CPU cache in C++?

I read about cache optimization in C++ and the mechanisms, modern CPUs use to predict what data is needed next, to copy that into cache. But is there a direct way in C++ for the programmers, who know what actually is needed next, to determine what data gets copied into CPU cache?

This varies with the processor and compiler you're using.
Assuming you're using an Intel x86/x64 or compatible (e.g., AMD) processor, the processor provides a number of prefetch instructions, and most compilers include intrinsics to invoke them. With VC++ you use _m_prefetch or _m_prefetchw. With gcc you use __builtin_prefetch.
Likewise, VC++ on an ARM provides a __prefetch intrinsic for the same purpose (no, I really don't know why they couldn't have used the same name as on x86; the signature and effect appear identical).
Most other reasonably modern, higher-end processors probably provide similar instructions, and
I'd guess most compilers provide intrinsics to make them available, but just as with these, the names of the intrinsics will vary. For that matter, even though the functions are intrinsic to the compiler, most require that you include some header to use them -- and the name of the header will also vary.

The prefetch intrinsics Jerry provided would do the trick. keep in mind that there are several flavors controlled by an argument to that function, determining which levels of the cache (if any) would be used to keep the line. A prefetch_NTA for e.g. would not pollute the caches, but rather provide the line only for immediate use (and is used in cases where you're going to use it soon and once only)
Also keep in mind that these instructions are basically hints to the CPU (which also does quite well by itself trying to guess which lines to prefetch). As such, they are not guaranteed to work, they might fail in many cases (if the memory subsystem is loaded, or the address got swapped out of memory).

How to control whether C math uses SSE2?

I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin.
First there is a dispatch routine from a file called "disp_pentium4.inc". It checks if the variable ___use_sse2_mathfcns has been set; if so, calls __sin_pentium4, otherwise calls __sin_default.
__sin_pentium4 (in "sin_pentium4.asm") starts by transferring the argument from the x87 fpu to the xmm0 register, performs the calculation using SSE2 instructions, and loads the result back in the fpu.
__sin_default (in "sin.asm") keeps the variable on the x87 stack and simply calls fsin.
So in both cases, the operand is pushed on the x87 stack and returned on it as well, making it transparent to the caller, but if ___use_sse2_mathfcns is defined, the operation is actually performed in SSE2 rather than x87.
This behavior is very interesting to me because the x87 transcendental functions are notorious for having slightly different behaviors depending on the implementation, whereas a given piece of SSE2 code should always give reproducible results.
Is there a way to determine for certain, either at compile or run-time, that the SSE2 code path will be used? I am not proficient writing assembly, so if this involves writing any assembly, a code example would be appreciated.

I found the answer through careful investigation of math.h. This is controlled by a method called _set_SSE2_enable. This is a public symbol documented here:
Enables or disables the use of Streaming SIMD Extensions 2 (SSE2)
instructions in CRT math routines. (This function is not available on
x64 architectures because SSE2 is enabled by default.)
This causes the aforementionned ___use_sse2_mathfcns flag to be set to the provided value, effectively enabling or disabling use of the _pentium4 SSE2 routines.
The documentation mentions this affects only certain transcendental functions, but looking at the disassembly, this seems to affect everyone of them.
Edit: stepping into every function reveals that they're all available in SSE2 except for the following:
fmod
sinh
cosh
tanh
sqrt
Sqrt is the biggest offender, but it's trivial to implement in SSE2 using intrinsics. For the others, there's no simple solution except perhaps using a third-party library, but I can probably do without.

Why not use your own library instead of the C runtime? This would provide an even stronger guarantee of consistency across computers (presumably the C runtime is provided as a DLL and might change slightly in time).
I would recommend CRlibm. If you are already targeting SSE2, and as long as you did not intend to change the FPU's rounding mode, you are in the ideal conditions to use it, and you won't find a more accurate implementation.

The short answer is that you can't tell IN YOUR CODE for certain what the library will do, unless you are also involving library-implementation specific details. These would make the code completely unportable - even two different builds of the same compiler may change the internals of the library.
Of course, if portability isn't an issue, then using extern <type> ___use_sse2_mathfcns; and checking if it's the true would clearly work.
I expect that if the processor has SSE2 and you are using a modern enough library, it would use SSE2 wherever possible. But to say that for certain is a different matter.
If this is critical for your code, then implement your own transcendental functions and use those - that's the only way to guarantee the same result. Or, use some suitable inline assembler (or transcendental) code to calculate selected sin, cos, etc values, and compare those with the sin() and cos() functions provided by the library.

Taking advantage of SSE and other CPU extensions

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
I believe that using SSE to implement these loops should improve their performance significantly, especially where many operations are carried out on the same set of data, so once the data is read into the cache initially, there shouldn't be any cache misses to stall it. However I'm not sure about going about this.
Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.
I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?
What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?

For your second point there are several solutions as long as you can separate out the differences into different functions:
plain old C function pointers
dynamic linking (which generally relies on C function pointers)
if you're using C++, having different classes that represent the support for different architectures and using virtual functions can help immensely with this.
Note that because you'd be relying on indirect function calls, the functions that abstract the different operations generally need to represent somewhat higher level functionality or you may lose whatever gains you get from the optimized instruction in the call overhead (in other words don't abstract the individual SSE operations - abstract the work you're doing).
Here's an example using function pointers:
typedef int (*scale_func_ptr)( int scalar, int* pData, int count);
int non_sse_scale( int scalar, int* pData, int count)
{
// do whatever work needs done, without SSE so it'll work on older CPUs
return 0;
}
int sse_scale( int scalar, in pData, int count)
{
// equivalent code, but uses SSE
return 0;
}
// at initialization
scale_func_ptr scale_func = non_sse_scale;
if (useSSE) {
scale_func = sse_scale;
}
// now, when you want to do the work:
scale_func( 12, theData_ptr, 512); // this will call the routine that tailored to SSE
// if the CPU supports it, otherwise calls the non-SSE
// version of the function

Good reading on the subject: Stop the instruction set war
Short overview: Sorry, it is not possible to solve your problem in simple and most compatible (Intel vs. AMD) way.

The SSE intrinsics work with visual c++, GCC and the intel compiler. There is no problem to use them these days.
Note that you should always keep a version of your code that does not use SSE and constantly check it against your SSE implementation.
This helps not only for debugging, it is also usefull if you want to support CPUs or architectures that don't support your required SSE versions.

In answer to your comment:
So effectively, as long as I don't try to actually execute code containing unsupported instructions I'm fine, and I could get away with an "if(see2Supported){...}else{...}" type switch?
Depends. It's fine for SSE instructions to exist in the binary as long as they're not executed. The CPU has no problem with that.
However, if you enable SSE support in the compiler, it will most likely swap a number of "normal" instructions for their SSE equivalents (scalar floating-point ops, for example), so even chunks of your regular non-SSE code will blow up on a CPU that doesn't support it.
So what you'll have to do is most likely compile on or two files separately, with SSE enabled, and let them contain all your SSE routines. Then link that with the rest of the app, which is compiled without SSE support.

Rather than hand-coding an alternative SSE implementation to your scalar code, I strongly suggest you have a look at OpenCL. It is a vendor-neutral portable, cross-platform system for computationally intensive applications (and is highly buzzword-compliant!). You can write your algorithm in a subset of C99 designed for vectorised operations, which is much easier than hand-coding SSE. And best of all, OpenCL will generate the best implementation at runtime, to execute either on the GPU or on the CPU. So basically you get the SSE code written for you.
Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
Your application sounds like just the kind of problem that OpenCL is designed to address. Writing alternative functions in SSE would certainly improve the execution speed, but it is a great deal of work to write and debug.
Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.
Yes. The SSE intrinsics have been essentially standardised by Intel, so the same functions work the same between Windows, Linux and Mac (specifically with Visual C++ and GNU g++).
I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?
You could do that (eg. using dlopen()) but it is a very complex solution. Much simpler would be (in C) to define a function interface and call the appropriate version of the optimised function via function pointer, or in C++ to use different implementation classes, depending on the CPU detected.
With OpenCL it is not necessary to do this, as the code is generated at runtime for the given architecture.
What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?
Within the SSE instruction set, there are many flavours. It can be quite difficult to code the same algorithm in different subsets of SSE when certain instructions are not present. I suggest (at least to begin with) that you choose a minimum supported level, such as SSE2, and fall back to the scalar implementation on older machines.
This is also an ideal situation for unit/regression testing, which is very important to ensure your different implementations produce the same results. Have a test suite of input data and known good output data, and run the same data through both versions of the processing function. You may need to have a precision test for passing (ie. the difference epsilon between the result and the correct answer is below 1e6, for example). This will greatly aid in debugging, and if you build in high-resolution timing to your testing framework, you can compare the performance improvements at the same time.

Runtime optimization of static languages: JIT for C++?

Is anyone using JIT tricks to improve the runtime performance of statically compiled languages such as C++? It seems like hotspot analysis and branch prediction based on observations made during runtime could improve the performance of any code, but maybe there's some fundamental strategic reason why making such observations and implementing changes during runtime are only possible in virtual machines. I distinctly recall overhearing C++ compiler writers mutter "you can do that for programs written in C++ too" while listening to dynamic language enthusiasts talk about collecting statistics and rearranging code, but my web searches for evidence to support this memory have come up dry.

Profile guided optimization is different than runtime optimization. The optimization is still done offline, based on profiling information, but once the binary is shipped there is no ongoing optimization, so if the usage patterns of the profile-guided optimization phase don't accurately reflect real-world usage then the results will be imperfect, and the program also won't adapt to different usage patterns.
You may be interesting in looking for information on HP's Dynamo, although that system focused on native binary -> native binary translation, although since C++ is almost exclusively compiled to native code I suppose that's exactly what you are looking for.
You may also want to take a look at LLVM, which is a compiler framework and intermediate representation that supports JIT compilation and runtime optimization, although I'm not sure if there are actually any LLVM-based runtimes that can compile C++ and execute + runtime optimize it yet.

I did that kind of optimization quite a lot in the last years. It was for a graphic rendering API that I've implemented. Since the API defined several thousand different drawing modes as general purpose function was way to slow.
I ended up writing my own little Jit-compiler for a domain specific language (very close to asm, but with some high level control structures and local variables thrown in).
The performance improvement I got was between factor 10 and 60 (depended on the complexity of the compiled code), so the extra work paid off big time.
On the PC I would not start to write my own jit-compiler but use either LIBJIT or LLVM for the jit-compilation. It wasn't possible in my case due to the fact that I was working on a non mainstream embedded processor that is not supported by LIBJIT/LLVM, so I had to invent my own.

The answer is more likely: no one did more than PGO for C++ because the benefits are likely unnoticeable.
Let me elaborate: JIT engines/runtimes have both blesses and drawbacks from their developer's view: they have more information at runtime but much little time to analyze.
Some optimizations are really expensive and you will unlikely see without a huge impact on start time are those one like: loop unrolling, auto-vectorization (which in most cases is also based on loop unrolling), instruction selection (to use SSE4.1 for CPU that use SSE4.1) combined with instruction scheduling and reordering (to use better super-scalar CPUs). This kind of optimizations combine great with C like code (that is accessible from C++).
The single full-blown compiler architecture to do advanced compilation (as far as I know) is the Java Hotspot compilation and architectures with similar principles using tiered compilation (Java Azul's systems, the popular to the day JaegerMonkey JS engine).
But one of the biggest optimization on runtime is the following:
Polymorphic inline caching (meaning that if you run the first loop with some types, the second time, the code of the loop will be specialized types that were from previous loop, and the JIT will put a guard and will put as default branch the inlined types, and based on it, from this specialized form using a SSA-form engine based will apply constant folding/propagation, inlining, dead-code-elimination optimizations, and depends of how "advanced" the JIT is, will do an improved or less improved CPU register assignment.)
As you may notice, the JIT (hotspots) will improve mostly the branchy code, and with runtime information will get better than a C++ code, but a static compiler, having at it's side the time to do analysis, instruction reordering, for simple loops, will likely get a little better performance. Also, typically, the C++ code, areas that need to be fast tends to not be OOP, so the information of the JIT optimizations will not bring such an amazing improvement.
Another advantage of JITs is that JIT works cross assemblies, so it has more information if it wants to do inlining.
Let me elaborate: let's say that you have a base class A and you have just one implementation of it namely B in another package/assembly/gem/etc. and is loaded dynamically.
The JIT as it see that B is the only implementation of A, it can replace everywhere in it's internal representation the A calls with B codes, and the method calls will not do a dispatch (look on vtable) but will be direct calls. Those direct calls may be inlined also. For example this B have a method: getLength() which returns 2, all calls of getLength() may be reduced to constant 2 all over. At the end a C++ code will not be able to skip the virtual call of B from another dll.
Some implementations of C++ do not support to optimize over more .cpp files (even today there is the -lto flag in recent versions of GCC that makes this possible). But if you are a C++ developer, concerned about speed, you will likely put the all sensitive classes in the same static library or even in the same file, so the compiler can inline it nicely, making the extra information that JIT have it by design, to be provided by developer itself, so no performance loss.

visual studio has an option for doing runtime profiling that then can be used for optimization of code.
"Profile Guided Optimization"

Microsoft Visual Studio calls this "profile guided optimization"; you can learn more about it at MSDN. Basically, you run the program a bunch of times with a profiler attached to record its hotspots and other performance characteristics, and then you can feed the profiler's output into the compiler to get appropriate optimizations.

I believe LLVM attempts to do some of this. It attempts to optimize across the whole lifetime of the program (compile-time, link-time, and run-time).

Reasonable question - but with a doubtful premise.
As in Nils' answer, sometimes "optimization" means "low-level optimization", which is a nice subject in its own right.
However, it is based on the concept of a "hot-spot", which has nowhere near the relevance it is commonly given.
Definition: a hot-spot is a small region of code where a process's program counter spends a large percentage of its time.
If there is a hot-spot, such as a tight inner loop occupying a lot of time, it is worth trying to optimize at the low level, if it is in code that you control (i.e. not in a third-party library).
Now suppose that inner loop contains a call to a function, any function. Now the program counter is not likely to be found there, because it is more likely to be in the function. So while the code may be wasteful, it is no longer a hot-spot.
There are many common ways to make software slow, of which hot-spots are one. However, in my experience, that is the only one of which most programmers are aware, and the only one to which low-level optimization applies.
See this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js