Why do I see __scalbnf in my profiler? - c++

I am profiling some C++ code with perf, and I see that __scalbnf and __wrap_scalbnf are taking up a good chunk of the run time. I looked up what these functions are, and my best guess is I am calling them via a call to std::exp. However I'd like to be able to confirm this. Is there a place where I can see the C++ code implementing std::exp to confirm this? Or what is the best way for me (a C++ amateur) to start digging into this and understanding what is happening?
Thank you.

Set a breakpoint on __scalbn. Run your program. Look at a backtrace (in GDB, bt). The call tree will show that exp() is a parent function for __scalbn.
If a function has multiple callers, the first hit might not be from the "hot" function you're profiling.
To actually figure out which higher-up function (including its children) is responsible for using a lot of time, see linux perf: how to interpret and find hotspots. Top-down profiling can find expensive functions that do all their work in calls to other functions, even when those other functions also have "innocent" callers. (e.g. memcpy is heavily used and often unavoidable, but what you'd want to find are callers that use it too much and could be optimized better. Or not called at all.)
And BTW, yes glibc's math lib exp() implementation does internally use __scalbn. I'm not sure how bad the implementation is, but I don't see an asm version for x86-64, only this pure C version. https://code.woboq.org/userspace/glibc/sysdeps/ieee754/dbl-64/wordsize-64/s_scalbn.c.html. (For __scalbnl(long double) there's https://code.woboq.org/userspace/glibc/sysdeps/x86_64/fpu/s_scalbnl.S.html, using the x87 fscale instruction for 80-bit floats. But there are only i386 asm files for the other sizes. And IA-64 (Itanium), but not x86-64).
glibc does have some vectorized EXP code, though, like the SSE4 SVML version https://code.woboq.org/userspace/glibc/sysdeps/x86_64/fpu/multiarch/svml_d_exp2_core_sse4.S.html#_ZGVbN2v_exp_sse4.
If you want higher-performance exp() without perfect accuracy, see Fastest Implementation of Exponential Function Using AVX (that's for float, not double. I forget if there's an SO answer with a double version).
Also related: Efficient implementation of log2(__m256d) in AVX2.

To confirm that std::exp is the reason for __scalbnf and __wrap_scalbnf, you can replace the std::exp calls by either:
an identity function that returns the input value
or by an alternative exp implementation (for example fm_exp, found here)
Then, if you still see __scalbnf and __wrap_scalbnf in the profiler output, it means it's not coming from std::exp.

Related

Choose assembly implementation to use based on supported instructions

I am working on a C library which compiles/links to a .a file that users can statically link into their code. The library's performance is very important, so I am writing performance-critical routines in x86-64 assembly to optimize performance.
For some routines, I can get significantly better performance if I use BMI2 instructions than if I stick to the "standard" x86-64 instruction set. Trouble is, BMI2 was introduced fairly recently and some of my users use processors that do not support those instructions.
So, I've written optimized the routines twice, once using BMI2 instructions and once without using them. In my current setup, I would distribute two versions of the .a file: a "fast" one that requires support for BMI2 instructions, and a "slow" one that does not require support for BMI2 instructions.
I am asking if there's a way to simplify this by distributing a single .a file that will dynamically choose the correct implementation based on whether the CPU on which the final application runs supports BMI2 instructions.
Unlike similar questions on StackOverflow, there are two peculiarities here:
The technique to choose the function needs to have particularly low overhead in the critical path. The routines in question, after assembly-optimization, run in ~10 ns, so even a single if statement could be significant.
The function that needs to be chosen "dynamically" is chosen once at the beginning, and then remains fixed for the duration of the program. I'm hoping that this will offer a faster solution than the one suggested in this question: Choosing method implementation at runtime
The fastest solution I've come up with so far is to do the following:
Check whether the CPU supports BMI2 instructions using the cpuid instruction.
Set a global variable true or false depending on the result.
Branch on the value of this global variable on every function invocation.
I'm not satisfied with this approach because it has two drawbacks:
I'm not sure how I can automatically run cpuid and set a global variable at the beginning of the program, given that I'm distributing a .a file and don't have control over the main function in the final binary. I'm happy to use C++ here if it offers a better solution, as long as the final library can still be linked with and called from a C program.
This incurs overhead on every function call, when ideally the only overhead would be on program startup.
Are there any solutions that are more efficient than the one I've detailed above?
x264 uses an init function (which users of the library are required to call before calling anything else, or something like that) to set up a struct of function pointers based on CPUID results. Including taking into account that pshufb is slow on some early CPUs that support it.
If your functions depend on pdep / pext, you probably want to detect AMD vs. Intel, because AMD's pdep/pext is very slow and probably not worth using on Ryzen, even though it is available. (See https://agner.org/optimize/ for instruction tables.)
Function pointers are fairly low overhead, about the same as calling a function in a shared library or DLL. call [rel funcptr] instead of call func. (In the compiler-generated asm that calls your functions).
CPU dependent code: how to avoid function pointers? shows a very simple example of it in C, and is asking for ways to avoid it. With dynamic linking, you can do CPU detection at dynamic link time so the dynamic-linking indirection becomes your CPU-dispatch indirection as well (like glibc does for selecting an optimized memcpy implementation.)
But with static linking for a .a, just make function pointers that are statically initialized to the baseline versions, and your CPU init function (which hopefully runs before any of the function pointers are dereferenced) rewrites them to point at the best version for the current CPU.
If you are using gcc, you can get the compiler to implement all the boiler plate code automatically. gcc manual page on function multiversioning

GCC 4.6.2 inlining behavior

-- snipped from chat.so --
I am stuck with gcc 4.6.2 on a certain project and after profiling with intel VTune
i noticed that very insignificant functions were not being inlined (or at least showed up under hotspots, which I assumed meant a failed inline)
an example function is a reinterpret cast, 2 numeric additions, and a ternary statement
i BELIEVE these are being inlined in Windows, but due to the profiling, think they are not being inlined in linux under gcc 4.6.2
I am attempting to get an ICC build working in linux (works in windows), but that'll take a little time
until then, does anyone know if GCC 4.6.2 is that different from VS2010 in terms of relatively simple compiler optimizations? I've turned on -O3 in GCC
what led me to this is that this is a rewrite of a significant section of code, and on Windows, the performance is approximately equal or a little slower, while on Linux it is at least 2x as slow.
The most informative answer would help me understand the steps required to verify inlining across platforms and how best to approach this situation as I understand these things are extremely situation-specific.
EDIT: Also, assuming that business-specific reasons force me to stick with GCC 4.6.2, what can I do about this without rewriting the code to make it less maintainable?
Thanks!
First the super-obvious for completeness: Are you absolutely sure that all the files doing the probably non-inlined calls were compiled with -O3?
The gcc and VS compiler and tool chains are sufficiently different that it wouldn't surprise me at all if their optimizers behaved rather differently.
Next let me observe that the ternary operator can be very deceiving. Ternary operators are almost certainly going to create a branch and potentially constructor calls, conversions, etc. Don't assume that just because it's a terse operator in C++ the compiler will be able generate a tiny amount of code for it. This could potentially inhibit the compiler from optmizing it. In fact, you could try reworking the ternary code into a normal if statement and see if that helps your performance at all.
Then once you've moved on to further diagnostics, an easy thing to try is to use strings <binary> | grep function and see if the function name shows up in the binary at all. If it doesn't then it's definitely being inlined (although even if it shows up it could be strictly debug information and not actual code). There are other tools such as nm, readelf, elfdump, and dump that can introspect binaries for symbols as well. You would need to see which tools are available on your platform and then try to use them to find the function(s) in question.
Another idea is to load the compiled binary into gdb, and ask it to disassemble the code at the file and line at the point where the function call is made. Then you can read the disassembly code to see what the compiler did. Most of the code should actually be fairly obvious. You will likely see something like a call instruction if an actual function call was made.

How to control whether C math uses SSE2?

I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin.
First there is a dispatch routine from a file called "disp_pentium4.inc". It checks if the variable ___use_sse2_mathfcns has been set; if so, calls __sin_pentium4, otherwise calls __sin_default.
__sin_pentium4 (in "sin_pentium4.asm") starts by transferring the argument from the x87 fpu to the xmm0 register, performs the calculation using SSE2 instructions, and loads the result back in the fpu.
__sin_default (in "sin.asm") keeps the variable on the x87 stack and simply calls fsin.
So in both cases, the operand is pushed on the x87 stack and returned on it as well, making it transparent to the caller, but if ___use_sse2_mathfcns is defined, the operation is actually performed in SSE2 rather than x87.
This behavior is very interesting to me because the x87 transcendental functions are notorious for having slightly different behaviors depending on the implementation, whereas a given piece of SSE2 code should always give reproducible results.
Is there a way to determine for certain, either at compile or run-time, that the SSE2 code path will be used? I am not proficient writing assembly, so if this involves writing any assembly, a code example would be appreciated.
I found the answer through careful investigation of math.h. This is controlled by a method called _set_SSE2_enable. This is a public symbol documented here:
Enables or disables the use of Streaming SIMD Extensions 2 (SSE2)
instructions in CRT math routines. (This function is not available on
x64 architectures because SSE2 is enabled by default.)
This causes the aforementionned ___use_sse2_mathfcns flag to be set to the provided value, effectively enabling or disabling use of the _pentium4 SSE2 routines.
The documentation mentions this affects only certain transcendental functions, but looking at the disassembly, this seems to affect everyone of them.
Edit: stepping into every function reveals that they're all available in SSE2 except for the following:
fmod
sinh
cosh
tanh
sqrt
Sqrt is the biggest offender, but it's trivial to implement in SSE2 using intrinsics. For the others, there's no simple solution except perhaps using a third-party library, but I can probably do without.
Why not use your own library instead of the C runtime? This would provide an even stronger guarantee of consistency across computers (presumably the C runtime is provided as a DLL and might change slightly in time).
I would recommend CRlibm. If you are already targeting SSE2, and as long as you did not intend to change the FPU's rounding mode, you are in the ideal conditions to use it, and you won't find a more accurate implementation.
The short answer is that you can't tell IN YOUR CODE for certain what the library will do, unless you are also involving library-implementation specific details. These would make the code completely unportable - even two different builds of the same compiler may change the internals of the library.
Of course, if portability isn't an issue, then using extern <type> ___use_sse2_mathfcns; and checking if it's the true would clearly work.
I expect that if the processor has SSE2 and you are using a modern enough library, it would use SSE2 wherever possible. But to say that for certain is a different matter.
If this is critical for your code, then implement your own transcendental functions and use those - that's the only way to guarantee the same result. Or, use some suitable inline assembler (or transcendental) code to calculate selected sin, cos, etc values, and compare those with the sin() and cos() functions provided by the library.

tool for finding which functions can ultimately cause a call to a (list of) low level functions

I have a very large C++ program where certain low level functions should only be called from certain contexts or while taking specific precautions. I am looking for a tool that shows me which of these low-level functions are called by much higher level functions. I would prefer this to be visible in the IDE with some drop down or labeling, possibly in annotated source output, but any easier method than manually searching the call-graph will help.
This is a problem of static analysis and I'm not helped by a profiler.
I am mostly working on mac, linux is OK, and if something is only available on windows then I can live with that.
Update
Just having the call-graph does not make it that much quicker to answer the question, "does foo() potentially cause a call to x() y() or z()". (or I'm missing something about the call-graph tools, perhaps I need to write a program that traverses it to get a solution?)
There exists Clang Static Analyzer which uses LLVM which should also be present on OS X. Actually i'm of the opinion that this is integrated in Xcode. Anyway, there exists a GUI.
Furthermore there are several LLVM passes, where you can generate call graphs, but i'm not sure if this is what you want.
The tool Scientific Toolworks "Understand" tool is supposed to be able to produce call graphs for C and C++.
Doxygen also supposedly produces call graphs.
I don't have any experience with either of these, but some harsh opinions. You need to keep in mind that I'm a vendor of another tool, so take this opinion with a big grain of salt.
I have experience building reasonably accurate call graphs for massive C systems (25 million lines) with 250,000 functions.
One issue I encounter in building a realistic call graph are indirect function calls, and for C++, overloaded method function calls. In big systems, there are a lot of both of these. To determine what gets called when FOO gets invoked, your tool has to have to deep semantic understanding of how the compiler/language resolves an overloaded call, and for indirect function calls, a reasonably precise determination of what a function pointer might actually point-to in a big system. If you don't get these reasonably right, your call graph will contain a lot of false positives (e.g., bogus claims of A calls B), and on scale false positives are a disaster.
For C++, you must have what amounts to the full compiler front end. Neither Understand or Doxygen have this, so I don't see how they can actually understand C++'s overloading/Koenig lookup rules. Neither Understand or Doxygen make any attempt that I know of to reason about indirect function calls.
Our DMS Software Reengineering Toolkit does build calls graphs for C reasonably well, even with indirect function pointers, using a C-language precise front end.
We have C++ language precise front end, and it does the overload resolution correctly (to the extent the C++ committee agrees on it, and we understand what they said, and what the individual compilers do [they don't always agree]), and we have something like Doxygen that shows this information. We don't presently have function pointer analysis for C++ but we are working on it (we have full control flow graphs within methods and that's a big step).
I understand CLANG has some option for computing call graphs, and I'd expect that to be accurate on overloads since Clang is essentially a C++ compiler implemented with a bunch of components. I don't know what, if anything Clang does to analyze function pointers.

Taking advantage of SSE and other CPU extensions

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
I believe that using SSE to implement these loops should improve their performance significantly, especially where many operations are carried out on the same set of data, so once the data is read into the cache initially, there shouldn't be any cache misses to stall it. However I'm not sure about going about this.
Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.
I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?
What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?
For your second point there are several solutions as long as you can separate out the differences into different functions:
plain old C function pointers
dynamic linking (which generally relies on C function pointers)
if you're using C++, having different classes that represent the support for different architectures and using virtual functions can help immensely with this.
Note that because you'd be relying on indirect function calls, the functions that abstract the different operations generally need to represent somewhat higher level functionality or you may lose whatever gains you get from the optimized instruction in the call overhead (in other words don't abstract the individual SSE operations - abstract the work you're doing).
Here's an example using function pointers:
typedef int (*scale_func_ptr)( int scalar, int* pData, int count);
int non_sse_scale( int scalar, int* pData, int count)
{
// do whatever work needs done, without SSE so it'll work on older CPUs
return 0;
}
int sse_scale( int scalar, in pData, int count)
{
// equivalent code, but uses SSE
return 0;
}
// at initialization
scale_func_ptr scale_func = non_sse_scale;
if (useSSE) {
scale_func = sse_scale;
}
// now, when you want to do the work:
scale_func( 12, theData_ptr, 512); // this will call the routine that tailored to SSE
// if the CPU supports it, otherwise calls the non-SSE
// version of the function
Good reading on the subject: Stop the instruction set war
Short overview: Sorry, it is not possible to solve your problem in simple and most compatible (Intel vs. AMD) way.
The SSE intrinsics work with visual c++, GCC and the intel compiler. There is no problem to use them these days.
Note that you should always keep a version of your code that does not use SSE and constantly check it against your SSE implementation.
This helps not only for debugging, it is also usefull if you want to support CPUs or architectures that don't support your required SSE versions.
In answer to your comment:
So effectively, as long as I don't try to actually execute code containing unsupported instructions I'm fine, and I could get away with an "if(see2Supported){...}else{...}" type switch?
Depends. It's fine for SSE instructions to exist in the binary as long as they're not executed. The CPU has no problem with that.
However, if you enable SSE support in the compiler, it will most likely swap a number of "normal" instructions for their SSE equivalents (scalar floating-point ops, for example), so even chunks of your regular non-SSE code will blow up on a CPU that doesn't support it.
So what you'll have to do is most likely compile on or two files separately, with SSE enabled, and let them contain all your SSE routines. Then link that with the rest of the app, which is compiled without SSE support.
Rather than hand-coding an alternative SSE implementation to your scalar code, I strongly suggest you have a look at OpenCL. It is a vendor-neutral portable, cross-platform system for computationally intensive applications (and is highly buzzword-compliant!). You can write your algorithm in a subset of C99 designed for vectorised operations, which is much easier than hand-coding SSE. And best of all, OpenCL will generate the best implementation at runtime, to execute either on the GPU or on the CPU. So basically you get the SSE code written for you.
Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
Your application sounds like just the kind of problem that OpenCL is designed to address. Writing alternative functions in SSE would certainly improve the execution speed, but it is a great deal of work to write and debug.
Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.
Yes. The SSE intrinsics have been essentially standardised by Intel, so the same functions work the same between Windows, Linux and Mac (specifically with Visual C++ and GNU g++).
I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?
You could do that (eg. using dlopen()) but it is a very complex solution. Much simpler would be (in C) to define a function interface and call the appropriate version of the optimised function via function pointer, or in C++ to use different implementation classes, depending on the CPU detected.
With OpenCL it is not necessary to do this, as the code is generated at runtime for the given architecture.
What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?
Within the SSE instruction set, there are many flavours. It can be quite difficult to code the same algorithm in different subsets of SSE when certain instructions are not present. I suggest (at least to begin with) that you choose a minimum supported level, such as SSE2, and fall back to the scalar implementation on older machines.
This is also an ideal situation for unit/regression testing, which is very important to ensure your different implementations produce the same results. Have a test suite of input data and known good output data, and run the same data through both versions of the processing function. You may need to have a precision test for passing (ie. the difference epsilon between the result and the correct answer is below 1e6, for example). This will greatly aid in debugging, and if you build in high-resolution timing to your testing framework, you can compare the performance improvements at the same time.