does .so size affect virtual function performance

does .so size affect virtual function performance - c++

Recently, I met a performance problem. In Vtune result, virtual function cost is always the no.1 cost, when I reduce the so size which from 48M to 37M, the performance seems better, raise up 3.9%.
I wanna know, does the .so size realy affect virtual function performance, if so, why? Thanks!

It is not purely size (though of course that affects paging after the program is loaded), but the number of adjustments the loader must make when loading a program. You can see that measured by setting the environment variable
LD_DEBUG=statistics
Virtual functions in particular will require a lot of adjustments during loading. For discussion on this,
Measure time taken dynamically linking at program startup?
Faster C++ program startups
by improving runtime linking efficiency.
Position Independent Executable (PIE) Performance

I use the method from the article(blogs.oracle.com/ali/entry/the_cost_of_elf_symbol) which is provided by #ErikAlapää, use RTLD_LAZY instead of RTLD_NOW when dlopen the so. However, it seems nothing help. While I compile it with less objects, it becomes better. It seems paging cache does affect the process performance.

Related

interpreting _dl_runtime_resolve_xsave'2 in callgrind output

Looking at the output of callgrind for my program run, I see that 125% !!! of the cycles are spent in _dl_runtime_resolve_xsave'2 (apparently part of the dynamic linker) while 100% is spent in main. But it also says that almost all the time spent inside _dl_runtime_resolve_xsave'2 is actually spent in inner methods (self=0%) but callgrind does not show any callees for this method.
Moreover, it looks like _dl_runtime_resolve_xsave'2 is called from several places in the program I am profiling.
I can understand that some time could be spent outside of main because the program I am profiling is using the prototype pattern and many objects prototypes are being built when their dynamic library are loaded but this cannot amount anywhere close to 25% of the time of that particular run (because if I do that run with no input data it takes orders of magnitude less time than the run I am profiling now).
Also the program is not using dlopen to open shared objects after the program start. Everything should be loaded at the start.
Here is a screenshot of the kcachegrind window:
How can I interpret those calls to _dl_runtime_resolve_xsave'2? Do I need to be concerned by the time spent in this method?
Thank you for your help.

_dl_runtime_resolve_xsave is used in the glibc dynamic loader during lazy binding. It looks up the function symbol during the first call to a function and then performs a tail call to the implementation. Unless you use something like LD_BIND_NOT=1 in the environment when launching the program, this is one-time operation that happens only during the first call to the function. Lazy binding has some cost, but unless you have many functions that are called exactly once, it will not contribute much to the execution cost. It is more likely a reporting artifact, perhaps related to the tail call or the rather exotic XSAVE instruction used in _dl_runtime_resolve_xsave.
You can disable lazy binding by launching the program with the LD_BIND_NOW=1 environment variable setting, the dynamic loader trampoline will not be used because all functions will be resolved on startup. Alternatively, you can link with -Wl,-z,now to make this change permanent (at least for the code you link, system libraries may still use lazy binding for their own function symbols).

Visual Studio Profile Guided Optimization

I have a native C++ application which performs heavy calculations and consumes a lot of memory. My goal is to optimize it, mainly reduce its run time.
After several cycles of profiling-optimizing, I tried the Profile Guided Optimization which I never tried before.
I followed the steps described on MSDN Profile-Guided Optimizations, changing the compilation (/GL) and linking (/LTCG) flags. After adding /GENPROFILE, I ran the application to create .pgc and .pdg files, then changed the linker options to /USEPROFILE and watched additional linker messages that reported that the profiling data was used:
3> 0 of 0 ( 0.0%) original invalid call sites were matched.
3> 0 new call sites were added.
3> 116 of 27096 ( 0.43%) profiled functions will be compiled for speed, and the rest of the functions will be compiled for size
3> 63583 of 345025 inline instances were from dead/cold paths
3> 27096 of 27096 functions (100.0%) were optimized using profile data
3> 608324578581 of 608324578581 instructions (100.0%) were optimized using profile data
3> Finished generating code
Everything looked promising, until I measured the program's performance.
The results were absolutely counterintuitive for me
Performance went down instead of up! 4% to 5% slower than without using Profile Guided Optimization (when comparing with/without the /USEPROFILE option).
Even when running the exact same scenario that was used with /GENPROFILE to create the Profile Guided Optimization data files, it ran 4% slower.
What is going on?

Looking at the sparse doc here the profiler doesn't seem to include any memory optimizations.
If your program takes 2GiB of memory, I'd speculate that the execution speed is limited by memory access and not by the CPU itself. (You also stated something about maps being used, these are also memory limited)
Memory access is difficult to optimize for a profiler, cause it can't change your malloc calls to (for example) put frequently used data into the same pages or make sure they are moved to the same cache line of the CPU.
In addition to that the profiler may introduce additional memory accesses when trying to optimize the bare CPU performance of your program.
The doc states "virtual call speculation", I would speculate that this (and maybe other features like inlining) could introduce additional memory traffic, thus degrading the overall performance cause memory bandwidth is already the limiting factor.

Don't look at it as a black box. If the program can be speeded up, it's because it is doing things it doesn't need to do.
Those things will hide from the profile-guided or any other optimizer, and they will certainly hide from your guesser.
They won't hide from this. Many people use it.
I'm trying to resist the temptation to guess, but I'm failing.
Here's what I see in big C++ apps, no matter how well-written they are.
When people could use a simple data structure like an array, instead they use an abstract container class, with iterators and whatnot. Where does the time go?
That's where it goes.
Another thing they do is write "powerful functions and methods". The writer of the function is so proud of it, that it does so much, that he/she expects it will be called reverently and sparingly.
The user of the function (which could be the same person) thinks "Look how useful this function is! See how much I can get done in a single line of code? The more I use it the more productive I will be."
See how this can easily do needless work?
There's another thing that happens in software - layers of abstraction.
If the pattern above is repeated over several layers of abstraction, the slowdown factors multiply.
The good news is, if you can find those, and if you can fix them, you can get enormous speedup. The bad news is you could suffer as "not a team player".

calling exe performance cost (compared to DLL)

We were discussing the possibility of using an exe instead of DLL inside a C or C++ code. The idea would be that in some cases to use an exe and pass arguments to it. (I guess its equivalent to somehow loading its main function as if it was a DLL).
The question we were wondering is does it imply a performance cost (especially in a loop with more than one iteration).
I tried to look in existing threads, while nobody answered this specific question. I saw that calling a function from DLL had an overhead for the first call, but then subsequent calls would only take 1 or 2 instructions.
For the exe case, it will each time need to create a separate process so it can run.(a second process if I need to open a shell that would open it, but from my research I can do it wihtout calling a shell). This process creation should cost some performance I'd guess. Moreover I think that the exe will each time be loaded into RAM, destroyed at the end of the process, then reloaded for next call and so on. A problem that is not present (?) with DLL.
PS: we were discussing this question more on a theoretical level than for implementing it, it's a question for the sake of learning.

The costs of running an exe are tremendous compared to calling a function from a DLL. If you can do it with a DLL, the you should if performance matters.
Of course, there may be other factors to consider: For example, when there is a bug in the code called, and crashes the process, in the case of an exe it is merely that exe that goes down, and the caller survives, but if the bug is in a DLL, the caller crashes, too.

Clearly, a DLL is going to get loaded, and if you call to it many times in a short time, it will have a benefit. If the time between calls is long enough, the DLL content may get evicted from RAM and have to be loaded from disk again (yes, that's hard to specify, and partly depends on the memory usage on the system).
However, executable files do get cached in memory, so the cost of "loading the executable" isn't that big. Yes, you have to create a new process and destroy it at the end, with all the related memory management code. For a small executable, this will be relatively light work, for a large, complex executable, it may be quite a long time.
Bear in mind that executing the same program many times isn't unusual - compiling a large project or running some sort of script on a large number of files, just to give a couple of simple examples. So the performance of this will be tuned by OS developers.
Obviously, the "retain stuff in RAM" applies to both DLL and EXE - it's basic file-caching done by the OS.

I should avoid static compilation because of cache miss?

The title sums up pretty much the entire story, I was reading this and the key point is that
A bigger executable means more cache misses
and since a static executable it's by definition bigger than one that is dynamically linked, I'm curious about what are the practical considerations in this case.

The article in the link discusses the side-effect of inlining small functions in OS the kernel. This has indeed got a noticeable effect on performance, because the same function is called from many different places throughout the a sequence of system calls - for example if you call open, and then call read, seek write, open will store a filehandle somewhere in the kernel, and in the call to read, seek, and write, that handle will have to be "found". If that's an inlined function, we now have three copies of that function in the cache, and no benefit at all from read having called the same function as seek and write does. If it's a "non-inline" function, it will indeed be ready in the cache when seek and write calls that function.
For a given process, whether the code is linked statically or dynamically, once the application is fully loaded will have very small impact. If there are MANY copies of the application, then other processes may benefit from re-using the same memory for the shared libraries. But the size needed for that process remains the same whether it is shared with 0, 1, 3, or 100 other processes. The benefit in sharing the binary files across many executables come from things like the C library that is behind almost every single executable in the system - so when you have 1000 processes running in the system, that ALL use the same basic runtime system, there is only one copy rather than 1000 copies of the code. But it is unlikely to have much effect on the cache efficiency on any particular application - perhaps common functions like strcpy and such like are used often enough that there is a small chance that when the OS task switches, it's still in the cache when the next application does strcpy.
So, in summary: probably doesn't make any difference at all.

The overall memory footprint of the static version is the same as that of the dynamic version; remember that the dynamically-linked objects still need to be loaded into memory!
Of course, one could also argue that if there are multiple processes running, and they all dynamically link against the same object, then only one copy is required in memory, and so the aggregate footprint is lower than if they had all statically linked.
[Disclaimer: all of the above is educated guesswork; I've never measured the effect of linking on cache behaviour.]

Is a DLL slower than a static link?

I made a GUI library for games. My test demo runs at 60 fps. When I run this demo with the static version of my library it takes 2-3% cpu in taskmanager. When I use the DLL version it uses around 13-15%. Is that normal? Is so, how could I optimize it? I already ask it to use /O2 for the most function inlining.

Do not start your performance timer until the DLL has had opportunity to execute its functionality one time. This gives it time to load into memory. Then start the timer and check performance. It should then basically match that of the static lib.
Also keep in mind that the load-location of the DLL can greatly affect how quickly it loads. The default base addres for DLLs is 0x400000. If you already have some other DLL in that location, then the load process must perform an expensive re-addressing step which will throw off your timing even more.
If you have such a conflict, just choose a different base address in Visual Studio.

You will have the overhead of loading the DLL (should be just once at the beginning). It isn't statically linked in with direct calls, so I would expect a small amount of overhead but not much.
However, some DLLs will have much higher overheads. I'm thinking of COM objects although there may be other examples. COM adds a lot of overhead on function calls between objects.

If you call DLL-functions they cannot be inlined for a caller. You should think a little about your DLL-boundaries.
May be it is better for your application to have a small bootstrap exe which just executes a main loop in your DLL. This way you can avoid much overhead for function calls.

It's a little unclear as to what's being statically/dynamically linked. Is the DLL of your lib statically linked with its dependencies? Is it possible that the DLL is calling other DLLs (that will be slow)? Maybe try running a profiler from valgrind on your executable to determine where all the CPU usage is coming from.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js