How to profile lock contentions under g++/std::mutex? - c++

Question
Are there any open-source tools or does anyone have any techniques/code for profiling the degree of std::mutex contentions in running code?
I would like to count the percentage of lock contention at the granularity (either by time or number) of each std::mutex instance. If there is a drop-in tool that doesn't require recoding, that would be even better.
I am looking for a technique that will work with std::thread and g++ : at the exit of the application, I would like to dump out a profile of mutex contention statistics into a log file, so that I can monitor the quality of threading code under actual running contexts.
Note
I have seen this thread. Unfortunately, the answers either require a pile of cash or run on Windows.

I recommend something like AMD CodeXL or Intel VTune. CodeXL is free; Intel VTune has free academic license if that's applicable to you, or you can try a 30-day trial. Both of them work in Linux.
At the most basic level, these tools can identify hotspots by eg, measuring how much time you are spending inside methods of std::mutex. There are other more advanced analysis techniques/tools included in each tool that may help you even further. You don't need to change your code at all, although you may need to check that you compiled with debug symbols and/or haven't stripped the binaries. You will also probably want to stay away from extreme optimization levels like -O3, and stick to -O1, -O2 or -Og.
PS: As will all optimization inquiries, I must remind you to always measure where your performance problems actually are before you start optimizing. No matter how worried you are about lock contention, validate your concerns with a profiler before you spend huge efforts trying to alleviate whatever lock contention you may or may not be having.

Related

Threading analysis in Vtune hangs at __kmp_acquire_ticket_lock

I am currently benchmarking a project written in C++ to determine the hot spots and the threading efficiency, using Intel VTune. When running the program normally it runs for ~15 minutes. Using the hotspot analysis in VTune I can see that the function __kmp_fork_barrier is taking up roughly 40% of the total CPU time.
Therefore, I also wanted to see the threading efficiency, but when starting the threading-module in VTune, it does not start the project at all, but instead hangs at __kmp_acquire_ticket_lock when running in Hardware event-based sampling-mode. When running in user-mode sampling-mode instead, the project immediately fails with a segfault (which does not occur when running it without VTune and checking it with valgrind). When using HPC performance characterization instead, VTune crashes.
Are those issues with VTune, or with my program? And how can I find the issues with the latter?
__kmp_xxx calls are functions of the Intel/Clang OpenMP runtime. __kmp_fork_barrier is called when an OpenMP barrier is reached. If you spend 40% of your time on this function this means that you have a load balancing issue with the OpenMP threads in your program. You need to fix this work imbalance to get better performance. You can use the (experimental) OMPT support of runtimes to track what threads are doing and when they do so. VTune should have a minimal support for profiling OpenMP programs. Encountering a VTune crash is likely a bug and it should be reported on the Intel forum so that VTune developers can fix it. On your side, you can check that your program always pass all OpenMP barrier in a deterministic way. For more information, you can look at the Intel VTune OpenMP tutorial.
Note that the results of VTune should also means that your OpenMP runtime is configured so that threads are actively polling the state of other threads which is good to reduce latencies but not always for performance or energy savings. You can control the behaviour of the runtime using the environment variable OMP_WAIT_POLICY.

gcc/C++: If CPU load is low, then code optimization is of little use, true?

My colleague likes to use gcc with '-g -O0' for building production binaries because of debugging is easy if core dump happens. He says there is no need to use compiler optimization or tweak the code because he finds the process in production does not have high CPU load, e.g. 30% around.
I asked him the reason behind that and he told me: If CPU load is not high, the bottleneck must not be our code performance, and should be some IO (disk/network). So by using gcc -O2 is of no use to improve the latency and throughput. Also that also indicates we don't have much to improve in the code because CPU is not a bottleneck. Is that correct?
About CPU usage ~ optimisation
I would expect most optimisation problems in a program to correlate to higher-than-usual CPU load, because we say that a sub-optimal program does more than it theoretically needs to. But "usual" here is a complicated word. I don't think you can pick a hard value of system-wide CPU load percentage at which optimisation becomes useful.
If my program reallocates a char buffer in a loop, when it doesn't need to, my program might run ten times slower than it needs to, and my CPU usage may be ten times higher than it needs to be, and optimising the function may yield ten-fold increases in application performance … but the CPU usage may still only be 0.5% of the whole system capacity.
Even if we were to choose a CPU load threshold at which to begin profiling and optimising, on a general-purpose server I'd say that 30% is far too high. But it depends on the system, because if you're programming for an embedded device that only runs your program, and has been chosen and purchased because it has just enough power to run your program, then 30% could be relatively low in the grand scheme of things.
Further still, not all optimisation problems will indeed have anything to do with higher-than-usual CPU load. Perhaps you're just waiting in a sleep longer than you actually need to, causing message latency to increase but substantially reducing CPU usage.
tl;dr: Your colleague's view is simplistic, and probably doesn't match reality in any useful way.
About build optimisation levels
Relating to the real crux of your question, though, it's fairly unusual to deploy a release build with all compiler optimisations turned off. Compilers are designed to emit pretty naive code at -O0, and to do the sort of optimisations that are pretty much "standard" in 2016 at -O1 and -O2. You're generally expected to turn these on for production use, otherwise you're wasting a huge portion of a modern compiler's capability.
Many folks also tend not to use -g in a release build, so that the deployed binary is smaller and easier for your customers to handle. You can drop a 45MB executable to 1MB by doing this, which is no pocket change.
Does this make debugging more difficult? Yes, it can. Generally, if a bug is located, you want to receive reproduction steps that you can then repeat in a debug-friendly version of your application and analyse the stack trace that comes out of that.
But if the bug cannot be reproduced on demand, or it can only be reproduced in a release build, then you may have a problem. It may therefore seem reasonable to keep basic optimisations on (-O1) but also keep debug symbols in (-g); the optimisations themselves shouldn't vastly hinder your ability to analyse the core dump provided by your customer, and the debug symbols will allow you to correlate the information to source code.
That being said, you can have your cake and eat it too:
Build your application with -O2 -g
Copy the resulting binary
Perform strip on one of those copies, to remove the debug symbols; the binaries will otherwise be identical
Store them both forever
Deploy the stripped version
When you have a core dump to analyse, debug it against your original, non-stripped version
You should also have sufficient logging in your application to be able to track down most bugs without needing any of this.
Under certain circumstances he could be correct, and mostly incorrect under other (while under some he's totally correct).
If you assume that you run for 1s the CPU would be busy for 0.3s and waiting for something else 0.7s. If you optimized the code and say got 100% improvement then the CPU would complete what took 0.3s in 0.15s and make the task complete in 0.85s instead of 1s (given that the wait for something else will take the same time).
However if you've got a multicore situation the CPU load is sometimes defined as the amount of processing power that's being used. So if one core runs at 100% and two are idling the CPU load would become 33% so in such a scenario 30% CPU load may be due to the program is only able to make use of one core. In that case it could improve performance drastically if the code were optimized.
Note that sometimes what is thought to be an optimization is actually an pessimization - that's why it's important to measure. I've seen a few "optimizations" that reduce performance. Also some times optimizations would alter the behavior (in particular when you "improve" the source code) so you should probably make sure it doesn't break anything by having proper tests. After doing performance measurement you should decide if it's worth trading debuggability for speed.
A possible improvement might be to compile with gcc -Og -g using a recent GCC. The -Og optimization is debugger-friendly.
Also, you can compile with gcc -O1 -g; you get many (simple) optimizations, so performance is usually 90% of -O2 (with of course some exceptions, where even -O3 matters). And the core dump is usually debuggable.
And it really depends upon the kind of software and the required reliability and ease of debugging. Numerical code (HPC) is quite different from small database post-processing.
At last, using -g3 instead of -g might help (e.g. gcc -Wall -O1 -g3)
BTW synchronization issues and deadlocks might be more likely to appear on optimized code than on non-optimized ones.
It's really simple: CPU time is not free. We like to think that it is, but it's patently false. There are all sorts of magnification effects that make every cycle count in some scenarios.
Suppose that you develop an app that runs on a million of mobile devices. Every second your code wastes is 1-2 years of continuous device use worth on a 4-core device. Even with 0% CPU utilization, wall time latency costs you backlight time, and that's not to be ignored with either: backlight uses about 30% of device's power.
Suppose that you develop an app that runs in a data center. Every 10% of the core that you're using is what someone else won't be using. At the end of the day, you've only got so many cores on a server, and that server has power, cooling, maintenance and amortization costs. Every 1% of CPU usage has costs that are simple to determine, and they aren't zero!
On the other hand: developer time isn't free, and every second of developer's attention requires commensurate energy and resource inputs just to keep her or him alive, fed, well and happy. Yet, in this case all the developer needs to do is flip a compiler switch. I personally don't buy the "easier debugging" myths. Modern debugging information is expressive enough to capture register use, value liveliness, code replication and such. Optimizations don't really get in the way as they did 15 years ago.
If your business has a single, underutilized server, then what the developer is doing might be OK, practically speaking. But all I see here really is an unwillingness to learn how to use the debugging tools or proper tools to begin with.

How to profile OpenMP bottlenecks

I have a loop that has been parallelized by OpenMP, but due to the nature of the task, there are 4 critical clauses.
What would be the best way to profile the speed up and find out which of the critical clauses (or maybe non-critical(!) ) take up the most time inside the loop?
I use Ubuntu 10.04 with g++ 4.4.3
Scalasca is a nice tool for profiling OpenMP (and MPI) codes and analyzing the results. Tau is also very nice but much harder to use. The intel tools, like the vtune, are also good but very expensive.
Arm MAP has OpenMP and pthreads profiling - and works without needing to instrument or modify your source code. You can see synchronization issues and where threads are spending time to the source line level. The OpenMP profiling blog entry is worth reading.
MAP is widely used for high performance computing as it is also profiles multiprocess applications such as MPI.
OpenMP includes the functions omp_get_wtime() and omp_get_wtick() for measuring timing performance (docs here), I would recommend using these.
Otherwise try a profiler. I prefer the google CPU profiler which can be found here.
There is also the manual way described in this answer.
There is also the ompP tool which I have used a number of times in the last ten years. I have found it to be really useful to identify and quantify load imbalance and parallel/serial regions. The web page seems to be down now but I also found it on web archive earlier this year.
edit: updated home directory

How to profile multi-threaded C++ application on Linux?

I used to do all my Linux profiling with gprof.
However, with my multi-threaded application, it's output appears to be inconsistent.
Now, I dug this up:
http://sam.zoy.org/writings/programming/gprof.html
However, it's from a long time ago and in my gprof output, it appears my gprof is listing functions used by non-main threads.
So, my questions are:
In 2010, can I easily use gprof to profile multi-threaded Linux C++ applications? (Ubuntu 9.10)
What other tools should I look into for profiling?
Edit: added another answer on poor man's profiler, which IMHO is better for multithreaded apps.
Have a look at oprofile. The profiling overhead of this tool is negligible and it supports multithreaded applications---as long as you don't want to profile mutex contention (which is a very important part of profiling multithreaded applications)
Have a look at poor man's profiler. Surprisingly there are few other tools that for multithreaded applications do both CPU profiling and mutex contention profiling, and PMP does both, while not even requiring to install anything (as long as you have gdb).
Try modern linux profiling tool, the perf (perf_events): https://perf.wiki.kernel.org/index.php/Tutorial and http://www.brendangregg.com/perf.html:
perf record ./application
# generates profile file perf.data
perf report
Have a look at Valgrind.
A Paul R said, have a look at Zoom. You can also use lsstack, which is a low-tech approach but surprisingly effective, compared to gprof.
Added: Since you clarified that you are running OpenGL at 33ms, my prior recommendation stands. In addition, what I personally have done in situations like that is both effective and non-intuitive. Just get it running with a typical or problematic workload, and just stop it, manually, in its tracks, and see what it's doing and why. Do this several times.
Now, if it only occasionally misbehaves, you would like to stop it only while it's misbehaving. That's not easy, but I've used an alarm-clock interrupt set for just the right delay. For example, if one frame out of 100 takes more than 33ms, at the start of a frame, set the timer for 35ms, and at the end of a frame, turn it off. That way, it will interrupt only when the code is taking too long, and it will show you why. Of course, one sample might miss the guilty code, but 20 samples won't miss it.
I tried valgrind and gprof. It is a crying shame that none of them work well with multi-threaded applications. Later, I found Intel VTune Amplifier. The good thing is, it handles multi-threading well, works with most of the major languages, works on Windows and Linux, and has many great profiling features. Moreover, the application itself is free. However, it only works with Intel processors.
You can randomly run pstack to find out the stack at a given point. E.g. 10 or 20 times.
The most typical stack is where the application spends most of the time (according to experience, we can assume a Pareto distribution).
You can combine that knowledge with strace or truss (Solaris) to trace system calls, and pmap for the memory print.
If the application runs on a dedicated system, you have also sar to measure cpu, memory, i/o, etc. to profile the overall system.
Since you didn't mention non-commercial, may I suggest Intel's VTune. It's not free but the level of detail is very impressive (and the overhead is negligible).
Putting a slightly different twist on matters, you can actually get a pretty good idea as to what's going on in a multithreaded application using ftrace and kernelshark. Collecting the right trace and pressing the right buttons and you can see the scheduling of individual threads.
Depending on your distro's kernel you may have to build a kernel with the right configuration (but I think that a lot of them have it built in these days).
Microprofile is another possible answer to this. It requires hand-instrumentation of the code, but it seems like it handles multi-threaded code pretty well. And it also has special hooks for profiling graphics pipelines, including what's going on inside the card itself.

Profiling C++ multi-threaded applications

Have you used any profiling tool like Intel Vtune analyzer?
What are your recommendations for a C++ multi threaded application on Linux and windows? I am primarily interested in cache misses, memory usage, memory leaks and CPU usage.
I use valgrind (only on UNIX), but mainly for finding memory errors and leaks.
Following are the good tools for multithreaded applications. You can try evaluation copy.
Runtime sanity check tool
Thread Checker -- Intel Thread checker / VTune, here
Memory consistency-check tools (memory usage, memory leaks)
- Memory Validator, here
Performance Analysis. (CPU usage)
- AQTime , here
EDIT: Intel thread checker can be used to diagnose Data races, Deadlocks, Stalled threads, abandoned locks etc. Please have lots of patience in analyzing the results as it is easy to get confused.
Few tips:
Disable the features that are not required.(In case of identifying deadlocks, data race can be disabled and vice versa.)
Use Instrumentation level based on your need. Levels like "All Function" and "Full Image" are used for data races, where as "API Imports" can be used for deadlock detection)
use context sensitive menu "Diagnostic Help" often.
On Linux, try oprofile.
It supports various performance counters.
On Windows, AMD's CodeAnalyst (free, unlike VTune) is worth a look.
It only supports event profiling on AMD hardware though
(on Intel CPUs it's just a handy timer-based profiler).
A colleague recently tried Intel Parallel Studio (beta) and rated it favourably
(it found some interesting parallelism-related issues in some code).
VTune give you a lot of details on what the processor is doing and sometimes I find it hard to see the wood for the trees. VTune will not report on memory leaks. You'll need purify plus for that, or if you can run on a Linux box valgrind is good for memory leaks at a great price.
VTune shows two views, one is useful the tabular one, the other I think is just for sales men to impress people with but not that useful.
For quick and cheap option I'd go with valgrind. Valgrind also has a cache grind part to it but i've not used it, but suspect its very good also.
cheers,
Martin.
You can try out AMD CodeXL's CPU profiler. It is free and available for both Windows and Linux.
AMD CodeXL's CPU profiler replaces the no longer supported CodeAnalyst tool (which was mentioned in an answer above given by timday).
For more information and download links, visit: AMD CodeXL web page.
I'll put in another answer for valgrind, especially the callgrind portion with the UI. It can handle multiple threads by profiling each thread for cache misses, etc. They also have a multi-thread error checker called helgrind, but I've never used it and don't know how good it is.
The Rational PurifyPlus suite includes both a well-proven leak detector and pretty good profiler. I'm not sure if it does go down to the level of cache misses, though - you might need VTune for that.
PurifyPlus is available both on various Unices and Windows so it should cover your requirements, but unfortunately in contrast to Valgrind, it isn't free.
For simple profiling gprof is pretty good..