Performance and profiling of OpenMP C++ code in VS107 - c++

I have a performance critical piece of C++ code running in Visual Studio 2017 that I've been profiling to look for potential bottlenecks. The profiler at a high level shows about 80% CPU usage across my eight cores executing this code. Having loaded in all the kernel symbols, the profiler shows that the busiest function is NTYieldExecution at 52% usage.
My guess is that this 52% is not correct, possibly 52% of one thread, but even then I'd be keen to know what's going on under the hood. I also have my own thread pool code which lead to 100% CPU usage on other code, so I'm wondering whether to move this code to an alternative multi-threading model. OpenMP is very convenient, but is it inefficient in Visual Studio 2017? More importantly, is it possible to isolate and remove any such inefficiencies?

The problem as it turned out was that part of the multi-threaded code in this case was inadvertently writing to a variable outside the scope of the OpenMP section which was in turn leading to the automatic insertion of a lock, as seen in the PartialBarrierN::Block. I resolved this by changing this to a more local variable which resulted in a significant speed up and 100% CPU usage.

Related

Threading analysis in Vtune hangs at __kmp_acquire_ticket_lock

I am currently benchmarking a project written in C++ to determine the hot spots and the threading efficiency, using Intel VTune. When running the program normally it runs for ~15 minutes. Using the hotspot analysis in VTune I can see that the function __kmp_fork_barrier is taking up roughly 40% of the total CPU time.
Therefore, I also wanted to see the threading efficiency, but when starting the threading-module in VTune, it does not start the project at all, but instead hangs at __kmp_acquire_ticket_lock when running in Hardware event-based sampling-mode. When running in user-mode sampling-mode instead, the project immediately fails with a segfault (which does not occur when running it without VTune and checking it with valgrind). When using HPC performance characterization instead, VTune crashes.
Are those issues with VTune, or with my program? And how can I find the issues with the latter?
__kmp_xxx calls are functions of the Intel/Clang OpenMP runtime. __kmp_fork_barrier is called when an OpenMP barrier is reached. If you spend 40% of your time on this function this means that you have a load balancing issue with the OpenMP threads in your program. You need to fix this work imbalance to get better performance. You can use the (experimental) OMPT support of runtimes to track what threads are doing and when they do so. VTune should have a minimal support for profiling OpenMP programs. Encountering a VTune crash is likely a bug and it should be reported on the Intel forum so that VTune developers can fix it. On your side, you can check that your program always pass all OpenMP barrier in a deterministic way. For more information, you can look at the Intel VTune OpenMP tutorial.
Note that the results of VTune should also means that your OpenMP runtime is configured so that threads are actively polling the state of other threads which is good to reduce latencies but not always for performance or energy savings. You can control the behaviour of the runtime using the environment variable OMP_WAIT_POLICY.

Can bad vectorized code impact on scalability?

I have parallelized an already existing code for computer vision applications using OpenMP. I think that I well designed it because:
The workload is well-balanced
There is no synchronization/locking mechanism
I parallelized the outer most loops
All the cores are used for most of the time (there are no idle cores)
There is enough work for each thread
Now, the application doesn't scale when using many cores, e.g. it doesn't scale well after 15 cores.
The code uses external libraries (i.e. OpenCV and IPP) where the code is already optimized and vectorized, while I manually vectorized some portions of the code as best as I could. However, according to Intel Advisor, the code isn't well vectorized, but there is no much left to do: I already vectorized the code where I could and I can't improve the external libraries.
So my question is: is it possible that vectorization is the reason why the code doesn't scale well at some point? If so, why?
In line with comments from Adam Nevraumont, VTune Amplifier can do a lot to pinpoint memory bandwidth issues: https://software.intel.com/en-us/vtune-amplifier-help-memory-access-analysis.
It may be useful to start at a higher level of analysis than that though, like looking at hot spots. If it turns out that most of your time is spent in OpenCV or similar like you're concerned about, finding that out early might save some time vs. digging into memory bottlenecks directly.

Finding performance issue that may be due to thread locking (possibly)

I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.
In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.
My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?
Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.
We might have to try walking through the code or something but not sure that we have time.
Any tips appreciated.
Thanks.
Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.
This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:
time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context
These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at https://stackoverflow.com/a/378024. The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.
While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.

How to profile lock contentions under g++/std::mutex?

Question
Are there any open-source tools or does anyone have any techniques/code for profiling the degree of std::mutex contentions in running code?
I would like to count the percentage of lock contention at the granularity (either by time or number) of each std::mutex instance. If there is a drop-in tool that doesn't require recoding, that would be even better.
I am looking for a technique that will work with std::thread and g++ : at the exit of the application, I would like to dump out a profile of mutex contention statistics into a log file, so that I can monitor the quality of threading code under actual running contexts.
Note
I have seen this thread. Unfortunately, the answers either require a pile of cash or run on Windows.
I recommend something like AMD CodeXL or Intel VTune. CodeXL is free; Intel VTune has free academic license if that's applicable to you, or you can try a 30-day trial. Both of them work in Linux.
At the most basic level, these tools can identify hotspots by eg, measuring how much time you are spending inside methods of std::mutex. There are other more advanced analysis techniques/tools included in each tool that may help you even further. You don't need to change your code at all, although you may need to check that you compiled with debug symbols and/or haven't stripped the binaries. You will also probably want to stay away from extreme optimization levels like -O3, and stick to -O1, -O2 or -Og.
PS: As will all optimization inquiries, I must remind you to always measure where your performance problems actually are before you start optimizing. No matter how worried you are about lock contention, validate your concerns with a profiler before you spend huge efforts trying to alleviate whatever lock contention you may or may not be having.

Visual Studio 2008 Profiler - Instrumented produces strange results

I run the Visual Studio 2008 profiler on a "RelDebug" build of my app. Optimizations are on, but inlining is only moderate, stack frames are present, and symbols are emitted. In other words, RelDebug is a somewhat optimized build that can be debugged (although the usual Release caveats about inspecting variables applies).
I run both the Sampling, and the Instrumented profiler on separate runs.
Result? The Sampling profiler produces a result that looks reasonable. However when I look at the Instrumented profiler results, I see functions that should not even be near the top of the list, coming out up to.
For example, a function like "SetFont" that consists of only 1 line assigning the height to a class member. Or "SetClipRect" that merely assigns a rectangle.
Of course I am looking at "Exclusive" stats (i.e. minus children).
This happen to anyone else? It always seems to happen once my application has grown to a certain size. It makes the instrumented profiler useless at that point.
I figured out the problem. Both the Visual Studio 2008 and the Visual Studio 2010 profilers are mediocre (to put it politely). I bought Intel C++ Studio which comes with vTune Amplifier (a profiler). Using the Intel profiler on the exact same code I was able to get profiler results that actually made sense.
You say "of course you are looking at Exclusive". Look at inclusive stats. In all but the simplest programs or algorithms, nearly all the time is spent in subroutines and functions, so if you've got a performance problem, it most likely consists of calls you didn't know were time-hogs.
The method I rely on is this. Assuming you are trying to find out what you could fix to make the code faster, it will find it, while not wasting your time with high-precision statistics about things that are not problems.
There's no bug. Sampling cannot tell you how much time you spent per call. Profiler is just counting how many times timer ended up in that specific function. Since SetFont is not frequently called, you don't get many hits in that function and you get impression that that function is not time consuming.
On the other hand, when you run instrumentation, profiler counts every call and measures execution time of every function. That is why you get accurate information about functions CPU consumption.
When examining instrumentation results you must always look at number of calls as well. Since SetFont is more-less API it doesn't matter if it's exclusive or inclusive. The only thing that matters is its overall time and how frequently it's called.