Threading analysis in Vtune hangs at __kmp_acquire_ticket_lock - c++

I am currently benchmarking a project written in C++ to determine the hot spots and the threading efficiency, using Intel VTune. When running the program normally it runs for ~15 minutes. Using the hotspot analysis in VTune I can see that the function __kmp_fork_barrier is taking up roughly 40% of the total CPU time.
Therefore, I also wanted to see the threading efficiency, but when starting the threading-module in VTune, it does not start the project at all, but instead hangs at __kmp_acquire_ticket_lock when running in Hardware event-based sampling-mode. When running in user-mode sampling-mode instead, the project immediately fails with a segfault (which does not occur when running it without VTune and checking it with valgrind). When using HPC performance characterization instead, VTune crashes.
Are those issues with VTune, or with my program? And how can I find the issues with the latter?

__kmp_xxx calls are functions of the Intel/Clang OpenMP runtime. __kmp_fork_barrier is called when an OpenMP barrier is reached. If you spend 40% of your time on this function this means that you have a load balancing issue with the OpenMP threads in your program. You need to fix this work imbalance to get better performance. You can use the (experimental) OMPT support of runtimes to track what threads are doing and when they do so. VTune should have a minimal support for profiling OpenMP programs. Encountering a VTune crash is likely a bug and it should be reported on the Intel forum so that VTune developers can fix it. On your side, you can check that your program always pass all OpenMP barrier in a deterministic way. For more information, you can look at the Intel VTune OpenMP tutorial.
Note that the results of VTune should also means that your OpenMP runtime is configured so that threads are actively polling the state of other threads which is good to reduce latencies but not always for performance or energy savings. You can control the behaviour of the runtime using the environment variable OMP_WAIT_POLICY.


Measure or profile use of AVX2 (and other advanced instruction sets) instructions used by programm

We are chasing some weird hardware failures on AMD Threadrippers. I came across some evidence that AVX2/AVX-512 instructions can lead to weird behaviour (
Is there a generic way of measuring or profiling the use of AVX2/AVX-512 instructions of a running program or machine? For now it would be enough for me to get a ball-park of how many of these instructions are being used in a given time frame. I do not necessarily need to pin it down to the actual program using them. The more detailed the profiling / attribution of AVX2/AVX-512 instruction use by program or time is the better.
I would prefer tools that run in Linux.

Performance and profiling of OpenMP C++ code in VS107

I have a performance critical piece of C++ code running in Visual Studio 2017 that I've been profiling to look for potential bottlenecks. The profiler at a high level shows about 80% CPU usage across my eight cores executing this code. Having loaded in all the kernel symbols, the profiler shows that the busiest function is NTYieldExecution at 52% usage.
My guess is that this 52% is not correct, possibly 52% of one thread, but even then I'd be keen to know what's going on under the hood. I also have my own thread pool code which lead to 100% CPU usage on other code, so I'm wondering whether to move this code to an alternative multi-threading model. OpenMP is very convenient, but is it inefficient in Visual Studio 2017? More importantly, is it possible to isolate and remove any such inefficiencies?
The problem as it turned out was that part of the multi-threaded code in this case was inadvertently writing to a variable outside the scope of the OpenMP section which was in turn leading to the automatic insertion of a lock, as seen in the PartialBarrierN::Block. I resolved this by changing this to a more local variable which resulted in a significant speed up and 100% CPU usage.

Finding performance issue that may be due to thread locking (possibly)

I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.
In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.
My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?
Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.
We might have to try walking through the code or something but not sure that we have time.
Any tips appreciated.
Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.
This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:
time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context
These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.
While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.

How to profile a C++ function at assembly level?

I have a function that is the bottleneck of my program. It requires no access to memory and requires only calculation. It is the inner loop and called many times so any small gains to this function is big wins for my program.
I come from a background in optimizing SPU code on the PS3 where you take a SPU program and run it through a pipeline analyzer where you can put each assembly statement in its own column and you minimize the amount of cycles the function takes. Then you overlay loops so you can minimized pipeline dependencies even more. With that program and a list of all the cycles each assembly instruction takes I could optimize much better then the compiler ever could.
On a different platform it had events I could register (cache misses, cycles, etc.) and I could run the function and track CPU events. That was pretty nice as well.
Now I'm doing a hobby project on Windows using Visual Studio C++ 2010 w/ a Core i7 Intel processor. I don't have the money to justify paying the large cost of VTune.
My question:
How do I profile a function at the assembly level for an Intel processor on Windows?
I want to compile, view disassembly, get performance metrics, adjust my code and repeat.
There are some great free tools available, mainly AMD's CodeAnalyst (from my experiences on my i7 vs my phenom II, its a bit handicapped on the Intel processor cause it doesn't have access to the direct hardware specific counters, though that might have been bad config).
However, a lesser know tool is the Intel Architecture Code Analyser (which is free like CodeAnalyst), which is similar to the spu tool you described, as it details latency, throughput and port pressure (basically the request dispatches to the ALU's, MMU and the like) line by line for your programs assembly. Stan Melax gave a nice talk on it and x86 optimization at this years GDC, under the title "hotspots, flops and uops: to-the-metal cpu optimization".
Intel also has a few more tools in the same vein as IACA, avaibale under the performance tuning section of their experimental/what-if code site, such as PTU, which is (or was) an experimental evolution of VTune, from what I can see, its free.
Its also a good idea to have read the intel optimization manual before diving into this.
EDIT: as Ben pointed out, the timings might not be correct for older processors, but that can be easily made up for using Agner Fog's Optimization manuals, which also contain many other gems.
You might want to try some of the utilities included in valgrind like callgrind or cachegrind.
Callgrind can do profiling and dump assembly.
And kcachegrind is a nice GUI, and will show the dumps including assembly and number of hits per instruction etc.
From you description it sounds like you problem may be embarrassingly parallel, have you considered using ppl's parallel_for?

Does/can Valgrind use multiple processors?

Is there a way to get valgrind to use multiple processors?
I'm doing some bottleneck profiling with valgrind's callgrind and noticed significantly different resource usage behavior in my application vs when run outside of valgrind/callgrind.
When run outside valgrind, it maxes out several processors, but run inside valgrind only uses one. This makes me worry that my bottle necks will be in different places, and thus invalidate my profiling.
According to the Valgrind Docs, they do not support multiple processors:
The main thing to point out with
respect to threaded programs is that
your program will use the native
threading library, but Valgrind
serialises execution so that only one
(kernel) thread is running at a time.
This approach avoids the horrible
implementation problems of
implementing a truly multithreaded
version of Valgrind, but it does mean
that threaded apps run only on one
CPU, even if you have a multiprocessor
or multicore machine.
Valgrind doesn't schedule the threads
itself. It merely ensures that only
one thread runs at once, using a
simple locking scheme. The actual
thread scheduling remains under
control of the OS kernel. What this
does mean, though, is that your
program will see very different
scheduling when run on Valgrind than
it does when running normally. This is
both because Valgrind is serialising
the threads, and because the code runs
so much slower than normal.
This difference in scheduling may
cause your program to behave
differently, if you have some kind of
concurrency, critical race, locking,
or similar, bugs. In that case you
might consider using the tools
Helgrind and/or DRD to track them
Take a look at:
They added:
--fair-sched option
It may help.