Profiling OpenMP-parallelized C++ code

Profiling OpenMP-parallelized C++ code - c++

What is the easiest way to profile a C++ program parallelized with OpenMP, on a machine on which one has no sudo rights?

I would recommend using Intel VTune Amplifier XE profiler.
The Basic Hotspots analysis doesn't require the root privileges and you can even install it without being in sudoers.
For OpenMP analysis it's best to compile with Intel OpenMP implementation and set environment variable KMP_FORKJOIN_FRAMES to 1 before running the profile session. This will enable the tool to visualize time regions from fork point to join point for each parallel region. This gives a good idea about where you had sufficient parallelism and where you did not. By using grid grouping like Frame Domain / Frame Type / Function you can also correlate the parallel regions with what was happening on CPUs which allows finding functions that didn't scale.
For example, imagine a simple code like below that runs some balanced work, then some serial work and then some imbalanced work calling delay() function for all of these making sure delay() doesn't inline. This imitates a real workload where all kinds of unfamiliar functions may be invoked from parallel regions making it harder to analyze whether the parallism was good or bad by looking into just hot-functions profile:
void __attribute__ ((noinline)) balanced_work() {
printf("Starting ideal parallel\n");
#pragma omp parallel
delay(3000000);
}
void __attribute__ ((noinline)) serial_work() {
printf("Starting serial work\n");
delay(3000000);
}
void __attribute__ ((noinline)) imbalanced_work() {
printf("Starting parallel with imbalance\n");
#pragma omp parallel
{
int mythread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
delay(1000000);
printf("First barrier %d\n", mythread);
#pragma omp barrier
delay(mythread * 25000 + 200000);
printf("Second barrier %d\n", mythread);
#pragma omp barrier
delay((nthreads - 1 - mythread) * 25000 + 200000);
printf("Join barrier %d\n", mythread);
}
}
int
main(int argc, char **argv)
{
setvbuf(stdout, NULL, _IONBF, 0);
calibrate();
balanced_work();
serial_work();
imbalanced_work();
printf("Bye bye\n");
}
For this code a typical function profile will show most of the time spent in the delay() function. On the other hand, viewing the data with frame grouping and CPU usage information in VTune will give an idea about what is serial, what is imbalanced and what is balanced. Here is what you might see with VTune:
Here one can see that:
There were 13.671 of elapsed time when we were executing an imbalanced region. One can see the imbalance from CPU Usage breakdown.
There were 3.652 of elapsed time that were pretty well balanced. There is some red time here, that’s likely some system effects - worth investigating in a real-world case.
And then I also have about 4 seconds of serial time. Figuring out that it’s 4 seconds is currently a bit tricky - you have to take elapsed time from summary (21.276 in my case) and subtract 13.671 and 3.652 from it yielding four. But easy enough.
Hope this helps.

Related

OpenMP task directive slower multithreaded than singlethreaded

I've encountered a problem where the task directive seems to slow down the execution time of the code the more threads I have. Now I have removed all of the unnecessary stuff from my code that isn't related to the problem since the problem still occurs even for this slimmed down piece of code that doesn't really do anything. But the general idea I have for this code is that I have the master thread generate tasks for all the other worker threads to execute.
#ifndef _REENTRANT
#define _REENTRANT
#endif
#include <vector>
#include <iostream>
#include <random>
#include <sched.h>
#include <semaphore.h>
#include <time.h>
#include <bits/stdc++.h>
#include <sys/times.h>
#include <stdio.h>
#include <stdbool.h>
#include <omp.h>
#include <chrono>
#define MAXWORKERS 16
using namespace std;
int nbrThreads = MAXWORKERS; //Number of threads
void busyWait() {
for (int i=0; i < 999; i++){}
}
void generatePlacements() {
#pragma omp parallel
{
#pragma omp master
{
int j = 0;
while (j < 8*7*6*5*4*3*2) {
#pragma omp task
{
busyWait();
}
j++;
}
}
}
}
int main(int argc, char const *argv[])
{
for (int i = 1; i <= MAXWORKERS; i++) {
int nbrThreads = i;
omp_set_num_threads(nbrThreads);
auto begin = omp_get_wtime();
generatePlacements();
double elapsed;
auto end = omp_get_wtime();
auto diff = end - begin;
cout << "Time taken for " << nbrThreads << " threads to execute was " << diff << endl;
}
return 0;
}
And I get the following output from running the program:
Time taken for 1 threads to execute was 0.0707005
Time taken for 2 threads to execute was 0.0375168
Time taken for 3 threads to execute was 0.0257982
Time taken for 4 threads to execute was 0.0234329
Time taken for 5 threads to execute was 0.0208451
Time taken for 6 threads to execute was 0.0288127
Time taken for 7 threads to execute was 0.0380352
Time taken for 8 threads to execute was 0.0403016
Time taken for 9 threads to execute was 0.0470985
Time taken for 10 threads to execute was 0.0539719
Time taken for 11 threads to execute was 0.0582986
Time taken for 12 threads to execute was 0.051923
Time taken for 13 threads to execute was 0.571846
Time taken for 14 threads to execute was 0.569011
Time taken for 15 threads to execute was 0.562491
Time taken for 16 threads to execute was 0.562118
Most notably was that from 6 threads on the time seems to get slower, and going from 12 threads to 13 threads seems to have the biggest performance hit, becoming whooping 10 times slower. Now I know that this issue revolves around the openMP task directive, since if I remove the busyWait() function the performance stays the same as seen above. But if I also remove the #pragma omp task header along with the busyWait() call I don't get any slowdown whatsoever, so the slowdown can't depend on the thread-creation. I have no clue what the problem here is.

First of all, the for (int i=0; i < 999; i++){} loop can be optimized by the compiler when optimization flags like -O2 or -O3 are enabled. In fact, mainstream compilers like Clang and GCC optimize it in -O2. Profiling non-optimized build is a wast of time and should never be done unless you have a very good reason to do that.
Assuming you enabled optimizations, the created task will be empty which means you are measuring the time to create many tasks. The thing is creating tasks is slow and creating many tasks doing nothing causes a contention making the creation even slower. The task granularity should be carefully tuned so not to put to much pressure on the OpenMP runtime. Assuming you did not enabled optimisations, then even a loop of 999 iterations is not enough for the runtime not to be under pressure (it should last less than 1 us on mainstream machines). Tasks should last for at least few microseconds for the overhead not to be the main bottleneck. On mainstream servers with a lot of cores, it should be at least dozens of microseconds. For the overhead to be negligible, tasks should last even longer. Task scheduling is powerful but expensive.
Due to the use of shared data structure protected with atomics and locks in OpenMP runtimes, the contention tends to grows with the number of core. On NUMA systems, it can be significantly higher when using multiple NUMA nodes due to NUMA effects. AMD processors with 16 cores are typically processors having multiple NUMA nodes. Using SMT (multiple hardware thread per physical core) does not significantly speed up this operation and adds more pressure to the OpenMP scheduler and the OS scheduler so it is generally not a good idea to use more threads than cores in this case (it can worth it when the task computational work can benefit from SMT, that is for latency-bound tasks for example, and when the overhead is small).
For more information about the overhead of mainstream OpenMP runtimes please consider reading On the Impact of OpenMP Task Granularity.

OMP accelerate the c++ dll but slow down the Unity

I writed a c++ native dll with heavy maths computation, and then put it into Unity engine to run.
The problem is that:
When I used OMP in c++, the OMP did improve the c++'s performance, which I measured by logging out the time. But OMP would slow down the Unity. Unity would run faster if I removed the OMP.
So, how could OMP boosted the dll and slowed down the Unity at the meantime?
Here is what the omp does:
DLLEXPORT void UpdateTreeQuick(DbvtWrapper* wrapper, Vector3* prePositions, Vector3* positions, Triangle* triangles,
int triangleCount, float margin)
{
bool needPropagate = false;
double d1 = omp_get_wtime();
#pragma omp parallel for schedule(static)
for (int i = 0; i < triangleCount; i++)
{
Vector3 sixPos[6];
sixPos[0] = prePositions[triangles[i].A];
sixPos[1] = prePositions[triangles[i].B];
sixPos[2] = prePositions[triangles[i].C];
sixPos[3] = positions[triangles[i].A];
sixPos[4] = positions[triangles[i].B];
sixPos[5] = positions[triangles[i].C];
DbvtVolume vol = DbvtVolume::FromPoints(sixPos, 6);
if (wrapper->m_dbvt->refit(wrapper->m_leaves[i], vol, margin))
needPropagate = true;
}
double d2 = omp_get_wtime();
if (triangleCount == 10222)
Debug::Log(d2 - d1);
}
Here is how I call this native code in Unity:
private void Update()
{
NativeAPI.UpdateTreeQuick(nativeDvbtWrapper, (Vector4*)nativePrePositionsWorld.GetUnsafePtr<Vector4>(),
(Vector4*)nativePositionsWorld.GetUnsafePtr<Vector4>(), (Triangle*)nativeTriangles.GetUnsafePtr<Triangle>(),
m_mesh.triangles.Length / 3, m_aabbMargin);
}
Wit OMP, 2 threads: the c++ code run with a time cost of 7-05 second, the Unity 125-130FPS;
Without OMP: c++ cost 0.0002008 seconds, BUT the Unity run at 138 FPS!
So,Again, how could OMP boosted the dll while slowed down the Unity at the meantime?

So, how could OMP boosted the dll and slowed down the Unity at the meantime?
More details would be great here, but:
In doubt, this can depend on many aspects. Besides the ones mentioned by AlexGeorg:
What are these OMP routines doing exactly? Which kinds of OMP-patterns are used? Which OpenMP version is used? Which kind of data is relevant for OpenMP in your runtime context? How "local" are the data sets you operate on?
Common OMP usage doesn't ensure a nice "main thread" discharge. Even if you strictly separate master from OMP slave work, that doesn't ensure a fluid main core behavior of your CPU a priori. It further depends on aspects like thread/CPU affinity in doubt.
Typical performance droppers for OpenMP-usage cases are cache(!) and sometimes pipeline bottlenecks. Especially if there is a lot of interference with parts of Unity, this might arise.
Maybe this has nothing to do with the FPS problem but it might be questionable that you forward pointers (the vectors) to your OMP loop. In doubt this can lead to hidden bottlenecks or even harder problems if not analyzed well enough since you hide the shared state of actual values a bit.
What's this refit method doing and is it a static/ const method? I'm not that familiar with Unity. Is there a chance for blocking GPU calls (CUDA)?
What you could try further is to measure
general OpenMP thread pool creation time in the Unity working context (you could use a quite easier task for that). How often is your routine called?
You could further look for main thread issues in removing the master thread (id 0) from the work.
If nothing helps, try to compare to another parallelization approach via std::thread simply or intel threading building blocks.

OpenMP Parallel Sections Within For Loop (C++) - Overhead

I have been working on a quantum simulation. Each time step a potential function is calculated, one step of the solver is iterated, and then a series of measurements are conducted. These three processes are easily parallelizable, and I've already made sure they don't interfere with each other. Additionally there is some stuff that is fairly simple, but should not be done in parallel. An outline of the setup is shown below.
omp_set_num_threads(3);
#pragma omp parallel
{
while (notDone) {
#pragma omp sections
{
#pragma omp section
{
createPotential();
}
#pragma omp section
{
iterateWaveFunction();
}
#pragma omp section
{
takeMeasurements();
}
}
#pragma omp single
{
doSimpleThings();
}
}
}
The code works just fine! I see a speed increase, mostly associated with the measurements running alongside the TDSE solver (about 30% speed increase). However, the program goes from using about 10% CPU (about one thread) to 35% (about three threads). This would make sense if the potential function, TDSE iterator, and measurements took equally as long, but they do not. Based on the speed increase, I would expect something on the order of 15% CPU usage.
I have a feeling this has to do with the overhead of running these three threads within the while loop. Replacing
#pragma omp sections
with
#pragma omp parallel sections
(and omitting the two lines just before the loop) changes nothing. Is there a more efficient way to run this setup? I'm not sure if the threads are constantly being recreated, or if the thread is holding up an entire core while it waits for the others to be done. If I increase the number of threads from 3 to any other number, the program uses as much resources as it wants (which could be all of the CPU) and gets no performance gain.

I've tried many options, including using tasks instead of sections (with the same results), switching compilers, etc. As suggested by Qubit, I also tried to use std::async. This was the solution! The CPU usage dropped from about 50% to 30% (this is on a different computer from the original post, so the numbers are different -- it's a 1.5x performance gain for 1.6x CPU usage basically). This is much closer to what I expected for this computer.
For reference, here is the new code outline:
void SimulationManager::runParallel(){
auto rV = &SimulationManager::createPotential();
auto rS = &SimulationManager::iterateWaveFunction();
auto rM = &SimulationManager::takeMeasurements();
std::future<int> f1, f2, f3;
while(notDone){
f1 = std::async(rV, this);
f2 = std::async(rS, this);
f3 = std::async(rM, this);
f1.get(); f2.get(); f3.get();
doSimpleThings();
}
}
The three original functions are called using std::async, and then I use the future variables f1, f2, and f3 to collect everything back to a single thread and avoid access issues.

Openmp performance with omp_get_max_threads greater than number of cores

I am novice is parallel programming. I running a my own Gibbs sampler written in C++. The overview of program look some thing like this.
for(int iter=0; iter <=itermax; iter++){ //loop1
#pragma omp parallel for schedule(dynamic)
for(int jobs= 0; jobs<=1000; jobs++){ // loop2
small_job();
#pragma omp critical(dataupdate){
data_updates()
}
}
jobs_that_cannot_be_parallelized();
}
I am running in a machine with 64 cores. Since small_job are of variable length and small I was assigning omp_get_max_threads = 128. The number of cores used seems to be correct (see fig load last hour).. Each of peaks belongs to loop2.
However when I look to the actual cpu usage (see fig it seems lot of of cpu is used by system and only 20% is used by user. Is it because I am spawning lots of threads at loop2. What are best practices to decide on omp_get_max_threads? I know I have not given enough information but I will really appreciate any other recommendation to make the program faster.

Simulating CPU Load In C++

I am currently writing an application in Windows using C++ and I would like to simulate CPU load.
I have the following code:
void task1(void *param) {
unsigned elapsed =0;
unsigned t0;
while(1){
if ((t0=clock())>=50+elapsed){//if time elapsed is 50ms
elapsed=t0;
Sleep(50);
}
}
}
int main(){
int ThreadNr;
for(int i=0; i < 4;i++){//for each core (i.e. 4 cores)
_beginthread( task1, 0, &ThreadNr );//create a new thread and run the "task1" function
}
while(1){}
}
I wrote this code using the same methodology as in the answers given in this thread: Simulate steady CPU load and spikes
My questions are:
Have I translated the C# code from the other post correctly over to C++?
Will this code generate an average CPU load of 50% on a quad-core processor?
How can I, within reasonable accuracy, find out the load percentage of the CPU? (is task manager my only option?)
EDIT: The reason I ask this question is that I want to eventually be able to generate CPU loads of 10,20,30,...,90% within a reasonable tolerance. This code seems to work well for to generate loads 70%< but seems to be very inaccurate at any load below 70% (as measured by the task manager CPU load readings).
Would anyone have any ideas as to how I could generate said loads but still be able to use my program on different computers (i.e. with different CPUs)?

At first sight, this looks like not-pretty-but-correct C++ or C (an easy way to be sure is to compile it). Includes are missing (<windows.h>, <process.h>, and <time.h>) but otherwise it compiles fine.
Note that clock and Sleep are not terribly accurate, and Sleep is not terribly reliable either. On the average, the thread function should kind of work as intended, though (give or take a few percent of variation).
However, regarding question 2) you should replace the last while(1){} with something that blocks rather than spins (e.g. WaitForSingleObject or Sleep if you will). otherwise the entire program will not have 50% load on a quadcore. You will have 100% load on one core due to the main thread, plus the 4x 50% from your four workers. This will obviously sum up to more than 50% per core (and will cause threads to bounce from one core to the other, resulting in nasty side effects).
Using Task Manager or a similar utility to verify whether you get the load you want is a good option (and since it's the easiest solution, it's also the best one).
Also do note that simulating load in such a way will probably kind of work, but is not 100% reliable.
There might be effects (memory, execution units) that are hard to predict. Assume for example that you're using 100% of the CPU's integer execution units with this loop (reasonable assumption) but zero of it's floating point or SSE units. Modern CPUs may share resources between real or logical cores, and you might not be able to predict exactly what effects you get. Or, another thread may be memory bound or having significant page faults, so taking away CPU time won't affect it nearly as much as you think (might in fact give it enough time to make prefetching work better). Or, it might block on AGP transfers. Or, something else you can't tell.
EDIT:
Improved version, shorter code that fixes a few issues and also works as intended:
Uses clock_t for the value returned by clock (which is technically "more correct" than using a not specially typedef'd integer. Incidentially, that's probably the very reason why the original code does not work as intended, since clock_t is a signed integer under Win32. The condition in if() always evaluates true, so the workers sleep almost all the time, consuming no CPU.
Less code, less complicated math when spinning. Computes a wakeup time 50 ticks in the future and spins until that time is reached.
Uses getchar to block the program at the end. This does not burn CPU time, and it allows you to end the program by pressing Enter. Threads are not properly ended as one would normally do, but in this simple case it's probably OK to just let the OS terminate them as the process exits.
Like the original code, this assumes that clock and Sleep use the same ticks. That is admittedly a bold assumption, but it holds true under Win32 which you used in the original code (both "ticks" are milliseconds). C++ doesn't have anything like Sleep (without boost::thread, or C++11 std::thread), so if non-Windows portability is intended, you'd have to rethink anyway.
Like the original code, it relies on functions (clock and Sleep) which are unprecise and unreliable. Sleep(50) equals Sleep(63) on my system without using timeBeginPeriod. Nevertheless, the program works "almost perfectly", resulting in a 50% +/- 0.5% load on my machine.
Like the original code, this does not take thread priorities into account. A process that has a higher than normal priority class will be entirely unimpressed by this throttling code, because that is how the Windows scheduler works.
#include <windows.h>
#include <process.h>
#include <time.h>
#include <stdio.h>
void task1(void *)
{
while(1)
{
clock_t wakeup = clock() + 50;
while(clock() < wakeup) {}
Sleep(50);
}
}
int main(int, char**)
{
int ThreadNr;
for(int i=0; i < 4; i++) _beginthread( task1, 0, &ThreadNr );
(void) getchar();
return 0;
}

Here is an a code sample which loaded my CPU to 100% on Windows.
#include "windows.h"
DWORD WINAPI thread_function(void* data)
{
float number = 1.5;
while(true)
{
number*=number;
}
return 0;
}
void main()
{
while (true)
{
CreateThread(NULL, 0, &thread_function, NULL, 0, NULL);
}
}
When you build the app and run it, push Ctrl-C to kill the app.

You can use the Windows perf counter API to get the CPU load. Either for the entire system or for your process.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js