Increased system CPU time in non-threaded regions of OpenMP code - c++

I am experiencing a very peculiar issue. I am parallelizing parts of a very large code (about 600MB of source code) using OpenMP, but I'm not getting the speedups I expected. I profiled many parts of the code using omp_get_wtime() to get reliable estimates, and I discovered the following: the OpenMP parts of my code do have better wall times when I use more threads, however when I turn the threads on the program has increased system cpu time usage in parts of the code where there are no omp pragmas at all (e.g., time spent in IPOPT). I used grep for pragma to make sure. Just to clarify, I know for sure from my profiling and code structure that the threaded parts and the parts where I get the slowdowns run in completely different times.
My code consists of two main parts. The first is the base code (about 25MB of code), which I compile using -fopenmp, and the second is third-party code (the remaining 575MB) which is not compiled with the OpenMP flag.
Can anyone think of a reason why something like this could be happening? I can't fathom it being resource contention since the slowdowns are in the non-threaded parts of the code.
I am running my code on a 4-core (2 physical cores) Intel-i7 4600U compiled using clang++-3.8 in Ubuntu 14.04, with -O3 optimizations.
Any ideas would be great!

Related

Compiling the C++ code with the processor flag makes the code slower (intel compiler)

I am doing some profiling and performance is important for me (even 5%). The processor is Intel Xeon Platinum 8280 ("Cascade Lake") on Frontera. I compile my code with -Ofast flag, in Release mode. When I add -march=cascadelake, the timing gets worse (5-6%) in my test case. The same is true if use -xCORE-AVX512 instead of march. I am using icpc 19.1.1.217. Can anyone please explain why? Also, what compilation flags do you suggest for better performance?
Edit 1: I am solving a linear system, which consists of different operations, such as dot-product and matrix-vector product. So, it would hard for me to provide reproducible code, but I can say that there are multiple loops in my code that the compiler can apply auto-vectorization. I have used Intel Optimization reports on the critical loops in my code and the report mentioned potential speedups of at least 1.75 for them (for some of the loops it was over 5x potential speedup).
I have also used aligned_alloc(64, size) to allocate aligned memory with 64-alignment as this processor supports AVX512. Also, I round up the size to be a multiple of 64.
I have added OpenMP support to my code and have parallelized some loops, but for these experiments that I am reporting, I am using only 1 OpenMP thread.
I have tried -mavx2, and I got the same result as -xCORE-AVX512.
I have used -O3 instead of -Ofast. I did not get any speed-up.

Threading analysis in Vtune hangs at __kmp_acquire_ticket_lock

I am currently benchmarking a project written in C++ to determine the hot spots and the threading efficiency, using Intel VTune. When running the program normally it runs for ~15 minutes. Using the hotspot analysis in VTune I can see that the function __kmp_fork_barrier is taking up roughly 40% of the total CPU time.
Therefore, I also wanted to see the threading efficiency, but when starting the threading-module in VTune, it does not start the project at all, but instead hangs at __kmp_acquire_ticket_lock when running in Hardware event-based sampling-mode. When running in user-mode sampling-mode instead, the project immediately fails with a segfault (which does not occur when running it without VTune and checking it with valgrind). When using HPC performance characterization instead, VTune crashes.
Are those issues with VTune, or with my program? And how can I find the issues with the latter?
__kmp_xxx calls are functions of the Intel/Clang OpenMP runtime. __kmp_fork_barrier is called when an OpenMP barrier is reached. If you spend 40% of your time on this function this means that you have a load balancing issue with the OpenMP threads in your program. You need to fix this work imbalance to get better performance. You can use the (experimental) OMPT support of runtimes to track what threads are doing and when they do so. VTune should have a minimal support for profiling OpenMP programs. Encountering a VTune crash is likely a bug and it should be reported on the Intel forum so that VTune developers can fix it. On your side, you can check that your program always pass all OpenMP barrier in a deterministic way. For more information, you can look at the Intel VTune OpenMP tutorial.
Note that the results of VTune should also means that your OpenMP runtime is configured so that threads are actively polling the state of other threads which is good to reduce latencies but not always for performance or energy savings. You can control the behaviour of the runtime using the environment variable OMP_WAIT_POLICY.

Using multiple OMP parallel sections in parallel -> Performance Issue?

I am trying to understand a huge performance problem with one of our C++ applications using OpenMP (on Windows). The structure of the application is as follows:
I have an algorithm which basically consists of a couple of for-loops which are parallelized using OpenMP:
void algorithm()
{
#pragma omp parallel for numThreads(12)
for (int i=0; ...)
{
// do some heavy computation (pure memory and CPU work, no I/O, no waiting)
}
// ... some more for-loops of this kind
}
The application executes this algorithm n times in parallel from n different threads:
std::thread t1(algorithm);
std::thread t2(algorithm);
//...
std::thread tn(algorithm);
t1.join();
t2.join();
//...
tn.join();
// end of application
Now, the problem is as follows:
when I run the application with n=1 (only one call to algorithm()) on my system with 32 physical CPU cores (no hyperthreading), it takes about 5s and loads the CPU to about 30% as expected (given that I have told OpenMP to only use 12 threads).
when I run with n=2, the CPU load goes up to about 60%, but the application takes almost 10 seconds. This means that it is almost impossible to run multiple algorithm instances in parallel.
This alone, of course, can have many reasons (including cache misses, RAM bandwidth limitations, etc.), but there is one thing that strikes me:
if I run my application twice in two parallel processes, each with n=1, both processes complete after about 5 seconds, meaning that I was well able to run two of my algorithms in parallel, as long as they live in different processes.
This seems to exclude many possible reasons for this performance bottleneck. And indeed, I have been unable to understand the cause of this, even after profiling the code. One of my suspicions is that there might be some excessive synchronization in OpenMP between different parallel sections.
Has anyone ever seen an effect like this before? Or can anyone give me advice how to approach this? I have really come to a point where I have tried all I can imagine, but without any success so far. I thus appreciate any help I can get!
Thanks a lot,
Da
PS.:
I have been using both, MS Visual Studio 2015 and Intel's 2017 compiler - both show basically the same effect.
I have a very simple reproducer showing this problem which I can provide if needed. It is really not much more than the above, just adding some real work to be done inside the for-loops.

Can bad vectorized code impact on scalability?

I have parallelized an already existing code for computer vision applications using OpenMP. I think that I well designed it because:
The workload is well-balanced
There is no synchronization/locking mechanism
I parallelized the outer most loops
All the cores are used for most of the time (there are no idle cores)
There is enough work for each thread
Now, the application doesn't scale when using many cores, e.g. it doesn't scale well after 15 cores.
The code uses external libraries (i.e. OpenCV and IPP) where the code is already optimized and vectorized, while I manually vectorized some portions of the code as best as I could. However, according to Intel Advisor, the code isn't well vectorized, but there is no much left to do: I already vectorized the code where I could and I can't improve the external libraries.
So my question is: is it possible that vectorization is the reason why the code doesn't scale well at some point? If so, why?
In line with comments from Adam Nevraumont, VTune Amplifier can do a lot to pinpoint memory bandwidth issues: https://software.intel.com/en-us/vtune-amplifier-help-memory-access-analysis.
It may be useful to start at a higher level of analysis than that though, like looking at hot spots. If it turns out that most of your time is spent in OpenCV or similar like you're concerned about, finding that out early might save some time vs. digging into memory bottlenecks directly.

How to optimize large data manipulation in parallel

I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation).
I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.
I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.
Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ).
Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.
My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time.
Can I further improve my software? Have I hit a limit?
I'm using C/C++ on Linux x86-64, Ubuntu 11.10.
I'm all ears! :-)
What kind of application is it? Could you show us some code?
As I commented, you might have reached some hardware limit like RAM bandwidth. If you did, no software trick could improve it.
You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help.
If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance).