Openmp consumes all the CPU power before even compiling/running the code? (parallel for) - c++

I was looking for a way to parallelize "for loops", without implementing pthread routines, and so on, by myself. I stumbled over openmp, and the #pragma omp parallel for default(none) directive. Since my for loop has several variables which are "shared" (like some integer values, and also some arrays where I want to store stuff I calculate in the loop, at the respective index position), I have added shared(variable1, variable2, ...) and so on. However by doing so, I noticed that the warnings in CLion which highlight the shared variables, won't go away. Furthermore I noticed that when I put the shared clause in my code, all of my 6 CPU cores start getting busy with most of the time 100 percent usage.
This seems super odd to me, since I haven't even compiled the code yet. The cores start working as soon as I put the shared() clause with some variables to the code.
I have never worked with openmp so I dont know, if I am may using it wrong? Would be great if someone can help me out with that, or probably give a hint why this happens.
Edit:
For clarification: With warnings, I mean, that the IDE underlines in red all the variables which seem to be shared. The CPU consumption is by the IDE itself, when adding the shared() clause to the code. However I have no clue why adding this clause would consume all the CPUs this much?

Related

Using multiple OMP parallel sections in parallel -> Performance Issue?

I am trying to understand a huge performance problem with one of our C++ applications using OpenMP (on Windows). The structure of the application is as follows:
I have an algorithm which basically consists of a couple of for-loops which are parallelized using OpenMP:
void algorithm()
{
#pragma omp parallel for numThreads(12)
for (int i=0; ...)
{
// do some heavy computation (pure memory and CPU work, no I/O, no waiting)
}
// ... some more for-loops of this kind
}
The application executes this algorithm n times in parallel from n different threads:
std::thread t1(algorithm);
std::thread t2(algorithm);
//...
std::thread tn(algorithm);
t1.join();
t2.join();
//...
tn.join();
// end of application
Now, the problem is as follows:
when I run the application with n=1 (only one call to algorithm()) on my system with 32 physical CPU cores (no hyperthreading), it takes about 5s and loads the CPU to about 30% as expected (given that I have told OpenMP to only use 12 threads).
when I run with n=2, the CPU load goes up to about 60%, but the application takes almost 10 seconds. This means that it is almost impossible to run multiple algorithm instances in parallel.
This alone, of course, can have many reasons (including cache misses, RAM bandwidth limitations, etc.), but there is one thing that strikes me:
if I run my application twice in two parallel processes, each with n=1, both processes complete after about 5 seconds, meaning that I was well able to run two of my algorithms in parallel, as long as they live in different processes.
This seems to exclude many possible reasons for this performance bottleneck. And indeed, I have been unable to understand the cause of this, even after profiling the code. One of my suspicions is that there might be some excessive synchronization in OpenMP between different parallel sections.
Has anyone ever seen an effect like this before? Or can anyone give me advice how to approach this? I have really come to a point where I have tried all I can imagine, but without any success so far. I thus appreciate any help I can get!
Thanks a lot,
Da
PS.:
I have been using both, MS Visual Studio 2015 and Intel's 2017 compiler - both show basically the same effect.
I have a very simple reproducer showing this problem which I can provide if needed. It is really not much more than the above, just adding some real work to be done inside the for-loops.

To what extent DLLs speed up calculations in a code such as loops etc

I have a code with a loop that counts to 10000000000, and within that loop, I do some calculations with conditional operators (if etc). It takes about 5 minutes to reach that number. so, my question is, can I reduce the time it takes by creating a DLL and call that dll for functions to do the calculation and return the values to the main program? will it make a difference in time it takes to do the calculations? further, will it improve the overall efficiency of the program?
By a “dll” I assume you mean going from managed .net code to that of un-managed “native” compiled code. Yes this approach can help.
It much depends. Remember, the speed of the loop code is likely only 25 seconds on a typical i3 (that is the cost and overhead to loop to 10 billion but doing much nothing else).
And I assumed you went to the project, then compile. On that screen select advanced compile. There you want to check remove integer overflow checks. Make sure you loop vars are integers for speed.
At that point the “base” loop that does nothing will drop from about 20 seconds down to about 6 seconds.
So that is the base loop speed – now it come down to what we are doing inside of that loop.
At this point, .net DOES HAVE a JIT (a just in time native compiler). This means your source code goes to “CLR” code and then in tern that code does get compiled down to native x86 assembly code. So this “does” get the source code down to REAL machine code level. However a JIT is certainly NOT as efficient nor can it spend “time” optimizing the code since the JIT has to work on the “fly” without you noticing it. So a c++ (or VB6 which runs as fast as c++ when native compiled) can certainly run faster, but the question then is by how much?
The optimized compiler might get another double in speed for the actually LOOPING code etc.
However, in BOTH cases (using .net managed code, or code compiled down to native Intel code), they BOTH LIKELY are calling the SAME routines to do the math!
In other words, if 80% of the code is spend in “library” code that does the math etc., then calling such code from c++ or calling such code from .net will make VERY LITTLE difference since the BULK of the work is spend in the same system code!
The above concept is really “supervisor” mode vs. your application mode.
In other words, the amount of time spent in your code vs. that of using system “library” code means that the bulk of the heaving lifting is occurring in supervisor code. That means jumping from .net to native c++/vb6 dll’s will NOT yield much in the way of performance.
So I would first ensure loops and array index refs in your code are integer types. The above tip of taking off bounds checking likely will give you “close” to that of using a .dll. Worse is often the time to “shuffle” the data two and from that external.dll sub will cost you MORE than the time saved on the processing side.
And if your routines are doing database or file i/o, then all bets are off, as that is VERY different problem.
So I would first test/try your application with [x] remove integer overflow checks turned off. And make sure during testing that you use ctrl-F5 in place of F5 to run your code without DEBUGGING. The above overflow check and options will NOT show increased speed when in debug mode.
So it hard to tell – it really depends on how much math (especially floating calls) you are doing (supervisor code) vs. that of just moving values around in arrays. If more code is moving things around, then I suggest the integer optimizing above, and going to a .dll likely will not help much.
Couldn´t you utilize "Parallel.ForEach" and strip this huge loop in some equal pieces?
Or try to work with some Backgroundworkers or even Threads (more than 1!!) to achieve the optimal CPU performance and try to reduce the spent time.

Do I need to disable OpenMP on a 1 core machine explicitly?

I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?
No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.
I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.

Performance difference using dynamic number of threads

I'm using openMP to parallize some heavy loops, and it works as expected.
Testing showed that this directive gave the most performance:
#pragma omp parallel for num_threads(7)
However, that may differ from machine to machine. Also, i wanted to be able to switch threading on/off using a runtime switch.
Therefore, i figured i could use something like this:
if(shouldThread)
omp_set_num_threads(optimalNumberOfThreadsForThisMachine);
else
omp_set_num_threads(1);
Where on my computer, the optimal number of threads is 7 in this example. Then, use this directive instead:
#pragma omp parallel for
It works well - except that the code compiled with the second directive is about 50% slower. Is this to be expected? I figure the runtime has to do dynamic dispatching and work scheduling, while the compile-time directive can add some sort of optimization, i guess.
Code is compiled with msvc 2013, on an core i7-3740

Does MSVC automatically optimize computation on dual core architecture?

Does MSVC automatically optimize computation on dual core architecture?
void Func()
{
Computation1();
Computation2();
}
If given the 2 computation with no relations in a function, does the visual studio
compiler automatically optimize the computation and allocate them to different cores?
Don't quote me on it but I doubt it. The OpenMP pragmas are the closest thing to what you're trying to do here, but even then you have to tell the compiler to use OpenMP and delineate the tasks.
Barring linking to libraries which are inherently multi-threaded, if you want to use both cores you have to set up threads and divide the work you want done intelligently.
No. It is up to you to create threads (or fibers) and specify what code runs on each one. The function as defined will run sequentially. It may switch to another thread (thanks Drew) core during execution, but it will still be sequential. In order for two functions to run concurrently on two different cores, they must first be running in two separate threads.
As greyfade points out, the compiler is unable to detect whether it is possible. In fact, I suspect that this is in the class of NP-Complete problems. If I am wrong, I am sure one of the compiler gurus will let me know.
There's no reliable way for the compiler to detect that the two functions are completely independent and that they have no state. Therefore, there's no way for the compiler to know that it's safe to break them out into separate threads of execution. In fact, threads aren't even part of the C++ standard (until C++1x), and even when they will be, they won't be an intrinsic feature - you must use the feature explicitly to benefit from it.
If you want your two functions to run in independent threads, then create independent threads for them to execute in. Check out boost::thread (which is also available in the std::tr1 namespace if your compiler has it). It's easy to use and works perfectly for your use case.
No. Madness would ensue if compilers did such a thing behind your back; what if Computation2 depended on side effects of Computation1?
If you're using VC10, look into the Concurrency Runtime (ConcRT or "concert") and it's partner the Parallel Patterns Library (PPL)
Similar solutions include OpenMP (kind of old and busted IMO, but widely supported) and Intel's Threading Building Blocks (TBB).
The compiler can't tell if it's a good idea.
First, of course, the compiler must be able to prove that it would be a safe optimization: That the functions can safely be executed in parallel. In general, that's a NP-complete problem, but in many simple cases, the compiler can figure that out (it already does a lot of dependency analysis).
Some bigger problems are:
it might turn out to be slower. Creating threads is a fairly expensive operation. The cost of that may just outweigh the gain from parallelizing the code.
it has to work well regardless of the number of CPU cores. The compiler doesn't know how many cores will be available when you run the program. So it'd have to insert some kind of optional forking code. If a core is available, follow this code path and branch out into a separate thread, otherwise follow this other code path. And again, more code and more conditionals also has an effect on performance. Will the result still be worth it? Perhaps, but how is the compiler supposed to know that?
it might not be what the programmer expects. What if I already create precisely two CPU-heavy threads on a dual-core system? I expect them both to be running 99% of the time. Suddenly the compiler decides to create more threads under the hood, and suddenly I have three CPU-heavy threads, meaning that mine get less execution time than I'd expected.
How many times should it do this? If you run the code in a loop, should it spawn a new thread in every iteration? Sooner or later the added memory usage starts to hurt.
Overall, it's just not worth it. There are too many cases where it might backfire. Added to the fact that the compiler could only safely apply the optimization in fairly simple cases in the first place, it's just not worth the bother.