Efficiency of using omp parallel for inside omp parallel omp task

Efficiency of using omp parallel for inside omp parallel omp task - c++

I am trying to improve the efficiency of a complicated project using OpenMP. I will use a toy example to show the situation I am facing.
First, I have a recursion function and I use omp task to improve it, like this:
void CollectMsg(Node *n) {
#pragma omp task
{
CollectMsg(n->child1);
ProcessMsg1(n->array, n->child1->array);
}
#pragma omp task
{
CollectMsg(n->child2);
ProcessMsg1(n->array, n->child2->array);
}
#pragma omp taskwait
ProcessMsg2(n->array);
}
void main() {
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
CollectMsg(root);
}
}
}
Then, however, the function ProcessMsg1 contains some computations and I want to use omp parallel for on it. The following code shows a simple example. In this way, the project runs correctly, but the efficiency of it is just slightly improved. I am wondering the working mechanism of this process. Does it work if I use omp parallel for inside omp task? I have this question because there is only a slight difference on efficiency between using and not using omp parallel for in function ProcessMsg1. If the answer is yes, first I created 2 threads for omp task, and in the processing of each task, I created 2 threads for omp parallel for, then should I think of this parallel process as 2*2=4 threaded parallelism? If the answer is no, then how can I further improve the efficiency of this case?
void ProcessMsg1(int *a1, int *a2) {
#pragma omp parallel for num_threads(2)
for (int i = 0; i < size; ++i) {
a1[i] *= a2[i];
}
}

Nesting a parallel for inside a task will work, but will require you enable nested execution and will lead to all sorts of difficult issues to resolve, especially around load balancing and the actual number of threads being created.
I would recommend using the taskloop construct instead:
void ProcessMsg1(int *a1, int *a2) {
#pragma omp taskloop grainsize(100)
for (int i = 0; i < size; ++i) {
a1[i] *= a2[i];
}
}
This will create additional tasks for to execute the loop. If ProcessMsg1 is executed from different tasks in the system, the tasks created by taskloop will mix automatically with all the other tasks that your code is creating.
Depending on how much work your code does per loop iteration, you may want to adjust the size of the tasks using the grainsize clause (I just put 100 to have an example). See the OpenMP Spec for further details.
If you your code does not have to wait until all of the loop in ProcessMsg1 has been executed, then you can add the nowait clause, which basically means that once all loop tasks have been created, tasks can still execute on other threads, while the ProcessMsg1 function is already exiting. The default behavior taskloop though is that it will wait for the loop tasks (and any potential child tasks they create) have completed.

Q :" ... how can I further improve the efficiency of this case? "
A : follow the "economy-of-costs"some visiblesome less, yet nevertheless also deciding the resulting ( im )performance & ( in )efficiency
How much did we already have to pay, before any PAR-section even started?
Let's start with the costs of the proposed adding multiple streams of code-execution :
Using the amount of ASM-instructions a simplified measure of how much work has to be done ( where all of CPU-clocks + RAM-allocation-costs + RAM-I/O + O/S-system-management time spent counts ), we start to see the relative-costs of all these ( unavoidable ) add-on costs, compared to the actual useful task ( i.e. how many ASM-instructions are finally spent on what we want to computer, contrasting the amounts of already burnt overhead-costs, that were needed to make this happen )
This fraction is cardinal if fighting for both performance & efficiency (of resources usage patterns).
For cases, where add-on overhead costs dominate, these cases are straight sin of anti-patterns.
For cases, where add-on overhead costs make less than 0.01 % of the useful work, we still may result in unsatisfactory low speed-ups (see the simulator and related details).
For cases, where the scope of useful work diminishes all add-on overhead costs, there we still see the Amdahl's Law ceiling - called so, so self-explanatory - a "Law of diminishing returns" ( since adding more resources ceases to improve the performance, even if we add infinitely many CPU-cores and likes ).
Tips for experimenting with the Costs-of-instantiation(s) :
We start with assuming, a code is 100% parallelisable ( having no pure-[SERIAL] part, which is never real, yet let's use it ).
- let's move the NCPUcores-slider in the simulator to anything above 64-cores- next move the Overhead-slider in the simulator to anything above a plain zero ( expressing a relative add-on cost of spawning a one of NCPUcores processes, as a number of percent, relative to the such [PARALLEL]-section part number of instructions - mathematically "dense" work has many such "useful"-instructions and may, supposing no other performance killers jump out of the box, may spends some reasonable amount of "add-on" costs, to spawn some amount of concurrent- or parallel-operations ( the actual number depends only on actual economy of costs, not on how many CPU-cores are present, the less on our "wishes" or scholastic or even worse copy/paste-"advice" ). On the contrary, mathematically "shallow" work has almost always "speed-ups" << 1 ( immense slow-downs ), as there is almost no chance to justify the known add-on costs ( paid on thread/process-instantiations, data SER/xfer/DES if moving params-in and results-back, the worse if among processes )
- next move the Overhead-slider in the simulator to the rightmost edge == 1. This shows the case, when the actual thread/process-spawning overhead-( a time lost )-costs are still not more than a just <= 1 % of all the computing-related instructions next, that are going to be performed for the "useful"-part of the work, that will be computed inside the such spawned process-instance. So even such 1:100 proportion factor ( doing 100x more "useful"-work than the lost CPU-time, burnt for arranging that many copies and making O/S-scheduler orchestrate concurrent execution thereof inside the available system Virtual-Memory ) has already all the warnings visible in the graph of the progression of Speed-up-degradation - just play a bit with the Overhead-slider in the simulator, before touching the others...
- only here touch and move the p-slider in the simulator to anything less than 100% ( having no [SERIAL]-part of the problem execution, which was nice in theory so far, yet never doable in practice, even the program launch is a pure-[SERIAL], by design )
So,besides straight errors,besides performance anti-patterns,there are lot of technical reasoning for ILP, SIMD-vectorisations, cache-line respecting tricks, that first start to squeeze out the maximum possible performance the task can ever get
refactoring of the real problem shall never go against a collected knowledge about performance as repeating the things that do not work will not bring any advantage, will it?
respect your physical platform constraints, ignoring them will degrade your performance
benchmark, profile, refactor
benchmark, profile, refactor
benchmark, profile, refactor
no other magic wand available here.
Details matter, always. The CPU / NUMA architecture details matter, and a lot. Check any possibilities for the actual native-architecture possibilities, as without all these details, the runtime performance will not reach the capabilities technically available.

Related

Multithreaded function performance worse than single threaded

I wrote an update() function which ran on a single thread, then I wrote the below function updateMP() which does the same thing except I divide the work in my two for loops here amongst some threads:
void GameOfLife::updateMP()
{
std::vector<Cell> toDie;
std::vector<Cell> toLive;
#pragma omp parallel
{
// private, per-thread variables
std::vector<Cell> myToDie;
std::vector<Cell> myToLive;
#pragma omp for
for (int i = 0; i < aliveCells.size(); i++) {
auto it = aliveCells.begin();
std::advance(it, i);
int liveCount = aliveCellNeighbors[*it];
if (liveCount < 2 || liveCount > 3) {
myToDie.push_back(*it);
}
}
#pragma omp for
for (int i = 0; i < aliveCellNeighbors.size(); i++) {
auto it = aliveCellNeighbors.begin();
std::advance(it, i);
if (aliveCells.find(it->first) != aliveCells.end()) // is this cell alive?
continue; // if so skip because we already updated aliveCells
if (aliveCellNeighbors[it->first] == 3) {
myToLive.push_back(it->first);
}
}
#pragma omp critical
{
toDie.insert(toDie.end(), myToDie.begin(), myToDie.end());
toLive.insert(toLive.end(), myToLive.begin(), myToLive.end());
}
}
for (const Cell& deadCell : toDie) {
setDead(deadCell);
}
for (const Cell& liveCell : toLive) {
setAlive(liveCell);
}
}
I noticed that it performs worse than the single threaded update() and seems like it's getting slower over time.
I think I might be doing something wrong by having two uses of omp for? I am new to OpenMP so I am still figuring out how to use it.
Why am I getting worse performance with my multithreaded implementation?
EDIT: Full source here: https://github.com/k-vekos/GameOfLife/tree/hashing?files=1

Why am I getting worse performance with my multithreaded implementation?
Classic question :)
You loop through only the alive cells. That's actually pretty interesting. A naive implementation of Conway's Game of Life would look at every cell. Your version optimizes for a fewer number of alive than dead cells, which I think is common later in the game. I can't tell from your excerpt, but I assume it trades off by possibly doing redundant work when the ratio of alive to dead cells is higher.
A caveat of omp parallel is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section. It's implementation dependent. I can't seem to find any information on MSVC's implementation. If anyone knows, please weight in.
So that means that your threads could be created/destroyed every update loop, which is heavy overhead. For this to be worth it, the amount of work should be orders of magnitude more expensive than the overhead.
You can profile/measure the code to determine overhead and work time. It should also help you see where the real bottlenecks are.
Visual Studio has a profiler with a nice GUI. You need to compile your code with release optimizations but also include the debug symbols (those are excluded in the default release configuration). I haven't looked into how to set this up manually because I usually use CMake and it sets up a RelWithDebInfo configuration automatically.
Use high_resolution_clock to time sections that are hard to measure with the profiler.
If you can use C++17, it has a standard parallel for_each (the ExecutionPolicy overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each. They're so new that I barely know anything about them (they might also have the same issues as OpenMP).
seems like it's getting slower over time.
Could it be that you're not cleaning up one of your vectors?

First off, if you want any kind of performance, you must do as little work as possible in a critical section. I'd start by changing the following:
std::vector<Cell> toDie;
std::vector<Cell> toLive;
to
std::vector<std::vector<Cell>> toDie;
std::vector<std::vector<Cell>> toLive;
Then, in your critical section, you can just do:
toDie.push_back(std::move(myToDie));
toLive.push_back(std::move(myToLive));
Arguably, a vector of vector isn't cute, but this will prevent deep-copying inside the CS which is unnecessary time consumption.
[Update]
IMHO there's no point in using multithreading if you are using non-contiguous data structures, at least not in that way. The fact is you'll spend most of your time waiting on cache misses because that's what associative containers do, and little doing the actual work.
I don't know how this game works. It feels like if I had to do something with numerous updates and render, I would have the updates done as quickly as possible on the 'main' thread', and I would have another (detached) thread for the renderer. You could then just give the renderer the results after every "update", and perform another update while it's rendering.
Also, I'm definitely not an expert in hashing, but hash<int>()(k.x * 3 + k.y * 5) seems like a high-collision hashing. You can certainly try something else like what's proposed here

Performance issues of multiple independent for loop with openMp

I am planning to use OpenMP threads for an intense computation. However, I couldn't acquire my expected performance in first trial. I thought I have several issues on it, but I have not assured yet. Generally, I am thinking the performance bottleneck is caused from fork and join model. Can you help me in some ways.
First, in a route cycle, running on a consumer thread, there is 2 independent for loops and some additional functions. The functions are located at end of the routine cycle and between the for loops, which is already seen below:
void routineFunction(short* xs, float* xf, float* yf, float* h)
{
// Casting
#pragma omp parallel for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
memset(yf,0,1024*1024*sizeof( float ));
// Filtering
#pragma omp parallel for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
status = DftiComputeBackward(hand, yf, yf); // Compute backward transform
}
Note: This code cannot be compilied, because I did it more readible as clearing details.
OpenMP thread number is set 8 dynamically. I observed the used threads in Windows taskbar. While thread number is increased by significantly, I didn't observe any performance improvement. I have some guesses, but I want to still discuss with you for further implementations.
My questions are these.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads? In case, the for loops cause fork and join at 2 times, how to align the code again. Is combining the two loops in a single loop sensible for saving performance, or using parallel region (#pragma omp parallel) and #pragma omp for (not #pragma omp parallel for) better choice for sharing works. I care about it forces me static scheduling by using thread id and thread numbers. According the document at page 34, static scheduling can cause load imbalance. Actually, I am familiar static scheduling because of CUDA programming, but I want to still avoid it, if there is any performance issue. I also read an answer in stackoverflow which points smart OpenMP algorithms do not join master thread after a parallel region is completed writed by Alexey Kukanov in last paragraph. How to utilize busy wait and sleep attributes of OpenMP for avoiding joining the master thread after first loop is completed.
Is there another reason for performance issue in the code?

This is mostly memory-bound code. Its performance and scalability are limited by the amount of data the memory channel can transfer per unit time. xf and yf take 8 MiB in total, which fits in the L3 cache of most server-grade CPUs but not of most desktop or laptop CPUs. If two or three threads are already able to saturate the memory bandwidth, adding more threads is not going to bring additional performance. Also, casting short to float is a relatively expensive operation - 4 to 5 cycles on modern CPUs.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
No, basically all OpenMP runtimes, including that of MSVC++, implement parallel regions using thread pools as this is the easiest way to satisfy the requirement of the OpenMP specification that thread-private variables retain their value between the different parallel regions. Only the very first parallel region suffers the full cost of starting new threads. Consequent regions reuse those threads and an additional price is paid only if more threads are needed that in any of the previously executed parallel regions. There is still some overhead, but it is far lower than that of starting new threads each time.
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads?
Yes, in your case two separate parallel regions are created. You can manually merge them into one:
#pragma omp parallel
{
#pragma omp for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
#pragma omp single
{
memset(yf,0,1024*1024*sizeof( float ));
//
// Other code that was between the two parallel regions
//
}
// Filtering
#pragma omp for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
}
Is there another reason for performance issue in the code?
It is memory-bound, or at least the two loops shown here are.

Alright, it's been a while since I did OpenMP stuff so hopefully I didn't mess any of this up... but here goes.
Forking and joining is the same thing as creating and destroying threads. How the cost compares to other threads (such as a C++11 thread) will be implementation dependent. I believe in general OpenMP threads might be slightly lighter-weight than C++11 threads, but I'm not 100% sure about that. You'd have to do some testing.
Currently each time routineFunction is called you will fork for the first for loop, join, do a memset, fork for the second loop, join, and then call DftiComputeBackward
You would be better off creating a parallel region as you stated. Not sure why the scheduling is an extra concern. It should be as easy as moving your memset to the top of the function, starting a parallel region using your noted command, and making sure each for loop is marked with #pragma omp for as you mentioned. You may need to put an explicit #pragma omp barrier in between the two for loops to make sure all threads finish the first for loop before starting the second... OpenMP has some implicit barriers but I forgot if #pragma omp for has one or not.
Make sure that the OpenMP compile flag is turned on for your compiler. If it isn't, the pragmas will be ignored, it will compile, and nothing will be different.
Your operations are prime for SIMD acceleration. You might want to see if your compiler supports auto-vectorization and if it is doing it. If not, I'd look into SIMD a bit, perhaps using intrinsics.
How much time does DftiComputeBackwards take relative to this code?

Performance problems using OpenMP in nested loops

I'm using the following code, which contains an OpenMP parallel for loop nested in another for-loop. Somehow the performance of this code is 4 Times slower than the sequential version (omitting #pragma omp parallel for).
Is it possible that OpenMp has to create Threads every time the method is called? In my test it is called 10000 times directly after each other.
I heard that sometimes OpenMP will keep the threads spinning. I also tried setting OMP_WAIT_POLICY=active and GOMP_SPINCOUNT=INFINITE. When I remove the openMP pragmas, the code is about 10 times faster. Note that the method containing this code will be called 10000 times.
for (round k = 1; k < processor.max; ++k) {
initialise_round(k);
for (std::vector<int> bucket : color_buckets) {
#pragma omp parallel for schedule (dynamic)
for (int i = 0; i < bucket.size(); ++i) {
if (processor.mark.is_marked_item(bucket[i])) {
processor.process(k, bucket[i]);
}
}
processor.finish_round(k);
}
}

You say that your sequential code is much faster so this makes me think that your processor.process function has too few instructions and duration. This leads to the case where passing the data to each thread does not pay off (the data exchange overhead is simply larger than the actual computation on that thread).
Other than that, I think that parallelizing the middle loop won't affect the algorithm but increase the amount of work per thread/

I think you are creating a team of threads on each iteration of the loop... (although I'm not sure what for alone does - I thought it should be parallel for). In this case, it would probably be better to separate the parallel from the for so the work of forking and creating the threads is done just once rather than being repeated in the other loops. So you could try to put a parallel pragma before your outermost loop so the overhead of forking and thread creation is just done once.

The actual problem was not related to OpenMP directly.
As the system has two CPUs, half of the threads where spawned on one and the other half on the other CPU. Therefore there was not shared L3 Cache. This lead in combination that the algorithm doesn't scale well to a performance decrease especially when using 2-4 Threads.
The solution was to use thread pinning for example via the linux tool: taskset

Why are all iterations in a loop parallelized using OpenMP schedule(dynamic) given to one thread? (MSVS 2010)

Direct Question: I've got a simple loop with, what can be, a computationally intensive function. Let's assume that each iteration takes the same amount of time (so load balancing should be easy).
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for ( int i=0; i < 30; i++ )
{
MyExpensiveFunction();
}
} // parallel block
Why are all of the iterations assigned to a single thread? I can add a:
std::cout << "tID = " << omp_get_thread_num() << "\n\n";
and it prints a bunch of zeros with only the last iteration assigned to thread 1.
My System: I must support cross compiling. So I'm using gcc 4.4.3 & 4.5.0 and they both work as expected, but for MS 2010, I see the above behavior where 29 iterations are assigned to thread 0 and one iteration is assigned to thread 1.
Really Odd: It took me a bit to realize that this might simply be a scheduling problem. I google'd and found this website, which if you skip to the bottom has an example with what must be auto-generated output. All iterations using dynamic and guided scheduling are assigned to thread zero??!?
Any guidance would be greatly appreciated!!

Most likely, this is because the OMP implementation in Visual Studio decided that you did nowhere near enough work to merit putting it on more than one thread. If you simply increase the quantity of iterations, then you may well find that the other threads have more utilization. Dynamic scheduling means that the implementation only forks new threads if it needs them, so if it doesn't need them, it doesn't make them or assign them work.

If each iteration takes the same amount of time, then you actually don't need a dynamic scheduling which causes more scheduling overhead than static scheduling policies. (static, 1) and (static) should be okay.
Could you let me know the length of each iteration? Regarding the example you cited (MSDN's example for schedulings), it is because the amount of work of each iteration is so small, so the first thread just got almost work. If you really increase the work of each iteration (at least an order of millisecond), then you will see the differences.
I did a lot of experiments related to OpenMP scheduling policies. MSVC's implementation of dynamic scheduling works well. I'm pretty sure your work in each iteration was too small.

Concurrency and optimization using OpenMP

I'm learning OpenMP. To do so, I'm trying to make an existing code parallel. But I seems to get an worse time when using OpenMP than when I don't.
My inner loop:
#pragma omp parallel for
for(unsigned long j = 0; j < c_numberOfElements; ++j)
{
//int th_id = omp_get_thread_num();
//printf("thread %d, j = %d\n", th_id, (int)j);
Point3D current;
#pragma omp critical
{
current = _points[j];
}
Point3D next = getNext(current);
if (!hasConstraint(next))
{
continue;
}
#pragma omp critical
{
_points[j] = next;
}
}
_points is a pointMap_t, defined as:
typedef boost::unordered_map<unsigned long, Point3D> pointMap_t;
Without OpenMP my running time is 44.904s. With OpenMP enabled, on a computer with two cores, it is 64.224s. What I am doing wrong?

Why have you wrapped your reads and writes to _points[j] in critical sections ? I'm not much of a C++ programmer, but it doesn't look to me as if you need those sections at all. As you've written it (uunamed critical sections) each thread is going to wait while the other goes through each of the sections. This could easily make the program slower.

It seems possible that the lookup and write to _points in critical sections is dragging down the performance when you use OpenMP. Single-threaded, this will not result in any contention.
Sharing seed data like this seems counterproductive in a parallel programming context. Can you restructure to avoid these contention points?

You need to show the rest of the code. From a comment to another answer, it seems you are using a map. That is really a bad idea, especially if you are mapping 0..n numbers to values: why don't you use an array?
If you really need to use containers, consider using the ones from the Intel's Thread Building Blocks library.

I agree that it would be best to see some working code.
The ultimate issue here is that there are criticals within a parallel region, and criticals are (a) enormously expensive in and of themselves, and (b) by definition, kill parallelism. The assignment to current certainl doesn't need to be inside a critical, as it is private; I wouldn't have thought the _points[j] assignment would be, either, but I don't know what the map stuff does, so there you go.
But you have a loop in which you have a huge amount of overhead, which grows linearly in the number of threads (the two critical regions) in order to do a tiny amount of actual work (walk along a linked list, it looks like). That's never going to be a good trade...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js