parallelize inner loop using openmp

parallelize inner loop using openmp - c++

I have three nested loops but only the innermost is parallelizable. The outer and middle loop stop conditions depend on the calculations done by the innermost loop and therefore I cannot change the order.
I have used a OPENMP pragma directive just before the innermost loop but the performance with two threads is worst than with one. I guess it is because the threads are being created every iteration of the outer loops.
Is there any way to create the threads outside the outer loops but just use it in the innermost loop?
Thanks in advance

OpenMP should be using a thread-pool, so you won't be recreating threads every time you execute your loop. Strictly speaking, however, that might depend on the OpenMP implementation you are using (I know the GNU compiler uses a pool). I suggest you look for other common problems, such as false sharing.

Unfortunately, current multicore computer systems are no good for such fine-grained inner-loop parallelism. It's not because of a thread creation/forking issue. As Itjax pointed out, virtually all OpenMP implementations exploit thread pools, i.e., they pre-create a number of threads, and threads are parked. So, there is actually no overhead of creating threads.
However, the problems of such parallelizing inner loops are the following two overhead:
Dispatching jobs/tasks to threads: even if we don't need to physically create threads, at least we must assign jobs (= create logical tasks) to threads which mostly requires synchronizations.
Joining threads: after all threads in a team, then these threads should be joined (unless nowait OpenMP directive used). This is typically implemented as a barrier operation, which is also very intensive synchronization.
Hence, one should minimize the actual number of thread assigning/joining. You may decrease such overhead by increasing the amount of work of the inner loop per invocation. This could be done by some code changes like loop unrolling.

Related

Synchronize access to two matrices

I have n-threads that access to two shared matrices in the following way:
if (matrix2[i][j] <= d){
matrix1[i][j] = v;
matrix2[i][j] = d;
}
I tried with a unique mutex before the critial section but the performace are very poor.
Which is the best way to synchronize this code? A matrix of mutex? Alternatives?

A matrix of mutex?
It's rare for very fine-grained locking to perform better than a smaller number of locks, unless keeping them locked for a long time is unavoidable. That seems unlikely here. It also opens the door to deadlocks (for example, what if one thread runs with i=1, j=2 at the same time as another thread with i=2, j=1?).
I tried with a unique mutex before the critical section but the performance are very poor.
What synchronization you need depends on your access pattern. Are multiple threads all performing the operation shown?
If so, do you really need to do that in parallel? It doesn't seem expensive enough to be worthwhile. Can you partition i,j regions between threads so they don't collide? Can you do some other long-running work in your threads, and batch the matrix updates for single-threaded application?
If not, you need to show what other access is causing a data race with the code shown.

OpenMP, nested loops design strategy

Assuming you have designed a sequential program with nested for loops and would like to transform it to parallel with OpenMP, and work on it in sections to debug as you go... would it be better to work on the outermost loop first and work your way in, or start at the innermost loop(s)? I am aware of the collapse function, but not all nested loops are collapsable.

Absolutely and definitely not in the innermost loop. This is because starting threads is generally expensive.
On the other hand, if the innermost loop takes much more resources to execute than starting a thread, then it doesn't make a difference. but otherwise, outermost loop is always the best choice.
Of course, this is a very broad answer in tone with your very broad question. There are always different answers for every different special case.
On the other hand, if you have such a complicated problem, I recommend that you use the low-level std::thread and control your threads manually. That needs more work, but you have more control and best results. Then you can use thread pools and have the most efficient solution.

pthread signaling without kernel call

I am running a few threads using pthreads on a real time linux (red hawk) in C++. All the threads run on a fixed frequency loop and one of the threads will poll the CPU clock and alert the other two threads that the next loop has started (by the end of the loop we can safely assume that the other loops have finished their task and are waiting for the next loop. My goal is to reduce latency where possible, and I have the ability to let threads take 100% of the CPU they are on (and guarantee they are the only thing running on that CPU due to the red hawk enhancements).
My idea to do this was to have the timing thread poll the cpu tick count until it reaches > X, then increment a 64 or 32 bit counter without asking for a mutex. The other two loops will poll this counter and wait for it to increase, also without asking for a mutex. How I see it no mutex is needed since the first thread can increment the counter atomically since it is the only thing writing to it. The other two threads can read from it without fear because a 32 or 64 bit number can be written to memory without it ever being a partial state (I think).
I realize that all my threads will be polling something and therefore running at 100% all the time, and I could reduce that by using the pthreads signaling, but I believe that the latency there is more than I want. I also know a mutex takes about a couple tens of nanoseconds, so I could probably use them without seeing the latency, but I don't see why it is needed when I have one thread increment a counter and the other two polling it.

You need to tell the compiler that your counter is a synchronization variable. You do that by declaring your counter std::atomic, and then using one of the built in operators (either fetch_add() or operator++() for the increment and load() for the reading threads.) See http://en.cppreference.com/w/cpp/atomic/atomic.
If you don't declare your counter atomic then you will have a data-race, your program has no defined semantics and the compiler is permitted to (and probably will) move code around with respect to the counter test (which will probably lead to results you don't expect.)
You need to use c++11 to get std::atomic. In most versions of g++ you do that with the --std=c++0x flag. The most recent versions of g++ require the --std=c++11 flag instead.

Since there will be shared variables, one thread modifying (incrementing) and others accessing, best would be to wrap between pthread_mutex_lock and pthread_mutex_unlock to ensure mutual exclusion

How does warp work with atomic operation?

The threads in a warp run physically parallel, so if one of them (called, thread X) start an atomic operation, what other will do? Wait? Is it mean, all threads will be waiting while thread X is pushed to the atomic-queue, get the access (mutex) and do some stuff with memory, which was protected with that mutex, and realese mutex after?
Is there any way to take other threads for some work, like reads some memory, so the atomic operation will hide it's latency? I mean, 15 idle threads it's.. not well, I guess. Atomic is really slow, does it? How can I accelerate it? Is there any pattern to work with it?
Does atomic operation with shared memory lock for a bank or whole memory?
For example (without mutexs), there is __shared__ float smem[256];
Thread1 runs atomicAdd(smem, 1);
Thread2 runs atomicAdd(smem + 1, 1);
Those threads works with different banks, but in general shared memory. Does they run parralel or they will be queued? Is there any difference with this example, if Thread1 and Thread2 are from separated warps or general one?

I count something like 10 questions. It makes it quite difficult to answer. It's suggested you ask one question per question.
Generally speaking, all threads in a warp are executing the same instruction stream. So there are two cases we can consider:
without conditionals (e.g. if...then...else) In this case, all threads are executing the same instruction, which happens to be an atomic instruction. Then all 32 threads will execute an atomic, although not necessarily on the same location. All of these atomics will get processed by the SM, and to some extent will serialize (they will completely serialize if they are updating the same location).
with conditionals For example, suppose we had if (!threadIdx.x) AtomicAdd(*data, 1); Then thread 0 would execute the atomic, and
others wouldn't. It might seem like we could get the others to do
something else, but the lockstep warp execution doesn't allow this.
Warp execution is serialized such that all threads taking the if
(true) path will execute together, and all threads executing the
if (false) path will execute together, but the true and false
paths will be serialized. So again, we can't really have different
threads in a warp executing different instructions simultaneously.
The net of it is, within a warp, we can't have one thread do an atomic while others do something else simultaneously.
A number of your other questions seem to expect that memory transactions are completed at the end of the instruction cycle that they originated in. This is not the case. With global and with shared memory, we must take special steps in the code to ensure that previous write transactions are visible to other threads (which could be argued as the evidence that the transaction completed.) One typical way to do this is to use barrier instructions, such as __syncthreads() or __threadfence() But without those barrier instructions, threads are not "waiting" for writes to complete. A (an operation dependent on a) read can stall a thread. A write generally cannot stall a thread.
Now lets see about your questions:
so if one of them start an atomic operation, what other will do? Wait?
No, they don't wait. The atomic operation gets dispatched to a functional unit on the SM that handles atomics, and all threads continue, together, in lockstep. Since an atomic generally implies a read, yes, the read can stall the warp. But the threads do not wait until the atomic operation is completed (i.e, the write). However, a subsequent read of this location could stall the warp, again, waiting for the atomic (write) to complete. In the case of a global atomic, which is guaranteed to update global memory, it will invalidate the L1 in the originating SM (if enabled) and the L2, if they contain that location as an entry.
Is there any way to take other threads for some work, like reads some memory, so the atomic operation will hide it's latency?
Not really, for the reasons I stated at the beginning.
Atomic is really slow, does it? How can I accelerate it? Is there any pattern to work with it?
Yes, atomics can make a program run much more slowly if they dominate the activity (such as naive reductions or naive histogramming.) Generally speaking, the way to accelerate atomic operations is to not use them, or use them sparingly, in a way that doesn't dominate program activity. For example, a naive reduction would use an atomic to add every element to the global sum. A smart parallel reduction will use no atomics at all for the work done in the threadblock. At the end of the threadblock reduction, a single atomic might be used to update the threadblock partial sum into the global sum. This means that I can do a fast parallel reduction of an arbitrarily large number of elements with perhaps on the order of 32 atomic adds, or less. This sparing use of atomics will basically not be noticeable in the overall program execution, except that it enables the parallel reduction to be done in a single kernel call rather than 2.
Shared memory: Does they run parralel or they will be queued?
They will be queued. The reason for this is that there are a limited number of functional units that can process atomic operations on shared memory, not enough to service all the requests from a warp in a single cycle.
I've avoided trying to answer questions that relate to the throughput of atomic operations, because this data is not well specified in the documentation AFAIK. It may be that if you issue enough simultaneous or nearly-simultaneous atomic operations, that some warps will stall on the atomic instruction, due to the queues that feed the atomic functional units being full. I don't know this to be true and I can't answer questions about it.

efficiency in multithreading

suppose I have a code like this
for(i = 0; i < i_max; i++)
for(j = 0; j < j_max; j++)
// do something
and I want to do this by using different threads (assuming the //do something tasks are independent from each other, think about montecarlo simulations for instance). My question is this: is it necessarily better to create a thread for each value of i, than creating a thread for each value of j? Something like this
for(i = 0; i < i_max; i++)
create_thread(j_max);
additionally: what would a suitable number of threads? Shall I just create i_max threads or, perhaps, use a semaphore with k < i_max threads running concurrently at any given time.
Thank you,

The best way to apportion the workload is workload-dependent.
Broadly - for parallelizable workload, use OpenMP; for heterogeneous workload, use a thread pool. Avoid managing your own threads if you can.
Monte Carlo simulation should be a good candidate for truly parallel code rather than thread pool.
By the way - in case you are on Visual C++, there is in Visual C++ v10 an interesting new Concurrency Runtime for precisely this type of problem. This is somewhat analogous to the Task Parallel Library that was added to .Net Framework 4 to ease the implementation of multicore/multi-CPU code.

Avoid creating threads unless you can keep them busy!
If your scenario is compute-bound, then you should minimize the number of threads you spawn to the number of cores you expect your code to run on. If you create more threads than you have cores, then the OS has to waste time and resources scheduling the threads to execute on the available cores.
If your scenario is IO-bound, then you should consider using async IO operations that are queued and which you check the response codes from after the async result is returned. Again, in this case, spawning a thread per IO operation is hugely wasteful as you'll cause the OS to have to waste time scheduling threads that are stalled.

Everyone here is basically right, but here's a quick-and-dirty way to split up the work and keep all of the processors busy. This works best when 1) creating threads is expensive compared to the work done in an iteration 2) most iterations take about the same amount of time to complete
First, create 1 thread per processor/core. These are your worker threads. They sit idle until they're told to do something.
Now, split up your work such that work that data that is needed at the same time is close together. What I mean by that is that if you were processing a ten-element array on a two processor machine, you'd split it up so that one group is elements 1,2,3,4,5 and the other is 6,7,8,9,10. You may be tempted to split it up 1,3,5,7,9 and 2,4,6,8,10, but then you're going to cause more false sharing (http://en.wikipedia.org/wiki/False_sharing) in your cache.
So now that you have a thread per processor and a group of data for each thread, you just tell each thread to work on an independent group of that data.
So in your case I'd do something like this.
for (int t=0;t<n_processors;++t)
{
thread[t]=create_thread();
datamin[t]=t*(i_max/n_processors);
datamax[t]=(t+1)*(i_max/n_processors);
}
for (int t=0;t<n_processors;++t)
do_work(thread[t], datamin[t], datamax[t], j_max)
//wait for all threads to be done
//continue with rest of the program.
Of course I left out things like dealing with your data not being an integer multiple of the number of processors, but those are easily fixed.
Also, if you're not adverse to 3rd party libraries, Intel's TBB (threading building blocks) does a great job of abstracting this from you and letting you get to the real work you want to do.

Everything around creating and calling threads is relatively expensive so you want to do that as little as possible.
If you parallelize your inner loop instead of the outer loop, then for each iteration of the outer loop j_max threads are created. An order of i_max more than if you parallelized the outer loop instead.
That said, the best parallelization depends on your actual problem. Depending on that, it can actually make sense to parallelize the inner loop instead.

Depends on the tasks and on what platform you're about to simulate on. For example, on CUDA's architecture you can split the tasks up so that each i,j,1 is done individually.
You still have the time to load data onto the card to consider.
Using for loops and something like OpenMP/MPI/your own threading mechanism, you can basically choose. In one scenario, parallel threads are broken out and j is looped sequentially on each thread. In the ohter, a loop is processed sequentially, and a loop is broken out in each parallelisation.
Parallelisation (breaking out threads) is costly. Remember that you have the cost of setting up n threads, then synchronising n threads. This represents a cost c over and above the runtime of the routines which in and of itself can make the total time greater for parallel processing than in single threaded mode. It depends on the problem in question; often, there's a critical size beyond which parallel is faster.
I would suggest breaking out into the parallel zone in the first for loop would be faster. If you do so on the inner loop, you must fork/join each time the loop runs, adding a large overhead to the speed of the code. You want, ideally, to have to create threads only once.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js