Best way divide a loop into threads? - concurrency

I have a loop that repeats itself 8 times and I want to run each loop in a different thread so it will run quicker, I looked it up online but I can't decide for a way to do this. There are no shared resources inside the loop. Any ideas?
Sorry for bad English

The best way is to analyze your program for how it is to be used, and determine what the best cost vs performance trade off you can make is. Threads, even in languages like go have a non-trivial overhead; and in languages like java, it can be a significant overhead.
You need to have a grasp upon what the cost of dispatching an operation onto a thread is versus the time to perform the operation, and what execution models you can apply to do so. For example, if you try:
for (i = 0; i < NTHREAD; i++) {
t[i] = create_thread(PerformAction, ...);
}
for (i = 0; i < NTHREAD; i++) {
join_thread(t[i]);
}
You might think you have done wonderfully, however the NTHREAD-1’th thread doesn’t start until you have paid the overhead for creating the others. In contrast, if you created threads in a tree-like structure, and your OS doesn’t blow, you can get a significantly lower latency.
So, best practise: measure, write for the generic case and configure for the specific one.

Related

Efficiency of using omp parallel for inside omp parallel omp task

I am trying to improve the efficiency of a complicated project using OpenMP. I will use a toy example to show the situation I am facing.
First, I have a recursion function and I use omp task to improve it, like this:
void CollectMsg(Node *n) {
#pragma omp task
{
CollectMsg(n->child1);
ProcessMsg1(n->array, n->child1->array);
}
#pragma omp task
{
CollectMsg(n->child2);
ProcessMsg1(n->array, n->child2->array);
}
#pragma omp taskwait
ProcessMsg2(n->array);
}
void main() {
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
CollectMsg(root);
}
}
}
Then, however, the function ProcessMsg1 contains some computations and I want to use omp parallel for on it. The following code shows a simple example. In this way, the project runs correctly, but the efficiency of it is just slightly improved. I am wondering the working mechanism of this process. Does it work if I use omp parallel for inside omp task? I have this question because there is only a slight difference on efficiency between using and not using omp parallel for in function ProcessMsg1. If the answer is yes, first I created 2 threads for omp task, and in the processing of each task, I created 2 threads for omp parallel for, then should I think of this parallel process as 2*2=4 threaded parallelism? If the answer is no, then how can I further improve the efficiency of this case?
void ProcessMsg1(int *a1, int *a2) {
#pragma omp parallel for num_threads(2)
for (int i = 0; i < size; ++i) {
a1[i] *= a2[i];
}
}
Nesting a parallel for inside a task will work, but will require you enable nested execution and will lead to all sorts of difficult issues to resolve, especially around load balancing and the actual number of threads being created.
I would recommend using the taskloop construct instead:
void ProcessMsg1(int *a1, int *a2) {
#pragma omp taskloop grainsize(100)
for (int i = 0; i < size; ++i) {
a1[i] *= a2[i];
}
}
This will create additional tasks for to execute the loop. If ProcessMsg1 is executed from different tasks in the system, the tasks created by taskloop will mix automatically with all the other tasks that your code is creating.
Depending on how much work your code does per loop iteration, you may want to adjust the size of the tasks using the grainsize clause (I just put 100 to have an example). See the OpenMP Spec for further details.
If you your code does not have to wait until all of the loop in ProcessMsg1 has been executed, then you can add the nowait clause, which basically means that once all loop tasks have been created, tasks can still execute on other threads, while the ProcessMsg1 function is already exiting. The default behavior taskloop though is that it will wait for the loop tasks (and any potential child tasks they create) have completed.
Q :" ... how can I further improve the efficiency of this case? "
A : follow the "economy-of-costs"some visiblesome less, yet nevertheless also deciding the resulting ( im )performance & ( in )efficiency
How much did we already have to pay, before any PAR-section even started?
Let's start with the costs of the proposed adding multiple streams of code-execution :
Using the amount of ASM-instructions a simplified measure of how much work has to be done ( where all of CPU-clocks + RAM-allocation-costs + RAM-I/O + O/S-system-management time spent counts ), we start to see the relative-costs of all these ( unavoidable ) add-on costs, compared to the actual useful task ( i.e. how many ASM-instructions are finally spent on what we want to computer, contrasting the amounts of already burnt overhead-costs, that were needed to make this happen )
This fraction is cardinal if fighting for both performance & efficiency (of resources usage patterns).
For cases, where add-on overhead costs dominate, these cases are straight sin of anti-patterns.
For cases, where add-on overhead costs make less than 0.01 % of the useful work, we still may result in unsatisfactory low speed-ups (see the simulator and related details).
For cases, where the scope of useful work diminishes all add-on overhead costs, there we still see the Amdahl's Law ceiling - called so, so self-explanatory - a "Law of diminishing returns" ( since adding more resources ceases to improve the performance, even if we add infinitely many CPU-cores and likes ).
Tips for experimenting with the Costs-of-instantiation(s) :
We start with assuming, a code is 100% parallelisable ( having no pure-[SERIAL] part, which is never real, yet let's use it ).
- let's move the NCPUcores-slider in the simulator to anything above 64-cores- next move the Overhead-slider in the simulator to anything above a plain zero ( expressing a relative add-on cost of spawning a one of NCPUcores processes, as a number of percent, relative to the such [PARALLEL]-section part number of instructions - mathematically "dense" work has many such "useful"-instructions and may, supposing no other performance killers jump out of the box, may spends some reasonable amount of "add-on" costs, to spawn some amount of concurrent- or parallel-operations ( the actual number depends only on actual economy of costs, not on how many CPU-cores are present, the less on our "wishes" or scholastic or even worse copy/paste-"advice" ). On the contrary, mathematically "shallow" work has almost always "speed-ups" << 1 ( immense slow-downs ), as there is almost no chance to justify the known add-on costs ( paid on thread/process-instantiations, data SER/xfer/DES if moving params-in and results-back, the worse if among processes )
- next move the Overhead-slider in the simulator to the rightmost edge == 1. This shows the case, when the actual thread/process-spawning overhead-( a time lost )-costs are still not more than a just <= 1 % of all the computing-related instructions next, that are going to be performed for the "useful"-part of the work, that will be computed inside the such spawned process-instance. So even such 1:100 proportion factor ( doing 100x more "useful"-work than the lost CPU-time, burnt for arranging that many copies and making O/S-scheduler orchestrate concurrent execution thereof inside the available system Virtual-Memory ) has already all the warnings visible in the graph of the progression of Speed-up-degradation - just play a bit with the Overhead-slider in the simulator, before touching the others...
- only here touch and move the p-slider in the simulator to anything less than 100% ( having no [SERIAL]-part of the problem execution, which was nice in theory so far, yet never doable in practice, even the program launch is a pure-[SERIAL], by design )
So,besides straight errors,besides performance anti-patterns,there are lot of technical reasoning for ILP, SIMD-vectorisations, cache-line respecting tricks, that first start to squeeze out the maximum possible performance the task can ever get
refactoring of the real problem shall never go against a collected knowledge about performance as repeating the things that do not work will not bring any advantage, will it?
respect your physical platform constraints, ignoring them will degrade your performance
benchmark, profile, refactor
benchmark, profile, refactor
benchmark, profile, refactor
no other magic wand available here.
Details matter, always. The CPU / NUMA architecture details matter, and a lot. Check any possibilities for the actual native-architecture possibilities, as without all these details, the runtime performance will not reach the capabilities technically available.

Multithreaded function performance worse than single threaded

I wrote an update() function which ran on a single thread, then I wrote the below function updateMP() which does the same thing except I divide the work in my two for loops here amongst some threads:
void GameOfLife::updateMP()
{
std::vector<Cell> toDie;
std::vector<Cell> toLive;
#pragma omp parallel
{
// private, per-thread variables
std::vector<Cell> myToDie;
std::vector<Cell> myToLive;
#pragma omp for
for (int i = 0; i < aliveCells.size(); i++) {
auto it = aliveCells.begin();
std::advance(it, i);
int liveCount = aliveCellNeighbors[*it];
if (liveCount < 2 || liveCount > 3) {
myToDie.push_back(*it);
}
}
#pragma omp for
for (int i = 0; i < aliveCellNeighbors.size(); i++) {
auto it = aliveCellNeighbors.begin();
std::advance(it, i);
if (aliveCells.find(it->first) != aliveCells.end()) // is this cell alive?
continue; // if so skip because we already updated aliveCells
if (aliveCellNeighbors[it->first] == 3) {
myToLive.push_back(it->first);
}
}
#pragma omp critical
{
toDie.insert(toDie.end(), myToDie.begin(), myToDie.end());
toLive.insert(toLive.end(), myToLive.begin(), myToLive.end());
}
}
for (const Cell& deadCell : toDie) {
setDead(deadCell);
}
for (const Cell& liveCell : toLive) {
setAlive(liveCell);
}
}
I noticed that it performs worse than the single threaded update() and seems like it's getting slower over time.
I think I might be doing something wrong by having two uses of omp for? I am new to OpenMP so I am still figuring out how to use it.
Why am I getting worse performance with my multithreaded implementation?
EDIT: Full source here: https://github.com/k-vekos/GameOfLife/tree/hashing?files=1
Why am I getting worse performance with my multithreaded implementation?
Classic question :)
You loop through only the alive cells. That's actually pretty interesting. A naive implementation of Conway's Game of Life would look at every cell. Your version optimizes for a fewer number of alive than dead cells, which I think is common later in the game. I can't tell from your excerpt, but I assume it trades off by possibly doing redundant work when the ratio of alive to dead cells is higher.
A caveat of omp parallel is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section. It's implementation dependent. I can't seem to find any information on MSVC's implementation. If anyone knows, please weight in.
So that means that your threads could be created/destroyed every update loop, which is heavy overhead. For this to be worth it, the amount of work should be orders of magnitude more expensive than the overhead.
You can profile/measure the code to determine overhead and work time. It should also help you see where the real bottlenecks are.
Visual Studio has a profiler with a nice GUI. You need to compile your code with release optimizations but also include the debug symbols (those are excluded in the default release configuration). I haven't looked into how to set this up manually because I usually use CMake and it sets up a RelWithDebInfo configuration automatically.
Use high_resolution_clock to time sections that are hard to measure with the profiler.
If you can use C++17, it has a standard parallel for_each (the ExecutionPolicy overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each. They're so new that I barely know anything about them (they might also have the same issues as OpenMP).
seems like it's getting slower over time.
Could it be that you're not cleaning up one of your vectors?
First off, if you want any kind of performance, you must do as little work as possible in a critical section. I'd start by changing the following:
std::vector<Cell> toDie;
std::vector<Cell> toLive;
to
std::vector<std::vector<Cell>> toDie;
std::vector<std::vector<Cell>> toLive;
Then, in your critical section, you can just do:
toDie.push_back(std::move(myToDie));
toLive.push_back(std::move(myToLive));
Arguably, a vector of vector isn't cute, but this will prevent deep-copying inside the CS which is unnecessary time consumption.
[Update]
IMHO there's no point in using multithreading if you are using non-contiguous data structures, at least not in that way. The fact is you'll spend most of your time waiting on cache misses because that's what associative containers do, and little doing the actual work.
I don't know how this game works. It feels like if I had to do something with numerous updates and render, I would have the updates done as quickly as possible on the 'main' thread', and I would have another (detached) thread for the renderer. You could then just give the renderer the results after every "update", and perform another update while it's rendering.
Also, I'm definitely not an expert in hashing, but hash<int>()(k.x * 3 + k.y * 5) seems like a high-collision hashing. You can certainly try something else like what's proposed here

Progress visualization in parallelized sections

I built a data analysis software where I tried to keep the functions short and without side effects if possible. Then I used OpenMP to parallelize for-loops where possible. This works well, except that progress is not visible.
Then I introduced statements like if (i % 10 == 0) print_status(i); to give a status message in the loop. Since all operations are going to take about the same time I did not change the OpenMP scheduling policy which just divides the loop range into a number of chunks and assigns each chunk to a thread. The output will then be something around
0
100
200
10
110
210
…
which is not really helpful to the user. I could change the scheduling policy such that the elements are roughly processed in order, but that would come with a synchronization cost that is not needed by the actual work to be done. It might even make it harder for the prefetcher since the loop indexes for each threads are not contiguous any more.
This issue is even worse since I have an expensive function which will takes minutes to complete. I run three of them in parallel. So I have
#pragma omp parallel for
for (int i = 0; i < 3; ++i) {
result[i] = expensive(i);
}
Currently the function expensive will just print something like “Working on i = 0 and I am at 56 %”. This output the interleaves (linewise, I used omp critical with std::cout) with the other three running instances. Judging the overall progress of the system is not that easy and a bit cumbersome.
I think I could solve this by passing a functor to the expensive function which can then be called with the current state. The functor would know about the number of expensive instances and could compute the global progress. A progress bar would need to have multiple active parts depending on the scheduling policy, just like a BitTorrent download. But just having the ratio of the done to the total work would be a great feature.
This will complicate the interfaces of all the long-running functions and I am not sure whether this is really worth the complication it brings with it.
It is probably impossible to get status information out of a function without introducing side effects. I just do not want to make the code hard to maintain just for unified progress information.
Is there any sensible way to do this without adding a lot more complication to the system?

how do i properly design a worker thread? (avoid for example Sleep(1))

i am still a beginner at multi-threading, so bear with me please:
i am currently writing an application that does some FVM calculation on a grid. it's a time-explicit model, so at every timestep i need to calculate new values for the whole grid. my idea was to distribute this calculation to 4 worker-threads, which then deal with the cells of the grid (first thread calculating 0, 4, 8... second thread 1, 5, 9... and so forth).
i create those 4 threads at program start.
they look something like this:
void __fastcall TCalculationThread::Execute()
{
bool alive = true;
THREAD_SIGNAL ts;
while (alive)
{
Sleep(1);
if (TryEnterCriticalSection(&TMS))
{
ts = thread_signal;
LeaveCriticalSection(&TMS);
alive = !ts.kill;
if (ts.go && !ts.done.at(this->index))
{
double delta_t = ts.dt;
for (unsigned int i=this->index; i < cells.size(); i+= this->steps)
{
calculate_one_cell();
}
EnterCriticalSection(&TMS);
thread_signal.done.at(this->index)=true;
LeaveCriticalSection(&TMS);
}
}
}
they use a global struct, to communicate with the main thread (main thread sets ts.go to true when the workers need to start.
now i am sure this is not the way to do it! not only does it feel wrong, it also doesn't perform very well...
i read for example here that a semaphore or an event would work better. the answer to this guy's question talks about a lockless queue.
i am not very familiar with these concepts would like some pointers how to continue.
could you line out any of the ways to do this better?
thank you for your time. (and sorry for the formatting)
i am using borland c++ builder and its thread-object (TThread).
The definitely more effective algorithm would be to calculate yields for 0,1,2,3 on one thread, 4,5,6,7 on another, etc. Interleaving memory accesses like that is very bad, even if the variables are completely independent- you'll get false sharing problems. This is the equivalent of the CPU locking every write.
Calling Sleep(1) in a calculation thread can't be a good solution to any problem. You want your threads to be doing useful work rather than blocking for no good reason.
I think your basic problem can be expressed as a serial algorithm of this basic form:
for (int i=0; i<N; i++)
cells[i]->Calculate();
You are in the happy position that calls to Calculate() are independent of each other—what you have here is a parallel for. This means that you can implement this without a mutex.
There are a variety of ways to achieve this. OpenMP would be one; a threadpool class another. If you are going to roll your own thread based solution then use InterlockedIncrement() on a shared variable to iterate through the array.
You may hit some false sharing problems, as #DeadMG suggests, but quite possibly not. If you do have false sharing then yet another approach is to stride across larger sub-arrays. Essentially the increment (i.e. stride) passed to InterlockedIncrement() would be greater than one.
The bottom line is that the way to make the code faster is to remove both the the critical section (and hence the contention on it) and the Sleep(1).

Boost.Thread no speedup?

I have a small program that implements a monte carlo simulation of BlackJack using various card counting strategies. My main function basically does this:
int bankroll = 50000;
int hands = 100;
int tests = 10000;
Simulation::strategy = hi_lo;
for(int i = 0; i < simulations; ++i)
runSimulation(bankroll, hands, tests, strategy);
The entire program run in a single thread on my machine takes about 10 seconds.
I wanted to take advantage of the 3 cores my processor has so I decided to rewrite the program to simply execute the various strategies in separate threads like this:
int bankroll = 50000;
int hands = 100;
int tests = 10000;
Simulation::strategy = hi_lo;
boost::thread threads[simulations];
for(int i = 0; i < simulations; ++i)
threads[i] = boost::thread(boost::bind(runSimulation, bankroll, hands, tests, strategy));
for(int i = 0; i < simulations; ++i)
threads[i].join();
However, when I ran this program, even though I got the same results it took around 24 seconds to complete. Did I miss something here?
If the value of simulations is high, then you end up creating a lot of threads, and the overhead of doing so can end up destroying any possible performance gains.
EDIT: One approach to this might be to just start three threads and let them each run 1/3 of the desired simulations. Alternatively, using a thread pool of some kind could also help.
This is a good candidate for a work queue with thread pool. I have used Intel Threading Blocks (TBB) for such requirements. Use handcrafted thread pools for quick hacks too. On Windows, the OS provides you with a nice thread pool backed work queue
"QueueUserWorkItem()"
Read these articles from Herb Sutter. You are probably victim of "false sharing".
http://drdobbs.com/go-parallel/article/showArticle.jhtml?articleID=214100002
http://drdobbs.com/go-parallel/article/showArticle.jhtml?articleID=217500206
I agree with dlev . If your function runSimulation is not changing anything which will be required for the next call to "runSimulation" to work properly then you can do something like:
. Divide "simulations" by 3.
. Now you will be having 3 counters "0 to simulation/3" "(simulation/3 + 1) to 2simulation/3" and "(2*simulation)/3 + 1 to simulation".
All these 3 counters can be used in three different threads simultaneously.
**NOTE ::** Your requirement might not be suitable for this type of checkup at all in case you have to do shared data lockup and all
I'm late to this party, but wanted to note two things for others who come across this post:
1) Definitely see the second Herb Sutter link that David points out (http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206). It solved the problem that brought me to this question, outlining a struct data object wrapper that ensures separate parallel threads aren't competing for resources headquartered on the same memory cache-line (hardware controls will prevent multiple threads from accessing the same memory cache-line simultaneously).
2) Re the original question, dlev points out a large part of the problem, but since it's a simulation I bet there's a deeper issue slowing things down. While none of your program's high-level variables are shared you probably have one critical system variable that's shared: the system-level "last random number" that's stored under-the-hood and used to create the next random number. You might even be initializing dedicated generator objects for each simulation, but if they're making calls to a function like rand() then they, and by extension their threads, are making repeated calls to the same shared system resource and subsequently blocking one another.
Solutions to issue #2 would depend on the structure of the simulation program itself. For instance if calls to a random generator are fragmented then I'd probably batch into one upfront call which retrieves and stores what the simulation will need. And this has me wondering now about more sophisticated approaches that'd deal with the underlying random generation shared-resource issue...