C++ 2011 : std::thread : simple example to parallelize a loop? - c++

C++ 2011 includes very cool new features, but I can't find a lot of example to parallelize a for-loop.
So my very naive question is : how do you parallelize a simple for loop (like using "omp parallel for") with std::thread ? (I search for an example).
Thank you very much.

std::thread is not necessarily meant to parallize loops. It is meant to be the lowlevel abstraction to build constructs like a parallel_for algorithm. If you want to parallize your loops, you should either wirte a parallel_for algorithm yourself or use existing libraires which offer task based parallism.
The following example shows how you could parallize a simple loop but on the other side also shows the disadvantages, like the missing load-balancing and the complexity for a simple loop.
typedef std::vector<int> container;
typedef container::iterator iter;
container v(100, 1);
auto worker = [] (iter begin, iter end) {
for(auto it = begin; it != end; ++it) {
*it *= 2;
}
};
// serial
worker(std::begin(v), std::end(v));
std::cout << std::accumulate(std::begin(v), std::end(v), 0) << std::endl; // 200
// parallel
std::vector<std::thread> threads(8);
const int grainsize = v.size() / 8;
auto work_iter = std::begin(v);
for(auto it = std::begin(threads); it != std::end(threads) - 1; ++it) {
*it = std::thread(worker, work_iter, work_iter + grainsize);
work_iter += grainsize;
}
threads.back() = std::thread(worker, work_iter, std::end(v));
for(auto&& i : threads) {
i.join();
}
std::cout << std::accumulate(std::begin(v), std::end(v), 0) << std::endl; // 400
Using a library which offers a parallel_for template, it can be simplified to
parallel_for(std::begin(v), std::end(v), worker);

Well obviously it depends on what your loop does, how you choose to parallellize, and how you manage the threads lifetime.
I'm reading the book from the std C++11 threading library (that is also one of the boost.thread maintainer and wrote Just Thread ) and I can see that "it depends".
Now to give you an idea of basics using the new standard threading, I would recommend to read the book as it gives plenty of examples.
Also, take a look at http://www.justsoftwaresolutions.co.uk/threading/ and https://stackoverflow.com/questions/415994/boost-thread-tutorials

Can't provide a C++11 specific answer since we're still mostly using pthreads. But, as a language-agnostic answer, you parallelise something by setting it up to run in a separate function (the thread function).
In other words, you have a function like:
def processArraySegment (threadData):
arrayAddr = threadData->arrayAddr
startIdx = threadData->startIdx
endIdx = threadData->endIdx
for i = startIdx to endIdx:
doSomethingWith (arrayAddr[i])
exitThread()
and, in your main code, you can process the array in two chunks:
int xyzzy[100]
threadData->arrayAddr = xyzzy
threadData->startIdx = 0
threadData->endIdx = 49
threadData->done = false
tid1 = startThread (processArraySegment, threadData)
// caveat coder: see below.
threadData->arrayAddr = xyzzy
threadData->startIdx = 50
threadData->endIdx = 99
threadData->done = false
tid2 = startThread (processArraySegment, threadData)
waitForThreadExit (tid1)
waitForThreadExit (tid2)
(keeping in mind the caveat that you should ensure thread 1 has loaded the data into its local storage before the main thread starts modifying it for thread 2, possibly with a mutex or by using an array of structures, one per thread).
In other words, it's rarely a simple matter of just modifying a for loop so that it runs in parallel, though that would be nice, something like:
for {threads=10} ({i} = 0; {i} < ARR_SZ; {i}++)
array[{i}] = array[{i}] + 1;
Instead, it requires a bit of rearranging your code to take advantage of threads.
And, of course, you have to ensure that it makes sense for the data to be processed in parallel. If you're setting each array element to the previous one plus 1, no amount of parallel processing will help, simply because you have to wait for the previous element to be modified first.
This particular example above simply uses an argument passed to the thread function to specify which part of the array it should process. The thread function itself contains the loop to do the work.

Using this class you can do it as:
Range based loop (read and write)
pforeach(auto &val, container) {
val = sin(val);
};
Index based for-loop
auto new_container = container;
pfor(size_t i, 0, container.size()) {
new_container[i] = sin(container[i]);
};

Define macro using std::thread and lambda expression:
#ifndef PARALLEL_FOR
#define PARALLEL_FOR(INT_LOOP_BEGIN_INCLUSIVE, INT_LOOP_END_EXCLUSIVE,I,O) \ \
{ \
int LOOP_LIMIT=INT_LOOP_END_EXCLUSIVE-INT_LOOP_BEGIN_INCLUSIVE; \
std::thread threads[LOOP_LIMIT]; auto fParallelLoop=[&](int I){ O; }; \
for(int i=0; i<LOOP_LIMIT; i++) \
{ \
threads[i]=std::thread(fParallelLoop,i+INT_LOOP_BEGIN_INCLUSIVE); \
} \
for(int i=0; i<LOOP_LIMIT; i++) \
{ \
threads[i].join(); \
} \
} \
#endif
usage:
int aaa=0; // std::atomic<int> aaa;
PARALLEL_FOR(0,90,i,
{
aaa+=i;
});
its ugly but it works (I mean, the multi-threading part, not the non-atomic incrementing).

AFAIK the simplest way to parallelize a loop, if you are sure that there are no concurrent access possible, is by using OpenMP.
It is supported by all major compilers except LLVM (as of August 2013).
Example :
for(int i = 0; i < n; ++i)
{
tab[i] *= 2;
tab2[i] /= 2;
tab3[i] += tab[i] - tab2[i];
}
This would be parallelized very easily like this :
#pragma omp parallel for
for(int i = 0; i < n; ++i)
{
tab[i] *= 2;
tab2[i] /= 2;
tab3[i] += tab[i] - tab2[i];
}
However, be aware that this is only efficient with a big number of values.
If you use g++, another very C++11-ish way of doing would be using a lambda and a for_each, and use gnu parallel extensions (which can use OpenMP behind the scene) :
__gnu_parallel::for_each(std::begin(tab), std::end(tab), [&] ()
{
stuff_of_your_loop();
});
However, for_each is mainly thought for arrays, vectors, etc...
But you can "cheat" it if you only want to iterate through a range by creating a Range class with begin and end method which will mostly increment an int.
Note that for simple loops that do mathematical stuff, the algorithms in #include <numeric> and #include <algorithm> can all be parallelized with G++.

Related

Performance of C++ program is 2x slower with the use of TBB

I'm trying to optimize the performance of a C++ program by using the TBB library.
My program only contains a couple of small for loop, so I know it can be a challenge to optimze time complexity in this case, but I have to use TBB.
As such, I tried to use a partitionner which made the program 2 time faster with TBB than without the partitionner, but it's still slower than the original program without the use of parallelism.
In my code, I print when a loop start and end with the id to see if there is parallelism. The output show that the loop is in fact execute sequentially, for example : start 1 end 1, start 2 end 2 , etc(it's a list of size 200). The output of the ids isn't random like you would expect from a parallelized program.
Here is an example of how I used the library:
tbb::global_control c(tbb::global_control::max_allowed_parallelism, 1000);
size_t grainsize = 1000;
size_t changes = 0;
tbb::parallel_for(
tbb::blocked_range<std::size_t>(0, list.size(), grainsize),
[&](const tbb::blocked_range<std::size_t> r) {
for (size_t id = r.begin(); id < r.end(); ++id) {
std::cout << "start:" << point_id << std::endl;
double disto = std::numeric_limits<double>::max();
size_t cluster_id = 0;
const Point& point = points.at(id);
for (size_t i = 0; i < short_list.size(); i++) {
const Point& origin = originss[i];
double disto2 = point.dist(origin);
if (disto2 < min) {
min = disto2;
clus = i;
}
}
if (m[id] != m_id) {
m[id] = m_id;
modif++;
}
disto_list[id] = min;
std::cout << "end:" << point_id << std::endl;
}
}
);
Is there a way to improve the performance of a C++ program composed of multiple small for loops with the use of the TBB library? And why are the loop not parallized?
If you are using task_scheduler_init in your program, then TBB uses the same thread throughout the program until task_scheduler_init objects are destroyed.
As you are passing max_allowed_parallelism as a parameter for global_control, if it is set to 1 then it will make your application run in a sequential way.
You can refer to the below link:
https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/scheduling_controls/global_control_cls.html
It will be helpful if you provide the complete reproducer to figure out where exactly the issue took place.

How to let different threads fill an array together?

Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.
So I come up with below approach:
int Max{1000000};
//SimResult is some struct with well-defined default value.
std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
int LastAdded{0};
void fill(int RandSeed)
{
Simulator sim{RandSeed};
while(LastAdded < Max)
{
// Do some work to bring foo to the desired state
//The duration of this work is subject to randomness
vec[LastAdded++]
= sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1);
auto fut2 = std::async(fill,2);
//maybe some more tasks.
fut1.get();
fut2.get();
//do something with the results in vec.
}
The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.
Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?
One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.
The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.
std::vector<SimResult> generateResults(size_t N_runs, double seed)
{
std::vector<SimResult> results(N_runs);
#pragma omp parallel for
for(auto i = 0; i < N_runs; i++)
{
auto sim = Simulator(seed + i);
results[i] = sim.GetResult();
}
}
Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:
#pragma omp parallel for schedule(dynamic, 16)
which would give each thread chunks of 16 items to work on at a time.
Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example
Update
to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.
void fill(int RandSeed, std::mutex &nextItemMutex)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
// enter critical area
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
// Acquire next item
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded++;
}
else
{
break;
}
// lock is released when nextItemLock goes out of scope
}
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[workingIndex] = sim.GetResult();//Produces SimResult.
}
}
Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.
Version 2:
To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:
void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded += blockSize;
}
else
{
break;
}
}
for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
vec[i] = sim.GetResult();//Produces SimResult.
}
}
Simple Version
void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{
Simulator sim{RandSeed};
for(size_t i = partitionStart; i < partitionEnd; i++)
{
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[i] = sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1, 0, Max / 2);
auto fut2 = std::async(fill,2, Max / 2, Max);
// ...
}

How to auto-vectorize range-based for loops?

A similar question was posted on SO for g++ that was rather vague, so I thought I'd post a specific example for VC++12 / VS2013 to which we can hopefully get an answer.
cross-link:
g++ , range based for and vectorization
MSDN gives the following as an example of a loop that can be vectorized:
for (int i=0; i<1000; ++i)
{
A[i] = A[i] + 1;
}
(http://msdn.microsoft.com/en-us/library/vstudio/jj658585.aspx)
Here is my version of a range-based analogue to the above, a c-style monstrosity, and a similar loop using std::for_each. I compiled with the /Qvec-report:2 flag and added the compiler messages as comments:
#include <vector>
#include <algorithm>
int main()
{
std::vector<int> vec(1000, 1);
// simple range-based for loop
{
for (int& elem : vec)
{
elem = elem + 1;
}
} // info C5002 : loop not vectorized due to reason '1304'
// c-style iteration
{
int * begin = vec.data();
int * end = begin + vec.size();
for (int* it = begin; it != end; ++it)
{
*it = *it + 1;
}
} // info C5001: loop vectorized
// for_each iteration
{
std::for_each(vec.begin(), vec.end(), [](int& elem)
{
elem = elem + 1;
});
} // (no compiler message provided)
return 0;
}
Only the c-style loop gets vectorized. Reason 1304 is as follows as per the MSDN docs:
1304: Loop includes assignments that are of different sizes.
It gives the following as an example of code that would trigger a 1304 message:
void code_1304(int *A, short *B)
{
// Code 1304 is emitted when the compiler detects
// different sized statements in the loop body.
// In this case, there is an 32-bit statement and a
// 16-bit statement.
// In cases like this consider splitting the loop into loops to
// maximize vector register utilization.
for (int i=0; i<1000; ++i)
{
A[i] = A[i] + 1;
B[i] = B[i] + 1;
}
}
I'm no expert but I can't see the relationship. Is this just buggy reporting? I've noticed that none of my range-based loops are getting vectorized in my actual program. What gives?
(In case this is buggy behavior I'm running VS2013 Professional Version 12.0.21005.1 REL)
EDIT: Bug report posted: https://connect.microsoft.com/VisualStudio/feedback/details/807826/range-based-for-loops-are-not-vectorized
Posted bug report here:
https://connect.microsoft.com/VisualStudio/feedback/details/807826/range-based-for-loops-are-not-vectorized
Response:
Hi, thanks for the report.
Vectorizing range-based-for-loop-y code is something we are actively
making better. We'll address vectorizing this, plus enabling
auto-vectorization for other C++ language & library features in future
releases of the compiler.
The emission of reason code 1304 (on x64) and reason code 1301 (on
x86) are artifacts of compiler internals. The details of that, for
this particular code, is not important.
Thanks for the report! I am closing this MSConnect item. Feel free to
respond if you need anything else.
Eric Brumer Microsoft Visual C++ Team

OpenMP parallelize multiple sequential loops

I want to parallelize the following function with OpenMP:
void calculateAll() {
int k;
int nodeId1, minCost1, lowerLimit1, upperLimit8;
for (k = mostUpperLevel; k > 0; k--) {
int myStart = borderNodesArrayStartGlobal[k - 1];
int size = myStart + borderNodesArraySizeGlobal[k - 1];
/* this loop may be parallel */
for (nodeId1 = myStart; nodeId1 < size; nodeId1++) {
if (getNodeScanned(nodeId1)) {
setNodeScannedFalse(nodeId1);
} else {
minCost1 = myMax;
lowerLimit1 = getNode3LevelsDownAll(nodeId1);
upperLimit8 = getUpperLimit3LevelsDownAll(nodeId1);
changeNodeValue(nodeId1, lowerLimit1, upperLimit8, minCost1, minCost1);
}
}
}
int myStart = restNodesArrayStartGlobal;
int size = myStart + restNodesArraySizeGlobal;
/* this loop may also be parallel */
for (nodeId1 = myStart; nodeId1 < size; nodeId1++) {
if (getNodeScanned(nodeId1)) {
setNodeScannedFalse(nodeId1);
} else {
minCost1 = myMax;
lowerLimit1 = getNode3LevelsDownAll(nodeId1);
upperLimit8 = getUpperLimit3LevelsDownAll(nodeId1);
changeNodeValue(nodeId1, lowerLimit1, upperLimit8, minCost1, minCost1);
}
}
}
Although I can use "omp pragma parallel for" on the 2 inside loops, code is too slow due to the constant overhead of creating new threads. Is there a way to separate "omp pragma parallel" so that at the beginning of function I take the necessary threads and then with "omp pragma for" to get the best possible results? I am using gcc 4.6.
Thanks in advance
The creation of the threads is normally not the bottleneck in openmp programs. It is the distribution of the tasks to the threads. The threads are actually generated at the first #pragma omp for (You can verify that with a profiler like VTune. At each loop the work is assigned to the threads. This assignment is often the problem as this is a costly operation.
However you should try to play around with the schedulers. As this might have a big impact on the performance. E.g play with schedule(dynamic,chunksize) vs schedule(static,chunksize) and also try different chunksizes.

C OpenMP parallel quickSort

Once again I'm stuck when using openMP in C++. This time I'm trying to implement a parallel quicksort.
Code:
#include <iostream>
#include <vector>
#include <stack>
#include <utility>
#include <omp.h>
#include <stdio.h>
#define SWITCH_LIMIT 1000
using namespace std;
template <typename T>
void insertionSort(std::vector<T> &v, int q, int r)
{
int key, i;
for(int j = q + 1; j <= r; ++j)
{
key = v[j];
i = j - 1;
while( i >= q && v[i] > key )
{
v[i+1] = v[i];
--i;
}
v[i+1] = key;
}
}
stack<pair<int,int> > s;
template <typename T>
void qs(vector<T> &v, int q, int r)
{
T pivot;
int i = q - 1, j = r;
//switch to insertion sort for small data
if(r - q < SWITCH_LIMIT)
{
insertionSort(v, q, r);
return;
}
pivot = v[r];
while(true)
{
while(v[++i] < pivot);
while(v[--j] > pivot);
if(i >= j) break;
std::swap(v[i], v[j]);
}
std::swap(v[i], v[r]);
#pragma omp critical
{
s.push(make_pair(q, i - 1));
s.push(make_pair(i + 1, r));
}
}
int main()
{
int n, x;
int numThreads = 4, numBusyThreads = 0;
bool *idle = new bool[numThreads];
for(int i = 0; i < numThreads; ++i)
idle[i] = true;
pair<int, int> p;
vector<int> v;
cin >> n;
for(int i = 0; i < n; ++i)
{
cin >> x;
v.push_back(x);
}
cout << v.size() << endl;
s.push(make_pair(0, v.size()));
#pragma omp parallel shared(s, v, idle, numThreads, numBusyThreads, p)
{
bool done = false;
while(!done)
{
int id = omp_get_thread_num();
#pragma omp critical
{
if(s.empty() == false && numBusyThreads < numThreads)
{
++numBusyThreads;
//the current thread is not idle anymore
//it will get the interval [q, r] from stack
//and run qs on it
idle[id] = false;
p = s.top();
s.pop();
}
if(numBusyThreads == 0)
{
done = true;
}
}
if(idle[id] == false)
{
qs(v, p.first, p.second);
idle[id] = true;
#pragma omp critical
--numBusyThreads;
}
}
}
return 0;
}
Algorithm:
To use openMP for a recursive function I used a stack to keep track of the next intervals on which the qs function should run. I manually add the 1st interval [0, size] and then let the threads get to work when a new interval is added in the stack.
The problem:
The program ends too early, not sorting the array after creating the 1st set of intervals ([q, i - 1], [i+1, r] if you look on the code. My guess is that the threads which get the work, considers the local variables of the quicksort function(qs in the code) shared by default, so they mess them up and add no interval in the stack.
How I compile:
g++ -o qs qs.cc -Wall -fopenmp
How I run:
./qs < in_100000 > out_100000
where in_100000 is a file containing 100000 on the 1st line followed by 100k intergers on the next line separated by spaces.
I am using gcc 4.5.2 on linux
Thank you for your help,
Dan
I didn't actually run your code, but I see an immediate mistake on p, which should be private not shared. The parallel invocation of qs: qs(v, p.first, p.second); will have races on p, resulting in unpredictable behavior. The local variables at qs should be okay because all threads have their own stack. However, the overall approach is good. You're on the right track.
Here are my general comments for the implementation of parallel quicksort. Quicksort itself is embarrassingly parallel, which means no synchronization is needed. The recursive calls of qs on a partitioned array is embarrassingly parallel.
However, the parallelism is exposed in a recursive form. If you simply use the nested parallelism in OpenMP, you will end up having thousand threads in a second. No speedup will be gained. So, mostly you need to turn the recursive algorithm into an interative one. Then, you need to implement a sort of work-queue. This is your approach. And, it's not easy.
For your approach, there is a good benchmark: OmpSCR. You can download at http://sourceforge.net/projects/ompscr/
In the benchmark, there are several versions of OpenMP-based quicksort. Most of them are similar to yours. However, to increase parallelism, one must minimize the contention on a global queue (in your code, it's s). So, there could be a couple of optimizations such as having local queues. Although the algorithm itself is purely parallel, the implementation may require synchronization artifacts. And, most of all, it's very hard to gain speedups.
However, you still directly use recursive parallelism in OpenMP in two ways: (1) Throttling the total number of the threads, and (2) Using OpenMP 3.0's task.
Here is pseudo code for the first approach (This is only based on OmpSCR's benchmark):
void qsort_omp_recursive(int* begin, int* end)
{
if (begin != end) {
// Partition ...
// Throttling
if (...) {
qsort_omp_recursive(begin, middle);
qsort_omp_recursive(++middle, ++end);
} else {
#pragma omp parallel sections nowait
{
#pragma omp section
qsort_omp_recursive(begin, middle);
#pragma omp section
qsort_omp_recursive(++middle, ++end);
}
}
}
}
In order to run this code, you need to call omp_set_nested(1) and omp_set_num_threads(2). The code is really simple. We simply spawn two threads on the division of the work. However, we insert a simple throttling logic to prevent excessive threads. Note that my experimentation showed decent speedups for this approach.
Finally, you may use OpenMP 3.0's task, where a task is a logically concurrent work. In the above all OpenMP's approaches, each parallel construct spawns two physical threads. You may say there is a hard 1-to-1 mapping between a task to a work thread. However, task separates logical tasks and workers.
Because OpenMP 3.0 is not popular yet, I will use Cilk Plus, which is great to express this kind of nested and recursive parallelisms. In Cilk Plus, the parallelization is extremely easy:
void qsort(int* begin, int* end)
{
if (begin != end) {
--end;
int* middle = std::partition(begin, end,
std::bind2nd(std::less<int>(), *end));
std::swap(*end, *middle);
cilk_spawn qsort(begin, middle);
qsort(++middle, ++end);
// cilk_sync; Only necessay at the final stage.
}
}
I copied this code from Cilk Plus' example code. You will see a single keyword cilk_spawn is everything to parallelize quicksort. I'm skipping the explanations of Cilk Plus and spawn keyword. However, it's easy to understand: the two recursive calls are declared as logically concurrent tasks. Whenever the recursion takes place, the logical tasks are created. But, the Cilk Plus runtime (which implements an efficient work-stealing scheduler) will handle all kinds of dirty job. It optimally queues the parallel tasks and maps to the work threads.
Note that OpenMP 3.0's task is essentially similar to the Cilk Plus's approach. My experimentation shows that pretty nice speedups were feasible. I got a 3~4x speedup on a 8-core machine. And, the speedup was scale. Cilk Plus' absolute speedups are greater than those of OpenMP 3.0's.
The approach of Cilk Plus (and OpenMP 3.0) and your approach are essentially the same: the separation of parallel task and workload assignment. However, it's very difficult to implement efficiently. For example, you must reduce the contention and use lock-free data structures.