To optimize the execution of some libraries I am making, I have to parallelize some calculations.
Unfortunately, I can not use openmp for that, so I am trying to do some similar alternative using boost::thread.
Anyone knows of some implementation like this?
I have special problems with the sharing of variables between threads (to define variables as 'shared' and 'pribate' of openmp). Any sugestions?
As far as I know you'll have to do that explicitly with anything other than OpenMP.
As an example if we have a parallelized loop in OpenMP
int i;
size_t length = 10000;
int someArray[] = new int[length];
#pragma omp parallel private(i)
{
#pragma omp for schedule(dynamic, 8)
for (i = 0; i < length; ++i) {
someArray[i] = i*i;
}
}
You'll have to factor out the logic into a "generic" loop that can work on a sub-range of your problem, and then explicitly schedule the threads. Each thread will then work on a chunk of the whole problem. In that way you explicitly declare the "private" variables- the ones that go into the subProblem function.
void subProblem(int* someArray, size_t startIndex, size_t subLength) {
size_t end = startIndex+subLength;
for (size_t i = startIndex; i < end; ++i) {
someArray[i] = i*i;
}
}
void algorithm() {
size_t i;
size_t length = 10000;
int someArray[] = new int[length];
int numThreads = 4; // how to subdivide
int thread = 0;
// a vector of all threads working on the problem
std::vector<boost::thread> threadVector;
for(thread = 0; thread < numThreads; ++thread) {
// size of subproblem
size_t subLength = length / numThreads;
size_t startIndex = subLength*thread;
// use move semantics to create a thread in the vector
// requires c++11. If you can't use c++11,
// perhaps look at boost::move?
threadVector.emplace(boost::bind(subProblem, someArray, startIndex, subLength));
}
// threads are now working on subproblems
// now go through the thread vector and join with the threads.
// left as an exercise :P
}
The above is one of many scheduling algorithms- it just cuts the problem into as many chunks as you have threads.
The OpenMP way is more complicated- it cuts the problem into many small sized chunks (of 8 in my example), and then uses work-stealing scheduling to give these chunks to threads in a thread pool. The difficulty of implementing the OpenMP way, is that you need "persistent" threads that wait for work ( a thread pool ). Hope this makes sense.
An even simpler way would be to do async on every iteration (scheduling a piece of work for each iteration). This can work, if the each iteration is very expensive and takes a long time. However, if it's small pieces of work with MANY iterations, most of the overhead will go into the scheduling and thread creation, rendering the parallelization useless.
In conclusion, depending on your problem, there are be many ways to schedule the work, it's up to you to find out what works best for your problem.
TL;DR:
Try Intel Threading Building Blocks (or Microsoft PPL) which schedule for you, provided you give the "sub-range" function:
http://cache-www.intel.com/cd/00/00/30/11/301132_301132.pdf#page=14
Related
I'm new to openMP and multi-threading.
I have been given a task to run a method as static, dynamic, and guided without using OpenMPfor loop which means I cant use scheduled clauses.!
I could create parallel threads with parallel and could assign loop iterations to threads equally
but how to make it static and dynamic(1000 block) and guided?
void static_scheduling_function(const int start_count,
const int upper_bound,
int *results)
{
int i, tid, numt;
#pragma omp parallel private(i,tid)
{
int from, to;
tid = omp_get_thread_num();
numt = omp_get_num_threads();
from = (upper_bound / numt) * tid;
to = (upper_bound / numt) * (tid + 1) - 1;
if (tid == numt - 1)
to = upper_bound - 1;
for (i = from; i < to; i++)
{
//compute one iteration (i)
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
======================================
For dynamic i have tried something like this
void chunk_scheduling_function(const int start_count, const int upper_bound, int* results) {
int numt, shared_lower_iteration_counter=start_count;
for (int shared_lower_iteration_counter=start_count; shared_lower_iteration_counter<upper_bound;){
#pragma omp parallel shared(shared_lower_iteration_counter)
{
int tid = omp_get_thread_num();
int from,to;
int chunk = 1000;
#pragma omp critical
{
from= shared_lower_iteration_counter; // 10, 1010
to = ( shared_lower_iteration_counter + chunk ); // 1010,
shared_lower_iteration_counter = shared_lower_iteration_counter + chunk; // 1100 // critical is important while incrementing shared variable which decides next iteration
}
for(int i = from ; (i < to && i < upper_bound ); i++) { // 10 to 1009 , i< upperbound prevents other threads from executing call
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
}
This looks like a university assignment (and a very good one IMO), I will not provide the complete solution, instead I will provide what you should be looking for.
The static scheduler looks okey; Notwithstanding, it can be improved by taking into account the chunk size as well.
For the dynamic and guided schedulers, they can be implemented by using a variable (let us name it shared_iteration_counter) that will be marking the current loop iteration that should pick up next by the threads. Therefore, when a thread needs to request a new task to work with (i.e., a new loop iteration) it queries that variable for that. In pseudo code would look like the following:
int thread_current_iteration = shared_iteration_counter++;
while(thread_current_iteration < MAX_SIZE)
{
// do work
thread_current_iteration = shared_iteration_counter++;
}
The pseudo code is assuming chunk size of 1 (i.e., shared_iteration_counter++) you will have to adapt to your use-case. Now, because that variable will be shared among threads, and every thread will be updating it, you need to ensure mutual exclusion during the updates of that variable. Fortunately, OpenMP offers means to achieve that, for instance, using #pragma omp critical, explicitly locks, and atomic operations. The latter is the better option for your use-case:
#pragma omp atomic
shared_iteration_counter = shared_iteration_counter + 1;
For the guided scheduler:
Similar to dynamic scheduling, but the chunk size starts off large and
decreases to better handle load imbalance between iterations. The
optional chunk parameter specifies them minimum size chunk to use. By
default the chunk size is approximately loop_count/number_of_threads.
In this case, not only you have to guarantee mutual exclusion of the variable that will be used to count the current loop iteration to be pick up by threads, but also guarantee mutual exclusion of the chunk size variable, since it also changes.
Without given it way too much bear in mind that you may need to considered how to deal with edge-cases such as your current thread_current_iteration= 1000 and your chunks_size=1000 with a MAX_SIZE=1500. Hence, thread_current_iteration + chunks_size > MAX_SIZE, but there is still 500 iterations to be computed.
Before I start, let me say that I've only used threads once when we were taught about them in university. Therefore, I have almost zero experience using them and I don't know if what I'm trying to do is a good idea.
I'm doing a project of my own and I'm trying to make a for loop run fast because I need the calculations in the loop for a real-time application. After "optimizing" the calculations in the loop, I've gotten closer to the desired speed. However, it still needs improvement.
Then, I remembered threading. I thought I could make the loop run even faster if I split it in 4 parts, one for each core of my machine. So this is what I tried to do:
void doYourThing(int size,int threadNumber,int numOfThreads) {
int start = (threadNumber - 1) * size / numOfThreads;
int end = threadNumber * size / numOfThreads;
for (int i = start; i < end; i++) {
//Calculations...
}
}
int main(void) {
int size = 100000;
int numOfThreads = 4;
int start = 0;
int end = size / numOfThreads;
std::thread coreB(doYourThing, size, 2, numOfThreads);
std::thread coreC(doYourThing, size, 3, numOfThreads);
std::thread coreD(doYourThing, size, 4, numOfThreads);
for (int i = start; i < end; i++) {
//Calculations...
}
coreB.join();
coreC.join();
coreD.join();
}
With this, computation time changed from 60ms to 40ms.
Questions:
1)Do my threads really run on a different core? If that's true, I would expect a greater increase in speed. More specifically, I assumed it would take close to 1/4 of the initial time.
2)If they don't, should I use even more threads to split the work? Will it make my loop faster or slower?
(1).
The question #François Andrieux asked is good. Because in the original code there is a well-structured for-loop, and if you used -O3 optimization, the compiler might be able to vectorize the computation. This vectorization will give you speedup.
Also, it depends on what is the critical path in your computation. According to Amdahl's law, the possible speedups are limited by the un-parallelisable path. You might check if the computation are reaching some variable where you have locks, then the time could also spend to spin on the lock.
(2). to find out the total number of cores and threads on your computer you may have lscpu command, which will show you the cores and threads information on your computer/server
(3). It is not necessarily true that more threads will have a better performance
There is a header-only library in Github which may be just what you need. Presumably your doYourThing processes an input vector (of size 100000 in your code) and stores the results into another vector. In this case, all you need to do is to say is
auto vectorOut = Lazy::runForAll(vectorIn, myFancyFunction);
The library will decide how many threads to use based on how many cores you have.
On the other hand, if the compiler is able to vectorize your algorithm and it still looks like it is a good idea to split the work into 4 chunks like in your example code, you could do it for example like this:
#include "Lazy.h"
void doYourThing(const MyVector& vecIn, int from, int to, MyVector& vecOut)
{
for (int i = from; i < to; ++i) {
// Calculate vecOut[i]
}
}
int main(void) {
int size = 100000;
MyVector vecIn(size), vecOut(size)
// Load vecIn vector with input data...
Lazy::runForAll({{std::pair{0, size/4}, {size/4, size/2}, {size/2, 3*size/4}, {3*size/4, size}},
[&](auto indexPair) {
doYourThing(vecIn, indexPair.first, indexPair.second, vecOut);
});
// Now the results are in vecOut
}
README.md gives further examples on parallel execution which you might find useful.
Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.
So I come up with below approach:
int Max{1000000};
//SimResult is some struct with well-defined default value.
std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
int LastAdded{0};
void fill(int RandSeed)
{
Simulator sim{RandSeed};
while(LastAdded < Max)
{
// Do some work to bring foo to the desired state
//The duration of this work is subject to randomness
vec[LastAdded++]
= sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1);
auto fut2 = std::async(fill,2);
//maybe some more tasks.
fut1.get();
fut2.get();
//do something with the results in vec.
}
The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.
Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?
One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.
The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.
std::vector<SimResult> generateResults(size_t N_runs, double seed)
{
std::vector<SimResult> results(N_runs);
#pragma omp parallel for
for(auto i = 0; i < N_runs; i++)
{
auto sim = Simulator(seed + i);
results[i] = sim.GetResult();
}
}
Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:
#pragma omp parallel for schedule(dynamic, 16)
which would give each thread chunks of 16 items to work on at a time.
Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example
Update
to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.
void fill(int RandSeed, std::mutex &nextItemMutex)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
// enter critical area
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
// Acquire next item
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded++;
}
else
{
break;
}
// lock is released when nextItemLock goes out of scope
}
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[workingIndex] = sim.GetResult();//Produces SimResult.
}
}
Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.
Version 2:
To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:
void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded += blockSize;
}
else
{
break;
}
}
for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
vec[i] = sim.GetResult();//Produces SimResult.
}
}
Simple Version
void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{
Simulator sim{RandSeed};
for(size_t i = partitionStart; i < partitionEnd; i++)
{
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[i] = sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1, 0, Max / 2);
auto fut2 = std::async(fill,2, Max / 2, Max);
// ...
}
I have a function that populates entries in a large matrix. As the computations are independent, I was thinking about exploiting std::thread so that chunks of the matrix can be processed by separate threads.
Instead of dividing the matrix in to n chunks where n is the limit on the maximum number of threads allowed to run simultaneously, I would like to make finer chunks, so that I could spawn a new thread when an existing thread is finished. (As the compute time will be widely different for different entries, and equally dividing the matrix will not be very efficient here. Hence the latter idea.)
What are the concepts in std::thread I should look into for doing this? (I came across async and condition_variables although I don't clearly see how they can be exploited for such kinds of spawning). Some example pseudo code would greatly help!
Why tax the OS scheduler with thread creation & destruction? (Assume these operations are expensive.) Instead, make your threads work more instead.
EDIT: If you do no want to split the work in equal chunks, then the best solution really is a thread pool. FYI, there is a thread_pool library in the works for C++14.
What is below assumed that you could split the work in equal chunks, so is not exactly applicable to your question. END OF EDIT.
struct matrix
{
int nrows, ncols;
// assuming row-based processing; adjust for column-based processing
void fill_rows(int first, int last);
};
int num_threads = std::thread::hardware_concurrency();
std::vector< std::thread > threads(num_threads);
matrix m; // must be initialized...
// here - every thread will process as many rows as needed
int nrows_per_thread = m.nrows / num_threads;
for(int i = 0; i != num_threads; ++i)
{
// thread i will process these rows:
int first = i * nrows_per_thread;
int last = first + nrows_per_thread;
// last thread gets remaining rows
last += (i == num_threads - 1) ? m.nrows % nrows_per_thread : 0;
threads[i] = std::move(std::thread([&m,first,last]{
m.fill_rows(first,last); }))
}
for(int i = 0; i != num_threads; ++i)
{
threads[i].join();
}
If this is an operation you do very frequently, then use a worker pool as #Igor Tandetnik suggests in the comments. For one-offs, it's not worth the trouble.
I am trying to multithread a piece of code using the boost library. The problem is that each thread has to access and modify a couple of global variables. I am using mutex to lock the shared resources, but the program ends up taking more time then when it was not multithreaded. Any advice on how to optimize the shared access?
Thanks a lot!
In the example below, the *choose_ecount* variable has to be locked, and I cannot take it out of the loop and lock it for only an update at the end of the loop because it is needed with the newest values by the inside function.
for(int sidx = startStep; sidx <= endStep && sidx < d.sents[lang].size(); sidx ++){
sentence s = d.sents[lang][sidx];
int senlen = s.words.size();
int end_symb = s.words[senlen-1].pos;
inside(s, lbeta);
outside(s,lbeta, lalpha);
long double sen_prob = lbeta[senlen-1][F][NO][0][senlen-1];
if (lambda[0] == 0){
mtx_.lock();
d.sents[lang][sidx].prob = sen_prob;
mtx_.unlock();
}
for(int size = 1; size <= senlen; size++)
for(int i = 0; i <= senlen - size ; i++)
{
int j = i + size - 1;
for(int k = i; k < j; k++)
{
int hidx = i; int head = s.words[hidx].pos;
for(int r = k+1; r <=j; r++)
{
int aidx = r; int arg = s.words[aidx].pos;
mtx_.lock();
for(int kids = ONE; kids <= MAX; kids++)
{
long double num = lalpha[hidx][R][kids][i][j] * get_choose_prob(s, hidx, aidx) *
lbeta[hidx][R][kids - 1][i][k] * lbeta[aidx][F][NO][k+1][j];
long double gen_right_prob = (num / sen_prob);
choose_ecount[lang][head][arg] += gen_right_prob; //LOCK
order_ecount[lang][head][arg][RIGHT] += gen_right_prob; //LOCK
}
mtx_.unlock();
}
}
From the code you have posted I can see only writes to choose_ecount and order_ecount. So why not use local per thread buffers to compute the sum and then add them up after the outermost loop and only sync this operation?
Edit:
If you need to access the intermediate values of choose_ecount how do you assure the correct intermediate value is present? One thread might have finished 2 iterations of its loop in the meantime producing different results in another thread.
It kind of sounds like you need to use a barrier for your computation instead.
It's unlikely you're going to get acceptable performance using a mutex in an inner loop. Concurrent programming is difficult, not just for the programmer but also for the computer. A large portion of the performance of modern CPUs comes from being able to treat blocks of code as sequences independent of external data. Algorithms that are efficient for single-threaded execution are often unsuitable for multi-threaded execution.
You might want to have a look at boost::atomic, which can provide lock-free synchronization, but the memory barriers required for atomic operations are still not free, so you may still run into problems, and you will probably have to re-think your algorithm.
I guess that you divide your complete problem into chunks ranging from startStep to endStep to get processed by each thread.
Since you have that locked mutex there, you're effectively serializing all threads:
You divide your problem into some chunks which are processed in serial, yet unspecified order.
That is the only thing you get is the overhead for doing multithreading.
Since you're operating on doubles, using atomic operations is not a choice for you: they're typically implemented for integral types only.
The only possible solution is to follow Kratz' suggestion to have a copy of choose_ecount and order_ecount for each thread and reduce them to a single one after your threads have finished.