D taskpool wait untill all tasks are done - concurrency

This is in relation to my previous question: D concurrent writing to buffer
Say you have a piece of code that consists of 2 consecutive code blocks A and B, where B depends on A. This is very common in programming. Both A and B consist of a loop, where each iteration can be run in parallel:
double[] array = [ ... ]; // has N elements
// A
for (int i = 0; i < N; i++)
{
job1(array[i]); // new task
}
// wait for all job1's to be done
// B
for (int i = 0; i < N; i++)
{
job2(array[i]); // new task
}
B can only be executed when A is finished. How do I wait till all tasks of A are finished before executing B?

I assume you're using std.parallelism? I wrote std.parallelism, so I'll let you in on a design decision. There was actually a join function in some of the betas of std.parallelism. It waited until all tasks were finished and then shut down the task pool. I removed it because I realized it was useless.
The reason is that if you're manually creating a set of O(N) task objects to iterate over some range, you're misusing the library. You should be using a parallel foreach loop instead, which automatically joins before it releases control back to the calling thread. Your example would become:
foreach(ref elem; parallel(array)) {
job1(elem);
}
foreach(ref elem; parallel(array)) {
job2(elem);
}
In this case job1 and job2 should not start a new task because the parallel foreach loop is already using enough tasks to fully utilize all CPU cores.

Related

OpenMP parallel only one thread seems to run at a time

I'm trying to parallelize the below for loop with OpenMP, however only one thread seems to be running at a time. I can tell this based on the below observations:
Normally when I have prints inside the loop, the output will be jumbled and lines will be mixed together, however here, all my outputs are printed cleanly, suggesting that only one thread is executing at a time.
There is some heavy dynamic programming computation going on inside the loop, however I only see CPU usage on one core in htop.
If I print the current thread number omp_get_thread_num() I only see one active thread at a time. e.g I see some iterations all from thread 4, then some iterations all from thread 3 and so on.
This only happens after a while. For the first few iterations, things seem to run in parallel.
I'm not sure if there is anything wrong with the code that prevents OpenMP from running two threads in parallel. Below is the for loop and the templates for the functions called inside it. The functions only work with what's passed into them and don't modify any other data structures.
I'm suspecting this may have something to do with the fact that I'm passing const references to things around, could that be the case?
// variables
string ref ; // read-only access
vector<vector<Cluster>> _clusters(24) ;
vector<Cluster> position_clusters = some_function() ;
#pragma omp parallel for num_threads(24) schedule(dynamic, 10)
for (int i = 0; i < position_clusters.size(); i++) {
auto& pc = position_clusters[i] ;
if (pc.size() < 2) {
continue ;
}
vector<Cluster> type_clusters = type_cluster(pc);
for (Cluster &tc : type_clusters) {
if (tc.size() < 2) {
continue ;
}
auto clusters = cluster_breakpoints(tc, 0.7) ; // dynamic programming
for (const Cluster &c : clusters) {
auto result = dynamic_programming(c, ref) ; // dynamic programming
_clusters[omp_get_thread_num()].push_back(result);
}
}
}
// Templates:
vector<Cluster> type_cluster(const Cluster &c) ;
vector<Cluster> cluster_breakpoints(Cluster& cluster, float ratio) ;
vector<Cluster> dynamic_programming(const Cluster& cluster, const string& ref) ;

c++ concurrency issue with scaling thread runtime

I have a program that performs the same function on a large array. I break the array into equal chunks and pass them to threads. Currently the threads perform the function and return what they are supposed to, BUT the more threads I add the longer each thread takes to run. Which totally negates the purpose of concurrency. I have tried with std::thread and std::async both with the same result. In the images below the amount of data processed by all child threads and the main thread are the same size (main has 6 more points), but what main runs in ~ 12 seconds the child threads take ~12 x the number of threads as if they were running asynchronously. But they all start at the same time, and if I output from each thread they are running concurrently. Does this have something to do with how they are being joined? I have tried everything I can think of, any help/advice is much appreciated! In the sample code main doesn't run the function until after child threads finish, if I put the join after the main runs it still doesn't run until the child threads finish. Below you can see the runtimes when run with 3 and 5 threads. These times are on a downscaled dataset for testing.
void foo(char* arg1, long arg2, std::promise<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>> & ftrV) {
std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>> Grid;
// does stuff....
// fills in "Grid"
ftrV.set_value(Grid);
}
int main(){
int thnmb = 3; // # of threads
std::vector<long> buffers; // fill in buffers
std::vector<char*> pointers; //fill in pointers
std::vector<std::promise<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>>> PV(thnmb); // vector of promise grids
std::vector<std::future<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>>> FV(thnmb); // vector of futures grids
std::vector<std::thread> th(thnmb); // vector of threads
std::vector<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>> vt1(thnmb); // vector to store thread grids
for (int i = 0; i < thnmb; i++) {
th[i] = std::thread(&foo, pointers[i], buffers[i], std::ref(PV[i]));
}
for (int i = 0; i < thnmb; i++) {
FV[i] = PV[i].get_future();
}
for (int i = 0; i < thnmb; i++) {
vt1[i] = FV[i].get();
}
for (int i = 0; i < thnmb; i++) {
th[i].join();
}
// main performs same function as foo here
// combine data
// do other stuff..
return(0);
}
It's hard to give a definitive answer without knowing what foo does, but you're probably running into memory access issues. Each access to your 5 dimension array will require 5 memory lookups, and it only takes 2 or 3 threads with memory access to saturate what a typical system can deliver.
main should perform it's foo work after creating the threads but before getting the value of the promises.
And foo should probably end with ftrV.set_value(std::move(Grid)) so that a copy of that array won't have to be made.

Boost Thread_Group in a loop is very slow

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.
The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)
I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

Spawn a set of threads iteratively in C++11?

I have a function that populates entries in a large matrix. As the computations are independent, I was thinking about exploiting std::thread so that chunks of the matrix can be processed by separate threads.
Instead of dividing the matrix in to n chunks where n is the limit on the maximum number of threads allowed to run simultaneously, I would like to make finer chunks, so that I could spawn a new thread when an existing thread is finished. (As the compute time will be widely different for different entries, and equally dividing the matrix will not be very efficient here. Hence the latter idea.)
What are the concepts in std::thread I should look into for doing this? (I came across async and condition_variables although I don't clearly see how they can be exploited for such kinds of spawning). Some example pseudo code would greatly help!
Why tax the OS scheduler with thread creation & destruction? (Assume these operations are expensive.) Instead, make your threads work more instead.
EDIT: If you do no want to split the work in equal chunks, then the best solution really is a thread pool. FYI, there is a thread_pool library in the works for C++14.
What is below assumed that you could split the work in equal chunks, so is not exactly applicable to your question. END OF EDIT.
struct matrix
{
int nrows, ncols;
// assuming row-based processing; adjust for column-based processing
void fill_rows(int first, int last);
};
int num_threads = std::thread::hardware_concurrency();
std::vector< std::thread > threads(num_threads);
matrix m; // must be initialized...
// here - every thread will process as many rows as needed
int nrows_per_thread = m.nrows / num_threads;
for(int i = 0; i != num_threads; ++i)
{
// thread i will process these rows:
int first = i * nrows_per_thread;
int last = first + nrows_per_thread;
// last thread gets remaining rows
last += (i == num_threads - 1) ? m.nrows % nrows_per_thread : 0;
threads[i] = std::move(std::thread([&m,first,last]{
m.fill_rows(first,last); }))
}
for(int i = 0; i != num_threads; ++i)
{
threads[i].join();
}
If this is an operation you do very frequently, then use a worker pool as #Igor Tandetnik suggests in the comments. For one-offs, it's not worth the trouble.

boost::thread_group - is it ok to call create_thread after join_all?

I have the following situation:
I create a boost::thread_group instance, then create threads for parallel-processing on some data, then join_all on the threads.
Initially I created the threads for every X elements of data, like so:
// begin = someVector.begin();
// end = someVector.end();
// batchDispatcher = boost::function<void(It, It)>(...);
boost::thread_group processors;
// create dispatching thread every ASYNCH_PROCESSING_THRESHOLD notifications
while(end - begin > ASYNCH_PROCESSING_THRESHOLD)
{
NotifItr split = begin + ASYNCH_PROCESSING_THRESHOLD;
processors.create_thread(boost::bind(batchDispatcher, begin, split));
begin = split;
}
// create dispatching thread for the remainder
if(begin < end)
{
processors.create_thread(boost::bind(batchDispatcher, begin, end));
}
// wait for parallel processing to finish
processors.join_all();
but I have a problem with this: When I have lots of data, this code is generating lots of threads (> 40 threads) which keeps the processor busy with thread-switching contexts.
My question is this: Is it possible to call create_thread on the thread_group after the call to join_all.
That is, can I change my code to this?
boost::thread_group processors;
size_t processorThreads = 0; // NEW CODE
// create dispatching thread every ASYNCH_PROCESSING_THRESHOLD notifications
while(end - begin > ASYNCH_PROCESSING_THRESHOLD)
{
NotifItr split = begin + ASYNCH_PROCESSING_THRESHOLD;
processors.create_thread(boost::bind(batchDispatcher, begin, split));
begin = split;
if(++processorThreads >= MAX_ASYNCH_PROCESSORS) // NEW CODE
{ // NEW CODE
processors.join_all(); // NEW CODE
processorThreads = 0; // NEW CODE
} // NEW CODE
}
// ...
Whoever has experience with this, thanks for any insight.
I believe this is not possible. The solution you want might actually be to implement a producer-consumer or a master-worker (main 'master' thread divides the work in several fixed size tasks, creates pool of 'workers' threads and sends one task to each worker until all tasks are done).
These solutions will demand some synchronization through semaphores but they will equalize well the performance one you can create one thread for each available core in the machine avoiding waste of time on context switches.
Another not-so-good-and-fancy option is to join one thread at a time. You can have a vector with 4 active threads, join one and create another. The problem of this approach is that you may waste processing time if your tasks are heterogeneous.