c++ concurrency issue with scaling thread runtime

c++ concurrency issue with scaling thread runtime - c++

I have a program that performs the same function on a large array. I break the array into equal chunks and pass them to threads. Currently the threads perform the function and return what they are supposed to, BUT the more threads I add the longer each thread takes to run. Which totally negates the purpose of concurrency. I have tried with std::thread and std::async both with the same result. In the images below the amount of data processed by all child threads and the main thread are the same size (main has 6 more points), but what main runs in ~ 12 seconds the child threads take ~12 x the number of threads as if they were running asynchronously. But they all start at the same time, and if I output from each thread they are running concurrently. Does this have something to do with how they are being joined? I have tried everything I can think of, any help/advice is much appreciated! In the sample code main doesn't run the function until after child threads finish, if I put the join after the main runs it still doesn't run until the child threads finish. Below you can see the runtimes when run with 3 and 5 threads. These times are on a downscaled dataset for testing.
void foo(char* arg1, long arg2, std::promise<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>> & ftrV) {
std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>> Grid;
// does stuff....
// fills in "Grid"
ftrV.set_value(Grid);
}
int main(){
int thnmb = 3; // # of threads
std::vector<long> buffers; // fill in buffers
std::vector<char*> pointers; //fill in pointers
std::vector<std::promise<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>>> PV(thnmb); // vector of promise grids
std::vector<std::future<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>>> FV(thnmb); // vector of futures grids
std::vector<std::thread> th(thnmb); // vector of threads
std::vector<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>> vt1(thnmb); // vector to store thread grids
for (int i = 0; i < thnmb; i++) {
th[i] = std::thread(&foo, pointers[i], buffers[i], std::ref(PV[i]));
}
for (int i = 0; i < thnmb; i++) {
FV[i] = PV[i].get_future();
}
for (int i = 0; i < thnmb; i++) {
vt1[i] = FV[i].get();
}
for (int i = 0; i < thnmb; i++) {
th[i].join();
}
// main performs same function as foo here
// combine data
// do other stuff..
return(0);
}

It's hard to give a definitive answer without knowing what foo does, but you're probably running into memory access issues. Each access to your 5 dimension array will require 5 memory lookups, and it only takes 2 or 3 threads with memory access to saturate what a typical system can deliver.
main should perform it's foo work after creating the threads but before getting the value of the promises.
And foo should probably end with ftrV.set_value(std::move(Grid)) so that a copy of that array won't have to be made.

Related

C++ Wait in main thread for future without while(true)

Question
I want to know if it is possible to wait in the main-Thread without any while(1)-loop.
I launch a few threads via std::async() and do calculation of numbers on each thread. After i start the threads i want to receive the results back. I do that with a std::future<>.get().
My problem
When i receive the result i call std::future.get(), which blocks the main thread until the calculation on the thread is done. This leads to some slower execution time, if one thread needs considerably more time then the following, where i could do some calculation with the finished results instead and then when the slowest thread is done i maybe have some some further calculation.
Is there a way to idle the main thread until ANY of the threads has finished running? I have thought of a callback function which wakes the main thread up, but i still don't know how to idle the main function without making it unresponsive for i.e. a second and not running a while(true) loop instead.
Current code
#include <iostream>
#include <future>
uint64_t calc_factorial(int start, int number);
int main()
{
uint64_t n = 1;
//The user entered number
uint64_t number = 0;
// get the user input
printf("Enter number (uint64_t): ");
scanf("%lu", &number);
std::future<uint64_t> results[4];
for (int i = 0; i < 4; i++)
{
// push to different cores
results[i] = std::async(std::launch::async, calc_factorial, i + 2, number);
}
for (int i = 0; i < 4; i++)
{
//retrieve result...I don't want to wait here if one threads needs more time than usual
n *= results[i].get();
}
// print n or the time needed
return 0;
}
uint64_t calc_factorial(int start, int number)
{
uint64_t n = 1;
for (int i = start; i <= number; i+=4) n *= i;
return n;
}
I prepared a code snippet which runs fine, I am using the GMP Lib for the big results, but the code runs with uint64_t instead if you enter small numbers.
Note
If you have compiled the GMP library for whatever reason on your PC already you could replace every uint64_t with mpz_class

I'd approach this somewhat differently.
Unless I have a fairly specific reason to do otherwise, I tend to approach most multithreaded code the same general way: use a (thread-safe) queue to transmit results. So create an instance of a thread-safe queue, and pass a reference to it to each of the threads that's doing to generate the data. The have whatever thread is going to collect the results grab them from the queue.
This makes it automatic (and trivial) that you create each result as it's produced, rather than getting stuck waiting for one after another has produced results.

Boost Thread_Group in a loop is very slow

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.

The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)

I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

QtConcurrent slowdown with long-lived object pointers

I'm in the process of adding multithreading to several CPU-intensive processes on a list of long-lived object pointers. Roughly 60 million of these objects were created and added to a primary list on the main processing thread.
All of the work occurs in two lambda functors, one to process the data (myMap) and one to collect the results (myReduce). The main list gets divided into four sub-lists of roughly 15 million each and sent to QtConcurrent::mappedReduced to do work. Here's some example code:
//main thread
const int count = 60000000;
QList<MyObject*> list;
for(int i = 0; i < count; ++i) {
MyObject* obj = new MyObject;
obj.readFromFile(path);
list << obj;
}
QList<QList<MyObject*> > sublists;
for(int i = 0; i < count; i += count/4) {
sublists << list.mid(i, count/4);
}
QThreadPool::globalInstance()->setMaxThreadCount(1); //slowdown when set to 4??
Result results_total;
std::function<Result (const QList<MyObject*>&)>
myMap = [](const QList<MyObject*>& m) -> Result {
//do lots of work on individual MyObjects, querying and modifying them
};
auto myReduce = [&results_total](bool& /*noreturn*/, const Result& result) {
results_total.count += result.count;
results_total.othernumber += result.othernumber;
};
QFutureWatcher<void> fw;
fw.setFuture(QtConcurrent::mappedReduced<bool>(
sublists, myMap, myReduce,
QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce));
fw.waitForFinished();
Here's the kicker: When I setMaxThreadCount to 4 instead of 1, the procedure slows down by 10% instead of speeding up 200-400%. I used the exact same methodology (split a list into fourths and run it through QtConcurrent) on another procedure and ran it on the exact same dataset for a roughly 4x speed boost as expected by using 4 threads instead of 1.
Googling around suggests that there must be a shared resource in the myRun functor somewhere, but I can't find anything at all that's shared between the processing threads other than the original list of MyObjects that exist on the main thread.
So here's the question: Does the fact that MyObject was created in a different thread than the processing thread matter if I can guarantee that there are no synchronization issues? This link suggests it doesn't matter, but that heap memory block seems to be the only thing both threads share.
I'm running Qt 4.8.6 on Windows 7 Pro x64 with an i7 processor.

TBB task_arena & task_group usage for scaling parallel_for work

I am trying to use the Threaded Building Blocks task_arena. There is a simple array full of '0'. Arena's threads put '1' in the array on the odd places. Main thread put '2' in the array on the even places.
/* Odd-even arenas tbb test */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
#include <tbb/task_arena.h>
#include <tbb/task_group.h>
#include <iostream>
using namespace std;
const int SIZE = 100;
int main()
{
tbb::task_arena limited(1); // no more than 1 thread in this arena
tbb::task_group tg;
int myArray[SIZE] = {0};
//! Main thread create another thread, then immediately returns
limited.enqueue([&]{
//! Created thread continues here
tg.run([&]{
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 == 0)
myArray[i] = 1;
}
);
});
});
//! Main thread do this work
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 != 0)
myArray[i] = 2;
}
);
//! Main thread waiting for 'tg' group
//** it does not create any threads here (doesn't it?) */
limited.execute([&]{
tg.wait();
});
for(int i = 0; i < SIZE; i++) {
cout << myArray[i] << " ";
}
cout << endl;
return 0;
}
The output is:
0 2 0 2 ... 0 2
So the limited.enque{tg.run{...}} block doesn't work.
What's the problem? Any ideas? Thank you.

You have created limited arena for one thread only, and by default this slot is reserved for the master thread. Though, enqueuing into such a serializing arena will temporarily boost its concurrency level to 2 (in order to satisfy 'fire-and-forget' promise of the enqueue), enqueue() does not guarantee synchronous execution of the submitted task. So, tg.wait() can start before tg.run() executes and thus the program will not wait when the worker thread is created, joins the limited arena, and fills the array with '1' (BTW, the whole array is filled in each of 100 parallel_for iterations).
So, in order to wait for the tg.run() to complete, use limited.execute instead. But it will prevent automatic enhancing of the limited concurrency level and the task will be deferred till tg.wait() executed by master thread.
If you want to see asynchronous execution, set arena's concurrency to 2 manually: tbb::task_arena limited(2);
or disable slot reservation for master thread: tbb::task_arena limited(1,0) (but note, it implies additional overheads for dynamic balancing of the number of threads in arena).
P.S. TBB has no points where threads are guaranteed to come (unlike OpenMP). Only enqueue methods guarantee creation of at least one worker thread, but it says nothing about when it will come. See local observer feature to get notification when threads are actually joining arenas.

Spawn a set of threads iteratively in C++11?

I have a function that populates entries in a large matrix. As the computations are independent, I was thinking about exploiting std::thread so that chunks of the matrix can be processed by separate threads.
Instead of dividing the matrix in to n chunks where n is the limit on the maximum number of threads allowed to run simultaneously, I would like to make finer chunks, so that I could spawn a new thread when an existing thread is finished. (As the compute time will be widely different for different entries, and equally dividing the matrix will not be very efficient here. Hence the latter idea.)
What are the concepts in std::thread I should look into for doing this? (I came across async and condition_variables although I don't clearly see how they can be exploited for such kinds of spawning). Some example pseudo code would greatly help!

Why tax the OS scheduler with thread creation & destruction? (Assume these operations are expensive.) Instead, make your threads work more instead.
EDIT: If you do no want to split the work in equal chunks, then the best solution really is a thread pool. FYI, there is a thread_pool library in the works for C++14.
What is below assumed that you could split the work in equal chunks, so is not exactly applicable to your question. END OF EDIT.
struct matrix
{
int nrows, ncols;
// assuming row-based processing; adjust for column-based processing
void fill_rows(int first, int last);
};
int num_threads = std::thread::hardware_concurrency();
std::vector< std::thread > threads(num_threads);
matrix m; // must be initialized...
// here - every thread will process as many rows as needed
int nrows_per_thread = m.nrows / num_threads;
for(int i = 0; i != num_threads; ++i)
{
// thread i will process these rows:
int first = i * nrows_per_thread;
int last = first + nrows_per_thread;
// last thread gets remaining rows
last += (i == num_threads - 1) ? m.nrows % nrows_per_thread : 0;
threads[i] = std::move(std::thread([&m,first,last]{
m.fill_rows(first,last); }))
}
for(int i = 0; i != num_threads; ++i)
{
threads[i].join();
}
If this is an operation you do very frequently, then use a worker pool as #Igor Tandetnik suggests in the comments. For one-offs, it's not worth the trouble.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

c++ concurrency issue with scaling thread runtime - c++

Related

C++ Wait in main thread for future without while(true)

Boost Thread_Group in a loop is very slow

QtConcurrent slowdown with long-lived object pointers

TBB task_arena & task_group usage for scaling parallel_for work

Spawn a set of threads iteratively in C++11?

Categories

Resources