QtConcurrent slowdown with long-lived object pointers

QtConcurrent slowdown with long-lived object pointers - c++

I'm in the process of adding multithreading to several CPU-intensive processes on a list of long-lived object pointers. Roughly 60 million of these objects were created and added to a primary list on the main processing thread.
All of the work occurs in two lambda functors, one to process the data (myMap) and one to collect the results (myReduce). The main list gets divided into four sub-lists of roughly 15 million each and sent to QtConcurrent::mappedReduced to do work. Here's some example code:
//main thread
const int count = 60000000;
QList<MyObject*> list;
for(int i = 0; i < count; ++i) {
MyObject* obj = new MyObject;
obj.readFromFile(path);
list << obj;
}
QList<QList<MyObject*> > sublists;
for(int i = 0; i < count; i += count/4) {
sublists << list.mid(i, count/4);
}
QThreadPool::globalInstance()->setMaxThreadCount(1); //slowdown when set to 4??
Result results_total;
std::function<Result (const QList<MyObject*>&)>
myMap = [](const QList<MyObject*>& m) -> Result {
//do lots of work on individual MyObjects, querying and modifying them
};
auto myReduce = [&results_total](bool& /*noreturn*/, const Result& result) {
results_total.count += result.count;
results_total.othernumber += result.othernumber;
};
QFutureWatcher<void> fw;
fw.setFuture(QtConcurrent::mappedReduced<bool>(
sublists, myMap, myReduce,
QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce));
fw.waitForFinished();
Here's the kicker: When I setMaxThreadCount to 4 instead of 1, the procedure slows down by 10% instead of speeding up 200-400%. I used the exact same methodology (split a list into fourths and run it through QtConcurrent) on another procedure and ran it on the exact same dataset for a roughly 4x speed boost as expected by using 4 threads instead of 1.
Googling around suggests that there must be a shared resource in the myRun functor somewhere, but I can't find anything at all that's shared between the processing threads other than the original list of MyObjects that exist on the main thread.
So here's the question: Does the fact that MyObject was created in a different thread than the processing thread matter if I can guarantee that there are no synchronization issues? This link suggests it doesn't matter, but that heap memory block seems to be the only thing both threads share.
I'm running Qt 4.8.6 on Windows 7 Pro x64 with an i7 processor.

Related

c++ concurrency issue with scaling thread runtime

I have a program that performs the same function on a large array. I break the array into equal chunks and pass them to threads. Currently the threads perform the function and return what they are supposed to, BUT the more threads I add the longer each thread takes to run. Which totally negates the purpose of concurrency. I have tried with std::thread and std::async both with the same result. In the images below the amount of data processed by all child threads and the main thread are the same size (main has 6 more points), but what main runs in ~ 12 seconds the child threads take ~12 x the number of threads as if they were running asynchronously. But they all start at the same time, and if I output from each thread they are running concurrently. Does this have something to do with how they are being joined? I have tried everything I can think of, any help/advice is much appreciated! In the sample code main doesn't run the function until after child threads finish, if I put the join after the main runs it still doesn't run until the child threads finish. Below you can see the runtimes when run with 3 and 5 threads. These times are on a downscaled dataset for testing.
void foo(char* arg1, long arg2, std::promise<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>> & ftrV) {
std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>> Grid;
// does stuff....
// fills in "Grid"
ftrV.set_value(Grid);
}
int main(){
int thnmb = 3; // # of threads
std::vector<long> buffers; // fill in buffers
std::vector<char*> pointers; //fill in pointers
std::vector<std::promise<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>>> PV(thnmb); // vector of promise grids
std::vector<std::future<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>>> FV(thnmb); // vector of futures grids
std::vector<std::thread> th(thnmb); // vector of threads
std::vector<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>> vt1(thnmb); // vector to store thread grids
for (int i = 0; i < thnmb; i++) {
th[i] = std::thread(&foo, pointers[i], buffers[i], std::ref(PV[i]));
}
for (int i = 0; i < thnmb; i++) {
FV[i] = PV[i].get_future();
}
for (int i = 0; i < thnmb; i++) {
vt1[i] = FV[i].get();
}
for (int i = 0; i < thnmb; i++) {
th[i].join();
}
// main performs same function as foo here
// combine data
// do other stuff..
return(0);
}

It's hard to give a definitive answer without knowing what foo does, but you're probably running into memory access issues. Each access to your 5 dimension array will require 5 memory lookups, and it only takes 2 or 3 threads with memory access to saturate what a typical system can deliver.
main should perform it's foo work after creating the threads but before getting the value of the promises.
And foo should probably end with ftrV.set_value(std::move(Grid)) so that a copy of that array won't have to be made.

Boost Thread_Group in a loop is very slow

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.

The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)

I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

Spawn a set of threads iteratively in C++11?

I have a function that populates entries in a large matrix. As the computations are independent, I was thinking about exploiting std::thread so that chunks of the matrix can be processed by separate threads.
Instead of dividing the matrix in to n chunks where n is the limit on the maximum number of threads allowed to run simultaneously, I would like to make finer chunks, so that I could spawn a new thread when an existing thread is finished. (As the compute time will be widely different for different entries, and equally dividing the matrix will not be very efficient here. Hence the latter idea.)
What are the concepts in std::thread I should look into for doing this? (I came across async and condition_variables although I don't clearly see how they can be exploited for such kinds of spawning). Some example pseudo code would greatly help!

Why tax the OS scheduler with thread creation & destruction? (Assume these operations are expensive.) Instead, make your threads work more instead.
EDIT: If you do no want to split the work in equal chunks, then the best solution really is a thread pool. FYI, there is a thread_pool library in the works for C++14.
What is below assumed that you could split the work in equal chunks, so is not exactly applicable to your question. END OF EDIT.
struct matrix
{
int nrows, ncols;
// assuming row-based processing; adjust for column-based processing
void fill_rows(int first, int last);
};
int num_threads = std::thread::hardware_concurrency();
std::vector< std::thread > threads(num_threads);
matrix m; // must be initialized...
// here - every thread will process as many rows as needed
int nrows_per_thread = m.nrows / num_threads;
for(int i = 0; i != num_threads; ++i)
{
// thread i will process these rows:
int first = i * nrows_per_thread;
int last = first + nrows_per_thread;
// last thread gets remaining rows
last += (i == num_threads - 1) ? m.nrows % nrows_per_thread : 0;
threads[i] = std::move(std::thread([&m,first,last]{
m.fill_rows(first,last); }))
}
for(int i = 0; i != num_threads; ++i)
{
threads[i].join();
}
If this is an operation you do very frequently, then use a worker pool as #Igor Tandetnik suggests in the comments. For one-offs, it's not worth the trouble.

Mergesort pThread implementation taking same time as single-threaded

(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s

There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.

Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.

Getting item sequence numbers from a QtConcurrent Threaded Calculation

The QtConcurrent namespace is really great for simplifying the management of multi-threaded calculations. Overall this works great and I have been able to use QtConcurrent run(), map(), and other variants in the way they are described in the API.
Overall Goal:
I would like to query, cancel(), or pause() a numerically intensive calculation from QML. So far this is working the way I would like, except that I cannot access the sequence numbers in the calculation. Here is a link that describes a similar QML setup.
Below is an image from small test app that I created to encapsulate what I am trying to do:
In the example above the calculation has nearly completed and all the cores have been enqueued with work properly, as can be seen from a system query:
But what I really would like to do is use the sequence numbers from a given list of the items IN THE multi-threaded calculation itself. E.g., one approach might be to simply setup the sequence numbers directly in a QList or QVector (other C++ STL containers can work as well), like this:
void TaskDialog::mapTask()
{
// Number of times the map function will be called:
int N = 5;
// Prepare the vector that we operate on with mapFunction:
QList<int> vectorOfInts;
for (int i = 0; i < N; i++) {
vectorOfInts << i;
}
// Start the calc:
QFuture<void> future = QtConcurrent::map(vectorOfInts, mapFunction);
_futureWatcher.setFuture(future);
//_futureWatcher.waitForFinished();
}
The calculation is non-blocking with the line: _futureWatcher.waitForFinished(); commented out, as shown in the code above. Note that when setup as a non-blocking calculation, the GUI thread is responsive, and the progress bar updates as desired.
But when the values in the QList container are queried during the calculation, what appears seem to be the uninitialized garbage values that one would expect when the array is not properly initialized.
Below is the example function I am calling:
void mapFunction(int& n)
{
// Check the n values:
qDebug() << "n = " << n;
/* Below is an arbitrary task but note that we left out n,
* although normally we would want to use it): */
const long work = 10000 * 10000 * 10;
long s = 0;
for (long j = 0; j < work; j++)
s++;
}
And the output of qDebug() is:
n = 30458288
n = 204778
n = 270195923
n = 0
n = 270385260
The n-values are useless but the sum values, s, are correct (although not shown) when the calculation is mapped in this fashion (non-blocking).
Now, if I uncomment the _futureWatcher.waitForFinished(); line then I get the expected values (the order is irrelevant):
n = 0
n = 2
n = 4
n = 3
n = 1
But in this case, with _futureWatcher.waitForFinished(); enabled, my GUI thread is blocked and the progress bar does not update.
What then would be the advantage of using QtConcurrent::map() with blocking enabled, if the goal to not block the main GUI thread?
Secondly, how can get the correct values of n in the non-blocking case, allowing the GUI to remain responsive and have the progress bar keep updating?
My only option may be to use QThread directly but I wanted to take advantage of all the nice tools setup for us in QtConcurrent.
Thoughts? Suggestions? Other options? Thanks.
EDIT: Thanks to user2025983 for the insight which helped me to solve this. The bottom line is that I first needed to dynamically allocate the QList:
QList<int>* vectorOfInts = new QList<int>;
for (int i = 0; i < N; i++)
vectorOfInts->push_back(i);
Next, the vectorOfInts is passed by reference to the map function by de-referencing the pointer, like this:
QFuture<void> future = QtConcurrent::map(*vectorOfInts, mapFunction);
Note also that the prototype of the mapFunction remains the same:
void mapFunction(int& n)
And then it all works properly: the GUI remained responsive, progress bar updated, the values of n are all correct, etc., WITHOUT the need to add blocking through the function:
_futureWatcher.waitForFinished();
Hope these extra details can help someone else.

The problem here is that your QList goes out of the scope when mapTask() finishes.
Since the mapFunction(int &n) takes the parameter by reference, it gets references to integer values which are now part of an array which is out of scope! So then the computer is free to do whatever it likes with that memory, which is why you see garbage values. If you are just using integer parameters, I would recommend passing the parameters by value and then everything should work.
Alternatively, if you must pass by reference you can have the futureWatcher delete the array when its finished.
QList<int>* vectorOfInts = new QList<int>;
// push back into structure
connect(_futureWatcher, SIGNAL(finished()), vectorOfInts, SLOT(deleteLater()));
// launch stuff
QtConcurrent::map...
// profit

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

QtConcurrent slowdown with long-lived object pointers - c++

Related

c++ concurrency issue with scaling thread runtime

Boost Thread_Group in a loop is very slow

Spawn a set of threads iteratively in C++11?

Mergesort pThread implementation taking same time as single-threaded

Getting item sequence numbers from a QtConcurrent Threaded Calculation

Categories

Resources