The QtConcurrent namespace is really great for simplifying the management of multi-threaded calculations. Overall this works great and I have been able to use QtConcurrent run(), map(), and other variants in the way they are described in the API.
Overall Goal:
I would like to query, cancel(), or pause() a numerically intensive calculation from QML. So far this is working the way I would like, except that I cannot access the sequence numbers in the calculation. Here is a link that describes a similar QML setup.
Below is an image from small test app that I created to encapsulate what I am trying to do:
In the example above the calculation has nearly completed and all the cores have been enqueued with work properly, as can be seen from a system query:
But what I really would like to do is use the sequence numbers from a given list of the items IN THE multi-threaded calculation itself. E.g., one approach might be to simply setup the sequence numbers directly in a QList or QVector (other C++ STL containers can work as well), like this:
void TaskDialog::mapTask()
{
// Number of times the map function will be called:
int N = 5;
// Prepare the vector that we operate on with mapFunction:
QList<int> vectorOfInts;
for (int i = 0; i < N; i++) {
vectorOfInts << i;
}
// Start the calc:
QFuture<void> future = QtConcurrent::map(vectorOfInts, mapFunction);
_futureWatcher.setFuture(future);
//_futureWatcher.waitForFinished();
}
The calculation is non-blocking with the line: _futureWatcher.waitForFinished(); commented out, as shown in the code above. Note that when setup as a non-blocking calculation, the GUI thread is responsive, and the progress bar updates as desired.
But when the values in the QList container are queried during the calculation, what appears seem to be the uninitialized garbage values that one would expect when the array is not properly initialized.
Below is the example function I am calling:
void mapFunction(int& n)
{
// Check the n values:
qDebug() << "n = " << n;
/* Below is an arbitrary task but note that we left out n,
* although normally we would want to use it): */
const long work = 10000 * 10000 * 10;
long s = 0;
for (long j = 0; j < work; j++)
s++;
}
And the output of qDebug() is:
n = 30458288
n = 204778
n = 270195923
n = 0
n = 270385260
The n-values are useless but the sum values, s, are correct (although not shown) when the calculation is mapped in this fashion (non-blocking).
Now, if I uncomment the _futureWatcher.waitForFinished(); line then I get the expected values (the order is irrelevant):
n = 0
n = 2
n = 4
n = 3
n = 1
But in this case, with _futureWatcher.waitForFinished(); enabled, my GUI thread is blocked and the progress bar does not update.
What then would be the advantage of using QtConcurrent::map() with blocking enabled, if the goal to not block the main GUI thread?
Secondly, how can get the correct values of n in the non-blocking case, allowing the GUI to remain responsive and have the progress bar keep updating?
My only option may be to use QThread directly but I wanted to take advantage of all the nice tools setup for us in QtConcurrent.
Thoughts? Suggestions? Other options? Thanks.
EDIT: Thanks to user2025983 for the insight which helped me to solve this. The bottom line is that I first needed to dynamically allocate the QList:
QList<int>* vectorOfInts = new QList<int>;
for (int i = 0; i < N; i++)
vectorOfInts->push_back(i);
Next, the vectorOfInts is passed by reference to the map function by de-referencing the pointer, like this:
QFuture<void> future = QtConcurrent::map(*vectorOfInts, mapFunction);
Note also that the prototype of the mapFunction remains the same:
void mapFunction(int& n)
And then it all works properly: the GUI remained responsive, progress bar updated, the values of n are all correct, etc., WITHOUT the need to add blocking through the function:
_futureWatcher.waitForFinished();
Hope these extra details can help someone else.
The problem here is that your QList goes out of the scope when mapTask() finishes.
Since the mapFunction(int &n) takes the parameter by reference, it gets references to integer values which are now part of an array which is out of scope! So then the computer is free to do whatever it likes with that memory, which is why you see garbage values. If you are just using integer parameters, I would recommend passing the parameters by value and then everything should work.
Alternatively, if you must pass by reference you can have the futureWatcher delete the array when its finished.
QList<int>* vectorOfInts = new QList<int>;
// push back into structure
connect(_futureWatcher, SIGNAL(finished()), vectorOfInts, SLOT(deleteLater()));
// launch stuff
QtConcurrent::map...
// profit
Related
I'm in the process of adding multithreading to several CPU-intensive processes on a list of long-lived object pointers. Roughly 60 million of these objects were created and added to a primary list on the main processing thread.
All of the work occurs in two lambda functors, one to process the data (myMap) and one to collect the results (myReduce). The main list gets divided into four sub-lists of roughly 15 million each and sent to QtConcurrent::mappedReduced to do work. Here's some example code:
//main thread
const int count = 60000000;
QList<MyObject*> list;
for(int i = 0; i < count; ++i) {
MyObject* obj = new MyObject;
obj.readFromFile(path);
list << obj;
}
QList<QList<MyObject*> > sublists;
for(int i = 0; i < count; i += count/4) {
sublists << list.mid(i, count/4);
}
QThreadPool::globalInstance()->setMaxThreadCount(1); //slowdown when set to 4??
Result results_total;
std::function<Result (const QList<MyObject*>&)>
myMap = [](const QList<MyObject*>& m) -> Result {
//do lots of work on individual MyObjects, querying and modifying them
};
auto myReduce = [&results_total](bool& /*noreturn*/, const Result& result) {
results_total.count += result.count;
results_total.othernumber += result.othernumber;
};
QFutureWatcher<void> fw;
fw.setFuture(QtConcurrent::mappedReduced<bool>(
sublists, myMap, myReduce,
QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce));
fw.waitForFinished();
Here's the kicker: When I setMaxThreadCount to 4 instead of 1, the procedure slows down by 10% instead of speeding up 200-400%. I used the exact same methodology (split a list into fourths and run it through QtConcurrent) on another procedure and ran it on the exact same dataset for a roughly 4x speed boost as expected by using 4 threads instead of 1.
Googling around suggests that there must be a shared resource in the myRun functor somewhere, but I can't find anything at all that's shared between the processing threads other than the original list of MyObjects that exist on the main thread.
So here's the question: Does the fact that MyObject was created in a different thread than the processing thread matter if I can guarantee that there are no synchronization issues? This link suggests it doesn't matter, but that heap memory block seems to be the only thing both threads share.
I'm running Qt 4.8.6 on Windows 7 Pro x64 with an i7 processor.
I have a function that populates entries in a large matrix. As the computations are independent, I was thinking about exploiting std::thread so that chunks of the matrix can be processed by separate threads.
Instead of dividing the matrix in to n chunks where n is the limit on the maximum number of threads allowed to run simultaneously, I would like to make finer chunks, so that I could spawn a new thread when an existing thread is finished. (As the compute time will be widely different for different entries, and equally dividing the matrix will not be very efficient here. Hence the latter idea.)
What are the concepts in std::thread I should look into for doing this? (I came across async and condition_variables although I don't clearly see how they can be exploited for such kinds of spawning). Some example pseudo code would greatly help!
Why tax the OS scheduler with thread creation & destruction? (Assume these operations are expensive.) Instead, make your threads work more instead.
EDIT: If you do no want to split the work in equal chunks, then the best solution really is a thread pool. FYI, there is a thread_pool library in the works for C++14.
What is below assumed that you could split the work in equal chunks, so is not exactly applicable to your question. END OF EDIT.
struct matrix
{
int nrows, ncols;
// assuming row-based processing; adjust for column-based processing
void fill_rows(int first, int last);
};
int num_threads = std::thread::hardware_concurrency();
std::vector< std::thread > threads(num_threads);
matrix m; // must be initialized...
// here - every thread will process as many rows as needed
int nrows_per_thread = m.nrows / num_threads;
for(int i = 0; i != num_threads; ++i)
{
// thread i will process these rows:
int first = i * nrows_per_thread;
int last = first + nrows_per_thread;
// last thread gets remaining rows
last += (i == num_threads - 1) ? m.nrows % nrows_per_thread : 0;
threads[i] = std::move(std::thread([&m,first,last]{
m.fill_rows(first,last); }))
}
for(int i = 0; i != num_threads; ++i)
{
threads[i].join();
}
If this is an operation you do very frequently, then use a worker pool as #Igor Tandetnik suggests in the comments. For one-offs, it's not worth the trouble.
(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s
There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.
Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.
I've been learning C++ from the internet for the past 2 years and finally the need has arisen for me to delve into MPI. I've been scouring stackoverflow and the rest of the internet (including http://people.sc.fsu.edu/~jburkardt/cpp_src/mpi/mpi.html and https://computing.llnl.gov/tutorials/mpi/#LLNL). I think I've got some of the logic down, but I'm having a hard time wrapping my head around the following:
#include (stuff)
using namespace std;
vector<double> function(vector<double> &foo, const vector<double> &bar, int dim, int rows);
int main(int argc, char** argv)
{
vector<double> result;//represents a regular 1D vector
int id_proc, tot_proc, root_proc = 0;
int dim;//set to number of "columns" in A and B below
int rows;//set to number of "rows" of A and B below
vector<double> A(dim*rows), B(dim*rows);//represent matrices as 1D vectors
MPI::Init(argc,argv);
id_proc = MPI::COMM_WORLD.Get_rank();
tot_proc = MPI::COMM_WORLD.Get_size();
/*
initialize A and B here on root_proc with RNG and Bcast to everyone else
*/
//allow all processors to call function() so they can each work on a portion of A
result = function(A,B,dim,rows);
//all processors do stuff with A
//root_proc does stuff with result (doesn't matter if other processors have updated result)
MPI::Finalize();
return 0;
}
vector<double> function(vector<double> &foo, const vector<double> &bar, int dim, int rows)
{
/*
purpose of function() is two-fold:
1. update foo because all processors need the updated "matrix"
2. get the average of the "rows" of foo and return that to main (only root processor needs this)
*/
vector<double> output(dim,0);
//add matrices the way I would normally do it in serial
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < dim; j++)
{
foo[i*dim + j] += bar[i*dim + j];//perform "matrix" addition (+= ON PURPOSE)
}
}
//obtain average of rows in foo in serial
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < dim; j++)
{
output[j] += foo[i*dim + j];//sum rows of A
}
}
for (int j = 0; j < dim; j++)
{
output[j] /= rows;//divide to obtain average
}
return output;
}
The code above is to illustrate the concept only. My main concern is to parallelize the matrix addition but what boggles my mind is this:
1) If each processor only works on a portion of that loop (naturally I'd have to modify the loop parameters per processor) what command do I use to merge all portions of A back into a single, updated A that all processors have in their memory. My guess is that I have to do some kind of Alltoall where each processor sends its portion of A to all other processors, but how do I guarantee that (for example) row 3 worked on by processor 3 overwrites row 3 of the other processors, and not row 1 by accident.
2) If I use an Alltoall inside function(), do all processors have to be allowed to step into function(), or can I isolate function() using...
if (id_proc == root_proc)
{
result = function(A,B,dim,rows);
}
… and then inside function() handle all the parallelization. As silly as it sounds, I'm trying to do a lot of the work on one processor (with broadcasts), and just parallelize the big time-consuming for loops. Just trying to keep the code conceptually simple so I can get my results and move on.
3) For the averaging part, I'm sure I can just use a reducing command if I wanted to parallelize it, correct?
Also, as an aside: is there a way to call Bcast() such that it is blocking? I'd like to use it to synchronize all my processors (boost libraries are not an option). If not then I'll just go with Barrier(). Thank you for your answer to this question, and to the community of stackoverflow for learning me how to program over the past two years! :)
1) The function you are looking is MPI_Allgather. MPI_Allgather will let you send a row from each processor and receive the result on all processors.
2) Yes you can use some of the processors in your function. Since MPI functions work with communicators you have to create a separate communicator for this purpose. I don't know how this is implemented in the C++ bindings but C bindings use the MPI_Comm_create function.
3) Yes see MPI_Allreduce.
aside: Bcast blocks a process until send/receive operation assigned to that process is finished. If you want to wait for all processors to finish their work (I don't have any idea why you would want to do this) you should use Barrier().
extra note: I wouldn't recommend using the C++ bindings as they are depreciated and you won't find specific examples on how to use them. Boost MPI is the library to use if you want C++ bindings however it does not cover all of MPI functions.
I have read the question that was posted earlier that seemed to be having the same error that I am getting when using wait for multiple objects but I believe that mine is different. I am using several threads to compute different parts of a mandelbrot set. The program compiles and produces the correct result about 3 out of 5 times but sometimes I get an error that says "Access violation when writing to ... (some memory location that is different every time)". Like I said, sometimes it works, sometimes it doesn't. I put break points before and after the waitformultipleobjects and have concluded that that must be the culprit. I just dont know why. Here is the code...
int max = size();
if (max == 0) //Return false if there are no threads
return false;
for(int i=0;i<max;++i) //Resume all threads
ResumeThread(threads[i]);
HANDLE *first = &threads[0]; //Create a pointer to the first thread
WaitForMultipleObjects(max,first,TRUE,INFINITE);//Wait for all threads to finish
Update: I tried using a for loop and WaitForSingleObject and the problem still persisted.
Update 2: Here is the thread function. It looks kind of ugly with all of the pointers.
unsigned MandelbrotSet::tfcn(void* obj)
{
funcArg *args = (funcArg*) obj;
int count = 0;
vector<int> dummy;
while(args->set->counts.size() <= args->row)
{
args->set->counts.push_back(dummy);
}
for(int y = 0; y < args->set->nx; ++y)
{
complex<double> c(args->set->zCorner.real() + (y * args->set->dx), args->set->zCorner.imag() + (args->row * args->set->dy));
count = args->set->iterate(c);
args->set->counts[args->row].push_back(count);
}
return 0;
}
Resolved: Alright everyone, I found the issue. You were right. It was in the thread itself. The problem was that all of the threads were trying to add rows to my 2D vector of counts (counts.push_back(dummy)). I guess the race condition was taking effect and each thread assumed it should add more rows even when it wasn't necessary. Thanks for the help.
I solved the problem. I edited the question and stated what was wrong but I will do it again here. I was encountering the race condition when I tried to push a vector of complex numbers to the 2D vector in my thread function. This is controlled by the while loop and when each thread is executed, each thread believes that it needs to push more vector to the 2D vector called counts. I moved this loop to the constructor and simply push all of the necessary vectors onto counts upon creation. Thanks for helping me look in a different direction!