I have read the question that was posted earlier that seemed to be having the same error that I am getting when using wait for multiple objects but I believe that mine is different. I am using several threads to compute different parts of a mandelbrot set. The program compiles and produces the correct result about 3 out of 5 times but sometimes I get an error that says "Access violation when writing to ... (some memory location that is different every time)". Like I said, sometimes it works, sometimes it doesn't. I put break points before and after the waitformultipleobjects and have concluded that that must be the culprit. I just dont know why. Here is the code...
int max = size();
if (max == 0) //Return false if there are no threads
return false;
for(int i=0;i<max;++i) //Resume all threads
ResumeThread(threads[i]);
HANDLE *first = &threads[0]; //Create a pointer to the first thread
WaitForMultipleObjects(max,first,TRUE,INFINITE);//Wait for all threads to finish
Update: I tried using a for loop and WaitForSingleObject and the problem still persisted.
Update 2: Here is the thread function. It looks kind of ugly with all of the pointers.
unsigned MandelbrotSet::tfcn(void* obj)
{
funcArg *args = (funcArg*) obj;
int count = 0;
vector<int> dummy;
while(args->set->counts.size() <= args->row)
{
args->set->counts.push_back(dummy);
}
for(int y = 0; y < args->set->nx; ++y)
{
complex<double> c(args->set->zCorner.real() + (y * args->set->dx), args->set->zCorner.imag() + (args->row * args->set->dy));
count = args->set->iterate(c);
args->set->counts[args->row].push_back(count);
}
return 0;
}
Resolved: Alright everyone, I found the issue. You were right. It was in the thread itself. The problem was that all of the threads were trying to add rows to my 2D vector of counts (counts.push_back(dummy)). I guess the race condition was taking effect and each thread assumed it should add more rows even when it wasn't necessary. Thanks for the help.
I solved the problem. I edited the question and stated what was wrong but I will do it again here. I was encountering the race condition when I tried to push a vector of complex numbers to the 2D vector in my thread function. This is controlled by the while loop and when each thread is executed, each thread believes that it needs to push more vector to the 2D vector called counts. I moved this loop to the constructor and simply push all of the necessary vectors onto counts upon creation. Thanks for helping me look in a different direction!
Related
I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.
The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)
I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.
I am rewriting an algorithm in C++ AMP and just ran into an issue with atomic writes, more specifically atomic_fetch_add, which apparently is only for integers?
I need to add a double_4 (or if I have to, a float_4) in an atomic fashion. How do I accomplish that with C++ AMP's atomics?
Is the best/only solution really to have a lock variable which my code can use to control the writes? I actually need to do atomic writes for a long list of output doubles, so I would essentially need a lock for every output.
I have already considered tiling this for better performance, but right now I am just in the first iteration.
EDIT:
Thanks for the quick answers already given.
I have a quick update to my question though.
I made the following lock attempt, but it seems that when one thread in a warp gets past the lock, all the other threads in the same warp just tags along. I was expecting the first warp thread to get the lock, but I must be missing something (note that it has been quite a few years since my cuda days, so I have just gotten dumb)
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It is as such not a big problem, since I should write into a shared variable at first and only write it to global memory after all threads in a tile have completed, but I just don't understand this behavior.
All the atomic add ops are only for integer types. You can do what you want without locks using 128-bit CAS (compare-and-swap) operations though for float_4 (I'm assuming this is 4 floats), but there's no 256-bit CAS ops what you would need for double_4. What you have to do is to have a loop which atomically reads float_4 from memory, perform the float add in the regular way, and then use CAS to test & swap the value if it's the original (and loop if not, i.e. some other thread changed the value between read & write). Note that the 128-bit CAS is only available on 64-bit architectures and that your data needs to be properly aligned.
if the critical code is short, you can create your own lock using atomic operations:
int lock = 1;
while(__sync_lock_test_and_set(&lock, 0) == 0) // trying to acquire lock
{
//yield the thread or go to sleep
}
//critical section, do the work
// release lock
lock = 1;
the advantage is you save the overhead of the OS locks.
The question has as such been answered by others and the answer is that you need to handle double atomics yourself. There is no function for it in the library.
I would also like to elaborate on my own edit, in case others come here with the same failing lock.
In the following example, my error was in not realizing that when the exchange failed, it actually changed the expected value! Thus the first thread would expect lock being zero and write a 1 in it. The next thread would expect 0 and would fail to write a one - but then the exchange wrote a one in the variable holding the expected value. This means that the next time the thread tried to do an exchange it expects a 1 in the lock! This it gets and then it thinks it gets the lock.
I was absolutely not aware that the &lock would receive a 1 on failed exchange match!
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
**//note that, if locks[j]!=lock then lock=1
//meaning that ACE will be true the next time if locks[j]==1
//meaning the while will terminate even though someone else has the lock**
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It seems that a fix is to do
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1))
{
lock=0; //reset the expected value
};
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
The QtConcurrent namespace is really great for simplifying the management of multi-threaded calculations. Overall this works great and I have been able to use QtConcurrent run(), map(), and other variants in the way they are described in the API.
Overall Goal:
I would like to query, cancel(), or pause() a numerically intensive calculation from QML. So far this is working the way I would like, except that I cannot access the sequence numbers in the calculation. Here is a link that describes a similar QML setup.
Below is an image from small test app that I created to encapsulate what I am trying to do:
In the example above the calculation has nearly completed and all the cores have been enqueued with work properly, as can be seen from a system query:
But what I really would like to do is use the sequence numbers from a given list of the items IN THE multi-threaded calculation itself. E.g., one approach might be to simply setup the sequence numbers directly in a QList or QVector (other C++ STL containers can work as well), like this:
void TaskDialog::mapTask()
{
// Number of times the map function will be called:
int N = 5;
// Prepare the vector that we operate on with mapFunction:
QList<int> vectorOfInts;
for (int i = 0; i < N; i++) {
vectorOfInts << i;
}
// Start the calc:
QFuture<void> future = QtConcurrent::map(vectorOfInts, mapFunction);
_futureWatcher.setFuture(future);
//_futureWatcher.waitForFinished();
}
The calculation is non-blocking with the line: _futureWatcher.waitForFinished(); commented out, as shown in the code above. Note that when setup as a non-blocking calculation, the GUI thread is responsive, and the progress bar updates as desired.
But when the values in the QList container are queried during the calculation, what appears seem to be the uninitialized garbage values that one would expect when the array is not properly initialized.
Below is the example function I am calling:
void mapFunction(int& n)
{
// Check the n values:
qDebug() << "n = " << n;
/* Below is an arbitrary task but note that we left out n,
* although normally we would want to use it): */
const long work = 10000 * 10000 * 10;
long s = 0;
for (long j = 0; j < work; j++)
s++;
}
And the output of qDebug() is:
n = 30458288
n = 204778
n = 270195923
n = 0
n = 270385260
The n-values are useless but the sum values, s, are correct (although not shown) when the calculation is mapped in this fashion (non-blocking).
Now, if I uncomment the _futureWatcher.waitForFinished(); line then I get the expected values (the order is irrelevant):
n = 0
n = 2
n = 4
n = 3
n = 1
But in this case, with _futureWatcher.waitForFinished(); enabled, my GUI thread is blocked and the progress bar does not update.
What then would be the advantage of using QtConcurrent::map() with blocking enabled, if the goal to not block the main GUI thread?
Secondly, how can get the correct values of n in the non-blocking case, allowing the GUI to remain responsive and have the progress bar keep updating?
My only option may be to use QThread directly but I wanted to take advantage of all the nice tools setup for us in QtConcurrent.
Thoughts? Suggestions? Other options? Thanks.
EDIT: Thanks to user2025983 for the insight which helped me to solve this. The bottom line is that I first needed to dynamically allocate the QList:
QList<int>* vectorOfInts = new QList<int>;
for (int i = 0; i < N; i++)
vectorOfInts->push_back(i);
Next, the vectorOfInts is passed by reference to the map function by de-referencing the pointer, like this:
QFuture<void> future = QtConcurrent::map(*vectorOfInts, mapFunction);
Note also that the prototype of the mapFunction remains the same:
void mapFunction(int& n)
And then it all works properly: the GUI remained responsive, progress bar updated, the values of n are all correct, etc., WITHOUT the need to add blocking through the function:
_futureWatcher.waitForFinished();
Hope these extra details can help someone else.
The problem here is that your QList goes out of the scope when mapTask() finishes.
Since the mapFunction(int &n) takes the parameter by reference, it gets references to integer values which are now part of an array which is out of scope! So then the computer is free to do whatever it likes with that memory, which is why you see garbage values. If you are just using integer parameters, I would recommend passing the parameters by value and then everything should work.
Alternatively, if you must pass by reference you can have the futureWatcher delete the array when its finished.
QList<int>* vectorOfInts = new QList<int>;
// push back into structure
connect(_futureWatcher, SIGNAL(finished()), vectorOfInts, SLOT(deleteLater()));
// launch stuff
QtConcurrent::map...
// profit
I am writing a program that does calculations in multiple threads and return the result using c++ future, here's a simplified version of my code
int main()
{
int length = 64;
vector<std::future<float>> threads(length);
vector<float> results(length);
int blockLength = 8;
int blockCount = length/blockLength;
for(int j=0;j<blockCount;j++)
{
for(int i=0;i<blockLength;i++)
{
threads[i + j * blockLength] = std::async(func1,i*j);
}
for(int i=0;i<blockLength;i++)
{
results[i + j * blockLength] = threads[i].get();
}
}
the definition of func1 is simplified as follows:
float func1(int input)
{
//calculations...
return result;
}
I would like that the program above does 64 times of calculations, in 8 threads at a time, so that the processor and memory usage would be better at the same time.
The program is conceived that it will post blockLength number of threads at a time, and wait till the calculation results are obtained, and proceed to the next loop.
the program will post blockLength number of threads for blockCount times, for example, 8 threads for 8 times.
but the program is not working, there is always a EXC_BAD_ACCESS exception when the first loop of blockLength threads finishes, besides, the calculation time of each thread is not guaranteed, any thread can run for a long time or finish quickly.
Here is a screenshot:
as is shown above, the CPU usage drops as some of the threads finish, but an exception is thrown as soon as the second loop starts.
Would you please point out what is wrong with my usage of future?
How can we correct it?
Thank you very much!
I am implementing a particle interaction simulator in pthreads,and I keep getting segmentation faults in my pthreads code. The fault occurs in the following loop, which each thread does in the end of each timestep in my thread_routine:
for (int i = first; i < last; i++)
{
get_id(particles[i], box_id);
pthread_mutex_lock(&locks[box_id.x + box_no * box_id.y]);
//cout << box_id.x << "," << box_id.y << "," << thread_id << "l" << endl;
box[box_id.x][box_id.y].push_back(&particles[i]);
//cout << box_id.x << box_id.y << endl;
pthread_mutex_unlock(&locks[box_id.x + box_no * box_id.y]);
}
The strange thing is that if I uncomment one (it doesn't matter which one) or both of the couts, the program runs as expected, with no errors occurring (but this obviously kills performance, and isn't an elegant solution), giving correct output.
box is a globally declared
vector < vector < vector < particle_t*> > > box
which represents a decomposition of my (square) domain into boxes.
When the loop starts, box[i][j].size() has been set to zero for all i, j, and the loop is supposed to put particles back into the box-structure (the get_id function gives correct results, I've checked)
The array pthread_mutex_t locks is declared as a global
pthread_mutex_t * locks,
and the size is set by thread 0 and the locks initialized by thread 0 before the other threads are created:
locks = (pthread_mutex_t *) malloc( box_no*box_no * sizeof( pthread_mutex_t ) );
for (int i = 0; i < box_no*box_no; i++)
{
pthread_mutex_init(&locks[i],NULL);
}
Do you have any idea of what could cause this? The code also runs if the number of processors is set to 1, and it seems like the more processors I run on, the earlier the seg fault occurs (it has run through the entire simulation once on two processors, but this seems to be an exception)
Thanks
This is only an educated guess, but based on the problem going away if you use one lock for all the boxes: push_back has to allocate memory, which it does via the std::allocator template. I don't think allocator is guaranteed to be thread-safe and I don't think it's guaranteed to be partitioned, one for each vector, either. (The underlying operator new is thread-safe, but allocator usually does block-slicing tricks to amortize operator new's cost.)
Is it practical for you to use reserve to preallocate space for all your vectors ahead of time, using some conservative estimate of how many particles are going to wind up in each box? That's the first thing I'd try.
The other thing I'd try is using one lock for all the boxes, which we know works, but moving the lock/unlock operations outside the for loop so that each thread gets to stash all its items at once. That might actually be faster than what you're trying to do -- less lock thrashing.
Are the box and box[i] vectors initialized properly? You only say the innermost set of vectors are set. Otherwise it looks like box_id's x or y component is wrong and running off the end of one of your arrays.
What part of the look is it crashing on?