I'm using pthreads to try and parallelize Dijkstra's pathfinding algorithm, but I'm running into a deadlock scenario I can't seem to figure out. The gist of it is that every thread has its own priority queue where it gets work (a std::multiset) and a mutex lock corresponding to that queue that is locked whenever it needs to be modified.
Every node has an owner thread which corresponds to the node ID modulo thread count. If a thread is looking through a node's neighbors and updates one of their weights (label) to something lower than it was before, it locks its owner's queue and removes/reinserts (this is to force the set to update its position in the queue). However, this implementation seems to deadlock. I can't tell why though because as far as I can tell, each thread holds only one lock at a time.
Each thread's initial queue contains all of its nodes, but every node's weight besides the source is initialized to ULONG_MAX. If a thread is out of work (it's getting nodes with ULONG_MAX weight from the queue) it just keeps locking and unlocking until another thread gives it work.
void *Dijkstra_local_owner_worker(void *param){
struct thread_args *myargs = ((struct thread_args *)param);
int tid = myargs->tid;
std::multiset<Node *,cmp_label> *Q = (myargs->Q);
struct thread_args *allargs = ((struct thread_args *)param)-tid;
AdjGraph *G = (AdjGraph *)allargs[thread_count].Q;
struct Node *n, *p;
int owner;
std::set<Edge>::iterator it;
Edge e;
pthread_mutex_lock(&myargs->mutex);
while(!Q->empty()){
n = *Q->begin(); Q->erase(Q->begin());
pthread_mutex_unlock(&myargs->mutex);
if(n->label == ULONG_MAX){
pthread_mutex_lock(&myargs->mutex);
Q->insert(n);
continue;
}
for( it = n->edges->begin(); it != n->edges->end(); it++){
e = *it;
p = G->getNode(e.dst);
owner = (int)(p->index % thread_count);
if(p->label > n->label + e.weight){
pthread_mutex_lock(&(allargs[owner].mutex));
allargs[owner].Q->erase(p);
p->label = n->label + e.weight;
p->prev = n;
allargs[owner].Q->insert(p);//update p's position in the PQ
pthread_mutex_unlock(&(allargs[owner].mutex));
}
}
pthread_mutex_lock(&myargs->mutex);
}
pthread_mutex_unlock(&myargs->mutex);
return NULL;
}
Here's the function that spawns the threads.
bool Dijkstra_local_owner(AdjGraph *G, struct Node *src){
G->setAllNodeLabels(ULONG_MAX);
struct thread_args args[thread_count+1];
src->label = 0;
struct Node *n;
for(int i=0; i<thread_count; i++){
args[i].Q = new std::multiset<Node *,cmp_label>;
args[i].tid = i;
pthread_mutex_init(&args[i].mutex,NULL);
}
for(unsigned long i = 0; i < G->n; i++){
n = G->getNode(i); //give all threads their workload in advance
args[(n->index)%thread_count].Q->insert(n);
}
args[thread_count].Q = (std::multiset<Node *,cmp_label> *)G;
//hacky repackaging of a pointer to prevent use of globals
//please note this works and is not the issue. I know it's horrible.
pthread_t threads[thread_count];
for(int i=0; i< thread_count; i++){
pthread_create(&threads[i],NULL,Dijkstra_local_owner_worker,&args[i]);
}
for(int i=0; i< thread_count; i++){
pthread_join(threads[i],NULL);
}
for(int i=0; i< thread_count; i++){
delete args[i].Q;
}
}
The structure definition for each thread's arguments:
struct thread_args{
std::multiset<Node *,cmp_label> *Q;
pthread_mutex_t mutex;
int tid;
};
My question is, where does this code deadlock? I'm getting tunnel vision here so I can't see where I'm going wrong. I've ensured all other logic works, so things like pointer dereferences, etc. are correct.
If a thread is out of work (it's getting nodes with ULONG_MAX weight
from the queue) it just keeps locking and unlocking until another
thread gives it work.
This is a potential problem - once a thread gets into this state, it will essentially hold the mutex locked for the entire duration of its timeslice. pthreads mutexes are lightweight, which means they aren't guaranteed to be fair - it's quite possible (likely, even) that the busy-waiting thread will be able to re-acquire the lock before a woken waiting thread is able to acquire it.
You should use pthread_cond_wait() here, and have the condition variable signalled when another thread updates the queue. The start of your loop would then look something like:
pthread_mutex_lock(&myargs->mutex);
while (!Q->empty())
{
n = *Q->begin();
if (n->label == ULONG_MAX)
{
pthread_cond_wait(&myargs->cond, &myargs->mutex);
continue; /* Re-check the condition after `pthread_cond_wait()` returns */
}
Q->erase(Q->begin());
pthread_mutex_unlock(&myargs->mutex);
/* ... */
and the point where you update another node's queue would look like:
/* ... */
allargs[owner].Q->insert(p); //update p's position in the PQ
pthread_cond_signal(&allargs[owner].cond);
pthread_mutex_unlock(&allargs[owner].mutex);
your code looks something like:
lock()
While(cond)
{
unlock()
if (cond1)
{
lock()
}
for(...)
{
....
}
lock()
}
unlock()
I think it's easy to see you can have problems with this approach depending on the datapath.
I would use the lock only for critical operations:
lock()
Q->erase(..)
unlock()
OR
lock()
Q->insert(..)
unlock()
Try to simplify things and see if that helps
Related
I am trying to synchronize a function I am parallelizing with pthreads.
The issue is, I am having a deadlock because a thread will exit the function while other threads are still waiting for the thread that exited to reach the barrier. I am unsure whether the pthread_barrier structure takes care of this. Here is an example:
static pthread_barrier_t barrier;
static void foo(void* arg) {
for(int i = beg; i < end; i++) {
if (i > 0) {
pthread_barrier_wait(&barrier);
}
}
}
int main() {
// create pthread barrier
pthread_barrier_init(&barrier, NULL, NUM_THREADS);
// create thread handles
//...
// create threads
for (int i = 0; i < NUM_THREADS; i++) {
pthread_create(&thread_handles[i], NULL, &foo, (void*) i);
}
// join the threads
for (int i = 0; i < NUM_THREADS; i++) {
pthread_join(&thread_handles[i], NULL);
}
}
Here is a solution I tried for foo, but it didn't work (note NUM_THREADS_COPY is a copy of the NUM_THREADS constant, and is decremented whenever a thread reaches the end of the function):
static void foo(void* arg) {
for(int i = beg; i < end; i++) {
if (i > 0) {
pthread_barrier_wait(&barrier);
}
}
pthread_barrier_init(&barrier, NULL, --NUM_THREADS_COPY);
}
Is there a solution to updating the number of threads to wait in a barrier for when a thread exits a function?
You need to decide how many threads it will take to pass the barrier before any threads arrive at it. Undefined behavior results from re-initializing the barrier while there are threads waiting at it. Among the plausible manifestations are that some of the waiting threads are prematurely released or that some of the waiting threads never get released, but those are by no means the only unwanted things that could happen. In any case ...
Is there a solution to updating the number of threads to wait in a
barrier for when a thread exits a function?
... no, pthreads barriers do not support that.
Since a barrier seems not to be flexible enough for your needs, you probably want to fall back to the general-purpose thread synchronization object: a condition variable (used together with a mutex and some kind of shared variable).
So I am trying to implement a double buffer for a typical producer and consumer problem.
1.get_items() basically produces 10 items at a time.
2.producer basically push 10 items onto a write queue. Assume that currently we only have one producer.
3.consumers will consume one item from the queue. There are many consumers.
So I am sharing my code as the following. The implementation idea is simple, consume from the readq until it is empty and then swap the queue pointer, which the readq would point to the writeq now and writeq would now points to the emptied queue and would starts to fill it again. So producer and consumer can work independently without halting each other. This sort of swaps space for time.
However, my code does not work in multiple consumer cases. In my code, I initiated 10 consumer threads, and it always stuck at the .join().
So I am thinking that my code is definitely buggy. However, by examine carefully, I did not find where that bug is. And it seems the code stuck after lk1.unlock(), so it is not stuck in a while or something obvious.
mutex m1;
mutex m2; // using 2 mutex, so when producer is locked, consumer can still run
condition_variable put;
condition_variable fetch;
queue<int> q1;
queue<int> q2;
queue<int>* readq = &q1;
queue<int>* writeq = &q2;
bool flag{ true };
vector<int> get_items() {
vector<int> res;
for (int i = 0; i < 10; i++) {
res.push_back(i);
}
return res;
}
void producer_mul() {
unique_lock<mutex> lk2(m2);
put.wait(lk2, [&]() {return flag == false; }); //producer waits for consumer signal
vector<int> items = get_items();
for (auto it : items) {
writeq->push(it);
}
flag = true; //signal queue is filled
fetch.notify_one();
lk2.unlock();
}
int get_one_item_mul() {
unique_lock<mutex> lk1(m1);
int res;
if (!(*readq).empty()) {
res = (*readq).front(); (*readq).pop();
if ((*writeq).empty() && flag == true) { //if writeq is empty
flag = false;
put.notify_one();
}
}
else {
readq = writeq; // swap queue pointer
while ((*readq).empty()) { // not yet write
if (flag) {
flag = false;
put.notify_one();//start filling process
}
//if (readq->empty()) { //upadted due to race. readq now points to writeq, so if producer finished, readq is not empty and flag = true.
fetch.wait(lk1, [&]() {return flag == true; });
//}
}
if (flag) {
writeq = writeq == &q1 ? &q2 : &q1; //swap the writeq to the alternative queue and fill it again
flag = false;
//put.notify_one(); //fill that queue again if needed. but in my case, 10 item is produced and consumed, so no need to use the 2nd round, plus the code does not working in this simple case..so commented out for now.
}
res = readq->front(); readq->pop();
}
lk1.unlock();
this_thread::sleep_for(10ms);
return res;
}
int main()
{
std::vector<std::thread> threads;
std::packaged_task<void(void)> job1(producer_mul);
vector<std::future<int>> res;
for (int i = 0; i < 10; i++) {
std::packaged_task<int(void)> job2(get_one_item_mul);
res.push_back(job2.get_future());
threads.push_back(std::thread(std::move(job2)));
}
threads.push_back(std::thread(std::move(job1)));
for (auto& t : threads) {
t.join();
}
for (auto& a : res) {
cout << a.get() << endl;
}
return 0;
}
I added some comments, but the idea and code is pretty simple and self-explanatory.
I am trying to figure out where the problem is in my code. Does it work for multiple consumer? Further more, if there are multiple producers here, does it work? I do not see a problem since basically in the code the lock is not fine grained. Producer and Consumer both are locked from the beginning till the end.
Looking forward to discussion and any help is appreciated.
Update
updated the race condition based on one of the answer.
The program is still not working.
Your program contains data races, and therefore exhibits undefined behavior. I see at least two:
producer_mul accesses and modifies flag while holding m2 mutex but not m1. get_one_item_mul accesses and modifies flag while holding m1 mutex but not m2. So flag is not in fact protected against concurrent access.
Similarly, producer_mul accesses writeq pointer while holding m2 mutex but not m1. get_one_item_mul modifies writeq while holding m1 mutex but not m2.
There's also a data race on the queues themselves. Initially, both queues are empty. producer_mul is blocked waiting on flag. Then the following sequence occurs ( P for producer thread, C for consumer thread):
C: readq = writeq; // Both now point to the same queue
C: flag = false; put.notify_one(); // This wakes up producer
**P: writeq->push(it);
**C: if (readq->empty())
The last two lines happen concurrently, with no protection against concurrent access. One thread modifies an std::queue instance while the other accesses that same instance. This is a data race.
There's a data race at the heart of the design. Let's imagine there's just one producer P and two consumers C1 and C2. Initially, P waits on put until flag == false. C1 grabs m1; C2 is blocked on m1.
C1 sets readq = writeq, then unblocks P1, then calls fetch.wait(lk1, [&]() {return flag == true; });. This unlocks m1, allowing C2 to proceed. So now P is busy writing to writeq while C2 is busy reading from readq - which is one and the same queue.
How can I run a function on a separate thread if a thread is available, assuming that i always want k threads running at the same time at any point?
Here's a pseudo-code
For i = 1 to N
IF numberOfRunningThreads < k
// run foo() on another thread
ELSE
// run foo()
In summary, once a thread is finished it notifies the other threads that there's a thread available that any of the other threads can use. I hope the description was clear.
My personal approach: Just do create the k threads and let them call foo repeatedly. You need some counter, protected against race conditions, that is decremented each time before foo is called by any thread. As soon as the desired number of calls has been performed, the threads will exit one after the other (incomplete/pseudo code):
unsigned int global_counter = n;
void fooRunner()
{
for(;;)
{
{
std::lock_guard g(global_counter_mutex);
if(global_counter == 0)
break;
--global_counter;
}
foo();
}
}
void runThreads(unsigned int n, unsigned int k)
{
global_counter = n;
std::vector<std::thread> threads(std::min(n, k - 1));
// k - 1: current thread can be reused, too...
// (provided it has no other tasks to perform)
for(auto& t : threads)
{
t = std::thread(&fooRunner);
}
fooRunner();
for(auto& t : threads)
{
t.join();
}
}
If you have data to pass to foo function, instead of a counter you could use e. g a FIFO or LIFO queue, whatever appears most appropriate for the given use case. Threads then exit as soon as the buffer gets empty; you'd have to prevent the buffer running empty prematurely, though, e. g. by prefilling all the data to be processed before starting the threads.
A variant might be a combination of both: exiting, if the global counter gets 0, waiting for the queue to receive new data e. g. via a condition variable otherwise, and the main thread continuously filling the queue while the threads are already running...
you can use (std::thread in <thread>) and locks to do what you want, but it seems to me that your code could be simply become parallel using openmp like this.
#pragma omp parallel num_threads(k)
#pragma omp for
for (unsigned i = 0; i < N; ++i)
{
auto t_id = omp_get_thread_num();
if (t_id < K)
foo()
else
other_foo()
}
std::mutex mutex;
std::condition_variable cv;
uint8_t size = 2;
uint8_t count = size;
uint8_t direction = -1;
const auto sync = [&size, &count, &mutex, &cv, &direction]() //.
{
{
std::unique_lock<std::mutex> lock(mutex);
auto current_direction = direction;
if (--count == 0)
{
count = size;
direction *= -1;
cv.notify_all();
}
else
{
cv.wait(lock,
[&direction, ¤t_direction]() //.
{ return direction != current_direction; });
}
}
};
as provided in the first unaccepted answer of reusable barrier
a 'generation' must be stored inside a barrier object to prevent a next generation from manipulating the wake up 'condition' of the current generation for a given set of threads. What I do not like about the first unaccepted answer is the growing counter of generations, I believe that we need only to differentiate between two generations at most that is if a thread satisfied the wait condition and started another barrier synchronization call as the second unaccepted solution suggests, the second solution however was somewhat complex and I believe that the above snippet would even be enough (currently implemented locally inside the main but could be abstracted into a struct). Am I correct in my 'belief' that a barrier can only be used simultaneously for 2 generations at most?
I'm currently using boost 1.55.0, and I cant understand why this code doesn't work.
The following code is a simplification that has the same problem as my program. Small runs finish, but when they are bigger the threads keep waiting forever.
boost::mutex m1;
boost::mutex critical_sim;
int total= 50000;
class krig{
public:
float dokrig(int in,float *sim, bool *aux, boost::condition_variable *hEvent){
float simnew=0;
boost::mutex::scoped_lock lk(m1);
if (in > 0)
{
while(!aux[in-1]){
hEvent[in-1].wait(lk);
}
simnew=1+sim[in-1];
}
return simnew;
};
};
void Simulnode( int itrd,float *sim, bool *aux, boost::condition_variable *hEvent){
int j;
float simnew;
krig kriga;
for(j=itrd; j<total; j=j+2){
if (fmod(1000.*j,total) == 0.0){
printf (" .progress. %f%%\n",100.*(float)j/(float)total);
}
simnew= kriga.dokrig(j,sim, aux, hEvent);
critical_sim.lock();
sim[j]=simnew;
critical_sim.unlock();
aux[j]=true;
hEvent[j].notify_one();
}
}
int main(int argc, char* argv[])
{
int i;
float *sim = new float[total];
bool *aux = new bool[total];
for(i=0; i<total; ++i)
aux[i]=false;
//boost::mutex m1;
boost::condition_variable *hEvent = new boost::condition_variable[total];
boost::thread_group tgroup;
for(i=0; i<2; ++i) {
tgroup.add_thread(new boost::thread(Simulnode, i,sim, aux, hEvent));
}
tgroup.join_all();
return 0;
}
Curiously, I noticed that if I place the code that is inside dokrig() inline in simulnode() then it seems to work. Can it be some problem with the scope of the lock?
Can anybody tell me where I am wrong? Thanks in advance.
The problem happens in this part:
aux[j]=true;
hEvent[j].notify_one();
The first line represents a change of the condition that is being monitored by the hEvent condition variable. The second line proclaims this change to the consumer part, that is waiting for that condition to become true.
The problem is that these two steps happen without synchronization with the consumer, which can lead to the following race:
The consumer checks the condition, which is currently false. This happens in a critical section protected by the mutex m1.
A thread switch occurs. The producer changes the condition to true and notifies any waiting consumers.
Threads switch back. The consumer resumes and calls wait. However, he already missed the notify that occurred in the last step, so he will wait forever.
It is important to understand that the purpose of the mutex that is passed to the wait call of the condition variable is not to protect the condition variable itself, but the condition that it monitors (which in this case is the change to aux).
To avoid the data race, writing to aux and the subsequent notify have to be protected by the same mutex:
{
boost::lock_guard<boost::mutex> lk(m1);
aux[j]=true;
hEvent[j].notify_one();
}