C++ condition variables vs new threads for vectorization

C++ condition variables vs new threads for vectorization - c++

I have a block of code that goes through a loop. A section of the code operates on a vector of data and I would like to vectorize this operation. The idea is to split the elaboration of the array on multiple threads that will work on subsections of the array. I have to decide between two possibilities. The first one is to create the threads each time this section is encountered an rejoin them at the end with the main thread:
for(....)
{
//serial stuff
//crate threads
for(i = 0; i < num_threads; ++i)
{
threads_vect.push_back(std::thread(f, sub_array[i]));
}
//join them
for(auto& t : threads_vect)
{
t.join();
}
//serial stuff
}
This is similar at what it is done with OpenMP, but since the problem is simple I'd like to use std::threads instead of OpenMP (unless there are good reasons against this).
The second option is to create the threads beforehand to avoid the overhead of creation and destruction, and use condition variables for synchronization (I omitted a lot of stuff for the synchronization. It is just the general idea):
std::condition_variable cv_threads;
std::condition_variable cv_main;
//create threads, the will be to sleep on cv_threads
for(....)
{
//serial stuff
//wake up threads
cv_threads.notify_all();
//sleep until the last thread finishes, that will notify.
main_thread_lock.lock();
cv_main.wait(main_lock);
//serial stuff
}
To allow for parallelism the threads will have to unlock the thread_lock as soon as they wake up at the beginning of the computation, then acquire it again at to go to sleep and synchronize between them to notify the main thread.
My question is which of this solutions is usually preferred in a context like this, and if the avoided overhead of thread creation and destruction is usually worth the added complexity (or worth at all given that the added synchronization also adds time).
Obviously this also depends on how long the computation is for each thread, but this could vary a lot since the length of the data vector could also be very short (to about two element per thread, that would led to a computation time of about 15 milliseconds).

The biggest disadvantage of creating new threads is that thread creation and shutdown is usually quite expensive. Just think of all the things an operating system has to do to get a thread off the ground, compared to what it takes to notify a condition variable.
Note that synchronization is always required, also with thread creation. The C++11 std::thread for instances introduces a synchronizes-with relationship with the creating thread upon construction. So you can safely assume that thread creation will always be significantly more expensive than condition variable signalling, regardless of your implementation.
A framework like OpenMP will typically attempt to amortize these costs in some way. For instance, an OpenMP implementation is not required to shut down the worker threads after every loop and many implementations will not do this.

Related

Terminate a thread from outside in C++11

I am running multiple threads in my C++11 code and the thread body is defined using lambda function as following.
// make connection to each device in a separate child thread
std::vector<std::thread> workers;
for(int ii = 0; ii < numDev; ii++)
{
workers.push_back(std::thread([=]() { // pass by value
// thread body
}));
}
// detach from all threads
std::for_each(workers.begin(), workers.end(), [](std::thread &t) {
t.detach();
});
// killing one of the threads here?
I detached from all children threads but keep a reference of each in workers vector. How can I kill one of the threads later on in my code?
Post in here suggests using std::terminate() but I guess it has no use in my case.

First, don't use raw std::threads. They are rarely a good idea. It is like manually calling new and delete, or messing with raw buffers and length counters in io code -- bugs waiting to happen.
Second, instead of killing the thread, provide the thread task with a function or atomic variable that says when the worker should kill itself.
The worker periodically checks its "should I die" state, and if so, it cleans itself up and dies.
Then simply signal the worker to die, and wait for it to do so.
This does require work in your worker thread, and if it does some task that cannot be interrupted that lasts a long time it doesn't work. Don't do tasks that cannot be interrupted and last a long time.
If you must do such a task, do it in a different process, and marshall the results back and forth. But modern OSs tend to have async APIs you can use instead of synchronous APIs for IO tasks, which lend themselves to being aborted if you are careful.
Terminating a thread while it is in an arbitrary state places your program into an unknown and undefined state of execution. It could be holding a mutex and never let it go in a standard library call, for example. But really, it can do anything at all.
Generally detaching threads is also a bad idea, because unless you magically know they are finished (difficult because you detached them), what happens after main ends is implementation defined.
Keep track of your threads, like you keep track of your memory allocations, but moreso. Use messages to tell threads to kill themselves. Join threads to clean up their resources, possibly using condition variables in a wrapper to make sure you don't join prior to the thread basically being done. Consider using std::async instead of raw threads, and wrap std::async itself up in a further abstraction.

Is there a race condition in the `latch` sample in N3600?

Proposed for inclusion in C++14 (aka C++1y) are some new thread synchronization primitives: latches and barriers. The proposal is
N3600: C++ Latches and Barriers
N3666: C++ Latches and Barriers, revised
It sounds like a good idea and the samples make it look very programmer-friendly. Unfortunately, I think the sample code invokes undefined behavior. The proposal says of latch::~latch():
Destroys the latch. If the latch is destroyed while other threads are in wait(), or are invoking count_down(), the behaviour is undefined.
Note that it says "in wait()" and not "blocked in wait()", as the description of count_down() uses.
Then the following sample is provided:
An example of the second use case is shown below. We need to load data and then process it using a number of threads. Loading the data is I/O bound, whereas starting threads and creating data structures is CPU bound. By running these in parallel, throughput can be increased.
void DoWork()
{
latch start_latch(1);
vector<thread*> workers;
for (int i = 0; i < NTHREADS; ++i) {
workers.push_back(new thread([&] {
// Initialize data structures. This is CPU bound.
...
start_latch.wait();
// perform work
...
}));
}
// Load input data. This is I/O bound.
...
// Threads can now start processing
start_latch.count_down();
}
Isn't there a race condition between the threads waking and returning from wait(), and destruction of the latch when it leaves scope? Besides that, all the thread objects are leaked. If the scheduler doesn't run all worker threads before count_down returns and the start_latch object leaves scope, then I think undefined behavior will result. Presumably the fix is to iterate the vector and join() and delete all the worker threads after count_down but before returning.
Is there a problem with the sample code?
Do you agree that a proposal should show a complete correct example, even if the task is extremely simple, in order for reviewers to see what the use experience will be like?
Note: It appears possible that one or more of the worker threads haven't yet begun to wait, and will therefore call wait() on a destroyed latch.
Update: There's now a new version of the proposal, but the representative example is unchanged.

Thanks for pointing this out. Yes, I think that the sample code (which, in its defense, was intended to be concise) is broken. It should probably wait for the threads to finish.
Any implementation that allows threads to be blocked in wait() is almost certainly going to involves some kind of condition variable, and destroying the latch while a thread has not yet exited wait() is potentially undefined.
I don't know if there's time to update the paper, but I can make sure that the next version is fixed.
Alasdair

Multiple threads and mutexes

I am very new to Linux programming so bear with me. I have 2 thread type that perform different operations so I want each one to have it's own mutex. Here is the code I am using , is it good ? If not why ?
static pthread_mutex_t cs_mutex = PTHREAD_MUTEX_INITIALIZER;
static pthread_mutex_t cs_mutex2 = PTHREAD_MUTEX_INITALIZER;
void * Thread1(void * lp)
{
int * sock = (int*)lp;
char buffer[2024];
int bytecount = recv(*sock, buffer, 2048, 0);
while (0 == 0)
{
if ((bytecount ==0) || (bytecount == -1))
{
pthread_mutex_lock(&cs_mutex);
//Some uninteresting operations witch plays with set 1 of global variables;
pthread_mutex_unlock(&cs_mutex);
}
}
}
void * Thread2(void * lp)
{
while (0 == 0)
{
pthread_mutex_lock(&cs_mutex2);
//Some uninteresting operations witch plays with some global variables;
pthread_mutex_unlock(&cs_mutex2);
}
}

Normally, a mutex is not thread related.
It ensures that a critical area is only accessed by a single thread.
So if u have some shared areas, like processing the same array by multiple threads, then you must ensure exclusive access for this area.
That means, you do not need a mutex for each thread. You need a mutex for the critical area.

If you only have one driver, there is no advantage to having two cars. Your Thread2 code can only make useful progress while holding cs_mutex2. So there's no point to having more than one thread running that code. Only one thread can hold the mutex at a time, and the other thread can do no useful work.
So all you'll accomplish is that occasionally the thread that doesn't hold the mutex will try to run and have to wait for the other. And occasionally the thread that does hold the mutex will try to release and re-acquire it and get pre-empted by the other.
This is a completely pointless use of threads.

I see three problems here. There's a question your infinite loop, another about your intention in having multiple threads, and there's a future maintainability "gotcha" lurking.
First
int bytecount = recv(*sock, buffer, 2048, 0);
while (0 == 0)
Is that right? You read some stuff from a socket, and start an infinite loop without ever closing the socket? I can only assume that you do some more reading in the loop, but in which case you are waiting for an external event while holding the mutex. In general that's a bad pattern limiting your concurrency. A possibly pattern is to have one thread reading the data and then passing the read data to other threads which do the processing.
Next, you have two different sets of resources each protected by their own mutex. You then intend to have a set of Threads for each resource. But each thread has the pattern
take mutex
lots of processing
release mutex
tiny window (a few machine instructions)
take mutex again
lots of processing
release mutex
next tiny window
There's virtually no opportunity for two threads to work in parallel. I question whether your have need for multiple threads for each resource.
Last there's a potential maintenance issue. I'm just pointing this out for future reference, I don't think you need to do anything right now. You have two functions, intended for use by two threads, but in the end they are just functions that can be called by anyone. If later maintenance results in those functions (or refactored subsets of the functions) then you could get two threads
take mutex 1
take mutex 2
and the other
take mutex 2
take mutex 1
Bingo: deadlock.
Not an easy problem to avoid, but at the very least one can aid the maintainer by careful naming choices and refactoring.

I think your code is correct, however please note 2 things:
It is not exception safe. If exception is thrown from Some uninteresting operations then your mutex will be never unlocked -> deadlock
You could also consider using std::mutex or boost::mutex instead of raw mutexes. For mutex locking it's better to use boost::mutex::scoped_lock (or std:: analog with modern compiler)
void test()
{
// not synch code here
{
boost::mutex::scoped_lock lock(mutex_);
// synchronized code here
}
}

If you have 2 different sets of data and 2 different threads working on these sets -- why do you need mutexes at all? Usually, mutexes are used when you deal with some shared piece of data and you don't want two threads to deal with it simultaneously, so you lock it with mutex, do some stuff, unlock.

Cheapest way to wake up multiple waiting threads without blocking

I use boost::thread to manage threads. In my program i have pool of threads (workers) that are activated sometimes to do some job simultaneously.
Now i use boost::condition_variable: and all threads are waiting inside boost::condition_variable::wait() call on their own conditional_variableS objects.
Can i AVOID using mutexes in classic scheme, when i work with conditional_variables? I want to wake up threads, but don't need to pass some data to them, so don't need a mutex to be locked/unlocked during awakening process, why should i spend CPU on this (but yes, i should remember about spurious wakeups)?
The boost::condition_variable::wait() call trying to REACQUIRE the locking object when CV received the notification. But i don't need this exact facility.
What is cheapest way to awake several threads from another thread?

If you don't reacquire the locking object, how can the threads know that they are done waiting? What will tell them that? Returning from the block tells them nothing because the blocking object is stateless. It doesn't have an "unlocked" or "not blocking" state for it to return in.
You have to pass some data to them, otherwise how will they know that before they had to wait and now they don't? A condition variable is completely stateless, so any state that you need must be maintained and passed by you.
One common pattern is to use a mutex, condition variable, and a state integer. To block, do this:
Acquire the mutex.
Copy the value of the state integer.
Block on the condition variable, releasing the mutex.
If the state integer is the same as it was when you coped it, go to step 3.
Release the mutex.
To unblock all threads, do this:
Acquire the mutex.
Increment the state integer.
Broadcast the condition variable.
Release the mutex.
Notice how step 4 of the locking algorithm tests whether the thread is done waiting? Notice how this code tracks whether or not there has been an unblock since the thread decided to block? You have to do that because condition variables don't do it themselves. (And that's why you need to reacquire the locking object.)
If you try to remove the state integer, your code will behave unpredictably. Sometimes you will block too long due to missed wakeups and sometimes you won't block long enough due to spurious wakeups. Only a state integer (or similar predicate) protected by the mutex tells the threads when to wait and when to stop waiting.
Also, I haven't seen how your code uses this, but it almost always folds into logic you're already using. Why did the threads block anyway? Is it because there's no work for them to do? And when they wakeup, are they going to figure out what to do? Well, finding out that there's no work for them to do and finding out what work they do need to do will require some lock since it's shared state, right? So there almost always is already a lock you're holding when you decide to block and need to reacquire when you're done waiting.

For controlling threads doing parallel jobs, there is a nice primitive called a barrier.
A barrier is initialized with some positive integer value N representing how many threads it holds. A barrier has only a single operation: wait. When N threads call wait, the barrier releases all of them. Additionally, one of the threads is given a special return value indicating that it is the "serial thread"; that thread will be the one to do some special job, like integrating the results of the computation from the other threads.
The limitation is that a given barrier has to know the exact number of threads. It's really suitable for parallel processing type situations.
POSIX added barriers in 2003. A web search indicates that Boost has them, too.
http://www.boost.org/doc/libs/1_33_1/doc/html/barrier.html

Generally speaking, you can't.
Assuming the algorithm looks something like this:
ConditionVariable cv;
void WorkerThread()
{
for (;;)
{
cv.wait();
DoWork();
}
}
void MainThread()
{
for (;;)
{
ScheduleWork();
cv.notify_all();
}
}
NOTE: I intentionally omitted any reference to mutexes in this pseudo-code. For the purposes of this example, we'll suppose ConditionVariable does not require a mutex.
The first time through MainTnread(), work is queued and then it notifies WorkerThread() that it should execute its work. At this point two things can happen:
WorkerThread() completes DoWork() before MainThread() can complete ScheduleWork().
MainThread() completes ScheduleWork() before WorkerThread() can complete DoWork().
In case #1, WorkerThread() comes back around to sleep on the CV, and is awoken by the next cv.notify() and all is well.
In case #2, MainThread() comes back around and notifies... nobody and continues on. Meanwhile WorkerThread() eventually comes back around in its loop and waits on the CV but it is now one or more iterations behind MainThread().
This is known as a "lost wakeup". It is similar to the notorious "spurious wakeup" in that the two threads now have different ideas about how many notify()s have taken place. If you are expecting the two threads to maintain synchrony (and usually you are), you need some sort of shared synchronization primitive to control it. This is where the mutex comes in. It helps avoid lost wakeups which, arguably, are a more serious problem than the spurious variety. Either way, the effects can be serious.
UPDATE: For further rationale behind this design, see this comment by one of the original POSIX authors: https://groups.google.com/d/msg/comp.programming.threads/cpJxTPu3acc/Hw3sbptsY4sJ
Spurious wakeups are two things:
Write your program carefully, and make sure it works even if you
missed something.
Support efficient SMP implementations
There may be rare cases where an "absolutely, paranoiacally correct"
implementation of condition wakeup, given simultaneous wait and
signal/broadcast on different processors, would require additional
synchronization that would slow down ALL condition variable operations
while providing no benefit in 99.99999% of all calls. Is it worth the
overhead? No way!
But, really, that's an excuse because we wanted to force people to
write safe code. (Yes, that's the truth.)

boost::condition_variable::notify_*(lock) does NOT require that the caller hold the lock on the mutex. THis is a nice improvement over the Java model in that it decouples the notification of threads with the holding of the lock.
Strictly speaking, this means the following pointless code SHOULD DO what you are asking:
lock_guard lock(mutex);
// Do something
cv.wait(lock);
// Do something else
unique_lock otherLock(mutex);
//do something
otherLock.unlock();
cv.notify_one();
I do not believe you need to call otherLock.lock() first.

pthread_join - multiple threads waiting

Using POSIX threads & C++, I have an "Insert operation" which can only be done safely one at a time.
If I have multiple threads waiting to insert using pthread_join then spawning a new thread
when it finishes. Will they all receive the "thread complete" signal at once and spawn multiple inserts or is it safe to assume that the thread that receives the "thread complete" signal first will spawn a new thread blocking the others from creating new threads.
/* --- GLOBAL --- */
pthread_t insertThread;
/* --- DIFFERENT THREADS --- */
// Wait for Current insert to finish
pthread_join(insertThread, NULL);
// Done start a new one
pthread_create(&insertThread, NULL, Insert, Data);
Thank you for the replies
The program is basically a huge hash table which takes requests from clients through Sockets.
Each new client connection spawns a new thread from which it can then perform multiple operations, specifically lookups or inserts. lookups can be conducted in parallel. But inserts need to be "re-combined" into a single thread. You could say that lookup operations could be done without spawning a new thread for the client, however they can take a while causing the server to lock, dropping new requests. The design tries to minimize system calls and thread creation as much as possible.
But now that i know it's not safe the way i first thought I should be able to cobble something together
Thanks

From opengroup.org on pthread_join:
The results of multiple simultaneous calls to pthread_join() specifying the same target thread are undefined.
So, you really should not have several threads joining your previous insertThread.
First, as you use C++, I recommend boost.thread. They resemble the POSIX model of threads, and also work on Windows. And it helps you with C++, i.e. by making function-objects usable more easily.
Second, why do you want to start a new thread for inserting an element, when you always have to wait for the previous one to finish before you start the next one? Seems not to be classical use of multiple-threads.
Although... One classical solution to this would be to have one worker-thread getting jobs from an event-queue, and other threads posting the operation onto the event-queue.
If you really just want to keep it more or less the way you have it now, you'd have to do this:
Create a condition variable, like insert_finished.
All the threads which want to do an insert, wait on the condition variable.
As soon as one thread is done with its insertion, it fires the condition variable.
As the condition variable requires a mutex, you can just notify all waiting threads, they all want start inserting, but as only one thread can acquire the mutex at a time, all threads will do the insert sequentially.
But you should take care that your synchronization is not implemented in a too ad-hoc way. As this is called insert, I suspect you want to manipulate a data-structure, so you probably want to implement a thread-safe data-structure first, instead of sharing the synchronization between data-structure-accesses and all clients. I also suspect that there will be more operations then just insert, which will need proper synchronization...

According to the Single Unix Specifcation: "The results of multiple simultaneous calls to pthread_join() specifying the same target thread are undefined."
The "normal way" of achieving a single thread to get the task would be to set up a condition variable (don't forget the related mutex): idle threads wait in pthread_cond_wait() (or pthread_cond_timedwait()), and when the thread doing the work has finished, it wakes up one of the idle ones with pthread_cond_signal().

Yes as most people recommended the best way seems to have a worker thread reading from a queue. Some code snippets below
pthread_t insertThread = NULL;
pthread_mutex_t insertConditionNewMutex = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t insertConditionDoneMutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t insertConditionNew = PTHREAD_COND_INITIALIZER;
pthread_cond_t insertConditionDone = PTHREAD_COND_INITIALIZER;
//Thread for new incoming connection
void * newBatchInsert()
{
for(each Word)
{
//Push It into the queue
pthread_mutex_lock(&lexicon[newPendingWord->length - 1]->insertQueueMutex);
lexicon[newPendingWord->length - 1]->insertQueue.push(newPendingWord);
pthread_mutex_unlock(&lexicon[newPendingWord->length - 1]->insertQueueMutex);
}
//Send signal to worker Thread
pthread_mutex_lock(&insertConditionNewMutex);
pthread_cond_signal(&insertConditionNew);
pthread_mutex_unlock(&insertConditionNewMutex);
//Wait Until it's finished
pthread_cond_wait(&insertConditionDone, &insertConditionDoneMutex);
}
//Worker thread
void * insertWorker(void *)
{
while(1)
{
pthread_cond_wait(&insertConditionNew, &insertConditionNewMutex);
for (int ii = 0; ii < maxWordLength; ++ii)
{
while (!lexicon[ii]->insertQueue.empty())
{
queueNode * newPendingWord = lexicon[ii]->insertQueue.front();
lexicon[ii]->insert(newPendingWord->word);
pthread_mutex_lock(&lexicon[ii]->insertQueueMutex);
lexicon[ii]->insertQueue.pop();
pthread_mutex_unlock(&lexicon[ii]->insertQueueMutex);
}
}
//Send signal that it's done
pthread_mutex_lock(&insertConditionDoneMutex);
pthread_cond_broadcast(&insertConditionDone);
pthread_mutex_unlock(&insertConditionDoneMutex);
}
}
int main(int argc, char * const argv[])
{
pthread_create(&insertThread, NULL, &insertWorker, NULL);
lexiconServer = new server(serverPort, (void *) newBatchInsert);
return 0;
}

The others have already pointed out this has undefined behaviour. I'd just add that the really simplest way to accomplish your task (to allow only one thread executing part of code) is to use a simple mutex - you need the threads executing that code to be MUTally EXclusive, and that's where mutex came to its name :-)
If you need the code to be ran in a specific thread (like Java AWT), then you need conditional variables. However, you should think twice whether this solution actually pays off. Imagine, how many context switches you need if you call your "Insert operation" 10000 times per second.

As you just now mentioned you're using a hash-table with several look-ups parallel to insertions, I'd recommend to check whether you can use a concurrent hash-table.
As the exact look-up results are non-deterministic when you're inserting elements simultaneously, such a concurrent hash-map may be exactly what you need. I do not have used concurrent hash-tables in C++, though, but as they are available in Java, you'll for sure find a library doing this in C++.

The only library which i found which supports inserts without locking new lookups - Sunrise DD (And i'm not sure whether it supports concurrent inserts)
However the switch from Google's Sparse Hash map more than doubles the memory usage. Lookups should happen fairly infrequently so rather than trying and write my own library
which combines the advantages of both i would rather just lock the table suspending lookups while changes are made safely.
Thanks again

It seems to me that you want to serialise inserts to the hashtable.
For this you want a lock - not spawning new threads.

From your description that looks very inefficient as you are re-creating the insert thread every time you want to insert something. The cost of creating the thread is not 0.
A more common solution to this problem is to spawn an insert thread that waits on a queue (ie sits in a loop sleeping while the loop is empty). Other threads then add work items to the queue. The insert thread picks items of the queue in the order they were added (or by priority if you want) and does the appropriate action.
All you have to do is make sure addition to the queue is protected so that only one thread at a time has accesses to modifying the actual queue, and that the insert thread does not do a busy wait but rather sleeps when nothing is in the queue (see condition variable).

Ideally,you dont want multiple threadpools in a single process, even if they perform different operations. The resuability of a thread is an important architectural definition, which leads to pthread_join being created in a main thread if you use C.
Ofcourse, for a C++ threadpool aka ThreadFactory , the idea is to keep the thread primitives abstract so, it can handle any of function/operation types passed to it.
A typical example would be a webserver which will have connection pools and thread pools which service connections and then process them further, but, all are derived from a common threadpool process.
SUMMARY : AVOID PTHREAD_JOIN IN any place other than a main thread.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js