Suspend background thread while working in main thread? - c++

I have a GUI reading / writing some data with many entries, where writing a single entry is fast but writing all entries takes long.
Writing of all entries should begin concurrently in a background thread right after startup (some properties can only be shown once all entries are written).
The user should be able to request a single read / write on the main thread without having to wait noticably long. I.e. the request should cause the background thread to wait after finishing its current single write
Once the single read / write on the main thread completes, the background thread should continue where it left off before being paused.
I have a solution which is running and working as far as I can see, but this is my first concurrent C++ code and maybe "it works" isn't the best metric for correctness.
For the sake of simplified code:
I start with some raw data vector and "write" consists of processing the elements in-place.
I can ask an element in data if it is already processed (is_processed(...))
Here is the simplified code:
// includes ..
using namespace std; // only to make the question less verbose
class Gui {
vector<int> data;
mutex data_mtx;
condition_variable data_cv;
atomic_bool background_blocked = false;
// ...
}
Gui::Gui() {
// some init work .. like obtaining the raw data
thread background_worker([this]{fill_data();});
background_worker.detach();
}
void Gui::fill_data() { // should only do processing work while main thread does not
unique_lock data_lock(data_mtx);
background_blocked = false;
for(auto& entry : raw_data) {
data_cv.wait(data_lock, [this]{return !background_blocked;});
if(!is_processed(entry)) proccess(entry);
}
}
int Gui::get_single_entry(int i) { // called by main thread - should respond immediately / pause background work
background_blocked = true;
unique_lock data_lock(data_mtx);
auto& entry = data[i];
if(!is_processed(entry)) process(entry);
const auto result = entry;
background_blocked = false;
data_lock.unlock();
data_cv.notify_one();
return result;
}
// ...
(A non-useful but illustrative example could be raw data containing only even numbers, process(..) adding 1 to the number, and is_processed(..) returning true if the number is odd. The property that can only be know after processing everything could be number of primes in the processed data - e.g. process(..) could also increment a prime-counter)
I think I am mostly unsure about safe reading. I can't find it right now but the gcc (which I use) doc says something like "if no thread is writing to a variable, reading of the variable from any thread is safe" - I did not see it say anything about the case where only 1 thread is writing, but other threads are reading at the same time. In the latter case, I assume not only could there be race-conditions, but a write may also be half-complete and thus a read could result in garbage?
To my understanding I need atomic for this reason, which is why I have atomic_bool background_blocked. Before asking this question, I actually just had non-atomic bool background blocked with the same code otherwise - it still ran and worked - but to my understanding I was lucky (or not unlucky) and this was wrong .. am I understanding this right?
I cannot background_blocked = true inside the lock on main thread, since the background thread is running. I think, instead of atomic, I could also use a second mutex just for the bool background_blocked? Is atomic_bool the better choice here?
Regarding the order of unlock / notify - If I read the docs right, I have to unlock before notify_one here, otherwise notify could make the background thread try to acquire the still-locked mutex, fail, and then wait for the next notify which may never come - and only then would the main thread unlock the mutex .. correct?
It is hard to be sure whether the code is correct or I am just not "unlucky" to get wrong results. But I think my design is correct and does what I want .. is it? I did not find more a standard / idiomatic design to solve my problem - am I overcomplicating anything / is there a better design?

Related

Is it possible a lock wouldn't release in a while loop

I have two threads using a common semaphore to conduct some processing. What I noticed is Thread 1 appears to hog the semaphore, and thread 2 is never able to acquire it. My running theory is maybe through compiler optimization/thread priority, somehow it just keeps giving it to thread 1.
Thread 1:
while(condition) {
mySemaphore->aquire();
//do some stuff
mySemaphore->release();
}
Thread 2:
mySemaphore->aquire();
//block of code i never reach...
mySemaphore->release();
As soon as I add a delay before Thread 1s next iteration, it allows thread 2 in. Which I think confirms my theory.
Basically for this to work I might need some sort of ordering aware lock. Does my reasoning make sense?

Where can we use std::barrier over std::latch?

I recently heard new c++ standard features which are:
std::latch
std::barrier
I cannot figure it out ,in which situations that they are applicable and useful over one-another.
If someone can raise an example for how to use each one of them wisely it would be really helpful.
Very short answer
They're really aimed at quite different goals:
Barriers are useful when you have a bunch of threads and you want to synchronise across of them at once, for example to do something that operates on all of their data at once.
Latches are useful if you have a bunch of work items and you want to know when they've all been handled, and aren't necessarily interested in which thread(s) handled them.
Much longer answer
Barriers and latches are often used when you have a pool of worker threads that do some processing and a queue of work items that is shared between. It's not the only situation where they're used, but it is a very common one and does help illustrate the differences. Here's some example code that would set up some threads like this:
const size_t worker_count = 7; // or whatever
std::vector<std::thread> workers;
std::vector<Proc> procs(worker_count);
Queue<std::function<void(Proc&)>> queue;
for (size_t i = 0; i < worker_count; ++i) {
workers.push_back(std::thread(
[p = &procs[i], &queue]() {
while (auto fn = queue.pop_back()) {
fn(*p);
}
}
));
}
There are two types that I have assumed exist in that example:
Proc: a type specific to your application that contains data and logic necessary to process work items. A reference to one is passed to each callback function that's run in the thread pool.
Queue: a thread-safe blocking queue. There is nothing like this in the C++ standard library (somewhat surprisingly) but there are a lot of open-source libraries containing them e.g. Folly MPMCQueue or moodycamel::ConcurrentQueue, or you can build a less fancy one yourself with std::mutex, std::condition_variable and std::deque (there are many examples of how to do this if you Google for them).
Latch
A latch is often used to wait until some work items you push onto the queue have all finished, typically so you can inspect the result.
std::vector<WorkItem> work = get_work();
std::latch latch(work.size());
for (WorkItem& work_item : work) {
queue.push_back([&work_item, &latch](Proc& proc) {
proc.do_work(work_item);
latch.count_down();
});
}
latch.wait();
// Inspect the completed work
How this works:
The threads will - eventually - pop the work items off of the queue, possibly with multiple threads in the pool handling different work items at the same time.
As each work item is finished, latch.count_down() is called, effectively decrementing an internal counter that started at work.size().
When all work items have finished, that counter reaches zero, at which point latch.wait() returns and the producer thread knows that the work items have all been processed.
Notes:
The latch count is the number of work items that will be processed, not the number of worker threads.
The count_down() method could be called zero times, one time, or multiple times on each thread, and that number could be different for different threads. For example, even if you push 7 messages onto 7 threads, it might be that all 7 items are processed onto the same one thread (rather than one for each thread) and that's fine.
Other unrelated work items could be interleaved with these ones (e.g. because they weree pushed onto the queue by other producer threads) and again that's fine.
In principle, it's possible that latch.wait() won't be called until after all of the worker threads have already finished processing all of the work items. (This is the sort of odd condition you need to look out for when writing threaded code.) But that's OK, it's not a race condition: latch.wait() will just immediately return in that case.
An alternative to using a latch is that there's another queue, in addition to the one shown here, that contains the result of the work items. The thread pool callback pushes results on to that queue while the producer thread pops results off of it. Basically, it goes in the opposite direction to the queue in this code. That's a perfectly valid strategy too, in fact if anything it's more common, but there are other situations where the latch is more useful.
Barrier
A barrier is often used to make all threads wait simultaneously so that the data associated with all of the threads can be operated on simultaneously.
typedef Fn std::function<void()>;
Fn completionFn = [&procs]() {
// Do something with the whole vector of Proc objects
};
auto barrier = std::make_shared<std::barrier<Fn>>(worker_count, completionFn);
auto workerFn = [barrier](Proc&) {
barrier->count_down_and_wait();
};
for (size_t i = 0; i < worker_count; ++i) {
queue.push_back(workerFn);
}
How this works:
All of the worker threads will pop one of these workerFn items off of the queue and call barrier.count_down_and_wait().
Once all of them are waiting, one of them will call completionFn() while the others continue to wait.
Once that function completes they will all return from count_down_and_wait() and be free to pop other, unrelated, work items from the queue.
Notes:
Here the barrier count is the number of worker threads.
It is guaranteed that each thread will pop precisely one workerFn off of the queue and handle it. Once a thread has popped one off of the queue, it will wait in barrier.count_down_and_wait() until all the other copies of workerFn have been popped off by other threads, so there is no chance of it popping another one off.
I used a shared pointer to the barrier so that it will be destroyed automatically once all the work items are done. This wasn't an issue with the latch because there we could just make it a local variable in the producer thread function, because it waits until the worker threads have used the latch (it calls latch.wait()). Here the producer thread doesn't wait for the barrier so we need to manage the memory in a different way.
If you did want the original producer thread to wait until the barrier has been finished, that's fine, it can call count_down_and_wait() too, but you will obviously need to pass worker_count + 1 to the barrier's constructor. (And then you wouldn't need to use a shared pointer for the barrier.)
If other work items are being pushed onto the queue at the same time, that's fine too, although it will potentially waste time as some threads will just be sitting there waiting for the barrier to be acquired while other threads are distracted by other work before they acquire the barrier.
!!! DANGER !!!
The last bullet point about other working being pushed onto the queue being "fine" is only the case if that other work doesn't also use a barrier! If you have two different producer threads putting work items with a barrier on to the same queue and those items are interleaved, then some threads will wait on one barrier and others on the other one, and neither will ever reach the required wait count - DEADLOCK. One way to avoid this is to only ever use barriers like this from a single thread, or even to only ever use one barrier in your whole program (this sounds extreme but is actually quite a common strategy, as barriers are often used for one-time initialisation on startup). Another option, if the thread queue you're using supports it, is to atomically push all work items for the barrier onto the queue at once so they're never interleaved with any other work items. (This won't work with the moodycamel queue, which supports pushing multiple items at once but doesn't guarantee that they won't be interleved with items pushed on by other threads.)
Barrier without completion function
At the point when you asked this question, the proposed experimental API didn't support completion functions. Even the current API at least allows not using them, so I thought I should show an example of how barriers can be used like that too.
auto barrier = std::make_shared<std::barrier<>>(worker_count);
auto workerMainFn = [&procs, barrier](Proc&) {
barrier->count_down_and_wait();
// Do something with the whole vector of Proc objects
barrier->count_down_and_wait();
};
auto workerOtherFn = [barrier](Proc&) {
barrier->count_down_and_wait(); // Wait for work to start
barrier->count_down_and_wait(); // Wait for work to finish
}
queue.push_back(std::move(workerMainFn));
for (size_t i = 0; i < worker_count - 1; ++i) {
queue.push_back(workerOtherFn);
}
How this works:
The key idea is to wait for the barrier twice in each thread, and do the work in between. The first waits have the same purpose as the previous example: they ensure any earlier work items in the queue are finished before starting this work. The second waits ensure that any later items in the queue don't start until this work has finished.
Notes:
The notes are mostly the same as the previous barrier example, but here are some differences:
One difference is that, because the barrier is not tied to the specific completion function, it's more likely that you can share it between multiple uses, like we did in the latch example, avoiding the use of a shared pointer.
This example makes it look like using a barrier without a completion function is much more fiddly, but that's just because this situation isn't well suited to them. Sometimes, all you need is to reach the barrier. For example, whereas we initialised a queue before the threads started, maybe you have a queue for each thread but initialised in the threads' run functions. In that case, maybe the barrier just signifies that the queues have been initialised and are ready for other threads to pass messages to each other. In that case, you can use a barrier with no completion function without needing to wait on it twice like this.
You could actually use a latch for this, calling count_down() and then wait() in place of count_down_and_wait(). But using a barrier makes more sense, both because calling the combined function is a little simpler and because using a barrier communicates your intention better to future readers of the code.
Any any case, the "DANGER" warning from before still applies.

Making a gather/barrier function with System V Semaphores

I'm trying to implement a gather function that waits for N processes to continue.
struct sembuf operations[2];
operaciones[0].sem_num = 0;
operaciones[0].sem_op = -1; // wait() or p()
operaciones[1].sem_num = 0;
operaciones[1].sem_op = 0; // wait until it becomes 0
semop ( this->id,operations,2 );
Initially, the value of the semaphore is N.
The problem is that it freezes even when all processes have executed the semop function. I think it is related to the fact that the operations are executed atomically (but I don't know exactly what it means). But I don't understand why it doesn't work.
Does the code subtract 1 from the semaphore and then block the process if it's not the last or is the code supposed to act in a different way?
It's hard to see what the code does without the whole function and algorithm.
By the looks of it, you apply 2 action in a single atomic action: subtract 1 from the semaphore and wait for 0.
There could be several issues if all processes freeze; the semaphore is not a shared between all processes, you got the number of processes wrong when initiating the semaphore or one process leaves the barrier, at a later point increases the semaphore and returns to the barrier.
I suggest debugging to see that all processes are actually in barrier, and maybe even printing each time you do any action on the semaphore (preferably on the same console).
As for what is an atomic action is; it is a single or sequence of operation that guarantied not to be interrupted while being executed. This means no other process/thread will interfere the action.

Change the blocking behavior on sem_wait in pthread

I understand that when sem_wait(foo) is called, the caller enters block state if the value of foo is 0.
Instead of entering block state, I want to caller to sleep for a random period of time. Here is the code I've come up with.
/* predefined a semaphore foo with initial value of 10 */
void* Queue(void *arg)
{
int bar;
int done=0;
while(done=0)
{
sem_getvalue(&foo,&bar);
if(bar>0){
sem_wait(&foo);
/* do sth */
sem_post(&foo);
done=1;
}else{ sleep(rand() % 60); }
}
pthread_exit(NULL);
}
How can I improve or is there any better solution to do this?
The code you have is racy: what if the semaphore goes to zero between the moment when you check it and the moment you do the sem_wait? You'll be in the situation you want to avoid (i.e. thread blocked on the semaphore).
You could use sem_trywait instead, which will not block if the semaphore is at zero when you call it.
There's a reason such a call doesn't exist: there's no real point. If you're using multiple threads, and you need to do something else, use another thread. If you want to see if you can do something, use sem_trywait().
Also, the way you're using the semaphore in your example seems more suited to a mutex if you're using the code to limit the number of threads in the section to just one. And there's no real gain to limiting the number of threads in the section to any number greater than one because at that point the section has to be multithread-safe anyway.
Semaphores are more useful in a producer-consumer pattern.

concurrent message processing ordered chronologically

I want to optimize a message decoder written in C++ in terms of performance. The decoder is designed completely sequentially. The concept for the actual parallelization is kind of simple:
As soon as new data arrives on a certain socket, tell a thread-pool to run another thread that will decode the received message.
At the end of each thread, a method will be invoked (namely a Qt signal will be emitted) and an object created during processing will be passed.
My problem is: length and complexity of the processed messages vary, such that the order in which threads finish might differ from the order that the messages have been received. In other words, I need to serialize in place without the use of a threadsafe container.
How can I make sure that threads, as soon as they finish, call the method mentioned above in the correct chronological order without queueing them in a threadsafe container?
My first idea was to create as many mutexes as there are threads in the thread-pool and then use each mutex to send a "finished"-signal from an older thread to a newer thread.
Any comments appreciated!
If you really don't want to use a data structure like a priority_queue or a sequence of pre-reserved buffers and block your threads instead, you can do the following:
Pair your message with an index that indicates its original
position and pass it on to the thread pool.
Use a common (e.g. global, atomic) counter variable that indicates the last processed message.
Let each thread wait until this variable indicates that the previous message has been processed.
Pass on the produced object and increase the counter
The code would look something like this:
struct MsgIndexed {
size_t idx;
Msg msg;
};
//Single thread that receives all messages sequentially
void threadReceive() {
for (size_t i = 1; true ; i++)
{
Msg m = readMsg();
dipatchMsg(MsgIndexed{i,m});
}
}
std::atomic<size_t> cnt=0;
//multiple worker threads that work in parallel
void threadWork() {
while (1) {
MsgIndexed msg = waitforMsg();
Obj obj = processMsg(msg.msg);
//Just for demonstration purposes.
//You probably don't want to use a spinlock here, but e.g. a condition variable instead
while (cnt != (msg.idx - 1u)) { std::this_thread::yield(); }
forwardObj(obj);
cnt++;
}
}
Just be aware that this is a quite inefficent solution, as your workerthreads still have to wait around after they are done with their actual work.