Multi-threading work around query - c++

I have an app which has a main thread that talks to a remote server over tcp and obtains some data. Before establishing a tcp connection, the main thread spawns a new thread that among other things prints the data obtained by the main thread.
Here's some pseudo-code on how the parallel thread does that -
while (1)
{
if(dataObtained)
{
printDataFromMainThread(remoteData);
exit;
}
}
In main thread:
recvOverTCPSocket();
saveData(remoteData);
dataObtained = true;
I was initially concerned that this is not thread-safe and when we run this app, there are seemingly no issues. I still have my concerns though.
It is clear that the parallel thread will read data only when the dataObtained flag is true...which happens only after all the data is saved. this ensures there is no chance of the parallel thread reading data while it is still being updated by the main thread.
However, I still am not sure if this is the right approach and whether I am missing any scenario where this is likely to fail.
Any inputs/comments. Does the above code snippet help us avoid using a semaphore while updating the data?

From your code, it looks like your parallel thread never blocks; that means that while your program is running, your parallel thread will probably be using up all of the available CPU cycles on one of the computer's cores. That's generally something you want to avoid, as it's very inefficient (e.g. it will tend to slow down other programs that are running on the computer, and on battery-powered devices it will unnecessarily deplete the battery).
You can avoid that e.g. by having the parallel thread block on a condition variable which the main thread would signal when the data is ready... but then the question becomes, why are you using multiple threads at all? It doesn't appear that your program is getting any benefit from having two threads, since only one of them is doing useful work at any given time anyway. Why not just have the same thread do both the TCP data download and then the printing of the downloaded data? That would give you the sequential behavior you are looking for, and be much simpler, more efficient, and less error-prone.

I would make boolean flag an atomic, but otherwise it should work
std::atomic<bool> dataObtained(false);
...
// in working thread
while (!dataObtained) { std::this_thread::yield(); }
// receive and print data

Related

Mutex is defying the very idea of threads: parallel processing [duplicate]

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!
The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.
Are threads more efficient?
No. But see final note (below).
On a single core, threads are much, much less efficient (than function/method calls).
As one example, on my Ubuntu 15.10(64), using g++ v5.2.1,
a) a context switch (from one thread to the other) enforced by use of std::mutex takes about 12,000 nanoseconds
b) but invoking 2 simple methods, for instance std::mutex lock() & unlock(), this takes < 50 nanoseconds. 3 orders of magnitude! So context switch vx function call is no contest.
The purpose of thread is to run at the same time, right?
Yes ... but this can not happen on a single core processor.
And on a multi-core system, context switch time can still dominate.
For example, my Ubuntu system is dual core. The measurement of context switch time I reported above uses a chain of 10 threads, where each thread simply waits for its input semaphore to be unlock()'d. When a thread's input semaphore is unlocked, the thread gets to run ... but the brief thread activity is simply 1) increment a count and check a flag, and 2) unlock() the next thread, and 3) lock() its own input mutex, i.e. wait again for the previous task signal. In that test, the thread we known as main starts the thread-sequencing with unlock() of one of the threads, and stops it with a flag that all threads can see.
During this measurement activity (about 3 seconds), Linux system monitor shows both cores are involved, and reports both cores at abut 60% utilization. I expected both cores at 100% .. don't know why they are not.
Can someone explain why we use mutexes within thread functions? Thank
you!
I suppose the most conventional use of std::mutex's is to serialize access to a memory structure (perhaps a shared-access storage or structure). If your application has data accessible by multiple threads, each write access must be serialized to prevent race conditions from corrupting the data. Sometimes, both read and write access needs to be serialized. (See dining philosophers problem.)
In your code, as an example (although I do not know what system you are using), it is possible that std::cout (a shared structure) will 'interleave' text. That is, a thread context switch might happen in the middle of printing a "hello", or even a 'hi'. This behaviour is usually undesired, but might be acceptable.
A number of years ago, I worked with vxWorks and my team learned to use mutex's on access to std::cout to eliminate that interleaving. Such behavior can be distracting, and generally, customers do not like it. (ultimately, for that app, we did away with the use of the std trio-io (cout, cerr, cin))
Devices, of various kinds, also might not function properly if you allow more than 1 thread to attempt operations on them 'simultaneously'. For example, I have written software for a device that required 50 us or more to complete its reaction to my software's 'poke', before any additional action to the device should be applied. The device simply ignored my codes actions without the wait.
You should also know that there are techniques that do not involve semaphores, but instead use a thread and an IPC to provide serialized (i.e. protected) resource access.
From wikipedia, "In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become true."
When the os provides a suitable IPC, I prefer to use a Hoare monitor. In my interpretation, the monitor is simply a thread that accepts commands over the IPC, and is the only thread to access the shared structure or device. When only 1 thread accesses a structure, NO mutex is needed. All other threads must send a message (via IPC) to request (or perhaps command) another structure change. The monitor thread handles one request at a time, sequentially out of the IPC.
Definition: collision
In the context of "thread context switch' and 'mutex semaphores', a 'collision' occurs when a thread must block-and-wait for access to a resource, because that resource is already 'in use' (i.e. 'occupied'). This is a forced context switch. See also the term "critical section".
When the shared resource is NOT currently in use, no collision. The lock() and unlock() cost almost nothing (by comparison to context switch).
When there is a collision, the context switch slows things down by a 'bunch'. But this 'bunch' might still be acceptable ... consider when 'bunch' is small compared to the duration of the activity inside the critical section.
Final note ... With this new idea of 'collision':
a) Multiple threads can be far less efficient in the face of many collisions.
For unexpected example, the function 'new' accesses a thread-shared resource we can call "dynamic memory". In one experience, each thread generated 1000's of new's at start up. One thread could complete that effort in 0.5 seconds. Four threads, started quickly back-to-back, took 40 seconds to complete the 4 start ups. Context switches!
b) Multiple threads can be more efficient, when you have multiple cores and no / or few collisions. Essentially, if the threads seldom interact, they can run (mostly) simultaneously.
Thread efficiency can be any where between a or b, when multiple cores and collisions.
For instance, my ram based "log" mechanisms seems to work well - one mutex access per log entry. Generally, I intentionally used minimal logging. And when debugging a 'discovered' challenge, I added additional logging (maybe later removed) to determine what was going wrong. Generally, the debugger is better than a general logging technique. But sometimes, adding several log entries worked well.
Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.
Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

How to use the work_stealing scheduler in boost.fibers

I am trying to build a generic task system where I can post tasks that get executed on whatever thread is free. With previous attempt I often ran out of threads because they would block at some point. So I am trying boost fibers; when one fiber blocks the thread is free to work on some other fiber, sounds perfect.
The work-stealing algorithm seems to be ideal for my purpose, but I have a very hard time to use it. In the example code fibers get created and only then the threads and schedulers get created, so all the fibers actually get executed on all the threads. But I want to start fibers later and by then all the other threads are suspended indefinitely because they didn't have any work. I have not found any way to wake them up again, all my fibers get only executed on the main thread. "notify" seems to be the method to call, but I don't see any way to actually get to an instance of an algorithm.
I tried keeping pointers to all instances of the algorithm so I could call notify(), but that doesn't really help; most of the time the algorithms in the worker threads cannot steal anything from the main one because the next one is the dispatcher_context.
I could disable "suspend", but threads are busy-waiting then, not an option.
I also tried the shared_work-algorithm. Same problem, once a thread cannot find a fiber it will never wake up again. I tried the same hack manually calling notify(), same result, very unreliable.
I tried using the channels, but AFAICT, if a fiber is waiting for it, the current context just "hops" over and runs the waiting fiber, suspending the current one.
In short: I find it very hard to reliably run a fiber on another thread. When profiling most threads are just waiting on a condition_variable, even though I did create tons of fibers.
As a small testing case I am trying:
std::vector<boost::fibers::future<int>> v;
for (auto i = 0; i < 16; ++i)
v.emplace_back(boost::fibers::async([i] {
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
return i;
}));
int s = 0;
for (auto &f : v)
s += f.get();
I am intentionally using this_thread::sleep_for to simulate the CPU being busy.
With 16 threads I would expect this code to run in 1s, but mostly it ends up being 16s. I was able to get this specific example to actually run in 1s just hacking around stuff; but no way felt "right" and no way did work for other scenarios, it always had to be hand-crafted to one specific scenario.
I think this example should just work as expected with a work_stealing algorithm; what am I missing? Is it just a misuse of fibers? How could I implement this reliably?
Thanks,
Dix
boost.fiber contains an example using the work_stealing algorithm (examples/work_stealing.cpp).
You have to install the algorithm on each worker-thread that should handle/steal fibers.
boost::fibers::use_scheduling_algorithm< boost::fibers::algo::work_stealing >( 4); // 4 worker-threads
Before you process tasks/fibers, you have to wait till all worker-threads have been registered at the algotithm. The example uses a barrier for this purpose.
You need an idication that all work/task has been procesed, for isntance using a condition-variable.
Take a look at Running with worker threads (boost documentation).

Mutex vs. standard function call

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!
The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.
Are threads more efficient?
No. But see final note (below).
On a single core, threads are much, much less efficient (than function/method calls).
As one example, on my Ubuntu 15.10(64), using g++ v5.2.1,
a) a context switch (from one thread to the other) enforced by use of std::mutex takes about 12,000 nanoseconds
b) but invoking 2 simple methods, for instance std::mutex lock() & unlock(), this takes < 50 nanoseconds. 3 orders of magnitude! So context switch vx function call is no contest.
The purpose of thread is to run at the same time, right?
Yes ... but this can not happen on a single core processor.
And on a multi-core system, context switch time can still dominate.
For example, my Ubuntu system is dual core. The measurement of context switch time I reported above uses a chain of 10 threads, where each thread simply waits for its input semaphore to be unlock()'d. When a thread's input semaphore is unlocked, the thread gets to run ... but the brief thread activity is simply 1) increment a count and check a flag, and 2) unlock() the next thread, and 3) lock() its own input mutex, i.e. wait again for the previous task signal. In that test, the thread we known as main starts the thread-sequencing with unlock() of one of the threads, and stops it with a flag that all threads can see.
During this measurement activity (about 3 seconds), Linux system monitor shows both cores are involved, and reports both cores at abut 60% utilization. I expected both cores at 100% .. don't know why they are not.
Can someone explain why we use mutexes within thread functions? Thank
you!
I suppose the most conventional use of std::mutex's is to serialize access to a memory structure (perhaps a shared-access storage or structure). If your application has data accessible by multiple threads, each write access must be serialized to prevent race conditions from corrupting the data. Sometimes, both read and write access needs to be serialized. (See dining philosophers problem.)
In your code, as an example (although I do not know what system you are using), it is possible that std::cout (a shared structure) will 'interleave' text. That is, a thread context switch might happen in the middle of printing a "hello", or even a 'hi'. This behaviour is usually undesired, but might be acceptable.
A number of years ago, I worked with vxWorks and my team learned to use mutex's on access to std::cout to eliminate that interleaving. Such behavior can be distracting, and generally, customers do not like it. (ultimately, for that app, we did away with the use of the std trio-io (cout, cerr, cin))
Devices, of various kinds, also might not function properly if you allow more than 1 thread to attempt operations on them 'simultaneously'. For example, I have written software for a device that required 50 us or more to complete its reaction to my software's 'poke', before any additional action to the device should be applied. The device simply ignored my codes actions without the wait.
You should also know that there are techniques that do not involve semaphores, but instead use a thread and an IPC to provide serialized (i.e. protected) resource access.
From wikipedia, "In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become true."
When the os provides a suitable IPC, I prefer to use a Hoare monitor. In my interpretation, the monitor is simply a thread that accepts commands over the IPC, and is the only thread to access the shared structure or device. When only 1 thread accesses a structure, NO mutex is needed. All other threads must send a message (via IPC) to request (or perhaps command) another structure change. The monitor thread handles one request at a time, sequentially out of the IPC.
Definition: collision
In the context of "thread context switch' and 'mutex semaphores', a 'collision' occurs when a thread must block-and-wait for access to a resource, because that resource is already 'in use' (i.e. 'occupied'). This is a forced context switch. See also the term "critical section".
When the shared resource is NOT currently in use, no collision. The lock() and unlock() cost almost nothing (by comparison to context switch).
When there is a collision, the context switch slows things down by a 'bunch'. But this 'bunch' might still be acceptable ... consider when 'bunch' is small compared to the duration of the activity inside the critical section.
Final note ... With this new idea of 'collision':
a) Multiple threads can be far less efficient in the face of many collisions.
For unexpected example, the function 'new' accesses a thread-shared resource we can call "dynamic memory". In one experience, each thread generated 1000's of new's at start up. One thread could complete that effort in 0.5 seconds. Four threads, started quickly back-to-back, took 40 seconds to complete the 4 start ups. Context switches!
b) Multiple threads can be more efficient, when you have multiple cores and no / or few collisions. Essentially, if the threads seldom interact, they can run (mostly) simultaneously.
Thread efficiency can be any where between a or b, when multiple cores and collisions.
For instance, my ram based "log" mechanisms seems to work well - one mutex access per log entry. Generally, I intentionally used minimal logging. And when debugging a 'discovered' challenge, I added additional logging (maybe later removed) to determine what was going wrong. Generally, the debugger is better than a general logging technique. But sometimes, adding several log entries worked well.
Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.
Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

Thread or timer to read sensor data out?

My Linux C++ application is periodically reading sensor data. Readout is done by simple file I/O operation (OS is writing to file, application is reading from this file).
Some information about my platform:
I have single core processor with hyper-threading
sensor data update frequency is 1 second
application GUI runs in main thread and shouldn't be blocked
I considered two approaches for sensor data read out:
timer running in main application thread
separate thread with infinite loop which does sensor data readout and then sleeps
Which approach makes more sens, are there any other alternatives ? What are the costs of both solution (e.g. blocking of main thread in first or context switching in second approach) ?
I don't know anything about your application or the hardware, but here are a few things to consider:
If you use a thread, you will have to create a communication channel of some sort to tell the main thread that data has been updated. Usually this would be a pipe(), as signals are inherently unreliable and condition locks don't work with I/O multiplexing (i.e. select()/poll()).
Can you get the entire set of data without blocking? If so, then just reading it in the main thread is probably easier. However, if your read can block you'll probably need some more "keep track of my read state to incorporate it into my central select()", whereas a thread can just block until more data is available.
Thus, neither solution is automatically "easier" to do.
I wouldn't worry about "context switching" for a read that only occurs once per second; that's irrelevant.
What else does the main thread have to do? Is it ok if it blocks? If so, then you dont need to do the timer, etc in a separate thread.
If the main thread cant block waiting for the periodic timer, then a separate thread must be created. The communication of data between the threads can be via an object that is accessible to both threads and protected via a mutex (look up pthread_mutex_t), which is quite simple to do.
As for which solution would be better and what are the costs, it depends on what else the main thread is doing. But for something this simple, either way should be about the same, and the context switching shouldnt affect anything. What should affect performance the most is how performance intensive the reads are.
I believe that cost of the context switch once a second is not an issue even for single-core CPU without hyper-threading especially taking to the account that the application is running in user space, thus is not really time-critical. The polling of your sensor in the main thread complicates the logic of the application. So, I would recommend you to start a thread for that purpose.
A sleep-loop will skew the timing because each iteration is going to take longer than 1sec. Timers don't have that problem, and they are made for this scenario. So choose a timer.
Performance-wise there is no difference because you are only triggering once a second.
If the Linux driver is reading a sensor data and writing it to a device file every second, you shouldn't duplicate the timer logic in your application. It may happen that after 1 second sleep your application will still read the same data as 1 second ago. A better approach would be to have a thread that would call a blocking read on a device file. When new sensor data is available, blocking read returns, the thread can process the data and call read again.

How can I improve my real-time behavior in multi-threaded app using pthreads and condition variables?

I have a multi-threaded application that is using pthreads. I have a mutex() lock and condition variables(). There are two threads, one thread is producing data for the second thread, a worker, which is trying to process the produced data in a real time fashion such that one chuck is processed as close to the elapsing of a fixed time period as possible.
This works pretty well, however, occasionally when the producer thread releases the condition upon which the worker is waiting, a delay of up to almost a whole second is seen before the worker thread gets control and executes again.
I know this because right before the producer releases the condition upon which the worker is waiting, it does a chuck of processing for the worker if it is time to process another chuck, then immediately upon receiving the condition in the worker thread, it also does a chuck of processing if it is time to process another chuck.
In this later case, I am seeing that I am late processing the chuck many times. I'd like to eliminate this lost efficiency and do what I can to keep the chucks ticking away as close to possible to the desired frequency.
Is there anything I can do to reduce the delay between the release condition from the producer and the detection that that condition is released such that the worker resumes processing? For example, would it help for the producer to call something to force itself to be context switched out?
Bottom line is the worker has to wait each time it asks the producer to create work for itself so that the producer can muck with the worker's data structures before telling the worker it is ready to run in parallel again. This period of exclusive access by the producer is meant to be short, but during this period, I am also checking for real-time work to be done by the producer on behalf of the worker while the producer has exclusive access. Somehow my hand off back to running in parallel again results in significant delay occasionally that I would like to avoid. Please suggest how this might be best accomplished.
I could suggest the following pattern. Generally the same technique could be used, e.g. when prebuffering frames in some real-time renderers or something like that.
First, it's obvious that approach that you describe in your message would only be effective if both of your threads are loaded equally (or almost equally) all the time. If not, multi-threading would actually benefit in your situation.
Now, let's think about a thread pattern that would be optimal for your problem. Assume we have a yielding and a processing thread. First of them prepares chunks of data to process, the second makes processing and stores the processing result somewhere (not actually important).
The effective way to make these threads work together is the proper yielding mechanism. Your yielding thread should simply add data to some shared buffer and shouldn't actually care about what would happen with that data. And, well, your buffer could be implemented as a simple FIFO queue. This means that your yielding thread should prepare data to process and make a PUSH call to your queue:
X = PREPARE_DATA()
BUFFER.LOCK()
BUFFER.PUSH(X)
BUFFER.UNLOCK()
Now, the processing thread. It's behaviour should be described this way (you should probably add some artificial delay like SLEEP(X) between calls to EMPTY)
IF !EMPTY(BUFFER) PROCESS(BUFFER.TOP)
The important moment here is what should your processing thread do with processed data. The obvious approach means making a POP call after the data is processed, but you will probably want to come with some better idea. Anyway, in my variant this would look like
// After data is processed
BUFFER.LOCK()
BUFFER.POP()
BUFFER.UNLOCK()
Note that locking operations in yielding and processing threads shouldn't actually impact your performance because they are only called once per chunk of data.
Now, the interesting part. As I wrote at the beginning, this approach would only be effective if threads act somewhat the same in terms of CPU / Resource usage. There is a way to make these threading solution effective even if this condition is not constantly true and matters on some other runtime conditions.
This way means creating another thread that is called controller thread. This thread would merely compare the time that each thread uses to process one chunk of data and balance the thread priorities accordingly. Actually, we don't have to "compare the time", the controller thread could simply work the way like:
IF BUFFER.SIZE() > T
DECREASE_PRIORITY(YIELDING_THREAD)
INCREASE_PRIORITY(PROCESSING_THREAD)
Of course, you could implement some better heuristics here but the approach with controller thread should be clear.