Possible ways of implementing a dynamic barrier in mulithreaded programs

Possible ways of implementing a dynamic barrier in mulithreaded programs - c++

I read this in a paper...
Consequently, our tool only checkpoints a thread when it is executing
at a known safe point: kernel entry, kernel exit, or certain
interruptible sleeps in the kernel that we have determined to be safe.
The thread that initiates a multithreaded fork creates a barrier on
which it waits until all other threads reach a safe point. Once all
threads reach the barrier, the original thread creates the checkpoint,
then lets the other threads continue execution.
Now my question is, can anyone guess what kind of barrier the authors are talking about. How a thread creates a barrier and inserts the barrier dynamically in other threads as well? Any working example will be highly appreciated.
EDITED
Please don't say use pthread_barrier_wait, because that is not the question. Here apparently the authors have a thread that inserts barriers into other threads dynamically. I want to know how?

The paper you're asking about appears to be "Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism". The paper mentions:
We modified the Linux kernel to implement our techniques.
and
We therefore created a new Linux primitive, called a multithreaded fork, that creates a child process with the same number of threads as its parent.
So when the paper says that
Respec only checkpoints a thread when it is executing at a known safe point: kernel entry, kernel exit, or certain interruptible sleeps in the kernel that we have determined to be safe. The thread that initiates a multithreaded fork creates a barrier on which it waits until all other threads reach a safe point. Once all threads reach the barrier, the original thread creates the checkpoint, then lets the other threads continue execution.
I'd assume that among the modifications they made to the Linux kernel was logic that threads in the process being logged will 'enter' the barrier when they reach one of those "safe points" (I'd also assume only if there's been a 'multithreaded fork' issued to create the barrier). Since this is occurring in the kernel, it would be easy enough to implement a barrier - there's not really anything dynamic going on. The modified kernel has the barriers implemented at those strategic safe points.
I haven't really read the paper (just skimmed a few bits). It's not entirely clear to me what might happen if one or more threads is performing work that doesn't require entering the kernel for a long period of time - it appears that the system depends on the threads getting to those explicit safe points. So threads shouldn't dawdle in a CPU intensive loop for too long (which is probably not an issue for the vast majority of programs):
Note that the actual execution time of an epoch may be longer than the epoch interval due to our barrier implementation; a checkpoint cannot be taken until all threads reach the barrier.

Well considering that your question is tagged with linux and pthreads, I can only imagine that it's referring to pthread barriers:
pthread_barrier_init
pthread_barrier_wait
pthread_barrier_destroy
Here's an example:
#include <pthread.h>
#include <stdio.h>
pthread_barrier_t bar;
pthread_t th;
void* function(void*)
{
printf("Second thread before the barrier\n");
pthread_barrier_wait(&bar);
printf("Second thread after the barrier\n");
return NULL;
}
int main()
{
printf("Main thread is beginning\n");
pthread_barrier_init(&bar, NULL, 2);
pthread_create(&th, NULL, function, NULL);
pthread_barrier_wait(&bar);
printf("Main thread has passed the barrier\n");
pthread_join(&th,NULL);
pthread_barrier_destroy(&bar);
return 0;
}

A barrier is a fairly standard synchronization primitive.
In basic terms, upon entering a barrier each thread is blocked until all relevant threads have reached the barrier, and then all are released.
I know you're asking about C/C++, but take a look at Java's CyclicBarrier as the concept is explained pretty well there.
Since you're asking about pthreads, take a look at pthread_barrier_init et al.
edit
But in this case, a thread seemingly dynamically inserts barriers in
the other threads. How?
It is hard to answer this without some kind of context (e.g. the paper that you're reading).
The excerpt that you quote gives an impression that this is a description of some low-level tool, that either inserts hooks that get executed on certain events (probably in the context of the threads in question), or indeed operates in kernel mode. Either way, it's little wonder it can do what it says it can.
It doesn't seem to me that anyone is talking about a user thread dynamically inserting barriers into another user thread.
Hope I'm not too far off in my guessing of the context.

Simple: use the pthread_barrier_wait pthread API call.
See the man page for the details: http://linux.die.net/man/3/pthread_barrier_wait

An OS thread barrier is nothing more than some state in memory. If you can share that state among threads (by properly initializing the threads) then the threads can use that barrier.
Essentially the main thread does:
CreateAllThreads(&barrier);
StartAllThreads();
EnterBarrier(&barrier);
All other threads do:
RuntimeInitialize();
EnterBarrier(&barrier);
The above is only a very rough pseudocode for illustrative purposes only.

Related

Mutex is defying the very idea of threads: parallel processing [duplicate]

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!

The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.

Are threads more efficient?
No. But see final note (below).
On a single core, threads are much, much less efficient (than function/method calls).
As one example, on my Ubuntu 15.10(64), using g++ v5.2.1,
a) a context switch (from one thread to the other) enforced by use of std::mutex takes about 12,000 nanoseconds
b) but invoking 2 simple methods, for instance std::mutex lock() & unlock(), this takes < 50 nanoseconds. 3 orders of magnitude! So context switch vx function call is no contest.
The purpose of thread is to run at the same time, right?
Yes ... but this can not happen on a single core processor.
And on a multi-core system, context switch time can still dominate.
For example, my Ubuntu system is dual core. The measurement of context switch time I reported above uses a chain of 10 threads, where each thread simply waits for its input semaphore to be unlock()'d. When a thread's input semaphore is unlocked, the thread gets to run ... but the brief thread activity is simply 1) increment a count and check a flag, and 2) unlock() the next thread, and 3) lock() its own input mutex, i.e. wait again for the previous task signal. In that test, the thread we known as main starts the thread-sequencing with unlock() of one of the threads, and stops it with a flag that all threads can see.
During this measurement activity (about 3 seconds), Linux system monitor shows both cores are involved, and reports both cores at abut 60% utilization. I expected both cores at 100% .. don't know why they are not.
Can someone explain why we use mutexes within thread functions? Thank
you!
I suppose the most conventional use of std::mutex's is to serialize access to a memory structure (perhaps a shared-access storage or structure). If your application has data accessible by multiple threads, each write access must be serialized to prevent race conditions from corrupting the data. Sometimes, both read and write access needs to be serialized. (See dining philosophers problem.)
In your code, as an example (although I do not know what system you are using), it is possible that std::cout (a shared structure) will 'interleave' text. That is, a thread context switch might happen in the middle of printing a "hello", or even a 'hi'. This behaviour is usually undesired, but might be acceptable.
A number of years ago, I worked with vxWorks and my team learned to use mutex's on access to std::cout to eliminate that interleaving. Such behavior can be distracting, and generally, customers do not like it. (ultimately, for that app, we did away with the use of the std trio-io (cout, cerr, cin))
Devices, of various kinds, also might not function properly if you allow more than 1 thread to attempt operations on them 'simultaneously'. For example, I have written software for a device that required 50 us or more to complete its reaction to my software's 'poke', before any additional action to the device should be applied. The device simply ignored my codes actions without the wait.
You should also know that there are techniques that do not involve semaphores, but instead use a thread and an IPC to provide serialized (i.e. protected) resource access.
From wikipedia, "In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become true."
When the os provides a suitable IPC, I prefer to use a Hoare monitor. In my interpretation, the monitor is simply a thread that accepts commands over the IPC, and is the only thread to access the shared structure or device. When only 1 thread accesses a structure, NO mutex is needed. All other threads must send a message (via IPC) to request (or perhaps command) another structure change. The monitor thread handles one request at a time, sequentially out of the IPC.
Definition: collision
In the context of "thread context switch' and 'mutex semaphores', a 'collision' occurs when a thread must block-and-wait for access to a resource, because that resource is already 'in use' (i.e. 'occupied'). This is a forced context switch. See also the term "critical section".
When the shared resource is NOT currently in use, no collision. The lock() and unlock() cost almost nothing (by comparison to context switch).
When there is a collision, the context switch slows things down by a 'bunch'. But this 'bunch' might still be acceptable ... consider when 'bunch' is small compared to the duration of the activity inside the critical section.
Final note ... With this new idea of 'collision':
a) Multiple threads can be far less efficient in the face of many collisions.
For unexpected example, the function 'new' accesses a thread-shared resource we can call "dynamic memory". In one experience, each thread generated 1000's of new's at start up. One thread could complete that effort in 0.5 seconds. Four threads, started quickly back-to-back, took 40 seconds to complete the 4 start ups. Context switches!
b) Multiple threads can be more efficient, when you have multiple cores and no / or few collisions. Essentially, if the threads seldom interact, they can run (mostly) simultaneously.
Thread efficiency can be any where between a or b, when multiple cores and collisions.
For instance, my ram based "log" mechanisms seems to work well - one mutex access per log entry. Generally, I intentionally used minimal logging. And when debugging a 'discovered' challenge, I added additional logging (maybe later removed) to determine what was going wrong. Generally, the debugger is better than a general logging technique. But sometimes, adding several log entries worked well.

Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.

Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

InterviewQ: How do you code a mutex?

I'm unfortunately out of a job and have been interviewing around lately. I faced this same question twice now, and was lost both times I was asked this question.
"How do you code a mutex"?
Conceptually I understand a mutex locks a certain part of code so multiple threads can not enter the critical section at the same time, eliminating data races. The first time I was asked to conceptually describe how I would code it, the second time I was asked to code it. I've been googling and haven't found any answers... can anyone help?
Thanks.

There's lots of ways to implement a mutex lock, but it typically starts with the basic premise that the cpu architecture offers some concept of atomic add and atomic subtract. That is, an addition operation can be done to an integer variable in memory (and return the result) without being corrupted by another thread attempting to access same memory location. Or at the very least, "atomic increment" and "atomic decrement".
On modern Intel chips, for example, there's an instruction called XADD. When combined with the LOCK prefix it executes atomically and invalidates cached values across other cores. gcc implements a wrapper for this instruction called __sync_add_and_fetch. Win32 implements a similar function called InterlockedIncrement. Both are just calling LOCK XADD under the hood. Other CPU architectures should offer something similar.
So the most basic mutex lock could be implemented something like this. This is often called a "spin" lock. And this cheap version offers no ability to recursively enter the lock.
// A correct, but poorly performant mutex implementation
void EnterLock(int* lock)
{
while (true)
{
int result = LOCK_XADD(lock,1); // increment the value in lock and return the result atomically
if (result == 1)
{
// if the value in lock was successfully incremented
// from 0 to 1 by this thread. It means this thread "acquired" the lock
return;
}
LOCK XADD(lock,-1); // we didn't get the lock - decrement it atmoically back to what it was
sleep(0); // give the thread quantum back before trying again
}
}
void LeaveLock(int* lock)
{
LOCK XADD(lock,-1); // release the lock. Assumes we successfully acquired it correctly with EnterLock above
}
The above suffers from poor performance of "spinning" and doesn't guarantee any fairness. A higher priority thread could continue to win the EnterLock battle over a lower priority thread. And the programmer could make a mistake and call LeaveLock with with a thread that did not previously call EnterLock. You could expand the above to operate on a data structure that not only includes the lock integer, but also has record keeping for the owner thread id and a recursion count.
The second concept for implementing a mutex is that the operating system can offer a wait and notify service such that a thread doesn't have to spin until the owner thread has released it. The thread or process waiting on lock can register itself with the OS to be put to sleep until the owner thread has released it. In OS terms, this is called a semaphore. Additionally, the OS level semaphore can also be used to implement locks across different processes and for the cases where the CPU doesn't offer an atomic add. And can be used to guaranteed fairness between multiple threads trying to acquire the lock.
Most implementations will try spinning for multiple attempts before falling back to making a system call.

I wouldn't say that this is a stupid question. On any level of abstraction for the position. On the high level you just say, you use standard library, or any threading library. If you apply for a position as the compiler developer you need to understand how it acutally works and what is needed for the implementation.
To implement a mutex, you need a locking mechanism, that is you need to have a resource that can be marked as taken across all threads. This is not trivial. You need to remember that two cores share memory, but they have caches. This piece of information must be guaranteed to be actual. So you do need support for hardware to ensure atomicity.
If you take at implementation of clang, they offload (at least in once case) implementation to pthreads, typedefs in threading support:
#if defined(_LIBCPP_HAS_THREAD_API_PTHREAD)
# include <pthread.h>
# include <sched.h>
#elif defined(_LIBCPP_HAS_THREAD_API_WIN32)
#include <Windows.h>
#include <process.h>
#include <fibersapi.h>
#endif
And if you dig through pthreads repo, you can find asm implementations of the interlocking operations. They rely on the lock asm keyword which make the operations atomic, i.e. no other thread can execute them at the same time. This eliminates racing conditions, and guarantees coherency.
Based on this, you can build a lock, which you can use for a mutex implementation.

Ensure that each thread gets a chance to execute in a given time period using C++11 threads

Suppose I have a multi-threaded program in C++11, in which each thread controls the behavior of something displayed to the user.
I want to ensure that for every time period T during which one of the threads of the given program have run, each thread gets a chance to execute for at least time t, so that the display looks as if all threads are executing simultaneously. The idea is to have a mechanism for round robin scheduling with time sharing based on some information stored in the thread, forcing a thread to wait after its time slice is over, instead of relying on the operating system scheduler.
Preferably, I would also like to ensure that each thread is scheduled in real time.
In case there is no way other than relying on the operating system, is there any solution for Linux?
Is it possible to do this? How?

No that's not cross-platform possible with C++11 threads. How often and how long a thread is called isn't up to the application. It's up to the operating system you're using.
However, there are still functions with which you can flag the os that a special thread/process is really important and so you can influence this time fuzzy for your purposes.
You can acquire the platform dependent thread handle to use OS functions.
native_handle_type std::thread::native_handle //(since C++11)
Returns the implementation defined underlying thread handle.
I just want to claim again, this requires a implementation which is different for each platform!
Microsoft Windows
According to the Microsoft documentation:
SetThreadPriority function
Sets the priority value for the specified thread. This value, together
with the priority class of the thread's process determines the
thread's base priority level.
Linux/Unix
For Linux things are more difficult because there are different systems how threads can be scheduled. Under Microsoft Windows it's using a priority system but on Linux this doesn't seem to be the default scheduling.
For more information, please take a look on this stackoverflow question(Should be the same for std::thread because of this).

I want to ensure that for every time period T during which one of the threads of the given program have run, each thread gets a chance to execute for at least time t, so that the display looks as if all threads are executing simultaneously.
You are using threads to make it seem as though different tasks are executing simultaneously. That is not recommended for the reasons stated in Arthur's answer, to which I really can't add anything.
If instead of having long living threads each doing its own task you can have a single queue of tasks that can be executed without mutual exclusion - you can have a queue of tasks and a thread pool dequeuing and executing tasks.
If you cannot, you might want to look into wait free data structures and algorithms. In a wait free algorithm/data structure, every thread is guaranteed to complete its work in a finite (and even specified) number of steps. I can recommend the book The Art of Multiprocessor Programming where this topic is discussed in length. The gist of it is: every lock free algorithm/data structure can be modified to be wait free by adding communication between threads over which a thread that's about to do work makes sure that no other thread is starved/stalled. Basically, prefer fairness over total throughput of all threads. In my experience this is usually not a good compromise.

SetThreadAffinityMask of pooled thread

I am wondering whether it is possible to set the processor affinity of a thread obtained from a thread pool. More specifically the thread is obtained through the use of TimerQueue API which I use to implement periodic tasks.
As a sidenote: I found TimerQueues the easiest way to implement periodic tasks but since these are usually longliving tasks might it be more appropriate to use dedicated threads for this purpose? Furthermore it is anticipated that synchronization primites such as semapores and mutexes need to be used to synchronize the various periodic tasks. Are the pooled threads suitable for these?
Thanks!
EDIT1: As Leo has pointed out the above question is actually two only loosely related ones. The first one is related to processor affinity of pooled threads. The second question is related to whether pooled threads obtained from the TimerQueue API are behaving just like manually created threads when it comes to synchronization objects. I will move this second question the a seperate topic.

If you do this, make sure you return things to how they were every time you release a thread back to the pool. Since you don't own those threads and other code which uses them may have other requirements/assumptions.
Are you sure you actually need to do this, though? It's very, very rare to need to set processor affinity. (I don't think I've ever needed to do it in anything I've written.)
Thread affinity can mean two quite different things. (Thanks to bk1e's comment to my original answer for pointing this out. I hadn't realised myself.)
What I would call processor affinity: Where a thread needs to be run consistently on a the same processor. This is what SetThreadAffinityMask deals with and it's very rare for code to care about it. (Usually it's due to very low-level issues like CPU caching in high performance code. Usually the OS will do its best to keep threads on the same CPU and it's usually counterproductive to force it to do otherwise.)
What I would call thread affinity: Where objects use thread-local storage (or some other state tied to the thread they're accessed from) and will go wrong if a sequence of actions is not done on the same thread.
From your question it sounds like you may be confusing #1 with #2. The thread itself will not change while your callback is running. While a thread is running it may jump between CPUs but that is normal and not something you have to worry about (except in very special cases).
Mutexes, semaphores, etc. do not care if a thread jumps between CPUs.
If your callback is executed by the thread pool multiple times, there is (depending on how the pool is used) usually no guarantee that the same thread will be used each time. i.e. Your callback may jump between threads, but not while it is in the middle of running; it may only change threads each time it runs again.
Some synchronization objects will care if your callback code runs on one thread and then, still thinking it holding locks on those objects, runs again on a different thread. (The first thread will still hold the locks, not the second one, although it depends which kind of synchronization object you use. Some don't care.) That isn't a #1, though; that's #2, and not something you'd use SetThreadAffinityMask to deal with.
As an example, Mutexes (CreateMutex) are owned by a thread. If you acquire a mutex on Thread A then any other thread which tries to acquire the mutex will block until you release the mutex on Thread A. (It is also an error for a thread to release a mutex it does not own.) So if your callback acquired a mutex, then exited, then ran again on another thread and released the mutex from there, it would be wrong.
On the other hand, an Event (CreateEvent) does not care which threads create, signal or destroy it. You can signal an event on one thread and then reset it on another and that's fine (normal, in fact).
It'd also be rare to hold a synchronization object between two separate runs of your callback (that would invite deadlocks, although there are certainly situations where you could legitimately want/do such a thing). However, if you created (for example) an apartment-threaded COM object then that would be something you would want to only access from one specific thread.

You shouldn't. You're only supposed to use that thread for the job at hand, on the processor it's running on at that point. Apart from the obvious inefficiency, the threadpool might destroy every thread as soon as you're done, and create a new one for your next job. The affinity masks wouldn't disappear that soon in practice, but it's even harder to debug if they disappear at random.

Boost threads: is it possible to limit the run time of a thread before moving to another thread

I have a program with a main thread and a diagnostics thread. The main thread is basically a while(1) loop that performs various tasks. One of these tasks is to provide a diagnostics engine with information about the system and then check back later (i.e. in the next loop) to see if there are any problems that should be dealt with. An iteration of the main loop should take no longer than 0.1 seconds. If all is well, then the diagnostic engine takes almost no time to come back with an answer. However, if there is a problem, the diagnostic engine can take seconds to isolate the problem. For this reason each time the diagnostic engine receives new information it spins up a new diagnostics thread.
The problem we're having is that the diagnostics thread is stealing time away from the main thread. Effectively, even though we have two threads, the main thread is not able to run as often as I would like because the diagnostic thread is still spinning.
Using Boost threads, is it possible to limit the amount of time that a thread can run before moving on to another thread? Also of importance here is that the diagnostic algorithm we are using is blackbox, so we can't put any threading code inside of it. Thanks!

If you run multiple threads they will indeed consume CPU time. If you only have a single processor, and one thread is doing processor intensive work then that thread will slow down the work done on other threads. If you use OS-specific facilities to change the thread priority then you can make the diagnostic thread have a lower priority than the main thread. Also, you mention that the diagnostic thread is "spinning". Do you mean it literally has the equivalent of a spin-wait like this:
while(!check_done()) ; // loop until done
If so, I would strongly suggest that you try and avoid such a busy-wait, as it will consume CPU time without achieving anything.
However, though multiple threads can cause each other to slow-down, if you are seeing an actual delay of several seconds this would suggest there is another problem, and that the main thread is actually waiting for the diagnostic thread to complete. Check that the call to join() for the diagnostic thread is outside the main loop.
Another possibility is that the diagnostic thread is locking a mutex needed by the main thread loop. Check which mutexes are locked and where.
To really help, I'd need to see some code.

looks like your threads are interlocked, so your main thread waits until background thread finished its work. check any multithreading sychronization that can cause this.
to check that it's nothing related to OS scheduling run you program on double-core system, so both threads can be executed really in parallel

From the way you've worded your question, it appears that you're not quite sure how threads work. I assume by "the amount of time that a thread can run before moving on to another thread" you mean the number of cpu cycles spent per thread. This happens hundreds of thousands of times per second.
Boost.Thread does not have support for thread priorities, although your OS-specific thread API will. However, your problem seems to indicate the necessity for a fundamental redesign -- or at least heavy profiling to find bottlenecks.

You can't do this generally at the OS level, so I doubt boost has anything specific for limiting execution time. You can kinda fake it with small-block operations and waits, but it's not clean.
I would suggest looking into processor affinity, either at a thread or process level (this will be OS-specific). If you can isolate your diagnostic processing to a limited subset of [logical] processors on a multi-core machine, it will give you a very course mechanism to control maximum execution amount relative to the main process. That's the best solution I have found when trying to do a similar type of thing.
Hope that helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js