I heard that there is something known as "atomic action" which is faster then using a mutex with critical section.
Does somebody know what is it, and how do I use it?
An atomic operation is an operation where the CPU reads and writes memory during the same bus access, this prevents other CPUs or system devices from modifying the memory simultaneously. E.g. a "test and set" operation, which could do "read memory at location X, if it is 0 set it to 1, return an indicator telling whether the value was set" without any chance of simultaneous access.
Wikipedia's http://en.wikipedia.org/wiki/Linearizability describes atomic operations.
If you're on windows, have a look at e.g. InterlockedTestExchange or InterlockedIncrement which are wrappers for atomic operations.
EDIT: Sample usage
A spinlock could be implemented using a test-and-set atomic operation:
while (test_and_set(&x) == 1) ;
This will keep looping until the current thread is the one that sets x to 1. If all other threads treat x in the same way its effect is the same as a mutex.
Atomic action only refers to the fact that an action will be done atomically uninterrupted by a co-running thread/processes.
What you are probably looking for are atomic built-ins in compilers. For example GCC provides this set: http://gcc.gnu.org/onlinedocs/gcc-4.5.1/gcc/Atomic-Builtins.html
These are usually implemented very efficiently using CPU support.
It is a tradeoff. As other posters have stated, an atomic operation is "try to grab this flag and return true on success". It is fast, but there are downsides.
A proper mutex blocks the threads that need to get into the critical section. With only atomic operations the waiting threads have to loop until they get the flag - this wastes CPU cycles. Another downside is that mutexes guarantee fair access - usually by just queueing the waiting processes in a FIFO queue. With spinlocks, there's a risk of resource starvation.
So, the bare atomic operations are faster but only when there's not too many threads trying to grab the critical section.
Since a GCC-specific answer was given, here's the VC++-specific link -- all of the intrinsics listed that begin with _Interlocked are relevant: http://msdn.microsoft.com/en-us/library/hd9bdb82.aspx. Also note that there are more intrinsics available for x64 than for x86: http://msdn.microsoft.com/en-us/library/azcs88h2.aspx.
Critical section implementation (shurely in windows) uses atomic variable to detect is a critical section (cs) captured by another thread or not, and enters kernel-side synchronization primitive only if real collision occurs. So if you need to protect a small piece of code and collision probability is small enough, critical section is a good solution.
However, if the protected code does nothing except of incrementing/decrementing or testing and modifying a single variable then it's the right case to use atomic operation.
There is also possible to protect by atomics more complicated code (google "lock free structures" and "transactional memory" for more info)
It is interesting but very complicated stuff and may not be recommended if some simple solution (like critical section) also works.
Related
I have a class that has a state (a simple enum) and that is accessed from two threads. For changing state I use a mutex (boost::mutex). Is it safe to check the state (e.g. compare state_ == ESTABLISHED) or do I have to use the mutex in this case too? In other words do I need the mutex when I just want to read a variable which could be concurrently written by another thread?
It depends.
The C++ language says nothing about threads or atomicity.
But on most modern CPU's, reading an integer is an atomic operation, which means that you will always read a consistent value, even without a mutex.
However, without a mutex, or some other form of synchronization, the compiler and CPU are free to reorder reads and writes, so anything more complex, anything involving accessing multiple variables, is still unsafe in the general case.
Assuming the writer thread updates some data, and then sets an integer flag to inform other threads that data is available, this could be reordered so the flag is set before updating the data. Unless you use a mutex or another form of memory barrier.
So if you want correct behavior, you don't need a mutex as such, and it's no problem if another thread writes to the variable while you're reading it. It'll be atomic unless you're working on a very unusual CPU. But you do need a memory barrier of some kind to prevent reordering in the compiler or CPU.
You have two threads, they exchange information, yes you need a mutex and you probably also need a conditional wait.
In your example (compare state_ == ESTABLISHED) indicates that thread #2 is waiting for thread #1 to initiate a connection/state. Without a mutex or conditionals/events, thread #2 has to poll the status continously.
Threads is used to increase performance (or improve responsiveness), polling usually results in decreased performance, either by consuming a lot of CPU or by introducing latencey due to the poll interval.
Yes. If thread a reads a variable while thread b is writing to it, you can read an undefined value. The read and write operation are not atomic, especially on a multi-processor system.
Generally speaking you don't, if your variable is declared with "volatile". And ONLY if it is a single variable - otherwise you should be really careful about possible races.
actually, there is no reason to lock access to the object for reading. you only want to lock it while writing to it. this is exactly what a reader-writer lock is. it doesn't lock the object as long as there are no write operations. it improves performance and prevents deadlocks. see the following links for more elaborate explanations :
wikipedia
codeproject
The access to the enum ( read or write) should be guarded.
Another thing:
If the thread contention is less and the threads belong to same process then Critical section would be better than mutex.
I was wondering what is the better choice: it's assumed there is a trivially copyable object, let's say a queue data structure, that is used by several threads to pop/push data. The object provides only methods put/push, that can't be accessed by more than one thread the same time. Obviously if put is called, push can't be called neither.
Would you suggest to wrap the model into atomic type (if possible), or rather use mutexes?
Regards!
Atomic is hardware thing, whereas mutex is OS thing. Mutex will end up by suspending the task, even though in some cases mutex will behave as a spinlock for a short period of time aka "optimistic spin", see https://lore.kernel.org/all/56C2673F.6070202#hpe.com/T/
So, if you have small operations like incrementing a variable, aka "atomic", without waiting for other things which might take longer, then atomic is for you.
If you want to (indefinitely) wait for some things to happen in other threads, polling for results via atomics, aka spinlock, might be a waste of CPU cycles therefore less cooperative, so it's better to use a mutex/condition variable which would suspend the task at a price of context switch latency.
Atomic is preferable for those kinds of cases. The atomic is a kind of operation supported by the CPU specifically whereas the other kinds of thread control tend to be implemented by the OS or other measures and incur more overhead.
EDIT: A quick search shows up this which has more info and is basically the same kind of question: Which is more efficient, basic mutex lock or atomic integer?
EDIT 2: And a more detailed article here http://www.informit.com/articles/article.aspx?p=1832575
I have seen people/articles/SO posts who say they have designed their own "lock-free" container for multithreaded usage. Assuming they haven't used a performance-hitting modulus trick (i.e. each thread can only insert based upon some modulo) how can data structures be multi-threaded but also lock-free???
This question is intended towards C and C++.
The key in lock-free programming is to use hardware-intrinsic atomic operations.
As a matter of fact, even locks themselves must use those atomic operations!
But the difference between locked and lock-free programming is that a lock-free program can never be stalled entirely by any single thread. By contrast, if in a locking program one thread acquires a lock and then gets suspended indefinitely, the entire program is blocked and cannot make progress. By contrast, a lock-free program can make progress even if individual threads are suspended indefinitely.
Here's a simple example: A concurrent counter increment. We present two versions which are both "thread-safe", i.e. which can be called multiple times concurrently. First the locked version:
int counter = 0;
std::mutex counter_mutex;
void increment_with_lock()
{
std::lock_guard<std::mutex> _(counter_mutex);
++counter;
}
Now the lock-free version:
std::atomic<int> counter(0);
void increment_lockfree()
{
++counter;
}
Now imagine hundreds of thread all call the increment_* function concurrently. In the locked version, no thread can make progress until the lock-holding thread unlocks the mutex. By contrast, in the lock-free version, all threads can make progress. If a thread is held up, it just won't do its share of the work, but everyone else gets to get on with their work.
It is worth noting that in general lock-free programming trades throughput and mean latency throughput for predictable latency. That is, a lock-free program will usually get less done than a corresponding locking program if there is not too much contention (since atomic operations are slow and affect a lot of the rest of the system), but it guarantees to never produce unpredictably large latencies.
For locks, the idea is that you acquire a lock and then do your work knowing that nobody else can interfere, then release the lock.
For "lock-free", the idea is that you do your work somewhere else and then attempt to atomically commit this work to "visible state", and retry if you fail.
The problems with "lock-free" are that:
it's hard to design a lock-free algorithm for something that isn't trivial. This is because there's only so many ways to do the "atomically commit" part (often relying on an atomic "compare and swap" that replaces a pointer with a different pointer).
if there's contention, it performs worse than locks because you're repeatedly doing work that gets discarded/retried
it's virtually impossible to design a lock-free algorithm that is both correct and "fair". This means that (under contention) some tasks can be lucky (and repeatedly commit their work and make progress) and some can be very unlucky (and repeatedly fail and retry).
The combination of these things mean that it's only good for relatively simple things under low contention.
Researchers have designed things like lock-free linked lists (and FIFO/FILO queues) and some lock-free trees. I don't think there's anything more complex than those. For how these things work, because it's hard it's complicated. The most sane approach would be to determine what type of data structure you're interested in, then search the web for relevant research into lock-free algorithms for that data structure.
Also note that there is something called "block free", which is like lock-free except that you know you can always commit the work and never need to retry. It's even harder to design a block-free algorithm, but contention doesn't matter so the other 2 problems with lock-free disappear. Note: the "concurrent counter" example in Kerrek SB's answer is not lock free at all, but is actually block free.
The idea of "lock free" is not really not having any lock, the idea is to minimize the number of locks and/or critical sections, by using some techniques that allow us not to use locks for most operations.
It can be achieved using optimistic design or transactional memory, where you do not lock the data for all operations, but only on some certain points (when doing the transaction in transactional memory, or when you need to roll-back in optimistic design).
Other alternatives are based on atomic implementations of some commands, such as CAS (Compare And Swap), that even allows us to solve the consensus problem given an implementation of it. By doing swap on references (and no thread is working on the common data), the CAS mechanism allows us to easily implement a lock-free optimistic design (swapping to the new data if and only if no one have changed it already, and this is done atomically).
However, to implement the underlying mechanism to one of these - some locking will most likely be used, but the amount of time the data will be locked is (supposed) to be kept to minimum, if these techniques are used correctly.
The new C and C++ standards (C11 and C++11) introduced threads, and thread shared atomic data types and operations. An atomic operation gives guarantees for operations that run into a race between two threads. Once a thread returns from such an operation, it can be sure that the operation has gone through in its entirety.
Typical processor support for such atomic operations exists on modern processors for compare and swap (CAS) or atomic increments.
Additionally to being atomic, data type can have the "lock-free" property. This should perhaps have been coined "stateless", since this property implies that an operation on such a type will never leave the object in an intermediate state, even when it is interrupted by an interrupt handler or a read of another thread falls in the middle of an update.
Several atomic types may (or may not) be lock-free, there are macros to test for that property. There is always one type that is guaranteed to be lock free, namely atomic_flag.
I was reading this post on performance differences in C# between critical sections and mutexes for a given test case. I'm womdering if there is any further documentation out there that gives performance overheads for the various locking classes for a C++ application, specifically MFC running on a Windows 32 or 64 bit platform?
The reason that I'm asking is that the profiler results I get across broad automated tests show a lot of time spent in mutex code. What I'm trying to figure out is how much of this is reasonable delay while waiting for a resource to become available, and how much is due to the implementation and specifics of the locking structure. I'm only dealing with a single process, which includes multiple threads, and am considering changing to critical sections. Long term automated testing shows that I don't need the time-outs offered by the mutex class.
Hence the question, is anyone aware of any reference documentation relating to the performance overheads of different MFC locking mechanisms on different Windows platforms?
As far as I can understand, a Win32 Mutex is a full blown kernel object. This means that any call to a Mutex will involve a system call. This will often invalidate the cache and therefore can be quite expensive.
Critical Sections are Userside objects that make no use of the kernel in cases where there is no contention. This is probably done using the x86 LOCK assembler instruction or similar to guarantee atomicity. Since no system call is made, it will be faster but because it not a kernel object, there is no way to access a critical section from another process.
The crucial difference between Critical Sections and Mutexes in Windows is that you can create a named mutex and use it from multiple processes, whereas there is no way to access a critical section of one process from another.
A consequence of a mutex being available in multiple processes is that access to it must be controlled by the kernel.
Read the following support article from Microsoft: http://support.microsoft.com/kb/105678.
Critical sections and mutexes provide synchronization that is very similar, except that critical sections can be used only by the threads of a single process. There are two areas to consider when choosing which method to use within a single process:
Speed. The Synchronization overview says the following about critical sections:
... critical section objects provide a slightly faster, more efficient
mechanism for mutual-exclusion synchronization. Critical sections use
a processor-specific test and set instruction to determine mutual
exclusion.
Deadlock. The Synchronization overview says the following about mutexes:
If a thread terminates without releasing its ownership of a mutex
object, the mutex is considered to be abandoned. A waiting thread can
acquire ownership of an abandoned mutex, but the wait function's
return value indicates that the mutex is abandoned.
WaitForSingleObject() will return WAIT_ABANDONED for a mutex that has
been abandoned. However, the resource that the mutex is protecting is
left in an unknown state.
There is no way to tell whether a critical section has been abandoned.
What are the factors to keep in mind while choosing between Critical Sections, Mutex and Spin Locks? All of them provide for synchronization but are there any specific guidelines on when to use what?
EDIT: I did mean the windows platform as it has a notion of Critical Sections as a synchronization construct.
In Windows parlance, a critical section is a hybrid between a spin lock and a non-busy wait. It spins for a short time, then--if it hasn't yet grabbed the resource--it sets up an event and waits on it. If contention for the resource is low, the spin lock behavior is usually enough.
Critical Sections are a good choice for a multithreaded program that doesn't need to worry about sharing resources with other processes.
A mutex is a good general-purpose lock. A named mutex can be used to control access among multiple processes. But it's usually a little more expensive to take a mutex than a critical section.
General points to consider:
The performance cost of using the mechanism.
The complexity introduced by using the mechanism.
In any given situation 1 or 2 may be more important.
E.g.
If you using multi-threading to write a high performance algorithm by making use of many cores and need to guard some data for safe access then 1 is probably very important.
If you have an application where a background thread is used to poll for some information on a timer and on the rare occasion it notices an update you need to guard some data for access then 2 is probably more important than 1.
1 will be down to the underlying implementation and probably scales with the scope of the protection e.g. a lock that is internal to a process is normally faster than a lock across all processes on a machine.
2 is easy to misjudge. First attempts to use locks to write thread safe code will normally miss some cases that lead to a deadlock. A simple deadlock would occur for example if thread A was waiting on a lock held by thread B but thread B was waiting on a lock held by thread A. Surprisingly easy to implement by accident.
On any given platform the naming and qualities of locking mechanisms may vary.
On windows critical sections are fast and process specific, mutexes are slower but cross process. Semaphores offer more complicated use cases. Some problems e.g. allocation from a pool may be solved very efficently using atomic functions rather than locks e.g. on windows InterlockedIncrement which is very fast indeed.
A Mutex in Windows is actually an interprocess concurrency mechanism, making it incredibly slow when used for intraprocess threading. A Critical Section is the Windows analogue to the mutex you normally think of.
Spin Locks are best used when the resource being contested is usually not held for a significant number of cycles, meaning the thread that has the lock is probably going to give it up soon.
EDIT : My answer is only relevant provided you mean 'On Windows', so hopefully that's what you meant.