Cost of mutex,critical section etc on Windows

Cost of mutex,critical section etc on Windows - c++

I read somewhere that the overhead of a mutex is not that much, because the context switching only happens in case of contention.
Also known Futexes in Linux.
Does the same thing hold good in Windows? Is Critical Section a more apt map to mutexes in Linux.
From what i gathered, Critical Sections provide better optimal performance compared to Mutex, is this true for every case?
Is there a corner case where mutexes are faster than critical section in Windows.
Assume only a single process-threads are accessing the mutexes(Just to eliminate the other benefit of Critical Sections)
Added Info: OS windows Server,
Language C++

Considering the specific purpose of Critical Sections and Mutexes I don't think you can ask a question regarding the cost as you don't have much alternative when you need multiple threads touching the same data. Obviously, if you just need to increment/decrement a number, you can use the Interlocked*() functions on a volatile number and you're good to go. But for anything more complex, you need to use a synchronization object.
Start your reading here on the Synchronization Objects available on Windows^. All functions are listed there, nicely grouped and properly explained. Some are Windows 8 only.
As regarding your question, Critical Sections are less expensive than Mutexes as they are designed to operate in the same process. Read this^ and this^ or just the following quote.
A critical section object provides synchronization similar to that provided by a mutex object, except that a critical section can be used only by the threads of a single process. Event, mutex, and semaphore objects can also be used in a single-process application, but critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization (a processor-specific test and set instruction). Like a mutex object, a critical section object can be owned by only one thread at a time, which makes it useful for protecting a shared resource from simultaneous access. Unlike a mutex object, there is no way to tell whether a critical section has been abandoned.
I use Critical Sections for same process synchronization and Mutexes for cross-process synchronization. Only when I REALLY need to know if a synchronization object was abandoned, I use Mutexes in the same process.
So, if you need a sync object, the question is not what are the costs but which is cheaper :) There's really no alternative but memory corruption.
PS: There might be alternatives like the one mentioned in the selected answer here^ but I always go for core platform-specific functionality vs. cross-platformness. It's always faster! So if you use Windows, use the tools of Windows :)
UPDATE
Based on your needs, you might be able to reduce the need of sync objects by trying to do as much self-contained work in a thread as possible and only combine the data at the end or every now and then.
Stupid Example: Take a list of URLs. You need to scrape them and analyze them.
Throw in a bunch of threads and start picking URLs, one by one, from the input list. For each one your process you centralize the results as you do it. It's real time and cool
Or you can throw in the threads each of them having a slice of the input URLs. This removes the need to sync the selection process. You store the analysis result in the thread and at the end, you combine the result just once. Or just once every 10 URLs let's say. Not for each of them. This will reduce the sync operations dramatically.
So costs can be lowered by choosing the right tool and thinking how to lower the lock and unlocks. But costs cannot be removed :)
PS: I only think in URLs :)
UPDATE 2:
Had the need in a project to do some measuring. And the results were quite surprising:
A std::mutex is most expensive. (price of cross-platformness)
A Windows native Mutex is 2x faster than std.
A Critical Section is 2x faster than the native Mutex.
A SlimReadWriteLock is +-10% of the Critical Section.
My homemade InterlockedMutex (spinlock) is 1.25x - 1.75x faster than the Critical Section.

Using std::mutex on windows 8 I usually get 3-4x improvement (on non contending case) speedup by using my own custom made spin lock:
mutex based
auto time = TimeIt([&]() {
for (int i = 0; i < tries; i++) {
bool val = mutex.try_lock();
if (val) {
data.value = 1;
}
}
});
home made lock free
time = TimeIt([&]() {
for (int i = 0; i < tries; i++) {
if (!guard.exchange(true)) {
// I own you
data.value = 1;
guard.store(true);
}
}
});
Tests are made on x86.
I haven't figured out what std::mutex uses underline on windows because it generates a lot of code.

Related

Mutex is defying the very idea of threads: parallel processing [duplicate]

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!

The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.

Are threads more efficient?
No. But see final note (below).
On a single core, threads are much, much less efficient (than function/method calls).
As one example, on my Ubuntu 15.10(64), using g++ v5.2.1,
a) a context switch (from one thread to the other) enforced by use of std::mutex takes about 12,000 nanoseconds
b) but invoking 2 simple methods, for instance std::mutex lock() & unlock(), this takes < 50 nanoseconds. 3 orders of magnitude! So context switch vx function call is no contest.
The purpose of thread is to run at the same time, right?
Yes ... but this can not happen on a single core processor.
And on a multi-core system, context switch time can still dominate.
For example, my Ubuntu system is dual core. The measurement of context switch time I reported above uses a chain of 10 threads, where each thread simply waits for its input semaphore to be unlock()'d. When a thread's input semaphore is unlocked, the thread gets to run ... but the brief thread activity is simply 1) increment a count and check a flag, and 2) unlock() the next thread, and 3) lock() its own input mutex, i.e. wait again for the previous task signal. In that test, the thread we known as main starts the thread-sequencing with unlock() of one of the threads, and stops it with a flag that all threads can see.
During this measurement activity (about 3 seconds), Linux system monitor shows both cores are involved, and reports both cores at abut 60% utilization. I expected both cores at 100% .. don't know why they are not.
Can someone explain why we use mutexes within thread functions? Thank
you!
I suppose the most conventional use of std::mutex's is to serialize access to a memory structure (perhaps a shared-access storage or structure). If your application has data accessible by multiple threads, each write access must be serialized to prevent race conditions from corrupting the data. Sometimes, both read and write access needs to be serialized. (See dining philosophers problem.)
In your code, as an example (although I do not know what system you are using), it is possible that std::cout (a shared structure) will 'interleave' text. That is, a thread context switch might happen in the middle of printing a "hello", or even a 'hi'. This behaviour is usually undesired, but might be acceptable.
A number of years ago, I worked with vxWorks and my team learned to use mutex's on access to std::cout to eliminate that interleaving. Such behavior can be distracting, and generally, customers do not like it. (ultimately, for that app, we did away with the use of the std trio-io (cout, cerr, cin))
Devices, of various kinds, also might not function properly if you allow more than 1 thread to attempt operations on them 'simultaneously'. For example, I have written software for a device that required 50 us or more to complete its reaction to my software's 'poke', before any additional action to the device should be applied. The device simply ignored my codes actions without the wait.
You should also know that there are techniques that do not involve semaphores, but instead use a thread and an IPC to provide serialized (i.e. protected) resource access.
From wikipedia, "In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become true."
When the os provides a suitable IPC, I prefer to use a Hoare monitor. In my interpretation, the monitor is simply a thread that accepts commands over the IPC, and is the only thread to access the shared structure or device. When only 1 thread accesses a structure, NO mutex is needed. All other threads must send a message (via IPC) to request (or perhaps command) another structure change. The monitor thread handles one request at a time, sequentially out of the IPC.
Definition: collision
In the context of "thread context switch' and 'mutex semaphores', a 'collision' occurs when a thread must block-and-wait for access to a resource, because that resource is already 'in use' (i.e. 'occupied'). This is a forced context switch. See also the term "critical section".
When the shared resource is NOT currently in use, no collision. The lock() and unlock() cost almost nothing (by comparison to context switch).
When there is a collision, the context switch slows things down by a 'bunch'. But this 'bunch' might still be acceptable ... consider when 'bunch' is small compared to the duration of the activity inside the critical section.
Final note ... With this new idea of 'collision':
a) Multiple threads can be far less efficient in the face of many collisions.
For unexpected example, the function 'new' accesses a thread-shared resource we can call "dynamic memory". In one experience, each thread generated 1000's of new's at start up. One thread could complete that effort in 0.5 seconds. Four threads, started quickly back-to-back, took 40 seconds to complete the 4 start ups. Context switches!
b) Multiple threads can be more efficient, when you have multiple cores and no / or few collisions. Essentially, if the threads seldom interact, they can run (mostly) simultaneously.
Thread efficiency can be any where between a or b, when multiple cores and collisions.
For instance, my ram based "log" mechanisms seems to work well - one mutex access per log entry. Generally, I intentionally used minimal logging. And when debugging a 'discovered' challenge, I added additional logging (maybe later removed) to determine what was going wrong. Generally, the debugger is better than a general logging technique. But sometimes, adding several log entries worked well.

Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.

Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

Mutex vs. standard function call

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!

The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.

Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.

Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

Performance difference between mutex and critical section in C++

I was reading this post on performance differences in C# between critical sections and mutexes for a given test case. I'm womdering if there is any further documentation out there that gives performance overheads for the various locking classes for a C++ application, specifically MFC running on a Windows 32 or 64 bit platform?
The reason that I'm asking is that the profiler results I get across broad automated tests show a lot of time spent in mutex code. What I'm trying to figure out is how much of this is reasonable delay while waiting for a resource to become available, and how much is due to the implementation and specifics of the locking structure. I'm only dealing with a single process, which includes multiple threads, and am considering changing to critical sections. Long term automated testing shows that I don't need the time-outs offered by the mutex class.
Hence the question, is anyone aware of any reference documentation relating to the performance overheads of different MFC locking mechanisms on different Windows platforms?

As far as I can understand, a Win32 Mutex is a full blown kernel object. This means that any call to a Mutex will involve a system call. This will often invalidate the cache and therefore can be quite expensive.
Critical Sections are Userside objects that make no use of the kernel in cases where there is no contention. This is probably done using the x86 LOCK assembler instruction or similar to guarantee atomicity. Since no system call is made, it will be faster but because it not a kernel object, there is no way to access a critical section from another process.

The crucial difference between Critical Sections and Mutexes in Windows is that you can create a named mutex and use it from multiple processes, whereas there is no way to access a critical section of one process from another.
A consequence of a mutex being available in multiple processes is that access to it must be controlled by the kernel.

Read the following support article from Microsoft: http://support.microsoft.com/kb/105678.
Critical sections and mutexes provide synchronization that is very similar, except that critical sections can be used only by the threads of a single process. There are two areas to consider when choosing which method to use within a single process:
Speed. The Synchronization overview says the following about critical sections:
... critical section objects provide a slightly faster, more efficient
mechanism for mutual-exclusion synchronization. Critical sections use
a processor-specific test and set instruction to determine mutual
exclusion.
Deadlock. The Synchronization overview says the following about mutexes:
If a thread terminates without releasing its ownership of a mutex
object, the mutex is considered to be abandoned. A waiting thread can
acquire ownership of an abandoned mutex, but the wait function's
return value indicates that the mutex is abandoned.
WaitForSingleObject() will return WAIT_ABANDONED for a mutex that has
been abandoned. However, the resource that the mutex is protecting is
left in an unknown state.
There is no way to tell whether a critical section has been abandoned.

Choosing between Critical Sections, Mutex and Spin Locks

What are the factors to keep in mind while choosing between Critical Sections, Mutex and Spin Locks? All of them provide for synchronization but are there any specific guidelines on when to use what?
EDIT: I did mean the windows platform as it has a notion of Critical Sections as a synchronization construct.

In Windows parlance, a critical section is a hybrid between a spin lock and a non-busy wait. It spins for a short time, then--if it hasn't yet grabbed the resource--it sets up an event and waits on it. If contention for the resource is low, the spin lock behavior is usually enough.
Critical Sections are a good choice for a multithreaded program that doesn't need to worry about sharing resources with other processes.
A mutex is a good general-purpose lock. A named mutex can be used to control access among multiple processes. But it's usually a little more expensive to take a mutex than a critical section.

General points to consider:
The performance cost of using the mechanism.
The complexity introduced by using the mechanism.
In any given situation 1 or 2 may be more important.
E.g.
If you using multi-threading to write a high performance algorithm by making use of many cores and need to guard some data for safe access then 1 is probably very important.
If you have an application where a background thread is used to poll for some information on a timer and on the rare occasion it notices an update you need to guard some data for access then 2 is probably more important than 1.
1 will be down to the underlying implementation and probably scales with the scope of the protection e.g. a lock that is internal to a process is normally faster than a lock across all processes on a machine.
2 is easy to misjudge. First attempts to use locks to write thread safe code will normally miss some cases that lead to a deadlock. A simple deadlock would occur for example if thread A was waiting on a lock held by thread B but thread B was waiting on a lock held by thread A. Surprisingly easy to implement by accident.
On any given platform the naming and qualities of locking mechanisms may vary.
On windows critical sections are fast and process specific, mutexes are slower but cross process. Semaphores offer more complicated use cases. Some problems e.g. allocation from a pool may be solved very efficently using atomic functions rather than locks e.g. on windows InterlockedIncrement which is very fast indeed.

A Mutex in Windows is actually an interprocess concurrency mechanism, making it incredibly slow when used for intraprocess threading. A Critical Section is the Windows analogue to the mutex you normally think of.
Spin Locks are best used when the resource being contested is usually not held for a significant number of cycles, meaning the thread that has the lock is probably going to give it up soon.
EDIT : My answer is only relevant provided you mean 'On Windows', so hopefully that's what you meant.

Overhead of pthread mutexes?

I'm trying to make a C++ API (for Linux and Solaris) thread-safe, so that its functions can be called from different threads without breaking internal data structures. In my current approach I'm using pthread mutexes to protect all accesses to member variables. This means that a simple getter function now locks and unlocks a mutex, and I'm worried about the overhead of this, especially as the API will mostly be used in single-threaded apps where any mutex locking seems like pure overhead.
So, I'd like to ask:
do you have any experience with performance of single-threaded apps that use locking versus those that don't?
how expensive are these lock/unlock calls, compared to eg. a simple "return this->isActive" access for a bool member variable?
do you know better ways to protect such variable accesses?

All modern thread implementations can handle an uncontended mutex lock entirely in user space (with just a couple of machine instructions) - only when there is contention, the library has to call into the kernel.
Another point to consider is that if an application doesn't explicitly link to the pthread library (because it's a single-threaded application), it will only get dummy pthread functions (which don't do any locking at all) - only if the application is multi-threaded (and links to the pthread library), the full pthread functions will be used.
And finally, as others have already pointed out, there is no point in protecting a getter method for something like isActive with a mutex - once the caller gets a chance to look at the return value, the value might already have been changed (as the mutex is only locked inside the getter method).

"A mutex requires an OS context switch. That is fairly expensive. "
This is not true on Linux, where mutexes are implemented using something called futex'es. Acquiring an uncontested (i.e., not already locked) mutex is, as cmeerw points out, a matter of a few simple instructions, and is typically in the area of 25 nanoseconds w/current hardware.
For more info:
Futex
Numbers everybody should know

This is a bit off-topic but you seem to be new to threading - for one thing, only lock where threads can overlap. Then, try to minimize those places. Also, instead of trying to lock every method, think of what the thread is doing (overall) with an object and make that a single call, and lock that. Try to get your locks as high up as possible (this again increases efficiency and may /help/ to avoid deadlocking). But locks don't 'compose', you have to mentally at least cross-organize your code by where the threads are and overlap.

I did a similar library and didn't have any trouble with lock performance. (I can't tell you exactly how they're implemented, so I can't say conclusively that it's not a big deal.)
I'd go for getting it right first (i.e. use locks) then worry about performance. I don't know of a better way; that's what mutexes were built for.
An alternative for single thread clients would be to use the preprocessor to build a non-locked vs locked version of your library. E.g.:
#ifdef BUILD_SINGLE_THREAD
inline void lock () {}
inline void unlock () {}
#else
inline void lock () { doSomethingReal(); }
inline void unlock () { doSomethingElseReal(); }
#endif
Of course, that adds an additional build to maintain, as you'd distribute both single and multithread versions.

I can tell you from Windows, that a mutex is a kernel object and as such incurs a (relatively) significant locking overhead. To get a better performing lock, when all you need is one that works in threads, is to use a critical section. This would not work across processes, just the threads in a single process.
However.. linux is quite a different beast to multi-process locking. I know that a mutex is implemented using the atomic CPU instructions and only apply to a process - so they would have the same performance as a win32 critical section - ie be very fast.
Of course, the fastest locking is not to have any at all, or to use them as little as possible (but if your lib is to be used in a heavily threaded environment, you will want to lock for as short a time as possible: lock, do something, unlock, do something else, then lock again is better than holding the lock across the whole task - the cost of locking isn't in the time taken to lock, but the time a thread sits around twiddling its thumbs waiting for another thread to release a lock it wants!)

A mutex requires an OS context switch. That is fairly expensive. The CPU can still do it hundreds of thousands of times per second without too much trouble, but it is a lot more expensive than not having the mutex there. Putting it on every variable access is probably overkill.
It also probably is not what you want. This kind of brute-force locking tends to lead to deadlocks.
do you know better ways to protect such variable accesses?
Design your application so that as little data as possible is shared. Some sections of code should be synchronized, probably with a mutex, but only those that are actually necessary. And typically not individual variable accesses, but tasks containing groups of variable accesses that must be performed atomically. (perhaps you need to set your is_active flag along with some other modifications. Does it make sense to set that flag and make no further changes to the object?)

I was curious about the expense of using a pthred_mutex_lock/unlock.
I had a scenario where I needed to either copy anywhere from 1500-65K bytes without using
a mutex or to use a mutex and do a single write of a pointer to the data needed.
I wrote a short loop to test each
gettimeofday(&starttime, NULL)
COPY DATA
gettimeofday(&endtime, NULL)
timersub(&endtime, &starttime, &timediff)
print out timediff data
or
ettimeofday(&starttime, NULL)
pthread_mutex_lock(&mutex);
gettimeofday(&endtime, NULL)
pthread_mutex_unlock(&mutex);
timersub(&endtime, &starttime, &timediff)
print out timediff data
If I was copying less than 4000 or so bytes, then the straight copy operation took less time. If however I was copying more than 4000 bytes, then it was less costly to do the mutex lock/unlock.
The timing on the mutex lock/unlock ran between 3 and 5 usec long including the time for
the gettimeofday for the currentTime which took about 2 usec

For member variable access, you should use read/write locks, which have slightly less overhead and allow multiple concurrent reads without blocking.
In many cases you can use atomic builtins, if your compiler provides them (if you are using gcc or icc __sync_fetch*() and the like), but they are notouriously hard to handle correctly.
If you can guarantee the access being atomic (for example on x86 an dword read or write is always atomic, if it is aligned, but not a read-modify-write), you can often avoid locks at all and use volatile instead, but this is non portable and requires knowledge of the hardware.

Well a suboptimal but simple approach is to place macros around your mutex locks and unlocks. Then have a compiler / makefile option to enable / disable threading.
Ex.
#ifdef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //actual mutex call
#endif
#ifndef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //do nothing
#endif
Then when compiling do a gcc -DTHREAD_ENABLED to enable threading.
Again I would NOT use this method in any large project. But only if you want something fairly simple.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js