Is C++ std::atomic compatible with pthreads?

Is C++ std::atomic compatible with pthreads? - c++

I have 2 pthread threads where one is writing a bool value and another is reading it.
I dont care for portability. Its x86 architecture. The only which concerns me is writing thread sets bool to true and starts doing its own work (which happens once a day at midnight) closing a file. And the other thread had read the bool as false and proceeds with its work (writing to a file) at the same time. Its very difficult to reproduce this scenario so I better get best possible theoretical solution.
Can I use std::atomic in case of pthreads?

Can I use std::atomic in case of pthreads?
Yes, that's what std::atomic is for.
It works with std::thread, POSIX threads, and any other kind of threads. Behind the scenes it uses "magical" compiler annotations to prevent certain thread-incompatible optimizations, and processor-specific locking instructions to guarantee that thread-safe code is generated1.
It makes (almost) no sense to use std::atomic without threads (you could use std::atomic instead of volatile for signal handlers, but there is no advantage in doing so).
The only which concerns me ...
The rest of your question makes no sense to me.
1 When used correctly, which is often non-trivial thing to do, and which is why you generally should try not to use std::atomic unless you are an expert.

Related

How compiler like GCC implement acquire/release semantics for std::mutex

My understanding is that std::mutex lock and unlock have a acquire/release semantics which will prevent instructions between them from being moved outside.
So acquire/release should disable both compiler and CPU reorder instructions.
My question is that I take a look at GCC5.1 code base and don't see anything special in std::mutex::lock/unlock to prevent compiler reordering codes.
I find a potential answer in does-pthread-mutex-lock-have-happens-before-semantics which indicates a mail that says a external function call act as compiler memory fences.
Is it always true? And where is the standard?

Threads are a fairly complicated, low-level feature. Historically, there was no standard C thread functionality, and instead it was done differently on different OS's. Today there is mainly the POSIX threads standard, which has been implemented in Linux and BSD, and now by extension OS X, and there are Windows threads, starting with Win32 and on. Potentially, there could be other systems besides these.
GCC doesn't directly contain a POSIX threads implementation, instead it may be a client of libpthread on a linux system. When you build GCC from source, you have to configure and build separately a number of ancillary libraries, supporting things like big numbers and threads. That is the point at which you select how threading will be done. If you do it the standard way on linux, you will have an implementation of std::thread in terms of pthreads.
On windows, starting with MSVC C++11 compliance, the MSVC devs implemented std::thread in terms of the native windows threads interface.
It's the OS's job to ensure that the concurrency locks provided by their API actually works -- std::thread is meant to be a cross-platform interface to such a primitive.
The situation may be more complicated for more exotic platforms / cross-compiling etc. For instance, in MinGW project (gcc for windows) -- historically, you have the option to build MinGW gcc using either a port of pthreads to windows, or using a native win32 based threading model. If you don't configure this when you build, you may end up with a C++11 compiler which doesn't support std::thread or std::mutex. See this question for more details. MinGW error: ‘thread’ is not a member of ‘std’
Now, to answer your question more directly. When a mutex is engaged, at the lowest level, this involves some call into libpthreads or some win32 API.
pthread_lock_mutex();
do_some_stuff();
pthread_unlock_mutex();
(The pthread_lock_mutex and pthread_unlock_mutex correspond to the implementations of lock and unlock of std::mutex on your platform, and in idiomatic C++11 code, these are in turn called in the ctor and dtor of std::unique_lock for instance if you are using that.)
Generally, the optimizer cannot reorder these unless it is sure that pthread_lock_mutex() has no side-effects that can change the observable behavior of do_some_stuff().
To my knowledge, the mechanism the compiler has for doing this is ultimately the same as what it uses for estimating the potential side-effects of calls to any other external library.
If there is some resource
int resource;
which is in contention among various threads, it means that there is some function body
void compete_for_resource();
and a function pointer to this is at some earlier point passed to pthread_create... in your program in order to initiate another thread. (This would presumably be in the implementation of the ctor of std::thread.) At this point, the compiler can see that any call into libpthread can potentially call compete_for_resource and touch any memory that that function touches. (From the compiler's point of view libpthread is a black box -- it is some .dll / .so and it can't make assumptions about what exactly it does.)
In particular, the call pthread_lock_mutex(); potentially has side-effects for resource, so it cannot be re-ordered against do_some_stuff().
If you never actually spawn any other threads, then to my knowledge, do_some_stuff(); could be reordered outside of the mutex lock. Since, then libpthread doesn't have any access to resource, it's just a private variable in your source and isn't shared with the external library even indirectly, and the compiler can see that.

All of these questions stem from the rules for compiler reordering. One of the fundamental rules for reordering is that the compiler must prove that the reorder does not change the result of the program. In the case of std::mutex, the exact meaning of that phrase is specified in a block of about 10 pages of legaleese, but the general intuitive sense of "doesn't change the result of the program" holds. If you had a guarantee about which operation came first, according to the specification, no compiler is allowed to reorder in a way which violates that guarantee.
This is why people often claim that a "function call acts as a memory barrier." If the compiler cannot deep-inspect the function, it cannot prove that the function didn't have a hidden barrier or atomic operation inside of it, thus it must treat that function as though it was a barrier.
There is, of course, the case where the compiler can inspect the function, such as the case of inline functions or link time optimizations. In these cases, one cannot rely on a function call to act as a barrier, because the compiler may indeed have enough information to prove the rewrite behaves the same as the original.
In the case of mutexes, even such advanced optimization cannot take place. The only way to reorder around the mutex lock/unlock function calls is to have deep-inspected the functions and proven there are no barriers or atomic operations to deal with. If it can't inspect every sub-call and sub-sub-call of that lock/unlock function, it can't prove it is safe to reorder. If it indeed can do this inspection, it would see that every mutex implementation contains something which cannot be reordered around (indeed, this is part of the definition of a valid mutex implementation). Thus, even in that extreme case, the compiler is still forbidden from optimizing.
EDIT: For completeness, I would like to point out that these rules were introduced in C++11. C++98 and C++03 reordering rules only prohibited changes that affected the result of the current thread. Such a guarantee is not strong enough to develop multithreading primitives like mutexes.
To deal with this, multithreading APIs like pthreads developed their own rules. from the Pthreads specification section 4.11:
Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and also
synchronize memory with respect to other threads. The following
functions synchronize memory with respect to other threads
It then lists a few dozen functions which synchronize memory, including pthread_mutex_lock and pthread_mutex_unlock.
A compiler which wishes to support the pthreads library must implement something to support this cross-thread memory synchronization, even though the C++ specification didn't say anything about it. Fortunately, any compiler where you want to do multithreading was developed with the recognition that such guarantees are fundamental to all multithreading, so every compiler that supports multithreading has it!
In the case of gcc, it did so without any special notes on the pthreads function calls because gcc would effectively create a barrier around every external function call (because it couldn't prove that no synchronization existed inside that function call). If gcc were to ever change that, they would also have to change their pthreads headers to include any extra verbage needed to mark the pthreads functions as synchronizing memory.
All of that, of course, is compiler specific. There were no standards answers to this question until C++11 came along with its new memory model.

NOTE: I am no expert in this area and my knowledge about it is in a spaghetti like condition. So take the answer with a grain of salt.
NOTE-2: This might not be the answer that OP is expecting. But here are my 2 cents anyways if it helps:
My question is that I take a look at GCC5.1 code base and don't see
anything special in std::mutex::lock/unlock to prevent compiler
reordering codes.
g++ using pthread library. std::mutex is just a thin wrapper around pthread_mutex. So, you will have to actually go and have a look at pthread's mutex implementation.
If you go bit deeper into the pthread implementation (which you can find here), you will see that it uses atomic instructions along with futex calls.
Two minor things to remember here:
1. The atomic instructions do use barriers.
2. Any function call is equivalent to full barrier. Do not remember from where I read it.
3. mutex calls may put the thread to sleep and cause context switch.
Now, as far as reordering goes, one of the things that needs to be guaranteed is that, no instruction after lock and before unlock should be reordered to before lock or after unlock. This I believe is not a full-barrier, but rather just acquire and release barrier respectively. But, this is again platform dependent, x86 provides sequential consistency by default whereas ARM provides a weaker ordering guarantee.
I strongly recommend this blog series:
http://preshing.com/archives/
It explains lots of lower level stuff in easy to understand language. Guess, I have to read it once again :)
UPDATE:: Unable to comment on #Cort Ammons answer due to length
#Kane I am not sure about this, but people in general write barriers for processor level which takes care of compiler level barriers as well. The same is not true for compiler builtin barriers.
Now, since the pthread_*lock* functions definitions are not present in the translation unit where you are making use of it (this is doubtful), calling lock - unlock should provide you with full memory barrier. The pthread implementation for the platform makes use of atomic instructions to block any other thread from accessing the memory locations after the lock or before unlock. Now since only one thread is executing the critical portion of the code it is ensured that any reordering within that will not change the expected behaviour as mentioned in above comment.
Atomics is pretty tough to understand and to get right, so, what I have written above is from my understanding. Would be very glad to know if my understanding is wrong here.

So acquire/release should disable both compiler and CPU reorder instructions.
By definition anything that prevents CPU reordering by speculative execution prevents compiler reordering. That's the definition of language semantics, even without MT (multi-threading) in the language, so you will be safe from reordering on old compilers that don't support MT.
But these compilers aren't safe for MT for a bunch of reasons, from the lack of thread protection around runtime initialization of static variables to the implicitly modified global variables like errno, etc.
Also, in C/C++, any call to a function that is purely external (that is: not inline, available for inlining at any point), without annotation explaining what it does (like the "pure function" attribute of some popular compiler), must be assumed to do anything that legal C/C++ code can do. No non trivial reordering would be possible (any reordering that is visible is non trivial).
Any correct implementation of locks on systems with multiple units of execution that don't simulate a global order on assembly instructions will require memory barriers and will prevent reordering.
An implementation of locks on a linearly executing CPU, with only one unit of execution (or where all threads are bound on the same unit of execution), might use only volatile variables for synchronisation and that is unsafe as volatile reads resp. writes do not provide any guarantee of acquire resp. release of any other data (contrast Java). Some kind of compiler barrier would be needed, like a strongly external function call, or some asm (""/*nothing*/) (which is compiler specific and even compiler version specific).

Does Boost have support for Windows EnterCriticalSection API?

I know Boost has support for mutexes and lock_guard, which can be used to implement critical sections.
But Windows has a special API for critical sections (see EnterCriticalSection and LeaveCriticalSection) which is a LOT faster than a mutex (for rarely contended, short sections of code).
Hence my question - it is possible in Boost to take advantage of this API, and fallback to spinlock/mutex/futex-based implementation on other platforms?

The simple answer is no.
Here's some relevant background from an old mailing list thread:
BTW. I am agree that mutex is more universal solution from a
performance point of view. But to be fair - CS are faster in simple
design. I believe that possibility to support them should be at
least
taken in account.
This was the article that someone pointed me to. The conclusion was
that CS are only faster if:
There are less than 8 threads total in the process.
You weren't running in the background.
You weren't on an dual processor machine.
To me this means that simple testing yields good CS performance
results, but any real world program is better off with a full blown
mutex.
I'm not adverse to supporting a CS implementation. However, I
originally chose not to for the following reasons:
You get either construction and destruction hits from using a PIMPL
idiom or you must include Windows.h in the Boost.Threads headers,
which I simply don't want to do. (This can be worked around by
emulating a CS ala OPTEX from the MSDN.)
According to this research paper most programs won't benefit from
a CS design.
It's trivial to code a (non-portable) critical_section class that
follows the Mutex model if you truly can make use of this.
For now I think I've made the right choice, though down the road we
may change the implementation to use a critical section or OPTEX.
Bill Kempf

Speaking as someone who helps out maintaining Boost.Thread, and as someone who failed to get an event object into Boost.Thread, I don't think critical sections have ever been added nor would be added to Boost for these reasons:
A Win32 critical section is trivially easy to build using a boost::atomic and a boost::condition_variable, so much so it isn't really worth having an official one. Here is probably the most complex one you could imagine, but extremely configurable including being constexpr ready (don't ask!): https://github.com/ned14/boost.outcome/blob/master/include/boost/outcome/v1/spinlock.hpp#L331
You can build your own simply by matching (Basic)Lockable concept and using atomic compare_exchange (non-x86/x64) or atomic exchange (x86/x64) and then grab it using a lock_guard around the critical section.
Some may object that a win32 critical section is not this. I am afraid it is: it simply spins on an atomic for a spin count, and then lazily tries to allocate a win32 event object which it then waits upon. Nothing special.
As much as you might think critical sections (really user mode mutexes) are better/faster/whatever, they probably are not as great as you might think. boost::mutex is a big vast heavyweight thing on Windows internally using a win32 semaphore as the kernel wait object because of the need to emulate thread cancellation and to behave well in a general purpose use context. It's easy to write a concurrency structure which is faster than another for some single use case, but it is very very hard to write a concurrency structure which is all of:
Faster than a standard implementation in the uncontended case.
Faster than a standard implementation in the lightly contended case.
Faster than a standard implementation in the heavily contended case.
Even if you manage all three of the above, that still isn't enough: you also need some guarantees on worst case progression ordering, so whether certain patterns of locks, waits and unlocks produce predictable outcomes. This is why threading facilities can appear to look slow in narrow use case scenarios, so Boost.Thread much as the STL can appear to be much slower than hand rolled locking code in say an uncontended use case.
Boost.Thread already does substantial work in user mode to avoid going to kernel sleep on Windows. On POSIX any of the major pthreads implementations also does substantial work to avoid kernel sleeps and hence Boost.Thread doesn't replicate that work. In other words, critical sections don't gain you anything in terms of scaling to load behaviours, though inevitably Boost.Thread v4 especially on Windows does a ton load of work a naive implementation does not (the planned rewrite of Boost.Thread is vastly more efficient on Windows as it can assume Windows Vista or above).

So, it looks like the default Boost mutex doesn't support it, but asio::detail::mutex does.
So I ended up using that:
#include <boost/asio/detail/mutex.hpp>
#include <boost/thread.hpp>
using boost::asio::detail::mutex;
using boost::lock_guard;
int myFunc()
{
static mutex mtx;
lock_guard<mutex> lock(mtx);
. . .
}

C++ volatile required when spinning on boost::shared_ptr operator bool()? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
When to use volatile with multi threading?
I have two threads referencing the same boost::shared_ptr:
boost::shared_ptr<Widget> shared;
On thread is spinning, waiting for the other thread to reset the boost::shared_ptr:
while(shared)
boost::thread::yield();
And at some point the other thread will call:
shared.reset();
My question is whether or not I need to declare the shared pointer as volatile to prevent the compiler from optimizing the call to shared.operator bool() out of the loop and never detecting the change? I know that if I were simply looping on a variable, waiting for it to reach 0 I would need volatile, but I'm not sure if boost::shared_ptr is implemented in such a way that it is not necessary here.
EDIT: I'm fully aware that condition variables can be used to solve this problem in a different way. But in this case, the busy loop is very uncommon and contending for the lock on the condition variable is an overhead we would rather not incur.

Rant 1:
That code probably won't do what you think it will. When you write code like that, you're introducing a data race into your code. This is almost certainly a bug that will result in your program non-deterministically failing.
Data structures (including shared_ptr) are generally not meant to be accessed concurrently. Do not modify the same structure at the same time in more than one thread. That could corrupt the structure. Do not modify it in one thread and read it in another thread. The reader could see inconsistent data. Probably multiple threads can read it at the same time.
If you think you really want to do some of the above, find out if the data structure allows some of these behaviors in a section probably titled "Thread Safety." If it does allow them, take a second look at whether your performance really needs this, and then use it. (The documentation on shared_ptr does NOT allow what you're doing.)
Rant 2:
Now, for a higher-level concern, you probably shouldn't be doing thread synchronization by waiting for a pointer to be set to NULL. Really, look at condition variables, barriers, or futures as a way of getting one thread to wait until another is finished with something. It's a nicer interface, and whoever looks at your code next (that includes you in 6 months) will thank you.
I know you're concerned about the performance cost of real synchronization. Don't worry about this. It'll be fine. If you're worried about lock contention, use barriers or futures, which don't have a big shared lock to contend for.
Caveat: there is a time for writing code that avoids locks at all cost. But unless you're looking at profiler data that says your synch ops are too slow for your target workload, this isn't the time.
Rant 3:
I hope that shared in your example is global. Otherwise, you have multiple threads with local references to the same shared_ptr that points to the real object you're interested in. It kind of defeats the purpose of having a reference-counted pointer. Just please tell me it's global.

What you actually should do is to use condition variables. Busy waits are evil.
Edit: also depending on your task, futures may be even cleaner way to achieve what you want.

pthread-win32 extension sem_post_multiple

I am currently building a thin C++ wrapper around pthreads for in-house use. Windows as well as QNX are targeted and fortunately the pthreads-win32 ports seems to work very well, while QNX is conformant to POSIX for our practical purposes.
Now, while implementing semaphores, I hit the function
sem_post_multiple(sem_t*, int)
which is apparently only available on pthreads-win32 but is missing from QNX. As the name suggests the function is supposed to increment the semaphore by the count given as second argument. As far as I can tell the function is not part of neither POSIX 1b nor POSIX 1c.
Although there is currently no requirement for said function I am still wondering why pthreads-win32 provides the function and whether it might be of use. I could try to mimic it for QNX using similar to the following:
sem_post_multiple_qnx(sem_t* sem, int count)
{
for(;count > 0; --count)
{
sem_post(sem);
}
}
What I am asking for are suggestion/advice on how to proceed. If consensus suggests to do implement the function for QNX I would also appreciate comments on whether the suggested code snipped is a viable solution.
Thanks in advance.
PS: I deliberately left out my fancy C++ class for clarity. For all folks suggesting boost to the rescue: it is not an option in my current project due to management reasons.

In any case semaphores are an optional extension in POSIX. E.g OS X doesn't seem to implement them fully. So if you are concerned with portability, you'd have to provide wrappers of the functionalities you need, anyhow.
Your approach to emulate an atomic increment by iterated sem_post has certainly downsides.
It might be performing badly,
where usually sem_t are used in
performance critical contexts.
This operation would not be
atomic. So confusing things might
happen before you finish the loop.
I would stick to the just necessary, strictly POSIX conforming. Beware that sem_timedwait is yet another optional part of the semaphore option.

Your proposed implementation of sem_post_multiple doesn't play nicely with sem_getvalue, since sem_post_multiple is an atomic increase and therefore it's not possible for a "simultaneous" call to sem_getvalue to return any of the intermediate values.
Personally I'd want to leave them both out: trying to add fundamental synchronization operations to a system which lacks them is a mug's game, and your wrapper might soon cease to be "thin". So don't get into it unless you have code that uses sem_post_multiple, that you absolutely have to port.

sem_post_multiple() is a non-standard helper function introduced by the win32-pthreads maintainers. Your implementation isn't the same as theirs because the multiple decrements aren't atomic. Whether or not this is a problem depends on the intended use. (Personally, I wouldn't try to implement this function unless/until the need arises.)

This is an interesting question. +1.
I agree with the current prevailing consensus here that it is probably not a good idea to implement that function. While your proposed implementation would probably be work just fine in most situations, there are definitely conditions in which the results could be dramatically different due to the non-atomicity. The following is one (extremely) contrived situation:
Start thread A which calls sem_post_multiple( s, 10 )
Thread B waiting on s is released. Thread B kills thread A.
In the above unfriendly scenario, the atomic version would have incremented the semaphore by 10. With non-atomic version, it may only be incremented once. This example is certainly not likely in the real world. For example, the killing of a thread is almost always a bad idea not to mention the fact that it could leave the semaphore object in an invalid state. The Win32 implementation could leave a mutex lock on the semaphore - see this for why.

Overhead of pthread mutexes?

I'm trying to make a C++ API (for Linux and Solaris) thread-safe, so that its functions can be called from different threads without breaking internal data structures. In my current approach I'm using pthread mutexes to protect all accesses to member variables. This means that a simple getter function now locks and unlocks a mutex, and I'm worried about the overhead of this, especially as the API will mostly be used in single-threaded apps where any mutex locking seems like pure overhead.
So, I'd like to ask:
do you have any experience with performance of single-threaded apps that use locking versus those that don't?
how expensive are these lock/unlock calls, compared to eg. a simple "return this->isActive" access for a bool member variable?
do you know better ways to protect such variable accesses?

All modern thread implementations can handle an uncontended mutex lock entirely in user space (with just a couple of machine instructions) - only when there is contention, the library has to call into the kernel.
Another point to consider is that if an application doesn't explicitly link to the pthread library (because it's a single-threaded application), it will only get dummy pthread functions (which don't do any locking at all) - only if the application is multi-threaded (and links to the pthread library), the full pthread functions will be used.
And finally, as others have already pointed out, there is no point in protecting a getter method for something like isActive with a mutex - once the caller gets a chance to look at the return value, the value might already have been changed (as the mutex is only locked inside the getter method).

"A mutex requires an OS context switch. That is fairly expensive. "
This is not true on Linux, where mutexes are implemented using something called futex'es. Acquiring an uncontested (i.e., not already locked) mutex is, as cmeerw points out, a matter of a few simple instructions, and is typically in the area of 25 nanoseconds w/current hardware.
For more info:
Futex
Numbers everybody should know

This is a bit off-topic but you seem to be new to threading - for one thing, only lock where threads can overlap. Then, try to minimize those places. Also, instead of trying to lock every method, think of what the thread is doing (overall) with an object and make that a single call, and lock that. Try to get your locks as high up as possible (this again increases efficiency and may /help/ to avoid deadlocking). But locks don't 'compose', you have to mentally at least cross-organize your code by where the threads are and overlap.

I did a similar library and didn't have any trouble with lock performance. (I can't tell you exactly how they're implemented, so I can't say conclusively that it's not a big deal.)
I'd go for getting it right first (i.e. use locks) then worry about performance. I don't know of a better way; that's what mutexes were built for.
An alternative for single thread clients would be to use the preprocessor to build a non-locked vs locked version of your library. E.g.:
#ifdef BUILD_SINGLE_THREAD
inline void lock () {}
inline void unlock () {}
#else
inline void lock () { doSomethingReal(); }
inline void unlock () { doSomethingElseReal(); }
#endif
Of course, that adds an additional build to maintain, as you'd distribute both single and multithread versions.

I can tell you from Windows, that a mutex is a kernel object and as such incurs a (relatively) significant locking overhead. To get a better performing lock, when all you need is one that works in threads, is to use a critical section. This would not work across processes, just the threads in a single process.
However.. linux is quite a different beast to multi-process locking. I know that a mutex is implemented using the atomic CPU instructions and only apply to a process - so they would have the same performance as a win32 critical section - ie be very fast.
Of course, the fastest locking is not to have any at all, or to use them as little as possible (but if your lib is to be used in a heavily threaded environment, you will want to lock for as short a time as possible: lock, do something, unlock, do something else, then lock again is better than holding the lock across the whole task - the cost of locking isn't in the time taken to lock, but the time a thread sits around twiddling its thumbs waiting for another thread to release a lock it wants!)

A mutex requires an OS context switch. That is fairly expensive. The CPU can still do it hundreds of thousands of times per second without too much trouble, but it is a lot more expensive than not having the mutex there. Putting it on every variable access is probably overkill.
It also probably is not what you want. This kind of brute-force locking tends to lead to deadlocks.
do you know better ways to protect such variable accesses?
Design your application so that as little data as possible is shared. Some sections of code should be synchronized, probably with a mutex, but only those that are actually necessary. And typically not individual variable accesses, but tasks containing groups of variable accesses that must be performed atomically. (perhaps you need to set your is_active flag along with some other modifications. Does it make sense to set that flag and make no further changes to the object?)

I was curious about the expense of using a pthred_mutex_lock/unlock.
I had a scenario where I needed to either copy anywhere from 1500-65K bytes without using
a mutex or to use a mutex and do a single write of a pointer to the data needed.
I wrote a short loop to test each
gettimeofday(&starttime, NULL)
COPY DATA
gettimeofday(&endtime, NULL)
timersub(&endtime, &starttime, &timediff)
print out timediff data
or
ettimeofday(&starttime, NULL)
pthread_mutex_lock(&mutex);
gettimeofday(&endtime, NULL)
pthread_mutex_unlock(&mutex);
timersub(&endtime, &starttime, &timediff)
print out timediff data
If I was copying less than 4000 or so bytes, then the straight copy operation took less time. If however I was copying more than 4000 bytes, then it was less costly to do the mutex lock/unlock.
The timing on the mutex lock/unlock ran between 3 and 5 usec long including the time for
the gettimeofday for the currentTime which took about 2 usec

For member variable access, you should use read/write locks, which have slightly less overhead and allow multiple concurrent reads without blocking.
In many cases you can use atomic builtins, if your compiler provides them (if you are using gcc or icc __sync_fetch*() and the like), but they are notouriously hard to handle correctly.
If you can guarantee the access being atomic (for example on x86 an dword read or write is always atomic, if it is aligned, but not a read-modify-write), you can often avoid locks at all and use volatile instead, but this is non portable and requires knowledge of the hardware.

Well a suboptimal but simple approach is to place macros around your mutex locks and unlocks. Then have a compiler / makefile option to enable / disable threading.
Ex.
#ifdef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //actual mutex call
#endif
#ifndef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //do nothing
#endif
Then when compiling do a gcc -DTHREAD_ENABLED to enable threading.
Again I would NOT use this method in any large project. But only if you want something fairly simple.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js