difference between lock, memory barrier, semaphore - c++

This article: http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf (page 12)
seems to make a difference between a lock and a memory barrier
I would like to know what the difference is between a lock, memory barrier, and a semaphore?
(While other questions might mention the difference between a lock and a synchronisation object, I found none about the difference between a lock and a memory barrier)

A memory barrier (also known as a fence) is a hardware operation, which
ensures the ordering of different reads and writes to the globally
visible store. On a typical modern processor, memory accesses are
pipelined, and may occur out of order. A memory barrier ensures that
this doesn't happen. A full memory barrier will ensure that all loads
and stores which precede it occur before any load or store which follows
it. (Many processors have support partial barriers; e.g. on a Sparc, a
membar #StoreStore ensures that all stores which occur before it will
be visible to all other processes before any store which occurs after
it.)
That's all a memory barrier does. It doesn't block the thread, or
anything.
Mutexes and semaphores are higher level primatives, implemented in the
operating system. A thread which requests a mutex lock will block, and
have its execution suspended by the OS, until that mutex is free. The
kernel code in the OS will contain memory barrier instructions in order
to implement a mutex, but it does much more; a memory barrier
instruction will suspend the hardware execution (all threads) until the
necessary conditions have been met—a microsecond or so at the
most, and the entire processor stops for this time. When you try to
lock a mutex, and another thread already has it, the OS will suspend
your thread (and only your thread—the processor continues to
execute other threads) until whoever holds the mutex frees it, which
could be seconds, minutes or even days. (Of course, if it's more than a
few hundred milliseconds, it's probably a bug.)
Finally, there's not really much difference between semaphores and
mutexes; a mutex can be considered a semaphore with a count of one.

A memory barrier is a method to order memory access. Compilers and CPU's can change this order to optimize, but in multithreaded environments, this can be an issue. The main difference with the others is that threads are not stopped by this.
A lock or mutex makes sure that code can only be accessed by 1 thread. Within this section, you can view the environment as singlethreaded, so memory barriers should not be needed.
a semaphore is basically a counter that can be increased (v()) or decreased (p()). If the counter is 0, then p() halts the thread until the counter is no longer 0. This is a way to synchronize threads, but I would prefer using mutexes or condition variables (controversial, but that's my opinion). When the initial counter is 1, then the semaphore is called a binary semaphore and it is similar to a lock.
A big difference between locks and semaphores is that the thread owns the lock, so no other thread should try to unlock, while this is not the case for semaphores.

Simples explanation for now.
Lock
Is an atomic test for whether this piece of code can proceed
lock (myObject)
{
// Stuff to do when I acquire the lock
}
This is normally a single CPU instruction that tests and sets a variable as a single atomic operation. More here, http://en.wikipedia.org/wiki/Test-and-set#Hardware_implementation_of_test-and-set_2
Memory Barrier
Is a hint to the processor that it can not execute these instructions out of order. Without it, the instructions could be executed out of order, like in double checked locking, the two null checks could be executed before the lock has.
Thread.MemoryBarrier();
public static Singleton Instance()
{
if (_singletonInstance == null)
{
lock(myObject)
{
if (_singletonInstance == null)
{
_singletonInstance = new Singleton();
}
}
}
}
This are also a set of CPU instructions that implement memory barriers to explicitly tell a CPU it can not execute things out of order.
Semaphores
Are similar to locks except they normally are for more than one thread. I.e. if you can handle 10 concurrent disk reads for example, you'd use a semaphore. Depending on the processor this is either its own instruction, or a test and set instruction with load more work around interrupts (for e.g. on ARM).

Related

What makes the boost::shared_mutex so slow

I used google benchmark to run following 3 tests and the result surprised me since the RW lock is ~4x slower than simple mutex in release mode. (and ~10x slower than simple mutex in debug mode)
void raw_access() {
(void) (gp->a + gp->b);
}
void mutex_access() {
std::lock_guard<std::mutex> guard(g_mutex);
(void) (gp->a + gp->b);
}
void rw_mutex_access() {
boost::shared_lock<boost::shared_mutex> l(g_rw_mutex);
(void) (gp->a + gp->b);
}
the result is:
2019-06-26 08:30:45
Running ./perf
Run on (4 X 2500 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 262K (x2)
L3 Unified 4194K (x1)
Load Average: 5.35, 3.22, 2.57
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
BM_RawAccess 1.01 ns 1.01 ns 681922241
BM_MutexAccess 18.2 ns 18.2 ns 38479510
BM_RWMutexAccess 92.8 ns 92.8 ns 7561437
I didn't get the enough information via google, so hope some help here.
Thanks
I don't know the particulars of how the standard library/boost/etc. implementations differ, although it seems like the standard library version is faster (congrats, whoever wrote it).
So instead I'll try to explain the speed differences between various mutex types on a theoretical level, which will explain why a shared mutex (should) be slower.
Atomic Spin Lock
More-so as an academic exercise, consider the simplest thread-safety "mutex-like" implementation: a simple atomic spin lock.
In essence, this is nothing more than an std::atomic<bool> or an std::atomic_flag. It is initialized to false. To "lock" the mutex, you simply do an atomic compare-and-exchange operation in a loop until you get a false value (i.e. the previous value was false prior to atomically setting it to true).
std::atomic_flag flag = ATOMIC_FLAG_INIT;
// lock it by looping until we observe a false value
while (flag.test_and_set()) ;
// do stuff under "mutex" lock
// unlock by setting it back to false state
flag.clear();
However, due to the nature of this construct, it's what we call an "unfair" mutex because the order of threads that acquire the lock is not necessarily the order in which they began their attempts to lock it. That is, under high contention, it's possible a thread tries to lock and simply never succeed because other threads are luckier. It's very timing-sensitive. Imagine musical chairs.
Because of this, although it functions like a mutex, it's not what we consider a "mutex".
Mutex
A mutex can be thought of as building on top of an atomic spin lock (although it's not typically implemented as such, since they typically are implemented with support of the operating system and/or hardware).
In essence, a mutex is a step above atomic spin locks because it has a queue of waiting threads. This allows it to be "fair" because the order of lock acquisition is (more or less) the same as the order of locking attempts.
If you've noticed, if you run sizeof(std::mutex) it might be a bit larger than you might expect. On my platform it's 40 bytes. That extra space is used to hold state information, notably including some way of accessing a lock queue for each individual mutex.
When you try to lock a mutex, it performs some low-level thread-safety operation to have thread-safe access to the mutex's status information (e.g. atomic spin lock), checks the state of the mutex, adds your thread to the lock queue, and (typically) puts your thread to sleep while you wait so you don't burn precious CPU time. The low-level thread-safety operation (e.g. the atomic spin lock) is atomically released at the same time the thread goes to sleep (this is typically where OS or hardware support is necessary to be efficient).
Unlocking is performed by performing a low-level thread-safe operation (e.g. atomic spin lock), popping the next waiting thread from the queue, and waking it up. The thread that has been awoken now "owns" the lock. Rinse wash and repeat.
Shared Mutex
A shared mutex takes this concept a step further. It can be owned by a single thread for read/write permissions (like a normal mutex), or by multiple threads for read-only permissions (semantically, anyway - it's up to the programmer to ensure it's safe).
Thus, in addition to the unique ownership queue (like a normal mutex) it also has a shared ownership state. The shared ownership state could be simply a count of the number of threads that currently have shared ownership. If you inspect sizeof(std::shared_mutex) you'll find it's typically even larger than std::mutex. On my system, for instance, it's 56 bytes.
So when you go to lock a shared mutex, it has to do everything a normal mutex does, but additionally has to verify some other stuff. For instance, if you're trying to lock uniquely it must verify that there are no shared owners. And when you're trying to lock sharingly it must verify that there are no unique owners.
Because we typically want mutexes to be "fair", once a unique locker is in the queue, future sharing lock attempts must queue instead of acquiring the lock, even though it might currently be under sharing (i.e. non-unique) lock by several threads. This is to ensure shared owners don't "bully" a thread that wants unique ownership.
But this also goes the other way: the queuing logic must ensure a shared locker is never placed into an empty queue during shared ownership (because it should immediately succeed and be another shared owner).
Additionally, if there's a unique locker, followed by a shared locker, followed by a unique locker, it must (roughly) guarantee that acquisition order. So each entry in the lock queue also needs a flag denoting its purpose (i.e. shared vs. unique).
And then we think of the wake-up logic. When you unlock a shared mutex, the logic differs depending on the current ownership type of the mutex. If the unlocking thread has unique ownership or is the last shared owner it might have to wake up some thread(s) from the queue. It will either wake up all threads at the front of the queue who are requesting shared ownership, or a single thread at the front of the queue requesting unique ownership.
As you can imagine, all of this extra logic for who's locking for what reasons and how it changes depending not only on the current owners but also on the contents of the queue makes this potentially quite a bit slower. The hope is that you read significantly more frequent than you write, and thus you can have many sharing owners running concurrently, which mitigates the performance hit of coordinating all of that.

Mutex granularity

I have a question regarding threads. It is known that basically when we call for mutex(lock) that means that thread keeps on executing the part of code uninterrupted by other threads until it meets mutex(unlock). (At least that's what they say in the book) So my question is if it is actually possible to have several scoped WriteLocks which do not interfere with each other. For example something like this:
If I have a buffer with N elements without any new elements coming, however with high frequency updates (like change value of Kth element) is it possible to set a different lock on each element so that the only time threads would stall and wait is if actually 2 or more threads are trying to update the same element?
To answer your question about N mutexes: yes, that is indeed possible. What resources are protected by a mutex depends entirely on you as the user of that mutex.
This leads to the first (statement) part of your question. A mutex by itself does not guarantee that a thread will work uninterrupted. All it guarantees is MUTual EXclusion - if thread B attempts to lock a mutex which thread A has locked, thread B will block (execute no code) until thread A unlocks the mutex.
This means mutexes can be used to guarantee that a thread executes a block of code uninterrupted; but this works only if all threads follow the same mutex-locking protocol around that block of code. Which means it is your responsibility to assign semantics (or meaning) to each individual mutex, and correctly adhere to those semantics in your code.
If you decide for the semantics to be "I have an array a of N data elements and an array m of N mutexes, and accessing a[i] can only be done when m[i] is locked," then that's how it will work.
The need to consistently stick to the same protocol is why you should generally encapsulate the mutex and the code/data protected by it in a class in some way or another, so that outside code doesn't need to know the details of the protocol. It just knows "call this member function, and the synchronisation will happen automagically." This "automagic" will be the class correcrtly implementing the protocol.
A crucial consideration when deciding between a mutex per array and a mutex per element is whether there are operations - like tracking the number of "in-use" array elements, the "active" element, or moving a pointer-to-array to a larger buffer - that can only be done safely by one thread while all the others are blocked.
A lesser but sometimes important consideration is the amount of extra memory more mutexes use.
If you genuinely need to do this kind of update as quickly as possible in a highly contested multi-threaded program, you may also want to learn about lock-free atomic types and their compare-and-swap / exchange operations, but I'd recommend against considering that unless profiling the existing locking is significant in your overall program performance.
A mutex does not stop other threads from running completely, it only stops other threads from locking the same mutex. I.e. while one thread is keeping the mutex locked, the operating system continues to do context switches letting other threads run also, but if any other thread is trying to lock the same mutex its execution will be halted until the mutex is unlocked.
So yes, you can indeed have several different mutexes and lock/unlock them independently. Just beware of deadlocks, i.e. if one thread can lock more than one mutex at a time you can run into a situation where thread 1 has locked mutex A and is trying to lock mutex B but blocks because thread 2 already has mutex B locked and it is trying to lock mutex A..
Its not completely clear that your use case is:
the threads gets a buffer assigned on that they have to work
the threads have some results and request a special buffer to update.
On the first variant you need some assignment logic that assigns a buffer to a thread.
This logic has to be exectued in an atomic way. so the best is to use a mutex to protect the assignment logic.
On the other variant it may be the best to have a vector of mutexes, one for each buffer element.
In Both cases the buffer does not need a protection because it (or better each field of it) is only accessed from one thread at a time.
You also may inform yourself about 'semaphores'. These contain a counter that allows to manage ressources that have a limited amount but more than one. Mutexes are a special case of semaphores with n=1.
You can have mutex per entry, C++11 mutex can be easily converted into an adaptive-spinlock, so you can achieve good CPU/Latency tradeoff.
Or, if you need very low latency yet have enough CPUs you can use an atomic "busy" flag per entry and spin in a tight compare-exchange loop on contention.
From experience, though, the best performance and scalability are achieved when concurrent writes are serialized via a command queue (or a queue of smaller immutable buffers to be concatenated at destination) and a single thread processing the queue.

Creating Mutexes with Binary Semaphore

I'm working with a simple system that does NOT have mutexes, but rather a limited array of hardware binary semaphores. Typically, all multithreading is been done with heavy Semaphore techniques that makes code both poor in performance and difficult to write correctly without deadlocks.
A naive implementation is to use one semaphore globally in order to ensure atomic access to a critical section. However, this means that unrelated objects (even of different types) will block if any critical section is being accessed.
My current solution to this problem is to use a single global semaphore to ensure atomic access to a guard byte that then ensures atomic access to the particular critical section. I currently have this so far:
while (true) {
while (mutexLock == Mutex::Locked) {
} //wait on mutex
Semaphore semaLock(SemaphoreIndex::Mutex); //RAII semaphore object
if (mutexLock == Mutex::Unlocked) {
mutexLock = Mutex::Locked;
break;
}
} //Semaphore is released by destructor here
// ... atomically safe code
mutexLock = Mutex::Unlocked;
I have a few questions: Is this the best way to approach this problem? Is this code thread-safe? Is this the same as a "double checked lock"? If so, does it suffer from the same problems and therefore need memory barriers?
EDIT: A few notes on the system this is being implemented on...
It is a RISC 16-bit processor with 32kB RAM. While it has heavy multithreading capabilities, its memory model is very primitive. Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads. Memory barriers are mostly for the compiler to know it should reload memory into general purpose registers, not for any hardware reason (no cache)
Even if the processor has heavy multithreading capabilities (not sure what you mean by that btw.), but is a single core processor, it still means that only one thread can execute at any one time.
A task switching system must be employed (which would normally run in a privileged mode) Using this system you must be able to define critical sections to atomically execute a (software implemented) mutex lock/unlock.
When you say "one core, many threads" does that mean that you have some kind of kernel running on your processor? The task switching system will be implemented by that kernel.
It might pay to look through the documentation of your kernel or ask your vendor for any tips.
Good luck.
Right now we still don't have that much information about your system (for example, what kind of registers are available for each instruction in parrallel? do you use bank architecture?, how many simultaneous instructions can you actually execute?) but hopefully what I suggest will help you
If I understand your situation you have a piece of hardware that does not have true cores, but simply MIMD ability via a vectorized operation (based on your reply). with a It is a "RISC 16-bit processor with 32kB RAM" where:
Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads
The key here is that loads and stores are atomic. Note you won't be able to do larger than 16bit load and stores atomically, since they will be compiled to two separate atomic instructions (and thus not being atomic itself).
Here is the functionality of a mutex:
attempt to lock
unlock
To lock, you might run into issues if each resource attempts to lock. For example say in your hardware N = 4 (number of processes to be run in parrallel). If instruction 1 (I1) and I2 try to lock, they will both be successful in locking. Since your loads and stores are atomic, both processes see "unlocked" at the same time, and both acquire the lock.
This means you can't do the following:
if mutex_1 unlocked:
lock mutex_1
which might look like this in an arbitrary assembly language:
load arithmetic mutex_addr
or arithmetic immediate(1) // unlocked = 0000 or 1 = 0001, locked = 0001 or 1 = 0001
store mutex_addr arithmetic // not putting in conditional label to provide better synchronization.
jumpifzero MUTEXLABEL arithmetic
To get around this you will need to have each "thread" either know if its currently getting a lock some one else is or avoid simultaneous lock access entirely.
I only see one kind of way this can be done on your system (via flag/mutex id checking). Have a mutex id associated with each thread for each mutex it is currently checking, and check for all other threads to see if you can actually acquire a lock. Your binary semaphores don't really help here because you need to associate an individual mutex with a semaphore if you were going to use it (and still have to load the mutex from ram).
A simple implementation of the check every thread unlock and lock, basically each mutex has an ID and a state, in order to avoid race conditions per instruction, the current mutex being handled is identified well before it is actually acquired. Having the "identify which lock you want to use" and "actually try to get the lock" come in two steps stops accidental acquisition on simultaneous access. With this method you can have 2^16-1 (because 0 is used to say no lock was found) mutexes and your "threads" can exist on any instruction pipe.
// init to zero
volatile uint16_t CURRENT_LOCK_ATTEMPT[NUM_THREADS]{0};
// make thread id associated with priority
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state){
if(mutex_lock_state == false){
// do not actually attempt to take the lock until checked everything.
// No race condition can happen now, you won't have actually set the lock
// if two attempt to acquire the same lock at the same time, you'll both
// be able to see some one else is as well.
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
//checking all lower threads, need some sort of priority
//predetermined to figure out locking.
for( int i = 0; i < MY_THREAD_ID; i++ ){
if((CURRENT_LOCK_ATTEMPT[i] == mutex_id){
//clearing bit.
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return false;
}
}
// make sure to lock mutex before clearing which mutex you are currently handling
mutex_lock_state = true;
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return true;
}
return false;
}
// its your fault if you didn't make sure you owned the lock in the first place
// if you did own it, theres no race condition, because of atomic store load.
// if you happen to set the state while another instruction is attempting to
// acquire the lock they simply wont get the lock and no race condition occurs
bool unlock(bool& mutex_lock_state){
mutex_lock_state = false;
}
If you want more equal access of resources you could change indexing instead of being based on i = 0 to i < MY_THREAD_ID, you randomly pick a "starting point" to circle around back to MY_THREAD_ID using modulo arithmetic. IE:
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state, uint16_t per_mutex_random_seed){
if(mutex_lock_state == false){
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
//need a per thread linear congruence generator for speed and consistency
std::minstd_rand0 random(per_mutex_random_seed)
for(int i = random() % TOTAL_NUM_THREADS; i != MY_THREAD_ID i = (i + 1) % TOTAL_NUM_THREADS)
{
//same as before
}
// if we actually acquired the lock
GLOBAL_SEED = global_random() // use another generator to set the next seed to be used
mutex_lock_state = true;
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return true;
}
return false;
}
In general your lack of test and set ability really throws a wrench into everything, meaning you are forced to use other algorithms for mutexes. For more information on other algorithms that you can use for non test and set architectures check out this SO post, and these wikipedia algorithms which only rely on atomic loads and stores:
Dekker's Algorithm
Eisenberg Mcguire Algorithm
Peterson's Algorithm
Lamport's Algorithm
Szymanski's Algorithm
All of these algorithms basically decompose into checking a set of flags to see if you can access the resource safely by going through every one elses flags.

Is a ticket lock bounded wait-free? (under certain assumptions)

I am talking about a ticket lock that might look as follows (in pseudo-C syntax):
unsigned int ticket_counter = 0, lock_counter = 0;
void lock() {
unsigned int my_ticket = fetch_and_increment(ticket_counter);
while (my_ticket != lock_counter) {}
}
void unlock() {
atomic_increment(lock_counter);
}
Let's assume such a ticket lock synchronizes access to a critical section S that is wait-free, i.e. executing the critical section takes exactly c cycles/instructions. Assuming, that there are at most p threads in the system, is the synchronization of S using the ticket lock bounded wait-free?
In my opinion, it is, since the ticket lock is fair and thus the upper bound for waiting is O(p * c).
Do I make a mistake? I am little bit confused. I always thought locking implies not to be (bounded) wait-free because of the following statement:
"It is impossible to construct a wait-free implementation of a queue, stack, priority queue, set, or list from a set of atomic register." (Corollary 5.4.1 in Art of Multiprocessor Programming, Herlihy and Shavit)
However, if the ticket lock (and maybe any other fair locking mechanism) is bounded wait-free under the mentioned assumptions, it (might) enables the construction of bounded wait-free implementations of queue, stack, etc. (That's the question I am actually faced with.)
Recall the definition of bounded wait-free in "Art of Multiprocessor Programming", p.59 by Herlihy and Shavit:
"A method is wait-free if it guarantees that every call finishes its execution in a finite number of steps. It is bounded wait-free if there is a bound on the number of steps a method call can take."
Well, I believe you are correct, with some caveats.
Namely, the bounded wait-free property holds only if the critical section S is non-preemptive, which I suppose you can guarantee only for kernel space code (by disabling interrupts in the critical section). Otherwise the OS might decide to switch to another thread while one thread is in the critical section, and then the wait time is unbounded, no?
Also, for kernel code, I suppose p is not the number of software threads but rather the number of hardware threads (or cores, for CPU's that don't support several threads per CPU core). Because at most p software threads will be runnable at a time, and as S is non-preemptive you have no sleeping threads waiting on the lock.

How do I safely read a variable from one thread and modify it from another?

I have a class instances which is being used in multiple threads. I am updating multiple member variables from one thread and reading the same member variables from one thread. What is the correct way to maintain the thread safety?
eg:
phthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
//unlock here?
Should I unlock the mutex over here? ( if another thread access the obj1 member variables 1 and 2 now, the accessed data might not be correct because memberV2 has not yet be updated. However, if I does not release the lock, the other thread might block because there is time consuming operation below.
//perform some time consuming operation which must be done before the assignment to memberV2 and after the assignment to memberV1
obj1.memberV2 = update field 2 from some calculation
pthread_mutex_unlock(&mutex1) //should I only unlock here?
Thanks
Your locking is correct. You should not release the lock early just to allow another thread to proceed (because that would allow the other thread to see the object in an inconsistent state.)
Perhaps it would be better to do something like:
//perform time consuming calculation
pthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
obj1.memberV2 = result;
pthread_mutex_unlock(&mutex1)
This of course assumes that the values used in the calculation won't be modified on any other thread.
Its hard to tell what you are doing that is causing problems. The mutex pattern is pretty simple. You Lock the mutex, access the shared data, unlock the mutex. This protects data, becuase the mutex will only let one thread get the lock at a time. Any thread that fails to get the lock has to wait till the mutex is unlocked. Unlocking wakes the waiters up. They will then fight to attain the lock. Losers go back to sleep. The time it takes to wake up might be multiple ms or more from the time the lock is released. Make sure you always unlock the mutex eventually.
Make sure you don't to keep locks locked for a long period of time. Most of the time, a long period of time is like a micro second. I prefer to keep it down around "a few lines of code." Thats why people have suggested that you do the long running calculation outside the lock. The reason for not keeping locks a long time is you increase the number of times other threads will hit the lock and have to spin or sleep, which decreases performance. You also increase the probability that your thread might be pre-empted while owning the lock, which means the lock is enabled while that thread sleeps. Thats even worse performance.
Threads that fail a lock dont have to sleep. Spinning means a thread encountering a locked mutex doesn't sleep, but loops repeatedly testing the lock for a predefine period before giving up and sleeping. This is a good idea if you have multiple cores or cores capable of multiple simultaneous threads. Multiple active threads means two threads can be executing the code at the same time. If the lock is around a small amount of code, then the thread that got the lock is going to be done real soon. the other thread need only wait a couple nano secs before it will get the lock. Remember, sleeping your thread is a context switch and some code to attach your thread to the waiters on the mutex, all have costs. Plus, once your thread sleeps, you have to wait for a period of time before the scheduler wakes it up. that could be multiple ms. Lookup spinlocks.
If you only have one core, then if a thread encounters a lock it means another sleeping thread owns the lock and no matter how long you spin it aint gonna unlock. So you would use a lock that sleeps a waiter immediately in hopes that the thread owning the lock will wake up and finish.
You should assume that a thread can be preempted at any machine code instruction. Also you should assume that each line of c code is probably many machine code instructions. The classic example is i++. This is one statement in c, but a read, an increment, and a store in machine code land.
If you really care about performance, try to use atomic operations first. Look to mutexes as a last resort. Most concurrency problems are easily solved with atomic operations (google gcc atomic operations to start learning) and very few problems really need mutexes. Mutexes are way way way slower.
Protect your shared data wherever it is written and wherever it is read. else...prepare for failure. You don't have to protect shared data during periods of time when only a single thread is active.
Its often useful to be able to run your app with 1 thread as well as N threads. This way you can debug race conditions easier.
Minimize the shared data that you protect with locks. Try to organize data into structures such that a single thread can gain exclusive access to the entire structure (perhaps by setting a single locked flag or version number or both) and not have to worry about anything after that. Then most of the code isnt cluttered with locks and race conditions.
Functions that ultimately write to shared variables should use temp variables until the last moment and then copy the results. Not only will the compiler generate better code, but accesses to shared variables especially changing them cause cache line updates between L2 and main ram and all sorts of other performance issues. Again if you don't care about performance disregard this. However i recommend you google the document "everything a programmer should know about memory" if you want to know more.
If you are reading a single variable from the shared data you probably don't need to lock as long as the variable is an integer type and not a member of a bitfield (bitfield members are read/written with multiple instructions). Read up on atomic operations. When you need to deal with multiple values, then you need a lock to make sure you didn't read version A of one value, get preempted, and then read version B of the next value. Same holds true for writing.
You will find that copies of data, even copies of entire structures come in handy. You can be working on building a new copy of the data and then swap it by changing a pointer in with one atomic operation. You can make a copy of the data and then do calculations on it without worrying if it changes.
So maybe what you want to do is:
lock the mutex
Make a copy of the input data to the long running calculation.
unlock the mutex
L1: Do the calculation
Lock the mutex
if the input data has changed and this matters
read the input data, unlock the mutex and go to L1
updata data
unlock mutex
Maybe, in the example above, you still store the result if the input changed, but go back and recalc. It depends if other threads can use a slightly out of date answer. Maybe other threads when they see that a thread is already doing the calculation simply change the input data and leave it to the busy thread to notice that and redo the calculation (there will be a race condition you need to handle if you do that, and easy one). That way the other threads can do other work rather than just sleep.
cheers.
Probably the best thing to do is:
temp = //perform some time consuming operation which must be done before the assignment to memberV2
pthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
obj1.memberV2 = temp; //result from previous calculation
pthread_mutex_unlock(&mutex1)
What I would do is separate the calculation from the update:
temp = some calculation
pthread_mutex_lock(&mutex1);
obj.memberV1 = 1;
obj.memberV2 = temp;
pthread_mutex_unlock(&mutex1);