Related
I get the feeling this may be a very general and common situation for which a well-known no-lock solution exists.
In a nutshell, I'm hoping there's approach like a readers/writer lock, but that doesn't require the readers to acquire a lock and thus can be better average performance.
Instead there'd be some atomic operations (128-bit CAS) for a reader, and a mutex for a writer. I'd have two copies of the data structure, a read-only one for the normally-successful queries, and an identical copy to be update under mutex protection. Once the data has been inserted into the writable copy, we make it the new readable copy. The old readable copy then gets inserted in turn, once all the pending readers have finished reading it, and the writer spins on the number of readers left until its zero, then modifies it in turn, and finally releases the mutex.
Or something like that.
Anything along these lines exist?
If your data fits in a 64-bit value, most systems can cheaply read/write that atomically, so just use std::atomic<my_struct>.
For smallish and/or infrequently-written data, there are a couple ways to make readers truly read-only on the shared data, not having to do any atomic RMW operations on a shared counter or anything. This allows read-side scaling to many threads without readers contending with each other (unlike a 128-bit atomic read on x86 using lock cmpxchg16b1, or taking a RWlock).
Ideally just an extra level of indirection via an atomic<T*> pointer (RCU), or just an extra load + compare-and-branch (SeqLock); no atomic RMWs or memory barriers stronger than acq/rel or anything else in the read side.
This can be appropriate for data that's read very frequently by many threads, e.g. a timestamp updated by a timer interrupt but read all over the place. Or a config setting that typically never changes.
If your data is larger and/or changes more frequently, one of the strategies suggested in other answers that requires a reader to still take a RWlock on something or atomically increment a counter will be more appropriate. This won't scale perfectly because each reader still needs to get exclusive ownership of the shared cache line containing lock or counter so it can modify it, but there's no such thing as a free lunch.
Note 1: Update: x86 vendors finally decided to guarantee that 128-bit SSE/AVX loads / stores are atomic on CPUs with AVX, so if you're lucky std::atomic<16-byte-struct> has cheap loads when running on a CPU with AVX enabled. e.g. not Pentium/Celeron before Ice Lake. GCC for a while has been indirecting to a libgcc atomic_load_16 function for 16-byte operations, so runtime dispatching for it can pick a lock cmpxchg16b strategy on CPUs that support it. Now it has a much better option to choose from on some CPUs.
RCU
It sounds like you're half-way to inventing RCU (Read Copy Update) where you update a pointer to the new version.
But remember a lock-free reader might stall after loading the pointer, so you have a deallocation problem. This is the hard part of RCU. In a kernel it can be solved by having sync points where you know that there are no readers older than some time t, and thus can free old versions. There are some user-space implementations. https://en.wikipedia.org/wiki/Read-copy-update and https://lwn.net/Articles/262464/.
For RCU, the less frequent the changes, the larger a data structure you can justify copying. e.g. even a moderate-sized tree could be doable if it's only ever changed interactively by an admin, while readers are running on dozens of cores all checking something in parallel. e.g. kernel config settings are one thing where RCU is great in Linux.
SeqLock
If your data is small (e.g. a 64-bit timestamp on a 32-bit machine), another good option is a SeqLock. Readers check a sequence counter before/after non-atomic copy of the data into a private buffer. If the sequence counters match, we know there wasn't tearing. (Writers mutually exclude each with a separate mutex). Implementing 64 bit atomic counter with 32 bit atomics / how to implement a seqlock lock using c++11 atomic library.
It's a bit of a hack in C++ to write something that can compile efficiently to a non-atomic copy that might have tearing, because inevitably that's data-race UB. (Unless you use std::atomic<long> with mo_relaxed for each chunk separately, but then you're defeating the compiler from using movdqu or something to copy 16 bytes at once.)
A SeqLock makes the reader copy the whole thing (or ideally just load it into registers) every read so it's only ever appropriate for a small struct or 128-bit integer or something. But for less than 64 bytes of data it can be quite good, better than having readers use lock cmpxchg16b for a 128-bit datum if you have many readers and infrequent writes.
It's not lock-free, though: a writer that sleeps while modifying the SeqLock could get readers stuck retrying indefinitely. For a small SeqLock the window is small, and obviously you want to have all the data ready before you do the first sequence-counter update to minimize the chance for an interrupt pausing the writer in mid update.
The best case is when there's only 1 writer so it doesn't have to do any locking; it knows nothing else will be modifying the sequence counter.
What you're describing is very similar to double instance locking and left-right concurrency control.
In terms of progress guarantees, the difference between the two is that the former is lock-free for readers while the latter is wait-free. Both are blocking for writers.
It turns out the two-structure solution I was thinking of has similarities to http://concurrencyfreaks.blogspot.com/2013/12/left-right-concurrency-control.html
Here's the specific data structure and pseudocode I had in mind.
We have two copies of some arbitrary data structure called MyMap allocated, and two
pointers out of a group of three pointers point to these two. Initially, one is pointed
to by achReadOnly[0].pmap and the other by pmapMutable.
A quick note on achReadOnly: it has a normal state and two temporary states. The normal state will be (WLOG for cell 0/1):
achReadOnly = { { pointer to one data structure, number of current readers },
{ nullptr, 0 } }
pmapMutable = pointer to the other data structure
When we've finished mutating "the other," we store it in the unused slot of the array
as it is the next-generation read-only and it's fine for readers to start accessing it.
achReadOnly = { { pointer to one data structure, number of old readers },
{ pointer to the other data structure, number of new readers } }
pmapMutable = pointer to the other data structure
The writer then clears the pointer to "the one", the previous-generation readonly, forcing
readers to go to the next-generation one. We move that to pmapMutable.
achReadOnly = { { nullptr, number of old readers },
{ pointer to the other data structure, number of new readers } }
pmapMutable = pointer to the one data structure
The writer then spins for number of old readers to hit one (itself) at which point
it can receive the same update. That 1 is overwritten with 0 to clean up in preparation to move forward. Though in fact it could be left dirty as it won't be referred to before being overwritten.
struct CountedHandle {
MyMap* pmap;
int iReaders;
};
// Data Structure:
atomic<CountedHandle> achReadOnly[2];
MyMap* pmapMutable;
mutex_t muxMutable;
data Read( key ) {
int iWhich = 0;
CountedHandle chNow, chUpdate;
// Spin if necessary to update the reader counter on a pmap, and/or
// to find a pmap (as the pointer will be overwritten with nullptr once
// a writer has finished updating the mutable copy and made it the next-
// generation read-only in the other slot of achReadOnly[].
do {
chNow = achReadOnly[ iWhich ];
if ( !chNow .pmap ) {
iWhich = 1 - iWhich;
continue;
}
chUpdate = chNow;
chNow.iReaders++;
} while ( CAS( ach[ iWhich ], chNow, chUpdate ) fails );
// Now we've found a map, AND registered ourselves as a reader of it atomicly.
// Importantly, it is impossible any reader has this pointer but isn't
// represented in that count.
if ( data = chnow.pmap->Find( key ) ) {
// Deregister ourselves as a reader.
do {
chNow = achReadOnly[ iWhich ];
chUpdate = chNow;
chNow.iReaders--;
} while ( CAS( ach[ iWhich ], chNow, chUpdate ) fails );
return data;
}
// OK, we have to add it to the structure.
lock muxMutable;
figure out data for this key
pmapMutable->Add( key, data );
// It's now the next-generation read-only. Put it where readers can find it.
achReadOnly[ 1 - iWhich ].pmap = pmapMutable;
// Prev-generation readonly is our Mutable now, though we can't change it
// until the readers are gone.
pmapMutable = achReadOnly[ iWhich ].pmap;
// Force readers to look for the next-generation readonly.
achReadOnly[ iWhich ].pmap = nullptr;
// Spin until all readers finish with previous-generation readonly.
// Remember we added ourselves as reader so wait for 1, not 0.
while ( achReadOnly[ iWhich ].iReaders > 1 }
;
// Remove our reader count.
achReadOnly[ iWhich ].iReaders = 0;
// No more readers for previous-generation readonly, so we can now write to it.
pmapMutable->Add( key, data );
unlock muxMutable;
return data;
}
Solution that has come to me:
Every thread has a thread_local copy of the data structure, and this can be queried at will without locks. Any time you find your data, great, you're done.
If you do NOT find your data, then you acquire a mutex for the master copy.
This will have potentially many new insertions in it from other threads (possibly including the data you need!). Check to see if it has your data and if not insert it.
Finally, copy all the recent updates--including the entry for the data you need--to your own thread_local copy. Release mutex and done.
Readers can read all day long, in parallel, even when updates are happening, without locks. A lock is only needed when writing, (or sometimes when catching up). This general approach would work for a wide range of underlying data structures. QED
Having many thread_local indexes sounds memory-inefficient if you have lots of threads using this structure.
However, the data found by the index, if it's read-only, need only have one copy, referred to by many indices. (Luckily, that is my case.)
Also, many threads might not be randomly accessing the full range of entries; maybe some only need a few entries and will very quickly reach a final state where their local copy of the structure can find all the data needed, before it grows much. And yet many other threads may not refer to this at all. (Luckily, that is my case.)
Finally, to "copy all the recent updates" it'd help if all new data added to the structure were, say, pushed onto the end of a vector so given that say you have 4000 entries in your local copy, the master copy has 4020, you can with a few machine cycles locate the 20 objects you need to add. (Luckily, that is my case.)
I'm writing parallel code that has a single writer and multiple readers. The writer will fill in an array from beginning to end, and the readers will access elements of the array in order. Pseudocode is something like the following:
std::vector<Stuff> vec(knownSize);
int producerIndex = 0;
std::atomic<int> consumerIndex = 0;
Producer thread:
for(a while){
vec[producerIndex] = someStuff();
++producerIndex;
}
Consumer thread:
while(!finished){
int myIndex = consumerIndex++;
while(myIndex >= producerIndex){ spin(); }
use(vec[myIndex]);
}
Do I need any sort of synchronization around the producerIndex? It seems like the worst thing that could happen is that I would read an old value while it's being updated so I might spin an extra time. Am I missing anything? Can I be sure that each assignment to myIndex will be unique?
As the comments have pointed out, this code has a data race. Instead of speculating about whether the code has a chance of doing what you want, just fix it: change the type of producerIndex and consumerIndex from int to std::atomic<int> and let the compiler implementor and standard library implementor worry about how to make that work right on your target platform.
It's likely that the array will be stored in the cache so all the threads will have their own copy of it. Whenever your producer puts a new value in the array this will set the dirty bit on the store address, so every other thread that uses the value will retrieve it from the RAM to its own copy in the cache.
That means you will get a lot of cache misses but no race conditions. :)
I have a ROS node running two threads and they both share the same class. This class has two sets of parameters "to read" and "to write" to be updated in a control loop.
There are two situations where questions arises.
My program is a node that pumps control data into a quadrotor (case 1) and reads the drone data to get feedback (case 2). Here I can control the execution frequency of thread A and I know the frequency at which thread B can communicate with its read/write external source.
The thread A reads data from the control source and updates the "to read" parameters. The thread B is constantly reading this "to read" parameters and writting them into the drone source. My point here is that I don't mind if I miss some of the values A thread has read, but thread B could happen to read something that's not a "true" value because thread A is writting or something similar?
The thread B after writting the "to read" parameters, reads the state of the drone that will update the second set "to write". Again thread A needs to read this "to write" parameters and write them back to the control source, the same way I don't care if a value is missed because I'll get the next one.
So do I need a mutex here? Or the reading threads will just miss some values but the ones read will be correct and consistent?
BTW: I am using boost:threads to implement the thread B as the thread A it's the ROS node itself.
A data race is undefined behavior. Even if the hardware guarantees atomic access and even your threads never actually access the same data at the same time due to timings. There is no such thing as a benign data race in C++. You can get lucky that the undefined behavior does what you want, but you can never be sure and every new compilation could break everything (not just a missed write). I strongly suggest you use an std::atomic. It will most likely generate almost the same code except that it is guaranteed to always work.
In general the answer is that you need a lock or some other type of synchronization mechanism. For example, if your data is a null-terminated string it's possible for you to read interleaved data. Say one thread was reading the buffer and the string in the buffer is "this is a test". The thread copies the first four bytes and then another thread comes in and overwrites the buffer with "cousin it is crazy". You'd end up copying "thisin it is crazy". That's just one example of things that could go wrong.
If you're always copying atomic types and everything is fixed length, then you could get away with it. But if you do, your data is potentially inconsistent. If two values are supposed to be related, it's possible for that relationship now to be broken because you read one value from the previous update and one value from the new update.
I have a class instances which is being used in multiple threads. I am updating multiple member variables from one thread and reading the same member variables from one thread. What is the correct way to maintain the thread safety?
eg:
phthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
//unlock here?
Should I unlock the mutex over here? ( if another thread access the obj1 member variables 1 and 2 now, the accessed data might not be correct because memberV2 has not yet be updated. However, if I does not release the lock, the other thread might block because there is time consuming operation below.
//perform some time consuming operation which must be done before the assignment to memberV2 and after the assignment to memberV1
obj1.memberV2 = update field 2 from some calculation
pthread_mutex_unlock(&mutex1) //should I only unlock here?
Thanks
Your locking is correct. You should not release the lock early just to allow another thread to proceed (because that would allow the other thread to see the object in an inconsistent state.)
Perhaps it would be better to do something like:
//perform time consuming calculation
pthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
obj1.memberV2 = result;
pthread_mutex_unlock(&mutex1)
This of course assumes that the values used in the calculation won't be modified on any other thread.
Its hard to tell what you are doing that is causing problems. The mutex pattern is pretty simple. You Lock the mutex, access the shared data, unlock the mutex. This protects data, becuase the mutex will only let one thread get the lock at a time. Any thread that fails to get the lock has to wait till the mutex is unlocked. Unlocking wakes the waiters up. They will then fight to attain the lock. Losers go back to sleep. The time it takes to wake up might be multiple ms or more from the time the lock is released. Make sure you always unlock the mutex eventually.
Make sure you don't to keep locks locked for a long period of time. Most of the time, a long period of time is like a micro second. I prefer to keep it down around "a few lines of code." Thats why people have suggested that you do the long running calculation outside the lock. The reason for not keeping locks a long time is you increase the number of times other threads will hit the lock and have to spin or sleep, which decreases performance. You also increase the probability that your thread might be pre-empted while owning the lock, which means the lock is enabled while that thread sleeps. Thats even worse performance.
Threads that fail a lock dont have to sleep. Spinning means a thread encountering a locked mutex doesn't sleep, but loops repeatedly testing the lock for a predefine period before giving up and sleeping. This is a good idea if you have multiple cores or cores capable of multiple simultaneous threads. Multiple active threads means two threads can be executing the code at the same time. If the lock is around a small amount of code, then the thread that got the lock is going to be done real soon. the other thread need only wait a couple nano secs before it will get the lock. Remember, sleeping your thread is a context switch and some code to attach your thread to the waiters on the mutex, all have costs. Plus, once your thread sleeps, you have to wait for a period of time before the scheduler wakes it up. that could be multiple ms. Lookup spinlocks.
If you only have one core, then if a thread encounters a lock it means another sleeping thread owns the lock and no matter how long you spin it aint gonna unlock. So you would use a lock that sleeps a waiter immediately in hopes that the thread owning the lock will wake up and finish.
You should assume that a thread can be preempted at any machine code instruction. Also you should assume that each line of c code is probably many machine code instructions. The classic example is i++. This is one statement in c, but a read, an increment, and a store in machine code land.
If you really care about performance, try to use atomic operations first. Look to mutexes as a last resort. Most concurrency problems are easily solved with atomic operations (google gcc atomic operations to start learning) and very few problems really need mutexes. Mutexes are way way way slower.
Protect your shared data wherever it is written and wherever it is read. else...prepare for failure. You don't have to protect shared data during periods of time when only a single thread is active.
Its often useful to be able to run your app with 1 thread as well as N threads. This way you can debug race conditions easier.
Minimize the shared data that you protect with locks. Try to organize data into structures such that a single thread can gain exclusive access to the entire structure (perhaps by setting a single locked flag or version number or both) and not have to worry about anything after that. Then most of the code isnt cluttered with locks and race conditions.
Functions that ultimately write to shared variables should use temp variables until the last moment and then copy the results. Not only will the compiler generate better code, but accesses to shared variables especially changing them cause cache line updates between L2 and main ram and all sorts of other performance issues. Again if you don't care about performance disregard this. However i recommend you google the document "everything a programmer should know about memory" if you want to know more.
If you are reading a single variable from the shared data you probably don't need to lock as long as the variable is an integer type and not a member of a bitfield (bitfield members are read/written with multiple instructions). Read up on atomic operations. When you need to deal with multiple values, then you need a lock to make sure you didn't read version A of one value, get preempted, and then read version B of the next value. Same holds true for writing.
You will find that copies of data, even copies of entire structures come in handy. You can be working on building a new copy of the data and then swap it by changing a pointer in with one atomic operation. You can make a copy of the data and then do calculations on it without worrying if it changes.
So maybe what you want to do is:
lock the mutex
Make a copy of the input data to the long running calculation.
unlock the mutex
L1: Do the calculation
Lock the mutex
if the input data has changed and this matters
read the input data, unlock the mutex and go to L1
updata data
unlock mutex
Maybe, in the example above, you still store the result if the input changed, but go back and recalc. It depends if other threads can use a slightly out of date answer. Maybe other threads when they see that a thread is already doing the calculation simply change the input data and leave it to the busy thread to notice that and redo the calculation (there will be a race condition you need to handle if you do that, and easy one). That way the other threads can do other work rather than just sleep.
cheers.
Probably the best thing to do is:
temp = //perform some time consuming operation which must be done before the assignment to memberV2
pthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
obj1.memberV2 = temp; //result from previous calculation
pthread_mutex_unlock(&mutex1)
What I would do is separate the calculation from the update:
temp = some calculation
pthread_mutex_lock(&mutex1);
obj.memberV1 = 1;
obj.memberV2 = temp;
pthread_mutex_unlock(&mutex1);
After posting my solution to my own problem regarding memory issues, nusi suggested that my solution lacks locking.
The following pseudo code vaguely represents my solution in a very simple way.
std::map<int, MyType1> myMap;
void firstFunctionRunFromThread1()
{
MyType1 mt1;
mt1.Test = "Test 1";
myMap[0] = mt1;
}
void onlyFunctionRunFromThread2()
{
MyType1 &mt1 = myMap[0];
std::cout << mt1.Test << endl; // Prints "Test 1"
mt1.Test = "Test 2";
}
void secondFunctionFromThread1()
{
MyType1 mt1 = myMap[0];
std::cout << mt1.Test << endl; // Prints "Test 2"
}
I'm not sure at all how to go about implementing locking, and I'm not even sure why I should do it (note the actual solution is much more complex). Could someone please explain how and why I should implement locking in this scenario?
One function (i.e. thread) modifies the map, two read it. Therefore a read could be interrupted by a write or vice versa, in both cases the map will probably be corrupted. You need locks.
Actually, it's not even just locking that is the issue...
If you really want thread two to ALWAYS print "Test 1", then you need a condition variable.
The reason is that there is a race condition. Regardless of whether or not you create thread 1 before thread 2, it is possible that thread 2's code can execute before thread 1, and so the map will not be initialized properly. To ensure that no one reads from the map until it has been initialized you need to use a condition variable that thread 1 modifies.
You also should use a lock with the map, as others have mentioned, because you want threads to access the map as though they are the only ones using it, and the map needs to be in a consistent state.
Here is a conceptual example to help you think about it:
Suppose you have a linked list that 2 threads are accessing. In thread 1, you ask to remove the first element from the list (at the head of the list), In thread 2, you try to read the second element of the list.
Suppose that the delete method is implemented in the following way: make a temporary ptr to point at the second element in the list, make the head point at null, then make the head the temporary ptr...
What if the following sequence of events occur:
-T1 removes the heads next ptr to the second element
- T2 tries to read the second element, BUT there is no second element because the head's next ptr was modified
-T1 completes removing the head and sets the 2nd element as the head
The read by T2 failed because T1 didn't use a lock to make the delete from the linked list atomic!
That is a contrived example, and isn't necessarily how you would even implement the delete operation; however, it shows why locking is necessary: it is necessary so that operations performed on data are atomic. You do not want other threads using something that is in an inconsistent state.
Hope this helps.
In general, threads might be running on different CPUs/cores, with different memory caches. They might be running on the same core, with one interrupting ("pre-empting" the other). This has two consequences:
1) You have no way of knowing whether one thread will be interrupted by another in the middle of doing something. So in your example, there's no way to be sure that thread1 won't try to read the string value before thread2 has written it, or even that when thread1 reads it, it is in a "consistent state". If it is not in a consistent state, then using it might do anything.
2) When you write to memory in one thread, there is no telling if or when code running in another thread will see that change. The change might sit in the cache of the writer thread and not get flushed to main memory. It might get flushed to main memory but not make it into the cache of the reader thread. Part of the change might make it through, and part of it not.
In general, without locks (or other synchronization mechanisms such as semaphores) you have no way of saying whether something that happens in thread A will occur "before" or "after" something that happens in thread B. You also have no way of saying whether or when changes made in thread A will be "visible" in thread B.
Correct use of locking ensures that all changes are flushed through the caches, so that code sees memory in the state you think it should see. It also allows you to control whether particular bits of code can run simultaneously and/or interrupt each other.
In this case, looking at your code above, the minimum locking you need is to have a synchronisation primitive which is released/posted by the second thread (the writer) after it has written the string, and acquired/waited on by the first thread (the reader) before using that string. This would then guarantee that the first thread sees any changes made by the second thread.
That's assuming the second thread isn't started until after firstFunctionRunFromThread1 has been called. If that might not be the case, then you need the same deal with thread1 writing and thread2 reading.
The simplest way to actually do this is to have a mutex which "protects" your data. You decide what data you're protecting, and any code which reads or writes the data must be holding the mutex while it does so. So first you lock, then read and/or write the data, then unlock. This ensures consistent state, but on its own it does not ensure that thread2 will get a chance to do anything at all in between thread1's two different functions.
Any kind of message-passing mechanism will also include the necessary memory barriers, so if you send a message from the writer thread to the reader thread, meaning "I've finished writing, you can read now", then that will be true.
There can be more efficient ways of doing certain things, if those prove too slow.
The whole idea is to prevent the program from going into an indeterminate/unsafe state due to multiple threads accessing the same resource(s) and/or updating/modifying the resource so that the subsequent state becomes undefined. Read up on Mutexes and Locking (with examples).
The set of instructions created as a result of compiling your code can be interleaved in any order. This can yield unpredictable and undesired results. For example, if thread1 runs before thread2 is selected to run, your output may look like:
Test 1
Test 1
Worse yet, one thread may get pre-empted in the middle of assigning - if assignment is not an atomic operation. In this case let's think of atomic as the smallest unit of work which can not be further split.
In order to create a logically atomic set of instructions -- even if they yield multiple machine code instructions in reality -- is to use a lock or mutex. Mutex stands for "mutual exclusion" because that's exactly what it does. It ensures exclusive access to certain objects or critical sections of code.
One of the major challenges in dealing with multiprogramming is identifying critical sections. In this case, you have two critical sections: where you assign to myMap, and where you change myMap[ 0 ]. Since you don't want to read myMap before writing to it, that is also a critical section.
The simplest answer is: you have to lock whenever there is an access to some shared resources, which are not atomics. In your case myMap is shared resource, so you have to lock all reading and writing operations on it.