I find many read write Spin lock implementation over the internet are unnecessarily complex. I have written a simple read-write lock in c++.
Could anybody tell me , if I am missing anything?
int r = 0;
int w = 0;
read_lock(void)
{
atomic_inc(r); //increment value atomically
while( w != 0);
}
read_unlock(void)
{
atomic_dec(r); // Decrement value atomically
}
write_lock(void)
{
while( (r != 0) &&
( w != 0))
atomic_inc(w); //increment value atomically
}
write_unlock(void)
{
atomic_dec(w); //Decrement value atomically
}
The usage would be as below.
read_lock()
// Critical Section
read_unlock();
write_lock()
// Critical Section
write_unlock();
Edit:
Thanks for the answers.
I now changed answer to that of atomic equivalent
If threads access r and w concurrently, they have a data-race. If a C++ program has a data-race, the behaviour of the program is undefined.
int is not guaranteed by the C++ standard to be atomic. Even if we assume a system where accessing an int is atomic, operator++ would probably not be an atomic operation even on such systems. As such, simultaneous increments could "disappear".
Furthermore after the loop in write_lock, another thread could also end their loop before w is incremented, thereby allowing multiple simultaneous writers - which I assume this lock is supposed to prevent.
Lastly, this appears to be an attempt at implementing a spinlock. Spinlocks have advantages and disadvantages. Their disadvantage is that they consume all CPU cycles of their thread while blocking. This is highly inefficient use of resources, and bad for battery time, and bad for other processes that could have used those cycles. But it can be optimal if the wait time is short.
The simplest implemention would be to use a single integral value. -1 shows a current write status, 0 means it is not being read or written to and a positive value indicates it is being read by that many threads.
Use atomic_int and compare_exchange_weak (or strong but weak should suffice)
std::atomic_int l=0;
void write_lock() {
int v = 0;
while( !l.compare_exchange_weak( v, -1 ) )
v = 0; // it will set it to what it currently held
}
void write_unlock() {
l = 0; // no need to compare_exchange
}
void read_lock() {
int v = l.load();
while( v < 0 || !l.compare_exchange_weak(v, v+1) )
v = l.load();
}
void read_unlock() {
--l; // no need to do anything else
}
I think that should work, and have RAII objects, i.e. create an automatic object that locks on construction and unlocks on destruction for each type.
that could be done like this:
class AtomicWriteSpinScopedLock
{
private:
atomic_int& l_;
public:
// handle copy/assign/move issues
explicit AtomicWriteSpinScopedLock( atomic_int& l ) :
l_(l)
{
int v = 0;
while( !l.compare_exchange_weak( v, -1 ) )
v = 0; // it will set it to what it currently held
}
~AtomicWriteSpinScopedLock()
{
l_ = 0;
}
};
class AtomicReadSpinScopedLock
{
private:
atomic_int& l_;
public:
// handle copy/assign/move issues
explicit AtomicReadSpinScopedLock( atomic_int& l ) :
l_(l)
{
int v = l.load();
while( v < 0 || !l.compare_exchange_weak(v, v+1) )
v = l.load(); }
}
~AtomicReadSpinScopedLock()
{
--l_;
}
};
On locking to write the value must be 0 and you must swap it to -1, so just keep trying to do that.
On locking to read the value must be non-negative and then you attempt to increase it, so there may be retries against other readers, not in acquiring the lock but in setting its count.
compare_exchange_weak sets to the first parameter what it actually held if the exchange failed, and the second parameter is what you are trying to change it to. It returns true if it swapped and false if it did not.
How efficient? It's a spin-lock. It will use CPU cycles whilst waiting, so it had better be available very soon: the update or the reading of the data should be swift.
Related
Suppose I have some global:
std::atomic_int next_free_block;
and a number of threads each with access to a
std::atomic_int child_offset;
that may be shared between threads. I would like to allocate free blocks to child offsets in a contiguous manner, that is, I want to perform the following operation atomically:
if (child_offset != 0) child_offset = next_free_block++;
Obviously the above implementation does not work as multiple threads may enter the body of the if statement and then try to assign different blocks to child_offset.
I have also considered the following:
int expected = child_offset;
do {
if (expected == 0) break;
int updated = next_free_block++;
} while (!child_offset.compare_exchange_weak(&expected, updated);
But this also doesn't work because if the CAS fails, the side effect of incrementing next_free_block remains even if nothing is assigned to child_offset. This leaves gaps in the allocation of free blocks.
I am aware that I could do this with a mutex (or some kind of spin lock) around each child_offset and potentially DCLP, but I would like to know if this is possible to implement efficiently with atomic operations.
The use case for this is as follows: I have a large tree that I'm building in parallel. The tree is an array of the following:
struct tree_page {
atomic<uint32_t> allocated;
uint32_t child_offset[8];
uint32_t nodes[1015];
};
The tree is built level by level: first the nodes at depth 0 are created, then at depth 1, etc. A separate thread is dispatched for each non-leaf node at the previous step. If no more space is left in a page, a new page is allocated from the global next_free_page which points to the first unused page in the array of struct tree_page and is assigned to an element of child_ptr. A bit field is then set in the node word that indicates which element of the child_ptr array should be used to find the node's children.
The code I am trying to write looks like this:
int expected = allocated.load(relaxed), updated;
do {
updated = expected + num_children;
if (updated > NODES_PER_PAGE) {
expected = -1; break;
}
} while (!allocated.compare_exchange_weak(&expected, updated));
if (expected != -1) {
// successfully allocated in the same page
} else {
for (int i = 0; i < 8; ++i) {
// this is the operation I would like to be atomic
if (child_offset[i] == 0)
child_offset[i] = next_free_block++;
int offset = try_allocating_at_page(pages[child_offset[i]]);
if (offset != -1) {
// successfully allocated at child_offset i
// ...
break;
}
}
}
As far as I understood from you description you array of child_offset is filled with 0 initially and then filled with some concrete values concurrently by different threads.
In this case you can atomically "tag" value first and if you are successful assign valid value. Something like this:
constexpr int INVALID_VALUE = -1;
for (int i = 0; i < 8; ++i) {
int expected = 0;
// this is the operation I would like to be atomic
if (child_offset[i].compare_exchange_weak(expected, INVALID_VALUE)) {
child_offset[i] = next_free_block++;
}
// Not sure if this is needed in your environment, but just in case
if (child_offset[i] == INVALID_VALUE) continue;
...
}
This doesn't guarantee that all values in child_offset array will be in ascending order. But if you need that why not fill it without multithreading involved?
I encounter deadlocks while executing the code snippet below as a thread.
void thread_lifecycle(
Queue<std::tuple<int64_t, int64_t, uint8_t>, QUEUE_SIZE>& query,
Queue<std::string, QUEUE_SIZE>& output_queue,
std::vector<Object>& pgs,
bool* pgs_executed, // Initialized to array false-values
std::mutex& pgs_executed_mutex,
std::atomic<uint32_t>& atomic_pgs_finished
){
bool iter_bool = false;
std::tuple<int64_t, int64_t, uint8_t> next_query;
std::string output = "";
int64_t lower, upper;
while(true) {
// Get next query
next_query = query.pop_front();
// Stop Condition reached terminate thread
if (std::get<2>(next_query) == uint8_t(-1)) break;
//Set query params
lower = std::get<0>(next_query);
upper = std::get<1>(next_query);
// Scan bool array
for (uint32_t i = 0; i < pgs.size(); i++){
// first lock for reading
pgs_executed_mutex.lock();
if (pgs_executed[i] == iter_bool) {
pgs_executed[i] = !pgs_executed[i];
// Unlock and execute the query
pgs_executed_mutex.unlock();
output = pgs.at(i).get_result(lower, upper);
// If query yielded a result, then add it to the output
if (output.length() != 0) {
output_queue.push_back(output);
}
// Inform main thread in case of last result
if (++atomic_pgs_finished >= pgs.size()) {
output_queue.push_back("LAST_RESULT_IDENTIFIER");
atomic_pgs_finished.exchange(0);
}
} else {
pgs_executed_mutex.unlock();
continue;
}
}
//finally flip for next query
iter_bool = !iter_bool;
}
}
Explained:
I have a vector of objects containing information which can be queried (similar to as a table in a database). Each thread can access the objects and all of them iterate the vector ONCE to query the objects which have not been queried and return results, if any.
In the next query it goes through the vector again, and so on... I use the bool* array to denote the entries which are currently queried, so that the processes can synchronize and determine which query should be executed next.
If all have been executed, the last thread having possibly the last results will also return an identifier for the main thread in order to inform that all objects have been queried.
My Question:
Regarding the bool* as well as atomic-pgs_finished, can there be a scenario, in-which a deadlock can occur. As far as i can think, i cannot see a deadlock in this snippet. However, executing this and running this for a while results into a deadlock.
I am seriously considering that a bit (byte?) has randomly flipped causing this deadlock (on ECC-RAM), so that 1 or more objects actually were not executed. Is this even possible?
Maybe another implementation could help?
Edit, Implementation of the Queue:
template<class T, size_t MaxQueueSize>
class Queue
{
std::condition_variable consumer_, producer_;
std::mutex mutex_;
using unique_lock = std::unique_lock<std::mutex>;
std::queue<T> queue_;
public:
template<class U>
void push_back(U&& item) {
unique_lock lock(mutex_);
while(MaxQueueSize == queue_.size())
producer_.wait(lock);
queue_.push(std::forward<U>(item));
consumer_.notify_one();
}
T pop_front() {
unique_lock lock(mutex_);
while(queue_.empty())
consumer_.wait(lock);
auto full = MaxQueueSize == queue_.size();
auto item = queue_.front();
queue_.pop();
if(full)
producer_.notify_all();
return item;
}
};
Thanks to #Ulrich Eckhardt (,
#PaulMcKenzie and all the other comments, thank you for the brainstorming!). I probably have found the cause of the deadlock. I tried to reduce this example even more and thought on removing atomic_pgs_finished, a variable indicating whether all pgs have been queried. Interestingly: ++atomic_pgs_finished >= pgs.size() returns not only once but multiple times true, so that multiple threads are in this specific if-clause.
I simply fixed it by using another mutex around this if-clause. Maybe someone can explain why ++atomic_pgs_finished >= pgs.size() is not atomic and causes true for multiple threads.
Below i have updated the code (mostly the same as in the question) with comments, so that it might be more understandable.
void thread_lifecycle(
Queue<std::tuple<int64_t, int64_t, uint8_t>, QUEUE_SIZE>& query, // The input queue containing queries, in my case triples
Queue<std::string, QUEUE_SIZE>& output_queue, // The Output Queue of results
std::vector<Object>& pgs, // Objects which should be queried
bool* pgs_executed, // Initialized to an array of false-values
std::mutex& pgs_executed_mutex, // a mutex, protecting pgs_executed
std::atomic<uint32_t>& atomic_pgs_finished // atomic counter to count how many have been executed (to send a end signal)
){
// Initialize variables
std::tuple<int64_t, int64_t, uint8_t> next_query;
std::string output = "";
int64_t lower, upper;
// Set the first iteration to false for the very first query
// This flips on the second iteration to reuse pgs_executed with true values and so on...
bool iter_bool = false;
// Execute as long as valid queries are received
while(true) {
// Get next query
next_query = query.pop_front();
// Stop Condition reached terminate thread
if (std::get<2>(next_query) == uint8_t(-1)) break;
// "Parse query" to query the objects in pgs
lower = std::get<0>(next_query);
upper = std::get<1>(next_query);
// Now iterate through the pgs and pgs_executed (once)
for (uint32_t i = 0; i < pgs.size(); i++){
// Lock to read and write into pgs_executed
pgs_executed_mutex.lock();
if (pgs_executed[i] == iter_bool) {
pgs_executed[i] = !pgs_executed[i];
// Unlock since we now execute the query on the object (which was not queried before)
pgs_executed_mutex.unlock();
// Query Execution
output = pgs.at(i).get_result(lower, upper);
// If the query yielded a result, then add it to the output for the main thread to read
if (output.length() != 0) {
output_queue.push_back(output);
}
// HERE THE ROOT CAUSE OF THE DEADLOCK HAPPENS
// Here i would like to inform the main thread that we exexuted the query on
// every object in pgs, so that it should no longer wait for other results
if (++atomic_pgs_finished >= pgs.size()) {
// In this if-clause multiple threads are present at once!
// This is not intended and causes a deadlock, push_back-ing
// multiple times "LAST_RESULT_IDENTIFIER" in-which the main-thread
// assumed that a query has finished. The main thread then simply added the next query, while the
// previous one was not finished causing threads to race each other on two queries simultaneously
// and not having the same iter_bool!
output_queue.push_back("LAST_RESULT_IDENTIFIER");
atomic_pgs_finished.exchange(0);
}
// END: HERE THE ROOT CAUSE OF THE DEADLOCK HAPPENS
} else {
// This case happens when the next element in the list was already executed (by another process),
// simply unlock pgs_executed and continue with the next element in pgs
pgs_executed_mutex.unlock();
continue; // This is uneccessary and could be removed
}
}
//finally flip for the next query in order to reuse bool* (which now has trues if a second query is incoming)
iter_bool = !iter_bool;
}
}
Or any way to implement?
Let's have an atomic:
std::atomic<int> val;
val = 0;
Now I want to update val only if val is not zero.
if (val != 0) {
// <- Caveat if val becomes 0 here by another thread.
val.fetch_sub(1);
}
So maybe:
int not_expected = 0;
val.hypothetical_not_compare_exchange_strong(not_expected, val - 1);
Actually the above also will not work because val may get updated between val - 1 and the hypothetical function.
Maybe this:
int old_val = val;
if (old_val == 0) {
// val is zero, don't update val. some other logic.
} else {
int new_val = old_val - 1;
bool could_update = val.compare_exchange_strong(old_val, new_val);
if (!could_update) {
// repeat the above steps again.
}
}
Edit:
val is a counter variable, not related to destruction of an object though. It's supposed to be an unsigned (since count can never be negative).
From thread A: if type 2 is sent out, type 1 cannot be sent out unless type 2 counter is 0.
while(true) {
if counter_1 < max_type_1_limit && counter_2 == 0 && somelogic:
send_request_type1();
counter_1++;
if some logic && counter_2 == 0:
send_request_type2();
counter_2++;
}
thread B & C: handle response:
if counter_1 > 0:
counter_1--
// (provided that after this counter_1 doesn't reduce to negative)
else
counter_2--
The general way to implement not available atomic operations is using a CAS loop; in your case it would look like this:
/// atomically decrements %val if it's not zero; returns true if it
/// decremented, false otherwise
bool decrement_if_nonzero(std::atomic_int &val) {
int old_value = val.load();
do {
if(old_value == 0) return false;
} while(!val.compare_exchange_weak(old_value, old_value-1));
return true;
}
So, Thread B & C would be:
if(!decrement_if_nonzero(counter_1)) {
counter_2--
}
and thread A could use plain atomic loads/increments - thread A is the only one who increments the counters, so its check about counter_1 being under a certain threshold will always hold, regardless of what thread B and C do.
The only "strange" thing I see is the counter_2 fixup logic - in thread B & C it's decremented without checking for zero, while in thread A it's incremented only if it's zero - it looks like a bug. Did you mean to clamp it to zero in thread B/C as well?
That being said, atomics are great and all, but are trickier to get right, so if I were implementing this kind of logic I'd start out with a mutex, and then move to atomics if profiling pointed out that the mutex was a bottleneck.
I am using std::deque at function to access elements without popping from the queue since i am using the same queue in different iterations. My solution is based on coarse-grained multithreading. Now i wanted to make it fine-grained multithreading solution. For that, i am using tbb::concurrent_queue. But i need the equivalent function of std::deque at operation in tbb::concurrent_queue?
EDIT
This is how i am implementing with std::deque (coarse-grained multithreading)
Keep in mind that dq is static queue(i.e. using many times in different iterations)
vertext_found = true;
std::deque<T> dq;
while ( i < dq->size())
{
EnterCriticalSection(&h);
if( i < dq.size() )
{
v = dq.at(i); // accessing element of queue without popping
i++;
vertext_found = true;
}
LeaveCriticalSection(&h);
if (vertext_found && (i < dq.size()) && v != NULL)
{
**operation on 'v'
vertext_found = false;
}
}
I want to achieve the same result with tbb::concurrent_queue?
If your algorithm has separate passes that fill the queue or consume the queue, consider using tbb::concurrent_vector. It has a push_back method that could be used for the fill pass, and an at() method for the consumption passes. If threads contend to pop elements in a consumption pass, consider using a tbb::atomic counter to generate indices for at().
If there is no such clean separation of filling and consuming, using at() would probably create more problems than it solves, even if it existed, because it would be racing against a consumer.
If a consumption pass just needs to loop over the concurrent_vector in parallel, consider using tbb::parallel_for for the loop. tbb::concurrent_vector has a range() method that supports this idiom.
void consume( tbb::concurrent_vector<T>& vec ) {
tbb::parallel_for( vec.range(), [&]( const tbb::concurrent_vector<T>::range_type& r ) {
for( auto i=r.begin(); i!=r.end(); ++i ) {
T value = *i;
...process value...;
}
});
}
If a consumption pass cannnot use tbb:parallel_for, consider using a TBB atomic counter to generate the indices. Initialize the counter to zero and use ++ to increment it. Here is an example:
tbb::atomic<size_t> head;
tbb::concurrent_vector<T> vec;
bool pop_one( T& result ) { // Try to grab next item from vec
size_t i = head++; // Fetch-and-increment must be single atomic operation
if( i<vec.size() ) {
result = vec[i];
return true;
} else {
return false; // Failed
}
}
In general, this solution will be less scalable than using tbb::parallel_for, because the counter "head" introduces a point of contention in the memory system.
According to the Doxygen docs in the TBB site (TBB Doxy docs) there's no operation at in the queue. You can push and try_pop elements with a tbb::strict_ppl::concurrent_queue.
If you're using a tbb::deprecated::concurrent_queue (older versions of TBB), there are available the push_if_not_full and pop_if_present operations.
In both queues, "multiple threads may each push and pop concurrently" as stated down in the brief section.
I want to realize something on this lines:
inline void DecrementPendingWorkItems()
{
if(this->pendingWorkItems != 0) //make sure we don't underflow and get a very high number
{
::InterlockedDecrement(&this->pendingWorkItems);
}
}
How can I do this so that both operations are atomic as a block, without using locks ?
You can just check the result of InterlockedDecrement() and if it happens to be negative (or <= 0 if that's more desirable) undo the decrement by calling InterlockedIncrement(). In otherwise proper code that should be just fine.
The simplest solution is just to use a mutex around the entire section
(and for all other accesses to this->pendingWorkItems). If for some
reason this isn't acceptable, then you'll probably need compare and
exchange:
void decrementPendingWorkItems()
{
int count = std::atomic_load( &pendingWorkItems );
while ( count != 0
&& ! std::atomic_compare_exchange_weak(
&pendingWorkItems, &count, count - 1 ) ) {
}
}
(This supposes that pendingWorkItems has type std::atomic_int.)
There is such a thing called "SpinLock". This is a very lightweight synchronisation.
This is the idea:
//
// This lock should be used only when operation with protected resource
// is very short like several comparisons or assignments.
//
class SpinLock
{
public:
__forceinline SpinLock() { body = 0; }
__forceinline void Lock()
{
int spin = 15;
for(;;) {
if(!InterlockedExchange(&body, 1)) break;
if(--spin == 0) { Sleep(10); spin = 29; }
}
}
__forceinline void Unlock() { InterlockedExchange(&body, 0); }
protected:
long body;
};
Actual numbers in the sample are not important. This lock is extremely efficient.
You can use InterlockedCompareExchange in a loop:
inline void DecrementPendingWorkItems() {
LONG old_items = this->pendingWorkingItems;
LONG items;
while ((items = old_items) > 0) {
old_items = ::InterlockedCompareExchange(&this->pendingWorkItems,
items-1, items);
if (old_items == items) break;
}
}
What the InterlockedCompareExchange function is doing is:
if pendingWorkItems matches items, then
set the value to items-1 and return items
else return pendingWorkItems
This is done atomically, and is also called a compare and swap.
Use an atomic CAS.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683560(v=vs.85).aspx
You can make it lock free, but not wait free.
As Kirill suggests this is similar to a spin lock in your case.
I think this does what you need, but I'd recommend thinking through all the possibilities before going ahead and using it as I have not tested it at all:
inline bool
InterlockedSetIfEqual(volatile LONG* dest, LONG exchange, LONG comperand)
{
return comperand == ::InterlockedCompareExchange(dest, exchange, comperand);
}
inline bool InterlockedDecrementNotZero(volatile LONG* ptr)
{
LONG comperand;
LONG exchange;
do {
comperand = *ptr;
exchange = comperand-1;
if (comperand <= 0) {
return false;
}
} while (!InterlockedSetIfEqual(ptr,exchange,comperand));
return true;
}
There remains the question as to why your pending work items should ever go below zero. You should really ensure that the number of increments matches the number of decrements and all will be fine. I'd perhaps add an assert or exception if this constraint is violated.