Lockless Deque in Win32 C++ - c++

I'm pretty new to lockless data structures, so for an exercise I wrote (What I hope functions as) a bounded lockless deque (No resizing yet, just want to get the base cases working). I'd just like to have some confirmation from people who know what they're doing as to whether I've got the right idea and/or how I might improve this.
class LocklessDeque
{
public:
LocklessDeque() : m_empty(false),
m_bottom(0),
m_top(0) {}
~LocklessDeque()
{
// Delete remaining tasks
for( unsigned i = m_top; i < m_bottom; ++i )
delete m_tasks[i];
}
void PushBottom(ITask* task)
{
m_tasks[m_bottom] = task;
InterlockedIncrement(&m_bottom);
}
ITask* PopBottom()
{
if( m_bottom - m_top > 0 )
{
m_empty = false;
InterlockedDecrement(&m_bottom);
return m_tasks[m_bottom];
}
m_empty = true;
return NULL;
}
ITask* PopTop()
{
if( m_bottom - m_top > 0 )
{
m_empty = false;
InterlockedIncrement(&m_top);
return m_tasks[m_top];
}
m_empty = true;
return NULL;
}
bool IsEmpty() const
{
return m_empty;
}
private:
ITask* m_tasks[16];
bool m_empty;
volatile unsigned m_bottom;
volatile unsigned m_top;
};

Looking at this I would think this would be a problem:
void PushBottom(ITask* task)
{
m_tasks[m_bottom] = task;
InterlockedIncrement(&m_bottom);
}
If this is used in an actual multithreaded environment I would think you'd collide when setting m_tasks[m_bottom]. Think of what would happen if you have two threads trying to do this at the same time - you couldn't be sure of which one actually set m_tasks[m_bottom].
Check out this article which is a reasonable discussion of a lock-free queue.

Your use of the m_bottom and m_top members to index the array is not okay. You can use the return value of InterlockedXxxx() to get a safe index. You'll need to lose IsEmpty(), it can never be accurate in a multi-threading scenario. Same problem with the empty check in PopXxx. I don't see how you could make that work without a mutex.

The key to doing almost impossible stuff like this is to use InterlockedCompareExchange. (This is the name Win32 uses but any multithreaded-capable platform will have an InterlockedCompareExchange equivalent).
The idea is, you make a copy of the structure (which must be small enough to perform an atomic read (64 or if you can handle some unportability, 128 bits on x86).
You make another copy with your proposed update, do your logic and update the copy, then you update the "real" structure using InterlockedCompareExchange. What InterlockedCompareExchange does is, atomically make sure the value is still the value you started with before your state update, and if it is still that value, atomically updates the value with the new state. Generally this is wrapped in an infinite loop that keeps trying until someone else hasn't changed the value in the middle. Here is roughly the pattern:
union State
{
struct
{
short a;
short b;
};
uint32_t volatile rawState;
} state;
void DoSomething()
{
// Keep looping until nobody else changed it behind our back
for (;;)
{
state origState;
state newState;
// It's important that you only read the state once per try
origState.rawState = state.rawState;
// This must copy origState, NOT read the state again
newState.rawState = origState.rawState;
// Now you can do something impossible to do atomically...
// This example takes a lot of cycles, there is huge
// opportunity for another thread to come in and change
// it during this update
if (newState.b == 3 || newState.b % 6 != 0)
{
newState.a++;
}
// Now we atomically update the state,
// this ONLY changes state.rawState if it's still == origState.rawState
// In either case, InterlockedCompareExchange returns what value it has now
if (InterlockedCompareExchange(&state.rawState, newState.rawState,
origState.rawState) == origState.rawState)
return;
}
}
(Please forgive if the above code doesn't actually compile - I wrote it off the top of my head)
Great. Now you can make lockless algorithms easy. WRONG! The trouble is that you are severely limited on the amount of data that you can update atomically.
Some lockless algorithms use a technique where they "help" concurrent threads. For example, say you have a linked list that you want to be able to update from multiple threads, other threads can "help" by performing updates to the "first" and "last" pointers if they are racing through and see that they are at the node pointed to by "last" but the "next" pointer in the node is not null. In this example, upon noticing that the "last" pointer is wrong, they update the last pointer, only if it still points to the current node, using an interlocked compare exchange.
Don't fall into a trap where you "spin" or loop (like a spinlock). While there is value in spinning briefly because you expect the "other" thread to finish something - they may not. The "other" thread may have been context switched and may not be running anymore. You are just eating CPU time, burning electricity (killing a laptop battery perhaps) by spinning until a condition is true. The moment you begin to spin, you might as well chuck your lockless code and write it with locks. Locks are better than unbounded spinning.
Just to go from hard to ridiculous, consider the mess you can get yourself into with other architectures - things are generally pretty forgiving on x86/x64, but when you get into other "weakly ordered" architectures, you get into territory where things happen that make no sense - memory updates won't happen in program order, so all your mental reasoning about what the other thread is doing goes out the window. (Even x86/x64 have a memory type called "write combining" which is often used when updating video memory but can be used for any memory buffer hardware, where you need fences) Those architectures require you to use "memory fence" operations to guarantee that all reads/writes/both before the fence will be globally visible (by other cores). A write fence guarantees that any writes before the fence will be globally visible before any writes after the fence. A read fence will guarantee that no reads after the fence will be speculatively executed before the fence. A read/write fence (aka full fence or memory fence) will make both guarantees. Fences are very expensive. (Some use the term "barrier" instead of "fence")
My suggestion is to implement it first with locks/condition variables. If you have any trouble with getting that working perfectly, it's hopeless to attempt doing a lockless implementation. And always measure, measure, measure. You'll probably find the performance of the implementation using locks is perfectly fine - without the incertainty of some flaky lockless implementation with a natsy hang bug that will only show up when you're doing a demo to an important customer. Perhaps you can fix the problem by redefining the original problem into something more easily solved, perhaps by restructuring the work so bigger items (or batches of items) are going into the collection, which reduces the pressure on the whole thing.
Writing lockless concurrent algorithms is very difficult (as you've seen written 1000x elsewhere I'm sure). It is often not worth the effort either.

Addressing the problem pointed out by Aaron, I'd do something like:
void PushBottom(ITask *task) {
int pos = InterlockedIncrement(&m_bottom);
m_tasks[pos] = task;
}
Likewise, to pop:
ITask* PopTop() {
int pos = InterlockedIncrement(&m_top);
if (pos == m_bottom) // This is still subject to a race condition.
return NULL;
return m_tasks[pos];
}
I'd eliminate both m_empty and IsEmpty() from the design completely. The result returned by IsEmpty is subject to a race condition, meaning by the time you look at that result, it may well be stale (i.e. what it tells you about the queue may be wrong by the time you look at what it returned). Likewise, m_empty provides nothing but a record of information that's already available without it, a recipe for producing stale data. Using m_empty doesn't guarantee it can't work right, but it does increase the chances of bugs considerably, IMO.
I'm guessing it's due to the interim nature of the code, but right now you also have some serious problems with the array bounds. You aren't doing anything to force your array indexes to wrap around, so as soon as you try to push the 17th task onto the queue, you're going to have a major problem.
Edit: I should point out that the race condition noted in the comment is quite serious -- and it's not the only one either. Although somewhat better than the original, this should not be mistaken for code that will work correctly.
I'd say that writing correct lock-free code is considerably more difficult than writing correct code that uses locks. I don't know of anybody who has done so without a solid understanding of code that does use locking. Based on the original code, I'd say it would be much better to start by writing and understanding code for a queue that does use locks, and only when you've used that to gain a much better understanding of the issues involved really make an attempt at lock-free code.

Related

Is this request frequency limiter thread safe?

In order to prevent excessive server pressure, I implemented a request frequency limiter using a sliding window algorithm, which can determine whether the current request is allowed to pass according to the parameters. In order to achieve the thread safety of the algorithm, I used the atomic type to control the number of sliding steps of the window, and used unique_lock to achieve the correct sum of the total number of requests in the current window.
But I'm not sure whether my implementation is thread-safe, and if it is safe, whether it will affect service performance. Is there a better way to achieve it?
class SlideWindowLimiter
{
public:
bool TryAcquire();
void SlideWindow(int64_t window_number);
private:
int32_t limit_; // maximum number of window requests
int32_t split_num_; // subwindow number
int32_t window_size_; // the big window
int32_t sub_window_size_; // size of subwindow = window_size_ / split_number
int16_t index_{0}; //the index of window vector
std::mutex mtx_;
std::vector<int32_t> sub_windows_; // window vector
std::atomic<int64_t> start_time_{0}; //start time of limiter
}
bool SlideWindowLimiter::TryAcquire() {
int64_t cur_time = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
auto time_stamp = start_time_.load();
int64_t window_num = std::max(cur_time - window_size_ - start_time_, int64_t(0)) / sub_window_size_;
std::unique_lock<std::mutex> guard(mtx_, std::defer_lock);
if (window_num > 0 && start_time_.compare_exchange_strong(time_stamp, start_time_.load() + window_num*sub_window_size_)) {
guard.lock();
SlideWindow(window_num);
guard.unlock();
}
monitor_->TotalRequestQps();
{
guard.lock();
int32_t total_req = 0;
std::cout<<" "<<std::endl;
for(auto &p : sub_windows_) {
std::cout<<p<<" "<<std::this_thread::get_id()<<std::endl;
total_req += p;
}
if(total_req >= limit_) {
monitor_->RejectedRequestQps();
return false;
} else {
monitor_->PassedRequestQps();
sub_windows_[index_] += 1;
return true;
}
guard.unlock();
}
}
void SlideWindowLimiter::SlideWindow(int64_t window_num) {
int64_t slide_num = std::min(window_num, int64_t(split_num_));
for(int i = 0; i < slide_num; i++){
index_ += 1;
index_ = index_ % split_num_;
sub_windows_[index_] = 0;
}
}
First of all, thread-safe is a relative property. Two sequences of operations are thread-safe relative to each other. A single bit of code cannot be thread-safe by itself.
I'll instead answer "am I handling threading in such a way that reasonable thread-safety guarantees could be made with other reasonable code".
The answer is "No".
I found one concrete problem; your use of atomic and compare_exchange_strong isn't in a loop, and you access start_time_ atomically at multiple spots without the proper care. If start_time_ changes in the period with the 3 spots you read and write from it, you return false, and fail to call SlideWindow, then... proceed as if you had.
I can't think of why that would be a reasonable response to contention, so that is a "No, this code isn't written to behave reasonably under multiple threads using it".
There is a lot of bad smell in your code. You are mixing concurrency code with a whole pile of state, which means it isn't clear what mutexes are guarding what data.
You have a pointer in your code that is never defined. Maybe it is supposed to be a global variable?
You are writing to cout using multiple << on one line. That is a bad plan in a multithreaded environment; even if your cout is concurrency-hardened, you get scrambled output. Build a buffer string and do one <<.
You are passing data between functions via the back door. index_ for example. One function sets a member variable, another reads it. Is there any possibility it gets edited by another thread? Hard to audit, but seems reasonably likely; you set it on one .lock(), then .unlock(), then read it as if it was in a sensible state in a later lock(). What more, you use it to access a vector; if the vector or index changed in unplanned ways, that could crash or lead to memory corruption.
...
I would be shocked if this code didn't have a pile of race conditions, crashes and the like in production. I see no sign of any attempt to prove that this code is concurrency safe, or simplify it to the point where it is easy to sketch such a proof.
In actual real practice, any code that you haven't proven is concurrency safe is going to be unsafe to use concurrently. So complex concurrency code is almost guaranteed to be unsafe to use concurrently.
...
Start with a really, really simple model. If you have a mutex and some data, make that mutex and the data into a single struct, so you know exactly what that mutex is guarding.
If you are messing with an atomic, don't use it in the middle of other code mixed up with other variables. Put it in its own class. Give that class a name, representing some concrete semantics, ideally ones that you have found elsewhere. Describe what it is supposed to do, and what the methods guarantee before and after. Then use that.
Elsewhere, avoid any kind of global state. This includes class member variables used to pass state around. Pass your data explicitly from one function to another. Avoid pointers to anything mutable.
If your data is all value types in automatic storage and pointers to immutable (never changing in the lifetime of your threads) data, that data can't be directly involved in a race condition.
The remaining data is bundled up and firewalled in a small a spot as possible, and you can look at how you interact with it and determine if you are messing up.
...
Multithreaded programming is hard, especially in an environment with mutable data. If you aren't working to make it possible to prove your code is correct, you are going to produce code that isn't correct, and you won't know it.
Well, based off my experience, I know it; all code that isn't obviously trying to act in such a way that it is easy to show it is correct is simply incorrect. If the code is old and has piles of patches over a decade+, the incorrectness is probably unlikely and harder to find, but it is probably still incorrect. If it is new code, it is probably easier to find the incorrectness.

Could this publish / check-for-update class for a single writer + reader use memory_order_relaxed or acquire/release for efficiency?

Introduction
I have a small class which make use of std::atomic for a lock free operation. Since this class is being called massively, it's affecting the performance and I'm having trouble.
Class description
The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).
A single thread is calling push(), and another single thread is calling pop().
Source I've read
Since this is using too much time of my computer time, I decided to study a bit further the std::atomic class and its memory_order. I've read a lot of memory_order post avaliable in StackOverflow and other sources and books, but I'm not able to get a clear idea about the different modes. Specially, I'm struggling between acquire and release modes: I fail too see why they are different to memory_order_seq_cst.
What I think each memory order do using my words, from my own research
memory_order_relaxed: In the same thread, the atomic operations are instant, but other threads may fail to see the lastest values instantly, they will need some time until they are updated. The code can be re-ordered freely by the compiler or OS.
memory_order_acquire / release: Used by atomic::load. It prevents the lines of code there are before this from being reordered (the compiler/OS may reorder after this line all it want), and reads the lastest value that was stored on this atomic using memory_order_release or memory_order_seq_cst in this thread or another thread. memory_order_release also prevents that code after it may be reordered. So, in an acquire/release, all the code between both can be shuffled by the OS. I'm not sure if that's between same thread, or different threads.
memory_order_seq_cst: Easiest to use because it's like the natural writting we are used with variables, instantly refreshing the values of other threads load functions.
The LockFreeEx class
template<typename T>
class LockFreeEx
{
public:
void push(const T& element)
{
const int wPos = m_position.load(std::memory_order_seq_cst);
const int nextPos = getNextPos(wPos);
m_buffer[nextPos] = element;
m_position.store(nextPos, std::memory_order_seq_cst);
}
const bool pop(T& returnedElement)
{
const int wPos = m_position.exchange(-1, std::memory_order_seq_cst);
if (wPos != -1)
{
returnedElement = m_buffer[wPos];
return true;
}
else
{
return false;
}
}
private:
static constexpr int maxElements = 8;
static constexpr int getNextPos(int pos) noexcept {return (++pos == maxElements)? 0 : pos;}
std::array<T, maxElements> m_buffer;
std::atomic<int> m_position {-1};
};
How I expect it could be improved
So, my first idea was using memory_order_relaxed in all atomic operations, since the pop() thread is in a loop looking for avaliable updates in pop function each 10-15 ms, then it's allowed to fail in the firsts pop() functions to realize later that there is a new update. It's only a bunch of milliseconds.
Another option would be using release/acquire - but I'm not sure about them. Using release in all store() and acquire in all load() functions.
Unfortunately, all the memory_order I described seems to work, and I'm not sure when will they fail, if they are supposed to fail.
Final
Please, could you tell me if you see some problem using relaxed memory order here? Or should I use release/acquire (maybe a further explanation on these could help me)? why?
I think that relaxed is the best for this class, in all its store() or load(). But I'm not sure!
Thanks for reading.
EDIT: EXTRA EXPLANATION:
Since I see everyone is asking for the 'char', I've changed it to int, problem solved! But it doesn't it the one I want to solve.
The class, as I stated before, is something likely to a LIFO but where only matters the last element pushed, if there is any.
I have a big struct T (copiable and asignable), that I must share between two threads in a lock-free way. So, the only way I know to do it is using a circular buffer that writes the last known value for T, and a atomic which know the index of the last value written. When there isn't any, the index would be -1.
Notice that my push thread must know when there is a "new T" avaliable, that's why pop() returns a bool.
Thanks again to everyone trying to assist me with memory orders! :)
AFTER READING SOLUTIONS:
template<typename T>
class LockFreeEx
{
public:
LockFreeEx() {}
LockFreeEx(const T& initValue): m_data(initValue) {}
// WRITE THREAD - CAN BE SLOW, WILL BE CALLED EACH 500-800ms
void publish(const T& element)
{
// I used acquire instead relaxed to makesure wPos is always the lastest w_writePos value, and nextPos calculates the right one
const int wPos = m_writePos.load(std::memory_order_acquire);
const int nextPos = (wPos + 1) % bufferMaxSize;
m_buffer[nextPos] = element;
m_writePos.store(nextPos, std::memory_order_release);
}
// READ THREAD - NEED TO BE VERY FAST - CALLED ONCE AT THE BEGGINING OF THE LOOP each 2ms
inline void update()
{
// should I change to relaxed? It doesn't matter I don't get the new value or the old one, since I will call this function again very soon, and again, and again...
const int writeIndex = m_writePos.load(std::memory_order_acquire);
// Updating only in case there is something new... T may be a heavy struct
if (m_readPos != writeIndex)
{
m_readPos = writeIndex;
m_data = m_buffer[m_readPos];
}
}
// NEED TO BE LIGHTNING FAST, CALLED MULTIPLE TIMES IN THE READ THREAD
inline const T& get() const noexcept {return m_data;}
private:
// Buffer
static constexpr int bufferMaxSize = 4;
std::array<T, bufferMaxSize> m_buffer;
std::atomic<int> m_writePos {0};
int m_readPos = 0;
// Data
T m_data;
};
Memory order is not about when you see some particular change to an atomic object but rather about what this change can guarantee about the surrounding code. Relaxed atomics guarantee nothing except the change to the atomic object itself: the change will be atomic. But you can't use relaxed atomics in any synchronization context.
And you have some code which requires synchronization. You want to pop something that was pushed and not trying to pop what has not been pushed yet. So if you use a relaxed operation then there is no guarantee that your pop will see this push code:
m_buffer[nextPos] = element;
m_position.store(nextPos, std::memory_relaxed);
as it is written. It just as well can see it this way:
m_position.store(nextPos, std::memory_relaxed);
m_buffer[nextPos] = element;
So you might try to get an element from the buffer which is not there yet. Hence, you have to use some synchronization and at least use acquire/release memory order.
And to your actual code. I think the order can be as follows:
const char wPos = m_position.load(std::memory_order_relaxed);
...
m_position.store(nextPos, std::memory_order_release);
...
const char wPos = m_position.exchange(-1, memory_order_acquire);
Your writer only needs release, not seq-cst, but relaxed is too weak. You can't publish a value for m_position until after the non-atomic assignment to the corresponding m_buffer[] entry. You need release ordering to make sure the m_position store is visible to other threads only after all earlier memory operations. (Including the non-atomic assignment). https://preshing.com/20120913/acquire-and-release-semantics/
This has to "synchronize-with" an acquire or seq_cst load in the reader. Or at least mo_consume in the reader.
In theory you also need wpos = m_position to be at least acquire (or consume in the reader), not relaxed, because C++11's memory model is weak enough for things like value-prediction which can let the compiler speculatively use a value for wPos before the load actually takes a value from coherent cache.
(In practice on real CPUs, a crazy compiler could do this with test/branch to introduce a control dependency, allowing branch prediction + speculative execution to break the data dependency for a likely value of wPos.)
But with normal compilers don't do that. On CPUs other than DEC Alpha, the data dependency in the source code of wPos = m_position and then using m_buffer[wPos] will create a data dependency in the asm, like mo_consume is supposed to take advantage of. Real ISAs other than Alpha guarantee dependency-ordering for dependent loads. (And even on Alpha, using a relaxed atomic exchange might be enough to close the tiny window that exists on the few real Alpha CPUs that allow this reordering.)
When compiling for x86, there's no downside at all to using mo_acquire; it doesn't cost any extra barriers. There can be on other ISAs, like 32-bit ARM where acquire costs a barrier, so "cheating" with a relaxed load could be a win that's still safe in practice. Current compilers always strengthen mo_consume to mo_acquire so we unfortunately can't take advantage of it.
You already have a real-word race condition even using seq_cst.
initial state: m_position = 0
reader "claims" slot 0 by exchanging in a m_position = -1 and reads part of m_buffer[0];
reader sleeps for some reason (e.g. timer interrupt deschedules it), or simply races with a writer.
writer reads wPos = m_position as -1, and calculates nextPos = 0.
It overwrites the partially-read m_buffer[0]
reader wakes up and finishes reading, getting a torn T &element. Data race UB in the C++ abstract machine, and tearing in practice.
Adding a 2nd check of m_position after the read (like a SeqLock) can't detect this in every case because the writer doesn't update m_position until after writing the buffer element.
Even though you your real use-case has long gaps between reads and writes, this defect can bite you with just one read and write happening at almost the same time.
I for sure know that the read side cannot wait for nothing and cannot be stopped (it's audio) and it's poped each 5-10ms, and the write side is the user input, which is more slower, a faster one could do a push once each 500ms.
A millisecond is ages on a modern CPU. Inter-thread latency is often something like 60 ns, so fractions of a microsecond, e.g. from a quad-core Intel x86. As long as you don't sleep on a mutex, it's not a problem to spin-retry once or twice before giving up.
Code review:
The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).
This isn't a real queue or stack: push and pop aren't great names. "publish" and "read" or "get" might be better and make it more obvious what this is for.
I'd include comments in the code to describe the fact that this is safe for a single writer, multiple readers. (The non-atomic increment of m_position in push makes it clearly unsafe for multiple writers.)
Even so, it's kinda weird even with 1 writer + 1 reader running at the same time. If a read starts while a write is in progress, it will get the "old" value instead of spin-waiting for a fraction of a microsecond to get the new value. Then next time it reads there will already be a new value waiting; the one it just missed seeing last time. So e.g. m_position can update in this order: 2, -1, 3.
That might or might not be desirable, depending on whether "stale" data has any value, and on acceptability of the reader blocking if the writer sleeps mid-write. Or even without the writer sleeping, of spin-waiting.
The standard pattern for rarely written smallish data with multiple read-only readers is a SeqLock. e.g. for publishing a 128-bit current timestamp on a CPU that can't atomically read or write a 128-bit value. See Implementing 64 bit atomic counter with 32 bit atomics
Possible design changes
To make this safe, we could let the writer run free, always wrapping around its circular buffer, and have the reader keep track of the last element it looked at.
If there's only one reader, this should be a simple non-atomic variable. If it's an instance variable, at least put it on the other side of m_buffer[] from the write-position.
// Possible failure mode: writer wraps around between reads, leaving same m_position
// single-reader
const bool read(T &elem)
{
// FIXME: big hack to get this in a separate cache line from the instance vars
// maybe instead use alignas(64) int m_lastread as a class member, and/or on the other side of m_buffer from m_position.
static int lastread = -1;
int wPos = m_position.load(std::memory_order_acquire); // or cheat with relaxed to get asm that's like "consume"
if (lastread == wPos)
return false;
elem = m_buffer[wPos];
lastread = wPos;
return true;
}
You want lastread in a separate cache line from the stuff the writer writes. Otherwise the reader's updates of readPos will be slower because of false-sharing with the writer's writes and vice versa.
This lets the reader(s) be truly read-only wrt. the cache lines written by the writer. It will still take MESI traffic to request read access to lines that are in Modified state after the writer writes them, though. But the writer can still read m_position with no cache miss, so it can get its stores into the store buffer right away. It only has to wait for an RFO to get exclusive ownership of the cache line(s) before it can commit the element and the updated m_position from its store buffer to L1d cache.
TODO: let m_position increment without manual wrapping, so we have a write sequence number that takes a very long time to wrap around, avoiding false-negative early out from lastread == wPos.
Use wPos & (maxElements-1) as the index. And static_assert(maxElements & (maxElements-1) == 0, "maxElements must be a power of 2");
Then the only danger is undetected tearing in a tiny time-window if the writer has wrapped all the way around and is writing the element being read. For frequent reads and infrequent writes, and a buffer that's not too small, this should never happen. Checking the m_position again after a read (like a SeqLock, similar to below) narrows the race window to only writes that are still in progress.
If there are multiple readers, another good option might be a claimed flag in each m_buffer entry. So you'd define
template<typename T>
class WaitFreePublish
{
private:
struct {
alignas(32) T elem; // at most 2 elements per cache line
std::atomic<int8_t> claimed; // writers sets this to 0, readers try to CAS it to 1
// could be bool if we don't end up needing 3 states for anything.
// set to "1" in the constructor? or invert and call it "unclaimed"
} m_buffer[maxElements];
std::atomic<int> m_position {-1};
}
If T has padding at the end, it's a shame we can't take advantage of that for the claimed flag :/
This avoids the possible failure mode of comparing positions: if the writer wraps around between reads, the worst we get is tearing. And we could detect such tearing by having the writer clear the claimed flag first, before writing the rest of the element.
With no other threads writing m_position, we can definitely use a relaxed load without worry. We could even cache the write-position somewhere else, but the reader hopefully isn't invalidating the cache-line containing m_position very often. And apparently in your use-case, writer performance/latency probably isn't a big deal.
So the writer + reader could look like this, with SeqLock-style tearing detection using the known update-order for claimed flag, element, and m_position.
/// claimed flag per array element supports concurrent readers
// thread-safety: single-writer only
// update claimed flag first, then element, then m_position.
void publish(const T& elem)
{
const int wPos = m_position.load(std::memory_order_relaxed);
const int nextPos = getNextPos(wPos);
m_buffer[nextPos].claimed.store(0, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release); // make sure that `0` is visible *before* the non-atomic element modification
m_buffer[nextPos].elem = elem;
m_position.store(nextPos, std::memory_order_release);
}
// thread-safety: multiple readers are ok. First one to claim an entry gets it
// check claimed flag before/after to detect overwrite, like a SeqLock
const bool read(T &elem)
{
int rPos = m_position.load(std::memory_order_acquire);
int8_t claimed = m_buffer[rPos].claimed.load(std::memory_order_relaxed);
if (claimed != 0)
return false; // read-only early-out
claimed = 0;
if (!m_buffer[rPos].claimed.compare_exchange_strong(
claimed, 1, std::memory_order_acquire, std::memory_order_relaxed))
return false; // strong CAS failed: another thread claimed it
elem = m_buffer[rPos].elem;
// final check that the writer didn't step on this buffer during read, like a SeqLock
std::atomic_thread_fence(std::memory_order_acquire); // LoadLoad barrier
// We expect it to still be claimed=1 like we set with CAS
// Otherwise we raced with a writer and elem may be torn.
// optionally retry once or twice in this case because we know there's a new value waiting to be read.
return m_buffer[rPos].claimed.load(std::memory_order_relaxed) == 1;
// Note that elem can be updated even if we return false, if there was tearing. Use a temporary if that's not ok.
}
Using claimed = m_buffer[rPos].exchange(1) and checking for claimed==0 would be another option, vs. CAS-strong. Maybe slightly more efficient on x86. On LL/SC machines I guess CAS might be able to bail out without doing a write at all if it finds a mismatch with expected, in which case the read-only check is pointless.
I used .claimed.compare_exchange_strong(claimed, 1) with success ordering = acquire to make sure that read of claimed happens-before reading .elem.
The "failure" memory ordering can be relaxed: If we see it already claimed by another thread, we give up and don't look at any shared data.
The memory-ordering of the store part of compare_exchange_strong can be relaxed, so we just need mo_acquire, not acq_rel. Readers don't do any other stores to the shared data, and I don't think the ordering of the store matters wrt. to the loads. CAS is an atomic RMW. Only one thread's CAS can succeed on a given buffer element because they're all trying to set it from 0 to 1. That's how atomic RMWs work, regardless of being relaxed or seq_cst or anything in between.
It doesn't need to be seq_cst: we don't need to flush the store buffer or whatever to make sure the store is visible before this thread reads .elem. Just being an atomic RMW is enough to stop multiple threads from actually thinking they succeed. Release would just make sure it can't move earlier, ahead of the relaxed read-only check. That wouldn't be a correctness problem. Hopefully no x86 compilers would do that at compile time. (At runtime on x86, RMW atomic operations are always seq_cst.)
I think being an RMW makes it impossible for it to "step on" a write from a writer (after wrapping around). But this might be real-CPU implementation detail, not ISO C++. In the global modification order for any given .claimed, I think the RMW stays together, and the "acquire" ordering does keep it ahead of the read of the .elem. A release store that wasn't part of a RMW would be a potential problem though: a writer could wrap around and put claimed=0 in a new entry, then the reader's store could eventually commit and set it to 1, when actually no reader has ever read that element.
If we're very sure the reader doesn't need to detect writer wrap-around of the circular buffer, leave out the std::atomic_thread_fence in the writer and reader. (The claimed and the non-atomic element store will still be ordered by the release-store to m_position). The reader can be simplified to leave out the 2nd check and always return true if it gets past the CAS.
Notice that m_buffer[nextPos].claimed.store(0, std::memory_order_release); would not be sufficient to stop later non-atomic stores from appearing before it: release-stores are a one-way barrier, unlike release fences. A release-fence is like a 2-way StoreStore barrier. (Free on x86, cheap on other ISAs.)
This SeqLock-style tearing detection doesn't technically avoid UB in the C++ abstract machine, unfortunately. There's no good / safe way to express this pattern in ISO C++, and it's known to be safe in asm on real hardware. Nothing actually uses the torn value (assuming read()'s caller ignores its elem value if it returns false).
Making elem a std::atomic<T> would be defeat the entire purpose: that would use a spinlock to get atomicity so it might as well use it directly.
Using volatile T elem would break buffer[i].elem = elem because unlike C, C++ doesn't allow copying a volatile struct to/from a regular struct. (volatile struct = struct not possible, why?). This is highly annoying for a SeqLock type of pattern where you'd like the compiler to emit efficient code to copy the whole object representation, optionally using SIMD vectors. You won't get that if you write a constructor or assignment operator that takes a volatile &T argument and does individual members. So clearly volatile is the wrong tool, and that only leaves compiler memory barriers to make sure the non-atomic object is fully read or fully written before the barrier. std::atomic_thread_fence is I think actually safe for that, like asm("" ::: "memory") in GNU C. It works in practice on current compilers.

multiple locks on different elements of an array

If I have 8 threads, and an array of 1,000,000,000 elements in an array, I can have 1,000,000,000 mutices where the index represents the element within the array that is being locked and written to. However this is fairly wasteful to me and requires a lot of memory.
Is there a way that I can only use 8 mutices and have the same functionality?
Thinking out aloud here... and not really sure how efficient this would be, but:
You could create method of locking certain indexes:
vector<int> mutexed_slots;
std::mutex mtx;
bool lock_element(int index)
{
std::lock_guard<std::mutex> lock(mtx);
// Check if item is in the mutexed list
if ( !std::find(mutexed_slots.begin(), mutexed_slots.end(), index) != vector.end() )
{
// If its not then add it - now that array value is safe from other threads
mutexed_slots.emplace_back(index);
return true;
}
return false;
}
void unlock_element(int index)
{
std::lock_guard<std::mutex> lock(mtx);
// No need to check because you will only unlock the element that you accessed (unless you are a very naughty boy indeed)
vec.erase(vec.begin() + index);
}
Note: This is the start of a idea, so don't knock it too hard just yet! Its also un-tested pseudo code. Its not really intended as a final answer - but as a start point. Please add comments to improve or to suggest that is is/isn't plausible.
Further points:
There may be a more efficient STL to use
You could probably wrap all of this up in a class along with your data
You would need to loop through lock_element() until it returns true - again not pretty at the moment. This mechanism could be improved.
Each thread needs to remember which index they currently are working on so that they only unlock that particular one - again this could be more integrated within a class to ensure that behaviour.
But as a concept - workable? I would think if you need really fast access (which maybe you do) this might not be that efficient, thoughts?
Update
This could be made much more efficient if each thread/worker "registers" its own entry in mutexed_slots. Then there would no push_back/remove's from the vector (except at the start/end). So each thread just sets the index that it has locked - if it has nothing locked then it just gets set to -1 (or such). I am thinking there are many more such efficiency improvements to be made. Again a complete class to do all this for you would be the way to implement it.
Testing / Results
I implemented a tester for this, just because I quite enjoy that sort of thing. My implementation is here
I think its a public github repo - so you are welcome to take a look. But I posted the results on the top-level readme (so scroll a little to see them). I implemented some improvements such that:
There are no insert/removal to the protection array at run-time
There is no need for a lock_guard to do the "unlock" because I am relying no a std::atomic index.
Below is my a printout of my summary:
Summary:
When the workload is 1ms (the time taken to perform each action) then the amount of work done was:
9808 for protected
8117 for normal
Note these values varied, sometimes the normal was higher, there appeared no clear winner.
When the workload is 0ms (basically increment a few counters) then the amount of work done was:
9791264 for protected
29307829 for normal
So here you can see that using the mutexed protection slows down the work by a factor of about a third (1/3). This ratio is consistant between tests.
I also ran the same tests for 1 worker, and the same ratios roughly held true. However when I make the array smaller (~1000 elements) the amount of work done is still roughly the same for when the work load is 1ms. But when the workload is very light I got results like:
5621311
39157931
Which is about 7 times slower.
Conclusion
The larger the array then less collisions occur - the performance is better.
The longer the work load is (per item) then the less noticeable the difference is with using the protecting mechanism.
It appears that the locking is generally only adding an overhead that is 2-3 times slower then incrementing a few counters. This is probably skewed by actual collisions because (from the results) the longest lock time recorded was a huge 40ms - but this was when there was the work time was very fast so, many collisions occurred (~8 successful locks per collision).
It depends on the access pattern, do you have a way to partition the work effectively? Basically, you can partition the array into 8 chunks (or as many as you can afford) and cover each part with a mutex, but if the access pattern is random you're still going to have a lot of collisions.
Do you have TSX support on your system? it would be a classic example, just have one global lock, and have the threads ignore it unless there's an actual collision.
You can write a class that will create locks on the fly when a particular index requires it, std::optional would be helpful for this (C++17 code ahead):
class IndexLocker {
public:
explicit IndexLocker(size_t size) : index_locks_(size) {}
std::mutex& get_lock(size_t i) {
if (std::lock_guard guard(instance_lock_); index_locks_[i] == std::nullopt) {
index_locks_[i].emplace();
}
return *index_locks_[i];
}
private:
std::vector<std::optional<std::mutex>> index_locks_;
std::mutex instance_lock_;
};
You could also use std::unique_ptr to minimize stack-space but maintain identical semantics:
class IndexLocker {
public:
explicit IndexLocker(size_t size) : index_locks_(size) {}
std::mutex& get_lock(size_t i) {
if (std::lock_guard guard(instance_lock_); index_locks_[i] == nullptr) {
index_locks_[i] = std::make_unique<std::mutex>();
}
return *index_locks_[i];
}
private:
std::vector<std::unique_ptr<std::mutex>> index_locks_;
std::mutex instance_lock_;
};
Using this class doesn't necessarily mean you need to create all 1,000,000 elements. You can use modulo operations to treat the locker as a "hash table" of mutexes:
constexpr size_t kLockLimit = 8;
IndexLocker index_locker(kLockLimit);
auto thread_code = [&](size_t i) {
std::lock_guard guard(index_locker.get_lock(i % kLockLimit));
// Do work with lock.
};
Worth mentioning that the "hash table" approach makes it very easy to deadlock (get_lock(0) followed by get_lock(16), for example). If each thread does work on exactly one element at a time, however, this shouldn't be an issue.
There are other trade-offs with fine-grain locking. Atomic operations are expensive, so a parallel algorithm that locks every element can take longer than the sequential version.
How to lock efficiently depends. Are the array elements dependent on other elements in the array? Are you mostly reading? mostly writing?
I don't want to split the array into 8 parts because that will cause a
high likelihood of waiting (access is random). The elements of the
array are a class that I will write that will be multiple Golomb coded
values.
I don't think having 8 mutexes is the way to go here. If a given lock protects an array section, you can't switch it to protect a different section in the midst of parallel execution without introducing a race condition (rendering the mutex pointless).
Are the array items small? If you can get them down to 8 bytes, you can declare your class with alignas(8) and instantiate std::atomic<YourClass> objects. (Size depends on architecture. Verify is_lock_free() returns true.) That could open up the possibility of lock-free algorithms. It almost seems like a variant of hazard pointers would be useful here. That's complex, so it's probably better to look into other approaches to parallelism if time is limited.

Thread-safe Settings

Im writing some settings classes that can be accessed from everywhere in my multithreaded application. I will read these settings very often (so read access should be fast), but they are not written very often.
For primitive datatypes it looks like boost::atomic offers what I need, so I came up with something like this:
class UInt16Setting
{
private:
boost::atomic<uint16_t> _Value;
public:
uint16_t getValue() const { return _Value.load(boost::memory_order_relaxed); }
void setValue(uint16_t value) { _Value.store(value, boost::memory_order_relaxed); }
};
Question 1: Im not sure about the memory ordering. I think in my application I don't really care about memory ordering (do I?). I just want to make sure that getValue() always returns a non-corrupted value (either the old or the new one). So are my memory ordering settings correct?
Question 2: Is this approach using boost::atomic recommended for this kind of synchronization? Or are there other constructs that offer better read performance?
I will also need some more complex setting types in my application, like std::string or for example a list of boost::asio::ip::tcp::endpoints. I consider all these setting values as immutable. So once I set the value using setValue(), the value itself (the std::string or the list of endpoints itself) does not change anymore. So again I just want to make sure that I get either the old value or the new value, but not some corrupted state.
Question 3: Does this approach work with boost::atomic<std::string>? If not, what are alternatives?
Question 4: How about more complex setting types like the list of endpoints? Would you recommend something like boost::atomic<boost::shared_ptr<std::vector<boost::asio::ip::tcp::endpoint>>>? If not, what would be better?
Q1, Correct if you don't try to read any shared non-atomic variables after reading the atomic. Memory barriers only synchronize access to non-atomic variables that may happen between atomic operations
Q2 I don't know (but see below)
Q3 Should work (if compiles). However,
atomic<string>
possibly isn't lock free
Q4 Should work but, again, the implementation isn't possibly lockfree (Implementing lockfree shared_ptr is challenging and patent-mined field).
So probably readers-writers lock (as Damon suggests in the comments) may be simpler and even more effective if your config includes data with size more than 1 machine word (for which CPU native atomics usually works)
[EDIT]However,
atomic<shared_ptr<TheWholeStructContainigAll> >
may have some sense even being non-lock free: this approach minimize collision probability for readers that need more than one coherent value, though the writer should make a new copy of the whole "parameter sheet" every time it changes something.
For question 1, the answer is "depends, but probably not". If you really only care that a single value isn't garbled, then yes, this is fine, and you don't care about memory order either.
Usually, though, this is a false premise.
For questions 2, 3, and 4 yes, this will work, but it will likely use locking for complex objects such as string (internally, for every access, without you knowing). Only rather small objects which are roughly the size of one or two pointers can normally be accessed/changed atomically in a lockfree manner. This depends on your platform, too.
It's a big difference whether one successfully updates one or two values atomically. Say you have the values left and right which delimit the left and right boundaries of where a task will do some processing in an array. Assume they are 50 and 100, respectively, and you change them to 101 and 150, each atomically. So the other thread picks up the change from 50 to 101 and starts doing its calculation, sees that 101 > 100, finishes, and writes the result to a file. After that, you change the output file's name, again, atomically.
Everything was atomic (and thus, more expensive than normal), but none of it was useful. The result is still wrong, and was written to the wrong file, too.
This may not be a problem in your particular case, but usually it is (and, your requirements may change in the future). Usually you really want the complete set of changes being atomic.
That said, if you have either many or complex (or, both many and complex) updates like this to do, you might want to use one big (reader-writer) lock for the whole config in the first place anyway, since that is more efficient than acquiring and releasing 20 or 30 locks or doing 50 or 100 atomic operations. Do however note that in any case, locking will severely impact performance.
As pointed out in the comments above, I would preferrably make a deep copy of the configuration from the one thread that modifies the configuration, and schedule updates of the reference (shared pointer) used by consumers as a normal tasks. That copy-modify-publish approach a bit similar to how MVCC databases work, too (these, too, have the problem that locking kills their performance).
Modifying a copy asserts that only readers are accessing any shared state, so no synchronization is necessary either for readers or for the single writer. Reading and writing is fast.
Swapping the configuration set happens only at well-defined points in times when the set is guaranteed to be in a complete, consistent state and threads are guaranteed not to do something else, so no ugly surprises of any kind can happen.
A typical task-driven application would look somewhat like this (in C++-like pseudocode):
// consumer/worker thread(s)
for(;;)
{
task = queue.pop();
switch(task.code)
{
case EXIT:
return;
case SET_CONFIG:
my_conf = task.data;
break;
default:
task.func(task.data, &my_conf); // can read without sync
}
}
// thread that interacts with user (also producer)
for(;;)
{
input = get_input();
if(input.action == QUIT)
{
queue.push(task(EXIT, 0, 0));
for(auto threads : thread)
thread.join();
return 0;
}
else if(input.action == CHANGE_SETTINGS)
{
new_config = new config(config); // copy, readonly operation, no sync
// assume we have operator[] overloaded
new_config[...] = ...; // I own this exclusively, no sync
task t(SET_CONFIG, 0, shared_ptr<...>(input.data));
queue.push(t);
}
else if(input.action() == ADD_TASK)
{
task t(RUN, input.func, input.data);
queue.push(t);
}
...
}
For anything more substantial than a pointer, use a mutex. The tbb (opensource) library supports the concept of reader-writer mutices, which allow multiple simultaneous readers, see the documentation.

should i always lock the global data in multi-thread programming, why or why not?

I'm new to multi-thread programming(actually, i'm not a fresh man in multi-threading, but i always use global data for reading and writing thread, i think it makes my code ugly and slow, i'm eager to improve my skill)
and i'm now developing a forwarder server using c++, for simplify the question, we suppose there are only two threads, a receiving-thread and a sending-thread, and, the stupid design as usual, I have an global std::list for saving data :(
receiving-thread read raw data from server and wirte it into global std::list.
sending-thread read global std::list and send it to several clients.
i use pthread_mutex_lock to sync the global std::list.
the problem is that the performance of forward server is poor, global list locked when receiving-thread is wrting, but at that time, my sending-thread wanna read, so it must waiting, but i think this waiting is useless.
what should i do, i know that global is bad, but, without global, how can i sync these two threads?
i'll keep searching from SO and google.
any suggestions, guides, technology or books will be appreciated. thanks!
EDIT
for any suggestions, i wanna know why or why not, please give me the reason, thanks a lot.
Notes:
Please provide more complete examples: http://sscce.org/
Answers:
Yes, you should synchronize access to shared data.
NOTE: this makes assumptions about std::list implementation - which may or may not apply to your case - but since this assumptions is valid for some implementation you cannot assume your implementation must be thread safe without some explicit guarantee
Consider the snippet:
std::list g_list;
void thread1()
{
while( /*input ok*/ )
{
/*read input*/
g_list.push_back( /*something*/ );
}
}
void thread2()
{
while( /*something*/ )
{
/*pop from list*/
data x = g_list.front();
g_list.pop_front();
}
}
say for example list has 1 element in it
std::list::push_back() must do:
allocate space (many CPU instructions)
copy data into new space (many CPU instructions)
update previous element (if it exists) to point to new element
set std::list::_size
std::list::pop_front() must do:
free space
update next element to not have previous element
set std::list_size
Now say thread 1 calls push_back() - after checking that there is an element (check on size) - it continues to update this element - but right after this - before it gets a chance to update the element - thread 2 could be running pop_front - and be busy freeing the memory for the first element - which could result then in thread 1 causing a segmentation fault - or even memory corruption. Similarly updates to size could result in push_back winning over pop_front's update - and then you have size 2 when you only have 1 element.
Do not use pthread_* in C++ unless you really know what your doing - use std::thread (c++11) or boost::thread - or wrap pthread_* in a class by yourself - because if you don't consider exceptions you will end up with deadlocks
You cannot get past some form of synchronization in this specific example - but you could optimize synchronization
Don't copy the data itself into and out of the std::list - copy a pointer to the data into and out of the list
Only lock while your actually accessing the std::list - but don't make this mistake:
{
// lock
size_t i = g_list.size();
// unlock
if ( i )
{
// lock
// work with g_list ...
// unlock
}
}
A more appropriate pattern here would be a message queue - you can implement one with a mutex, a list and a condition variable. Here are some implementations you can look at:
http://pocoproject.org/docs/Poco.Notification.html
http://gnodebian.blogspot.com.es/2013/07/a-thread-safe-asynchronous-queue-in-c11.html
http://docs.wxwidgets.org/trunk/classwx_message_queue_3_01_t_01_4.html
google for more
There is also the option of atomic containers, look at:
http://calumgrant.net/atomic/ - not sure if this is backed by actual atomic storage (as opposed to just using synchronization behind an interface)
google for more
You could also go for an asynchronous approach with boost::asio - though your case should be quite fast if done right.