C++11 Lock-free sequence number generator safe? - c++

The goal is to implement a sequence number generator in modern C++. The context is in a concurrent environment.
Requirement #1 The class must be singleton (common for all threads)
Requirement #2 The type used for the numbers is 64-bit integer.
Requirement #3 The caller can request more than one numbers
Requirement #4 This class will cache a sequence of numbers before being able serve the calls. Because it caches a sequence, it must also store the upper bound -> the maximum number to be able to return.
Requirement #5 Last but not least, at startup (constructor) and when there are no available numbers to give ( n_requested > n_avalaible ), the singleton class must query the database to get a new sequence. This load from DB, updates both seq_n_ and max_seq_n_.
A brief draft for its interface is the following:
class singleton_sequence_manager {
public:
static singleton_sequence_manager& instance() {
static singleton_sequence_manager s;
return s;
}
std::vector<int64_t> get_sequence(int64_t n_requested);
private:
singleton_sequence_manager(); //Constructor
void get_new_db_sequence(); //Gets a new sequence from DB
int64_t seq_n_;
int64_t max_seq_n_;
}
Example just to clarify the use case.
Suppose that at startup, DB sets seq_n_ to 1000 and max_seq_n_ to 1050:
get_sequence.(20); //Gets [1000, 1019]
get_sequence.(20); //Gets [1020, 1039]
get_sequence.(5); //Gets [1040, 1044]
get_sequence.(10); //In order to serve this call, a new sequence must be load from DB
Obviously, an implementation using locks and std::mutex is quite simple.
What I am interested into is implementing a lock-free version using std::atomic and atomic operations.
My first attempt is the following one:
int64_t seq_n_;
int64_t max_seq_n_;
were changed to:
std::atomic<int64_t> seq_n_;
std::atomic<int64_t> max_seq_n_;
Get a new sequence from DB just sets the new values in the atomic variables:
void singleton_sequence_manager::get_new_db_sequence() {
//Sync call is made to DB
//Let's just ignore unhappy paths for simplicity
seq_n_.store( start_of_seq_got_from_db );
max_seq_n_.store( end_of_seq_got_from_db );
//At this point, the class can start returning numbers in [seq_n_ : max_seq_n_]
}
And now the get_sequence function using atomic compare and swap technique:
std::vector<int64_t> singleton_sequence_manager::get_sequence(int64_t n_requested) {
bool succeeded{false};
int64_t current_seq{};
int64_t next_seq{};
do {
current_seq = seq_n_.load();
do {
next_seq = current_seq + n_requested + 1;
}
while( !seq_n_.compare_exchange_weak( current_seq, next_seq ) );
//After the CAS, the caller gets the sequence [current_seq:next_seq-1]
//Check if sequence is in the cached bound.
if( max_seq_n_.load() > next_seq - 1 )
succeeded = true;
else //Needs to load new sequence from DB, and re-calculate again
get_new_db_sequence();
}
while( !succeeded );
//Building the response
std::vector<int64_t> res{};
res.resize(n_requested);
for(int64_t n = current_seq ; n < next_seq ; n++)
res.push_back(n);
return res;
}
Thoughts:
I am really concerned for the lock-free version. Is the implementation safe ? If we ignore the DB load part, obviously yes. The problem arises (in my head at least) when the class has to load a new sequence from the DB. Is the update from DB safe ? Two atomic stores ?
My second attempt was to combine both seq_n_ and max_seq_n_ into a struct called sequence and use a single atomic variable std::atomic but the compiler failed. Because the size of the struct sequence is greater than 64-bit.
Is it possible to somehow protect the DB part by using an atomic flag for marking if the sequence is ready yet: flag set to false while waiting the db load to finish and to update both atomic variables. Therefore, get_sequence must be updated in order to wait for flag to bet set to true. (Use of spin lock ?)

Your lock-free version has a fundamental flaw, because it treats two independent atomic variables as one entity. Since writes to seq_n_ and max_seq_n_ are separate statements, they can be separated during execution resulting in the use of one of them with a value that is incorrect when paired with the other.
For example, one thread can get past the CAS inner while loop (with an n_requested that is too large for the current cached sequence), then be suspended before checking if it is cached. A second thread can come thru and update the max_seq_n value to a larger value. The first thread then resumes, and passes the max_seq_n check because the value was updated by the second thread. It is now using an invalid sequence.
A similar thing can happen in get_new_db_sequence between the two store calls.
Since you're writing to two distinct locations (even if adjacent in memory), and they cannot be updated atomically (due to the combined size of 128 bits not being a supported atomic size with your compiler), the writes must be protected by a mutex.
A spin lock should only be used for very short waits, since it does consume CPU cycles. A typical usage would be to use a short spin lock, and if the resource is still unavailable use something more expensive (like a mutex) to wait using CPU time.

Related

Changing code from Sequential Consistency to a less stringent ordering in a barrier implementation

I came across this code for a simple implementation of a barrier (for code that can't use std::experimental::barrier in C++17 or std::barrier in C++20) in C++ Concurrency in Action book.
[Edit]
A barrier is a synchronization mechanism where a group of threads (the number of threads is passed to the constructor of the barrier) can arrive at and wait (by calling wait method) or arrive at and drop off (by calling done_waiting). If all the threads in the group arrive at the barrier, then the barrier is reset and the threads can proceed with the next set of actions. If some threads in the group drops off, then the number of threads in the group is reduced accordingly for thenext round of synchronization with the barrier.
[End of Edit]
Here is the code provided for a simple implementation of a barrier.
struct barrier
{
std::atomic<unsigned> count;
std::atomic<unsigned> spaces;
std::atomic<unsigned> generation;
barrier(unsigned count_):count(count_),spaces(count_),generation(0)
{}
void wait(){
unsigned const gen=generation.load();
if(!--spaces){
spaces=count.load();
++generation;
}else{
while(generation.load()==gen){
std::this_thread::yield();
}
}
}
void done_waiting(){
--count;
if(!--spaces){
spaces=count.load();
++generation;
}
}
};
The author, Anthony Williams, mentions that he chose sequential consistency ordering to make it easier to reason about the code and said that relaxed ordering could be used to make the code more efficient. This is how I changed the code to employ relaxed ordering. Please help me understand if my code is right.
struct barrier
{
std::atomic<unsigned> count;
std::atomic<unsigned> spaces;
std::atomic<unsigned> generation;
barrier(unsigned count_):count(count_),spaces(count_),generation(0)
{}
void wait(){
unsigned const gen=generation.load(std::memory_order_acquire);
if(1 == spaces.fetch_sub(1, std::memory_order_relaxed)){
spaces=count.load(std::memory_order_relaxed);
generation.fetch_add(1, std::memory_order_release);
}else{
while(generation.load(std::memory_order_relaxed)==gen){
std::this_thread::yield();
}
}
}
void done_waiting(){
count.fetch_sub(1, std::memory_order_relaxed);
if(1 == spaces.fetch_sub(1, std::memory_order_relaxed)){
spaces=count.load(std::memory_order_relaxed);
generation.fetch_add(1, std::memory_order_release);
}
}
};
The reasoning is this. The increment of generation is a release operation that synchronizes with the loading of generation in the wait call. This ensures that the load from count into spaces is visible to all the threads that call wait and read the new value of generation that was stored with release semantics.
All the operations on spaces thereafter are RMW operations which take part in a release sequence and hence can be relaxed operations. Is this reasoning correct or is this code wrong? Please help me understand. Thanks in advance.
[Edit]I tried using my barrier code like this.
void fun(barrier* b){
std::cout << "In Thread " << std::this_thread::get_id() << std::endl;
b->wait();
std::cout << std::this_thread::get_id() << " First wait done" << std::endl;
b->wait();
std::cout << std::this_thread::get_id() << " Second wait done" << std::endl;
b->done_waiting();
}
int main(){
barrier b{2};
std::thread t(fun, &b);
fun(&b);
std::cout << std::this_thread::get_id() << " " << b.get_count() << std::endl;
t.join();
}
I also tried testing it with many more threads and in superficial runs, it seems to be doing the right thing. But I would still like to understand if my reasoning is correct or if I am missing something very obvious.[End of Edit]
Modified version of the code
I believe the following version of the code has the weakest orderings
that still make it correct.
struct barrier
{
std::atomic<unsigned> count;
std::atomic<unsigned> spaces;
std::atomic<unsigned> generation;
barrier(unsigned count_):count(count_),spaces(count_),generation(0)
{}
void wait(){
unsigned const gen=generation.load(std::memory_order_relaxed);
if(1 == spaces.fetch_sub(1, std::memory_order_acq_rel)){
spaces.store(count.load(std::memory_order_relaxed),
std::memory_order_relaxed);
generation.fetch_add(1, std::memory_order_release);
}else{
while(generation.load(std::memory_order_acquire)==gen){
std::this_thread::yield();
}
}
}
void done_waiting(){
count.fetch_sub(1, std::memory_order_relaxed);
if(1 == spaces.fetch_sub(1, std::memory_order_acq_rel)){
spaces.store(count.load(std::memory_order_relaxed),
std::memory_order_relaxed);
generation.fetch_add(1, std::memory_order_release);
}
}
};
Discussion of new version
Throughout this discussion, when we speak of the "last" thread to reach the barrier in
a particular generation, we mean the unique thread that sees
spaces.fetch_sub(1) return the value 1.
In general, if a thread is making a store that promises some operation has
completed, it needs to be a release store. When another thread loads
that value to get proof that the operation is complete, and it is safe
to use the results, then that load needs to be acquire.
Now in the code at hand, the way a thread indicates that it is done with its "regular
work" (whatever was sequenced before the call to wait() or
done_waiting()) is by decrementing spaces, i.e. storing a value
that is 1 less than the previous value in the modification order. For
a non-last thread, that's the only store that it does. So
that store, namely spaces.fetch_sub(), has to be release.
For the last thread, the way that it knows it is last is by loading
the value 1 in spaces.fetch_sub(). This serves as proof that all
other threads have finished their regular work, and that the last
thread can safely drop the barrier. So spaces.fetch_sub() needs to
be acquire as well. Hence spaces.fetch_sub(1, std::memory_order_acq_rel).
The way that the non-last threads determine that the barrier is down and
they can safely proceed is by loading a value from generation in
the yield loop, and observing that it is different from gen. So
that load needs to be acquire; and the store it observes, namely
generation.fetch_add(), needs to be release.
I claim that's all the barriers we need. The work done by the last
thread to update spaces from count is effectively in its own
little critical section, begun with the acquire load of 1 from
spaces and ended with the release increment of generation. At
this point, all other threads have already loaded and stored spaces,
every thread that called wait has already loaded the old value of
generation, and every thread that called done_waiting() has
already stored its new decremented value for count. So the last
thread can safely manipulate them with relaxed ordering, knowing that
no other thread will do so.
And now the initial load gen = generation.load() in wait() does not
need to be acquire. It can't be pushed down past the store to
spaces, because the latter is release. So it is sure to be safely
loaded before the value 1 is stored to spaces, and only after that
could a potential last thread update generation.
A proof
Now let's try to give a formal proof that this new version is correct. Look at two threads, A and B,
which do
barrier b(N); // N >= 2
void thrA() {
work_A();
b.wait(); // or b.done_waiting(), doesn't matter which
}
void thrB() {
b.wait();
work_B();
}
There are N-2 other threads, each of which calls either b.wait()
or b.done_waiting(). For simplicity, let's suppose we begin at
generation 0.
We want to prove that work_A() happens before work_B(), so that
they can be operating on the same data (conflicting) without a data
race. Consider two cases depending on whether or not B is the last
thread to reach the barrier.
Suppose B is the last thread. That means it did an acquire load of the
value 1 from spaces. Thread A must necessarily have release-stored some
value >= 1, which was then successively decremented by atomic RMW
operations (in other threads) until it reached 1. That is a release
sequence headed by A's store, and B's load takes its value from the
last side effect in that sequence, so A's store synchronizes with B's
load. By sequencing, work_A() happens before A's store, and B's
load happens before work_B(), hence work_A() happens before
work_B() as desired.
Now suppose B is not the last thread. Then it returns from wait()
only when it loads from generation a value different from gen.
Let L denote the actual last thread, which could potentially be A.
I claim, as a first step, that gen in B must be 0. For the load of
generation into gen happens before B's release store in spaces.fetch_sub(), and as noted above, this heads a release sequence which
eventually stores the value 1 (in the second-to-last thread). The
load of spaces in L takes its value from that side effect, so B's
store to spaces synchronizes with L's load. B's load of
generation happens before its store to spaces, and L's load of
spaces happens before its store (fetch_add()) to generation. So
B's load of generation happens before L's store. By read-write
coherence [intro.races p17], B's load of generation must not take its value from L's store, but instead from
some earlier value in the modification order. This must
necessarily be 0 as there are no other modifications to generation.
(If we were working at generation G instead of 0, this would prove only that gen <= G. But as I explain below, all these loads happen after the previous increment of generation, which is where the value G was stored. So that proves the opposite inequality gen >= G.)
So B returns from wait() only when it loads 1 from generation.
Thus, that final acquire load has taken its value from the release
store to generation done by L, showing that L's store happens before
B's return. Now L's load of 1 from spaces happens before its store
to generation (by sequencing), and A's store to spaces happens
before L's load of 1 as proved earlier. (If L = A then the same
conclusion still holds, by sequencing.) We now have the following
operations totally ordered by happens-before:
A: work_A();
A: store side of spaces.fetch_sub()
L: load of 1 in spaces.fetch_sub()
L: store side of generation.fetch_add()
B: acquire load of 1 from generation
B: work_B()
and have the desired conclusion by transitivity. (If L=A then delete lines 2 and 3 above.)
We could similarly prove that all the decrements of count in the
various calls to done_waiting() happen before L's load of count to
store it into spaces, thus those can safely be relaxed. The
re-initialization of spaces in L, and the increment of generation,
both happen before any thread returns from wait(), so even if those
are relaxed, any barrier operations sequenced afterwards will see the
barrier properly reset.
I think that covers all the desired semantics.
Fences
We could actually weaken things a little more by using fences. For
instance, the acq in spaces.fetch_sub() was solely for the benefit
of thread L which loads the value 1; other threads didn't need it. So
we could do instead
if (1 == spaces.fetch_sub(1, std::memory_order_release)){
std::atomic_thread_fence(std::memory_order_acquire);
// ...
}
and then only thread L needs to pay the cost of the acquire. Not that
it really matters much, since all other threads are going to sleep
anyway and so we're unlikely to care if they are slow.
Data races in OP's original version
(This section was written earlier, before I made the modified version above.)
I believe there are at least two bugs.
Let's analyze wait() by itself. Consider code that does:
int x = 0;
barrier b(2);
void thrA() {
x = 1;
b.wait();
}
void thrB() {
b.wait();
std::cout << x << std::endl;
}
We wish to prove that x = 1 in thrA happens before the evaluation of x in thrB, so that the code would have no data race and be forced to print the value 1.
But I don't think we can. Let's suppose thrB reaches the barrier first, which is to say that it observes spaces.fetch_sub returning 2. So the loads and stores performed in each thread are sequenced as follows:
thrA:
x = 1;
gen=generation.load(std::memory_order_acquire); // returns 0
spaces.fetch_sub(1, std::memory_order_relaxed); // returns 1, stores 0
spaces=count.load(std::memory_order_relaxed); // stores 2
generation.fetch_add(1, std::memory_order_release); // returns 0, stores 1
thrB:
gen=generation.load(std::memory_order_acquire); // returns 0
spaces.fetch_sub(1, std::memory_order_relaxed); // returns 2, stores 1
generation.load(std::memory_order_relaxed); // returns 0
... many iterations
generation.load(std::memory_order_relaxed); // returns 1
x; // returns ??
To have any hope, we have to get some operation A in thrA to synchronize with some operation B in thrB. This is only possible if B is an acquire operation that takes its value from a side effect in the release sequence headed by A. But there is only one acquire operation in thrB, namely the initial generation.load(std::memory_order_acquire). And it does not take its value (namely 0) from any of the operations in thrB, but from the initialization of generation that happened before either thread was started. This side effect is not part of any useful release sequence, certainly not of any release sequence headed by an operation which happens after x=1. So our attempt at a proof breaks down.
More informally, if we inspect the sequencing of thrB, we see that the evaluation of x could be reordered before any or all of the relaxed operations. The fact that it's only evaluated conditionally on generation.load(std::memory_order_relaxed) returning 1 doesn't help; we could have x loaded speculatively much earlier, and the value only used after generation.load(std::memory_order_relaxed) finally returns 1. So all we know is that x is evaluated sometime after generation.load(std::memory_order_acquire) returns 0, which gives us precisely no useful information at all about what thrA might or might not have done by then.
This particular issue could be fixed by upgrading the load of generation in the spin loop to acquire, or by placing an acquire fence after the loop exits but before wait() returns.
As for done_waiting, it seems problematic too. If we have
void thrA() {
x = 1;
b.done_waiting();
}
void thrB() {
b.wait();
std::cout << x;
}
then presumably we again want 1 to be printed, without data races. But suppose thrA reaches the barrier first. Then all that it does is
x = 1;
count.fetch_sub(1, std::memory_order_relaxed); // irrelevant
spaces.fetch_sub(1, std::memory_order_relaxed); // returns 2, stores 1
with no release stores at all, so it cannot synchronize with thrB.
Informally, there is no barrier to prevent the store x=1 from being delayed indefinitely, so there cannot be any guarantee that thrB will observe it.
It is late here, so for the moment I leave it as an exercise how to fix this.
By the way, ThreadSanitizer detects data races for both cases: https://godbolt.org/z/1MdbMbYob. I probably should have tried that first, but initially it wasn't so clear to me what to actually test.
I am not sure at this point if these are the only bugs, or if there are more.

Best way to cache meta-data from a kv storage

Currently, we have a system where metadata will be stored in a kv storage cluster. We store it simply by serializing the application metadata with protobuf and then send to kv cluster. As the system gets bigger and bigger, fetching metadata itself becomes expensive. Therefore, we develop an in-memory meta-cache component, simply an LRU cache, and the cached items are the protobuf objects. Recently, we have faced some challenges:
Concurrently read-write seem a big issue: when we add new data to our system, we need to update the cache as well, so we will need to partially lock part of the cache to ensure repeatable read, this brings very high lock contention on the cache.
The cache gets bigger and bigger over time.
I'm thinking that maybe our cache design is not good enough and considering using 3rd party lib like Facebook Cachelib (our system is written in C++). Can anyone who has experience with the matter give me some advice? Should we use 3rd party lib, or should we improve our own? If we improve our own, what can we do?
Thanks so much :).
If you are in-memory caching on only a single server node and if you can hash all of keys (for metadata indexing) into unique positive integers and
if ~75M lookups per second on an old CPU is ok for you, there is a multi-threaded multi-level read-write cache implementation:
https://github.com/tugrul512bit/LruClockCache/blob/main/MultiLevelCache.h
It works like this:
int main()
{
std::vector<int> data(1024*1024); // simulating a backing-store
MultiLevelCache<int,int> cache(
64*1024 /* direct-mapped L1 cache elements */,
256,1024 /* n-way set-associative (LRU approximation) L2 cache elements */,
[&](int key){ return data[key]; } /* cache-miss function to get data from backingstore into cache */,
[&](int key, int value){ data[key]=value; } /* cache-miss function to set data on backging-store during eviction */
);
cache.set(5,10); // this is single-thread example, sets value 10 at key position of 5
cache.flush(); // writes all latest bits of data to backing-store
std::cout<<data[5]<<std::endl;
auto val = cache.getThreadSafe(5); // this is thread-safe from any number of threads
cache.setThreadSafe(10,val); // thread-safe, any number of threads
return 0;
}
It consists of two caches, L1 is fast&sharded but weak on cache-hits, L2 is slower&sharded but is better on cache-hits although not as good as perfect-LRU.
If there is a read-only part (like just distributing values to threads from a read-only backing-store) in the application, there is another way to improve performance up to 2.5 billion lookups per second:
// shared last level cache (LRU approximation)
auto LLC = std::make_shared<LruClockCache<int,int>>(LLCsize,
[ & ](int key) {
return backingStore[key];
},
[ & ](int key, int value) {
backingStore[key] = value;
});
#pragma omp parallel num_threads(8)
{
// each thread builds its own lockless L1&L2 caches, binds to LLC (LLC has thread-safe bindings to L2)
// building overhead is linearly dependent on cache sizes
CacheThreader<LruClockCache,int,int> cache(LLC,L1size,L2size);
for(int i=0;i<many_times;i++)
cache.get(i); // 300M-400M per second per thread
}
Performance depends on accessing pattern prediction. For user-input dependent access order, it has lower performance due to inefficient pipelining. For compile-time-known indexing, it has better performance. For true random access, it has lower performance due to cache-misses & lack of vectorization. For sequential access, it has better performance.
If you need to use your main thread asynchronously during get/set calls and still require a (weak) cache coherence between gets and sets, there is another implementation which has a performance that relies on number of keys requested at a time from a thread:
// backing-store
std::vector<int> data(1000000);
// L1 direct mapped 128 tags
// L2 n-way set-associative 128 sets + 1024 tags per set
AsyncCache<int,int> cache(128,128,1024,[&](int key){ return data[key]; },[&](int key, int value){ data[key]=value; });
// a variable to use as output for the get() call
int val;
// thread-safe, returns immediately
// each set/get call should be given a slot id per thread
// or it returns a slot id instead
int slot = cache.setAsync(5,100);
// returns immediately
cache.getAsync(5,&val,slot);
// garbage garbage
std::cout<<data[5]<<" "<<val<<std::endl;
// waits for operations made in a slot
cache.barrier(slot);
// garbage 100
std::cout<<data[5]<<" "<<val<<std::endl;
// writes latest bits to backing-store
cache.flush();
// 100 100
std::cout<<data[5]<<" "<<val<<std::endl;
this is not fully-coherent and threads are responsible to do right ordering of gets/sets between each other.
If your metadata indexing is two/three-dimensional, there are 2D/3D direct-mapped multi-thread caches:
https://github.com/tugrul512bit/LruClockCache/blob/main/integer_key_specialization/DirectMapped2DMultiThreadCache.h
https://github.com/tugrul512bit/LruClockCache/blob/main/integer_key_specialization/DirectMapped3DMultiThreadCache.h
int backingStore[10][10];
DirectMapped2DMultiThreadCache<int,int> cache(4,4,
[&](int x, int y){ return backingStore[x][y]; },
[&](int x, int y, int value){ backingStore[x][y]=value; });
for(int i=0;i<10;i++)
for(int j=0;j<10;j++)
cache.set(i,j,0); // row-major
cache.flush();
for(int i=0;i<10;i++)for(int j=0;j<10;j++)
std::cout<<backingStore[i][j]<<std::endl;
it has better cache-hit ratio than normal direct-mapped cache if you do tiled-processing.

C++ concurrent writes to array (not std::vector) of bools

I'm using C++11 and I'm aware that concurrent writes to std::vector<bool> someArray are not thread-safe due to the specialization of std::vector for bools.
I'm trying to find out if writes to bool someArray[2048] have the same problem:
Suppose all entries in someArray are initially set to false.
Suppose I have a bunch of threads that write at different indices in someArray. In fact, these threads only set different array entries from false to true.
Suppose I have a reader thread that at some point acquires a lock, triggering a memory fence operation.
Q: Will the reader see all the writes to someArray that occurred before the lock was acquired?
Thanks!
You should use std::array<bool, 2048> someArray, not bool someArray[2048];. If you're in C++11-land, you'll want to modernize your code as much as you are able.
std::array<bool, N> is not specialized in the same way that std::vector<bool> is, so there's no concerns there in terms of raw safety.
As for your actual question:
Will the reader see all the writes to someArray that occurred before the lock was acquired?
Only if the writers to the array also interact with the lock, either by releasing it at the time that they finish writing, or else by updating a value associated with the lock that the reader then synchronizes with. If the writers never interact with the lock, then the data that will be retrieved by the reader is undefined.
One thing you'll also want to bear in mind: while it's not unsafe to have multiple threads write to the same array, provided that they are all writing to unique memory addresses, writing could be slowed pretty dramatically by interactions with the cache. For example:
void func_a() {
std::array<bool, 2048> someArray{};
for(int i = 0; i < 8; i++) {
std::thread writer([i, &someArray]{
for(size_t index = i * 256; index < (i+1) * 256; index++)
someArray[index] = true;
//Some kind of synchronization mechanism you need to work out yourself
});
writer.detach();
}
}
void func_b() {
std::array<bool, 2048> someArray{};
for(int i = 0; i < 8; i++) {
std::thread writer([i, &someArray]{
for(size_t index = i; index < 2048; index += 8)
someArray[index] = true;
//Some kind of synchronization mechanism you need to work out yourself
});
writer.detach();
}
}
The details are going to vary depending on the underlying hardware, but in nearly all situations, func_a is going to be orders of magnitude faster than func_b, at least for a sufficiently large array size (2048 was chosen as an example, but it may not be representative of the actual underlying performance differences). Both functions should have the same result, but one will be considerably faster than the other.
First of all, the general std::vector is not thread-safe as you might think. The guarantees are already stated here.
Addressing your question: The reader may not see all writes after acquiring the lock. This is due to the fact that the writers may never have performed a release operation which is required to establish a happens-before relationship between the writes and the subsequent reads. In (very) simple terms: every acquire operation (such as a mutex lock) needs a release operation to synchronize with. Every memory operation done before a release onto a certain veriable will be visible to any thread that acquired the same variable. See also Release-Acquire ordering.
One important thing to note is that all operations (fetch and store) on an int32 sized variable (such as a bool) is atomic (holds true for x86 or x64 architectures). So if you declare your array as volatile (necessary as each thread may have a cached value of the array), you should not have any issues in modifying the array (via multiple threads).

Copying std::vector between threads without locking

I have a vector that is modified in one thread, and I need to use its contents in another. Locking between these threads is unacceptable due to performance requirements. Since iterating over the vector while it is changing will cause a crash, I thought to copy the vector and then iterate over the copy. My question is, can this way also crash?
struct Data
{
int A;
double B;
bool C;
};
std::vector<Data> DataVec;
void ModifyThreadFunc()
{
// Here the vector is changed, which includes adding and erasing elements
...
}
void ReadThreadFunc()
{
auto temp = DataVec; // Will this crash?
for (auto& data : temp)
{
// Do stuff with the data
...
}
// This definitely can crash
/*for (auto& data : DataVec)
{
// Do stuff with the data
...
}*/
}
The basic thread safety guarantee for vector::operator= is:
"if an exception is thrown, the container is in a valid state."
What types of exceptions are possible here?
EDIT:
I solved this using double buffering, and posted my answer below.
As has been pointed out by the other answers, what you ask for is not doable. If you have concurrent access, you need synchronization, end of story.
That being said, it is not unusual to have requirements like yours where synchronization is not an option. In that case, what you can still do is get rid of the concurrent access. For example, you mentioned that the data is accessed once per frame in a game-loop like execution. Is it strictly required that you get the data from the current frame or could it also be the data from the last frame?
In that case, you could work with two vectors, one that is being written to by the producer thread and one that is being read by all the consumer threads. At the end of the frame, you simply swap the two vectors. Now you no longer need *(1) fine-grained synchronization for the data access, since there is no concurrent data access any more.
This is just one example how to do this. If you need to get rid of locking, start thinking about how to organize data access so that you avoid getting into the situation where you need synchronization in the first place.
*(1): Strictly speaking, you still need a synchronization point that ensures that when you perform the swapping, all the writer and reader threads have finished working. But this is far easier to do (usually you have such a synchronization point at the end of each frame anyway) and has a far lesser impact on performance than synchronizing on every access to the vector.
My question is, can this way also crash?
Yes, you still have a data race. If thread A modifies the vector while thread B is creating a copy, all iterators to the vector are invalidated.
What types of exceptions are possible here?
std::vector::operator=(const vector&) will throw on memory allocation failure, or if the contained elements throw on copy. The same thing applies to copy construction, which is what the line in your code marked "Will this crash?" is actually doing.
The fundamental problem here is that std::vector is not thread-safe. You have to either protect it with a lock/mutex, or replace it with a thread-safe container (such as the lock-free containers in Boost.Lockfree or libcds).
I have a vector that is modified in one thread, and I need to use its contents in another. Locking between these threads is unacceptable due to performance requirements.
this is an impossible to meet requirement.
Anyway, any sharing of data between 2 threads will require a kind of locking, be it explicit or implementation (eventually hardware) provided. You must examine again your actual requirements: it can be inacceptable to suspend one thread until the other one ends, but you could lock short sequences of instructions. And/or possibly use a diffent architecture. For example erasing an item in a vector is a costly operation (linear time because you have to move all the data above the removed item) while marking it as invalid is much quicker (constant time because it is one single write). If you really have to erase in the middle of a vector, maybe a list would be more appropriate.
But if you can put a locking exclusion around the copy of the vector in ReadThreadFunc and around any vector modification in ModifyThreadFunc, it could be enough. To give a priority to the modifying thread, you could just try to lock in the other thread and immediately give up if you cannot.
Maybe you should rethink your design!
Each thread should have his own vector (list, queue whatever fit your needs) to work on. So thread A can do some work and pass the result to thrad B. You simply have to lock when writing the data from thread A int thread B's queue.
Without some kind of locking it's not possible.
So I solved this using double buffering, which guarantees no crashing, and the reading thread will always have usable data, even if it might not be correct:
struct Data
{
int A;
double B;
bool C;
};
const int MAXSIZE = 100;
Data Buffer[MAXSIZE];
std::vector<Data> DataVec;
void ModifyThreadFunc()
{
// Here the vector is changed, which includes adding and erasing elements
...
// Copy from the vector to the buffer
size_t numElements = DataVec.size();
memcpy(Buffer, DataVec.data(), sizeof(Data) * numElements);
memset(&Buffer[numElements], 0, sizeof(Data) * (MAXSIZE - numElements));
}
void ReadThreadFunc()
{
Data* p = Buffer;
for (int i = 0; i < MAXSIZE; ++i)
{
// Use the data
...
++p;
}
}

Ring Allocator For Lockfree Update of Member Variable?

I have a class that stores the latest value of some incoming realtime data (around 150 million events/second).
Suppose it looks like this:
class DataState
{
Event latest_event;
public:
//pushes event atomically
void push_event(const Event __restrict__* e);
//pulls event atomically
Event pull_event();
};
I need to be able to push events atomically and pull them with strict ordering guarantees. Now, I know I can use a spinlock, but given the massive event rate (over 100 million/second) and high degree of concurrency I'd prefer to use lockfree operations.
The problem is that Event is 64 bytes in size. There is no CMPXCHG64B instruction on any current X86 CPU (as of August '16). So if I use std::atomic<Event> I'd have to link to libatomic which uses mutexes under the hood (too slow).
So my solution was to instead atomically swap pointers to the value. Problem is dynamic memory allocation becomes a bottleneck with these event rates. So... I define something I call a "ring allocator":
/// #brief Lockfree Static short-lived allocator used for a ringbuffer
/// Elements are guaranteed to persist only for "size" calls to get_next()
template<typename T> class RingAllocator {
T *arena;
std::atomic_size_t arena_idx;
const std::size_t arena_size;
public:
/// #brief Creates a new RingAllocator
/// #param size The number of elements in the underlying arena. Make this large enough to avoid overwriting fresh data
RingAllocator<T>(std::size_t size) : arena_size(size)
{
//allocate pool
arena = new T[size];
//zero out pool
std::memset(arena, 0, sizeof(T) * size);
arena_idx = 0;
}
~RingAllocator()
{
delete[] arena;
}
/// #brief Return next element's pointer. Thread-safe
/// #return pointer to next available element
T *get_next()
{
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
}
};
Then I could have my DataState class look like this:
class DataState
{
std::atomic<Event*> latest_event;
RingAllocator<Event> event_allocator;
public:
//pushes event atomically
void push_event(const Event __restrict__* e)
{
//store event
Event *new_ptr = event_allocator.get_next()
*new_ptr = *e;
//swap event pointers
latest_event.store(new_ptr, std::memory_order_release);
}
//pulls event atomically
Event pull_event()
{
return *(latest_event.load(std::memory_order_acquire));
}
};
As long as I size my ring allocator to the max # of threads that may concurrently call the functions, there's no risk of overwriting data that pull_event could return. Plus everything's super localized so indirection won't cause bad cache performance. Any possible pitfalls with this approach?
The DataState class:
I thought it was going to be a stack or queue, but it isn't, so push / pull don't seem like good names for methods. (Or else the implementation is totally bogus).
It's just a latch that lets you read the last event that any thread stored.
There's nothing to stop two writes in a row from overwriting an element that's never been read. There's also nothing to stop you reading the same element twice.
If you just need somewhere to copy small blocks of data, a ring buffer does seem like a decent approach. But if you don't want to lose events, I don't think you can use it this way. Instead, just get a ring buffer entry, then copy to it and use it there. So the only atomic operation should be incrementing the ring buffer position index.
The ring buffer
You can make get_next() much more efficient. This line does an atomic post-increment (fetch_add) and an atomic exchange:
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
I'm not even sure it's safe, because the xchg can maybe step on the fetch_add from another thread. Anyway, even if it's safe, it's not ideal.
You don't need that. Make sure the arena_size is always a power of 2, then you don't need to modulo the shared counter. You can just let it go, and have every thread modulo it for their own use. It will eventually wrap, but it's a binary integer so it will wrap at a power of 2, which is a multiple of your arena size.
I'd suggest storing an AND-mask instead of a size, so there's no risk of the % compiling to anything other than an and instruction, even if it's not a compile-time constant. This makes sure we avoid a 64-bit integer div instruction.
template<typename T> class RingAllocator {
T *arena;
std::atomic_size_t arena_idx;
const std::size_t size_mask; // maybe even make this a template parameter?
public:
RingAllocator<T>(std::size_t size)
: arena_idx(0), size_mask(size-1)
{
// verify that size is actually a power of two, so the mask is all-ones in the low bits, and all-zeros in the high bits.
// so that i % size == i & size_mask for all i
...
}
...
T *get_next() {
size_t idx = arena_idx.fetch_add(1, std::memory_order_relaxed); // still atomic, but we don't care which order different threads take blocks in
idx &= size_mask; // modulo our local copy of the idx
return &arena[idx];
}
};
Allocating the arena would be more efficient if you used calloc instead of new + memset. The OS already zeros pages before giving them to user-space processes (to prevent information leakage), so writing them all is just wasted work.
arena = new T[size];
std::memset(arena, 0, sizeof(T) * size);
// vs.
arena = (T*)calloc(size, sizeof(T));
Writing the pages yourself does fault them in, so they're all wired to real physical pages, instead of just copy-on-write mappings for a system-wide shared physical zero page (like they are after new/malloc/calloc). On a NUMA system, the physical page chosen might depend on which thread actually touched the page, rather than which thread did the allocation. But since you're reusing the pool, the first core to write a page might not be the one that ends up using it most.
Maybe something to look for in microbenchmarks / perf counters.
As long as I size my ring allocator to the max # of threads that may concurrently call the functions, there's no risk of overwriting data that pull_event could return. .... Any possible pitfalls with this approach?
The pitfall is, IIUC, that your statement is wrong.
If I have just 2 threads, and 10 elements in the ring buffer, the first thread could call pull_event once, and be "mid-pulling", and then the second thread could call push 10 times, overwriting what thread 1 is pulling.
Again, assuming I understand your code correctly.
Also, as mentioned above,
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
that arena_idx++ inside the exchange on the same variable, just looks wrong. And in fact is wrong. Two threads could increment it - ThreadA increments to 8 and threadB increments to 9, and then threadB exchanges it to 9, then threadA exchanges it to 8. whoops.
atomic(op1) # atomic(op2) != atomic(op1 # op2)
I worry about what else is wrong in code not shown. I don't mean that as an insult - lock-free is just not easy.
Have you looked at any of the C++ Disruptor (Java) ports that are available?
disruptor--
disruptor
Although they are not complete ports they may offer all that you need. I am currently working on a more fully featured port however it's not quite ready.