I'm using C++11 and I'm aware that concurrent writes to std::vector<bool> someArray are not thread-safe due to the specialization of std::vector for bools.
I'm trying to find out if writes to bool someArray[2048] have the same problem:
Suppose all entries in someArray are initially set to false.
Suppose I have a bunch of threads that write at different indices in someArray. In fact, these threads only set different array entries from false to true.
Suppose I have a reader thread that at some point acquires a lock, triggering a memory fence operation.
Q: Will the reader see all the writes to someArray that occurred before the lock was acquired?
Thanks!
You should use std::array<bool, 2048> someArray, not bool someArray[2048];. If you're in C++11-land, you'll want to modernize your code as much as you are able.
std::array<bool, N> is not specialized in the same way that std::vector<bool> is, so there's no concerns there in terms of raw safety.
As for your actual question:
Will the reader see all the writes to someArray that occurred before the lock was acquired?
Only if the writers to the array also interact with the lock, either by releasing it at the time that they finish writing, or else by updating a value associated with the lock that the reader then synchronizes with. If the writers never interact with the lock, then the data that will be retrieved by the reader is undefined.
One thing you'll also want to bear in mind: while it's not unsafe to have multiple threads write to the same array, provided that they are all writing to unique memory addresses, writing could be slowed pretty dramatically by interactions with the cache. For example:
void func_a() {
std::array<bool, 2048> someArray{};
for(int i = 0; i < 8; i++) {
std::thread writer([i, &someArray]{
for(size_t index = i * 256; index < (i+1) * 256; index++)
someArray[index] = true;
//Some kind of synchronization mechanism you need to work out yourself
});
writer.detach();
}
}
void func_b() {
std::array<bool, 2048> someArray{};
for(int i = 0; i < 8; i++) {
std::thread writer([i, &someArray]{
for(size_t index = i; index < 2048; index += 8)
someArray[index] = true;
//Some kind of synchronization mechanism you need to work out yourself
});
writer.detach();
}
}
The details are going to vary depending on the underlying hardware, but in nearly all situations, func_a is going to be orders of magnitude faster than func_b, at least for a sufficiently large array size (2048 was chosen as an example, but it may not be representative of the actual underlying performance differences). Both functions should have the same result, but one will be considerably faster than the other.
First of all, the general std::vector is not thread-safe as you might think. The guarantees are already stated here.
Addressing your question: The reader may not see all writes after acquiring the lock. This is due to the fact that the writers may never have performed a release operation which is required to establish a happens-before relationship between the writes and the subsequent reads. In (very) simple terms: every acquire operation (such as a mutex lock) needs a release operation to synchronize with. Every memory operation done before a release onto a certain veriable will be visible to any thread that acquired the same variable. See also Release-Acquire ordering.
One important thing to note is that all operations (fetch and store) on an int32 sized variable (such as a bool) is atomic (holds true for x86 or x64 architectures). So if you declare your array as volatile (necessary as each thread may have a cached value of the array), you should not have any issues in modifying the array (via multiple threads).
Related
An object has a memory pool assigned to it by malloc().
This object accesses the memory pool for reading and writing using std::atomic<A> *.
<A> could be a class which is trivially copyable type or any standard type.
This object which read and write the memory pool could be called by different threads.
If a thread get a reference to an atomic object for example pointed to a particular memory address of the memory pool and then do some logic with the object, could another thread references to the same atomic object during this time? Does this thread wait until this object is "released" by the other thread or would continue getting another object which is not already referenced? Following code is the algorithm implemented by such use case:
std::atomic<Record> * it_record{&memory_address}; -> memory_address is obtained by malloc()
for(size_t i{0U}; i < m_nbr_record; ++i) {
Record tmp = *it_record;
if (tmp == target_record) {
*it_record = new_record;
found = true;
break;
}
++it_record;
}
If a thread get a reference to an atomic object for example pointed to a particular memory address of the memory pool and then do some logic with the object, could another thread references to the same atomic object during this time?
Yes. Accessing an object through an atomic variable only guarantees that single atomic operations are executed so that other threads cannot interfere. In your example the reading and writing are each an atomic operation.
Does this thread wait until this object is "released" by the other thread […]?
No, that is what a mutex does. Again, for atomics a singular operation is atomic, not a sequence of operations.
For your particular use case, it looks like a compare-and-swap (CAS) should work.
for(size_t i = 0; i < m_nbr_record; ++i) {
if(it_record[i].compare_exchange_strong(
target_record, new_record)) {
found = true;
break;
}
}
Addendum
Being correct and being efficient are two very different things. This kind of operation is somewhat unusual and very inefficient. The description of your algorithm is too vague to give any meaningful advice but a memory pool sounds very much like something best served by a lockfree stack
If you stick with your design, check whether it is possible to change it into a read + CAS loop to avoid the expensive CAS until the correct element is found.
for(size_t i = 0; i < m_nbr_record; ++i) {
if(it_record[i].load(std::memory_order_relaxed) != target_record)
continue;
if(it_record[i].compare_exchange_strong(
target_record, new_record)) {
found = true;
break;
}
}
You should also check whether one of the weaker memory orders are possible for CAS but that depends on the specifics of your use-case.
I have a code block that looks like this
std::vector<uint32_t> flags(n, 0);
#pragma omp parallel for
for(int i = 0; i <v; i++) {
// If any thread finds it true, its true.
// Max value of j is n.
for(auto& j : vec[i])
flags[j] = true;
}
Work based upon the flags.
Is there any need for a mutex ? I understand cache coherence is going to make sure all the write buffers are synchronized, and conflicting buffers will not be written to memory. Secondly the overhead of cache coherence can be avoided by simply changing
flags[j] = true;
to
if(!flags[j]) flags[j] = true;
The check if flags[j] is already set will reduce the write frequency thus need for cache coherency updates. And even if by any chance flags[j] is read to be false it will only end up in one extra write to flags[j] which is okay.
EDIT :
Yes multiple threads may and will try to write to the same index in flags[j]. Hence the question.
uint32_t has intentionally been used and bool is not used since writing to a boolean in parallel can malfunction as the neighboring booleans share the same byte. But writing to the same uint32_t in parallel from different threads will not malfunction in the same manner as booleans even without mutex.
FWIW , to comply with the standards, I ended up keeping this code which more or less complies with the standards not 100% though. But the non-standard code shown above did not fail in tests. I thought for once that it would fail in multi socket machines but turns out x86 also provide multi socket level cache coherence.
#pragma omp parallel
{
std::vector<uint32_t> flags_local(n, 0);
#pragma omp parallel for
for(int i = 0; i <v; i++) {
for(auto& j : vec[i])
flags_local[j] = true;
}
// No omp directive here, as all threads
// need to traverse their full arrays.
for(int j = 0; j <n; i++) {
if(flags_local[j] && !flags[j]) {
#pragma omp critical
{ flags[j] = true; }
}
}
}
Thread safety in C++ is such that you need not worry about cache coherency and such hardware related issues. What matters is what is specified in the C++ standard, and I don't think it mentions cache coherency. Thats for the implementers to worry about.
For writing to elements of a std::vector the rules are actually rather simple: Writing to distinct elements of a vector is thread-safe. Only if two threads write to the same index you need to synchronize the access (and for that it does not matter whether both threads write the same value or not).
As pointed out by Evg, I made a rough simplification. What counts is that all threads access different memory locations. Hence, with a std::vector<bool> things wouldn't be that simple, because typically several elements of a std::vector<bool> are stored in a single byte.
Yes multiple treads may and will try to write to the same index in flags[j].
Then you need to synrchonize the access. The fact that all elements are false initially and all writes do write true is not relevant. What counts is that you have multiple threads that access the same memory and at least one writes to it. When this is the case you need to synchronize the access or you have a data race.
Accessing a variable concurrently to a write is a race condition, and so it is undefined behavior.
flags[j] = true;
should be protected.
Alternatively you might use atomic types (but see how-to-declare-a-vector-of-atomic-in-c++).
or even simpler using std::atomic_ref (c++20)
std::vector<uint32_t> flags(n, 0);
#pragma omp parallel for
for (int i = 0; i < v; i++) {
for (auto& flag : vec[i]) {
auto atom_flag = std::atomic_ref<std::uint32_t>(flag);
atom_flag = true;
}
}
The goal is to implement a sequence number generator in modern C++. The context is in a concurrent environment.
Requirement #1 The class must be singleton (common for all threads)
Requirement #2 The type used for the numbers is 64-bit integer.
Requirement #3 The caller can request more than one numbers
Requirement #4 This class will cache a sequence of numbers before being able serve the calls. Because it caches a sequence, it must also store the upper bound -> the maximum number to be able to return.
Requirement #5 Last but not least, at startup (constructor) and when there are no available numbers to give ( n_requested > n_avalaible ), the singleton class must query the database to get a new sequence. This load from DB, updates both seq_n_ and max_seq_n_.
A brief draft for its interface is the following:
class singleton_sequence_manager {
public:
static singleton_sequence_manager& instance() {
static singleton_sequence_manager s;
return s;
}
std::vector<int64_t> get_sequence(int64_t n_requested);
private:
singleton_sequence_manager(); //Constructor
void get_new_db_sequence(); //Gets a new sequence from DB
int64_t seq_n_;
int64_t max_seq_n_;
}
Example just to clarify the use case.
Suppose that at startup, DB sets seq_n_ to 1000 and max_seq_n_ to 1050:
get_sequence.(20); //Gets [1000, 1019]
get_sequence.(20); //Gets [1020, 1039]
get_sequence.(5); //Gets [1040, 1044]
get_sequence.(10); //In order to serve this call, a new sequence must be load from DB
Obviously, an implementation using locks and std::mutex is quite simple.
What I am interested into is implementing a lock-free version using std::atomic and atomic operations.
My first attempt is the following one:
int64_t seq_n_;
int64_t max_seq_n_;
were changed to:
std::atomic<int64_t> seq_n_;
std::atomic<int64_t> max_seq_n_;
Get a new sequence from DB just sets the new values in the atomic variables:
void singleton_sequence_manager::get_new_db_sequence() {
//Sync call is made to DB
//Let's just ignore unhappy paths for simplicity
seq_n_.store( start_of_seq_got_from_db );
max_seq_n_.store( end_of_seq_got_from_db );
//At this point, the class can start returning numbers in [seq_n_ : max_seq_n_]
}
And now the get_sequence function using atomic compare and swap technique:
std::vector<int64_t> singleton_sequence_manager::get_sequence(int64_t n_requested) {
bool succeeded{false};
int64_t current_seq{};
int64_t next_seq{};
do {
current_seq = seq_n_.load();
do {
next_seq = current_seq + n_requested + 1;
}
while( !seq_n_.compare_exchange_weak( current_seq, next_seq ) );
//After the CAS, the caller gets the sequence [current_seq:next_seq-1]
//Check if sequence is in the cached bound.
if( max_seq_n_.load() > next_seq - 1 )
succeeded = true;
else //Needs to load new sequence from DB, and re-calculate again
get_new_db_sequence();
}
while( !succeeded );
//Building the response
std::vector<int64_t> res{};
res.resize(n_requested);
for(int64_t n = current_seq ; n < next_seq ; n++)
res.push_back(n);
return res;
}
Thoughts:
I am really concerned for the lock-free version. Is the implementation safe ? If we ignore the DB load part, obviously yes. The problem arises (in my head at least) when the class has to load a new sequence from the DB. Is the update from DB safe ? Two atomic stores ?
My second attempt was to combine both seq_n_ and max_seq_n_ into a struct called sequence and use a single atomic variable std::atomic but the compiler failed. Because the size of the struct sequence is greater than 64-bit.
Is it possible to somehow protect the DB part by using an atomic flag for marking if the sequence is ready yet: flag set to false while waiting the db load to finish and to update both atomic variables. Therefore, get_sequence must be updated in order to wait for flag to bet set to true. (Use of spin lock ?)
Your lock-free version has a fundamental flaw, because it treats two independent atomic variables as one entity. Since writes to seq_n_ and max_seq_n_ are separate statements, they can be separated during execution resulting in the use of one of them with a value that is incorrect when paired with the other.
For example, one thread can get past the CAS inner while loop (with an n_requested that is too large for the current cached sequence), then be suspended before checking if it is cached. A second thread can come thru and update the max_seq_n value to a larger value. The first thread then resumes, and passes the max_seq_n check because the value was updated by the second thread. It is now using an invalid sequence.
A similar thing can happen in get_new_db_sequence between the two store calls.
Since you're writing to two distinct locations (even if adjacent in memory), and they cannot be updated atomically (due to the combined size of 128 bits not being a supported atomic size with your compiler), the writes must be protected by a mutex.
A spin lock should only be used for very short waits, since it does consume CPU cycles. A typical usage would be to use a short spin lock, and if the resource is still unavailable use something more expensive (like a mutex) to wait using CPU time.
I have a vector that is modified in one thread, and I need to use its contents in another. Locking between these threads is unacceptable due to performance requirements. Since iterating over the vector while it is changing will cause a crash, I thought to copy the vector and then iterate over the copy. My question is, can this way also crash?
struct Data
{
int A;
double B;
bool C;
};
std::vector<Data> DataVec;
void ModifyThreadFunc()
{
// Here the vector is changed, which includes adding and erasing elements
...
}
void ReadThreadFunc()
{
auto temp = DataVec; // Will this crash?
for (auto& data : temp)
{
// Do stuff with the data
...
}
// This definitely can crash
/*for (auto& data : DataVec)
{
// Do stuff with the data
...
}*/
}
The basic thread safety guarantee for vector::operator= is:
"if an exception is thrown, the container is in a valid state."
What types of exceptions are possible here?
EDIT:
I solved this using double buffering, and posted my answer below.
As has been pointed out by the other answers, what you ask for is not doable. If you have concurrent access, you need synchronization, end of story.
That being said, it is not unusual to have requirements like yours where synchronization is not an option. In that case, what you can still do is get rid of the concurrent access. For example, you mentioned that the data is accessed once per frame in a game-loop like execution. Is it strictly required that you get the data from the current frame or could it also be the data from the last frame?
In that case, you could work with two vectors, one that is being written to by the producer thread and one that is being read by all the consumer threads. At the end of the frame, you simply swap the two vectors. Now you no longer need *(1) fine-grained synchronization for the data access, since there is no concurrent data access any more.
This is just one example how to do this. If you need to get rid of locking, start thinking about how to organize data access so that you avoid getting into the situation where you need synchronization in the first place.
*(1): Strictly speaking, you still need a synchronization point that ensures that when you perform the swapping, all the writer and reader threads have finished working. But this is far easier to do (usually you have such a synchronization point at the end of each frame anyway) and has a far lesser impact on performance than synchronizing on every access to the vector.
My question is, can this way also crash?
Yes, you still have a data race. If thread A modifies the vector while thread B is creating a copy, all iterators to the vector are invalidated.
What types of exceptions are possible here?
std::vector::operator=(const vector&) will throw on memory allocation failure, or if the contained elements throw on copy. The same thing applies to copy construction, which is what the line in your code marked "Will this crash?" is actually doing.
The fundamental problem here is that std::vector is not thread-safe. You have to either protect it with a lock/mutex, or replace it with a thread-safe container (such as the lock-free containers in Boost.Lockfree or libcds).
I have a vector that is modified in one thread, and I need to use its contents in another. Locking between these threads is unacceptable due to performance requirements.
this is an impossible to meet requirement.
Anyway, any sharing of data between 2 threads will require a kind of locking, be it explicit or implementation (eventually hardware) provided. You must examine again your actual requirements: it can be inacceptable to suspend one thread until the other one ends, but you could lock short sequences of instructions. And/or possibly use a diffent architecture. For example erasing an item in a vector is a costly operation (linear time because you have to move all the data above the removed item) while marking it as invalid is much quicker (constant time because it is one single write). If you really have to erase in the middle of a vector, maybe a list would be more appropriate.
But if you can put a locking exclusion around the copy of the vector in ReadThreadFunc and around any vector modification in ModifyThreadFunc, it could be enough. To give a priority to the modifying thread, you could just try to lock in the other thread and immediately give up if you cannot.
Maybe you should rethink your design!
Each thread should have his own vector (list, queue whatever fit your needs) to work on. So thread A can do some work and pass the result to thrad B. You simply have to lock when writing the data from thread A int thread B's queue.
Without some kind of locking it's not possible.
So I solved this using double buffering, which guarantees no crashing, and the reading thread will always have usable data, even if it might not be correct:
struct Data
{
int A;
double B;
bool C;
};
const int MAXSIZE = 100;
Data Buffer[MAXSIZE];
std::vector<Data> DataVec;
void ModifyThreadFunc()
{
// Here the vector is changed, which includes adding and erasing elements
...
// Copy from the vector to the buffer
size_t numElements = DataVec.size();
memcpy(Buffer, DataVec.data(), sizeof(Data) * numElements);
memset(&Buffer[numElements], 0, sizeof(Data) * (MAXSIZE - numElements));
}
void ReadThreadFunc()
{
Data* p = Buffer;
for (int i = 0; i < MAXSIZE; ++i)
{
// Use the data
...
++p;
}
}
I have a class that stores the latest value of some incoming realtime data (around 150 million events/second).
Suppose it looks like this:
class DataState
{
Event latest_event;
public:
//pushes event atomically
void push_event(const Event __restrict__* e);
//pulls event atomically
Event pull_event();
};
I need to be able to push events atomically and pull them with strict ordering guarantees. Now, I know I can use a spinlock, but given the massive event rate (over 100 million/second) and high degree of concurrency I'd prefer to use lockfree operations.
The problem is that Event is 64 bytes in size. There is no CMPXCHG64B instruction on any current X86 CPU (as of August '16). So if I use std::atomic<Event> I'd have to link to libatomic which uses mutexes under the hood (too slow).
So my solution was to instead atomically swap pointers to the value. Problem is dynamic memory allocation becomes a bottleneck with these event rates. So... I define something I call a "ring allocator":
/// #brief Lockfree Static short-lived allocator used for a ringbuffer
/// Elements are guaranteed to persist only for "size" calls to get_next()
template<typename T> class RingAllocator {
T *arena;
std::atomic_size_t arena_idx;
const std::size_t arena_size;
public:
/// #brief Creates a new RingAllocator
/// #param size The number of elements in the underlying arena. Make this large enough to avoid overwriting fresh data
RingAllocator<T>(std::size_t size) : arena_size(size)
{
//allocate pool
arena = new T[size];
//zero out pool
std::memset(arena, 0, sizeof(T) * size);
arena_idx = 0;
}
~RingAllocator()
{
delete[] arena;
}
/// #brief Return next element's pointer. Thread-safe
/// #return pointer to next available element
T *get_next()
{
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
}
};
Then I could have my DataState class look like this:
class DataState
{
std::atomic<Event*> latest_event;
RingAllocator<Event> event_allocator;
public:
//pushes event atomically
void push_event(const Event __restrict__* e)
{
//store event
Event *new_ptr = event_allocator.get_next()
*new_ptr = *e;
//swap event pointers
latest_event.store(new_ptr, std::memory_order_release);
}
//pulls event atomically
Event pull_event()
{
return *(latest_event.load(std::memory_order_acquire));
}
};
As long as I size my ring allocator to the max # of threads that may concurrently call the functions, there's no risk of overwriting data that pull_event could return. Plus everything's super localized so indirection won't cause bad cache performance. Any possible pitfalls with this approach?
The DataState class:
I thought it was going to be a stack or queue, but it isn't, so push / pull don't seem like good names for methods. (Or else the implementation is totally bogus).
It's just a latch that lets you read the last event that any thread stored.
There's nothing to stop two writes in a row from overwriting an element that's never been read. There's also nothing to stop you reading the same element twice.
If you just need somewhere to copy small blocks of data, a ring buffer does seem like a decent approach. But if you don't want to lose events, I don't think you can use it this way. Instead, just get a ring buffer entry, then copy to it and use it there. So the only atomic operation should be incrementing the ring buffer position index.
The ring buffer
You can make get_next() much more efficient. This line does an atomic post-increment (fetch_add) and an atomic exchange:
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
I'm not even sure it's safe, because the xchg can maybe step on the fetch_add from another thread. Anyway, even if it's safe, it's not ideal.
You don't need that. Make sure the arena_size is always a power of 2, then you don't need to modulo the shared counter. You can just let it go, and have every thread modulo it for their own use. It will eventually wrap, but it's a binary integer so it will wrap at a power of 2, which is a multiple of your arena size.
I'd suggest storing an AND-mask instead of a size, so there's no risk of the % compiling to anything other than an and instruction, even if it's not a compile-time constant. This makes sure we avoid a 64-bit integer div instruction.
template<typename T> class RingAllocator {
T *arena;
std::atomic_size_t arena_idx;
const std::size_t size_mask; // maybe even make this a template parameter?
public:
RingAllocator<T>(std::size_t size)
: arena_idx(0), size_mask(size-1)
{
// verify that size is actually a power of two, so the mask is all-ones in the low bits, and all-zeros in the high bits.
// so that i % size == i & size_mask for all i
...
}
...
T *get_next() {
size_t idx = arena_idx.fetch_add(1, std::memory_order_relaxed); // still atomic, but we don't care which order different threads take blocks in
idx &= size_mask; // modulo our local copy of the idx
return &arena[idx];
}
};
Allocating the arena would be more efficient if you used calloc instead of new + memset. The OS already zeros pages before giving them to user-space processes (to prevent information leakage), so writing them all is just wasted work.
arena = new T[size];
std::memset(arena, 0, sizeof(T) * size);
// vs.
arena = (T*)calloc(size, sizeof(T));
Writing the pages yourself does fault them in, so they're all wired to real physical pages, instead of just copy-on-write mappings for a system-wide shared physical zero page (like they are after new/malloc/calloc). On a NUMA system, the physical page chosen might depend on which thread actually touched the page, rather than which thread did the allocation. But since you're reusing the pool, the first core to write a page might not be the one that ends up using it most.
Maybe something to look for in microbenchmarks / perf counters.
As long as I size my ring allocator to the max # of threads that may concurrently call the functions, there's no risk of overwriting data that pull_event could return. .... Any possible pitfalls with this approach?
The pitfall is, IIUC, that your statement is wrong.
If I have just 2 threads, and 10 elements in the ring buffer, the first thread could call pull_event once, and be "mid-pulling", and then the second thread could call push 10 times, overwriting what thread 1 is pulling.
Again, assuming I understand your code correctly.
Also, as mentioned above,
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
that arena_idx++ inside the exchange on the same variable, just looks wrong. And in fact is wrong. Two threads could increment it - ThreadA increments to 8 and threadB increments to 9, and then threadB exchanges it to 9, then threadA exchanges it to 8. whoops.
atomic(op1) # atomic(op2) != atomic(op1 # op2)
I worry about what else is wrong in code not shown. I don't mean that as an insult - lock-free is just not easy.
Have you looked at any of the C++ Disruptor (Java) ports that are available?
disruptor--
disruptor
Although they are not complete ports they may offer all that you need. I am currently working on a more fully featured port however it's not quite ready.