Multithreaded read-many, write-seldom array/vector iteration in C++ - c++

I have a need to almost-constantly iterate over a sequence of structs in a read-only fashion but for every 1M+ reads, one of the threads may append an item. I think using a mutex would be overkill here and I also read somewhere that r/w locks have their own drawbacks for the readers.
I was thinking about using reserve() on a std::vector but this answer Iterate over STL container using indices safe way to avoid using locks? seemed to invalidate that.
Any ideas on what way might be fastest? The most important thing is for the readers to be able to quickly and efficiently iterate with as little contention as possible. The writing operations aren't time-sensitive.
Update: Another one of my use cases is that the "list" could contain pointers rather than structs. I.e, std::vector. Same requirements apply.
Update 2: Hypothetic example
globally accessible:
typedef std::vector<MyClass*> Vector;
Vector v;
v.reserve(50);
Reader threads 1-10: (these run pretty much run all the time)
.
.
int total = 0;
for (Vector::const_iterator it = v.begin(); it != v.end(); ++it)
{
MyClass* ptr = *it;
total += ptr->getTotal();
}
// do something with total
.
.
Writer threads 11-15:
MyClass* ptr = new MyClass();
v.push_back(ptr);
That's basically what happens here. threads 1-15 could all be running concurrently although generally there are only 1-2 reading threads and 1-2 writer threads.

What I think could work here is own implementation of vector, something like this:
template <typename T> class Vector
{
// constructor will be needed of course
public:
std::shared_ptr<const std::vector<T> > getVector()
{ return mVector; }
void push_back(const T&);
private:
std::shared_ptr<std::vector<T> > mVector;
};
Then, whenever readers need to access a specific Vector, they should call getVector() and keep the returned shared_ptr until finished reading.
But writers should always use Vector's push_back to add new value. This push_back should then check if mVector.size() == mVector.capacity() and if true, allocate new vector and assign it to mVector. Something like:
template <typename T> Vector<T>::push_back(const T& t)
{
if (mVector->size() == mVector->capacity())
{
// make certain here that new_size > old_size
std::vector<T> vec = new std::vector<T> (mVector->size() * SIZE_MULTIPLIER);
std::copy(mVector->begin(), mVector->end(), vec->begin());
mVector.reset(vec);
}
// put 't' into 'mVector'. 'mVector' is guaranteed not to reallocate now.
}
The idea here is inspired by RCU (read-copy-update) algorithm. If storage space is exhausted, the new storage should not invalidate the old storage as long as there is at least one reader accessing it. But, the new storage should be allocated and any reader coming after allocation, should be able to see it. The old storage should be deallocated as soon as no one is using it anymore (all readers are finished).
Since most HW architectures provide some way to have atomic increments and decrements, I think shared_ptr (and thus Vector) will be able to run completely lock-less.
One disadvantage to this approach though, is that depending on how long readers hold that shared_ptr you might end up with several copies of your data.
PS: hope I haven't made too many embarrassing errors in the code :-)

... using reserve() on a std::vector ...
This can only be useful if you can guarantee the vector will never need to grow. You've stated that the number if items is not bounded above, so you can't give that guarantee.
Notwithstanding the linked question, you could conceivable use std::vector just to manage memory for you, but it would take an extra layer of logic on top to work around the problems identified in the accepted answer.
The actual answer is: the fastest thing to do is minimize the amount of synchronization. What the minimal amount of synchronization is depends on details of your code and usage that you haven't specified.
For example, I sketched a solution using a linked-list of fixed-size chunks. This means your common use case should be as efficient as an array traversal, but you're able to grow dynamically without re-allocating.
However, the implementation turns out to be sensitive to questions like:
whether you need to remove items
whenever they're read?
only from the front or from other places?
whether you want the reader to busy-wait if the container is empty
whether this should use some kind of backoff
what degree of consistency is required?

Related

multiple locks on different elements of an array

If I have 8 threads, and an array of 1,000,000,000 elements in an array, I can have 1,000,000,000 mutices where the index represents the element within the array that is being locked and written to. However this is fairly wasteful to me and requires a lot of memory.
Is there a way that I can only use 8 mutices and have the same functionality?
Thinking out aloud here... and not really sure how efficient this would be, but:
You could create method of locking certain indexes:
vector<int> mutexed_slots;
std::mutex mtx;
bool lock_element(int index)
{
std::lock_guard<std::mutex> lock(mtx);
// Check if item is in the mutexed list
if ( !std::find(mutexed_slots.begin(), mutexed_slots.end(), index) != vector.end() )
{
// If its not then add it - now that array value is safe from other threads
mutexed_slots.emplace_back(index);
return true;
}
return false;
}
void unlock_element(int index)
{
std::lock_guard<std::mutex> lock(mtx);
// No need to check because you will only unlock the element that you accessed (unless you are a very naughty boy indeed)
vec.erase(vec.begin() + index);
}
Note: This is the start of a idea, so don't knock it too hard just yet! Its also un-tested pseudo code. Its not really intended as a final answer - but as a start point. Please add comments to improve or to suggest that is is/isn't plausible.
Further points:
There may be a more efficient STL to use
You could probably wrap all of this up in a class along with your data
You would need to loop through lock_element() until it returns true - again not pretty at the moment. This mechanism could be improved.
Each thread needs to remember which index they currently are working on so that they only unlock that particular one - again this could be more integrated within a class to ensure that behaviour.
But as a concept - workable? I would think if you need really fast access (which maybe you do) this might not be that efficient, thoughts?
Update
This could be made much more efficient if each thread/worker "registers" its own entry in mutexed_slots. Then there would no push_back/remove's from the vector (except at the start/end). So each thread just sets the index that it has locked - if it has nothing locked then it just gets set to -1 (or such). I am thinking there are many more such efficiency improvements to be made. Again a complete class to do all this for you would be the way to implement it.
Testing / Results
I implemented a tester for this, just because I quite enjoy that sort of thing. My implementation is here
I think its a public github repo - so you are welcome to take a look. But I posted the results on the top-level readme (so scroll a little to see them). I implemented some improvements such that:
There are no insert/removal to the protection array at run-time
There is no need for a lock_guard to do the "unlock" because I am relying no a std::atomic index.
Below is my a printout of my summary:
Summary:
When the workload is 1ms (the time taken to perform each action) then the amount of work done was:
9808 for protected
8117 for normal
Note these values varied, sometimes the normal was higher, there appeared no clear winner.
When the workload is 0ms (basically increment a few counters) then the amount of work done was:
9791264 for protected
29307829 for normal
So here you can see that using the mutexed protection slows down the work by a factor of about a third (1/3). This ratio is consistant between tests.
I also ran the same tests for 1 worker, and the same ratios roughly held true. However when I make the array smaller (~1000 elements) the amount of work done is still roughly the same for when the work load is 1ms. But when the workload is very light I got results like:
5621311
39157931
Which is about 7 times slower.
Conclusion
The larger the array then less collisions occur - the performance is better.
The longer the work load is (per item) then the less noticeable the difference is with using the protecting mechanism.
It appears that the locking is generally only adding an overhead that is 2-3 times slower then incrementing a few counters. This is probably skewed by actual collisions because (from the results) the longest lock time recorded was a huge 40ms - but this was when there was the work time was very fast so, many collisions occurred (~8 successful locks per collision).
It depends on the access pattern, do you have a way to partition the work effectively? Basically, you can partition the array into 8 chunks (or as many as you can afford) and cover each part with a mutex, but if the access pattern is random you're still going to have a lot of collisions.
Do you have TSX support on your system? it would be a classic example, just have one global lock, and have the threads ignore it unless there's an actual collision.
You can write a class that will create locks on the fly when a particular index requires it, std::optional would be helpful for this (C++17 code ahead):
class IndexLocker {
public:
explicit IndexLocker(size_t size) : index_locks_(size) {}
std::mutex& get_lock(size_t i) {
if (std::lock_guard guard(instance_lock_); index_locks_[i] == std::nullopt) {
index_locks_[i].emplace();
}
return *index_locks_[i];
}
private:
std::vector<std::optional<std::mutex>> index_locks_;
std::mutex instance_lock_;
};
You could also use std::unique_ptr to minimize stack-space but maintain identical semantics:
class IndexLocker {
public:
explicit IndexLocker(size_t size) : index_locks_(size) {}
std::mutex& get_lock(size_t i) {
if (std::lock_guard guard(instance_lock_); index_locks_[i] == nullptr) {
index_locks_[i] = std::make_unique<std::mutex>();
}
return *index_locks_[i];
}
private:
std::vector<std::unique_ptr<std::mutex>> index_locks_;
std::mutex instance_lock_;
};
Using this class doesn't necessarily mean you need to create all 1,000,000 elements. You can use modulo operations to treat the locker as a "hash table" of mutexes:
constexpr size_t kLockLimit = 8;
IndexLocker index_locker(kLockLimit);
auto thread_code = [&](size_t i) {
std::lock_guard guard(index_locker.get_lock(i % kLockLimit));
// Do work with lock.
};
Worth mentioning that the "hash table" approach makes it very easy to deadlock (get_lock(0) followed by get_lock(16), for example). If each thread does work on exactly one element at a time, however, this shouldn't be an issue.
There are other trade-offs with fine-grain locking. Atomic operations are expensive, so a parallel algorithm that locks every element can take longer than the sequential version.
How to lock efficiently depends. Are the array elements dependent on other elements in the array? Are you mostly reading? mostly writing?
I don't want to split the array into 8 parts because that will cause a
high likelihood of waiting (access is random). The elements of the
array are a class that I will write that will be multiple Golomb coded
values.
I don't think having 8 mutexes is the way to go here. If a given lock protects an array section, you can't switch it to protect a different section in the midst of parallel execution without introducing a race condition (rendering the mutex pointless).
Are the array items small? If you can get them down to 8 bytes, you can declare your class with alignas(8) and instantiate std::atomic<YourClass> objects. (Size depends on architecture. Verify is_lock_free() returns true.) That could open up the possibility of lock-free algorithms. It almost seems like a variant of hazard pointers would be useful here. That's complex, so it's probably better to look into other approaches to parallelism if time is limited.

should I synchronize the deque or not

I have a deque with pointers inside in a C++ application. I know there are two threads to access it.
Thread1 will add pointers from the back and Thread2 will process and remove pointers from the front.
The Thread2 will wait until deque reach certain of amount, saying 10 items, and then start to process it. It will only loop and process 10 items at a time. In the meantime, Thread1 may still keep adding new items into the deque.
I think it will be fine without synchronize the deque because Thread1 and Thread2 are accessing different part of the deque. It is deque not vector. So there is no case that the existing memory of the container will be reallocated.
Am I right? if not, why (I want to know what I am missing)?
EDIT:
I know it will not hurt to ALWAYS synchronize it. But it may hurt the performance or not necessary. I just want it run faster and correctly if possible.
The deque has to keep track of how many elements it has and where those elements are. Adding an element changes that stored data, as does removing an element. Changing that data from two threads without synchronization is a data race, and produces undefined behavior.
In short, you must synchronize those operations.
In general, the Standard Library containers cannot be assumed to be thread-safe unless all you do is reading from them.
If you take a look under the covers, at deque implementation, you will uncover something similar to this:
template <typename T>
class deque {
public:
private:
static size_t const BufferCapacity = /**/;
size_t _nb_available_buffer;
size_t _first_occupied_buffer;
size_t _last_occupied_buffer;
size_t _size_first_buffer;
size_t _size_last_buffer;
T** _buffers; // heap allocated array of
// heap allocated arrays of fixed capacity
}; // class deque
Do you see the problem ? _buffers, at the very least, may be access concurrently by both enqueue and dequeue operations (especially when the array has become too small and need be copied in a bigger array).
So, what is the alternative ? What you are looking for is a concurrent queue. There are some implementations out there, and you should probably not worry too much on whether or not they are lock-free unless it proves to be a bottleneck. An example would be TTB concurrent_queue.
I would advise against creating your own lock-free queue, even if you heard it's all the fad, because all first implementations I have seen had (sometimes subtle) race-conditions.

How is allocation/deallocation of objects in standard template library done

Most of the time I am confused about how allocation/deallocation of stl objects is done. For example: take this loop.
vector<vector<int>> example;
for(//some conditions) {
vector<int>row;
for(//some conditions) {
row.push_back(k); //k is some int.
}
example.push_back(row);
}
In this case what is happening with the object row. I can still see values if I access via example, which means that when I do example.push_back(row) a new copy is created. Am I correct. Is there a good way to prevent the same(if I am correct).
Also can anyone give references where I can read up how is allocation/deallocation handled in stl or what are best practices to avoid such memory copying issue(in case of large applications).
Any help appreciated.
when I do example.push_back(row) a new copy is created. Am I correct.
Yes.
Is there a good way to prevent the same
Why would you want to prevent it? That behaviour is what makes vector simple and safe to use.
The standard library containers have value semantics, so they take a copy of the values you add to them and they manage the lifetime of those values, so you don't need to worry about it.
Also can anyone give references where I can read up how is allocation/deallocation handled in stl
Have you never heard of a search engine? Try http://www.sgi.com/tech/stl/Allocators.html for starters.
or what are best practices to avoid such memory copying issue(in case of large applications).
In general: forget about it. You usually don't need to worry about it, unless profiling has shown there's a performance problem.
std::vector does allow more fine-grained control over its memory usage, see the New members section and footnotes at http://www.sgi.com/tech/stl/Vector.html for more information.
For your example, you could add a new row to the example container then add the int values directly to it:
vector<vector<int>> example;
for(/*some conditions*/) {
example.resize(example.size()+1);
vector<int>& row = example.back();
for(/*some conditions*/) {
row.push_back(k); //k is some int.
}
}
Even better would be to reserve enough capacity in the vector in advance:
vector<vector<int>> example;
example.reserve( /* maximum expected size of vector */ );
for(/*some conditions*/) {
example.resize(example.size()+1);
vector<int>& row = example.back();
for(/*some conditions*/) {
row.push_back(k); //k is some int.
}
}
All an stl implementation has to do is obey the standard.
But std::swap is often used to switch the contents of a vector with another. This can be used to prevent value copies from being taken and is a good way of achieving efficiency, at least in the pre C++11 world. (In your case, push back an empty vector and swap it with the one you've created).

C++ STL vector iterator vs indexes access and thread safety

I am iterating over an STL vector and reading values from it. There is another thread which can make changes to this vector. Now, if the other thread inserts or removes and element from the vector, it invalidates the iterator. There is no use of locks involved. Does my choice of accessing the container through indexes(Approach 1) in place of iterators(Approach 2) make it thread safe? What about performance?
struct A{int i; int j;};
Approach 1:
size_t s = v.size();//v contains pointers to objects of type A
for(size_t i = 0; i < s; ++i)
{
A* ptr = v[i];
ptr->i++;
}
Approach 2:
std::vector<A*>::iterator begin = v.begin();
std::vector<A*>::iterator end = v.end();
for(std::vector<A*>::iterator it = begin; it != end; ++it)
{
A* ptr = *it;
ptr->i++:
}
The thread-safety guarantees for standard library containers are very straight forward (these rules were added in C++ 2011 but essentially all current library implementations conform to these requirements and impose the corresponding restrictions):
it is OK to have multiple concurrent readers
if there is one thread modifying a container there shall be no other thread accessing (reading or writing) it
the requirements are per container object
Effectively, this means that you need to use some mechanism external to the container to guarantee that a container accessed from multiple threads is handled correctly. For example, you can use a mutex or a readerwriter lock. Of course, most of the time containers are accessed only from one thread and things work just fine without any locking.
Without using explict locks you will cause data races and the behavior is undefined, independent of whether you use indices or iterators.
OP "Does my choice of accessing the container through indexes(Approach 1) in place of iterators(Approach 2) make it thread safe?"
No, neither approach is thread safe once you start writing to your data structure.
Therefore you will need to serialize access to your data structure.
To save you a lot of time and frustration there a lot of ready-rolled solutions e.g.
Intel Threading Building Blocks (TBB) which comes with thread safe containers such as concurrent_vector.
http://threadingbuildingblocks.org/
A concurrent_vector is a container with the following features:
Random access by index. The index of the first element is zero.
Multiple threads can grow the container and append new elements concurrently.
Growing the container does not invalidate existing iterators or indices.*
OP "What about performance?"
Not knowable. Different performance on different systems with different compilers but not known to be large enough to influence your choices.
No. STL containers are not thread safe.
You should provide exclusive access to each thread(the one that removes/the one that adds), while they're accessing the vector. Even when using indexes, you might be removing the i-th elemenet, making the pointer you had retrieved, invalid.
Could your algorithm work with a fixed size array?
Reason I ask is that the only way, logically, to have multiple threads modifying (most kinds of) container in a thread-safe, lock-free way is to make the container itself invariant. That means the CONTAINER doesn't ever change within the threads, just the elements within it. Think of the difference between messing with the insides of boxcars on a train, vs. actually adding & removing entire boxcars FROM that train as its moving down the tracks. Even meddling with the elements is only safe if your operations on that data observe certain constraints.
Good news is that locks are not always the end of the world. If multiple execution contexts (threads, programs, etc.) can hit the same object simultaneously, they're often the only solution anyway.

Which STL container should I use for a FIFO?

Which STL container would fit my needs best? I basically have a 10 elements wide container in which I continually push_back new elements while pop_front ing the oldest element (about a million time).
I am currently using a std::deque for the task but was wondering if a std::list would be more efficient since I wouldn't need to reallocate itself (or maybe I'm mistaking a std::deque for a std::vector?). Or is there an even more efficient container for my need?
P.S. I don't need random access
Since there are a myriad of answers, you might be confused, but to summarize:
Use a std::queue. The reason for this is simple: it is a FIFO structure. You want FIFO, you use a std::queue.
It makes your intent clear to anybody else, and even yourself. A std::list or std::deque does not. A list can insert and remove anywhere, which is not what a FIFO structure is suppose to do, and a deque can add and remove from either end, which is also something a FIFO structure cannot do.
This is why you should use a queue.
Now, you asked about performance. Firstly, always remember this important rule of thumb: Good code first, performance last.
The reason for this is simple: people who strive for performance before cleanliness and elegance almost always finish last. Their code becomes a slop of mush, because they've abandoned all that is good in order to really get nothing out of it.
By writing good, readable code first, most of you performance problems will solve themselves. And if later you find your performance is lacking, it's now easy to add a profiler to your nice, clean code, and find out where the problem is.
That all said, std::queue is only an adapter. It provides the safe interface, but uses a different container on the inside. You can choose this underlying container, and this allows a good deal of flexibility.
So, which underlying container should you use? We know that std::list and std::deque both provide the necessary functions (push_back(), pop_front(), and front()), so how do we decide?
First, understand that allocating (and deallocating) memory is not a quick thing to do, generally, because it involves going out to the OS and asking it to do something. A list has to allocate memory every single time something is added, and deallocate it when it goes away.
A deque, on the other hand, allocates in chunks. It will allocate less often than a list. Think of it as a list, but each memory chunk can hold multiple nodes. (Of course, I'd suggest that you really learn how it works.)
So, with that alone a deque should perform better, because it doesn't deal with memory as often. Mixed with the fact you're handling data of constant size, it probably won't have to allocate after the first pass through the data, whereas a list will be constantly allocating and deallocating.
A second thing to understand is cache performance. Going out to RAM is slow, so when the CPU really needs to, it makes the best out of this time by taking a chunk of memory back with it, into cache. Because a deque allocates in memory chunks, it's likely that accessing an element in this container will cause the CPU to bring back the rest of the container as well. Now any further accesses to the deque will be speedy, because the data is in cache.
This is unlike a list, where the data is allocated one at a time. This means that data could be spread out all over the place in memory, and cache performance will be bad.
So, considering that, a deque should be a better choice. This is why it is the default container when using a queue. That all said, this is still only a (very) educated guess: you'll have to profile this code, using a deque in one test and list in the other to really know for certain.
But remember: get the code working with a clean interface, then worry about performance.
John raises the concern that wrapping a list or deque will cause a performance decrease. Once again, he nor I can say for certain without profiling it ourselves, but chances are that the compiler will inline the calls that the queue makes. That is, when you say queue.push(), it will really just say queue.container.push_back(), skipping the function call completely.
Once again, this is only an educated guess, but using a queue will not degrade performance, when compared to using the underlying container raw. Like I've said before, use the queue, because it's clean, easy to use, and safe, and if it really becomes a problem profile and test.
Check out std::queue. It wraps an underlying container type, and the default container is std::deque.
Where performance really matters, check out the Boost circular buffer library.
I continually push_back new elements
while pop_front ing the oldest element
(about a million time).
A million is really not a big number in computing. As others have suggested, use a std::queue as your first solution. In the unlikely event of it being too slow, identify the bottleneck using a profiler (do not guess!) and re-implement using a different container with the same interface.
Why not std::queue? All it has is push_back and pop_front.
A queue is probably a simpler interface than a deque but for such a small list, the difference in performance is probably negligible.
Same goes for list. It's just down to a choice of what API you want.
Use a std::queue, but be cognizant of the performance tradeoffs of the two standard Container classes.
By default, std::queue is an adapter on top of std::deque. Typically, that'll give good performance where you have a small number of queues containing a large number of entries, which is arguably the common case.
However, don't be blind to the implementation of std::deque. Specifically:
"...deques typically have large minimal memory cost; a deque holding just one element has to allocate its full internal array (e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size or 4096 bytes, whichever is larger, on 64-bit libc++)."
To net that out, presuming that a queue entry is something that you'd want to queue, i.e., reasonably small in size, then if you have 4 queues, each containing 30,000 entries, the std::deque implementation will be the option of choice. Conversely, if you have 30,000 queues, each containing 4 entries, then more than likely the std::list implementation will be optimal, as you'll never amortize the std::deque overhead in that scenario.
You'll read a lot of opinions about how cache is king, how Stroustrup hates linked lists, etc., and all of that is true, under certain conditions. Just don't accept it on blind faith, because in our second scenario there, it's quite unlikely that the default std::deque implementation will perform. Evaluate your usage and measure.
This case is simple enough that you can just write your own. Here is something that works well for micro-conroller situations where STL use takes too much space. It is nice way to pass data and signal from interrupt handler to your main loop.
// FIFO with circular buffer
#define fifo_size 4
class Fifo {
uint8_t buff[fifo_size];
int writePtr = 0;
int readPtr = 0;
public:
void put(uint8_t val) {
buff[writePtr%fifo_size] = val;
writePtr++;
}
uint8_t get() {
uint8_t val = NULL;
if(readPtr < writePtr) {
val = buff[readPtr%fifo_size];
readPtr++;
// reset pointers to avoid overflow
if(readPtr > fifo_size) {
writePtr = writePtr%fifo_size;
readPtr = readPtr%fifo_size;
}
}
return val;
}
int count() { return (writePtr - readPtr);}
};