Copying std::vector between threads without locking

Copying std::vector between threads without locking - c++

I have a vector that is modified in one thread, and I need to use its contents in another. Locking between these threads is unacceptable due to performance requirements. Since iterating over the vector while it is changing will cause a crash, I thought to copy the vector and then iterate over the copy. My question is, can this way also crash?
struct Data
{
int A;
double B;
bool C;
};
std::vector<Data> DataVec;
void ModifyThreadFunc()
{
// Here the vector is changed, which includes adding and erasing elements
...
}
void ReadThreadFunc()
{
auto temp = DataVec; // Will this crash?
for (auto& data : temp)
{
// Do stuff with the data
...
}
// This definitely can crash
/*for (auto& data : DataVec)
{
// Do stuff with the data
...
}*/
}
The basic thread safety guarantee for vector::operator= is:
"if an exception is thrown, the container is in a valid state."
What types of exceptions are possible here?
EDIT:
I solved this using double buffering, and posted my answer below.

As has been pointed out by the other answers, what you ask for is not doable. If you have concurrent access, you need synchronization, end of story.
That being said, it is not unusual to have requirements like yours where synchronization is not an option. In that case, what you can still do is get rid of the concurrent access. For example, you mentioned that the data is accessed once per frame in a game-loop like execution. Is it strictly required that you get the data from the current frame or could it also be the data from the last frame?
In that case, you could work with two vectors, one that is being written to by the producer thread and one that is being read by all the consumer threads. At the end of the frame, you simply swap the two vectors. Now you no longer need *(1) fine-grained synchronization for the data access, since there is no concurrent data access any more.
This is just one example how to do this. If you need to get rid of locking, start thinking about how to organize data access so that you avoid getting into the situation where you need synchronization in the first place.
*(1): Strictly speaking, you still need a synchronization point that ensures that when you perform the swapping, all the writer and reader threads have finished working. But this is far easier to do (usually you have such a synchronization point at the end of each frame anyway) and has a far lesser impact on performance than synchronizing on every access to the vector.

My question is, can this way also crash?
Yes, you still have a data race. If thread A modifies the vector while thread B is creating a copy, all iterators to the vector are invalidated.
What types of exceptions are possible here?
std::vector::operator=(const vector&) will throw on memory allocation failure, or if the contained elements throw on copy. The same thing applies to copy construction, which is what the line in your code marked "Will this crash?" is actually doing.
The fundamental problem here is that std::vector is not thread-safe. You have to either protect it with a lock/mutex, or replace it with a thread-safe container (such as the lock-free containers in Boost.Lockfree or libcds).

I have a vector that is modified in one thread, and I need to use its contents in another. Locking between these threads is unacceptable due to performance requirements.
this is an impossible to meet requirement.
Anyway, any sharing of data between 2 threads will require a kind of locking, be it explicit or implementation (eventually hardware) provided. You must examine again your actual requirements: it can be inacceptable to suspend one thread until the other one ends, but you could lock short sequences of instructions. And/or possibly use a diffent architecture. For example erasing an item in a vector is a costly operation (linear time because you have to move all the data above the removed item) while marking it as invalid is much quicker (constant time because it is one single write). If you really have to erase in the middle of a vector, maybe a list would be more appropriate.
But if you can put a locking exclusion around the copy of the vector in ReadThreadFunc and around any vector modification in ModifyThreadFunc, it could be enough. To give a priority to the modifying thread, you could just try to lock in the other thread and immediately give up if you cannot.

Maybe you should rethink your design!
Each thread should have his own vector (list, queue whatever fit your needs) to work on. So thread A can do some work and pass the result to thrad B. You simply have to lock when writing the data from thread A int thread B's queue.
Without some kind of locking it's not possible.

So I solved this using double buffering, which guarantees no crashing, and the reading thread will always have usable data, even if it might not be correct:
struct Data
{
int A;
double B;
bool C;
};
const int MAXSIZE = 100;
Data Buffer[MAXSIZE];
std::vector<Data> DataVec;
void ModifyThreadFunc()
{
// Here the vector is changed, which includes adding and erasing elements
...
// Copy from the vector to the buffer
size_t numElements = DataVec.size();
memcpy(Buffer, DataVec.data(), sizeof(Data) * numElements);
memset(&Buffer[numElements], 0, sizeof(Data) * (MAXSIZE - numElements));
}
void ReadThreadFunc()
{
Data* p = Buffer;
for (int i = 0; i < MAXSIZE; ++i)
{
// Use the data
...
++p;
}
}

Related

Custom allocators as alternatives to vector of smart pointers?

This question is about owning pointers, consuming pointers, smart pointers, vectors, and allocators.
I am a little bit lost on my thoughts about code architecture. Furthermore, if this question has already an answer somewhere, 1. sorry, but I haven't found a satisfying answer so far and 2. please point me to it.
My problem is the following:
I have several "things" stored in a vector and several "consumers" of those "things". So, my first try was like follows:
std::vector<thing> i_am_the_owner_of_things;
thing* get_thing_for_consumer() {
// some thing-selection logic
return &i_am_the_owner_of_things[5]; // 5 is just an example
}
...
// somewhere else in the code:
class consumer {
consumer() {
m_thing = get_thing_for_consumer();
}
thing* m_thing;
};
In my application, this would be safe because the "things" outlive the "consumers" in any case. However, more "things" can be added during runtime and that can become a problem because if the std::vector<thing> i_am_the_owner_of_things; gets reallocated, all the thing* m_thing pointers become invalid.
A fix to this scenario would be to store unique pointers to "things" instead of "things" directly, i.e. like follows:
std::vector<std::unique_ptr<thing>> i_am_the_owner_of_things;
thing* get_thing_for_consumer() {
// some thing-selection logic
return i_am_the_owner_of_things[5].get(); // 5 is just an example
}
...
// somewhere else in the code:
class consumer {
consumer() {
m_thing = get_thing_for_consumer();
}
thing* m_thing;
};
The downside here is that memory coherency between "things" is lost. Can this memory coherency be re-established by using custom allocators somehow? I am thinking of something like an allocator which would always allocate memory for, e.g., 10 elements at a time and whenever required, adds more 10-elements-sized chunks of memory.
Example:
initially:
v = ☐☐☐☐☐☐☐☐☐☐
more elements:
v = ☐☐☐☐☐☐☐☐☐☐ 🡒 ☐☐☐☐☐☐☐☐☐☐
and again:
v = ☐☐☐☐☐☐☐☐☐☐ 🡒 ☐☐☐☐☐☐☐☐☐☐ 🡒 ☐☐☐☐☐☐☐☐☐☐
Using such an allocator, I wouldn't even have to use std::unique_ptrs of "things" because at std::vector's reallocation time, the memory addresses of the already existing elements would not change.
As alternative, I can only think of referencing the "thing" in "consumer" via a std::shared_ptr<thing> m_thing, as opposed to the current thing* m_thing but that seems like the worst approach to me, because a "thing" shall not own a "consumer" and with shared pointers I would create shared ownership.
So, is the allocator-approach a good one? And if so, how can it be done? Do I have to implement the allocator by myself or is there an existing one?

If you are able to treat thing as a value type, do so. It simplifies things, you don't need a smart pointer for circumventing the pointer/reference invalidation issue. The latter can be tackled differently:
If new thing instances are inserted via push_front and push_back during the program, use std::deque instead of std::vector. Then, no pointers or references to elements in this container are invalidated (iterators are invalidated, though - thanks to #odyss-jii for pointing that out). If you fear that you heavily rely on the performance benefit of the completely contiguous memory layout of std::vector: create a benchmark and profile.
If new thing instances are inserted in the middle of the container during the program, consider using std::list. No pointers/iterators/references are invalidated when inserting or removing container elements. Iteration over a std::list is much slower than a std::vector, but make sure this is an actual issue in your scenario before worrying too much about that.

There is no single right answer to this question, since it depends a lot on the exact access patterns and desired performance characteristics.
Having said that, here is my recommendation:
Continue storing the data contiguously as you are, but do not store aliasing pointers to that data. Instead, consider a safer alternative (this is a proven method) where you fetch the pointer based on an ID right before using it -- as a side-note, in a multi-threaded application you can lock attempts to resize the underlying store whilst such a weak reference lives.
So your consumer will store an ID, and will fetch a pointer to the data from the "store" on demand. This also gives you control over all "fetches", so that you can track them, implement safety measure, etc.
void consumer::foo() {
thing *t = m_thing_store.get(m_thing_id);
if (t) {
// do something with t
}
}
Or more advanced alternative to help with synchronization in multi-threaded scenario:
void consumer::foo() {
reference<thing> t = m_thing_store.get(m_thing_id);
if (!t.empty()) {
// do something with t
}
}
Where reference would be some thread-safe RAII "weak pointer".
There are multiple ways of implementing this. You can either use an open-addressing hash table and use the ID as a key; this will give you roughly O(1) access time if you balance it properly.
Another alternative (best-case O(1), worst-case O(N)) is to use a "reference" structure, with a 32-bit ID and a 32-bit index (so same size as 64-bit pointer) -- the index serves as a sort-of cache. When you fetch, you first try the index, if the element in the index has the expected ID you are done. Otherwise, you get a "cache miss" and you do a linear scan of the store to find the element based on ID, and then you store the last-known index value in your reference.

IMO best approach would be create new container which will behave is safe way.
Pros:
change will be done on separate level of abstraction
changes to old code will be minimal (just replace std::vector with new container).
it will be "clean code" way to do it
Cons:
it may look like there is a bit more work to do
Other answer proposes use of std::list which will do the job, but with larger number of allocation and slower random access. So IMO it is better to compose own container from couple of std::vectors.
So it may start look more or less like this (minimum example):
template<typename T>
class cluster_vector
{
public:
static const constexpr cluster_size = 16;
cluster_vector() {
clusters.reserve(1024);
add_cluster();
}
...
size_t size() const {
if (clusters.empty()) return 0;
return (clusters.size() - 1) * cluster_size + clusters.back().size();
}
T& operator[](size_t index) {
thowIfIndexToBig(index);
return clusters[index / cluster_size][index % cluster_size];
}
void push_back(T&& x) {
if_last_is_full_add_cluster();
clusters.back().push_back(std::forward<T>(x));
}
private:
void thowIfIndexToBig(size_t index) const {
if (index >= size()) {
throw std::out_of_range("cluster_vector out of range");
}
}
void add_cluster() {
clusters.push_back({});
clusters.back().reserve(cluster_size);
}
void if_last_is_full_add_cluster() {
if (clusters.back().size() == cluster_size) {
add_cluster();
}
}
private:
std::vector<std::vector<T>> clusters;
}
This way you will provide container which will not reallocate items. It doesn't meter what T does.

[A shared pointer] seems like the worst approach to me, because a "thing" shall not own a "consumer" and with shared pointers I would create shared ownership.
So what? Maybe the code is a little less self-documenting, but it will solve all your problems.
(And by the way you are muddling things by using the word "consumer", which in a traditional producer/consumer paradigm would take ownership.)
Also, returning a raw pointer in your current code is already entirely ambiguous as to ownership. In general, I'd say it's good practice to avoid raw pointers if you can (like you don't need to call delete.) I would return a reference if you go with unique_ptr
std::vector<std::unique_ptr<thing>> i_am_the_owner_of_things;
thing& get_thing_for_consumer() {
// some thing-selection logic
return *i_am_the_owner_of_things[5]; // 5 is just an example
}

C++ concurrent writes to array (not std::vector) of bools

I'm using C++11 and I'm aware that concurrent writes to std::vector<bool> someArray are not thread-safe due to the specialization of std::vector for bools.
I'm trying to find out if writes to bool someArray[2048] have the same problem:
Suppose all entries in someArray are initially set to false.
Suppose I have a bunch of threads that write at different indices in someArray. In fact, these threads only set different array entries from false to true.
Suppose I have a reader thread that at some point acquires a lock, triggering a memory fence operation.
Q: Will the reader see all the writes to someArray that occurred before the lock was acquired?
Thanks!

You should use std::array<bool, 2048> someArray, not bool someArray[2048];. If you're in C++11-land, you'll want to modernize your code as much as you are able.
std::array<bool, N> is not specialized in the same way that std::vector<bool> is, so there's no concerns there in terms of raw safety.
As for your actual question:
Will the reader see all the writes to someArray that occurred before the lock was acquired?
Only if the writers to the array also interact with the lock, either by releasing it at the time that they finish writing, or else by updating a value associated with the lock that the reader then synchronizes with. If the writers never interact with the lock, then the data that will be retrieved by the reader is undefined.
One thing you'll also want to bear in mind: while it's not unsafe to have multiple threads write to the same array, provided that they are all writing to unique memory addresses, writing could be slowed pretty dramatically by interactions with the cache. For example:
void func_a() {
std::array<bool, 2048> someArray{};
for(int i = 0; i < 8; i++) {
std::thread writer([i, &someArray]{
for(size_t index = i * 256; index < (i+1) * 256; index++)
someArray[index] = true;
//Some kind of synchronization mechanism you need to work out yourself
});
writer.detach();
}
}
void func_b() {
std::array<bool, 2048> someArray{};
for(int i = 0; i < 8; i++) {
std::thread writer([i, &someArray]{
for(size_t index = i; index < 2048; index += 8)
someArray[index] = true;
//Some kind of synchronization mechanism you need to work out yourself
});
writer.detach();
}
}
The details are going to vary depending on the underlying hardware, but in nearly all situations, func_a is going to be orders of magnitude faster than func_b, at least for a sufficiently large array size (2048 was chosen as an example, but it may not be representative of the actual underlying performance differences). Both functions should have the same result, but one will be considerably faster than the other.

First of all, the general std::vector is not thread-safe as you might think. The guarantees are already stated here.
Addressing your question: The reader may not see all writes after acquiring the lock. This is due to the fact that the writers may never have performed a release operation which is required to establish a happens-before relationship between the writes and the subsequent reads. In (very) simple terms: every acquire operation (such as a mutex lock) needs a release operation to synchronize with. Every memory operation done before a release onto a certain veriable will be visible to any thread that acquired the same variable. See also Release-Acquire ordering.

One important thing to note is that all operations (fetch and store) on an int32 sized variable (such as a bool) is atomic (holds true for x86 or x64 architectures). So if you declare your array as volatile (necessary as each thread may have a cached value of the array), you should not have any issues in modifying the array (via multiple threads).

Performance issues in joining threads

I wrote the following parallel code for examining all elements in a vector of vector. I store only those elements from vector<vector<int> > which satisfy a given condition. However, my problem is some of the vectors within vector<vector<int> > are pretty large while others are pretty small. Due to which my code takes a long time to perform thread.join(). Can someone please suggest as to how can I improve the performance of my code.
void check_if_condition(vector<int>& a, vector<int>& satisfyingElements)
{
for(vector<int>::iterator i1=a.begin(), l1=a.end(); i1!=l1; ++i1)
if(some_check_condition(*i1))
satisfyingElements.push_back(*i1);
}
void doWork(std::vector<vector<int> >& myVec, std::vector<vector<int> >& results, size_t current, size_t end)
{
end = std::min(end, myVec.size());
int numPassed = 0;
for(; current < end; ++current) {
vector<int> satisfyingElements;
check_if_condition(myVec[current], satisfyingElements);
if(!satisfyingElements.empty()){
results[current] = satisfyingElements;
}
}
}
int main()
{
std::vector<std::vector<int> > myVec(1000000);
std::vector<std::vector<int> > results(myVec.size());
unsigned numparallelThreads = std::thread::hardware_concurrency();
std::vector<std::thread> parallelThreads;
auto blockSize = myVec.size() / numparallelThreads;
for(size_t i = 0; i < numparallelThreads - 1; ++i) {
parallelThreads.emplace_back(doWork, std::ref(myVec), std::ref(results), i * blockSize, (i+1) * blockSize);
}
//also do work in this thread
doWork(myVec, results, (numparallelThreads-1) * blockSize, myVec.size());
for(auto& thread : parallelThreads)
thread.join();
std::vector<int> storage;
storage.reserve(numPassed.load());
auto itRes = results.begin();
auto itmyVec = myVec.begin();
auto endRes = results.end();
for(; itRes != endRes; ++itRes, ++itmyVec) {
if(!(*itRes).empty())
storage.insert(storage.begin(),(*itRes).begin(), (*itRes).end());
}
std::cout << "Done" << std::endl;
}

It would be nice to see if you can give some scale of those 'large' inner-vectors just to see how bad is the problem.
I think however, is that your problem is this:
for(auto& thread : parallelThreads)
thread.join();
This bit makes goes through on all thread sequentially and wait until they finish, and only then looks at the next one. For a thread-pool, you want to wait until every thread is done. This can be done by using condition_variable for each thread to finish. Before they finish they have to notify the condition_variable for which you can wait.
Looking at your implementation the bigger issue here is that your worker threads are not balanced in their consumption.
To get a more balanced load on all of your threads, you need to flatten your data structure, so the different worker threads can process relatively similar sized chunks of data. I am not sure where is your data coming from, but having a vector of a vector in an application that is dealing with large data sets doesn't sound like a great idea. Either process the existing vector of vectors into a single one, or read the data in like that if possible. If you need the row number for your processing, you can keep a vector of start-end ranges from which you can find your row number.
Once you have a single big vector, you can break it down to equal sized chunks to feed into worker threads. Second, you don't want to build vectors on the stack handing and pushing them into another vector because, chances are, you are running into issues to allocate memory during the working of your threads. Allocating memory is a global state change and as such will require some level of locking (with proper address partitioning it could be avoided though). As a rule of thumb, whenever your are looking for performance you should remove dynamic allocation from performance critical parts.
In this case, perhaps your threads would rather 'mark' elements are satisfying conditions, rather than building vectors of the satisfying elems. And once that's done, you can iterate through only the good ones without pushing and copying anything. Such solution would be less wastefull.
In fact, if I were you, I would give a try to solve this issue first on a single thread, doing the suggestions above. If you get rid of the vector-of-vectors structure, and iterate through elements conditionally (this might be as simple as using the of the xxxx_if algorithms C++11 standard library provides), you could end up with a decent enough performance. And only at that point worth looking at delegating chunks of this work to worker threads. At this point in your coded there's very little justification to use worker threads, just to filter them. Do as little writing and moving as you can, and you gain a lot of performance. Parallelization only works well in certain circumstances.

Ring Allocator For Lockfree Update of Member Variable?

I have a class that stores the latest value of some incoming realtime data (around 150 million events/second).
Suppose it looks like this:
class DataState
{
Event latest_event;
public:
//pushes event atomically
void push_event(const Event __restrict__* e);
//pulls event atomically
Event pull_event();
};
I need to be able to push events atomically and pull them with strict ordering guarantees. Now, I know I can use a spinlock, but given the massive event rate (over 100 million/second) and high degree of concurrency I'd prefer to use lockfree operations.
The problem is that Event is 64 bytes in size. There is no CMPXCHG64B instruction on any current X86 CPU (as of August '16). So if I use std::atomic<Event> I'd have to link to libatomic which uses mutexes under the hood (too slow).
So my solution was to instead atomically swap pointers to the value. Problem is dynamic memory allocation becomes a bottleneck with these event rates. So... I define something I call a "ring allocator":
/// #brief Lockfree Static short-lived allocator used for a ringbuffer
/// Elements are guaranteed to persist only for "size" calls to get_next()
template<typename T> class RingAllocator {
T *arena;
std::atomic_size_t arena_idx;
const std::size_t arena_size;
public:
/// #brief Creates a new RingAllocator
/// #param size The number of elements in the underlying arena. Make this large enough to avoid overwriting fresh data
RingAllocator<T>(std::size_t size) : arena_size(size)
{
//allocate pool
arena = new T[size];
//zero out pool
std::memset(arena, 0, sizeof(T) * size);
arena_idx = 0;
}
~RingAllocator()
{
delete[] arena;
}
/// #brief Return next element's pointer. Thread-safe
/// #return pointer to next available element
T *get_next()
{
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
}
};
Then I could have my DataState class look like this:
class DataState
{
std::atomic<Event*> latest_event;
RingAllocator<Event> event_allocator;
public:
//pushes event atomically
void push_event(const Event __restrict__* e)
{
//store event
Event *new_ptr = event_allocator.get_next()
*new_ptr = *e;
//swap event pointers
latest_event.store(new_ptr, std::memory_order_release);
}
//pulls event atomically
Event pull_event()
{
return *(latest_event.load(std::memory_order_acquire));
}
};
As long as I size my ring allocator to the max # of threads that may concurrently call the functions, there's no risk of overwriting data that pull_event could return. Plus everything's super localized so indirection won't cause bad cache performance. Any possible pitfalls with this approach?

The DataState class:
I thought it was going to be a stack or queue, but it isn't, so push / pull don't seem like good names for methods. (Or else the implementation is totally bogus).
It's just a latch that lets you read the last event that any thread stored.
There's nothing to stop two writes in a row from overwriting an element that's never been read. There's also nothing to stop you reading the same element twice.
If you just need somewhere to copy small blocks of data, a ring buffer does seem like a decent approach. But if you don't want to lose events, I don't think you can use it this way. Instead, just get a ring buffer entry, then copy to it and use it there. So the only atomic operation should be incrementing the ring buffer position index.
The ring buffer
You can make get_next() much more efficient. This line does an atomic post-increment (fetch_add) and an atomic exchange:
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
I'm not even sure it's safe, because the xchg can maybe step on the fetch_add from another thread. Anyway, even if it's safe, it's not ideal.
You don't need that. Make sure the arena_size is always a power of 2, then you don't need to modulo the shared counter. You can just let it go, and have every thread modulo it for their own use. It will eventually wrap, but it's a binary integer so it will wrap at a power of 2, which is a multiple of your arena size.
I'd suggest storing an AND-mask instead of a size, so there's no risk of the % compiling to anything other than an and instruction, even if it's not a compile-time constant. This makes sure we avoid a 64-bit integer div instruction.
template<typename T> class RingAllocator {
T *arena;
std::atomic_size_t arena_idx;
const std::size_t size_mask; // maybe even make this a template parameter?
public:
RingAllocator<T>(std::size_t size)
: arena_idx(0), size_mask(size-1)
{
// verify that size is actually a power of two, so the mask is all-ones in the low bits, and all-zeros in the high bits.
// so that i % size == i & size_mask for all i
...
}
...
T *get_next() {
size_t idx = arena_idx.fetch_add(1, std::memory_order_relaxed); // still atomic, but we don't care which order different threads take blocks in
idx &= size_mask; // modulo our local copy of the idx
return &arena[idx];
}
};
Allocating the arena would be more efficient if you used calloc instead of new + memset. The OS already zeros pages before giving them to user-space processes (to prevent information leakage), so writing them all is just wasted work.
arena = new T[size];
std::memset(arena, 0, sizeof(T) * size);
// vs.
arena = (T*)calloc(size, sizeof(T));
Writing the pages yourself does fault them in, so they're all wired to real physical pages, instead of just copy-on-write mappings for a system-wide shared physical zero page (like they are after new/malloc/calloc). On a NUMA system, the physical page chosen might depend on which thread actually touched the page, rather than which thread did the allocation. But since you're reusing the pool, the first core to write a page might not be the one that ends up using it most.
Maybe something to look for in microbenchmarks / perf counters.

As long as I size my ring allocator to the max # of threads that may concurrently call the functions, there's no risk of overwriting data that pull_event could return. .... Any possible pitfalls with this approach?
The pitfall is, IIUC, that your statement is wrong.
If I have just 2 threads, and 10 elements in the ring buffer, the first thread could call pull_event once, and be "mid-pulling", and then the second thread could call push 10 times, overwriting what thread 1 is pulling.
Again, assuming I understand your code correctly.
Also, as mentioned above,
return &arena[arena_idx.exchange(arena_idx++ % arena_size)];
that arena_idx++ inside the exchange on the same variable, just looks wrong. And in fact is wrong. Two threads could increment it - ThreadA increments to 8 and threadB increments to 9, and then threadB exchanges it to 9, then threadA exchanges it to 8. whoops.
atomic(op1) # atomic(op2) != atomic(op1 # op2)
I worry about what else is wrong in code not shown. I don't mean that as an insult - lock-free is just not easy.

Have you looked at any of the C++ Disruptor (Java) ports that are available?
disruptor--
disruptor
Although they are not complete ports they may offer all that you need. I am currently working on a more fully featured port however it's not quite ready.

How can I use something like std::vector<std::mutex>?

I have a large, but potentially varying, number of objects which are concurrently written into. I want to protect that access with mutexes. To that end, I thought I use a std::vector<std::mutex>, but this doesn't work, since std::mutex has no copy or move constructor, while std::vector::resize() requires that.
What is the recommended solution to this conundrum?
edit:
Do all C++ random-access containers require copy or move constructors for re-sizing? Would std::deque help?
edit again
First, thanks for all your thoughts. I'm not interested in solutions that avoid mutices and/or move them into the objects (I refrain from giving details/reasons). So given the problem that I want a adjustable number of mutices (where the adjustment is guaranteed to occur when no mutex is locked), then there appear to be several solutions.
1 I could use a fixed number of mutices and use a hash-function to map from objects to mutices (as in Captain Oblivous's answer). This will result in collisions, but the number of collisions should be small if the number of mutices is much larger than the number of threads, but still smaller than the number of objects.
2 I could define a wrapper class (as in ComicSansMS's answer), e.g.
struct mutex_wrapper : std::mutex
{
mutex_wrapper() = default;
mutex_wrapper(mutex_wrapper const&) noexcept : std::mutex() {}
bool operator==(mutex_wrapper const&other) noexcept { return this==&other; }
};
and use a std::vector<mutex_wrapper>.
3 I could use std::unique_ptr<std::mutex> to manage individual mutexes (as in Matthias's answer). The problem with this approach is that each mutex is individually allocated and de-allocated on the heap. Therefore, I prefer
4 std::unique_ptr<std::mutex[]> mutices( new std::mutex[n_mutex] );
when a certain number n_mutex of mutices is allocated initially. Should this number later be found insufficient, I simply
if(need_mutex > n_mutex) {
mutices.reset( new std::mutex[need_mutex] );
n_mutex = need_mutex;
}
So which of these (1,2,4) should I use?

vector requires that the values are movable, in order to maintain a contiguous array of values as it grows. You could create a vector containing mutexes, but you couldn't do anything that might need to resize it.
Other containers don't have that requirement; either deque or [forward_]list should work, as long as you construct the mutexes in place either during construction, or by using emplace() or resize(). Functions such as insert() and push_back() will not work.
Alternatively, you could add an extra level of indirection and store unique_ptr; but your comment in another answer indicates that you believe the extra cost of dynamic allocation to be unacceptable.

If you want to create a certain length:
std::vector<std::mutex> mutexes;
...
size_t count = 4;
std::vector<std::mutex> list(count);
mutexes.swap(list);

You could use std::unique_ptr<std::mutex> instead of std::mutex. unique_ptrs are movable.

I suggest using a fixed mutex pool. Keep a fixed array of std::mutex and select which one to lock based on the address of the object like you might do with a hash table.
std::array<std::mutex, 32> mutexes;
std::mutex &m = mutexes[hashof(objectPtr) % mutexes.size()];
m.lock();
The hashof function could be something simple that shifts the pointer value over a few bits. This way you only have to initialize the mutexes once and you avoid the copy of resizing the vector.

If efficiency is such a problem, I assume that you have only very small data structures which are changed very often. It is then probably better to use Atomic Compare And Swap (and other atomic operations) instead of using mutexes, specifically std::atomic_compare_exchange_strong

I'll sometimes use a solution along the lines of your 2nd option when I want a std::vector of classes or structs that each have their own std::mutex. Of course, it is a bit tedious as I write my own copy/move/assignment operators.
struct MyStruct {
MyStruct() : value1(0), value2(0) {}
MyStruct(const MyStruct& other) {
std::lock_guard<std::mutex> l(other.mutex);
value1 = other.value1;
value2 = other.value2;
}
MyStruct(MyStruct&& other) {
std::lock_guard<std::mutex> l(other.mutex);
value1 = std::exchange(other.value1, 0);
value2 = std::exchange(other.value2, 0);
}
MyStruct& operator=(MyStruct&& other) {
std::lock_guard<std::mutex> l1(this->mutex), l2(other.mutex);
std::swap(value1, other.value1);
std::swap(value2, other.value2);
return *this;
}
MyStruct& operator=(const MyStruct& other) {
// you get the idea
}
int value1;
double value2;
mutable std::mutex mutex;
};
You don't need to "move" the std::mutex. You just need to hold a lock on it while you "move" everything else.

How about declaring each mutex as a pointer?
std::vector<std::mutex *> my_mutexes(10)
//Initialize mutexes
for(int i=0;i<10;++i) my_mutexes[i] = new std::mutex();
//Release mutexes
for(int i=0;i<10;++i) delete my_mutexes[i];

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js