I have a C++11 program that does some computations and uses a std::unordered_map to cache results of those computations. The program uses multiple threads and they use a shared unordered_map to store and share the results of the computations.
Based on my reading of unordered_map and STL container specs, as well as unordered_map thread safety, it seems that an unordered_map, shared by multiple threads, can handle one thread writing at a time, but many readers at a time.
Therefore, I'm using a std::mutex to wrap my insert() calls to the map, so that at most only one thread is inserting at a time.
However, my find() calls do not have a mutex as, from my reading, it seems that many threads should be able to read at once. However, I'm occasionally getting data races (as detected by TSAN), manifesting themselves in a SEGV. The data race clearly points to the insert() and find() calls that I mentioned above.
When I wrap the find() calls in a mutex, the problem goes away. However, I don't want to serialize the concurrent reads, as I'm trying to make this program as fast as possible. (FYI: I'm running using gcc 5.4.)
Why is this happening? Is my understanding of the concurrency guarantees of std::unordered_map incorrect?
You still need a mutex for your readers to keep the writers out, but you need a shared one. C++14 has a std::shared_timed_mutex that you can use along with scoped locks std::unique_lock and std::shared_lock like this:
using mutex_type = std::shared_timed_mutex;
using read_only_lock = std::shared_lock<mutex_type>;
using updatable_lock = std::unique_lock<mutex_type>;
mutex_type mtx;
std::unordered_map<int, std::string> m;
// code to update map
{
updatable_lock lock(mtx);
m[1] = "one";
}
// code to read from map
{
read_only_lock lock(mtx);
std::cout << m[1] << '\n';
}
There are several problems with that approach.
first, std::unordered_map has two overloads of find - one which is const, and one which is not. I'd dare to say that I don't believe that that non-const version of find will mutate the map, but still for the compiler invoking non const method from a multiple threads is a data race and some compilers actually use undefined behavior for nasty optimizations.
so first thing - you need to make sure that when multiple threads invoke std::unordered_map::find they do it with the const version. that can be achieved by referencing the map with a const reference and then invoking find from there.
second, you miss the the part that many thread may invoke const find on your map, but other threads can not invoke non const method on the object! I can definitely imagine many threads call find and some call insert on the same time, causing a data race. imagine that, for example, insert makes the map's internal buffer reallocate while some other thread iterates it to find the wanted pair.
a solution to that is to use C++14 shared_mutex which has an exclusive/shared locking mode. when thread call find, it locks the lock on shared mode, when a thread calls insert it locks it on exclusive lock.
if your compiler does not support shared_mutex, you can use platform specific synchronization objects, like pthread_rwlock_t on Linux and SRWLock on Windows.
another possibility is to use lock-free hashmap, like the one provided by Intel's thread-building blocks library, or concurrent_map on MSVC concurrency runtime. the implementation itself uses lock-free algorithms which makes sure access is always thread-safe and fast on the same time.
I was hoping someone could advise on how multiple threads can write to a common container (eg a map). In a scenario where some threads might share the same key using Boost and C++
The map might be of type : std::map, with different threads accessing the object to modify different data members. Would each thread wait upon hitting unique_lock for the current thread to finish before proceeding?
would it be as simple as each thread entering a critical section as this example:
//somewhere within the code
boost::unique_lock mutex;
void modifyMap(const std::string& key,const unsigned int dataX,
const unsigned int dataY)
{
// would each thread wait for exclusive access?
boost::unique_lock<boost::shared_mutex> lock (mutex);
// i now have exclusive access no race conditions;
m_map.find(key)->second.setDataX(dataX);
m_map.find(key)->second.setDataX(dataY);
}
thanks in advance
You should create a thread-safe implementation of a data structure. It can be either lock-based (for example implemented by using mutexes) or lock-free (using atomic operations or memory orderings which are supported in C++11 and boost).
I can briefly describe the lock-based approach. For example, you may want to design a thread-safe linked list. If your threads perform only read operations everything is safe. On the other hand if you try to write to this data structure you might need a previous and a next node pointers in the list (if its double-linked you need to update their pointers to point to the inserted node) and while you modify them some other thread might read the incorrect pointer data so you need a lock on the two nodes between which you want to insert your new node. This creates serialization (other threads wait for mutex to be unlocked) and reduces the potential for concurrency.
The full example with a lookup table is available in the book "C++ Concurrency: Practical Multithreading" by Anthony Williams at page 171, listing 6.11. The book itself is a good start for a multithreading programming with the latest C++ standard as the author of the book also designed both boost::thread and C++11 thread libraries.
update: to make your example work for read/write (if you need more operations you need to protect them also) you're better off using boost::shared_mutex which essentially allows multiple-read single write access: if one thread wants to write than it is going acquire an exclusive lock and all other threads will have to wait. Here's some code:
template <typename mapType>
class threadSafeMap {
boost::shared_mutex map_mutex;
mapType* m_map;
public:
threadSafeMap() {
m_map = new mapType();
}
void modifyMap(std::string& key,const unsigned int dataX,
const unsigned int dataY)
{
//std::lock_guard in c++11. std::shared_mutex is going to be available in C++14
//acquire exclusive access - other threads wait
boost::lock_guard<boost::shared_mutex> lck(map_mutex);
m_map.find(key)->second.setDataX(dataX);
m_map.find(key)->second.setDataX(dataY);
}
int getValueByKey(std::string& key)
{
//std::lock_guard in c++11. std::shared_mutex is going to be available in C++11
//acquire shared access - other threads can read. If the other thread needs access it has to wait for a fully unlocked state.
boost::shared_lock<boost::shared_mutex> lck(map_mutex);
return m_map.getValue(key);
}
~threadSafeMap() {
delete m_map;
}
};
Lock-guard objects are destructed and mutex is unlocked at the end of the lifetime. mapType template can be replaced by your map type.
I am quite new to multi-threading, I have a single threaded data analysis app that has a good bit of potential for parallelization and while the data sets are large it does not come close to saturating the hard-disk read/write so I figure I should take advantage of the threading support that is now in the standard and try to speed the beast up.
After some research I decided that producer consumer was a good approach for the reading of data from the disk and processing it and I started writing an object pool that would become part of the circular buffer that will be where the producers put data and the consumers get the data. As I was writing the class it felt like I was being too fine grained in how I was handling locking and releasing data members. It feels like half the code is locking and unlocking and like there are an insane number of synchronization objects floating around.
So I come to you with a class declaration and a sample function and this question: Is this too fine-grained? Not fine grained enough? Poorly thought out?
struct PoolArray
{
public:
Obj* arr;
uint32 used;
uint32 refs;
std::mutex locker;
};
class SegmentedPool
{
public: /*Construction and destruction cut out*/
void alloc(uint32 cellsNeeded, PoolPtr& ptr);
void dealloc(PoolPtr& ptr);
void clearAll();
private:
void expand();
//stores all the segments of the pool
std::vector< PoolArray<Obj> > pools;
ReadWriteLock poolLock;
//stores pools that are empty
std::queue< int > freePools;
std::mutex freeLock;
int currentPool;
ReadWriteLock currentLock;
};
void SegmentedPool::dealloc(PoolPtr& ptr)
{
//find and access the segment
poolLock.lockForRead();
PoolArray* temp = &(pools[ptr.getSeg()]);
poolLock.unlockForRead();
//reduce the count of references in the segment
temp->locker.lock();
--(temp->refs);
//if the number of references is now zero then set the segment back to unused
//and push it onto the queue of empty segments so that it can be reused
if(temp->refs==0)
{
temp->used=0;
freeLock.lock();
freePools.push(ptr.getSeg());
freeLock.unlock();
}
temp->locker.unlock();
ptr.set(NULL,-1);
}
A few explanations:
First PoolPtr is a stupid little pointer like object that stores the pointer and the segment number in the pool that the pointer came from.
Second this is all "templatized" but i took those lines out to try to reduce the length of the code block
Third ReadWriteLock is something I put together using a mutex and a pair of condition variables.
Locks are inefficient no matter how fine grained they are, so avoid at all cost.
Both queue and vector can be easyly implemented lock free using compare-swap primitive.
there are a number of papers on the topic
Lock free queue:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.8674&rep=rep1&type=pdf
http://www.par.univie.ac.at/project/peppher/publications/Published/opodis10lfq.pdf
Lock free vector:
http://www.stroustrup.com/lock-free-vector.pdf
Straustrup's paper also refers to lock-free allocator, but don't jump at it right away, standard allocators are pretty good these days.
UPD
If you don't want to bother writing your own containers, use Intel's Threading Building Blocks library, it provide both thread safe vector and queue. They are NOT lock free, but they are optimized to use CPU cache efficiently.
UPD
Regarding PoolArray, you don't need a lock there either. If you can use c++11, use std::atomic for atomic increments and swaps, otherwise use compiler built-ins (InterLocked* functions in MSVC and _sync* in gcc http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html)
A good start - you lock things when needed and free them as soon as you're finished.
Your ReadWriteLock is pretty much a CCriticalSection object - depending on your needs it might improve performance to use that instead.
One thing I would say is that call your temp->locker.lock(); function before you release the lock on the pool poolLock.unlockForRead();, otherwise you're performing operations on the pool object when it's not under synchronisation control - it could be being used by another thread at that point. A minor point, but with multi-threading it's the minor points that trip you up in the end.
A good approach to take to multi-threading is to wrap any controlled resources in objects or functions that do the locking and unlocking inside them, so anyone who wants access to the data doesn't have to worry about which lock to lock or unlock, and when to do it. for example:
...
if(temp->refs==0)
{
temp->used=0;
freeLock.lock();
freePools.push(ptr.getSeg());
freeLock.unlock();
}
...
would be...
...
if(temp->refs==0)
{
temp->used=0;
addFreePool(ptr.getSeg());
}
...
void SegmentedPool::addFreePool(unsigned int seg)
{
freeLock.lock();
freePools.push(seg);
freeLock.unlock();
}
There are plenty of multi-threading benchmarking tools out there too. You can play around with controlling your resources in different ways, run it through one of the tools, and see where any bottlenecks are if you feel like performance is becoming an issue.
I want to implement a array-liked data structure allowing multiple threads to modify/insert items simultaneously. How can I obtain it in regard to performance? I implemented a wrapper class around std::vector and I used critical sections for synchronizing threads. Please have a look at my code below. Each time a thread want to work on the internal data, it may have to wait for other threads. Hence, I think its performance is NOT good. :( Is there any idea?
class parallelArray{
private:
std::vector<int> data;
zLock dataLock; // my predefined class for synchronizing
public:
void insert(int val){
dataLock.lock();
data.push_back(val);
dataLock.unlock();
}
void modify(unsigned int index, int newVal){
dataLock.lock();
data[index]=newVal; // assuming that the index is valid
dataLock.unlock();
}
};
Take a look at shared_mutex in the Boost library. This allows you to have multiple readers, but only one writer
http://www.boost.org/doc/libs/1_47_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_types.shared_mutex
The best way is to use some fast reader-writer lock. You perform shared locking for read-only access and exclusive locking for writable access - this way read-only access is performed simultaneously.
In user-mode Win32 API there are Slim Reader/Writer (SRW) Locks available in Vista and later.
Before Vista you have to implement reader-writer lock functionality yourself that is pretty simple task. You can do it with one critical section, one event and one enum/int value. Though good implementation would require more effort - I would use hand-crafted linked list of local (stack allocated) structures to implement fair waiting queue.
I have std::list<Info> infoList in my application that is shared between two threads. These 2 threads are accessing this list as follows:
Thread 1: uses push_back(), pop_front() or clear() on the list (Depending on the situation)
Thread 2: uses an iterator to iterate through the items in the list and do some actions.
Thread 2 is iterating the list like the following:
for(std::list<Info>::iterator i = infoList.begin(); i != infoList.end(); ++i)
{
DoAction(i);
}
The code is compiled using GCC 4.4.2.
Sometimes ++i causes a segfault and crashes the application. The error is caused in std_list.h line 143 at the following line:
_M_node = _M_node->_M_next;
I guess this is a racing condition. The list might have changed or even cleared by thread 1 while thread 2 was iterating it.
I used Mutex to synchronize access to this list and all went ok during my initial test. But the system just freezes under stress test making this solution totally unacceptable. This application is a real-time application and i need to find a solution so both threads can run as fast as possible without hurting the total applications throughput.
My question is this:
Thread 1 and Thread 2 need to execute as fast as possible since this is a real-time application. what can i do to prevent this problem and still maintain the application performance? Are there any lock-free algorithms available for such a problem?
Its ok if i miss some newly added Info objects in thread 2's iteration but what can i do to prevent the iterator from becoming a dangling pointer?
Thanks
Your for() loop can potentially keep a lock for a relatively long time, depending on how many elements it iterates. You can get in real trouble if it "polls" the queue, constantly checking if a new element became available. That makes the thread own the mutex for an unreasonably long time, giving few opportunities to the producer thread to break in and add an element. And burning lots of unnecessary CPU cycles in the process.
You need a "bounded blocking queue". Don't write it yourself, the lock design is not trivial. Hard to find good examples, most of it is .NET code. This article looks promising.
In general it is not safe to use STL containers this way. You will have to implement specific method to make your code thread safe. The solution you chose depends on your needs. I would probably solve this by maintaining two lists, one in each thread. And communicating the changes through a lock free queue (mentioned in the comments to this question). You could also limit the lifetime of your Info objects by wrapping them in boost::shared_ptr e.g.
typedef boost::shared_ptr<Info> InfoReference;
typedef std::list<InfoReference> InfoList;
enum CommandValue
{
Insert,
Delete
}
struct Command
{
CommandValue operation;
InfoReference reference;
}
typedef LockFreeQueue<Command> CommandQueue;
class Thread1
{
Thread1(CommandQueue queue) : m_commands(queue) {}
void run()
{
while (!finished)
{
//Process Items and use
// deleteInfo() or addInfo()
};
}
void deleteInfo(InfoReference reference)
{
Command command;
command.operation = Delete;
command.reference = reference;
m_commands.produce(command);
}
void addInfo(InfoReference reference)
{
Command command;
command.operation = Insert;
command.reference = reference;
m_commands.produce(command);
}
}
private:
CommandQueue& m_commands;
InfoList m_infoList;
}
class Thread2
{
Thread2(CommandQueue queue) : m_commands(queue) {}
void run()
{
while(!finished)
{
processQueue();
processList();
}
}
void processQueue()
{
Command command;
while (m_commands.consume(command))
{
switch(command.operation)
{
case Insert:
m_infoList.push_back(command.reference);
break;
case Delete:
m_infoList.remove(command.reference);
break;
}
}
}
void processList()
{
// Iterate over m_infoList
}
private:
CommandQueue& m_commands;
InfoList m_infoList;
}
void main()
{
CommandQueue commands;
Thread1 thread1(commands);
Thread2 thread2(commands);
thread1.start();
thread2.start();
waitforTermination();
}
This has not been compiled. You still need to make sure that access to your Info objects is thread safe.
I would like to know what is the purpose of this list, it would be easier to answer the question then.
As Hoare said, it is generally a bad idea to try to share data to communicate between two threads, rather you should communicate to share data: ie messaging.
If this list is modelling a queue, for example, you might simply use one of the various ways to communicate (such as sockets) between the two threads. Consumer / Producer is a standard and well-known problem.
If your items are expensive, then only pass the pointers around during communication, you'll avoid copying the items themselves.
In general, it's exquisitely difficult to share data, although it is unfortunately the only way of programming we hear of in school. Normally only low-level implementation of "channels" of communication should ever worry about synchronization and you should learn to use the channels to communicate instead of trying to emulate them.
To prevent your iterator from being invalidated you have to lock the whole for loop. Now I guess the first thread may have difficulties updating the list. I would try to give it a chance to do its job on each (or every Nth iteration).
In pseudo-code that would look like:
mutex_lock();
for(...){
doAction();
mutex_unlock();
thread_yield(); // give first thread a chance
mutex_lock();
if(iterator_invalidated_flag) // set by first thread
reset_iterator();
}
mutex_unlock();
You have to decide which thread is the more important. If it is the update thread, then it must signal the iterator thread to stop, wait and start again. If it is the iterator thread, it can simply lock the list until iteration is done.
The best way to do this is to use a container that is internally synchronized. TBB and Microsoft's concurrent_queue do this. Anthony Williams also has a good implementation on his blog here
Others have already suggested lock-free alternatives, so I'll answer as if you were stuck using locks...
When you modify a list, existing iterators can become invalidated because they no longer point to valid memory (the list automagically reallocates more memory when it needs to grow). To prevent invalidated iterators, you could make the producer block on a mutex while your consumer traverses the list, but that would be needless waiting for the producer.
Life would be easier if you used a queue instead of a list, and have your consumer use a synchronized queue<Info>::pop_front() call instead of iterators that can be invalidated behind your back. If your consumer really needs to gobble chunks of Info at a time, then use a condition variable that'll make your consumer block until queue.size() >= minimum.
The Boost library has a nice portable implementation of condition variables (that even works with older versions of Windows), as well as the usual threading library stuff.
For a producer-consumer queue that uses (old-fashioned) locking, check out the BlockingQueue template class of the ZThreads library. I have not used ZThreads myself, being worried about lack of recent updates, and because it didn't seem to be widely used. However, I have used it as inspiration for rolling my own thread-safe producer-consumer queue (before I learned about lock-free queues and TBB).
A lock-free queue/stack library seems to be in the Boost review queue. Let's hope we see a new Boost.Lockfree in the near future! :)
If there's interest, I can write up an example of a blocking queue that uses std::queue and Boost thread locking.
EDIT:
The blog referenced by Rick's answer already has a blocking queue example that uses std::queue and Boost condvars. If your consumer needs to gobble chunks, you can extend the example as follows:
void wait_for_data(size_t how_many)
{
boost::mutex::scoped_lock lock(the_mutex);
while(the_queue.size() < how_many)
{
the_condition_variable.wait(lock);
}
}
You may also want to tweak it to allow time-outs and cancellations.
You mentioned that speed was a concern. If your Infos are heavyweight, you should consider passing them around by shared_ptr. You can also try making your Infos fixed size and use a memory pool (which can be much faster than the heap).
As you mentioned that you don't care if your iterating consumer misses some newly-added entries, you could use a copy-on-write list underneath. That allows the iterating consumer to operate on a consistent snapshot of the list as of when it first started, while, in other threads, updates to the list yield fresh but consistent copies, without perturbing any of the extant snapshots.
The trade here is that each update to the list requires locking for exclusive access long enough to copy the entire list. This technique is biased toward having many concurrent readers and less frequent updates.
Trying to add intrinsic locking to the container first requires you to think about which operations need to behave in atomic groups. For instance, checking if the list is empty before trying to pop off the first element requires an atomic pop-if-not-empty operation; otherwise, the answer to the list being empty can change in between when the caller receives the answer and attempts to act upon it.
It's not clear in your example above what guarantees the iteration must obey. Must every element in the list eventually be visited by the iterating thread? Does it make multiple passes? What does it mean for one thread to remove an element from the list while another thread is running DoAction() against it? I suspect that working through these questions will lead to significant design changes.
You're working in C++, and you mentioned needing a queue with a pop-if-not-empty operation. I wrote a two-lock queue many years ago using the ACE Library's concurrency primitives, as the Boost thread library was not yet ready for production use, and the chance for the C++ Standard Library including such facilities was a distant dream. Porting it to something more modern would be easy.
This queue of mine -- called concurrent::two_lock_queue -- allows access to the head of the queue only via RAII. This ensures that acquiring the lock to read the head will always be mated with a release of the lock. A consumer constructs a const_front (const access to head element), a front (non-const access to head element), or a renewable_front (non-const access to head and successor elements) object to represent the exclusive right to access the head element of the queue. Such "front" objects can't be copied.
Class two_lock_queue also offers a pop_front() function that waits until at least one element is available to be removed, but, in keeping with std::queue's and std::stack's style of not mixing container mutation and value copying, pop_front() returns void.
In a companion file, there's a type called concurrent::unconditional_pop, which allows one to ensure through RAII that the head element of the queue will be popped upon exit from the current scope.
The companion file error.hh defines the exceptions that arise from use of the function two_lock_queue::interrupt(), used to unblock threads waiting for access to the head of the queue.
Take a look at the code and let me know if you need more explanation as to how to use it.
If you're using C++0x you could internally synchronize list iteration this way:
Assuming the class has a templated list named objects_, and a boost::mutex named mutex_
The toAll method is a member method of the list wrapper
void toAll(std::function<void (T*)> lambda)
{
boost::mutex::scoped_lock(this->mutex_);
for(auto it = this->objects_.begin(); it != this->objects_.end(); it++)
{
T* object = it->second;
if(object != nullptr)
{
lambda(object);
}
}
}
Calling:
synchronizedList1->toAll(
[&](T* object)->void // Or the class that your list holds
{
for(auto it = this->knownEntities->begin(); it != this->knownEntities->end(); it++)
{
// Do something
}
}
);
You must be using some threading library. If you are using Intel TBB, you can use concurrent_vector or concurrent_queue. See this.
If you want to continue using std::list in a multi-threaded environment, I would recommend wrapping it in a class with a mutex that provides locked access to it. Depending on the exact usage, it might make sense to switch to a event-driven queue model where messages are passed on a queue that multiple worker threads are consuming (hint: producer-consumer).
I would seriously take Matthieu's thought into consideration. Many problems that are being solved using multi-threaded programming are better solved using message-passing between threads or processes. If you need high throughput and do not absolutely require that the processing share the same memory space, consider using something like the Message-Passing Interface (MPI) instead of rolling your own multi-threaded solution. There are a bunch of C++ implementations available - OpenMPI, Boost.MPI, Microsoft MPI, etc. etc.
I don't think you can get away without any synchronisation at all in this case as certain operation will invalidate the iterators you are using. With a list, this is fairly limited (basically, if both threads are trying to manipulate iterators to the same element at the same time) but there is still a danger that you'll be removing an element at the same time you're trying to append one to it.
Are you by any chance holding the lock across DoAction(i)? You obviously only want to hold the lock for the absolute minimum of time that you can get away with in order to maximise the performance. From the code above I think you'll want to decompose the loop somewhat in order to speed up both sides of the operation.
Something along the lines of:
while (processItems) {
Info item;
lock(mutex);
if (!infoList.empty()) {
item = infoList.front();
infoList.pop_front();
}
unlock(mutex);
DoAction(item);
delayALittle();
}
And the insert function would still have to look like this:
lock(mutex);
infoList.push_back(item);
unlock(mutex);
Unless the queue is likely to be massive, I'd be tempted to use something like a std::vector<Info> or even a std::vector<boost::shared_ptr<Info> > to minimize the copying of the Info objects (assuming that these are somewhat more expensive to copy compared to a boost::shared_ptr. Generally, operations on a vector tend to be a little faster than on a list, especially if the objects stored in the vector are small and cheap to copy.