I want to implement a array-liked data structure allowing multiple threads to modify/insert items simultaneously. How can I obtain it in regard to performance? I implemented a wrapper class around std::vector and I used critical sections for synchronizing threads. Please have a look at my code below. Each time a thread want to work on the internal data, it may have to wait for other threads. Hence, I think its performance is NOT good. :( Is there any idea?
class parallelArray{
private:
std::vector<int> data;
zLock dataLock; // my predefined class for synchronizing
public:
void insert(int val){
dataLock.lock();
data.push_back(val);
dataLock.unlock();
}
void modify(unsigned int index, int newVal){
dataLock.lock();
data[index]=newVal; // assuming that the index is valid
dataLock.unlock();
}
};
Take a look at shared_mutex in the Boost library. This allows you to have multiple readers, but only one writer
http://www.boost.org/doc/libs/1_47_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_types.shared_mutex
The best way is to use some fast reader-writer lock. You perform shared locking for read-only access and exclusive locking for writable access - this way read-only access is performed simultaneously.
In user-mode Win32 API there are Slim Reader/Writer (SRW) Locks available in Vista and later.
Before Vista you have to implement reader-writer lock functionality yourself that is pretty simple task. You can do it with one critical section, one event and one enum/int value. Though good implementation would require more effort - I would use hand-crafted linked list of local (stack allocated) structures to implement fair waiting queue.
Related
I'm sorry if this is a duplicate, but as much as I search I only find solutions that don't apply:
so I have a hash table, and I want multiple threads to be simultaneously reading and writing to the table. But how do I prevent data races when:
threads writing to the same hash as another
threads writing to a hash being read
edit:
if possible, because this hash needs to be extremely fast as it's accessed extremely frequently, is there a way to lock two racing threads only if they are accessing the same index of the hash table?
I have answered variations of this question before. Please read my previous answer regarding this topic.
Many people have tried to implement thread safe collection classes (lists, hash tables, maps, sets, queues, etc... ) and failed. Or worse, failed, didn't know it, but shipped it anyway.
A naive way to build a thread-safe hash table is to start with an existing hash table implementation and add a mutex to all public methods. You could imagine a hypothetical implementation is this:
// **THIS IS BAD**
template<typename K, typename V>
class ThreadSafeMap
{
private:
std::map<K,V> _map;
std::mutex _mutex;
public:
void insert(const K& k, const V& v)
{
std::lock_guard lck(_mutex);
_map[k] = v;
}
const V& at(const K& key)
{
std::lock_guard lck(_mutex);
return _map.at(k);
}
// other methods not shown - but are essentially a repeat of locking a mutex
// before accessing the underlying data structure
};
In the the above example, std::lock_guard locks mutex when the lck variable is instantiated, and lock_guard's destructor will release the mutex when the lck variable goes out of scope
And to a certain extent, it is thread safe. But then you start to use the above data structure in a complex ways, it breaks down.
Transactions on hash tables are often multi-step operations. For example, an entire application transaction on the table might be to lookup a record and upon successfully returning it, change some member on what the record points to.
So imagine we had used the above class across different threads like the following:
ThreadSafeMap g_map<std::string, Item>;
// thread 1
Item& item = g_map.at(key);
item.value++;
// thread 2
Item& item = g_map.at(key);
item.value--;
// thread 3
g_map.erase(key);
g_map[key] = newItem;
It's easy to think the above operations are thread safe because the hash table itself is thread safe. But they are not. Thread 1 and thread 2 are both trying to access the same item outside of the lock. Thread 3 is even trying to replace that record that might be referenced by the other two threads. There's a lot of undefined behavior here.
The solution? Stick with a single threaded hash table implementation and use the mutex at the application/transaction level. Better:
std::unordered_map<std::string, Item> g_map;
std::mutex g_mutex;
// thread 1
{
std::lock_guard lck(g_mutex);
Item& item = g_map.at(key);
item.value++;
}
// thread 2
{
std::lock_guard lck(g_mutex);
Item& item = g_map.at(key);
item.value--;
}
// thread 3
{
std::lock_guard lck(g_mutex);
g_map.erase(key);
g_map[key] = newItem;
}
Bottom line. Don't just stick mutexes and locks on your low-level data structures and proclaim it to be thread safe. Use mutexes and locks at the level the caller expects to do its set of operations on the hash table itself.
The most reliable and appropriate way to avoid data races is to serialize access to the hash table using a mutex; i.e. each thread needs to acquire the mutex before performing any operations (read or write) on the hash table, and release the mutex after it is done.
What you're probably looking for, though, is to implement a lock-free hash table, but ensuring correct multithreaded behavior without locks is extremely difficult to do correctly, and if you were at the technical level required to implement such a thing, you wouldn't need to ask about it on Stackoverflow. So I strongly suggest that you either stick with the serialized-access approach (which works fine for 99% of the software out there, and is possible to implement correctly without in-depth knowledge of the CPU, cache architecture, RAM, OS, scheduler, optimizer, C++ language spec, etc) or if you must use a lock-free data structure, that you find a premade one from a reputable source to use rather than trying to roll your own. In fact, even if you want to roll your own, you should start by looking through the source code of working examples, to get an idea of what they are doing and why they are doing it.
So you need basic thread synchronization or what? You must use mutex, lock_guard, or some other mechanism for thread synchronization in the read and write functions. In cppreference.com you have the documentation of the standard library.
I have a requirement for an MessageQueue which will store objects and 2 threads will act as producer and cousumer. i am planning to use std::queue to store objects. I am working in MFC and C++ on VC 6.0 .For synchronization between 2 threads which syncronization primitives could be used as I can't use C++ 11 on VC 6.0.
Please provide me some direction? I am planning to use CriticalSection and Event. Is there any better way to handle this?
Is std::queue is thread-safe?
I'm not well versed in the MFC synchronization tools, but what you want to do is definitely possible.
EDIT: Based on what people are saying in the comments, it looks like CCriticalSection is a better choice than CMutex in this case so I've updated my answer.
For signaling between threads, using semaphores would be a good choice. Wikipedia has a nice example of pseudocode using semaphores for the producer-consumer / bounded-buffer problem. Note that you will need two semaphores, one to count how many items are in the queue and how many slots the queue has left. Note that with more than two threads, you also need a mutex-type or critical section synchronization mechanism in addition to the semaphores (see the wiki link, second code example). This may seem counter-intuitive, but keep in mind the fact that the producer and the consumer are waiting on two different queue conditions before they act.
Based on what I've read, a good option is to make your own wrapper class with a CCriticalSection member, then when you want to lock the resource (like you would within one of your wrapper class' get/set member function) you call CCriticalSection's Lock() method (shown here). When you're done with the shared resource, remember to call Unlock() on the CCriticalSection.
Adapted From MSDN:
#include <queue>
class SharedQueue
{
static std::queue<int> _qShared; //shared resource
static CCriticalSection _critSect;
public:
SharedQueue(void) {}
~SharedQueue(void) {}
void push(int); //locks, modifies, and unlocks shared resource
};
//Declaration of static members and push_back
std::queue<int> SharedQueue::_qShared;
CCriticalSection SharedQueue::_critSect;
void SharedQueue::push(int item)
{
_critSect.Lock();
_qShared.push(item);
_critSect.Unlock();
}
As pointed out in the comments and in the MSDN docs, CCriticalSection is useful when access to your shared resource does not cross process boundaries. It is also more performant in this case than CMutex.
You need to wrap the std::queue, since it is not thread safe. Assume any container in the STL is not thread safe unless the documentation specifically mentions that it is.
I am quite new to multi-threading, I have a single threaded data analysis app that has a good bit of potential for parallelization and while the data sets are large it does not come close to saturating the hard-disk read/write so I figure I should take advantage of the threading support that is now in the standard and try to speed the beast up.
After some research I decided that producer consumer was a good approach for the reading of data from the disk and processing it and I started writing an object pool that would become part of the circular buffer that will be where the producers put data and the consumers get the data. As I was writing the class it felt like I was being too fine grained in how I was handling locking and releasing data members. It feels like half the code is locking and unlocking and like there are an insane number of synchronization objects floating around.
So I come to you with a class declaration and a sample function and this question: Is this too fine-grained? Not fine grained enough? Poorly thought out?
struct PoolArray
{
public:
Obj* arr;
uint32 used;
uint32 refs;
std::mutex locker;
};
class SegmentedPool
{
public: /*Construction and destruction cut out*/
void alloc(uint32 cellsNeeded, PoolPtr& ptr);
void dealloc(PoolPtr& ptr);
void clearAll();
private:
void expand();
//stores all the segments of the pool
std::vector< PoolArray<Obj> > pools;
ReadWriteLock poolLock;
//stores pools that are empty
std::queue< int > freePools;
std::mutex freeLock;
int currentPool;
ReadWriteLock currentLock;
};
void SegmentedPool::dealloc(PoolPtr& ptr)
{
//find and access the segment
poolLock.lockForRead();
PoolArray* temp = &(pools[ptr.getSeg()]);
poolLock.unlockForRead();
//reduce the count of references in the segment
temp->locker.lock();
--(temp->refs);
//if the number of references is now zero then set the segment back to unused
//and push it onto the queue of empty segments so that it can be reused
if(temp->refs==0)
{
temp->used=0;
freeLock.lock();
freePools.push(ptr.getSeg());
freeLock.unlock();
}
temp->locker.unlock();
ptr.set(NULL,-1);
}
A few explanations:
First PoolPtr is a stupid little pointer like object that stores the pointer and the segment number in the pool that the pointer came from.
Second this is all "templatized" but i took those lines out to try to reduce the length of the code block
Third ReadWriteLock is something I put together using a mutex and a pair of condition variables.
Locks are inefficient no matter how fine grained they are, so avoid at all cost.
Both queue and vector can be easyly implemented lock free using compare-swap primitive.
there are a number of papers on the topic
Lock free queue:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.8674&rep=rep1&type=pdf
http://www.par.univie.ac.at/project/peppher/publications/Published/opodis10lfq.pdf
Lock free vector:
http://www.stroustrup.com/lock-free-vector.pdf
Straustrup's paper also refers to lock-free allocator, but don't jump at it right away, standard allocators are pretty good these days.
UPD
If you don't want to bother writing your own containers, use Intel's Threading Building Blocks library, it provide both thread safe vector and queue. They are NOT lock free, but they are optimized to use CPU cache efficiently.
UPD
Regarding PoolArray, you don't need a lock there either. If you can use c++11, use std::atomic for atomic increments and swaps, otherwise use compiler built-ins (InterLocked* functions in MSVC and _sync* in gcc http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html)
A good start - you lock things when needed and free them as soon as you're finished.
Your ReadWriteLock is pretty much a CCriticalSection object - depending on your needs it might improve performance to use that instead.
One thing I would say is that call your temp->locker.lock(); function before you release the lock on the pool poolLock.unlockForRead();, otherwise you're performing operations on the pool object when it's not under synchronisation control - it could be being used by another thread at that point. A minor point, but with multi-threading it's the minor points that trip you up in the end.
A good approach to take to multi-threading is to wrap any controlled resources in objects or functions that do the locking and unlocking inside them, so anyone who wants access to the data doesn't have to worry about which lock to lock or unlock, and when to do it. for example:
...
if(temp->refs==0)
{
temp->used=0;
freeLock.lock();
freePools.push(ptr.getSeg());
freeLock.unlock();
}
...
would be...
...
if(temp->refs==0)
{
temp->used=0;
addFreePool(ptr.getSeg());
}
...
void SegmentedPool::addFreePool(unsigned int seg)
{
freeLock.lock();
freePools.push(seg);
freeLock.unlock();
}
There are plenty of multi-threading benchmarking tools out there too. You can play around with controlling your resources in different ways, run it through one of the tools, and see where any bottlenecks are if you feel like performance is becoming an issue.
I have std::list<Info> infoList in my application that is shared between two threads. These 2 threads are accessing this list as follows:
Thread 1: uses push_back(), pop_front() or clear() on the list (Depending on the situation)
Thread 2: uses an iterator to iterate through the items in the list and do some actions.
Thread 2 is iterating the list like the following:
for(std::list<Info>::iterator i = infoList.begin(); i != infoList.end(); ++i)
{
DoAction(i);
}
The code is compiled using GCC 4.4.2.
Sometimes ++i causes a segfault and crashes the application. The error is caused in std_list.h line 143 at the following line:
_M_node = _M_node->_M_next;
I guess this is a racing condition. The list might have changed or even cleared by thread 1 while thread 2 was iterating it.
I used Mutex to synchronize access to this list and all went ok during my initial test. But the system just freezes under stress test making this solution totally unacceptable. This application is a real-time application and i need to find a solution so both threads can run as fast as possible without hurting the total applications throughput.
My question is this:
Thread 1 and Thread 2 need to execute as fast as possible since this is a real-time application. what can i do to prevent this problem and still maintain the application performance? Are there any lock-free algorithms available for such a problem?
Its ok if i miss some newly added Info objects in thread 2's iteration but what can i do to prevent the iterator from becoming a dangling pointer?
Thanks
Your for() loop can potentially keep a lock for a relatively long time, depending on how many elements it iterates. You can get in real trouble if it "polls" the queue, constantly checking if a new element became available. That makes the thread own the mutex for an unreasonably long time, giving few opportunities to the producer thread to break in and add an element. And burning lots of unnecessary CPU cycles in the process.
You need a "bounded blocking queue". Don't write it yourself, the lock design is not trivial. Hard to find good examples, most of it is .NET code. This article looks promising.
In general it is not safe to use STL containers this way. You will have to implement specific method to make your code thread safe. The solution you chose depends on your needs. I would probably solve this by maintaining two lists, one in each thread. And communicating the changes through a lock free queue (mentioned in the comments to this question). You could also limit the lifetime of your Info objects by wrapping them in boost::shared_ptr e.g.
typedef boost::shared_ptr<Info> InfoReference;
typedef std::list<InfoReference> InfoList;
enum CommandValue
{
Insert,
Delete
}
struct Command
{
CommandValue operation;
InfoReference reference;
}
typedef LockFreeQueue<Command> CommandQueue;
class Thread1
{
Thread1(CommandQueue queue) : m_commands(queue) {}
void run()
{
while (!finished)
{
//Process Items and use
// deleteInfo() or addInfo()
};
}
void deleteInfo(InfoReference reference)
{
Command command;
command.operation = Delete;
command.reference = reference;
m_commands.produce(command);
}
void addInfo(InfoReference reference)
{
Command command;
command.operation = Insert;
command.reference = reference;
m_commands.produce(command);
}
}
private:
CommandQueue& m_commands;
InfoList m_infoList;
}
class Thread2
{
Thread2(CommandQueue queue) : m_commands(queue) {}
void run()
{
while(!finished)
{
processQueue();
processList();
}
}
void processQueue()
{
Command command;
while (m_commands.consume(command))
{
switch(command.operation)
{
case Insert:
m_infoList.push_back(command.reference);
break;
case Delete:
m_infoList.remove(command.reference);
break;
}
}
}
void processList()
{
// Iterate over m_infoList
}
private:
CommandQueue& m_commands;
InfoList m_infoList;
}
void main()
{
CommandQueue commands;
Thread1 thread1(commands);
Thread2 thread2(commands);
thread1.start();
thread2.start();
waitforTermination();
}
This has not been compiled. You still need to make sure that access to your Info objects is thread safe.
I would like to know what is the purpose of this list, it would be easier to answer the question then.
As Hoare said, it is generally a bad idea to try to share data to communicate between two threads, rather you should communicate to share data: ie messaging.
If this list is modelling a queue, for example, you might simply use one of the various ways to communicate (such as sockets) between the two threads. Consumer / Producer is a standard and well-known problem.
If your items are expensive, then only pass the pointers around during communication, you'll avoid copying the items themselves.
In general, it's exquisitely difficult to share data, although it is unfortunately the only way of programming we hear of in school. Normally only low-level implementation of "channels" of communication should ever worry about synchronization and you should learn to use the channels to communicate instead of trying to emulate them.
To prevent your iterator from being invalidated you have to lock the whole for loop. Now I guess the first thread may have difficulties updating the list. I would try to give it a chance to do its job on each (or every Nth iteration).
In pseudo-code that would look like:
mutex_lock();
for(...){
doAction();
mutex_unlock();
thread_yield(); // give first thread a chance
mutex_lock();
if(iterator_invalidated_flag) // set by first thread
reset_iterator();
}
mutex_unlock();
You have to decide which thread is the more important. If it is the update thread, then it must signal the iterator thread to stop, wait and start again. If it is the iterator thread, it can simply lock the list until iteration is done.
The best way to do this is to use a container that is internally synchronized. TBB and Microsoft's concurrent_queue do this. Anthony Williams also has a good implementation on his blog here
Others have already suggested lock-free alternatives, so I'll answer as if you were stuck using locks...
When you modify a list, existing iterators can become invalidated because they no longer point to valid memory (the list automagically reallocates more memory when it needs to grow). To prevent invalidated iterators, you could make the producer block on a mutex while your consumer traverses the list, but that would be needless waiting for the producer.
Life would be easier if you used a queue instead of a list, and have your consumer use a synchronized queue<Info>::pop_front() call instead of iterators that can be invalidated behind your back. If your consumer really needs to gobble chunks of Info at a time, then use a condition variable that'll make your consumer block until queue.size() >= minimum.
The Boost library has a nice portable implementation of condition variables (that even works with older versions of Windows), as well as the usual threading library stuff.
For a producer-consumer queue that uses (old-fashioned) locking, check out the BlockingQueue template class of the ZThreads library. I have not used ZThreads myself, being worried about lack of recent updates, and because it didn't seem to be widely used. However, I have used it as inspiration for rolling my own thread-safe producer-consumer queue (before I learned about lock-free queues and TBB).
A lock-free queue/stack library seems to be in the Boost review queue. Let's hope we see a new Boost.Lockfree in the near future! :)
If there's interest, I can write up an example of a blocking queue that uses std::queue and Boost thread locking.
EDIT:
The blog referenced by Rick's answer already has a blocking queue example that uses std::queue and Boost condvars. If your consumer needs to gobble chunks, you can extend the example as follows:
void wait_for_data(size_t how_many)
{
boost::mutex::scoped_lock lock(the_mutex);
while(the_queue.size() < how_many)
{
the_condition_variable.wait(lock);
}
}
You may also want to tweak it to allow time-outs and cancellations.
You mentioned that speed was a concern. If your Infos are heavyweight, you should consider passing them around by shared_ptr. You can also try making your Infos fixed size and use a memory pool (which can be much faster than the heap).
As you mentioned that you don't care if your iterating consumer misses some newly-added entries, you could use a copy-on-write list underneath. That allows the iterating consumer to operate on a consistent snapshot of the list as of when it first started, while, in other threads, updates to the list yield fresh but consistent copies, without perturbing any of the extant snapshots.
The trade here is that each update to the list requires locking for exclusive access long enough to copy the entire list. This technique is biased toward having many concurrent readers and less frequent updates.
Trying to add intrinsic locking to the container first requires you to think about which operations need to behave in atomic groups. For instance, checking if the list is empty before trying to pop off the first element requires an atomic pop-if-not-empty operation; otherwise, the answer to the list being empty can change in between when the caller receives the answer and attempts to act upon it.
It's not clear in your example above what guarantees the iteration must obey. Must every element in the list eventually be visited by the iterating thread? Does it make multiple passes? What does it mean for one thread to remove an element from the list while another thread is running DoAction() against it? I suspect that working through these questions will lead to significant design changes.
You're working in C++, and you mentioned needing a queue with a pop-if-not-empty operation. I wrote a two-lock queue many years ago using the ACE Library's concurrency primitives, as the Boost thread library was not yet ready for production use, and the chance for the C++ Standard Library including such facilities was a distant dream. Porting it to something more modern would be easy.
This queue of mine -- called concurrent::two_lock_queue -- allows access to the head of the queue only via RAII. This ensures that acquiring the lock to read the head will always be mated with a release of the lock. A consumer constructs a const_front (const access to head element), a front (non-const access to head element), or a renewable_front (non-const access to head and successor elements) object to represent the exclusive right to access the head element of the queue. Such "front" objects can't be copied.
Class two_lock_queue also offers a pop_front() function that waits until at least one element is available to be removed, but, in keeping with std::queue's and std::stack's style of not mixing container mutation and value copying, pop_front() returns void.
In a companion file, there's a type called concurrent::unconditional_pop, which allows one to ensure through RAII that the head element of the queue will be popped upon exit from the current scope.
The companion file error.hh defines the exceptions that arise from use of the function two_lock_queue::interrupt(), used to unblock threads waiting for access to the head of the queue.
Take a look at the code and let me know if you need more explanation as to how to use it.
If you're using C++0x you could internally synchronize list iteration this way:
Assuming the class has a templated list named objects_, and a boost::mutex named mutex_
The toAll method is a member method of the list wrapper
void toAll(std::function<void (T*)> lambda)
{
boost::mutex::scoped_lock(this->mutex_);
for(auto it = this->objects_.begin(); it != this->objects_.end(); it++)
{
T* object = it->second;
if(object != nullptr)
{
lambda(object);
}
}
}
Calling:
synchronizedList1->toAll(
[&](T* object)->void // Or the class that your list holds
{
for(auto it = this->knownEntities->begin(); it != this->knownEntities->end(); it++)
{
// Do something
}
}
);
You must be using some threading library. If you are using Intel TBB, you can use concurrent_vector or concurrent_queue. See this.
If you want to continue using std::list in a multi-threaded environment, I would recommend wrapping it in a class with a mutex that provides locked access to it. Depending on the exact usage, it might make sense to switch to a event-driven queue model where messages are passed on a queue that multiple worker threads are consuming (hint: producer-consumer).
I would seriously take Matthieu's thought into consideration. Many problems that are being solved using multi-threaded programming are better solved using message-passing between threads or processes. If you need high throughput and do not absolutely require that the processing share the same memory space, consider using something like the Message-Passing Interface (MPI) instead of rolling your own multi-threaded solution. There are a bunch of C++ implementations available - OpenMPI, Boost.MPI, Microsoft MPI, etc. etc.
I don't think you can get away without any synchronisation at all in this case as certain operation will invalidate the iterators you are using. With a list, this is fairly limited (basically, if both threads are trying to manipulate iterators to the same element at the same time) but there is still a danger that you'll be removing an element at the same time you're trying to append one to it.
Are you by any chance holding the lock across DoAction(i)? You obviously only want to hold the lock for the absolute minimum of time that you can get away with in order to maximise the performance. From the code above I think you'll want to decompose the loop somewhat in order to speed up both sides of the operation.
Something along the lines of:
while (processItems) {
Info item;
lock(mutex);
if (!infoList.empty()) {
item = infoList.front();
infoList.pop_front();
}
unlock(mutex);
DoAction(item);
delayALittle();
}
And the insert function would still have to look like this:
lock(mutex);
infoList.push_back(item);
unlock(mutex);
Unless the queue is likely to be massive, I'd be tempted to use something like a std::vector<Info> or even a std::vector<boost::shared_ptr<Info> > to minimize the copying of the Info objects (assuming that these are somewhat more expensive to copy compared to a boost::shared_ptr. Generally, operations on a vector tend to be a little faster than on a list, especially if the objects stored in the vector are small and cheap to copy.
I'm looking for something similar to the CopyOnWriteSet in Java, a set that supports add, remove and some type of iterators from multiple threads.
there isn't one that I know of, the closest is in thread building blocks which has concurrent_unordered_map
The STL containers allow concurrent read access from multiple threads as long as you don't aren't doing concurrent modification. Often it isn't necessary to iterate while adding / removing.
The guidance about providing a simple wrapper class is sane, I would start with something like the code snippet below protecting the methods that you really need concurrent access to and then providing 'unsafe' access to the base std::set so folks can opt into the other methods that aren't safe. If necessary you can protect access as well to acquiring iterators and putting them back, but this is tricky (still less so than writing your own lock free set or your own fully synchronized set).
I work on the parallel pattern library so I'm using critical_section from VS2010 beta boost::mutex works great too and the RAII pattern of using a lock_guard is almost necessary regardless of how you choose to do this:
template <class T>
class synchronized_set
{
//boost::mutex is good here too
critical_section cs;
public:
typedef set<T> std_set_type;
set<T> unsafe_set;
bool try_insert(...)
{
//boost has a lock_guard
lock_guard<critical_section> guard(cs);
}
};
Why not just use a shared mutex to protect concurrent access? Be sure to use RAII to lock and unlock the mutex:
{
Mutex::Lock lock(mutex);
// std::set manipulation goes here
}
where Mutex::Lock is a class that locks the mutex in the constructor and unlocks it in the destructor, and mutex is a mutex object that is shared by all threads. Mutex is just a wrapper class that hides whatever specific OS primitive you are using.
I've always thought that concurrency and set behavior are orthogonal concepts, so it's better to have them in separate classes. In my experiences, classes that try to be thread safe themselves aren't very flexible or all that useful.
You don't want internal locking, as your invariants will often require multiple operations on the data structure, and internal locking only prevents the steps happening at the same time, whereas you need to keep the steps from different macro-operations from interleaving.
You can also take a look at ACE library which has all thread safe containers you might ever need.
All I can think of is to use OpenMP for parallelization, derive a set class from std's and put a shell around each critial set operation that declares that operation critical using #pragma omp critical.
Qt's QSet class uses implicit sharing (copy on write semantics) and similar methods with std::set, you can look its implementation, Qt is lgpl.
Thread safety and copy on write semantics are not the same thing. That being said...
If you're really after copy-on-write semantics the Adobe Source Libraries has a copy_on_write template that adds these semantics to whatever you instantiate it with.