Why doesn't std::shared_ptr use reference linking? - c++

std::shared_ptr needs to allocate a control block on the heap which holds the reference count. There was another approach I learnt from http://ootips.org/yonat/4dev/smart-pointers.html which keeps all the references in a doubly linked list. It doesn't need additional allocations nor a counter but the reference object itself is larger.
Is there a benchmark or any clear reason showing one implementation is better than the others?

The standard does in theory allow a linked list to be used, but because copying a shared_ptr must be threadsafe it would be harder to implement that with a linked list. The list would need to be protected by a mutex (or be a lockfree list, which is much harder) so that every time a shared_ptr is copied or goes out of scope the list can be safely modified by multiple threads.
It's much simpler and in general more efficient to do reference counting using atomic types and use atomic operations for the ref count updates.
Edit: To answer the comment below, just using atomic pointers to implement the linked list wouldn't be enough. To add or remove a node from the list (which correspond to increasing and decreasing the use_count) you would need to update two pointers atomically, to adjust the links in the nodes before and after the one being added/removed. std::atomic<T*> allows you to update a single pointer atomically, but doesn't help if you need to update two such objects atomically. Since those two pointers live in separate nodes they're not adjacent so even a quadword CAS won't help.
Alternatives are a mutex protecting the whole list (obviously bad for contention) or a mutex per list node where you lock the mutex of each node involved in any update, which uses more memory and affects up to three nodes at a time, i.e. requires locking three mutexes. If you have use_count() less than or equal to five then copying/destroying any one shared_ptr contends with copying/destroying any of the other instances that shares ownership of the same pointer. You might get less contention with high use counts where most updates are to non-neighbour nodes distant from each other, but probably not in the general case. Plenty of programs use shared_ptr with use counts in single digits. Even when use counts are high and there's no contention on any nodes, you still have to lock three mutexes and create/destroy a list node (possibly requiring a heap allocation) and update the pointers in its neighbouring nodes, so an atomic increment/decrement is much simpler and could still be faster despite the contention on the atomic integers.
Last time I mentioned on the committee reflector that shared_ptr isn't required to use ref counts and could use a list I got the replies:
Does anyone actually do that, given that the Standard recognizes multithreading now?
and (in reference to the thread-safety requirements):
It's much harder to make that work (efficiently) for a reference-linked implementation. I'm not even sure that it's possible, though it may be.

Related

Lock-free stack with freelist: why don't the next pointers need to be atomic?

A lock-free stack can be implemented as a singly linked list. This seems simple until we have to think about what to do with nodes after they've been popped. One strategy is to simply move them to a per-stack LIFO freelist (from which nodes can be reused by subsequent push operations) until eventually all threads are done with the stack, at which point a single thread destroys all nodes in the stack and all nodes in the freelist. Boost.Lockfree uses this strategy. So does Chris Wellons's C11 implementation. I will refer to the latter, because it's easier to read and the details are essentially the same, since C11 atomics are very similar to C++11 atomics.
In Wellons's implementation, which can be found on GitHub here, all lstack_node objects are non-atomic. In particular, this means that all accesses to the next member of an lstack_node object are non-atomic. What I am unable to understand is: why is it that such accesses never race with each other?
The next member is read at lstack.c:30. It is written at lstack.c:39. If these two lines can execute concurrently on the same lstack_node object, then the program contains a race. Is this possible? It seems possible to me:
Thread 1 calls lstack_pop, which calls pop. It atomically loads the head node's value into the local variable orig. Now, orig.node is a pointer to the node that was at the top of the stack just now. (Note that up until this point, only local variables have been modified, so it is impossible for anything that thread 1 has done so far to make a CAS fail in any other thread.) Meanwhile...
Thread 2 calls lstack_pop. pop succeeds and returns node, a pointer to the node that has just been excised from the stack; this is the same node that orig.node points to in thread 1. It then begins to call push in order to add node to the freelist. The freelist head node is atomically loaded, and node->next is set to point to the first node in the freelist.
Oops. This races with the read to orig.node->next in thread 1.
Could Wellons's implementation simply be wrong? I doubt it. If his implementation is wrong, then so is the Boost one, because the only way to fix (what appears to me to be) the race condition is to make next atomic. But I don't think the Boost implementation could be wrong in such a basic way without it having been noticed and fixed by now. So I must have made a mistake in my reasoning.
I just wrote a long text trying to explain why there cannot be a race, until I took a closer look at how Wellson implemented the free list, and I came to the conclusion that you are correct!
The important point here is what you mentioned at the very beginning of your question:
This seems simple until we have to think about what to do with nodes after they've been popped. One strategy is to simply move them to a per-stack LIFO freelist until eventually all threads are done with the stack, at which point a single thread destroys all nodes in the stack and all nodes in the freelist.
But that is not how the freelist in Wellson's implementation works! Instead it tries to reuse nodes from the freelist, but then next also needs to be atomic as you observed correctly. If the free list would have been implemented as you described, i.e., the popped nodes would be added to some freelist (unaltered, i.e., the freelist uses a different pointer than next) and only released once the stack is no longer used by any thread, then next could be a plain variable since it would not be change once the node has been pushed successfully.
That does not necessarily mean that the boost lock-free queue is also incorrect, but I don't know the code good enough to make a qualified statement about the boost implementation.
FWIW - this is usually referred as the memory reclamation problem. This freelist approach is a simple solution, though usually not practical for real-world scenarios. For a real-world scenario you probably want to use a memory reclamation scheme like hazard pointers or epoch based reclamation. You can take a look at my xenium library where I have implemented several different reclamation schemes as well as lock-free data structures that use them. More information about the memory reclamation problem and my implementations in xenium can be also found in my thesis Effective memory reclamation for lock-free data structures in C++.
The key thing to note is that the next fields are read-only for every node that is currently in a linked list. The next can only be modified when the node has been successfully removed from a linked list. Once that happens, the thread that removed it 'owns' it and noone else can sensibly look at it (they might read a value, but that value will be thrown away when their compare_and_set fails.) So that owning thread can safely modify the next field as part of pushing it on another list.
In your hypothetical, you're missing the fact that the two pops (done by the two threads) cannot both succeed and return the same node. If two threads try to simultaneously pop, they might get the same node pointer, but one will fail in the compare_and_set atomic instruction and will loop back with a different node pointer.
This does require that read/write races are "safe" -- that is when you have a read/write race between two threads, the reader might get any value but won't otherwise fail (no trap or other undefined behavior), and won't otherwise interfere with the write, but that tends to be the case on most (all?) hardware. As long as the reader does not depend on the value read during a race, you can detect the race after the fact and disregard that value.

Multithreading on arrays / Do I need locking mechanisms here?

I am writing a Multithreaded application. That application contains an array of length, lets say, 1000.
If I now would have two threads and I would make sure, that thread 1 will only access the elements 0-499 and thread 2 would only access elements 500-999, would I need a locking mechanism to protect the array or would that be fine.
Note: Only the content of the array will be changed during calculations! The array wont be moved, memcpyed or in some other way altered than altering elements inside of the array.
What you want is perfectly fine! Those kind of strategies (melt together with a bunch of low level atomic primitives) are the basis for what's called lock-free programming.
Actually, there could be possible problems in implementing this solution. You have to strongly guarantee the properties, that you have mentioned.
Make sure, that your in memory data array never moves. You cannot rely on most std containers. Most of them could significantly change during modification. std::map are rebalancing inner trees and making some inner pointers invalid. std::vector sometimes reallocates the whole container when inserting.
Make sure that there is only one consumer and only one producer for any data, that you have. Each consumer have to store inner iterator in valid state to prevent reading same item twice, or skip some item. Each producer must put data in valid place, without possibility to overwrite existing, not read data.
Disobeying of any of this rules makes you need to implement mutexes.

Optimal strategy to make a C++ hash table, thread safe

(I am interested in design of implementation NOT a readymade construct that will do it all.)
Suppose we have a class HashTable (not hash-map implemented as a tree but hash-table)
and say there are eight threads.
Suppose read to write ratio is about 100:1 or even better 1000:1.
Case A) Only one thread is a writer and others including writer can read from HashTable(they may simply iterate over entire hash table)
Case B) All threads are identical and all could read/write.
Can someone suggest best strategy to make the class thread safe with following consideration
1. Top priority to least lock contention
2. Second priority to least number of locks
My understanding so far is thus :
One BIG reader-writer lock(semaphore).
Specialize the semaphore so that there could be eight instances writer-resource for case B, where each each writer resource locks one row(or range for that matter).
(so i guess 1+8 mutexes)
Please let me know if I am thinking on the correct line, and how could we improve on this solution.
With such high read/write ratios, you should consider a lock free solution, e.g. nbds.
EDIT:
In general, lock free algorithms work as follows:
arrange your data structures such that for each function you intend to support there is a point at which you are able to, in one atomic operation, determine whether its results are valid (i.e. other threads have not mutated its inputs since they have been read) and commit to them; with no changes to state visible to other threads unless you commit. This will involve leveraging platform-specific functions such as Win32's atomic compare-and-swap or Cell's cache line reservation opcodes.
each supported function becomes a loop that repeatedly reads the inputs and attempts to perform the work, until the commit succeeds.
In cases of very low contention, this is a performance win over locking algorithms since functions mostly succeed the first time through without incurring the overhead of acquiring a lock. As contention increases, the gains become more dubious.
Typically the amount of data it is possible to atomically manipulate is small - 32 or 64 bits is common - so for functions involving many reads and writes, the resulting algorithms become complex and potentially very difficult to reason about. For this reason, it is preferable to look for and adopt a mature, well-tested and well-understood third party lock free solution for your problem in preference to rolling your own.
Hashtable implementation details will depend on various aspects of the hash and table design. Do we expect to be able to grow the table? If so, we need a way to copy bulk data from the old table into the new safely. Do we expect hash collisions? If so, we need some way of walking colliding data. How do we make sure another thread doesn't delete a key/value pair between a lookup returning it and the caller making use of it? Some form of reference counting, perhaps? - but who owns the reference? - or simply copying the value on lookup? - but what if values are large?
Lock-free stacks are well understood and relatively straightforward to implement (to remove an item from the stack, get the current top, attempt to replace it with its next pointer until you succeed, return it; to add an item, get the current top and set it as the item's next pointer, until you succeed in writing a pointer to the item as the new top; on architectures with reserve/conditional write semantics, this is enough, on architectures only supporting CAS you need to append a nonce or version number to the atomically manipulated data to avoid the ABA problem). They are one way of keeping track of free space for keys/data in an atomic lock free manner, allowing you to reduce a key/value pair - the data actually stored in a hashtable entry - to a pointer/offset or two, a small enough amount to be manipulated using your architecture's atomic instructions. There are others.
Reads then become a case of looking up the entry, checking the kvp against the requested key, doing whatever it takes to make sure the value will remain valid when we return it (taking a copy / increasing its reference count), checking the entry hasn't been modified since we began the read, returning the value if so, undoing any reference count changes and repeating the read if not.
Writes will depend on what we're doing about collisions; in the trivial case, they are simply a case of finding the correct empty slot and writing the new kvp.
The above is greatly simplified and insufficient to produce your own safe implementation, especially if you are not familiar with lock-free/wait-free techniques. Possible complications include the ABA problem, priority inversion, starvation of particular threads; I have not addressed hash collisions.
The nbds page links to an excellent presentation on a real world approach that allows growth / collisions. Others exist, a quick Google finds lots of papers.
Lock free and wait free algorithms are fascinating areas of research; I encourage the reader to Google around. That said, naive lock free implementations can easily look reasonable and behave correctly much of the time while in reality being subtly unsafe. While it is important to have a solid grasp on the principles, I strongly recommend using an existing, well-understood and proven implementation over rolling your own.
You may want to look at Java's ConcurrentHashMap implementation for one possible implementation.
The basic idea is NOT to lock for every read operation but only for writes. Since in your interview they specifically mentioned an extremely high read:write ratio it makes sense trying to stuff as much overhead as possible into writes.
The ConcurrentHashMap divides the hashtable into so called "Segments" that are themselves concurrently readable hashtables and keep every single segment in a consistent state to allow traversing without locking.
When reading you basically have the usual hashmap get() with the difference that you have to worry about reading stale values, so things like the value of the correct node, the first node of the segment table and next pointers have to be volatile (with c++'s non-existent memory model you probably can't do this portably; c++0x should help here, but haven't looked at it so far).
When putting a new element in there you get all the overhead, first of all having to lock the given segment. After locking it's basically a usual put() operation, but you have to guarantee atomic writes when updating the next pointer of a node (pointing to the newly created node whose next pointer has to be already correctly pointing to the old next node) or overwriting the value of a node.
When growing the segment, you have to rehash the existing nodes and put them into the new, larger table. The important part is to clone nodes for the new table as not to influence the old table (by changing their next pointers too early) until the new table is complete and replaces the old one (they use some clever trick there that means they only have to clone about 1/6 of the nodes - nice that but I'm not really sure how they reach that number).
Note that garbage collection makes this a whole lot easier because you don't have to worry about the old nodes that weren't reused - as soon as all readers are finished they will automatically be GCed. That's solvable though, but I'm not sure what the best approach would be.
I hope the basic idea is somewhat clear - obviously there are several points that aren't trivially ported to c++, but it should give you a good idea.
No need to lock the whole table, just have a lock per bucket. That immediately gives parallelism. Inserting a new node to the table requires a lock on the bucket about to have the head node modified. New nodes are always added at the head of the table so that readers can iterate through the nodes without worrying about seeing new nodes.
Each node has a r/w lock; readers iterating get a read lock lock. Node modification requires a write lock.
Iteration without the bucket lock leading to node removal requires an attempt to take the bucket lock, and if it fails it must release the locks and retry to avoid deadlock because the lock order is different.
Brief overview.
You can try atomic_hashtable for c
https://github.com/Taymindis/atomic_hashtable for read, write, and delete without locking while multithreading accessing the data, Simple and Stable
API documents given in README.

Lock Free Queue -- Single Producer, Multiple Consumers

I am looking for a method to implement lock-free queue data structure that supports single producer, and multiple consumers. I have looked at the classic method by Maged Michael and Michael Scott (1996) but their version uses linked lists. I would like an implementation that makes use of bounded circular buffer. Something that uses atomic variables?
On a side note, I am not sure why these classic methods are designed for linked lists that require a lot of dynamic memory management. In a multi-threaded program, all memory management routines are serialized. Aren't we defeating the benefits of lock-free methods by using them in conjunction with dynamic data structures?
I am trying to code this in C/C++ using pthread library on a Intel 64-bit architecture.
Thank you,
Shirish
The use of a circular buffer makes a lock necessary, since blocking is needed to prevent the head from going past the tail. But otherwise the head and tail pointers can easily be updated atomically. Or in some cases the buffer can be so large that overwriting is not an issue. (in real life you will see this in automated trading systems, with circular buffers sized to hold X minutes of market data. If you are X minutes behind, you have wayyyy worse problems than overwriting your buffer).
When I implemented the MS queue in C++, I built a lock free allocator using a stack, which is very easy to implement. If I have MSQueue then at compile time I know sizeof(MSQueue::node). Then I make a stack of N buffers of the required size. The N can grow, i.e. if pop() returns null, it is easy to go ask the heap for more blocks, and these are pushed onto the stack. Outside of the possibly blocking call for more memory, this is a lock free operation.
Note that the T cannot have a non-trivial dtor. I worked on a version that did allow for non-trivial dtors, that actually worked. But I found that it was easier just to make the T a pointer to the T that I wanted, where the producer released ownership, and the consumer acquired ownership. This of course requires that the T itself is allocated using lockfree methods, but the same allocator I made with the stack works here as well.
In any case the point of lock-free programming is not that the data structures themselves are slower. The points are this:
lock free makes me independent of the scheduler. Lock-based programming depends on the scheduler to make sure that the holders of a lock are running so that they can release the lock. This is what causes "priority inversion" On Linux there are some lock attributes to make sure this happens
If I am independent of the scheduler, the OS has a far easier time managing timeslices, and I get far less context switching
it is easier to write correct multithreaded programs using lockfree methods since I dont have to worry about deadlock , livelock, scheduling, syncronization, etc This is espcially true with shared memory implementations, where a process could die while holding a lock in shared memory, and there is no way to release the lock
lock free methods are far easier to scale. In fact, I have implemented lock free methods using messaging over a network. Distributed locks like this are a nightmare
That said, there are many cases where lock-based methods are preferable and/or required
when updating things that are expensive or impossible to copy. Most lock free methods use some sort of versioning, i.e. make a copy of the object, update it, and check if the shared version is still the same as when you copied it, then make the current version you update version. Els ecopy it again, apply the update, and check again. Keep doing this until it works. This is fine when the objects are small, but it they are large, or contain file handles, etc then not recommended
Most types are impossible to access in a lock free way, e.g. any STL container. These have invariants that require non atomic access , for example assert(vector.size()==vector.end()-vector.begin()). So if you are updating/reading a vector that is shared, you have to lock it.
This is an old question, but no one has provided an accepted solution. So I offer this info for others who may be searching.
This website: http://www.1024cores.net
Provides some really useful lockfree/waitfree data structures with thorough explanations.
What you are seeking is a lock-free solution to the reader/writer problem.
See: http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem
For a traditional one-block circular buffer I think this simply cannot be done safely with atomic operations. You need to do so much in one read. Suppose you have a structure that has this:
uint8_t* buf;
unsigned int size; // Actual max. buffer size
unsigned int length; // Actual stored data length (suppose in write prohibited from being > size)
unsigned int offset; // Start of current stored data
On a read you need to do the following (this is how I implemented it anyway, you can swap some steps like I'll discuss afterwards):
Check if the read length does not surpass stored length
Check if the offset+read length do not surpass buffer boundaries
Read data out
Increase offset, decrease length
What should you certainly do synchronised (so atomic) to make this work? Actually combine steps 1 and 4 in one atomic step, or to clarify: do this synchronised:
check read_length, this can be sth like read_length=min(read_length,length);
decrease length with read_length: length-=read_length
get a local copy from offset unsigned int local_offset = offset
increase offset with read_length: offset+=read_length
Afterwards you can just do a memcpy (or whatever) starting from your local_offset, check if your read goes over circular buffer size (split in 2 memcpy's), ... . This is 'quite' threadsafe, your write method could still write over the memory you're reading, so make sure your buffer is really large enough to minimize that possibility.
Now, while I can imagine you can combine 3 and 4 (I guess that's what they do in the linked-list case) or even 1 and 2 in atomic operations, I cannot see you do this whole deal in one atomic operation :).
You can however try to drop 'length' checking if your consumers are very smart and will always know what to read. You'd also need a new woffset variable then, because the old method of (offset+length)%size to determine write offset wouldn't work anymore. Note this is close to the case of a linked list, where you actually always read one element (= fixed, known size) from the list. Also here, if you make it a circular linked list, you can read to much or write to a position you're reading at that moment!
Finally: my advise, just go with locks, I use a CircularBuffer class, completely safe for reading & writing) for a realtime 720p60 video streamer and I have got no speed issues at all from locking.
This is an old question but no one has provided an answer that precisely answers it. Given that still comes up high in search results for (nearly) the same question, there should be an answer, given that one exists.
There may be more than one solution, but here is one that has an implementation:
https://github.com/tudinfse/FFQ
The conference paper referenced in the readme details the algorithm.

Design options for references into a thread safe cache when evicting older entries

I'm trying to design a simple cache that follows the following rules:
Entries have a unique key
When the number of entries in the cache exceeds a certain limit, the older items are evicted (to keep the cache from getting too large).
Each entry's data is immutable until the entry is removed from the cache.
A 'reader' can access an entry in the cache and the entry must be valid for the lifetime of the reader.
Each reader can be on its own thread, and all readers access the same cache instance.
Thread safety with this cache is important, since we don't want readers holding a reference to an entry, only to have it evicted by another thread somewhere else.
Hence, my current implementation just copies out the whole entry when reading from the cache. This is fine for smaller objects, but once objects get too large there's too much copying going on. It's also not so great with large numbers of readers that are accessing the same cached entry.
Since the data is immutable, it would be great if every reader to the same message could just hold a reference instead of a copy, but in some thread safe manner (so it wouldn't get evicted).
A previous implementation used reference counting to achieve this...but it's very tricky with threads, and I went with this simpler approach.
Are there any other patterns/ideas I could use to improve this design?
In a native system without a higher power (such as a VM) capable of performing garbage collection, you aren't going to do much better performance or complexity wise than reference counting.
You are are correct the reference counting can be tricky - not only does the increment and decrement have to atomic, but you need to ensure that the object can't be deleted out from under you before you are able to increment it. Thus, if you store the reference counter inside the object, you'll have to somehow avoid the race that occurs between the time you read the pointer to the object out of the cache, and manage to increment the pointer.
If your structure is a standard container, which is not already thread-safe, you will also have to protect the container from unsupported concurrent access. This protection can dovetail nicely with avoiding the reference counting race condition described above - if you use a read-writer lock to protect the structure, combined with atomic increments of the in-object reference counter while still holding the reader lock, you'll be protected from anyone deleting the object out from under you before you get the reference count, since such mutators must be "writers".
Here, objects can be evicted from the cache while still having a positive reference count - they will be destroyed when the last outstanding reference is dropped (by your smart pointer class). This is typically considered a feature, since it means that at least some object can always be removed from the cache, but it also has the downside that there is no strict upper on the number of objects "alive" in memory, since the reference counting allows objects to say alive even after they've left the cache. Whether this is acceptable to you depends on your requirements and details such as how long other threads may hold references to objects.
If you don't have access to (non-standard) atomic increment routines, you can use a mutex to do the atomic increment/decrement, although this may increase the cost significantly in both time and per-object space.
If you want to get more exotic (and faster) you'll need to design a container which is itself threadsafe, and come up with a more complex reference counting mechanism. For example, you may be able to create a hash table where the primary bucket array is never re-allocated, so can be accessed without locking. Furthermore, you can use non-portable double-wide CAS (compare and swap) operations on that array to both read a pointer and increment a reference count adjacent to it (128 bits of stuff on a 64-bit arch), allowing you to avoid the race mentioned above.
A completely different track would be to implement some kind of "delayed safe delete" strategy. Here avoid reference counting entirely. You remove references from your cache, but do not delete objects immediately, since other threads may still hold pointers to the object. Then later at some "safe" time you delete the object. Of course, the trick is discover when such a safe time exists. Basic strategies involve each thread signaling when they "enter" and "leave" a danger zone during which they may access the cache and hold references to contained objects. Once all threads which were in the danger zone when an object was removed from the cache have left the danger zone, you can free the object while being sure that no more references are held.
How practical this is depends on whether you have logical "enter" and "leave" points in your application (many request-oriented applications will), and whether the "enter" and "leave" costs can be amortized across many cache accesses. The upside is no reference counting! Of course, you still need a thread-safe container.
You can find references to many academic papers on the topic and some practical performance considerations by examining the papers linked here.
I think you effectively want a reader/writer lock per entry. Readers read lock and unlock as they are using it. The eviction thread has to obtain a write lock (which forces all readers to complete before it can be acquired). There needs to be some way for a reader to then tell (before acquiring a read lock) whether the entry in question has been evicted concurrently.
On the downside, one lock per entry is expensive for a big cache (in terms of memory). You can address that by using a lock across for a set of entries - this trades off memory vs concurrency. Need to be careful about deadlock scenarios in that case.
Sounds like a monitor with a std::map as the buffer would be useful in this situation.
I'm thinking that if you want to share a reference, you'll need to keep a count. So long as you use the interlocked inc/dec functions, this should be simple enough even for multiple threads.
It seems to me that the reference counting solution is only tricky in that the updating/testing for eviction of said reference counters must be inside a critical section protected by a mutex. So long as more than one process doesn't access the reference counters at a time it should be thread safe.
Have a circular queue, and don't allow multiple threads to write to it or the cache will be useless. Each thread should have its own cache, maybe with read access to the other caches but not write access.