Atomic shared_ptr for lock-free singly linked list - c++

I'm wondering if it is possible to create a lock-free, thread-safe shared pointer for any of the "common" architectures, like x64 or ARMv7 / ARMv8.
In a talk about lock-free programming at cppcon2014, Herb Sutter presented a (partial) implementation of a lock-free singly linked list. The implementation looks quite simple, but it relies on an atomic shared_ptr implementation that doesn't exist in the standard library yet or on using the specialized std::atomic... functions. This is especially important as single push/pop calls potentially invoke multiple atomic loads/stores and compare_exchange operations.
The problem I see (and I think some of the questions in the talk went into the same direction) is that for this to be an actual lock-free data structure, those atomic operations would have to be lock-free themselves. I don't know of any standard library implementation for the std::atomic... functions that is lock-free and - at least with a short google / SO search - I also didn't find a suggestion of how to implement a lock-free specialization for std::atomic<std::shared_ptr>.
Now before I'm wasting my time on this I wanted to ask:
Do you know, if it is possible to write an lockfree, atomic shared pointer at all?
Are there already any implementations that I've overlooked and - ideally - are even compatible with what you would expect from a std::atomic<std::shared_ptr>? For the mentioned queue it would especially require a CAS-operation.
If there is no way to implement this on current architectures, do you see any other benefit in Herb's implementation compared to a "normal" linked list that is protected by a lock?
For reference, here is the code from Herb Sutter (might contain typos from me):
template<class T>
class slist {
struct Node { T t; std::shared_ptr<Node> next; };
std::atomic<std::shared_ptr<Node>> head;
public:
class reference{
std::shared_ptr<Node> p;
public:
reference(std::shared_ptr<Node> p_){}
T& operator*(){ return p->t; }
T* operator->(){ return &p->t; }
};
auto find(T t) const {
auto p = head.load();
while (p && p-> != t) {
p = p - next;
}
return reference(move(p));
}
void push_front(T t) {
auto p = std::make_shared<Node>();
p->t = t;
p->next = head;
while (!head.compare_exchange_weak(p->next, p)) {}
}
void pop_front() {
auto p = head.load();
while (p && !head.compare_exchange_weak(p, p - next)) { ; }
}
};
Note, that in this implementation, single instances of a shared_ptr can be accessed / modified by multiple different threads. It can be read/copied, reset and even deleted (as part of a node). So this not about whether multiple different shared_ptr objects (that manage the same object) can be used by multiple threads without a race condition - this that is already true for current implementations and required by the standard - but it is about concurrent access to a single pointer instance, which is - for standard shared pointers - no more threadsafe than the same operations on raw pointers would be.
To explain my motivation:
This is mainly an academic question. I've no intention of implementing my own lock free list in production code, but I find the topic interesting and at first glance, Herb's presentation seemed to be a good introduction. However, while thinking about this question and #sehe's comment on my answer, I remembered this talk, had another look at it and realized that it doesn't make much sense to call Herb's implementation lock-free, if it's primitive operations require locks (which they currently do). So I was wondering, whether this is just a limitation of the current implementations or a fundamental flaw in the design.

I'm adding this as an answer since it's too long to fit in a comment:
Something to consider. A lock-free shared_ptr is not needed to implement lock-free/wait-free data structures.
The reason Sutter uses shared_ptr in his presentation is because the most complicated part of writing lock-free data structures is not the synchronization, but the memory reclamation: we cannot delete nodes while they're potentially accessed by other threads, so we have to leak them and reclaim later. A lock-free shared_ptr implementation essentially provides "free" memory reclamation and makes examples of lock-free code palatable, especially in a context of a time-limited presentation.
Of course, having a lock-free atomic_shared_ptr as part of the Standard would be a huge help. But it's not the only way of doing memory reclamation for lock-free data structures, there's the naive implementation of maintaining a list of nodes to be deleted at quiescent points in execution (works in low-contention scenarios only), hazard pointers, roll-your-own atomic reference counting using split counts.
As for performance, #mksteve is correct, lock-free code in not guaranteed to outperform lock-based alternatives unless maybe it runs on a highly parallel system offering true concurrency. It's goal is to enable maximum concurrency and because of that what we typically get is threads doing less waiting at the the cost of performing more work.
PS If this is something that interests you, you should consider taking a look at C++ Concurrency in Action by Anthony Williams. It dedicates a whole chapter to writing lock-free/wait-free code, which offers a good starting place, walking through implementations of lock-free stack and queue.

Do you know, if it is possible to write an lockfree, atomic shared
pointer at all?
Are there already any implementations that I've
overlooked and - ideally - are even compatible with what you would
expect from a std::atomic?
I think the std::atomic_... offers a form of implementation, where the slist would perform special atomic_ queries on the shared_ptr. The problem with this being separated into two classes (std::atomic and std::shared_ptr) is that they each have constraints which need to be adhered to in order to function. The class separation, makes that knowledge of shared constaints impossible.
Within slist, which knows both items, it can help the situation, and thus probably the atomic_... functions will work.
If there is no way to implement this on current architectures, do you
see any other benefit in Herb's implementation compared to a "normal"
linked list that is protected by a lock?
From Wikipedia : Non blocking algorithm the purpose of the lock free nature, is to guarantee some progress is being made by at least one thread.
This does not give a guarantee of better performance than a locked implementation, but does give a guarantee that deadlocks will not occur.
Imagine T required a lock to perform a copy, this could also have been owned by some operations outside of the list. Then a deadlock would be possible, if it was owned, and a lock based implementation of slist was called.
I think CAS is implemented in the std::compare_exchange_weak, so would be implementation independent.
Current lock free algorithms for complex structures (e.g vector, map) tend to be significantly less efficient than locking algorithms, Dr Dobbs : lock-free data structures but the benefit offered (improved thread performance) would improve significantly the performance of computers, which tend to have large numbers of idle cpus.
Further research into the algorithms may identify new instructions which could be implemented in the CPUs of the future, to give us wait-free performance and improved utilization of computing resources.

It is possible to write a lock-free shared ptr as the only thing that needs changing is the count. The ptr itself is only copied so no special care needed here. When deleting, this must be the last instance, so no other copies exist in other threads so nobody would increment in the same time.
But having said that, std::atomic> wuold be a very specialized thing as it's not exactly a primitive type.
I've seen a few implementations of lock-free lists but none of them was of shared pointers. These containers usually have a special purpose and therefore there is an agreement around their usage (when/who creates/deletes) so using shared pointers is not required.
Also, shared pointers introduce an overhead that is contrary to our low latency goals that brought us to the lock-free domain in the first place.
So, back to your question- I think it is possible, but I don't see why do that.
If you really need something like that, a refCount member variable would serve better.
I see no specific benefit in Herb's specific implementation, maybe except the academic one, but lock-free lists have the obvious motivation of not having a lock. They often serve as queues or just to share a collection of nodes between threads that are allergic to locks.
Maybe we should ask Herb.. Herb? are you listening?
EDIT:
Following all the comments below, I've implemented a lock-free singly linked list. The list is fairly complex to prevent shared ptrs from being deleted while they are accessed. It is too big to post here but here are the main ideas:
- The main idea is to stash removed entries in a separate place - a garbage collector - to make them inaccessible to later actions.
- An atomic ref count is incremented on entry to every function (push_front, pop_front, and front) and auto-decremented on exit. On decrementing to zero a version counter is incremented. All in one atomic instruction.
- When a shared ptrs needs to be erases, in pop_front, it is pushed into a GC. There's a GC per version number. The GC is implemented using a simpler lock-free list that can only push_front or pop_all. I've created a circular buffer of 256 GCs, but some other scheme can be applied.
- A version's GC is flushed on version increment and then shared ptrs delete the holders.
So, if you call pop_front, without anything else running, the ref count is incremented to 1, the front shared ptr is pushed into GC[0], ref count back to zero and version to 1, GC[0] is flushed - it decrements the shared ptr we popped and possibly deletes the object it owns.
Now, wrt a lock-free shared_ptr. I believe this is doable. Here are the ideas I thought of:
- You can have a spin lock of sorts using the low bits of the pointer to the holder, so you can dereference it only after you've locked it. You can use different bit for inc/dec etc. This is much better than lock on the entire thing.
The problem here is that the shared ptr itself can be deleted so whatever contains it would have to provide some protection from outside, like the linked list.
- You can have some central registry of shared pointers. This does not suffer from the problem above, but would be challenging to scale up without latency spikes once in a while.
To summarize, I currently think this whole idea is moot. If you find some other approach that does not suffer from big problems - I'll be very curious to know about it :)
Thanks!

Related

What is C++20 std::atomic<shared_ptr<T>> and std::atomic<weak_ptr<T>>? [duplicate]

I read the following article by Antony Williams and as I understood in addition to the atomic shared count in std::shared_ptr in std::experimental::atomic_shared_ptr the actual pointer to the shared object is also atomic?
But when I read about reference counted version of lock_free_stack described in Antony's book about C++ Concurrency it seems for me that the same aplies also for std::shared_ptr, because functions like std::atomic_load, std::atomic_compare_exchnage_weak are applied to the instances of std::shared_ptr.
template <class T>
class lock_free_stack
{
public:
void push(const T& data)
{
const std::shared_ptr<node> new_node = std::make_shared<node>(data);
new_node->next = std::atomic_load(&head_);
while (!std::atomic_compare_exchange_weak(&head_, &new_node->next, new_node));
}
std::shared_ptr<T> pop()
{
std::shared_ptr<node> old_head = std::atomic_load(&head_);
while(old_head &&
!std::atomic_compare_exchange_weak(&head_, &old_head, old_head->next));
return old_head ? old_head->data : std::shared_ptr<T>();
}
private:
struct node
{
std::shared_ptr<T> data;
std::shared_ptr<node> next;
node(const T& data_) : data(std::make_shared<T>(data_)) {}
};
private:
std::shared_ptr<node> head_;
};
What is the exact difference between this two types of smart pointers, and if pointer in std::shared_ptr instance is not atomic, why it is possible the above lock free stack implementation?
The atomic "thing" in shared_ptr is not the shared pointer itself, but the control block it points to. meaning that as long as you don't mutate the shared_ptr across multiple threads, you are ok. do note that copying a shared_ptr only mutates the control block, and not the shared_ptr itself.
std::shared_ptr<int> ptr = std::make_shared<int>(4);
for (auto i =0;i<10;i++){
std::thread([ptr]{ auto copy = ptr; }).detach(); //ok, only mutates the control block
}
Mutating the shared pointer itself, such as assigning it different values from multiple threads, is a data race, for example:
std::shared_ptr<int> ptr = std::make_shared<int>(4);
std::thread threadA([&ptr]{
ptr = std::make_shared<int>(10);
});
std::thread threadB([&ptr]{
ptr = std::make_shared<int>(20);
});
Here, we are mutating the control block (which is ok) but also the shared pointer itself, by making it point to a different values from multiple threads. This is not ok.
A solution to that problem is to wrap the shared_ptr with a lock, but this solution is not so scalable under some contention, and in a sense, loses the automatic feeling of the standard shared pointer.
Another solution is to use the standard functions you quoted, such as std::atomic_compare_exchange_weak. This makes the work of synchronizing shared pointers a manual one, which we don't like.
This is where atomic shared pointer comes to play. You can mutate the shared pointer from multiple threads without fearing a data race and without using any locks. The standalone functions will be members ones, and their use will be much more natural for the user. This kind of pointer is extremely useful for lock-free data structures.
N4162(pdf), the proposal for atomic smart pointers, has a good explanation. Here's a quote of the relevant part:
Consistency. As far as I know, the [util.smartptr.shared.atomic]
functions are the only atomic operations in the standard that
are not available via an atomic type. And for all types
besides shared_ptr, we teach programmers to use atomic types
in C++, not atomic_* C-style functions. And that’s in part because of...
Correctness. Using the free functions makes code error-prone
and racy by default. It is far superior to write atomic once on
the variable declaration itself and know all accesses
will be atomic, instead of having to remember to use the atomic_*
operation on every use of the object, even apparently-plain reads.
The latter style is error-prone; for example, “doing it wrong” means
simply writing whitespace (e.g., head instead of atomic_load(&head) ),
so that in this style every use of the variable is “wrong by default.” If you forget to
write the atomic_* call in even one place, your code will still
successfully compile without any errors or warnings, it will “appear
to work” including likely pass most testing, but will still contain a
silent race with undefined behavior that usually surfaces as intermittent
hard-to-reproduce failures, often/usually in the field,
and I expect also in some cases exploitable vulnerabilities.
These classes of errors are eliminated by simply declaring the variable atomic,
because then it’s safe by default and to write the same set of
bugs requires explicit non-whitespace code (sometimes explicit
memory_order_* arguments, and usually reinterpret_casting).
Performance. atomic_shared_ptr<> as a distinct type
has an important efficiency advantage over the
functions in [util.smartptr.shared.atomic] — it can simply store an
additional atomic_flag (or similar) for the internal spinlock
as usual for atomic<bigstruct>. In contrast, the existing standalone functions
are required to be usable on any arbitrary shared_ptr
object, even though the vast majority of shared_ptrs will
never be used atomically. This makes the free functions inherently
less efficient; for example, the implementation could require
every shared_ptr to carry the overhead of an internal spinlock
variable (better concurrency, but significant overhead per
shared_ptr), or else the library must maintain a lookaside data
structure to store the extra information for shared_ptrs that are
actually used atomically, or (worst and apparently common in
practice) the library must use a global spinlock.
Calling std::atomic_load() or std::atomic_compare_exchange_weak() on a shared_ptr is functionally equivalent to calling atomic_shared_ptr::load() or atomic_shared_ptr::atomic_compare_exchange_weak(). There shouldn't be any performance difference between the two. Calling std::atomic_load() or std::atomic_compare_exchange_weak() on a atomic_shared_ptr would be syntactically redundant and might or might not incur a performance penalty.
atomic_shared_ptr is an API refinement. shared_ptr already supports atomic operations, but only when using the appropriate atomic non-member functions. This is error-prone, because the non-atomic operations remain available and are too easy for an unwary programmer to invoke by accident. atomic_shared_ptr is less error-prone because it doesn't expose any non-atomic operations.
shared_ptr and atomic_shared_ptr expose different APIs, but they don't necessarily need to be implemented differently; shared_ptr already supports all the operations exposed by atomic_shared_ptr. Having said that, the atomic operations of shared_ptr are not as efficient as they could be, because it must also support non-atomic operations. Therefore there are performance reasons why atomic_shared_ptr could be implemented differently. This is related to the single responsibility principle. "An entity with several disparate purposes... often offers crippled interfaces for any of its specific purposes because the partial overlap among various areas of functionality blurs the vision needed for crisply implementing each." (Sutter & Alexandrescu 2005, C++ Coding Standards)

Choosing an Appropriate FIFO Data Structure

I need a FIFO structure that supports indexing. Each element is an array of data that is saved off a device I'm reading from. The FIFO has a constant size, and at start-up each element is zeroed out.
Here's some pseudo code to help understand the issue:
Thread A (Device Reader):
1. Lock the structure.
2. Pop oldest element off of FIFO (don't need it).
3. Read next array of data (note this is a fixed size array) from the device.
4. Push new data array onto the FIFO.
5. Unlock.
Thread B (Data Request From Caller):
1. Lock the structure.
2. Determine request type.
3. if (request = one array) memcpy over the latest array saved (LIFO).
4. else memcpy over the whole FIFO to the user as a giant array (caller uses arrays).
5. Unlock.
Note that the FIFO shouldn't be changed in Thread B, the caller should just get a copy, so data structures where pop is destructive wouldn't necessarily work without an intermediate copy.
My code also has a boost dependency already and I am using a lockfree spsc_queue elsewhere. With that said, I don't see how this queue would work for me here given the need to work as a LIFO in some cases and also the need to memcpy over the entire FIFO at times.
I also considered a plain std::vector, but I'm worried about performance when I'm constantly pushing and popping.
One point not clear in the question is the compiler target, whether or not the solution is restricted to partial C++11 support (like VS2012), or full support (like VS2015). You mentioned boost dependency, which lends similar features to older compilers, so I'll rely on that and speak generally about options on the assumption that boost may provide what a pre-C++11 compiler may not, or you may elect C++11 features like the now standardized mutex, lock, threads and shared_ptr.
There's no doubt in my mind that the primary tool for the FIFO (which, as you stated, may occasionally need LIFO operation) is the std::deque. Even though the deque supports reasonably efficient dynamic expansion and shrinking of storage, contrary to your primary requirement of a static size, it's main feature is the ability to function as both FIFO and LIFO with good performance in ways vectors can't as easily manage. Internally most implementations provide what may be analogized as a collection of smaller vectors which are marshalled by the deque to function as if a single vector container (for subscripting) while allowing for double ended pushing and popping with efficient memory management. It can be tempting to use a vector, employing a circular buffer technique for fixed sizes, but any performance improvement is minimal, and deque is known to be reliable.
Your point regarding destructive pops isn't entirely clear to me. That could mean several things. std::deque offers back and front as a peek to what's at the ends of the deque, without destruction. In fact, they're required to look because deque's pop_front and pop_back only remove elements, they don't provide access to the element being popped. Taking an element and popping it is a two step process on std::deque. An alternate meaning, however, is that a read only requester needs to pop strictly as a means of navigation, not destruction, which is not really a pop, but a traversal. As long as the structure is under lock, that is easily managed with iterators or indexes. Or, it could also mean you need a independent copy of the queue.
Assuming some structure representing device data:
struct DevDat { .... };
I'm immediately faced with that curious question, should this not be a generic solution? It doesn't matter for the sake of discussion, but it seems the intent is an odd combination of application specific operation and a generalized thread-safe stack "machine", so I'll suggest a generic solution which is easily translated otherwise (that is, I suggest template classes, but you could easily choose non-templates if preferred). These psuedo code examples are sparse, just illustrating container layout ideas and proposed concepts.
class SafeStackBase
{ protected: std::mutex sync;
};
template <typename Element>
class SafeStack : public SafeStackBase
{ public:
typedef std::deque< Element > DeQue;
private:
DeQue que;
};
SafeStack could handle any kind of data in the stack, so that detail is left for Element declaration, which I illustrate with typedefs:
typedef std::vector< DevDat > DevArray;
typedef std::shared_ptr< DevArray > DevArrayPtr;
typedef SafeStack< DevArrayPtr > DeviceQue;
Note I'm proposing vector instead of array because I don't like the idea of having to choose a fixed size, but std::array is an option, obviously.
The SafeStackBase is intended for code and data that isn't aware of the users data type, which is why the mutex is stored there. It could easily part of the template class, but the practice of placing non-type aware data and code in a non-template base helps reduce code bloat when possible (functions which don't use Element, for example, need not be expanded in template instantiations). I suggest the DevArrayPtr so that the arrays can be "plucked out" of the queue without copying the arrays, then shared and distributed outside the structure under shared_ptr's shared ownership. This is a matter of illustration, and does not adequately deal with questions regarding content of those arrays. That could be managed by DevDat, which could marshal reading of the array data, while limiting writing of the array data to an authorized friend (a write accessor strategy), such that Thread B (a reader only) is not carelessly able to modify the content. In this way it's possible to provide these arrays without copying data..just return a copy of the DevArrayPtr for communal access to the entire array. This also supports returning a container of DevArrayPtr's supporting ThreadB point 4 (copy the whole FIFO to the user), as in:
typedef std::vector< DevArrayPtr > QueArrayVec;
typedef std::deque< DevArrayPtr > QueArrayDeque;
typedef std::array< DevArrayPtr, 12 > QueArrays;
The point is that you can return any container you like, which is merely an array of pointers to the internal std::array< DevDat >, letting DevDat control read/write authorization by requiring some authorization object for writing, and if this copy should be operable as a FIFO without potential interference with Thread A's write ownership, QueArrayDeque provides the full feature set as an independent FIFO/LIFO structure.
This brings up an observation about Thread A. There you state lock is step 1, while unlock is step 5, but I submit that only steps 2 and 4 are really required under lock. Step 3 can take time, and even if you assume that is a short time, it's not as short as a pop followed by a push. The point is that the lock is really about controlling the FIFO/LIFO queue structure, and not about reading data from the device. As such, that data can be fashioned into DevArray, which is THEN provided to SafeStack to be pop/pushed under lock.
Assume code inside SafeStack:
typedef std::lock_guard< std::mutex > Lock; // I use typedefs a lot
void StuffIt( const Element & e )
{ Lock l( sync );
que.pop_front();
que.push_back( e );
}
StuffIt does that simple, generic job of popping the front, pushing the back, under lock. Since it takes an const Element &, step 3 of Thread A is already done. Since Element, as I suggest, is a DevArrayPtr, this is used with:
DeviceQue dq;
auto p = std::make_shared<DevArray>();
dq.StuffIt( p );
How the DevArray is populated is up to it's constructor or some function, the point is that a shared_ptr is used to transport it.
This brings up a more generic point about SafeStack. Obviously there is some potential for standard access functions, which could mimic std::deque, but the primary job for SafeStack is to lock/unlock for access control, and do something while under lock. To that end, I submit a generic functor is sufficient to generalize the notion. The preferred mechanics, especially with respect to boost, is up to you, but something like (code inside SafeStack):
bool LockedFunc( std::function< bool(DevQue &)> f )
{
Lock l( sync );
f( que );
}
Or whatever mechanics you like for calling a functor taking a DevQue as a parameter. This means you could fashion callbacks with complete access to the deque (and it's interface) while under lock, or provide functors or lambdas which perform specific tasks under lock.
The design point is to make SafeStack small, focused on that minimal task of doing a few things under lock, taking most any kind of data in the queue. Then, using that last point, provide the array under shared_ptr to provide the service of Thread B steps 3 and 4.
To be clear about that, keep in mind that whatever is done to the shared_ptr to copy it is similar to what can be done to simple POD types, like ints, with respect to containers. That is, one could loop through the elements of the DevQue fashioning a copy of those elements into another container in the same code which would do that for a container of integers (remember, it's a member function of a template - that type is generic). The resulting work is only copying pointers, which is less effort than copying entire arrays of data.
Now, step 4 isn't QUITE clear to me. It appears to say that you need to return a DevArray which is the accumulated content of all entries in the queue. That's trivial to arrange, but it might work a little better with a vector (as that's dynamically expandable), but as long as the std::array has sufficient room, it's certainly possible.
However, the only real difference between such an array and the queue's native "array of arrays" is how it is traversed (and counted). Returning one Element (step 3) is quick, but since step 4 is indicated under lock, that's a bit more than most locked functions should really do if they don't have to.
I'd suggest SafeStack should be able to provide a copy of que (a DeQue typedef), which is quick. Then, outside of the lock, Thread B has a copy of the DeQue ( a std::deque< DevArrayPtr > ) to fashion into it's own "giant array".
Now, more about that array. To this point I've not adequately dealt with marshalling it. I've just suggested that DevDat does that, but this may not be adequate. Certainly the content of the std::array or std::vector conveying a collection of DevDats could be written. Perhaps that deserves it's own outer structure. I'll leave that to you, because the point I've made is that SafeStack is now focused on it's small task (lock/access/unlock) and can take anything which can be owned by a share_ptr (or POD's and copyable objects). In the same way SafeStack is an outer shell marshalling a std::deque with a mutex, some similar outer shell could marshal read only access to the std::vector or std::array of DevDats, with a kind of write accessor used by Thread A. That could be a simple as something that only allows construction of the std::array to create it's content, after which read only access could be all that's provided.
I would suggest you to use boost::circular_buffer which is a fixed size container that supports random access iteration, constant time insert and erase at the beginning and end. You can use it as a FIFO with push_back(), read back() for the latest data saved and iterate over the whole container via begin(), end() or using operator[].
But at start-up the elements are not zeroed out. It has in my opinion an even more convenient interface. The container is empty at first and insertion will increase size until it reaches max size.

Regarding mark-sweep ( lazy approach ) for garbage collection in C++?

I know reference counter technique but never heard of mark-sweep technique until today, when reading the book named "Concepts of programming language".
According to the book:
The original mark-sweep process of garbage collection operates as follow: The runtime system allocates storage cells as requested and disconnects pointers from cells as necessary, without regard of storage reclamation ( allowing garbage to accumulate), until it has allocated all available cells. At this point, a mark-sweep process is begun to gather all the garbage left floating-around in the heap. To facilitate the process, every heap cells has an extra indicator bit or field that is used by the collection algorithm.
From my limited understanding, smart-pointers in C++ libraries use reference counting technique. I wonder is there any library in C++ using this kind of implementation for smart-pointers? And since the book is purely theoretical, I could not visualize how the implementation is done. An example to demonstrate this idea would be greatly valuable. Please correct me if I'm wrong.
Thanks,
There is one difficulty to using garbage collection in C++, it's to identify what is pointer and what is not.
If you can tweak a compiler to provide this information for each and every object type, then you're done, but if you cannot, then you need to use conservative approach: that is scanning the memory searching for any pattern that may look like a pointer. There is also the difficulty of "bit stuffing" here, where people stuff bits into pointers (the higher bits are mostly unused in 64 bits) or XOR two different pointers to "save space".
Now, in C++0x the Standard Committee introduced a standard ABI to help implementing Garbage Collection. In n3225 you can find it at 20.9.11 Pointer safety [util.dynamic.safety]. This supposes that people will implement those functions for their types, of course:
void declare_reachable(void* p); // throw std::bad_alloc
template <typename T> T* undeclare_reachable(T* p) noexcept;
void declare_no_pointers(char* p, size_t n) noexcept;
void undeclare_no_pointers(char* p, size_t n) noexcept;
pointer_safety get_pointer_safety() noexcept;
When implemented, it will authorize you to plug any garbage collection scheme (defining those functions) into your application. It will of course require some work of course to actually provide those operations wherever they are needed. One solution could be to simply override new and delete but it does not account for pointer arithmetic...
Finally, there are many strategies for Garbage Collection: Reference Counting (with Cycle Detection algorithms) and Mark And Sweep are the main different systems, but they come in various flavors (Generational or not, Copying/Compacting or not, ...).
Although they may have upgraded it by now, Mozilla Firefox used to use a hybrid approach in which reference-counted smart pointers were used when possible, with a mark-and-sweep garbage collector running in parallel to clean up reference cycles. It's possible other projects have adopted this approach, though I'm not fully sure.
The main reason that I could see C++ programmers avoiding this type of garbage collection is that it means that object destructors would run asynchronously. This means that if any objects were created that held on to important resources, such as network connections or physical hardware, the cleanup wouldn't be guaranteed to occur in a timely fashion. Moreover, the destructors would have to be very careful to use appropriate synchronization if they were to access shared resources, while in a single-threaded, straight reference-counting solution this wouldn't be necessary.
The other complexity of this approach is that C++ allows for raw arithmetic operations on pointers, which greatly complicates the implementation of any garbage collector. It's possible to conservatively solve this problem (look at the Boehm GC, for example), though it's a significant barrier to building a system of this sort.

Returning an iterator in a multi threaded environment, a good idea?

Is it a good idea to return an iterator on a list in an object that is used and shared in a multi threaded environment?
class RequestList
{
public:
RequestList::RequestList();
RequestList::~RequestList();
std::list<boost::shared_ptr<Request> >::iterator GetIterator();
int ListSize();
void AddItem(boost::shared_ptr<Request> request);
void RemoveItem(boost::shared_ptr<Request> request);
std::list<boost::shared_ptr<Request> > GetRequestsList();
boost::shared_ptr<Request> GetRequest();
private:
std::list<boost::shared_ptr<Request> > requests;
std::list<boost::shared_ptr<Request> >::iterator iter; //Iterator
boost::mutex listmtx;
};
std::list<boost::shared_ptr<Request> >::iterator RequestList::GetIterator()
{
return this->iter;
}
USE:
RequestList* requests;
In some thread (may be used again in other threads)
std::list<boost::shared_ptr<Request> >::iterator iter = requests->GetIterator();
Or would it be smarter to just create an iterator for that list each time and use it locally within each thread?
No it is usually not a good idea to share an iterator across threads. There are a couple of ways to make sure you don't get in trouble.
First off, an iterator is a light-weight object which is fast to construct and takes up very little memory. So you should not be concerned about any performance-issues. Just create an instance whenever you need one.
That said you do have to make sure that your list is not altered when you are iterating. I see you have a boost::mutex on in your class. Locking that will be perfect for ensuring that you don't get any problems when iterating.
A different and equally valid way of handling these situations is to simply copy the internal list and iterate that. This is a good solution if you require that the list is continually updated and you don't want other threads waiting. Of course it takes up a bit more memory, but since you are storing smart pointers, it will hardly be anything at all.
Depends how the list is used, but from what you've shown it looks wrong. The iterator becomes invalid if the element it refers to is removed: the element in this case being the shared_ptr object in the list.
As soon as you release the mutex, I guess some other thread could come along and remove that element. You haven't shown code that does it, but if it can happen, then iterators shouldn't "escape" the mutex.
I assume this is a "self-synchronizing" container, since the mutex is private and there's nothing in the API to lock it. The fundamental difficulty with such things, is that it's not thread-safe to perform any kind of iteration on them from the outside. It's easy enough to provide a thread-safe queue, that supports:
adding an element,
removing an element by value,
removing the head and returning a copy of its value.
Beyond that, it's harder to provide useful basic operations, because almost anything that manipulates the list in any interesting way needs to be done entirely under the lock.
By the looks of things, you can copy the list with GetRequestsList, and iterate over the copy. Not sure whether it will do you any good, since the copy is instantly out of date.
Accessing the list via iterators in multiple threads where the main list itself is not locked is dangerous.
There's no guarantee what state the list will be as you do things in with the iterators in different threads (for example, one thread could happily iterate through and erase all the items, what will the other thread - who's also iterating, see?)
If you are going to work on the list in multiple threads, lock the whole list first, then do what you need to. Copying the list is an option, but not optimal (depending on the size of your list and how fast it's updated). If locking becomes a bottle neck, re-think your architecture (list per thread for example?)
Each thread that calls the GetIterator function will get its own copy of the stored iterator in the list.
As a std::list<>::iterator is a bi-directional iterator, any copies you make are completely independent of the source. If one of them changes, this will not be reflected in any of the other iterators.
As for using iterator in a multi-threaded environment, this is not that different from a single-threaded environment. The iterator remains valid as long as the element it refers to is part of the container. You just have to take care of proper synchronization when accessing/modifying the container or its elements.
If the list is modified by one of your threads you might get into trouble.
But of course, you can take care of that by setting locks and ro- and rw-locks during modification. But since mutexes are the scurge of any high performance program, maybe you can make a copy of the list (or references) and keep the original list mutex- and lock-free? That would be the best way.
If you have the mutexes in place you only have to battle with the issues of a modifying a list while holding iterators on it as you would normally do anyway -- i.e. adding elements should be ok, deletion should be done "careful" but doing it on a list is probably less likely to explode as on a vector:-)
I would reconsider this design and would use a task-based approach. This way you don't need any mutexes.
For example use Intel TBB, which initializes a task pool internally. So you can easily implement a one-writer/multiple-readers concept.
First add all requests to your request container (a simple std::vector might be better suited in terms of cache locality and performance) and then do a parallel_for() over you request-vector BUT DON'T remove a request in your parallel-for() functor!
After the processing you can actually clear your request vector without any need of locking a mutex. That's it!
I hope I could help a bit.

How to implement thread safe reference counting in C++

How do you implement an efficient and thread safe reference counting system on X86 CPUs in the C++ programming language?
I always run into the problem that the critical operations not atomic, and the available X86 Interlock operations are not sufficient for implementing the ref counting system.
The following article covers this topic, but requires special CPU instructions:
http://www.ddj.com/architect/184401888
Nowadays, you can use the Boost/TR1 shared_ptr<> smart pointer to keep your reference counted references.
Works great; no fuss, no muss. The shared_ptr<> class takes care of all the locking needed on the refcount.
In VC++, you can use _InterlockedCompareExchange.
do
read the count
perform mathematical operation
interlockedcompareexchange( destination, updated count, old count)
until the interlockedcompareexchange returns the success code.
On other platforms/compilers, use the appropriate intrinsic for the LOCK CMPXCHG instruction that MS's _InterlockedCompareExchange exposes.
Strictly speaking, you'll need to wait until C++0x to be able to write thread-safe code in pure C++.
For now, you can use Posix, or create your own platform independent wrappers around compare and swap and/or interlocked increment/decrement.
Win32 InterlockedIncrementAcquire and InterlockedDecrementRelease (if you want to be safe and care about platforms with possible reordering, hence you need to issue memory barriers at the same time) or InterlockedIncrement and InterlockedDecrement (if you are sure you will stay x86), are atomic and will do the job.
That said, Boost/TR1 shared_ptr<> will handle all of this for you, therefore unless you need to implement it on your own, you will probably do the best to stick to it.
Bear in mind that the locking is very expensive, and it happens every time you hand objects around between smart pointers - even when the object is currently owned by one thread (the smart pointer library doesn't know that).
Given this, there may be a rule of thumb applicable here (I'm happy to be corrected!)
If the follow things apply to you:
You have complex data structures that would be difficult to write destructors for (or where STL-style value semantics would be inappropriate, by design) so you need smart pointers to do it for you, and
You're using multiple threads that share these objects, and
You care about performance as well as correctness
... then actual garbage collection may be a better choice. Although GC has a bad reputation for performance, it's all relative. I believe it compares very favourably with locking smart pointers. It was an important part of why the CLR team chose true GC instead of something using reference counting. See this article, in particular this stark comparison of what reference assignment means if you have counting going on:
no ref-counting:
a = b;
ref counting:
if (a != null)
if (InterlockedDecrement(ref a.m_ref) == 0)
a.FinalRelease();
if (b != null)
InterlockedIncrement(ref b.m_ref);
a = b;
If the instruction itself is not atomic then you need to make the section of code that updates the appropriate variable a critical section.
i.e. You need to prevent other threads entering that section of code by using some locking scheme. Of course the locks need to be atomic, but you can find an atomic locking mechanism within the pthread_mutex class.
The question of efficient: The pthread library is as efficient as it can be and still guarantee that mutex lock is atomic for your OS.
Is it expensive: Probably. But for everything that requires a guarantee there is a cost.
That particular code posted in that ddj article is adding extra complexity to account for bugs in using smart pointers.
Specifically, if you can't guarantee that the smart pointer won't change in an assignment to another smart pointer, you are doing it wrong or are doing something very unreliable to begin with. If the smart pointer can change while being assigned to another smart pointer, that means that the code doing the assignment doesn't own the smart pointer, which is suspect to begin with.