How to implement thread safe reference counting in C++

How to implement thread safe reference counting in C++ - c++

How do you implement an efficient and thread safe reference counting system on X86 CPUs in the C++ programming language?
I always run into the problem that the critical operations not atomic, and the available X86 Interlock operations are not sufficient for implementing the ref counting system.
The following article covers this topic, but requires special CPU instructions:
http://www.ddj.com/architect/184401888

Nowadays, you can use the Boost/TR1 shared_ptr<> smart pointer to keep your reference counted references.
Works great; no fuss, no muss. The shared_ptr<> class takes care of all the locking needed on the refcount.

In VC++, you can use _InterlockedCompareExchange.
do
read the count
perform mathematical operation
interlockedcompareexchange( destination, updated count, old count)
until the interlockedcompareexchange returns the success code.
On other platforms/compilers, use the appropriate intrinsic for the LOCK CMPXCHG instruction that MS's _InterlockedCompareExchange exposes.

Strictly speaking, you'll need to wait until C++0x to be able to write thread-safe code in pure C++.
For now, you can use Posix, or create your own platform independent wrappers around compare and swap and/or interlocked increment/decrement.

Win32 InterlockedIncrementAcquire and InterlockedDecrementRelease (if you want to be safe and care about platforms with possible reordering, hence you need to issue memory barriers at the same time) or InterlockedIncrement and InterlockedDecrement (if you are sure you will stay x86), are atomic and will do the job.
That said, Boost/TR1 shared_ptr<> will handle all of this for you, therefore unless you need to implement it on your own, you will probably do the best to stick to it.

Bear in mind that the locking is very expensive, and it happens every time you hand objects around between smart pointers - even when the object is currently owned by one thread (the smart pointer library doesn't know that).
Given this, there may be a rule of thumb applicable here (I'm happy to be corrected!)
If the follow things apply to you:
You have complex data structures that would be difficult to write destructors for (or where STL-style value semantics would be inappropriate, by design) so you need smart pointers to do it for you, and
You're using multiple threads that share these objects, and
You care about performance as well as correctness
... then actual garbage collection may be a better choice. Although GC has a bad reputation for performance, it's all relative. I believe it compares very favourably with locking smart pointers. It was an important part of why the CLR team chose true GC instead of something using reference counting. See this article, in particular this stark comparison of what reference assignment means if you have counting going on:
no ref-counting:
a = b;
ref counting:
if (a != null)
if (InterlockedDecrement(ref a.m_ref) == 0)
a.FinalRelease();
if (b != null)
InterlockedIncrement(ref b.m_ref);
a = b;

If the instruction itself is not atomic then you need to make the section of code that updates the appropriate variable a critical section.
i.e. You need to prevent other threads entering that section of code by using some locking scheme. Of course the locks need to be atomic, but you can find an atomic locking mechanism within the pthread_mutex class.
The question of efficient: The pthread library is as efficient as it can be and still guarantee that mutex lock is atomic for your OS.
Is it expensive: Probably. But for everything that requires a guarantee there is a cost.

That particular code posted in that ddj article is adding extra complexity to account for bugs in using smart pointers.
Specifically, if you can't guarantee that the smart pointer won't change in an assignment to another smart pointer, you are doing it wrong or are doing something very unreliable to begin with. If the smart pointer can change while being assigned to another smart pointer, that means that the code doing the assignment doesn't own the smart pointer, which is suspect to begin with.

Related

What is C++20 std::atomic<shared_ptr<T>> and std::atomic<weak_ptr<T>>? [duplicate]

I read the following article by Antony Williams and as I understood in addition to the atomic shared count in std::shared_ptr in std::experimental::atomic_shared_ptr the actual pointer to the shared object is also atomic?
But when I read about reference counted version of lock_free_stack described in Antony's book about C++ Concurrency it seems for me that the same aplies also for std::shared_ptr, because functions like std::atomic_load, std::atomic_compare_exchnage_weak are applied to the instances of std::shared_ptr.
template <class T>
class lock_free_stack
{
public:
void push(const T& data)
{
const std::shared_ptr<node> new_node = std::make_shared<node>(data);
new_node->next = std::atomic_load(&head_);
while (!std::atomic_compare_exchange_weak(&head_, &new_node->next, new_node));
}
std::shared_ptr<T> pop()
{
std::shared_ptr<node> old_head = std::atomic_load(&head_);
while(old_head &&
!std::atomic_compare_exchange_weak(&head_, &old_head, old_head->next));
return old_head ? old_head->data : std::shared_ptr<T>();
}
private:
struct node
{
std::shared_ptr<T> data;
std::shared_ptr<node> next;
node(const T& data_) : data(std::make_shared<T>(data_)) {}
};
private:
std::shared_ptr<node> head_;
};
What is the exact difference between this two types of smart pointers, and if pointer in std::shared_ptr instance is not atomic, why it is possible the above lock free stack implementation?

The atomic "thing" in shared_ptr is not the shared pointer itself, but the control block it points to. meaning that as long as you don't mutate the shared_ptr across multiple threads, you are ok. do note that copying a shared_ptr only mutates the control block, and not the shared_ptr itself.
std::shared_ptr<int> ptr = std::make_shared<int>(4);
for (auto i =0;i<10;i++){
std::thread([ptr]{ auto copy = ptr; }).detach(); //ok, only mutates the control block
}
Mutating the shared pointer itself, such as assigning it different values from multiple threads, is a data race, for example:
std::shared_ptr<int> ptr = std::make_shared<int>(4);
std::thread threadA([&ptr]{
ptr = std::make_shared<int>(10);
});
std::thread threadB([&ptr]{
ptr = std::make_shared<int>(20);
});
Here, we are mutating the control block (which is ok) but also the shared pointer itself, by making it point to a different values from multiple threads. This is not ok.
A solution to that problem is to wrap the shared_ptr with a lock, but this solution is not so scalable under some contention, and in a sense, loses the automatic feeling of the standard shared pointer.
Another solution is to use the standard functions you quoted, such as std::atomic_compare_exchange_weak. This makes the work of synchronizing shared pointers a manual one, which we don't like.
This is where atomic shared pointer comes to play. You can mutate the shared pointer from multiple threads without fearing a data race and without using any locks. The standalone functions will be members ones, and their use will be much more natural for the user. This kind of pointer is extremely useful for lock-free data structures.

N4162(pdf), the proposal for atomic smart pointers, has a good explanation. Here's a quote of the relevant part:
Consistency. As far as I know, the [util.smartptr.shared.atomic]
functions are the only atomic operations in the standard that
are not available via an atomic type. And for all types
besides shared_ptr, we teach programmers to use atomic types
in C++, not atomic_* C-style functions. And that’s in part because of...
Correctness. Using the free functions makes code error-prone
and racy by default. It is far superior to write atomic once on
the variable declaration itself and know all accesses
will be atomic, instead of having to remember to use the atomic_*
operation on every use of the object, even apparently-plain reads.
The latter style is error-prone; for example, “doing it wrong” means
simply writing whitespace (e.g., head instead of atomic_load(&head) ),
so that in this style every use of the variable is “wrong by default.” If you forget to
write the atomic_* call in even one place, your code will still
successfully compile without any errors or warnings, it will “appear
to work” including likely pass most testing, but will still contain a
silent race with undefined behavior that usually surfaces as intermittent
hard-to-reproduce failures, often/usually in the field,
and I expect also in some cases exploitable vulnerabilities.
These classes of errors are eliminated by simply declaring the variable atomic,
because then it’s safe by default and to write the same set of
bugs requires explicit non-whitespace code (sometimes explicit
memory_order_* arguments, and usually reinterpret_casting).
Performance. atomic_shared_ptr<> as a distinct type
has an important efficiency advantage over the
functions in [util.smartptr.shared.atomic] — it can simply store an
additional atomic_flag (or similar) for the internal spinlock
as usual for atomic<bigstruct>. In contrast, the existing standalone functions
are required to be usable on any arbitrary shared_ptr
object, even though the vast majority of shared_ptrs will
never be used atomically. This makes the free functions inherently
less efficient; for example, the implementation could require
every shared_ptr to carry the overhead of an internal spinlock
variable (better concurrency, but significant overhead per
shared_ptr), or else the library must maintain a lookaside data
structure to store the extra information for shared_ptrs that are
actually used atomically, or (worst and apparently common in
practice) the library must use a global spinlock.

Calling std::atomic_load() or std::atomic_compare_exchange_weak() on a shared_ptr is functionally equivalent to calling atomic_shared_ptr::load() or atomic_shared_ptr::atomic_compare_exchange_weak(). There shouldn't be any performance difference between the two. Calling std::atomic_load() or std::atomic_compare_exchange_weak() on a atomic_shared_ptr would be syntactically redundant and might or might not incur a performance penalty.

atomic_shared_ptr is an API refinement. shared_ptr already supports atomic operations, but only when using the appropriate atomic non-member functions. This is error-prone, because the non-atomic operations remain available and are too easy for an unwary programmer to invoke by accident. atomic_shared_ptr is less error-prone because it doesn't expose any non-atomic operations.
shared_ptr and atomic_shared_ptr expose different APIs, but they don't necessarily need to be implemented differently; shared_ptr already supports all the operations exposed by atomic_shared_ptr. Having said that, the atomic operations of shared_ptr are not as efficient as they could be, because it must also support non-atomic operations. Therefore there are performance reasons why atomic_shared_ptr could be implemented differently. This is related to the single responsibility principle. "An entity with several disparate purposes... often offers crippled interfaces for any of its specific purposes because the partial overlap among various areas of functionality blurs the vision needed for crisply implementing each." (Sutter & Alexandrescu 2005, C++ Coding Standards)

used case for boost make_shared

I am walking through a cpp code and have the following questions(have almost no exposure to boost library)
bool xxxx::calcYYY ()
{
bool retStatus = false;
boost::shared_ptr<DblMatrix> price = boost::make_shared<DblMatrix>(xxx, xxx);
.....
retStatus = true;
}
return retStatus;
}
Why are local scoped pointers instantiated as shared?
There must be an additional overhead to maintain reference counting,in a high performing code.
What is the boost alternative to do this correctly here?

Why are local scoped pointers instantiated as shared?
Because they point to objects that can be shared. For example, they can be passed to functions that can extend the life of the object they refer to.
There must be an additional overhead to maintain reference counting,in a high performing code.
Possibly. It's also possible the compiler can optimize it out. There are a lot of tricks, such as passing the pointers by const reference to avoid having to change the reference count, using std::move where possible, and so on that can make the additional overhead negligible for many use cases. But, presumably, it's being used because there is some benefit. We can't tell with just the code you've shown.
What is the boost alternative to do this correctly here?
There is nothing incorrect shown.

Why are local scoped pointers instantiated as shared?
Because the author wanted to use a smart pointer that would guarantee the memory is freed even if the function exits via an early return, or an exception.
The best choice for such a usage is std::unique_ptr, but that wasn't available until C++11, and if the original code predates that it wouldn't have been an option.
The next best choice is boost::unique_ptr. That actually handles this case perfectly, but it is in general a lot less useful than std::unique_ptr because it didn't support move semantics. As such, having a general rule "just use boost::shared_ptr and stop worrying" is perfectly sensible (particularly if you are going to pass the pointer into a general purpose function which may want to do reallocation).
There was also std::auto_ptr which also handles this case perfectly, but it's going to cause anyone reading the code to go "hang on, is this OK". It is best just avoided.
There must be an additional overhead to maintain reference counting,in a high performing code.
If the pointer is allocated once, and then never used until it is released, then the cost of an atomic increment and atomic decrement are going to be lost in the noise of a call to new and delete.

Atomic shared_ptr for lock-free singly linked list

I'm wondering if it is possible to create a lock-free, thread-safe shared pointer for any of the "common" architectures, like x64 or ARMv7 / ARMv8.
In a talk about lock-free programming at cppcon2014, Herb Sutter presented a (partial) implementation of a lock-free singly linked list. The implementation looks quite simple, but it relies on an atomic shared_ptr implementation that doesn't exist in the standard library yet or on using the specialized std::atomic... functions. This is especially important as single push/pop calls potentially invoke multiple atomic loads/stores and compare_exchange operations.
The problem I see (and I think some of the questions in the talk went into the same direction) is that for this to be an actual lock-free data structure, those atomic operations would have to be lock-free themselves. I don't know of any standard library implementation for the std::atomic... functions that is lock-free and - at least with a short google / SO search - I also didn't find a suggestion of how to implement a lock-free specialization for std::atomic<std::shared_ptr>.
Now before I'm wasting my time on this I wanted to ask:
Do you know, if it is possible to write an lockfree, atomic shared pointer at all?
Are there already any implementations that I've overlooked and - ideally - are even compatible with what you would expect from a std::atomic<std::shared_ptr>? For the mentioned queue it would especially require a CAS-operation.
If there is no way to implement this on current architectures, do you see any other benefit in Herb's implementation compared to a "normal" linked list that is protected by a lock?
For reference, here is the code from Herb Sutter (might contain typos from me):
template<class T>
class slist {
struct Node { T t; std::shared_ptr<Node> next; };
std::atomic<std::shared_ptr<Node>> head;
public:
class reference{
std::shared_ptr<Node> p;
public:
reference(std::shared_ptr<Node> p_){}
T& operator*(){ return p->t; }
T* operator->(){ return &p->t; }
};
auto find(T t) const {
auto p = head.load();
while (p && p-> != t) {
p = p - next;
}
return reference(move(p));
}
void push_front(T t) {
auto p = std::make_shared<Node>();
p->t = t;
p->next = head;
while (!head.compare_exchange_weak(p->next, p)) {}
}
void pop_front() {
auto p = head.load();
while (p && !head.compare_exchange_weak(p, p - next)) { ; }
}
};
Note, that in this implementation, single instances of a shared_ptr can be accessed / modified by multiple different threads. It can be read/copied, reset and even deleted (as part of a node). So this not about whether multiple different shared_ptr objects (that manage the same object) can be used by multiple threads without a race condition - this that is already true for current implementations and required by the standard - but it is about concurrent access to a single pointer instance, which is - for standard shared pointers - no more threadsafe than the same operations on raw pointers would be.
To explain my motivation:
This is mainly an academic question. I've no intention of implementing my own lock free list in production code, but I find the topic interesting and at first glance, Herb's presentation seemed to be a good introduction. However, while thinking about this question and #sehe's comment on my answer, I remembered this talk, had another look at it and realized that it doesn't make much sense to call Herb's implementation lock-free, if it's primitive operations require locks (which they currently do). So I was wondering, whether this is just a limitation of the current implementations or a fundamental flaw in the design.

I'm adding this as an answer since it's too long to fit in a comment:
Something to consider. A lock-free shared_ptr is not needed to implement lock-free/wait-free data structures.
The reason Sutter uses shared_ptr in his presentation is because the most complicated part of writing lock-free data structures is not the synchronization, but the memory reclamation: we cannot delete nodes while they're potentially accessed by other threads, so we have to leak them and reclaim later. A lock-free shared_ptr implementation essentially provides "free" memory reclamation and makes examples of lock-free code palatable, especially in a context of a time-limited presentation.
Of course, having a lock-free atomic_shared_ptr as part of the Standard would be a huge help. But it's not the only way of doing memory reclamation for lock-free data structures, there's the naive implementation of maintaining a list of nodes to be deleted at quiescent points in execution (works in low-contention scenarios only), hazard pointers, roll-your-own atomic reference counting using split counts.
As for performance, #mksteve is correct, lock-free code in not guaranteed to outperform lock-based alternatives unless maybe it runs on a highly parallel system offering true concurrency. It's goal is to enable maximum concurrency and because of that what we typically get is threads doing less waiting at the the cost of performing more work.
PS If this is something that interests you, you should consider taking a look at C++ Concurrency in Action by Anthony Williams. It dedicates a whole chapter to writing lock-free/wait-free code, which offers a good starting place, walking through implementations of lock-free stack and queue.

Do you know, if it is possible to write an lockfree, atomic shared
pointer at all?
Are there already any implementations that I've
overlooked and - ideally - are even compatible with what you would
expect from a std::atomic?
I think the std::atomic_... offers a form of implementation, where the slist would perform special atomic_ queries on the shared_ptr. The problem with this being separated into two classes (std::atomic and std::shared_ptr) is that they each have constraints which need to be adhered to in order to function. The class separation, makes that knowledge of shared constaints impossible.
Within slist, which knows both items, it can help the situation, and thus probably the atomic_... functions will work.
If there is no way to implement this on current architectures, do you
see any other benefit in Herb's implementation compared to a "normal"
linked list that is protected by a lock?
From Wikipedia : Non blocking algorithm the purpose of the lock free nature, is to guarantee some progress is being made by at least one thread.
This does not give a guarantee of better performance than a locked implementation, but does give a guarantee that deadlocks will not occur.
Imagine T required a lock to perform a copy, this could also have been owned by some operations outside of the list. Then a deadlock would be possible, if it was owned, and a lock based implementation of slist was called.
I think CAS is implemented in the std::compare_exchange_weak, so would be implementation independent.
Current lock free algorithms for complex structures (e.g vector, map) tend to be significantly less efficient than locking algorithms, Dr Dobbs : lock-free data structures but the benefit offered (improved thread performance) would improve significantly the performance of computers, which tend to have large numbers of idle cpus.
Further research into the algorithms may identify new instructions which could be implemented in the CPUs of the future, to give us wait-free performance and improved utilization of computing resources.

It is possible to write a lock-free shared ptr as the only thing that needs changing is the count. The ptr itself is only copied so no special care needed here. When deleting, this must be the last instance, so no other copies exist in other threads so nobody would increment in the same time.
But having said that, std::atomic> wuold be a very specialized thing as it's not exactly a primitive type.
I've seen a few implementations of lock-free lists but none of them was of shared pointers. These containers usually have a special purpose and therefore there is an agreement around their usage (when/who creates/deletes) so using shared pointers is not required.
Also, shared pointers introduce an overhead that is contrary to our low latency goals that brought us to the lock-free domain in the first place.
So, back to your question- I think it is possible, but I don't see why do that.
If you really need something like that, a refCount member variable would serve better.
I see no specific benefit in Herb's specific implementation, maybe except the academic one, but lock-free lists have the obvious motivation of not having a lock. They often serve as queues or just to share a collection of nodes between threads that are allergic to locks.
Maybe we should ask Herb.. Herb? are you listening?
EDIT:
Following all the comments below, I've implemented a lock-free singly linked list. The list is fairly complex to prevent shared ptrs from being deleted while they are accessed. It is too big to post here but here are the main ideas:
- The main idea is to stash removed entries in a separate place - a garbage collector - to make them inaccessible to later actions.
- An atomic ref count is incremented on entry to every function (push_front, pop_front, and front) and auto-decremented on exit. On decrementing to zero a version counter is incremented. All in one atomic instruction.
- When a shared ptrs needs to be erases, in pop_front, it is pushed into a GC. There's a GC per version number. The GC is implemented using a simpler lock-free list that can only push_front or pop_all. I've created a circular buffer of 256 GCs, but some other scheme can be applied.
- A version's GC is flushed on version increment and then shared ptrs delete the holders.
So, if you call pop_front, without anything else running, the ref count is incremented to 1, the front shared ptr is pushed into GC[0], ref count back to zero and version to 1, GC[0] is flushed - it decrements the shared ptr we popped and possibly deletes the object it owns.
Now, wrt a lock-free shared_ptr. I believe this is doable. Here are the ideas I thought of:
- You can have a spin lock of sorts using the low bits of the pointer to the holder, so you can dereference it only after you've locked it. You can use different bit for inc/dec etc. This is much better than lock on the entire thing.
The problem here is that the shared ptr itself can be deleted so whatever contains it would have to provide some protection from outside, like the linked list.
- You can have some central registry of shared pointers. This does not suffer from the problem above, but would be challenging to scale up without latency spikes once in a while.
To summarize, I currently think this whole idea is moot. If you find some other approach that does not suffer from big problems - I'll be very curious to know about it :)
Thanks!

How to return smart pointers (shared_ptr), by reference or by value?

Let's say I have a class with a method that returns a shared_ptr.
What are the possible benefits and drawbacks of returning it by reference or by value?
Two possible clues:
Early object destruction. If I return the shared_ptr by (const) reference, the reference counter is not incremented, so I incur the risk of having the object deleted when it goes out of scope in another context (e.g. another thread). Is this correct? What if the environment is single-threaded, can this situation happen as well?
Cost. Pass-by-value is certainly not free. Is it worth avoiding it whenever possible?
Thanks everybody.

Return smart pointers by value.
As you've said, if you return it by reference, you won't properly increment the reference count, which opens up the risk of deleting something at the improper time. That alone should be enough reason to not return by reference. Interfaces should be robust.
The cost concern is nowadays moot thanks to return value optimization (RVO), so you won't incur a increment-increment-decrement sequence or something like that in modern compilers. So the best way to return a shared_ptr is to simply return by value:
shared_ptr<T> Foo()
{
return shared_ptr<T>(/* acquire something */);
};
This is a dead-obvious RVO opportunity for modern C++ compilers. I know for a fact that Visual C++ compilers implement RVO even when all optimizations are turned off. And with C++11's move semantics, this concern is even less relevant. (But the only way to be sure is to profile and experiment.)
If you're still not convinced, Dave Abrahams has an article that makes an argument for returning by value. I reproduce a snippet here; I highly recommend that you go read the entire article:
Be honest: how does the following code make you feel?
std::vector<std::string> get_names();
...
std::vector<std::string> const names = get_names();
Frankly, even though I should know better, it makes me nervous. In principle, when get_names()
returns, we have to copy a vector of strings. Then, we need to copy it again when we initialize
names, and we need to destroy the first copy. If there are N strings in the vector, each copy
could require as many as N+1 memory allocations and a whole slew of cache-unfriendly data accesses > as the string contents are copied.
Rather than confront that sort of anxiety, I’ve often fallen back on pass-by-reference to avoid
needless copies:
get_names(std::vector<std::string>& out_param );
...
std::vector<std::string> names;
get_names( names );
Unfortunately, this approach is far from ideal.
The code grew by 150%
We’ve had to drop const-ness because we’re mutating names.
As functional programmers like to remind us, mutation makes code more complex to reason about by undermining referential transparency and equational reasoning.
We no longer have strict value semantics for names.
But is it really necessary to mess up our code in this way to gain efficiency? Fortunately, the answer turns out to be no (and especially not if you are using C++0x).

Regarding any smart pointer (not just shared_ptr), I don't think it's ever acceptable to return a reference to one, and I would be very hesitant to pass them around by reference or raw pointer. Why? Because you cannot be certain that it will not be shallow-copied via a reference later. Your first point defines the reason why this should be a concern. This can happen even in a single-threaded environment. You don't need concurrent access to data to put bad copy semantics in your programs. You don't really control what your users do with the pointer once you pass it off, so don't encourage misuse giving your API users enough rope to hang themselves.
Secondly, look at your smart pointer's implementation, if possible. Construction and destruction should be darn close to negligible. If this overhead isn't acceptable, then don't use a smart pointer! But beyond this, you will also need to examine the concurrency architecture that you've got, because mutually exclusive access to the mechanism that tracks the uses of the pointer is going to slow you down more than mere construction of the shared_ptr object.
Edit, 3 years later: with the advent of the more modern features in C++, I would tweak my answer to be more accepting of cases when you've simply written a lambda that never lives outside of the calling function's scope, and isn't copied somewhere else. Here, if you wanted to save the very minimal overhead of copying a shared pointer, it would be fair and safe. Why? Because you can guarantee that the reference will never be mis-used.

Regarding mark-sweep ( lazy approach ) for garbage collection in C++?

I know reference counter technique but never heard of mark-sweep technique until today, when reading the book named "Concepts of programming language".
According to the book:
The original mark-sweep process of garbage collection operates as follow: The runtime system allocates storage cells as requested and disconnects pointers from cells as necessary, without regard of storage reclamation ( allowing garbage to accumulate), until it has allocated all available cells. At this point, a mark-sweep process is begun to gather all the garbage left floating-around in the heap. To facilitate the process, every heap cells has an extra indicator bit or field that is used by the collection algorithm.
From my limited understanding, smart-pointers in C++ libraries use reference counting technique. I wonder is there any library in C++ using this kind of implementation for smart-pointers? And since the book is purely theoretical, I could not visualize how the implementation is done. An example to demonstrate this idea would be greatly valuable. Please correct me if I'm wrong.
Thanks,

There is one difficulty to using garbage collection in C++, it's to identify what is pointer and what is not.
If you can tweak a compiler to provide this information for each and every object type, then you're done, but if you cannot, then you need to use conservative approach: that is scanning the memory searching for any pattern that may look like a pointer. There is also the difficulty of "bit stuffing" here, where people stuff bits into pointers (the higher bits are mostly unused in 64 bits) or XOR two different pointers to "save space".
Now, in C++0x the Standard Committee introduced a standard ABI to help implementing Garbage Collection. In n3225 you can find it at 20.9.11 Pointer safety [util.dynamic.safety]. This supposes that people will implement those functions for their types, of course:
void declare_reachable(void* p); // throw std::bad_alloc
template <typename T> T* undeclare_reachable(T* p) noexcept;
void declare_no_pointers(char* p, size_t n) noexcept;
void undeclare_no_pointers(char* p, size_t n) noexcept;
pointer_safety get_pointer_safety() noexcept;
When implemented, it will authorize you to plug any garbage collection scheme (defining those functions) into your application. It will of course require some work of course to actually provide those operations wherever they are needed. One solution could be to simply override new and delete but it does not account for pointer arithmetic...
Finally, there are many strategies for Garbage Collection: Reference Counting (with Cycle Detection algorithms) and Mark And Sweep are the main different systems, but they come in various flavors (Generational or not, Copying/Compacting or not, ...).

Although they may have upgraded it by now, Mozilla Firefox used to use a hybrid approach in which reference-counted smart pointers were used when possible, with a mark-and-sweep garbage collector running in parallel to clean up reference cycles. It's possible other projects have adopted this approach, though I'm not fully sure.
The main reason that I could see C++ programmers avoiding this type of garbage collection is that it means that object destructors would run asynchronously. This means that if any objects were created that held on to important resources, such as network connections or physical hardware, the cleanup wouldn't be guaranteed to occur in a timely fashion. Moreover, the destructors would have to be very careful to use appropriate synchronization if they were to access shared resources, while in a single-threaded, straight reference-counting solution this wouldn't be necessary.
The other complexity of this approach is that C++ allows for raw arithmetic operations on pointers, which greatly complicates the implementation of any garbage collector. It's possible to conservatively solve this problem (look at the Boehm GC, for example), though it's a significant barrier to building a system of this sort.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js