Is fetch_sub really atomic? - c++

I have the following code (written in C++):
Code in StringRef class:
inline void retain() const {
m_refCount.fetch_add(1, std::memory_order_relaxed);
}
inline void release() const {
if(m_refCount.fetch_sub(1, std::memory_order_release) == 1){
std::atomic_thread_fence(std::memory_order_acquire);
deleteFromParent();
}
}
Code in InternedString:
public:
inline InternedString(){
m_ref = nullptr;
}
inline InternedString(const InternedString& other){
m_ref = other.m_ref;
if(m_ref)
m_ref->retain();
}
inline InternedString(InternedString&& other){
m_ref = other.m_ref;
other.m_ref = nullptr;
}
inline InternedString& operator=(const InternedString& other){
if(&other == this)
return *this;
if(other.m_ref)
other.m_ref->retain();
if(m_ref)
m_ref->release();
m_ref = other.m_ref;
return *this;
}
inline InternedString& operator=(InternedString&& other){
if(&other == this)
return *this;
if(m_ref)
m_ref->release();
m_ref = other.m_ref;
other.m_ref = nullptr;
return *this;
}
/*! #group Destructors */
inline ~InternedString(){
if(m_ref)
m_ref->release();
}
private:
inline InternedString(const StringRef* ref){
assert(ref);
m_ref = ref;
m_ref->retain();
}
When i execute this code in multiple threads deleteFromParent() gets called more than once for the same object. I don't understand why... Even if i am over releasing i should still not get this behaviour, i guess...
Can somebody help me? What am i doing wrong?

fetch_sub is as atomic as can be, but that's not the problem.
Try modifying your code like so:
if(m_refCount.fetch_sub(1, std::memory_order_release) == 1){
Sleep(10);
std::atomic_thread_fence(std::memory_order_acquire);
deleteFromParent();
and see what happens.
If your destructor gets preempted by a thread that makes use of your InternedString operators, they will happily and unknowingly get a reference to an object on the verge of deletion.
This means the rest of your code is free to reference deleted objects, leading to all sorts of UBs, including the possible re-incrementing of your perfectly atomic reference count leading to multiple perfectly atomic destructions.
Assuming anybody could copy references around without locking the destructor first is plain wrong, and only made worse if you bury it under the textbook perfect litany of operators needed to hide reference juggling from the end user.
If any task is free to delete your objects anytime, a bit of code like InternedString a = b; will simply have no way to know whether b is a valid object or not.
The reference count mechanism will work as intended only if all references have been set at a time the object was indeed valid.
What you can do is create as many InternedStrings as you want in code sections where no deletion can occur in parallel (be it during init or through plain mutex locking), but once destructors are on the loose, that's it for reference juggling.
The only way to make that work without using mutexes or other synchronization objects would be to add a mechanism for acquiring a reference that would let the user know that the object has been deleted. Here is an example of how that could be done.
Now if you try to hide it all under a carpet of rule of five operators, the only remaining solution is to add some kind of valid attribute to your InternedString, that every bit of code would have to check before trying to access the underlying string.
This would amount to dumping the multitasking problems on the desk of your interface's end user, who would in the best case end up using mutexes to prevent other bits of code from deleting objects from under his feet, or maybe just tinker with the code until implicit synchronizations apparently took care of the problem, planting so many ticking time bombs in the application.
Atomic counters and/or structures are no replacement for multitasking synchronization. Except for some experts who can design ultra smart algorithms, atomic variables are just a huge pitfall coated in tons of syntactic sugar.

Related

Check resource before and after the lock

I've run into code that simplified looks like this
inline someClass* otherClass::getSomeClass()
{
if (m_someClass)
return m_someClass.get();
std::unique_lock<std::shared_mutex> lock(m_lock);
if (m_someClass)
return m_someClass.get();
m_someClass= std::make_unique<someClass>(this);
return m_someClass.get();
}
So it seems it's a pattern to be sure thread safety of creation of someClass object. I don't have much experience in multithreading, but this code doesn't look nice to me. Is there some other way to rewrite this or it's a way it should be?
The biggest problem here is that you are violating the C++ memory model. In the C++ memory model, a write operation and a read operation to the same data must be synchronized.
The m_someClass at the front is reading what is written to in the mutex.
It is possible that the operator bool on m_someClass is atomic somehow.
Also, your code doesn't handle the object ever being destroyed.
If it is atomic, then you should possibly be using atomic operations to update it and not a lock. Such a pattern can result in "wasted" objects being created; often this is worth the cost of removing the lock.
make m_someClass be std::atomic<std::shared_ptr<someClass>>.
Return std::shared_ptr<someClass> from getSomeClass.
auto existing = m_someClass.load();
if (existing)
return existing;
auto created = std::make_shared<someClass>(this);
if (
m_someClass.compare_exchange_strong(existing, created)
) {
return created;
} else {
return existing;
}
Two threads can both create a new someClass if they both try to get at the same time, but only one will persist, the other will be discarded, and the function will return the one that persists.

Confusion about thread-safety

I am new to the world of concurrency but from what I have read I understand the program below to be undefined in its execution. If I understand correctly this is not threadsafe as I am concurrently reading/writing both the shared_ptr and the counter variable in non-atomic ways.
#include <string>
#include <memory>
#include <thread>
#include <chrono>
#include <iostream>
struct Inner {
Inner() {
t_ = std::thread([this]() {
counter_ = 0;
running_ = true;
while (running_) {
counter_++;
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
});
}
~Inner() {
running_ = false;
if (t_.joinable()) {
t_.join();
}
}
std::uint64_t counter_;
std::thread t_;
bool running_;
};
struct Middle {
Middle() {
data_.reset(new Inner);
t_ = std::thread([this]() {
running_ = true;
while (running_) {
data_.reset(new Inner());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
});
}
~Middle() {
running_ = false;
if (t_.joinable()) {
t_.join();
}
}
std::uint64_t inner_data() {
return data_->counter_;
}
std::shared_ptr<Inner> data_;
std::thread t_;
bool running_;
};
struct Outer {
std::uint64_t data() {
return middle_.inner_data();
}
Middle middle_;
};
int main() {
Outer o;
while (true) {
std::cout << "Data: " << o.data() << std::endl;
}
return 0;
}
My confusion comes from this:
Is the access to data_->counter safe in Middle::inner_data?
If thread A has a member shared_ptr<T> sp and decides to update it while thread B does shared_ptr<T> sp = A::sp will the copy and destruction be threadsafe? Or do I risk the copy failing because the object is in the process of being destroyed.
Under what circumstances (can I check this with some tool?) is undefined likely to mean std::terminate? I suspect something like the above happens in some of my production code but I cannot be certain as I am confused about 1 and 2 but this small program has been running for days since I wrote it and nothing happens.
Code can be checked here at https://godbolt.org/g/saHz94
Is the access to data_->counter safe in Middle::inner_data?
No; it's a race condition. According to the standard, it's undefined behavior anytime you allow unsynchronized access to the same variable from more than one thread, and at least one thread might possibly modify the variable.
As a practical matter, here are a couple of unwanted behaviors you might see:
The thread reading the value of counter_ reads an "old" value of counter (that rarely or never updates) due to different processor cores caching the variable independently of each other (using atomic_t would avoid this problem, because then the compiler would be aware that you are intending this variable to be accessed in an unsynchronized manner, and it would know to take precautions to prevent this problem)
Thread A might read the address that the data_ shared_pointer points to and be just about to dereference the address and read from the Inner struct it points to, when Thread A gets kicked off the CPU by thread B. Thread B executes, and during Thread B's execution, the old Inner struct gets deleted and the data_ shared_pointer set to point to a new Inner struct. Then Thread A gets back onto the CPU again, but since Thread A already has the old pointer value in memory, it dereferences the old value rather than the new one and ends up reading from freed/invalid memory. Again, this is undefined behavior, so in principle anything could happen; in practice you're likely to see either no obvious misbehavior, or occasionally a wrong/garbage value, or possibly a crash, it depends.
If thread A has a member shared_ptr sp and decides to update it
while thread B does shared_ptr sp = A::sp will the copy and
destruction be threadsafe? Or do I risk the copy failing because the
object is in the process of being destroyed.
If you're only retargeting the shared_ptrs themselves (i.e. changing them to point to different objects) and not modifying the T objects that they point to, that should be thread safe AFAIK. But if you are modifying state of the T objects themselves (i.e. the Inner object in your example) that is not thread safe, since you could have one thread reading from the object while another thread is writing to it (deleting the object can be seen as a special case of writing to it, in that it definitely changes the object's state)
Under what circumstances (can I check this with some tool?) is
undefined likely to mean std::terminate?
When you hit undefined behavior, it's very much dependent on the details of your program, the compiler, the OS, and the hardware architecture what will happen. In principle, undefined behavior means anything (including the program running just as you intended!) can happen, but you can't rely on any particular behavior -- which is what makes undefined behavior so evil.
In particular, it's common for a multithreaded program with a race condition to run fine for hours/days/weeks and then one day the timing is just right and it crashes or computes an incorrect result. Race conditions can be really difficult to reproduce for that reason.
As for when terminate() might be called, terminate() would be called if the the fault causes an error state that is detected by the runtime environment (i.e. it corrupts a data structure that the runtime environment does integrity checks on, such as, in some implementations, the heap's metadata). Whether or not that actually happens depends on how the heap was implemented (which varies from one OS and compiler to the next) and what sort of corruption the fault introduced.
Thread safety is an operation between threads, not an absolute in general.
You cannot read or write a variable while another thread writes a variable without synchronization between the other thread's write and your read or write. Doing so is undefined behavior.
Undefined can mean anything. Program crashes. Program reads impossible value. Program formats hard drive. Program emails your browser history to all of your contacts.
A common case for unsynchronized integer access is that the compiler optimizes multiple reads to a value into one and doesn't reload it, because it can prove there is no defined way that someone could have modified the value. Or, the CPU memory cache does the same thing, because you did not synchronize.
For the pointers, similar or worse problems can occur, including following dangling pointers, corrupting memory, crashes, etc.
There are now atomic operations you can perform on shared pointers., as well as atomic<shared_ptr<?>>.

std::function in combination with thread c++11 fails debug assertion in vector

I want to build a helper class that can accept an std::function created via std::bind) so that i can call this class repeaded from another thread:
short example:
void loopme() {
std::cout << "yay";
}
main () {
LoopThread loop = { std::bind(&loopme) };
loop.start();
//wait 1 second
loop.stop();
//be happy about output
}
However, when calling stop() my current implementation will raise the following error: debug assertion Failed , see Image: i.stack.imgur.com/aR9hP.png.
Does anyone know why the error is thrown ?
I don't even use vectors in this example.
When i dont call loopme from within the thread but directly output to std::cout, no error is thrown.
Here the full implementation of my class:
class LoopThread {
public:
LoopThread(std::function<void(LoopThread*, uint32_t)> function) : function_{ function }, thread_{ nullptr }, is_running_{ false }, counter_{ 0 } {};
~LoopThread();
void start();
void stop();
bool isRunning() { return is_running_; };
private:
std::function<void(LoopThread*, uint32_t)> function_;
std::thread* thread_;
bool is_running_;
uint32_t counter_;
void executeLoop();
};
LoopThread::~LoopThread() {
if (isRunning()) {
stop();
}
}
void LoopThread::start() {
if (is_running_) {
throw std::runtime_error("Thread is already running");
}
if (thread_ != nullptr) {
throw std::runtime_error("Thread is not stopped yet");
}
is_running_ = true;
thread_ = new std::thread{ &LoopThread::executeLoop, this };
}
void LoopThread::stop() {
if (!is_running_) {
throw std::runtime_error("Thread is already stopped");
}
is_running_ = false;
thread_->detach();
}
void LoopThread::executeLoop() {
while (is_running_) {
function_(this, counter_);
++counter_;
}
if (!is_running_) {
std::cout << "end";
}
//delete thread_;
//thread_ = nullptr;
}
I used the following Googletest code for testing (however a simple main method containing the code should work):
void testfunction(pft::LoopThread*, uint32_t i) {
std::cout << i << ' ';
}
TEST(pfFiles, TestLoop)
{
pft::LoopThread loop{ std::bind(&testfunction, std::placeholders::_1, std::placeholders::_2) };
loop.start();
std::this_thread::sleep_for(std::chrono::milliseconds(500));
loop.stop();
std::this_thread::sleep_for(std::chrono::milliseconds(2500));
std::cout << "Why does this fail";
}
Your use of is_running_ is undefined behavior, because you write in one thread and read in another without a synchronization barrier.
Partly due to this, your stop() doesn't stop anything. Even without this UB (ie, you "fix" it by using an atomic), it just tries to say "oy, stop at some point", by the end it does not even attempt to guarantee the stop happened.
Your code calls new needlessly. There is no reason to use a std::thread* here.
Your code violates the rule of 5. You wrote a destructor, then neglected copy/move operations. It is ridiculously fragile.
As stop() does nothing of consequence to stop a thread, your thread with a pointer to this outlives your LoopThread object. LoopThread goes out of scope, destroying what the pointer your std::thread stores. The still running executeLoop invokes a std::function that has been destroyed, then increments a counter to invalid memory (possibly on the stack where another variable has been created).
Roughly, there is 1 fundamental error in using std threading in every 3-5 lines of your code (not counting interface declarations).
Beyond the technical errors, the design is wrong as well; using detach is almost always a horrible idea; unless you have a promise you make ready at thread exit and then wait on the completion of that promise somewhere, doing that and getting anything like a clean and dependable shutdown of your program is next to impossible.
As a guess, the vector error is because you are stomping all over stack memory and following nearly invalid pointers to find functions to execute. The test system either puts an array index in the spot you are trashing and then the debug vector catches that it is out of bounds, or a function pointer that half-makes sense for your std function execution to run, or somesuch.
Only communicate through synchronized data between threads. That means atomic data, or mutex guarded, unless you are getting ridiculously fancy. You don't understand threading enough to get fancy. You don't understand threading enough to copy someone who got fancy and properly use it. Don't get fancy.
Don't use new. Almost never, ever use new. Use make_shared or make_unique if you absolutely have to. But use those rarely.
Don't detach a thread. Period. Yes this means you might have to wait for it to finish a loop or somesuch. Deal with it, or write a thread manager that does the waiting at shutdown or somesuch.
Be extremely clear about what data is owned by what thread. Be extremely clear about when a thread is finished with data. Avoid using data shared between threads; communicate by passing values (or pointers to immutable shared data), and get information from std::futures back.
There are a number of hurdles in learning how to program. If you have gotten this far, you have passed a few. But you probably know people who learned along side of you that fell over at one of the earlier hurdles.
Sequence, that things happen one after another.
Flow control.
Subprocedures and functions.
Looping.
Recursion.
Pointers/references and dynamic vs automatic allocation.
Dynamic lifetime management.
Objects and Dynamic dispatch.
Complexity
Coordinate spaces
Message
Threading and Concurrency
Non-uniform address spaces, Serialization and Networking
Functional programming, meta functions, currying, partial application, Monads
This list is not complete.
The point is, each of these hurdles can cause you to crash and fail as a programmer, and getting each of these hurdles right is hard.
Threading is hard. Do it the easy way. Dynamic lifetime management is hard. Do it the easy way. In both cases, extremely smart people have mastered the "manual" way to do it, and the result is programs that exhibit random unpredictable/undefined behavior and crash a lot. Muddling through manual resource allocation and deallocation and multithreaded code can be made to work, but the result is usually someone whose small programs work accidentally (they work insofar as you fixed the bugs you noticed). And when you master it, initial mastery comes in the form of holding an entire program's "state" in uour head and understanding how it works; this fails to scale to large many-developer code bases, so younusually graduate to having large programs that work accidentally.
Both make_unique style and only-immutable-shared-data based threading are composible strategies. This means if small pieces are correct, and you put them together, the resulting program is correct (with regards to resource lifetime and concurrency). That permits local mastery of small-scale threading or resource management to apply to larfe-scale programs in the domain that these strategies work.
After following the guide from #Yakk i decided to restructure my programm:
bool is_running_ will change to td::atomic<bool> is_running_
stop() will not only trigger the stopping, but will activly wait for the thread to stop via a thread_->join()
all calls of new are replaced with std::make_unique<std::thread>( &LoopThread::executeLoop, this )
I have no experience with copy or move constructors. So i decided to forbid them. This should prevent me from accidently using this. If i sometime in the future will need those i have to take a deepter look on thoose
thread_->detach() was replaced by thread_->join() (see 2.)
This is the end of the list.
class LoopThread {
public:
LoopThread(std::function<void(LoopThread*, uint32_t)> function) : function_{ function }, is_running_{ false }, counter_{ 0 } {};
LoopThread(LoopThread &&) = delete;
LoopThread(const LoopThread &) = delete;
LoopThread& operator=(const LoopThread&) = delete;
LoopThread& operator=(LoopThread&&) = delete;
~LoopThread();
void start();
void stop();
bool isRunning() const { return is_running_; };
private:
std::function<void(LoopThread*, uint32_t)> function_;
std::unique_ptr<std::thread> thread_;
std::atomic<bool> is_running_;
uint32_t counter_;
void executeLoop();
};
LoopThread::~LoopThread() {
if (isRunning()) {
stop();
}
}
void LoopThread::start() {
if (is_running_) {
throw std::runtime_error("Thread is already running");
}
if (thread_ != nullptr) {
throw std::runtime_error("Thread is not stopped yet");
}
is_running_ = true;
thread_ = std::make_unique<std::thread>( &LoopThread::executeLoop, this );
}
void LoopThread::stop() {
if (!is_running_) {
throw std::runtime_error("Thread is already stopped");
}
is_running_ = false;
thread_->join();
thread_ = nullptr;
}
void LoopThread::executeLoop() {
while (is_running_) {
function_(this, counter_);
++counter_;
}
}
TEST(pfThread, TestLoop)
{
pft::LoopThread loop{ std::bind(&testFunction, std::placeholders::_1, std::placeholders::_2) };
loop.start();
std::this_thread::sleep_for(std::chrono::milliseconds(50));
loop.stop();
}

Proper compiler intrinsics for double-checked locking?

When implementing double-checked locking, what is the proper way to do the memory and/or compiler barriers when implementing double-checked locking for initialization?
Something like std::call_once isn't what I want; it's way too slow. It's typically just implemented on top of pthread_mutex_lock and EnterCriticalSection respective to OS.
In my programs, I often run into initialization cases where the initialization is safe to repeat, as long as exactly one thread gets to set the final pointer. If another thread beats it to setting the final pointer to the singleton object, it deletes what it created and makes use of the other thread's. I also often use this in cases where it doesn't matter which thread "wins" because they all come up with the same result.
Here's an unsafe, overly-contrived example, using Visual C++ intrinsics:
MyClass *GetGlobalMyClass()
{
static MyClass *const UNSET_POINTER = reinterpret_cast<MyClass *>(
static_cast<intptr_t>(-1));
static MyClass *volatile s_object = UNSET_POINTER;
if (s_object == UNSET_POINTER)
{
MyClass *newObject = MyClass::Create();
if (_InterlockedCompareExchangePointer(&s_object, newObject,
UNSET_POINTER) != UNSET_POINTER)
{
// Another thread beat us. If Create didn't return null, destroy.
if (newObject)
{
newObject->Destroy(); // calls "delete this;", presumably
}
}
}
return s_object;
}
On a weakly-ordered memory architecture, my understanding is that it's possible that the new value of s_object is visible to other threads before other variables written inside MyClass::Create or MyClass::MyClass are visible. Also, the compiler itself could arrange the code this way in the absence of a compiler barrier (in Visual C++, _WriteBarrier, but _InterlockedCompareExchange acts as a barrier).
Do I need like a store fence intrinsic function in there or something in order to ensure that MyClass's variables are visible to all threads before s_object becomes somethings besides -1?
Fortunately, the rules in C++ are very simple:
If there is a data race, the behaviour is undefined.
In you code the data race is caused by the following read, which conflicts with the write operation in __InterlockedCompareExchangePointer.
if (s_object.m_void == UNSET_POINTER)
A thread-safe solution without blocking might look as follows. Note that on x86 a load operation with sequential consistency has basically no overhead compared to a regular load operation. If you care about other architectures, you can also use acquire release instead of sequential consistency.
static std::atomic<MyClass*> s_object{nullptr};
MyClass* o = s_object.load(std::memory_order_seq_cst);
if (o == nullptr) {
o = new MyClass{...};
MyClass* expected = nullptr;
if (!s_object.compare_exchange_strong(expected, o, std::memory_order_seq_cst)) {
delete o;
o = expected;
}
}
return o;
For a proper C++11 implementation any function-local static variable will be constructed in a thread-safe fashion by the first thread passing through this variable.

Read-write thread-safe smart pointer in C++, x86-64

I develop some lock free data structure and following problem arises.
I have writer thread that creates objects on heap and wraps them in smart pointer with reference counter. I also have a lot of reader threads, that work with these objects. Code can look like this:
SmartPtr ptr;
class Reader : public Thread {
virtual void Run {
for (;;) {
SmartPtr local(ptr);
// do smth
}
}
};
class Writer : public Thread {
virtual void Run {
for (;;) {
SmartPtr newPtr(new Object);
ptr = newPtr;
}
}
};
int main() {
Pool* pool = SystemThreadPool();
pool->Run(new Reader());
pool->Run(new Writer());
for (;;) // wait for crash :(
}
When I create thread-local copy of ptr it means at least
Read an address.
Increment reference counter.
I can't do these two operations atomically and thus sometimes my readers work with deleted object.
The question is - what kind of smart pointer should I use to make read-write access from several threads with correct memory management possible? Solution should exist, since Java programmers don't even care about such a problem, simply relying on that all objects are references and are deleted only when nobody uses them.
For PowerPC I found http://drdobbs.com/184401888, looks nice, but uses Load-Linked and Store-Conditional instructions, that we don't have in x86.
As far I as I understand, boost pointers provide such functionality only using locks. I need lock free solution.
boost::shared_ptr have atomic_store which uses a "lock-free" spinlock which should be fast enough for 99% of possible cases.
boost::shared_ptr<Object> ptr;
class Reader : public Thread {
virtual void Run {
for (;;) {
boost::shared_ptr<Object> local(boost::atomic_load(&ptr));
// do smth
}
}
};
class Writer : public Thread {
virtual void Run {
for (;;) {
boost::shared_ptr<Object> newPtr(new Object);
boost::atomic_store(&ptr, newPtr);
}
}
};
int main() {
Pool* pool = SystemThreadPool();
pool->Run(new Reader());
pool->Run(new Writer());
for (;;)
}
EDIT:
In response to comment below, the implementation is in "boost/shared_ptr.hpp"...
template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r )
{
boost::detail::spinlock_pool<2>::scoped_lock lock( p );
p->swap( r );
}
template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r )
{
boost::detail::spinlock & sp = boost::detail::spinlock_pool<2>::spinlock_for( p );
sp.lock();
p->swap( r );
sp.unlock();
return r; // return std::move( r )
}
With some jiggery-pokery you should be able to accomplish this using InterlockedCompareExchange128. Store the reference count and pointer in a 2 element __int64 array. If reference count is in array[0] and pointer in array[1] the atomic update would look like this:
while(true)
{
__int64 comparand[2];
comparand[0] = refCount;
comparand[1] = pointer;
if(1 == InterlockedCompareExchange128(
array,
pointer,
refCount + 1,
comparand))
{
// Pointer is ready for use. Exit the while loop.
}
}
If an InterlockedCompareExchange128 intrinsic function isn't available for your compiler then you may use the underlying CMPXCHG16B instruction instead, if you don't mind mucking around in assembly language.
The solution proposed by RobH doesn't work. It has the same problem as the original question: when accessing the reference count object, it might already have been deleted.
The only way I see of solving the problem without a global lock (as in boost::atomic_store) or conditional read/write instructions is to somehow delay the destruction of the object (or the shared reference count object if such thing is used). So zennehoy has a good idea but his method is too unsafe.
The way I might do it is by keeping copies of all the pointers in the writer thread so that the writer can control the destruction of the objects:
class Writer : public Thread {
virtual void Run() {
list<SmartPtr> ptrs; //list that holds all the old ptr values
for (;;) {
SmartPtr newPtr(new Object);
if(ptr)
ptrs.push_back(ptr); //push previous pointer into the list
ptr = newPtr;
//Periodically go through the list and destroy objects that are not
//referenced by other threads
for(auto it=ptrs.begin(); it!=ptrs.end(); )
if(it->refCount()==1)
it = ptrs.erase(it);
else
++it;
}
}
};
However there are still requirements for the smart pointer class. This doesn't work with shared_ptr as the reads and writes are not atomic. It almost works with boost::intrusive_ptr. The assignment on intrusive_ptr is implemented like this (pseudocode):
//create temporary from rhs
tmp.ptr = rhs.ptr;
if(tmp.ptr)
intrusive_ptr_add_ref(tmp.ptr);
//swap(tmp,lhs)
T* x = lhs.ptr;
lhs.ptr = tmp.ptr;
tmp.ptr = x;
//destroy temporary
if(tmp.ptr)
intrusive_ptr_release(tmp.ptr);
As far as I understand the only thing missing here is a compiler level memory fence before lhs.ptr = tmp.ptr;. With that added, both reading rhs and writing lhs would be thread-safe under strict conditions: 1) x86 or x64 architecture 2) atomic reference counting 3) rhs refcount must not go to zero during the assignment (guaranteed by the Writer code above) 4) only one thread writing to lhs (using CAS you could have several writers).
Anyway, you could create your own smart pointer class based on intrusive_ptr with necessary changes. Definitely easier than re-implementing shared_ptr. And besides, if you want performance, intrusive is the way to go.
The reason this works much more easily in java is garbage collection. In C++, you have to manually ensure that a value is not just starting to be used by a different thread when you want to delete it.
A solution I've used in a similar situation is to simply delay the deletion of the value. I create a separate thread that iterates through a list of things to be deleted. When I want to delete something, I add it to this list with a timestamp. The deleting thread waits until some fixed time after this timestamp before actually deleting the value. You just have to make sure that the delay is large enough to guarantee that any temporary use of the value has completed.
100 milliseconds would have been enough in my case, I chose a few seconds to be safe.