Confusion about implementation error within shared_ptr destructor - c++

I have just seen Herb Sutter's talk: C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 2 of 2
He shows bug in implementation of std::shared_ptr destructor:
if( control_block_ptr->refs.fetch_sub(1, memory_order_relaxed ) == 0 )
delete control_block_ptr; // B
He says, that due to memory_order_relaxed, delete can be placed before fetch_sub.
At 1:25:18 - Release doesn't keep line B below, where it should be
How that is possible? There is happens-before / sequenced-before relationship, because they are both in single thread. I might be wrong, but there is also carries-a-dependency-to between fetch_sub and delete.
If he is right, which ISO items support that?

Imagine a code that releases a shared pointer:
auto tmp = &(the_ptr->a);
*tmp = 10;
the_ptr.dec_ref();
If dec_ref() doesn't have a "release" semantic, it's perfectly fine for a compiler (or CPU) to move things from before dec_ref() to after it (for example):
auto tmp = &(the_ptr->a);
the_ptr.dec_ref();
*tmp = 10;
And this is not safe, since dec_ref() also can be called from other thread in the same time and delete the object.
So, it must have a "release" semantic for things before dec_ref() to stay there.
Lets now imagine that object's destructor looks like this:
~object() {
auto xxx = a;
printf("%i\n", xxx);
}
Also we will modify example a bit and will have 2 threads:
// thread 1
auto tmp = &(the_ptr->a);
*tmp = 10;
the_ptr.dec_ref();
// thread 2
the_ptr.dec_ref();
Then, the "aggregated" code will look like:
// thread 1
auto tmp = &(the_ptr->a);
*tmp = 10;
{ // the_ptr.dec_ref();
if (0 == atomic_sub(...)) {
{ //~object()
auto xxx = a;
printf("%i\n", xxx);
}
}
}
// thread 2
{ // the_ptr.dec_ref();
if (0 == atomic_sub(...)) {
{ //~object()
auto xxx = a;
printf("%i\n", xxx);
}
}
}
However, if we only have a "release" semantic for atomic_sub(), this code can be optimized that way:
// thread 2
auto xxx = the_ptr->a; // "auto xxx = a;" from destructor moved here
{ // the_ptr.dec_ref();
if (0 == atomic_sub(...)) {
{ //~object()
printf("%i\n", xxx);
}
}
}
But that way, destructor will not always print the last value of "a" (this code is not race free anymore). That's why we also need acquire semantic for atomic_sub (or, strictly speaking, we need an acquire barrier when counter becomes 0 after decrement).

Looks like he is talking about synchronization of actions on shared object itself, which are not shown on his code blocks (and as the result - confusing).
That's why he put acq_rel - because all actions on the object should happens before its destruction, all in order.
But I'm still not sure why he talks about swapping delete with fetch_sub.

This is a late reply.
Let's start out with this simple type:
struct foo
{
~foo() { std::cout << value; }
int value;
};
And we'll use this type in a shared_ptr, as follows:
void runs_in_separate_thread(std::shared_ptr<foo> my_ptr)
{
my_ptr->value = 5;
my_ptr.reset();
}
int main()
{
std::shared_ptr<foo> my_ptr(new foo);
std::async(std::launch::async, runs_in_separate_thread, my_ptr);
my_ptr.reset();
}
Two threads will be running in parallel, both sharing ownership of a foo object.
With a correct shared_ptr implementation
(that is, one with memory_order_acq_rel), this program has defined behavior.
The only value that this program will print is 5.
With an incorrect implementation (using memory_order_relaxed) there
are no such guarantees. The behavior is undefined because a data race of
foo::value is introduced. The trouble occurs only for cases when the destructor
gets called in the main thread. With a relaxed memory order, the write
to foo::value in the other thread may not propagate to the destructor in the main thread.
A value other than 5 could be printed.
So what's a data race? Well, check out the definition and pay attention to the last bullet point:
When an evaluation of an expression writes to a memory location and another evaluation reads or modifies the same memory location, the expressions are said to conflict. A program that has two conflicting evaluations has a data race unless either
both conflicting evaluations are atomic operations (see std::atomic)
one of the conflicting evaluations happens-before another (see std::memory_order)
In our program, one thread will write to foo::value and one thread will
read from foo::value. These are supposed to be sequential; the write
to foo::value should always happen before the read. Intuitively, it
makes sense that they would be as the destructor is supposed to be the last
thing that happens to an object.
memory_order_relaxed does not offer such ordering guarantees though and so memory_order_acq_rel is required.

In the talk Herb shows memory_order_release not memory_order_relaxed, but relaxed would have even more problems.
Unless delete control_block_ptr accesses control_block_ptr->refs (which it probably doesn't) then the atomic operation does not carry-a-dependency-to the delete. The delete operation might not touch any memory in the control block, it might just return that pointer to the freestore allocator.
But I'm not sure if Herb is talking about the compiler moving the delete before the atomic operation, or just referring to when the side effects become visible to other threads.

Related

What exactly is Synchronize-With relationship?

I've been reading this post of Jeff Preshing about The Synchronizes-With Relation, and also the "Release-Acquire Ordering" section in the std::memory_order page from cpp reference, and I don't really understand:
It seems that there is some kind of promise by the standard that I don't understand why it's necessary. Let's take the example from the CPP reference:
#include <thread>
#include <atomic>
#include <cassert>
#include <string>
std::atomic<std::string*> ptr;
int data;
void producer()
{
std::string* p = new std::string("Hello");
data = 42;
ptr.store(p, std::memory_order_release);
}
void consumer()
{
std::string* p2;
while (!(p2 = ptr.load(std::memory_order_acquire)))
;
assert(*p2 == "Hello"); // never fires
assert(data == 42); // never fires
}
int main()
{
std::thread t1(producer);
std::thread t2(consumer);
t1.join(); t2.join();
}
The reference explains that:
If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable is tagged memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory. This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.
as far as I understand, when we
ptr.store(p, std::memory_order_release)
What we're actually doing is telling both the compiler and the CPU that when running, make it so there will be no way that data and the memory pointed to by std::string* p will be visible AFTER the new value of ptr will be visible to thread t2.
And same, when we
ptr.load(std::memory_order_acquire)
We are telling the compiler and CPU: make it so the loading of ptr will be no later than then loading of *p2 and data.
So I don't understand what further promise we have here?
This ptr.store(p, std::memory_order_release) (L1) guarantees that anything done prior to this line in this particular thread (T1) will be visible to other threads as long as those other threads are reading ptr in a correct fashion (in this case, using std::memory_order_acquire). This guarantee works only with this pair, alone this line guarantees nothing.
Now you have ptr.load(std::memory_order_acquire) (L2) on the other thread (T2) which, working with its pair from another thread, guarantees that as long as it read the value written in T1 you can see other values written prior to that line (in your case it is data). So because L1 synchronizes with L2, data = 42; happens before assert(data == 42).
Also there is a guarantee that ptr is written and read atomically, because, well, it is atomic. No other guarantees or promises are in that code.

C++: __sync_synchronize() still needed with std::atomic?

I've been running into an infrequent but re-occurring race condition.
The program has two threads and uses std::atomic. I'll simplify the critical parts of the code to look like:
std::atomic<uint64_t> b; // flag, initialized to 0
uint64_t data[100]; // shared data, initialized to 0
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
// read various shared variables here, for example:
uint64_t x = data[5];
// race condition sometimes (x sometimes not consistent)
}
The odd thing is that when I add __sync_synchronize() to each thread, then the race condition goes away. I've seen this happen on two different servers.
i.e. when I change the code to look like the following, then the problem goes away:
thread 1 (publishing):
// set various shared variables here, for example
data[5] = 10;
__sync_synchronize();
uint64_t a = b.exchange(1); // signal to thread 2 that data is ready
thread 2 (receiving):
if (b.load() != 0) { // signal that data is ready
__sync_synchronize();
// read various shared variables here, for example:
uint64_t x = data[5];
}
Why is __sync_synchronize() necessary? It seems redundant as I thought both exchange and load ensured the correct sequential ordering of logic.
Architecture is x86_64 processors, linux, g++ 4.6.2
Whilst it is impossible to say from your simplified code what actually goes on in your actual application, the fact that __sync_synchronize helps, and the fact that this function is a memory barrier tells me that you are writing things in the one thread that the other thread is reading, in a way that isn't atomic.
An example:
thread_1:
object *p = new object;
p->x = 1;
b.exchange(p); /* give pointer p to other thread */
thread_2:
object *p = b.load();
if (p->x == 1) do_stuff();
else error("Huh?");
This may very well trigger the error-path in thread2, because the write to p->x has not actually been completed when thread 2 reads the new pointer value p.
Adding memory barrier, in this case, in the thread_1 code should fix this. Note that for THIS case, a memory barrier in thread_2 will not do anything - it may alter the timing and appear to fix the problem, but it won't be the right thing. You may need memory barriers on both sides still, if you are reading/writing memory that is shared between two threads.
I understand that this may not be precisely what your code is doing, but the concept is the same - __sync_synchronize ensures that memory reads and memory writes have completed for ALL of the instructions before that function call [which isn't a real function call, it will inline a single instruction that waits for any pending memory operations to comlete].
Noteworthy is that operations on std::atomic ONLY affect the actual data stored in the atomic object. Not reads/writes of other data.
Sometimes you also need a "compiler barrier" to avoid the compiler moving stuff from one side of an operation to another:
std::atomic<bool> flag(false);
value = 42;
flag.store(true);
....
another thread:
while(!flag.load());
print(value);
Now, there is a chance that the compiler generates the first form as:
flag.store(true);
value = 42;
Now, that wouldn't be good, would it? std::atomic is guaranteed to be a "compiler barrier", but in other cases, the compiler may well shuffle stuff around in a similar way.

Confusion about thread-safety

I am new to the world of concurrency but from what I have read I understand the program below to be undefined in its execution. If I understand correctly this is not threadsafe as I am concurrently reading/writing both the shared_ptr and the counter variable in non-atomic ways.
#include <string>
#include <memory>
#include <thread>
#include <chrono>
#include <iostream>
struct Inner {
Inner() {
t_ = std::thread([this]() {
counter_ = 0;
running_ = true;
while (running_) {
counter_++;
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
});
}
~Inner() {
running_ = false;
if (t_.joinable()) {
t_.join();
}
}
std::uint64_t counter_;
std::thread t_;
bool running_;
};
struct Middle {
Middle() {
data_.reset(new Inner);
t_ = std::thread([this]() {
running_ = true;
while (running_) {
data_.reset(new Inner());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
});
}
~Middle() {
running_ = false;
if (t_.joinable()) {
t_.join();
}
}
std::uint64_t inner_data() {
return data_->counter_;
}
std::shared_ptr<Inner> data_;
std::thread t_;
bool running_;
};
struct Outer {
std::uint64_t data() {
return middle_.inner_data();
}
Middle middle_;
};
int main() {
Outer o;
while (true) {
std::cout << "Data: " << o.data() << std::endl;
}
return 0;
}
My confusion comes from this:
Is the access to data_->counter safe in Middle::inner_data?
If thread A has a member shared_ptr<T> sp and decides to update it while thread B does shared_ptr<T> sp = A::sp will the copy and destruction be threadsafe? Or do I risk the copy failing because the object is in the process of being destroyed.
Under what circumstances (can I check this with some tool?) is undefined likely to mean std::terminate? I suspect something like the above happens in some of my production code but I cannot be certain as I am confused about 1 and 2 but this small program has been running for days since I wrote it and nothing happens.
Code can be checked here at https://godbolt.org/g/saHz94
Is the access to data_->counter safe in Middle::inner_data?
No; it's a race condition. According to the standard, it's undefined behavior anytime you allow unsynchronized access to the same variable from more than one thread, and at least one thread might possibly modify the variable.
As a practical matter, here are a couple of unwanted behaviors you might see:
The thread reading the value of counter_ reads an "old" value of counter (that rarely or never updates) due to different processor cores caching the variable independently of each other (using atomic_t would avoid this problem, because then the compiler would be aware that you are intending this variable to be accessed in an unsynchronized manner, and it would know to take precautions to prevent this problem)
Thread A might read the address that the data_ shared_pointer points to and be just about to dereference the address and read from the Inner struct it points to, when Thread A gets kicked off the CPU by thread B. Thread B executes, and during Thread B's execution, the old Inner struct gets deleted and the data_ shared_pointer set to point to a new Inner struct. Then Thread A gets back onto the CPU again, but since Thread A already has the old pointer value in memory, it dereferences the old value rather than the new one and ends up reading from freed/invalid memory. Again, this is undefined behavior, so in principle anything could happen; in practice you're likely to see either no obvious misbehavior, or occasionally a wrong/garbage value, or possibly a crash, it depends.
If thread A has a member shared_ptr sp and decides to update it
while thread B does shared_ptr sp = A::sp will the copy and
destruction be threadsafe? Or do I risk the copy failing because the
object is in the process of being destroyed.
If you're only retargeting the shared_ptrs themselves (i.e. changing them to point to different objects) and not modifying the T objects that they point to, that should be thread safe AFAIK. But if you are modifying state of the T objects themselves (i.e. the Inner object in your example) that is not thread safe, since you could have one thread reading from the object while another thread is writing to it (deleting the object can be seen as a special case of writing to it, in that it definitely changes the object's state)
Under what circumstances (can I check this with some tool?) is
undefined likely to mean std::terminate?
When you hit undefined behavior, it's very much dependent on the details of your program, the compiler, the OS, and the hardware architecture what will happen. In principle, undefined behavior means anything (including the program running just as you intended!) can happen, but you can't rely on any particular behavior -- which is what makes undefined behavior so evil.
In particular, it's common for a multithreaded program with a race condition to run fine for hours/days/weeks and then one day the timing is just right and it crashes or computes an incorrect result. Race conditions can be really difficult to reproduce for that reason.
As for when terminate() might be called, terminate() would be called if the the fault causes an error state that is detected by the runtime environment (i.e. it corrupts a data structure that the runtime environment does integrity checks on, such as, in some implementations, the heap's metadata). Whether or not that actually happens depends on how the heap was implemented (which varies from one OS and compiler to the next) and what sort of corruption the fault introduced.
Thread safety is an operation between threads, not an absolute in general.
You cannot read or write a variable while another thread writes a variable without synchronization between the other thread's write and your read or write. Doing so is undefined behavior.
Undefined can mean anything. Program crashes. Program reads impossible value. Program formats hard drive. Program emails your browser history to all of your contacts.
A common case for unsynchronized integer access is that the compiler optimizes multiple reads to a value into one and doesn't reload it, because it can prove there is no defined way that someone could have modified the value. Or, the CPU memory cache does the same thing, because you did not synchronize.
For the pointers, similar or worse problems can occur, including following dangling pointers, corrupting memory, crashes, etc.
There are now atomic operations you can perform on shared pointers., as well as atomic<shared_ptr<?>>.

Proper compiler intrinsics for double-checked locking?

When implementing double-checked locking, what is the proper way to do the memory and/or compiler barriers when implementing double-checked locking for initialization?
Something like std::call_once isn't what I want; it's way too slow. It's typically just implemented on top of pthread_mutex_lock and EnterCriticalSection respective to OS.
In my programs, I often run into initialization cases where the initialization is safe to repeat, as long as exactly one thread gets to set the final pointer. If another thread beats it to setting the final pointer to the singleton object, it deletes what it created and makes use of the other thread's. I also often use this in cases where it doesn't matter which thread "wins" because they all come up with the same result.
Here's an unsafe, overly-contrived example, using Visual C++ intrinsics:
MyClass *GetGlobalMyClass()
{
static MyClass *const UNSET_POINTER = reinterpret_cast<MyClass *>(
static_cast<intptr_t>(-1));
static MyClass *volatile s_object = UNSET_POINTER;
if (s_object == UNSET_POINTER)
{
MyClass *newObject = MyClass::Create();
if (_InterlockedCompareExchangePointer(&s_object, newObject,
UNSET_POINTER) != UNSET_POINTER)
{
// Another thread beat us. If Create didn't return null, destroy.
if (newObject)
{
newObject->Destroy(); // calls "delete this;", presumably
}
}
}
return s_object;
}
On a weakly-ordered memory architecture, my understanding is that it's possible that the new value of s_object is visible to other threads before other variables written inside MyClass::Create or MyClass::MyClass are visible. Also, the compiler itself could arrange the code this way in the absence of a compiler barrier (in Visual C++, _WriteBarrier, but _InterlockedCompareExchange acts as a barrier).
Do I need like a store fence intrinsic function in there or something in order to ensure that MyClass's variables are visible to all threads before s_object becomes somethings besides -1?
Fortunately, the rules in C++ are very simple:
If there is a data race, the behaviour is undefined.
In you code the data race is caused by the following read, which conflicts with the write operation in __InterlockedCompareExchangePointer.
if (s_object.m_void == UNSET_POINTER)
A thread-safe solution without blocking might look as follows. Note that on x86 a load operation with sequential consistency has basically no overhead compared to a regular load operation. If you care about other architectures, you can also use acquire release instead of sequential consistency.
static std::atomic<MyClass*> s_object{nullptr};
MyClass* o = s_object.load(std::memory_order_seq_cst);
if (o == nullptr) {
o = new MyClass{...};
MyClass* expected = nullptr;
if (!s_object.compare_exchange_strong(expected, o, std::memory_order_seq_cst)) {
delete o;
o = expected;
}
}
return o;
For a proper C++11 implementation any function-local static variable will be constructed in a thread-safe fashion by the first thread passing through this variable.

Wait-free or lock-free initialization

I've been looking at lazy initialization (let's say- global variables, but it could be of anything). So far, what I've come up with is something like
enum state {
uninitialized,
initializing,
initialized
};
state s;
char memory[sizeof(T)];
T& initialize() {
auto val = compare_and_swap(&s, state::uninitialized, state::initializing);
if (val == initialized)
return *(T*)memory;
if (val == initializing) {
while(atomic_read(&s) != state::initialized);
return *(T*)memory;
}
new (memory) T();
atomic_write(&s, state::initialized);
return *(T*)memory;
}
In the case where it's already been initialized, then it's wait-free. But I've got a problem with the case where one thread is initializing. The number of steps required to finish initializing or wait for initialization to finish isn't proportional to the number of threads. But if the initializing thread is paused, the other threads all have to wait arbitrarily until it's resumed. So in the general case it's not lock-free or wait-free.
Is it possible to create lazy initialization which is wait-free or lock-free?
If you're willing to initialize more than one object, then you can make lockfree code by only storing a pointer:
std::atomic<T *> p { nullptr };
T & get()
{
T * q = p.load();
if (!q)
{
T * r = new T;
if (p.compare_exchange_strong(q, r))
{
return *r;
}
else
{
delete r;
return *q;
}
}
return *q;
}
The cost of lock-free algorithms is that you generally have to "try and fail", so you have to pay the price of trying locally even if you have to discard the result.
As you rightly pointed out, if only a single thread performs the initialization, you always depend on that single thread, and you cannot be lock-free.
You will need corresponding clean-up code, too, if the destructor has side effects:
delete p.exchange(nullptr);
I wouldnt have thought either is possible in the general case. If you did this as part of global initialization, before main is called and before the threads are created, then you stand a chance. How can it be wait-free or lock-free?
It could be lock-free if you allow the initialise to fail and retry later.
It cant be wait-free if another thread has the item.
If you care about the performance here, then you could use the thread id to hash into a bucket of items, to try and reduce the frequency of contention.