Is struct assignment atomic in C/C++? - c++

I am writing a program which has one process reading and writing to a shared memory and another process only reading it. In the shared memory there is a struct like this:
struct A{
int a;
int b;
double c;
};
what I expect is to read the struct at once because while I am reading, the other process might be modifying the content of the struct. This can be achieved if the struct assignment is atomic, that is not interrupted. Like this:
struct A r = shared_struct;
So, is struct assignment atomic in C/C++? I tried searching the web but cannot find helpful answers. Can anyone help?
Thank you.

No, both C and C++ standard don't guarantee assignment operations to be atomic. You need some implementation-specific stuff for that - either something in the compiler or in the operating system.

C and C++ support atomic types in their current standards.
C++11 introduced support for atomic types. Likewise C11 introduced atomics.

Do you need to atomically snapshot all the struct members? Or do you just need shared read/write access to separate members separately? The latter is a lot easier, see below.
C11 stdatomic and C++11 std::atomic provide syntax for atomic objects of arbitrary sizes. But if they're larger than 8B or 16B, they won't be lock-free on typical systems, though. (i.e. atomic load, store, exchange or CAS will be implemented by taking a hidden lock and then copying the whole struct).
If you just want a couple members, it's probably better to use a lock yourself and then access the members, instead of getting the compiler to copy the whole struct. (Current compilers aren't good at optimizing weird uses of atomics like that).
Or add a level of indirection, so there's a pointer which can easily be updated atomically to point to another struct with a different set of values. This is the building block for RCU (Read-Copy-Update) See also https://lwn.net/Articles/262464/. There are good library implementations of RCU, so use one instead of rolling your own unless your use-case is a lot simpler than the general case. Figuring out when to free old copies of the struct is one of the hard parts, because you can't do that until the last reader thread is done with it. And the whole point of RCU is to make the read path as light-weight as possible...
Your struct is 16 bytes on most systems; just barely small enough that x86-64 can load or store the entire things somewhat more efficiently than just taking a lock. (But only with lock cmpxchg16b). Still, it's not totally silly to use C/C++ atomics for this
common to both C++11 and C11:
struct A{
int a;
int b;
double c;
};
In C11 use the _Atomic type qualifier to make an atomic type. It's a qualifier like const or volatile, so you can use it on just about anything.
#include <stdatomic.h>
_Atomic struct A shared_struct;
// atomically take a snapshot of the shared state and do something
double read_shared (void) {
struct A tmp = shared_struct; // defaults to memory_order_seq_cst
// or tmp = atomic_load_explicit(&shared_struct, memory_order_relaxed);
//int t = shared_struct.a; // UNDEFINED BEHAVIOUR
// then do whatever you want with the copy, it's a normal struct
if (tmp.a > tmp.b)
tmp.c = -tmp.c;
return tmp.c;
}
// or take tmp by value or pointer as a function arg
// static inline
void update_shared(int a, int b, double c) {
struct A tmp = {a, b, c};
//shared_struct = tmp;
// If you just need atomicity, not ordering, relaxed is much faster for small lock-free objects (no memory barrier)
atomic_store_explicit(&shared_struct, tmp, memory_order_relaxed);
}
Note that accessing a single member of an _Atomic struct is undefined behaviour. It won't respect locking, and might not be atomic. So don't do int i = shared_state.a; (C++11 don't compile that, but C11 will).
In C++11, it's nearly the same: use the std::atomic<T> template.
#include <atomic>
std::atomic<A> shared_struct;
// atomically take a snapshot of the shared state and do something
double read_shared (void) {
A tmp = shared_struct; // defaults to memory_order_seq_cst
// or A tmp = shared_struct.load(std::memory_order_relaxed);
// or tmp = std::atomic_load_explicit(&shared_struct, std::memory_order_relaxed);
//int t = shared_struct.a; // won't compile: no operator.() overload
// then do whatever you want with the copy, it's a normal struct
if (tmp.a > tmp.b)
tmp.c = -tmp.c;
return tmp.c;
}
void update_shared(int a, int b, double c) {
struct A tmp{a, b, c};
//shared_struct = tmp;
// If you just need atomicity, not ordering, relaxed is much faster for small lock-free objects (no memory barrier)
shared_struct.store(tmp, std::memory_order_relaxed);
}
See it on the Godbolt compiler explorer
Of if you don't need to snapshot the entire struct, but instead just want each member to be separately atomic, then you can simply make each member an atomic type. (Like atomic_int and _Atomic double or std::atomic<double>).
struct Amembers {
atomic_int a, b;
#ifdef __cplusplus
std::atomic<double> c;
#else
_Atomic double c;
#endif
} shared_state;
// If these members are used separately, put them in separate cache lines
// instead of in the same struct to avoid false sharing cache-line ping pong.
(Note that C11 stdatomic is not guaranteed to be compatible with C++11
std::atomic, so don't expect to be able to access the same struct from C or C++.)
In C++11, struct-assignment for a struct with atomic members won't compile, because std::atomic deletes its copy-constructor. (You're supposed to load std::atomic<T> shared into T tmp, like in the whole-struct example above.)
In C11, struct-assignment for a non-atomic struct with atomic members will compile but is not atomic. The C11 standard doesn't specifically point this out anywhere. The best I can find is: n1570: 6.5.16.1 Simple assignment:
In simple assignment (=), the value of the right operand is converted to the type of the
assignment expression and replaces the value stored in the object designated by the left
operand.
Since this doesn't say anything about special handling of atomic members, it must be assumed that it's like a memcpy of the object representations. (Except that it's allowed to not update the padding.)
In practice, it's easy to get gcc to generate asm for structs with an atomic member where it copies non-atomically. Especially with a large atomic member which is atomic but not lock-free.

No it isn't.
That is actually a property of the CPU architecture in relation to the memory layout of struck
You could use the 'atomic pointer swap' solution, which can be made atomic, and could be used in a lockfree scenario.
Be sure to mark the respective shared pointer (variables) as volatile if it is important that changes are seen by other threads 'immediately'This, in real life (TM) is not enough to guarantee correct treatment by the compiler. Instead program against atomic primitives/intrinsics directly when you want to have lockfree semantics. (see comments and linked articles for background)
Of course, inversely, you'll have to make sure you take a deep copy at the relevant times in order to do processing on the reading side of this.
Now all of this quickly becomes highly complex in relation to memory management and I suggest you scrutinize your design and ask yourself seriously whether all the (perceived?) performance benefits justify the risk. Why don't you opt for a simple (reader/writer) lock, or get your hands on a fancy shared pointer implementation that is threadsafe ?

Related

Boolean stop signal between threads

What is the simplest way to signal a background thread to stop executing?
I have used something like:
volatile bool Global_Stop = false;
void do_stuff() {
while (!Global_Stop) {
//...
}
}
Is there anything wrong with this? I know for complex cases you might need "atomic" or mutexes, but for just boolean signaling this should work, right?
std::atomic is not for "complex cases". It is for when you need to access something from multiple threads. There are some myths about volatile, I cannot recall them, because all I remember is that volatile does not help when you need to access something from different threads. You need a std::atomic<bool>. Whether on your actual hardware accessing a bool is atomic does not really matter, because as far as C++ is concerned it is not.
Yes there's a problem: that's not guaranteed to work in C++. But it's very simple to fix, so long as you're on at least C++11: use std::atomic<bool> instead, like this:
#include <atomic>
std::atomic<bool> Global_Stop = false;
void do_stuff() {
while (!Global_Stop) {
//...
}
}
One problem is that the compiler is allowed to reorder memory accesses, so long as it can prove that it won't change the effect of the program:
int foo() {
int i = 1;
int j = 2;
++i;
++j;
return i + j;
}
Here the compiler is allowed to increment j before i because it clearly won't change the effect of the program. In fact it can optimise the whole thing away into return 5;. So what counts as "won't change the effect of the program?" The answer is long and complex and I don't pretend to understand them all, but one part of it is that the compiler only has to worry about threads in certain contexts. If i and j were global variables instead of local variables, it could still reverse ++i and ++j because it's allowed to assume there's only one thread accessing them unless you use certain thread primatives (such as a mutex).
Now when it comes to code like this:
while (!Global_Stop) {
//...
}
If it can prove the code hidden in the comment doesn't touch the Global_Stop, and there are no thread primatives such as a mutex, it can happily optimise it to:
if (!Global_Stop) {
while (true) {
//...
}
}
If it can prove that Global_Stop is false at the start then it can even remove the if check!
Actually things are even worse than this, at least in theory. You see, if a thread is in the process of writing to a variable when another thread accesses it then only part of that write might be observed, giving you a totally different value (e.g. you update i from 3 to 4 and the other thread reads 7). Admittedly that is unlikely with a bool. But the standard is even more broader than this: this situation is undefined behaviour, so it could even crash your program or have some other weird unexpected behaviour.
Yes, this will most likely work, but only "by accident". As #idclev463035818 already wrote correctly:
std::atomic is not for "complex cases". It is for when you need to access something from multiple threads.
So in this case you should use atomic<bool> instead of volatile. The fact that volatile has been part of that language long before the introduction of threads in C++11 should already be a strong indication that volatile was never designed or intended to be used for multi-threading. It is important to note that in C++ volatile is something fundamentally different from volatile in languages like Java or C# where volatile is in fact related to the memory model. In these languages a volatile variable is much like an atomic in C++.
In C++, volatile is used for what is often referred to as "unusual memory", where memory can be read or modified outside the current process (for example when using memory mapped I/O). volatile forces the compiler to execute all operations in the exact order as specified. This prevents some optimizations that would be perfectly legal for atomics, while also allowing some optimizations that are actually illegal for atomics. For example:
volatile int x;
int y;
volatile int z;
x = 1;
y = 2;
z = 3;
z = 4;
...
int a = x;
int b = x;
int c = y;
int d = z;
In this example, there are two assignments to z, and two read operations on x. If x and z were atomics instead of volatile, the compiler would be free to see the first store as irrelevant and simply remove it. Likewise it could just reuse the value returned by the first load of x, effectively generate code like int b = a. But since x and z are volatile these optimizations are not possible. Instead, the compiler has to ensure that all volatile operations are executed in the exact order as specified, i.e., the volatile operations cannot be reordered with respect to each other. However, this does not prevent the compiler from reordering non-volatile operations. For example, the operations on y could freely be moved up or down - something that would not be possible if x and z were atomics. So if you were to try implementing a lock based on a volatile variable, the compiler could simply (and legally) move some code outside your critical section.
Last but not least it should be noted that marking a variable as volatile does not prevent it from participating in a data race. In those rare cases where you have some "unusual memory" (and therefore really require volatile) that is also accessed by multiple threads, you have to use volatile atomics.

Can a member variable (attribute) reside in a register in C++?

A local variable (say an int) can be stored in a processor register, at least as long as its address is not needed anywhere. Consider a function computing something, say, a complicated hash:
int foo(int const* buffer, int size)
{
int a; // local variable
// perform heavy computations involving frequent reads and writes to a
return a;
}
Now assume that the buffer does not fit into memory. We write a class for computing the hash from chunks of data, calling foo multiple times:
struct A
{
void foo(int const* buffer, int size)
{
// perform heavy computations involving frequent reads and writes to a
}
int a;
};
A object;
while (...more data...)
{
A.foo(buffer, size);
}
// do something with object.a
The example may be a bit contrived. The important difference here is that a was a local variable in the free function and now is a member variable of the object, so the state is preserved across multiple calls.
Now the question: would it be legal for the compiler to load a at the beginning of the foo method into a register and store it back at the end? In effect this would mean that a second thread monitoring the object could never observe an intermediate value of a (synchronization and undefined behavior aside). Provided that speed is a major design goal of C++, this seems to be reasonable behavior. Is there anything in the standard that would keep a compiler from doing this? If no, do compilers actually do this? In other words, can we expect a (possibly small) performance penalty for using a member variable, aside from loading and storing it once at the beginning and the end of the function?
As far as I know, the C++ language itself does not even specify what a register is. However, I think that the question is clear anyway. Whereever this matters, I appreciate answers for a standard x86 or x64 architecture.
The compiler can do that if (and only if) it can prove that nothing else will access a during foo's execution.
That's a non-trivial problem in general; I don't think any compiler attempts to solve it.
Consider the (even more contrived) example
struct B
{
B (int& y) : x(y) {}
void bar() { x = 23; }
int& x;
};
struct A
{
int a;
void foo(B& b)
{
a = 12;
b.bar();
}
};
Looks innocent enough, but then we say
A baz;
B b(baz.a);
baz.foo(b);
"Optimising" this would leave 12 in baz.a, not 23, and that is clearly wrong.
Short answer to "Can a member variable (attribute) reside in a register?": yes.
When iterating through a buffer and writing the temporary result to any sort of primitive, wherever it resides, keeping the temporary result in a register would be a good optimization. This is done frequently in compilers. However, it is implementation based, even influenced by passed flags, so to know the result, you should check the generated assembly.

volatile variable being optimized away?

I have a background thread which loops on a state variable done. When I want to stop the thread I set the variable done to true. But apparently this variable is never set. I understand that the compiler might optimize it away so I have marked done volatile. But that seems not to have any effect. Note, I am not worried about race conditions so I have not made it atomic or used any synchronization constructs. How do I get the thread to not skip testing the variable at every iteration? Or is the problem something else entirely? done is initially false.
struct SomeObject
{
volatile bool done_;
void DoRun();
};
static void RunLoop(void* arg)
{
if (!arg)
return;
SomeObject* thiz = static_cast<SomeObject*>(arg);
while( !(thiz->done_) ) {
thiz->DoRun();
}
return;
}
volatile doesn't have any multi-threaded meaning in C++. It is a holdover from C, used as a modifier for sig_atomic_t flags touched by signal handlers and for access to memory mapped devices. There is no language-mandated compulsion for a C++ function to re-access memory, which leads to the race condition (reader never bothering to check twice as an "optimization") that others note above.
Use std::atomic (from C++11 or newer).
It can be, and usually is lock-free:
struct SomeObject {
std::atomic_bool done_;
void DoRun();
bool IsDone() { return done_.load(); }
void KillMe() { done_.store(true); }
};
static void RunLoop(void *arg) {
SomeObject &t = static_cast<SomeObject &>(*arg);
cout << t.done_.is_lock_free(); // some archaic platforms may be false
while (!t.IsDone()) {
t.DoRun();
}
}
The load() and store() methods force the compiler to, at the least, check the memory location at every iteration. For x86[_64], the cache line for the SomeObject instance's done_ member will be cached and checked locally with no lock or even atomic/locked memory reads as-is. If you were doing something more complicated than a one-way flag set, you'd need to consider using something like explicit memory fences, etc.
Pre-C++11 has no multi-threaded memory model, so you will have to rely on a third-party library with special compiler privileges like pthreads or use compiler-specific functionality to get the equivalent.
This works like you expect in msvc 2010. If I remove the volatile it loops forever. If I leave the volatile it works. This is because microsoft treats volatile like you expect, which is different than iso.
This works too:
struct CDone {
bool m_fDone;
};
int ThreadProc(volatile CDone *pDone) {
}
Here is what MSDN says:
http://msdn.microsoft.com/en-us/library/12a04hfd.aspx
Objects that are declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object when it is requested, even if a previous instruction asked for a value from the same object.
Also, the value of the object is written immediately on assignment.
ISO Compliant:
If you are familiar with the C# volatile keyword, or familiar with the behavior of volatile in earlier versions of Visual C++, be aware that the C++11 ISO Standard volatile keyword is different and is supported in Visual Studio when the /volatile:iso compiler option is specified. (For ARM, it's specified by default). The volatile keyword in C++11 ISO Standard code is to be used only for hardware access; do not use it for inter-thread communication. For inter-thread communication, use mechanisms such as std::atomic from theC++ Standard Template Library.

Must I call atomic load/store explicitly?

C++11 introduced the std::atomic<> template library. The standard specifies the store() and load() operations to atomically set / get a variable shared by more than one thread.
My question is are assignment and access operations also atomic?
Namely, is:
std::atomic<bool> stop(false);
...
void thread_1_run_until_stopped()
{
if(!stop.load())
/* do stuff */
}
void thread_2_set_stop()
{
stop.store(true);
}
Equivalent to:
void thread_1_run_until_stopped()
{
if(!stop)
/* do stuff */
}
void thread_2_set_stop()
{
stop = true;
}
Are assignment and access operations for non-reference types also atomic?
Yes, they are. atomic<T>::operator T and atomic<T>::operator= are equivalent to atomic<T>::load and atomic<T>::store respectively. All the operators are implemented in the atomic class such that they will use atomic operations as you would expect.
I'm not sure what you mean about "non-reference" types? Not sure how reference types are relevant here.
You can do both, but the advantage of load()/store() is that they allow to specify memory order. It is important sometimes for performance, where you can specify std::memory_order_relaxed while atomic<T>::operator T and atomic<T>::operator= would use the most safe and slow std::memory_order_seq_cst. Sometimes it is important for correctness and readability of your code: although the default std::memory_order_seq_cst is most safe thus most likely to be correct, it is not immediately clear for the reader what kind of operation (acquire/release/consume) you are doing, or whether you are doing such operation at all (to answer: isn't relaxed order sufficient here?).

What is the difference between using explicit fences and std::atomic?

Assuming that aligned pointer loads and stores are naturally atomic on the target platform, what is the difference between this:
// Case 1: Dumb pointer, manual fence
int* ptr;
// ...
std::atomic_thread_fence(std::memory_order_release);
ptr = new int(-4);
this:
// Case 2: atomic var, automatic fence
std::atomic<int*> ptr;
// ...
ptr.store(new int(-4), std::memory_order_release);
and this:
// Case 3: atomic var, manual fence
std::atomic<int*> ptr;
// ...
std::atomic_thread_fence(std::memory_order_release);
ptr.store(new int(-4), std::memory_order_relaxed);
I was under the impression that they were all equivalent, however Relacy detects a data race in
the first case (only):
struct test_relacy_behaviour : public rl::test_suite<test_relacy_behaviour, 2>
{
rl::var<std::string*> ptr;
rl::var<int> data;
void before()
{
ptr($) = nullptr;
rl::atomic_thread_fence(rl::memory_order_seq_cst);
}
void thread(unsigned int id)
{
if (id == 0) {
std::string* p = new std::string("Hello");
data($) = 42;
rl::atomic_thread_fence(rl::memory_order_release);
ptr($) = p;
}
else {
std::string* p2 = ptr($); // <-- Test fails here after the first thread completely finishes executing (no contention)
rl::atomic_thread_fence(rl::memory_order_acquire);
RL_ASSERT(!p2 || *p2 == "Hello" && data($) == 42);
}
}
void after()
{
delete ptr($);
}
};
I contacted the author of Relacy to find out if this was expected behaviour; he says that there is indeed a data race in my test case.
However, I'm having trouble spotting it; can someone point out to me what the race is?
Most importantly, what are the differences between these three cases?
Update: It's occurred to me that Relacy may simply be complaining about the atomicity (or lack thereof, rather) of the variable being accessed across threads... after all, it doesn't know that I intend only to use this code on platforms where aligned integer/pointer access is naturally atomic.
Another update: Jeff Preshing has written an excellent blog post explaining the difference between explicit fences and the built-in ones ("fences" vs "operations"). Cases 2 and 3 are apparently not equivalent! (In certain subtle circumstances, anyway.)
Although various answers cover bits and pieces of what the potential problem is and/or provide useful information, no answer correctly describes the potential issues for all three cases.
In order to synchronize memory operations between threads, release and acquire barriers are used to specify ordering.
In the diagram, memory operations A in thread 1 cannot move down across the (one-way) release barrier (regardless whether that is a release operation on an atomic store,
or a standalone release fence followed by a relaxed atomic store). Hence memory operations A are guaranteed to happen before the atomic store.
Same goes for memory operations B in thread 2 which cannot move up across the acquire barrier; hence the atomic load happens before memory operations B.
The atomic ptr itself provides inter-thread ordering based on the guarantee that it has a single modification order. As soon as thread 2 sees a value for ptr,
it is guaranteed that the store (and thus memory operations A) happened before the load. Because the load is guaranteed to happen before memory operations B,
the rules for transitivity say that memory operations A happen before B and synchronization is complete.
With that, let's look at your 3 cases.
Case 1 is broken because ptr, a non-atomic type, is modified in different threads. That is a classical example of a data race and it causes undefined behavior.
Case 2 is correct. As an argument, the integer allocation with new is sequenced before the release operation. This is equivalent to:
// Case 2: atomic var, automatic fence
std::atomic<int*> ptr;
// ...
int *tmp = new int(-4);
ptr.store(tmp, std::memory_order_release);
Case 3 is broken, albeit in a subtle way. The problem is that even though the ptr assignment is correctly sequenced after the standalone fence,
the integer allocation (new) is also sequenced after the fence, causing a data race on the integer memory location.
the code is equivalent to:
// Case 3: atomic var, manual fence
std::atomic<int*> ptr;
// ...
std::atomic_thread_fence(std::memory_order_release);
int *tmp = new int(-4);
ptr.store(tmp, std::memory_order_relaxed);
If you map that to the diagram above, the new operator is supposed to be part of memory operations A. Being sequenced below the release fence,
ordering guarantees no longer hold and the integer allocation may actually be reordered with memory operations B in thread 2.
Therefore, a load() in thread 2 may return garbage or cause other undefined behavior.
I believe the code has a race. Case 1 and case 2 are not equivalent.
29.8 [atomics.fences]
-2- A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.
In case 1 your release fence does not synchronize with your acquire fence because ptr is not an atomic object and the store and load on ptr are not atomic operations.
Case 2 and case 3 are equivalent (actually, not quite, see LWimsey's comments and answer), because ptr is an atomic object and the store is an atomic operation. (Paragraphs 3 and 4 of [atomic.fences] describe how a fence synchronizes with an atomic operation and vice versa.)
The semantics of fences are defined only with respect to atomic objects and atomic operations. Whether your target platform and your implementation offer stronger guarantees (such as treating any pointer type as an atomic object) is implementation-defined at best.
N.B. for both of case 2 and case 3 the acquire operation on ptr could happen before the store, and so would read garbage from the uninitialized atomic<int*>. Simply using acquire and release operations (or fences) doesn't ensure that the store happens before the load, it only ensures that if the load reads the stored value then the code is correctly synchronized.
Several pertinent references:
the C++11 draft standard (PDF, see clauses 1, 29 and 30);
Hans-J. Boehm's overview of concurrency in C++;
McKenney, Boehm and Crowl on concurrency in C++;
GCC's developmental notes on concurrency in C++;
the Linux kernel's notes on concurrency;
a related question with answers here on Stackoverflow;
another related question with answers;
Cppmem, a sandbox in which to experiment with concurrency;
Cppmem's help page;
Spin, a tool for analyzing the logical consistency of concurrent systems;
an overview of memory barriers from a hardware perspective (PDF).
Some of the above may interest you and other readers.
The memory backing an atomic variable can only ever be used for the contents of the atomic. However, a plain variable, like ptr in case 1, is a different story. Once a compiler has the right to write to it, it can write anything to it, even the value of a temporary value when you run out of registers.
Remember, your example is pathologically clean. Given a slightly more complex example:
std::string* p = new std::string("Hello");
data($) = 42;
rl::atomic_thread_fence(rl::memory_order_release);
std::string* p2 = new std::string("Bye");
ptr($) = p;
it is totally legal for the compiler to choose to reuse your pointer
std::string* p = new std::string("Hello");
data($) = 42;
rl::atomic_thread_fence(rl::memory_order_release);
ptr($) = new std::string("Bye");
std::string* p2 = ptr($);
ptr($) = p;
Why would it do so? I don't know, perhaps some exotic trick to keep a cache line or something. The point is that, since ptr is not atomic in case 1, there is a race case between the write on line 'ptr($) = p' and the read on 'std::string* p2 = ptr($)', yielding undefined behavior. In this simple test case, the compiler may not choose to exercise this right, and it may be safe, but in more complicated cases the compiler has the right to abuse ptr however it pleases, and Relacy catches this.
My favorite article on the topic: http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
The race in the first example is between the publication of the pointer, and the stuff that it points to. The reason is, that you have the creation and initialization of the pointer after the fence (= on the same side as the publication of the pointer):
int* ptr; //noop
std::atomic_thread_fence(std::memory_order_release); //fence between noop and interesting stuff
ptr = new int(-4); //object creation, initalization, and publication
If we assume that CPU accesses to properly aligned pointers are atomic, the code can be corrected by writing this:
int* ptr; //noop
int* newPtr = new int(-4); //object creation & initalization
std::atomic_thread_fence(std::memory_order_release); //fence between initialization and publication
ptr = newPtr; //publication
Note that even though this may work fine on many machines, there is absolutely no guarantee within the C++ standard on the atomicity of the last line. So better use atomic<> variables in the first place.