Acquire/Release versus Sequentially Consistent memory order

Acquire/Release versus Sequentially Consistent memory order - c++

For any std::atomic<T> where T is a primitive type:
If I use std::memory_order_acq_rel for fetch_xxx operations, and std::memory_order_acquire for load operation and std::memory_order_release for store operation blindly (I mean just like resetting the default memory ordering of those functions)
Will the results be same as if I used std::memory_order_seq_cst (which is being used as default) for any of the declared operations?
If the results were the same, is this usage anyhow different than using std::memory_order_seq_cst in terms of efficiency?

The C++11 memory ordering parameters for atomic operations specify constraints on the ordering. If you do a store with std::memory_order_release, and a load from another thread reads the value with std::memory_order_acquire then subsequent read operations from the second thread will see any values stored to any memory location by the first thread that were prior to the store-release, or a later store to any of those memory locations.
If both the store and subsequent load are std::memory_order_seq_cst then the relationship between these two threads is the same. You need more threads to see the difference.
e.g. std::atomic<int> variables x and y, both initially 0.
Thread 1:
x.store(1,std::memory_order_release);
Thread 2:
y.store(1,std::memory_order_release);
Thread 3:
int a=x.load(std::memory_order_acquire); // x before y
int b=y.load(std::memory_order_acquire);
Thread 4:
int c=y.load(std::memory_order_acquire); // y before x
int d=x.load(std::memory_order_acquire);
As written, there is no relationship between the stores to x and y, so it is quite possible to see a==1, b==0 in thread 3, and c==1 and d==0 in thread 4.
If all the memory orderings are changed to std::memory_order_seq_cst then this enforces an ordering between the stores to x and y. Consequently, if thread 3 sees a==1 and b==0 then that means the store to x must be before the store to y, so if thread 4 sees c==1, meaning the store to y has completed, then the store to x must also have completed, so we must have d==1.
In practice, then using std::memory_order_seq_cst everywhere will add additional overhead to either loads or stores or both, depending on your compiler and processor architecture. e.g. a common technique for x86 processors is to use XCHG instructions rather than MOV instructions for std::memory_order_seq_cst stores, in order to provide the necessary ordering guarantees, whereas for std::memory_order_release a plain MOV will suffice. On systems with more relaxed memory architectures the overhead may be greater, since plain loads and stores have fewer guarantees.
Memory ordering is hard. I devoted almost an entire chapter to it in my book.

Memory ordering can be quite tricky, and the effects of getting it wrong is often very subtle.
The key point with all memory ordering is that it guarantees what "HAS HAPPENED", not what is going to happen. For example, if you store something to a couple of variables (e.g. x = 7; y = 11;), then another processor may be able to see y as 11 before it sees the value 7 in x. By using memory ordering operation between setting x and setting y, the processor that you are using will guarantee that x = 7; has been written to memory before it continues to store something in y.
Most of the time, it's not REALLY important which order your writes happen, as long as the value is updated eventually. But if we, say, have a circular buffer with integers, and we do something like:
buffer[index] = 32;
index = (index + 1) % buffersize;
and some other thread is using index to determine that the new value has been written, then we NEED to have 32 written FIRST, then index updated AFTER. Otherwise, the other thread may get old data.
The same applies to making semaphores, mutexes and such things work - this is why the terms release and acquire are used for the memory barrier types.
Now, the cst is the most strict ordering rule - it enforces that both reads and writes of the data you've written goes out to memory before the processor can continue to do more operations. This will be slower than doing the specific acquire or release barriers. It forces the processor to make sure stores AND loads have been completed, as opposed to just stores or just loads.
How much difference does that make? It is highly dependent on what the system archiecture is. On some systems, the cache needs to flushed [partially] and interrupts sent from one core to another to say "Please do this cache-flushing work before you continue" - this can take several hundred cycles. On other processors, it's only some small percentage slower than doing a regular memory write. X86 is pretty good at doing this fast. Some types of embedded processors, (some models of - not sure?)ARM for example, require a bit more work in the processor to ensure everything works.

Related

Does std::atomic<> gurantee that store() operation is propagated immediately (almost) to other threads with load()?

I have std::atomic<T> atomic_value; (for type T being bool, int32_t, int64_t and any other). If 1st thread does
atomic_value.store(value, std::memory_order_relaxed);
and in 2nd thread at some points of code I do
auto value = atomic_value.load(std::memory_order_relaxed);
How fast is this updated atomic value propagated from 1st thread to 2nd, between CPU cores? (for all CPU models)
Is it propagated almost-immediately? For example up-to speed of cache coherence propagation in Intel, meaning that 0-2 cycles or so. Maybe few more cycles for some other CPU models/manufacturers.
Or this value may stuck un-updated for many many cycles sometimes?
Does atomic guarantee that value is propagated between CPU cores as fast as possible for given CPU?
Maybe if instead on 1st thread I do
atomic_value.store(value, std::memory_order_release);
and on 2nd thread
auto value = atomic_value.load(std::memory_order_acquire);
then will it help to propagate value faster? (notice change of both memory orders) And now with speed guarantee? Or it will be same gurantee of speed as for relaxed order?
As a side question - does replacing relaxed order with release+acquire also synchronizes all modifications in other (non-atomic) variables?
Meaning that in 1st thread everything that was written to memory before store-with-release, is this whole memory guaranteed in 2nd thread to be exactly in final state (same as in 1st thread) at point of load-with-acquire, of course in a case if loaded value was new one (updated).
So this means that for ANY type of std::atomic<> (or std::atomic_flag) point of store-with-release in one thread synchronizes all memory writes before it with point in another thread that does load-with-acquire of same atomic, in a case of course if in other thread value of atomic got updated? (Sure if value in 2nd thread is not yet new then we expect that memory writes have not yet finished)
PS. Why question arose... Because according to name "atomic" it is obvious to conclude (probably miss-conclude) that by default (without extra constraints, i.e. with just relaxed memory order) std::atomic<> just makes any arithmetic operation atomical, and nothing else, no other guarantees about synchronization or speed of propagation. Meaning that write to memory location will be whole (e.g. all 4 bytes at once for int32_t), or exchange with atomic location will do both read-write atomically (actually in a locked fashion), or incrementing a value will do atomically three operations read-add-write.

The C++ standard says only this [C++20 intro.progress p18]:
An implementation should ensure that the last value (in modification order) assigned by an atomic or
synchronization operation will become visible to all other threads in a finite period of time.
Technically this is only a "should", and "finite time" is not very specific. But the C++ standard is broad enough that you can't expect them to specify a particular number of cycles or nanoseconds or what have you.
In practice, you can expect that a call to any atomic store function, even with memory_order_relaxed, will cause an actual machine store instruction to be executed. The value will not just be left in a register. After that, it's out of the compiler's hands and up to the CPU.
(Technically, if you had two or more stores in succession to the same object, with a bounded amount of other work done in between, the compiler would be allowed to optimize out all but the last one, on the basis that you couldn't have been sure anyway that any given load would happen at the right instant to see one of the other values. In practice I don't believe that any compilers currently do so.)
A reasonable expectation for typical CPU architectures is that the store will become globally visible "without unnecessary delay". The store may go into the core's local store buffer. The core will process store buffer entries as quickly as it can; it does not just let them sit there to age like a fine wine. But they still could take a while. For instance, if the cache line is currently held exclusive by another core, you will have to wait until it is released before your store can be committed.
Using stronger memory ordering will not speed up the process; the machine is already making its best efforts to commit the store. Indeed, a stronger memory ordering may actually slow it down; if the store was made with release ordering, then it must wait for all earlier stores in the buffer to commit before it can itself be committed. On strongly-ordered architectures like x86, every store is automatically release, so the store buffer always remains in strict order; but on a weakly ordered machine, using relaxed ordering may allow your store to "jump the queue" and reach L1 cache sooner than would otherwise have been possible.

With memory_order_relaxed how is total order of modification of an atomic variable assured on typical architectures?

As I understand memory_order_relaxed is to avoid costly memory fences that may be needed with more constrained ordering on a particular architecture.
In that case how is total modification order for an atomic variable achieved on popular processors?
EDIT:
atomic<int> a;
void thread_proc()
{
int b = a.load(memory_order_relaxed);
int c = a.load(memory_order_relaxed);
printf(“first value %d, second value %d\n, b, c);
}
int main()
{
thread t1(thread_proc);
thread t2(thread_proc);
a.store(1, memory_order_relaxed);
a.store(2, memory_order_relaxed);
t1.join();
t2.join();
}
What will guarantee that the output won’t be:
first value 1, second value 2
first value 2, second value 1
?

Multi-processors often use the MESI protocol to ensure total store order on a location. Information is transferred at cache-line granularity. The protocol ensures that before a processor modifies the contents of a cache line, all other processors relinquish their copy of the line, and must reload a copy of the modified line. Hence in the example where a processor writes x and then y to the same location, if any processor sees the write of x, it must have reloaded from the modified line, and must relinquish the line again before the writer writes y.

There is usually a specific set of assembly instructions that corresponds to operations on std::atomics, for example an atomic addition on x86 is lock xadd.
By specifying memory order relaxed you can conceptually think of it as telling the compiler "you must use this technique to increment the value, but I impose no other restrictions outside of the standard as-if optimisations rules on top of that". So literally just replacing an add with an lock xadd is likely sufficient under a relaxed ordering constraint.
Also keep in mind 'memory_order_relaxed' specifies a minimum standard that the compiler has to respect. Some intrinsics on some platforms will have implicit hardware barriers, which doesn't violate the constraint for being too constrained.

All atomic operations act in accord with [intro.races]/14:
If an operation A that modifies an atomic object M happens before an operation B that modifies M, then A shall be earlier than B in the modification order of M.
The two stores from the main thread are required to happen in that order, since the two operations are ordered within the same thread. Therefore, they cannot happen outside of that order. If someone sees the value 2 in the atomic, then the first thread must have executed past the point where the value was set to 1, per [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.
This of course only applies to atomic operations on a specific atomic object; ordering with respect to other things doesn't exist when using relaxed ordering (which is the point).
How does this get achieved on real machines? In whatever way the compiler sees fit to do so. The compiler could decide that, since you're overwriting the value of the variable you just set, then it can remove the first store per the as-if rule. Nobody ever seeing the value 1 is a perfectly legitimate implementation according to the C++ memory model.
But otherwise, the compiler is required to emit whatever is needed to make it work. Note that out-of-order processors aren't typically allowed to complete dependent operations out of order, so that's typically not a problem.

There are two parts in an inter thread communication:
a core that can do loads and stores
the memory system which consists of coherent caches
The issue is the speculative execution in the CPU core.
A processor load and store unit always need to compare addresses in order to avoid reordering two writes to the same location (if it reorders writes at all) or to pre-fetch a stale value that has just been written to (when reads are done early, before previous writes).
Without that feature, any sequence of executable code would be at risk of having its memory accesses completely randomized, seeing values written by a following instruction, etc. All memory locations would be "renamed" in crazy ways with no way for a program to refer to the same (originally named) location twice in a row.
All programs would break.
On the other hand, memory locations in potentially running code can have two "names":
the location that can hold a modifiable value, in L1d
the location that can be decoded as executable code, in L1i
And these are not connected in any way until a special reload code instruction is performed, not only the L1i but also the instruction decoder can have in cache locations that are otherwise modifiable.
[Another complication is when two virtual addresses (used by speculative loads or stores) refer to the same physical addresses (aliasing): that's another conflict that needs to be dealt with.]
Summary: In most cases, a CPU will naturally to provide an order for accesses on each data memory location.
EDIT:
While a core needs to keep track of operations that invalidate speculative execution, mainly a write to a location later read by a speculative instruction. Reads don't conflict with each others and a CPU core might want to keep track of modification of cached memory after a speculative read (making reads happen visibly in advance) and if reads can be executed out of order it's conceivable that a later read might be complete before an earlier read; on why the system would begin a later read first, a possible cause would be if the address computation is easier and complete first.
So a system that can begin reads out of order and that would consider them completed as soon as a value is made available by the cache, and valid as long as no write by the same core ends up conflicting with either read, and does not monitor L1i cache invalidations caused by another CPU wanting to modify a close memory location (possible that one location), such sequence is possible:
decompose the soon to be executed instructions into sequence A which is long a list of sequenced operations ending with a result in r1 and B a shorter sequence ending with a result in r2
run both in parallel, with B producing a result earlier
speculatively try load (r2), noting that a write that address may invalidate the speculation (suppose the location is available in L1i)
then another CPU annoys us stealing the cache line holding location of (r2)
A completes making r1 value available and we can speculatively do load (r1) (which happens to be the same address as (r2)); which stalls until our cache gets back its cache line
the value of the last done load can be different from the first
Neither speculations of A nor B invalided any memory location, as the system doesn't consider either the loss of cache line or the return of a different value by the last load to be an invalidation of a speculation (which would be easy to implement as we have all the information locally).
Here the system sees any read as non conflicting with any local operation that isn't a local write and the loads are done in an order depending on the complexity of A and B and not whichever comes first in program order (the description above doesn't even say that the program order was changed, just that it was ignored by speculation: I have never described which of the loads was first in the program).
So for a relaxed atomic load, a special instruction would be needed on such system.
The cache system
Of course the cache system doesn't change orders of requests, as it works like a global random access system with temporary ownership by cores.

How do I make memory stores in one thread "promptly" visible in other threads?

Suppose I wanted to copy the contents of a device register into a variable that would be read by multiple threads. Is there a good general way of doing this? Here are examples of two possible methods of doing this:
#include <atomic>
volatile int * const Device_reg_ptr = reinterpret_cast<int *>(0x666);
// This variable is read by multiple threads.
std::atomic<int> device_reg_copy;
// ...
// Method 1
const_cast<volatile std::atomic<int> &>(device_reg_copy)
.store(*Device_reg_ptr, std::memory_order_relaxed);
// Method 2
device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
More generally, in the face of possible whole program optimization, how does one correctly control the latency of memory writes in one thread being visible in other threads?
EDIT: In your answer, please consider the following scenario:
The code is running on a CPU in an embedded system.
A single application is running on the CPU.
The application has far fewer threads than the CPU has processor cores.
Each core has a massive number of registers.
The application is small enough that whole program optimization is successfully used when building its executable.
How do we make sure that a store in one thread does not remain invisible to other threads indefinitely?

If you would like to update the value of device_reg_copy in atomic fashion, then device_reg_copy.store(*Device_reg_ptr, std::memory_order_relaxed); suffices.
There is no need to apply volatile to atomic variables, it is unnecessary.
std::memory_order_relaxed store is supposed to incur the least amount of synchronization overhead. On x86 it is just a plain mov instruction.
However, if you would like to update it in such a way, that the effects of any preceding stores become visible to other threads along with the new value of device_reg_copy, then use std::memory_order_release store, i.e. device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);. The readers need to load device_reg_copy as std::memory_order_acquire in this case. Again, on x86 std::memory_order_release store is a plain mov.
Whereas if you use the most expensive std::memory_order_seq_cst store, it does insert the memory barrier for you on x86.
This is why they say that x86 memory model is a bit too strong for C++11: plain mov instruction is std::memory_order_release on stores and std::memory_order_acquire on loads. There is no relaxed store or load on x86.
I cannot recommend enough CPU Cache Flushing Fallacy article.

The C++ standard is rather vague about making atomic stores visible to other threads..
29.3.12
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
That is as detailed as it gets, there is no definition of 'reasonable', and it does not have to be immediately.
Using a stand-alone fence to force a certain memory ordering is not necessary since you can specify those on atomic operations, but the question is,
what is your expectation with regards to using a memory fence..
Fences are designed to enforce ordering on memory operations (between threads), but they do not guarantee visibility in a timely manner.
You can store a value to an atomic variable with the strongest memory ordering (ie. seq_cst), but even when another thread executes load() at a later time than the store(),
you might still get an old value from the cache and yet (surprisingly) it does not violate the happens-before relationship.
Using a stronger fence might make a difference wrt. timing and visibility, but there are no guarantees.
If prompt visibility is important, I would consider using a Read-Modify-Write (RMW) operation to load the value.
These are atomic operations that read and modify atomically (ie. in a single call), and have the additional property that they are guaranteed to operate on the latest value.
But since they have to reach a little further than the local cache, these calls also tend to be more expensive to execute.
As pointed out by Maxim Egorushkin, whether or not you can use weaker memory orderings than the default (seq_cst) depends on whether other memory operations need to be synchronized (made visible) between threads.
That is not clear from your question, but it is generally considered safe to use the default (sequential consistency).
If you are on an unusually weak platform, if performance is problematic, and if you need data synchronization between threads, you could consider using acquire/release semantics:
// thread 1
device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);
// thread 2
device_reg_copy.fetch_add(0, std::memory_order_acquire);
If thread 2 sees the value written by thread 1, it is guaranteed that memory operations prior to the store in thread 1 are visible after the load in thread 2.
Acquire/Release operations form a pair and they synchronize based on a run-time relationship between the store and load. In other words, if thread 2 does not see the value stored by thread 1,
there are no ordering guarantees.
If the atomic variable has no dependencies on any other data, you can use std::memory_order_relaxed; store ordering is always guaranteed for a single atomic variable.
As mentioned by others, there is no need for volatile when it comes to inter-thread communication with std::atomic.

Compare and Swap : synchronizing via different data sizes

Using the GCC builtin C atomic primitives, we can perform an atomic CAS operation using __atomic_compare_exchange.
Unlike C++11's std::atomic type, the GCC C atomic primitives operate on regular non-atomic integral types, including 128-bit integers on platforms where cmpxchg16b is supported. (A future version of the C++ standard may support similar functionality with the std::atomic_view class template.)
This makes me question:
What happens if an atomic CAS operation on a larger data size observes a change which happened by an atomic operation on the same memory location, but using a smaller data size?
For example, suppose we have:
struct uint128_type {
uint64_t x;
uint64_t y;
} __attribute__ ((aligned (16)));
And suppose we have a shared variable of type uint128_type, like:
uint128_type Foo;
Now, suppose Thread A does:
Foo expected = { 0, 0 };
Foo desired = { 100, 100 };
int result = __atomic_compare_exchange(
&Foo,
&expected,
&desired,
0,
__ATOMIC_SEQ_CST
);
And Thread B does:
uint64_t expected = 0;
uint64_t desired = 500;
int result = __atomic_compare_exchange(
&Foo.x,
&expected,
&desired,
0,
__ATOMIC_SEQ_CST
);
What happens if Thread A's 16-byte CAS happens before Thread B's 8 byte CAS (or vice-versa)? Does the CAS fail as normal? Is this even defined behavior? Is this likely to "do the right thing" on typical architectures like x86_64 that support 16b CAS?
Edit: to be clear, since it seems to be causing confusion, I'm not asking if the above behavior is defined by the C++ standard. Obviously, all the __atomic_* functions are GCC extensions. (However future C++ standards may have to define this sort of thing, if std::atomic_view becomes standardized.) I am asking more generally about the semantics of atomic operations on typical modern hardware. As an example, if x86_64 code has 2 threads perform atomic operations on the same memory address, but one thread uses CMPXCHG8b and the other uses CMPXCHG16b, so that one does an atomic CAS on a single word while the other does an atomic CAS on a double word, how are the semantics of these operations defined? More specifically, would a CMPXCHG16b fail because it observes that the data has mutated from the expected value due to a previous CMPXCHG8b?
In other words, can two different CAS operations using two different data sizes (but the same starting memory address) safely be used to synchronize between threads?

One or the other happens first, and each operation proceeds according to its own semantics.
On x86 CPUs, both operations will require a lock on the same cache line held throughout the entire operation. So whichever one gets that lock first will not see any effects of the second operation and whichever one gets that lock second will see all the effects of the first operation. The semantics of both operations will be fully respected.
Other hardware may achieve this result some other way, but if it doesn't achieve this result, it's broken unless it specifies that it has a limitation.

The atomic data will, eventually, located somewhere in the memory and all accesses to it (or to respective caches, when the operations are atomic) will be serialized. Since the CAS operation is supposed to be atomic, it will performed as a whole or not at all.
That being said, one of the operations will succeed, the second will fail. The order is non-deterministic.
From x86 Instruction Set Reference:
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)
Clearly, both threads will attempt to locked-write after locked-read (when used with LOCK prefix, that is), which means only one of them will succeed in performing the CAS, the other will read already changed value.

Hardware is usually very conservative when checking for conflicts between potentially conflicting atomic operations. It may even happen that two CAS operations to two completely different, non-overlapping address ranges may be detected to conflict with each other.

The definition of atomic is unlikely to change, emphasis mine
In concurrent programming, an operation (or set of operations) is atomic,
linearizable, indivisible or uninterruptible if it appears to
the rest of the system to occur instantaneously. Atomicity is a
guarantee of isolation from concurrent processes.
Your question asks...
What happens if an atomic CAS operation on a larger data size observes
a change which happened by an atomic operation on the same memory
location, but using a smaller data size?
By definition no two overlapping regions of memory modified using atomic operations can mutate concurrently ie the two operations must happen linearly or their not atomic.

I've heard i++ isn't thread safe, is ++i thread-safe?

I've heard that i++ isn't a thread-safe statement since in assembly it reduces down to storing the original value as a temp somewhere, incrementing it, and then replacing it, which could be interrupted by a context switch.
However, I'm wondering about ++i. As far as I can tell, this would reduce to a single assembly instruction, such as 'add r1, r1, 1' and since it's only one instruction, it'd be uninterruptable by a context switch.
Can anyone clarify? I'm assuming that an x86 platform is being used.

You've heard wrong. It may well be that "i++" is thread-safe for a specific compiler and specific processor architecture but it's not mandated in the standards at all. In fact, since multi-threading isn't part of the ISO C or C++ standards (a), you can't consider anything to be thread-safe based on what you think it will compile down to.
It's quite feasible that ++i could compile to an arbitrary sequence such as:
load r0,[i] ; load memory into reg 0
incr r0 ; increment reg 0
stor [i],r0 ; store reg 0 back to memory
which would not be thread-safe on my (imaginary) CPU that has no memory-increment instructions. Or it may be smart and compile it into:
lock ; disable task switching (interrupts)
load r0,[i] ; load memory into reg 0
incr r0 ; increment reg 0
stor [i],r0 ; store reg 0 back to memory
unlock ; enable task switching (interrupts)
where lock disables and unlock enables interrupts. But, even then, this may not be thread-safe in an architecture that has more than one of these CPUs sharing memory (the lock may only disable interrupts for one CPU).
The language itself (or libraries for it, if it's not built into the language) will provide thread-safe constructs and you should use those rather than depend on your understanding (or possibly misunderstanding) of what machine code will be generated.
Things like Java synchronized and pthread_mutex_lock() (available to C/C++ under some operating systems) are what you need to look into (a).
(a) This question was asked before the C11 and C++11 standards were completed. Those iterations have now introduced threading support into the language specifications, including atomic data types (though they, and threads in general, are optional, at least in C).

You can't make a blanket statement about either ++i or i++. Why? Consider incrementing a 64-bit integer on a 32-bit system. Unless the underlying machine has a quad word "load, increment, store" instruction, incrementing that value is going to require multiple instructions, any of which can be interrupted by a thread context switch.
In addition, ++i isn't always "add one to the value." In a language like C, incrementing a pointer actually adds the size of the thing pointed to. That is, if i is a pointer to a 32-byte structure, ++i adds 32 bytes. Whereas almost all platforms have an "increment value at memory address" instruction that is atomic, not all have an atomic "add arbitrary value to value at memory address" instruction.

They are both thread-unsafe.
A CPU cannot do math directly with memory. It does that indirectly by loading the value from memory and doing the math with CPU registers.
i++
register int a1, a2;
a1 = *(&i) ; // One cpu instruction: LOAD from memory location identified by i;
a2 = a1;
a1 += 1;
*(&i) = a1;
return a2; // 4 cpu instructions
++i
register int a1;
a1 = *(&i) ;
a1 += 1;
*(&i) = a1;
return a1; // 3 cpu instructions
For both cases, there is a race condition that results in the unpredictable i value.
For example, let's assume there are two concurrent ++i threads with each using register a1, b1 respectively. And, with context switching executed like the following:
register int a1, b1;
a1 = *(&i);
a1 += 1;
b1 = *(&i);
b1 += 1;
*(&i) = a1;
*(&i) = b1;
In result, i doesn't become i+2, it becomes i+1, which is incorrect.
To remedy this, moden CPUs provide some kind of LOCK, UNLOCK cpu instructions during the interval a context switching is disabled.
On Win32, use InterlockedIncrement() to do i++ for thread-safety. It's much faster than relying on mutex.

If you are sharing even an int across threads in a multi-core environment, you need proper memory barriers in place. This can mean using interlocked instructions (see InterlockedIncrement in win32 for example), or using a language (or compiler) that makes certain thread-safe guarantees. With CPU level instruction-reordering and caches and other issues, unless you have those guarantees, don't assume anything shared across threads is safe.
Edit: One thing you can assume with most architectures is that if you are dealing with properly aligned single words, you won't end up with a single word containing a combination of two values that were mashed together. If two writes happen over top of each other, one will win, and the other will be discarded. If you are careful, you can take advantage of this, and see that either ++i or i++ are thread-safe in the single writer/multiple reader situation.

If you want an atomic increment in C++ you can use C++0x libraries (the std::atomic datatype) or something like TBB.
There was once a time that the GNU coding guidelines said updating datatypes that fit in one word was "usually safe" but that advice is wrong for SMP machines, wrong for some architectures, and wrong when using an optimizing compiler.
To clarify the "updating one-word datatype" comment:
It is possible for two CPUs on an SMP machine to write to the same memory location in the same cycle, and then try to propagate the change to the other CPUs and the cache. Even if only one word of data is being written so the writes only take one cycle to complete, they also happen simultaneously so you cannot guarantee which write succeeds. You won't get partially updated data, but one write will disappear because there is no other way to handle this case.
Compare-and-swap properly coordinates between multiple CPUs, but there is no reason to believe that every variable assignment of one-word datatypes will use compare-and-swap.
And while an optimizing compiler doesn't affect how a load/store is compiled, it can change when the load/store happens, causing serious trouble if you expect your reads and writes to happen in the same order they appear in the source code (the most famous being double-checked locking does not work in vanilla C++).
NOTE My original answer also said that Intel 64 bit architecture was broken in dealing with 64 bit data. That is not true, so I edited the answer, but my edit claimed PowerPC chips were broken. That is true when reading immediate values (i.e., constants) into registers (see the two sections named "Loading pointers" under listing 2 and listing 4) . But there is an instruction for loading data from memory in one cycle (lmw), so I've removed that part of my answer.

Even if it is reduced to a single assembly instruction, incrementing the value directly in memory, it is still not thread safe.
When incrementing a value in memory, the hardware does a "read-modify-write" operation: it reads the value from the memory, increments it, and writes it back to memory. The x86 hardware has no way of incrementing directly on the memory; the RAM (and the caches) is only able to read and store values, not modify them.
Now suppose you have two separate cores, either on separate sockets or sharing a single socket (with or without a shared cache). The first processor reads the value, and before it can write back the updated value, the second processor reads it. After both processors write the value back, it will have been incremented only once, not twice.
There is a way to avoid this problem; x86 processors (and most multi-core processors you will find) are able to detect this kind of conflict in hardware and sequence it, so that the whole read-modify-write sequence appears atomic. However, since this is very costly, it is only done when requested by the code, on x86 usually via the LOCK prefix. Other architectures can do this in other ways, with similar results; for instance, load-linked/store-conditional and atomic compare-and-swap (recent x86 processors also have this last one).
Note that using volatile does not help here; it only tells the compiler that the variable might have be modified externally and reads to that variable must not be cached in a register or optimized out. It does not make the compiler use atomic primitives.
The best way is to use atomic primitives (if your compiler or libraries have them), or do the increment directly in assembly (using the correct atomic instructions).

On x86/Windows in C/C++, you should not assume it is thread-safe. You should use InterlockedIncrement() and InterlockedDecrement() if you require atomic operations.

If your programming language says nothing about threads, yet runs on a multithreaded platform, how can any language construct be thread-safe?
As others pointed out: you need to protect any multithreaded access to variables by platform specific calls.
There are libraries out there that abstract away the platform specificity, and the upcoming C++ standard has adapted it's memory model to cope with threads (and thus can guarantee thread-safety).

Never assume that an increment will compile down to an atomic operation. Use InterlockedIncrement or whatever similar functions exist on your target platform.
Edit: I just looked up this specific question and increment on X86 is atomic on single processor systems, but not on multiprocessor systems. Using the lock prefix can make it atomic, but it's much more portable just to use InterlockedIncrement.

According to this assembly lesson on x86, you can atomically add a register to a memory location, so potentially your code may atomically execute '++i' ou 'i++'.
But as said in another post, the C ansi does not apply atomicity to '++' opération, so you cannot be sure of what your compiler will generate.

The 1998 C++ standard has nothing to say about threads, although the next standard (due this year or the next) does. Therefore, you can't say anything intelligent about thread-safety of operations without referring to the implementation. It's not just the processor being used, but the combination of the compiler, the OS, and the thread model.
In the absence of documentation to the contrary, I wouldn't assume that any action is thread-safe, particularly with multi-core processors (or multi-processor systems). Nor would I trust tests, as thread synchronization problems are likely to come up only by accident.
Nothing is thread-safe unless you have documentation that says it is for the particular system you're using.

Throw i into thread local storage; it isn't atomic, but it then doesn't matter.

AFAIK, According to the C++ standard, read/writes to an int are atomic.
However, all that this does is get rid of the undefined behavior that's associated with a data race.
But there still will be a data race if both threads try to increment i.
Imagine the following scenario:
Let i = 0 initially:
Thread A reads the value from memory and stores in its own cache.
Thread A increments the value by 1.
Thread B reads the value from memory and stores in its own cache.
Thread B increments the value by 1.
If this is all a single thread you would get i = 2 in memory.
But with both threads, each thread writes its changes and so Thread A writes i = 1 back to memory, and Thread B writes i = 1 to memory.
It's well defined, there's no partial destruction or construction or any sort of tearing of an object, but it's still a data race.
In order to atomically increment i you can use:
std::atomic<int>::fetch_add(1, std::memory_order_relaxed)
Relaxed ordering can be used because we don't care where this operation takes place all we care about is that the increment operation is atomic.

You say "it's only one instruction, it'd be uninterruptible by a context switch." - that's all well and good for a single CPU, but what about a dual core CPU? Then you can really have two threads accessing the same variable at the same time without any context switches.
Without knowing the language, the answer is to test the heck out of it.

I think that if the expression "i++" is the only in a statement, it's equivalent to "++i", the compiler is smart enough to not keep a temporal value, etc. So if you can use them interchangeably (otherwise you won't be asking which one to use), it doesn't matter whichever you use as they're almost the same (except for aesthetics).
Anyway, even if the increment operator is atomic, that doesn't guarantee that the rest of the computation will be consistent if you don't use the correct locks.
If you want to experiment by yourself, write a program where N threads increment concurrently a shared variable M times each... if the value is less than N*M, then some increment was overwritten. Try it with both preincrement and postincrement and tell us ;-)

For a counter, I recommend a using the compare and swap idiom which is both non locking and thread-safe.
Here it is in Java:
public class IntCompareAndSwap {
private int value = 0;
public synchronized int get(){return value;}
public synchronized int compareAndSwap(int p_expectedValue, int p_newValue){
int oldValue = value;
if (oldValue == p_expectedValue)
value = p_newValue;
return oldValue;
}
}
public class IntCASCounter {
public IntCASCounter(){
m_value = new IntCompareAndSwap();
}
private IntCompareAndSwap m_value;
public int getValue(){return m_value.get();}
public void increment(){
int temp;
do {
temp = m_value.get();
} while (temp != m_value.compareAndSwap(temp, temp + 1));
}
public void decrement(){
int temp;
do {
temp = m_value.get();
} while (temp > 0 && temp != m_value.compareAndSwap(temp, temp - 1));
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js