Does this envelope implementation correctly use C++11 atomics?

Does this envelope implementation correctly use C++11 atomics? - c++

I have written a simple 'envelope' class to make sure I understand the C++11 atomic semantics correctly. I have a header and a payload, where the writer clears the header, fills in the payload, then fills the header with an increasing integer. The idea is that a reader then can read the header, memcpy out the payload, read the header again, and if the header is the same the reader can then assume they successfully copied the payload. It's OK that the reader may miss some updates, but it's not OK for them to get a torn update (where there is a mix of bytes from different updates). There is only ever a single reader and a single writer.
The writer uses release memory order and the reader uses acquire memory order.
Is there any risk of the memcpy being reordered with the atomic store/load calls? Or can the loads be reordered with each other? This never aborts for me but maybe I'm lucky.
#include <iostream>
#include <atomic>
#include <thread>
#include <cstring>
struct envelope {
alignas(64) uint64_t writer_sequence_number = 1;
std::atomic<uint64_t> sequence_number;
char payload[5000];
void start_writing()
{
sequence_number.store(0, std::memory_order::memory_order_release);
}
void publish()
{
sequence_number.store(++writer_sequence_number, std::memory_order::memory_order_release);
}
bool try_copy(char* copy)
{
auto before = sequence_number.load(std::memory_order::memory_order_acquire);
if(!before) {
return false;
}
::memcpy(copy, payload, 5000);
auto after = sequence_number.load(std::memory_order::memory_order_acquire);
return before == after;
}
};
envelope g_envelope;
void reader_thread()
{
char local_copy[5000];
unsigned messages_received = 0;
while(true) {
if(g_envelope.try_copy(local_copy)) {
for(int i = 0; i < 5000; ++i) {
// if there is no tearing we should only see the same letter over and over
if(local_copy[i] != local_copy[0]) {
abort();
}
}
if(messages_received++ % 64 == 0) {
std::cout << "successfully received=" << messages_received << std::endl;
}
}
}
}
void writer_thread()
{
const char alphabet[] = {"ABCDEFGHIJKLMNOPQRSTUVWXYZ"};
unsigned i = 0;
while(true) {
char to_write = alphabet[i % (sizeof(alphabet)-1)];
g_envelope.start_writing();
::memset(g_envelope.payload, to_write, 5000);
g_envelope.publish();
++i;
}
}
int main(int argc, char** argv)
{
std::thread writer(&writer_thread);
std::thread reader(&reader_thread);
writer.join();
reader.join();
return 0;
}

This is called a seqlock; it has a data race simply because of the conflicting calls to memset and memcpy. There have been proposals to provide a memcpy-like facility to make this sort of code correct; the most recent is not likely to appear before C++26 (even if approved).

This is called a seqlock. It's a known pattern, and it works well for publish occasionally, read often. If you republish too often (especially for a buffer as large as 5000 bytes), you risk too many retries by the readers as they keep detecting possible tearing. It's commonly used to e.g. publish a 64-bit or 128-bit timestamp from a timer interrupt handler to all cores, where the fact that the writer doesn't have to acquire a lock is great, and so is the fact that readers are read-only and have negligible overhead in the fast-path.
Acq and Rel are one-way barriers.
You need atomic_thread_fence(mo_acquire) before the 2nd load of the sequence number in the reader to make sure it doesn't happen earlier, before the memcpy finishes. And same for atomic_thread_fence(mo_release) in the writer, after the first store before writing the data. Note that acquire / release fences are 2-way barriers, and do affect non-atomic variables1. (Despite misconceptions to the contrary, fences really are 2-way barriers, unlike acquire or release operations. Jeff Preshing explains and debunks the confusion)
See also Implementing 64 bit atomic counter with 32 bit atomics for my attempt at a templated SeqLock class. I required the template class T to provide an assignment operator to copy itself around, but using memcpy might be better. I was using volatile for extra safety against the C++ UB we include. That works easily for uint64_t but is a huge pain in C++ for anything wider, unlike in C where you can get the compiler to efficiently emit code to load from a volatile struct into a non-volatile temporary.
You're going to have C++ data-race UB either way (because C++ makes best efficiency impossible without UB: the whole point of a SeqLock is to let tearing potentially happen on data[], but detect that and never actually look at the torn data). You could avoid UB by copying your data as an array of atomic<unsigned long> or something, but current compilers aren't smart enough to use SIMD for that so the access to the shared data would be inefficient. (And HW vendors fail to document Per-element atomicity of vector load/store and gather/scatter?, even though we all know that current CPUs do give that and future CPUs almost certainly will too.)
A memory barrier is probably sufficient, but it would be nice to do something to "launder" the value to make extra sure the compiler doesn't put another reload of the non-atomic data after the 2nd load. Like What is the purpose of glibc's atomic_forced_read function?. But as I said, I think atomic_thread_fence() is sufficient. At least in practice with compilers like GCC, which treat thread_fence like asm("":::"memory") that tells the compiler all values in memory might have changed.
Footnote 1: Maxim points out that atomic_thread_fence may be sort of a hack because ISO C++ specifies things only in terms of barriers and release-sequences synchronizing with loads that see the value stored.
But it's well known how fences and acq/rel loads/stores map to asm for any given target platform. It's implausible that a compiler will do enough whole-program inter-thread analysis to prove that it can break your code.
There might be an argument to be made in terms of the language used in the C++ standard about establishing happens-before relationships between the store of tmp+1 and at least some hypothetical reader. In practice that's enough to stop a compiler from breaking the writer: it can't know what code will be reading the data it's writing so it has to respect barriers. And probably the language in the standard is strong enough that a reader that sees an odd sequence number (and avoids reading data[]) can avoid data-race UB, so there would be a valid happens-before relationship between an atomic store that has to stay ahead of some non-atomic stores. So I'm not convinced that there's any room for a malicious all-seeing compiler to not respect atomic_thread_fence() there, let alone any real compiler.
In any case, you definitely do not want _mm_lfence() on x86. You want the compiler barrier against runtime reordering, but you definitely do not want the main effect of lfence: blocking out-of-order execution.Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths and Are loads and stores the only instructions that gets reordered?
i.e. you just want GNU C asm("":::"memory"), aka atomic_signal_fence(mo_seq_cst). Also equivalent to atomic_thread_fence(mo_acq_rel) on x86, which only has to block compile-time reordering to control runtime ordering, because the only runtime reording x86's strong memory model allows is StoreLoad (except for NT stores). x86's memory model is seq_cst + a store-buffer with store forwarding (which weakens seq_cst to acq/rel, and occasionally has other funky effects especially for loads that partially overlap a store).
For more about _mm_lfence() and so on vs. the asm instructions, see When should I use _mm_sfence _mm_lfence and _mm_mfence.
Other tweaks
Your sequence number is unnecessarily wide, and 64-bit atomics are less efficient on some 32-bit platforms, and very inefficient on a few. A 32-bit sequence number won't wrap in any reasonable thread-sleep time. (e.g. a 4GHz CPU will take about a whole second to do 2^32 stores at 1 store per clock, and that's with zero contention for writes to the cache line. And no cycles spend executing stores of the actual data. And practical use-cases don't have the writer in a tight loop publishing new values constantly: that could lead to something similar to livelock with readers constantly retrying and making no progress.)
unsigned long is never (AFAIK) too wide to handle efficiently, except on CPUs narrower than 32-bit. So atomic<long> or atomic<unsigned long> would use a 64-bit counter on CPUs where that's fine, but definitely avoid the risk of using a 64-bit atomic in 32-bit code. And long is required to be at least 32 bits wide.
Also, you don't need two copies of the write sequence number. Just have the writer do an atomic load into a tmp var, then separate atomic stores of tmp+1 and tmp+2.
(You're correct in wanting to avoid sequence_number++; it would be a bad idea to ask the compiler to do two atomic RMWs). The only advantage of a separate non-atomic var for the writer's private seq number is if this can inline into a write loop and keep it in a register so the writer never reloads the value.

Related

Can/should non-lock-free atomics be implemented with a SeqLock?

In both MSVC STL and LLVM libc++ implementations std::atomic for non-atomic size is implemented using a spin lock.
libc++ (Github):
_LIBCPP_INLINE_VISIBILITY void __lock() const volatile {
while(1 == __cxx_atomic_exchange(&__a_lock, _LIBCPP_ATOMIC_FLAG_TYPE(true), memory_order_acquire))
/*spin*/;
}
_LIBCPP_INLINE_VISIBILITY void __lock() const {
while(1 == __cxx_atomic_exchange(&__a_lock, _LIBCPP_ATOMIC_FLAG_TYPE(true), memory_order_acquire))
/*spin*/;
}
MSVC (Github) (recently discussed in this Q&A):
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
// Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
// Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
// The code in mentioned manual is covered by the 0BSD license.
int _Current_backoff = 1;
const int _Max_backoff = 64;
while (_InterlockedExchange(&_Spinlock, 1) != 0) {
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
_mm_pause();
}
_Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
}
}
#elif
/* ... */
#endif
}
While thinking of a better possible implementation, I wonder if it is feasible to replace this with SeqLock? Advantage would be cheap reads if reads don't contend with writes.
Another thing I'm questioning is if SeqLock can be improved to use OS wait. It appears to me that if reader observes an odd count, it can wait with atomic wait underlying mechanism (Linux futex/Windows WaitOnAddress), thus avoiding the starvation problem of spinlock.
To me it looks like possible. Though C++ memory model doesn't cover Seqlock currently, types in std::atomic must be trivially copyable, so memcpy reads/writes in seqlock will work and will deal with races if sufficient barriers are used to get a volatile-equivalent without defeating optimizations too badly. This will be part of a specific C++ implementation's header files so it doesn't have to be portable.
Existing SO Q&As about implement a SeqLock in C++ (perhaps using other std::atomic operations)
Implementing 64 bit atomic counter with 32 bit atomics
how to implement a seqlock lock using c++11 atomic library

Yes, you can use a SeqLock as a readers/writers lock if you provide mutual exclusion between writers. You'd still get read-side scalability, while writes and RMWs would stay about the same.
It's not a bad idea, although it has potential fairness problems for readers if you have very frequent writes. Maybe not a good idea for a mainstream standard library, at least not without some testing with some different workloads / use-cases on a range of hardware, since working great on some machines but faceplanting on others is not what you want for standard library stuff. (Code that wants great performance for its special case often unfortunately has to use an implementation that's tuned for it, not the standard one.)
Mutual exclusion is possible with a separate spinlock, or just using the low bit of the sequence number. In fact I've seen other descriptions of a SeqLock that assumed you'd be using it with multiple writers, and didn't even mention the single-writer case that allows pure-load and pure-store for the sequence number to avoid the cost of an atomic RMW.
How to use the sequence number as a spinlock
A writer or RMWer attempts to atomically CAS the sequence number to increment (if it wasn't already odd). If the sequence number is already odd, writers just spin until they see an even value.
This would mean writers have to start by reading the sequence number before trying to write, which can cause extra coherency traffic (MESI Share request, then RFO). On a machine that actually had a fetch_or in hardware, you could use that to atomically make the count odd and see if you won the race to take it from even to odd.
On x86-64, you can use lock bts to set the low bit and find out what the old low bit was, then load the whole sequence number if it was previously even (because you won the race, no other writer is going to be modifying it). So you can do a release-store of that plus 1 to "unlock" instead of needing a lock add.
Making other writers faster at reclaiming the lock may actually be a bad thing, though: you want to give a window for readers to complete. Maybe just use multiple pause instructions (or equivalent on non-x86) in write-side spin loops, more than in read-side spins. If contention is low, readers probably had time to see it before writers got to it, otherwise writers will frequently see it locked and go into the slower spin loop. Maybe with faster-increasing backoff for writers, too.
An LL/SC machine could (in asm at least) test-and-increment just as easily as CAS or TAS. I don't know how to write pure C++ that would compile to just that. fetch_or could compile efficiently for LL/SC, but still to a store even if it was already odd. (If you have to LL separately from SC, you might as well make the most of it and not store at all if it will be useless, and hope that the hardware is designed to make the best of things.)
(It's critical to not unconditionally increment; you must not unlock another writer's ownership of the lock. But an atomic-RMW that leaves the value unchanged is always ok for correctness, if not performance.)
It may not be a good idea by default because of bad results with heavy write activity making it potentially hard for a reader to get a successful read done. As Wikipedia points out:
The reader never blocks, but it may have to retry if a write is in progress; this speeds up the readers in the case where the data was not modified, since they do not have to acquire the lock as they would with a traditional read–write lock. Also, writers do not wait for readers, whereas with traditional read–write locks they do, leading to potential resource starvation in a situation where there are a number of readers (because the writer must wait for there to be no readers). Because of these two factors, seqlocks are more efficient than traditional read–write locks for the situation where there are many readers and few writers. The drawback is that if there is too much write activity or the reader is too slow, they might livelock (and the readers may starve).
The "too slow reader" problem is unlikely, just a small memcpy. Code shouldn't expect good results from std::atomic<T> for very large T; the general assumption is that you'd only bother with std::atomic for a T that can be lock-free on some implementations. (Usually not including transactional memory since mainstream implementations don't do that.)
But the "too much write" problem could still be real: SeqLock is best for read-mostly data. Readers may have a bad time with a heavy write mix, retrying even more than with a simple spinlock or a readers-writers lock.
It would be nice if there was a way to make this an option for an implementation, like an optional template parameter such as std::atomic<T, true>, or a #pragma, or #define before including <atomic>. Or a command-line options.
An optional template param affects every use of the type, but might be slightly less clunky than a separate class name like gnu::atomic_seqlock<T>. An optional template param would still make std::atomic types be that class name, so e.g. matching specializations of other things for std::atomic. But might break other things, IDK.
Might be fun to hack something up to experiment with.

Could this publish / check-for-update class for a single writer + reader use memory_order_relaxed or acquire/release for efficiency?

Introduction
I have a small class which make use of std::atomic for a lock free operation. Since this class is being called massively, it's affecting the performance and I'm having trouble.
Class description
The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).
A single thread is calling push(), and another single thread is calling pop().
Source I've read
Since this is using too much time of my computer time, I decided to study a bit further the std::atomic class and its memory_order. I've read a lot of memory_order post avaliable in StackOverflow and other sources and books, but I'm not able to get a clear idea about the different modes. Specially, I'm struggling between acquire and release modes: I fail too see why they are different to memory_order_seq_cst.
What I think each memory order do using my words, from my own research
memory_order_relaxed: In the same thread, the atomic operations are instant, but other threads may fail to see the lastest values instantly, they will need some time until they are updated. The code can be re-ordered freely by the compiler or OS.
memory_order_acquire / release: Used by atomic::load. It prevents the lines of code there are before this from being reordered (the compiler/OS may reorder after this line all it want), and reads the lastest value that was stored on this atomic using memory_order_release or memory_order_seq_cst in this thread or another thread. memory_order_release also prevents that code after it may be reordered. So, in an acquire/release, all the code between both can be shuffled by the OS. I'm not sure if that's between same thread, or different threads.
memory_order_seq_cst: Easiest to use because it's like the natural writting we are used with variables, instantly refreshing the values of other threads load functions.
The LockFreeEx class
template<typename T>
class LockFreeEx
{
public:
void push(const T& element)
{
const int wPos = m_position.load(std::memory_order_seq_cst);
const int nextPos = getNextPos(wPos);
m_buffer[nextPos] = element;
m_position.store(nextPos, std::memory_order_seq_cst);
}
const bool pop(T& returnedElement)
{
const int wPos = m_position.exchange(-1, std::memory_order_seq_cst);
if (wPos != -1)
{
returnedElement = m_buffer[wPos];
return true;
}
else
{
return false;
}
}
private:
static constexpr int maxElements = 8;
static constexpr int getNextPos(int pos) noexcept {return (++pos == maxElements)? 0 : pos;}
std::array<T, maxElements> m_buffer;
std::atomic<int> m_position {-1};
};
How I expect it could be improved
So, my first idea was using memory_order_relaxed in all atomic operations, since the pop() thread is in a loop looking for avaliable updates in pop function each 10-15 ms, then it's allowed to fail in the firsts pop() functions to realize later that there is a new update. It's only a bunch of milliseconds.
Another option would be using release/acquire - but I'm not sure about them. Using release in all store() and acquire in all load() functions.
Unfortunately, all the memory_order I described seems to work, and I'm not sure when will they fail, if they are supposed to fail.
Final
Please, could you tell me if you see some problem using relaxed memory order here? Or should I use release/acquire (maybe a further explanation on these could help me)? why?
I think that relaxed is the best for this class, in all its store() or load(). But I'm not sure!
Thanks for reading.
EDIT: EXTRA EXPLANATION:
Since I see everyone is asking for the 'char', I've changed it to int, problem solved! But it doesn't it the one I want to solve.
The class, as I stated before, is something likely to a LIFO but where only matters the last element pushed, if there is any.
I have a big struct T (copiable and asignable), that I must share between two threads in a lock-free way. So, the only way I know to do it is using a circular buffer that writes the last known value for T, and a atomic which know the index of the last value written. When there isn't any, the index would be -1.
Notice that my push thread must know when there is a "new T" avaliable, that's why pop() returns a bool.
Thanks again to everyone trying to assist me with memory orders! :)
AFTER READING SOLUTIONS:
template<typename T>
class LockFreeEx
{
public:
LockFreeEx() {}
LockFreeEx(const T& initValue): m_data(initValue) {}
// WRITE THREAD - CAN BE SLOW, WILL BE CALLED EACH 500-800ms
void publish(const T& element)
{
// I used acquire instead relaxed to makesure wPos is always the lastest w_writePos value, and nextPos calculates the right one
const int wPos = m_writePos.load(std::memory_order_acquire);
const int nextPos = (wPos + 1) % bufferMaxSize;
m_buffer[nextPos] = element;
m_writePos.store(nextPos, std::memory_order_release);
}
// READ THREAD - NEED TO BE VERY FAST - CALLED ONCE AT THE BEGGINING OF THE LOOP each 2ms
inline void update()
{
// should I change to relaxed? It doesn't matter I don't get the new value or the old one, since I will call this function again very soon, and again, and again...
const int writeIndex = m_writePos.load(std::memory_order_acquire);
// Updating only in case there is something new... T may be a heavy struct
if (m_readPos != writeIndex)
{
m_readPos = writeIndex;
m_data = m_buffer[m_readPos];
}
}
// NEED TO BE LIGHTNING FAST, CALLED MULTIPLE TIMES IN THE READ THREAD
inline const T& get() const noexcept {return m_data;}
private:
// Buffer
static constexpr int bufferMaxSize = 4;
std::array<T, bufferMaxSize> m_buffer;
std::atomic<int> m_writePos {0};
int m_readPos = 0;
// Data
T m_data;
};

Memory order is not about when you see some particular change to an atomic object but rather about what this change can guarantee about the surrounding code. Relaxed atomics guarantee nothing except the change to the atomic object itself: the change will be atomic. But you can't use relaxed atomics in any synchronization context.
And you have some code which requires synchronization. You want to pop something that was pushed and not trying to pop what has not been pushed yet. So if you use a relaxed operation then there is no guarantee that your pop will see this push code:
m_buffer[nextPos] = element;
m_position.store(nextPos, std::memory_relaxed);
as it is written. It just as well can see it this way:
m_position.store(nextPos, std::memory_relaxed);
m_buffer[nextPos] = element;
So you might try to get an element from the buffer which is not there yet. Hence, you have to use some synchronization and at least use acquire/release memory order.
And to your actual code. I think the order can be as follows:
const char wPos = m_position.load(std::memory_order_relaxed);
...
m_position.store(nextPos, std::memory_order_release);
...
const char wPos = m_position.exchange(-1, memory_order_acquire);

Your writer only needs release, not seq-cst, but relaxed is too weak. You can't publish a value for m_position until after the non-atomic assignment to the corresponding m_buffer[] entry. You need release ordering to make sure the m_position store is visible to other threads only after all earlier memory operations. (Including the non-atomic assignment). https://preshing.com/20120913/acquire-and-release-semantics/
This has to "synchronize-with" an acquire or seq_cst load in the reader. Or at least mo_consume in the reader.
In theory you also need wpos = m_position to be at least acquire (or consume in the reader), not relaxed, because C++11's memory model is weak enough for things like value-prediction which can let the compiler speculatively use a value for wPos before the load actually takes a value from coherent cache.
(In practice on real CPUs, a crazy compiler could do this with test/branch to introduce a control dependency, allowing branch prediction + speculative execution to break the data dependency for a likely value of wPos.)
But with normal compilers don't do that. On CPUs other than DEC Alpha, the data dependency in the source code of wPos = m_position and then using m_buffer[wPos] will create a data dependency in the asm, like mo_consume is supposed to take advantage of. Real ISAs other than Alpha guarantee dependency-ordering for dependent loads. (And even on Alpha, using a relaxed atomic exchange might be enough to close the tiny window that exists on the few real Alpha CPUs that allow this reordering.)
When compiling for x86, there's no downside at all to using mo_acquire; it doesn't cost any extra barriers. There can be on other ISAs, like 32-bit ARM where acquire costs a barrier, so "cheating" with a relaxed load could be a win that's still safe in practice. Current compilers always strengthen mo_consume to mo_acquire so we unfortunately can't take advantage of it.
You already have a real-word race condition even using seq_cst.
initial state: m_position = 0
reader "claims" slot 0 by exchanging in a m_position = -1 and reads part of m_buffer[0];
reader sleeps for some reason (e.g. timer interrupt deschedules it), or simply races with a writer.
writer reads wPos = m_position as -1, and calculates nextPos = 0.
It overwrites the partially-read m_buffer[0]
reader wakes up and finishes reading, getting a torn T &element. Data race UB in the C++ abstract machine, and tearing in practice.
Adding a 2nd check of m_position after the read (like a SeqLock) can't detect this in every case because the writer doesn't update m_position until after writing the buffer element.
Even though you your real use-case has long gaps between reads and writes, this defect can bite you with just one read and write happening at almost the same time.
I for sure know that the read side cannot wait for nothing and cannot be stopped (it's audio) and it's poped each 5-10ms, and the write side is the user input, which is more slower, a faster one could do a push once each 500ms.
A millisecond is ages on a modern CPU. Inter-thread latency is often something like 60 ns, so fractions of a microsecond, e.g. from a quad-core Intel x86. As long as you don't sleep on a mutex, it's not a problem to spin-retry once or twice before giving up.
Code review:
The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).
This isn't a real queue or stack: push and pop aren't great names. "publish" and "read" or "get" might be better and make it more obvious what this is for.
I'd include comments in the code to describe the fact that this is safe for a single writer, multiple readers. (The non-atomic increment of m_position in push makes it clearly unsafe for multiple writers.)
Even so, it's kinda weird even with 1 writer + 1 reader running at the same time. If a read starts while a write is in progress, it will get the "old" value instead of spin-waiting for a fraction of a microsecond to get the new value. Then next time it reads there will already be a new value waiting; the one it just missed seeing last time. So e.g. m_position can update in this order: 2, -1, 3.
That might or might not be desirable, depending on whether "stale" data has any value, and on acceptability of the reader blocking if the writer sleeps mid-write. Or even without the writer sleeping, of spin-waiting.
The standard pattern for rarely written smallish data with multiple read-only readers is a SeqLock. e.g. for publishing a 128-bit current timestamp on a CPU that can't atomically read or write a 128-bit value. See Implementing 64 bit atomic counter with 32 bit atomics
Possible design changes
To make this safe, we could let the writer run free, always wrapping around its circular buffer, and have the reader keep track of the last element it looked at.
If there's only one reader, this should be a simple non-atomic variable. If it's an instance variable, at least put it on the other side of m_buffer[] from the write-position.
// Possible failure mode: writer wraps around between reads, leaving same m_position
// single-reader
const bool read(T &elem)
{
// FIXME: big hack to get this in a separate cache line from the instance vars
// maybe instead use alignas(64) int m_lastread as a class member, and/or on the other side of m_buffer from m_position.
static int lastread = -1;
int wPos = m_position.load(std::memory_order_acquire); // or cheat with relaxed to get asm that's like "consume"
if (lastread == wPos)
return false;
elem = m_buffer[wPos];
lastread = wPos;
return true;
}
You want lastread in a separate cache line from the stuff the writer writes. Otherwise the reader's updates of readPos will be slower because of false-sharing with the writer's writes and vice versa.
This lets the reader(s) be truly read-only wrt. the cache lines written by the writer. It will still take MESI traffic to request read access to lines that are in Modified state after the writer writes them, though. But the writer can still read m_position with no cache miss, so it can get its stores into the store buffer right away. It only has to wait for an RFO to get exclusive ownership of the cache line(s) before it can commit the element and the updated m_position from its store buffer to L1d cache.
TODO: let m_position increment without manual wrapping, so we have a write sequence number that takes a very long time to wrap around, avoiding false-negative early out from lastread == wPos.
Use wPos & (maxElements-1) as the index. And static_assert(maxElements & (maxElements-1) == 0, "maxElements must be a power of 2");
Then the only danger is undetected tearing in a tiny time-window if the writer has wrapped all the way around and is writing the element being read. For frequent reads and infrequent writes, and a buffer that's not too small, this should never happen. Checking the m_position again after a read (like a SeqLock, similar to below) narrows the race window to only writes that are still in progress.
If there are multiple readers, another good option might be a claimed flag in each m_buffer entry. So you'd define
template<typename T>
class WaitFreePublish
{
private:
struct {
alignas(32) T elem; // at most 2 elements per cache line
std::atomic<int8_t> claimed; // writers sets this to 0, readers try to CAS it to 1
// could be bool if we don't end up needing 3 states for anything.
// set to "1" in the constructor? or invert and call it "unclaimed"
} m_buffer[maxElements];
std::atomic<int> m_position {-1};
}
If T has padding at the end, it's a shame we can't take advantage of that for the claimed flag :/
This avoids the possible failure mode of comparing positions: if the writer wraps around between reads, the worst we get is tearing. And we could detect such tearing by having the writer clear the claimed flag first, before writing the rest of the element.
With no other threads writing m_position, we can definitely use a relaxed load without worry. We could even cache the write-position somewhere else, but the reader hopefully isn't invalidating the cache-line containing m_position very often. And apparently in your use-case, writer performance/latency probably isn't a big deal.
So the writer + reader could look like this, with SeqLock-style tearing detection using the known update-order for claimed flag, element, and m_position.
/// claimed flag per array element supports concurrent readers
// thread-safety: single-writer only
// update claimed flag first, then element, then m_position.
void publish(const T& elem)
{
const int wPos = m_position.load(std::memory_order_relaxed);
const int nextPos = getNextPos(wPos);
m_buffer[nextPos].claimed.store(0, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release); // make sure that `0` is visible *before* the non-atomic element modification
m_buffer[nextPos].elem = elem;
m_position.store(nextPos, std::memory_order_release);
}
// thread-safety: multiple readers are ok. First one to claim an entry gets it
// check claimed flag before/after to detect overwrite, like a SeqLock
const bool read(T &elem)
{
int rPos = m_position.load(std::memory_order_acquire);
int8_t claimed = m_buffer[rPos].claimed.load(std::memory_order_relaxed);
if (claimed != 0)
return false; // read-only early-out
claimed = 0;
if (!m_buffer[rPos].claimed.compare_exchange_strong(
claimed, 1, std::memory_order_acquire, std::memory_order_relaxed))
return false; // strong CAS failed: another thread claimed it
elem = m_buffer[rPos].elem;
// final check that the writer didn't step on this buffer during read, like a SeqLock
std::atomic_thread_fence(std::memory_order_acquire); // LoadLoad barrier
// We expect it to still be claimed=1 like we set with CAS
// Otherwise we raced with a writer and elem may be torn.
// optionally retry once or twice in this case because we know there's a new value waiting to be read.
return m_buffer[rPos].claimed.load(std::memory_order_relaxed) == 1;
// Note that elem can be updated even if we return false, if there was tearing. Use a temporary if that's not ok.
}
Using claimed = m_buffer[rPos].exchange(1) and checking for claimed==0 would be another option, vs. CAS-strong. Maybe slightly more efficient on x86. On LL/SC machines I guess CAS might be able to bail out without doing a write at all if it finds a mismatch with expected, in which case the read-only check is pointless.
I used .claimed.compare_exchange_strong(claimed, 1) with success ordering = acquire to make sure that read of claimed happens-before reading .elem.
The "failure" memory ordering can be relaxed: If we see it already claimed by another thread, we give up and don't look at any shared data.
The memory-ordering of the store part of compare_exchange_strong can be relaxed, so we just need mo_acquire, not acq_rel. Readers don't do any other stores to the shared data, and I don't think the ordering of the store matters wrt. to the loads. CAS is an atomic RMW. Only one thread's CAS can succeed on a given buffer element because they're all trying to set it from 0 to 1. That's how atomic RMWs work, regardless of being relaxed or seq_cst or anything in between.
It doesn't need to be seq_cst: we don't need to flush the store buffer or whatever to make sure the store is visible before this thread reads .elem. Just being an atomic RMW is enough to stop multiple threads from actually thinking they succeed. Release would just make sure it can't move earlier, ahead of the relaxed read-only check. That wouldn't be a correctness problem. Hopefully no x86 compilers would do that at compile time. (At runtime on x86, RMW atomic operations are always seq_cst.)
I think being an RMW makes it impossible for it to "step on" a write from a writer (after wrapping around). But this might be real-CPU implementation detail, not ISO C++. In the global modification order for any given .claimed, I think the RMW stays together, and the "acquire" ordering does keep it ahead of the read of the .elem. A release store that wasn't part of a RMW would be a potential problem though: a writer could wrap around and put claimed=0 in a new entry, then the reader's store could eventually commit and set it to 1, when actually no reader has ever read that element.
If we're very sure the reader doesn't need to detect writer wrap-around of the circular buffer, leave out the std::atomic_thread_fence in the writer and reader. (The claimed and the non-atomic element store will still be ordered by the release-store to m_position). The reader can be simplified to leave out the 2nd check and always return true if it gets past the CAS.
Notice that m_buffer[nextPos].claimed.store(0, std::memory_order_release); would not be sufficient to stop later non-atomic stores from appearing before it: release-stores are a one-way barrier, unlike release fences. A release-fence is like a 2-way StoreStore barrier. (Free on x86, cheap on other ISAs.)
This SeqLock-style tearing detection doesn't technically avoid UB in the C++ abstract machine, unfortunately. There's no good / safe way to express this pattern in ISO C++, and it's known to be safe in asm on real hardware. Nothing actually uses the torn value (assuming read()'s caller ignores its elem value if it returns false).
Making elem a std::atomic<T> would be defeat the entire purpose: that would use a spinlock to get atomicity so it might as well use it directly.
Using volatile T elem would break buffer[i].elem = elem because unlike C, C++ doesn't allow copying a volatile struct to/from a regular struct. (volatile struct = struct not possible, why?). This is highly annoying for a SeqLock type of pattern where you'd like the compiler to emit efficient code to copy the whole object representation, optionally using SIMD vectors. You won't get that if you write a constructor or assignment operator that takes a volatile &T argument and does individual members. So clearly volatile is the wrong tool, and that only leaves compiler memory barriers to make sure the non-atomic object is fully read or fully written before the barrier. std::atomic_thread_fence is I think actually safe for that, like asm("" ::: "memory") in GNU C. It works in practice on current compilers.

Is lockless hashing without std::atomics guaranteed to be thread-safe in C++11?

Consider the following attempt at a lockless hashtable for multithreaded search algorithms (inspired by this paper)
struct Data
{
uint64_t key;
uint64_t value;
};
struct HashEntry
{
uint64_t key_xor_value;
uint64_t value;
};
void insert_data(Data const& e, HashEntry* h, std::size_t tableOffset)
{
h[tableOffset].key_xor_value = e.key ^ e.value;
h[tableOffset].value = e.value;
}
bool data_is_present(Data const& e, HashEntry const* h, std::size_t tableOffset)
{
auto const tmp_key_xor_value = h[tableOFfset].key_xor_value;
auto const tmp_value = h[tableOffset].value;
return e.key == (tmp_key_xor_value ^ tmp_value);
}
The idea is that a HashEntry struct stores the XOR-ed combination of the two 64-bit words of a Data struct. If two threads have interleaved reads/writes to the two 64-bit words of a HashEntry struct, the idea is that this can be detected by the reading thread by XOR-ing again and comparing against the original key. So one might have a loss of efficiency by corrupted hash entries, but still have guaranteed correctness in case the decoded retrieved key matches the original.
The paper mentions that it is based on the following assumption:
For the remainder of this discussion, assume that 64 bit memory
read/write operations are atomic, that is the entire 64 bit value is
read/written in one cycle.
My questions are: is the above code without explicit use of std::atomic<uint64_t> guaranteed to be thread-safe in C++11? Or can the individual 64-bit words be corrupted by simultaneous reads/writes? Even on 64-bit platforms? And how is this different from the old C++98 Standard?
Quotes from the Standard would be much appreciated.
UPDATE: based on this amazing paper by Hans Boehm on "benign" data races, a simple way to get bitten is for the compiler to cancel both XORs from insert_data() and data_is_present() to alway return true, e.g. if it finds a local code fragment like
insert_data(e, h, t);
if (data_is_present(e, h, t)) // optimized to true as if in single-threaded code
read_and_process(e, h, t); // data race if other thread has written

The C++11 specification defines pretty much any attempt by one thread to read or write a memory location that another thread is writing to as undefined behavior (absent the use of atomics or mutexes to prevent read/writes from one thread while another thread is writing).
Individual compilers may make it safe, but the C++11 specification doesn't provide coverage itself. Simultaneous reads are never a problem; it's writing in one thread while reading/writing in another.
And how is this different from the old C++98 Standard?
The C++98/03 standard doesn't provide any coverage with regard to threading. As far as the C++98/03 memory model is concerned, threading is not a thing that can possibly happen.

I dont think it depends so much on the compiler as on the CPU (its instruction set) you are using. I wouldnt think the assumption would be very portable.

The code's totally broken. The compiler's has substantial freedom to reorder instructions if its analysis suggests the overall effect is identical. In insert_data for example, there's no guarantee that key_xor_value will be updated before the value, whether the updates are done on temporary registers before being written back into the cache, let alone when those cache updates - whatever their "order" in the machine code language and CPU instruction execution pipeline - will be flushed from the updating core's or cores' (if context-switched mid-function) private caches to become visible to other threads. The compiler might even do the updates in steps using 32 bit registers, depending on the CPU, whether compiling 32-bit or 64-bit, compilation options etc..
Atomic operations tend to require something like CAS (Compare and Swap) style instructions, or volatile and memory barrier instructions, that sync data across cores' caches and enforce some ordering.

C++ Thread Safe Integer

I have currently created a C++ class for a thread safe integer which simply stores an integer privately and has public get a set functions which use a boost::mutex to ensure that only one change at a time can be applied to the integer.
Is this the most efficient way to do it, I have been informed that mutexes are quite resource intensive? The class is used a lot, very rapidly so it could well be a bottleneck...
Googleing C++ Thread Safe Integer returns unclear views and oppinions on the thread safety of integer operations on different architectures.
Some say that a 32bit int on a 32bit arch is safe, but 64 on 32 isn't due to 'alignment' Others say it is compiler/OS specific (which I don't doubt).
I am using Ubuntu 9.10 on 32 bit machines, some have dual cores and so threads may be executed simultaneously on different cores in some cases and I am using GCC 4.4's g++ compiler.
Thanks in advance...
Please Note: The answer I have marked as 'correct' was most suitable for my problem - however there are some excellent points made in the other answers and they are all worth reading!

There is the C++0x atomic library, and there is also a Boost.Atomic library under development that use lock free techniques.

It's not compiler and OS specific, it's architecture specific. The compiler and OS come into it because they're the tools you work through, but they're not the ones setting the real rules. This is why the C++ standard won't touch the issue.
I have never in my life heard of an 64-bit integer write, which can be split into two 32-bit writes, being interrupted halfway through. (Yes, that's an invitation to others to post counterexamples.) Specifically, I have never heard of a CPU's load/store unit allowing a misaligned write to be interrupted; an interrupting source has to wait for the whole misaligned access to complete.
To have an interruptible load/store unit, its state would have to be saved to the stack... and the load/store unit is what saves the rest of the CPU's state to the stack. This would be hugely complicated, and bug prone, if the load/store unit were interruptible... and all that you would gain is one cycle less latency in responding to interrupts, which, at best, is measured in tens of cycles. Totally not worth it.
Back in 1997, A coworker and I wrote a C++ Queue template which was used in a multiprocessing system. (Each processor had its own OS running, and its own local memory, so these queues were only needed for memory shared between processors.) We worked out a way to make the queue change state with a single integer write, and treated this write as an atomic operation. Also, we required that each end of the queue (i.e. the read or write index) be owned by one and only one processor. Thirteen years later, the code is still running fine, and we even have a version that handles multiple readers.
Still, if you want to treat a 64-bit integer write as atomic, align the field to a 64-bit bound. Why worry?
EDIT: For the case you mention in your comment, I'd need more information to be sure, so let me give an example of something that could be implemented without specialized synchronization code.
Suppose you have N writers and one reader. You want the writers to be able to signal events to the reader. The events themselves have no data; you just want an event count, really.
Declare a structure for the shared memory, shared between all writers and the reader:
#include <stdint.h>
struct FlagTable
{ uint32_t flag[NWriters];
};
(Make this a class or template or whatever as you see fit.)
Each writer needs to be told its index and given a pointer to this table:
class Writer
{public:
Writer(FlagTable* flags_, size_t index_): flags(flags_), index(index_) {}
void SignalEvent(uint32_t eventCount = 1);
private:
FlagTable* flags;
size_t index;
}
When the writer wants to signal an event (or several), it updates its flag:
void Writer::SignalEvent(uint32_t eventCount)
{ // Effectively atomic: only one writer modifies this value, and
// the state changes when the incremented value is written out.
flags->flag[index] += eventCount;
}
The reader keeps a local copy of all the flag values it has seen:
class Reader
{public:
Reader(FlagTable* flags_): flags(flags_)
{ for(size_t i = 0; i < NWriters; ++i)
seenFlags[i] = flags->flag[i];
}
bool AnyEvents(void);
uint32_t CountEvents(int writerIndex);
private:
FlagTable* flags;
uint32_t seenFlags[NWriters];
}
To find out if any events have happened, it just looks for changed values:
bool Reader::AnyEvents(void)
{ for(size_t i = 0; i < NWriters; ++i)
if(seenFlags[i] != flags->flag[i])
return true;
return false;
}
If something happened, we can check each source and get the event count:
uint32_t Reader::CountEvents(int writerIndex)
{ // Only read a flag once per function call. If you read it twice,
// it may change between reads and then funny stuff happens.
uint32_t newFlag = flags->flag[i];
// Our local copy, though, we can mess with all we want since there
// is only one reader.
uint32_t oldFlag = seenFlags[i];
// Next line atomically changes Reader state, marking the events as counted.
seenFlags[i] = newFlag;
return newFlag - oldFlag;
}
Now the big gotcha in all this? It's nonblocking, which is to say that you can't make the Reader sleep until a Writer writes something. The Reader has to choose between sitting in a spin-loop waiting for AnyEvents() to return true, which minimizes latency, or it can sleep a bit each time through, which saves CPU but could let a lot of events build up. So it's better than nothing, but it's not the solution to everything.
Using actual synchronization primitives, one would only need to wrap this code with a mutex and condition variable to make it properly blocking: the Reader would sleep until there was something to do. Since you used atomic operations with the flags, you could actually keep the amount of time the mutex is locked to a minimum: the Writer would only need to lock the mutex long enough to send the condition, and not set the flag, and the reader only needs to wait for the condition before calling AnyEvents() (basically, it's like the sleep-loop case above, but with a wait-for-condition instead of a sleep call).

C++ has no real atomic integer implementation, neither do most common libraries.
Consider the fact that even if said implementation would exist, it would have to rely on some sort of mutex - due to the fact that you cannot guarantee atomic operations across all architectures.

As you're using GCC, and depending on what operations you want to perform on the integer, you might get away with GCC's atomic builtins.
These might be a bit faster than mutexes, but in some cases still a lot slower than "normal" operations.

For full, general purpose synchronization, as others have already mentioned, the traditional synchronization tools are pretty much required. However, for certain special cases it is possible to take advantage of hardware optimizations. Specifically, most modern CPUs support atomic increment & decrement on integers. The GLib library has pretty good cross-platform support for this. Essentially, the library wraps CPU & compiler specific assembly code for these operations and defaults to mutex protection where they're not available. It's certainly not very general-purpose but if you're only interested in maintaining a counter, this might be sufficient.

you can also have a look at the atomic ops section of intels Thread Building Blocks or the atomic_ops project

I've heard i++ isn't thread safe, is ++i thread-safe?

I've heard that i++ isn't a thread-safe statement since in assembly it reduces down to storing the original value as a temp somewhere, incrementing it, and then replacing it, which could be interrupted by a context switch.
However, I'm wondering about ++i. As far as I can tell, this would reduce to a single assembly instruction, such as 'add r1, r1, 1' and since it's only one instruction, it'd be uninterruptable by a context switch.
Can anyone clarify? I'm assuming that an x86 platform is being used.

You've heard wrong. It may well be that "i++" is thread-safe for a specific compiler and specific processor architecture but it's not mandated in the standards at all. In fact, since multi-threading isn't part of the ISO C or C++ standards (a), you can't consider anything to be thread-safe based on what you think it will compile down to.
It's quite feasible that ++i could compile to an arbitrary sequence such as:
load r0,[i] ; load memory into reg 0
incr r0 ; increment reg 0
stor [i],r0 ; store reg 0 back to memory
which would not be thread-safe on my (imaginary) CPU that has no memory-increment instructions. Or it may be smart and compile it into:
lock ; disable task switching (interrupts)
load r0,[i] ; load memory into reg 0
incr r0 ; increment reg 0
stor [i],r0 ; store reg 0 back to memory
unlock ; enable task switching (interrupts)
where lock disables and unlock enables interrupts. But, even then, this may not be thread-safe in an architecture that has more than one of these CPUs sharing memory (the lock may only disable interrupts for one CPU).
The language itself (or libraries for it, if it's not built into the language) will provide thread-safe constructs and you should use those rather than depend on your understanding (or possibly misunderstanding) of what machine code will be generated.
Things like Java synchronized and pthread_mutex_lock() (available to C/C++ under some operating systems) are what you need to look into (a).
(a) This question was asked before the C11 and C++11 standards were completed. Those iterations have now introduced threading support into the language specifications, including atomic data types (though they, and threads in general, are optional, at least in C).

You can't make a blanket statement about either ++i or i++. Why? Consider incrementing a 64-bit integer on a 32-bit system. Unless the underlying machine has a quad word "load, increment, store" instruction, incrementing that value is going to require multiple instructions, any of which can be interrupted by a thread context switch.
In addition, ++i isn't always "add one to the value." In a language like C, incrementing a pointer actually adds the size of the thing pointed to. That is, if i is a pointer to a 32-byte structure, ++i adds 32 bytes. Whereas almost all platforms have an "increment value at memory address" instruction that is atomic, not all have an atomic "add arbitrary value to value at memory address" instruction.

They are both thread-unsafe.
A CPU cannot do math directly with memory. It does that indirectly by loading the value from memory and doing the math with CPU registers.
i++
register int a1, a2;
a1 = *(&i) ; // One cpu instruction: LOAD from memory location identified by i;
a2 = a1;
a1 += 1;
*(&i) = a1;
return a2; // 4 cpu instructions
++i
register int a1;
a1 = *(&i) ;
a1 += 1;
*(&i) = a1;
return a1; // 3 cpu instructions
For both cases, there is a race condition that results in the unpredictable i value.
For example, let's assume there are two concurrent ++i threads with each using register a1, b1 respectively. And, with context switching executed like the following:
register int a1, b1;
a1 = *(&i);
a1 += 1;
b1 = *(&i);
b1 += 1;
*(&i) = a1;
*(&i) = b1;
In result, i doesn't become i+2, it becomes i+1, which is incorrect.
To remedy this, moden CPUs provide some kind of LOCK, UNLOCK cpu instructions during the interval a context switching is disabled.
On Win32, use InterlockedIncrement() to do i++ for thread-safety. It's much faster than relying on mutex.

If you are sharing even an int across threads in a multi-core environment, you need proper memory barriers in place. This can mean using interlocked instructions (see InterlockedIncrement in win32 for example), or using a language (or compiler) that makes certain thread-safe guarantees. With CPU level instruction-reordering and caches and other issues, unless you have those guarantees, don't assume anything shared across threads is safe.
Edit: One thing you can assume with most architectures is that if you are dealing with properly aligned single words, you won't end up with a single word containing a combination of two values that were mashed together. If two writes happen over top of each other, one will win, and the other will be discarded. If you are careful, you can take advantage of this, and see that either ++i or i++ are thread-safe in the single writer/multiple reader situation.

If you want an atomic increment in C++ you can use C++0x libraries (the std::atomic datatype) or something like TBB.
There was once a time that the GNU coding guidelines said updating datatypes that fit in one word was "usually safe" but that advice is wrong for SMP machines, wrong for some architectures, and wrong when using an optimizing compiler.
To clarify the "updating one-word datatype" comment:
It is possible for two CPUs on an SMP machine to write to the same memory location in the same cycle, and then try to propagate the change to the other CPUs and the cache. Even if only one word of data is being written so the writes only take one cycle to complete, they also happen simultaneously so you cannot guarantee which write succeeds. You won't get partially updated data, but one write will disappear because there is no other way to handle this case.
Compare-and-swap properly coordinates between multiple CPUs, but there is no reason to believe that every variable assignment of one-word datatypes will use compare-and-swap.
And while an optimizing compiler doesn't affect how a load/store is compiled, it can change when the load/store happens, causing serious trouble if you expect your reads and writes to happen in the same order they appear in the source code (the most famous being double-checked locking does not work in vanilla C++).
NOTE My original answer also said that Intel 64 bit architecture was broken in dealing with 64 bit data. That is not true, so I edited the answer, but my edit claimed PowerPC chips were broken. That is true when reading immediate values (i.e., constants) into registers (see the two sections named "Loading pointers" under listing 2 and listing 4) . But there is an instruction for loading data from memory in one cycle (lmw), so I've removed that part of my answer.

Even if it is reduced to a single assembly instruction, incrementing the value directly in memory, it is still not thread safe.
When incrementing a value in memory, the hardware does a "read-modify-write" operation: it reads the value from the memory, increments it, and writes it back to memory. The x86 hardware has no way of incrementing directly on the memory; the RAM (and the caches) is only able to read and store values, not modify them.
Now suppose you have two separate cores, either on separate sockets or sharing a single socket (with or without a shared cache). The first processor reads the value, and before it can write back the updated value, the second processor reads it. After both processors write the value back, it will have been incremented only once, not twice.
There is a way to avoid this problem; x86 processors (and most multi-core processors you will find) are able to detect this kind of conflict in hardware and sequence it, so that the whole read-modify-write sequence appears atomic. However, since this is very costly, it is only done when requested by the code, on x86 usually via the LOCK prefix. Other architectures can do this in other ways, with similar results; for instance, load-linked/store-conditional and atomic compare-and-swap (recent x86 processors also have this last one).
Note that using volatile does not help here; it only tells the compiler that the variable might have be modified externally and reads to that variable must not be cached in a register or optimized out. It does not make the compiler use atomic primitives.
The best way is to use atomic primitives (if your compiler or libraries have them), or do the increment directly in assembly (using the correct atomic instructions).

On x86/Windows in C/C++, you should not assume it is thread-safe. You should use InterlockedIncrement() and InterlockedDecrement() if you require atomic operations.

If your programming language says nothing about threads, yet runs on a multithreaded platform, how can any language construct be thread-safe?
As others pointed out: you need to protect any multithreaded access to variables by platform specific calls.
There are libraries out there that abstract away the platform specificity, and the upcoming C++ standard has adapted it's memory model to cope with threads (and thus can guarantee thread-safety).

Never assume that an increment will compile down to an atomic operation. Use InterlockedIncrement or whatever similar functions exist on your target platform.
Edit: I just looked up this specific question and increment on X86 is atomic on single processor systems, but not on multiprocessor systems. Using the lock prefix can make it atomic, but it's much more portable just to use InterlockedIncrement.

According to this assembly lesson on x86, you can atomically add a register to a memory location, so potentially your code may atomically execute '++i' ou 'i++'.
But as said in another post, the C ansi does not apply atomicity to '++' opération, so you cannot be sure of what your compiler will generate.

The 1998 C++ standard has nothing to say about threads, although the next standard (due this year or the next) does. Therefore, you can't say anything intelligent about thread-safety of operations without referring to the implementation. It's not just the processor being used, but the combination of the compiler, the OS, and the thread model.
In the absence of documentation to the contrary, I wouldn't assume that any action is thread-safe, particularly with multi-core processors (or multi-processor systems). Nor would I trust tests, as thread synchronization problems are likely to come up only by accident.
Nothing is thread-safe unless you have documentation that says it is for the particular system you're using.

Throw i into thread local storage; it isn't atomic, but it then doesn't matter.

AFAIK, According to the C++ standard, read/writes to an int are atomic.
However, all that this does is get rid of the undefined behavior that's associated with a data race.
But there still will be a data race if both threads try to increment i.
Imagine the following scenario:
Let i = 0 initially:
Thread A reads the value from memory and stores in its own cache.
Thread A increments the value by 1.
Thread B reads the value from memory and stores in its own cache.
Thread B increments the value by 1.
If this is all a single thread you would get i = 2 in memory.
But with both threads, each thread writes its changes and so Thread A writes i = 1 back to memory, and Thread B writes i = 1 to memory.
It's well defined, there's no partial destruction or construction or any sort of tearing of an object, but it's still a data race.
In order to atomically increment i you can use:
std::atomic<int>::fetch_add(1, std::memory_order_relaxed)
Relaxed ordering can be used because we don't care where this operation takes place all we care about is that the increment operation is atomic.

You say "it's only one instruction, it'd be uninterruptible by a context switch." - that's all well and good for a single CPU, but what about a dual core CPU? Then you can really have two threads accessing the same variable at the same time without any context switches.
Without knowing the language, the answer is to test the heck out of it.

I think that if the expression "i++" is the only in a statement, it's equivalent to "++i", the compiler is smart enough to not keep a temporal value, etc. So if you can use them interchangeably (otherwise you won't be asking which one to use), it doesn't matter whichever you use as they're almost the same (except for aesthetics).
Anyway, even if the increment operator is atomic, that doesn't guarantee that the rest of the computation will be consistent if you don't use the correct locks.
If you want to experiment by yourself, write a program where N threads increment concurrently a shared variable M times each... if the value is less than N*M, then some increment was overwritten. Try it with both preincrement and postincrement and tell us ;-)

For a counter, I recommend a using the compare and swap idiom which is both non locking and thread-safe.
Here it is in Java:
public class IntCompareAndSwap {
private int value = 0;
public synchronized int get(){return value;}
public synchronized int compareAndSwap(int p_expectedValue, int p_newValue){
int oldValue = value;
if (oldValue == p_expectedValue)
value = p_newValue;
return oldValue;
}
}
public class IntCASCounter {
public IntCASCounter(){
m_value = new IntCompareAndSwap();
}
private IntCompareAndSwap m_value;
public int getValue(){return m_value.get();}
public void increment(){
int temp;
do {
temp = m_value.get();
} while (temp != m_value.compareAndSwap(temp, temp + 1));
}
public void decrement(){
int temp;
do {
temp = m_value.get();
} while (temp > 0 && temp != m_value.compareAndSwap(temp, temp - 1));
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js