Implementation of global shared counter

Implementation of global shared counter - concurrency

the following code comes from (Listing 5.3):
Is Parallel Programming Hard, And, If So, What Can You Do About It?
DEFINE PER_THREAD(long, counter);
void inc_count(void){
__get_thread_var(counter)++; // __get_thread_var returns a reference to thread local counter.
}
long read_count(void){
int t;
long sum = 0;
for_each_thread(t)
sum += per_thread(counter, t);
return sum;
}
Above code implements a global (shared by cores) counter. We could use atomic operations instead, obviously. However, we consider the case where updates are very frequent so we need a per-thread variable to minimize traffic on CPU system.
I cannot grasp why it is correct. After all, In C/C++/Java we have at most SC-DRF (Sequential Consistency if there is no data race).
Actually, we have a datarace. In a result we have no guarantee from a memory model. Especially, what about Out Of Air Thin values? I cannot see how it is guaranteed that it does not happen. So, what do you think? Is that implementation correct in the terms of my doubts, and why?

Not sure which version of the document you are referring to but figure 5.3 in the latest one uses the macros READ_ONCE and WRITE_ONCE which provide basic visibility guarantees across threads. When combined with proper alignment of counter, then I suppose it is essentially the equivalent of relaxed atomic semantics.

Related

Reordering of operations and lock-free data structures

Assume we have a Container maintaining a set of int values, plus a flag for each value indicating whether the value is valid. Invalid values are considered to be INT_MAX. Initially, all values are invalid. When a value is accessed for the first time, it is set to INT_MAX and its flag is set to valid.
struct Container {
int& operator[](int i) {
if (!isValid[i]) {
values[i] = INT_MAX; // (*)
isValid[i] = true; // (**)
}
return values[i];
}
std::vector<int> values;
std::vector<bool> isValid;
};
Now, another thread reads container values concurrently:
// This member is allowed to overestimate value i, but it must not underestimate it.
int Container::get(int i) {
return isValid[i] ? values[i] : INT_MAX;
}
This is perfectly valid code, but it is crucial that lines (*) and (**) are executed in the given order.
Does the standard guarantee in this case that the lines are executed in the given order? At least from a single-threaded perspective, the lines could be interchanged, couldn't they?
If not, what is the most efficient way to ensure their order? This is high-performance code, so I cannot go without -O3 and do not want to use volatile.

There is no synchronization here. If you access these values from one thread and change them from another thread you got undefined behavior. You'd either need a lock around all accesses in which case things are fine. Otherwise you'll need to make all your std::vector elements atomic<T> and you can control visibility of the values using the appropriate visibility parameters.
There seems to be a misunderstanding of what synchronization and in particular atomic operations do: their purpose is to make code fast! That may appear counter intuitive so here is the explanation: non-atomic operations should be as fast as possibe and there are deliberately no guarantees how they access memory exactly. As long as the compiler and execution system produce the correct results the compiler iand system are free to do whatever they need or want to do. To achieve good performance interaction between different threads are assumed to not exist.
In a concurrent system there are, however, interactions betwwen threads. This is where atomic operations enter the stage: they allow the specification of exactly the necessary synchronization needed. Thus, they allow to tell the compiler the minimal constraints it has to obey to make the thread unteraction correct. The compiler will use these indicators to generate the best possible code to achieve what is specified. That code may be identical to code not using any synchronization although in practice it is normally necessary to also prevent the CPU from reordering operations. As a result, correct use of the synchronization results in the most efficient code with only the absolutely necessary overhead.
The tricky part is to some extent finding which synchronizations are needed and to minimize these. Simply not having any will allow the compiler and the CPU to reorder operations freely and won't work.
Since the question mentioned volatile please note that volatile is entirely unrelated to concurrency! The primary purpose for volatile is to inform the system that an address points to memory whose access may have side effects. Primarily it is used to have memory mapped I/O or hardware control be accessible. Die to the potential of side effects it one of the two aspects of C++ defining the semantics of programs (the other one is I/O using standard library I/O facilities).

In C11/C++11, possible to mix atomic/non-atomic ops on the same memory?

Is it possible to perform atomic and non-atomic ops on the same memory location?
I ask not because I actually want to do this, but because I'm trying to understand the C11/C++11 memory model. They define a "data race" like so:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
-- C11 §5.1.2.4 p25, C++11 § 1.10 p21
Its the "at least one of which is not atomic" part that is troubling me. If it weren't possible to mix atomic and non-atomic ops, it would just say "on an object which is not atomic."
I can't see any straightforward way of performing non-atomic operations on atomic variables. std::atomic<T> in C++ doesn't define any operations with non-atomic semantics. In C, all direct reads/writes of an atomic variable appear to be translated into atomic operations.
I suppose memcpy() and other direct memory operations might be a way of performing a non-atomic read/write on an atomic variable? ie. memcpy(&atomicvar, othermem, sizeof(atomicvar))? But is this even defined behavior? In C++, std::atomic is not copyable, so would it be defined behavior to memcpy() it in C or C++?
Initialization of an atomic variable (whether through a constructor or atomic_init()) is defined to not be atomic. But this is a one-time operation: you're not allowed to initialize an atomic variable a second time. Placement new or an explicit destructor call could would also not be atomic. But in all of these cases, it doesn't seem like it would be defined behavior anyway to have a concurrent atomic operation that might be operating on an uninitialized value.
Performing atomic operations on non-atomic variables seems totally impossible: neither C nor C++ define any atomic functions that can operate on non-atomic variables.
So what is the story here? Is it really about memcpy(), or initialization/destruction, or something else?

I think you're overlooking another case, the reverse order. Consider an initialized int whose storage is reused to create an std::atomic_int. All atomic operations happen after its ctor finishes, and therefore on initialized memory. But any concurrent, non-atomic access to the now-overwritten int has to be barred as well.
(I'm assuming here that the storage lifetime is sufficient and plays no role)
I'm not entirely sure because I think that the second access to int would be invalid anyway as the type of the accessing expression int doesn't match the object's type at the time (std::atomic<int>). However, "the object's type at the time" assumes a single linear time progression which doesn't hold in a multi-threaded environment. C++11 in general has that solved by making such assumptions about "the global state" Undefined Behavior per se, and the rule from the question appears to fit in that framework.
So perhaps rephrasing: if a single memory location contains an atomic object as well as a non-atomic object, and if the destruction of the earliest created (older) object is not sequenced-before the creation of the other (newer) object, then access to the older object conflicts with access to the newer object unless the former is scheduled-before the latter.

disclaimer: I am not a parallelism guru.
Is it possible to mix atomic/non-atomic ops on the same memory, and if
so, how?
you can write it in the code and compile, but it will probably yield undefined behaviour.
when talking about atomics, it is important to understand what kind o problems do they solve.
As you might know, what we call in shortly "memory" is multi-layered set of entities which are capable to hold memory.
first we have the RAM, then the cache lines , then the registers.
on mono-core processors, we don't have any synchronization problem. on multi-core processors we have all of them. every core has it own set of registers and cache lines.
this casues few problems.
First one of them is memory reordering - the CPU may decide on runtime to scrumble some reading/writing instructions to make the code run faster. this may yield some strange results that are completly invisible on the high-level code that brought this set of instruction. the most classic example of this phenomanon is the "two threads - two integer" example:
int i=0;
int j=0;
thread a -> i=1, then print j
thread b -> j=1 then print i;
logically, the result "00" cannot be. either a ends first, the result may be "01", either b ends first, the result may be "10". if both of them ends in the same time, the result may be "11". yet, if you build small program which imitates this situtation and run it in a loop, very quicly you will see the result "00"
another problem is memory invisibility. like I mentioned before, the variable's value may be cached in one of the cache lines, or be stored in one of the registered. when the CPU updates a variables value - it may delay the writing of the new value back to the RAM. it may keep the value in the cache/regiter because it was told (by the compiler optimizations) that that value will be updated again soon, so in order to make the program faster - update the value again and only then write it back to the RAM. it may cause undefined behaviour if other CPU (and consequently a thread or a process) depends on the new value.
for example, look at this psuedo code:
bool b = true;
while (b) -> print 'a'
new thread -> sleep 4 seconds -> b=false;
the character 'a' may be printed infinitly, because b may be cached and never be updated.
there are many more problems when dealing with paralelism.
atomics solves these kind of issues by (in a nutshell) telling the compiler/CPU how to read and write data to/from the RAM correctly without doing un-wanted scrumbling (read about memory orders). a memory order may force the cpu to write it's values back to the RAM, or read the valuse from the RAM even if they are cached.
So, although you can mix non atomics actions with atomic ones, you only doing part of the job.
for example let's go back to the second example:
atomic bool b = true;
while (reload b) print 'a'
new thread - > b = (non atomicly) false.
so although one thread re-read the value of b from the RAM again and again but the other thread may not write false back to the RAM.
So although you can mix these kind of operations in the code, it will yield underfined behavior.

I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operationsin parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. That is that the sizes of the atomic and non atomic types are the same.
I definitely would not attempt to mix both types of access to an address in a parallel context, that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.

I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operations in parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. My, possibly simplistic, interpretation of that is that the sizes of the atomic and non atomic types are the same and that the hardware can update such types in a single operation.
I definitely would not attempt to mix both types of access to an address in a parallel context, i think that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.

C++ Thread Safe Integer

I have currently created a C++ class for a thread safe integer which simply stores an integer privately and has public get a set functions which use a boost::mutex to ensure that only one change at a time can be applied to the integer.
Is this the most efficient way to do it, I have been informed that mutexes are quite resource intensive? The class is used a lot, very rapidly so it could well be a bottleneck...
Googleing C++ Thread Safe Integer returns unclear views and oppinions on the thread safety of integer operations on different architectures.
Some say that a 32bit int on a 32bit arch is safe, but 64 on 32 isn't due to 'alignment' Others say it is compiler/OS specific (which I don't doubt).
I am using Ubuntu 9.10 on 32 bit machines, some have dual cores and so threads may be executed simultaneously on different cores in some cases and I am using GCC 4.4's g++ compiler.
Thanks in advance...
Please Note: The answer I have marked as 'correct' was most suitable for my problem - however there are some excellent points made in the other answers and they are all worth reading!

There is the C++0x atomic library, and there is also a Boost.Atomic library under development that use lock free techniques.

It's not compiler and OS specific, it's architecture specific. The compiler and OS come into it because they're the tools you work through, but they're not the ones setting the real rules. This is why the C++ standard won't touch the issue.
I have never in my life heard of an 64-bit integer write, which can be split into two 32-bit writes, being interrupted halfway through. (Yes, that's an invitation to others to post counterexamples.) Specifically, I have never heard of a CPU's load/store unit allowing a misaligned write to be interrupted; an interrupting source has to wait for the whole misaligned access to complete.
To have an interruptible load/store unit, its state would have to be saved to the stack... and the load/store unit is what saves the rest of the CPU's state to the stack. This would be hugely complicated, and bug prone, if the load/store unit were interruptible... and all that you would gain is one cycle less latency in responding to interrupts, which, at best, is measured in tens of cycles. Totally not worth it.
Back in 1997, A coworker and I wrote a C++ Queue template which was used in a multiprocessing system. (Each processor had its own OS running, and its own local memory, so these queues were only needed for memory shared between processors.) We worked out a way to make the queue change state with a single integer write, and treated this write as an atomic operation. Also, we required that each end of the queue (i.e. the read or write index) be owned by one and only one processor. Thirteen years later, the code is still running fine, and we even have a version that handles multiple readers.
Still, if you want to treat a 64-bit integer write as atomic, align the field to a 64-bit bound. Why worry?
EDIT: For the case you mention in your comment, I'd need more information to be sure, so let me give an example of something that could be implemented without specialized synchronization code.
Suppose you have N writers and one reader. You want the writers to be able to signal events to the reader. The events themselves have no data; you just want an event count, really.
Declare a structure for the shared memory, shared between all writers and the reader:
#include <stdint.h>
struct FlagTable
{ uint32_t flag[NWriters];
};
(Make this a class or template or whatever as you see fit.)
Each writer needs to be told its index and given a pointer to this table:
class Writer
{public:
Writer(FlagTable* flags_, size_t index_): flags(flags_), index(index_) {}
void SignalEvent(uint32_t eventCount = 1);
private:
FlagTable* flags;
size_t index;
}
When the writer wants to signal an event (or several), it updates its flag:
void Writer::SignalEvent(uint32_t eventCount)
{ // Effectively atomic: only one writer modifies this value, and
// the state changes when the incremented value is written out.
flags->flag[index] += eventCount;
}
The reader keeps a local copy of all the flag values it has seen:
class Reader
{public:
Reader(FlagTable* flags_): flags(flags_)
{ for(size_t i = 0; i < NWriters; ++i)
seenFlags[i] = flags->flag[i];
}
bool AnyEvents(void);
uint32_t CountEvents(int writerIndex);
private:
FlagTable* flags;
uint32_t seenFlags[NWriters];
}
To find out if any events have happened, it just looks for changed values:
bool Reader::AnyEvents(void)
{ for(size_t i = 0; i < NWriters; ++i)
if(seenFlags[i] != flags->flag[i])
return true;
return false;
}
If something happened, we can check each source and get the event count:
uint32_t Reader::CountEvents(int writerIndex)
{ // Only read a flag once per function call. If you read it twice,
// it may change between reads and then funny stuff happens.
uint32_t newFlag = flags->flag[i];
// Our local copy, though, we can mess with all we want since there
// is only one reader.
uint32_t oldFlag = seenFlags[i];
// Next line atomically changes Reader state, marking the events as counted.
seenFlags[i] = newFlag;
return newFlag - oldFlag;
}
Now the big gotcha in all this? It's nonblocking, which is to say that you can't make the Reader sleep until a Writer writes something. The Reader has to choose between sitting in a spin-loop waiting for AnyEvents() to return true, which minimizes latency, or it can sleep a bit each time through, which saves CPU but could let a lot of events build up. So it's better than nothing, but it's not the solution to everything.
Using actual synchronization primitives, one would only need to wrap this code with a mutex and condition variable to make it properly blocking: the Reader would sleep until there was something to do. Since you used atomic operations with the flags, you could actually keep the amount of time the mutex is locked to a minimum: the Writer would only need to lock the mutex long enough to send the condition, and not set the flag, and the reader only needs to wait for the condition before calling AnyEvents() (basically, it's like the sleep-loop case above, but with a wait-for-condition instead of a sleep call).

C++ has no real atomic integer implementation, neither do most common libraries.
Consider the fact that even if said implementation would exist, it would have to rely on some sort of mutex - due to the fact that you cannot guarantee atomic operations across all architectures.

As you're using GCC, and depending on what operations you want to perform on the integer, you might get away with GCC's atomic builtins.
These might be a bit faster than mutexes, but in some cases still a lot slower than "normal" operations.

For full, general purpose synchronization, as others have already mentioned, the traditional synchronization tools are pretty much required. However, for certain special cases it is possible to take advantage of hardware optimizations. Specifically, most modern CPUs support atomic increment & decrement on integers. The GLib library has pretty good cross-platform support for this. Essentially, the library wraps CPU & compiler specific assembly code for these operations and defaults to mutex protection where they're not available. It's certainly not very general-purpose but if you're only interested in maintaining a counter, this might be sufficient.

you can also have a look at the atomic ops section of intels Thread Building Blocks or the atomic_ops project

Synchronizing access to variable

I need to provide synchronization to some members of a structure.
If the structure is something like this
struct SharedStruct {
int Value1;
int Value2;
}
and I have a global variable
SharedStruct obj;
I want that the write from a processor
obj.Value1 = 5; // Processor B
to be immediately visible to the other processors, so that when I test the value
if(obj.Value1 == 5) { DoSmth(); } // Processor A
else DoSmthElse();
to get the new value, not some old value from the cache.
First I though that if I use volatile when writing/reading the values, it is enough. But I read that volatile can't solve this kind o issues.
The members are guaranteed to be properly aligned on 2/4/8 byte boundaries, and writes should be atomic in this case, but I'm not sure how the cache could interfere with this.
Using memory barriers (mfence, sfence, etc.) would be enough ? Or some interlocked operations are required ?
Or maybe something like
lock mov addr, REGISTER
?
The easiest would obviously be some locking mechanism, but speed is critical and can't afford locks :(
Edit
Maybe I should clarify a bit. The value is set only once (behaves like a flag). All the other threads need just to read it. That's why I think that it may be a way to force the read of this new value without using locks.
Thanks in advance!

There Ain't No Such Thing As A Free Lunch. If your data is being accessed from multiple threads, and it is necessary that updates are immediately visible by those other threads, then you have to protect the shared struct by a mutex, or a readers/writers lock, or some similar mechanism.
Performance is a valid concern when synchronizing code, but it is trumped by correctness. Generally speaking, aim for correctness first and then profile your code. Worrying about performance when you haven't yet nailed down correctness is premature optimization.

Use explicitly atomic instructions. I believe most compilers offer these as intrinsics. Compare and Exchange is another good one.
If you intend to write a lockless algorithm, you need to write it so that your changes only take effect when conditions are as expected.
For example, if you intend to insert a linked list object, use the compare/exchange stuff so that it only inserts if the pointer still points at the same location when you actually do the update.
Or if you are going to decrement a reference count and free the memory at count 0, you will want to pre-free it by making it unavailable somehow, check that the count is still 0 and then really free it. Or something like that.
Using a lock, operate, unlock design is generally a lot easier. The lock-free algorithms are really difficult to get right.

All the other answers here seem to hand wave about the complexities of updating shared variables using mutexes, etc. It is true that you want the update to be atomic.
And you could use various OS primitives to ensure that, and that would be good
programming style.
However, on most modern processors (certainly the x86), writes of small, aligned scalar values is atomic and immediately visible to other processors due to cache coherency.
So in this special case, you don't need all the synchronizing junk; the hardware does the
atomic operation for you. Certainly this is safe with 4 byte values (e.g., "int" in 32 bit C compilers).
So you could just initialize Value1 with an uninteresting value (say 0) before you start the parallel threads, and simply write other values there. If the question is exiting the loop on a fixed value (e.g., if value1 == 5) this will be perfectly safe.
If you insist on capturing the first value written, this won't work. But if you have a parallel set of threads, and any value written other than the uninteresting one will do, this is also fine.

I second peterb's answer to aim for correctness first. Yes, you can use memory barriers here, but they will not do what you want.
You said immediately. However, how immediate this update ever can be, you could (and will) end up with the if() clause being executed, then the flag being set, and than the DoSmthElse() being executed afterwards. This is called a race condition...
You want to synchronize something, it seems, but it is not this flag.

Making the field volatile should make the change "immediately" visible in other threads, but there is no guarantee that the instant at which thread A executes the update doesn't occur after thread B tests the value but before thread B executes the body of the if/else statement.
It sounds like what you really want to do is make that if/else statement atomic, and that will require either a lock, or an algorithm that is tolerant of this sort of situation.

I've heard i++ isn't thread safe, is ++i thread-safe?

I've heard that i++ isn't a thread-safe statement since in assembly it reduces down to storing the original value as a temp somewhere, incrementing it, and then replacing it, which could be interrupted by a context switch.
However, I'm wondering about ++i. As far as I can tell, this would reduce to a single assembly instruction, such as 'add r1, r1, 1' and since it's only one instruction, it'd be uninterruptable by a context switch.
Can anyone clarify? I'm assuming that an x86 platform is being used.

You've heard wrong. It may well be that "i++" is thread-safe for a specific compiler and specific processor architecture but it's not mandated in the standards at all. In fact, since multi-threading isn't part of the ISO C or C++ standards (a), you can't consider anything to be thread-safe based on what you think it will compile down to.
It's quite feasible that ++i could compile to an arbitrary sequence such as:
load r0,[i] ; load memory into reg 0
incr r0 ; increment reg 0
stor [i],r0 ; store reg 0 back to memory
which would not be thread-safe on my (imaginary) CPU that has no memory-increment instructions. Or it may be smart and compile it into:
lock ; disable task switching (interrupts)
load r0,[i] ; load memory into reg 0
incr r0 ; increment reg 0
stor [i],r0 ; store reg 0 back to memory
unlock ; enable task switching (interrupts)
where lock disables and unlock enables interrupts. But, even then, this may not be thread-safe in an architecture that has more than one of these CPUs sharing memory (the lock may only disable interrupts for one CPU).
The language itself (or libraries for it, if it's not built into the language) will provide thread-safe constructs and you should use those rather than depend on your understanding (or possibly misunderstanding) of what machine code will be generated.
Things like Java synchronized and pthread_mutex_lock() (available to C/C++ under some operating systems) are what you need to look into (a).
(a) This question was asked before the C11 and C++11 standards were completed. Those iterations have now introduced threading support into the language specifications, including atomic data types (though they, and threads in general, are optional, at least in C).

You can't make a blanket statement about either ++i or i++. Why? Consider incrementing a 64-bit integer on a 32-bit system. Unless the underlying machine has a quad word "load, increment, store" instruction, incrementing that value is going to require multiple instructions, any of which can be interrupted by a thread context switch.
In addition, ++i isn't always "add one to the value." In a language like C, incrementing a pointer actually adds the size of the thing pointed to. That is, if i is a pointer to a 32-byte structure, ++i adds 32 bytes. Whereas almost all platforms have an "increment value at memory address" instruction that is atomic, not all have an atomic "add arbitrary value to value at memory address" instruction.

They are both thread-unsafe.
A CPU cannot do math directly with memory. It does that indirectly by loading the value from memory and doing the math with CPU registers.
i++
register int a1, a2;
a1 = *(&i) ; // One cpu instruction: LOAD from memory location identified by i;
a2 = a1;
a1 += 1;
*(&i) = a1;
return a2; // 4 cpu instructions
++i
register int a1;
a1 = *(&i) ;
a1 += 1;
*(&i) = a1;
return a1; // 3 cpu instructions
For both cases, there is a race condition that results in the unpredictable i value.
For example, let's assume there are two concurrent ++i threads with each using register a1, b1 respectively. And, with context switching executed like the following:
register int a1, b1;
a1 = *(&i);
a1 += 1;
b1 = *(&i);
b1 += 1;
*(&i) = a1;
*(&i) = b1;
In result, i doesn't become i+2, it becomes i+1, which is incorrect.
To remedy this, moden CPUs provide some kind of LOCK, UNLOCK cpu instructions during the interval a context switching is disabled.
On Win32, use InterlockedIncrement() to do i++ for thread-safety. It's much faster than relying on mutex.

If you are sharing even an int across threads in a multi-core environment, you need proper memory barriers in place. This can mean using interlocked instructions (see InterlockedIncrement in win32 for example), or using a language (or compiler) that makes certain thread-safe guarantees. With CPU level instruction-reordering and caches and other issues, unless you have those guarantees, don't assume anything shared across threads is safe.
Edit: One thing you can assume with most architectures is that if you are dealing with properly aligned single words, you won't end up with a single word containing a combination of two values that were mashed together. If two writes happen over top of each other, one will win, and the other will be discarded. If you are careful, you can take advantage of this, and see that either ++i or i++ are thread-safe in the single writer/multiple reader situation.

If you want an atomic increment in C++ you can use C++0x libraries (the std::atomic datatype) or something like TBB.
There was once a time that the GNU coding guidelines said updating datatypes that fit in one word was "usually safe" but that advice is wrong for SMP machines, wrong for some architectures, and wrong when using an optimizing compiler.
To clarify the "updating one-word datatype" comment:
It is possible for two CPUs on an SMP machine to write to the same memory location in the same cycle, and then try to propagate the change to the other CPUs and the cache. Even if only one word of data is being written so the writes only take one cycle to complete, they also happen simultaneously so you cannot guarantee which write succeeds. You won't get partially updated data, but one write will disappear because there is no other way to handle this case.
Compare-and-swap properly coordinates between multiple CPUs, but there is no reason to believe that every variable assignment of one-word datatypes will use compare-and-swap.
And while an optimizing compiler doesn't affect how a load/store is compiled, it can change when the load/store happens, causing serious trouble if you expect your reads and writes to happen in the same order they appear in the source code (the most famous being double-checked locking does not work in vanilla C++).
NOTE My original answer also said that Intel 64 bit architecture was broken in dealing with 64 bit data. That is not true, so I edited the answer, but my edit claimed PowerPC chips were broken. That is true when reading immediate values (i.e., constants) into registers (see the two sections named "Loading pointers" under listing 2 and listing 4) . But there is an instruction for loading data from memory in one cycle (lmw), so I've removed that part of my answer.

Even if it is reduced to a single assembly instruction, incrementing the value directly in memory, it is still not thread safe.
When incrementing a value in memory, the hardware does a "read-modify-write" operation: it reads the value from the memory, increments it, and writes it back to memory. The x86 hardware has no way of incrementing directly on the memory; the RAM (and the caches) is only able to read and store values, not modify them.
Now suppose you have two separate cores, either on separate sockets or sharing a single socket (with or without a shared cache). The first processor reads the value, and before it can write back the updated value, the second processor reads it. After both processors write the value back, it will have been incremented only once, not twice.
There is a way to avoid this problem; x86 processors (and most multi-core processors you will find) are able to detect this kind of conflict in hardware and sequence it, so that the whole read-modify-write sequence appears atomic. However, since this is very costly, it is only done when requested by the code, on x86 usually via the LOCK prefix. Other architectures can do this in other ways, with similar results; for instance, load-linked/store-conditional and atomic compare-and-swap (recent x86 processors also have this last one).
Note that using volatile does not help here; it only tells the compiler that the variable might have be modified externally and reads to that variable must not be cached in a register or optimized out. It does not make the compiler use atomic primitives.
The best way is to use atomic primitives (if your compiler or libraries have them), or do the increment directly in assembly (using the correct atomic instructions).

On x86/Windows in C/C++, you should not assume it is thread-safe. You should use InterlockedIncrement() and InterlockedDecrement() if you require atomic operations.

If your programming language says nothing about threads, yet runs on a multithreaded platform, how can any language construct be thread-safe?
As others pointed out: you need to protect any multithreaded access to variables by platform specific calls.
There are libraries out there that abstract away the platform specificity, and the upcoming C++ standard has adapted it's memory model to cope with threads (and thus can guarantee thread-safety).

Never assume that an increment will compile down to an atomic operation. Use InterlockedIncrement or whatever similar functions exist on your target platform.
Edit: I just looked up this specific question and increment on X86 is atomic on single processor systems, but not on multiprocessor systems. Using the lock prefix can make it atomic, but it's much more portable just to use InterlockedIncrement.

According to this assembly lesson on x86, you can atomically add a register to a memory location, so potentially your code may atomically execute '++i' ou 'i++'.
But as said in another post, the C ansi does not apply atomicity to '++' opération, so you cannot be sure of what your compiler will generate.

The 1998 C++ standard has nothing to say about threads, although the next standard (due this year or the next) does. Therefore, you can't say anything intelligent about thread-safety of operations without referring to the implementation. It's not just the processor being used, but the combination of the compiler, the OS, and the thread model.
In the absence of documentation to the contrary, I wouldn't assume that any action is thread-safe, particularly with multi-core processors (or multi-processor systems). Nor would I trust tests, as thread synchronization problems are likely to come up only by accident.
Nothing is thread-safe unless you have documentation that says it is for the particular system you're using.

Throw i into thread local storage; it isn't atomic, but it then doesn't matter.

AFAIK, According to the C++ standard, read/writes to an int are atomic.
However, all that this does is get rid of the undefined behavior that's associated with a data race.
But there still will be a data race if both threads try to increment i.
Imagine the following scenario:
Let i = 0 initially:
Thread A reads the value from memory and stores in its own cache.
Thread A increments the value by 1.
Thread B reads the value from memory and stores in its own cache.
Thread B increments the value by 1.
If this is all a single thread you would get i = 2 in memory.
But with both threads, each thread writes its changes and so Thread A writes i = 1 back to memory, and Thread B writes i = 1 to memory.
It's well defined, there's no partial destruction or construction or any sort of tearing of an object, but it's still a data race.
In order to atomically increment i you can use:
std::atomic<int>::fetch_add(1, std::memory_order_relaxed)
Relaxed ordering can be used because we don't care where this operation takes place all we care about is that the increment operation is atomic.

You say "it's only one instruction, it'd be uninterruptible by a context switch." - that's all well and good for a single CPU, but what about a dual core CPU? Then you can really have two threads accessing the same variable at the same time without any context switches.
Without knowing the language, the answer is to test the heck out of it.

I think that if the expression "i++" is the only in a statement, it's equivalent to "++i", the compiler is smart enough to not keep a temporal value, etc. So if you can use them interchangeably (otherwise you won't be asking which one to use), it doesn't matter whichever you use as they're almost the same (except for aesthetics).
Anyway, even if the increment operator is atomic, that doesn't guarantee that the rest of the computation will be consistent if you don't use the correct locks.
If you want to experiment by yourself, write a program where N threads increment concurrently a shared variable M times each... if the value is less than N*M, then some increment was overwritten. Try it with both preincrement and postincrement and tell us ;-)

For a counter, I recommend a using the compare and swap idiom which is both non locking and thread-safe.
Here it is in Java:
public class IntCompareAndSwap {
private int value = 0;
public synchronized int get(){return value;}
public synchronized int compareAndSwap(int p_expectedValue, int p_newValue){
int oldValue = value;
if (oldValue == p_expectedValue)
value = p_newValue;
return oldValue;
}
}
public class IntCASCounter {
public IntCASCounter(){
m_value = new IntCompareAndSwap();
}
private IntCompareAndSwap m_value;
public int getValue(){return m_value.get();}
public void increment(){
int temp;
do {
temp = m_value.get();
} while (temp != m_value.compareAndSwap(temp, temp + 1));
}
public void decrement(){
int temp;
do {
temp = m_value.get();
} while (temp > 0 && temp != m_value.compareAndSwap(temp, temp - 1));
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js