omp parallel for: Thread fails to assign value - c++

I have a for loop parallelized by OpenMP. I want the parallel threads to fill an std::vector<bool> of falses with trues. Each thread should write to its own entry of the vector. But sometimes one assignment fails. How can this happen?
int size = 10;
std::vector<bool> vec(10); // all entries contain false
#pragma omp parallel for
for (int i = 0; i < size; ++i)
vec[i] = true; // sometimes this assignment fails for a thread
The vector may end up looking like this:
true true true true false true true true true true

The problem is due to how std::vector<bool> is stored. It is designed for space efficiency, and stores elements in individual bits, not bytes. This means that assignments to vec[0], vec[1], ... vec[7] all write to the same byte in memory. Since you have multiple threads writing to the same address (it will actually be a read-modify-write sequence), there is a race condition. This can cause a write by one thread to be "undone" by a write of a later thread.
In addition, the potential performance benefit in a memory-intensive loop like this is not large due to the bandwidth limits of main memory. Combined with the cache invalidation required by each thread, the threaded performance will likely be inferior to single threaded operation.
With the fundamental basic operation being done here (setting memory to true), just code it up as a standard non-OMP single thread loop. The compiler will be able to optimize it to something reasonably performant.

As per the standard, all standard containers are required to avoid data races when the contents of the contained object in different elements in the same container are modified concurrently.
Except std::vector<bool>.
1201ProgramAlarm explains why.
std::vector<bool> is infamous for many issues - the most straight-forward solution is to use std::vector<char> instead. If you do that, your program is just fine, although you should not expect any performance boost from using OpenMP in that particular case.

my openmp is rusty, but what comes to mind is #pragma omp flush [(list)]
the flush directive in openmp may be used to identify a synchronization point, which is defined as a point in the execution of the program where the executing thread needs to have a consistent view of memory. A consistent view of memory has 2 requirements: all memory read/write operations before and after the flush directive must be performed either before or after. this is often called a memory fence. Page 163, parallel programming in openmp by Chandra,Dagum,Kohr,Maydan,McDonald,Menon; 2001
And there is much more detailed description on this in the book, as well as mention of variables not being retained in a buffer/register across any synchronization point, which per your posted code you are not doing.
with very little code posted, and I don't know how or when you are checking vec[i] so this would be my first guess... and then checking on the versions of the linux operating system, C compiler, and openmp versions involved...
but I don't expect an OS upgrade, a compiler upgrade, nor an openmp version upgrade, to correct what you are experiencing it's likely your code
much like what happens in standard C when doing printf("hello") without a following fflush(stdout) when you do not do printf("hello\n"). The \n causes an inherent flush to happen on most all systems that is taken for granted.

Related

how can I get good speedup for a parallel write to memory?

I'm new to OpenMP and trying to get some very basic loops in my code parallelized with OpenMP, with good speedup on multiple cores. Here's a function in my program:
bool Individual::_SetFitnessScaling_1(double source_value, EidosObject **p_values, size_t p_values_size)
{
if ((source_value < 0.0) || (std::isnan(source_value)))
return true;
#pragma omp parallel for schedule(static) default(none) shared(p_values_size) firstprivate(p_values, source_value) if(p_values_size >= EIDOS_OMPMIN_SET_FITNESS_S1)
for (size_t value_index = 0; value_index < p_values_size; ++value_index)
((Individual *)(p_values[value_index]))->fitness_scaling_ = source_value;
return false;
}
So the goal is to set the fitnessScaling ivar of every object pointed to by pointers in the buffer that p_values points to, to the same double value source_value. Those various objects might be more or less anywhere in memory, so each write probably hits a different cache line; that's an aspect of the code that would be difficult to change, but I'm hoping that by spreading it across multiple cores I can at least divide that pain by a good speedup factor. The cast to (Individual *) is safe, by the way; checks were already done external to this function that guarantee its safety.
You can see my first attempt at parallelizing this, using the default static schedule (so each thread gets its own contiguous block in p_values), making the loop limit shared, and making p_values and source_value be firstprivate so each thread gets its own private copy of those variables, initialized to the original value. The threshold for parallelization, EIDOS_OMPMIN_SET_FITNESS_S1, is set to 900. I test this with a script that passes in a million values, with between 1 and 8 cores (and a max thread count to match), so the loop should certainly run in parallel. I have followed these same practices in some other places in the code and have seen a good speedup. [EDIT: I should say that the speedup I observe for this, for 2/4/6/8 cores/threads, is always about 1.1x-1.2x the single-threaded performance, so there's a very small win but it is realized already with 2 cores and does not get any better with 8 cores.] The notable difference with this code is that this loop spends its time writing to memory; the other loops I have successfully parallelized spend their time doing things like reading values from a buffer and summing across them, so they might be limited by memory read speeds, but not by memory write speeds.
It occurred to me that with all of this writing through a pointer, my loop might be thrashing due to things like aliasing (making the compiler force a flush of the cache after each write), or some such. I attempted to solve that kind of issue as follows, using const and __restrict:
bool Individual::_SetFitnessScaling_1(double source_value, EidosObject **p_values, size_t p_values_size)
{
if ((source_value < 0.0) || (std::isnan(source_value)))
return true;
#pragma omp parallel default(none) shared(p_values_size) firstprivate(p_values, source_value) if(p_values_size >= EIDOS_OMPMIN_SET_FITNESS_S1)
{
EidosObject * const * __restrict local_values = p_values;
#pragma omp for schedule(static)
for (size_t value_index = 0; value_index < p_values_size; ++value_index)
((Individual *)(local_values[value_index]))->fitness_scaling_ = source_value;
}
return false;
}
This made no difference to the performance, however. I still suspect that some kind of memory contention, cache thrash, or aliasing issue is preventing the code from parallelizing effectively, but I don't know how to solve it. Or maybe I'm barking up the wrong tree?
These tests are done with Xcode 13 (i.e., using Apple clang 13.0.0) on macOS, on an M1 Mac mini (2020).
[EDIT: In reply to comments below, a few points. (1) There is nothing fancy going on inside the class here, no operator= or similar; the assignment of source_value into fitness_scaling_ is, in effect, simply the assignment of a double into a field in a struct. (2) The use of firstprivate(p_values, source_value) is to ensure that repeated reading from those values across threads doesn't introduce some kind of between-thread contention that slows things down. It is recommended in Mattson, He, & Koniges' book "The OpenMP Common Core"; see section 6.3.2, figure 6.10 with the corrected Mandelbrot code using firstprivate, and the quote on p. 111: "An easy solution is to change the storage attribute for eps to firstprivate. This gives each thread its own copy of the variable but with a specified value. Notice that eps is read-only. It is not updated inside the parallel region. Therefore, another solution is to let it be shared (shared(eps)) or not specify eps in a data environment clause and let its default, shared behavior be used. While this would result in correct code, it would potentially increase overhead. If eps is shared, every thread will be reading the same address in memory... Some compilers will optimize for such read-only variables by putting them into registers, but we should not rely on that behavior." I have observed this change speeding up parallelized loops in other contexts, so I have adopted it as my standard practice in such cases; if I have misunderstood, please do let me know. (3) No, keeping the fitness_scaling_ values in their own buffer is not a workable solution for several reasons. Most importantly, this method may be called with any arbitrary buffer of pointers to Individual; it is not necessarily setting the fitness_scaling_ of all Individual objects, just an effectively random subset of them, so this operation will never be reducible to a simple memset(). Also, I am going to need to similarly optimize the setting of many other properties on Individual and on other classes in my code, so a general solution is needed; I can't very well put all of the ivars of all of my classes into separately allocated buffers external to the objects themselves. And third, Individual objects are being dynamically allocated and deallocated independently of each other, so an external buffer of fitness_scaling_ values for the objects would have big implementation problems.]

memory model, how load acquire semantic actually works?

From very nice Paper and article about memory reordering.
Q1: I understand that cache-coherence, store buffer and invalidation queue is root cause of memory reordering ?
Store release is quite understandable, have to wait for all load and store are completed before set flag to true.
About load acquire, typical use of atomic load is waiting for a flag. Suppose we have 2 threads:
int x = 0;
std::atomic<bool> ready_flag = false;
// thread-1
if(ready_flag.load(std::memory_order_relaxed))
{
// (1)
// load x here
}
// (2)
// load x here
// thread-2
x = 100;
ready_flag.store(true, std::memory_order_release);
EDIT: in thread-1, it should be a while loop, but I copied the logic from article above. So, assume memory-reorder is occurred just in time.
Q2: Because (1) and (2) depends on if condition, CPU have to wait for ready_flag, does it mean write-release is enough ? How memory-reordering can happens with this context ?
Q3: Obviously we have load-acquire, so I guess mem-reorder is possible, then where should we place the fence, (1) or (2) ?
Accessing an atomic variable is not a mutex operation; it merely accesses the stored value atomically, with no chance for any CPU operation to interrupt the access such that no data races can occur with regard to accessing that value (it can also issue barriers with regard to other accesses, which is what the memory orders provide). But that's it; it doesn't wait for any particular value to appear in the atomic variable.
As such, your if statement will read whatever value happens to be there at the time. If you want to guard access to x until the other statement has written to it and signaled the atomic, you must:
Not allow any code to read from x until the atomic flag has returned the value true. Simply testing the value once won't do that; you must loop over repeated accesses until it is true. Any other attempt to read from x results in a data race and is therefore undefined behavior.
Whenever you access the flag, you must do so in a way that tells the system that values written by the thread setting that flag should be visible to subsequent operations that see the set value. That requires a proper memory order, one which must be at least memory_order_acquire.
To be technical, the read from the flag itself doesn't have to do the acquire. You could perform an acquire operation after having read the proper value from the flag. But you need to have an acquire-equivalent operation happen before reading x.
The writing statement must set the flag using a releasing memory order that must be at least as powerful as memory_order_release.
Because (1) and (2) depends on if condition, CPU have to wait for ready_flag
There are 2 showstopper flaws in that reasoning:
Branch prediction + speculative execution is a real thing in real CPUs. Control dependencies behave differently from data dependencies. Speculative execution breaks control dependencies.
In most (but not all) real CPUs, data dependencies do work like C++ memory_order_consume. A typical use-case is loading a pointer and then dereferencing it. That's still not safe in C++'s very weak memory model, but will happen to compile to asm that works for most ISAs other than DEC Alpha. Alpha can (in practice on some hardware) even manage to violate causality and load a stale value when dereferencing a just-loaded pointer, even if the stores were correctly ordered.
Compilers can break control and even data dependencies. C++ source logic doesn't always translate directly to asm. In this case a compiler could emit asm that works like this:
tmp = load(x); // compile time reordering before the relaxed load
if (load(ready_flag)
actually use tmp;
It's data-race UB in C++ to read x while it might still be being written, but for most specific ISAs there's no problem with that. You just have to avoid actually using any load results that might be bogus.
This might not be a useful optimization for most ISAs but nothing rules it out. Hiding load latency on in-order pipelines by doing the load earlier might actually be useful sometimes, (if it wasn't being written by another thread, and the compiler might guess that wasn't happening because there's no acquire load).
By far your best bet is to use ready_flag.load(mo_acquire).
A separate problem is that you have commented out code that reads x after the if(), which will run even if the load didn't see the data ready. As #Nicol explained in an answer, this means data-race UB is possible because you might be reading x while the producer is writing it.
Perhaps you wanted to write a spin-wait loop like while(!ready_flag){ _mm_pause(); }? Generally be careful of wasting huge amounts of CPU time spinning; if it might be a long time, use a library-supported thing like maybe a condition variable that gives you efficient fallback to OS-supported sleep/wakeup (e.g. Linux futex) after spinning for a short time.
If you did want a manual barrier separate from the load, it would be
if (ready_flag.load(mo_relaxed))
atomic_thread_fence(mo_acquire);
int tmp = x; // now this is safe
}
// atomic_thread_fence(mo_acquire); // still wouldn't make it safe to read x
// because this code runs even after ready_flag == false
Using if(ready_flag.load(mo_acquire)) would lead to an unconditional fence before branching on the ready_flag load, when compiling for any ISA where acquire-load wasn't available with a single instruction. (On x86 all loads are acquire, on AArch64 ldar does an acquire load. ARM needs load + dsb ish)
The C++ standard doesn't specify the code generated by any particular construct; only correct combinations of thread communication tools product a guaranteed result.
You don't get guarantees from the CPU in C++ because C++ is not a kind of (macro) assembly, not even a "high level assembly", at least not when not all objects have a volatile type.
Atomic objects are communication tools to exchange data between threads. The correct use, for correct visibility of memory operations, is either a store operation with (at least) release followed by a load with acquire, the same with RMW in between, either the store (resp. the load) replaced by RMW with (at least) a release (resp. acquire), on any variant with a relaxed operation and a separate fence.
In all cases:
the thread "publishing" the "done" flag must use a memory ordering at least release (that is: release, release+acquire or sequential consistency),
and the "subscribing" thread, the one acting on the flag must use at least acquire (that is: acquire, release+acquire or sequential consistency).
In practice with separately compiled code other modes might work, depending on the CPU.

In C11/C++11, possible to mix atomic/non-atomic ops on the same memory?

Is it possible to perform atomic and non-atomic ops on the same memory location?
I ask not because I actually want to do this, but because I'm trying to understand the C11/C++11 memory model. They define a "data race" like so:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
-- C11 §5.1.2.4 p25, C++11 § 1.10 p21
Its the "at least one of which is not atomic" part that is troubling me. If it weren't possible to mix atomic and non-atomic ops, it would just say "on an object which is not atomic."
I can't see any straightforward way of performing non-atomic operations on atomic variables. std::atomic<T> in C++ doesn't define any operations with non-atomic semantics. In C, all direct reads/writes of an atomic variable appear to be translated into atomic operations.
I suppose memcpy() and other direct memory operations might be a way of performing a non-atomic read/write on an atomic variable? ie. memcpy(&atomicvar, othermem, sizeof(atomicvar))? But is this even defined behavior? In C++, std::atomic is not copyable, so would it be defined behavior to memcpy() it in C or C++?
Initialization of an atomic variable (whether through a constructor or atomic_init()) is defined to not be atomic. But this is a one-time operation: you're not allowed to initialize an atomic variable a second time. Placement new or an explicit destructor call could would also not be atomic. But in all of these cases, it doesn't seem like it would be defined behavior anyway to have a concurrent atomic operation that might be operating on an uninitialized value.
Performing atomic operations on non-atomic variables seems totally impossible: neither C nor C++ define any atomic functions that can operate on non-atomic variables.
So what is the story here? Is it really about memcpy(), or initialization/destruction, or something else?
I think you're overlooking another case, the reverse order. Consider an initialized int whose storage is reused to create an std::atomic_int. All atomic operations happen after its ctor finishes, and therefore on initialized memory. But any concurrent, non-atomic access to the now-overwritten int has to be barred as well.
(I'm assuming here that the storage lifetime is sufficient and plays no role)
I'm not entirely sure because I think that the second access to int would be invalid anyway as the type of the accessing expression int doesn't match the object's type at the time (std::atomic<int>). However, "the object's type at the time" assumes a single linear time progression which doesn't hold in a multi-threaded environment. C++11 in general has that solved by making such assumptions about "the global state" Undefined Behavior per se, and the rule from the question appears to fit in that framework.
So perhaps rephrasing: if a single memory location contains an atomic object as well as a non-atomic object, and if the destruction of the earliest created (older) object is not sequenced-before the creation of the other (newer) object, then access to the older object conflicts with access to the newer object unless the former is scheduled-before the latter.
disclaimer: I am not a parallelism guru.
Is it possible to mix atomic/non-atomic ops on the same memory, and if
so, how?
you can write it in the code and compile, but it will probably yield undefined behaviour.
when talking about atomics, it is important to understand what kind o problems do they solve.
As you might know, what we call in shortly "memory" is multi-layered set of entities which are capable to hold memory.
first we have the RAM, then the cache lines , then the registers.
on mono-core processors, we don't have any synchronization problem. on multi-core processors we have all of them. every core has it own set of registers and cache lines.
this casues few problems.
First one of them is memory reordering - the CPU may decide on runtime to scrumble some reading/writing instructions to make the code run faster. this may yield some strange results that are completly invisible on the high-level code that brought this set of instruction. the most classic example of this phenomanon is the "two threads - two integer" example:
int i=0;
int j=0;
thread a -> i=1, then print j
thread b -> j=1 then print i;
logically, the result "00" cannot be. either a ends first, the result may be "01", either b ends first, the result may be "10". if both of them ends in the same time, the result may be "11". yet, if you build small program which imitates this situtation and run it in a loop, very quicly you will see the result "00"
another problem is memory invisibility. like I mentioned before, the variable's value may be cached in one of the cache lines, or be stored in one of the registered. when the CPU updates a variables value - it may delay the writing of the new value back to the RAM. it may keep the value in the cache/regiter because it was told (by the compiler optimizations) that that value will be updated again soon, so in order to make the program faster - update the value again and only then write it back to the RAM. it may cause undefined behaviour if other CPU (and consequently a thread or a process) depends on the new value.
for example, look at this psuedo code:
bool b = true;
while (b) -> print 'a'
new thread -> sleep 4 seconds -> b=false;
the character 'a' may be printed infinitly, because b may be cached and never be updated.
there are many more problems when dealing with paralelism.
atomics solves these kind of issues by (in a nutshell) telling the compiler/CPU how to read and write data to/from the RAM correctly without doing un-wanted scrumbling (read about memory orders). a memory order may force the cpu to write it's values back to the RAM, or read the valuse from the RAM even if they are cached.
So, although you can mix non atomics actions with atomic ones, you only doing part of the job.
for example let's go back to the second example:
atomic bool b = true;
while (reload b) print 'a'
new thread - > b = (non atomicly) false.
so although one thread re-read the value of b from the RAM again and again but the other thread may not write false back to the RAM.
So although you can mix these kind of operations in the code, it will yield underfined behavior.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operationsin parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. That is that the sizes of the atomic and non atomic types are the same.
I definitely would not attempt to mix both types of access to an address in a parallel context, that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operations in parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. My, possibly simplistic, interpretation of that is that the sizes of the atomic and non atomic types are the same and that the hardware can update such types in a single operation.
I definitely would not attempt to mix both types of access to an address in a parallel context, i think that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.

Parallel tasks get better performances with boost::thread than with ppl or OpenMP

I have a C++ program which could be parallelized. I'm using Visual Studio 2010, 32bit compilation.
In short the structure of the program is the following
#define num_iterations 64 //some number
struct result
{
//some stuff
}
result best_result=initial_bad_result;
for(i=0; i<many_times; i++)
{
result *results[num_iterations];
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
// update best_result;
}
Since each some_computations() is independent(some global variables read, but no global variables modified) I parallelized the inner for-loop.
My first attempt was with boost::thread,
thread_group group;
for(j=0; j<num_iterations; j++)
{
group.create_thread(boost::bind(&some_computation, this, result+j));
}
group.join_all();
The results were good, but I decided to try more.
I tried the OpenMP library
#pragma omp parallel for
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
The results were worse than the boost::thread's ones.
Then I tried the ppl library and used parallel_for():
Concurrency::parallel_for(0,num_iterations, [=](int j) {
some_computations(results+j);
})
The results were the worst.
I found this behaviour quite surprising. Since OpenMP and ppl are designed for the parallelization, I would have expected better results, than boost::thread. Am I wrong?
Why is boost::thread giving me better results?
OpenMP or PPL do no such thing as being pessimistic. They just do as they are told, however there's some things you should take into consideration when you do try to paralellize loops.
Without seeing how you implemented these things, it's hard to say what the real cause may be.
Also if the operations in each iteration have some dependency on any other iterations in the same loop, then this will create contention, which will slow things down. You haven't shown what your some_operation function actually does, so it's hard to tell if there is data dependencies.
A loop that can be truly parallelized has to be able to have each iteration run totally independent of all other iterations, with no shared memory being accessed in any of the iterations. So preferably, you'd write stuff to local variables and then copy at the end.
Not all loops can be parallelized, it is very dependent on the type of work being done.
For example, something that is good for parallelizing is work being done on each pixel of a screen buffer. Each pixel is totally independent from all other pixels, and therefore, a thread can take one iteration of a loop and do the work without needing to be held up waiting for shared memory or data dependencies within the loop between iterations.
Also, if you have a contiguous array, this array may be partly in a cache line, and if you are editing element 5 in thread A and then changing element 6 in thread B, you may get cache contention, which will also slow down things, as these would be residing in the same cache line. A phenomenon known as false sharing.
There is many aspects to think about when doing loop parallelization.
In short words, openMP is mainly based on shared memory, with additional cost of tasking management and memory management. ppl is designed to handle generic patterns of common data structures and algorithms, it brings additional complexity cost. Both of them have additional CPU cost, but your simple falling down boost threads do not (boost threads are just simple API wrapping). That's why both of them are slower than your boost version. And, since the exampled computation is independent for each other, without synchronization, openMP should be close to the boost version.
It occurs in simple scenarios, but, for complicated scenarios, with complicated data layout and algorithms, it should be context dependent.

I've heard i++ isn't thread safe, is ++i thread-safe?

I've heard that i++ isn't a thread-safe statement since in assembly it reduces down to storing the original value as a temp somewhere, incrementing it, and then replacing it, which could be interrupted by a context switch.
However, I'm wondering about ++i. As far as I can tell, this would reduce to a single assembly instruction, such as 'add r1, r1, 1' and since it's only one instruction, it'd be uninterruptable by a context switch.
Can anyone clarify? I'm assuming that an x86 platform is being used.
You've heard wrong. It may well be that "i++" is thread-safe for a specific compiler and specific processor architecture but it's not mandated in the standards at all. In fact, since multi-threading isn't part of the ISO C or C++ standards (a), you can't consider anything to be thread-safe based on what you think it will compile down to.
It's quite feasible that ++i could compile to an arbitrary sequence such as:
load r0,[i] ; load memory into reg 0
incr r0 ; increment reg 0
stor [i],r0 ; store reg 0 back to memory
which would not be thread-safe on my (imaginary) CPU that has no memory-increment instructions. Or it may be smart and compile it into:
lock ; disable task switching (interrupts)
load r0,[i] ; load memory into reg 0
incr r0 ; increment reg 0
stor [i],r0 ; store reg 0 back to memory
unlock ; enable task switching (interrupts)
where lock disables and unlock enables interrupts. But, even then, this may not be thread-safe in an architecture that has more than one of these CPUs sharing memory (the lock may only disable interrupts for one CPU).
The language itself (or libraries for it, if it's not built into the language) will provide thread-safe constructs and you should use those rather than depend on your understanding (or possibly misunderstanding) of what machine code will be generated.
Things like Java synchronized and pthread_mutex_lock() (available to C/C++ under some operating systems) are what you need to look into (a).
(a) This question was asked before the C11 and C++11 standards were completed. Those iterations have now introduced threading support into the language specifications, including atomic data types (though they, and threads in general, are optional, at least in C).
You can't make a blanket statement about either ++i or i++. Why? Consider incrementing a 64-bit integer on a 32-bit system. Unless the underlying machine has a quad word "load, increment, store" instruction, incrementing that value is going to require multiple instructions, any of which can be interrupted by a thread context switch.
In addition, ++i isn't always "add one to the value." In a language like C, incrementing a pointer actually adds the size of the thing pointed to. That is, if i is a pointer to a 32-byte structure, ++i adds 32 bytes. Whereas almost all platforms have an "increment value at memory address" instruction that is atomic, not all have an atomic "add arbitrary value to value at memory address" instruction.
They are both thread-unsafe.
A CPU cannot do math directly with memory. It does that indirectly by loading the value from memory and doing the math with CPU registers.
i++
register int a1, a2;
a1 = *(&i) ; // One cpu instruction: LOAD from memory location identified by i;
a2 = a1;
a1 += 1;
*(&i) = a1;
return a2; // 4 cpu instructions
++i
register int a1;
a1 = *(&i) ;
a1 += 1;
*(&i) = a1;
return a1; // 3 cpu instructions
For both cases, there is a race condition that results in the unpredictable i value.
For example, let's assume there are two concurrent ++i threads with each using register a1, b1 respectively. And, with context switching executed like the following:
register int a1, b1;
a1 = *(&i);
a1 += 1;
b1 = *(&i);
b1 += 1;
*(&i) = a1;
*(&i) = b1;
In result, i doesn't become i+2, it becomes i+1, which is incorrect.
To remedy this, moden CPUs provide some kind of LOCK, UNLOCK cpu instructions during the interval a context switching is disabled.
On Win32, use InterlockedIncrement() to do i++ for thread-safety. It's much faster than relying on mutex.
If you are sharing even an int across threads in a multi-core environment, you need proper memory barriers in place. This can mean using interlocked instructions (see InterlockedIncrement in win32 for example), or using a language (or compiler) that makes certain thread-safe guarantees. With CPU level instruction-reordering and caches and other issues, unless you have those guarantees, don't assume anything shared across threads is safe.
Edit: One thing you can assume with most architectures is that if you are dealing with properly aligned single words, you won't end up with a single word containing a combination of two values that were mashed together. If two writes happen over top of each other, one will win, and the other will be discarded. If you are careful, you can take advantage of this, and see that either ++i or i++ are thread-safe in the single writer/multiple reader situation.
If you want an atomic increment in C++ you can use C++0x libraries (the std::atomic datatype) or something like TBB.
There was once a time that the GNU coding guidelines said updating datatypes that fit in one word was "usually safe" but that advice is wrong for SMP machines, wrong for some architectures, and wrong when using an optimizing compiler.
To clarify the "updating one-word datatype" comment:
It is possible for two CPUs on an SMP machine to write to the same memory location in the same cycle, and then try to propagate the change to the other CPUs and the cache. Even if only one word of data is being written so the writes only take one cycle to complete, they also happen simultaneously so you cannot guarantee which write succeeds. You won't get partially updated data, but one write will disappear because there is no other way to handle this case.
Compare-and-swap properly coordinates between multiple CPUs, but there is no reason to believe that every variable assignment of one-word datatypes will use compare-and-swap.
And while an optimizing compiler doesn't affect how a load/store is compiled, it can change when the load/store happens, causing serious trouble if you expect your reads and writes to happen in the same order they appear in the source code (the most famous being double-checked locking does not work in vanilla C++).
NOTE My original answer also said that Intel 64 bit architecture was broken in dealing with 64 bit data. That is not true, so I edited the answer, but my edit claimed PowerPC chips were broken. That is true when reading immediate values (i.e., constants) into registers (see the two sections named "Loading pointers" under listing 2 and listing 4) . But there is an instruction for loading data from memory in one cycle (lmw), so I've removed that part of my answer.
Even if it is reduced to a single assembly instruction, incrementing the value directly in memory, it is still not thread safe.
When incrementing a value in memory, the hardware does a "read-modify-write" operation: it reads the value from the memory, increments it, and writes it back to memory. The x86 hardware has no way of incrementing directly on the memory; the RAM (and the caches) is only able to read and store values, not modify them.
Now suppose you have two separate cores, either on separate sockets or sharing a single socket (with or without a shared cache). The first processor reads the value, and before it can write back the updated value, the second processor reads it. After both processors write the value back, it will have been incremented only once, not twice.
There is a way to avoid this problem; x86 processors (and most multi-core processors you will find) are able to detect this kind of conflict in hardware and sequence it, so that the whole read-modify-write sequence appears atomic. However, since this is very costly, it is only done when requested by the code, on x86 usually via the LOCK prefix. Other architectures can do this in other ways, with similar results; for instance, load-linked/store-conditional and atomic compare-and-swap (recent x86 processors also have this last one).
Note that using volatile does not help here; it only tells the compiler that the variable might have be modified externally and reads to that variable must not be cached in a register or optimized out. It does not make the compiler use atomic primitives.
The best way is to use atomic primitives (if your compiler or libraries have them), or do the increment directly in assembly (using the correct atomic instructions).
On x86/Windows in C/C++, you should not assume it is thread-safe. You should use InterlockedIncrement() and InterlockedDecrement() if you require atomic operations.
If your programming language says nothing about threads, yet runs on a multithreaded platform, how can any language construct be thread-safe?
As others pointed out: you need to protect any multithreaded access to variables by platform specific calls.
There are libraries out there that abstract away the platform specificity, and the upcoming C++ standard has adapted it's memory model to cope with threads (and thus can guarantee thread-safety).
Never assume that an increment will compile down to an atomic operation. Use InterlockedIncrement or whatever similar functions exist on your target platform.
Edit: I just looked up this specific question and increment on X86 is atomic on single processor systems, but not on multiprocessor systems. Using the lock prefix can make it atomic, but it's much more portable just to use InterlockedIncrement.
According to this assembly lesson on x86, you can atomically add a register to a memory location, so potentially your code may atomically execute '++i' ou 'i++'.
But as said in another post, the C ansi does not apply atomicity to '++' opération, so you cannot be sure of what your compiler will generate.
The 1998 C++ standard has nothing to say about threads, although the next standard (due this year or the next) does. Therefore, you can't say anything intelligent about thread-safety of operations without referring to the implementation. It's not just the processor being used, but the combination of the compiler, the OS, and the thread model.
In the absence of documentation to the contrary, I wouldn't assume that any action is thread-safe, particularly with multi-core processors (or multi-processor systems). Nor would I trust tests, as thread synchronization problems are likely to come up only by accident.
Nothing is thread-safe unless you have documentation that says it is for the particular system you're using.
Throw i into thread local storage; it isn't atomic, but it then doesn't matter.
AFAIK, According to the C++ standard, read/writes to an int are atomic.
However, all that this does is get rid of the undefined behavior that's associated with a data race.
But there still will be a data race if both threads try to increment i.
Imagine the following scenario:
Let i = 0 initially:
Thread A reads the value from memory and stores in its own cache.
Thread A increments the value by 1.
Thread B reads the value from memory and stores in its own cache.
Thread B increments the value by 1.
If this is all a single thread you would get i = 2 in memory.
But with both threads, each thread writes its changes and so Thread A writes i = 1 back to memory, and Thread B writes i = 1 to memory.
It's well defined, there's no partial destruction or construction or any sort of tearing of an object, but it's still a data race.
In order to atomically increment i you can use:
std::atomic<int>::fetch_add(1, std::memory_order_relaxed)
Relaxed ordering can be used because we don't care where this operation takes place all we care about is that the increment operation is atomic.
You say "it's only one instruction, it'd be uninterruptible by a context switch." - that's all well and good for a single CPU, but what about a dual core CPU? Then you can really have two threads accessing the same variable at the same time without any context switches.
Without knowing the language, the answer is to test the heck out of it.
I think that if the expression "i++" is the only in a statement, it's equivalent to "++i", the compiler is smart enough to not keep a temporal value, etc. So if you can use them interchangeably (otherwise you won't be asking which one to use), it doesn't matter whichever you use as they're almost the same (except for aesthetics).
Anyway, even if the increment operator is atomic, that doesn't guarantee that the rest of the computation will be consistent if you don't use the correct locks.
If you want to experiment by yourself, write a program where N threads increment concurrently a shared variable M times each... if the value is less than N*M, then some increment was overwritten. Try it with both preincrement and postincrement and tell us ;-)
For a counter, I recommend a using the compare and swap idiom which is both non locking and thread-safe.
Here it is in Java:
public class IntCompareAndSwap {
private int value = 0;
public synchronized int get(){return value;}
public synchronized int compareAndSwap(int p_expectedValue, int p_newValue){
int oldValue = value;
if (oldValue == p_expectedValue)
value = p_newValue;
return oldValue;
}
}
public class IntCASCounter {
public IntCASCounter(){
m_value = new IntCompareAndSwap();
}
private IntCompareAndSwap m_value;
public int getValue(){return m_value.get();}
public void increment(){
int temp;
do {
temp = m_value.get();
} while (temp != m_value.compareAndSwap(temp, temp + 1));
}
public void decrement(){
int temp;
do {
temp = m_value.get();
} while (temp > 0 && temp != m_value.compareAndSwap(temp, temp - 1));
}
}