Are increment operators in Go atomic on x86?

Are increment operators in Go atomic on x86? - concurrency

Here's some background:
I need a counter variable shared between go routines, used for something like a leaky-bucket. I know there's an example of leaky-bucket in Effective Go, concurrent section, but the number that I need to track may be very large, and I feel it's inefficient to use number of elements in a channel to track it. So I am considering of using a shared variable between different routines to track the number.
I understand that without explicit configuration, all go routines are mapped onto one thread. But if I assign more than one thread to the program on a multi-core computer, is increment operators atomic? Is it all the same for different data types (int32, float32, etc.) on different machines (x86_32, x86_64, arm)?
To be more concrete, What if I have counter += 1000 in one routine, and counter -= 512 in another routine, and the two routines happens to be running in two threads? Do I need to worry about thread-safty? Shall I put on locks on the counter?

No, increment should never be assumed to be atomic. Use the atomic addition functions or a mutex.
Lets assume:
import "sync/atomic"
var counter = new(int32)
One goroutine could do atomic.AddInt32(counter, 1000) while another did atomic.AddInt32(counter, -512) without a mutex.
If you would rather use a mutex:
import "sync"
var counter int32
var mutex sync.Mutex
func Add(x int32) {
mutex.Lock()
defer mutex.Unlock()
counter += x
}

Related

C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

Could you please help me to understand what std::memory_order should be used in std::atomic_flag::test_and_set to do some work only once by a set of threads and why? The work should be done by whatever thread gets to it first, and all other threads should just check as quickly as possible that someone is already going the work and continue working on other tasks.
In my tests of the example below, any memory order works, but I think that it is just a coincidence. I suspect that Release-Acquire ordering is what I need, but, in my case, only one memory_order can be used in both threads (it is not the case that one thread can use memory_order_release and the other can use memory_order_acquire since I do not know which thread will arrive to doing the work first).
#include <atomic>
#include <iostream>
#include <thread>
std::atomic_flag done = ATOMIC_FLAG_INIT;
const std::memory_order order = std::memory_order_seq_cst;
//const std::memory_order order = std::memory_order_acquire;
//const std::memory_order order = std::memory_order_relaxed;
void do_some_work_that_needs_to_be_done_only_once(void)
{ std::cout<<"Hello, my friend\n"; }
void run(void)
{
if(not done.test_and_set(order))
do_some_work_that_needs_to_be_done_only_once();
}
int main(void)
{
std::thread a(run);
std::thread b(run);
a.join();
b.join();
// expected result:
// * only one thread said hello
// * all threads spent as little time as possible to check if any
// other thread said hello yet
return 0;
}
Thank you very much for your help!

Following up on some things in the comments:
As has been discussed, there is a well-defined modification order M for done on any given run of the program. Every thread does one store to done, which means one entry in M. And by the nature of atomic read-modify-writes, the value returned by each thread's test_and_set is the value that immediately precedes its own store in the order M. That's promised in C++20 atomics.order p10, which is the critical clause for understanding atomic RMW in the C++ memory model.
Now there are a finite number of threads, each corresponding to one entry in M, which is a total order. Necessarily there is one such entry that precedes all the others. Call it m1. The test_and_set whose store is entry m1 in M must return the preceding value in M. That can only be the value 0 which initialized done. So the thread corresponding to m1 will see test_and_set return 0. Every other thread will see it return 1, because each of their modifications m2, ..., mN follows (in M) another modification, which must have been a test_and_set storing the value 1.
We may not be bothering to observe all of the total order M, but this program does determine which of its entries is first on this particular run. It's the unique one whose test_and_set returns 0. A thread that sees its test_and_set return 1 won't know whether it came 2nd or 8th or 96th in that order, but it does know that it wasn't first, and that's all that matters here.
Another way to think about it: suppose it were possible for two threads (tA, tB) both to load the value 0. Well, each one makes an entry in the modification order; call them mA and mB. M is a total order so one has to go before the other. And bearing in mind the all-important [atomics.order p10], you will quickly find there is no legal way for you to fill out the rest of M.
All of this is promised by the standard without any reference to memory ordering, so it works even with std::memory_order_relaxed. The only effect of relaxed memory ordering is that we can't say much about how our load/store will become visible with respect to operations on other variables. That's irrelevant to the program at hand; it doesn't even have any other variables.
In the actual implementation, this means that an atomic RMW really has to exclusively own the variable for the duration of the operation. We must ensure that no other thread does a store to that variable, nor the load half of a read-modify-write, during that period. In a MESI-like coherent cache, this is done by temporarily locking the cache line in the E state; if the system makes it possible for us to lose that lock (like an LL/SC architecture), abort and start again.
As to your comment about "a thread reading false from its own cache/buffer": the implementation mustn't allow that in an atomic RMW, not even with relaxed ordering. When you do an atomic RMW, you must read it while you hold the lock, and use that value in the RMW operation. You can't use some old value that happens to be in a buffer somewhere. Likewise, you have to complete the write while you still hold the lock; you can't stash it in a buffer and let it complete later.

relaxed is fine if you just need to determine the winner of the race to set the flag1, so one thread can start on the work and later threads can just continue on.
If the run_once work produces data that other threads need to be able to read, you'll need a release store after that, to let potential readers know that the work is finished, not just started. If it was instead just something like printing or writing to a file, and other threads don't care when that finishes, then yeah you have no ordering requirements between threads beyond the modification order of done which exists even with relaxed. An atomic RMW like test_and_set lets you determines which thread's modification was first.
BTW, you should check read-only before even trying to test-and-set; unless run() is only called very infrequently, like once per thread startup. For something like a static int foo = non_constant; local var, compilers use a guard variable that's loaded (with an acquire load) to see if init is already complete. If it's not, branch to code that uses an atomic RMW to modify the guard variable, with one thread winning, the rest effectively waiting on a mutex for that thread to init.
You might want something like that if you have data that all threads should read. Or just use a static int foo = something_to_run_once(), or some type other than int, if you actually have some data to init.
Or perhaps use C++11 std::call_once to solve this problem for you.
On normal systems, atomic_flag has no advantage over and atomic_bool. done.exchange(true) on a bool is equivalent to test_and_set of a flag. But atomic_bool is more flexible in terms of the operations it supports, like plain read that isn't part of an RMW test-and-set.
C++20 does add a test() method for atomic_flag. ISO C++ guarantees that atomic_flag is lock-free, but in practice so is std::atomic<bool> on all real-world systems.
Footnote 1: why relaxed guarantees a single winner
The memory_order parameter only governs ordering wrt. operations on other variables by the same thread.
Does calling test_and_set by a thread force somehow synchronization of the flag with values written by other threads?
It's not a pure write, it's an atomic read-modify-write, so the result of the one that went first is guaranteed to be visible to the one that happens to be second. That's the whole point of test-and-set as a primitive building block for mutual exclusion.
If two TAS operations could both load the original value (false), and then both store true, they would be atomic. They'd have overlapped with each other.
Two atomic RMWs on the same atomic object must happen in some order, the modification-order of that object. (Because they're not read-only: an RMW includes a modification. But also includes a read so you can see what the value was immediately before the new value; that read is tied to the modification order, unlike a plain read).
Every atomic object separately has a modification-order that all threads can agree on; this is guaranteed by ISO C++. (With orders less than seq_cst, ordering between objects can be different from source order, and not guaranteed that all threads even agree which store happened first, the IRIW problem.)
Being an atomic RMW guarantees that exactly one test_and_set will return false in thread A or B. Same for fetch_add with multiple threads incrementing a counter: the increments have to happen in some order (i.e. serialized with each other), and whatever that order is becomes the modification-order of that atomic object.
Atomic RMWs have to work this way to not lose counts. i.e. to actually be atomic.

std::atomic - behaviour of relaxed ordering

Can the following call to print result in outputting stale/unintended values?
std::mutex g;
std::atomic<int> seq;
int g_s = 0;
int i = 0, j = 0, k = 0; // ignore fact that these could easily made atomic
// Thread 1
void do_work() // seldom called
{
// avoid over
std::lock_guard<std::mutex> lock{g};
i++;
j++;
k++;
seq.fetch_add(1, std::memory_order_relaxed);
}
// Thread 2
void consume_work() // spinning
{
const auto s = g_s;
// avoid overhead of constantly acquiring lock
g_s = seq.load(std::memory_order_relaxed);
if (s != g_s)
{
// no lock guard
print(i, j, k);
}
}

TL:DR: this is super broken; use a Seq Lock instead. Or RCU if your data structure is bigger.
Yes, you have data-race UB, and in practice stale values are likely; so are inconsistent values (from different increments). ISO C++ has nothing to say about what will happen, so it depends on how it happens to compile for some real machine, and interrupts / context switches in the reader that happen in the middle of reading some of these multiple vars. e.g. if the reader sleeps for any reason between reading i and j, you could miss many updates, or at least get a j that doesn't match your i.
Relaxed seq with writer+reader using lock_guard
I'm assuming the writer would look the same, so the atomic RMW increment is inside the critical section.
I'm picturing the reader checking seq like it is now, and only taking a lock after that, inside the block that runs print.
Even if you did use lock_guard to make sure the reader got a consistent snapshot of all three variables (something you couldn't get from making each of them separately atomic), I'm not sure relaxed would be sufficient in theory. It might be in practice on most real implementations for real machines (where compilers have to assume there might be a reader that synchronizes a certain way, even if there isn't in practice). I'd use at least release/acquire for seq, if I was going to take a lock in the reader.
Taking a mutex is an acquire operation, same as a std::memory_order_acquire load on the mutex object. A relaxed increment inside a critical section can't become visible to other threads until after the writer has taken the lock.
But in the reader, with if( xyz != seq.load(relaxed) ) { take_lock; ... }, the load is not guaranteed to "happen before" taking the lock. In practice on many ISAs it will, especially x86 where all atomic RMWs are full memory barriers. But in ISO C++, and maybe some real implementations, it's possible for the relaxed load to reorder into the reader's critical section. Of course, ISO C++ doesn't define things in terms of "reordering", only in terms of syncing with and values loads are allowed to see.
(This reordering may not be fully plausible; it would mean the read side would have to actually take the lock based on branch prediction / speculation on the load result. Maybe with lock elision like x86 did with transactional memory, except without x86's strong memory ordering?)
Anyway, it's pretty hairly to reason about, and release / acquire ops are quite cheap on most CPUs. If you expected it to be expensive, and for the check to often be false, you could check again with an acquire load, or put an acquire fence inside the if so it doesn't happen on the no-new-work path.
Use a Seq Lock
Your problem is better solved by using your sequence counter as part of a Seq Lock, so neither reader nor writer needs a mutex. (Summary: increment before writing, then touch the payload, then increment again. In the reader, read i, j, and k into local temporaries, then check the sequence number again to make sure it's the same, and an even number. With appropriate memory barriers.
See the wikipedia article and/or link below for actual details, but the real change from what you have now is that the sequence number has to increment by 2. If you can't handle that, use a separate counter for the actual lock, with seq as part of the payload.)
If you don't want to use a mutex in the reader, using one in the writer only helps in terms of implementation-detail side-effects, like making sure stores to memory actually happen, not keeping i in a register across calls if do_work inlines into some caller.
BTW, updating seq doesn't need to be an atomic RMW if there's only one writer. You can relaxed load and separately store an incremented temporary (with release semantics).
A Seq Lock is good for cheap reads and occasional writes that make the reader retry. Implementing 64 bit atomic counter with 32 bit atomics shows appropriate fencing.
It relies on non-atomic reads that may have a data race, but not using the result if your sequence counter detects tearing. C++ doesn't define the behaviour in that case, but it works in practice on real implementations. (C++ is mostly keeping its options open in case of hardware race detection, which normal CPUs don't do.)
If you have multiple writers, you'd still use a normal lock to give mutual exclusion between them. or use the sequence counter as a spinlock, as a writer acquires it by making the count odd. Otherwise you just need the sequence counter.
Your global g_s is just to track the latest sequence number the reader has seen? Storing it next to the data defeats some of the purpose/benefit, since it means the reader is writing the same cache line as the writer, assuming that variables declared near each other all end up together. Consider making it static inside the function, or separate it with other stuff, or with padding, like alignas(64) or 128. (That wouldn't guarantee that a compiler doesn't put it right before the other vars, though; a struct would let you control the layout of all of them. With enough alignment, you can make sure they're not in the same aligned pair of cache lines.)

Even ignoring the staleness, this is causes a data race and UB.
Thread 2 can read i,j,k while thread 1 is modifying them, you don't synchronize the access to those variables. If thread 2 doesn't respect the g, there's no point in locking it in thread 1.

Yes, it can.
First of all, the lock guard does not have any effect on your code. A lock has to be used by at least two threads to have any effect.
Thread 2 can read at any moment. It can read an incremented i and not incremented j and k. In theory, it can even read a weird partial value obtained by reading in between updating the various bytes that compose i - for example incrementing from 0xFF to 0x100 results reading 0x1FF or 0x0 - but not on x86 where these updates happen to be atomic.

memory range sharing in threads : ensure data is not stuck in cache

When sending address of memory location from one thread to another, how to ensure that data is not stuck in CPU cache, and that the second thread actually reads the correct value ? ( I'm using a socketpair() to send the
pointer from one thread to another )
And related question , how does c++ compiler, along with thread primitives, figure out what memory address need to be handled specially for synchronozations.
struct Test { int fld; }
thread_1 ( ) {
Test *ptr1 = new Test;
ptr1->fld = 100;
::write(write_fd, &ptr1, sizeof(ptr1));
}
thread_2 () {
Test *ptr2;
::read(read_fd, &ptr2, sizeof(ptr2));
// WHAT MAGIC IS REQUIRED TO ENSURE THIS ?
assert(ptr2->fld == 100 );
}

If you want to pass the value between threads in the same process, I would ensure that std::atomic<int> as the type of field, and the related setter and getter functions. Obviously, passing a pointer from one process to another doesn't work at all, unless it's from an area of memory that is guaranteed to have the same address in both processes - shared memory for example, but then you shouldn't need sockets...
Compilers do not, in general, know how to deal with caches, except for atomic types (technically, atomics are usually dealt with using separate instructions, rather than cache-flushing and cache-invalidation, and the processor hardware handles the relevant "talk to other processors about the cache content").
The OS (subject to bugs of course) does that sort of thing when passing between processes - or within processes. But for passing pointers, you can't rely on that, the newly received pointer value is correct, but the content the pointer is pointing at isn't cache-managed.
In some processors, you can use a memory barrier to the correct order of memory content between threads. This forces the processor to "perform all memory operations before this point". However, in the case of system calls like read and write, the OS should take care of that for you, and ensure that the memory has been properly written to before the read starts to read the memory it wants to store in the socket buffer, and write will have a memory barrier after it's stored your data (in this case the value of the pointer, but memory barriers affect all reads and/or writes that preceed that point).
If you were to implement your own primitives for passing data, and the processors do not have cache coherency (most of the modern processors do), you will also need to add a cache-flush for the writing side, and a cache invalidate for the reading side. This is architecture dependent, there is no support for this in standard C or C++ (and in some processors, only OS functionality [kernel mode] can flush or invalidate cache content, in other processors it can be done in user-mode code - the granularity of such operations also varies, it may be necessary to flush or invalidate the entire cache-system, or individual lines of 32, 64 or 128 bytes can be flushed at a time)

In C++, you don't need to care about implementation details like caches. The only thing you need to do is to make sure there is a C++ happens-after relation.
As Mats Petersson's answer shows, std::atomic is one way to achieve that. All accesses to an atomic variable are ordered, although the order might not be statically determined (i.e. if you have two threads trying to write to the same atomic variable, you can't predict which write happens last).
Another mechanism to enforce synchronization is std::mutex. Threads can attempt to lock a mutex, but only one thread can have a mutex locked at a time. The other threads will block. The compiler will make certain that when one thread unlocks a mutex and the next thread locks the mutex, writes by the first thread can be read by the second thread. If this requires flushing the cache, the compiler will arrange that.
Yet another mechanism is std::atomic_thread_fence. This is useful if you have multiple objects shared between threads (all in the same direction). Instead of making them all atomic, you can make one of them atomic and "attach" a fence to that atomic variable. You then write to the atomic variable last, and read from it first. Obviously this is best encapsulated in a class.

Creating Mutexes with Binary Semaphore

I'm working with a simple system that does NOT have mutexes, but rather a limited array of hardware binary semaphores. Typically, all multithreading is been done with heavy Semaphore techniques that makes code both poor in performance and difficult to write correctly without deadlocks.
A naive implementation is to use one semaphore globally in order to ensure atomic access to a critical section. However, this means that unrelated objects (even of different types) will block if any critical section is being accessed.
My current solution to this problem is to use a single global semaphore to ensure atomic access to a guard byte that then ensures atomic access to the particular critical section. I currently have this so far:
while (true) {
while (mutexLock == Mutex::Locked) {
} //wait on mutex
Semaphore semaLock(SemaphoreIndex::Mutex); //RAII semaphore object
if (mutexLock == Mutex::Unlocked) {
mutexLock = Mutex::Locked;
break;
}
} //Semaphore is released by destructor here
// ... atomically safe code
mutexLock = Mutex::Unlocked;
I have a few questions: Is this the best way to approach this problem? Is this code thread-safe? Is this the same as a "double checked lock"? If so, does it suffer from the same problems and therefore need memory barriers?
EDIT: A few notes on the system this is being implemented on...
It is a RISC 16-bit processor with 32kB RAM. While it has heavy multithreading capabilities, its memory model is very primitive. Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads. Memory barriers are mostly for the compiler to know it should reload memory into general purpose registers, not for any hardware reason (no cache)

Even if the processor has heavy multithreading capabilities (not sure what you mean by that btw.), but is a single core processor, it still means that only one thread can execute at any one time.
A task switching system must be employed (which would normally run in a privileged mode) Using this system you must be able to define critical sections to atomically execute a (software implemented) mutex lock/unlock.
When you say "one core, many threads" does that mean that you have some kind of kernel running on your processor? The task switching system will be implemented by that kernel.
It might pay to look through the documentation of your kernel or ask your vendor for any tips.
Good luck.

Right now we still don't have that much information about your system (for example, what kind of registers are available for each instruction in parrallel? do you use bank architecture?, how many simultaneous instructions can you actually execute?) but hopefully what I suggest will help you
If I understand your situation you have a piece of hardware that does not have true cores, but simply MIMD ability via a vectorized operation (based on your reply). with a It is a "RISC 16-bit processor with 32kB RAM" where:
Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads
The key here is that loads and stores are atomic. Note you won't be able to do larger than 16bit load and stores atomically, since they will be compiled to two separate atomic instructions (and thus not being atomic itself).
Here is the functionality of a mutex:
attempt to lock
unlock
To lock, you might run into issues if each resource attempts to lock. For example say in your hardware N = 4 (number of processes to be run in parrallel). If instruction 1 (I1) and I2 try to lock, they will both be successful in locking. Since your loads and stores are atomic, both processes see "unlocked" at the same time, and both acquire the lock.
This means you can't do the following:
if mutex_1 unlocked:
lock mutex_1
which might look like this in an arbitrary assembly language:
load arithmetic mutex_addr
or arithmetic immediate(1) // unlocked = 0000 or 1 = 0001, locked = 0001 or 1 = 0001
store mutex_addr arithmetic // not putting in conditional label to provide better synchronization.
jumpifzero MUTEXLABEL arithmetic
To get around this you will need to have each "thread" either know if its currently getting a lock some one else is or avoid simultaneous lock access entirely.
I only see one kind of way this can be done on your system (via flag/mutex id checking). Have a mutex id associated with each thread for each mutex it is currently checking, and check for all other threads to see if you can actually acquire a lock. Your binary semaphores don't really help here because you need to associate an individual mutex with a semaphore if you were going to use it (and still have to load the mutex from ram).
A simple implementation of the check every thread unlock and lock, basically each mutex has an ID and a state, in order to avoid race conditions per instruction, the current mutex being handled is identified well before it is actually acquired. Having the "identify which lock you want to use" and "actually try to get the lock" come in two steps stops accidental acquisition on simultaneous access. With this method you can have 2^16-1 (because 0 is used to say no lock was found) mutexes and your "threads" can exist on any instruction pipe.
// init to zero
volatile uint16_t CURRENT_LOCK_ATTEMPT[NUM_THREADS]{0};
// make thread id associated with priority
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state){
if(mutex_lock_state == false){
// do not actually attempt to take the lock until checked everything.
// No race condition can happen now, you won't have actually set the lock
// if two attempt to acquire the same lock at the same time, you'll both
// be able to see some one else is as well.
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
//checking all lower threads, need some sort of priority
//predetermined to figure out locking.
for( int i = 0; i < MY_THREAD_ID; i++ ){
if((CURRENT_LOCK_ATTEMPT[i] == mutex_id){
//clearing bit.
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return false;
}
}
// make sure to lock mutex before clearing which mutex you are currently handling
mutex_lock_state = true;
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return true;
}
return false;
}
// its your fault if you didn't make sure you owned the lock in the first place
// if you did own it, theres no race condition, because of atomic store load.
// if you happen to set the state while another instruction is attempting to
// acquire the lock they simply wont get the lock and no race condition occurs
bool unlock(bool& mutex_lock_state){
mutex_lock_state = false;
}
If you want more equal access of resources you could change indexing instead of being based on i = 0 to i < MY_THREAD_ID, you randomly pick a "starting point" to circle around back to MY_THREAD_ID using modulo arithmetic. IE:
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state, uint16_t per_mutex_random_seed){
if(mutex_lock_state == false){
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
//need a per thread linear congruence generator for speed and consistency
std::minstd_rand0 random(per_mutex_random_seed)
for(int i = random() % TOTAL_NUM_THREADS; i != MY_THREAD_ID i = (i + 1) % TOTAL_NUM_THREADS)
{
//same as before
}
// if we actually acquired the lock
GLOBAL_SEED = global_random() // use another generator to set the next seed to be used
mutex_lock_state = true;
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return true;
}
return false;
}
In general your lack of test and set ability really throws a wrench into everything, meaning you are forced to use other algorithms for mutexes. For more information on other algorithms that you can use for non test and set architectures check out this SO post, and these wikipedia algorithms which only rely on atomic loads and stores:
Dekker's Algorithm
Eisenberg Mcguire Algorithm
Peterson's Algorithm
Lamport's Algorithm
Szymanski's Algorithm
All of these algorithms basically decompose into checking a set of flags to see if you can access the resource safely by going through every one elses flags.

thread synchronization - delicate issue

let's i have this loop :
static a;
for (static int i=0; i<10; i++)
{
a++;
///// point A
}
to this loop 2 threads enters...
i'm not sure about something.... what will happen in case thread1 gets into POINT A , stay there, while THREAD2 gets into the loop 10 times, but after the 10'th loop after incrementing i's value to 10, before checking i's value if it's less then 10,
Thread1 is getting out of the loop and suppose to increment i and get into the loop again.
what's the value that Thread1 will increment (which i will he see) ? will it be 10 or 0 ?
is it posibble that Thread1 will increment i to 1, and then thread 2 will get to the loop again for 9 times (and them maybe 8 ,7 , etc...)
thanks

You have to realize that an increment operation is effectively really:
read the value
add 1
write the value back
You have to ask yourself, what happens if two of these happen in two independent threads at the same time:
static int a = 0;
thread 1 reads a (0)
adds 1 (value is 1)
thread 2 reads a (0)
adds 1 (value is 1)
thread 1 writes (1)
thread 2 writes (1)
For two simultaneous increments, you can see that it is possible that one of them gets lost because both threads read the pre-incremented value.
The example you gave is complicated by the static loop index, which I didn't notice at first.
Since this is c++ code, standard implementation is that the static variables are visible to all threads, thus there is only one loop counting variable for all threads. The sane thing to do would be to use a normal auto variable, because each thread would have its own, no locking required.
That means that while you will lose increments sometimes, you also may gain them because the loop itself may lose count and iterate extra times. All in all, a great example of what not to do.

If i is shared between multiple threads, all bets are off. It's possible for any thread to increment i at essentially any point during another thread's execution (including halfway through that thread's increment operation). There is no meaningful way to reason about the contents of i in the above code. Don't do that. Either give each thread its own copy of i, or make the increment and comparison with 10 a single atomic operation.

It's not really a delicate issue because you would never allow this in real code if the synchronization was going to be an issue.

I'm just going to use i++ in your loop:
for (static int i=0; i<10; i++)
{
}
Because it mimics a. (Note, static here is very strange)
Consider if Thread A is suspended just as it reaches i++. Thread B gets i all the way to 9, goes into i++ and makes it 10. If it got to move on, the loop would exist. Ah, but now Thread A is resumed! So it continues where it left off: increment i! So i becomes 11, and your loop is borked.
Any time threads share data, it needs to be protected. You could also make i++ and i < 10 happen atomically (never be interrupted), if your platform supports it.

You should use mutual exclusion to solve this problem.

And that is why, on multi-threaded environment, we are suppose to use locks.
In your case, you should write:
bool test_increment(int& i)
{
lock()
++i;
bool result = i < 10;
unlock();
return result;
}
static a;
for(static int i = -1 ; test_increment(i) ; )
{
++a;
// Point A
}
Now the problem disappears .. Note that lock() and unlock() are supposed to lock and unlock a mutex common to all threads trying to access i!

Yes, it's possible that either thread can do the majority of the work in that loop. But as Dynite explained, this would (and should) never show up in real code. If synchronization is an issue, you should provide mutual exclusion (a Boost, pthread, or Windows Thread) mutex to prevent race conditions such as this.

Why would you use a static loop counter?
This smells like homework, and a bad one at that.

Both the threads have their own copy of i, so the behavior can be anything at all. That's part of why it's such a problem.
When you use a mutex or critical section the threads will generally sync up, but even that is not absolutely guaranteed if the variable is not volatile.
And someone will no doubt point out "volatile has no use in multithreading!" but people say lots of stupid things. You don't have to have volatile but it is helpful for some things.

If your "int" is not the atomic machine word size (think 64 bit address + data emulating a 32-bit VM) you will "word-tear". In that case your "int" is 32 bits, but the machine addresses 64 atomically. Now you have to read all 64, increment half, and write them all back.
This is a much larger issue; bone up on processor instruction sets, and grep gcc for how it implements "volatile" everywhere if you really want the gory details.
Add "volatile" and see how the machine code changes. If you aren't looking down at the chip registers, please just use boost libraries and be done with it.

If you need to increment a value with multiple threads at the same time, then Look up "atomic operations". For linux, look up "gcc atomic operations". There is hardware support on most platforms to atomicly increment, add, compare and swap, and more. LOCKING WOULD BE OVERKILL for this....atomic inc is magnitudes faster than lock inc unlock. If you have to change a lot of fields at the same time you may need a lock, although you can change 128 bits worth of fields at a time with most atomic ops.
volatile is not the same as an atomic operation. Volatile helps the compiler know when its a bad idea to use a copy of a variable. Among its uses, volatile is important when you have multiple threads changing data that you would like to read the "most up to date version of" of without locking. Volatile will still not fix your a++ problem as two threads can read the value of "a" at the same time and then both increment the same "a" and then the last one to write "a" wins and you lost an inc. Volatile will slow down optimized code by not letting the compiler hold values in registers and what not.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js