What are the semantics of store forwarding under the x86 memory model?

What are the semantics of store forwarding under the x86 memory model? - concurrency

It's not too difficult to figure out the semantics of present CPUs by experimentation; but what I want to know is, what semantic guarantees are provided by the architectural memory model? I've read volume 3, chapter 8 of the intel SDM, and I believe it is ambiguous. Am I missing something?
The manual nominally says:
A read which succeeds a write in program order and is not from the same location may be ordered before the write.
A read which succeeds a write in program order and is from the same location happens after it.
It also says:
In a multiple-processor system, the following ordering principles apply:
Memory ordering obeys causality
I'll come back to this.
Just one sentence is later devoted to store forwarding:
The processor-ordering model described in this section is virtually identical to that used by the Pentium and Intel486 processors. The only enhancements in the Pentium 4, Intel Xeon, and P6 family processors are:
Store-buffer forwarding, when a read passes a write to the same memory location
There's also an example (§8.2.3.5); examples are not generally taken to be normative, but we'll take it at face value for now. From what I understand, the effect of the example is basically to amend #2 above to:
If a read is from the same location as a write which it succeeds in program order, the read must happen 'after' the write in that it either must observe the write, or another write made by another core to the same location which happens after the first write; however, in the first case, the read may happen before the write is made visible to other processors.
The example is of a situation where this makes a difference; I've elided it for brevity. Another example is given here, which was also the motivation for this question.
However, this notion of 'from the same location' makes no account of a read which partially overlaps a previous write. What are the semantics then? Here are two concrete questions:
First question
Suppose that, to set up our initial state, we have mov word [r1], 0. Then:
Core 1
mov byte [_x], 0x01
mov r1, word [_x]
Core 2
mov eax, 0x0000
mov r2, 0x0100
lock cmpxchg word [_x], r3
Is it true that exactly one of the following must be true?
The write happens before the CAS; the CAS fails; and the read sees 0x0001.
The write happens after the CAS; the CAS succeeds; and the read sees 0x0101.
And there is no third possibility?
The write happens after the CAS; the CAS succeeds; but the read still sees 0x0001 because it happened before the write became visible to other cores.
Second question
Much more straightforward, this time: suppose that core 2 isn't writing near r1. (It might be writing elsewhere, though; else the issue is meaningless.) We just have:
Core 1
mov byte [_x], 0x01
mov r1, word [_x]
Must this read wait until after the write becomes visible to other cores?
These two questions are variants (but distinct variants) of the same question: if a read partially overlaps a write in the store queue, can you forward the part that's in the store queue, and serve the rest as a regular read? Or do you have to stall until the read is completed? The manual only talks about a read 'to the same location as' a write; but in this case, the read is both to the same location as the write and to a different location.
You could apply the following reasoning to this case:
A word read is notionally a pair of byte-reads, executed atomically.
If we break the read up into two reads, each one can, independently, be ordered before the write is made visible to other cores; one does not alias it, and one matches it exactly.
Having moved both of the byte reads to the other side of the write, we can recombine them into a single atomic read.
But I believe this is unsound reasoning. You can't just take an atomic operation, break it up into its pieces, apply nonatomic reasoning to the constituents, and then recombine them into an atom.
Furthermore, if this reasoning were correct, then it would seem that the answer to question 1 is no. But doesn't that violate causality?
I think the following:
If the answer to question 1 is no, then causality is violated, and so there is a contradiction.
If the answer to question 1 is yes, but the answer to question 2 is no, then a load can be served partly from the store queue and partly from cache. However, this can only be done speculatively, and must back out if another core claims the cache line exclusively between the time when the store is forwarded and the time when it is completed.
If the answer to questions 1 and 2 are both yes, then a load can be served partly from the store queue and partly from cache; but this can only be done speculatively if the core issuing the store already holds the cache line in question exclusively, and bets that no one else asks for it before the store completes.
Practically, there is very little difference between the case when only the answer to question 1 is yes, and the case when the answers to questions 1 and 2 are both yes. And no processor I know of can forward a store to a partially-overlapping load in any event. So it would seem that the answers to both questions should be yes in all cases, but this is not made explicit anywhere.

Related

Are C++ atomics preemption safe?

From what I understand of atomics, they are special assembly instructions which guarantee that two processors in an SMP system cannot write to the same memory region at the same time. For example in PowerPC an atomic increment would look something like:
retry:
lwarx r4, 0, r3 // Read integer from RAM location r3 into r4, placing reservation.
addi r4, r4, 1 // Add 1 to r4.
stwcx. r4, 0, r3 // Attempt to store incremented value back to RAM.
bne- retry // If the store failed (unlikely), retry.
However this does not protect the four instructions from begin preempted by an interrupt, and another task being scheduled in. To protect from preemption you need to do an interrupt-lock before entering the code.
From what I can see of C++ atomics, they seem to be enforcing locks if needed. So my first question is -
Does the C++ standard guarantee no preemption will happen during an atomic operation? If so, where in the standard can I find this?
I checked the atomic<int>::is_always_lock_free on my Intel PC, and it came true. With my assumption of the above assembly block, this confused me. After digging into Intel assembly (which I am unfamiliar with) I found lock xadd DWORD PTR [rdx], eax to be happening. So my question is -
Do some architectures provide atomic related instructions which guarantee no-preepmtion? Or is my understanding wrong?
Finally I was wondering about the compare_exchange_weak and compare_exchange_strong semantics -
Does the difference lie int the retry mechanism or is it something else?
EDIT: After reading the answers, I am curious about one more thing
The atomic member function operations fetch_add, operator++ etc. are they strong or weak?

This is similar to this question: Anything in std::atomic is wait-free?
Here are some definitions of lock-freedom and wait-freedom (both taken from Wikipedia):
An algorithm is lock-free if, when the program threads are run for a sufficiently long time, at least one of the threads makes progress.
An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes.
Your code with the retry loop is lock-free: a thread only has to perform a retry if the store fails, but that implies that the value must have been updated in the meantime, so some other thread must have made progress.
With regard to lock-freedom it doesn't matter whether a thread can be preempted in the middle of an atomic operation or not.
Some operations can be translated to a single atomic operation, in which case this operation is wait-free and therefore cannot be preempted midway. However, which operations are actually wait-free depends on the compiler and the target architecture (as described in my answer in the referenced SO question).
Regarding the difference between compare_exchange_weak and compare_exchange_strong - the weak version can fail spuriously, i.e., it may fail even though the comparison is actually true. This can happen on architectures with LL/SC. Suppose we use compare_exchange_weak to update some variable with the expected value A. LL loads the value A from the variable, and before the SC is executed, the variable is changed to B and then back to A. So even though the variable contains the same value as before, the intermediate change to B causes the SC (and therefore the compare_exchange_weak) to fail. compare_exchange_strong cannot fail spuriously, but to achieve that it has to use a retry-loop on architectures with LL/SC.
I am not entirely sure what you mean by fetch_add being "strong or weak". fetch_add cannot fail - it simply performs an atomic update of some variable by adding the provided value, and returns the old value of the variable. Whether this can be translated to a single instruction (like on Intel) or to a retry loop with LL/SC (Power) or CAS (Sparc) depends on the target architecture. Either way, the variable is guaranteed to be updated correctly.

Does the C++ standard guarantee no preemption will happen during an atomic operation? If so, where in the standard can I find this?
No, it doesn't. Since there's really no way code could tell whether or not this happened (it's indistinguishable from pre-emption either before or after the atomic operation depending on circumstances) there is no reason for it to.
Do some architectures provide atomic related instructions which guarantee no-preemption? Or is my understanding wrong?
There would be no point since the operation must appear to be atomic anyway, so pre-emption during would always be identical in observed behavior to pre-emption before or after. If you can write code that ever sees a case where pre-emption during the atomic operation causes observable effects different from either pre-emption before or pre-emption after, that platform is broken since the operation does not behave atomically.

Memory ordering from hardware perspective

I think I understand aspects of memory ordering guarantees to some extent after reading a few materials on the Net. However it seems a little magical looking at the rules only from software and theoretical point of view. An example for why two processors could seem to reorder is explained here and helped me a lot to actually visualise the process. So what i understood is that the pre-fetcher could load the read early for one processor and does not do so for the other then to the outside observer it would look like the 1st processor did an earlier read than the 2nd (and could potentially now have stale value in absence of synchonisation) and thus see the instructions reordered.
After that i was actually looking for explanations from CPU point of view for more of how such effects can be produced. For instance, consider the acquire-release fence. A classic example for this usually quoted is something like:
thread-0: x.store(true,std::memory_order_release);
thread-1: y.store(true,std::memory_order_release);
thread-2:
while(!x.load(std::memory_order_acquire));
if(y.load(std::memory_order_acquire)) ++z;
thread-3:
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire)) ++z;
Since there is no total-order as in sequential consitency, thread-2 can see thread-0 doing its stuff 1st followed by thread-1 and thread-3 can see thread-1 doing its stuff 1st followed by thread-0. Thus z==0 can be a possible outcome.
If there was an explaination (say taking four cpu's each running one of the threads above) and what in hardware would happen to make us see this reorder, it would be immensely helpful. It does not have to be very complex real world detailed case (it can be though if that's the only way to understand it). Just an approximation like what the linked answer above does, with something about cache (or any participating factor) thrown in, it should do it for me (and probably many others ?) i guess.
Another one is:
thread-0:
x.store(true,std::memory_order_relaxed);
y.store(true,std::memory_order_release);
thread-1:
while(!y.load(std::memory_order_acquire)); // <------ (1)
if(x.load(std::memory_order_relaxed)) ++z;
Following the rules again, i can understand that this will never get z==0 (assuming all initial values are 0) and why changing (1) to relaxed might get us z==0. But once more it sort of appears magical until i can think of how it can physically happen.
Thus any help (or pointers) taking adequate number of processors and their cache etc. for the explanation would be immense.

InterlockedExchange and memory visibility

I have read the article Synchronization and Multiprocessor Issues and I have a question about InterlockedCompareExchange and InterlockedExchange. The question is actually about the last example in the article. They have two variables iValue and fValueHasBeenComputed and in CacheComputedValue() they modify each of them using InterlockedExchange:
InterlockedExchange ((LONG*)&iValue, (LONG)ComputeValue()); // don't understand
InterlockedExchange ((LONG*)&fValueHasBeenComputed, TRUE); // understand
I understand that I can use InterlockedExchange for modifing iValue but is it enought just to do
iValue = ComputeValue();
So is it actually necessary to use InterlockedExchange to set iValue? Or other threads will see iValue correctly even if iValue = ComputeValue();. I mean the other threads will see iValue correctly because there is InterlockedExchange after it.
There is also the paper A Principle-Based Sequential Memory Model for Microsoft Native Code Platforms. There is the 3.1.1 example with more or less the same code. One of the recomendation Make y interlocked. Notice - not both y and x.
Update
Just to clarify the question. The issue is that I see a contradiction. The example from "Synchronization and Multiprocessor Issues" uses two InterlockedExchange. On the contrary, in the example 3.1.1 "Basic Reodering" (which I think is quite similar to the first example) Herb Sutter gives this recomendation
"Make y interlocked: If y is interlocked, then there is no race on y
because it is atomically updatable,and there is no race on x because a
-> b -> d."
. In this draft Herb do not use two interlocked variable (If I am right he means use InterlockedExchange only for y ).

They did that to prevent partial reads/writes if the address of iValue is not aligned to an address that guarantees atomic access. this problem would arise when two or more physical thread try to write the value concurrently, or one reads and one tries to write at the same time.
As a secondary point, it should be noted that stores are not always globally visible, they are only going to be visible when serialized, either by a fence or by a bus lock.

You simply get an atomic operation with InterlockedExchange. Why you need it?
Cause InterlockedExchange does 2 things.
Replaces a value of variable
Returns an old value
If you do the same things in 2 operations (Thus first check value then replace) you can get screwed if other instructions (on another thread) occur between these 2.
And you also prevent data races on this value. here you get a good explanation why read/write on a LONG is not atomic

There are two plausible resolutions to the contradiction you've observed.
One is that the second document is simply wrong in that particular respect. It is, after all, a draft. I note that the example you refer to specifically states that the programmer cannot rely on the writes to be atomic, which means that both writes must indeed be interlocked.
The other is that the additional interlock might not actually be required in that particular example, because it is a very special case: only a single bit of the variable is being changed. However, the specification being developed doesn't appear to mention this as a premise, so I doubt that this is intentional.

I think this discussion has the answer to the question: Implicit Memory Barriers.
Question: does calling InterlockedExchange (implicit full fence) on T1
and T2, gurentess that T2 will "See" the write done by T1 before the
fence? (A, B and C variables), even though those variables are not
plance on the same cache-line as Foo and Bar ?
Answer: Yes -- the full fence generated by the InterlockedExchange will
guarantee that the writes to A, B, and C are not reordered past the
fence implicit in the InterlockedExchange call. This is the point of
memory barrier semantics. They do not need to be on the same cache
line.
Memory Barriers: a Hardware View for Software Hackers and Lockless Programming Considerations for Xbox 360 and Microsoft Windows are also insteresting.

Thread safety and bit-field

I know that bit-field are compiler dependant, but I haven't find documentation about thread safety on bit-field with the latest g++ and Visual C++ 2010.
Does the operations on a bit-field member are atomic ?

"Thread safe" is unfortunately a very overloaded term in programming.
If you mean atomic access to bit-fields, the answer is no (at least on all processors I'm aware of). You have atomic access to 32bit memory locations on 32bit machines, but that only means you'll read or write a whole 32 bit value. This does not mean another thread won't do the same thing. If you're looking to stop that you likely want synchronization.
If you mean synchronized access to bit-fields, the answer is also no, unless you wrap your access in a higher level synchronization primitive (which are often built on atomic operations).
In short, the compiler does not provide atomic or synchronized access to bit fields without extra work on your part.
Does that help?
Edit: Dr. Dan Grossman has two nice lectures on atomicity and synchronization I found on UOregon's CS department page.

When writing to a bit-field, there may be a time window in which any attempt by another thread to access (read or write) any (same or different) bit-field in the same structure will result in Undefined Behavior, meaning anything can happen. When reading a bit field, there may be a time window when any attempt by another thread to write any bit-field in the same structure will result in Undefined Behavior.
If you cannot practically use separate variables for the bit-fields in question, you may be able to store multiple bit-fields in an integer and update them atomically by creating a union between the bit-field structure and a 32-bit integer, and then using a CompareExchange sequence:
Read the value of the bitfield as an Int32.
Convert it to a bitfield structure
Update the structure
Convert the structure back to an Int32.
Use CompareExchange to overwrite the variable with the new value only if it still holds the value read in (1); if the value has changed, start over with step (1).
For this approach to work well, steps 2-4 must be fast. The longer they take, the greater the likelihood that the CompareExchange in step 5 will fail, and thus the more times steps 2-4 will have to be re-executed before the CompareExchange succeeds.

If you want to update bitfields in thread-safe way you need to split your bit-fields into separate flags and use regular ints to store them. Accessing separate machine-words is thread-safe (although you need to consider optimizations and cache coherency on multiprocessor systems).

See the Windows Interlocked Functions
Also see this related SO question

just simple use atomic.AddInt32
example:
atomic.AddInt32(&intval, 1 << 0) //set the first bit
atomic.AddInt32(&intval, 1 << 1) //set the second bit
atomic.AddInt32(&intval, -(1 << 1 + 1 << 0)) //clear the first and second bit
the code is in Go, i think c++ also has some thing like atomic.AddInt32

Are C++ Reads and Writes of an int Atomic?

I have two threads, one updating an int and one reading it. This is a statistic value where the order of the reads and writes is irrelevant.
My question is, do I need to synchronize access to this multi-byte value anyway? Or, put another way, can part of the write be complete and get interrupted, and then the read happen.
For example, think of a value = 0x0000FFFF that gets incremented value of 0x00010000.
Is there a time where the value looks like 0x0001FFFF that I should be worried about? Certainly the larger the type, the more possible something like this to happen.
I've always synchronized these types of accesses, but was curious what the community thinks.

Boy, what a question. The answer to which is:
Yes, no, hmmm, well, it depends
It all comes down to the architecture of the system. On an IA32 a correctly aligned address will be an atomic operation. Unaligned writes might be atomic, it depends on the caching system in use. If the memory lies within a single L1 cache line then it is atomic, otherwise it's not. The width of the bus between the CPU and RAM can affect the atomic nature: a correctly aligned 16bit write on an 8086 was atomic whereas the same write on an 8088 wasn't because the 8088 only had an 8 bit bus whereas the 8086 had a 16 bit bus.
Also, if you're using C/C++ don't forget to mark the shared value as volatile, otherwise the optimiser will think the variable is never updated in one of your threads.

At first one might think that reads and writes of the native machine size are atomic but there are a number of issues to deal with including cache coherency between processors/cores. Use atomic operations like Interlocked* on Windows and the equivalent on Linux. C++0x will have an "atomic" template to wrap these in a nice and cross-platform interface. For now if you are using a platform abstraction layer it may provide these functions. ACE does, see the class template ACE_Atomic_Op.

IF you're reading/writing 4-byte value AND it is DWORD-aligned in memory AND you're running on the I32 architecture, THEN reads and writes are atomic.

Yes, you need to synchronize accesses. In C++0x it will be a data race, and undefined behaviour. With POSIX threads it's already undefined behaviour.
In practice, you might get bad values if the data type is larger than the native word size. Also, another thread might never see the value written due to optimizations moving the read and/or write.

You must synchronize, but on certain architectures there are efficient ways to do it.
Best is to use subroutines (perhaps masked behind macros) so that you can conditionally replace implementations with platform-specific ones.
The Linux kernel already has some of this code.

On Windows, Interlocked***Exchange***Add is guaranteed to be atomic.

To echo what everyone said upstairs, the language pre-C++0x cannot guarantee anything about shared memory access from multiple threads. Any guarantees would be up to the compiler.

No, they aren't (or at least you can't assume they are). Having said that, there are some tricks to do this atomically, but they typically aren't portable (see Compare-and-swap).

I agree with many and especially Jason. On windows, one would likely use InterlockedAdd and its friends.

Asside from the cache issue mentioned above...
If you port the code to a processor with a smaller register size it will not be atomic anymore.
IMO, threading issues are too thorny to risk it.

Lets take this example
int x;
x++;
x=x+5;
The first statement is assumed to be atomic because it translates to a single INC assembly directive that takes a single CPU cycle. However, the second assignment requires several operations so it's clearly not an atomic operation.
Another e.g,
x=5;
Again, you have to disassemble the code to see what exactly happens here.

tc,
I think the moment you use a constant ( like 6) , the instruction wouldn't be completed in one machine cycle.
Try to see the instruction set of x+=6 as compared to x++

Some people think that ++c is atomic, but have a eye on the assembly generated. For example with 'gcc -S' :
movl cpt.1586(%rip), %eax
addl $1, %eax
movl %eax, cpt.1586(%rip)
To increment an int, the compiler first load it into a register, and stores it back into the memory. This is not atomic.

Definitively NO !
That answer from our highest C++ authority, M. Boost:
Operations on "ordinary" variables are not guaranteed to be atomic.

The only portable way is to use the sig_atomic_t type defined in signal.h header for your compiler. In most C and C++ implementations, that is an int. Then declare your variable as "volatile sig_atomic_t."

Reads and writes are atomic, but you also need to worry about the compiler re-ordering your code. Compiler optimizations may violate happens-before relationship of statements in your code. By using atomic you don't have to worry about that.
...
atomic i;
soap_status = GOT_RESPONSE ;
i = 1
In the above example, the variable 'i' will only be set to 1 after we get a soap response.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js