Does data race exist because of the instruction pipelining of the CPU?

Does data race exist because of the instruction pipelining of the CPU? - concurrency

As titled.
I have seen a couple of articles online regarding why data race exists but seems no one of them mentions the root cause of why a data race happens when more than one thread writes the value on the same memory location. And I assume that happens because of the instruction pipelining of CPU, the execution of each instruction would be separated into several steps (IF, ID, EX, MEM, WB), and those steps of those instructions would be executed interleaved, which causes the data race, is this understanding correct?

Related

How "lock add" is implemented on x86 processors

I recently benchmarked std::atomic::fetch_add vs std::atomic::compare_exchange_strong on a 32 core Skylake Intel processor. Unsurprisingly (from the myths I've heard about fetch_add), fetch_add is almost an order of magnitude more scalable than compare_exchange_strong. Looking at the disassembly of the program std::atomic::fetch_add is implemented with a lock add and std::atomic::compare_exchange_strong is implemented with lock cmpxchg (https://godbolt.org/z/qfo4an).
What makes lock add so much faster on an intel multi-core processor? From my understanding, the slowness in both instructions comes from contention on the cacheline, and to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from MESI). How then does the processor optimize fetch_add internally?
This is a simplified version of the benchmarking code. There was no load+CAS loop for the compare_exchange_strong benchmark, just a compare_exchange_strong on the atomic with an input variable that kept getting varied by thread and iteration. So it was just a comparison of instruction throughput under contention from multiple CPUs.

lock add and lock cmpxchg both work essentially the same way, by holding onto that cache line in Modified state for the duration of the microcoded instruction. (Can num++ be atomic for 'int num'?). According to Agner Fog's instruction tables, lock cmpxchg and lock add are very similar numbers of uops from microcode. (Although lock add is slightly simpler). Agner's throughput numbers are for the uncontended case, where the var stays hot in L1d cache of one core. And cache misses can cause uop replays, but I don't see any reason to expect a significant difference.
You claim you aren't doing load+CAS or using a retry loop. But is it possible you're only counting successful CAS or something? On x86, every CAS (including failures) has almost identical cost to lock add. (With all your threads hammering on the same atomic variable, you'll get lots of CAS failures from using a stale value for expected. This is not the usual use-case for CAS retry loops).
Or does your CAS version actually do a pure load from the atomic variable to get an expected value? That might be leading to memory-order mis-speculation.
You don't have complete code in the question so I have to guess, and couldn't try it on my desktop. You don't even have any perf-counter results or anything like that; there are lots of perf events for off-core memory access, and events like mem_inst_retired.lock_loads that could record number of locked instructions executed.
With lock add, every time a core gets ownership of the cache line, it succeeds at doing an increment. Cores are only waiting for HW arbitration of access to the line, never for another core to get the line and then fail to increment because it had a stale value.
It's plausible that HW arbitration could treat lock add and lock cmpxchg differently, e.g. perhaps letting a core hang onto the line for long enough to do a couple lock add instructions.
Is that what you mean?
Or maybe you have some major failure in microbenchmark methodology, like maybe not doing a warm-up loop to get CPU frequency up from idle before starting your timing? Or maybe some threads happen to finish early and let the other threads run with less contention?

to execute both instructions with sequential consistency, the
executing CPU has to pull the line into it's own core in exclusive or
modified mode (from MESI).
No, to execute either instruction with any consistent, defined semantic that guarantees that concurrent executions on multiple CPU do not lose increments, you would need that. Even if you were willing to drop "sequential consistency" (on these instructions) or even drop the usual acquire and release guarantees of reads and writes.
Any locked instruction effectively enforces mutual exclusion on the part of memory sufficient to guarantee atomicity. (Like a regular mutex but at the memory level.) Because no other core can access that memory range for the duration of the operation, the atomicity is trivially guaranteed.
What makes lock add so much faster on an intel multi-core processor?
I would expect any tiny difference of timing to be critical in these cases, and doing the load plus compare (or compare-load plus compare-load ...) might change the timing enough to lose the chance, much like too code using mutexes can have widely different efficiency when there is heavy contention and a small change in access pattern changes the way the mutex is attributed.

does false sharing occur when data is read in openmp?

If I have a C++ program with OpenMP parallelization, where different threads constantly use some small shared array only for reading data from it, does false sharing occur in this case? in other words, is false sharing related only to memory write operations, or it can also happen with memory read operations.

Typically used cache coherence protocols, such as MESI (modified, exclusive, shared, invalid), have a specific state for cache lines called "shared". Cache lines are in this state if they are read by multiple processors. Each processor then has a copy of the cache line and can safely read from it without false-sharing. On a write, all processors are informed to invalidate the cache line which is the main cause for false-sharing

False sharing is a performance issue because it causes additional movement of a cache line which takes time. When two variables which are not really shared reside in the same line and separate threads update each of them, the line has to bounce around the machine which increases the latency of each access. In this case if the variables were in separate lines each thread would keep a locally modified copy of "its" line and no data movement would be required.
However, if you are not updating a line, then no data movement is necessary and there is no performance impact from the sharing beyond the fact that you might have been able to have data each thread does need in there, rather than data it doesn't. That is a small, second order, effect. though. So unless you know you are cache capacity limited ignore it!

Can memory store be reordered really, in an OoOE processor?

We know that two instructions can be reordered by an OoOE processor. For example, there are two global variables shared among different threads.
int data;
bool ready;
A writer thread produce data and turn on a flag ready to allow readers to consume that data.
data = 6;
ready = true;
Now, on an OoOE processor, these two instructions can be reordered (instruction fetch, execution). But what about the final commit/write-back of the results? i.e., will the store be in-order?
From what I learned, this totally depends on a processor's memory model. E.g., x86/64 has a strong memory model, and reorder of stores is disallowed. On the contrary, ARM typically has a weak model where store reordering can happen (along with several other reorderings).
Also, the gut feeling tells me that I am right because otherwise we won't need a store barrier between those two instructions as used in typical multi-threaded programs.
But, here is what our wikipedia says:
.. In the outline above, the OoOE processor avoids the stall that
occurs in step (2) of the in-order processor when the instruction is
not completely ready to be processed due to missing data.
OoOE processors fill these "slots" in time with other instructions
that are ready, then re-order the results at the end to make it appear
that the instructions were processed as normal.
I'm confused. Is it saying that the results have to be written back in-order? Really, in an OoOE processor, can store to data and ready be reordered?

The simple answer is YES on some processor types.
Before the CPU, your code faces an earlier problem, compiler reordering.
data = 6;
ready = true;
The compiler is free to rearrange these statements since, as far as it knows, they do not affect each other (it is not thread-aware).
Now down to the processor level:
1) An out-of-order processor can process these instructions in different order, including reversing the order of the stores.
2) Even if the CPU performs them in order, they memory controller may not perform them in order because it may need to flush or bring in new cache lines or do an address translation before it can write them.
3) Even if this doesn't happen, another CPU in the system may not see them in the same order. In order to observe them, it may need to bring in the modified cache lines from the core that wrote them. It may not be able to bring one cache line in earlier than another if it is held be another core or if there is contention for that line by multiple cores, and its own out of order execution will read one before the other.
4) Finally, speculative execution on other cores may read the value of data before ready was set by the writing core, and by the time it gets around to reading ready, it was already set but data was also modified.
These problems are all solved by memory barriers. Platforms with weakly-ordered memory must make use of memory barriers to ensure memory coherence for thread synchronization.

The consistency model (or memory model) for the architecture determines what memory operations can be reordered. The idea is always to achieve the best performance from the code, while preserving the semantics expected by the programmer. That is the point from wikipedia, the memory operations appear in order to the programmer, even though they may have been reordered. Reordering is generally safe when the code is single-threaded, as the processor can easily detect potential violations.
On x86, the common model is that writes are not reordered with other writes. Yet, the processor is using out of order execution (OoOE), so instructions are being reordered constantly. Generally, the processor has several additional hardware structures to support OoOE, like a reorder buffer and load-store queue. The reorder buffer ensures that all instructions appear to execute in order, such that interrupts and exceptions break a specific point in the program. The load-store queue functions similarly, in that it can restore the order of memory operations according to the memory model. The load-store queue also disambiguates addresses, so that the processor can identify when the operations are made to the same or different addresses.
Back to OoOE, the processor is executing 10s to 100s of instructions in every cycle. Loads and stores are computing their addresses, etc. The processor may prefetch the cache lines for the accesses (which may include cache coherence), but it cannot actually access the line either to read or write until it is safe (according to the memory model) to do so.
Inserting store barriers, memory fences, etc tell both the compiler and processor about further restrictions to reordering the memory operations. The compiler is part of implementing the memory model, as some languages like java have specific memory model, while others like C obey the "memory accesses should appear as if they were executed in order".
In conclusion, yes, data and ready can be reordered in an OoOE. But it depends on the memory model as to whether they actually are. So if you need a specific order, provide the appropriate indication using barriers, etc such that the compiler, processor, etc will not choose a different order for higher performance.

On modern processor, the storing action itself is async (think of it like submit a change to the L1 cache and continue execution, the cache system further propagate in async manner). So the changes on two object lies on different cache block may be realised OoO from other CPU's perspective.
Furthermore, even the instruction to store those data, can be executed OoO. For example when two object is stored "at the same time", but the bus line of one object is retained/locked by other CPU or bus mastering, thus other other object may be committed earlier.
Therefore, to properly share data across threads, you need some kind of memory barrier or make use of transactional memory feature found in latest CPU like TSX.

I think you're misinterpreting "appear that the instructions were processed as normal." What that means is that if I have:
add r1 + 7 -> r2
move r3 -> r1
and the order of those is effectively reversed by out-of-order execution, the value that participates in the add operation will still be the value of r1 that was present prior to the move. Etc. The CPU will cache register values and/or delay register stores to assure that the "meaning" of a sequential instruction stream is not changed.
This says nothing about the order of stores as visible from another processor.

Intel TSX hardware transactional memory what do non-transactional threads see?

Suppose you have two threads, one creates a TSX transaction, and modifies some data structure. The other thread does no synchronization of any kind and reads the same data structure. Is the transaction atomic to it? I can't actually imagine that it can be true, since there is no way afaik to block or restart it if it tries reading a cache line modified by the transaction.
If the transaction is not atomic, then are the write ordering rules on x86 still respected? If it sees write #2, then it is guaranteed that it must be able to see the previous write #1. Does this still hold for writes that happen as part of a transaction?
I could not find answers to these questions anywhere, and I kind of doubt anyone on SO would know either, but at least when somebody finds out this is a Google friendly place to put an answer.

(My answer is based on Intel® 64 and IA-32 Architectures Optimization Reference Manual, Chapter 12)
The transaction is atomic to the read, in that the read will cause the transaction to abort, and thus appear that it never took place. In the transactional region, cache lines (tracked in the L1) read are considered the read-set and lines written to from the write-set. If another processor reads from the write-set (which is your example) or writes to either the read- or write-set, then there is a data conflict.
Data conflicts are detected through the cache coherence protocol.
Data conflicts cause transactional aborts. In the initial
implementation, the thread that detects the data conflict will
transactionally abort.
Thus the thread attempting the transaction is tracking the line and will detect the conflict when the other thread makes its read request. It aborts and "the hardware will restart at the instruction address provided by the operation of the XBEGIN instruction". In this chapter, there are no distinctions as to what the second processor is doing. It does not matter whether it is attempting a transaction or performing a simple read.
To summarize, all threads (whether transactional or not) see either the full transaction or nothing. Only the thread in a TSX transaction can see the intermediate state of memory.

does atomic operation (c++) freeze cpu

If we perform a atomic operation on a multi core cpu, does the atomic instruction freeze operations on all other cores?
Example if we do a increment on a atomic variable:
++atomicInteger;
Does this freeze all other operations on other cores?
I am focused on x86 processors.
I know that reading or writing to memory aligned native type is atomic and does not impact any other cores execution.

x86 allows writing unaligned data that spans across two cache lines (i.e. across two 64 byte chunks), but the result is not guaranteed to be atomic. This means you may read 8 Byte from addr 0x1003c for e.g., requiring the CPU to fetch 2 lines (0x10000 and 0x10040), taking the relevant 4-byte chunks and stitching them together. However, these two lines may be stored in different locations - one could be cached, the other could be in the main memory. In extreme cases (page splits), one could in theory even be swapped out. As a result, you might get 2 data chunks from different times (a better term is observation points), where a store from some other process could have changed one in the middle.
On the other hand, once you add the lock prefix (or add an std::atomic definition, which should include that for you), x86 does guarantee that the result comes from a single observation point, and is consistent with observations from all other threads. To achieve this, it's quite possible that the CPU will enforce a complete block of all cores (for e.g. bus lock) until both lines are secured in the requesting core. If you don't you're risking a livelock where you constantly get one line, and lose it to another core by the time you got the second.
p.s. - user3286380 raised a good point, ++atomicInteger is not atomic, even if you declared it as such. std::atomic guarantees an atomic read and an atomic write (each on its own observation point), but it doesn't guarantee atomic read-modify-write unless you explicitly state that.

An atomic operation is an operation that cannot be done by multiple processors at the same time. If you want to do an addition atomically only one thread can be doing that operation.
If we perform a atomic operation on a multi core cpu, does the atomic
instruction freeze operations on all other cores?
No. Not necessarily, if you happen to have multiple threads trying to do the same atomic operation, than they will be halted except the first one to reach that atomic statement.
I know that reading or writing to memory aligned native type is atomic
and does not impact any other cores execution.
Where did you read this? It does not sound correct to me. The result of this operation might depend on the architecture. But if you have multiple threads on x86 for example, and those threads try to write to the same location, the operation is not atomic by default. So the final value of the address that is being edited by threads can be anything.
Here is a similar discussion you might be interested in : pthreads: If I increment a global from two different threads, can there be sync issues?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js