If I have a C++ program with OpenMP parallelization, where different threads constantly use some small shared array only for reading data from it, does false sharing occur in this case? in other words, is false sharing related only to memory write operations, or it can also happen with memory read operations.
Typically used cache coherence protocols, such as MESI (modified, exclusive, shared, invalid), have a specific state for cache lines called "shared". Cache lines are in this state if they are read by multiple processors. Each processor then has a copy of the cache line and can safely read from it without false-sharing. On a write, all processors are informed to invalidate the cache line which is the main cause for false-sharing
False sharing is a performance issue because it causes additional movement of a cache line which takes time. When two variables which are not really shared reside in the same line and separate threads update each of them, the line has to bounce around the machine which increases the latency of each access. In this case if the variables were in separate lines each thread would keep a locally modified copy of "its" line and no data movement would be required.
However, if you are not updating a line, then no data movement is necessary and there is no performance impact from the sharing beyond the fact that you might have been able to have data each thread does need in there, rather than data it doesn't. That is a small, second order, effect. though. So unless you know you are cache capacity limited ignore it!
Related
In OpenMP (I am using C++), is there a performance cost if you have a shared (or even global) variable that is being repeatedly read (not written) by multiple threads? I am aware that if they were writing to the variable, this would be incorrect. I am asking specifically about reading only - is there a potential performance cost if multiple threads are repeatedly reading the same variable?
If you're only reading, then you have no safety issues. Everything will work fine. By definition, you don't have Race Conditions. You don't need to do any locking, so no high-contention problems can happen. You can test thread safety at run-time using the Clang ThreadSanitizer.
On the other hand, there are some performance issues to be aware about. Try to avoid false sharing by making every thread (or preferably all threads) access a bunch of data that's consecutive in memory at a time. This way, when the CPU cache loads data, it'll not require to access memory multiple times every instant. Accessing memory is considered very expensive (hundreds of times slower, at least) compared to accessing CPU cache.
Good luck!
If the variable (more precise memory location) is only read by all threads, you are basically fine both in terms of correctness and performance. Cache protocols have a "shared" state - so the value can be cached on multiple cores.
However, you should also avoid to write data on the same cache line than the variable, as this would invalidate the cache for other cores. Also on a NUMA system you have to consider that it may be more expensive to read some memory regions for certain cores/threads.
We know that two instructions can be reordered by an OoOE processor. For example, there are two global variables shared among different threads.
int data;
bool ready;
A writer thread produce data and turn on a flag ready to allow readers to consume that data.
data = 6;
ready = true;
Now, on an OoOE processor, these two instructions can be reordered (instruction fetch, execution). But what about the final commit/write-back of the results? i.e., will the store be in-order?
From what I learned, this totally depends on a processor's memory model. E.g., x86/64 has a strong memory model, and reorder of stores is disallowed. On the contrary, ARM typically has a weak model where store reordering can happen (along with several other reorderings).
Also, the gut feeling tells me that I am right because otherwise we won't need a store barrier between those two instructions as used in typical multi-threaded programs.
But, here is what our wikipedia says:
.. In the outline above, the OoOE processor avoids the stall that
occurs in step (2) of the in-order processor when the instruction is
not completely ready to be processed due to missing data.
OoOE processors fill these "slots" in time with other instructions
that are ready, then re-order the results at the end to make it appear
that the instructions were processed as normal.
I'm confused. Is it saying that the results have to be written back in-order? Really, in an OoOE processor, can store to data and ready be reordered?
The simple answer is YES on some processor types.
Before the CPU, your code faces an earlier problem, compiler reordering.
data = 6;
ready = true;
The compiler is free to rearrange these statements since, as far as it knows, they do not affect each other (it is not thread-aware).
Now down to the processor level:
1) An out-of-order processor can process these instructions in different order, including reversing the order of the stores.
2) Even if the CPU performs them in order, they memory controller may not perform them in order because it may need to flush or bring in new cache lines or do an address translation before it can write them.
3) Even if this doesn't happen, another CPU in the system may not see them in the same order. In order to observe them, it may need to bring in the modified cache lines from the core that wrote them. It may not be able to bring one cache line in earlier than another if it is held be another core or if there is contention for that line by multiple cores, and its own out of order execution will read one before the other.
4) Finally, speculative execution on other cores may read the value of data before ready was set by the writing core, and by the time it gets around to reading ready, it was already set but data was also modified.
These problems are all solved by memory barriers. Platforms with weakly-ordered memory must make use of memory barriers to ensure memory coherence for thread synchronization.
The consistency model (or memory model) for the architecture determines what memory operations can be reordered. The idea is always to achieve the best performance from the code, while preserving the semantics expected by the programmer. That is the point from wikipedia, the memory operations appear in order to the programmer, even though they may have been reordered. Reordering is generally safe when the code is single-threaded, as the processor can easily detect potential violations.
On x86, the common model is that writes are not reordered with other writes. Yet, the processor is using out of order execution (OoOE), so instructions are being reordered constantly. Generally, the processor has several additional hardware structures to support OoOE, like a reorder buffer and load-store queue. The reorder buffer ensures that all instructions appear to execute in order, such that interrupts and exceptions break a specific point in the program. The load-store queue functions similarly, in that it can restore the order of memory operations according to the memory model. The load-store queue also disambiguates addresses, so that the processor can identify when the operations are made to the same or different addresses.
Back to OoOE, the processor is executing 10s to 100s of instructions in every cycle. Loads and stores are computing their addresses, etc. The processor may prefetch the cache lines for the accesses (which may include cache coherence), but it cannot actually access the line either to read or write until it is safe (according to the memory model) to do so.
Inserting store barriers, memory fences, etc tell both the compiler and processor about further restrictions to reordering the memory operations. The compiler is part of implementing the memory model, as some languages like java have specific memory model, while others like C obey the "memory accesses should appear as if they were executed in order".
In conclusion, yes, data and ready can be reordered in an OoOE. But it depends on the memory model as to whether they actually are. So if you need a specific order, provide the appropriate indication using barriers, etc such that the compiler, processor, etc will not choose a different order for higher performance.
On modern processor, the storing action itself is async (think of it like submit a change to the L1 cache and continue execution, the cache system further propagate in async manner). So the changes on two object lies on different cache block may be realised OoO from other CPU's perspective.
Furthermore, even the instruction to store those data, can be executed OoO. For example when two object is stored "at the same time", but the bus line of one object is retained/locked by other CPU or bus mastering, thus other other object may be committed earlier.
Therefore, to properly share data across threads, you need some kind of memory barrier or make use of transactional memory feature found in latest CPU like TSX.
I think you're misinterpreting "appear that the instructions were processed as normal." What that means is that if I have:
add r1 + 7 -> r2
move r3 -> r1
and the order of those is effectively reversed by out-of-order execution, the value that participates in the add operation will still be the value of r1 that was present prior to the move. Etc. The CPU will cache register values and/or delay register stores to assure that the "meaning" of a sequential instruction stream is not changed.
This says nothing about the order of stores as visible from another processor.
I have an array of size n and n threads, each ith thread can read / write only to ith cell of an array. I do not use any memory locks. Is this safe for C++ Boost threads ? How is this related to the cache in the processors, there are stored chunks of memory, not single values. I guess that cores of processor share cache and there is no duplication of data chunks within cache, therefore when many modification of the same chunk (however on various positions) occurs there is no conflict between versions.
On any modern processor, writing to separate memory locations (even if adjacent) will pose no hazard. Otherwise, threading would be much, much harder.
Indeed, it is a relatively common idiom to have threads "fill out" the elements of an array: this is precisely what typical threaded implementations of linear algebra programs do, for example.
Writing to separate memory locations will work correctly, however 'false sharing' may cause performance problems depending on the patterns of data accesses and the specific architecture.
Oracle's OpenMP API docs have a good description of false sharing:
6.2.1 What Is False Sharing?
Most high performance processors, such as UltraSPARC processors,
insert a cache buffer between slow memory and the high speed registers
of the CPU. Accessing a memory location causes a slice of actual
memory (a cache line) containing the memory location requested to be
copied into the cache. Subsequent references to the same memory
location or those around it can probably be satisfied out of the cache
until the system determines it is necessary to maintain the coherency
between cache and memory.
However, simultaneous updates of individual elements in the same cache
line coming from different processors invalidates entire cache lines,
even though these updates are logically independent of each other.
Each update of an individual element of a cache line marks the line as
invalid. Other processors accessing a different element in the same
line see the line marked as invalid. They are forced to fetch a more
recent copy of the line from memory or elsewhere, even though the
element accessed has not been modified. This is because cache
coherency is maintained on a cache-line basis, and not for individual
elements. As a result there will be an increase in interconnect
traffic and overhead. Also, while the cache-line update is in
progress, access to the elements in the line is inhibited.
This situation is called false sharing. If this occurs frequently,
performance and scalability of an OpenMP application will suffer
significantly.
False sharing degrades performance when all of the following
conditions occur.
Shared data is modified by multiple processors.
Multiple processors update data within the same cache line.
This updating occurs very frequently (for example, in a tight loop).
Note that shared data that is read-only in a loop does not lead to
false sharing.
Before C++11, the Standard didn't address threading at all. Now it does. This rule is found in section 1.7:
A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. — end note ] Two or more threads of execution (1.10) can update and access separate memory locations without interfering with each other.
An array is not a scalar, but its elements are. So each element is a distinct memory location, and therefore distinct elements are eligible for being used by different threads simultaneously with no need for locking or synchronization (as long as at most one thread accessed any given element).
However, you will cause a great deal of extra work for the cache coherency protocol if data stored in the same cache line are written by different threads. Consider adding padding, or interchanging data layout so that all variables used by a thread are stored adjacently. (array of structures instead of structure of arrays)
I just found this library, that provides lock-free ring, that works way faster then channels: https://github.com/textnode/gringo (and it works really faster especially with GOMAXPROCS > 1 )
But interesting part is struct for managing queue state:
type Gringo struct {
padding1 [8]uint64
lastCommittedIndex uint64
padding2 [8]uint64
nextFreeIndex uint64
padding3 [8]uint64
readerIndex uint64
padding4 [8]uint64
contents [queueSize]Payload
padding5 [8]uint64
}
If i remove "paddingX [8]uint64" fields it works about 20% slower. How it can be?
Also appreciate if someone explained why this lock-free algorithm much faster then channels, even buffered?
Padding eliminates false sharing by putting each structure on its own cache line. If two variables share a cache line, a read of an unmodified variable will be as expensive as a read of a modified variable if there's an intervening write to the other variable.
When a variable is read on multiple cores and not modified, the cache line is shared by the cores. This makes the reads very cheap. Before any core can write to any part of that cache line, it must invalidate the cache line on other cores. If any core later reads from that cache line, it will find the cache line invalidated and have to go back to sharing it. This makes painful extra cache coherency traffic when one variable is frequently modified and the other is frequently read.
It works faster because it does not require locks. This is an implementation in Java (called Disruptor) which works really well, and seems to be the inspiration for gringo. They explain the cost of locks and how you can increase throughput here.
As for the padding, the paper also hints at some of the reasons. Basically: processor caches. This paper explains it well. You can gain tremendous performance gain by making the processor access its Level 1 cache instead of going through memory or its outer caches as often as possible. But this requires to take extra precautions as the processor will fully load its cache, and reload it (from memory or level 2-3 caches) every time it is required.
In the case of concurrent data structure, as #David Schwartz said, the false sharing will force the processor to reload its cache much more often, as some data might be loaded in the rest of the memory line, be modified, and force the whole cache to be loaded again.
Suppose I have a C++11 application where two threads write to different but nearby memory locations, using simple pointers to primitive types. Can I be sure that both these writes will end up in memory eventually (probably after both have reached a boost::barrier), or is there a risk that both CPU cores hold their own cache line containing that data, and the second core flushing its modification to RAM will overwrite and undo the modification done by the first write?
I hope that cache coherence will take care of this for me in all situations and on all setups compliant with the C++11 memory model, but I'd like to be sure.
Yes the cache coherency mechanisms will take care of this. This is called False sharing and should be avoided by better separating the data to increase performance.