Concurrent writes to different locations in the same cache line

Concurrent writes to different locations in the same cache line - c++

Suppose I have a C++11 application where two threads write to different but nearby memory locations, using simple pointers to primitive types. Can I be sure that both these writes will end up in memory eventually (probably after both have reached a boost::barrier), or is there a risk that both CPU cores hold their own cache line containing that data, and the second core flushing its modification to RAM will overwrite and undo the modification done by the first write?
I hope that cache coherence will take care of this for me in all situations and on all setups compliant with the C++11 memory model, but I'd like to be sure.

Yes the cache coherency mechanisms will take care of this. This is called False sharing and should be avoided by better separating the data to increase performance.

Related

Performance cost to multiple OpenMP threads reading (not writing) a shared variable?

In OpenMP (I am using C++), is there a performance cost if you have a shared (or even global) variable that is being repeatedly read (not written) by multiple threads? I am aware that if they were writing to the variable, this would be incorrect. I am asking specifically about reading only - is there a potential performance cost if multiple threads are repeatedly reading the same variable?

If you're only reading, then you have no safety issues. Everything will work fine. By definition, you don't have Race Conditions. You don't need to do any locking, so no high-contention problems can happen. You can test thread safety at run-time using the Clang ThreadSanitizer.
On the other hand, there are some performance issues to be aware about. Try to avoid false sharing by making every thread (or preferably all threads) access a bunch of data that's consecutive in memory at a time. This way, when the CPU cache loads data, it'll not require to access memory multiple times every instant. Accessing memory is considered very expensive (hundreds of times slower, at least) compared to accessing CPU cache.
Good luck!

If the variable (more precise memory location) is only read by all threads, you are basically fine both in terms of correctness and performance. Cache protocols have a "shared" state - so the value can be cached on multiple cores.
However, you should also avoid to write data on the same cache line than the variable, as this would invalidate the cache for other cores. Also on a NUMA system you have to consider that it may be more expensive to read some memory regions for certain cores/threads.

does false sharing occur when data is read in openmp?

If I have a C++ program with OpenMP parallelization, where different threads constantly use some small shared array only for reading data from it, does false sharing occur in this case? in other words, is false sharing related only to memory write operations, or it can also happen with memory read operations.

Typically used cache coherence protocols, such as MESI (modified, exclusive, shared, invalid), have a specific state for cache lines called "shared". Cache lines are in this state if they are read by multiple processors. Each processor then has a copy of the cache line and can safely read from it without false-sharing. On a write, all processors are informed to invalidate the cache line which is the main cause for false-sharing

False sharing is a performance issue because it causes additional movement of a cache line which takes time. When two variables which are not really shared reside in the same line and separate threads update each of them, the line has to bounce around the machine which increases the latency of each access. In this case if the variables were in separate lines each thread would keep a locally modified copy of "its" line and no data movement would be required.
However, if you are not updating a line, then no data movement is necessary and there is no performance impact from the sharing beyond the fact that you might have been able to have data each thread does need in there, rather than data it doesn't. That is a small, second order, effect. though. So unless you know you are cache capacity limited ignore it!

Efficiency of concurrent std::vector writes

According to http://en.cppreference.com/w/cpp/container#Thread_safety, it is safe to write to different elements of one std::vector, from different threads.
But if the value_type is smaller than the word size of the CPU (or the hardware destructive interference size), like (std::vector<char>), does this mean that access to elements is less efficient than it could be without the requirement for thread safety?
For example, does the read/write access imply memory fence/atomic instructions?

Yes it is safe, standard requires it to be safe. However, it might be inefficient due to what is called 'false sharing'.
False sharing happens when to individual threads update adjacent memory, which belongs to the same cache line. If those threads happen to be executed on two different cores, they end up invalidating the cache line on both CPUs and trigger expensive cache updates.
Code writer should make reasonable efforts to make the false sharing less likely by trying to assign close indexes to the same thread.
And to answer question I have just seen in the original post - no, there will be no compiler-generated fences on such writes.

A conforming implementation of C++ must be able to write to a value of char without "inventing writes". In other words, char must be at least as big as the machine requires for isolated writes.
(It may still be inefficient to write to adjacent memory locations from multiple threads due to interference in the hierarchical memory, but it wouldn't be incorrect.)

Is modification of various cells of an array by many threads safe in c++ (boost)

I have an array of size n and n threads, each ith thread can read / write only to ith cell of an array. I do not use any memory locks. Is this safe for C++ Boost threads ? How is this related to the cache in the processors, there are stored chunks of memory, not single values. I guess that cores of processor share cache and there is no duplication of data chunks within cache, therefore when many modification of the same chunk (however on various positions) occurs there is no conflict between versions.

On any modern processor, writing to separate memory locations (even if adjacent) will pose no hazard. Otherwise, threading would be much, much harder.
Indeed, it is a relatively common idiom to have threads "fill out" the elements of an array: this is precisely what typical threaded implementations of linear algebra programs do, for example.

Writing to separate memory locations will work correctly, however 'false sharing' may cause performance problems depending on the patterns of data accesses and the specific architecture.
Oracle's OpenMP API docs have a good description of false sharing:
6.2.1 What Is False Sharing?
Most high performance processors, such as UltraSPARC processors,
insert a cache buffer between slow memory and the high speed registers
of the CPU. Accessing a memory location causes a slice of actual
memory (a cache line) containing the memory location requested to be
copied into the cache. Subsequent references to the same memory
location or those around it can probably be satisfied out of the cache
until the system determines it is necessary to maintain the coherency
between cache and memory.
However, simultaneous updates of individual elements in the same cache
line coming from different processors invalidates entire cache lines,
even though these updates are logically independent of each other.
Each update of an individual element of a cache line marks the line as
invalid. Other processors accessing a different element in the same
line see the line marked as invalid. They are forced to fetch a more
recent copy of the line from memory or elsewhere, even though the
element accessed has not been modified. This is because cache
coherency is maintained on a cache-line basis, and not for individual
elements. As a result there will be an increase in interconnect
traffic and overhead. Also, while the cache-line update is in
progress, access to the elements in the line is inhibited.
This situation is called false sharing. If this occurs frequently,
performance and scalability of an OpenMP application will suffer
significantly.
False sharing degrades performance when all of the following
conditions occur.
Shared data is modified by multiple processors.
Multiple processors update data within the same cache line.
This updating occurs very frequently (for example, in a tight loop).
Note that shared data that is read-only in a loop does not lead to
false sharing.

Before C++11, the Standard didn't address threading at all. Now it does. This rule is found in section 1.7:
A memory location is either an object of scalar type or a maximal sequence of adjacent bit-ﬁelds all having non-zero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. — end note ] Two or more threads of execution (1.10) can update and access separate memory locations without interfering with each other.
An array is not a scalar, but its elements are. So each element is a distinct memory location, and therefore distinct elements are eligible for being used by different threads simultaneously with no need for locking or synchronization (as long as at most one thread accessed any given element).
However, you will cause a great deal of extra work for the cache coherency protocol if data stored in the same cache line are written by different threads. Consider adding padding, or interchanging data layout so that all variables used by a thread are stored adjacently. (array of structures instead of structure of arrays)

memory access vs. memory copy

I am writing an application in C++ that needs to read-only from the same memory many times from many threads.
My question is from a performance point of view will it be better to copy the memory for each thread or give all threads the same pointer and have all of them access the same memory.
Thanks

There is no definitive answer from the little information you have given about your target system and so on, but on a normal PC, most likely the fastest will be to not copy.
One reason copying could be slow, is that it might result in cache misses if the data area is large. A normal PC would cache read-only access to the same data area very efficiently between threads, even if those threads happen to run on different cores.
One of the benefits explicitly listed by Intel for their approach to caching is "Allows more data-sharing opportunities for threads running on separate cores that are sharing cache". I.e. they encourage a practice where you don't have to program the threads to explicitly cache data, the CPU will do it for you.

Since you specifically mention many threads, I assume you have at least a multi-socket system. Typically, memory banks are associated to processor sockets. That is, one processor is "nearest" to its own memory banks and needs to communicate with the other processors memopry controllers to access data on other banks. (Processor here means the physical thing in the socket)
When you allocate data, typically a first-write policy is used to determine on which memory banks your data will be allocated, which means it can access it faster than the other processors.
So, at least for multiple processors (not just multiple cores) there should be a performance improvement from allocating a copy at least for every processor. Be sure, to allocate/copy the data with every processor/thread and not from a master thread (to exploit the first-write policy). Also you need to make sure, that threads will not migrate between processors, because then you are likely to lose the close connection to your memory.
I am not sure, how copying data for every thread on a single processor would affect performance, but I guess not copying could improve the ability to share the contents of the higher level caches, that are shared between cores.
In any case, benchmark and decide based on actual measurements.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js