Efficiency of concurrent std::vector writes - c++

According to http://en.cppreference.com/w/cpp/container#Thread_safety, it is safe to write to different elements of one std::vector, from different threads.
But if the value_type is smaller than the word size of the CPU (or the hardware destructive interference size), like (std::vector<char>), does this mean that access to elements is less efficient than it could be without the requirement for thread safety?
For example, does the read/write access imply memory fence/atomic instructions?

Yes it is safe, standard requires it to be safe. However, it might be inefficient due to what is called 'false sharing'.
False sharing happens when to individual threads update adjacent memory, which belongs to the same cache line. If those threads happen to be executed on two different cores, they end up invalidating the cache line on both CPUs and trigger expensive cache updates.
Code writer should make reasonable efforts to make the false sharing less likely by trying to assign close indexes to the same thread.
And to answer question I have just seen in the original post - no, there will be no compiler-generated fences on such writes.

A conforming implementation of C++ must be able to write to a value of char without "inventing writes". In other words, char must be at least as big as the machine requires for isolated writes.
(It may still be inefficient to write to adjacent memory locations from multiple threads due to interference in the hierarchical memory, but it wouldn't be incorrect.)

Related

Using Mutex for shared memory of 1 word

I have an application where multiple threads access and write to a shared memory of 1 word (16bit).
Can I expect that the processor reads and writes a word from/to memory in an atomic operation? So I don't need mutex protection of the shared memory/variable?
The target is an embedded device running VxWorks.
EDIT: There's only one CPU and it is an old one (>7years) - I am not exactly sure about the architecture and model, but I am also more interested in the general way that "most" CPU's will work. If it is a 16bit CPU, would it then, in most cases, be fair to expect that it will read/write a 16bit variable in one operation? Or should I always in any case use mutex protection? And let's say that I don't care about portability, and we talk about C++98.
All processors will read and write aligned machine-words atomicially in the sense that you won't get half the bits of the old value and half the bits of the new value if read by another processor.
To achieve good speed, modern processors will NOT synchronize read-modif-write operations to a particular location unless you actually ask for it - since nearly all reads and writes go to "non-shared" locations.
So, if the value is, say, a counter of how many times we've encountered a particular condition, or some other "if we read/write an old value, it'll go wrong" situation, then you need to ensure that two processors don't simultaneously update the value. This can typically be done with atomic instructions (or some other form of atomic updates) - this will ensure that one, and only one, processor touches the value at any given time, and that all the other processors DO NOT hold a copy of the value that they think is accurate and up to date when another has just made an update. See the C++11 std::atomic set of functions.
Note the distinction between atomically reading or writing the machine word's value and atomically performing the whole update.
The problem is not the atomicity of the acess (which you can usually assume unless you are using a 8bit MC), but the missing synchronization which leads to undefined behavior.
If you want to write portable code, use atomics instead. If you want to achieve maximal performance for your specific platform, read the documentation of your OS and compiler very carefully and see what additional mechanisms or guarantees they provide for multithreaded programs (But I really doubt that you will find anything more efficient than std::atomic that gives you sufficient guarantees).
Can I expect that the processor reads and writes a word from/to memory in an atomic operation?
Yes.
So I don't need mutex protection of the shared memory/variable?
No. Consider:
++i;
Even if the read and write are atomic, two threads doing this at the same time can each read, each increment, and then each write, resulting in only one increment where two are needed.
Can I expect that the processor reads and writes a word from/to memory in an atomic operation?
Yes, if the data's properly aligned and no bigger than a machine word, most CPU instructions will operate on it atomically in the sense you describe.
So I don't need mutex protection of the shared memory/variable?
You do need some synchronisation - whether a mutex or using atomic operations ala std::atomic.
The reasons for this include:
if your variable is not volatile, the compiler might not even emit read and write instructions for the memory address nominally holding that variable at the places you might expect, instead reusing values read or set earlier that are saved in CPU registers or known at compile time
if you use a mutex or std::atomic type you do not need to use volatile as well
further, even if the data is written towards memory, it may not leave the CPU caches and be written to actual RAM where other cores and CPUs can see it unless you explicitly use a memory barrier (std::mutex and std::atomic types do that for you)
finally, delays between reading and writing values can cause unexpected results, so operations like ++x can fail as explained by David Schwartz.

Is modification of various cells of an array by many threads safe in c++ (boost)

I have an array of size n and n threads, each ith thread can read / write only to ith cell of an array. I do not use any memory locks. Is this safe for C++ Boost threads ? How is this related to the cache in the processors, there are stored chunks of memory, not single values. I guess that cores of processor share cache and there is no duplication of data chunks within cache, therefore when many modification of the same chunk (however on various positions) occurs there is no conflict between versions.
On any modern processor, writing to separate memory locations (even if adjacent) will pose no hazard. Otherwise, threading would be much, much harder.
Indeed, it is a relatively common idiom to have threads "fill out" the elements of an array: this is precisely what typical threaded implementations of linear algebra programs do, for example.
Writing to separate memory locations will work correctly, however 'false sharing' may cause performance problems depending on the patterns of data accesses and the specific architecture.
Oracle's OpenMP API docs have a good description of false sharing:
6.2.1 What Is False Sharing?
Most high performance processors, such as UltraSPARC processors,
insert a cache buffer between slow memory and the high speed registers
of the CPU. Accessing a memory location causes a slice of actual
memory (a cache line) containing the memory location requested to be
copied into the cache. Subsequent references to the same memory
location or those around it can probably be satisfied out of the cache
until the system determines it is necessary to maintain the coherency
between cache and memory.
However, simultaneous updates of individual elements in the same cache
line coming from different processors invalidates entire cache lines,
even though these updates are logically independent of each other.
Each update of an individual element of a cache line marks the line as
invalid. Other processors accessing a different element in the same
line see the line marked as invalid. They are forced to fetch a more
recent copy of the line from memory or elsewhere, even though the
element accessed has not been modified. This is because cache
coherency is maintained on a cache-line basis, and not for individual
elements. As a result there will be an increase in interconnect
traffic and overhead. Also, while the cache-line update is in
progress, access to the elements in the line is inhibited.
This situation is called false sharing. If this occurs frequently,
performance and scalability of an OpenMP application will suffer
significantly.
False sharing degrades performance when all of the following
conditions occur.
Shared data is modified by multiple processors.
Multiple processors update data within the same cache line.
This updating occurs very frequently (for example, in a tight loop).
Note that shared data that is read-only in a loop does not lead to
false sharing.
Before C++11, the Standard didn't address threading at all. Now it does. This rule is found in section 1.7:
A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. — end note ] Two or more threads of execution (1.10) can update and access separate memory locations without interfering with each other.
An array is not a scalar, but its elements are. So each element is a distinct memory location, and therefore distinct elements are eligible for being used by different threads simultaneously with no need for locking or synchronization (as long as at most one thread accessed any given element).
However, you will cause a great deal of extra work for the cache coherency protocol if data stored in the same cache line are written by different threads. Consider adding padding, or interchanging data layout so that all variables used by a thread are stored adjacently. (array of structures instead of structure of arrays)

How does a mutex ensure a variable's value is consistent across cores?

If I have a single int which I want to write to from one thread and read from on another, I need to use std::atomic, to ensure that its value is consistent across cores, regardless of whether or not the instructions that read from and write to it are conceptually atomic. If I don't, it may be that the reading core has an old value in its cache, and will not see the new value. This makes sense to me.
If I have some complex data type that cannot be read/written to atomically, I need to guard access to it using some synchronisation primitive, such as std::mutex. This will prevent the object getting into (or being read from) an inconsistent state. This makes sense to me.
What doesn't make sense to me is how mutexes help with the caching problem that atomics solve. They seem to exist solely to prevent concurrent access to some resource, but not to propagate any values contained within that resource to other cores' caches. Is there some part of their semantics I've missed which deals with this?
The right answer to this is magic pixies - e.g. It Just Works. The implementation of std::atomic for each platform must do the right thing.
The right thing is a combination of 3 parts.
Firstly, the compiler needs to know that it can't move instructions across boundaries [in fact it can in some cases, but assume that it doesn't].
Secondly, the cache/memory subsystem needs to know - this is generally done using memory barriers, although x86/x64 generally have such strong memory guarantees that this isn't necessary in the vast majority of cases (which is a big shame as its nice for wrong code to actually go wrong).
Finally the CPU needs to know it cannot reorder instructions. Modern CPUs are massively aggressive at reordering operations and making sure in the single threaded case that this is unnoticeable. They may need more hints that this cannot happen in certain places.
For most CPUs part 2 and 3 come down to the same thing - a memory barrier implies both. Part 1 is totally inside the compiler, and is down to the compiler writers to get right.
See Herb Sutters talk 'Atomic Weapons' for a lot more interesting info.
The consistency across cores is ensured by memory barriers (which also prevents instructions reordering). When you use std::atomic, not only do you access the data atomically, but the compiler (and library) also insert the relevant memory barriers.
Mutexes work the same way: the mutex implementations (eg. pthreads or WinAPI or what not) internally also insert memory barriers.
Most modern multicore processors (including x86 and x64) are cache coherent. If two cores hold the same memory location in cache and one of them updates the value, the change is automatically propagated to other cores' caches. It's inefficient (writing to the same cache line at the same time from two cores is really slow) but without cache coherence it would be very difficult to write multithreaded software.
And like syam said, memory barriers are also required. They prevent the compiler or processor from reordering memory accesses, and also force the write into memory (or at least into cache), when for example a variable is held in a register because of compiler optizations.

Concurrent writes to different locations in the same cache line

Suppose I have a C++11 application where two threads write to different but nearby memory locations, using simple pointers to primitive types. Can I be sure that both these writes will end up in memory eventually (probably after both have reached a boost::barrier), or is there a risk that both CPU cores hold their own cache line containing that data, and the second core flushing its modification to RAM will overwrite and undo the modification done by the first write?
I hope that cache coherence will take care of this for me in all situations and on all setups compliant with the C++11 memory model, but I'd like to be sure.
Yes the cache coherency mechanisms will take care of this. This is called False sharing and should be avoided by better separating the data to increase performance.

Is it possible to have a race condition for a write-only operation?

If I have several threads trying to write the same value to a single location in memory, is it possible to have a race condition? Can the data somehow get corrupted during the writes? There is no preceding read or test conditions, only the write...
EDIT: To clarify, I'm computing a dot product on a GPU. I'm using several threads to calculate the individual products (one thread per row/column element) and saving them to a temporary location in memory. I need to then sum those intermediate products and save the result.
I was thinking about having all threads individually perform this sum/store operation since branching on a GPU can hurt performance. (You would think it should take the same amount of time for the sum/store whether it's done by a single thread or all threads, but I've tested this and there is a small performance hit.) All threads will get the same sum, but I'm concerned about a race condition when they each try to write their answer to the same location in memory. In the limited testing I've done, everything seems fine, but I'm still nervous...
Under most threading standards on most platforms, this is simply prohibited or undefined. That is, you are not allowed to do it and if you do, anything can happen.
High-level language compilers like those for C and C++ are free to optimize code based on the assumption that you will not do anything you are not allowed to do. So a "write-only" operation may turn out to be no such thing. If you write i = 1; in C or C++, the compiler is free to generate the same code as if you wrote i = 0; i++;. Similarly confounding optimizations really do occur in the real world.
Instead, follow the rules for whatever threading model you are using to use appropriate synchronization primitives. If your platform provides them, use appropriate atomic operations.
There is no problem having multiple threads writing a single (presumably shared or global) memory location in CUDA, even "simultaneously" i.e. from the same line of code.
If you care about the order of the writes, then this is a problem, as CUDA makes no guarantee of order, for multiple threads executing the same write operation to the same memory location. If this is an issue, you should use atomics or some other method of refactoring your code to sort it out. (It doesn't sound like this is an issue for you.)
Presumably, as another responder has stated, you care about the result at some point. Therefore it's necessary to have a barrier of some sort, either explicit (e.g. __synchthreads(), for multiple threads within a block using shared memory for example) or implicit (e.g. end of a kernel, for multiple threads writing to a location in global memory) before you read that location and expect a sensible result. Note these are not the only possible barrier methods that could give you sane results, just two examples. Warp-synchronous behavior or other clever coding techniques could be leveraged to ensure the sanity of a read following a collection of writes.
Although on the surface the answer would seem to be no, there are no race conditions, the answer is a bit more nuanced. Boris is right that on some 32-bit architectures, storing a 64-bit long or address may take two operations and therefore may be read in an invalid state. This is probably pretty difficult to reproduce since memory pages are what typically are updated and a long would never span a memory page.
However, the more important issue is that you need to realize that without memory synchronization there are no guarantees around when a thread would see the updated value. A thread could run for a long period of time reading an out-of-date value from memory. It wouldn't be an invalid value but it would not be the most recent one written. That may not specifically cause a "race-condition" but it might cause your program to perform in an unexpected manner.
Also, although you say it is "write-only", obviously someone is reading the value otherwise there would be no reason to perform the update. The details of what portion of the code is reading the value will better inform us as to whether the write-only without synchronization is truly safe.
If write-only operations are not atomic obviously there will be a moment when another thread may observe the data in the corrupted state.
For example writing to 64-bit integers, that are stored as a pair of 32-bit integers.
Thread A - just finished writing the high order word, and the Thread B has just finished writing to the low order word, and is going to set the high order word;
Thread C may see that integer consists of a low order word written by thread B and a high order word written by thread A.
P.S. this question is very generic, actual results will depend on the memory model of the environment(language) and the underlying processor architecture(hardware).