In OpenMP (I am using C++), is there a performance cost if you have a shared (or even global) variable that is being repeatedly read (not written) by multiple threads? I am aware that if they were writing to the variable, this would be incorrect. I am asking specifically about reading only - is there a potential performance cost if multiple threads are repeatedly reading the same variable?
If you're only reading, then you have no safety issues. Everything will work fine. By definition, you don't have Race Conditions. You don't need to do any locking, so no high-contention problems can happen. You can test thread safety at run-time using the Clang ThreadSanitizer.
On the other hand, there are some performance issues to be aware about. Try to avoid false sharing by making every thread (or preferably all threads) access a bunch of data that's consecutive in memory at a time. This way, when the CPU cache loads data, it'll not require to access memory multiple times every instant. Accessing memory is considered very expensive (hundreds of times slower, at least) compared to accessing CPU cache.
Good luck!
If the variable (more precise memory location) is only read by all threads, you are basically fine both in terms of correctness and performance. Cache protocols have a "shared" state - so the value can be cached on multiple cores.
However, you should also avoid to write data on the same cache line than the variable, as this would invalidate the cache for other cores. Also on a NUMA system you have to consider that it may be more expensive to read some memory regions for certain cores/threads.
I believe in single processor systems, more than one Store will happen one after the other,
but what is the case for multi processsor systems?
Adding to the question, also if the machine is 32bit and when we try to write
a long int(64 bit) value to the memory, how will the Load/Store instructions behave?
The reason for the above two questions is, if someone tries to read the same memory (a memory
of size 32bit/64 bit, in 32 bit systems), in another
thread will this be safe, or do i need to consider using locks.?
Added:
I wanted to do with minimum locks possible since ours is time critical execution.
Hence I wanted to understand is there ever a possibility of executing two Store/Load instructions
at the same instant of time to the same memory location, if things gets executed in multi processor
environment.
You are wrong if you only look on load/store cpu instructions.
The compiler and your os and cpu can:
change execution order to optimize the code
can hold values in separate caches
can store data in cpu registers without accessing cache or other memory
can optimize access complete away
... a lot more I believe!
If you want to access the same variable from different threads you must use a synchronization mechanism which is provided from your language or a library which fits to your os. Nothing else will give you a guarantee to work.
The problem is not the real access to any kind of memory. You definitely must ensure that your code contains memory barriers as needed for the underlying libraries and OS support. If there are no barriers between multi thread access you will maybe not see any change from a write in one thread while read it from a second one.
This will also be a problem on a single core cpu because the compiler have no idea that you modify a variable from two threads if you don't use any kind of synchronization.
To your add on:
You simply have no control over any kind of memory access without writing your code in assembler. And if you write it in assembler, you! have to deal with registers L1/L2/Lx Caching, Memory Mapping, Inter-CPU-Communication and so on. Forget all about load/store instruction. This is only 1% of the job!
If you have time critical jobs:
try to fix the core were the thread runs on ( see detailed description for threading libraries like posix pthreads or whatever lib you are running on )
it can be much faster to run a single process with a single thread and program it in a cooperative fashion. No locks, no memory barriers, no ipc. But you have to deal with all the thread a like problems. But it is fast!
Often it is much faster to split you problem in some processes each only with one thread and make the ipc minimal. This needs a deep understanding how you can scale your algorithms.
Often a very! simple 8/16 bit cpu runs much faster in special environments in comparison to a fat 8 core cpu with fat os on it.
But you don't tell us the rest of your environment and requirements so the answer never can give a full answer to your real problem. But keep in mind: load/store was yesterday.
This can not be answered generically. You have to know which model of what design of processor it is. An AMD Opteron will be different from an Intel Pentium which is different from a Intel Core2, and all of those are different from an ARMv7 design. [They are probably fairly similar, but there's details that you may care about if you REALLY want to rely on these operations to be performed in a specific way]. And of course, if you share memory between, say, a GPU (graphics processing unit) and a CPU, you have even more possible scenarios of "different design".
There are single core "superscalar" (more than one execution unit) and "out of order execution" (processors that reorder instructions), so more than one execution unit (including more than one load/store unit), and thus more than one instruction (including load or store) can be performed at the same time.
Obviously, once the processor determines that the memory operation needs to go "outside" (that is, the value is not available in the cache), it has to be serialized, but there is no guarantee that a load or store as sequenced by you or the compiler won't be re-ordered between loads and stores. If the processor has instructions to support "data wider than the bus" (e.g. 32-bit processor loading 64-bit word), these are typically atomic to that processor. If the processor does not in itself support 64-bit words, then the load of a 64-bit value would encompass two 32-bit loads.
[When I write "load", the same applies for "store"]
In case of multiprocessor or multicore architectures, it becomes a system architecture question, which makes it even more complicated than "we can't answer this without understanding the processor design", since there are now more components involved: memory design (one lump of memory shared between processors, several lumps of memory that are not directly shared, etc).
In general, if you have multiple threads, you will need to use atomic operations - most processors have a way to say "I want this to happen without someone else interfering" - in the old days, it would be a "lock" pin on the processor(s) that would be wired to anything else that could access the memory bus, and when that pin was active, all other devices has to wait for it to become inactive before accessing the memory bus. These days, it's a fair bit more sophisticated, since there are caches involved. Most systems use an "exclusive cache content" method: The processor signals all it's peers that "I want this address to be exclusive in my cache", at which point all other processors will "flush and invalidate" that particular address in their caches. Then the atomic operation is performed in the cache, and the results available to be read by other processors only when the atomic operation is completed. This is a pretty simplified view of how it works - modern processors are very complex, and there is a lot of work involved with such seemingly simple things as "make sure this value gets updated in a way that doesn't get interrupted by some other processor writing to the same thing".
If there isn't support in the processor for "atomic" operations, then there has to be proper locks (and any processor designed for use in a multicore/multicpu environment will have operations to support locks in some way), where the lock is taken before updating something, and then released after the update. This is clearly more complex than having builtin atomic operations, but it makes the design of the processor simpler. Also, for more complex updates (where more than one 32- or 64-bit value needs updating) this sort of locking is still required - for example, if we have a "queue" where you have a "where we're writing" and "elements in queue" that both need to be updated on write, you can't do that in a single operation [without being VERY clever about it, at least].
In heterogeneous systems, such as GPU + CPU combinations, you can't do atomics between different devices, because the cache of one device doesn't "understand the language" of the other device. So when the CPU says "I want this as exclusive", the GPU sees "Hurdi gurdi meatballs" and thinks "I have no idea what that is about, I'll just ignore it" [or something like that]. In this case, there has to be some other way to access shared data, but it's typically not atomic ever, you have to send commands (via other means than the interprocessor signalling system) to the GPU to say "flush your cache, and tell me when you're done with that", and when the CPU has written something the GPU needs, the CPU will flush it's cache before telling the GPU that it can use the data. This can get pretty messy, and takes a fair amount of time.
I believe in single processor systems, more than one Store will happen one after the other,
False. Most machines are set up like that, but for performance reasons many CPUs can be configured to have a much more relaxed store ordering. This is almost never a problem for an application on a single CPU (because the CPU will make it look like you expect) but it's really critical to understand when talking to hardware.
Here's a wikipedia article: http://en.wikipedia.org/wiki/Memory_ordering
This gets doubly complicated on CPUs with non-coherent local caches. Because then you can have strong ordering as seen from this CPU while other CPUs will see totally different results depending on the cache flush order.
Adding to the question, also if the machine is 32bit and when we try to write a long int(64 bit) value to the memory, how will the Load/Store instructions behave?
Some 32 bit CPUs have instructions to do atomic 64 bit writes, others don't. Those that don't will do two separate writes where a partial result can be seen by other CPUs or threads (if you get unlucky with context switching) or signal handlers or interrupt handlers.
The reason for the above two questions is, if someone tries to read the same memory (a memory of size 32bit/64 bit, in 32 bit systems), in another thread will this be safe, or do i need to consider using locks.?
Yes, no, maybe. If it's just one value and it doesn't tell the other thread that some other memory might be in a certain state, then yes, it can be safe in certain circumstances. You're not guaranteed that the other thread will see the changed value in memory for a long time, but eventually it should see it.
Generally, you can't reason about the behavior of access to shared memory in a threaded environment without strictly following the documentation of the thread model you're using. And most of those say something like without locks the behavior is undefined, with locks everything that happened before the lock is guaranteed to happen before the lock and everything that happens after the lock is guaranteed to happen after the lock. This is not only because of differences between CPUs, but also because the operating system can do something funny and the locking code needs to be designed to convince the compiler to not do something funny either (which is surprisingly hard with modern compilers).
Suppose I have a C++11 application where two threads write to different but nearby memory locations, using simple pointers to primitive types. Can I be sure that both these writes will end up in memory eventually (probably after both have reached a boost::barrier), or is there a risk that both CPU cores hold their own cache line containing that data, and the second core flushing its modification to RAM will overwrite and undo the modification done by the first write?
I hope that cache coherence will take care of this for me in all situations and on all setups compliant with the C++11 memory model, but I'd like to be sure.
Yes the cache coherency mechanisms will take care of this. This is called False sharing and should be avoided by better separating the data to increase performance.
Regarding performance, assuming we get a block of data that will be freqenctly accessed by each threads, and these data are read-only, which means threads wont do anything besides reading the data.
Then is it benefitial to create one copy of these data (assuming the data there read-only) for each thread or not?
If the freqenently accessed data are shared by all threads (instead of one copy for each thread), wouldnt this increase the chance of these data will get properly cached?
One copy of read-only data per thread will not help you with caching; quite the opposite, it can hurt instead when threads execute on the same multicore (and possibly hyperthreaded) CPU and so share its cache, as in this case per-thread copies of the data may compete for limited cache space.
However, in case of a multi-CPU system, virtually all of which are NUMA nowadays, typically having per-CPU memory banks with access cost somewhat different between the "local" and "remote" memory, it can be beneficial to have a per-CPU copies of read-only data, placed in its local memory bank.
The memory mapping is controlled by OS, so if you take this road it makes sense to study NUMA-related behavior of your OS. For example, Linux uses first-touch memory allocation policy, which means memory mapping happens not at malloc but when the program accesses a memory page for the first time, and OS tries to allocate physical memory from the local bank.
And the usual performance motto applies: measure, don't guess.
I read in the Visual C++ documentation that it is safe for multiple threads to read from the same object.
My question is: how does a X86-64 CPU with multiple cores handle this?
Say you have a 1 MB block of memory. Are different threads literally able to read the exact same data at the same time or do cores read one word at a time with only one core allowed to read a particular word at a time?
If there are really no writes in your 1MB block then yeah, each core can read from its own cache line without any problem as no writes are being committed and therefore no cache coherency problems arise.
In a multicore architecture, basically there is a cache for each core and a "Cache Coherence Protocol" which invalidates the cache on some cores which do not have the most up to date information. I think most processors implement the MOESI protocol for cache coherency.
Cache coherency is a complex topic that has been largely discussed (I specially like some articles by Joe Duffy here and here). The discussion nonetheless revolves around the possible performance penalties of code that, while being apparently lock free, can slow down due to the cache coherency protocol kicking in to maintain coherency across the processors caches, but, as long as there are no writes there's simply no coherency to maintain and thus no lost on performance.
Just to clarify, as said in the comment, RAM can't be accessed simultaneously since x86 and x64 architectures implement a single bus which is shared between cores with SMP guaranteeing the fairness accessing main memory. Nonetheless this situation is hidden by each core cache which allows each core to have its own copy of the data. For 1MB of data it would be possible to incur on some contention while the core update its cache but that would be negligible.
Some useful links:
Cache Coherence Protocols
Cache Coherence
Not only are different cores allowed to read from the same block of memory, they're allowed to write at the same time too. If it's "safe" or not, that's an entirely different story. You need to implement some sort of a guard in your code (usually done with semaphores or derivates of them) to guard against multiple cores fighting over the same block of memory in a way you don't specifically allow.
About the size of the memory a core reads at a time, that's usually a register's worth, 32 bits on a 32bit cpu, 64 bits for a 64bit cpu and so on. Even streaming is done dword by dword (look at memcpy for example).
About how concurrent multiple cores really are, every core uses a single bus to read and write to the memory, so accessing any resources (ram, external devices, the floating point processing unit) is one request at a time, one core at a time. The actual processing inside the core is completely concurrent however. DMA transfers also don't block the bus, concurrent transfers get queued and processed one at a time (I believe, not 100% sure on this).
edit: just to clarify, unlike the other reply here, I'm talking only about a no-cache scenario. Of course if the memory gets cached read-only access is completely concurrent.