does atomic operation (c++) freeze cpu - c++

If we perform a atomic operation on a multi core cpu, does the atomic instruction freeze operations on all other cores?
Example if we do a increment on a atomic variable:
++atomicInteger;
Does this freeze all other operations on other cores?
I am focused on x86 processors.
I know that reading or writing to memory aligned native type is atomic and does not impact any other cores execution.

x86 allows writing unaligned data that spans across two cache lines (i.e. across two 64 byte chunks), but the result is not guaranteed to be atomic. This means you may read 8 Byte from addr 0x1003c for e.g., requiring the CPU to fetch 2 lines (0x10000 and 0x10040), taking the relevant 4-byte chunks and stitching them together. However, these two lines may be stored in different locations - one could be cached, the other could be in the main memory. In extreme cases (page splits), one could in theory even be swapped out. As a result, you might get 2 data chunks from different times (a better term is observation points), where a store from some other process could have changed one in the middle.
On the other hand, once you add the lock prefix (or add an std::atomic definition, which should include that for you), x86 does guarantee that the result comes from a single observation point, and is consistent with observations from all other threads. To achieve this, it's quite possible that the CPU will enforce a complete block of all cores (for e.g. bus lock) until both lines are secured in the requesting core. If you don't you're risking a livelock where you constantly get one line, and lose it to another core by the time you got the second.
p.s. - user3286380 raised a good point, ++atomicInteger is not atomic, even if you declared it as such. std::atomic guarantees an atomic read and an atomic write (each on its own observation point), but it doesn't guarantee atomic read-modify-write unless you explicitly state that.

An atomic operation is an operation that cannot be done by multiple processors at the same time. If you want to do an addition atomically only one thread can be doing that operation.
If we perform a atomic operation on a multi core cpu, does the atomic
instruction freeze operations on all other cores?
No. Not necessarily, if you happen to have multiple threads trying to do the same atomic operation, than they will be halted except the first one to reach that atomic statement.
I know that reading or writing to memory aligned native type is atomic
and does not impact any other cores execution.
Where did you read this? It does not sound correct to me. The result of this operation might depend on the architecture. But if you have multiple threads on x86 for example, and those threads try to write to the same location, the operation is not atomic by default. So the final value of the address that is being edited by threads can be anything.
Here is a similar discussion you might be interested in : pthreads: If I increment a global from two different threads, can there be sync issues?

Related

How to directly read the value of the `std::atomic_int64_t` without atomic operation?

I have an std::atomic_int64_t that can be read by multiple threads but written by only one thread. In the one thread that writes the atomic, I want to read it directly without any atomic-related instruction since there won't be concurrent writing. How should I do that in C++?
It's hard to tell for sure without knowing your use case if what you're trying to do is reasonable, but there's about a 99.95% chance that it's a bad idea. The reason for this is not obvious, so let me have a go.
Complex runtime environments
Atomics, despite the name, are not just about atomic access to a variable, they're about ordering of effects. To understand what this means, we have to understand a little bit about two things:
modern CPUs, caches, and memory
compiler optimizations
For point 1, consider that a modern CPU consists mostly of memory management. Very little silicon is actually devoted to calculating, most of it is concerned with keeping the calculation units fed with data. When you store a value into memory, chances are it's not going to show up in main memory immediately, instead it'll go to the active CPU core's store buffer that's going to be flushed out at some point in the future, at which point it may become visible to the other CPU core on which your other thread runs. In the name of performance, we have turned our CPUs into highly asynchronous beasts.
For point 2, consider that your compiler will take apart the code you give it, analyze it for data dependencies, and reorder your instructions in such a way that they'll have the same results but run faster. Consider that the compiler can only do this for the code of one thread at a time. It cannot know that another thread depends on what the first one is doing being done in a particular order (or indeed at all), and so it'll run roughshod over the assumptions of that other, unknown thread. Another thread will change a variable, so that'll need to be re-read from time to time? Well, the compiler doesn't know that and has the value in a register, so it'll generate an endless loop and call it a day. This sort of behaviour needs to be inhibited when you have multiple threads.
Atomics are about synchronization
The main point of atomics is synchronization. An Atomic write ensures that store buffers are flushed, things become visible in main memory, and prevent the compiler from reordering instructions across it (possibly one-way, depending on the precise boundary used). Similarly, an atomic read ensures that the values from main memory become visible in cache and prevent the compiler from reordering across it.
So if your reader threads don't use atomic reads, we have a situation: The writer threads does a proper atomic store. This ensures that the compiler does not reorder operations in the writer thread across the atomic boundary, and the generated code ensures that the CPU core's store buffer is flushed. If the reader threads read the atomic variable with an atomic read and see the new value, they'll also see everything else the writer thread was instructed to do up to that point.
So, if the reader does not use an atomic read to read the atomic variable, what could go wrong? Well, basically two things:
the cpu core might not see the need to update its cache
the compiler could reorder operations across the non-atomic read, or not see a need to re-read a value from memory that the optimizer thought it already knew and had in a register.
In effect, what this means is that the reader thread might work under the assumption that things the writer thread did before the write have already happened, but it ends up "not seeing" those new data. Hilarity will (almost) inevitably ensue.
TL;DR
Atomics are about synchonisation. You have multiple threads that you need to synchronize. Use atomic reads in the reader thread, or you're not synchronizing.

Multithread share 2 variable problem with nonlock

I have a question about multithread share variable problem.
the two variable is like:
{
void* a;
uint64_t b;
}
only one thread can modify the two variable, other thread will frequently read these two variable.
I want to change a and b at one time, other thread will see the change together(see new value a and new value b).
Because many thread will frequently read these two variables, so I don't want to add lock, I want to ask if there is a method to combine change a and b operation, make it like a atomic operation? like use memory fence, will it work? Thank you!
You're looking for a SeqLock.
It's ideal for this use-case, especially with infrequently-changed data. (e.g. like a time variable updated by a timer interrupt, read all over the place.)
Implementing 64 bit atomic counter with 32 bit atomics
Optimal way to pass a few variables between 2 threads pinning different CPUs
SeqLock advantages include perfect read-side scaling (readers don't need to get exclusive ownership of any cache lines, they're truly read-only not just lock-free), so any number of readers can read as often as they like with zero contention with each other. The downside is occasional retry, if a reader happens to try to read at just the wrong time. That's rare, and doesn't happen when the writer hasn't just written something.
So readers aren't quite wait-free, and in fact if the writer sleeps at just the wrong time, the readers are stuck retrying until it wakes up again! So overall the algorithm isn't even lock-free or obstruction-free. But the very common fast-path is just two extra reads from the same cache line as the data, and whatever is necessary for LoadLoad ordering in the reader. If there's been no write since the last read, the loads can all be L1d cache hits.
The only thing better is if you have efficient 16-byte atomic stores and loads, like Intel (but not AMD yet) CPUs with AVX, if your compiler / libatomic uses it for 16-byte loads of std::atomic<struct_16bytes> instead of x86-64 lock cmpxchg16b. (In practice most AMD CPUs are though to have atomic 16-byte load/store as well, but only Intel has officially put it in their manuals that the AVX feature bit implies atomicity for aligned 128-bit load/store such as movaps, so compilers can safely start uting it.)
Or AArch64 guarantees 16-byte atomicity for plain stp / ldp in ARMv8.4 I think.
But without those hardware features, and compiler+options to take advantage of them, 16-byte loads often get implemented as an atomic RMW, meaning each reader takes exclusive ownership of the cache line. That means reads contend with other reads, instead of the cache line staying in shared state, hot in the cache of every core that's reading it.
like use memory fence, will it work?
No, memory fences can't create atomicity (glue multiple operations into a larger transaction), only create ordering between operations.
Although you could say that the idea behind a SeqLock is to carefully order the write and reads (wrt. to sequence variable) in order to detect torn reads and retry when it happens. So yes, barriers are important for that.

Why does atomic operation need exclusive cache access?

In my understanding atomic operation (c++ atomic for example) first locks the cache line and then performs atomic operation. I have two questions: 1. if let's say atomic compare and swap is atomic operation itself in hardware why we need to lock cache line and 2. when cache line is locked how another cpu is waiting for it? does it use spin-lock style waiting?
thanks
First of all: It depends!
1.) If a system locks a cache line has nothing to do with c++. It is a question how a cache is organized and especially how assembler instructions acts with cache. That is a question for cpu architecture!
2.) How a compiler performs a atomic operation is implementation depended. Which assembler instructions will be generated to perform a atomic operation can vary from compiler to compiler and even on different versions.
3.) As I know, a full lock of a cache line is only the fall back solution if no "more clever" notification/synchronization of other cores accessing the same cache lines can be performed. But there are not only a single cache involved typically. Think of multi level cache architecture. Some caches are only visible to a single core! So there is a need of performing also more memory system operations as locking a line. You also have to move data from different cache levels also if multiple cores are involved!
4.) From the c++ perspective, a atomic operation is not only a single operation. What really will happen depends on memory ordering options for the atomic operation. As atomic operations often used for inter thread synchronization, a lot more things must be done for a single atomic RMW operation! To get an idea what all has to be done you should give https://www.cplusplusconcurrencyinaction.com/ a chance. It goes into the details of memory barriers and memory ordering.
5.) Locking a cache line ( if this really happens ) should not result in spin locks or other things on other cores as the access for the cache line itself took only some clock cycles. Depending on the architecture it simply "holds" the other core for some cycles. It may happen that the "sleeping" core can do in parallel other things in a different pipe. But hey, that is very hardware specific.
As already given as a comment: Take a look on https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/, it gives some hints what can happen with cache coherency and locking.
There is much more than locking under the hood. I believe your question scratches only on the surface!
For practical usage: Don't think about! Compiler vendors and cpu architects have done a very good job. You as a programmer should measure your code performance. From my perspective: No need to think about of what happens if cache lines are locked. You have to write good algorithms and think about good memory organization of your program data and less interrelationships between threads.

Using Mutex for shared memory of 1 word

I have an application where multiple threads access and write to a shared memory of 1 word (16bit).
Can I expect that the processor reads and writes a word from/to memory in an atomic operation? So I don't need mutex protection of the shared memory/variable?
The target is an embedded device running VxWorks.
EDIT: There's only one CPU and it is an old one (>7years) - I am not exactly sure about the architecture and model, but I am also more interested in the general way that "most" CPU's will work. If it is a 16bit CPU, would it then, in most cases, be fair to expect that it will read/write a 16bit variable in one operation? Or should I always in any case use mutex protection? And let's say that I don't care about portability, and we talk about C++98.
All processors will read and write aligned machine-words atomicially in the sense that you won't get half the bits of the old value and half the bits of the new value if read by another processor.
To achieve good speed, modern processors will NOT synchronize read-modif-write operations to a particular location unless you actually ask for it - since nearly all reads and writes go to "non-shared" locations.
So, if the value is, say, a counter of how many times we've encountered a particular condition, or some other "if we read/write an old value, it'll go wrong" situation, then you need to ensure that two processors don't simultaneously update the value. This can typically be done with atomic instructions (or some other form of atomic updates) - this will ensure that one, and only one, processor touches the value at any given time, and that all the other processors DO NOT hold a copy of the value that they think is accurate and up to date when another has just made an update. See the C++11 std::atomic set of functions.
Note the distinction between atomically reading or writing the machine word's value and atomically performing the whole update.
The problem is not the atomicity of the acess (which you can usually assume unless you are using a 8bit MC), but the missing synchronization which leads to undefined behavior.
If you want to write portable code, use atomics instead. If you want to achieve maximal performance for your specific platform, read the documentation of your OS and compiler very carefully and see what additional mechanisms or guarantees they provide for multithreaded programs (But I really doubt that you will find anything more efficient than std::atomic that gives you sufficient guarantees).
Can I expect that the processor reads and writes a word from/to memory in an atomic operation?
Yes.
So I don't need mutex protection of the shared memory/variable?
No. Consider:
++i;
Even if the read and write are atomic, two threads doing this at the same time can each read, each increment, and then each write, resulting in only one increment where two are needed.
Can I expect that the processor reads and writes a word from/to memory in an atomic operation?
Yes, if the data's properly aligned and no bigger than a machine word, most CPU instructions will operate on it atomically in the sense you describe.
So I don't need mutex protection of the shared memory/variable?
You do need some synchronisation - whether a mutex or using atomic operations ala std::atomic.
The reasons for this include:
if your variable is not volatile, the compiler might not even emit read and write instructions for the memory address nominally holding that variable at the places you might expect, instead reusing values read or set earlier that are saved in CPU registers or known at compile time
if you use a mutex or std::atomic type you do not need to use volatile as well
further, even if the data is written towards memory, it may not leave the CPU caches and be written to actual RAM where other cores and CPUs can see it unless you explicitly use a memory barrier (std::mutex and std::atomic types do that for you)
finally, delays between reading and writing values can cause unexpected results, so operations like ++x can fail as explained by David Schwartz.

Is it safe to read an integer variable that's being concurrently modified without locking?

Suppose that I have an integer variable in a class, and this variable may be concurrently modified by other threads. Writes are protected by a mutex. Do I need to protect reads too? I've heard that there are some hardware architectures on which, if one thread modifies a variable, and another thread reads it, then the read result will be garbage; in this case I do need to protect reads. I've never seen such architectures though.
This question assumes that a single transaction only consists of updating a single integer variable so I'm not worried about the states of any other variables that might also be involved in a transaction.
atomic read
As said before, it's platform dependent. On x86, the value must be aligned on a 4 byte boundary. Generally for most platforms, the read must execute in a single CPU instruction.
optimizer caching
The optimizer doesn't know you are reading a value modified by a different thread. declaring the value volatile helps with that: the optimizer will issue a memory read / write for every access, instead of trying to keep the value cached in a register.
CPU cache
Still, you might read a stale value, since on modern architectures you have multiple cores with individual cache that is not kept in sync automatically. You need a read memory barrier, usually a platform-specific instruction.
On Wintel, thread synchronization functions will automatically add a full memory barrier, or you can use the InterlockedXxxx functions.
MSDN: Memory and Synchronization issues, MemoryBarrier Macro
[edit] please also see drhirsch's comments.
You ask a question about reading a variable and later you talk about updating a variable, which implies a read-modify-write operation.
Assuming you really mean the former, the read is safe if it is an atomic operation. For almost all architectures this is true for integers.
There are a few (and rare) exceptions:
The read is misaligned, for example accessing a 4-byte int at an odd address. Usually you need to force the compiler with special attributes to do some misalignment.
The size of an int is bigger than the natural size of instructions, for example using 16 bit ints on a 8 bit architecture.
Some architectures have an artificially limited bus width. I only know of very old and outdated ones, like a 386sx or a 68008.
I'd recommend not to rely on any compiler or architecture in this case.
Whenever you have a mix of readers and writers (as opposed to just readers or just writers) you'd better sync them all. Imagine your code running an artificial heart of someone, you don't really want it to read wrong values, and surely you don't want a power plant in your city go 'boooom' because someone decided not to use that mutex. Save yourself a night-sleep in a long run, sync 'em.
If you only have one thread reading -- you're good to go with just that one mutex, however if you're planning for multiple readers and multiple writers you'd need a sophisticated piece of code to sync that. A nice implementation of read/write lock that would also be 'fair' is yet to be seen by me.
Imagine that you're reading the variable in one thread, that thread gets interrupted while reading and the variable is changed by a writing thread. Now what is the value of the read integer after the reading thread resumes?
Unless reading a variable is an atomic operation, in this case only takes a single (assembly) instruction, you can not ensure that the above situation can not happen.
(The variable could be written to memory, and retrieving the value would take more than one instruction)
The consensus is that you should encapsulate/lock all writes individualy, while reads can be executed concurrently with (only) other reads
Suppose that I have an integer variable in a class, and this variable may be concurrently modified by other threads. Writes are protected by a mutex. Do I need to protect reads too? I've heard that there are some hardware architectures on which, if one thread modifies a variable, and another thread reads it, then the read result will be garbage; in this case I do need to protect reads. I've never seen such architectures though.
In the general case, that is potentially every architecture. Every architecture has cases where reading concurrently with a write will result in garbage.
However, almost every architecture also has exceptions to this rule.
It is common that word-sized variables are read and written atomically, so synchronization is not needed when reading or writing. The proper value will be written atomically as a single operation, and threads will read the current value as a single atomic operation as well, even if another thread is writing. So for integers, you're safe on most architectures. Some will extend this guarantee to a few other sizes as well, but that's obviously hardware-dependant.
For non-word-sized variables both reading and writing will typically be non-atomic, and will have to be synchronized by other means.
If you don't use prevous value of this variable when write new, then:
You can read and write integer variable without using mutex. It is because integer is base type in 32bit architecture and every modification/read of value is doing with one operation.
But, if you donig something such as increment:
myvar++;
Then you need use mutex, because this construction is expanded to myvar = myvar + 1 and between read myvar and increment myvar, myvar can be modified. In that case you will get bad value.
While it would probably be safe to read ints on 32 bit systems without synchronization. I would not risk it. While multiple concurrent reads are not a problem, I do not like writes to happen at the same time as reads.
I would recommend placing the reads in the Critical Section too and then stress test your application on multiple cores to see if this is causing too much contention. Finding concurrency bugs is a nightmare I prefer to avoid. What happens if in the future some one decides to change the int to a long long or a double, so they can hold larger numbers?
If you have a nice thread library like boost.thread or zthread then you should have read/writer locks. These would be ideal for your situation as they allow multiple reads while protecting writes.
This may happen on 8 bit systems which use 16 bit integers.
If you want to avoid locking you can under suitable circumstances get away with reading multiple times, until you get two equal consecutive values. For example, I've used this approach to read the 64 bit clock on a 32 bit embedded target, where the clock tick was implemented as an interrupt routine. In that case reading three times suffices, because the clock can only tick once in the short time the reading routine runs.
In general, each machine instruction goes thru several hardware stages when executing. As most current CPUs are multi-core or hyper-threaded, that means that reading a variable may start it moving thru the instruction pipeline, but it doesn't stop another CPU core or hyper-thread from concurrently executing a store instruction to the same address. The two concurrently executing instructions, read and store, might "cross paths", meaning that the read will receive the old value just before the new value is stored.
To resume: you do need the mutex for both read and writes.
Both reading / writing to variables with concurrency must be protected by a critical section (not mutex). Unless you want to waste your whole day debugging.
Critical sections are platform-specific, I believe. On Win32, critical section is very efficient: when no interlocking occur, entering critical section is almost free and does not affect overall performance. When interlocking occur, it is still more efficient, than mutex, because it implements a series of checks before suspending the thread.
Depends on your platform. Most modern platforms offer atomic operations for integers: Windows has InterlockedIncrement, InterlockedDecrement, InterlockedCompareExchange, etc. These operations are usually supported by the underlying hardware (read: the CPU) and they are usually cheaper than using a critical section or other synchronization mechanisms.
See MSDN: InterlockedCompareExchange
I believe Linux (and modern Unix variants) support similar operations in the pthreads package but I don't claim to be an expert there.
If a variable is marked with the volatile keyword then the read/write becomes atomic but this has many, many other implications in terms of what the compiler does and how it behaves and shouldn't just be used for this purpose.
Read up on what volatile does before you blindly start using it: http://msdn.microsoft.com/en-us/library/12a04hfd(VS.80).aspx