I have the following code:
#include <atomic>
int main () {
std::atomic<uint32_t> value(0);
value.fetch_add(1, std::memory_order::relaxed);
static_assert(std::atomic<uint32_t>::is_always_lock_free);
return 0;
}
It compiles and so it means std::atomic<uint32_t>::is_always_lock_free is true.
Then, the assembly code looks like this with gcc 10 and -std=c++20 -O3 -mtune=skylake-avx512 -march=skylake-avx512:
0000000000401050 <main>:
401050: c7 44 24 fc 00 00 00 mov DWORD PTR [rsp-0x4],0x0
401057: 00
401058: f0 ff 44 24 fc lock inc DWORD PTR [rsp-0x4]
40105d: 31 c0 xor eax,eax
40105f: c3 ret
Many posts point out that read-modify-write operation (fetch_add() here) can't be an atomic operation without a lock.
My question is what std::atomic::is_always_lock_free being true really means.
This page states Equals true if this atomic type is always lock-free and false if it is never or sometimes lock-free.
What does it mean by "this atomic type is always lock-free" then?
"Lock" here is in the sense of "mutex", not specifically in reference to the x86 instruction prefix named lock.
A trivial and generic way to implement std::atomic<T> for arbitrary types T would be as a class containing a T member together with a std::mutex, which is locked and unlocked around every operation on the object (load, store, exchange, fetch_add, etc). Those operations can then be done in any old way, and need not use atomic machine instructions, because the lock protects them. This implementation would be not lock free.
A downside of such an implementation, besides being slow in general, is that if two threads try to operate on the object at the same time, one of them will have to wait for the lock, which may actually block and cause it to be scheduled out for a while. Or, if a thread gets scheduled out while holding the lock, every other thread that wants to operate on the object will have to wait for the first thread to get scheduled back in and complete its work first.
So it is desirable if the machine supports truly atomic operations on T: a single instruction or sequence that other threads cannot interfere with, and which will not block other threads if interrupted (or perhaps cannot be interrupted at all). If for some type T the library has been able to specialize std::atomic<T> with such an implementation, then that is what we mean by saying it is lock free. (It is just confusing on x86 because the atomic instructions used for such implementations are named lock. On other architectures they might be called something else, e.g. ARM64's ldxr/stxr exclusive load/store instructions.)
The C++ standard allows for types to be "sometimes lock free": maybe it is not known at compile time whether std::atomic<T> will be lock-free, because it depends on special machine features that will be detected at runtime. It's even possible that some objects of type std::atomic<T> are lock-free and others are not. That's why atomic_is_lock_free is a function and not a constant. It checks whether this particular object is lock-free on this particular day.
However, it might be the case for some implementations that certain types can be guaranteed, at compile time, to always be lock free. That's what is_always_lock_free is used to indicate, and note that it's a constexpr bool instead of a function.
The context of the question is normal cached memory such as the memory of C and C++ objects.
Many CPU perform memory loads out of order (with the benefit of not completely stalling the executing thread on a cache miss), and do not check later that another core did not ask the permission to change these (*), so a read that appears later in the program order was effectively retrieved earlier, and another thread might have made progress in that interval, which means that no agreement on the ordering of memory operations is possible between these two threads that agrees with program order.
(*) By asking the local cache to invalidate the cache line containing it, which is a prerequisite before altering a value in any architecture that can support normal thread programming.
Many CPU types that allow that behavior also guarantee that when a value depends on the result of previous load, all subsequent operations that have a dependency on that value occur really after the load, in particular the loads at an address derived from that value.
But a value derived from another can be known to be mathematically independent of the parameter; indeed, a common idiom to set a register to zero is to xor its value with itself, which syntactically derives the value from the previous one but semantically does not.
Do all CPU that track value dependencies to guarantee that memory operations based on values dependent on a previous load allow such non semantic dependencies to create an independent dependency for the purpose of ordering memory operations? Does that apply to bitwise operations or all arithmetic operations, including but not limited to use of absorbing elements?
Most obvious examples are:
x ^ x
x & 0
x | -1
x - x
x * 0
x / x (raises an exception if zero)
The purpose the question obviously is to understand how a compiler can support memory_order_consume in C and C++ on these architectures in an optimal way, following the intent of that feature (and not in a degenerate way by defining mo_consume as mo_acquire).
PROBLEM STATEMENT
The problem is the compilation of code such as
r1 = x.load(memory_order_consume);
r2 = y | (r1&0);
r3 = z + (r1-r1);
r4 = (&E)[r1*0];
Where the simple strategy of renaming r1 and making it "register volatile" (volatile-like, non optimizable, except in a register) in the intermediate code produces correct "consume" (value/address dependent load) behavior on each operations.
BONUS QUESTION
Do any CPU directly support dependent constants? That is a "set immediate value" instruction that introduces an explicit dependency of the constant on a register value:
load.i.dep r1,8,r4 ; sets r1 to 8 but dependency on value of r4 is respected
lea r0,r0,r1 ; r0 += r1, r0 depends on r4
load.w r2,r1 ; loads one word at r1 to r2 only after r4 is known
I wonder how signal assignment statements are executed concurrently in combinational logic using VHDL? For the following code for example the three statements are supposed to run concurrently. What I have a doubt in is that how the 'y' output signal is immediately changed when I run the simulation although if the statements ran concurrently 'y' will not see the effect of 'wire1' and 'wire2' (only if the statements are executed more than one time).
entity test1 is port (a, b, c, d : in bit; y : out bit);
end entity test1;
------------------------------------------------------
architecture basic of test1 is
signal wire1, wire2 : bit;
begin
wire1 <= a and b;
wire2 <= c and d;
y <= wire1 and wire2;
end architecture basic;
Since VHDL is used for simulating digital circuits, this must work similarly to the actual circuits, where (after a small delay usually ignored in simulations) circuits continously follow their inputs.
I assume you wonder how the implementation achieves this behaviour:
The simulator will keep track of which signal depends on which other symbol and reevaluates the expression whenever one of the inputs changes.
So when a changes, wire1 will be updated, and in turn trigger an update to y. This will continue as long as combinatorial updates are necessary. So in the simulation the updates are indeed well ordered, although no simulation time has passed. The "time" between such updates is often called a "delta cycle".
Essentially I have 2 "while True:" loops in my code. Both of the loops are right at the end. However when I run the code, only the first while True: loop gets run, and the second one gets ignored.
For example:
while True:
print "hi"
while True:
print "bye"
Here, it will continuously print hi, but wont print bye at all (the actual code has a tracer.execute() for one loop, and the other is listening to a port, and they both work on their own).
Is there any way to get both loops to work at the same time independently?
Yes.A way to get both loops to work at the same time independently:
Your initial surprise was related to the nature, how Finite-State-Automata actually work.
[0]: any-processing-will-always-<START>-here
[1]: Read a next instruction
[2]: Execute the instruction
[3]: GO TO [1]
The stream of abstract instructions is being executed in a pure-[SERIAL] manner, one after another. There is no other way in the CPU since uncle Turing.
Your desire to have more streams-of-instructions run at the same time independently is called [CONCURRENT] process-scheduling.
You have several tools for achieving a wanted modus-operandi:
Read about a weaker form, using just a thread-based concurrency ( which, due to a Python-specific GIL-locking, yet executes on the physical hardware as a [CONCURRENT]-processing, but GIL-interleaving ( which was knowingly implemented as a very cheap form of a collision-avoidance for each and every case, that this [CONCURRENCY] might introduce ) will finally interleave each of the ( now ) [CONCURRENT]-streams, so as to principally avoid colliding access to any Python object at the same time. If you are fine with this execute-just-one-instruction-stream-fragment-at-a-time ( and round-robin their actual order of GIL-stepped execution ), you can live in a safe and collision-free world.
Another tool, Python may use, is the joblib.Parallel()( joblib.delayed() ), where you will have to master a bit more things to make these ( now a set of fully spawned subprocesses, each ( yes, each ) having a full-copy of python-state + all variables ( read: a lot of time and memory needed to spawn 'em ) and no mutual coordination ).
So decide about which form is just-enough for the kind of your use-case, and better check the new Amdahl's Law re-formulation carefully ( implications on costs of going distributed or parallel )
I decide I want to benchmark a particular function, so I naïvely write code like this:
#include <ctime>
#include <iostream>
int SlowCalculation(int input) { ... }
int main() {
std::cout << "Benchmark running..." << std::endl;
std::clock_t start = std::clock();
int answer = SlowCalculation(42);
std::clock_t stop = std::clock();
double delta = (stop - start) * 1.0 / CLOCKS_PER_SEC;
std::cout << "Benchmark took " << delta << " seconds, and the answer was "
<< answer << '.' << std::endl;
return 0;
}
A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering. He suggested that the optimizer could, for example, effectively reorder the code like this:
std::clock_t start = std::clock();
std::clock_t stop = std::clock();
int answer = SlowCalculation(42);
At first I was skeptical that such extreme reordering was allowed, but after some research and experimentation, I learned that it was.
But volatile didn't feel like the right solution; isn't volatile really just for memory mapped I/O?
Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms. The disassembly listings for the two versions showed virtually no difference other than a different selection of registers. This makes me wonder if there is another way to avoid the code reordering that doesn't have such overwhelming side effects.
My question is:
What is the best way to prevent code reordering in benchmarking code like this?
My question is similar to this one (which was about using volatile to avoid elision rather than reordering), this one (which didn't answer how to prevent reordering), and this one (which debated whether the issue was code reordering or dead code elimination). While all three are on this exact topic, none actually answer my question.
Update: The answer appears to be that my colleague was mistaken and that reordering like this isn't consistent with the standard. I've upvoted everyone who said so and am awarding the bounty to the Maxim.
I've seen one case (based on the code in this question) where Visual Studio 2010 reordered the clock calls as I illustrated (only in 64-bit builds). I'm trying to make a minimal case to illustrate that so that I can file a bug on Microsoft Connect.
For those who said that volatile should be much slower because it forces reads and writes to memory, this isn't quite consistent with the code being emitted. In my answer on this question, I show the disassembly for the the code with and without volatile. Inside the loop, everything is kept in registers. The only significant differences appear to be register selection. I do not understand x86 assembly well enough to know why the performance of the non-volatile version is consistently fast while the volatile version is inconsistently (and sometimes dramatically) slower.
A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering.
Sorry, but your colleague is wrong.
The compiler does not reorder calls to functions whose definitions are not available at compile time. Simply imagine the hilarity that would ensue if compiler reordered such calls as fork and exec or moved code around these.
In other words, any function with no definition is a compile time memory barrier, that is, the compiler does not move subsequent statements before the call or prior statements after the call.
In your code calls to std::clock end up calling a function whose definition is not available.
I can not recommend enough watching atomic Weapons: The C++ Memory Model and Modern Hardware because it discusses misconceptions about (compile time) memory barriers and volatile among many other useful things.
Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms
Not sure if volatile is to blame here.
The reported run-time depends on how the benchmark is run. Make sure you disable CPU frequency scaling so that it does not turn on turbo mode or switches frequency in the middle of the run. Also, micro-benchmarks should be run as real-time priority processes to avoid scheduling noise. It could be that during another run some background file indexer starts competing with your benchmark for the CPU time. See this for more details.
A good practice is to measure times it takes to execute the function a number of times and report min/avg/median/max/stdev/total time numbers. High standard deviation may indicate that the above preparations are not performed. The first run often is the longest because the CPU cache may be cold and it may take many cache misses and page faults and also resolve dynamic symbols from shared libraries on the first call (lazy symbol resolution is the default run-time linking mode on Linux, for example), while subsequent calls are going to execute with much less overhead.
The usual way to prevent reordering is a compile barrier i.e asm volatile ("":::"memory"); (with gcc). This is an asm instruction which does nothing, but we tell the compiler it will clobber memory, so it is not permitted to reorder code across it. The cost of this is only the actual cost of removing the reorder, which obviously is not the case for changing the optimisation level etc as suggested elsewhere.
I believe _ReadWriteBarrier is equivalent for Microsoft stuff.
Per Maxim Yegorushkin's answer, reordering is unlikely to be the cause of your issues though.
Volatile ensures one thing, and one thing only: reads from a volatile variable will be read from memory every time -- the compiler won't assume that the value can be cached in a register. And likewise, writes will be written through to memory. The compiler won't keep it around in a register "for a while, before writing it out to memory".
In order to prevent compiler reordering you may use so called compiler fences.
MSVC includes 3 compiler fences:
_ReadWriteBarrier() - full fence
_ReadBarrier() - two-sided fence for loads
_WriteBarrier() - two-sided fence for stores
ICC includes __memory_barrier() full fence.
Full fences are usually the best choice because there is no need in finer-granularity on this level (compiler fences are basically costless in run-time).
Statment reordering (which most compiler do when optimization is enabled), thats the also main reason why certain program fail to operation operation when compiled with compiler optimzation.
Will suggest to read http://preshing.com/20120625/memory-ordering-at-compile-time to see potential issues we can face with compiler reoredering etc.
Related problem: how to stop the compiler from hoisting a tiny repeated calculation out of a loop
I couldn't find this anywhere - so adding my own answer 11 years after the question was asked ;).
Using volatile on variables is not what you want for that. That will cause the compiler to load and store those variable from and to RAM every single time (assuming there is a side effect of that that must be preserved: aka - good for I/O registers). When you are bench marking you are not interested in measuring how long it takes to get something from memory, or write it there. Often you just want your variable to be in CPU registers.
volatile is usable if you assign to it once outside a loop that doesn't get optimized away (like summing an array), as an alternative to printing the result. (Like the long-running function in the question). But not inside a tiny loop; that will introduce store/reload instructions and store-forwarding latency.
I think that the ONLY way to submit your compiler into not optimizing your benchmark code to hell is by using asm. This allows you to fool the compiler into thinking it doesn't know anything about your variables content, or usage, so it has to do everything every single time, as often as your loop asks it to.
For example, if I wanted to benchmark m & -m where m is some uint64_t, I could try:
uint64_t const m = 0x0000080e70100000UL;
for (int i = 0; i < loopsize; ++i)
{
uint64_t result = m & -m;
}
The compiler would obviously say: I'm not even going to calculate that;
since you're not using the result. Aka, it would actually do:
for (int i = 0; i < loopsize; ++i)
{
}
Then you can try:
uint64_t const m = 0x0000080e70100000UL;
static uint64_t volatile result;
for (int i = 0; i < loopsize; ++i)
{
result = m & -m;
}
and the compiler says, ok - so you want me to write to result every time
and do
uint64_t const m = 0x0000080e70100000UL;
uint64_t tmp = m & -m;
static uint64_t volatile result;
for (int i = 0; i < loopsize; ++i)
{
result = tmp;
}
Spending a lot of time writing to the memory address of result loopsize times, just as you asked.
Finally you could also make m volatile, but the result would look like this in assembly:
507b: ba e8 03 00 00 mov $0x3e8,%edx
# top of loop
5080: 48 8b 05 89 ef 20 00 mov 0x20ef89(%rip),%rax # 214010 <m_test>
5087: 48 8b 0d 82 ef 20 00 mov 0x20ef82(%rip),%rcx # 214010 <m_test>
508e: 48 f7 d8 neg %rax
5091: 48 21 c8 and %rcx,%rax
5094: 48 89 44 24 28 mov %rax,0x28(%rsp)
5099: 83 ea 01 sub $0x1,%edx
509c: 75 e2 jne 5080 <main+0x120>
Reading from memory twice and writing to it once, besides the requested calculation with registers.
The correct way to do this is therefore:
for (int i = 0; i < loopsize; ++i)
{
uint64_t result = m & -m;
asm volatile ("" : "+r" (m) : "r" (result));
}
which results in the assembly code (from gcc8.2 on the Godbolt compiler explorer):
# gcc8.2 -O3 -fverbose-asm
movabsq $8858102661120, %rax #, m
movl $1000, %ecx #, ivtmp_9 # induction variable tmp_9
.L2:
mov %rax, %rdx # m, tmp91
neg %rdx # tmp91
and %rax, %rdx # m, result
# asm statement here, m=%rax result=%rdx
subl $1, %ecx #, ivtmp_9
jne .L2
ret
Doing exactly the three requested assembly instructions inside the loop, plus a sub and jne for the loop overhead.
The trick here is that by using the asm volatile1 and tell the compiler
"r" input operand: it uses the value of result as input so the compiler has to materialize it in a register.
"+r" input/output operand: m stays in the same register but is (potentially) modified.
volatile: it has some mysterious side effect and/or is not a pure function of the inputs; the compiler must execute it as many times as the source does. This forces the compiler to leave your test snippet alone and inside the loop. See the gcc manual's Extended Asm#Volatile section.
footnote 1: The volatile is required here or the compiler will turn this into an empty loop. Non-volatile asm (with any output operands) is considered a pure function of its inputs that can be optimized away if the result is unused. Or CSEd to only run once if used multiple times with the same input.
Everything below is not mine-- and I do not necessarily agree with it. --Carlo Wood
If you had used asm volatile ("" : "=r" (m) : "r" (result)); (with an "=r" write-only output), the compiler might choose the same register for m and result, creating a loop-carried dependency chain that tests the latency, not throughput, of the calculation.
From that, you'd get this asm:
5077: ba e8 03 00 00 mov $0x3e8,%edx
507c: 0f 1f 40 00 nopl 0x0(%rax) # alignment padding
# top of loop
5080: 48 89 e8 mov %rbp,%rax # copy m
5083: 48 f7 d8 neg %rax # -m
5086: 48 21 c5 and %rax,%rbp # m &= -m instead of using the tmp as the destination.
5089: 83 ea 01 sub $0x1,%edx
508c: 75 f2 jne 5080 <main+0x120>
This will run at 1 iteration per 2 or 3 cycles (depending on whether your CPU has mov-elimination or not.) The version without a loop-carried dependency can run at 1 per clock cycle on Haswell and later, and Ryzen. Those CPUs have the ALU throughput to run at least 4 uops per clock cycle.
This asm corresponds to C++ that looks like this:
for (int i = 0; i < loopsize; ++i)
{
m = m & -m;
}
By misleading the compiler with a write-only output constraint, we've created asm that doesn't look like the source (which looked like it was computing a new result from a constant every iteration, not using result as an input to the next iteration..)
You might want to microbenchmark latency, so you can more easily detect the benefit of compiling with -mbmi or -march=haswell to let the compiler use blsi %rax, %rax and calculate m &= -m; in one instruction. But it's easier to keep track of what you're doing if the C++ source has the same dependency as the asm, instead of fooling the compiler into introducing a new dependency.
You could make two C files, SlowCalculation compiled with g++ -O3 (high level of optimization), and the benchmark one compiled with g++ -O1 (lower level, still optimized - that may be sufficient for that benchmarking part).
According to the man page, reordering of code happens during -O2 and -O3 optimizations levels.
Since optimization happens during compilation, not linkage, the benchmark side should not be affected by code reordering.
Assuming you are using g++ - but there should be something equivalent in another compiler.
The correct way to do this in C++ is to use a class, e.g. something like
class Timer
{
std::clock_t startTime;
std::clock_t* targetTime;
public:
Timer(std::clock_t* target) : targetTime(target) { startTime = std::clock(); }
~Timer() { *target = std::clock() - startTime; }
};
and use it like this:
std::clock_t slowTime;
{
Timer timer(&slowTime);
int answer = SlowCalculation(42);
}
Mind you, I don't actually believe your compiler will ever re-order like this.
There are a couple of ways that I can think of. The idea is to create compile time barriers so that compiler does not reorder a set of instructions.
One possible way to avoid reordering would be to enforce dependency among instructions that cannot be resolved by compiler (e.g. passing a pointer to the function and using that pointer in later instruction). I'm not sure how that would affect the performance of the actual code that you are interested in benchmarking.
Another possibility is to make the function SlowCalculation(42); an extern function (define this function in a separate .c/.cpp file and link the file to your main program) and declare start and stop as global variables. I do not know what are the optimizations offered by the link-time/inter-procedural optimizer of your compiler.
Also, if you compile at O1 or O0, most probably the compiler would not bother reordering instructions.
Your question is somewhat related to (Compile time barriers - compiler code reordering - gcc and pthreads)
Reordering described by your colleague just breaks 1.9/13
Sequenced before is an asymmetric, transitive, pair-wise relation between evaluations executed by a single
thread (1.10), which induces a partial order among those evaluations. Given any two evaluations A and B, if
A is sequenced before B, then the execution of A shall precede the execution of B. If A is not sequenced before
B and B is not sequenced before A, then A and B are unsequenced . [ Note: The execution of unsequenced
evaluations can overlap. —end note ] Evaluations A and B are indeterminately sequenced when either A
is sequenced before B or B is sequenced before A, but it is unspecified which. [ Note: Indeterminately
sequenced evaluations cannot overlap, but either could be executed first. —end note ]
So basically you should not think about reordering while you don't use threads.