llvm: why is adce more aggressive than dce

llvm: why is adce more aggressive than dce - llvm

As both adce and dce are the optimizers in llvm-opt and the optimal directions are different(adce considers all the insts dead first and dce considers all the insts live first). but why is adce more aggressive than dce? And one more question is that are there an example(a .ll file)'s outputs(one is optimized by dce and another byadce) are different?

Related

Why C-style Arrays performance in O3 is less than no optimization on Quick Bench?

Base on C-style Arrays vs std::vector using std::vector::at, std::vector::operator[], and iterators
I run the following benchmarks.
no optimization
https://quick-bench.com/q/LjybujMGImpATTjbWePzcb6xyck
O3
https://quick-bench.com/q/u5hnSy90ZRgJ-CQ75b1c1a_3BuY
From here, vectors definitely perform better in O3.
However, C-style Array is slower with -O3 than -O0
C-style (no opt) : about 2500
C-style (O3) : about 3000
I don't know what factors lead to this result. Maybe it's because the compiler is c++14?
(I'm not asking about std::vector relative to plain arrays, I'm just asking about plain arrays with/without optimization.)

Your -O0 code wasn't faster in an absolute sense, just as a ratio against an empty
for (auto _ : state) {} loop.
That also gets slower when optimization is disabled, because the state iterator functions don't inline. Check the asm for your own functions, and instead of an outer-loop counter in %rbx like:
# outer loop of your -O3 version
sub $0x1,%rbx
jne 407f57 <BM_map_c_array(benchmark::State&)+0x37>
RBX was originally loaded from 0x10(%rdi), from the benchmark::State& state function arg.
You instead get state counter updates in memory, like the following, plus a bunch of convoluted code that materializes a boolean in a register and then tests it again.
# part of the outer loop of your -O0 version
12.50% mov -0x8060(%rbp),%rax
25.00% sub $0x1,%rax
12.50% mov %rax,-0x8060(%rbp)
There are high counts on those instructions because the call map_c_array didn't inline, so most of the CPU time wasn't actually spent in this function itself. But of the time that was, about half was on these instructions. In an empty loop, or one that called an empty function (I'm not sure which Quick Bench is doing), that would still be the case.
Quick Bench does this to try to normalize things for whatever hardware its cloud VM ends up running on, with whatever competing load. Click the "About Quick Bench" in the dropdown at the top right.
And see the label on the graph: CPU time / Noop time. (When they say "Noop", they don't mean a nop machine instruction, they mean in a C++ sense.)
An empty loop with a loop counter runs about 6x slower when compiled with optimization disabled (bottlenecked on store-to-load forwarding latency of the loop counter), so your -O0 code is "only" a bit less than 6x slower, not exactly 6x slower.
With a counter in a register, modern x86 CPUs can run loops at 1 cycle per iteration, like looptop: dec %ebx / jnz looptop. dec has one cycle latency, vs. subtract or dec on a memory location being about 6 cycles since it includes the store/reload. (https://agner.org/optimize/ and https://uops.info/. Also
The performance of two scan functions (benchmarked without optimization; my answer explains that they bottleneck on store-forwarding latency.)
Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
Adding a redundant assignment speeds up code when compiled without optimization (Intel Sandybridge-family store-forwarding has variable latency depending on how soon you try to reload).
With that bottleneck built-in to the baseline you're comparing against, it's normal that adding some array-access work inside a loop won't be as much slower as array access vs. an empty loop.

Because you aren't benchmarking what you think you're benchmarking. I bothered to look at your code, and found that you're trying to see how fast your CPU can advance the counter in a for loop while seeing how fast your data BUS can transfer data. Is this really something you need to worry about, like ever?
In general, benchmarks outside multi-thousand programs are worthless and will never be taken with a straight face by anyone even remotely experienced in programming, so stop doing that.

Do C++ compilers align small functions to optimize cache-line fetches?

I may be misunderstanding how cache fetches work, but I'm curious if there are any compiler optimizations for aligning small functions that are not inlined.
If the cache-line-size is 64 bytes on a given machine, would it make sense to have function pointers to functions that are smaller than 64 bytes be aligned within a single cache line to prevent multiple cache fetches to retrieve the function?
Even if the function is 100 bytes in size, it could still be aligned to 2 cache lines, with the worst case being 3 if it is unaligned. Is this a viable optimization and do compilers use anything like this in real applications such as packing small commonly used functions together?

No, mainstream compilers like gcc and clang don't leave extra unused space to start a small function at the start of a cache line, to avoid having its end cross a boundary. Nor do they choose an order within the text section that optimizes for this without reducing overall code density for I-cache and iTLB.
AFAIK, GCC doesn't even know instruction sizes; it compiles by emitting asm text to be assembled separately. Before every function, it uses .p2align 4 (assuming the default is -falign-functions=16 like on x86-64), or clang -mllvm -align-all-functions=4 (2^4 = 16), since CPUs often fetch in chunks that size, and you want the first aligned fetch to pull in multiple useful instructions.
Inside functions, GCC by default aligns branch targets (or at least tops of loops) by padding to the next multiple of 16 if it would take 10 or fewer bytes, then unconditionally align by 8, but the condition is implemented by the assembler (which does know machine-code sizes / position):
.p2align 4,,10 # GCC does this for loop tops *inside* functions
.p2align 3
Interesting idea, though, might be worth looking at whether there are any real-world benefits to doing this.
Most frequently-called functions are already hot in some level of cache (because caches work, and being frequently called means they tend to stay hot), but this possibly could reduce the number of cache lines that need to stay hot.
Another thing to consider is that for many functions, not all the code bytes are hot. e.g. the fast path might be in the first 32 bytes, with later code bytes only being if() or else blocks for error conditions or other special cases. (Figuring out which path through the function is the common one is part of a compiler's job, although profile-guided optimization (PGO) can help. Or hinting with C++20 [[likely]] / [[unlikely]] can achieve the same result if the hints are actually correct, letting the compiler lay out the code so the fast path(s) minimizes taken branches and maximizes cache locality. How do the likely/unlikely macros in the Linux kernel work and what is their benefit? has an example using GNU C __builtin_expect().
Sometimes these later parts of a function will jump back to the "main" path of the function when they're done, other times they independently end with their own ret instruction. (This is called "tail duplication", optimizing away a jmp by duplicating the epilogue if any.)
So if you were to blindly assume that it matters to have the whole function in the same cache line, but actually only the first 32 bytes typically execute on most calls, you could end up displacing the start of a later somewhat larger function so it starts closer to the end of a cache line, maybe without gaining anything.
So this could maybe be good with profile-guided optimization to figure out which functions were actually hot, and group them adjacent to each other (for iTLB and L1i locality), and sort them so they pack nicely. Or which functions tend to be called together, one after the other.
And conversely, to group functions that often go unused for a long time together, so those cache lines can stay cold (and even iTLB entries if there are pages of them).

What might cause the same SSE code to run a few times slower in the same function?

Edit 3: The images are links to the full-size versions. Sorry for the pictures-of-text, but the graphs would be hard to copy/paste into a text table.
I have the following VTune profile for a program compiled with icc --std=c++14 -qopenmp -axS -O3 -fPIC:
In that profile, two clusters of instructions are highlighted in the assembly view. The upper cluster takes significantly less time than the lower one, in spite of instructions being identical and in the same order. Both clusters are located inside the same function and are obviously both called n times. This happens every time I run the profiler, on both a Westmere Xeon and a Haswell laptop that I'm using right now (compiled with SSE because that's what I'm targeting and learning right now).
What am I missing?
Ignore the poor concurrency, this is most probably due to the laptop throttling, since it doesn't occur on the desktop Xeon machine.
I believe this is not an example of micro-optimisation, since those three added together amount to a decent % of the total time, and I'm really interested about the possible cause of this behavior.
Edit: OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune...
Same profile, albeit with a slightly lower CPI this time.

HW perf counters typically charge stalls to the instruction that had to wait for its inputs, not the instruction that was slow producing outputs.
The inputs for your first group comes from your gather. This probably cache-misses a lot, and doesn't the costs aren't going to get charged to those SUBPS/MULPS/ADDPS instructions. Their inputs come directly from vector loads of voxel[], so store-forwarding failure will cause some latency. But that's only ~10 cycles IIRC, small compared to cache misses during the gather. (Those cache misses show up as large bars for the instructions right before the first group that you've highlighted)
The inputs for your second group come directly from loads that can miss in cache. In the first group, the direct consumers of the cache-miss loads were instructions for lines like the one that sets voxel[0], which has a really large bar.
But in the second group, the time for the cache misses in a_transfer[] is getting attributed to the group you've highlighted. Or if it's not cache misses, then maybe it's slow address calculation as the loads have to wait for RAX to be ready.
It looks like there's a lot you could optimize here.
instead of store/reload for a_pointf, just keep it hot across loop iterations in a __m128 variable. Storing/reloading in the C source only makes sense if you found the compiler was making a poor choice about which vector register to spill (if it ran out of registers).
calculate vi with _mm_cvttps_epi32(vf), so the ROUNDPS isn't part of the dependency chain for the gather indices.
Do the voxel gather yourself by shuffling narrow loads into vectors, instead of writing code that copies to an array and then loads from it. (guaranteed store-forwarding failure, see Agner Fog's optimization guides and other links from the x86 tag wiki).
It might be worth it to partially vectorize the address math (calculation of base_0, using PMULDQ with a constant vector), so instead of a store/reload (~5 cycle latency) you just have a MOVQ or two (~1 or 2 cycle latency on Haswell, I forget.)
Use MOVD to load two adjacent short values, and merge another pair into the second element with PINSRD. You'll probably get good code from _mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0), except that pointer aliasing is undefined behaviour. You might get worse code from _mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0).
Then expand the low four 16-bit elements into 32-bit elements integers with PMOVSX, and convert them all to float in parallel with _mm_cvtepi32_ps (CVTDQ2PS).
Your scalar LERPs aren't being auto-vectorized, but you're doing two in parallel (and could maybe save an instruction since you want the result in a vector anyway).
Calling floorf() is silly, and a function call forces the compiler to spill all xmm registers to memory. Compile with -ffast-math or whatever to let it inline to a ROUNDSS, or do that manually. Especially since you go ahead and load the float that you calculate from that into a vector!
Use a vector compare instead of scalar prev_x / prev_y / prev_z. Use MOVMASKPS to get the result into an integer you can test. (You only care about the lower 3 elements, so test it with compare_mask & 0b0111 (true if any of the low 3 bits of the 4-bit mask are set, after a compare for not-equal with _mm_cmpneq_ps. See the double version of the instruction for more tables on how it all works: http://www.felixcloutier.com/x86/CMPPD.html).

Well, analyzing assembly code please note that running time is attributed to the next instruction - so, the data you're looking by instructions need to be interpreted carefully. There is a corresponding note in VTune Release Notes:
Running time is attributed to the next instruction (200108041)
To collect the data about time-consuming running regions of the
target, the Intel® VTune™ Amplifier interrupts executing target
threads and attributes the time to the context IP address.
Due to the collection mechanism, the captured IP address points to an
instruction AFTER the one that is actually consuming most of the time.
This leads to the running time being attributed to the next
instruction (or, rarely to one of the subsequent instructions) in the
Assembly view. In rare cases, this can also lead to wrong attribution
of running time in the source - the time may be erroneously attributed
to the source line AFTER the actual hot line.
In case the inline mode is ON and the program has small functions
inlined at the hotspots, this can cause the running time to be
attributed to a wrong function since the next instruction can belong
to a different function in tightly inlined code.

How to get "phi" instruction in llvm without optimization

When I use the command clang -emit-llvm -S test.c -o test.ll, there is no any "phi" instruction in the IR file. How can I get it?
I know that I can use the pass "-mem2reg" or "-gvn" to get "phi" instruction. But they would do some optimization. I just want to get "phi" without any optimization.

I'm not sure what you mean by "do some optimization" but it seems to me that mem2reg is exactly what you need. Here is how it's described in the documentation:
This file promotes memory references to be register references. It
promotes alloca instructions which only have loads and stores as uses.
An alloca is transformed by using dominator frontiers to place phi
nodes, then traversing the function in depth-first order to rewrite
loads and stores as appropriate. This is just the standard SSA
construction algorithm to construct “pruned” SSA form.
Clang itself does not produce optimized LLVM IR. It produces fairly straightforward IR wherein locals are kept in memory (using allocas). The optimizations are done by opt on LLVM IR level, and one of the most important optimizations is indeed mem2reg which makes sure that locals are represented in LLVM's SSA values instead of memory.

Branch on ?: operator?

For a typical modern compiler on modern hardware, will the ? : operator result in a branch that affects the instruction pipeline?
In other words which is faster, calling both cases to avoid a possible branch:
bool testVar = someValue(); // Used later.
purge(white);
purge(black);
or picking the one actually needed to be purged and only doing it with an operator ?::
bool testVar = someValue();
purge(testVar ? white : black);
I realize you have no idea how long purge() will take, but I'm just asking a general question here about whether I would ever want to call purge() twice to avoid a possible branch in the code.
I realize this is a very tiny optimization and may make no real difference, but would still like to know. I expect the ?: does not result in branching, but want to make sure my understanding is correct.

Depends on the platform. Specifically, it depends on the size of jump prediction table of the CPU and whether the CPU allows conditional operations (like on ARM).
CPUs with conditional operations will strongly favor the second case. CPUs with bigger jump prediction tables will favor the first case.
The real answer (like with any other performance questions): measure and compare. Sometimes the rest of the code throws a curve ball and it's usually impossible to predict effects of some changes.

The CMOV (Conditional MOVe) instruction has been part of the x86 instruction set since the Pentium Pro. It is rarely automatically generated by GCC because of compiler options commonly used and restrictions placed by the C language. A SETCC/CMOV sequence can be inserted by inline assembly in your C program. This should only be done is cases where the conditional variable is a randomly oscillating value in the inner loop (millions of executions) of a program. In non-oscillating cases and in cases of simple patterns of oscillation, modern processors can predict branches with a very high degree of accuracy. In 2007, Linus Torvalds suggested here to avoid use of CMOV in most situations.
Intel describes the conditional move in the Intel(R) Architecture Software Developer's Manual, Volume 2: Instruction Set Reference Manual:
The CMOVcc instructions check the state of one or more of the status
flags in the EFLAGS register (CF, OF, PF, SF, and ZF) and perform a
move operation if the flags are in a specified state (or condition). A
condition code (cc) is associated with each instruction to indicate
the condition being tested for. If the condition is not satisfied, a
move is not performed and execution continues with the instruction
following the CMOVcc instruction.
These instructions can move a 16- or 32-bit value from memory to a
general-purpose register or from one general-purpose register to
another. Conditional moves of 8-bit register operands are not
supported.
The conditions for each CMOVcc mnemonic is given in the description
column of the above table. The terms “less” and “greater” are used for
comparisons of signed integers and the terms “above” and “below” are
used for unsigned integers.
Because a particular state of the status flags can sometimes be
interpreted in two ways, two mnemonics are defined for some opcodes.
For example, the CMOVA (conditional move if above) instruction and the
CMOVNBE (conditional move if not below or equal) instruction are
alternate mnemonics for the opcode 0F 47H.

I can't imagine the first method would ever be faster.
With the first method you may avoid a branch, but you replace it with a function call, which would usually involve a branch plus a lot more (unless it was inlined). Even if inlined, unless the functionality inside the purge() function was absolutely trivial it would almost certainly be slower.

Calling a function is at least as expensive as doing a logic test + jump (and yes, the ? : ternary operator would require a jump).

in the first case purge is called twice. In the second case purge is called once
Its hard to answer the question about branching because its so dependent on compilers and instruction set. For example on an ARM (which has conditional instruction execution) it might not branch. ON an x86 it almost certainly will

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js