Simple getter/accessor prevents vectorization - gcc bug? - c++

Consider this minimal implementation of a fixed vector<int>:
constexpr std::size_t capacity = 1000;
struct vec
{
int values[capacity];
std::size_t _size = 0;
std::size_t size() const noexcept
{
return _size;
}
void push(int x)
{
values[size()] = x;
++_size;
}
};
Given the following test case:
vec v;
for(std::size_t i{0}; i != capacity; ++i)
{
v.push(i);
}
asm volatile("" : : "g"(&v) : "memory");
The compiler produces non-vectorized assembly: live example on godbolt.org
If I make any of the following changes...
values[size()] -> values[_size]
Add __attribute__((always_inline)) to size()
...then the compiler then produces vectorized assembly: live example on godbolt.org
Is this a gcc bug? Or is there a reason why a simple accessor such as size() would prevent auto-vectorization unless always_inline is explicitly added?

The loop in your example is vectorised for GCC < 7.1, and not vectorized for GCC >= 7.1. So there seems to be some change in behaviour here.
We can look at the compiler optimisation report by adding -fopt-info-vec-all to the command line:
For GCC 7.3:
<source>:24:29: note: === vect_pattern_recog ===
<source>:24:29: note: === vect_analyze_data_ref_accesses ===
<source>:24:29: note: not vectorized: complicated access pattern.
<source>:24:29: note: bad data access.
<source>:21:5: note: vectorized 0 loops in function.
For GCC 6.3:
<source>:24:29: note: === vect_pattern_recog ===
<source>:24:29: note: === vect_analyze_data_ref_accesses ===
<source>:24:29: note: === vect_mark_stmts_to_be_vectorized ===
[...]
<source>:24:29: note: LOOP VECTORIZED
<source>:21:5: note: vectorized 1 loops in function.
So GCC 7.x decides not to vectorise the loop, because of a complicated access pattern, which might be the (at that point) non-inlined size() function. Forcing inlining, or doing it manually fixes that. GCC 6.x seems to do that by itself. However, the assembly does look like size() was eventually inlined in both cases, but maybe only after the vectorisation step in GCC 7.x (this is me guessing).
I wondered why you put the asm volatile(...) line at the end - probably to prevent the compiler from throwing away the whole loop, because it has no observable effect in this test case. If we just return the last element of v instead, we can reach the same without causing any possible side-effects on the memory model for v.
return v.values[capacity - 1];
The code now vectorises with GCC 7.x, as it already did with GCC 6.x:
<source>:24:29: note: === vect_pattern_recog ===
<source>:24:29: note: === vect_analyze_data_ref_accesses ===
<source>:24:29: note: === vect_mark_stmts_to_be_vectorized ===
[...]
<source>:24:29: note: LOOP VECTORIZED
<source>:21:5: note: vectorized 1 loops in function.
So what's the conclusion here?
something changed with GCC 7.1
best guess: a side-effect of the asm volatile messes with inlining of size() preventing vectorisation
Whether or not this is a bug - could be either in 6.x or 7.x depending on what behaviour is desired for the asm volatile() construct - would be a question for the GCC developers.
Also: try adding -mavx2 or -mavx512f -mavx512cd (or -march=native etc.) to the command line, depending on your hardware, to get vectorisation beyond 128-bit xmm, i.e. ymm and zmm, registers.

I could narrow the problem down.
In double or single precision and the optimization flags -std=c++11 -Ofast -march=native:
Clang with Version >= 5.0.0 produces AVX move instructions with zmm registers
Gcc with 4.9 <= Version <= 6.3 produces AVX move instructions with zmm registers
Gcc with Version >= 7.1.0 produces AVX move instructions with xmm registers
Try it out: https://godbolt.org/g/NXgF4g

Related

disable all obvious elimination when compiling with gcc (without changing my source code!)

I want to keep all dead code (or anything that is even obviously can be optimized) when compiling with gcc, but even with -O0, some dead code are still optimized. How can I keep all code without changing my source code? The sample code is as follows, and when compiling with g++ -S -O0 main.cc, the if-statement will be optimized in assembly code (there will be no cmpl and jmp code).
int main() {
constexpr int a = 123; // or const int a = 0; I do not want to remove `const` or `constexpr` qualifier.
if (a) // or just if (123)
return 1;
return 0;
}
A related question is here: Disable "if(0)" elimination in gcc. But the answers there need you to change your source code (remove const/constexpr qualifier) which I do not want to do.
Could it be that I do not change my source code but only use some compiler flags to achieve this?
This is not possible with GCC to keep the conditional in this case since it is removed during a very early stage of the compilation.
First of all, here is the compilation steps of GCC:
Code parsing (syntax & semantics) producing an AST in GENERIC representation (HL-IR)
High-level GIMPLE generation (ML-IR)
Low-level GIMPLE generation (ML-IR)
Tree SSA optimization (ML-IR)
RTL generation (LL-IR)
Code optimization
Assembly generation
The conditional is already removed after the generation of the (theoretically unoptimized) high-level GIMPLE representation. Thus, before any optimization step. One can check this by using the GCC flag -fdump-tree-all and look at the first generated GIMPLE code. Here is the result:
;; Function int main() (null)
;; enabled by -tree-original
{
const int a = 123;
<<cleanup_point const int a = 123;>>;
return <retval> = 1;
return <retval> = 0;
}
return <retval> = 0;
One can note that the resulting code is the same with both constexpr and const. Actually, constexpr is treated as a simple const variable in the HL GIMPLE code.
It is hard to know when the conditional is exactly removed in Step 1 as GENERIC is an implementation-dependent internal representation of GCC. It is not very flexible/customizable. AFAIK, it is not even yet possible to generate the AST/GENERIC representation. You can extract it yourself with some GCC plugins, but this is a quite tricky task.

Lambda vs. manually inlined code changes GCC's optimizer behavior

The following code:
#include <vector>
extern std::vector<int> rng;
int main()
{
auto is_even=[](int x){return x%2==0;};
int res=0;
for(int x:rng){
if(is_even(x))res+=x;
}
return res;
}
is optimized by GCC 11.1 (link to Godbolt) in a very different way than:
#include <vector>
extern std::vector<int> rng;
int main()
{
int res=0;
for(int x:rng){
if(x%2==0)res+=x;
}
return res;
}
(Link to Godbolt.) Besides, the second version (where the lambda has been replaced by direct, manual injection of its body in the place of call), is much faster than the first one.
Is this a GCC bug?
There is no such thing as a vectorized integral modulo operation in the x64 architecture. This means that the code by itself is not inherently vectorizable, and needs to be transformed beforehand before that can be done.
You can see the vectorization working just fine in both cases in the much easier case where a SIMD-friendly evenness test is used: https://godbolt.org/z/hc5ffbePY
So if anything, it could be argued that GCC managing to vectorize the inlined version at all, and clang inlining both of them, is actually pretty impressive.
That being said, since we know for a fact that GCC is capable of performing that transformation, it would appear that it is only performed before inlining happens, which is unfortunate, and probably deserves being brought up to the maintainer's attention.
It's a quirk of the code generation. There is no reason why the lambda version shouldn't be vectorized. In fact, clang vectorizes it as-is. If you specify return type as int, GCC vectorizes it too:
auto is_even = [](int x) -> int { return x % 2 == 0; };
If you use std::accumulate, it's also vectorized. You can report this to GCC so they can fix it.

How to auto-vectorise a loop which 1) modifies an array, 2) indicates whether the array changed or not at the end?

I have this C++ function:
#include <stddef.h>
typedef unsigned long long Word;
bool fun(Word *lhs, const Word *rhs, size_t s)
{
bool changed = false;
#pragma omp simd
for (size_t i = 0; i < s; ++i) {
const Word old = lhs[i];
lhs[i] |= rhs[i];
changed = changed || old != lhs[i];
}
return changed;
}
In essence, it's a bitwise-or implementation for a bit vector (lhs |= rhs). I'm quite new to writing SIMD-conscious code, and I can't quite figure out how to get the compiler to vectorise this without introducing extra overhead (e.g., making changed an array then looping over it). Removing the changed = ... line allows everything to vectorise fine.
I have tried with omp simd and without. I don't think this is relevant but I want to keep it because lhs and rhs never overlap and I want to add the align clause eventually.
Currently, I'm working with GCC, but I'd like things to work well with both GCC and Clang eventually.
TL:DR: use Word unchanged = -1ULL; and update it with unchanged &= (old == lhs[i]) ? -1ULL : 0; so this maps naturally to a SIMD compare-for-equal and SIMD AND.
Or even better, changed |= old ^ lhs[i]; vectorizes nicely with GCC and clang, for Word changed = 0;. With clang, it gives optimal asm. With GCC, the first way is better because GCC pessimizes to changed |= (~old) & rhs[i]; // find RHS bits that weren't already set costing an extra movdqa register copy, or with AVX removing the ability to fold the unaligned load into a memory source for vpor (because it needs both operands twice, once for this and once for the main |).
Compare-for-unequal isn't directly available until AVX-512; doing that would have to invert the compare result before combining into a changed vector.
The overall operation can be vectorized manually with intrinsics (or asm) pretty much as written, without any major transformations, except of course optimizing to bitwise | OR instead of actual short-circuit evaluation. So this is basically a missed optimization. But in the natural asm implementation of this, your vector of changed elements would be the same width as the data, not just 4 bools. (For x86 that would take an extra vmovmskpd to feed a scalar or instead of just a SIMD vpor, and most ISAs don't have a movemask operation so maybe the generic vectorizer isn't even considering using it. Fun fact: clang auto-vectorizes your original code really badly, doing a horizontal OR down to a scalar bool every iteration.)
Using Word changed = 0; lets this vectorize fairly decently, with changed |= ..., with or without OpenMP pragmas (differently, haven't sorted out which is actually better for every combo). Compilers are dumb (complex pieces of machinery, not human understanding) and often don't figure out things like this for themselves - auto-vectorization is hard enough that they sometimes need some hand-holding.
So the trick is making changed the same width as the array elements.
If you use OpenMP, you need to tell the OpenMP vectorizer about reductions such as sum of an array with +, or in this case OR. In this case, #pragma omp simd reduction(|:changed). You should be using changed |= stuff instead of logical short-circuit eval anyway, if you want this to vectorize into branchless SIMD. reduction(|:changed) actually seems to override your actual code to some degree, so be careful it matches.
ICC even breaks your code (not updating changed in the SIMD part) if you just use #pragma omp simd https://godbolt.org/z/bG98Kz. (Perhaps that gives it license to ignore serial dependencies, or at least reductions, which you didn't tell it about? Either that or an ICC bug, I don't know OpenMP very well.)
With the original bool changed instead of Word, GCC doesn't auto-vectorize at all, and clang does a nasty job (horizontal reduction to a scalar bool inside the inner loop!)
Two versions that auto-vectorize:
On Godbolt with -O3 -march=nehalem -mtune=skylake -fopenmp (So using SSE4.1 / 4.2, but not AVX or BMI1/BMI2). I haven't looked in detail at which ends up with less clunky cleanup code.
#include <stddef.h>
typedef unsigned long long Word;
bool fun_v1(Word *lhs, const Word *rhs, size_t s)
{
Word changed = 0;
#pragma omp simd reduction(|:changed) // optional, some asm differences with/without
for (size_t i = 0; i < s; ++i) {
const Word old = lhs[i];
changed |= (~old) & rhs[i]; // find RHS bits that weren't already set. pure bitwise, no 64-bit-element SIMD == needed. Do this before storing so compiler doesn't have to worry about lhs/rhs overlap.
lhs[i] |= rhs[i];
//changed |= (old != lhs[i]) ? -1ULL : 0; // requires inverting the cmpeq result, but can fold a memory operand with AVX unlike the bitwise version
//changed = changed || (old != lhs[i]); // short circuit eval is weird for SIMD, compiles inefficiently.
}
return changed;
}
(update: changed |= old ^ lhs[i]; appears even better to get a non-zero value on not-equal. It uses only commutative operations, and doesn't need == / pcmpeqq. #chtz suggested this in comments, I haven't rewritten the rest of the answer to cut out discussion of worse optoins. clang will auto-vectorize with it, and with AVX allows a memory source operand for rhs because it's only needed once. https://godbolt.org/z/ex5519. So this appears to be the best of both worlds.)
changed |= (old != lhs[i]) ? -1ULL : 0; is also still only 10 instructions (9 uops) in the inner loop, same as changed |= (~old) & rhs[i];, for GCC 10.2 without AVX. But for clang, that defeats auto-vectorization! Clang will handle changed |= (old != lhs[i]); (or with an explicit ? 1 : 0) so that's odd. -1ULL avoids needing a set1_epi64x(1) vector constant so I used that.
Versions using == or != will need SSE4.1 pcmpeqq for vectorization of 64-bit compare for ==: compilers probably aren't going to be smart enough to realize that any integer element-size is fine for the overall thing. And emulating a narrower compare probably won't look profitable.
The ~old & rhs[i] way works with just SSE2. Ending the loop with SSE4.1 ptest instead of shuffles and POR and MOVQ would be more efficient, but compilers are pretty dumb about stuff like that. (And handling the ends of loops in general.
Just naive reduction, and scalar cleanup for odd elements instead of a possibly-overlapping final vector that ends at the end of the arrays. |= is idempotent so at worst it would cause a store-forwarding stall if you don't schedule your loads well. That's another thing you could do better with manual vectorization, but using intrinsics would force one SIMD vector width, while auto-vec lets the compiler use wider vectors when you compile for an AVX2 CPU like -march=haswell or -march=znver2.)
Until AVX-512, only compare for == is available (or >), not != directly. To reduce that the way we want, we'd need to unchanged &= (old == updated);. This lets GCC save 1 instruction in the loop, bringing it down to 9 instructions, 8 uops. It can possibly run at 1 iteration per 2 cycles.
But clang for some reason doesn't auto-vectorize it at all. Apparently clang doesn't like the ? -1 : 0 ternary here or in the other version, maybe not realizing that's what SIMD compares produce.
bool fun_v2(Word *lhs, const Word *rhs, size_t s)
{
Word unchanged = -1ULL;
// clang fails to vectorize?!? GCC works as expected with/without pragma
#pragma omp simd reduction(&:unchanged)
for (size_t i = 0; i < s; ++i) {
const Word old = lhs[i];
lhs[i] |= rhs[i];
unchanged &= (old == lhs[i]) ? -1ULL : 0;
}
return !unchanged;
}
With AVX available, vpor with a memory source operand would be efficient if compilers weren't using a stupid indexed addressing mode, forcing it to un-laminate on Intel Sandybridge family (but not on AMD).
Note that if you're thinking of using Word as a wide type to use this on arbitrary data of other types, beware strict-aliasing rules and Undefined Behaviour. Manual vectorization might be a good bet because _mm_loadu_si128((const __m128*)int_ptr); is fully strict-aliasing safe: vector pointers (and load / store intrinsics) are like char* in that they can alias anything. For a portable version, either use memcpy or GNU C typedef unsigned long unaligned_aliasing_chunk __attribute__((may_alias,aligned(1))). "Word" has different meanings in asm for different ISAs, like being 16-bit in x86, so it's not the best name for a type you want to be was wide as the machine can efficiently use. unsigned long is often that, but is 32-bit on some 64-bit machines. unsigned long long is probably fine.

Disable constant folding for LLVM 10 C++ API

I'm using the LLVM C++ API to write a compiler front-end for a subset of the C language. I've noticed the generated IR always has the constant folding optimization applied. But I want to disable this and get a faithful, unoptimized IR. Is there any way to do this?
Following is the code I'm using the generate IR from my module.
llvm::verifyModule(kit.module, &llvm::outs());
kit.module.print(llvm::outs(), nullptr);
auto tirFile = "output.ir";
error_code ec;
llvm::raw_fd_ostream tirFileStream(tirFile, ec, llvm::sys::fs::F_None);
kit.module.print(tirFileStream, nullptr);
tirFileStream.flush();
Seems like the version of LLVM I'm using is LLVM 10.
sumit#HAL9001:~$ llvm-config --version
10.0.0
For example, when I run my compiler on the the following C function
int arith() {
return (10 - 10/3) << 3 | (23+8*12) & 1024;
}
It gets compiled to
define i32 #arith() {
entry:
ret i32 56
}
The binary operations on constants are evaluated by the compiler itself, i.e. constant folding; it doesn't get translated to appropriate IR code.
Quoting from this link:
The way that the front-end lowers code to IR causes this sort of
constant folding to happen even before any LLVM IR is generated.
Essentially, when you do the AST traversal, you’re going to
essentially see the following code get run:
IRBuilder<> Builder; Value *LHS = Builder.getInt32(2);
Value *RHS = Builder.getInt32(4); // LHS and RHS are ConstantInt values because they’re constant expressions.
Value *Res = Builder.CreateMul(LHS,RHS); // Because LHS and RHS are constant values, the IRBuilder folds this to a constant expression.
This constant folding cannot be turned off. (I’m also assuming there’s
no other constant folding going on at the Clang AST level).
In LLVM 11 you can use
IRBuilder<llvm::NoFolder> instead of IRBuilder<>
I'm pretty sure it works for LLVM 10 too (although I haven't verified that).
Don't forget to #include #include <llvm/IR/NoFolder.h> :)

ThreadSanitizer says my Atomic Inc/Dec has data races, false positive?

I wrote my atomic_inc for increment an integer using asm, it is actually used for referencing counting for shared objects. gcc 4.8.2 -fsanitize=thread reports data races and I finally found it was likely cause by my atomic_inc. I don't believe my code has a bug there to do with data races, is that a false positive by tsan?
static inline int atomic_add(volatile int *count, int add) {
__asm__ __volatile__(
"lock xadd %0, (%1);"
: "=a"(add)
: "r"(count), "a"(add)
: "memory"
);
return add;
}
void MyClass::Ref() {
// std::unique_lock<std::mutex> lock(s_ref);
atomic_add(&_refs, 1);
}
void MyClass::Unref() {
// std::unique_lock<std::mutex> lock(s_ref);
int n = atomic_add(&_refs, -1) - 1;
// lock.unlock();
assert(n >= 0);
if (n <= 0) {
delete this;
}
}
Part of your problem is that gcc doesn't look inside the asm.
The other part of your problem is that volatile doesn't make a variable thread-safe.
Given __asm__ means you are committed to gcc, why not use the gcc intrinsics? (They are documented and well tested, and gcc will understand their semantics.)
As to whether the warning is a false positive, I don't know. The safe thing to do is to assume the problem is genuine. It is really hard to see problems in multi-threaded code (even when you know they are there). (Once we ripped out a very clever piece of code using a published mutex algorithm that was failing and replaced it with a simple spin-lock. That fixed the failures, but we never could find why it failed.)
As others have already said, the tool cannot see inside your asm. But you shouldn't do that anyway.
Just use std::atomic and be done with it - that's both thread safe and portable and the compiler knows how to optimize it - unlike your current code.