Auto-vectorizing: Convincing the compiler that alias check is not necessary - c++

I am doing some image processing, for which I benefit from vectorization.
I have a function that vectorizes ok, but for which I am not able to convince the compiler that the input and output buffer have no overlap, and so no alias checking is necessary.
I should be able to do so using __restrict__, but if the buffers are not defined as __restrict__ when arriving as function argument, there is no way to convince the compiler that I am absolutely sure that 2 buffers will never overlap.
This is the function:
__attribute__((optimize("tree-vectorize","tree-vectorizer-verbose=6")))
void threshold(const cv::Mat& inputRoi, cv::Mat& outputRoi, const unsigned char th) {
const int height = inputRoi.rows;
const int width = inputRoi.cols;
for (int j = 0; j < height; j++) {
const uint8_t* __restrict in = (const uint8_t* __restrict) inputRoi.ptr(j);
uint8_t* __restrict out = (uint8_t* __restrict) outputRoi.ptr(j);
for (int i = 0; i < width; i++) {
out[i] = (in[i] < valueTh) ? 255 : 0;
}
}
}
The only way I can convince the compiler to not perform the alias checking is if I put the inner loop in a separate function, in which the pointers are defined as __restrict__ arguments. If I declare this inner function as inlined, again the alias checking is activated.
You can see the effect also with this example, which I think is consistent: http://goo.gl/7HK5p7
(Note: I know there might be better ways of writing the same function, but in this case I am just trying to understand how to avoid alias check)
Edit:
Problem is solved!! (See answer below)
Using gcc 4.9.2, here is the complete example. Note the use of the compiler flag -fopt-info-vec-optimized in place of the superseded -ftree-vectorizer-verbose=N.
So, for gcc, use #pragma GCC ivdep and enjoy! :)

if you are using Intel compiler, you can try to include the line:
#pragma ivdep
The following paragraph is quoted from Intel compiler user manual:
The ivdep pragma instructs the compiler to ignore assumed vector
dependencies. To ensure correct code, the compiler treats an assumed
dependence as a proven dependence, which prevents vectorization. This
pragma overrides that decision. Use this pragma only when you know
that the assumed loop dependencies are safe to ignore.
In gcc, one should add the line:
#pragma GCC ivdep
inside the function and right before the loop you want to vectorize (see documentation). This is only supported starting from gcc 4.9 and, by the way, makes the use of __restrict__ redundant.

Another approach for this specific issue that is standardised and fully portable across (reasonably modern) compiler is to use the OpenMP simd directive, which is part of the standard since version 4.0. The code then becomes:
void threshold(const unsigned char* inputRoi, const unsigned char valueTh,
unsigned char* outputRoi, const int width,
const int stride, const int height) {
#pragma omp simd
for (int i = 0; i < width; i++) {
outputRoi[i] = (inputRoi[i] < valueTh) ? 255 : 0;
}
}
And when compiled with OpenMP support enabled (with either full support or only partial one for simd only, like with -qopenmp-simd for the Intel compiler), then the code is fully vectorised.
In addition, this gives you the opportunity to indicate possible alignment of vectors, which can come handy in some circumstances. For example, had your input and output arrays been allocated with an alignment-aware memory allocator, such a posix_memalign() with an alignment requirement of 256b, then the code could become:
void threshold(const unsigned char* inputRoi, const unsigned char valueTh,
unsigned char* outputRoi, const int width,
const int stride, const int height) {
#pragma omp simd aligned(inputRoi, outputRoi : 32)
for (int i = 0; i < width; i++) {
outputRoi[i] = (inputRoi[i] < valueTh) ? 255 : 0;
}
}
This should then permit to generate an even faster binary. And this feature isn't readily available using the ivdep directives. All the more reasons to use the OpenMP simd directive.

The Intel compiler at least as of version 14 does not generate aliasing checks for threshold2 in the code you linked indicating that your approach should work. However, the gcc auto-vectorizer misses this opportunity for optimization but does generate vectorized code, tests for proper alignment, tests for aliasing and non-vectorized fall-back/clean-up code.

Related

Auto-vectorization with SSE2 movemask for bytes to bitmap with gcc

With properly constructed C/C++ code one can hint gcc to generate efficient SIMD assembler on its own, without use of intrinsics, e.g. https://locklessinc.com/articles/vectorize/
I am trying to achieve a similar effect for movemask operation (PMOVMSKB / *_movemask_epi8 family), but so far without success.
The simplest code I could think of:
#include <cstdint>
alignas(128) int8_t arr[32];
uint32_t foo()
{
uint32_t rv = 0;
for (int it = 0; it < 32; ++it)
{
rv |= (arr[it] < 0) << it;
}
return rv;
}
leads to assembly that fails to utilize move mask instruction: https://godbolt.org/z/3XimYc
Does anyone have an idea if there's a way to do that with gcc without explicitly using intrinsics?
I haven't looked into MD files and associated implementation in gcc yet (https://github.com/gcc-mirror/gcc/tree/master/gcc/config/i386).

Optimizing away memcpy

In an attempt to avoid breaking strict aliasing rules, I introduced memcpy to a couple places in my code expecting it to be a no-op. The following example produces a call to memcpy (or equivalent) on gcc and clang. Specifically, fool<40> always does while foo does on gcc but not clang and fool<2> does on clang but not gcc. When / how can this be optimized away?
uint64_t bar(const uint16_t *buf) {
uint64_t num[2];
memcpy(&num, buf, 16);
return num[0] + num[1];
}
uint64_t foo(const uint16_t *buf) {
uint64_t num[3];
memcpy(&num, buf, sizeof(num));
return num[0] + num[1];
}
template <int SZ>
uint64_t fool(const uint16_t *buf) {
uint64_t num[SZ];
memcpy(&num, buf, sizeof(num));
uint64_t ret = 0;
for (int i = 0; i < SZ; ++i)
ret += num[i];
return ret;
}
template uint64_t fool<2>(const uint16_t*);
template uint64_t fool<40>(const uint16_t*);
And a link to the compiled output (godbolt).
I can't really tell you why exactly the respective compilers fail to optimize the code in the way you'd hope them to optimize in the specific cases. I guess each compiler is either just unable to track the relationship established by memcpy between the target array and the source memory (as we can see, they do seem to recognize this relationship at least in some cases), or they simply have some heuristic tell them to chose not to make use of it.
Anyways, since compilers seem to not behave as we would hope when we rely on them tracking the entire array, what we can try to do is to make it more obvious to the compiler by just doing the memcpy on an element-per-element basis. This seems to produce the desired result on both compilers. Note that I had to manually unroll the initialization in bar and foo as clang will otherwise do a copy again.
Apart from that, note that in C++ you should use std::memcpy, std::uint64_t, etc. since the standard headers are not guaranteed to also introduce these names into the global namespace (though I'm not aware of any implementation that doesn't do that).

Function with template bool argument: guaranteed to be optimized?

In the following example of templated function, is the central if inside the for loop guaranteed to be optimized out, leaving the used instructions only?
If this is not guaranteed to be optimized (in GCC 4, MSVC 2013 and llvm 8.0), what are the alternatives, using C++11 at most?
NOTE that this function does nothing usable, and I know that this specific function can be optimized in several ways and so on. But all I want to focus is on how the bool template argument works in generating code.
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
float ret = (IsMin ? std::numeric_limits<float>::max() : -std::numeric_limits<float>::max());
for (int x = 0; x < arraySize; x++) {
// Is this code optimized by the compiler to skip the unnecessary if?
if (isMin) {
if (ret > vals[x]) ret = vals[x];
} else {
if (ret < vals[x]) ret = vals[x];
}
}
return val;
}
In theory no. The C++ standard permits compilers to be not just dumb, but downright hostile. It could inject code doing useless stuff for no reason, so long as the abstract machine behaviour remains the same.1
In practice, yes. Dead code elimination and constant branch detection are easy, and every single compiler I have ever checked eliminates that if branch.
Note that both branches are compiled before one is eliminated, so they both must be fully valid code. The output assembly behaves "as if" both branches exist, but the branch instruction (and unreachable code) is not an observable feature of the abstract machine behaviour.
Naturally if you do not optimize, the branch and dead code may be left in, so you can move the instruction pointer into the "dead code" with your debugger.
1 As an example, nothing prevents a compiler from implementing a+b as a loop calling inc in assembly, or a*b as a loop adding a repeatedly. This is a hostile act by the compiler on almost all platforms, but not banned by the standard.
There is no guarantee that it will be optimized away. There is a pretty good chance that it will be though since it is a compile time constant.
That said C++17 gives us if constexpr which will only compile the code that pass the check. If you want a guarantee then I would suggest you use this feature instead.
Before C++17 if you only want one part of the code to be compiled you would need to specialize the function and write only the code that pertains to that specialization.
Since you ask for an alternative in C++11 here is one :
float IterateOverArrayImpl(float* vals, int arraySize, std::false_type)
{
float ret = -std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret < vals[x])
ret = vals[x];
}
return ret;
}
float IterateOverArrayImpl(float* vals, int arraySize, std::true_type)
{
float ret = std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret > vals[x])
ret = vals[x];
}
return ret;
}
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
return IterateOverArrayImpl(vals, arraySize, std::integral_constant<bool, IsMin>());
}
You can see it in live here.
The idea is to use function overloading to handle the test.

Auto-Vectorization in Visual Studio 2012 express on std::vector is not happening

I am having a simple program in which i am having 3 std::vector and using them in for loops. After enabling the compilation flag ON, i am testing whether these loops are optimized or not. But visual studio is showing that loop is not vectorized due to reason 1200. My sample code is as below.
#include <iostream>
#include <vector>
#include <time.h>
int main(char *argv[], int argc)
{
clock_t t=clock();
int tempSize=100;
std::vector<double> tempVec(tempSize);
std::vector<double> tempVec1(tempSize);
std::vector<double> tempVec2(tempSize);
for(int i=0;i<tempSize;i++)
{
tempVec1[i] = 20;
tempVec2[i] = 30;
}
for(int i=0,imax=tempSize;i<imax;i++)
tempVec[i] = tempVec1[i] + tempVec2[i];
t =clock()-t; // stop the clock
std::cout <<"Time in millisecs = " << t/double(CLOCKS_PER_SEC) << std::endl;
return 0;
}
And below is the output of this code compilation using option "/Qvec-report:2" enabled.
2> --- Analyzing function: main
2> d:\test\ssetestonvectors\main.cpp(12) : info C5002: loop not vectorized due to reason '1200'
2> d:\test\ssetestonvectors\main.cpp(18) : info C5002: loop not vectorized due to reason '1200'
When i read about the error code 1200 on msdn page:
https://msdn.microsoft.com/en-us/library/jj658585.aspx
It specifies that error code 1200 is due to "Loop contains loop carried data dependence"
I am unable to understand how this loop is containing that. I am having some sort of code that i need to optimize so that it can use Auto-Vectorization feature of Visual studio so that it can be optimized for SSE2. This code contains vector operations. So i am unable to do that because each time visual studio is showing some error code like this.
I think your problem is that:
for(int i=0,imax=tempSize;i<imax;i++)
tempVec[i] = tempVec1[i] + tempVec2[i];
Is actually
for(int i=0,imax=tempSize;i<imax;i++)
tempVec.operator[](i) = tempVec1.operator[](i) + tempVec2.operator[](i);
... and the vectorizer is failing to look insider the function calls. The first fix for that is:
const double* t1 = &tempVec1.front();
const double* t2 = &tempVec2.front();
double *t = &tempVec.front();
for(int i=0,imax=tempSize;i<imax;i++)
t[i] = t1[i] + t2[i];
The problem with that, is that the vectoriser can't see that t, t1, and t2 don't overlap. You have to promise the compiler they don't:
const double* __restrict t1 = &tempVec1.front();
const double* __restrict t2 = &tempVec2.front();
double * __restrict t = &tempVec.front();
for(int i=0,imax=tempSize;i<imax;i++)
t[i] = t1[i] + t2[i];
Obviously (I hope) use of the __restrict keyword (which is not part of standard C++) means this code will not be portable to other C++ compilers.
Edit: The OP has clarified that replacing calls to operator[] with call to at produces a different failure message (although that might be because at is more complex).
If the problem is not the function calls, my next hypothesis is that operator [] boils down to something like return this.__begin[i]; and the vectorizer doesn't know that different std::vectors have non-overlapping memory. If so, the final code block is still the solution.
Auto-vectorization is a rather new feature of MSVC, and you're using an older version of MSVC. So it's far from perfect. Microsoft knows that, so they've decided to only vectorize code when it is absolutely safe.
The particular error message is a bit terse. In reality, it should say "Loop might contain loop-carried data dependence". Since MSVC can't prove their absence, it doesn't vectorize.

How to conditionally set compiler optimization for template headers

I found a question somewhat interesting, and went on an attempt to answer it. The author wants to compile -one- source file (which relies on template libraries) with AVX optimizations, and the rest of the project without those.
So, to see what would happen, I've created a test project like this:
main.cpp
#include <iostream>
#include <string>
#include "fn_normal.h"
#include "fn_avx.h"
int main(int argc, char* argv[])
{
int number = 10; // this will come from input, but let's keep it simple for now
int result;
if (std::string(argv[argc - 1]) == "--noavx")
result = FnNormal(number);
else
{
std::cout << "AVX selected\n";
result = FnAVX(number);
}
std::cout << "Double of " << number << " is " << result << std::endl;
return 0;
}
Files fn_normal.h and fn_avx.h contains declarations for functions FnNormal() and FnAVX() respectively, which are defined as follows:
fn_normal.cpp
#include "fn_normal.h"
#include "double.h"
int FnNormal(int num)
{
return RtDouble(num);
}
fn_avx.cpp
#include "fn_avx.h"
#include "double.h"
int FnAVX(int num)
{
return RtDouble(num);
}
And here's the template function definition:
double.h
template<typename T>
int RtDouble(T number)
{
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
}
Ultimately, I set Enhanced Instruction Set to AVX for the file fn_avx.cpp under "Properties-> C/C++ -> Code Generation", leaving it to Not Set for the other sources, thus it should default to SSE2.
I thought that by doing so, the compiler would instantiate the template once for each source that includes it (and avoid violating the One-Definition Rule by mangling the template function name or some other way), and thus calling the program with the --noavx parameter would make it run fine in cpus without avx support.
But the resulting program will actualy have only one machine-code version of the function, with avx instructions, and will fail on older cpus.
Disabling all other optimizations doesn't solve this issue. Also tried No Enhanced Instructions - /arch:IA32 instead of Not Set as well.
As I'm just now beginning to understand templates and such, could someone point to me the exact details for this behavior and what I could actually do to achieve my goal?
My compiler is MSVC 2013.
Additional info: the .obj files for both fn_normal.cpp and fn_avx.cpp are almost the same size in bytes. I've looked into the generated assembly listings and they are almost the same, with the important difference that the avx-enabled source replaces default sse's movss/mulss with vmovss and vmulss, respectively. But stepping throught the code in Visual Studio's disassembly view (Ctrl+Alt+D), confirms that fnNormal() indeed makes use of the avx specialized instructions.
The compiler will generate two objects (fn_avx.obj and fn_normal.obj), which are compiled with different instruction sets. As you said, outputting the disassembly for both verifies that this is being done correctly:
objdump -d fn_normal.obj:
...
movss -0x1f5c(%ebp,%eax,4),%xmm0
mulss -0x1f5c(%ebp,%ecx,4),%xmm0
mov -0x1f68(%ebp),%edx
mulss -0x1f5c(%ebp,%edx,4),%xmm0
mov -0x1f68(%ebp),%eax
movss %xmm0,-0xfb4(%ebp,%eax,4)
...
objdump -d fn_avx.obj:
...
vmovss -0x1f5c(%ebp,%eax,4),%xmm0
vmulss -0x1f5c(%ebp,%ecx,4),%xmm0,%xmm0
mov -0x1f68(%ebp),%edx
vmulss -0x1f5c(%ebp,%edx,4),%xmm0,%xmm0
mov -0x1f68(%ebp),%eax
vmovss %xmm0,-0xfb4(%ebp,%eax,4)
...
The look strikingly similar, because by default MSVC 2013 will assume SSE2 availability. If you change the instruction set to IA32, you'll get something with non-vector instructions. So, this is not an issue with the compiler/compilation unit.
The issue here, is RtDouble is defined in a header file as a non-specialized template (perfectly legal). The compiler assumes its definition across multiple translation units will be the same, but, by compiling with different options, that assumption is being violated. It's essentially no different than introducing a divergence with the preprocessor:
double.h:
template<typename T>
int RtDouble(T number)
{
#ifdef SUPER_BAD
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
#else
return 0;
#endif
}
fn_avx.cpp:
#include "fn_avx.h"
#define SUPER_BAD
#include "double.h"
int FnAVX(int num)
{
return RtDouble(num);
}
The FnNormal then will just return 0 (and you can verify this with the the disassembly of the new fn_normal.obj). The linker happily chooses one, and does not warn you about either situation. The question then comes down to: should it? That would be extremely helpful in situations like this. However, it would also slow down linking, as it would need to do a comparison of all of the functions that could exist in multiple compilation units (eg. inline functions as well).
When I have faced a similar issue in my code, I choose a different function naming scheme for the optimized version vs. the non-optimized version. Using a template parameter to distinguish them would also work just as well (as suggested in #celtschk's answer).
Basically the compiler needs to minimize the space not mentioning that having the same template instantiated 2x could cause problems if there would be static members. So from what I know the compiler is processing the template either for every source code and then chooses one of the implementations, or it postpones the actual code generation to the link time. Either way it is a problem for this AVX thingy. I ended up solving it the old fashioned way - with some global definitions not depending on any templates or anything. For too complex applications this could be a huge problem though. Intel Compiler has a recently added pragma (I don't recall the exact name), that makes the function implemented right after it use just AVX instructions, which would solve the problem. How reliable it is, that I don't know.
I've worked around this problem successfully by forcing any templated functions that will be used with different compiler options in different source files to be inline. Just using the inline keyword is usually not sufficient, since the compiler will sometimes ignore it for functions larger than some threshold, so you have to force the compiler to do it.
In MSVC++:
template<typename T>
__forceinline int RtDouble(T number) {...}
GCC:
template<typename T>
inline __attribute__((always_inline)) int RtDouble(T number) {...}
Keep in mind you may have to forceinline any other functions that RtDouble may call within the same module in order to keep the compiler flags consistent in those functions as well. Also keep in mind that MSVC++ simply ignores __forceinline when optimizations are disabled, such as in debug builds, and in those cases this trick won't work, so expect different behavior in non-optimized builds. It can make things problematic to debug in any case, but it does indeed work so long as the compiler allows inlining.
I think the simplest solution is to let the compiler know that those functions are indeed intended to be different, by using a template parameter that does nothing but distinguish them:
File double.h:
template<bool avx, typename T>
int RtDouble(T number)
{
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
a[n] = b[n] * b[n] * b[n];
}
return number * 2;
}
File fn_normal.cpp:
#include "fn_normal.h"
#include "double.h"
int FnNormal(int num)
{
return RtDouble<false>(num);
}
File fn_avx.cpp:
#include "fn_avx.h"
#include "double.h"
int FnAVX(int num)
{
return RtDouble<true>(num);
}