I want to compile code that uses the intrinsic function _mm256_undefined_si256() (returns a vector of 8 packed double word integers). Here is the reduced snipped of the affected function from the header file:
// test.hpp
#include "immintrin.h"
namespace {
inline __m256i foo(__m256i a, __m256i b) {
__m256i res = _mm256_undefined_si256();
// some inline asm stuff
// __asm__(...);
return res;
}
}
Compiling via gcc -march=native -mavx2 -O3 -std=c++11 test.cpp -o app throws the following error >>_mm256_undefined_si256<< was not declared in this scope.
I can not explain why this intrinsic function is not defined, since there are other intrinsics used in the header file which work properly.
Your code works in GCC4.9 and newer (https://godbolt.org/z/bajMsKvK9). GCC4.9 was released in April 2014, close to a decade ago, and the most recent release of GCC4.8.5 was in June 2015. So it's about time to upgrade your compiler!
GCC4.8 was missing that intrinsic, and didn't even know about -march=sandybridge (let alone tuning options for Haswell which had AVX2), although it did know about the less meaningful -march=corei7-avx.
It does happen that GCC misses some of the more obscure intrinsics that Intel adds along with support for a new instruction set, so support for _mm256_add_epi32 won't always imply _mm256_undefined_si256().
e.g. it took until GCC11 for them to add _mm_load_si32(void*) unaligned aliasing-safe movd (which I think Intel introduced around the same time as AVX-512 stuff), so that's multiple years late. (And until GCC12 / 11.3 for GCC to implement it correctly, Bug 99754, and still not aliasing-safe for _mm_load_ss(float*) (Bug 84508).
But fortunately for the _mm256_undefined_si256, it's supported by non-ancient versions of all the mainstream compilers.
Related
I have been working on optimizing performance and of course doing regression tests when I noticed that g++ seems to alter results depending on chosen optimization. So far I thought that -O2 -march=[whatever] should yield the exact same results for numerical computations regardless of what architecture is chosen. However this seems not to be the case for g++. While using old architectures up to ivybridge yields the same results as clang does for any architecture, I get different results for gcc for haswell and newer. Is this a bug in gcc or did I misunderstand something about optimizations? I am really startled because clang does not seem to show this behavior.
Note that I am well aware that the differences are within machine precision, but they still disturb my simple regression checks.
Here is some example code:
#include <iostream>
#include <armadillo>
int main(){
arma::arma_rng::set_seed(3);
arma::sp_cx_mat A = arma::sprandn<arma::sp_cx_mat>(20,20, 0.1);
arma::sp_cx_mat B = A + A.t();
arma::cx_vec eig;
arma::eigs_gen(eig, B, 1, "lm", 0.001);
std::cout << "eigenvalue: " << eig << std::endl;
}
Compiled using:
g++ -march=[architecture] -std=c++14 -O2 -o test example.cpp -larmadillo
gcc version: 6.2.1
clang version: 3.8.0
Compiled for 64 bit, executed on an Intel Skylake processor.
It is because GCC uses fused-multiply-add (fma) instruction by default, if it is available. Clang, on the contrary, doesn't use them by default, even if it is available.
Result from a*b+c can differ whether fma used or not, that's why you get different results, when you use -march=haswell (Haswell is the first Intel CPU which supports fma).
You can decide whether you want to use this feature with -ffp-contract=XXX.
-ffp-contract=off, you won't get fma instructions.
-ffp-contract=on, you get fma instructions, but only in the case of contraction if allowed by the language standard. In current version of GCC, this means off (because it is not implemented yet).
-ffp-contract=fast (that's the GCC default), you'll get fma instrucions.
I've just updated MinGW using mingw-get-setup and i'm unable to build anyting that contains <cmath> header if I use anything larger than -O0 with -std=c++1y. (I also tried c++11 and c++98) I'm getting errors like this one:
g++.exe -pedantic-errors -pedantic -Wextra -Wall -std=c++1y -O3 -c Z:\Projects\C++\L6\src\events.cpp -o obj\src\events.o
In file included from z:\lander\mingw\lib\gcc\mingw32\4.8.1\include\c++\cmath:44:0,
from Z:\Projects\C++\L6\src\utils.h:4,
from Z:\Projects\C++\L6\src\events.cpp:10:
z:\lander\mingw\include\math.h: In function 'float hypotf(float, float)':
z:\lander\mingw\include\math.h:635:30: error: '_hypot' was not declared in this scope
{ return (float)(_hypot (x, y)); }
Is something wrong on my side?
Or version at mingw repo is bugged? And if so, is there any quick fix for this header?
To avoid any further speculation, and downright bad suggestions such as using #if 0, let me give an authoritative answer, from the perspective of a MinGW project contributor.
Yes, the MinGW.org implementation of include/math.h does have a bug in its inline implementation of hypotf (float, float); the bug is triggered when compiling C++, with the affected header included (as it is when cmath is included), and any compiler option which causes __STRICT_ANSI__ to become defined is specified, (as is the case for those -std=c... options noted by the OP). The appropriate solution is not to occlude part of the math.h file, with #if 0 or otherwise, but to correct the broken inline implementation of hypotf (float, float); simply removing the spurious leading underscore from the inline reference to _hypot (float, float), where its return value is cast to the float return type should suffice.
Alternatively, substituting an equivalent -std=gnu... for -std=c... in the compiler options should circumvent the bug, and may offer a suitable workaround.
FWIW, I'm not entirely happy with MinGW.org's current implementation of hypotl (long double, long double) either; correcting both issues is on my punch list for the next release of the MinGW runtime, but ATM, I have little time to devote to preparing this.
Update
This bug is no longer present in the current release of the MinGW.org runtime library (currently mingwrt-3.22.4, but fixed since release 3.22). If you are using anything older than this, (including any of the critically broken 4.x releases), you should upgrade.
As noted by Keith, this is a bug in the MinGW.org header.
As an alternative to editing the MinGW.org header, you can use MinGW-w64, which provides everything MinGW.org provides, and a whole lot more.
For a list of differences between the runtimes, see this wiki page.
MinGW uses gcc and the Microsoft runtime library. Microsoft's implementation support C90, but its support for later versions of the C standard (C99 and C11) is very poor.
The hypot function (along with hypotf and hypotl) was added in C99.
If you're getting this error with a program that calls hypot, such as:
#include <cmath>
int main() {
std::cout << std::hypot(3.0, 4.0)) << '\n';
}
then it's just a limitation of the Microsoft runtime library, and therefore of MinGW. If it occurs with any program that has #include <cmath>, then it's a bug, perhaps a configuration error, in MinGW.
I just started to experiment with intrinsics. I managed to successfully compile a program using __m128 on a Mac using Clang 5.1. The CPU on this Mac is an Intel core i5 M540.
When I tried to compile the same code with __m256, I get the following message:
simple.cpp:4:2: error: unknown type name '__m256'
__m256 A;
The code looks like this:
#include <immintrin.h>
int main()
{
__m256 A;
return 0;
}
And here is the command used to compile it:
c++ -o simple simple.cpp -march=native -O3
Is it just that my CPU is too old to support AVX instruction set? Are all the options I use (on the command line) correct? I checked in the immintrin.h include file, and it does call another including file which seems to be defining AVX intrinsics. Apologies if the question is naive or if the terminology is misused, as I said, I am new to this topic.
The Intel 540M CPU is in the Westmere microarchitecture (sorry for the mistake in the comment) which appears before Sandy Bridge when AVX was introduced so it doesn't support AVX. The term "core i5" covers a wide range of architectures from Nehalem to Haswell (current) so using a core i5 CPU doesn't mean that you'll have support for all instruction sets like the lates ones.
I work with two computers. One without AVX support and one with AVX. It would be convenient to have my code find the instruction set supported by my CPU at run-time and choose the appropriate code path.
I've follow the suggestions by Agner Fog to make a CPU dispatcher (http://www.agner.org/optimize/#vectorclass). However, on my maching ithout AVX compiling and linking with visual studio the code with AVX enabled causes the code to crash when I run it.
I mean for example I have two source files one with the SSE2 instruction set defined with some SSE2 instructions and another one with the AVX instruction set defined and with some AVX instructions. In my main function if I only reference the SSE2 functions the code still crashes by virtue of having any source code with AVX enabled and with AVX instructions. Any clues to how I can fix this?
Edit:
Okay, I think I isolated the problem. I'm using Agner Fog's vector class and I have defined three source files as:
//file sse2.cpp - compiled with /arch:SSE2
#include "vectorclass.h"
float func_sse2(const float* a) {
Vec8f v1 = Vec8f().load(a);
float sum = horizontal_add(v1);
return sum;
}
//file avx.cpp - compiled with /arch:AVX
#include "vectorclass.h"
float func_avx(const float* a) {
Vec8f v1 = Vec8f().load(a);
float sum = horizontal_add(v1);
return sum;
}
//file foo.cpp - compiled with /arch:SSE2
#include <stdio.h>
extern float func_sse2(const float* a);
extern float func_avx(const float* a);
int main() {
float (*fp)(const float*a);
float a[] = {1,2,3,4,5,6,7,8};
int iset = 6;
if(iset>=7) {
fp = func_avx;
}
else {
fp = func_sse2;
}
float sum = (*fp)(a);
printf("sum %f\n", sum);
}
This crashes. If I instead use Vec4f in func_SSE2 it does not crash. I don't understand this. I can use Vec8f with SSE2 by itself as long as I don't have another source file with AVX. Agner Fog's manual says
"There is no advantage in using the 256-bit floating point vector classes (Vec8f,
Vec4d) unless the AVX instruction set is specified, but it can be convenient to use
these classes anyway if the same source code is used with and without AVX.
Each 256-bit vector will simply be split up into two 128-bit vectors when compiling
without AVX."
However, when I have two source files with Vec8f one compiled with SSE2 and one compiled with AVX then I get a crash.
Edit2:
I can get it to work from the command line
>cl -c sse2.cpp
>cl -c /arch:AVX avx.cpp
>cl foo.cpp sse2.obj avx.obj
>foo.exe
Edit3:
This, however, crashes
>cl -c sse2.cpp
>cl -c /arch:AVX avx.cpp
>cl foo.cpp avx.obj sse2.obj
>foo.exe
Another clue. Apparently, the order of linking matters. It crashes if avx.obj is before sse2.obj but if sse2.obj is before avx.obj it does not crash. I'm not sure if it chooses the correct code path (I don't have access to my AVX system right now) but at least it does not crash.
I realise that this is an old question and that the person who asked it appears to be no longer around, but I hit the same problem yesterday. Here's what I worked out.
When compiled both your sse2.cpp and avx.cpp files produce object files that not only contain your function but also any required template functions.
(e.g. Vec8f::load) These template functions are also compiled using the requested instruction set.
The means that your sse2.obj and avx.obj object files will both contain definitions of Vec8f::load each compiled using the respective instruction sets.
However, since the compiler treats Vec8f::load as externally visible, it puts it a 'COMDAT' section of the object file with a 'selectany' (aka 'pick any') label. This tells the linker that if it sees multiple definitions of this symbol, for example in 2 different object files, then it is allowed to pick any one it likes. (It does this to reduce duplicate code in the final executable which otherwise would be inflated in size by multiple definitions of template and inline functions.)
The problem you are having is directly related to this in that the order of the object files passed to the linker is affecting which one it picks. Specifically here, it appears to be picking the first definition it sees.
If this was avx.obj then the AVX compiled version of Vec8F::load will always be used. This will crash on a machine that doesn't support that instruction set.
On the other hand if sse2.obj is first then the SSE2 compiled version will always be used. This won't crash but it will only use SSE2 instructions even if AVX is supported.
That this is the case can be seen if you look at the linker 'map' file output (produced using the /map option.) Here are the relevant (edited) excerpts -
//
// link with sse2.obj before avx.obj
//
0001:00000080 _main foo.obj
0001:00000330 func_sse2##YAMPBM#Z sse2.obj
0001:00000420 ??0Vec256fe##QAE#XZ sse2.obj
0001:00000440 ??0Vec4f##QAE#ABT__m128###Z sse2.obj
0001:00000470 ??0Vec8f##QAE#XZ sse2.obj <-- sse2 version used
0001:00000490 ??BVec4f##QBE?AT__m128##XZ sse2.obj
0001:000004c0 ?get_high#Vec8f##QBE?AVVec4f##XZ sse2.obj
0001:000004f0 ?get_low#Vec8f##QBE?AVVec4f##XZ sse2.obj
0001:00000520 ?load#Vec8f##QAEAAV1#PBM#Z sse2.obj <-- sse2 version used
0001:00000680 ?func_avx##YAMPBM#Z avx.obj
0001:00000740 ??BVec8f##QBE?AT__m256##XZ avx.obj
//
// link with avx.obj before sse2.obj
//
0001:00000080 _main foo.obj
0001:00000270 ?func_avx##YAMPBM#Z avx.obj
0001:00000330 ??0Vec8f##QAE#XZ avx.obj <-- avx version used
0001:00000350 ??BVec8f##QBE?AT__m256##XZ avx.obj
0001:00000380 ?load#Vec8f##QAEAAV1#PBM#Z avx.obj <-- avx version used
0001:00000580 ?func_sse2##YAMPBM#Z sse2.obj
0001:00000670 ??0Vec256fe##QAE#XZ sse2.obj
0001:00000690 ??0Vec4f##QAE#ABT__m128###Z sse2.obj
0001:000006c0 ??BVec4f##QBE?AT__m128##XZ sse2.obj
0001:000006f0 ?get_high#Vec8f##QBE?AVVec4f##XZ sse2.obj
0001:00000720 ?get_low#Vec8f##QBE?AVVec4f##XZ sse2.obj
As for fixing it, that's another matter. In this case, the following blunt hack should work by forcing the avx version to have its own differently named versions of the template functions. This will increase the resulting executable size as it will contain multiple versions of the same function even if the sse2 and avx versions are identical.
// avx.cpp
namespace AVXWrapper {
\#include "vectorclass.h"
}
using namespace AVXWrapper;
float func_avx(const float* a)
{
...
}
There are some important limitations though -
(a) if the included file manages any form of global state it will no longer be truly global as you will have 2 'semi-global' versions, and
(b) you won't be able to pass vectorclass variables as parameters between other code and functions defined in avx.cpp.
The fact that the link order matters makes me think that there might be some kind of initialization code in the obj file. If the initialization code is communal, then only the first one is taken. I can't reproduce it, but you should be able to see it in an assembly listing (compile with /c /Ftestavx.asm)
Put the SSE and AVX functions in different CPP files and be sure to compile SSE version wihout /arch:AVX.
The following code crashes for me using GCC to build for ARM:
#include <vector>
using namespace std;
void foo(vector<bool>& bools) {
bools.push_back(true);
}
int main(int argc, char** argv) {
vector<bool> bools;
bool b = false;
bools.push_back(b);
}
My compiler is: arm_v5t_le-gcc (GCC) 3.4.3 (MontaVista 3.4.3-25.0.30.0501131 2005-07-23). The crash doesn't occur when building for debug, but occurs with optimizations set to -O2.
Yes, the foo function is necessary to reproduce the issue. This was very confusing at first, but I've discovered that the crash only happens when the push_back call isn't inlined. If GCC notices that the push_back method is called more than once, it won't inline it in each location. For example, I can also reproduce the crash by calling push_back twice inside of main. If you make foo static, then gcc can tell it is never called and will optimize it out, resulting in push_back getting inlined into main, resulting in the crash not occurring.
I've tried this on x86 with gcc 4.3.3, and it appears the issue is fixed for that version.
So, my questions are:
Has anyone else run into this? Perhaps there are some compiler flags I can pass in to prevent it.
Is this a bug with gcc's code generation, or is it a bug in the stl implementation (bits/stl_bvector.h)? (I plan on testing this out myself when I get the time)
If it is a problem with the compiler, is upgrading to 4.3.3 what fixes it, or is it switching to x86 from arm?
Incidentally, most other vector<bool> methods seem to work. And yes, I know that using vector<bool> isn't the best option in the world.
Can you build your own toolchain with gcc 3.4.6 and Montavista's patches? 3.4.6 is the last release of the 3.x line.
I can append some instructions for how to build an ARM cross-compiler from GCC sources if you want. I have to do it all the time, since nobody does prebuilt toolchains for Mac OS X.
I'd be really surprised if this is broken for ARM in gcc 4.x. But the only way to test is if you or someone else can try this out on an ARM-targeting gcc 4.x.
Upgrading to GCC 4 is a safe bet. Its code generation backend replaces the old RTL (Register Transfer Language) representation with SSA (Static Single Assignment). This change allowed a significant rewrite of the optimizer.