I'm trying to create a minimal reproducer for this issue report. There seems to be some problems with AVX-512, which is shipping on the latest Apple machines with Skylake processors.
According to GCC6 release notes the AVX-512 gear should be available. According to the Intel Intrinsics Guide vmovdqu64 is available with AVX-512VL and AVX-512F:
$ cat test.cxx
#include <cstdint>
#include <immintrin.h>
int main(int argc, char* argv[])
{
uint64_t x[8];
__m512i y = _mm512_loadu_epi64(x);
return 0;
}
And then:
$ /opt/local/bin/g++-mp-6 -mavx512f -Wa,-q test.cxx -o test.exe
test.cxx: In function 'int main(int, char**)':
test.cxx:6:37: error: '_mm512_loadu_epi64' was not declared in this scope
__m512i y = _mm512_loadu_epi64(x);
^
$ /opt/local/bin/g++-mp-6 -mavx -mavx2 -mavx512f -Wa,-q test.cxx -o test.exe
test.cxx: In function 'int main(int, char**)':
test.cxx:6:37: error: '_mm512_loadu_epi64' was not declared in this scope
__m512i y = _mm512_loadu_epi64(x);
^
$ /opt/local/bin/g++-mp-6 -msse4.1 -msse4.2 -mavx -mavx2 -mavx512f -Wa,-q test.cxx -o test.exe
test.cxx: In function 'int main(int, char**)':
test.cxx:6:37: error: '_mm512_loadu_epi64' was not declared in this scope
__m512i y = _mm512_loadu_epi64(x);
^
I walked the options back to -msse2 without success. I seem to be missing something.
What is required to engage AVX-512 for modern GCC?
According to a /opt/local/bin/g++-mp-6 -v, these are the header search paths:
#include "..." search starts here:
#include <...> search starts here:
/opt/local/include/gcc6/c++/
/opt/local/include/gcc6/c++//x86_64-apple-darwin13
/opt/local/include/gcc6/c++//backward
/opt/local/lib/gcc6/gcc/x86_64-apple-darwin13/6.5.0/include
/opt/local/include
/opt/local/lib/gcc6/gcc/x86_64-apple-darwin13/6.5.0/include-fixed
/usr/include
/System/Library/Frameworks
/Library/Frameworks
And then:
$ grep -R '_mm512_' /opt/local/lib/gcc6/ | grep avx512f | head -n 8
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_epi64 (long long __A, long long __B, long long __C,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_epi32 (int __A, int __B, int __C, int __D,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_pd (double __A, double __B, double __C, double __D,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:_mm512_set_ps (float __A, float __B, float __C, float __D,
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:#define _mm512_setr_epi64(e0,e1,e2,e3,e4,e5,e6,e7) \
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h: _mm512_set_epi64(e7,e6,e5,e4,e3,e2,e1,e0)
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h:#define _mm512_setr_epi32(e0,e1,e2,e3,e4,e5,e6,e7, \
/opt/local/lib/gcc6//gcc/x86_64-apple-darwin13/6.5.0/include/avx512fintrin.h: _mm512_set_epi32(e15,e14,e13,e12,e11,e10,e9,e8,e7,e6,e5,e4,e3,e2,e1,e0)
...
With no masking, there's no reason for this intrinsic to exist or to ever use it instead of the equivalent _mm512_loadu_si512. It's just confusing, and could trick human readers into thinking it was a vmovq zero-extending load of a single epi64.
Intel's intrinsics finder does specify that it exists, but even current trunk gcc (on Godbolt) doesn't define it.
Almost all AVX512 instructions support merge-masking and zero-masking. Instructions that used to be purely bitwise / whole-register with no meaningful element boundaries now come in 32 and 64-bit element flavours, like vpxord and vpxorq. Or vmovdqa32 and vmovdqa64. But using either version with no masking is still just a normal vector load / store / register-copy, and it's not meaningful to specify anything about element-size for them in the C++ source with intrinsics, only the total vector width.
See also What is the difference between _mm512_load_epi32 and _mm512_load_si512?
SSE* and AVX1/2 options are irrelevent to whether or not GCC headers define this intrinsic in terms of gcc built-ins or not; -mavx512f already implies all of the Intel SSE/AVX extensions before AVX512.
It is present in clang trunk (but not 7.0 so it was only very recently added).
unaligned _mm512_loadu_si512 - supported everywhere, use this
unaligned _mm512_loadu_epi64 - clang trunk, not gcc.
aligned _mm512_load_si512 - supported everywhere, use this
aligned _mm512_load_epi64 - also supported everywhere, surprisingly.
unaligned _mm512_maskz_loadu_epi64 - supported everywhere, use this for zero-masked loads
unaligned _mm512_mask_loadu_epi64 - supported everywhere, use this for merge-mask loads.
This code compiles on gcc as early as 4.9.0, and mainline (Linux) clang as early as 3.9, both with -march=avx512f. Or if they support it, -march=skylake-avx512 or -march=knl. I haven't tested with Apple Clang.
#include <immintrin.h>
__m512i loadu_si512(void *x) { return _mm512_loadu_si512(x); }
__m512i load_epi64(void *x) { return _mm512_load_epi64(x); }
//__m512i loadu_epi64(void *x) { return _mm512_loadu_epi64(x); }
__m512i loadu_maskz(void *x) { return _mm512_maskz_loadu_epi64(0xf0, x); }
__m512i loadu_mask(void *x) { return _mm512_mask_loadu_epi64(_mm512_setzero_si512(), 0xf0, x); }
Godbolt link; you can uncomment the _mm512_loadu_epi64 and flip the compiler to clang trunk to see it work there.
_mm512_loadu_epi64 is not available in 32-bit mode. You need to compile for 64-bit mode. In general, AVX512 works best in 64-bit mode.
Related
I am writing a modulo function and want to optimize the number of instructions called. Currently it looks like this
#include <cstdlib>
constexpr long long mod = 1e9 + 7;
static __attribute__((always_inline)) long long modulo(long long x) noexcept
{
return lldiv(x, mod).rem;
}
and even though I am compiling with -static there is a call to lldiv that isn't being inlined. I want to get around this by adding inline assembly to my function to call the idiv instruction directly. How can I do this?
I am using g++ 9.4.0 and I'm targeting x86.
Since this is related to a competetive programming contest, the code will be compiled with g++ -std=c++17 -O3 -static.
I have the following code (intended to detect if the compiler supports C++14):
#include <memory>
#include <algorithm>
// Check the version language macro, but skip MSVC because
// MSVC reports 199711 even in MSVC 2017.
#if __cplusplus < 201402L && !defined(_MSC_VER) && !defined(__INTEL_COMPILER)
#error "insufficient support for C++14"
#endif
int main()
{
auto ptr = std::make_unique<int>(42);
constexpr int max = std::max(0, 1);
(void) ptr;
(void) max;
return 0;
}
When compiling it with g++ (version 11.2.1) and the line g++ -std=c++14 test.cpp -o test it works fine. When compiling it with the intel compiler (version 2021.3.0 (gcc version 11.2.1 compatibility)) instead using icpc -std=c++14 test.cpp -o test, it fails with
In file included from /usr/include/c++/11/cwchar(44),
from /usr/include/c++/11/bits/postypes.h(40),
from /usr/include/c++/11/iosfwd(40),
from /usr/include/c++/11/bits/shared_ptr.h(52),
from /usr/include/c++/11/memory(77),
from test.cpp(1):
/usr/include/wchar.h(155): error: attribute "__malloc__" does not take arguments
__attribute_malloc__ __attr_dealloc_free;
^
In file included from /usr/include/c++/11/cstdlib(75),
from /usr/include/c++/11/bits/stl_algo.h(59),
from /usr/include/c++/11/algorithm(62),
from test.cpp(2):
/usr/include/stdlib.h(565): error: attribute "__malloc__" does not take arguments
__attr_dealloc_free;
^
In file included from /usr/include/c++/11/cstdlib(75),
from /usr/include/c++/11/bits/stl_algo.h(59),
from /usr/include/c++/11/algorithm(62),
from test.cpp(2):
/usr/include/stdlib.h(569): error: attribute "__malloc__" does not take arguments
__THROW __attr_dealloc (reallocarray, 1);
^
In file included from /usr/include/c++/11/cstdlib(75),
from /usr/include/c++/11/bits/stl_algo.h(59),
from /usr/include/c++/11/algorithm(62),
from test.cpp(2):
/usr/include/stdlib.h(797): error: attribute "__malloc__" does not take arguments
__attr_dealloc_free __wur;
^
compilation aborted for test.cpp (code 2)
What exactly is going wrong here, and how can I fix it?
Short update: Looks as if CUDA is running into similar issues, and it might be related to glibc 2.34: https://forums.developer.nvidia.com/t/cuda-11-5-samples-throw-multiple-error-attribute-malloc-does-not-take-arguments/192750/15
Compiling and executing the shared code using icpc 2021.4 and it runs fine.
Used below command to compile the code.
icpc -std=c++14
Below are the environment details.
Operating System: Ubuntu 18.04.3 LTS
Kernel: Linux 4.15.0-76-generic
For compatibility, Kindly refer the link for IntelĀ® C++ Compiler Classic System Requirements
https://software.intel.com/content/www/us/en/develop/articles/oneapi-c-compiler-system-requirements.html
I've been trying to get some std::atan2 code to auto vectorize. I've been able to get it to the point where GCC doesn't complain about asin, but it doesn't seem to be able to handle atan2 (nor atan for that matter).
Here is a link to the godbolt version. here is the source code:
#include <cmath>
#include <cstddef>
#include <array>
constexpr std::size_t array_length = 16;
void calc_uv_coordinates(
const std::array<float,array_length>& pos_x,
const std::array<float,array_length>& pos_y,
const std::array<float,array_length>& pos_z,
std::array<float,array_length>& tex_u,
std::array<float,array_length>& tex_v) noexcept{
#pragma clang loop vectorize(enable)
for(std::size_t i = 0; i < array_length; ++i){
tex_u[i] = std::atan2(pos_z[i], pos_x[i]) / (2 * M_PI);
tex_v[i] = std::asin(pos_y[i]) / M_PI;
}
}
Here are the compiler arguments:
clang 9.0: -c -Ofast -ffast-math -Rpass-analysis=loop-vectorize
gcc 9.2: -c -Ofast -ffast-math -fopt-info-vec-missed -fopt-info-vec
Clang says:
In file included from <source>:2:
/opt/compiler-explorer/gcc-9.2.0/lib/gcc/x86_64-linux-gnu/9.2.0/../../../../include/c++/9.2.0/cmath:145:12: remark: loop not vectorized: call instruction cannot be vectorized [-Rpass-analysis]
{ return __builtin_atan2f(__y, __x); }
^
<source>:15:5: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
for(std::size_t i = 0; i < array_length; ++i){
^
1 warning generated.
ASM generation compiler returned: 0
clang-9: warning: -Wl,-rpath,/opt/compiler-explorer/clang-9.0.0/lib: 'linker' input unused [-Wunused-command-line-argument]
clang-9: warning: -Wl,-rpath,/opt/compiler-explorer/clang-9.0.0/lib32: 'linker' input unused [-Wunused-command-line-argument]
clang-9: warning: -Wl,-rpath,/opt/compiler-explorer/clang-9.0.0/lib64: 'linker' input unused [-Wunused-command-line-argument]
In file included from <source>:2:
/opt/compiler-explorer/gcc-9.2.0/lib/gcc/x86_64-linux-gnu/9.2.0/../../../../include/c++/9.2.0/cmath:145:12: remark: loop not vectorized: call instruction cannot be vectorized [-Rpass-analysis]
{ return __builtin_atan2f(__y, __x); }
^
<source>:15:5: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
for(std::size_t i = 0; i < array_length; ++i){
^
1 warning generated.
Execution build compiler returned: 0
Program returned: 1
Error: no suitable ./output.s executable found
GCC says:
<source>:15:30: missed: couldn't vectorize loop
/opt/compiler-explorer/gcc-9.2.0/include/c++/9.2.0/cmath:145:28: missed: not vectorized: relevant stmt not supported: _15 = __builtin_atan2f (_2, _1);
ASM generation compiler returned: 0
<source>:15:30: missed: couldn't vectorize loop
/opt/compiler-explorer/gcc-9.2.0/include/c++/9.2.0/cmath:145:28: missed: not vectorized: relevant stmt not supported: _15 = __builtin_atan2f (_2, _1);
Execution build compiler returned: 0
Program returned: 1
Error: no suitable ./output.s executable found
I'm not sure I understand what relevant stmt not supported: _15 = __builtin_atan2f (_2, _1); is saying, I believe it is saying that GCC has not put in the manual effort to auto vectorize the built in. It appears that GCC only auto vectorizes trig with built-in assembly instructions, but I don't understand why that forces atan2f off the table for vectorization, or why that makes it any different than an arbitrary arithmetic function.
Other libraries have SIMD atan and atan2, so it isn't like this function is impossible to vectorize. In principle this should just be each part of atan2 being vectorized along all elements.
I have some code using large integer literals as follows:
if(nanoseconds < 1'000'000'000'000)
This gives the compiler warning integer constant is too large for 'long' type [-Wlong-long]. However, if I change it to:
if(nanoseconds < 1'000'000'000'000ll)
...I instead get the warning use of C++11 long long integer constant [-Wlong-long].
I would like to disable this warning just for this line, but without disabling -Wlong-long or using -Wno-long-long for the entire project. I have tried surrounding it with:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wlong-long"
...
#pragma GCC diagnostic pop
but that does not seem to work here with this warning. Is there something else I can try?
I am building with -std=gnu++1z.
Edit: minimal example for the comments:
#include <iostream>
auto main()->int {
double nanoseconds = 10.0;
if(nanoseconds < 1'000'000'000'000ll) {
std::cout << "hello" << std::endl;
}
return EXIT_SUCCESS;
}
Building with g++ -std=gnu++1z -Wlong-long test.cpp gives test.cpp:6:20: warning: use of C++11 long long integer constant [-Wlong-long]
I should mention this is on a 32bit platform, Windows with MinGW-w64 (gcc 5.1.0) - the first warning does not seem to appear on my 64bit Linux systems, but the second (with the ll suffix) appears on both.
It seems that the C++11 warning when using the ll suffix may be a gcc bug. (Thanks #praetorian)
A workaround (inspired by #nate-eldredge's comment) is to avoid using the literal and have it produced at compile time with constexpr:
int64_t constexpr const trillion = int64_t(1'000'000) * int64_t(1'000'000);
if(nanoseconds < trillion) ...
I try to compile the simple code
#include <atomic>
int bar = 0;
void foo(std::atomic<int>&flag)
{ bar = flag; }
with clang++ 3.2 (downloaded as llvm 3.2 from llvm.org; on mac os.x 10.8.3 this fails with the error
/> clang++ -std=c++11 -stdlib=libc++ -O3 -march=native -c test.cc
In file included from test.cc:1:
/usr/include/c++/v1/atomic:576:17: error: first argument to atomic operation must be a pointer to non-const _Atomic type ('const _Atomic(int) *' invalid)
{return __c11_atomic_load(&__a_, __m);}
^ ~~~~~
/usr/include/c++/v1/atomic:580:53: note: in instantiation of member function
'std::_1::_atomic_base::load' requested here
operator _Tp() const _NOEXCEPT {return load();}
^
test.cc:5:9: note: in instantiation of member function 'std::_1::_atomic_base::operator int' requested here
bar = done;
When I use /usr/bin/clang++ instead (which comes with the OS or Xcode) it compiles just fine. The libc++ is that at /usr/lib/c++/v1 in both cases.
What am I missing? Is there another libc++ that comes with llvm 3.2 but which I'm missing? (I cannot find anything in the clang3.2 tree).
Xcode now bundles libc++ within the Xcode.app directory. You can inspect this directory by control-clicking Xcode.app and choose "Show Package Contents".