'_mm_hadd_ps' was not declared in this scope

'_mm_hadd_ps' was not declared in this scope - c++

I am in the process of optimizing my code for matrix multiplication.
for (int i = 0; i < SIZE; i++) {
for (int j = 0; j < SIZE; j++) {
float tmp = 0;
for (int k = 0; k < SIZE; k+=4) {
v1 = _mm_load_ps(&m1[i][k]);
v2 = _mm_load_ps(&m2[j][k]);
vMul = _mm_mul_ps(v1, v2);
vRes = _mm_add_ps(vRes, vMul);
}
vRes = _mm_hadd_ps(vRes, vRes);
vRes = _mm_hadd_ps(vRes, vRes);
_mm_store_ss(&result[i][j], vRes);
}
}
But g++ complains that "*'_mm_hadd_ps' was not declared in this scope*". Why is that, I am able to use other SSE functions like _mm_add_ps ...

Horizontal add instructions (such as _mm_hadd_ps) are part of SSE3. All the other ones that you are currently using are SSE.
It seems that you've only included the SSE or SSE2 headers.
So you'll need the SSE3 header:
#include <pmmintrin.h>
It will enable:
_mm_addsub_ps
_mm_addsub_pd
_mm_hadd_ps
_mm_hadd_pd
_mm_hsub_ps
_mm_hsub_pd
_mm_movehdup_ps
_mm_movehdup_pd
_mm_moveldup_ps
_mm_moveldup_pd
_mm_lddqu_si128

Use #include <x86intrin.h>, it will include all intrinsics supported by the target processor. Including pmmintrin.h and alike is deprecated and not recommended in recent versions of GCC. Also make sure you target the SSE3 instruction set in your compilation, either by adding -msse3 option, or (better) by using -march= option.

In addition to including the correct header as Mysticial pointed out, you might also need to add the -msse3 flag to g++'s command-line arguments in order to enable SSE3 instructions. This will allow the code generator to emit SSE3 instructions, and it will define the __SSE3__ preprocessor macro, which then enables the declarations in <pmmintrin.h>.

Related

How do I get clang/gcc to vectorize looped array comparisons?

bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,64);
b2=(uint8_t*)__builtin_assume_aligned(b2,64);
for(int ii = 0; ii < 64; ++ii){
if(b1[ii]!=b2[ii]){
return false;
}
}
return true;
}
Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling. I would think its pretty easy to just put both memory regions in 512 bit registers and compare them. Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):
bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,8);
b2=(uint8_t*)__builtin_assume_aligned(b2,8);
for(int ii = 0; ii < 8; ++ii){
if(b1[ii]!=b2[ii]){
return false;
}
}
return true;
}
So are both compilers just dumb or is there a reason that this code isn't vectorized? And is there any way to force vectorization short of writing inline assembly?

"I assume" the following is most efficient
memcmp(b1, b2, any_size_you_need);
especially for huge arrays!
(For small arrays, there is not a lot to gain anyway!)
Otherwise, you would need to vectorize manually using Intel Intrinsics. (Also mentioned by chtz.) I started to look at that until i thought about memcmp.

The compiler must assume that once the function returns (or it is exiting the loop), it can't read any bytes behind the current index -- for example if one of the pointers happens to point to somewhere near the boundary of invalid memory.
You can give the compiler a chance to optimize this by using (non-lazy) bitwise &/| operators to combine the results of the individual comparisons:
bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,64);
b2=(uint8_t*)__builtin_assume_aligned(b2,64);
bool ret = true;
for(int ii = 0; ii < 64; ++ii){
ret &= (b1[ii]==b2[ii]);
}
return ret;
}
Godbolt demo: https://godbolt.org/z/3ePh7q5rM
gcc still fails to vectorize this, though. So you may need to write manual vectorized versions of this, if this function is performance critical.

Why is my OpenMP program slow when running it with CLion and WSL, but fast when I compile it manually?

I have a very simple matrix multiplication program to try out OpenMP:
#include <iostream>
#include <stdlib.h>
#include <omp.h>
int main() {
const int N = 1000;
double matrix[N][N];
double vector[N];
double outvector[N];
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
matrix[i][j] = rand() % 50;
}
}
for (int i = 0; i < N; i++) {
vector[i] = rand() % 50;
}
double t=omp_get_wtime();
omp_set_num_threads(12);
#pragma omp parallel for
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
outvector[i] += matrix[i][j] * vector[j];
}
}
t = omp_get_wtime() - t;
printf("%g\n", t);
return 0;
}
I compile it two ways:
Using my Windows Subsystem Linux (WSL), using g++ main.cpp -o main -fopenmp. This way, it runs significantly faster than if I comment out #pragma omp parallel for (as one would expect).
Using my CLion toolchain:
It's a default WSL toolchain
And the following CMakeLists file:
cmake_minimum_required(VERSION 3.9)
project(Fernuni)
set(CMAKE_CXX_STANDARD 14)
# There was no difference between using this or find_package
#set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -fopenmp")
find_package(OpenMP)
if (OPENMP_FOUND)
set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
endif()
add_executable(Fernuni main.cpp)
This way, it runs about ten times slower if #pragma omp parallel for is there.
Why?
edit:
Here are my specific timings with various combinations of optimization settings and enabled/disabled pragma directive:
OMP,O0 OMP,O3 O0 O3 OMP only nothing
CLion 0.0332578 0.0234029 0.0023873 6.4e-06 0.0058386 0.0094753
WSL/g++ 0.007106 0.0012252 0.0038349 5.1e-06 0.0008419 0.0021912

You are mainly measuring the overheads of the operating system, the ones of the hardware, the ones of the runtime and the effect of compiler optimizations.
Indeed, the computation time should be less than 1 ms, while the time to create the OS threads, to schedule them and to initialize OpenMP can actually be close the same time regarding the OS and the runtime (it takes between 0.3 and 0.8 ms on my machine to do only that). Not to mention that caches are cold (many cache misses occurs), the processor frequency may not be high (because of frequency scaling).
Moreover, the compiler can optimize the loop to nothing as it does not have any side effect and the result is not read. Actually, GCC do that with -O3 without OpenMP but not with (see here). This is likely because the GCC optimizer do not see that the output is not read (due to internal escape references). This explains why you get better timings without OpenMP. You should read the result, at least to check they are correct! Put it shortly the benchmark is flawed.
Note that using -march=native or -mavx+-mfma can help a lot to speed up the loop this case since these instruction sets can theoretically divide by 4 the number of instructions executed in that case. In practice, the program will likely be memory-bound (due to the size of the matrix).

Enable fast-math in Clang on a per-function basis?

GCC provides a way to optimize a function/section of code selectively with
fast-math using attributes. Is there a way to enable the same in Clang with pragmas/attributes?
I understand Clang provides some pragmas
to specify floating point flags. However, none of these pragmas enable fast-math.
PS: A similar question was asked before but was not answered in the context of Clang.

UPD: I missed that you mentioned pragmas. The Option1 is fast math for one op as far as I understood. I am not sure about flushing subnormals, I hope it's not affected.
I didn't find the per function options, but I did find 2 pragmas that can help.
Let's say we want dot product.
Option 1.
float innerProductF32(const float* a, const float* b, std::size_t size) {
float res = 0.f;
for (std::size_t i = 0; i != size; ++i) {
#pragma float_control(precise, off)
res += a[i] * b[i];
}
return res;
}
Option2:
float innerProductF32(const float* a, const float* b, std::size_t size) {
float res = 0.f;
_Pragma("clang loop vectorize(enable) interleave(enable)")
for (std::size_t i = 0; i != size; ++i) {
res += a[i] * b[i];
}
return res;
}
The second one is less powerful, it does not generate fma instructions, but maybe it's not what you want.

Adding a print statement speeds up code by an order of magnitude

I've hit a piece of extremely bizarre performance behavior in a piece of C/C++ code, as suggested in the title, which I have no idea how to explain.
Here's an as-close-as-I've-found-to-minimal working example [EDIT: see below for a shorter one]:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp], gg[pp];
for(int ii = 0; ii < pp; ii++) {
ff[ii] = gg[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl dual[pp];
for(int ii = 0; ii < pp; ii++) {
dual[ii] = 0.0;
}
for(int h1 = 0; h1 < pp; h1 ++) {
for(int h2 = 0; h2 < pp; h2 ++) {
cdbl avg_right = 0.0;
for(int xx = 0; xx < pp; xx ++) {
int c00 = xx, c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
avg_right += ff[c00] * conj(ff[c01]) * conj(ff[c10]) * gg[c11];
}
avg_right /= static_cast<cdbl>(pp);
for(int xx = 0; xx < pp; xx ++) {
int c01 = (xx + h1) % pp, c10 = (xx + h2) % pp,
c11 = (xx + h1 + h2) % pp;
dual[xx] += conj(ff[c01]) * conj(ff[c10]) * ff[c11] * conj(avg_right);
}
}
}
for(int ii = 0; ii < pp; ii++) {
dual[ii] = conj(dual[ii]) / static_cast<double>(pp*pp);
}
for(int ii = 0; ii < pp; ii++) {
gg[ii] = dual[ii];
}
#ifdef I_WANT_THIS_TO_RUN_REALLY_FAST
printf("%.15lf\n", gg[0].real());
#else // I_WANT_THIS_TO_RUN_REALLY_SLOWLY
#endif
}
printf("%.15lf\n", gg[0].real());
return 0;
}
Here are the results of running this on my system:
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2
me#mine $ time ./test.elf > /dev/null
real 0m7.329s
user 0m7.328s
sys 0m0.000s
me#mine $ g++ -o test.elf test.cc -Wall -Wextra -O2 -DI_WANT_THIS_TO_RUN_REALLY_FAST
me#mine $ time ./test.elf > /dev/null
real 0m0.492s
user 0m0.490s
sys 0m0.001s
me#mine $ g++ --version
g++ (Gentoo 4.9.4 p1.0, pie-0.6.4) 4.9.4 [snip]
It's not terribly important what this code computes: it's just a tonne of complex arithmetic on arrays of length 29. It's been "simplified" from a much larger tonne of complex arithmetic that I care about.
So, the behavior seems to be, as claimed in the title: if I put this print statement back in, the code gets a lot faster.
I've played around a bit: e.g, printing a constant string doesn't give the speedup, but printing the clock time does. There's a pretty clear threshold: the code is either fast or slow.
I considered the possibility that some bizarre compiler optimization either does or doesn't kick in, maybe depending on whether the code does or doesn't have side effects. But, if so it's pretty subtle: when I looked at the disassembled binaries, they're seemingly identical except that one has an extra print statement in and they use different interchangeable registers. I may (must?) have missed something important.
I'm at a total loss to explain what an earth could be causing this. Worse, it does actually affect my life because I'm running related code a lot, and going round inserting extra print statements does not feel like a good solution.
Any plausible theories would be very welcome. Responses along the lines of "your computer's broken" are acceptable if you can explain how that might explain anything.
UPDATE: with apologies for the increasing length of the question, I've shrunk the example to
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
const int pp = 29;
typedef complex<double> cdbl;
int main() {
cdbl ff[pp];
cdbl blah = 0.0;
for(int ii = 0; ii < pp; ii++) {
ff[ii] = 1.0;
}
for(int it = 0; it < 1000; it++) {
cdbl xx = 0.0;
for(int kk = 0; kk < 100; kk++) {
for(int ii = 0; ii < pp; ii++) {
for(int jj = 0; jj < pp; jj++) {
xx += conj(ff[ii]) * conj(ff[jj]) * ff[ii];
}
}
}
blah += xx;
printf("%.15lf\n", blah.real());
}
printf("%.15lf\n", blah.real());
return 0;
}
I could make it even smaller but already the machine code is manageable. If I change five bytes of the binary corresponding to the callq instruction for that first printf, to 0x90, the execution goes from fast to slow.
The compiled code is very heavy with function calls to __muldc3(). I feel it must be to do with how the Broadwell architecture does or doesn't handle these jumps well: both versions run the same number of instructions so it's a difference in instructions / cycle (about 0.16 vs about 2.8).
Also, compiling -static makes things fast again.
Further shameless update: I'm conscious I'm the only one who can play with this, so here are some more observations:
It seems like calling any library function — including some foolish ones I made up that do nothing — for the first time, puts the execution into slow state. A subsequent call to printf, fprintf or sprintf somehow clears the state and execution is fast again. So, importantly the first time __muldc3() is called we go into slow state, and the next {,f,s}printf resets everything.
Once a library function has been called once, and the state has been reset, that function becomes free and you can use it as much as you like without changing the state.
So, e.g.:
#include <stdio.h>
#include <stdlib.h>
#include <complex>
using namespace std;
int main() {
complex<double> foo = 0.0;
foo += foo * foo; // 1
char str[10];
sprintf(str, "%c\n", 'c');
//fflush(stdout); // 2
for(int it = 0; it < 100000000; it++) {
foo += foo * foo;
}
return (foo.real() > 10.0);
}
is fast, but commenting out line 1 or uncommenting line 2 makes it slow again.
It must be relevant that the first time a library call is run the "trampoline" in the PLT is initialized to point to the shared library. So, maybe somehow this dynamic loading code leaves the processor frontend in a bad place until it's "rescued".

For the record, I finally figured this out.
It turns out this is to do with AVX–SSE transition penalties. To quote this exposition from Intel:
When using Intel® AVX instructions, it is important to know that mixing 256-bit Intel® AVX instructions with legacy (non VEX-encoded) Intel® SSE instructions may result in penalties that could impact performance. 256-bit Intel® AVX instructions operate on the 256-bit YMM registers which are 256-bit extensions of the existing 128-bit XMM registers. 128-bit Intel® AVX instructions operate on the lower 128 bits of the YMM registers and zero the upper 128 bits. However, legacy Intel® SSE instructions operate on the XMM registers and have no knowledge of the upper 128 bits of the YMM registers. Because of this, the hardware saves the contents of the upper 128 bits of the YMM registers when transitioning from 256-bit Intel® AVX to legacy Intel® SSE, and then restores these values when transitioning back from Intel® SSE to Intel® AVX (256-bit or 128-bit). The save and restore operations both cause a penalty that amounts to several tens of clock cycles for each operation.
The compiled version of my main loops above includes legacy SSE instructions (movapd and friends, I think), whereas the implementation of __muldc3 in libgcc_s uses a lot of fancy AVX instructions (vmovapd, vmulsd etc.).
This is the ultimate cause of the slowdown.
Indeed, Intel performance diagnostics show that this AVX/SSE switching happens almost exactly once each way per call of `__muldc3' (in the last version of the code posted above):
$ perf stat -e cpu/event=0xc1,umask=0x08/ -e cpu/event=0xc1,umask=0x10/ ./slow.elf
Performance counter stats for './slow.elf':
100,000,064 cpu/event=0xc1,umask=0x08/
100,000,118 cpu/event=0xc1,umask=0x10/
(event codes taken from table 19.5 of another Intel manual).
That leaves the question of why the slowdown turns on when you call a library function for the first time, and turns off again when you call printf, sprintf or whatever. The clue is in the first document again:
When it is not possible to remove the transitions, it is often possible to avoid the penalty by explicitly zeroing the upper 128-bits of the YMM registers, in which case the hardware does not save these values.
I think the full story is therefore as follows. When you call a library function for the first time, the trampoline code in ld-linux-x86-64.so that sets up the PLT leaves the upper bits of the MMY registers in a non-zero state. When you call sprintf among other things it zeros out the upper bits of the MMY registers (whether by chance or design, I'm not sure).
Replacing the sprintf call with asm("vzeroupper") — which instructs the processor explicitly to zero these high bits — has the same effect.
The effect can be eliminated by adding -mavx or -march=native to the compile flags, which is how the rest of the system was built. Why this doesn't happen by default is just a mystery of my system I guess.
I'm not quite sure what we learn here, but there it is.

Determine what intrinsic flag is activated

Before I elaborate the specifics, I have the following function,
Let _e, _w be an array of equal size. Let _stepSize be of float type.
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
for (UINT i = 0; i < n; i++){
_w[i] += _e[i]*multiplier
}
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
This function is well and fine, but if a cpu have intrinsic, specifically for this example, SSE4, then function below allows me to shave seconds off (for the same input) even with -O3 gcc flag already included for both and extra -msse4a added for this one.
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
__m128d multSSE = _mm_set_pd(multiplier, multiplier);
__m128d* eSSE = (__m128d*)_e;
__m128d* wSSE = (__m128d*)_w;
size_t n = getSize()>>1;
for (UINT i = 0; i < n; i++){
wSSE[i] = _mm_add_pd(wSSE[i],_mm_mul_pd(multSSE, eSSE[i]));
}
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
Problem:
My problem now is I want something like this,
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
#ifdef _mssa4a_defined_
__m128d multSSE = _mm_set_pd(multiplier, multiplier);
__m128d* eSSE = (__m128d*)_e;
__m128d* wSSE = (__m128d*)_w;
size_t n = getSize()>>1;
for (UINT i = 0; i < n; i++){
wSSE[i] = _mm_add_pd(wSSE[i],_mm_mul_pd(multSSE, eSSE[i]));
}
#else // No intrinsic
for (UINT i = 0; i < n; i++){
_w[i] += _e[i]*multiplier
}
#endif
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
Thus, if in gcc, I declared -msse4a to compile this code, then it will pick compile the code in if statement. And of course, my plan is to implement it for all intrinsic, not just for SSE4A above.

GCC, ICC (on Linux), and Clang have the following compile options with corresponding defines
options define
-mfma __FMA__
-mavx2 __AVX2__
-mavx __AVX__
-msse4.2 __SSE4_2__
-msse4.1 __SSE4_1__
-mssse3 __SSSE3__
-msse3 __SSE3__
-msse2 __SSE2__
-m64 __SSE2__
-msse __SSE__
Options and defines in GCC and Clang but not in ICC:
-msse4a __SSE4A__
-mfma4 __FMA4__
-mxop __XOP__
AVX512 options which are defined in recent versions of GCC, Clang, and ICC
-mavx512f __AVX512F__ //foundation instructions
-mavx512pf __AVX512PF__ //pre-fetch instructions
-mavx512er __AVX512ER__ //exponential and reciprocal instructions
-mavx512cd __AVX512CD__ //conflict detection instructions
AVX512 options which will likely be in GCC, Clang, and ICC soon (if not already):
-mavx512bw __AVX512BW__ //byte and word instructions
-mavx512dq __AVX512DQ__ //doubleword and quadword Instructions
-mavx512vl __AVX512VL__ //vector length extensions
Note that many of these switches enable several more: e.g -mfma enables and defines AVX2, AVX, SSE4.2 SSE4.1, SSSE3, SSE3, SSE2, SSE.
I'm not 100% what the compiler options with ICC for AVX512 is. It could be -xMIC-AVX512 instead of -mavx512f.
MSVC only appears to define __AVX__ and __AVX2__.
In your case your code appears only to be using SSE2 so if you compile in 64-bit mode (which is the default in a 64-bit user space or explicitly with -m64) then __SSE2__ is defined. But since you used -msse4a then __SSE4A__ will be defined as well.
Note that enabling an instruction is not the same as determine if an instruction set is available. If you want your code to work on multiple instruction sets then I suggest a CPU dispatcher.

I later learned that there is no way to do this. A simple and elegant way though is this. For x86-64 platform with sse4a intrinsic, do the following make target-rule (assuming that you store intrinsic sources in src/intrinsic/ and builds (.o files) in build/* ):
CXX=g++ -O3
CXXFLAGS=-std=c++14 -Wunused
CPPFLAGS=
CPP_INTRINSIC_FLAG:=-ffast-math
INTRINSIC_OBJECT := $(patsubst src/intrinsic/%.cpp,build/%.o,$(wildcard src/intrinsic/*.cpp))
x86-64-sse4: $(eval CPP_INTRINSIC_FLAG+=-msse4a -DSSE4A) $(INTRINSIC_OBJECT)
# Intrinsic objects
build/%.o: src/intrinsic/%.cpp
$(CXX) $(CXXFLAGS) -c $(CPPFLAGS) $(CPP_INTRINSIC_FLAG) $(INCLUDE_PATHS) $^ -o $#

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js