Before I elaborate the specifics, I have the following function,
Let _e, _w be an array of equal size. Let _stepSize be of float type.
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
for (UINT i = 0; i < n; i++){
_w[i] += _e[i]*multiplier
}
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
This function is well and fine, but if a cpu have intrinsic, specifically for this example, SSE4, then function below allows me to shave seconds off (for the same input) even with -O3 gcc flag already included for both and extra -msse4a added for this one.
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
__m128d multSSE = _mm_set_pd(multiplier, multiplier);
__m128d* eSSE = (__m128d*)_e;
__m128d* wSSE = (__m128d*)_w;
size_t n = getSize()>>1;
for (UINT i = 0; i < n; i++){
wSSE[i] = _mm_add_pd(wSSE[i],_mm_mul_pd(multSSE, eSSE[i]));
}
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
Problem:
My problem now is I want something like this,
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
#ifdef _mssa4a_defined_
__m128d multSSE = _mm_set_pd(multiplier, multiplier);
__m128d* eSSE = (__m128d*)_e;
__m128d* wSSE = (__m128d*)_w;
size_t n = getSize()>>1;
for (UINT i = 0; i < n; i++){
wSSE[i] = _mm_add_pd(wSSE[i],_mm_mul_pd(multSSE, eSSE[i]));
}
#else // No intrinsic
for (UINT i = 0; i < n; i++){
_w[i] += _e[i]*multiplier
}
#endif
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
Thus, if in gcc, I declared -msse4a to compile this code, then it will pick compile the code in if statement. And of course, my plan is to implement it for all intrinsic, not just for SSE4A above.
GCC, ICC (on Linux), and Clang have the following compile options with corresponding defines
options define
-mfma __FMA__
-mavx2 __AVX2__
-mavx __AVX__
-msse4.2 __SSE4_2__
-msse4.1 __SSE4_1__
-mssse3 __SSSE3__
-msse3 __SSE3__
-msse2 __SSE2__
-m64 __SSE2__
-msse __SSE__
Options and defines in GCC and Clang but not in ICC:
-msse4a __SSE4A__
-mfma4 __FMA4__
-mxop __XOP__
AVX512 options which are defined in recent versions of GCC, Clang, and ICC
-mavx512f __AVX512F__ //foundation instructions
-mavx512pf __AVX512PF__ //pre-fetch instructions
-mavx512er __AVX512ER__ //exponential and reciprocal instructions
-mavx512cd __AVX512CD__ //conflict detection instructions
AVX512 options which will likely be in GCC, Clang, and ICC soon (if not already):
-mavx512bw __AVX512BW__ //byte and word instructions
-mavx512dq __AVX512DQ__ //doubleword and quadword Instructions
-mavx512vl __AVX512VL__ //vector length extensions
Note that many of these switches enable several more: e.g -mfma enables and defines AVX2, AVX, SSE4.2 SSE4.1, SSSE3, SSE3, SSE2, SSE.
I'm not 100% what the compiler options with ICC for AVX512 is. It could be -xMIC-AVX512 instead of -mavx512f.
MSVC only appears to define __AVX__ and __AVX2__.
In your case your code appears only to be using SSE2 so if you compile in 64-bit mode (which is the default in a 64-bit user space or explicitly with -m64) then __SSE2__ is defined. But since you used -msse4a then __SSE4A__ will be defined as well.
Note that enabling an instruction is not the same as determine if an instruction set is available. If you want your code to work on multiple instruction sets then I suggest a CPU dispatcher.
I later learned that there is no way to do this. A simple and elegant way though is this. For x86-64 platform with sse4a intrinsic, do the following make target-rule (assuming that you store intrinsic sources in src/intrinsic/ and builds (.o files) in build/* ):
CXX=g++ -O3
CXXFLAGS=-std=c++14 -Wunused
CPPFLAGS=
CPP_INTRINSIC_FLAG:=-ffast-math
INTRINSIC_OBJECT := $(patsubst src/intrinsic/%.cpp,build/%.o,$(wildcard src/intrinsic/*.cpp))
x86-64-sse4: $(eval CPP_INTRINSIC_FLAG+=-msse4a -DSSE4A) $(INTRINSIC_OBJECT)
# Intrinsic objects
build/%.o: src/intrinsic/%.cpp
$(CXX) $(CXXFLAGS) -c $(CPPFLAGS) $(CPP_INTRINSIC_FLAG) $(INCLUDE_PATHS) $^ -o $#
Related
This question already has answers here:
Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
(2 answers)
Closed last year.
This post was edited and submitted for review last year and failed to reopen the post:
Original close reason(s) were not resolved
I have a strange issue with some SSE2 and AVX code I have been working on. I am building my application using GCC which runtime cpu feature detection. The object files are built with seperate flags for each CPU feature, for example:
g++ -c -o ConvertSamples_SSE.o ConvertSamples_SSE.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse
g++ -c -o ConvertSamples_SSE2.o ConvertSamples_SSE2.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse2
g++ -c -o ConvertSamples_AVX.o ConvertSamples_AVX.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -mavx
When I first launch the program, I find that the SSE2 routines are as per normal with a nice speed boost over the non SSE routines (around 100% faster). After I run any AVX routine, the exact same SSE2 routine now runs much slower.
Could someone please explain what the cause of this may be?
Before the AVX routine runs, all the tests are around 80-130% faster then FPU math, as can be seen here, after the AVX routine runs, the SSE routines are much slower.
If I skip the AVX test routines I never see this performance loss.
Here is my SSE2 routine
void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
static __m128 mul = _mm_set_ps1(ratio);
unsigned int i;
for (i = 0; i < samples - 3; i += 4, in += 4, out += 4)
{
__m128i con = _mm_cvtps_epi32(_mm_mul_ps(_mm_load_ps(in), mul));
out[0] = ((int16_t*)&con)[0];
out[1] = ((int16_t*)&con)[2];
out[2] = ((int16_t*)&con)[4];
out[3] = ((int16_t*)&con)[6];
}
for (; i < samples; ++i, ++in, ++out)
*out = (int16_t)lrint(*in * ratio);
}
And the AVX version of the same.
void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
static __m256 mul = _mm256_set1_ps(ratio);
unsigned int i;
for (i = 0; i < samples - 7; i += 8, in += 8, out += 8)
{
__m256i con = _mm256_cvtps_epi32(_mm256_mul_ps(_mm256_load_ps(in), mul));
out[0] = ((int16_t*)&con)[0];
out[1] = ((int16_t*)&con)[2];
out[2] = ((int16_t*)&con)[4];
out[3] = ((int16_t*)&con)[6];
out[4] = ((int16_t*)&con)[8];
out[5] = ((int16_t*)&con)[10];
out[6] = ((int16_t*)&con)[12];
out[7] = ((int16_t*)&con)[14];
}
for(; i < samples; ++i, ++in, ++out)
*out = (int16_t)lrint(*in * ratio);
}
I have also run this through valgrind which detects no errors.
Mixing AVX code and legacy SSE code incurs a performance penalty. The most reasonable solution is to execute the VZEROALL instruction after an AVX segment of code, especially just before executing SSE code.
As per Intel's diagram, the penalty when transitioning into or out of state C (legacy SSE with upper half of AVX registers saved) is in the order of 100 clock cycles. The other transitions are only 1 cycle:
References:
Intel: Avoiding AVX-SSE Transition Penalties
IntelĀ® AVX State Transitions: Migrating SSE Code to AVX
This question already has answers here:
What's the difference between -O3 and (-O2 + flags that man gcc says -O3 adds to -O2)?
(2 answers)
Closed 8 years ago.
Here's the function I'm looking at:
template <uint8_t Size>
inline uint64_t parseUnsigned( const char (&buf)[Size] )
{
uint64_t val = 0;
for (uint8_t i = 0; i < Size; ++i)
if (buf[i] != ' ')
val = (val * 10) + (buf[i] - '0');
return val;
}
I have a test harness which passes in all possible numbers with Size=5, left-padded with spaces. I'm using GCC 4.7.2. When I run the program under callgrind after compiling with -O3 I get:
I refs: 7,154,919
When I compile with -O2 I get:
I refs: 9,001,570
OK, so -O3 improves the performance (and I confirmed that some of the improvement comes from the above function, not just the test harness). But I don't want to completely switch from -O2 to -O3, I want to find out which specific option(s) to add. So I consult man g++ to get the list of options it says are added by -O3:
-fgcse-after-reload [enabled]
-finline-functions [enabled]
-fipa-cp-clone [enabled]
-fpredictive-commoning [enabled]
-ftree-loop-distribute-patterns [enabled]
-ftree-vectorize [enabled]
-funswitch-loops [enabled]
So I compile again with -O2 followed by all of the above options. But this gives me even worse performance than plain -O2:
I refs: 9,546,017
I discovered that adding -ftree-vectorize to -O2 is responsible for this performance degradation. But I can't figure out how to match the -O3 performance with any combination of options. How can I do this?
In case you want to try it yourself, here's the test harness (put the above parseUnsigned() definition under the #includes):
#include <cmath>
#include <stdint.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
template <uint8_t Size>
inline void increment( char (&buf)[Size] )
{
for (uint8_t i = Size - 1; i < 255; --i)
{
if (buf[i] == ' ')
{
buf[i] = '1';
break;
}
++buf[i];
if (buf[i] > '9')
buf[i] -= 10;
else
break;
}
}
int main()
{
char str[5];
memset(str, ' ', sizeof(str));
unsigned max = std::pow(10, sizeof(str));
for (unsigned ii = 0; ii < max; ++ii)
{
uint64_t result = parseUnsigned(str);
if (result != ii)
{
printf("parseUnsigned(%*s) from %u: %lu\n", sizeof(str), str, ii, result);
abort();
}
increment(str);
}
}
A very similar question was already answered here: https://stackoverflow.com/a/6454659/483486
I've copied the relevant text underneath.
UPDATE: There are questions about it in GCC WIKI:
"Is -O1 (-O2,-O3 or -Os) equivalent to individual -foptimization options?"
No. First, individual optimization options (-f*) do not enable optimization, an option -Os or -Ox with x > 0 is required. Second, the -Ox flags enable many optimizations that are not controlled by any individual -f* option. There are no plans to add individual options for controlling all these optimizations.
"What specific flags are enabled by -O1 (-O2, -O3 or -Os)?"
Varies by platform and GCC version. You can get GCC to tell you what flags it enables by doing this:
touch empty.c
gcc -O1 -S -fverbose-asm empty.c
cat empty.s
In general, I assume that the STL implementation of any algorithm is at least as efficient as anything I can come up with (with the additional benefit of being error free). However, I came to wonder whether the STL's focus on iterators might be harmful in some situations.
Lets assume I want to calculate the inner product of two fixed size arrays. My naive implementation would look like this:
std::array<double, 100000> v1;
std::array<double, 100000> v2;
//fill with arbitrary numbers
double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
sum += v1[i] * v2[i];
}
As the number of iterations and the memory layout are known during compile time and all operations can directly be mapped to native processor instructions, the compiler should easily be able to generate the "optimal" machine code from this (loop unrolling, vectorization / FMA instructions ...).
The STL version
double sum = std::inner_product(cbegin(v1), cend(v1), cbegin(v2), 0.0);
on the other hand adds some additional indirections and even if everything is inlined, the compiler still has to deduce that it is working on a continuous memory region and where this region lies. While this is certainly possible in principle, I wonder, whether the typical c++ compiler will actually do it.
So my question is: Do you think, there can be a performance benefit of implementing standard algorithms that operate on fixed size arrays on my own, or will the STL Version always outperform a manual implementation?
As suggested I did some measurements and
for the code below
compiled with VS2013 for x64 in release mode
executed on a Win8.1 Machine with an i7-2640M,
the algorithm version is consistently slower by about 20% (15.6-15.7s vs 12.9-13.1s). The relative difference, also stays roughly constant over two orders of magnitude for N and REPS.
So I guess the answer is: Using standard library algorithms CAN hurt performance.
It would still be interesting, if this is a general problem or if it is specific to my platform, compiler and benchmark. You are welcome to post your own resutls or comment on the benchmark.
#include <iostream>
#include <numeric>
#include <array>
#include <chrono>
#include <cstdlib>
#define USE_STD_ALGORITHM
using namespace std;
using namespace std::chrono;
static const size_t N = 10000000; //size of the arrays
static const size_t REPS = 1000; //number of repitions
array<double, N> a1;
array<double, N> a2;
int main(){
srand(10);
for (size_t i = 0; i < N; ++i) {
a1[i] = static_cast<double>(rand())*0.01;
a2[i] = static_cast<double>(rand())*0.01;
}
double res = 0.0;
auto start=high_resolution_clock::now();
for (size_t z = 0; z < REPS; z++) {
#ifdef USE_STD_ALGORITHM
res = std::inner_product(a1.begin(), a1.end(), a2.begin(), res);
#else
for (size_t t = 0; t < N; ++t) {
res+= a1[t] * a2[t];
}
#endif
}
auto end = high_resolution_clock::now();
std::cout << res << " "; // <-- necessary, so that loop isn't optimized away
std::cout << duration_cast<milliseconds>(end - start).count() <<" ms"<< std::endl;
}
/*
* Update: Results (ubuntu 14.04 , haswell)
* STL: algorithm
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3551 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3567 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9378 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8505 ms
*
* loop:
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3543 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3551 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9613 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8642 ms
*/
EDIT:
I did a quick check with g++-4.9.2 and clang++-3.5 with O3and std=c++11 on a fedora 21 Virtual Box VM on the same machine and apparently those compilers don't have the same problem (the time is almost the same for both versions). However, gcc's version is about twice as fast as clang's (7.5s vs 14s).
I am considering the following C++ program:
#include <iostream>
#include <limits>
int main(int argc, char **argv) {
unsigned int sum = 0;
for (unsigned int i = 1; i < std::numeric_limits<unsigned int>::max(); ++i) {
double f = static_cast<double>(i);
unsigned int t = static_cast<unsigned int>(f);
sum += (t % 2);
}
std::cout << sum << std::endl;
return 0;
}
I use the gcc / g++ compiler, g++ -v gives gcc version 4.7.2 20130108 [gcc-4_7-branch revision 195012] (SUSE Linux).
I am running openSUSE 12.3 (x86_64) and have a Intel(R) Core(TM) i7-3520M CPU.
Running
g++ -O3 test.C -o test_64_opt
g++ -O0 test.C -o test_64_no_opt
g++ -m32 -O3 test.C -o test_32_opt
g++ -m32 -O0 test.C -o test_32_no_opt
time ./test_64_opt
time ./test_64_no_opt
time ./test_32_opt
time ./test_32_no_opt
yields
2147483647
real 0m4.920s
user 0m4.904s
sys 0m0.001s
2147483647
real 0m16.918s
user 0m16.851s
sys 0m0.019s
2147483647
real 0m37.422s
user 0m37.308s
sys 0m0.000s
2147483647
real 0m57.973s
user 0m57.790s
sys 0m0.011s
Using float instead of double, the optimized 64 bit variant even finishes in 2.4 seconds, while the other running times stay roughly the same. However, with float I get different outputs depending on optimization, probably due to the higher processor-internal precision.
I know 64 bit may have faster math, but we have a factor of 7 (and nearly 15 with floats) here.
I would appreciate an explanation of these running time discrepancies.
The problem isn't 32bit vs 64bit, it's the lack of SSE and SSE2. When compiling for 64bit, gcc assumes it can use SSE and SSE2 since all available x86_64 processors have it.
Compile your 32bit version with -msse -msse2 and the runtime difference nearly disappears.
My benchmark results for completeness:
-O3 -m32 -msse -msse2 4.678s
-O3 (64bit) 4.524s
I am in the process of optimizing my code for matrix multiplication.
for (int i = 0; i < SIZE; i++) {
for (int j = 0; j < SIZE; j++) {
float tmp = 0;
for (int k = 0; k < SIZE; k+=4) {
v1 = _mm_load_ps(&m1[i][k]);
v2 = _mm_load_ps(&m2[j][k]);
vMul = _mm_mul_ps(v1, v2);
vRes = _mm_add_ps(vRes, vMul);
}
vRes = _mm_hadd_ps(vRes, vRes);
vRes = _mm_hadd_ps(vRes, vRes);
_mm_store_ss(&result[i][j], vRes);
}
}
But g++ complains that "*'_mm_hadd_ps' was not declared in this scope*". Why is that, I am able to use other SSE functions like _mm_add_ps ...
Horizontal add instructions (such as _mm_hadd_ps) are part of SSE3. All the other ones that you are currently using are SSE.
It seems that you've only included the SSE or SSE2 headers.
So you'll need the SSE3 header:
#include <pmmintrin.h>
It will enable:
_mm_addsub_ps
_mm_addsub_pd
_mm_hadd_ps
_mm_hadd_pd
_mm_hsub_ps
_mm_hsub_pd
_mm_movehdup_ps
_mm_movehdup_pd
_mm_moveldup_ps
_mm_moveldup_pd
_mm_lddqu_si128
Use #include <x86intrin.h>, it will include all intrinsics supported by the target processor. Including pmmintrin.h and alike is deprecated and not recommended in recent versions of GCC. Also make sure you target the SSE3 instruction set in your compilation, either by adding -msse3 option, or (better) by using -march= option.
In addition to including the correct header as Mysticial pointed out, you might also need to add the -msse3 flag to g++'s command-line arguments in order to enable SSE3 instructions. This will allow the code generator to emit SSE3 instructions, and it will define the __SSE3__ preprocessor macro, which then enables the declarations in <pmmintrin.h>.