This question already has answers here:
What's the difference between -O3 and (-O2 + flags that man gcc says -O3 adds to -O2)?
(2 answers)
Closed 8 years ago.
Here's the function I'm looking at:
template <uint8_t Size>
inline uint64_t parseUnsigned( const char (&buf)[Size] )
{
uint64_t val = 0;
for (uint8_t i = 0; i < Size; ++i)
if (buf[i] != ' ')
val = (val * 10) + (buf[i] - '0');
return val;
}
I have a test harness which passes in all possible numbers with Size=5, left-padded with spaces. I'm using GCC 4.7.2. When I run the program under callgrind after compiling with -O3 I get:
I refs: 7,154,919
When I compile with -O2 I get:
I refs: 9,001,570
OK, so -O3 improves the performance (and I confirmed that some of the improvement comes from the above function, not just the test harness). But I don't want to completely switch from -O2 to -O3, I want to find out which specific option(s) to add. So I consult man g++ to get the list of options it says are added by -O3:
-fgcse-after-reload [enabled]
-finline-functions [enabled]
-fipa-cp-clone [enabled]
-fpredictive-commoning [enabled]
-ftree-loop-distribute-patterns [enabled]
-ftree-vectorize [enabled]
-funswitch-loops [enabled]
So I compile again with -O2 followed by all of the above options. But this gives me even worse performance than plain -O2:
I refs: 9,546,017
I discovered that adding -ftree-vectorize to -O2 is responsible for this performance degradation. But I can't figure out how to match the -O3 performance with any combination of options. How can I do this?
In case you want to try it yourself, here's the test harness (put the above parseUnsigned() definition under the #includes):
#include <cmath>
#include <stdint.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
template <uint8_t Size>
inline void increment( char (&buf)[Size] )
{
for (uint8_t i = Size - 1; i < 255; --i)
{
if (buf[i] == ' ')
{
buf[i] = '1';
break;
}
++buf[i];
if (buf[i] > '9')
buf[i] -= 10;
else
break;
}
}
int main()
{
char str[5];
memset(str, ' ', sizeof(str));
unsigned max = std::pow(10, sizeof(str));
for (unsigned ii = 0; ii < max; ++ii)
{
uint64_t result = parseUnsigned(str);
if (result != ii)
{
printf("parseUnsigned(%*s) from %u: %lu\n", sizeof(str), str, ii, result);
abort();
}
increment(str);
}
}
A very similar question was already answered here: https://stackoverflow.com/a/6454659/483486
I've copied the relevant text underneath.
UPDATE: There are questions about it in GCC WIKI:
"Is -O1 (-O2,-O3 or -Os) equivalent to individual -foptimization options?"
No. First, individual optimization options (-f*) do not enable optimization, an option -Os or -Ox with x > 0 is required. Second, the -Ox flags enable many optimizations that are not controlled by any individual -f* option. There are no plans to add individual options for controlling all these optimizations.
"What specific flags are enabled by -O1 (-O2, -O3 or -Os)?"
Varies by platform and GCC version. You can get GCC to tell you what flags it enables by doing this:
touch empty.c
gcc -O1 -S -fverbose-asm empty.c
cat empty.s
Related
In the following example, the elimination of unused code is performed for sin() but not for pow(). I was wondering why. Tried gcc and clang.
Here is some more details about this example, which is otherwise mostly code.
The code contains a loop over an integer from which a floating point number is computed.
The number is passed to a mathematical function: either pow() or sin()
depending on which macros are defined.
If macro USE is defined, the sum of all returned values is accumulated in another variable which is then copied to a volatile variable to prevent the optimizer from removing the code entirely.
// main.cpp
#include <chrono>
#include <cmath>
#include <cstdio>
int main() {
std::chrono::steady_clock clock;
auto start = clock.now();
double s = 0;
const size_t count = 1 << 27;
for (size_t i = 0; i < count; ++i) {
const double x = double(i) / count;
double a = 0;
#ifdef POW
a = std::pow(x, 0.5);
#endif
#ifdef SIN
a = std::sin(x);
#endif
#ifdef USE
s += a;
#endif
}
auto stop = clock.now();
printf(
"%.0f ms\n", std::chrono::duration<double>(stop - start).count() * 1e3);
volatile double a = s;
(void)a;
}
As seen from the output, the computation of sin() is completely eliminated if the results are unused. This is not the case for pow() since the execution time does not decrease.
I normally observe this if the call may return a NaN (log(-x) but not log(+x)).
# g++ 10.2.0
g++ -std=c++14 -O3 -DPOW main.cpp -o main && ./main
3064 ms
g++ -std=c++14 -O3 -DPOW -DUSE main.cpp -o main && ./main
3172 ms
g++ -std=c++14 -O3 -DSIN main.cpp -o main && ./main
0 ms
g++ -std=c++14 -O3 -DSIN -DUSE main.cpp -o main && ./main
1391 ms
# clang++ 11.0.1
clang++ -std=c++14 -O3 -DPOW main.cpp -o main && ./main
3288 ms
clang++ -std=c++14 -O3 -DPOW -DUSE main.cpp -o main && ./main
3351 ms
clang++ -std=c++14 -O3 -DSIN main.cpp -o main && ./main
177 ms
clang++ -std=c++14 -O3 -DSIN -DUSE main.cpp -o main && ./main
1524 ms
Q: Is it possible to improve IO of this code with LLVM Clang under OS X:
test_io.cpp:
#include <iostream>
#include <string>
constexpr int SIZE = 1000*1000;
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
std::string command(argv[1]);
if (command == "gen") {
for (int i = 0; i < SIZE; ++i) {
std::cout << 1000*1000*1000 << " ";
}
} else if (command == "read") {
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
}
Compile:
clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
Benchmark:
> time ./test_io gen | ./test_io read
real 0m2.961s
user 0m3.675s
sys 0m0.012s
Apart from the sad fact that reading of 10MB file costs 3 seconds, it's much slower than g++ (installed via homebrew):
> gcc-6 -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
> time ./test_io gen | ./test_io read
real 0m0.149s
user 0m0.167s
sys 0m0.040s
My clang version is Apple LLVM version 7.0.0 (clang-700.0.72). clangs installed from homebrew (3.7 and 3.8) also produce slow io. clang installed on Ubuntu (3.8) generates fast io. Apple LLVM version 8.0.0 generates slow io (2 people asked).
I also dtrussed it a bit (sudo dtruss -c "./test_io gen | ./test_io read") and found that clang version makes 2686 write_nocancel syscalls, while gcc version makes 2079 writev syscalls. Which probably points to the root of the problem.
The issue is in libc++ that does not implement sync_with_stdio.
Your command line clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io does not use libstdc++, it will use libc++. To force use libstdc++ you need -stdlib=libstdc++.
Minimal example if you have the input file ready:
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
Timings:
$ clang++ test_io.cpp -o test -O2 -std=c++11
$ time ./test read < input
real 0m2.802s
user 0m2.780s
sys 0m0.015s
$ clang++ test_io.cpp -o test -O2 -std=c++11 -stdlib=libstdc++
clang: warning: libstdc++ is deprecated; move to libc++
$ time ./test read < input
real 0m0.185s
user 0m0.169s
sys 0m0.012s
This question already has answers here:
Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
(2 answers)
Closed last year.
This post was edited and submitted for review last year and failed to reopen the post:
Original close reason(s) were not resolved
I have a strange issue with some SSE2 and AVX code I have been working on. I am building my application using GCC which runtime cpu feature detection. The object files are built with seperate flags for each CPU feature, for example:
g++ -c -o ConvertSamples_SSE.o ConvertSamples_SSE.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse
g++ -c -o ConvertSamples_SSE2.o ConvertSamples_SSE2.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -msse2
g++ -c -o ConvertSamples_AVX.o ConvertSamples_AVX.cpp -std=c++11 -fPIC -O0 -g -Wall -I./include -mavx
When I first launch the program, I find that the SSE2 routines are as per normal with a nice speed boost over the non SSE routines (around 100% faster). After I run any AVX routine, the exact same SSE2 routine now runs much slower.
Could someone please explain what the cause of this may be?
Before the AVX routine runs, all the tests are around 80-130% faster then FPU math, as can be seen here, after the AVX routine runs, the SSE routines are much slower.
If I skip the AVX test routines I never see this performance loss.
Here is my SSE2 routine
void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
static __m128 mul = _mm_set_ps1(ratio);
unsigned int i;
for (i = 0; i < samples - 3; i += 4, in += 4, out += 4)
{
__m128i con = _mm_cvtps_epi32(_mm_mul_ps(_mm_load_ps(in), mul));
out[0] = ((int16_t*)&con)[0];
out[1] = ((int16_t*)&con)[2];
out[2] = ((int16_t*)&con)[4];
out[3] = ((int16_t*)&con)[6];
}
for (; i < samples; ++i, ++in, ++out)
*out = (int16_t)lrint(*in * ratio);
}
And the AVX version of the same.
void Float_S16(const float *in, int16_t *out, const unsigned int samples)
{
static float ratio = (float)Limits<int16_t>::range() / (float)Limits<float>::range();
static __m256 mul = _mm256_set1_ps(ratio);
unsigned int i;
for (i = 0; i < samples - 7; i += 8, in += 8, out += 8)
{
__m256i con = _mm256_cvtps_epi32(_mm256_mul_ps(_mm256_load_ps(in), mul));
out[0] = ((int16_t*)&con)[0];
out[1] = ((int16_t*)&con)[2];
out[2] = ((int16_t*)&con)[4];
out[3] = ((int16_t*)&con)[6];
out[4] = ((int16_t*)&con)[8];
out[5] = ((int16_t*)&con)[10];
out[6] = ((int16_t*)&con)[12];
out[7] = ((int16_t*)&con)[14];
}
for(; i < samples; ++i, ++in, ++out)
*out = (int16_t)lrint(*in * ratio);
}
I have also run this through valgrind which detects no errors.
Mixing AVX code and legacy SSE code incurs a performance penalty. The most reasonable solution is to execute the VZEROALL instruction after an AVX segment of code, especially just before executing SSE code.
As per Intel's diagram, the penalty when transitioning into or out of state C (legacy SSE with upper half of AVX registers saved) is in the order of 100 clock cycles. The other transitions are only 1 cycle:
References:
Intel: Avoiding AVX-SSE Transition Penalties
IntelĀ® AVX State Transitions: Migrating SSE Code to AVX
Before I elaborate the specifics, I have the following function,
Let _e, _w be an array of equal size. Let _stepSize be of float type.
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
for (UINT i = 0; i < n; i++){
_w[i] += _e[i]*multiplier
}
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
This function is well and fine, but if a cpu have intrinsic, specifically for this example, SSE4, then function below allows me to shave seconds off (for the same input) even with -O3 gcc flag already included for both and extra -msse4a added for this one.
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
__m128d multSSE = _mm_set_pd(multiplier, multiplier);
__m128d* eSSE = (__m128d*)_e;
__m128d* wSSE = (__m128d*)_w;
size_t n = getSize()>>1;
for (UINT i = 0; i < n; i++){
wSSE[i] = _mm_add_pd(wSSE[i],_mm_mul_pd(multSSE, eSSE[i]));
}
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
Problem:
My problem now is I want something like this,
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
#ifdef _mssa4a_defined_
__m128d multSSE = _mm_set_pd(multiplier, multiplier);
__m128d* eSSE = (__m128d*)_e;
__m128d* wSSE = (__m128d*)_w;
size_t n = getSize()>>1;
for (UINT i = 0; i < n; i++){
wSSE[i] = _mm_add_pd(wSSE[i],_mm_mul_pd(multSSE, eSSE[i]));
}
#else // No intrinsic
for (UINT i = 0; i < n; i++){
_w[i] += _e[i]*multiplier
}
#endif
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
Thus, if in gcc, I declared -msse4a to compile this code, then it will pick compile the code in if statement. And of course, my plan is to implement it for all intrinsic, not just for SSE4A above.
GCC, ICC (on Linux), and Clang have the following compile options with corresponding defines
options define
-mfma __FMA__
-mavx2 __AVX2__
-mavx __AVX__
-msse4.2 __SSE4_2__
-msse4.1 __SSE4_1__
-mssse3 __SSSE3__
-msse3 __SSE3__
-msse2 __SSE2__
-m64 __SSE2__
-msse __SSE__
Options and defines in GCC and Clang but not in ICC:
-msse4a __SSE4A__
-mfma4 __FMA4__
-mxop __XOP__
AVX512 options which are defined in recent versions of GCC, Clang, and ICC
-mavx512f __AVX512F__ //foundation instructions
-mavx512pf __AVX512PF__ //pre-fetch instructions
-mavx512er __AVX512ER__ //exponential and reciprocal instructions
-mavx512cd __AVX512CD__ //conflict detection instructions
AVX512 options which will likely be in GCC, Clang, and ICC soon (if not already):
-mavx512bw __AVX512BW__ //byte and word instructions
-mavx512dq __AVX512DQ__ //doubleword and quadword Instructions
-mavx512vl __AVX512VL__ //vector length extensions
Note that many of these switches enable several more: e.g -mfma enables and defines AVX2, AVX, SSE4.2 SSE4.1, SSSE3, SSE3, SSE2, SSE.
I'm not 100% what the compiler options with ICC for AVX512 is. It could be -xMIC-AVX512 instead of -mavx512f.
MSVC only appears to define __AVX__ and __AVX2__.
In your case your code appears only to be using SSE2 so if you compile in 64-bit mode (which is the default in a 64-bit user space or explicitly with -m64) then __SSE2__ is defined. But since you used -msse4a then __SSE4A__ will be defined as well.
Note that enabling an instruction is not the same as determine if an instruction set is available. If you want your code to work on multiple instruction sets then I suggest a CPU dispatcher.
I later learned that there is no way to do this. A simple and elegant way though is this. For x86-64 platform with sse4a intrinsic, do the following make target-rule (assuming that you store intrinsic sources in src/intrinsic/ and builds (.o files) in build/* ):
CXX=g++ -O3
CXXFLAGS=-std=c++14 -Wunused
CPPFLAGS=
CPP_INTRINSIC_FLAG:=-ffast-math
INTRINSIC_OBJECT := $(patsubst src/intrinsic/%.cpp,build/%.o,$(wildcard src/intrinsic/*.cpp))
x86-64-sse4: $(eval CPP_INTRINSIC_FLAG+=-msse4a -DSSE4A) $(INTRINSIC_OBJECT)
# Intrinsic objects
build/%.o: src/intrinsic/%.cpp
$(CXX) $(CXXFLAGS) -c $(CPPFLAGS) $(CPP_INTRINSIC_FLAG) $(INCLUDE_PATHS) $^ -o $#
In general, I assume that the STL implementation of any algorithm is at least as efficient as anything I can come up with (with the additional benefit of being error free). However, I came to wonder whether the STL's focus on iterators might be harmful in some situations.
Lets assume I want to calculate the inner product of two fixed size arrays. My naive implementation would look like this:
std::array<double, 100000> v1;
std::array<double, 100000> v2;
//fill with arbitrary numbers
double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
sum += v1[i] * v2[i];
}
As the number of iterations and the memory layout are known during compile time and all operations can directly be mapped to native processor instructions, the compiler should easily be able to generate the "optimal" machine code from this (loop unrolling, vectorization / FMA instructions ...).
The STL version
double sum = std::inner_product(cbegin(v1), cend(v1), cbegin(v2), 0.0);
on the other hand adds some additional indirections and even if everything is inlined, the compiler still has to deduce that it is working on a continuous memory region and where this region lies. While this is certainly possible in principle, I wonder, whether the typical c++ compiler will actually do it.
So my question is: Do you think, there can be a performance benefit of implementing standard algorithms that operate on fixed size arrays on my own, or will the STL Version always outperform a manual implementation?
As suggested I did some measurements and
for the code below
compiled with VS2013 for x64 in release mode
executed on a Win8.1 Machine with an i7-2640M,
the algorithm version is consistently slower by about 20% (15.6-15.7s vs 12.9-13.1s). The relative difference, also stays roughly constant over two orders of magnitude for N and REPS.
So I guess the answer is: Using standard library algorithms CAN hurt performance.
It would still be interesting, if this is a general problem or if it is specific to my platform, compiler and benchmark. You are welcome to post your own resutls or comment on the benchmark.
#include <iostream>
#include <numeric>
#include <array>
#include <chrono>
#include <cstdlib>
#define USE_STD_ALGORITHM
using namespace std;
using namespace std::chrono;
static const size_t N = 10000000; //size of the arrays
static const size_t REPS = 1000; //number of repitions
array<double, N> a1;
array<double, N> a2;
int main(){
srand(10);
for (size_t i = 0; i < N; ++i) {
a1[i] = static_cast<double>(rand())*0.01;
a2[i] = static_cast<double>(rand())*0.01;
}
double res = 0.0;
auto start=high_resolution_clock::now();
for (size_t z = 0; z < REPS; z++) {
#ifdef USE_STD_ALGORITHM
res = std::inner_product(a1.begin(), a1.end(), a2.begin(), res);
#else
for (size_t t = 0; t < N; ++t) {
res+= a1[t] * a2[t];
}
#endif
}
auto end = high_resolution_clock::now();
std::cout << res << " "; // <-- necessary, so that loop isn't optimized away
std::cout << duration_cast<milliseconds>(end - start).count() <<" ms"<< std::endl;
}
/*
* Update: Results (ubuntu 14.04 , haswell)
* STL: algorithm
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3551 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3567 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9378 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8505 ms
*
* loop:
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3543 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3551 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9613 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8642 ms
*/
EDIT:
I did a quick check with g++-4.9.2 and clang++-3.5 with O3and std=c++11 on a fedora 21 Virtual Box VM on the same machine and apparently those compilers don't have the same problem (the time is almost the same for both versions). However, gcc's version is about twice as fast as clang's (7.5s vs 14s).