I got the following computation:
if (x==0) x=1.0;
Y = x * A
Any idea how can i avoid the "if" branching above by using some math clamping/step functions.
Y=((x==0)+(x!=0)*x)*A;
icc -O3 compiles this without a jmp (gcc/llvm do not) for x86-64 (checked with https://gcc.godbolt.org).
Related
I want to keep all dead code (or anything that is even obviously can be optimized) when compiling with gcc, but even with -O0, some dead code are still optimized. How can I keep all code without changing my source code? The sample code is as follows, and when compiling with g++ -S -O0 main.cc, the if-statement will be optimized in assembly code (there will be no cmpl and jmp code).
int main() {
constexpr int a = 123; // or const int a = 0; I do not want to remove `const` or `constexpr` qualifier.
if (a) // or just if (123)
return 1;
return 0;
}
A related question is here: Disable "if(0)" elimination in gcc. But the answers there need you to change your source code (remove const/constexpr qualifier) which I do not want to do.
Could it be that I do not change my source code but only use some compiler flags to achieve this?
This is not possible with GCC to keep the conditional in this case since it is removed during a very early stage of the compilation.
First of all, here is the compilation steps of GCC:
Code parsing (syntax & semantics) producing an AST in GENERIC representation (HL-IR)
High-level GIMPLE generation (ML-IR)
Low-level GIMPLE generation (ML-IR)
Tree SSA optimization (ML-IR)
RTL generation (LL-IR)
Code optimization
Assembly generation
The conditional is already removed after the generation of the (theoretically unoptimized) high-level GIMPLE representation. Thus, before any optimization step. One can check this by using the GCC flag -fdump-tree-all and look at the first generated GIMPLE code. Here is the result:
;; Function int main() (null)
;; enabled by -tree-original
{
const int a = 123;
<<cleanup_point const int a = 123;>>;
return <retval> = 1;
return <retval> = 0;
}
return <retval> = 0;
One can note that the resulting code is the same with both constexpr and const. Actually, constexpr is treated as a simple const variable in the HL GIMPLE code.
It is hard to know when the conditional is exactly removed in Step 1 as GENERIC is an implementation-dependent internal representation of GCC. It is not very flexible/customizable. AFAIK, it is not even yet possible to generate the AST/GENERIC representation. You can extract it yourself with some GCC plugins, but this is a quite tricky task.
Intro
Kahan summation / compensated summation is technique that addresses compilers´ inability to respect the associative property of numbers. Truncation errors results in (a+b)+c not being exactly equal to a+(b+c) and thus accumulate an undesired relative error on longer series of sums, which is a common obstacle in scientific computing.
Task
I desire the optimal implementation of Kahan summation. I suspect that the best performance may be achieved with handcrafted assembly code.
Attempts
The code below calculates the sum of 1000 random numbers in range [0,1] with three approaches.
Standard summation: Naive implementation which accumulates a root mean square relative error that grows as O(sqrt(N))
Kahan summation [g++]: Compensated summation using the c/c++ function "csum". Explanation in comments. Note that some compilers may have default flags that invalidate this implementation (see output below).
Kahan summation [asm]: Compensated summation implemented as "csumasm" using the same algorithm as "csum". Cryptic explanation in comments.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
extern "C" void csumasm(double&, double, double&);
__asm__(
"csumasm:\n"
"movsd (%rcx), %xmm0\n" //xmm0 = a
"subsd (%r8), %xmm1\n" //xmm1 - r8 (c) | y = b-c
"movapd %xmm0, %xmm2\n"
"addsd %xmm1, %xmm2\n" //xmm2 + xmm1 (y) | b = a+y
"movapd %xmm2, %xmm3\n"
"subsd %xmm0, %xmm3\n" //xmm3 - xmm0 (a) | b - a
"movapd %xmm3, %xmm0\n"
"subsd %xmm1, %xmm0\n" //xmm0 - xmm1 (y) | - y
"movsd %xmm0, (%r8)\n" //xmm0 to c
"movsd %xmm2, (%rcx)\n" //b to a
"ret\n"
);
void csum(double &a,double b,double &c) { //this function adds a and b, and passes c as a compensation term
double y = b-c; //y is the correction of b argument
b = a+y; //add corrected b argument to a argument. The output of the current summation
c = (b-a)-y; //find new error to be passed as a compensation term
a = b;
}
double fun(double fMin, double fMax){
double f = (double)rand()/RAND_MAX;
return fMin + f*(fMax - fMin); //returns random value
}
int main(int argc, char** argv) {
int N = 1000;
srand(0); //use 0 seed for each method
double sum1 = 0;
for (int n = 0; n < N; ++n)
sum1 += fun(0,1);
srand(0);
double sum2 = 0;
double c = 0; //compensation term
for (int n = 0; n < N; ++n)
csum(sum2,fun(0,1),c);
srand(0);
double sum3 = 0;
c = 0;
for (int n = 0; n < N; ++n)
csumasm(sum3,fun(0,1),c);
printf("Standard summation:\n %.16e (error: %.16e)\n\n",sum1,sum1-sum3);
printf("Kahan compensated summation [g++]:\n %.16e (error: %.16e)\n\n",sum2,sum2-sum3);
printf("Kahan compensated summation [asm]:\n %.16e\n",sum3);
return 0;
}
The output with -O3 is:
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902127e+002 (error: 0.0000000000000000e+000)
Kahan compensated summation [asm]:
5.1991955320902127e+002
The output with -O3 -ffast-math
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [asm]:
5.1991955320902127e+002
It is clear that -ffast-math destroys the Kahan summation arithmetic, which is unfortunate because my program requires the use of -ffast-math.
Question
Is it possible to construct a better/faster asm x64 code for Kahan's compensated summation? Perhaps there is a clever way to skip some of the movapd instructions?
If no better asm codes are possible, is there a c++ way to implement Kahan summation that can be used with -ffast-math without devolving to the naive summation? Perhaps a c++ implementation is generally more flexible for the compiler to optimize.
Ideas or suggestions are appreciated.
Further information
The contents of "fun" cannot be inlined, but the "csum" function could be.
The sum must be calculated as an iterative process (the corrected term must be applied on every single addition). This is because the intended summation function takes an input that depends on the previous sum.
The intended summation function is called indefinitely and several hundred million times per second, which motives the pursuit of a high performance low-level implementation.
Higher precision arithmetic such as long double, float128 or arbitrary precision libraries are not to be considered as higher precision solutions due to performance reasons.
Edit: Inlined csum (doesn't make much sense without the full code, but just for reference)
subsd xmm0, QWORD PTR [rsp+32]
movapd xmm1, xmm3
addsd xmm3, xmm0
movsd QWORD PTR [rsp+16], xmm3
subsd xmm3, xmm1
movapd xmm1, xmm3
subsd xmm1, xmm0
movsd QWORD PTR [rsp+32], xmm1
You can put functions that need to not use -ffast-math (like a csum loop) in a separate file that gets compiled without -ffast-math.
Possibly you could also use __attribute__((optimize("no-fast-math"))), but https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html says that optimization-level pragmas and attributes aren't "suitable in production code", unfortunately.
update: apparently part of the question was based on a misunderstanding that -O3 wasn't safe, or something? It is; ISO C++ specifies FP math rules that are like GCC's -fno-fast-math. Compiling everything with just -O3 apparently makes the OP's code run quickly and safely. See the bottom of this answer for workarounds like OpenMP to get some of the benefit of fast-math for some parts of your code without actually enabling -ffast-math.
ICC defaults to fast-path so you have to specifically enable FP=strict for it to be safe with -O3, but gcc/clang default to fully strict FP regardless of other optimization settings. (except -Ofast = -O3 -ffast-math)
You should be able to vectorize Kahan summation by keeping a vector (or four) of totals and an equal number of vectors of compensations. You can do that with intrinsics (as long as you don't enable fast-math for that file).
e.g. use SSE2 __m128d for 2 packed additions per instruction. Or AVX __m256d. On modern x86, addpd / subpd have the same performance as addsd and subsd (1 uop, 3 to 5 cycle latency depending on microarchitecture: https://agner.org/optimize/).
So you're effectively doing 8 compensated summations in parallel, each sum getting every 8th input element.
Generating random numbers on the fly with your fun() is significantly slower than reading them from memory. If your normal use-case has data in memory, you should be benchmarking that. Otherwise I guess scalar is interesting.
If you're going to use inline asm, it would be much better to actually use it inline so you can get multiple inputs and multiple outputs in XMM registers with Extended asm, not stored/reloaded through memory.
Defining a stand-alone function that actually takes args by reference looks pretty performance-defeating. (Especially when it doesn't even return either of them as a return value to avoid one of the store/reload chains). Even just making a function call introduces a lot of overhead by clobbering many registers. (Not as bad in Windows x64 as in x86-64 System V where all the XMM regs are call-clobbered, and more of the integer regs.)
Also your stand-alone function is specific to the Windows x64 calling convention so it's less portable than inline asm inside a function would be.
And BTW, clang managed to implement csum(double&, double, double&): with only two movapd instructions, instead of the 3 in your asm (which I assume you copied from GCC's asm output). https://godbolt.org/z/lw6tug. If you can assume AVX is available, you can avoid any.
And BTW, movaps is 1 byte smaller and should be used instead. No CPUs have had separate data domains / forwarding networks for double vs. float, just vec-FP vs. vec-int (vs. GP integer)
But really by far your bet is to get GCC to compile a file or function without -ffast-math. https://gcc.gnu.org/wiki/DontUseInlineAsm. That lets the compiler avoid the movaps instructions when AVX is available, besides letting it optimize better when unrolling.
If you're willing to accept the overhead of a function-call for every element, you might as well let the compiler generate that asm by putting csum in a separate file. (Hopefully link-time optimization respects -fno-fast-math for one file, perhaps by not inlining that function.)
But it would be much better to disable fast-math for the whole function containing the summation loop by putting it in a separate file. You may be stuck choosing where non-inline function-call boundaries need to be, based on compiling some code with fast-math and others without.
Ideally compile all of your code with -O3 -march=native, and profile-guided optimization. Also -flto link-time optimization to enable cross-file inlining.
It's not surprising that -ffast-math breaks Kahan summation: treating FP math as associative is one of the main reasons to use fast-math. If you need other parts of -ffast-math like -fno-math-errno and -fno-trapping-math so math functions can inline better, then enable those manually. Those are basically always safe and a good idea; nobody checks errno after calling sqrt so that requirement to set errno for some inputs is just a terrible misdesign of C that burdens implementations unnecessarily. GCC's -ftrapping-math is on by default even though it's broken (it doesn't always exactly reproduce the number of FP exceptions you'd get if you unmasked any) so it should really be off by default. Turning it off doesn't enable any optimizations that would break NaN propagation, it only tells GCC that the number of exceptions isn't a visible side-effect.
Or maybe try -ffast-math -fno-associative-math for your Kahan summation file, but that's the main one that's needed to auto-vectorize FP loops that involve reductions, and helps in other cases. But still, there are several other valuable optimizations that you'd still get.
Another way to get optimizations that normally require fast-math is #pragma omp simd to enable auto-vectorization with OpenMP even in files compiled without auto-vectorization. You can declare an accumulator variable for a reduction to let gcc reorder operations on it as if they were associative.
Consider this minimal implementation of a fixed vector<int>:
constexpr std::size_t capacity = 1000;
struct vec
{
int values[capacity];
std::size_t _size = 0;
std::size_t size() const noexcept
{
return _size;
}
void push(int x)
{
values[size()] = x;
++_size;
}
};
Given the following test case:
vec v;
for(std::size_t i{0}; i != capacity; ++i)
{
v.push(i);
}
asm volatile("" : : "g"(&v) : "memory");
The compiler produces non-vectorized assembly: live example on godbolt.org
If I make any of the following changes...
values[size()] -> values[_size]
Add __attribute__((always_inline)) to size()
...then the compiler then produces vectorized assembly: live example on godbolt.org
Is this a gcc bug? Or is there a reason why a simple accessor such as size() would prevent auto-vectorization unless always_inline is explicitly added?
The loop in your example is vectorised for GCC < 7.1, and not vectorized for GCC >= 7.1. So there seems to be some change in behaviour here.
We can look at the compiler optimisation report by adding -fopt-info-vec-all to the command line:
For GCC 7.3:
<source>:24:29: note: === vect_pattern_recog ===
<source>:24:29: note: === vect_analyze_data_ref_accesses ===
<source>:24:29: note: not vectorized: complicated access pattern.
<source>:24:29: note: bad data access.
<source>:21:5: note: vectorized 0 loops in function.
For GCC 6.3:
<source>:24:29: note: === vect_pattern_recog ===
<source>:24:29: note: === vect_analyze_data_ref_accesses ===
<source>:24:29: note: === vect_mark_stmts_to_be_vectorized ===
[...]
<source>:24:29: note: LOOP VECTORIZED
<source>:21:5: note: vectorized 1 loops in function.
So GCC 7.x decides not to vectorise the loop, because of a complicated access pattern, which might be the (at that point) non-inlined size() function. Forcing inlining, or doing it manually fixes that. GCC 6.x seems to do that by itself. However, the assembly does look like size() was eventually inlined in both cases, but maybe only after the vectorisation step in GCC 7.x (this is me guessing).
I wondered why you put the asm volatile(...) line at the end - probably to prevent the compiler from throwing away the whole loop, because it has no observable effect in this test case. If we just return the last element of v instead, we can reach the same without causing any possible side-effects on the memory model for v.
return v.values[capacity - 1];
The code now vectorises with GCC 7.x, as it already did with GCC 6.x:
<source>:24:29: note: === vect_pattern_recog ===
<source>:24:29: note: === vect_analyze_data_ref_accesses ===
<source>:24:29: note: === vect_mark_stmts_to_be_vectorized ===
[...]
<source>:24:29: note: LOOP VECTORIZED
<source>:21:5: note: vectorized 1 loops in function.
So what's the conclusion here?
something changed with GCC 7.1
best guess: a side-effect of the asm volatile messes with inlining of size() preventing vectorisation
Whether or not this is a bug - could be either in 6.x or 7.x depending on what behaviour is desired for the asm volatile() construct - would be a question for the GCC developers.
Also: try adding -mavx2 or -mavx512f -mavx512cd (or -march=native etc.) to the command line, depending on your hardware, to get vectorisation beyond 128-bit xmm, i.e. ymm and zmm, registers.
I could narrow the problem down.
In double or single precision and the optimization flags -std=c++11 -Ofast -march=native:
Clang with Version >= 5.0.0 produces AVX move instructions with zmm registers
Gcc with 4.9 <= Version <= 6.3 produces AVX move instructions with zmm registers
Gcc with Version >= 7.1.0 produces AVX move instructions with xmm registers
Try it out: https://godbolt.org/g/NXgF4g
I have the cross-platform audio processing app. It is written using Qt and PortAudio libraries. I also use Chaotic-Daw sources for some audio processing functions (Vibarto effect and Soft-Knee Dynamic range compression). The problem is that I cannot port my app from Windows to Mac OSX because of I get the compiler errors for __asm parts (I use Mac OSX Yosemite and Qt Creator 3.4.1 IDE):
/Users/admin/My
projects/MySound/daw/basics/rosic_NumberManipulations.h:69:
error:
expected '(' after 'asm'
{
^
for such lines:
INLINE int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
#ifndef LINUX
__asm
{ // <========= error indicates that row
fld x;
fadd st, st (0);
fadd round_towards_m_i;
fistp i;
sar i, 1;
}
#else
i = (int) floor(x);
#endif
return (i);
}
How can I resolve this problem?
The code was clearly written for Microsoft's Visual C++ compiler, as that is the syntax it uses for inline assembly. It uses the Intel syntax and is rather simplistic, which makes it easy to write but hinders its optimization potential.
Clang and GCC both use a different format for inline assembly. In particular, they use the GNU AT&T syntax. It is more complicated to write, but much more expressive. The compiler error is basically Clang's way of telling you, "I can tell you're trying to write inline assembly, but you've formatted it all wrong!"
Therefore, to make this code compile, you will need to convert the MSVC-style inline assembly into GAS-format inline assembly. It might look like this:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
__asm__("fadd %[x], %[x] \n\t"
"fadds %[adj] \n\t"
"fistpl %[i] \n\t"
"sarl $1, %[i]"
: [i] "=m" (i) // store result in memory (as required by FISTP)
: [x] "t" (x), // load input onto top of x87 stack (equivalent to FLD)
[adj] "m" (round_towards_m_i)
: "st");
return (i);
}
But, because of the additional expressivity of the GAS style, we can offload more of the work to the built-in optimizer, which may yield even more optimal object code:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
x += x; // equivalent to the first FADD
x += round_towards_m_i; // equivalent to the second FADD
__asm__("fistpl %[i]"
: [i] "=m" (i)
: [x] "t" (x)
: "st");
return (i >> 1); // equivalent to the final SAR
}
Live demonstration
(Note that, technically, a signed right-shift like that done by the last line is implementation-defined in C and would normally be inadvisable. However, if you're using inline assembly, you have already made the decision to target a specific platform and can therefore rely on implementation-specific behavior. In this case, I know and it can easily be demonstrated that all C compilers will generate SAR instructions to do an arithmetic right-shift on signed integer values.)
That said, it appears that the authors of the code intended for the inline assembly to be used only when you are compiling for a platform other than LINUX (presumably, that would be Windows, on which they expected you to be using Microsoft's compiler). So you could get the code to compile simply by ensuring that you are defining LINUX, either on the command line or in your makefile.
I'm not sure why that decision was made; Clang and GCC are both going to generate the same inefficient code that MSVC does (assuming that you are targeting the older generation of x86 processors and unable to use SSE2 instructions). It is up to you: the code will run either way, but it will be slower without the use of inline assembly to force the use of this clever optimization.
I was wondering what the typical compiler's assembly reduction would be when performing an integer modulus by 2 operation such as this:
const char* integer_string = "300"; // avoid compiler optimization
int i = atoi(integer_string);
int b = i % 2; // the line in question
I'd imagine the compiler could turn it into a bit-wise operation to just check that last bit (1s place), but does it do this?
The question only makes sense in the context of a particular compiler, platform, optimization options etc.
My compiler (gcc 4.7.2 on x86_64) does do this when -O3 optimizations are turned on:
andl $1, %esi