Understanding compilation result for std::isnan

Understanding compilation result for std::isnan - c++

I always assumed, that there is practically no difference between testing for NAN via
x!=x
or
std::isnan(x)
However, gcc provides different assemblers for both versions (live on godbolt.org):
;x!=x:
ucomisd %xmm0, %xmm0
movl $1, %edx
setne %al
cmovp %edx, %eax
ret
;std::isnan(x)
ucomisd %xmm0, %xmm0
setp %al
ret
However, I'm struggling to understand both version. My naive try to compile std::isnan(x) would be:
ucomisd %xmm0, %xmm0
setne %al ;return true when not equal
ret
but I must be missing something.
Probably, there is missed optimization in the x!=x-version (Edit: it is probably a regression in gcc-8.1).
My question is, why is the parity flag (setp, PF=1) and not the equal flag (setne, ZF=0) used in the second version?

The result of x!=x is due to a regression introduced to gcc-8, clang produces the same assembler for both versions.
My misunderstanding about the way ucomisd is functioning was pointed out by #tkausl. The result of this operation can be:
unordered < > ==
ZF 1 0 0 1
PF 1 0 1 0
CF 1 1 0 0
In the case of ucomisd %xmm0, %xmm only the outcomes "unordered" and "==" are possible.
The case of NaN is unordered and for this ZF is set the same as in the case of ==. Thus we can use the flags PF and CF to differentiate between two possible outcomes.

Related

Why use abs() or fabs() instead of conditional negation?

In C/C++, why should one use abs() or fabs() to find the absolute value of a variable without using the following code?
int absoluteValue = value < 0 ? -value : value;
Does it have something to do with fewer instructions at lower level?

The "conditional abs" you propose is not equivalent to std::abs (or fabs) for floating point numbers, see e.g.
#include <iostream>
#include <cmath>
int main () {
double d = -0.0;
double a = d < 0 ? -d : d;
std::cout << d << ' ' << a << ' ' << std::abs(d);
}
output:
-0 -0 0
Given -0.0 and 0.0 represent the same real number '0', this difference may or may not matter, depending on how the result is used. However, the abs function as specified by IEEE754 mandates the signbit of the result to be 0, which would forbid the result -0.0. I personally think anything used to calculate some "absolute value" should match this behavior.
For integers, both variants will be equivalent both in runtime and behavior. (Live example)
But as std::abs (or the fitting C equivalents) are known to be correct and easier to read, you should just always prefer those.

The first thing that comes to mind is readability.
Compare these two lines of code:
int x = something, y = something, z = something;
// Compare
int absall = (x > 0 ? x : -x) + (y > 0 ? y : -y) + (z > 0 ? z : -z);
int absall = abs(x) + abs(y) + abs(z);

The compiler will most likely do the same thing for both at the bottom layer - at least a modern competent compiler.
However, at least for floating point, you'll end up writing a few dozen lines if you want to handle all the special cases of infinity, not-a-number (NaN), negative zero and so on.
As well as it's easier to read that abs is taking the absolute value than reading that if it's less than zero, negate it.
If the compiler is "stupid", it may well end up doing worse code for a = (a < 0)?-a:a, because it forces an if (even if it's hidden), and that could well be worse than the built-in floating point abs instruction on that processor (aside from complexity of special values)
Both Clang (6.0-pre-release) and gcc (4.9.2) generates WORSE code for the second case.
I wrote this little sample:
#include <cmath>
#include <cstdlib>
extern int intval;
extern float floatval;
void func1()
{
int a = std::abs(intval);
float f = std::abs(floatval);
intval = a;
floatval = f;
}
void func2()
{
int a = intval < 0?-intval:intval;
float f = floatval < 0?-floatval:floatval;
intval = a;
floatval = f;
}
clang makes this code for func1:
_Z5func1v: # #_Z5func1v
movl intval(%rip), %eax
movl %eax, %ecx
negl %ecx
cmovll %eax, %ecx
movss floatval(%rip), %xmm0 # xmm0 = mem[0],zero,zero,zero
andps .LCPI0_0(%rip), %xmm0
movl %ecx, intval(%rip)
movss %xmm0, floatval(%rip)
retq
_Z5func2v: # #_Z5func2v
movl intval(%rip), %eax
movl %eax, %ecx
negl %ecx
cmovll %eax, %ecx
movss floatval(%rip), %xmm0
movaps .LCPI1_0(%rip), %xmm1
xorps %xmm0, %xmm1
xorps %xmm2, %xmm2
movaps %xmm0, %xmm3
cmpltss %xmm2, %xmm3
movaps %xmm3, %xmm2
andnps %xmm0, %xmm2
andps %xmm1, %xmm3
orps %xmm2, %xmm3
movl %ecx, intval(%rip)
movss %xmm3, floatval(%rip)
retq
g++ func1:
_Z5func1v:
movss .LC0(%rip), %xmm1
movl intval(%rip), %eax
movss floatval(%rip), %xmm0
andps %xmm1, %xmm0
sarl $31, %eax
xorl %eax, intval(%rip)
subl %eax, intval(%rip)
movss %xmm0, floatval(%rip)
ret
g++ func2:
_Z5func2v:
movl intval(%rip), %eax
movl intval(%rip), %edx
pxor %xmm1, %xmm1
movss floatval(%rip), %xmm0
sarl $31, %eax
xorl %eax, %edx
subl %eax, %edx
ucomiss %xmm0, %xmm1
jbe .L3
movss .LC3(%rip), %xmm1
xorps %xmm1, %xmm0
.L3:
movl %edx, intval(%rip)
movss %xmm0, floatval(%rip)
ret
Note that both cases are notably more complex in the second form, and in the gcc case, it uses a branch. Clang uses more instructions, but no branch. I'm not sure which is faster on which processor models, but quite clearly more instructions is rarely better.

Why use abs() or fabs() instead of conditional negation?
Various reasons have already been stated, yet consider conditional code advantages as abs(INT_MIN) should be avoided.
There is a good reason to use the conditional code in lieu of abs() when the negative absolute value of an integer is sought
// Negative absolute value
int nabs(int value) {
return -abs(value); // abs(INT_MIN) is undefined behavior.
}
int nabs(int value) {
return value < 0 ? value : -value; // well defined for all `int`
}
When a positive absolute function is needed and value == INT_MIN is a real possibility, abs(), for all its clarity and speed fails a corner case. Various alternatives
unsigned absoluteValue = value < 0 ? (0u - value) : (0u + value);

There might be a more-efficient low-level implementation than a conditional branch, on a given architecture. For example, the CPU might have an abs instruction, or a way to extract the sign bit without the overhead of a branch. Supposing an arithmetic right shift can fill a register r with -1 if the number is negative, or 0 if positive, abs x could become (x+r)^r (and seeing
Mats Petersson's answer, g++ actually does this on x86).
Other answers have gone over the situation for IEEE floating-point.
Trying to tell the compiler to perform a conditional branch instead of trusting the library is probably premature optimization.

Consider that you could feed a complicated expression into abs(). If you code it with expr > 0 ? expr : -expr, you have to repeat the whole expression three times, and it will be evaluated two times.
In addition, the two result (before and after the colon) might turn out to be of different types (like signed int / unsigned int), which disables the use in a return statement.
Of course, you could add a temporary variable , but that solves only parts of it, and is not better in any way either.

...and would you make it into a macro, you can have multiple evaluations that you may not want (side efffects). Consider:
#define ABS(a) ((a)<0?-(a):(a))
and use:
f= 5.0;
f=ABS(f=fmul(f,b));
which would expand to
f=((f=fmul(f,b)<0?-(f=fmul(f,b)):(f=fmul(f,b)));
Function calls won't have this unintended side-effects.

Assuming that the compiler won't be able to determine that both abs() and conditional negation are attempting to achieve the same goal, conditional negation compiles to a compare instruction, a conditional jump instruction, and a move instruction, whereas abs() either compiles to an actual absolute value instruction, in instruction sets that support such a thing, or a bitwise and that keeps everthing the same, except for the sign bit. Every instruction above is typically 1 cycle, so using abs() is likely to be at least as fast, or faster than conditional negation (since the compiler might still recognize that you are attempting to calculate an absolute value when using the conditional negation, and generate an absolute value instruction anyway). Even if there is no change in the compiled code, abs() is still more readable than conditional negation.

The intent behind abs() is "(unconditionally) set the sign of this number to positive". Even if that had to be implemented as a conditional based on the current state of the number, it's probably more useful to be able to think of it as a simple "do this", rather than a more complex "if… this… that".

What prevents the inlining of sqrt when compiled without -ffast-math [duplicate]

I'm trying to profile the time it takes to compute a sqrt using the following simple C code, where readTSC() is a function to read the CPU's cycle counter.
double sum = 0.0;
int i;
tm = readTSC();
for ( i = 0; i < n; i++ )
sum += sqrt((double) i);
tm = readTSC() - tm;
printf("%lld clocks in total\n",tm);
printf("%15.6e\n",sum);
However, as I printed out the assembly code using
gcc -S timing.c -o timing.s
on an Intel machine, the result (shown below) was surprising?
Why there are two sqrts in the assembly code with one using the sqrtsd instruction and the other using a function call? Is it related to loop unrolling and trying to execute two sqrts in one iteration?
And how to understand the line
ucomisd %xmm0, %xmm0
Why does it compare %xmm0 to itself?
//----------------start of for loop----------------
call readTSC
movq %rax, -32(%rbp)
movl $0, -4(%rbp)
jmp .L4
.L6:
cvtsi2sd -4(%rbp), %xmm1
// 1. use sqrtsd instruction
sqrtsd %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp .L8
je .L5
.L8:
movapd %xmm1, %xmm0
// 2. use C funciton call
call sqrt
.L5:
movsd -16(%rbp), %xmm1
addsd %xmm1, %xmm0
movsd %xmm0, -16(%rbp)
addl $1, -4(%rbp)
.L4:
movl -4(%rbp), %eax
cmpl -36(%rbp), %eax
jl .L6
//----------------end of for loop----------------
call readTSC

It's using the library sqrt function for error handling. See glibc's documentation: 20.5.4 Error Reporting by Mathematical Functions: math functions set errno for compatibility with systems that don't have IEEE754 exception flags. Related: glibc's math_error(7) man page.
As an optimization, it first tries to perform the square root by the inlined sqrtsd instruction, then checks the result against itself using the ucomisd instruction which sets the flags as follows:
CASE (RESULT) OF
UNORDERED: ZF,PF,CF 111;
GREATER_THAN: ZF,PF,CF 000;
LESS_THAN: ZF,PF,CF 001;
EQUAL: ZF,PF,CF 100;
ESAC;
In particular, comparing a QNaN to itself will return UNORDERED, which is what you will get if you try to take the square root of a negative number. This is covered by the jp branch. The je check is just paranoia, checking for exact equality.
Also note that gcc has a -fno-math-errno option which will sacrifice this error handling for speed. This option is part of -ffast-math, but can be used on its own without enabling any result-changing optimizations.
sqrtsd on its own correctly produces NaN for negative and NaN inputs, and sets the IEEE754 Invalid flag. The check and branch is only to preserve the errno-setting semantics which most code doesn't rely on.
-fno-math-errno is the default on Darwin (OS X), where the math library never sets errno, so functions can be inlined without this check.

simd vectorlength and unroll factor for fortran loop

I want to vectorize the fortran below with SIMD directives
!DIR$ SIMD
DO IELEM = 1 , NELEM
X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
ENDDO
And I used the instruction avx2. The program is compiled by
ifort main_vec.f -simd -g -pg -O2 -vec-report6 -o vec.out -xcore-avx2 -align array32byte
Then I'd like to add VECTORLENGTH(n) clause after SIMD.
If there's no such a clause or n = 2, 4, the information doesn't give information about the unroll factor
if n = 8, 16, vectorization support: unroll factor set to 2.
I've read Intel's article about vectorization support: unroll factor set to xxxx So I guess the loop is unrolled to something like:
DO IELEM = 1 , NELEM, 2
X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
X(IKLE(IELEM+1)) = X(IKLE(IELEM+1)) + W(IELEM+1)
ENDDO
Then 2 X go into a vector register, 2 W go to another, do the addition.
But how does the value of VECTORLENGTH work? Or maybe I don't really understand what does the vector length mean.
And since I use the avx2 instruction, for the DOUBLE PRECISION type X, what's the maximum length could be reach?
Here's part of the assembly of the loop with SSE2, VL=8 and the compiler told me that unroll factor is 2. However it used 4 registers instead of 2.
.loc 1 114 is_stmt 1
movslq main_vec_$IKLE.0.1(,%rdx,4), %rsi #114.9
..LN202:
movslq 4+main_vec_$IKLE.0.1(,%rdx,4), %rdi #114.9
..LN203:
movslq 8+main_vec_$IKLE.0.1(,%rdx,4), %r8 #114.9
..LN204:
movslq 12+main_vec_$IKLE.0.1(,%rdx,4), %r9 #114.9
..LN205:
movsd -8+main_vec_$X.0.1(,%rsi,8), %xmm0 #114.26
..LN206:
movslq 16+main_vec_$IKLE.0.1(,%rdx,4), %r10 #114.9
..LN207:
movhpd -8+main_vec_$X.0.1(,%rdi,8), %xmm0 #114.26
..LN208:
movslq 20+main_vec_$IKLE.0.1(,%rdx,4), %r11 #114.9
..LN209:
movsd -8+main_vec_$X.0.1(,%r8,8), %xmm1 #114.26
..LN210:
movslq 24+main_vec_$IKLE.0.1(,%rdx,4), %r14 #114.9
..LN211:
addpd main_vec_$W.0.1(,%rdx,8), %xmm0 #114.9
..LN212:
movhpd -8+main_vec_$X.0.1(,%r9,8), %xmm1 #114.26
..LN213:
..LN214:
movslq 28+main_vec_$IKLE.0.1(,%rdx,4), %r15 #114.9
..LN215:
movsd -8+main_vec_$X.0.1(,%r10,8), %xmm2 #114.26
..LN216:
addpd 16+main_vec_$W.0.1(,%rdx,8), %xmm1 #114.9
..LN217:
movhpd -8+main_vec_$X.0.1(,%r11,8), %xmm2 #114.26
..LN218:
..LN219:
movsd -8+main_vec_$X.0.1(,%r14,8), %xmm3 #114.26
..LN220:
addpd 32+main_vec_$W.0.1(,%rdx,8), %xmm2 #114.9
..LN221:
movhpd -8+main_vec_$X.0.1(,%r15,8), %xmm3 #114.26
..LN222:
..LN223:
addpd 48+main_vec_$W.0.1(,%rdx,8), %xmm3 #114.9
..LN224:
movsd %xmm0, -8+main_vec_$X.0.1(,%rsi,8) #114.9
..LN225:
.loc 1 113 is_stmt 1
addq $8, %rdx #113.7
..LN226:
.loc 1 114 is_stmt 1
psrldq $8, %xmm0 #114.9
..LN227:
.loc 1 113 is_stmt 1
cmpq $26000, %rdx #113.7
..LN228:
.loc 1 114 is_stmt 1
movsd %xmm0, -8+main_vec_$X.0.1(,%rdi,8) #114.9
..LN229:
movsd %xmm1, -8+main_vec_$X.0.1(,%r8,8) #114.9
..LN230:
psrldq $8, %xmm1 #114.9
..LN231:
movsd %xmm1, -8+main_vec_$X.0.1(,%r9,8) #114.9
..LN232:
movsd %xmm2, -8+main_vec_$X.0.1(,%r10,8) #114.9
..LN233:
psrldq $8, %xmm2 #114.9
..LN234:
movsd %xmm2, -8+main_vec_$X.0.1(,%r11,8) #114.9
..LN235:
movsd %xmm3, -8+main_vec_$X.0.1(,%r14,8) #114.9
..LN236:
psrldq $8, %xmm3 #114.9
..LN237:
movsd %xmm3, -8+main_vec_$X.0.1(,%r15,8) #114.9
..LN238:

1) Vector Length N is a number of elements/iterations you can execute in parallel after "vectorizing" your loop (normally by putting N elements of array X into single vector register and processing them altogether by vector instruction). For simplification, think of Vector Length as value given by this formula:
Vector Length (abbreviated VL) = Vector Register Width / Sizeof (data type)
For AVX2 , Vector Register Width = 256 bit. Sizeof (double precision) = 8 bytes = 64 bits. Thus:
Vector Length (double FP, avx2) = 256 / 64 = 4
$DIR SIMD VECTORLENGTH (N) basically enforces compiler to use specified vector length (and to put N elements of array X into single vector register). That's it.
2) Unrolling and Vectorization relationship. For simplification, think of unrolling and vectorization as normally unrelated (somewhat "orthogonal") optimization techniques.
If your loop is unrolled by factor of M (M could be 2, 4,..), then it doesn't neccesarily mean that vector registers were used at all and it does not mean that your loop was parallelized in any sense. What it means instead is that M instances of original loop iterations have been grouped together into single iteration; and within given new "unwinded"/"unrolled" iteration old ex-iterations are executed sequentially, one by one (so your guessing example is absolutely correct).
The purpose of unrolling is normally making loop more "micro-architecture/memory-friendly". In more details: by making loop iterations more "fat" you normally improve the balance between pressure to your CPU resources vs. pressure to your Memory/Cache resources, especially since after unrolling you can normally reuse some data in registers more effectively.
3) Unrolling + Vectorization. It's not uncommon that Compilers simulteneously vectorize (with VL=N) and unroll (by M) certain loops. As a result, number of iterations in optimized loop is smaller than number of iterations in original loop by approximately factor of NxM, however number of elements processed in parallel (simulteneously in given moment in time) will only be N.
Thus, in your example, if loop is vectorized with VL=4, and unrolled by 2, then the pseudo-code for it might look like:
DO IELEM = 1 , NELEM, 8
[X(IKLE(IELEM)),X(IKLE(IELEM+2)), X(IKLE(IELEM+4)), X(IKLE(IELEM+6))] = ...
[X(IKLE(IELEM+1)),X(IKLE(IELEM+3)), X(IKLE(IELEM+5)), X(IKLE(IELEM+7))] = ...
ENDDO
,where square brackets "correspond" to vector register content.
4) Vectorization against Unrolling :
for loops with relatively small number of iterations (especially in C++) - it may happen that unrolling is not desirable since it partially blocks efficient vectorization (not enough iterations to execute in parallel) and (as you see from my artifical example) may somehow impact the way the data has to be loaded from memory. Different compilers have different heuristics wrt balancing Trip Counts, VL and Unrolling between each other; that's probably why unroll was disabled in your case when VL was smaller than 8.
runtime and compile-time trade-offs between trip counts, unrolling and vector length, as well as appropiate automatic suggestions (especially in case of using fresh Intel C++ or Fortran Compiler) could be explored using "Intel (Vectorization) Advisor":
5) P.S. There is a third dimension (I don't really like to talk about it).
When vectorlength requested by user is bigger than possible Vector Length on given hardware (let's say specifying vectorlength(16) for avx2 platform for double FP) or when you mix different types, then compiler can (or can not) start using a notion of "virtual vector register" and start doing double-/quad-pumping. M-pumping is kind of unrolling, but only for single instruction (i.e. pumping leads to repeating the single instruction, while unrolling leads to repeating the whole loop body). You may try to read about m-pumping in recent OpenMP books like given one. So in some cases you may end-up with superposition of a) vectorization, b) unrolling and c) double-pumping, but it's not common case and I'd avoid enforcing vectorlength > 2*ISA_VectorLength.

When/why does (a < 0) potentially branch in an expression?

After reading many of the comments on this question, there are a couple people (here and here) that suggest that this code:
int val = 5;
int r = (0 < val) - (val < 0); // this line here
will cause branching. Unfortunately, none of them give any justification or say why it would cause branching (tristopia suggests it requires a cmove-like instruction or predication, but doesn't really say why).
Are these people right in that "a comparison used in an expression will not generate a branch" is actually myth instead of fact? (assuming you're not using some esoteric processor) If so, can you give an example?
I would've thought there wouldn't be any branching (given that there's no logical "short circuiting"), and now I'm curious.

To simplify matters, consider just one part of the expression: val < 0. Essentially, this means “if val is negative, return 1, otherwise 0”; you could also write it like this:
val < 0 ? 1 : 0
How this is translated into processor instructions depends heavily on the compiler and the target processor. The easiest way to find out is to write a simple test function, like so:
int compute(int val) {
return val < 0 ? 1 : 0;
}
and review the assembler code that is generated by the compiler (e.g., with gcc -S -o - example.c). For my machine, it does it without branching. However, if I change it to return 5 instead of 1, there are branch instructions:
...
cmpl $0, -4(%rbp)
jns .L2
movl $5, %eax
jmp .L3
.L2:
movl $0, %eax
.L3:
...
So, “a comparison used in an expression will not generate a branch” is indeed a myth. (But “a comparison used in an expression will always generate a branch” isn’t true either.)
Addition in response to this extension/clarification:
I'm asking if there's any (sane) platform/compiler for which a branch is likely. MIPS/ARM/x86(_64)/etc. All I'm looking for is one case that demonstrates that this is a realistic possibility.
That depends on what you consider a “sane” platform. If the venerable 6502 CPU family is sane, I think there is no way to calculate val > 0 on it without branching. Most modern instruction sets, on the other hand, provide some type of set-on-X instruction.
(val < 0 can actually be computed without branching even on 6502, because it can be implemented as a bit shift.)

Empiricism for the win:
int sign(int val) {
return (0 < val) - (val < 0);
}
compiled with optimisations. gcc (4.7.2) produces
sign:
.LFB0:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
setg %al
shrl $31, %edi
subl %edi, %eax
ret
.cfi_endproc
no branch. clang (3.2):
sign: # #sign
.cfi_startproc
# BB#0:
movl %edi, %ecx
shrl $31, %ecx
testl %edi, %edi
setg %al
movzbl %al, %eax
subl %ecx, %eax
ret
neither. (on x86_64, Core i5)

This is actually architecture-dependent. If there exists an instruction to set the value to 0/1 depending on the sign of another value, there will be no branching. If there's no such instruction, branching would be necessary.

Is there any advantage to using pow(x,2) instead of x*x, with x double?

is there any advantage to using this code
double x;
double square = pow(x,2);
instead of this?
double x;
double square = x*x;
I prefer x*x and looking at my implementation (Microsoft) I find no advantages in pow because x*x is simpler than pow for the particular square case.
Is there any particular case where pow is superior?

FWIW, with gcc-4.2 on MacOS X 10.6 and -O3 compiler flags,
x = x * x;
and
y = pow(y, 2);
result in the same assembly code:
#include <cmath>
void test(double& x, double& y) {
x = x * x;
y = pow(y, 2);
}
Assembles to:
pushq %rbp
movq %rsp, %rbp
movsd (%rdi), %xmm0
mulsd %xmm0, %xmm0
movsd %xmm0, (%rdi)
movsd (%rsi), %xmm0
mulsd %xmm0, %xmm0
movsd %xmm0, (%rsi)
leave
ret
So as long as you're using a decent compiler, write whichever makes more sense to your application, but consider that pow(x, 2) can never be more optimal than the plain multiplication.

std::pow is more expressive if you mean x², x*x is more expressive if you mean x*x, especially if you are just coding down e.g. a scientific paper and readers should be able to understand your implementation vs. the paper. The difference is subtle maybe for x*x/x², but I think if you use named functions in general, it increases code expessiveness and readability.
On modern compilers, like e.g. g++ 4.x, std::pow(x,2) will be inlined, if it is not even a compiler-builtin, and strength-reduced to x*x. If not by default and you don't care about IEEE floating type conformance, check your compiler's manual for a fast math switch (g++ == -ffast-math).
Sidenote: It has been mentioned that including math.h increases program size. My answer was:
In C++, you #include <cmath>, not math.h. Also, if your compiler is not stone-old, it will increase your programs size only by what you are using (in the general case), and if your implementation of std::pow just inlines to corresponding x87 instructions, and a modern g++ will strength-reduce x² with x*x, then there is no relevant size-increase. Also, program size should never, ever dictate how expressive you make your code is.
A further advantage of cmath over math.h is that with cmath, you get a std::pow overload for each floating point type, whereas with math.h you get pow, powf, etc. in the global namespace, so cmath increases adaptability of code, especially when writing templates.
As a general rule: Prefer expressive and clear code over dubiously grounded performance and binary size reasoned code.
See also Knuth:
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
and Jackson:
The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.

Not only is x*x clearer it certainly will be at least as fast as pow(x,2).

This question touches on one of the key weaknesses of most implementations of C and C++ regarding scientific programming. After having switched from Fortran to C about twenty years, and later to C++, this remains one of those sore spots that occasionally makes me wonder whether that switch was a good thing to do.
The problem in a nutshell:
The easiest way to implement pow is Type pow(Type x; Type y) {return exp(y*log(x));}
Most C and C++ compilers take the easy way out.
Some might 'do the right thing', but only at high optimization levels.
Compared to x*x, the easy way out with pow(x,2) is extremely expensive computationally and loses precision.
Compare to languages aimed at scientific programming:
You don't write pow(x,y). These languages have a built-in exponentiation operator. That C and C++ have steadfastly refused to implement an exponentiation operator makes the blood of many scientific programmers programmers boil. To some diehard Fortran programmers, this alone is reason to never switch to C.
Fortran (and other languages) are required to 'do the right thing' for all small integer powers, where small is any integer between -12 and 12. (The compiler is non-compliant if it can't 'do the right thing'.) Moreover, they are required to do so with optimization off.
Many Fortran compilers also know how to extract some rational roots without resorting to the easy way out.
There is an issue with relying on high optimization levels to 'do the right thing'. I have worked for multiple organizations that have banned use of optimization in safety critical software. Memories can be very long (multiple decades long) after losing 10 million dollars here, 100 million there, all due to bugs in some optimizing compiler.
IMHO, one should never use pow(x,2) in C or C++. I'm not alone in this opinion. Programmers who do use pow(x,2) typically get reamed big time during code reviews.

In C++11 there is one case where there is an advantage to using x * x over std::pow(x,2) and that case is where you need to use it in a constexpr:
constexpr double mySqr( double x )
{
return x * x ;
}
As we can see std::pow is not marked constexpr and so it is unusable in a constexpr function.
Otherwise from a performance perspective putting the following code into godbolt shows these functions:
#include <cmath>
double mySqr( double x )
{
return x * x ;
}
double mySqr2( double x )
{
return std::pow( x, 2.0 );
}
generate identical assembly:
mySqr(double):
mulsd %xmm0, %xmm0 # x, D.4289
ret
mySqr2(double):
mulsd %xmm0, %xmm0 # x, D.4292
ret
and we should expect similar results from any modern compiler.
Worth noting that currently gcc considers pow a constexpr, also covered here but this is a non-conforming extension and should not be relied on and will probably change in later releases of gcc.

x * x will always compile to a simple multiplication. pow(x, 2) is likely to, but by no means guaranteed, to be optimised to the same. If it's not optimised, it's likely using a slow general raise-to-power math routine. So if performance is your concern, you should always favour x * x.

IMHO:
Code readability
Code robustness - will be easier to change to pow(x, 6), maybe some floating point mechanism for a specific processor is implemented, etc.
Performance - if there is a smarter and faster way to calculate this (using assembler or some kind of special trick), pow will do it. you won't.. :)
Cheers

I would probably choose std::pow(x, 2) because it could make my code refactoring easier. And it would make no difference whatsoever once the code is optimized.
Now, the two approaches are not identical. This is my test code:
#include<cmath>
double square_explicit(double x) {
asm("### Square Explicit");
return x * x;
}
double square_library(double x) {
asm("### Square Library");
return std::pow(x, 2);
}
The asm("text"); call simply writes comments to the assembly output, which I produce using (GCC 4.8.1 on OS X 10.7.4):
g++ example.cpp -c -S -std=c++11 -O[0, 1, 2, or 3]
You don't need -std=c++11, I just always use it.
First: when debugging (with zero optimization), the assembly produced is different; this is the relevant portion:
# 4 "square.cpp" 1
### Square Explicit
# 0 "" 2
movq -8(%rbp), %rax
movd %rax, %xmm1
mulsd -8(%rbp), %xmm1
movd %xmm1, %rax
movd %rax, %xmm0
popq %rbp
LCFI2:
ret
LFE236:
.section __TEXT,__textcoal_nt,coalesced,pure_instructions
.globl __ZSt3powIdiEN9__gnu_cxx11__promote_2IT_T0_NS0_9__promoteIS2_XsrSt12__is_integerIS2_E7__valueEE6__typeENS4_IS3_XsrS5_IS3_E7__valueEE6__typeEE6__typeES2_S3_
.weak_definition __ZSt3powIdiEN9__gnu_cxx11__promote_2IT_T0_NS0_9__promoteIS2_XsrSt12__is_integerIS2_E7__valueEE6__typeENS4_IS3_XsrS5_IS3_E7__valueEE6__typeEE6__typeES2_S3_
__ZSt3powIdiEN9__gnu_cxx11__promote_2IT_T0_NS0_9__promoteIS2_XsrSt12__is_integerIS2_E7__valueEE6__typeENS4_IS3_XsrS5_IS3_E7__valueEE6__typeEE6__typeES2_S3_:
LFB238:
pushq %rbp
LCFI3:
movq %rsp, %rbp
LCFI4:
subq $16, %rsp
movsd %xmm0, -8(%rbp)
movl %edi, -12(%rbp)
cvtsi2sd -12(%rbp), %xmm2
movd %xmm2, %rax
movq -8(%rbp), %rdx
movd %rax, %xmm1
movd %rdx, %xmm0
call _pow
movd %xmm0, %rax
movd %rax, %xmm0
leave
LCFI5:
ret
LFE238:
.text
.globl __Z14square_libraryd
__Z14square_libraryd:
LFB237:
pushq %rbp
LCFI6:
movq %rsp, %rbp
LCFI7:
subq $16, %rsp
movsd %xmm0, -8(%rbp)
# 9 "square.cpp" 1
### Square Library
# 0 "" 2
movq -8(%rbp), %rax
movl $2, %edi
movd %rax, %xmm0
call __ZSt3powIdiEN9__gnu_cxx11__promote_2IT_T0_NS0_9__promoteIS2_XsrSt12__is_integerIS2_E7__valueEE6__typeENS4_IS3_XsrS5_IS3_E7__valueEE6__typeEE6__typeES2_S3_
movd %xmm0, %rax
movd %rax, %xmm0
leave
LCFI8:
ret
But when you produce the optimized code (even at the lowest optimization level for GCC, meaning -O1) the code is just identical:
# 4 "square.cpp" 1
### Square Explicit
# 0 "" 2
mulsd %xmm0, %xmm0
ret
LFE236:
.globl __Z14square_libraryd
__Z14square_libraryd:
LFB237:
# 9 "square.cpp" 1
### Square Library
# 0 "" 2
mulsd %xmm0, %xmm0
ret
So, it really makes no difference unless you care about the speed of unoptimized code.
Like I said: it seems to me that std::pow(x, 2) more clearly conveys your intentions, but that is a matter of preference, not performance.
And the optimization seems to hold even for more complex expressions. Take, for instance:
double explicit_harder(double x) {
asm("### Explicit, harder");
return x * x - std::sin(x) * std::sin(x) / (1 - std::tan(x) * std::tan(x));
}
double implicit_harder(double x) {
asm("### Library, harder");
return std::pow(x, 2) - std::pow(std::sin(x), 2) / (1 - std::pow(std::tan(x), 2));
}
Again, with -O1 (the lowest optimization), the assembly is identical yet again:
# 14 "square.cpp" 1
### Explicit, harder
# 0 "" 2
call _sin
movd %xmm0, %rbp
movd %rbx, %xmm0
call _tan
movd %rbx, %xmm3
mulsd %xmm3, %xmm3
movd %rbp, %xmm1
mulsd %xmm1, %xmm1
mulsd %xmm0, %xmm0
movsd LC0(%rip), %xmm2
subsd %xmm0, %xmm2
divsd %xmm2, %xmm1
subsd %xmm1, %xmm3
movapd %xmm3, %xmm0
addq $8, %rsp
LCFI3:
popq %rbx
LCFI4:
popq %rbp
LCFI5:
ret
LFE239:
.globl __Z15implicit_harderd
__Z15implicit_harderd:
LFB240:
pushq %rbp
LCFI6:
pushq %rbx
LCFI7:
subq $8, %rsp
LCFI8:
movd %xmm0, %rbx
# 19 "square.cpp" 1
### Library, harder
# 0 "" 2
call _sin
movd %xmm0, %rbp
movd %rbx, %xmm0
call _tan
movd %rbx, %xmm3
mulsd %xmm3, %xmm3
movd %rbp, %xmm1
mulsd %xmm1, %xmm1
mulsd %xmm0, %xmm0
movsd LC0(%rip), %xmm2
subsd %xmm0, %xmm2
divsd %xmm2, %xmm1
subsd %xmm1, %xmm3
movapd %xmm3, %xmm0
addq $8, %rsp
LCFI9:
popq %rbx
LCFI10:
popq %rbp
LCFI11:
ret
Finally: the x * x approach does not require includeing cmath which would make your compilation ever so slightly faster all else being equal.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js