Why does clang generate rsqrt, if stack-protector is turned on? - c++

Check out this simple code:
#include <cmath>
float foo(float in) {
return sqrtf(in);
}
With -ffast-math, clang generates sqrtss, as it is expected. But, if I use -fstack-protector-all as well, it changes sqrtss to rsqrtss, as you can see at godbolt. Why?

The short and sweet:
rsqrtss is safer and, as a result, less accurate and slower.
sqrtss is faster and, as a result, less safe.
Why is rsqrtss safer?
It doesn't use the whole XMM register.
Why is rsqrtss slower?
Because it needs more registers to perform the same action as
sqrtss.
Why does rsqrtss use a reciprocal?
In a pinch, it seems that the reciprocal of a square root can be calculated faster and with less memory.
Pico-spelenda: Lots of math.
The long and bitter:
Research
What does -ffast-math do?
-ffast-math
Enable fast-math mode. This defines the __FAST_MATH__ preprocessor
macro, and lets the compiler make aggressive, potentially-lossy
assumptions about floating-point math. These include:
Floating-point math obeys regular algebraic rules for real numbers (e.g. + and * are associative, x/y == x * (1/y), and (a + b) * c == a * c + b * c),
operands to floating-point operations are not equal to NaN and Inf, and
+0 and -0 are interchangeable.
What does -fstack-protector-all do?
This answer can be found here.
Basically, it "forces the usage of stack protectors for all functions".
What is a "stack protector"?
A nice article for you.
The blissfully short, quite terribly succient sparknotes is:
A "stack protector" is used to prevent exploitation of stack overwrites.
the stack protector as implemented in gcc and clang adds an additional guard
variable to each function’s stack area.
Interesting Drawback To Note:
"Adding these checks will lead to a little runtime overhead: More stack
space is needed, but that is negligible except for really constrained
systems...Do you aim for maximum security at the cost of
performance? -fstack-protector-all is for you."
What is sqrtss?
According to #godbolt:
Computes the square root of the low single-precision floating-point value
in the second source operand and stores the single-precision floating-point
result in the destination operand. The second source operand can be an XMM
register or a 32-bit memory location. The first source and destination
operands is an XMM register.
What is a "source operand"?
A tutorial can be found here
In essence, an operand is a location of data in a computer. Imagine the simple instruction of x+x=y.You need to know what 'x' is, which is the source operand. And where the result will be stored, 'y', which is the destination operand. Notice how the '+' symbol, which is commonly called an 'operation' can be forgotten, because it doesn't matter in this example.
What is an "XMM register"?
An explanation can be found here.
It's just a specific type of register. It's primarily used in floating math
( which, surpisingly enough, is the math you are trying to do ).
What is rsqrtss?
Again, according to #godbolt:
Computes an approximate reciprocal of the square root of the low
single-precision floating-point value in the source operand (second operand)
stores the single-precision floating-point result in the destination operand.
The source operand can be an XMM register or a 32-bit memory location. The
destination operand is an XMM register. The three high-order doublewords of
the destination operand remain unchanged. See Figure 10-6 in the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration
of a scalar single-precision floating-point operation.
What is a "doubleword"?
A simple definition.
It is a unit of measurement of computer memory, just like 'bit' or 'byte'. However, unlike 'bit' or 'byte', it is not universal and depends on the architectures of the computer.
What does "Figure 10-6 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1" look like?
Here you go.
Disclaimer:
Most of this knowlegde comes from outside sources. I literally install clang just now to help answer your question. I'm not an expert.

Related

Is there a difference between a = -a and a *= -1 for floats in c++?

For types like signed int and float and double etc? Because, if I remember correctly (a very big if), it's a matter of flipping a bit and I thought whether there's an explicit operation that would do it to make code run faster?
Edit: okay, just for floating-point types, the int parts were me forgetting to use my brain, sorry)
-a and a * -1 are different operations.1 Nothing in the C++ standard overtly requires an implementation to produce different results for them, so a compiler may treat them as identical. Nothing in the C++ standard requires an implementation to treat them identically, so a compiler may treat them differently.
The C++ standard is largely lax about specifying floating-point operations. A C++ implementation could consider unary - to be a mathematical operation so that -a is equivalent to 0-a. On the other hand, a C++ implementation could consider unary - to be the IEEE-754 negate(x) operation, which per IEEE 754-2008 5.5.1:
… negate(x) copies a floating-point operand x to a destination in the same format, reversing the sign bit. negate(x) is not the same as subtraction(0, x)…
Differences include:
Negation is a bit-level operation; it flips the sign bit, does not signal1 an exception even if the operand is a signaling NaN, and may propagate non-canonical encodings (relevant for decimal formats).
Subtraction and multiplication are mathematical operations; they will signal2 exceptional conditions and do not propagate non-canonical results.
Thus, you might find that a compiler generates an XOR instruction for a = -a; that merely flips the sign bit but generates a multiply instruction for a *= -1;. It should generate an XOR instruction for a *= -1; only if the implementation does not support floating-point flags, traps, signaling NaNs, or anything else that makes negation and subtraction/multiplication distinguishable.
Godbolt shows that x86-64 Clang 11.0.0 with default options uses xor for -a and mulss for a * -1. But x86-64 GCC 10.2 uses xorps for both.
Footnotes
1 I have isolated the - and * operations here; the assignment portion is not of interest.
2 “Signal” is used here in the IEEE-754 sense, meaning an indication is given that an exceptional condition has occurred. This can result in a flag being raised, and it may cause a trap that affects program control. It is not the same as a C++ signal, although the trap may cause a C++ signal.

Truncate Floats and Doubles after user defined points in X87 and SSE FPUs

I have made a function g that is able to approximate a function to a certain degree, this function gives accurate results up to 5 decimals ( 1,23456xxxxxxxxxxxx where the x positions are just rounding errors / junk ) .
To avoid spreading error to other computations that will use the results of g I would like to just set all the x positions to zero, better yet, just set to 0 everything after the 5th decimal .
I haven't found anything in X87 and SSE literature that let's me play with IEEE 754 bits or their representation the way I would like to .
There is an old reference to the FISTP instruction for X87 that is apparently mirrored in the SSE world with FISTTP, with the benefit that FISTTP doesn't necesserily modify the control word and is therefore faster .
I have noticed that FISTTP was called "chopping mode", but now in more modern literature is just "rounding toward zero" or "truncate" and this confuse me because "chopping" means removing something altogether where "rounding toward zero" doesn't necessarily means the same thing to me .
I don't need to round to zero, I only need to preserve up to 5 decimals as the last step in my function before storing the result in memory; how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?
As several people commented, more early rounding doesn't help the final result be more accurate. If you want to read more about floating point comparisons and weirdness / gotchas, I highly recommend Bruce Dawson's series of articles on floating point. Here's a quote from the one with the index
We’ve finally reached the point in this series that I’ve been waiting
for. In this post I am going to share the most crucial piece of
floating-point math knowledge that I have. Here it is:
[Floating-point] math is hard.
You just won’t believe how vastly, hugely, mind-bogglingly hard it is.
I mean, you may think it’s difficult to calculate when trains from
Chicago and Los Angeles will collide, but that’s just peanuts to
floating-point math.
(Bonus points if you recognize that last paragraph as a paraphrase of a famous line about space.)
How you could actually implement your bad idea:
There aren't any machine instructions or C standard library functions to truncate or round to anything other than integer.
Note that there are machine instructions (and C functions) that round a double to nearest (representable) integer without converting it to intmax_t or anything, just double->double. So no round-trip through a fixed-width 2's complement integer.
So to use them, you could scale your float up by some factor, round to nearest integer, then scale back down. (like chux's round()-based function, but I'd recommend C99 double rint(double) instead of round(). round has weird rounding semantics that don't match any of the available rounding modes on x86, so it compiles to worse code.
The x86 asm instructions you keep mentioning are nothing special, and don't do anything that you can't ask the compiler to do with pure C.
FISTP (Float Integer STore (and Pop the x87 stack) is one way for a compiler or asm programmer to implement long lrint(double) or (int)nearbyint(double). Some compilers make better code for one or the other. It rounds using the current x87 rounding mode (default: round to nearest), which is the same semantics as those ISO C standard functions.
FISTTP (Float Integer STore with Truncation (and Pop the x87 stack) is part of SSE3, even though it operates on the x87 stack. It lets compilers avoid setting the rounding mode to truncation (round-towards-zero) to implement the C truncation semantics of (long)x, and then restoring the old rounding mode.
This is what the "not modify the control word" stuff is about. Neither instruction does that, but to implement (int)x without FISTTP, the compiler has to use other instructions to modify and restore the rounding mode around a FIST instruction. Or just use SSE2 CVTTSD2SI to convert a double in an xmm register with truncation, instead of an FP value on the legacy x87 stack.
Since FISTTP is only available with SSE3, you'd only use it for long double, or in 32-bit code that had FP values in x87 registers anyway because of the crusty old 32-bit ABI which returns FP values on the x87 stack.
PS. if you didn't recognize Bruce's HHGTG reference, the original is:
Space is big. Really big. You just won’t believe how vastly hugely
mindbogglingly big it is. I mean you may think it’s a long way down
the road to the chemist’s, but that’s just peanuts to space.
how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?
The following does not use X87 nor SSE. I've included it as a community reference for general purpose code. If anything, it can be used to test a X87 solution.
Any "chopping" of the result of g() will at least marginally increase error, hopefully tolerable as OP said "To avoid spreading error to other computations ..."
It is unclear if OP wants "accurate results up to 5 decimals" to reflect absolute precision (+/- 0.000005) or relative precision (+/- 0.000005 * result). Will assume "absolute precision".
Since float, double are far often a binary floating point, any "chop" will reflect a FP number nearest to a multiple of 0.00001.
Text Method:
// - x xxx...xxx . xxxxx \0
char buf[1+1+ DBL_MAX_10_EXP+1 +5 +1];
sprintf(buf, "%.5f", x);
x = atof(buf);
round() rint() method:
#define SCALE 100000.0
if (fabs(x) < DBL_MAX/SCALE) {
x = x*SCALE;
x = rint(x)/SCALE;
}
Direct bit manipulation of x. Simply zero select bits in the significand.
TBD code.

Floating point comparison in Apple libraries [duplicate]

Apple CoreGraphics.framework, CGGeometry.h:
CG_INLINE bool __CGSizeEqualToSize(CGSize size1, CGSize size2)
{
return size1.width == size2.width && size1.height == size2.height;
}
#define CGSizeEqualToSize __CGSizeEqualToSize
Why do they (Apple) compare floats with ==? I can't believe this is a mistake. So can you explain me?
(I've expected something like fabs(size1.width - size2.width) < 0.001).
Floating point comparisons are native width on all OSX and iOS architectures.
For float, that comes to:
i386, x86_64:
32 bit XMM register (or memory for second operand)
using an instructions in the family of ucomiss
ARM:
32 bit register
using instructions in the same family as vcmp
Some of the floating point comparison issues have been removed by restricting storage to 32/64 for these types. Other platforms may use the native 80 bit FPU often (example). On OS X, SSE instructions are favored, and they use natural widths. So, that reduces many of the floating point comparison issues.
But there is still room for error, or times when you will favor approximation. One hidden detail about CGGeometry types' values is that they may be rounded to a nearby integer (you may want to do this yourself in some cases).
Given the range of CGFloat (float or double-x86_64) and typical values, it's reasonable to assume the rounded values generally be represented accurately enough, such that the results will be suitably comparable in the majority of cases. Therefore, it's "pretty safe", "pretty accurate", and "pretty fast" within those confines.
There are still times when you may prefer approximated comparisons in geometry calculations, but apple's implementation is what I'd consider the closest to a reference implementation for the general purpose solution in this context.

Computational complexity for casting int to unsigned vs complexity for comparing values

I just wanted to know which operation is faster in C/C++, as well as what the computational complexity for a type cast is.
Typecasting x to an unsigned integer like so:
(unsigned int) x
or
Performing a comparison between x and a constant:
x<0
edit: computational complexity as in which process requires the least amount of bit operations on a low level aspect of the hardware in order to successfully carry out the instruction.
edit #2: So to give some context what I'm specifically trying to do is seeing if reducing
if( x < 0)
into
if((((unsigned int)x)>>(((sizeof(int))<<3)-1)))
would be more efficient or not if done over 100,000,000 times, with large quantities for x, above/below (+/-)50,000,000
(unsigned int) x is - for the near-universal 2's complement notation - a compile time operation: you're telling the compiler to treat the content of x as an unsigned value, which doesn't require any runtime machine-code instructions in and of itself, but may change the machine code instructions it emits to support the usage made of the unsigned value, or even cause dead-code elimination optimisations, for example the following could be eliminated completely after the cast:
if ((unsigned int)my_unsigned_int >= 0)
The relevant C++ Standard quote (my boldfacing):
If the destination type is unsigned, the resulting value is the least unsigned integer congruent to the source integer (modulo 2n where n is the number of bits used to represent the unsigned type). [ Note: In a two’s
complement representation, this conversion is conceptual and there is no change in the bit pattern (if there is no truncation). —end note ]
There could be an actual bitwise change requiring an operation on some bizarre hardware using 1's complement or sign/magnitude representations. (Thanks Yuushi for highlighting this in comments).
That contrasts with x < 0, which - for a signed x about which the compiler has no special knowledge, does require a CPU/machine-code instruction to evaluate (if the result is used) and corresponding runtime. That comparison instruction tends to take 1 "cycle" on even older CPUs, but do keep in mind that modern CPU pipelines can execute many such instruction in parallel during a single cycle.
if( x < 0) vs if((((unsigned int)x)>>(((sizeof(int))<<3)-1))) - faster?
The first will always be at least as fast as the second. A comparison to zero is a bread-and-butter operation for the CPU, and the C++ compiler's certain to use an efficient opcode (machine code instruction) for it: you're wasting your time trying to improve on that.
The monster if((((unsigned int)x)>>(((sizeof(int))<<3)-1))) will be slower than the straight forward if(x < 0): Both versions need to compare a value against zero, but your monster adds a shift before the comparison can take place.
To answer your actual edited question, it is unlikely to be faster. In the best case, if x is known at compile time, the compiler will be able to optimize out the branch completely in both cases.
If x is a run-time value, then the first will produce a single test instruction. The second will likely produce a shift-right immediate followed by a test instruction.

Integer vs floating division -> Who is responsible for providing the result?

I've been programming for a while in C++, but suddenly had a doubt and wanted to clarify with the Stackoverflow community.
When an integer is divided by another integer, we all know the result is an integer and like wise, a float divided by float is also a float.
But who is responsible for providing this result? Is it the compiler or DIV instruction?
That depends on whether or not your architecture has a DIV instruction. If your architecture has both integer and floating-point divide instructions, the compiler will emit the right instruction for the case specified by the code. The language standard specifies the rules for type promotion and whether integer or floating-point division should be used in each possible situation.
If you have only an integer divide instruction, or only a floating-point divide instruction, the compiler will inline some code or generate a call to a math support library to handle the division. Divide instructions are notoriously slow, so most compilers will try to optimize them out if at all possible (eg, replace with shift instructions, or precalculate the result for a division of compile-time constants).
Hardware divide instructions almost never include conversion between integer and floating point. If you get divide instructions at all (they are sometimes left out, because a divide circuit is large and complicated), they're practically certain to be "divide int by int, produce int" and "divide float by float, produce float". And it'll usually be that both inputs and the output are all the same size, too.
The compiler is responsible for building whatever operation was written in the source code, on top of these primitives. For instance, in C, if you divide a float by an int, the compiler will emit an int-to-float conversion and then a float divide.
(Wacky exceptions do exist. I don't know, but I wouldn't put it past the VAX to have had "divide float by int" type instructions. The Itanium didn't really have a divide instruction, but its "divide helper" was only for floating point, you had to fake integer divide on top of float divide!)
The compiler will decide at compile time what form of division is required based on the types of the variables being used - at the end of the day a DIV (or FDIV) instruction of one form or another will get involved.
Your question doesn't really make sense. The DIV instruction doesn't do anything by itself. No matter how loud you shout at it, even if you try to bribe it, it doesn't take responsibility for anything
When you program in a programming language [X], it is the sole responsibility of the [X] compiler to make a program that does what you described in the source code.
If a division is requested, the compiler decides how to make a division happen. That might happen by generating the opcode for the DIV instruction, if the CPU you're targeting has one. It might be by precomputing the division at compile-time, and just inserting the result directly into the program (assuming both operands are known at compile-time), or it might be done by generating a sequence of instructions which together emulate a divison.
But it is always up to the compiler. Your C++ program doesn't have any effect unless it is interpreted according to the C++ standard. If you interpret it as a plain text file, it doesn't do anything. If your compiler interprets it as a Java program, it is going to choke and reject it.
And the DIV instruction doesn't know anything about the C++ standard. A C++ compiler, on the other hand, is written with the sole purpose of understanding the C++ standard, and transforming code according to it.
The compiler is always responsible.
One of the most important rules in the C++ standard is the "as if" rule:
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
Which in relation to your question means it doesn't matter what component does the division, as long as it gets done. It may be performed by a DIV machine code, it may be performed by more complicated code if there isn't an appropriate instruction for the processor in question.
It can also:
Replace the operation with a bit-shift operation if appropriate and likely to be faster.
Replace the operation with a literal if computable at compile-time or an assignment if e.g. when processing x / y it can be shown at compile time that y will always be 1.
Replace the operation with an exception throw if it can be shown at compile time that it will always be an integer division by zero.
Practically
The C99 standard defines "When integers are divided, the result of the / operator
is the algebraic quotient with any fractional part
discarded." And adds in a footnote that "this is often called 'truncation toward zero.'"
History
Historically, the language specification is responsible.
Pascal defines its operators so that using / for division always returns a real (even if you use it to divide 2 integers), and if you want to divide integers and get an integer result, you use the div operator instead. (Visual Basic has a similar distinction and uses the \ operator for integer division that returns an integer result.)
In C, it was decided that the same distinction should be made by casting one of the integer operands to a float if you wanted a floating point result. It's become convention to treat integer versus floating point types the way you describe in many C-derived languages. I suspect this convention may have originated in Fortran.