Integer vs floating division -> Who is responsible for providing the result? - c++

I've been programming for a while in C++, but suddenly had a doubt and wanted to clarify with the Stackoverflow community.
When an integer is divided by another integer, we all know the result is an integer and like wise, a float divided by float is also a float.
But who is responsible for providing this result? Is it the compiler or DIV instruction?

That depends on whether or not your architecture has a DIV instruction. If your architecture has both integer and floating-point divide instructions, the compiler will emit the right instruction for the case specified by the code. The language standard specifies the rules for type promotion and whether integer or floating-point division should be used in each possible situation.
If you have only an integer divide instruction, or only a floating-point divide instruction, the compiler will inline some code or generate a call to a math support library to handle the division. Divide instructions are notoriously slow, so most compilers will try to optimize them out if at all possible (eg, replace with shift instructions, or precalculate the result for a division of compile-time constants).

Hardware divide instructions almost never include conversion between integer and floating point. If you get divide instructions at all (they are sometimes left out, because a divide circuit is large and complicated), they're practically certain to be "divide int by int, produce int" and "divide float by float, produce float". And it'll usually be that both inputs and the output are all the same size, too.
The compiler is responsible for building whatever operation was written in the source code, on top of these primitives. For instance, in C, if you divide a float by an int, the compiler will emit an int-to-float conversion and then a float divide.
(Wacky exceptions do exist. I don't know, but I wouldn't put it past the VAX to have had "divide float by int" type instructions. The Itanium didn't really have a divide instruction, but its "divide helper" was only for floating point, you had to fake integer divide on top of float divide!)

The compiler will decide at compile time what form of division is required based on the types of the variables being used - at the end of the day a DIV (or FDIV) instruction of one form or another will get involved.

Your question doesn't really make sense. The DIV instruction doesn't do anything by itself. No matter how loud you shout at it, even if you try to bribe it, it doesn't take responsibility for anything
When you program in a programming language [X], it is the sole responsibility of the [X] compiler to make a program that does what you described in the source code.
If a division is requested, the compiler decides how to make a division happen. That might happen by generating the opcode for the DIV instruction, if the CPU you're targeting has one. It might be by precomputing the division at compile-time, and just inserting the result directly into the program (assuming both operands are known at compile-time), or it might be done by generating a sequence of instructions which together emulate a divison.
But it is always up to the compiler. Your C++ program doesn't have any effect unless it is interpreted according to the C++ standard. If you interpret it as a plain text file, it doesn't do anything. If your compiler interprets it as a Java program, it is going to choke and reject it.
And the DIV instruction doesn't know anything about the C++ standard. A C++ compiler, on the other hand, is written with the sole purpose of understanding the C++ standard, and transforming code according to it.
The compiler is always responsible.

One of the most important rules in the C++ standard is the "as if" rule:
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
Which in relation to your question means it doesn't matter what component does the division, as long as it gets done. It may be performed by a DIV machine code, it may be performed by more complicated code if there isn't an appropriate instruction for the processor in question.
It can also:
Replace the operation with a bit-shift operation if appropriate and likely to be faster.
Replace the operation with a literal if computable at compile-time or an assignment if e.g. when processing x / y it can be shown at compile time that y will always be 1.
Replace the operation with an exception throw if it can be shown at compile time that it will always be an integer division by zero.

Practically
The C99 standard defines "When integers are divided, the result of the / operator
is the algebraic quotient with any fractional part
discarded." And adds in a footnote that "this is often called 'truncation toward zero.'"
History
Historically, the language specification is responsible.
Pascal defines its operators so that using / for division always returns a real (even if you use it to divide 2 integers), and if you want to divide integers and get an integer result, you use the div operator instead. (Visual Basic has a similar distinction and uses the \ operator for integer division that returns an integer result.)
In C, it was decided that the same distinction should be made by casting one of the integer operands to a float if you wanted a floating point result. It's become convention to treat integer versus floating point types the way you describe in many C-derived languages. I suspect this convention may have originated in Fortran.

Related

How to express float constants precisely in source code

I have some C++11 code generated via a code generator that contains a large array of floats, and I want to make sure that the compiled values are precisely the same as the compiled values in the generator (assuming that both depend on the same float ISO norm)
So I figured the best way to do it is to store the values as hex representations and interpret them as float in the code.
Edit for Clarification: The code generator takes the float values and converts them to their corresponding hex representations. The target code is supposed to convert back to float.
It looks something like this:
const unsigned int data[3] = { 0x3d13f407U, 0x3ea27884U, 0xbe072dddU};
float const* ptr = reinterpret_cast<float const*>(&data[0]);
This works and gives me access to all the data element as floats, but I recently stumbled upon the fact that this is actually undefined behavior and only works because my compiler resolves it the way I intended:
https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8
https://en.cppreference.com/w/cpp/language/reinterpret_cast.
The standard basically says that reinterpret_cast is not defined between POD pointers of different type.
So basically I have three options:
Use memcopy and hope that the compiler will be able to optimize this
Store the data not as hex-values but in a different way.
Use std::bit_cast from C++20.
I cannot use 3) because I'm stuck with C++11.
I don't have the resources to store the data array twice, so I would have to rely on the compiler to optimize this. Due to this, I don't particularly like 1) because it could stop working if I changed compilers or compiler settings.
So that leaves me with 2):
Is there a standardized way to express float values in source code so that they map to the exact float value when compiled? Does the ISO float standard define this in a way that guarantees that any compiler will follow the interpretation? I imagine if I deviate from the way the compiler expects, I could run the risk that the float "neighbor" of the number I actually want is used.
I would also take alternative ideas if there is an option 4 I forgot.
How to express float constants precisely in source code
Use hexadecimal floating point literals. Assuming some endianess for the hexes you presented:
float floats[] = { 0x1.27e80ep-5, 0x1.44f108p-2, -0x1.0e5bbap-3 };
If you have the generated code produce the full representation of the floating-point value—all of the decimal digits needed to show its exact value—then a C++ 11 compiler is required to parse the number exactly.
C++ 11 draft N3092 2.14.4 1 says, of a floating literal:
… The exponent, if present, indicates the power of 10 by which the significant [likely typo, should be “significand”] part is to be scaled. If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner…
Thus, if the floating literal does not have all the digits needed to show the exact value, the implementation may round it either upward or downward, as the implementation defines. But if it does have all the digits, then the value represented by the floating literal is representable in the floating-point format, and so its value must be the result of the parsing.
I have read some very valuable information here and would like to throw in an option that does not strictly answer the question, but could be a solution.
It might be problematic, but if so, I would like to discuss it.
The simple solution would be: Leave it as it is.
A short rundown of why I am hesitant about the suggested options:
memcpy relies on the compiler to optimize away the actual copy and understand that I only want to read the values. Since I am having large arrays of data I would want to avoid a surprise event in which a compiler setting would be changed that suddenly introduces increased runtime and would require a fix on short notice.
bit_cast is only available from C++20. There are reference implementations but they basically use memcpy under the hood (see above).
hex float literals are only available from C++17
Directly writing the floats precisely... I don't know, it seems to be somewhat dangerous, because if I make a slight mistake I may end up with a data block that is slightly off and could have an impact on my classification results. A mistake like that would be a nightmare to spot.
So why do I think I can get away with an implementation that is strictly speaking undefined? The rationale is that the standard may not define it, but compiler manufacturers likely do, at least the ones I have worked with so far gave me exact results. The code has been running without major problems for a fairly long time, across dozens of code generator run and I would expect that a failed reinterpret_cast would break the conversion so severely that I would spot the result in my classification results right away.
Still not robust enough though. So my idea was to write a unit test that contains a significant number of hex-floats, do the reinterpret_cast and compare to reference float values for exact correspondence to tell me if a setting or compiler failed in this regard.
I have one doubt though: Is the assumption somewhat reasonable that a failed reinterpret_cast would break things spectacularly, or are the bets totally off when it comes to undefined behavior?
I am a bit worried that if the compiler implementation defines the undefined behavior in a way that it would pick a float that is close the hex value instead of the precise one (although I would wonder why), and that it happens only sporadically so that my unit test misses the problems.
So the endgame would be to unit test every single data entry against the corresponding reference float. Since the code is generated, I can generate the test as well. I think that should put all my worries to rest and make sure that I can get this to work across all possible compilers and compiler settings or be notified if anything breaks.

Do literals in C++ really evaluate?

It was always my understanding that l-values have to evaluate, but for kind of obvious and easily explained reasons. An identifier represents a region of storage, and the value is in that storage and must be retrieved. That makes sense. But a program needing to evaluate a literal (for example, the integer 21) doesn't quite make sense to me. The value is right there, how much more explicit can you get? Well, besides adding U for unsigned, or some other suffix. This is why I'm curious about literals needing to be evaluated, as I've only seen this mentioned in one place. Most books also switch up terminology, like "Primary Expression," "operand," or "subexpression" and the like, to the point where the lines begin to blur. In all this time I have yet to see a clear explanation for this particular thing. It seems like a waste of processing power.
A ordinary literal only needs to be evaluated during compilation, by the compiler.
A user defined literal may be evaluated also at run time. For example, after including the <string> header, and making its ...s literals available by the directive using namespace std::string_literals;, then "Blah"s is a user defined literal of type std::string. The "Blah" part is evaluated by the compiler, at compile time. The conversion to std::string, which involves dynamic allocation, necessarily happens at run time.
But a program needing to evaluate a literal (for example, the integer
21) doesn't quite make sense to me. The value is right there, how much
more explicit can you get?
Things are a little more complicated for floating point types. Consider the number 0.1. In binary it cannot be represented exactly and the closest floating point representation must be selected for it. If you input that number during runtime, the conversion of 0.1 to the binary representation has to respect the rounding mode (upward, downward, toward zero, toward infinity). Strict treatment of floating point arithmetic suggests that conversion of the 0.1 floating point literal to the binary representation should also be performed respecting the rounding mode (which only becomes known during runtime) and therefore cannot be done by the compiler (actually the bigger part of it can be done by the compiler but the final rounding has to be performed during runtime, taking into account the rounding mode).

If two languages follow IEEE 754, will calculations in both languages result in the same answers?

I'm in the process of converting a program from Scilab code to C++. One loop in particular is producing a slightly different result than the original Scilab code (it's a long piece of code so I'm not going to include it in the question but I'll try my best to summarise the issue below).
The problem is, each step of the loop uses calculations from the previous step. Additionally, the difference between calculations only becomes apparent around the 100,000th iteration (out of approximately 300,000).
Note: I'm comparing the output of my C++ program with the outputs of Scilab 5.5.2 using the "format(25);" command. Meaning I'm comparing 25 significant digits. I'd also like to point out I understand how precision cannot be guaranteed after a certain number of bits but read the sections below before commenting. So far, all calculations have been identical up to 25 digits between the two languages.
In attempts to get to the bottom of this issue, so far I've tried:
Examining the data type being used:
I've managed to confirm that Scilab is using IEEE 754 doubles (according to the language documentation). Also, according to Wikipedia, C++ isn't required to use IEEE 754 for doubles, but from what I can tell, everywhere I use a double in C++ it has perfectly match Scilab's results.
Examining the use of transcendental functions:
I've also read from What Every Computer Scientist Should Know About Floating-Point Arithmetic that IEEE does not require transcendental functions to be exactly rounded. With that in mind, I've compared the results of these functions (sin(), cos(), exp()) in both languages and again, the results appear to be the same (up to 25 digits).
The use of other functions and predefined values:
I repeated the above steps for the use of sqrt() and pow(). As well as the value of Pi (I'm using M_PI in C++ and %pi in Scilab). Again, the results were the same.
Lastly, I've rewritten the loop (very carefully) in order to ensure that the code is identical between the two languages.
Note: Interestingly, I noticed that for all the above calculations the results between the two languages match farther than the actual result of the calculations (outside of floating point arithmetic). For example:
Value of sin(x) using Wolfram Alpha = 0.123456789.....
Value of sin(x) using Scilab & C++ = 0.12345yyyyy.....
Where even once the value computed using Scilab or C++ started to differ from the actual result (from Wolfram). Each language's result still matched each other. This leads me to believe that most of the values are being calculated (between the two languages) in the same way. Even though they're not required to by IEEE 754.
My original thinking was one of the first three points above are implemented differently between the two languages. But from what I can tell everything seems to produce identical results.
Is it possible that even though all the inputs to these loops are identical, the results can be different? Possibly because a very small error (past what I can see with 25 digits) is occurring that accumulates over time? If so, how can I go about fixing this issue?
No, the format of the numbering system does not guarantee equivalent answers from functions in different languages.
Functions, such as sin(x), can be implemented in different ways, using the same language (as well as different languages). The sin(x) function is an excellent example. Many implementations will use a look-up table or look-up table with interpolation. This has speed advantages. However, some implementations may use a Taylor Series to evaluate the function. Some implementations may use polynomials to come up with a close approximation.
Having the same numeric format is one hurdle to solve between languages. Function implementation is another.
Remember, you need to consider the platform as well. A program that uses an 80-bit floating point processor will have different results than a program that uses a 64-bit floating point software implementation.
Some architectures provide the capability of using extended precision floating point registers (e.g. 80 bits internally, versus 64-bit values in RAM). So, it's possible to get slightly different results for the same calculation, depending on how the computations are structured, and the optimization level used to compile the code.
Yes, it's possible to have a different results. It's possible even if you are using exactly the same source code in the same programming language for the same platform. Sometimes it's enough to have a different compiler switch; for example -ffastmath would lead the compiler to optimize your code for speed rather than accuracy, and, if your computational problem is not well-conditioned to begin with, the result may be significantly different.
For example, suppose you have this code:
x_8th = x*x*x*x*x*x*x*x;
One way to compute this is to perform 7 multiplications. This would be the default behavior for most compilers. However, you may want to speed this up by specifying compiler option -ffastmath and the resulting code would have only 3 multiplications:
temp1 = x*x; temp2 = temp1*temp1; x_8th = temp2*temp2;
The result would be slightly different because finite precision arithmetic is not associative, but sufficiently close for most applications and much faster. However, if your computation is not well-conditioned that small error can quickly get amplified into a large one.
Note that it is possible that the Scilab and C++ are not using the exact same instruction sequence, or that one uses FPU and the other uses SSE, so there may not be a way to get them to be exactly the same.
As commented by IInspectable, if your compiler has _control87() or something similar, you can use it to change the precision and/or rounding settings. You could try combinations of this to see if it has any effect, but again, even you manage to get the settings identical for Scilab and C++, differences in the actual instruction sequences may be the issue.
http://msdn.microsoft.com/en-us/library/e9b52ceh.aspx
If SSE is used, I"m not sure what can be adjusted as I don't think SSE has an 80 bit precision mode.
In the case of using FPU in 32 bit mode, and if your compiler doesn't have something like _control87, you could use assembly code. If inline assembly is not allowed, you would need to call an assembly function. This example is from an old test program:
static short fcw; /* 16 bit floating point control word */
/* ... */
/* set precision control to extended precision */
__asm{
fnstcw fcw
or fcw,0300h
fldcw fcw
}

Compile-time vs runtime constants

I'm currently developing my own math lib to improve my c++ skills. I stumbled over boost's constants header file and I'm asking myself what is the point of using compile-time constants over runtime declared constants?
const float root_two = 1.414213562373095048801688724209698078e+00;
const float root_two = std::sqrt( 2.0f );
Isn't there an error introduced when using the fixed compile-time constant but calculations while running the application with functions?
Wouldn't then the error be negleted if you use runtime constants?
As HansPassant said, it may save you a micro-Watt. However, note that the compiler will sometimes optimize that away by evaluating the expression during compilation and substituting in the literal value. See this answer to my earlier question about this.
Isn't there an error introduced when using the fixed compile-time constant?
If you are using arbitrary-precision data types, perhaps. But it is more efficient to use plain data types like double and these are limited to about 16 decimal digits of precision anyways.
Based on (2), your second initialization would not be more precise than your first one. In fact, if you precomputed the value of the square root with an arbitrary precision calculator, the literal may even be more precise.
A library such as Boost must work in tons of environments. The application that uses the library could have set FPU could be in flush-to-zero mode, giving you 0.0 for denormalized (tiny) results.
Or the application could have been compiled with the -fast-math flag, giving
inaccurate results.
Furthermore, a runtime computation of (a + b + c) depends on how the compiler generated code will store intermediate results. It might chose to pop (a + b) from the FPU as a 64-bit double, or it could leave it on the FPU stack as 80 bits. It depends on tons of things, also algebraic rewrites of associativity.
All in all, if you mix different processors, operating systems, compilers and the different applications the library is used inside, you will get a different result for each permutation of the above.
In some (rare) siturations this is not wanted; you can be in need for an exact constant value.

How do division-by-zero exceptions work?

How is division calculated on compiler/chip level?
And why does C++ always throw these exceptions at run-time instead of compile-time (in case the divisor is known to be zero at compile time)?
It depends. Some processors have a hardware divide instruction. Some processors have to do the calculation is software.
Some C++ compilers don't trap at runtime either. Often because there is no hardware support for trapping on divide by zero.
It totally depend on the compiler. You can if you want write an extension for your compiler to check this kind of problem.
For example visual C++:
Division by zero Compiler error
At the chip level, division is of course done with circuits. Here's an overview of binary division circuitry.
Because the C++ compiler just isn't checking for divisors that are guaranteed to equal 0. It could check for this.
Big lookup tables. Remember those multiplication tables from school? Same idea, but division instead of multiplication. Obviously not every single number is in there, but the number is broken up into chunks and then shoved through the table.
The division takes place at runtime, not at compile time. Yes, the compiler could see that the divisor is zero, but most people are not expected to write an invalid statement like that.