Can integer division in C/C++ run into loss of precision issues? - c++

Suppose we have three integer (int, long, long long, unsigned int, etc) variables a, b, c. Normally, performing
c = a / b;
would result in truncate of fractions. However, is it possible for c to end up with an incorrect value?
I am not talking about a / b may be out of range for c's type. Rather, I am talking about how integer division is implemented in C. Does performing a / b first generate a float type intermediate result, and then the intermediate value is truncated?
If so, I wonder if loss of precision of the intermediate value can lead to an incorrect value of c. For example, suppose the precise value for a / b is 2, but somehow the intermediate result is 1.9999..., thus c will end up with an incorrect value of 1. Can such cases happen, or does integer division always result in a correct value if the expected value is in the range of c's type?

Does performing a / b first generate a float type intermediate result
As far as the language is concerned, there are no intermediate results.
does integer division always result in a correct value if the expected value is in the range of c's type?
Yes.

Section 6.5.5 of the C11 standards states
When integers are divided, the result of the / operator is the algebraic quotient with any fractional part discarded. If the quotient a/b is representable, the expression (a/b)*b + a%b shall equal a;
Which means there's no way, mathematically, that you'll get wrong results.

Suppose we have three integer (int, long, long long, unsigned int, etc) variables a, b, c. Normally, performing
c = a / b;
would result in truncate of fractions. However, is it possible for c to end up with an incorrect value? I am not talking about a / b may be out of range for c's type.
It should not be possible that for example the last digit of division be wrong, if all rules were followed otherwise. C11 6.5.5p6:
When integers are divided, the result of the / operator is the algebraic quotient with any fractional part discarded.
i.e. the result is not "close" to but exactly the same as a / b would be algebraically, just anything following the point discarded.
That does not mean there won't be any gotchas: it is possible that the division of a / b might be mathematically not out of range for c's type yet out of range for the type used in the division itself which can cause wrong values be set in c.
Consider this example:
#include <stdio.h>
#include <inttypes.h>
int main(void) {
int32_t a = INT32_MIN;
int32_t b = -1;
int64_t c = a / b;
printf("%" PRId64, c);
}
The result of division of INT32_MIN / -1 is representable in c, it is INT32_MAX + 1, which is positive. However on 32-bit platforms the arithmetic happens in 32 bits, and this division produces an integer overflow which causes the behaviour to be undefined. What happens on my computer is that if I compile without optimizations it aborts the program. If I compile with optimizations enabled (-O3), the compiler will resolve this calculation at compilation time, and handles the overflow in a peculiar way and produces the result -2147483648 which is negative.
Likewise, if you do this:
uint16_t a = 16;
int16_t b = -1;
int32_t result = a / b;
printf("%" PRId32 "\n", result);
the result on a 32-bit int machine is -16. If you change the type of a to uint32_t the math happens in unsigned:
uint32_t a = 16;
int16_t b = -1;
int32_t result = a / b;
printf("%" PRId32 "\n", result);
The result is of course 0. And you would get 0 from the former calculation too on a 16-bit machine.

Related

When a 64bit int is cast to 64bit float in C/C++ and doesn't have an exact match, will it always land on a non-fractional number?

When int64_t is cast to double and doesn't have an exact match, to my knowledge I get a sort of best-effort-nearest-value equivalent in double. For example, 9223372036854775000 in int64_t appears to end up as 9223372036854774784.0 in double:
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
return 0;
}
It appears to me as if an int64_t cast to a double always ends up on as a clean non-fractional number, even in this higher number range where double has really low precision. However, I just observed this from random attempts. Is this guaranteed to happen for any value of int64_t cast to a double?
And if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off? (Assuming it doesn't overflow during the conversion back.) Like here:
#include <inttypes.h>
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
printf("Corresponding int to corresponding double: %" PRId64 "\n",
(int64_t)((double)9223372036854775000LL));
// Outputs: 9223372036854774784
return 0;
}
Or can it be imprecise and get me the "wrong" int in some corner cases?
Intuitively and from my tests the answer to both points appears to be "yes", but if somebody with a good formal understanding of the floating point standards and the maths behind it could confirm this that would be really helpful to me. I would also be curious if any known more aggressive optimizations like gcc's -Ofast are known to break any of this.
In general case yes, both should be true. The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base of the floating type would be zeroed. For example in your case your system uses ISO/IEC/IEEE 60559 binary floating point numbers. When inspected in base 2, it can be seen that the trailing digits of the value are indeed zeroed:
>>> bin(9223372036854775000)
'0b111111111111111111111111111111111111111111111111111110011011000'
>>> bin(9223372036854774784)
'0b111111111111111111111111111111111111111111111111111110000000000'
The conversion of a double without fractions to an integer type, given that the value of the double falls within the range of the integer type should be exact...
Though you still might encounter a quality-of-implementation issue, or an outright bug - for example MSVC currently has a compiler bug where a round-trip conversion of unsigned 32-bit value with MSB set (or just double value between 2³¹ and 2³²-1 converted to unsigned int) would "overflow" in the conversion and always result in exactly 2³¹.
The following assumes the value being converted is positive. The behavior of negative numbers is analogous.
C 2018 6.3.1.4 2 specifies conversions from integer to real and says:
… If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
This tells us that some integer value x being converted to floating-point can produce a non-integer only if one of the two representable values bounding x is not an integer and x is not representable.
5.2.4.2.2 specifies the model used for floating-point numbers. Each finite floating-point number is represented by a sequence of digits in a certain base b scaled by be for some exponent e. (b is an integer greater than 1.) Then, if one of the two values bounding x, say p is not an integer, the scaling must be such that the lowest digit in that floating-point number represents a fraction. But if this is the case, then setting all of the digits in p that represent fractions to 0 must produce a new floating-point number that is an integer. If x < p, this integer must be x, and therefore x is representable in the floating-point format. On the other hand, if p < x, we can add enough to each digit that represents a fraction to make it 0 (and produce a carry to the next higher digit). This will also produce an integer representable in the floating-point type1, and it must be x.
Therefore, if conversion of an integer x to the floating-point type would produce a non-integer, x must be representable in the type. But then conversion to the floating-point type must produce x. So it is never possible to produce a non-integer.
Footnote
1 It is possible this will carry out of all the digits, as when applying it to a three-digit decimal number 9.99, which produces 10.00. In this case, the value produced is the next power of b, if it is in range of the floating-point format. If it is not, the C standard does not define the behavior. Also note the C standard sets minimum requirements on the range that floating-point formats must support which preclude any format from not being able to represent 1, which avoids a degenerate case in which a conversion could produce a number like .999 because it was the largest representable finite value.
When a 64bit int is cast to 64bit float ... and doesn't have an exact match, will it always land on a non-fractional number?
Is this guaranteed to happen for any value of int64_t cast to a double?
For common double: Yes, it always land on a non-fractional number
When there is no match, the result is the closest floating point representable value above or below, depending on rounding mode. Given the characteristics of common double, these 2 bounding values are also whole numbers. When the value is not representable, there is first a nearby whole number one.
... if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off?
No. Edge cases near INT64_MAX fail as the converted value could become a FP value above INT64_MAX. Then conversion back to the integer type incurs: "the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." C17dr § 6.3.1.3 3
#include <limits.h>
#include <string.h>
int main() {
long long imaxm1 = LLONG_MAX - 1;
double max = (double) imaxm1;
printf("%lld\n%f\n", imaxm1, max);
long long imax = (long long) max;
printf("%lld\n", imax);
}
9223372036854775806
9223372036854775808.000000
9223372036854775807 // Value here is implementation defined.
Deeper exceptions
(Question variation) When an N bit integer type is cast to a floating point and doesn't have an exact match, will it always land on a non-fractional number?
Integer type range exceeds finite float point
Conversion to infinity: With common float, and uint128_t, UINT128_MAX converts to infinity. This is readily possible with extra wide integer types.
int main() {
unsigned __int128 imaxm1 = 0xFFFFFFFFFFFFFFFF;
imaxm1 <<= 64;
imaxm1 |= 0xFFFFFFFFFFFFFFFF;
double fmax = (float) imaxm1;
double max = (double) imaxm1;
printf("%llde27\n%f\n%f\n", (long long) (imaxm1/1000000000/1000000000/1000000000),
fmax, max);
}
340282366920e27
inf
340282366920938463463374607431768211456.000000
Floating point precession deep more than range
On some unicorn implementation, with very wide FP precision and small range, the largest finite could, in theory, not practice, be a non-whole number. Then with an even wider integer type, the conversion could result in this non-whole number value. I do not see this as a legit concern of OP's.

c++ safeness of code with implicit conversion between signed and unsigned

According to the rules on implicit conversions between signed and unsigned integer types, discussed here and here, when summing an unsigned int with a int, the signed int is first converted to an unsigned int.
Consider, e.g., the following minimal program
#include <iostream>
int main()
{
unsigned int n = 2;
int x = -1;
std::cout << n + x << std::endl;
return 0;
}
The output of the program is, nevertheless, 1 as expected: x is converted first to an unsigned int, and the sum with n leads to an integer overflow, giving the "right" answer.
In a code like the previous one, if I know for sure that n + x is positive, can I assume that the sum of unsigned int n and int x gives the expected value?
In a code like the previous one, if I know for sure that n + x is positive, can I assume that the sum of unsigned int n and int x gives the expected value?
Yes.
First, the signed value converted to unsigned, using modulo arithmetic:
If the destination type is unsigned, the resulting value is the least unsigned integer congruent to the source integer (modulo 2n
where n is the number of bits used to represent the unsigned type).
Then two unsigned values will be added using modulo arithmetic:
Unsigned integers shall obey the laws of arithmetic modulo 2n where n is the number of bits in the value representation of that particular size of integer.
This means that you'll get the expected answer.
Even, if the result would be negative in the mathematical sense, the result in C++ would be a number which is modulo-equal to the negative number.
Note that I've supposed here that you add two same-sized integers.
I think you can be sure and it is not implementation defined, although this statement requires some interpretations of the standard when it comes to systems that do not use two's complement for representing negative values.
First, let's state the things that are clear: unsigned integrals do not overflow but take on a modulo 2^nrOfBits-value (cf this online C++ standard draft):
6.7.1 Fundamental types
(7) Unsigned integers shall obey the laws of arithmetic modulo 2n
where n is the number of bits in the value representation of that
particular size of integer.
So it's just a matter of whether a negative value nv is converted correctly into an unsigned integral bit pattern nv(conv) such that x + nv(conv) will always be the same as x - nv. For the case of a system using two's complement, things are clear, since the two's complement is actually designed such that this arithmetic works immediately.
For systems using other representations of negative values, we'll have to read the standard carefully:
7.8 Integral conversions
(2) If the destination type is unsigned, the resulting value is the
least unsigned integer congruent to the source integer (modulo 2n
where n is the number of bits used to represent the unsigned type). [
Note: In a two’s complement representation, this conversion is
conceptual and there is no change in the bit pattern (if there is
notruncation). —endnote]
As the footnote explicitly says, that in a two's complement representation, there is no change in the bit pattern, we may assume that in systems other than 2s complement a real conversion will take place such that x + nv(conv) == x - nv.
So due to 7.8 (2), I'd say that your assumption is valid.

Subtracting unsigned long longs with signed long long result?

Suppose I have these two types:
typedef unsigned long long uint64;
typedef signed long long sint64;
And I have these variables:
uint64 a = ...;
uint64 b = ...;
sint64 c;
I want to subtract b from a and assign the result to c, clearly if the absolute value of the difference is greater than 2^63 than it will wrap (or be undefined) which is ok. But for cases where the absolute difference is less than 2^63 I want the result to be correct.
Of the following three ways:
c = a - b; // sign conversion warning ignored
c = sint64(a - b);
c = sint64(a) - sint64(b);
Which of the them are guaranteed to work by the standard? (and why/how?)
None of the three work. The first fails if the difference is negative (no matter the absolute value), the second is the same as the first, and the third fails if either operand is too large.
It's impossible to implement without a branch.
c = b < a? a - b : - static_cast< sint64 >( b - a );
Fundamentally, unsigned types use modulo arithmetic without any kind of sign bit. They don't know they wrapped around, and the language spec doesn't identify wraparound with negative numbers. Also, assigning a value outside the range of a signed integral variable results in an implementation-defined, potentially nonsense result (integral overflow).
Consider a machine with no hardware to convert between native negative integers and two's complement. It can perform two's complement subtraction using bitwise negation and native two's complement addition, though. (Bizarre, maybe, but that is what C and C++ currently require.) The language leaves it up to the programmer, then, to convert the negative values. The only way to do that is to negate a positive value, which requires that the computed difference be positive. So…
The best solution is to avoid any attempt to represent a negative number as a large positive number in the first place.
EDIT: I forgot the cast before, which would have produced a large unsigned value, equivalently to the other solutions!
Potatoswatter's answer is probably the most pragmatic solution, but "impossible to implement without a branch" is like a red rag to a bull for me. If your hypothetical system implements undefined overflow/cast operations like that, my hypothetical system implements branches by killing puppies.
So I'm not completely familiar with what the standard(s) would say, but how about this:
sint64 c,d,r;
c = a >> 1;
d = b >> 1;
r = (c-d) * 2;
c = a & 1;
d = b & 1;
r += c - d;
I've written it in a fairly verbose fasion so the individual operations are clear, but have left some implicit casts. Is anything there undefined?
Steve Jessop rightly points out that this does fail in the case where the difference is exactly 2^63-1, as the multiply overflows before the 1 is subtracted.
So here's an even uglier version which should cover all underflow/overflow conditions:
sint64 c,d,r,ov;
c = a >> 1;
d = b >> 1;
ov = a >> 63;
r = (c-d-ov) * 2;
c = a & 1;
d = b & 1;
r += ov + ov + c - d;
if the absolute value of the difference is greater than 2^63 than it
will wrap (or be undefined) which is ok. But for cases where the
absolute difference is less than 2^63 I want the result to be correct.
Then all three of the notations you suggest work, assuming a conventional architecture. The notable difference
is that the third one sint64(a) - sint64(b) invokes undefined behavior
when the difference is not representable, whereas the first two are
guaranteed to wrap around (unsigned arithmetic overflow is guaranteed to wrap around and conversion from unsigned to signed is implementation-defined, whereas signed arithmetic overflow is undefined).

Curious arithmetic error- 255x256x256x256=18446744073692774400

I encountered a strange thing when I was programming under c++. It's about a simple multiplication.
Code:
unsigned __int64 a1 = 255*256*256*256;
unsigned __int64 a2= 255 << 24; // same as the above
cerr()<<"a1 is:"<<a1;
cerr()<<"a2 is:"<<a2;
interestingly the result is:
a1 is: 18446744073692774400
a2 is: 18446744073692774400
whereas it should be:(using calculator confirms)
4278190080
Can anybody tell me how could it be possible?
255*256*256*256
all operands are int you are overflowing int. The overflow of a signed integer is undefined behavior in C and C++.
EDIT:
note that the expression 255 << 24 in your second declaration also invokes undefined behavior if your int type is 32-bit. 255 x (2^24) is 4278190080 which cannot be represented in a 32-bit int (the maximum value is usually 2147483647 on a 32-bit int in two's complement representation).
C and C++ both say for E1 << E2 that if E1 is of a signed type and positive and that E1 x (2^E2) cannot be represented in the type of E1, the program invokes undefined behavior. Here ^ is the mathematical power operator.
Your literals are int. This means that all the operations are actually performed on int, and promptly overflow. This overflowed value, when converted to an unsigned 64bit int, is the value you observe.
It is perhaps worth explaining what happened to produce the number 18446744073692774400. Technically speaking, the expressions you wrote trigger "undefined behavior" and so the compiler could have produced anything as the result; however, assuming int is a 32-bit type, which it almost always is nowadays, you'll get the same "wrong" answer if you write
uint64_t x = (int) (255u*256u*256u*256u);
and that expression does not trigger undefined behavior. (The conversion from unsigned int to int involves implementation-defined behavior, but as nobody has produced a ones-complement or sign-and-magnitude CPU in many years, all implementations you are likely to encounter define it exactly the same way.) I have written the cast in C style because everything I'm saying here applies equally to C and C++.
First off, let's look at the multiplication. I'm writing the right hand side in hex because it's easier to see what's going on that way.
255u * 256u = 0x0000FF00u
255u * 256u * 256u = 0x00FF0000u
255u * 256u * 256u * 256u = 0xFF000000u (= 4278190080)
That last result, 0xFF000000u, has the highest bit of a 32-bit number set. Casting that value to a signed 32-bit type therefore causes it to become negative as-if 232 had been subtracted from it (that's the implementation-defined operation I mentioned above).
(int) (255u*256u*256u*256u) = 0xFF000000 = -16777216
I write the hexadecimal number there, sans u suffix, to emphasize that the bit pattern of the value does not change when you convert it to a signed type; it is only reinterpreted.
Now, when you assign -16777216 to a uint64_t variable, it is back-converted to unsigned as-if by adding 264. (Unlike the unsigned-to-signed conversion, this semantic is prescribed by the standard.) This does change the bit pattern, setting all of the high 32 bits of the number to 1 instead of 0 as you had expected:
(uint64_t) (int) (255u*256u*256u*256u) = 0xFFFFFFFFFF000000u
And if you write 0xFFFFFFFFFF000000 in decimal, you get 18446744073692774400.
As a closing piece of advice, whenever you get an "impossible" integer from C or C++, try printing it out in hexadecimal; it's much easier to see oddities of twos-complement fixed-width arithmetic that way.
The answer is simple -- overflowed.
Here Overflow occurred on int and when you are assigning it to unsigned int64 its converted in to 18446744073692774400 instead of 4278190080

Which variables should I typecast when doing math operations in C/C++?

For example, when I'm dividing two ints and want a float returned, I superstitiously write something like this:
int a = 2, b = 3;
float c = (float)a / (float)b;
If I do not cast a and b to floats, it'll do integer division and return an int.
Similarly, if I want to multiply a signed 8-bit number with an unsigned 8-bit number, I will cast them to signed 16-bit numbers before multiplying for fear of overflow:
u8 a = 255;
s8 b = -127;
s16 = (s16)a * (s16)b;
How exactly does the compiler behave in these situations when not casting at all or when only casting one of the variables? Do I really need to explicitly cast all of the variables, or just the one on the left, or the one on the right?
Question 1: Float division
int a = 2, b = 3;
float c = static_cast<float>(a) / b; // need to convert 1 operand to a float
Question 2: How the compiler works
Five rules of thumb to remember:
Arithmetic operations are always performed on values of the same type.
The result type is the same as the operands (after promotion)
The smallest type arithmetic operations are performed on is int.
ANSCI C (and thus C++) use value preserving integer promotion.
Each operation is done in isolation.
The ANSI C rules are as follows:
Most of these rules also apply to C++ though not all types are officially supported (yet).
If either operand is a long double the other is converted to a long double.
If either operand is a double the other is converted to a double.
If either operand is a float the other is converted to a float.
If either operand is a unsigned long long the other is converted to unsigned long long.
If either operand is a long long the other is converted to long long.
If either operand is a unsigned long the other is converted to unsigned long.
If either operand is a long the other is converted to long.
If either operand is a unsigned int the other is converted to unsigned int.
Otherwise both operands are converted to int.
Overflow
Overflow is always a problem. Note. The type of the result is the same as the input operands so all the operations can overflow, so yes you do need to worry about it (though the language does not provide any explicit way to catch this happening.
As a side note:
Unsigned division can not overflow but signed division can.
std::numeric_limits<int>::max() / -1 // No Overflow
std::numeric_limits<int>::min() / -1 // Will Overflow
In general, if operands are of different types, the compiler will promote all to the largest or most precise type:
If one number is... And the other is... The compiler will promote to...
------------------- ------------------- -------------------------------
char int int
signed unsigned unsigned
char or int float float
float double double
Examples:
char + int ==> int
signed int + unsigned char ==> unsigned int
float + int ==> float
Beware, though, that promotion occurs only as required for each intermediate calculation, so:
4.0 + 5/3 = 4.0 + 1 = 5.0
This is because the integer division is performed first, then the result is promoted to float for the addition.
You can just cast one of them. It doesn't matter which one though.
Whenever the types don't match, the "smaller" type is automatically promoted to the "larger" type, with floating point being "larger" than integer types.
Division of integers: cast any one of the operands, no need to cast them both. If both operands are integers the division operation is an integer division, otherwise it is a floating-point division.
As for the overflow question, there is no need to explicitly cast, as the compiler implicitly does that for you:
#include <iostream>
#include <limits>
using namespace std;
int main()
{
signed int a = numeric_limits<signed int>::max();
unsigned int b = a + 1; // implicit cast, no overflow here
cout << a << ' ' << b << endl;
return 0;
}
In the case of the floating-point division, as long as one variable is of a floating-point datatype (float or double), then the other variable should be widened to a floating-point type, and floating-point division should occur; so there's no need to cast both to a float.
Having said that, I always cast both to a float, anyway.
I think as long as you are casting just one of the two variables the compiler will behave properly (At least on the compilers that I know).
So all of:
float c = (float)a / b;
float c = a / (float)b;
float c = (float)a / (float)b;
will have the same result.
Then there are older brain-damaged types like me who, having to use old-fashioned languages, just unthinkingly write stuff like
int a;
int b;
float z;
z = a*1.0*b;
Of course this isn't universal, good only for pretty much just this case.
Having worked on safety-critical systems, i tend to be paranoid and always cast both factors: float(a)/float(b) - just in case some subtle gotcha is planning to bite me later. No matter how good the compiler is said to be, no matter how well-defined the details are in the official language specs. Paranoia: a programmer's best friend!
Do you need to cast one or two sides? The answer isn't dictated by the compiler. It has to know the exact, precse rules. Instead, the answer should be dictated by the person who will read the code later. For that reason alone, cast both sides to the same type. Implicit truncation might be visible enough, so that cast could be redundant.
e.g. this cast float->int is obvious.
int a = float(foo()) * float(c);