Loss of precision for int to float conversion - c++

In C++, the conversion of an integer value of type I to a floating point type F will be exact — as static_cast<I>(static_cast<F>(i)) == i — if the range of I is a part of the range of integral values of F.
Is it possible, and if yes how, to calculate the loss of precision of static_cast<F>(i) (without using another floating point type with a wider range)?
As a start, I tried to code a function that would return if a conversion is safe or not (safe, meaning no loss of precision), but I must admit I am not so sure about its correctness.
template <class F, class I>
bool is_cast_safe(I value)
{
return std::abs(alue) < std::numeric_limits<F>::digits;
}
std::cout << is_cast_safe<float>(4) << std::endl; // true
std::cout << is_cast_safe<float>(0x1000001) << std::endl; // false
Thanks in advance.

is_cast_safe can be implemented with:
static const F One = 1;
F ULP = std::scalbn(One, std::ilogb(value) - std::numeric_limits<F>::digits + 1);
I U = std::max(ULP, One);
return value % U;
This sets ULP to the value of the least digit position in the result of converting value to F. ilogb returns the position (as an exponent of the floating-point radix) for the highest digit position, and subtracting one less than the number of digits adjusts to the lowest digit position. Then scalbn gives us the value of that position, which is the ULP.
Then value can be represented exactly in F if and only if it is a multiple of the ULP. To test that, we convert the ULP to I (but substitute 1 if it is less than 1), and then take the remainder of value divided by the ULP (or 1).
Also, if one is concerned the conversion to F might overflow, code can be inserted to handle this as well.
Calculating the actual amount of the change is trickier. The conversion to floating-point could round up or down, and the rule for choosing is implementation-defined, although round-to-nearest-ties-to-even is common. So the actual change cannot be calculated from the floating-point properties we are given in numeric_limits. It must involve performing the conversion and doing some work in floating-point. This definitely can be done, but it is a nuisance. I think an approach that should work is:
Assume value is non-negative. (Negative values can be handled similarly but are omitted for now for simplicity.)
First, test for overflow in conversion to F. This in itself is tricky, as the behavior is undefined if the value is too large. Some similar considerations were addressed in this answer to a question about safely converting from floating-point to integer (in C).
If the value does not overflow, then convert it. Let the result be x. Divide x by the floating-point radix r, producing y. If y is not an integer (which can be tested using fmod or trunc) the conversion was exact.
Otherwise, convert y to I, producing z. This is safe because y is less than the original value, so it must fit in I.
Then the error due to conversion is (z-value/r)*r + value%r.

I loss = abs(static_cast<I>(static_cast<F>(i))-i) should do the job. The only exception if i's magnitude is large, so static_cast<F>(i) would generate an out-of-I-range F.
(I supposed here that I abs(I) is available)

Related

Can multiplying a pair of almost-one values ever yield a result of 1.0?

I have two floating point values, a and b. I can guarantee they are values in the domain (0, 1). Is there any circumstance where a * b could equal one? I intend to calculate 1/(1 - a * b), and wish to avoid a divide by zero.
My instinct is that it cannot, because the result should be equal or smaller to a or b. But instincts are a poor replacement for understanding the correct behavior.
I do not get to specify the rounding mode, so if there's a rounding mode where I could get into trouble, I want to know about it.
Edit: I did not specify whether the compiler was IEEE compliant or not because I cannot guarantee that the compiler/CPU running my software will indeed by IEEE compliant.
I have two floating point values, a and b…
Since this says we have “values,” not “variables,” it admits a possibility that 1 - a*b may evaluate to 1. When writing about software, people sometimes use names as placeholders for more complicated expressions. For example, one might have an expression a that is sin(x)/x and an expression b that is 1-y*y and then ask about computing 1 - a*b when the code is actually 1 - (sin(x)/x)*(1-y*y). This would be a problem because C++ allows extra precision to be used when evaluating floating-point expressions.
The most common instances of this is that the compiler uses long double arithmetic while computing expressions containing double operands or it uses a fused multiply-add instructions while computing an expression of the format x + y*z.
Suppose expressions a and b have been computed with excess precision and are positive values less than 1 in that excess precision. E.g., for illustration, suppose double were implemented with four decimal digits but a and b were computed with long double with six decimal digits. a and b could both be .999999. Then a*b is .999998000001 before rounding, .999998 after rounding to six digits. Now suppose that at this point in the computation, the compiler converts from long double to double, perhaps because it decides to store this intermediate value on the stack temporarily while it computes some other things from nearby expressions. Converting it to four-digit double produces 1.000, because that is the four-decimal-digit number nearest .999998. When the compiler later loads this from the stack and continues evaluation, we have 1 - 1.000, and the result is zero.
On the other hand, if a and b are variables, I expect your expression is safe. When a value is assigned to a variable or is converted with a cast operation, the C++ standard requires it to be converted to the nominal type; the result must be a value in the nominal type, without any “extra precision.” Then, given 0 < a < 1 and 0 < b < 1, the mathematical value (that, without floating-point rounding) a•b is less than a and is less than b. Then rounding of a•b to the nominal type cannot produce a value greater than a or b with any IEEE-754 rounding method, so it cannot produce 1. (The only requirement here is that the rounding method never skip over values—it might be constrained to round in a particular direction, upward or downward or toward zero or whatever, but it never goes past a representable value in that direction to get to a value farther away from the unrounded result. Since we know a•b is bounded above by both a and b, rounding cannot produce any result greater than the lesser of a and b.)
Formally, the C++ standard does not impose any requirements on the accuracy of floating-point results. So a C++ implementation could use a bonkers rounding mode that produced 3.14 for .9*.9. Aside from implementations flushing subnormals to zero, I am not aware of any C++ implementations that do not obey the requirement above. Flushing subnormals to zero will not affect calculations in 1 - a*b when a and b are near 1. (In a perverse floating-point format, with an exponent range narrower than the significand and no subnormal values, .9999 could be representable while .0001 is not because the exponent required for it is out of range. Then 1-.9999*.9999, which would produce .0002 in normal four-digit arithmetic, would produce 0 due to underflow. No such formats are in normal hardware.)
So, if a and b are variables, 0 < a < 1 and 0 < b < 1, and your C++ implementation is reasonable (may use extra precision, may flush subnormals, does not use perverse floating-point formats or rounding), then 1 - a*b does not evaluate to zero.
There is a mathematical proof that it will never be >= 1. I don't have it handy.... you may want to ask on the math stack overflow site if you are interested in studying the proof. But your instincts are correct. It will never be >= 1.
Now, we must be careful because floating point arithmetic is only an approximation of math and has limitations. I'm not an expert on these limitations, but the floating-point standard is very carefully designed and provides certain guarantees. I'm pretty sure one of them includes (or implies) that x * y where x < 1 and y < 1 is guaranteed to be < 1.
You can check that even if using the highest float or double that is lower than 1, and multiplying by itself, the result will be lower than 1. Any multiplication of numbers lower than that must give a smaller result.
Here is the code I ran, with the results in comments:
float a = nextafterf(1, 0); // 0.999999940
double b = nextafter(1, 0); // 0.99999999999999989
float c = a * a; // 0.999999881
double d = b * b; // 0.99999999999999978

is assigning two doubles guaranteed to yield the same bitset patterns?

There are several posts here about floating point numbers and their nature. It is clear that comparing floats and doubles must always be done cautiously. Asking for equality has also been discussed and the recommendation is clearly to stay away from it.
But what if there is a direct assignement:
double a = 5.4;
double b = a;
assumg a is any non-NaN value - can a == b ever be false?
It seems that the answer is obviously no, yet I can't find any standard defining this behaviour in a C++ environment. IEEE-754 states that two floating point numbers with equal (non-NaN) bitset patterns are equal. Does it now mean that I can continue comparing my doubles this way without having to worry about maintainability? Do I have to worried about other compilers / operating systems and their implementation regarding these lines? Or maybe a compiler that optimizes some bits away and ruins their equality?
I wrote a little program that generates and compares non-NaN random doubles forever - until it finds a case where a == b yields false. Can I compile/run this code anywhere and anytime in the future without having to expect a halt? (ignoring endianness and assuming sign, exponent and mantissa bit sizes / positions stay the same).
#include <iostream>
#include <random>
struct double_content {
std::uint64_t mantissa : 52;
std::uint64_t exponent : 11;
std::uint64_t sign : 1;
};
static_assert(sizeof(double) == sizeof(double_content), "must be equal");
void set_double(double& n, std::uint64_t sign, std::uint64_t exponent, std::uint64_t mantissa) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
convert.sign = sign;
convert.exponent = exponent;
convert.mantissa = mantissa;
memcpy(&n, &convert, sizeof(double_content));
}
void print_double(double& n) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
std::cout << "sign: " << convert.sign << ", exponent: " << convert.exponent << ", mantissa: " << convert.mantissa << " --- " << n << '\n';
}
int main() {
std::random_device rd;
std::mt19937_64 engine(rd());
std::uniform_int_distribution<std::uint64_t> mantissa_distribution(0ull, (1ull << 52) - 1);
std::uniform_int_distribution<std::uint64_t> exponent_distribution(0ull, (1ull << 11) - 1);
std::uniform_int_distribution<std::uint64_t> sign_distribution(0ull, 1ull);
double a = 0.0;
double b = 0.0;
bool found = false;
while (!found){
auto sign = sign_distribution(engine);
auto exponent = exponent_distribution(engine);
auto mantissa = mantissa_distribution(engine);
//re-assign exponent for NaN cases
if (mantissa) {
while (exponent == (1ull << 11) - 1) {
exponent = exponent_distribution(engine);
}
}
//force -0.0 to be 0.0
if (mantissa == 0u && exponent == 0u) {
sign = 0u;
}
set_double(a, sign, exponent, mantissa);
b = a;
//here could be more (unmodifying) code to delay the next comparison
if (b != a) { //not equal!
print_double(a);
print_double(b);
found = true;
}
}
}
using Visual Studio Community 2017 Version 15.9.5
The C++ standard clearly specifies in [basic.types]#3:
For any trivially copyable type T, if two pointers to T point to distinct T objects obj1 and obj2, where neither obj1 nor obj2 is a potentially-overlapping subobject, if the underlying bytes ([intro.memory]) making up obj1 are copied into obj2, obj2 shall subsequently hold the same value as obj1.
It gives this example:
T* t1p;
T* t2p;
// provided that t2p points to an initialized object ...
std::memcpy(t1p, t2p, sizeof(T));
// at this point, every subobject of trivially copyable type in *t1p contains
// the same value as the corresponding subobject in *t2p
The remaining question is what a value is. We find in [basic.fundamental]#12 (emphasis mine):
There are three floating-point types: float, double, and long double.
The type double provides at least as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The value representation of floating-point types is implementation-defined.
Since the C++ standard has no further requirements on how floating point values are represented, this is all you will find as guarantee from the standard, as assignment is only required to preserve values ([expr.ass]#2):
In simple assignment (=), the object referred to by the left operand is modified by replacing its value with the result of the right operand.
As you correctly observed, IEEE-754 requires that non-NaN, non-zero floats compare equal if and only if they have the same bit pattern. So if your compiler uses IEEE-754-compliant floats, you should find that assignment of non-NaN, non-zero floating point numbers preserves bit patterns.
And indeed, your code
double a = 5.4;
double b = a;
should never allow (a == b) to return false. But as soon as you replace 5.4 with a more complicated expression, most of this nicety vanishes. It's not the exact subject of the article, but https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/ mentions several possible ways in which innocent looking code can yield different results (which breaks "identical to the bit pattern" assertions). In particular, you might be comparing an 80 bit intermediate result with a 64 bit rounded result, possibly yielding inequality.
There are some complications here. First, note that the title asks a different question than the question. The title asks:
is assigning two doubles guaranteed to yield the same bitset patterns?
while the question asks:
can a == b ever be false?
The first of these asks whether different bits might occur from an assignment (which could be due to either the assignment not recording the same value as its right operand or due to the assignment using a different bit pattern that represents the same value), while the second asks whether, whatever bits are written by an assignment, the stored value must compare equal to the operand.
In full generality, the answer to the first question is no. Using IEEE-754 binary floating-point formats, there is a one-to-one map between non-zero numeric values and their encodings in bit patterns. However, this admits several cases where an assignment could produce a different bit pattern:
The right operand is the IEEE-754 −0 entity, but +0 is stored. This is not a proper IEEE-754 operation, but C++ is not required to conform to IEEE 754. Both −0 and +0 represent mathematical zero and would satisfy C++ requirements for assignment, so a C++ implementation could do this.
IEEE-754 decimal formats have one-to-many maps between numeric values and their encodings. By way of illustration, three hundred could be represented with bits whose direct meaning is 3•102 or bits whose direct meaning is 300•100. Again, since these represent the same mathematical value, it would be permissible under the C++ standard to store one in the left operand of an assignment when the right operand is the other.
IEEE-754 includes many non-numeric entities called NaNs (for Not a Number), and a C++ implementation might store a NaN different from the right operand. This could include either replacing any NaN with a “canonical” NaN for the implementation or, upon assignment of a signaling Nan, indicating the signal in some way and then converting the signaling NaN to a quiet NaN and storing that.
Non-IEEE-754 formats may have similar issues.
Regarding the latter question, can a == b be false after a = b, where both a and b have type double, the answer is no. The C++ standard does require that an assignment replace the value of the left operand with the value of the right operand. So, after a = b, a must have the value of b, and therefore they are equal.
Note that the C++ standard does not impose any restrictions on the accuracy of floating-point operations (although I see this only stated in non-normative notes). So, theoretically, one might interpret assignment or comparison of floating-point values to be floating-point operations and say that they do not need to be accuracy, so the assignment could change the value or the comparison could return an inaccurate result. I do not believe this is a reasonable interpretation of the standard; the lack of restrictions on floating-point accuracy is intended to allow latitude in expression evaluation and library routines, not simple assignment or comparison.
One should note the above applies specifically to a double object that is assigned from a simple double operand. This should not lull readers into complacency. Several similar but different situations can result in failure of what might seem intuitive mathematically, such as:
After float x = 3.4;, the expression x == 3.4 will generally evaluate as false, since 3.4 is a double and has to be converted to a float for the assignment. That conversion reduces precision and alters the value.
After double x = 3.4 + 1.2;, the expression x == 3.4 + 1.2 is permitted by the C++ standard to evaluate to false. This is because the standard permits floating-point expressions to be evaluated with more precision than the nominal type requires. Thus, 3.4 + 1.2 might be evaluated with the precision of long double. When the result is assigned to x, the standard requires that the excess precision be “discarded,” so the value is converted to a double. As with the float example above, this conversion may change the value. Then the comparison x == 3.4 + 1.2 may compare a double value in x to what is essentially a long double value produced by 3.4 + 1.2.

How is ++ defined on a large floating point [duplicate]

This question already has an answer here:
maximum value in float
(1 answer)
Closed 7 years ago.
So I've been looking at IEEE754 floating point double. (My C++ compiler uses that type for a double).
Consider this snippet:
// 9007199254740992 is the 53rd power of 2.
// 590295810358705700000 is the 69th power of 2.
for (double f = 9007199254740992; f <= 590295810358705700000; ++f){
/* what is f?*/
}
Presumably f increments in even steps up to the 54th power of 2, due to rounding up?
Then after that, nothing happens due to rounding down?
Is that correct? Is it even well-defined?
++f is essentially the same as f = f + 1, ignoring the fact that ++f is an expression that yields a value.
Now, for floating point values, the issue of representability comes into play. It may be that f + 1 is not representable. In which case, f + 1 will evaluate to the nearest representable value to the true value of f + 1. In case there are two equally near candidates for nearest representable value, round to even is used.
This is covered in the Operations section of What Every Computer Scientist Should Know About Floating-Point Arithmetic:
The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even).
So, if your example, for sufficiently large values of f, you will find that f == f + 1.
Yes, this loop will never end on rounding problem. I hope the reason is clear for you (since you are familiar with https://en.wikipedia.org/wiki/IEEE_floating_point) but let me describe in short for impatient audience.
We can think about floating point as forced by compiler/FPU/standard special presentation of number. For simple example let's review:
20000
2e4
0.2e5
Both three forms represents the same number. Last two form called "science" form but what is the best? IEEE754 answers - the last one because we can save the space by omitting leading 0 and just write .2e5 . Such decimal analogy is very close to binary presentation where there is a space for mantissa (.2) and exponent (5).
Now let's do the same for 20000.00000000001
0.2000000000000001e5
As we can see mantissa growth and there is some limit where fixed memory will overflow. Instead of exception we sacrifice precision, that (just as example) give as the 0.2e5.
For bigger numbers (as in question) we have lost in precision too.
9007199254740992 may be presented as 0.9e16 And when 1 is added nothing happens.
So f = f + 1 creates infinite loop
Being f++ the same as f = f + 1, as pointed out on the comments, and as i tested myself, f == f+1 (!!) for a large f dependent on the platform. An explanation is here (for small numbers, but the principle is the same) http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BinMath/addFloat.html
Here's how to add floating point numbers.
First, convert the two representations to scientific notation. Thus,
we explicitly represent the hidden 1. In order to add, we need the
exponents of the two numbers to be the same. We do this by rewriting
Y. This will result in Y being not normalized, but value is equivalent
to the normalized Y. Add x - y to Y's exponent. Shift the radix point
of the mantissa (signficand) Y left by x - y to compensate for the
change in exponent. Add the two mantissas of X and the adjusted Y
together. If the sum in the previous step does not have a single bit
of value 1, left of the radix point, then adjust the radix point and
exponent until it does. Convert back to the one byte floating point
representation.
In the process of converting the number to the same exponent, due to precision, 1 is rounded to 0, and hence f == f + 1.
According to IEEE754, after the sum the number is rounded to match the double format, and due to the rounding operation, f==f+1.
I don't know if there are problems where looping over large floating point values by increment of 1 is a meaningful solution, but people may be stumbling on this question looking for a workaround for their neverending loop. Therefore, even though the question only asks how the addition is defined by the standard, I'll propose a workaround.
Indeed, for large values of f, f++ == f is true, and using that as the increment in loop will have undefined behaviour.
Assuming it's OK that f be incremented by a number that is the smallest number e greater than 1 for which the floating point has a representation f + e > f. In that case, following workaround where the loop will always terminate could be OK:
// use template, or overloads for different floatingpoints
template<class T>
T add_s(T l, T r) {
T result = l + r;
T greater = std::max(l, r);
if(result == greater)
return std::nextafter(greater, std::numeric_limits<T>::max());
return result;
}
// ...
for (double f = /*...*/; f < /*...*/; f = add_s(f, 1.0))
That said, adding tiny floats to huge floats will result in an uncontrollable cumulation of errors. If that's not OK for you, then you need arbitraty precision math, not floating point.

double and float comparison [duplicate]

This question already has answers here:
Comparing float and double
(3 answers)
Closed 7 years ago.
According to this post, when comparing a float and a double, the float should be treated as double.
The following program, does not seem to follow this statement. The behaviour looks quite unpredictable.
Here is my program:
void main(void)
{
double a = 1.1; // 1.5
float b = 1.1; // 1.5
printf("%X %X\n", a, b);
if ( a == b)
cout << "success " <<endl;
else
cout << "fail" <<endl;
}
When I run the following program, I get "fail" displayed.
However, when I change a and b to 1.5, it displays "success".
I have also printed the hex notations of the values. They are different in both the cases. My compiler is Visual Studio 2005
Can you explain this output ? Thanks.
float f = 1.1;
double d = 1.1;
if (f == d)
In this comparison, the value of f is promoted to type double. The problem you're seeing isn't in the comparison, but in the initialization. 1.1 can't be represented exactly as a floating-point value, so the values stored in f and d are the nearest value that can be represented. But float and double are different sizes, so have a different number of significant bits. When the value in f is promoted to double, there's no way to get back the extra bits that were lost when the value was stored, so you end up with all zeros in the extra bits. Those zero bits don't match the bits in d, so the comparison is false. And the reason the comparison succeeds with 1.5 is that 1.5 can be represented exactly as a float and as a double; it has a bunch of zeros in its low bits, so when the promotion adds zeros the result is the same as the double representation.
I found a decent explanation of the problem you are experiencing as well as some solutions.
See How dangerous is it to compare floating point values?
Just a side note, remember that some values can not be represented EXACTLY in IEEE 754 floating point representation. Your same example using a value of say 1.5 would compare as you expect because there is a perfect representation of 1.5 without any loss of data. However, 1.1 in 32-bit and 64-bit are in fact different values because the IEEE 754 standard can not perfectly represent 1.1.
See http://www.binaryconvert.com
double a = 1.1 --> 0x3FF199999999999A
Approximate representation = 1.10000000000000008881784197001
float b = 1.1 --> 0x3f8ccccd
Approximate representation = 1.10000002384185791015625
As you can see, the two values are different.
Also, unless you are working in some limited memory type environment, it's somewhat pointless to use floats. Just use doubles and save yourself the headaches.
If you are not clear on why some values can not be accurately represented, consult a tutorial on how to covert a decimal to floating point.
Here's one: http://class.ece.iastate.edu/arun/CprE281_F05/ieee754/ie5.html
I would regard code which directly performs a comparison between a float and a double without a typecast to be broken; even if the language spec says that the float will be implicitly converted, there are two different ways that the comparison might sensibly be performed, and neither is sufficiently dominant to really justify a "silent" default behavior (i.e. one which compiles without generating a warning). If one wants to perform a conversion by having both operands evaluated as double, I would suggest adding an explicit type cast to make one's intentions clear. In most cases other than tests to see whether a particular double->float conversion will be reversible without loss of precision, however, I suspect that comparison between float values is probably more appropriate.
Fundamentally, when comparing floating-point values X and Y of any sort, one should regard comparisons as indicating that X or Y is larger, or that the numbers are "indistinguishable". A comparison which shows X is larger should be taken to indicate that the number that Y is supposed to represent is probably smaller than X or close to X. A comparison that says the numbers are indistinguishable means exactly that. If one views things in such fashion, comparisons performed by casting to float may not be as "informative" as those done with double, but are less likely to yield results that are just plain wrong. By comparison, consider:
double x, y;
float f = x;
If one compares f and y, it's possible that what one is interested in is how y compares with the value of x rounded to a float, but it's more likely that what one really wants to know is whether, knowing the rounded value of x, whether one can say anything about the relationship between x and y. If x is 0.1 and y is 0.2, f will have enough information to say whether x is larger than y; if y is 0.100000001, it will not. In the latter case, if both operands are cast to double, the comparison will erroneously imply that x was larger; if they are both cast to float, the comparison will report them as indistinguishable. Note that comparison results when casting both operands to double may be erroneous not only when values are within a part per million; they may be off by hundreds of orders of magnitude, such as if x=1e40 and y=1e300. Compare f and y as float and they'll compare indistinguishable; compare them as double and the smaller value will erroneously compare larger.
The reason why the rounding error occurs with 1.1 and not with 1.5 is due to the number of bits required to accurately represent a number like 0.1 in floating point format. In fact an accurate representation is not possible.
See How To Represent 0.1 In Floating Point Arithmetic And Decimal for an example, particularly the answer by #paxdiablo.

C++ integer floor function

I want to implement greatest integer function. [The "greatest integer function" is a quite standard name for what is also known as the floor function.]
int x = 5/3;
My question is with greater numbers could there be a loss of precision as 5/3 would produce a double?
EDIT: Greatest integer function is integer less than or equal to X.
Example:
4.5 = 4
4 = 4
3.2 = 3
3 = 3
What I want to know is 5/3 going to produce a double? Because if so I will have loss of precision when converting to int.
Hope this makes sense.
You will lose the fractional portion of the quotient. So yes, with greater numbers you will have more relative precision, such as compared with 5000/3000.
However, 5 / 3 will return an integer, not a double. To force it to divide as double, typecast the dividend as static_cast<double>(5) / 3.
Integer division gives integer results, so 5 / 3 is 1 and 5 % 3 is 2 (the remainder operator). However, this doesn't necessarily hold with negative numbers. In the original C++ standard, -5 / 3 could be either -1 (rounding towards zero) or -2 (the floor), but -1 was recommended. In the latest C++0B draft (which is almost certainly very close to the final standard), it is -1, so finding the floor with negative numbers is more involved.
5/3 will always produce 1 (an integer), if you do 5.0/3 or 5/3.0 the result will be a double.
As far as I know, there is no predefined function for this purpose.
It might be necessary to use such a function, if for some reason floating-point calculations are out of question (e.g. int64_t has a higher precision than double can represent without error)
We could define this function as follows:
#include <cmath>
inline long
floordiv (long num, long den)
{
if (0 < (num^den))
return num/den;
else
{
ldiv_t res = ldiv(num,den);
return (res.rem)? res.quot-1
: res.quot;
}
}
The idea is to use the normal integer divison, but adjust for negative results to match the behaviour of the double floor(double) function. The point is to truncate always towards the next lower integer, irrespective of the position of the zero point. This can be very important if the intention is to create even sized intervals.
Timing measurements show that this function here only creates a small overhead compared with the built-in / operator, but of course the floating point based floor function is significantly faster....
Since in C and C++, as others have said, / is integer division, it will return an int. in particular, it will return the floor of the double answer... (C and C++ always truncate) So, basically 5/3 is exactly what you want.
It may get a little weird in negatives as -5/3 => -2 which may or may not be what you want...