I am writing a code that has to compile in single and double precision. The original version was only in double precision, but I am trying to enable single precision now by using templates.
My question is: is it necessary to cast the 1. and the 7.8 to the specified type with static_cast<TF>(1.) for instance, or will the compiler take care of it? I find the casting not remarkably pretty and prefer to stay away from it. (I have other functions that are much longer and that contain many more literal constants).
template<typename TF>
inline TF phih_stable(const TF zeta)
{
// Hogstrom, 1988
return 1. + 7.8*zeta;
}
Casting and implicit conversions are two things. For this example, you could treat the template function as if it were two overloaded functions, but with the same code within. At the interface level (parameters, return value) the compiler will generate implicit conversions.
Now, the question you will have to ask yourself is this: Do those implicit conversions do what I want? If they do, just leave it as it is. If they don't, you could try to add explicit conversions (maybe use the function-style casting like TF(1.)) or, you could specialize this function for double and float.
Another option, less general but maybe it works here, is that you switch the code around, i.e. that you write the code for single-precision float and then let the compiler apply its implicit conversions. Since the conversions usually only go to the bigger type, it should fit for both double and float without incurring any overhead for float.
When you do:
return 1. + 7.8*zeta;
The literals 1. and 7.8 are double, so when if zeta is a float, it will first be converted to double, then the whole computation will be done in double-precision, and the result will be cast back to float, this is equivalent to:
return (float)(1. + 7.8 * (double)zeta);
Otherwise said, this is equivalent to calling phih_stable(double) and storing the result in a float, so your template would be useless for float.
If you want the computation to be made in single precision, you need the casts1:
return TF(1.) + TF(7.8) * zeta;
What about using 1.f and 7.8f? The problem is that (double)7.8f != 7.8 due to floating point precision. The difference is around 1e-7, the actual stored value for 7.8f (assuming 32-bits float) is:
7.80000019073486328125
While the actual stored value for 7.8 (assuming 64-bits double) is:
7.79999999999999982236431605997
So you have to ask yourself if you accept this loss of precision.
You can compare the two following implementations:
template <class T>
constexpr T phih_stable_cast(T t) {
return T(1l) + T(7.8l) * t;
}
template <class T>
constexpr T phih_stable_float(T t) {
return 1.f + 7.8f * t;
}
And the following assertions:
static_assert(phih_stable_cast(3.4f) == 1. + 7.8f * 3.4f, "");
static_assert(phih_stable_cast(3.4) == 1. + 7.8 * 3.4, "");
static_assert(phih_stable_cast(3.4l) == 1. + 7.8l * 3.4l, "");
static_assert(phih_stable_float(3.4f) == 1.f + 7.8f * 3.4f, "");
static_assert(phih_stable_float(3.4) == 1. + 7.8 * 3.4, "");
static_assert(phih_stable_float(3.4l) == 1.l + 7.8l * 3.4l, "");
The two last assertions fail due to the loss of precision when doing the computation.
1 You should even downcast from long double to not lose precision when using your function with long double: return TF(1.l) + TF(7.8l) * zeta;.
Related
This C/C++ reduced testcase:
int x = ~~~;
const double s = 1.0 / x;
const double r = 1.0 - x * s;
assert(r >= 0); // fail
isn't numerically stable, and hits the assert. The reason is that the last calculation can be done with FMA, which brings r into the negatives.
Clang has enabled FMA by default (since version 14), so it is causing some interesting regressions. Here's a running version: https://godbolt.org/z/avvnnKo5E
Interestingly enough, if one splits the last calculation in two, then the FMA is not emitted and the result is always non-negative:
int x = ~~~;
const double s = 1.0 / x;
const double tmp = x * s;
const double r = 1.0 - tmp;
assert(r >= 0); // OK
Is this a guaranteed behavior of IEEE754 / FP_CONTRACT, or is this playing with fire and one should just find a more numerically stable approach? I can't find any indication that fp contractions are only meant to happen "locally" (within one expression), and a simple split like to above is sufficient to prevent them.
(Of course, in due time, one could also think of replacing the algorithm with a more numerically stable one. Or add a clamp into the [0.0, 1.0] range, but that feels hackish.)
The C++ standard allows floating-point expressions to be evaluated with extra range and precision because C++ 2020 draft N4849 7.1 [expr.pre] 6 says:
The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.
However, note 51 tells us:
The cast and assignment operators must still perform their specific conversions as described in 7.6.1.3, 7.6.3, 7.6.1.8
and 7.6.19.
The meaning of this is that an assignment or a cast must convert the value to the nominal type. So if extra range or precision has been used, that result must be converted to an actual double value when an assignment to a double is performed. (I expect, for this purpose, assignment includes initialization in a definition.)
So 1.0 - x * s may used a fused multiply-add, but const double tmp = x * s; const double r = 1.0 - tmp; must compute a double result for x * s, then subtract that double result from 1.0.
Note that this does not preclude const double tmp = x * s; from using extra precision to compute x * s and then rounding again to get a double result. In rare instances, this can produce a double-rounding error, where the result differs slightly from what one would get by rounding the real-number-arithmetic result of x•s directly to a double. That is unlikely to occur in practice; a C++ implementation would have no reason to compute x * s with extra precision and then round it to double.
Also note that C and C++ implementations do not necessarily conform to the C or C++ standard.
I'm writing a program to calculate the result of numbers:
int main()
{
float a, b;
cin >> a >> b;
float result = b + a * a * 0.4;
cout << result;
}
but I have a warning at a * a and it said Warning C26451 Arithmetic overflow: Using operator '*' on a 4 byte value and then casting the result to a 8 byte value. Cast the value to the wider type before calling operator '*' to avoid overflow (io.2). Sorry if this a newbie question, can anyone help me with this? Thank you!
In the C language as described in the first edition of K&R, all floating-point operations were performed by converting operands to a common type (specifically double), performing operations with that type, and then if necessary converting the result to whatever type was needed. On many platforms, that was the most convenient and space-efficient way of handling floating-point math. While the Standard still allows implementations to behave that way, it also allows implementations to perform floating-point operations on smaller types to be performed using those types directly.
As written, the subexpression a * a * 0.5; would be performed by multiplying a * a together using float type, then multiply by a value 0.5 which is of type double. This latter multiplication would require converting the float result of a * a to double. If e.g. a had been equal to 2E19f, then performing the multiply using type float would yield a value too large to be represented using that type. Had the code instead performed the multiplication using type double, then the result 4E38 would be representable in that type, and the result of multiplying that by 0.5 (i.e. 2E38) would be within the range that is representable by float.
Note that in this particular situation, the use of float for the intermediate computations would only affect the result if a was within narrow ranges of very large or very small. If instead of multiplying by 0.5 one had multiplied by other values, however, the choice of whether to use float or double for the first multiplication could affect the accuracy of rounding. Generally, using double for both multiplies would yield slightly more accurate results, but at the expense of increased execution time. Using float for both may yield better execution speed, but at the result of reduced precision. If the floating-point constant had been something that isn't precisely representable in float, converting to double and multiplying by a double constant may yield slightly more accurate results than using float for everything, but in most cases where one would want that precision, one would also want the increased position that would be achieved by using double for the first multiply as well.
Let's look at the error message.
Using operator '*' on a 4 byte value
It is describing this code:
a * a
Your float is 4 bytes. The result of the multiplication is 4 bytes. And the result of a multiplication may overflow.
and then casting the result to a 8 byte value.
It is describing this code:
(result) * 0.4;
Your result is 4 bytes. 0.4 is a double, which is 8 bytes. C++ will promote your float result to a double before performing this multiplication.
So...
The compiler is observing that you are doing float math that could overflow and then immediately converting the result to a double, making the potential overflow unnecessary.
Change the code to this to remove the float to double to float conversions.
float result = b + a * a * 0.4f;
I read the question as "how to change the code to remove the warning?".
If you take the advice in the warning's text literally:
float result = b + (double)a * a * 0.4;
But this is nonsense — if an overflow happens, your result will probably not fit into float result.
It looks like in your case overflow is not possible, and you feel perfectly fine doing all calculations with float. If so, just write 0.4f instead of 0.4 — then the constant will have float type (instead of double), and the result will also be float.
If you want to "fix" the problem with overflow
double result = b + (double)a * a * 0.4;
But then you must also change the following code, which uses the result. And you don't remove the possibility of overflow, you just make it much less likely.
Problem: In a homework problem (it is to be done on paper with a pen so no coding) I must determine the type and value of an addition performed in C++.
1 + 0.5
What I've answered is:
Type float (because I thought that integer + float = float)
Value 1.5 (As far as I know when two different datatypes are added,
the result of the addition is going to be converted to the datatype that does not loose any information.)
Solution says:
Type: double
Value: 1.5
My Question: Why is 0.5 a double and not a float? How can I distingish between a float and a double? I mean, 0.5 looks to me like a float and a double.
First of all, yes. integer + float = float. You are correct about that part.
The issue is not with that, but rather your assumption that 0.5 is a float. It is not. In C++, float literals are followed by an f meaning that 0.5f is a float. However, 0.5 is actually a double. That means that your equation is now:
integer + double = double
As you can see, this result is a double. That is why the correct answer to your question is that the resulting type is a double.
By the way, to clear the record, technically what's going on here isn't integer + double = double. What is happening is that the 1 is falling subject to implicit conversion. Essentially, the 1 is converted to a double, since the other side of the operation is a double as well. This way, the computer is adding the same types and not different ones. That means that the actual addition taking place here is more like:
double + double = double
In C++, floating point literals without a type suffix are double by default. If you want it to be float, you need to specify the f suffix, like 0.5f.
I need to convert normalized integer values to and from real floating-point values. For instance, for int16_t, a value of 1.0 is represented by 32767 and -1.0 is represented by -32768. Although it's a bit tedious to do this for each integer type, both signed and unsigned, it's still easy enough to write by hand.
However, I want to use standard methods whenever possible rather than going off and reinventing the wheel, so what I'm looking for is something like a standard C or C++ header, a Boost library, or some other small, portable, easily-incorporated source that already performs these conversions.
Here's a templated solution using std::numeric_limits:
#include <cstdint>
#include <limits>
template <typename T>
constexpr double normalize (T value) {
return value < 0
? -static_cast<double>(value) / std::numeric_limits<T>::min()
: static_cast<double>(value) / std::numeric_limits<T>::max()
;
}
int main () {
// Test cases evaluated at compile time.
static_assert(normalize(int16_t(32767)) == 1, "");
static_assert(normalize(int16_t(0)) == 0, "");
static_assert(normalize(int16_t(-32768)) == -1, "");
static_assert(normalize(int16_t(-16384)) == -0.5, "");
static_assert(normalize(uint16_t(65535)) == 1, "");
static_assert(normalize(uint16_t(0)) == 0, "");
}
This handles both signed and unsigned integers, and 0 does normalize to 0.
View Successful Compilation Result
I'd question whether your intent is correct here (or indeed that of most of the answers).
Since you're likely just dealing with something like an integer representation of a "real" value such as that produced by an ADC - I'd argue that in fact a floating point fraction of +32767/32768 (not +1.0) is represented by the integer +32767, as a value of +1.0 can't actually be expressed in this form due to the 2's complement arithmetic used.
Although it's a bit tedious to do this for each integer type, both
signed and unsigned, it's still easy enough to write by hand.
You certainly don't need to do this for each integer type! Use <limits> instead.
template<class T> double AsDouble(const T x) {
const double valMin = std::numeric_limits<T>::min();
const double valMax = std::numeric_limits<T>::max();
return 2 * (x - valMin) / (valMax - valMin) - 1; // note: 0 does not become 0.
}
If I have the following declaration:
float a = 3.0 ;
is that an error? I read in a book that 3.0 is a double value and that I have to specify it as float a = 3.0f. Is it so?
It is not an error to declare float a = 3.0 : if you do, the compiler will convert the double literal 3.0 to a float for you.
However, you should use the float literals notation in specific scenarios.
For performance reasons:
Specifically, consider:
float foo(float x) { return x * 0.42; }
Here the compiler will emit a conversion (that you will pay at runtime) for each returned value. To avoid it you should declare:
float foo(float x) { return x * 0.42f; } // OK, no conversion required
To avoid bugs when comparing results:
e.g. the following comparison fails :
float x = 4.2;
if (x == 4.2)
std::cout << "oops"; // Not executed!
We can fix it with the float literal notation :
if (x == 4.2f)
std::cout << "ok !"; // Executed!
(Note: of course, this is not how you should compare float or double numbers for equality in general)
To call the correct overloaded function (for the same reason):
Example:
void foo(float f) { std::cout << "\nfloat"; }
void foo(double d) { std::cout << "\ndouble"; }
int main()
{
foo(42.0); // calls double overload
foo(42.0f); // calls float overload
return 0;
}
As noted by Cyber, in a type deduction context, it is necessary to help the compiler deduce a float :
In case of auto :
auto d = 3; // int
auto e = 3.0; // double
auto f = 3.0f; // float
And similarly, in case of template type deduction :
void foo(float f) { std::cout << "\nfloat"; }
void foo(double d) { std::cout << "\ndouble"; }
template<typename T>
void bar(T t)
{
foo(t);
}
int main()
{
bar(42.0); // Deduce double
bar(42.0f); // Deduce float
return 0;
}
Live demo
The compiler will turn any of the following literals into floats, because you declared the variable as a float.
float a = 3; // converted to float
float b = 3.0; // converted to float
float c = 3.0f; // float
It would matter is if you used auto (or other type deducting methods), for example:
auto d = 3; // int
auto e = 3.0; // double
auto f = 3.0f; // float
Floating point literals without a suffix are of type double, this is covered in the draft C++ standard section 2.14.4 Floating literals:
[...]The type of a floating literal is double unless explicitly specified by a suffix.[...]
so is it an error to assign 3.0 a double literal to a float?:
float a = 3.0
No, it is not, it will be converted, which is covered in section 4.8 Floating point conversions:
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined.
We can read more details on the implications of this in GotW #67: double or nothing which says:
This means that a double constant can be implicitly (i.e., silently)
converted to a float constant, even if doing so loses precision (i.e.,
data). This was allowed to remain for C compatibility and usability
reasons, but it's worth keeping in mind when you do floating-point
work.
A quality compiler will warn you if you try to do something that's
undefined behavior, namely put a double quantity into a float that's
less than the minimum, or greater than the maximum, value that a float
is able to represent. A really good compiler will provide an optional
warning if you try to do something that may be defined but could lose
information, namely put a double quantity into a float that is between
the minimum and maximum values representable by a float, but which
can't be represented exactly as a float.
So there are caveats for the general case that you should be aware of.
From a practical perspective, in this case the results will most likely be the same even though technically there is a conversion, we can see this by trying out the following code on godbolt:
#include <iostream>
float func1()
{
return 3.0; // a double literal
}
float func2()
{
return 3.0f ; // a float literal
}
int main()
{
std::cout << func1() << ":" << func2() << std::endl ;
return 0;
}
and we see that the results for func1 and func2 are identical, using both clang and gcc:
func1():
movss xmm0, DWORD PTR .LC0[rip]
ret
func2():
movss xmm0, DWORD PTR .LC0[rip]
ret
As Pascal points out in this comment you won't always be able to count on this. Using 0.1 and 0.1f respectively causes the assembly generated to differ since the conversion must now be done explicitly. The following code:
float func1(float x )
{
return x*0.1; // a double literal
}
float func2(float x)
{
return x*0.1f ; // a float literal
}
results in the following assembly:
func1(float):
cvtss2sd %xmm0, %xmm0 # x, D.31147
mulsd .LC0(%rip), %xmm0 #, D.31147
cvtsd2ss %xmm0, %xmm0 # D.31147, D.31148
ret
func2(float):
mulss .LC2(%rip), %xmm0 #, D.31155
ret
Regardless whether you can determine if the conversion will have a performance impact or not, using the correct type better documents your intention. Using an explicit conversions for example static_cast also helps to clarify the conversion was intended as opposed to accidental, which may signify a bug or potential bug.
Note
As supercat points out, multiplication by e.g. 0.1 and 0.1f is not equivalent. I am just going to quote the comment because it was excellent and a summary probably would not do it justice:
For example, if f was equal to 100000224 (which is exactly
representable as a float), multiplying it by one tenth should yield a
result which rounds down to 10000022, but multiplying by 0.1f will
instead yield a result which erroneously rounds up to 10000023. If the
intention is to divide by ten, multiplication by double constant 0.1
will likely be faster than division by 10f, and more precise than
multiplication by 0.1f.
My original point was to demonstrate a false example given in another question but this finely demonstrates subtle issues can exist in toy examples.
It's not an error in the sense that the compiler will reject it, but it is an error in the sense that it may not be what you want.
As your book correctly states, 3.0 is a value of type double. There is an implicit conversion from double to float, so float a = 3.0; is a valid definition of a variable.
However, at least conceptually, this performs a needless conversion. Depending on the compiler, the conversion may be performed at compile time, or it may be saved for run time. A valid reason for saving it for run time is that floating-point conversions are difficult and may have unexpected side effects if the value cannot be represented exactly, and it's not always easy to verify whether the value can be represented exactly.
3.0f avoids that problem: although technically, the compiler is still allowed to calculate the constant at run time (it always is), here, there is absolutely no reason why any compiler might possibly do that.
While not an error, per se, it is a little sloppy. You know you want a float, so initialize it with a float.Another programmer may come along and not be sure which part of the declaration is correct, the type or the initializer. Why not have them both be correct?
float Answer = 42.0f;
When you define a variable, it is initialized with the provided initializer. This may require converting the value of the initializer to the type of the variable that's being initialized. That's what's happening when you say float a = 3.0;: The value of the initializer is converted to float, and the result of the conversion becomes the initial value of a.
That's generally fine, but it doesn't hurt to write 3.0f to show that you're aware of what you're doing, and especially if you want to write auto a = 3.0f.
If you try out the following:
std::cout << sizeof(3.2f) <<":" << sizeof(3.2) << std::endl;
you will get output as:
4:8
that shows, size of 3.2f is taken as 4 bytes on 32-bit machine wheres 3.2 is interpreted as double value taking 8 bytes on 32-bit machine.
This should provide the answer that you are looking for.
The compiler deduces the best-fitting type from literals, or at leas what it thinks is best-fitting. That is rather lose efficiency over precision, i.e. use a double instead of float.
If in doubt, use brace-intializers to make it explicit:
auto d = double{3}; // make a double
auto f = float{3}; // make a float
auto i = int{3}; // make a int
The story gets more interesting if you initialize from another variable where type-conversion rules apply: While it is legal to constuct a double form a literal, it cant be contructed from an int without possible narrowing:
auto xxx = double{i} // warning ! narrowing conversion of 'i' from 'int' to 'double'