How to express float constants precisely in source code

How to express float constants precisely in source code - c++

I have some C++11 code generated via a code generator that contains a large array of floats, and I want to make sure that the compiled values are precisely the same as the compiled values in the generator (assuming that both depend on the same float ISO norm)
So I figured the best way to do it is to store the values as hex representations and interpret them as float in the code.
Edit for Clarification: The code generator takes the float values and converts them to their corresponding hex representations. The target code is supposed to convert back to float.
It looks something like this:
const unsigned int data[3] = { 0x3d13f407U, 0x3ea27884U, 0xbe072dddU};
float const* ptr = reinterpret_cast<float const*>(&data[0]);
This works and gives me access to all the data element as floats, but I recently stumbled upon the fact that this is actually undefined behavior and only works because my compiler resolves it the way I intended:
https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8
https://en.cppreference.com/w/cpp/language/reinterpret_cast.
The standard basically says that reinterpret_cast is not defined between POD pointers of different type.
So basically I have three options:
Use memcopy and hope that the compiler will be able to optimize this
Store the data not as hex-values but in a different way.
Use std::bit_cast from C++20.
I cannot use 3) because I'm stuck with C++11.
I don't have the resources to store the data array twice, so I would have to rely on the compiler to optimize this. Due to this, I don't particularly like 1) because it could stop working if I changed compilers or compiler settings.
So that leaves me with 2):
Is there a standardized way to express float values in source code so that they map to the exact float value when compiled? Does the ISO float standard define this in a way that guarantees that any compiler will follow the interpretation? I imagine if I deviate from the way the compiler expects, I could run the risk that the float "neighbor" of the number I actually want is used.
I would also take alternative ideas if there is an option 4 I forgot.

How to express float constants precisely in source code
Use hexadecimal floating point literals. Assuming some endianess for the hexes you presented:
float floats[] = { 0x1.27e80ep-5, 0x1.44f108p-2, -0x1.0e5bbap-3 };

If you have the generated code produce the full representation of the floating-point value—all of the decimal digits needed to show its exact value—then a C++ 11 compiler is required to parse the number exactly.
C++ 11 draft N3092 2.14.4 1 says, of a floating literal:
… The exponent, if present, indicates the power of 10 by which the significant [likely typo, should be “significand”] part is to be scaled. If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner…
Thus, if the floating literal does not have all the digits needed to show the exact value, the implementation may round it either upward or downward, as the implementation defines. But if it does have all the digits, then the value represented by the floating literal is representable in the floating-point format, and so its value must be the result of the parsing.

I have read some very valuable information here and would like to throw in an option that does not strictly answer the question, but could be a solution.
It might be problematic, but if so, I would like to discuss it.
The simple solution would be: Leave it as it is.
A short rundown of why I am hesitant about the suggested options:
memcpy relies on the compiler to optimize away the actual copy and understand that I only want to read the values. Since I am having large arrays of data I would want to avoid a surprise event in which a compiler setting would be changed that suddenly introduces increased runtime and would require a fix on short notice.
bit_cast is only available from C++20. There are reference implementations but they basically use memcpy under the hood (see above).
hex float literals are only available from C++17
Directly writing the floats precisely... I don't know, it seems to be somewhat dangerous, because if I make a slight mistake I may end up with a data block that is slightly off and could have an impact on my classification results. A mistake like that would be a nightmare to spot.
So why do I think I can get away with an implementation that is strictly speaking undefined? The rationale is that the standard may not define it, but compiler manufacturers likely do, at least the ones I have worked with so far gave me exact results. The code has been running without major problems for a fairly long time, across dozens of code generator run and I would expect that a failed reinterpret_cast would break the conversion so severely that I would spot the result in my classification results right away.
Still not robust enough though. So my idea was to write a unit test that contains a significant number of hex-floats, do the reinterpret_cast and compare to reference float values for exact correspondence to tell me if a setting or compiler failed in this regard.
I have one doubt though: Is the assumption somewhat reasonable that a failed reinterpret_cast would break things spectacularly, or are the bets totally off when it comes to undefined behavior?
I am a bit worried that if the compiler implementation defines the undefined behavior in a way that it would pick a float that is close the hex value instead of the precise one (although I would wonder why), and that it happens only sporadically so that my unit test misses the problems.
So the endgame would be to unit test every single data entry against the corresponding reference float. Since the code is generated, I can generate the test as well. I think that should put all my worries to rest and make sure that I can get this to work across all possible compilers and compiler settings or be notified if anything breaks.

Related

Do literals in C++ really evaluate?

It was always my understanding that l-values have to evaluate, but for kind of obvious and easily explained reasons. An identifier represents a region of storage, and the value is in that storage and must be retrieved. That makes sense. But a program needing to evaluate a literal (for example, the integer 21) doesn't quite make sense to me. The value is right there, how much more explicit can you get? Well, besides adding U for unsigned, or some other suffix. This is why I'm curious about literals needing to be evaluated, as I've only seen this mentioned in one place. Most books also switch up terminology, like "Primary Expression," "operand," or "subexpression" and the like, to the point where the lines begin to blur. In all this time I have yet to see a clear explanation for this particular thing. It seems like a waste of processing power.

A ordinary literal only needs to be evaluated during compilation, by the compiler.
A user defined literal may be evaluated also at run time. For example, after including the <string> header, and making its ...s literals available by the directive using namespace std::string_literals;, then "Blah"s is a user defined literal of type std::string. The "Blah" part is evaluated by the compiler, at compile time. The conversion to std::string, which involves dynamic allocation, necessarily happens at run time.

But a program needing to evaluate a literal (for example, the integer
21) doesn't quite make sense to me. The value is right there, how much
more explicit can you get?
Things are a little more complicated for floating point types. Consider the number 0.1. In binary it cannot be represented exactly and the closest floating point representation must be selected for it. If you input that number during runtime, the conversion of 0.1 to the binary representation has to respect the rounding mode (upward, downward, toward zero, toward infinity). Strict treatment of floating point arithmetic suggests that conversion of the 0.1 floating point literal to the binary representation should also be performed respecting the rounding mode (which only becomes known during runtime) and therefore cannot be done by the compiler (actually the bigger part of it can be done by the compiler but the final rounding has to be performed during runtime, taking into account the rounding mode).

If two languages follow IEEE 754, will calculations in both languages result in the same answers?

I'm in the process of converting a program from Scilab code to C++. One loop in particular is producing a slightly different result than the original Scilab code (it's a long piece of code so I'm not going to include it in the question but I'll try my best to summarise the issue below).
The problem is, each step of the loop uses calculations from the previous step. Additionally, the difference between calculations only becomes apparent around the 100,000th iteration (out of approximately 300,000).
Note: I'm comparing the output of my C++ program with the outputs of Scilab 5.5.2 using the "format(25);" command. Meaning I'm comparing 25 significant digits. I'd also like to point out I understand how precision cannot be guaranteed after a certain number of bits but read the sections below before commenting. So far, all calculations have been identical up to 25 digits between the two languages.
In attempts to get to the bottom of this issue, so far I've tried:
Examining the data type being used:
I've managed to confirm that Scilab is using IEEE 754 doubles (according to the language documentation). Also, according to Wikipedia, C++ isn't required to use IEEE 754 for doubles, but from what I can tell, everywhere I use a double in C++ it has perfectly match Scilab's results.
Examining the use of transcendental functions:
I've also read from What Every Computer Scientist Should Know About Floating-Point Arithmetic that IEEE does not require transcendental functions to be exactly rounded. With that in mind, I've compared the results of these functions (sin(), cos(), exp()) in both languages and again, the results appear to be the same (up to 25 digits).
The use of other functions and predefined values:
I repeated the above steps for the use of sqrt() and pow(). As well as the value of Pi (I'm using M_PI in C++ and %pi in Scilab). Again, the results were the same.
Lastly, I've rewritten the loop (very carefully) in order to ensure that the code is identical between the two languages.
Note: Interestingly, I noticed that for all the above calculations the results between the two languages match farther than the actual result of the calculations (outside of floating point arithmetic). For example:
Value of sin(x) using Wolfram Alpha = 0.123456789.....
Value of sin(x) using Scilab & C++ = 0.12345yyyyy.....
Where even once the value computed using Scilab or C++ started to differ from the actual result (from Wolfram). Each language's result still matched each other. This leads me to believe that most of the values are being calculated (between the two languages) in the same way. Even though they're not required to by IEEE 754.
My original thinking was one of the first three points above are implemented differently between the two languages. But from what I can tell everything seems to produce identical results.
Is it possible that even though all the inputs to these loops are identical, the results can be different? Possibly because a very small error (past what I can see with 25 digits) is occurring that accumulates over time? If so, how can I go about fixing this issue?

No, the format of the numbering system does not guarantee equivalent answers from functions in different languages.
Functions, such as sin(x), can be implemented in different ways, using the same language (as well as different languages). The sin(x) function is an excellent example. Many implementations will use a look-up table or look-up table with interpolation. This has speed advantages. However, some implementations may use a Taylor Series to evaluate the function. Some implementations may use polynomials to come up with a close approximation.
Having the same numeric format is one hurdle to solve between languages. Function implementation is another.
Remember, you need to consider the platform as well. A program that uses an 80-bit floating point processor will have different results than a program that uses a 64-bit floating point software implementation.

Some architectures provide the capability of using extended precision floating point registers (e.g. 80 bits internally, versus 64-bit values in RAM). So, it's possible to get slightly different results for the same calculation, depending on how the computations are structured, and the optimization level used to compile the code.

Yes, it's possible to have a different results. It's possible even if you are using exactly the same source code in the same programming language for the same platform. Sometimes it's enough to have a different compiler switch; for example -ffastmath would lead the compiler to optimize your code for speed rather than accuracy, and, if your computational problem is not well-conditioned to begin with, the result may be significantly different.
For example, suppose you have this code:
x_8th = x*x*x*x*x*x*x*x;
One way to compute this is to perform 7 multiplications. This would be the default behavior for most compilers. However, you may want to speed this up by specifying compiler option -ffastmath and the resulting code would have only 3 multiplications:
temp1 = x*x; temp2 = temp1*temp1; x_8th = temp2*temp2;
The result would be slightly different because finite precision arithmetic is not associative, but sufficiently close for most applications and much faster. However, if your computation is not well-conditioned that small error can quickly get amplified into a large one.

Note that it is possible that the Scilab and C++ are not using the exact same instruction sequence, or that one uses FPU and the other uses SSE, so there may not be a way to get them to be exactly the same.
As commented by IInspectable, if your compiler has _control87() or something similar, you can use it to change the precision and/or rounding settings. You could try combinations of this to see if it has any effect, but again, even you manage to get the settings identical for Scilab and C++, differences in the actual instruction sequences may be the issue.
http://msdn.microsoft.com/en-us/library/e9b52ceh.aspx
If SSE is used, I"m not sure what can be adjusted as I don't think SSE has an 80 bit precision mode.
In the case of using FPU in 32 bit mode, and if your compiler doesn't have something like _control87, you could use assembly code. If inline assembly is not allowed, you would need to call an assembly function. This example is from an old test program:
static short fcw; /* 16 bit floating point control word */
/* ... */
/* set precision control to extended precision */
__asm{
fnstcw fcw
or fcw,0300h
fldcw fcw
}

Is there any way to create a floating point number without using ldexp?

I'm trying to create a IEEE754 floating point number with the sign, exponent and mantissa, but I can't seem to get the ldexp() function working on my computer so I was wondering if there was a way to create that number by directly manipulating the bits' value.

One standard-ish idiom for messing with value representations is to work with your bits as part of an int or char array, and then memcpy() that into your intended type.
Note that doing what you ask through writing one field of a union and reading another, or through type-punning (casting and dereferencing pointers from one type to another, other than char*) is technically undefined behavior under the C++ standard, and so should be avoided. Compilers are known to apply optimizations resulting from assumptions that that programs don't execute these behaviors, and lead to unexpected behavior when they do.
For the exact instance of pointer casting considered here, the Clang/LLVM developers have published in a blog post that this is undefined behavior that they may optimize in unexpected ways.

Warning for inexact floating-point constants

Questions like "Why isn't 0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1 = 0.8?" got me thinking that...
... It would probably be nice to have the compiler warn about the floating-point constants that it rounds to the nearest representable in the binary floating-point type (e.g. 0.1 and 0.8 are rounded in radix-2 floating-point, otherwise they'd need an infinite amount of space to store the infinite number of digits).
I've looked up gcc warnings and so far found none for this purpose (-Wall, -Wextra, -Wfloat-equal, -Wconversion, -Wcoercion (unsupported or C only?), -Wtraditional (C only) don't appear to be doing what I want).
I haven't found such a warning in Microsoft Visual C++ compiler either.
Am I missing a hidden or rarely-used option?
Is there any compiler at all that has this kind of warning?
EDIT: This warning could be useful for educational purposes and serve as a reminder to those new to floating-point.

There is no technical reason the compiler could not issue such warnings. However, they would be useful only for students (who ought to be taught how floating-point arithmetic works before they start doing any serious work with it) and people who do very fine work with floating-point. Unfortunately, most floating-point work is rough; people throw numbers at the computer without much regard for how the computer works, and they accept whatever results they get.
The warning would have to be off by default to support the bulk of existing floating-point code. Were it available, I would turn it on for my code in the Mac OS X math library. Certainly there are points in the library where we depend on every bit of the floating-point value, such as places where we use extended-precision arithmetic, and values are represented across more than one floating-point object (e.g., we would have one object with the high bits of 1/π, another object with 1/π minus the first object, and a third object with 1/π minus the first two objects, giving us about 150 bits of 1/π). Some such values are represented in hexadecimal floating-point in the source text, to avoid any issues with compiler conversion of decimal numerals, and we could readily convert any remaining numerals to avoid the new compiler warning.
However, I doubt we could convince the compiler developers that enough people would use this warning or that it would catch enough bugs to make it worth their time. Consider the case of libm. Suppose we generally wrote exact numerals for all constants but, on one occasion, wrote some other numeral. Would this warning catch a bug? Well, what bug is there? Most likely, the numeral is converted to exactly the value we wanted anyway. When writing code with this warning turned on, we are likely thinking about how the floating-point calculations will be performed, and the value we have written is one that is suitable for our purpose. E.g., it may be a coefficient of some minimax polynomial we calculated, and the coefficient is as good as it is going to get, whether represented approximately in decimal or converted to some exactly-representable hexadecimal floating-point numeral.
So, this warning will rarely catch bugs. Perhaps it would catch an occasion where we mistyped a numeral, accidentally inserting an extra digit into a hexadecimal floating-point numeral, causing it to extend beyond the representable significand. But that is rare. In most cases, the numerals we use are either simple and short or are copied and pasted from software that has calculated them. On some occasions, we will hand-type special values, such as 0x1.fffffffffffffp0. A warning when an extra “f” slips into that numeral might catch a bug during compilation, but that error would almost certainly be caught quickly in testing, since it drastically alters the special value.
So, such a compiler warning has little utility: Very few people will use it, and it will catch very few bugs for the people who do use it.

The warning is in the source: when you write float, double, or long double including any of their respective literals. Obviously, some literals are exact but even this doesn't help much: the sum of two exact values may inexact, e.g., if the have rather different scales. Having the compiler warn about inexact floating point constants would generate a false sense of security. Also, what are you meant to do about rounded constants? Writing the exact closest value explicitly would be error prone and obfuscate the intent. Writing them differently, e.g., writing 1.0 / 10.0 instead of 0.1 also obfuscates the intent and could yield different values.

There will be no such compiler switch and the reason is obvious.
We are writing down the binary components in decimal:
First fractional bit is 0.5
Second fractional bit is 0.25
Third fractional bit is 0.125
....
Do you see it ? Due to the odd endings with the number 5 every bit needs
another decimal to represent it exactly. One bit needs one decimal, two bits
needs two decimals and so on.
So for fractional floating points it would mean that for most decimal numbers
you need 24(!) decimal digits for single precision floats and
53(!!) decimal digits for double precision.
Worse, the exact digits carry no extra information, they are pure artifacts
caused by the base change.
Noone is going to write down 3.141592653589793115997963468544185161590576171875
for pi to avoid a compiler warning.

I don't see how a compiler would know or that the compiler can warn you about something like that. It is only a coincidence that a number can be exactly represented by something that is inherently inaccurate.

changing float type to short but with same behaviour as float type variable

Is it possible to change the
float *pointer
type that is used in the VS c++ project
to some other type, so that it will still behave as a floating type but with less range?
I know that the floating point values never exceed some fixed value in that project, so I want to optimize the program by memory it uses. It doesn't need 4 bytes for each element of the 'float *pointer', 2 bytes will be enough I think. If I change a float to short and imitate the floating point behaviour, then it will use twice shorter memory. How to do it?
EDIT:
It calculates the probabilities. So there are divisions like
A / B
Where A < B,
And also B (and A) can be from 1 to 10 000.

There is a standard 16-bit floating point format described in IEEE 754-2008 called "binary16". It is specified as a format to store floating point values with reduced precisions. There is almost no compiler support for that yet (I think GCC supports it for certain ARM platforms), but it is quite easy to roll your own routines. This fellow:
http://blog.fpmurphy.com/2008/12/half-precision-floating-point-format_14.html
wrote a bit about it and also presents a routine to convert half-float <-> float.
Also, here seems to be a half-float C++ wrapper class:
half.h:
http://www.koders.com/cpp/fidABD00D95DE84C73BF0218AC621E400E07AA77B53.aspx
half.cpp
http://www.koders.com/cpp/fidF0DD0510FAAED03817A956D251787609BEB5989E.aspx
which supplies "HalfFloat" as a possible drop-in replacement type.

Maybe use fixed-point math? It all depends on value and precision you want to achieve.
http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C
For C there is a lot of code that makes fixed-point easy and I'm pretty sure there are also many C++ classes that make it even easier, but I don't know of any, I'm more into C.

The first, obvious, memory optimization would be to try and get rid of the pointer. If you can store just the float, that may, depending on the larger context, reduce your memory consumption from eight to four bytes already. (On a 64-Bit system, from twelve to four.)
Whether you can get by with a short depends on what your program does with the values. You may be able to use fix point arithmetic using an integral type such as a short, yes but your questions shows way too little context to judge that.

The code you posted and the text in the question do not deal with actual float, but with pointers to float. In all architectures I know of, the size of a pointer is the same regardless of the pointed type, so there would be no improvement in changing that to a short or char pointer.
Now, about the actual pointed elements, what is the range that you expect in your application? What is the precision you need? How many of those elements do you have? What are the memory constraints of your target platform? Unless the range and precision are small and the number of elements huge, just use floats. Also note that if you need floating point operations, storing any other type will require conversions before and after each operation, and you might be impacting performance.
Without greater knowledge of what you are doing, the ranges for short in many architectures are [-32k, 32k), where k stands for 1024. If your data ranges is [-32,32) and you can do with roughly 3 decimal digits you could use fixed point arithmetic with shorts, but there are few such situation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js