Safe conversion from double to unsigned 64 bit integer - c++

On my platform this prints 9223372036854775808.
double x = 1e19;
std::cout << static_cast<unsigned __int64>(x) << '\n';
I tried Boost.NumericConversion, but got the same result.
Splitting x into 2 equal part, then adding together converted halves give the correct result. But I need a generic solution to use in a template code.
Thank you in advance.
EDIT:
This problem shows up on Visual Studio 2008, but not MinGW. Casting 4.0e9 into unsigned long works fine.

Seems like it works well with gcc, but it is problematic in Visual Studio. See Microsoft's answer regarding this issue:
Our floating-point to integer
conversions are always done to a
signed integer. In this particular
case we use FIST instruction which
generates 800..00 as you described.
Therefore, there is no defined
behavior for converting to unsigned
64-bit integer values which are
larger than largest 64-bit signed
integer.
So you can only convert the numbers in the signed 64-bit integer range: −9,223,372,036,854,775,808 to +9,223,372,036,854,775,807 (-2^63~2^63-1).

The behavior of your compiler is not conforming to C99, it requires that positive values should always be converted correctly if possible. It only allows to deviate from that for negative values.
The remaindering operation performed
when a value of integer type is
converted to unsigned type need not be
performed when a value of real
floating type is converted to unsigned
type. Thus, the range of portable real
floating values is (−1, Utype_MAX+1).
For you template code, you might just test if your value is greater than static_cast< double >(UINT64_MAX/2) and do the repair work that you are already doing. If this only concerns testing for constants, this should be optimized out where it is not relevant.

Related

"Integer constant is so large that it is unsigned" compiler warning rationale

The following C/C++ code:
long long foo = -9223372036854775808LL; // -2^63
compiles (g++) with the warning
integer constant is so large that it is unsigned.
clang++ gives a similar warning.
Thanks to this bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52661. I now understand why GCC gives this warning. Unfortunately, the response to the bug report didn't explain the reason for this behaviour very well.
Questions:
Why is no warning given for the equivalent code for a 32/16/8-bit signed integer constant?
GCC and Clang both give this warning, so it is clearly intentional behaviour and not just 'to make it easier to parse,' as is suggested in response to the bug report. Why?
Is this behaviour mandated by the C/C++ standard? Some other standard?
This has to do with how the type of integer constants is defined.
First, as mentioned in the gcc bug report, -9223372036854775808LL is actually two tokens: the unary - operator and the integer constant 9223372036854775808LL. So the warning applies only to the latter.
Section 6.4.4.1p5 of the C standard states:
The type of an integer constant is the first of the corresponding list in which its value can be represented.
Based on this, a decimal integer constant with no suffix will have type int, long, or long long based on the value. These are all signed types. So anything value small enough to fit in an 8 bit or 16 bit type still has type int, and a value too large for a 32 bit signed int will have type long or long long depending on the size of the type on that system. The same goes for a constant with the LL suffix, but only the long long type is tried.
The warning comes up because the value you're using doesn't fit in the above type list. Any lesser value will result in the value having a signed type meaning there's no conversion to unsigned.
As various more or less confused people in the bug report said, the integer constant 9223372036854775808LL is too large to fit inside a long long.
For decimal constants, the standard has a list in 6.4.4.1 (see the answer by #dbush) describing what types the compiler will try to give to an integer constant. In this case, the only valid option for type is (signed) long long and it won't fit there. Then §6 under that table kicks in:
If an integer constant cannot be represented by any type in its list, it may have an
extended integer type, if the extended integer type can represent its value. /--/
If the list contains both signed and unsigned types, the extended integer type
may be signed or unsigned.
Extended integer type is a fuzzy but formal term in the standard. In this case the compiler apparenty tries to squeeze the constant into a unsigned long long "extended integer type" where it fits. This isn't really guaranteed behavior but implementation-defined.
Then the unary - operator is applied to the unsigned long long which produces the warning.
This is the reason why library headers such as limits.h like to define LLONG_MIN as
#define LLONG_MIN (-9223372036854775807LL - 1)
You could do something similar to avoid this warning. Or better yet, use LLONG_MIN.
Why is no warning given for the equivalent code for a 32/16/8-bit signed integer constant?
A constant is not limited 8, 16, or 32 bit. It is the first type that fits and decimal constants can go up to at least 63-bits.
9223372036854775808LL is outside OP's long long range as 9223372036854775808 takes 64-bits.
The - is applied after the constant is made.
On a int,long,long long as 32,32,64 bit implementation: -2147483648 is type long long, not int.
GCC and Clang both give this warning, so it is clearly intentional behavior and not just 'to make it easier to parse,' as is suggested in response to the bug report. Why?
No comment. Link was not informative. Best to data here.
Is this behavior mandated by the C/C++ standard? Some other standard?
Yes, by the C standard.

Is it safe to take the difference of two size_t objects?

I'm investigating a standard for my team around using size_t vs int (or long, etc). The biggest drawback I've seen pointed out is that taking the difference of two size_t objects can cause problems (I'm unsure of specific problems -- maybe something wasn't 2s complemented and the signed/unsigned angers the compiler). I wrote a quick program in C++ using the V120 VS2013 compiler that allowed me to do the following:
#include <iostream>
main()
{
size_t a = 10;
size_t b = 100;
int result = a - b;
}
The program resulted in -90, which although correct, makes me nervous about type mismatches, signed/unsigned problems, or just plain undefined behavior if the size_t happens to get used in complex math.
My question is if it's safe to do math with size_t objects, specifically, taking the difference? I'm considering using size_t as a standard for things like indexes. I've seen some interesting posts on the topic here, but they don't address the math issue (or I missed it).
What type for subtracting 2 size_t's?
typedef for a signed type that can contain a size_t?
This is not guaranteed to work portably, but is not UB either. The code must run without error, but the resulting int value is implementation defined. So as long as you are working on platforms that guarantee the desired behavior, this is fine (as long as the difference can be represented by an int of course), otherwise, just use signed types everywhere (see last paragraph).
Subtracting two std::size_ts will yield a new std::size_t† and its value will be determined by wrapping. In your example, assuming 64 bit size_t, a - b will equal 18446744073709551526. This does not fit into an (commonly used 32 bit) int, so an implementation defined value is assigned to result.
To be honest, I would recommend to not use unsigned integers for anything but bit magic. Several members of the standard committee agree with me: https://channel9.msdn.com/Events/GoingNative/2013/Interactive-Panel-Ask-Us-Anything 9:50, 42:40, 1:02:50
Rule of thumb (paraphrasing Chandler Carruth from the above video): If you could count it yourself, use int, otherwise use std::int64_t.
†Unless its conversion rank is less than int, e.g. if std::size_t is unsigned short. In that case, the result is an int and everything will work fine (unless int is not wider than short). However
I do not know of any platform that does this.
This would still be platform specific, see first paragraph.
The size_t type is unsigned. The subtraction of any two size_t values is defined-behavior
However, firstly, the result is implementation-defined if a larger value is subtracted from a smaller one. The result is the mathematical value, reduced to the smallest positive residue modulo SIZE_T_MAX + 1. For instance if the largest value of size_t is 65535, and the result of subtracting two size_t values is -3, then the result will be 65536 - 3 = 65533. On a different compiler or machine with a different size_t, the numeric value will be different.
Secondly, a size_t value might be out of range of the type int. If that is the case, we get a second implementation-defined result arising from the forced conversion. In this situation, any behavior can apply; it just has to be documented by the implementation, and the conversion must not fail. For instance, the result could be clamped into the int range, producing INT_MAX. A common behavior seen on two's complement machines (virtually all) in the conversion of wider (or equal width) unsigned types to narrower signed types is simple bit truncation: enough bits are taken from the unsigned value to fill the signed value, including its sign bit.
Because of the way two's complement works, if the original arithmetically correct abstract result itself fits into int, then the conversion will produce that result.
For instance, suppose that the subtraction of a pair of 64 bit size_t values on a two's complement machine yields the abstract arithmetic value -3, which is becomes the positive value 0xFFFFFFFFFFFFFFFD. When this is coerced into a 32 bit int, then the common behavior seen in many compilers for two's complement machines is that the lower 32 bits are taken as the image of the resulting int: 0xFFFFFFFD. And, of course, that is just the value -3 in 32 bits.
So the upshot is, that your code is de facto quite portable because virtually all mainstream machines are two's complement with conversion rules based on sign extension and bit truncation, including between signed and unsigned.
Except that sign extension doesn't occur when a value is widened while converting from unsigned to signed. Thus he one problem is the rare situation in which int is wider than size_t. If a 16 bit size_t result is 65533, due to 4 being subtracted from 1, this will not produce a -3 when converted to a 32 bit int; it will produce 65533!
If you don't use size_t, you are screwed: size_t is the one type that exists to be used for memory sizes, and which is consequently guaranteed to always be big enough for that purpose. (uintptr_t is quite similar, but it's neither the first such type, nor is it used by the standard libraries, nor is it available without including stdint.h.) If you use an int, you can get undefined behavior when your allocations exceed 2GiB of address space (or 32kiB if you are on a platform where int is only 16 bits!), even though the machine has more memory and you are executing in 64 bit mode.
If you need a difference of size_t that may become negative, use the signed variant ssize_t.

What does the C++ standard say about results of casting value of a type that lies outside the range of the target type?

Recently I had to perform some data type conversions from float to 16 bit integer. Essentially my code reduces to the following
float f_val = 99999.0;
short int si_val = static_cast<short int>(f_val);
// si_val is now -32768
This input value was a problem and in my code I had neglected to check the limits of the float value so I can see my fault, but it made me wonder about the exact rules of the language when one has to do this kind of ungainly cast. I was slightly surprised to find that value of the cast was -32768. Furthermore, this is the value I get whenever the value of the float exceeds the limits of a 16 bit integer. I have googled this but found a surprising lack of detailed info about it. The best I could find was the following from cplusplus.com
Converting to int from some smaller integer type, or to double from
float is known as promotion, and is guaranteed to produce the exact
same value in the destination type. Other conversions between
arithmetic types may not always be able to represent the same value
exactly:
If the conversion is from a floating-point type to an integer type, the value
is truncated (the decimal part is removed).
The conversions from/to bool consider false equivalent to zero (for numeric
types) and to null pointer (for pointer types); and true equivalent to all
other values.
Otherwise, when the destination type cannot represent the value, the conversion
is valid between numerical types, but the value is
implementation-specific (and may not be portable).
This suggestion that the results are implementation defined does not surprise me, but I have heard that cplusplus.com is not always reliable.
Finally, when performing the same cast from a 32 bit integer to 16 bit integer (again with a value outisde of 16 bit range) I saw results clearly indicating integer overflow. Although I was not surprised by this it has added to my confusion due to the inconsistency with the cast from float type.
I have no access to the C++ standard, but a lot of C++ people here do so I was wondering what the standard says on this issue? Just for completeness, I am using g++ version 4.6.3.
You're right to question what you've read. The conversion has no defined behaviour, which contradicts what you quoted in your question.
4.9 Floating-integral conversions [conv.fpint]
1 A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates;
that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be
represented in the destination type. [ Note: If the destination type is bool, see 4.12. -- end note ]
One potentially useful permitted result that you might get is a crash.

Differences in rounded result when calling pow()

OK, I know that there was many question about pow function and casting it's result to int, but I couldn't find answer to this a bit specific question.
OK, this is the C code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
int i = 5;
int j = 2;
double d1 = pow(i,j);
double d2 = pow(5,2);
int i1 = (int)d1;
int i2 = (int)d2;
int i3 = (int)pow(i,j);
int i4 = (int)pow(5,2);
printf("%d %d %d %d",i1,i2,i3,i4);
return 0;
}
And this is the output: "25 25 24 25". Notice that only in third case where arguments to pow are not literals we have that wrong result, probably caused by rounding errors. Same thing happends without explicit casting. Could somebody explain what happens in this four cases?
Im using CodeBlocks in Windows 7, and MinGW gcc compiler that came with it.
The result of the pow operation is 25.0000 plus or minus some bit of rounding error. If the rounding error is positive or zero, 25 will result from the conversion to an integer. If the rounding error is negative, 24 will result. Both answers are correct.
What is most likely happening internally is that in one case a higher-precision, 80-bit FPU value is being used directly and in the other case, the result is being written from the FPU to memory (as a 64-bit double) and then read back in (converting it to a slightly different 80-bit value). This can make a microscopic difference in the final result, which is all it takes to change a 25.0000000001 to a 24.999999997
Another possibility is that your compiler recognizes the constants passed to pow and does the calculation itself, substituting the result for the call to pow. Your compiler may use an internal arbitrary-precision math library or it may just use one that's different.
This is caused by a combination of two problems:
The implementation of pow you are using is not high quality. Floating-point arithmetic is necessarily approximate in many cases, but good implementations take care to ensure that simple cases such as pow(5, 2) return exact results. The pow you are using is returning a result that is less than 25 by an amount greater than 0 but less than or equal to 2–49. For example, it might be returning 25–2-50.
The C implementation you are using sometimes uses a 64-bit floating-point format and sometimes uses an 80-bit floating-point format. As long as the number is kept in the 80-bit format, it retains the complete value that pow returned. If you convert this value to an integer, it produces 24, because the value is less than 25 and conversion to integer truncates; it does not round. When the number is converted to the 64-bit format, it is rounded. Converting between floating-point formats rounds, so the result is rounded to the nearest representable value, 25. After that, conversion to integer produces 25.
The compiler may switch formats whenever it is “convenient” in some sense. For example, there are a limited number of registers with the 80-bit format. When they are full, the compiler may convert some values to the 64-bit format and store them in memory. The compiler may also rearrange expressions or perform parts of them at compile-time instead of run-time, and these can affect the arithmetic performed and the format used.
It is troublesome when a C implementation mixes floating-point formats, because users generally cannot predict or control when the conversions between formats occur. This leads to results that are not easily reproducible and interferes with deriving or controlling numerical properties of software. C implementations can be designed to use a single format throughout and avoid some of these problems, but your C implementation is apparently not so designed.
To add to the other answers here: just generally be very careful when working with floating point values.
I highly recommend reading this paper (even though it is a long read):
http://hal.archives-ouvertes.fr/docs/00/28/14/29/PDF/floating-point-article.pdf
Skip to section 3 for practical examples, but don't neglect the previous chapters!
I'm fairly sure this can be explained by "intermediate rounding" and the fact that pow is not simply looping around j times multiplying by i, but calculating using exp(log(i)*j) as a floating point calculation. Intermediate rounding may well convert 24.999999999996 into 25.000000000 - even arbitrary storing and reloading of the value may cause differences in this sort of behaviuor, so depending on how the code is generated, it may make a difference to the exact result.
And of course, in some cases, the compiler may even "know" what pow actually achieves, and replace the calculation with a constant result.

C++ why does this not provide the system maximum size for integer?

So, if I understand correctly, an integer is a collection of bytes, it represents numbers in base-two format, if you will.
Therefore, if I have unsigned int test=0, is should really just consist of a field of bits, all of which are zero. However,
unsigned int test=0;
test=~test;
produces -1.
I would've thought that this would've filled all the bits with '1', making the integer as large as it can be on that system....
Thanks for any help!
How do you print the value?
If it's displayed as "-1" or a large unsigned integer is just a manner of the bits are interpreted when printing them out, the bits themselves don't know the difference.
You need to print it as an unsigned value.
Also, as pointed out by other answers, you're assming a lot about how the system stores the numbers; there's no guarantee that there's a specific correlation between a number and the bits used to represent that number.
Anyway, the proper way to get this value is to #include <climits> and then just use UINT_MAX.
You're not understanding correctly. An integer represents an integer, and that's it. The specifics of the representation are not part of the standard (with a few exceptions), and you have no business assuming any correlation between bitwise operations and integer values.
(Ironically, what the standard does mandate via modular arithmetic rules is that -1 converted to an unsigned integer is in fact the largest possible value for that unsigned type.)
Update: To clarify, I'm speaking generally for all integral types. If you only use unsigned types (which I assumed you weren't because of your negative answer), you have a well-defined correspondence between bitwise operations and the represented value.
Alternatively you can use:
unsigned int test =0;
test--;