Casting UINT64 to float?

Casting UINT64 to float? - c++

Is it safe to cast a UINT64 to a float? I realize that UINT64 does not hold decimals, so my float will be whole numbers. However, my function to return my delta-time returns a UINT64, which isn't a very useful type for the function I'm currently working with. I'm assuming a simple static_cast<float>(uint64value) will not work?

Large values of UINT64, (an 8 byte value) may be truncated if you cast them to a float, which is only 4 bytes.

Define safe - you can easily lose a lot of digits of precision if the 64-bit value is large, but apart from that (which is presumably a known issue that you don't mind about), the conversion should be safe. If your compiler doesn't handle it correctly, get a better compiler.

You might try performing your arithmetic in a long double or double first:
typedef long double real_type
real_type x = static_cast<real_type>(long1);
real_type y = static_cast<real_type>(long2);
real_type z = x / y;
float result = static_cast<float>(real_type);

Rule of thumb: int can be cast to and back from double
It is safe to cast to and back from float but you will be limited to rather small numbers, about 16 million, and if you exceed the allowed magnitude you will silently lose lower-order precision. With double, you can use much larger integers.
Assuming an IEEE 754 underlying floating point system, you will be able to accurately cast integers of 23 bits to and from float and 52 bits to and from double. Actually, you get one more bit because of the hidden bit, so you can fit an integer up to and including 1FFFFFFFFFFFFF or 9007199254740991 in a double.
So every single 32-bit integer has an exact representation in double; it can be cast to and back safely, and the ordinary arithmetic operations on them will produce exact results.
Indeed, this is what JavaScript does for every integer numeric operation. People who say "floating point is inaccurate" are drastically oversimplifying the matter.

Safe? What do you mean by safe? As far as the precision is concerned, IEEE-754 float has a 23-(+1)bit mantissa. By forcefully converting a 64-bit value into a "rounded" 24 bit value, you'll inflict a massive loss of precision in the wide range of least-significant bits. Is this loss acceptable in your application? Frankly, if your original value really makes use of the 64-bit range, forcing it into something as small as float doesn't sound as a good idea to me.

why wouldn't static_cast work?
Max uint64 is 2^64 = 1.84467441 × 10^19
According to this max 32-bit float is
9.999999×10^96.
Should work... having problems?
http://en.wikipedia.org/wiki/Decimal32_floating-point_format

Related

Using scientific notation in for loops

I've recently come across some code which has a loop of the form
for (int i = 0; i < 1e7; i++){
}
I question the wisdom of doing this since 1e7 is a floating point type, and will cause i to be promoted when evaluating the stopping condition. Should this be of cause for concern?

The elephant in the room here is that the range of an int could be as small as -32767 to +32767, and the behaviour on assigning a larger value than this to such an int is undefined.
But, as for your main point, indeed it should concern you as it is a very bad habit. Things could go wrong as yes, 1e7 is a floating point double type.
The fact that i will be converted to a floating point due to type promotion rules is somewhat moot: the real damage is done if there is unexpected truncation of the apparent integral literal. By the way of a "proof by example", consider first the loop
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 18446744073709551615ULL; ){
std::cout << i << "\n";
}
This outputs every consecutive value of i in the range, as you'd expect. Note that std::numeric_limits<std::uint64_t>::max() is 18446744073709551615ULL, which is 1 less than the 64th power of 2. (Here I'm using a slide-like "operator" ++< which is useful when working with unsigned types. Many folk consider --> and ++< as obfuscating but in scientific programming they are common, particularly -->.)
Now on my machine, a double is an IEEE754 64 bit floating point. (Such as scheme is particularly good at representing powers of 2 exactly - IEEE754 can represent powers of 2 up to 1022 exactly.) So 18,446,744,073,709,551,616 (the 64th power of 2) can be represented exactly as a double. The nearest representable number before that is 18,446,744,073,709,550,592 (which is 1024 less).
So now let's write the loop as
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 1.8446744073709551615e19; ){
std::cout << i << "\n";
}
On my machine that will only output one value of i: 18,446,744,073,709,550,592 (the number that we've already seen). This proves that 1.8446744073709551615e19 is a floating point type. If the compiler was allowed to treat the literal as an integral type then the output of the two loops would be equivalent.

It will work, assuming that your int is at least 32 bits.
However, if you really want to use exponential notation, you should better define an integer constant outside the loop and use proper casting, like this:
const int MAX_INDEX = static_cast<int>(1.0e7);
...
for (int i = 0; i < MAX_INDEX; i++) {
...
}
Considering this, I'd say it is much better to write
const int MAX_INDEX = 10000000;
or if you can use C++14
const int MAX_INDEX = 10'000'000;

1e7 is a literal of type double, and usually double is 64-bit IEEE 754 format with a 52-bit mantissa. Roughly every tenth power of 2 corresponds to a third power of 10, so double should be able to represent integers up to at least 105*3 = 1015, exactly. And if int is 32-bit then int has roughly 103*3 = 109 as max value (asking Google search it says that "2**31 - 1" = 2 147 483 647, i.e. twice the rough estimate).
So, in practice it's safe on current desktop systems and larger.
But C++ allows int to be just 16 bits, and on e.g. an embedded system with that small int, one would have Undefined Behavior.

If the intention to loop for a exact integer number of iterations, for example if iterating over exactly all the elements in an array then comparing against a floating point value is maybe not such a good idea, solely for accuracy reasons; since the implicit cast of an integer to float will truncate integers toward zero there's no real danger of out-of-bounds access, it will just abort the loop short.
Now the question is: When do these effects actually kick in? Will your program experience them? The floating point representation usually used these days is IEEE 754. As long as the exponent is 0 a floating point value is essentially an integer. C double precision floats 52 bits for the mantissa, which gives you integer precision to a value of up to 2^52, which is in the order of about 1e15. Without specifying with a suffix f that you want a floating point literal to be interpreted single precision the literal will be double precision and the implicit conversion will target that as well. So as long as your loop end condition is less 2^52 it will work reliably!
Now one question you have to think about on the x86 architecture is efficiency. The very first 80x87 FPUs came in a different package, and later a different chip and as aresult getting values into the FPU registers is a bit awkward on the x86 assembly level. Depending on what your intentions are it might make the difference in runtime for a realtime application; but that's premature optimization.
TL;DR: Is it safe to to? Most certainly yes. Will it cause trouble? It could cause numerical problems. Could it invoke undefined behavior? Depends on how you use the loop end condition, but if i is used to index an array and for some reason the array length ended up in a floating point variable always truncating toward zero it's not going to cause a logical problem. Is it a smart thing to do? Depends on the application.

What is the correct type in c\c++ to store a COM's VT_DECIMAL?

I'm trying to write a wrapper to ADO.
A DECIMAL is one type a COM VARIANT can be, when the VARIANT type is VT_DECIMAL.
I'm trying to put it in c native data type, and keep the variable value.
it seem that the correct type is long double, but I get "no suitable conversion error".
For example:
_variant_t v;
...
if(v.vt == VT_DECIMAL)
{
double d = (double)v; //this works but I'm afraid can be loss of data...
long double ld1 = (long double)v; //error: more then one conversion from variant to long double applied.
long double ld2 = (long double)v.decVal; //error: no suitable conversion function from decimal to long double exist.
}
So my questions are:
is it totally safe to use double to store all possible decimal values?
if not, how can I convert the decimal to a long double?
How to convert a decimal to string? (using the << operator, sprintf is also good for me)

The internal representation for DECIMAL is not a double precision floating point value, it is integer instead with sign/scale options. If you are going to initialize DECIMAL parts, you should initialize these fields - 96-bit integer value, scale, sign, then you get valid decimal VARIANT value.
DECIMAL on MSDN:
scale - The number of decimal places for the number. Valid values are from 0 to 28. So 12.345 is represented as 12345 with a scale of 3.
sign - Indicates the sign; 0 for positive numbers or DECIMAL_NEG for negative numbers. So -1 is represented as 1 with the DECIMAL_NEG bit set.
Hi32 - The high 32 bits of the number.
Lo64 - The low 64 bits of the number. This is an _int64.
Your questions:
is it totally safe to use double to store all possible decimal values?
You cannot initialize as double directly (e.g. VT_R8), but you can initialize as double variant and use variant conversion API to convert to VT_DECIMAL. A small rounding can be applied to value.
if not, how can I convert the decimal to a long double?
How to convert a decimal to string? (using the << operator, sprintf is also good for me)
VariantChangeType can convert decimal variant to variant of another type, including integer, double, string - you provide the type to convert to. Vice versa, you can also convert something different to decimal.

"Safe" isn't exactly the correct word, the point of DECIMAL is to not introduce rounding errors due to base conversions. Calculations are done in base 10 instead of base 2. That makes them slow but accurate, the kind of accuracy that an accountant likes. He won't have to chase a billionth-of-a-penny mismatches.
Use _variant_t::ChangeType() to make conversions. Pass VT_R8 to convert to double precision. Pass VT_BSTR to convert to a string, the kind that the accountant likes. No point in chasing long double, that 10-byte FPU type is history.

this snippets is taken from http://hackage.haskell.org/package/com-1.2.1/src/cbits/AutoPrimSrc.c
the Hackage.org says:
Hackage is the Haskell community's central package archive of open
source software.
but please check the authors permissions
void writeVarWord64( unsigned int hi, unsigned int lo, VARIANT* v )
{
ULONGLONG r;
r = (ULONGLONG)hi;
r >>= 32;
r += (ULONGLONG)lo;
if (!v) return;
VariantInit(v);
v->vt = VT_DECIMAL;
v->decVal.Lo64 = r;
v->decVal.Hi32 = 0;
v->decVal.sign = 0;
v->decVal.scale = 0;
}

If I understood Microsoft's documentation (https://msdn.microsoft.com/en-us/library/cc234586.aspx) correctly, VT_DECIMAL is an exact 92-bit integer value with a fixed scale and precision. In that case you can't store this without loss of information in a float, a double or a 64-bit integer variable.
You're best bet would be to store it in a 128-bit integer like __int128 but I don't know the level of compiler support for it. I'm also not sure you will be able to just cast one to the other without resorting to some bit manipulations.

Is it totally safe to use double to store all possible decimal values?
It actually depends what you mean by safe. If you mean "is there any risk of introducing some degree of conversion imprecision?", yes there is a risk. The internal representations are far too different to guarantee perfect conversion, and conversion noise is likely to be introduced.
How can I convert the decimal to a long double / a string?
It depends (again) of what you want to do with the object:
For floating-point computation, see #Gread.And.Powerful.Oz's link to the following answer: C++ converting Variant Decimal to Double Value
For display, see MSDN documentation on string conversion
For storage without any conversion imprecision, you should probably store the decimal as a scaled integer of the form pair<long long,short>, where first holds the 96-bits mantissa and second holds the number of digits to the right of the decimal point. This representation is as close as possible to the decimal's internal representation, will not introduce any conversion imprecision and won't waste CPU resources on integer-to-string formatting.

Is int->double->int guaranteed to be value-preserving?

If I have an int, convert it to a double, then convert the double back to an int, am I guaranteed to get the same value back that I started with? In other words, given this function:
int passThroughDouble(int input)
{
double d = input;
return d;
}
Am I guaranteed that passThroughDouble(x) == x for all ints x?

No it isn't. The standard says nothing about the relative sizes of int and double.
If int is a 64-bit integer and double is the standard IEEE double-precision, then it will already fail for numbers bigger than 2^53.
That said, int is still 32-bit on the majority of environments today. So it will still hold in many cases.

If we restrict consideration to the "traditional" IEEE-754-style representation of floating-point types, then you can expect this conversion to be value-preserving if and only if the mantissa of the type double has as many bits as there are non-sign bits in type int.
Mantissa of a classic IEEE-754 double type is 53-bit wide (including the "implied" leading bit), which means that you can represent integers in [-2^53, +2^53] range precisely. Everything out of this range will generally lose precision.
So, it all depends on how wide your int is compared to your double. The answer depends on the specific platform. With 32-bit int and IEEE-754 double the equality should hold.

How to convert 32 bit integers to 32 bit floats so that the ordering is preserved?

I have two 32 bit integers i1,i2 which I need to convert to floats f1,f2 in such a way that their relative ordering is preserved (i.e. i1 < i2 => f1 < f2)
Will a reinterpret_cast do the trick? Is there some better way?

If the integer values are less than 224, just convert the values:
float f1 = i1, f2 = i2;
For larger values, you will lose precision and two distinct integers may convert to the same floating point value.
On the other hand, you could copy the bit pattern. If your floats are IEEE754, then this requires that the sign bits agree and that neither integer represents some form of NaN. (If the sign bits do not agree, you must beware of -0.f == +0.f:.) To copy the binary representation:
float f1;
std::copy(reinterpret_cast<const char*>(&i1),
reinterpret_cast<const char*>(&i1) + 4,
reinterpret_cast<char*>(&f1));

Integer inherently stores more information in the same bit width than a float on a 32-bit machine, because of values that are reserved for NaN space and infinities. So in short, cannot be done.
int range: -2,147,483,648 to 2,147,483,647
float precision: 7 digits
I think that it would be possible if the nature of the problem limits the range of integer values somehow. Otherwise use a double-precision value. It has 15-16 digits in mantissa.
Keep in mind that in C++ the int type can have different range depending on your native pointer size. On a 16-bit machine, int range is -32k to +32k.
Also, keep in mind that there's no promise of correctness for two (binary) least-significant bits, even in a cast-to-float scenario.
http://steve.hollasch.net/cgindex/coding/ieeefloat.html

When you're casting a int to a float the value is not changed in general, therefore the relative order is preserved.
The reinterpret_cast cannot be used for this purpose, since it is only usable for pointers, e. g. converting an object to a kind of "flat" memory representation, i. e. it copies the bit pattern.

char* to double and back to char* again ( 64 bit application)

I am trying to convert a char* to double and back to char* again. the following code works fine if the application you created is 32-bit but doesn't work for 64-bit application. The problem occurs when you try to convert back to char* from int. for example if the hello = 0x000000013fcf7888 then converted is 0x000000003fcf7888 only the last 32 bits are right.
#include <iostream>
#include <stdlib.h>
#include <tchar.h>
using namespace std;
int _tmain(int argc, _TCHAR* argv[]){
char* hello = "hello";
unsigned int hello_to_int = (unsigned int)hello;
double hello_to_double = (double)hello_to_int;
cout<<hello<<endl;
cout<<hello_to_int<<"\n"<<hello_to_double<<endl;
unsigned int converted_int = (unsigned int)hello_to_double;
char* converted = reinterpret_cast<char*>(converted_int);
cout<<converted_int<<"\n"<<converted<<endl;
getchar();
return 0;
}

On 64-bit Windows pointers are 64-bit while int is 32-bit. This is why you're losing data in the upper 32-bits while casting. Instead of int use long long to hold the intermediate result.
char* hello = "hello";
unsigned long long hello_to_int = (unsigned long long)hello;
Make similar changes for the reverse conversion. But this is not guaranteed to make the conversions function correctly because a double can easily represent the entire 32-bit integer range without loss of precision but the same is not true for a 64-bit integer.
Also, this isn't going to work
unsigned int converted_int = (unsigned int)hello_to_double;
That conversion will simply truncate anything digits after the decimal point in the floating point representation. The problem exists even if you change the data type to unsigned long long. You'll need to reinterpret_cast<unsigned long long> to make it work.
Even after all that you may still run into trouble depending on the value of the pointer. The conversion to double may cause the value to be a signalling NaN for instance, in which cause your code might throw an exception.
Simple answer is, unless you're trying this out for fun, don't do conversions like these.

You can't cast a char* to int on 64-bit Windows because an int is 32 bits, while a char* is 64 bits because it's a pointer. Since a double is always 64 bits, you might be able to get away with casting between a double and char*.

A couple of issues with encoding any integer (specifically, a collection of bits) into a floating point value:
Conversions from 64-bit integers to doubles can be lossy. A double has 53-bits of actual precision, so integers above 2^52 (give or take an extra 2) will not necessarily be represented precisely.
If you decide to reinterpret the bits of a pointer as a double instead (via union or reinterpret_cast) you will still have issues if you happen to encode a pointer as set of bits that are not a valid double representation. Unless you can guarantee that the double value never gets written back by the FPU, the FPU can silently transform an invalid double into another invalid double (see NaN), i.e., a double value that represents the same value but has different bits. (See this for issues related to using floating point formats as bits.)
You can probably safely get away with encoding a 32-bit pointer in a double, as that will definitely fit within the 53-bit precision range.

only the last 32 bits are right.
That's because an int in your platform is only 32 bits long. Note that reinterpret_cast only guarantees that you can convert a pointer to an int of sufficient size (not your case), and back.

If it works in any system, anywhere, just all yourself lucky and move on. Converting a pointer to an integer is one thing (as long as the integer is large enough, you can get away with it), but a double is a floating point number - what you are doing simply doesn't make any sense, because a double is NOT necessarily capable of representing any random number. A double has range and precision limitations, and limits on how it represents things. It can represent numbers across a wide range of values, but it can't represent EVERY number in that range.
Remember that a double has two components: the mantissa and the exponent. Together, these allow you to represent either very big or very small numbers, but the mantissa has limited number of bits. If you run out of bits in the mantissa, you're going to lose some bits in the number you are trying to represent.
Apparently you got away with it under certain circumstances, but you're asking it to do something it wasn't made for, and for which it is manifestly inappropriate.
Just don't do that - it's not supposed to work.

This is as expected.
Typically a char* is going to be 32 bits on a 32-bit system, 64 bits on a 64-bit system; double is typically 64 bits on both systems. (These sizes are typical, and probably correct for Windows; the language permits a lot more variations.)
Conversion from a pointer to a floating-point type is, as far as I know, undefined. That doesn't just mean that the result of the conversion is undefined; the behavior of a program that attempts to perform such a conversion is undefined. If you're lucky, the program will crash or fail to compile.
But you're converting from a pointer to an integer (which is permitted, but implementation-defined) and then from an integer to a double (which is permitted and meaningful for meaningful numeric values -- but converted pointer values are not numerically meaningful). You're losing information because not all of the 64 bits of a double are used to represent the magnitude of the number; typically 11 or so bits are used to represent the exponent.
What you're doing quite simply makes no sense.
What exactly are you trying to accomplish? Whatever it is, there's surely a better way to do it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Casting UINT64 to float? - c++

Large values of UINT64, (an 8 byte value) may be truncated if you cast them to a float, which is only 4 bytes.

Define safe - you can easily lose a lot of digits of precision if the 64-bit value is large, but apart from that (which is presumably a known issue that you don't mind about), the conversion should be safe. If your compiler doesn't handle it correctly, get a better compiler.

You might try performing your arithmetic in a long double or double first: typedef long double real_type real_type x = static_cast<real_type>(long1); real_type y = static_cast<real_type>(long2); real_type z = x / y; float result = static_cast<float>(real_type);

why wouldn't static_cast work? Max uint64 is 2^64 = 1.84467441 × 10^19 According to this max 32-bit float is 9.999999×10^96. Should work... having problems? http://en.wikipedia.org/wiki/Decimal32_floating-point_format

Related

Using scientific notation in for loops

What is the correct type in c\c++ to store a COM's VT_DECIMAL?

Is int->double->int guaranteed to be value-preserving?

How to convert 32 bit integers to 32 bit floats so that the ordering is preserved?

char* to double and back to char* again ( 64 bit application)

Categories

Resources