How to obtain hexadecimal value of -nan - c++

I am writing some code in intel intrinsics and did this:
#include <iostream>
#include <xmmintrin.h>
float data[4];
__m128 val1 = _mm_set_ps1(2);
__m128 val2 = _mm_set_ps1(1);
val1 = _mm_cmpgt_ps(val1, val2);
_mm_store_ps(data, val1);
std::cout << std::hex << data[0];
I am trying to get the hexadecimal value of "true" in SSE intrinsics (which is -nan), but only keep getting -nan as "the hexadecimal value" whenever I try to print the hexadecimal value of -nan.
I also tried using std::oct and std::dec and neither of those worked.
I also tried comparing 0xFFFFFFFF and data[0] in different combinations and got this:
float data[4];
__m128 val1 = _mm_set_ps1(2);
__m128 val2 = _mm_set_ps1(1);
val1 = _mm_cmpgt_ps(val1, val2);
_mm_store_ps(data, val1);
float f = 0xFFFFFFFF;
float g = 0xFFFFFFFF;
std::cout << std::dec << (data[0] == f) << "\n"; // Prints "0"
std::cout << std::dec << (data[0] == data[0]) << "\n"; // Prints "0"
std::cout << std::dec << (f == g); // Prints "1"
Is there any way for me to print the hexadecimal value of -nan and if not, can somebody please tell me the binary, hexadecimal, etc. value of -nan?

Per IEEE specification, NaN is a floating-point value that has all of its exponent bits set to "1".
So a value with all the bits set to "1" would also be a NaN.
If you want to see the raw bytes, just print the raw bytes:
#include <cmath>
#include <iomanip>
#include <iostream>
#include <sstream>
template<typename T>
std::string get_hex_bytes(T x) {
std::stringstream res;
auto p = reinterpret_cast<const unsigned char*>(&x);
for (int i = 0; i < sizeof(x); ++i) {
if (i)
res << ' ';
res << std::setfill('0') << std::setw(2) << std::hex << (int)p[i];
}
return res.str();
}
int main() {
float data = NAN;
std::cout << get_hex_bytes(data) << std::endl;
}
On a little-endian machine will print something like:
00 00 c0 ff
P.S. float f = 0xFFFFFFFF; will not set all of the bits to "1", it simply converts an integer 0xFFFFFFFF to a floating point representation (perfectly representable with some loss of precision).

As the manual says, _mm_cmpgt_ps (which is really cmpps with a specific comparison predicate),
Performs a SIMD compare of the packed single-precision floating-point values in the source operand (second
operand) and the destination operand (first operand) and returns the results of the comparison to the destination
operand. The comparison predicate operand (third operand) specifies the type of comparison performed on each of
the pairs of packed values. The result of each comparison is a doubleword mask of all 1s (comparison true) or all
0s (comparison false). The sign of zero is ignored for comparisons, so that –0.0 is equal to +0.0.
(emphasis added)
"All 1s", or 0xFFFFFFFF in hexadecimal (since it's 32 bits per element), has the sign bit set (so there is a legitimate reason to print a - sign in front of whatever else this number might be) and since the exponent is all ones and the significand is not zero, it is also a NaN. The NaN-ness usually isn't very relevant, the main intended use for this result is as a mask in bitwise operations (eg _mm_and_ps, _mm_blendv_ps, etc), which do not care about the special semantics of NaN.

First of all, there's no such thing as a "negative nan". nan is, by definition, Not a Number. You can't negate it. -nan is the same sort of thing as nan.
There's no exactly standards-compliant way to get the underlying bits comprised by a floating-point value, but the closest thing is memcpy. Simply copy from a pointer to float or double to a pointer to an equivalently-sized unsigned integer type, then print that with std::hex active.

Related

Largest uint64 which can be accurately represented in a float in C/C++ [duplicate]

This question already has answers here:
Which is the first integer that an IEEE 754 float is incapable of representing exactly?
(2 answers)
Closed 5 months ago.
I understand that floating point precision has only so many bits. It comes as no surprise that the following code thinks that (float)(UINT64_MAX) and (float)(UINT64_MAX - 1) are equal. I am trying to write a function which would detect this type of, for a lack of proper term, "conversion overflow". I thought I could somehow use FLT_MAX but that's not correct. What's the right way to do this?
#include <iostream>
#include <cstdint>
int main()
{
uint64_t x1(UINT64_MAX);
uint64_t x2(UINT64_MAX - 1);
float f1(static_cast<float>(x1));
float f2(static_cast<float>(x2));
std::cout << f1 << " == " << f2 << " = " << (f1 == f2) << std::endl;
return 0;
}
Largest uint64 which can be accurately represented in a float
What's the right way to do this?
When FLT_RADIX == 2, we are looking for a uint64_t of the form below where n is the max number of bits encodable in a float value. This is usually 24. See FLT_MANT_DIG from <float.h>.
111...(total of n binary digits)...111000...(64-n bits all zero)...000.
//
//1234561234567890
0xFFFFFF0000000000, in decimal: 18446742974197923840
// e.g.
~( (1ull << (64-FLT_MANT_DIG)) - 1)
The following function gives you the highest integer exactly representable in a floating point type such that all smaller positive integers are also exactly representable.
template<typename T>
T max_representable_integer()
{
return std::scalbn(T(1.0), std::numeric_limits<T>::digits);
}
It does the computation in the floating point as for some the result may not be representable in a uint64_t.

Bit representation of float using an int pointer

I have the following exercise:
Implement a function void float to bits(float x) which prints the bit
representation of x. Hint: Casting a float to an int truncates the
fractional part, but no information is lost casting a float pointer to
an int pointer.
Now, I know that a float is represented by a sign-bit, some bits for its mantissa, some bits for the basis and some bits for the exponent. It depends on my system how many bits are used.
The problem we are facing here is that our number basically has two parts. Let's consider 8.7 the bit representation of this number would be (to my understanding) the following: 1000.0111
Now, float's are stored wit a leading zero, so 8.8 would become 0.88*10^1
So I somehow have to get all the information out of my memory. I don't really see how I should do that. What should that hint hint me to? What's the difference between a integer pointer and a float pointer?
Currently I have this:
void float_to_bits() {
float a = 4.2345678f;
int* b;
b = (int*)(&a);
*b = a;
std::cout << *(b) << "\n";
}
But I really don't get the bigger picture behind the hint here. How do I get the mantissa, the exponent, the sign and the basis? I also tried playing around with the bit-wise operators >>, <<. But I just don't see how this should help me here, since they won't change the pointers position. It's useful to get e.g. the bit representation of an integer but that's about it, no idea what use it'd be here.
The hint your teacher gave is misleading: casting pointer between different types is at best implementation defined. However, memcpy(...)ing an object to a suutably sized array if unsigned char is defined. The content if the resulting array can then be decomposed into bits. Here is a quick hack to represent the bits using hexadecimal values:
#include <iostream>
#include <iomanip>
#include <cstring>
int main() {
float f = 8.7;
unsigned char bytes[sizeof(float)];
std::memcpy(bytes, &f, sizeof(float));
std::cout << std::hex << std::setfill(‘0’);
for (int b: bytes) {
std::cout << std::setw(2) << b;
}
std::cout << ‘\n’;
}
Note that IEEE 754 binary floating points do not store the full significand (the standard doesn’t use mantissa as a term) except for denormalized values: the 32 bit floats store
1 bit for the sign
8 bits for the exponent
23 bits for the normalized significand with the non-zero high bit being implied
The hint directs you how to pass the Float into an Integer without passing through value conversion.
When you assign floating-point value to an integer, the processor removes the fraction part. int i = (int) 4.502f; will result in i=4;
but when you make a int pointer (int*) point to a float's location,
no conversion is made, also when you read the int* value.
to show the representation, i like seeing HEX numbers,
thats why my first example was given in HEX
(each Hexa-decimal digit represents 4 binary digits).
but it is also possible to print as binary,
and there are many ways (I like this one best!)
Follows an annotated example code:
Also available # Culio
#include <iostream>
#include <bitset>
using namespace std;
int main()
{
float a = 4.2345678f; // allocate space for a float. Call it 'a' and put the floating point value of `4.2345678f` in it.
unsigned int* b; // allocate a space for a pointer (address), call the space b, (hint to compiler, this will point to integer number)
b = (unsigned int*)(&a); // GREAT, exactly what you needed! take the float 'a', get it's address '&'.
// by default, it is an address pointing at float (float*) , so you correctly cast it to (int*).
// Bottom line: Set 'b' to the address of a, but treat this address of an int!
// The Hint implied that this wont cause type conversion:
// int someInt = a; // would cause `someInt = 4` same is your line below:
// *b = a; // <<<< this was your error.
// 1st thing, it aint required, as 'b' already pointing to `a` address, hence has it's value.
// 2nd by this, you set the value pointed by `b` to 'a' (including conversion to int = 4);
// the value in 'a' actually changes too by this instruction.
cout << a << " in binary " << bitset<32>(*b) << endl;
cout << "Sign " << bitset<1>(*b >> 31) << endl; // 1 bit (31)
cout << "Exp " << bitset<8>(*b >> 23) << endl; // 8 bits (23-30)
cout << "Mantisa " << bitset<23>(*b) << endl; // 23 bits (0-22)
}

convert hexadecimal string to binary and seperate into bits n C++

I need to covert hexadecimal string to binary then pass the bits into different variables.
For example, my input is:
std::string hex = "E136";
How do I convert the string into binary output 1110 0001 0011 0110?
After that I need to pass the bit 0 to variable A, bits 1-9 to variable B and bits 10-15 to variable C.
Thanks in advance
How do I convert the string [...]?
Start with result value of null, then for each character (starting at first, indicating most significant one) determine its value (in range of [0:15]), multiply the so far received result by 16 and add the current value to. For your given example, this will result in
(((0 * 16 + v('E')) * 16 + v('1')) * 16 + v('3')) + v('6')
There are standard library functions doing the stuff for you, such as std::strtoul:
char* end;
unsigned long value = strtoul(hex.c_str(), &end, 16);
// ^^ base!
The end pointer useful to check if you have read the entire string:
if(*char == 0)
{
// end of string reached
}
else
{
// some part of the string was left, you might consider this
// as error (could occur if e. g. "f10s12" was passed, then
// end would point to the 's')
}
If you don't care for end checking, you can just pass nullptr instead.
Don't convert back to a string afterwards, you can get the required values by masking (&) and bitshifting (>>), e. g getting bits [1-9]:
uint32_t b = value >> 1 & 0x1ffU;
Working on integrals is much more efficient than working on strings. Only when you want to print out the final result, then convert back to string (if using a std::ostream, operator<< already does the work for you...).
While playing with this sample, I realized that I gave a wrong recommendation:
std::setbase(2) does not work by standard. Ouch! (SO: Why doesn't std::setbase(2) switch to binary output?)
For conversion of numbers to string with binary digits, something else must be used. I made this small sample. Though, the separation of bits is considered as well, my main focus was on output with different bases (and IMHO worth another answer):
#include <algorithm>
#include <iomanip>
#include <iostream>
#include <string>
std::string bits(unsigned value, unsigned w)
{
std::string text;
for (unsigned i = 0; i < w || value; ++i) {
text += '0' + (value & 1); // bit -> character '0' or '1'
value >>= 1; // shift right one bit
}
// text is right to left -> must be reversed
std::reverse(text.begin(), text.end());
// done
return text;
}
void print(const char *name, unsigned value)
{
std::cout
<< name << ": "
// decimal output
<< std::setbase(10) << std::setw(5) << value
<< " = "
// binary output
#if 0 // OLD, WRONG:
// std::setbase(2) is not supported by standard - Ouch!
<< "0b" << std::setw(16) << std::setfill('0') << std::setbase(2) << value
#else // NEW:
<< "0b" << bits(value, 16)
#endif // 0
<< " = "
// hexadecimal output
<< "0x" << std::setw(4) << std::setfill('0') << std::setbase(16) << value
<< '\n';
}
int main()
{
std::string hex = "E136";
unsigned value = strtoul(hex.c_str(), nullptr, 16);
print("hex", value);
// bit 0 -> a
unsigned a = value & 0x0001;
// bit 1 ... 9 -> b
unsigned b = (value & 0x03FE) >> 1;
// bit 10 ... 15 -> c
unsigned c = (value & 0xFC00) >> 10;
// report
print(" a ", a);
print(" b ", b);
print(" c ", c);
// done
return 0;
}
Output:
hex: 57654 = 0b1110000100110110 = 0xe136
a : 00000 = 0b0000000000000000 = 0x0000
b : 00155 = 0b0000000010011011 = 0x009b
c : 00056 = 0b0000000000111000 = 0x0038
Live Demo on coliru
Concerning, the bit operations:
binary bitwise and operator (&) is used to set all unintended bits to 0. The second value can be understood as mask. It would be more obvious if I had used binary numbers but this is not supported in C++. Hex codes do nearly as well as a hex digit represents always the same pattern of 4 bits. (as 16 = 24) After some time of practice, you usually learn to "see" the bits in the hex code.
About the right shift (>>), I was not quite sure. OP didn't require that bits have to be moved somewhere – only that they had to be separated into distinct variables. So, these right-shift's might be obsolete.
So, this question which seemed to be trivial leaded to a surprising enlightment (for me).

Why (int)pow(2, 32) == -2147483648

On the Internet I found the following problem:
int a = (int)pow(2, 32);
cout << a;
What does it print on the screen?
Firstly I thought about 0,
but after I wrote code and executed it, i got -2147483648, but why?
Also I noticed that even (int)(pow(2, 32) - pow(2, 31)) equals -2147483648.
Can anyone explain why (int)pow(2, 32) equals -2147483648?
Assuming int is 32 bits (or less) on your machine, this is undefined behavior.
From the standard, conv.fpint:
A prvalue of a floating-point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.
Most commonly int is 32 bits, and it can represent values in the interval [-2^31, 2^31-1] which is [-2147483648, 2147483647]. The result of std::pow(2, 32) is a double that represents the exact value 2^32. Since 2^32 exceeds the range that can be represented by int, the conversion attempt is undefined behavior. This means that in the best case, the result can be anything.
The same goes for your second example: pow(2, 32) - pow(2, 31) is simply the double representation of 2^31, which (just barely) exceeds the range that can be represented by a 32-bit int.
The correct way to do this would be to convert to a large enough integral type, e.g. int64_t:
std::cout << static_cast<int64_t>(std::pow(2, 32)) << "\n"; // prints 4294967296
The behavior you are seeing relates to using Two's Complement to represent
signed integers. For 3-bit numbers the range of values range from [-4, 3]. For 32-bit numbers it ranges from -(2^31) to (2^31)-1. (i.e. -2147483648 to 2147483647).
this because the result of the operation overflow int data type because it exceeds its max value so don't cast to int cast it to long
#include <iostream>
#include <cmath>
#include <climits>
using namespace std;
int main() {
cout << (int)pow(2, 32) << endl;
// 2147483647
cout << INT_MIN << endl;
//-2147483648
cout << INT_MAX << endl;
//2147483647
cout << (long)pow(2, 32) << endl;
//4294967296
cout << LONG_MIN << endl;
// -9223372036854775808
cout << LONG_MAX << endl;
// 9223372036854775808
return 0;
}
if you are not aware about int overflow you can check this link

conversion of double to string to double throws exception

The following code throws an std::out_of_range exception in Visual Studio 2013 where in my opinion it shouldn't:
#include <string>
#include <limits>
int main(int argc, char ** argv)
{
double maxDbl = std::stod(std::to_string(std::numeric_limits<double>::max()));
return 0;
}
I tested the code also with gcc 4.9.2 and there it does not throw an exception. The issue seems to be caused by an inaccurate string representation after the conversion to string. In Visual Studio std::to_string(std::numeric_limits<double>::max()) yields
179769313486231610000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000
which indeed seems too large. In gcc, however, it yields
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
which seems to be smaller than the passed value.
However, isn't std::numeric_limits<double>::max() supposed to return the
maximum finite representable floating-point number?
So why do the string representations get off? What am I missing here?
Direct answer
Gcc (and Clang and VS2105) correctly return the integer value of (21024 - 1) - (21024-53 - 1) that is what is represented with 52 one bits of significand and an unbiased exponent of 1023 (21024 - 1 would be the integer value with 1023 one bits, and I just substract all the bits below the 52 of the IEE754 format)
I can confirm that a large integer library give 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368L
The previous exact floating point would be 2971 lesser (971 = 1023 - 52) that is : 179769313486231550856124328384506240234343437157459335924404872448581845754556114388470639943126220321960804027157371570809852884964511743044087662767600909594331927728237078876188760579532563768698654064825262115771015791463983014857704008123419459386245141723703148097529108423358883457665451722744025579520L
The next non representable value would be 2971 greater that is:
179769313486231590772930519078902473361797697894230657273430081157732675805500963132708477322407536021120113879871393357658789768814416622492847430639474124377767893424865485276302219601246094119453082952085005768838150682342462881473913110540827237163350510684586298239947245938479716304835356329624224137216L
But the value used by MSVC2013 and previous is near to 21024 + 2971, that is : 179769313486231610731333614426100589925524828262616317947942685512308090830973387504827396012048193870699768806228404251083258210739369062217227314575410731769485876273179688476358949112102859294830297395714877595371718127781702814782017661749531126051903195165027873311156314696040132728420308633064323416064L
. As it is greater than any value representable in IEEE754 double precision, it cannot be decoded to a double.
Because at most, one could say that any value between 21024 - 2971 (std::numeric_limits<double>::max()) and 21024 could be rounded to std::numeric_limits<double>::max(), but values greater than 21024 are clearly an overflow.
Discussion on accuracy
Only 16 decimal digits are accurate in a double and all other digits can be seen as garbage or random values since they do not depend on the value itself but only one the way you choose to calculate them. Just try to substract 1e+288 (that's already a big value) to maxDbl and look what happens :
maxLess = max Dbl - 1.e+288;
if (maxLess == maxDbl) {
std::cout << "Unchanged" << std::endl;
}
else std::cout << "Changed" << std::endl;
You should see ... Unchanged.
It just looks like VS 2013 is a little incoherent in the way it rounds floating point values : it rounded maxDbl by excess to one bit higher than the maximum actually representable value, and could not decode it later.
The problem is that the standard choosed to use a %f format which gives a false sentiment of accuracy. If you want to see an equivalent problem in gcc, just use :
#include <iostream>
#include <string>
#include <limits>
#include <iomanip>
#include <sstream>
int main() {
double max = std::numeric_limits<double>::max();
std::ostringstream ostr;
ostr << std::setprecision(16) << max;
std::string smax = ostr.str();
std::cout << smax << std::endl;
double m2 = std::stod(smax);
std::cout << m2 << std::endl;
return 0;
}
Rounded to 16 digits mxDbl writes (correctly) : 1.797693134862316e+308, but can no longer be decoded back
And this one :
#include <iostream>
#include <string>
#include <limits>
int main() {
double maxDbl = std::numeric_limits<double>::max();
std::string smax = std::to_string(maxDbl);
std::cout << smax << std::endl;
std::string smax2 = "179769313486231570800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000";
double max2 = std::stod(smax2);
if (max2 == maxDbl) {
std::cout << smax2 << " is same double as " << smax << std::endl;
}
return 0;
}
Displays :
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
179769313486231570800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 is same double as 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
TL/DR : What I mean is that one big enoudh double value can of course be represented by an exact integer (per IEEE754). But it does represent all integers between half to the previous one and half to the next one. So any integer in that range could be an acceptable representation for the double, and one value rounded at 16 decimal digits should be acceptable, but current standard libraries only allow max floating point value to be truncated at 16 decimal digits. But VS2013 gave a number above the max of the range what was in any case an error.
Reference
IEEE floating point on wikipedia