Float to 4 uint8_t and display - c++

I read from float in Wikipedia and i tried print his bits.
i used std::bitset and this return other bits different from what I expected(i know because i used the same number of the example in the link), then i used memcpy() and copy the memory of float to 4 parts of 1 byte each and print, this method worked but i have 4 questions.
1) Why using bitset in a float, this print only the integer part?
2) Why bitset working only with the array and not with the float?
3) memcpy() worked in correct order?
The last question is because 0.15625f == 0b00111110001000000000000000000000.
Then i think that the correct order is:
bb[0] == 0b00111110;
bb[1] == 0b00100000;
bb[2] == 0b00000000;
bb[3] == 0b00000000;
But the order returned is inverse.
4) Why happend this ?
My code:
#include <cstring>
#include <iostream>
#include <bitset>
int main(int argc,char** argv){
float f = 0.15625f;
std::cout << std::bitset<32>(f) << std::endl;
//print: 00000000000000000000000000000000
//This print only integer part of the float. I tried with 5.2341 and others
uint8_t bb[4];
memcpy(bb, &f, 4);
std::cout << std::bitset<8>(bb[0]) << std::endl;
//print: 00000000
std::cout << std::bitset<8>(bb[1]) << std::endl;
//print: 00000000
std::cout << std::bitset<8>(bb[2]) << std::endl;
//print: 00100000
std::cout << std::bitset<8>(bb[3]) << std::endl;
//print: 00111110
return 0;
}

To construct std::bitset from a float, one of std::bitset construtors is used. The one that is relevant here is
constexpr bitset(unsigned long long val) noexcept;
Before this constructor is called, float is converted into unsigned long long, and its decimal part is truncated. std::bitset has no constructors that take floating-point values.
The bytes order of floating-point numbers is affected by machine endianness. On a little-endian machine bytes are stored in the reverse order. If your machine uses the same endianness for floating-point numbers and for integers, you can simply write
float f = 0.15625f;
std::uint32_t b;
std::memcpy(&b, &f, 4);
std::cout << std::bitset<32>(b) << std::endl;
// Output: 00111110001000000000000000000000
to get bytes in the correct order automatically.

Related

Output of strtoull() loses precision when converted to double and then back to uint64_t

Consider the following:
#include <iostream>
#include <cstdint>
int main() {
std::cout << std::hex
<< "0x" << std::strtoull("0xFFFFFFFFFFFFFFFF",0,16) << std::endl
<< "0x" << uint64_t(double(std::strtoull("0xFFFFFFFFFFFFFFFF",0,16))) << std::endl
<< "0x" << uint64_t(double(uint64_t(0xFFFFFFFFFFFFFFFF))) << std::endl;
return 0;
}
Which prints:
0xffffffffffffffff
0x0
0xffffffffffffffff
The first number is just the result of converting ULLONG_MAX, from a string to a uint64_t, which works as expected.
However, if I cast the result to double and then back to uint64_t, then it prints 0, the second number.
Normally, I would attribute this to the precision inaccuracy of floats, but what further puzzles me, is that if I cast the ULLONG_MAX from uint64_t to double and then back to uint64_t, the result is correct (third number).
Why the discrepancy between the second and the third result?
EDIT (by #Radoslaw Cybulski)
For another what-is-going-on-here try this code:
#include <iostream>
#include <cstdint>
using namespace std;
int main() {
uint64_t z1 = std::strtoull("0xFFFFFFFFFFFFFFFF",0,16);
uint64_t z2 = 0xFFFFFFFFFFFFFFFFull;
std::cout << z1 << " " << uint64_t(double(z1)) << "\n";
std::cout << z2 << " " << uint64_t(double(z2)) << "\n";
return 0;
}
which happily prints:
18446744073709551615 0
18446744073709551615 18446744073709551615
The number that is closest to 0xFFFFFFFFFFFFFFFF, and is representable by double (assuming 64 bit IEEE) is 18446744073709551616. You'll find that this is a bigger number than 0xFFFFFFFFFFFFFFFF. As such, the number is outside the representable range of uint64_t.
Of the conversion back to integer, the standard says (quoting latest draft):
[conv.fpint]
A prvalue of a floating-point type can be converted to a prvalue of an integer type.
The conversion truncates; that is, the fractional part is discarded.
The behavior is undefined if the truncated value cannot be represented in the destination type.
Why the discrepancy between the second and the third result?
Because the behaviour of the program is undefined.
Although it is mostly pointless to analyse reasons for differences in UB because the scope of variation is limitless, my guess at the reason for the discrepancy in this case is that in one case the value is compile time constant, while in the other there is a call to a library function that is invoked at runtime.

Showing binary representation of floating point types in C++ [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
Consider the following code for integral types:
template <class T>
std::string as_binary_string( T value ) {
return std::bitset<sizeof( T ) * 8>( value ).to_string();
}
int main() {
unsigned char a(2);
char b(4);
unsigned short c(2);
short d(4);
unsigned int e(2);
int f(4);
unsigned long long g(2);
long long h(4);
std::cout << "a = " << +a << " " << as_binary_string( a ) << std::endl;
std::cout << "b = " << +b << " " << as_binary_string( b ) << std::endl;
std::cout << "c = " << c << " " << as_binary_string( c ) << std::endl;
std::cout << "d = " << c << " " << as_binary_string( d ) << std::endl;
std::cout << "e = " << e << " " << as_binary_string( e ) << std::endl;
std::cout << "f = " << f << " " << as_binary_string( f ) << std::endl;
std::cout << "g = " << g << " " << as_binary_string( g ) << std::endl;
std::cout << "h = " << h << " " << as_binary_string( h ) << std::endl;
std::cout << "\nPress any key and enter to quit.\n";
char q;
std::cin >> q;
return 0;
}
Pretty straight forward, works well and is quite simple.
EDIT
How would one go about writing a function to extract the binary or bit pattern of arbitrary floating point types at compile time?
When it comes to floats I have not found anything similar in any existing libraries of my own knowledge. I've searched google for days looking for one, so then I resorted into trying to write my own function without any success. I no longer have the attempted code available since I've originally asked this question so I can not exactly show you all of the different attempts of implementations along with their compiler - build errors. I was interested in trying to generate the bit pattern for floats in a generic way during compile time and wanted to integrate that into my existing class that seamlessly does the same for any integral type. As for the floating types themselves, I have taken into consideration the different formats as well as architecture endian. For my general purposes the standard IEEE versions of the floating point types is all that I should need to be concerned with.
iBug had suggested for me to write my own function when I originally asked this question, while I was in the attempt of trying to do so. I understand binary numbers, memory sizes, and the mathematics, but when trying to put it all together with how floating point types are stored in memory with their different parts {sign bit, base & exp } is where I was having the most trouble.
Since then with the suggestions those who have given a great answer - example I was able to write a function that would fit nicely into my already existing class template and now it works for my intended purposes.
What about writing one by yourself?
static_assert(sizeof(float) == sizeof(uint32_t));
static_assert(sizeof(double) == sizeof(uint64_t));
std::string as_binary_string( float value ) {
std::uint32_t t;
std::memcpy(&t, &value, sizeof(value));
return std::bitset<sizeof(float) * 8>(t).to_string();
}
std::string as_binary_string( double value ) {
std::uint64_t t;
std::memcpy(&t, &value, sizeof(value));
return std::bitset<sizeof(double) * 8>(t).to_string();
}
You may need to change the helper variable t in case the sizes for the floating point numbers are different.
You can alternatively copy them bit-by-bit. This is slower but serves for arbitrarily any type.
template <typename T>
std::string as_binary_string( T value )
{
const std::size_t nbytes = sizeof(T), nbits = nbytes * CHAR_BIT;
std::bitset<nbits> b;
std::uint8_t buf[nbytes];
std::memcpy(buf, &value, nbytes);
for(int i = 0; i < nbytes; ++i)
{
std::uint8_t cur = buf[i];
int offset = i * CHAR_BIT;
for(int bit = 0; bit < CHAR_BIT; ++bit)
{
b[offset] = cur & 1;
++offset; // Move to next bit in b
cur >>= 1; // Move to next bit in array
}
}
return b.to_string();
}
You said it doesn't need to be standard. So, here is what works in clang on my computer:
#include <iostream>
#include <algorithm>
using namespace std;
int main()
{
char *result;
result=new char[33];
fill(result,result+32,'0');
float input;
cin >>input;
asm(
"mov %0,%%eax\n"
"mov %1,%%rbx\n"
".intel_syntax\n"
"mov rcx,20h\n"
"loop_begin:\n"
"shr eax\n"
"jnc loop_end\n"
"inc byte ptr [rbx+rcx-1]\n"
"loop_end:\n"
"loop loop_begin\n"
".att_syntax\n"
:
: "m" (input), "m" (result)
);
cout <<result <<endl;
delete[] result;
return 0;
}
This code makes a bunch of assumptions about the computer architecture and I am not sure on how many computers it would work.
EDIT:
My computer is a 64-bit Mac-Air. This program basically works by allocating a 33-byte string and filling the first 32 bytes with '0' (the 33rd byte will automatically be '\0').
Then it uses inline assembly to store the float into a 32-bit register and then it repeatedly shifts it to the right by one bit.
If the last bit in the register was 1 before the shift, it gets stored into the carry flag.
The assembly code then checks the carry flag and, if it contains 1, it increases the corresponding byte in the string by 1.
Since it was previously initialized to '0', it will turn to '1'.
So, effectively, when the loop in the assembly is finished, the binary representation of a float is stored into a string.
This code only works for x64 (it uses 64-bit registers "rbx" and "rcx" to store the pointer and the counter for the loop), but I think it's easy to tweak it to work on other processors.
An IEEE floating point number looks like the following
sign exponent mantissa
1 bit 11 bits 52 bits
Note that there's a hidden 1 before the mantissa, and the exponent
is biased so 1023 = 0, not two's complement.
By memcpy()ing to a 64 bit unsigned integer you can then apply AND and
OR masks to get the bit pattern. The arrangement could be big endian
or little endian.
You can easily work out which arrangement you have by passing easy numbers
such as 1 or 2.
Generally people either use std::hexfloat or cast a pointer to the floating-point value to a pointer to an unsigned integer of the same size and print the indirected value in hex format. Both methods facilitate bit-level analysis of floating-point in a productive fashion.
You could roll your by casting the address of the float/double to a char and iterating it that way:
#include <memory>
#include <iostream>
#include <limits>
#include <iomanip>
template <typename T>
std::string getBits(T t) {
std::string returnString{""};
char *base{reinterpret_cast<char *>(std::addressof(t))};
char *tail{base + sizeof(t) - 1};
do {
for (int bits = std::numeric_limits<unsigned char>::digits - 1; bits >= 0; bits--) {
returnString += ( ((*tail) & (1 << bits)) ? '1' : '0');
}
} while (--tail >= base);
return returnString;
}
int main() {
float f{10.0};
double d{100.0};
double nd{-100.0};
std::cout << std::setprecision(1);
std::cout << getBits(f) << std::endl;
std::cout << getBits(d) << std::endl;
std::cout << getBits(nd) << std::endl;
}
Output on my machine (note the sign flip in the third output):
01000001001000000000000000000000
0100000001011001000000000000000000000000000000000000000000000000
1100000001011001000000000000000000000000000000000000000000000000

Converting a ulong to a long

I have a number stored as a ulong. I want the bits stored in memory to be interpreted in a 2's complement fashion. So I want the first bit to be the sign bit etc. If I want to convert to a long, so that the number is interpreted correctly as a 2's complement , how do I do this?
I tried creating pointers of different data types that all pointed to the same buffer. I then stored the ulong into the buffer. I then dereferenced a long pointer. This however is giving me a bad result?
I did :
#include <iostream>
using namespace std;
int main() {
unsigned char converter_buffer[4];//
unsigned long *pulong;
long *plong;
pulong = (unsigned long*)&converter_buffer;
plong = (long*)&converter_buffer;
unsigned long ulong_num = 65535; // this has a 1 as the first bit
*pulong = ulong_num;
std:: cout << "the number as a long is" << *plong << std::endl;
return 0;
}
For some reason this is giving me the same positive number.
Would casting help?
Actually using pointers was a good start but you have to cast your unsigned long* to void* first, then you can cast the result to long* and dereference it:
#include <iostream>
#include <climits>
int main() {
unsigned long ulongValue = ULONG_MAX;
long longValue = *((long*)((void*)&ulongValue));
std::cout << "ulongValue: " << ulongValue << std::endl;
std::cout << "longValue: " << longValue << std::endl;
return 0;
}
The code above will results the following:
ulongValue: 18446744073709551615
longValue: -1
With templates you can make it more readable in your code:
#include <iostream>
#include <climits>
template<typename T, typename U>
T unsafe_cast(const U& from) {
return *((T*)((void*)&from));
}
int main() {
unsigned long ulongValue = ULONG_MAX;
long longValue = unsafe_cast<long>(ulongValue);
std::cout << "ulongValue: " << ulongValue << std::endl;
std::cout << "longValue: " << longValue << std::endl;
return 0;
}
Keep in mind that this solution is absolutely unsafe due to the fact that you can cast anyithing to void*. This practicle was common in C but I do not recommend to use it in C++. Consider the following cases:
#include <iostream>
template<typename T, typename U>
T unsafe_cast(const U& from) {
return *((T*)((void*)&from));
}
int main() {
std::cout << std::hex << std::showbase;
float fValue = 3.14;
int iValue = unsafe_cast<int>(fValue); // OK, they have same size.
std::cout << "Hexadecimal representation of " << fValue
<< " is: " << iValue << std::endl;
std::cout << "Converting back to float results: "
<< unsafe_cast<float>(iValue) << std::endl;
double dValue = 3.1415926535;
int lossyValue = unsafe_cast<int>(dValue); // Bad, they have different size.
std::cout << "Lossy hexadecimal representation of " << dValue
<< " is: " << lossyValue << std::endl;
std::cout << "Converting back to double results: "
<< unsafe_cast<double>(lossyValue) << std::endl;
return 0;
}
The code above results for me the following:
Hexadecimal representation of 3.14 is: 0x4048f5c3
Converting back to float results: 3.14
Lossy hexadecimal representation of 3.14159 is: 0x54411744
Converting back to double results: 6.98387e-315
And for last line you can get anything because the conversion will read garbage from the memory.
Edit
As lorro commented bellow, using memcpy() is safer and can prevent the overflow. So, here is another version of type casting which is safer:
template<typename T, typename U>
T safer_cast(const U& from) {
T to;
memcpy(&to, &from, (sizeof(T) > sizeof(U) ? sizeof(U) : sizeof(T)));
return to;
}
You can do this:
uint32_t u;
int32_t& s = (int32_t&) u;
Then you can use s and u interchangeably with 2's complement, e.g.:
s = -1;
std::cout << u << '\n'; // 4294967295
In your question you ask about 65535 but that is a positive number. You could do:
uint16_t u;
int16_t& s = (int16_t&) u;
u = 65535;
std::cout << s << '\n'; // -1
Note that assigning 65535 (a positive number) to int16_t would implementation-defined behaviour, it does not necessarily give -1.
The problem with your original code is that it is not permitted to alias a char buffer as long. (And that you might overflow your buffer). However, it is OK to alias an integer type as its corresponding signed/unsigned type.
In general, when you have two arithmetic types that are the same size and you want to reinterpret the bit representation of one using the type of the other, you do it with a union:
#include <stdint.h>
union reinterpret_u64_d_union {
uint64_t u64;
double d;
};
double
reinterpret_u64_as_double(uint64_t v)
{
union reinterpret_u64_d_union u;
u.u64 = v;
return u.d;
}
For the special case of turning an unsigned number into a signed type with the same size (or vice versa), however, you can just use a traditional cast:
int64_t
reinterpret_u64_as_i64(uint64_t v)
{
return (int64_t)v;
}
(The cast is not strictly required for [u]int64_t, but if you don't explicitly write a cast, and the types you're converting between are small, the "integer promotions" may get involved, which is usually undesirable.)
The way you were trying to do it violates the pointer-aliasing rules and provokes undefined behavior.
In C++, note that reinterpret_cast<> does not do what the union does; it is the same as static_cast<> when applied to arithmetic types.
In C++, also note that the use of a union above relies on a rule in the 1999 C standard (with corrigienda) that has not been officially incorporated into the C++ standard last I checked; however, all compilers I am familiar with will do what you expect.
And finally, in both C and C++, long and unsigned long are guaranteed to be able to represent at least −2,147,483,647 ... 214,7483,647 and 0 ... 4,294,967,295, respectively. Your test program used 65535, which is guaranteed to be representable by both long and unsigned long, so the value would have been unchanged however you did it. Well, unless you used invalid pointer aliasing and the compiler decided to make demons fly out of your nose instead.

Storing the hex value FF in an unsigned 8 bit integer produces garbage instead of -1

Behold my code:
#include <iostream>
int main()
{
uint8_t no_value = 0xFF;
std::cout << "novalue: " << no_value << std::endl;
return 0;
}
Why does this output: novalue: ▒
On my terminal it looks like:
I was expecting -1.
After all, if we:
we get:
uint8_t is most likeley typedef-ed to unsigned char. When you pass this to the << operator, the overload for char is selected, which causes your 0xFF value to be interpreted as an ASCII character code, and displaying the "garbage".
If you really want to see -1, you should try this:
#include <iostream>
#include <stdint.h>
int main()
{
uint8_t no_value = 0xFF;
std::cout << "novalue (cast): " << (int)(int8_t)no_value << std::endl;
return 0;
}
Note that I first cast to int8_t, which causes your previously unsigned value to be instead interpretted as a signed value. This is where 255 becomes -1. Then, I cast to int, so that << understands it to mean "integer" instead of "character".
Your confusion comes from that fact that Windows calculator doesn't give you options for signed / unsigned -- it always considers values signed. So when you used an uint8_t, you made it unsigned.
Try this
#include <iostream>
int main()
{
uint8_t no_value = 0x41;
std::cout << "novalue: " << no_value << std::endl;
return 0;
}
You will get this output:
novalue: A
uint8_t probably the same thing as unsigned char.
std::cout with chars will output the char itself and not the char's ASCII value.

Function Returning Negative Value

I still have not run it through enough tests however for some reason, using certain non-negative values, this function will sometimes pass back a negative value. I have done a lot of manual testing in calculator with different values but I have yet to have it display this same behavior.
I was wondering if someone would take a look at see if I am missing something.
float calcPop(int popRand1, int popRand2, int popRand3, float pERand, float pSRand)
{
return ((((((23000 * popRand1) * popRand2) * pERand) * pSRand) * popRand3) / 8);
}
The variables are all contain randomly generated values:
popRand1: between 1 and 30
popRand2: between 10 and 30
popRand3: between 50 and 100
pSRand: between 1 and 1000
pERand: between 1.0f and 5500.0f which is then multiplied by 0.001f before being passed to the function above
Edit:
Alright so after following the execution a bit more closely it is not the fault of this function directly. It produces an infinitely positive float which then flips negative when I use this code later on:
pPMax = (int)pPStore;
pPStore is a float that holds popCalc's return.
So the question now is, how do I stop the formula from doing this? Testing even with very high values in Calculator has never displayed this behavior. Is there something in how the compiler processes the order of operations that is causing this or are my values simply just going too high?
In this case it seems that when you are converting back to an int after the function returns it is possible that you reach the maximum value of an int, my suggestion is for you to use a type that can represent a greater range of values.
#include <iostream>
#include <limits>
#include <boost/multiprecision/cpp_int.hpp>
int main(int argc, char* argv[])
{
std::cout << "int min: " << std::numeric_limits<int>::min() << std::endl;
std::cout << "int max: " << std::numeric_limits<int>::max() << std::endl;
std::cout << "long min: " << std::numeric_limits<long>::min() << std::endl;
std::cout << "long max: " << std::numeric_limits<long>::max() << std::endl;
std::cout << "long long min: " << std::numeric_limits<long long>::min() << std::endl;
std::cout << "long long max: " << std::numeric_limits<long long>::max() << std::endl;
boost::multiprecision::cpp_int bigint = 113850000000;
int smallint = 113850000000;
std::cout << bigint << std::endl;
std::cout << smallint << std::endl;
std::cin.get();
return 0;
}
As you can see here, there are other types which have a bigger range. If these do not suffice I believe the latest boost version has just the thing for you.
Throw an exception:
if (pPStore > static_cast<float>(INT_MAX)) {
throw std::overflow_error("exceeds integer size");
} else {
pPMax = static_cast<int>(pPStore);
}
or use float instead of int.
When you multiply the maximum values of each term together you get a value around 1.42312e+12 which is somewhat larger than a 32 bit integer can hold, so let's see what the standard has to say about floating point-to-integer conversions, in 4.9/1:
A prvalue of a floating point type can be converted to a prvalue of an
integer type. The conversion trun- cates; that is, the fractional part
is discarded. The behavior is undefined if the truncated value cannot
be represented in the destination type.
So we learn that for a large segment of possible result values your function can generate, the conversion back to a 32 bit integer would be undefined, which includes making negative numbers.
You have a few options here. You could use a 64 bit integer type (long or long long possibly) to hold the value instead of truncating down to int.
Alternately you could scale down the results of your function by a factor of around 1000 or so, to keep the maximal results within the range of values that a 32 bit integer could hold.