Let's assume we have a representation of -63 as signed seven-bit integer within a uint16_t. How can we convert that number to float and back again, when we don't know the representation type (like two's complement).
An application for such an encoding could be that several numbers are stored in one int16_t. The bit-count could be known for each number and the data is read/written from a third-party library (see for example the encoding format of tivxDmpacDofNode() here: https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/group__group__vision__function__dmpac__dof.html --- but this is just an example). An algorithm should be developed that makes the compiler create the right encoding/decoding independent from the actual representation type. Of course it is assumed that the compiler uses the same representation type as the library does.
One way that seems to work well, is to shift the bits such that their sign bit coincides with the sign bit of an int16_t and let the compiler do the rest. Of course this makes an appropriate multiplication or division necessary.
Please see this example:
#include <iostream>
#include <cmath>
int main()
{
// -63 as signed seven-bits representation
uint16_t data = 0b1000001;
// Shift 9 bits to the left
int16_t correct_sign_data = static_cast<int16_t>(data << 9);
float f = static_cast<float>(correct_sign_data);
// Undo effect of shifting
f /= pow(2, 9);
std::cout << f << std::endl;
// Now back to signed bits
f *= pow(2, 9);
uint16_t bits = static_cast<uint16_t>(static_cast<int16_t>(f)) >> 9;
std::cout << "Equals: " << (data == bits) << std::endl;
return 0;
}
I have two questions:
This example uses actually a number with known representation type (two's complement) converted by https://www.exploringbinary.com/twos-complement-converter/. Is the bit-shifting still independent from that and would it work also for other representation types?
Is this the canonical and/or most elegant way to do it?
Clarification:
I know the bit width of the integers I would like to convert (please check the link to the TIOVX example above), but the integer representation type is not specified.
The intention is to write code that can be recompiled without changes on a system with another integer representation type and still correctly converts from int to float and/or back.
My claim is that the example source code above does exactly that (except that the example input data is hardcoded and it would have to be different if the integer representation type were not two's complement). Am I right? Could such a "portable" solution be written also with a different (more elegant/canonical) technique?
Your question is ambiguous as to whether you intend to truly store odd-bit integers, or odd-bit floats represented by custom-encoded odd-bit integers. I'm assuming by "not knowing" the bit-width of the integer, that you mean that the bit-width isn't known at compile time, but is discovered at runtime as your custom values are parsed from a file, for example.
Edit by author of original post:
The assumption in the original question that the presented code is independent from the actual integer representation type, is wrong (as explained in the comments). Integer types are not specified, for example it is not clear that the leftmost bit is the sign bit. Therefore the presented code also contains assumptions, they are just different (and most probably worse) than the assumption "integer representation type is two's complement".
Here's a simple example of storing an odd-bit integer. I provide a simple struct that let's you decide how many bits are in your integer. However, for simplicity in this example, I used uint8_t which has a maximum of 8-bits obviously. There are several different assumptions and simplifications made here, so if you want help on any specific nuance, please specify more in the comments and I will edit this answer.
One key detail is to properly mask off your n-bit integer after performing 2's complement conversions.
Also please note that I have basically ignored overflow concerns and bit-width switching concerns that may or may not be a problem depending on how you intend to use your custom-width integers and the maximum bit-width you intend to support.
#include <iostream>
#include <string>
struct CustomInt {
int bitCount = 7;
uint8_t value;
uint8_t mask = 0;
CustomInt(int _bitCount, uint8_t _value) {
bitCount = _bitCount;
value = _value;
mask = 0;
for (int i = 0; i < bitCount; ++i) {
mask |= (1 << i);
}
}
bool isNegative() {
return (value >> (bitCount - 1)) & 1;
}
int toInt() {
bool negative = isNegative();
uint8_t tempVal = value;
if (negative) {
tempVal = ((~tempVal) + 1) & mask;
}
int ret = tempVal;
return negative ? -ret : ret;
}
float toFloat() {
return toInt(); //Implied truncation!
}
void setFromFloat(float f) {
int intVal = f; //Implied truncation!
bool negative = f < 0;
if (negative) {
intVal = -intVal;
}
value = intVal;
if (negative) {
value = ((~value) + 1) & mask;
}
}
};
int main() {
CustomInt test(7, 0b01001110); // -50. Would be 78 if this were a normal 8-bit integer
std::cout << test.toFloat() << std::endl;
}
Using the GSL (GNU Scientific Library), I'm trying to understand why gsl_vector_view_array() returns a slighly modified value after assignment.
In the code below, I declare a vector_view 'qview_test' which is linked to table q_test[0]=0.0 and display its value which is 0.0. Then, I change the value of q_test[0]=1.12348 and expecting the same value for qview_test, but it gets alterated to qview_test=1.1234800000000000341771055900608189.
How do you explain such a result ? How to replicate the result without GSL ?
#include <iostream>
#include <gsl/gsl_blas.h>
using namespace std;
double q_test[1]={0.0};
gsl_vector_view qview_test;
int nb_variable = 1;
int main()
{
qview_test=gsl_vector_view_array(q_test,nb_variable);
cout.precision(35);
cout << "qview before: " << gsl_vector_get(&qview_test.vector,0)<< endl;
// Assign value
q_test[0]=1.12348;
cout << "qview after: " << gsl_vector_get(&qview_test.vector,0) << endl;
return 0;
}
Thanks for any help,
H.Nam
This looks like floating point rounding to me.
Basically any decimal number can only have a finite precision and all numbers in between get rounded to the nearest float.
I am not familiar with gsl so I don't know why it's displaying so many digits.
In other words, to get more precise give your number more bits (128 bit float or something like that) to be represented. This will give you more precision, however you most likely won't need it.
I've been stumped on this one for days. I've written this program from a book called Write Great Code Volume 1 Understanding the Machine Chapter four.
The project is to do Floating Point operations in C++. I plan to implement the other operations in C++ on my own; the book uses HLA (High Level Assembly) in the project for other operations like multiplication and division.
I wanted to display the exponent and other field values after they've been extracted from the FP number; for debugging. Yet I have a problem: when I look at these values in memory they are not what I think they should be. Key words: what I think. I believe I understand the IEEE FP format; its fairly simple and I understand all I've read so far in the book.
The big problem is why the Rexponent variable seems to be almost unpredictable; in this example with the given values its 5. Why is that? By my guess it should be two. Two because the decimal point is two digits right of the implied one.
I've commented the actual values that are produced in the program in to the code so you don't have to run the program to get a sense of whats happening (at least in the important parts).
It is unfinished at this point. The entire project has not been created on my computer yet.
Here is the code (quoted from the file which I copied from the book and then modified):
#include<iostream>
typedef long unsigned real; //typedef our long unsigned ints in to the label "real" so we don't confuse it with other datatypes.
using namespace std; //Just so I don't have to type out std::cout any more!
#define asreal(x) (*((float *) &x)) //Cast the address of X as a float pointer as a pointer. So we don't let the compiler truncate our FP values when being converted.
inline int extractExponent(real from) {
return ((from >> 23) & 0xFF) - 127; //Shift right 23 bits; & with eight ones (0xFF == 1111_1111 ) and make bias with the value by subtracting all ones from it.
}
void fpadd ( real left, real right, real *dest) {
//Left operand field containers
long unsigned int Lexponent = 0;
long unsigned Lmantissa = 0;
int Lsign = 0;
//RIGHT operand field containers
long unsigned int Rexponent = 0;
long unsigned Rmantissa = 0;
int Rsign = 0;
//Resulting operand field containers
long int Dexponent = 0;
long unsigned Dmantissa = 0;
int Dsign = 0;
std::cout << "Size of datatype: long unsigned int is: " << sizeof(long unsigned int); //For debugging
//Properly initialize the above variable's:
//Left
Lexponent = extractExponent(left); //Zero. This value is NOT a flat zero when displayed because we subtract 127 from the exponent after extracting it! //Value is: 0xffffff81
Lmantissa = extractMantissa (left); //Zero. We don't do anything to this number except add a whole number one to it. //Value is: 0x00000000
Lsign = extractSign(left); //Simple.
//Right
**Rexponent = extractExponent(right); //Value is: 0x00000005 <-- why???**
Rmantissa = extractMantissa (right);
Rsign = extractSign(right);
}
int main (int argc, char *argv[]) {
real a, b, c;
asreal(a) = -0.0;
asreal(b) = 45.67;
fpadd(a,b, &c);
printf("Sum of A and B is: %f", c);
std::cin >> a;
return 0;
}
Help would be much appreciated; I'm several days in to this project and very frustrated!
in this example with the given values its 5. Why is that?
The floating point number 45.67 is internally represented as
2^5 * 1.0110110101011100001010001111010111000010100011110110
which actually represents the number
45.6700000000000017053025658242404460906982421875
This is as close as you can get to 45.67 inside float.
If all you are interested in is the exponent of a number, simply compute its base 2 logarithm and round down. Since 45.67 is between 32 (2^5) and 64 (2^6), the exponent is 5.
Computers use binary representation for all numbers. Hence, the exponent is for base two, not base ten. int(log2(45.67)) = 5.
I have the following function:
typedef unsigned long long int UINT64;
UINT64 getRandom(const UINT64 &begin = 0, const UINT64 &end = 100) {
return begin >= end ? 0 : begin + (UINT64) ((end - begin)*rand()/(double)RAND_MAX);
};
Whenever I call
getRandom(0, ULLONG_MAX);
or
getRandom(0, LLONG_MAX);
I always get the same value 562967133814800. How can I fix this problem?
What is rand()?
According to this the rand() function returns a value in the range [0,RAND_MAX].
What is RAND_MAX?
According to this, RAND_MAX is "an integral constant expression whose value is the maximum value returned by the rand function. This value is library-dependent, but is guaranteed to be at least 32767 on any standard library implementation."
Precision Is An Issue
You take rand()/(double)RAND_MAX, but you have perhaps only 32767 discrete values to work with. Thus, although you have big numbers, you don't really have more numbers. That could be an issue.
Seeding May Be An Issue
Also, you don't talk about how you are calling the function. Do you run the program once with LLONG_MAX and another time with ULLONG_MAX? In that case, the behaviour you are seeing is because you are implicitly using the same random seed each time. Put another way, each time you run the program it will generate the exact same sequence of random numbers.
How can I seed?
You can use the srand() function like so:
#include <stdlib.h> /* srand, rand */
#include <time.h> /* time */
int main (){
srand (time(NULL));
//The rest of your program goes here
}
Now you will get a new sequence of random numbers each time you run your program.
Overflow Is An Issue
Consider this part ((end - begin)*rand()/(double)RAND_MAX).
What is (end-begin)? It is LLONG_MAX or ULLONG_MAX these are, by definition, the largest possible values those data types can hold. Therefore, it would be bad to multiply them by anything. Yet you do! You multiply them by rand(), which is non-zero. This will cause an overflow. But we can fix that...
Order of Operations Is An Issue
You then divide them by RAND_MAX. I think you've got your order of operations wrong here. You really meant to say:
((end - begin) * (rand()/(double)RAND_MAX) )
Note the new parantheses! (rand()/(double)RAND_MAX)
Now you are multiplying an integer by a fraction, so you are guaranteed not to overflow. But that introduces a new problem...
Promotion Is An Issue
But there's an even deeper problem. You divide an int by a double. When you do that the int is promoted to a double. A double is a floating-point number which basically means that it sacrifices precision in order to have a big range. That's probably what's biting you. As you get to bigger and bigger numbers both your ullong and your llong end up getting cast to the same value. This could be especially true if you overflowed your data type first (see above).
Uh oh
So, basically, everything about the PRNG you have presented is wrong.
Perhaps this is why John von Neumann said
Anyone who attempts to generate random numbers by deterministic means
is, of course, living in a state of sin.
And, sometimes, we pay for those sins.
How can I absolve myself?
C++11 provides some nice functionality. You can use it as follows
#include <iostream>
#include <random>
#include <limits>
int main(){
std::random_device rd; //Get a random seed from the OS entropy device, or whatever
std::mt19937_64 eng(rd()); //Use the 64-bit Mersenne Twister 19937 generator
//and seed it with entropy.
//Define the distribution, by default it goes from 0 to MAX(unsigned long long)
//or what have you.
std::uniform_int_distribution<unsigned long long> distr;
//Generate random numbers
for(int n=0; n<40; n++)
std::cout << distr(eng) << ' ';
std::cout << std::endl;
}
(Note that appropriately seeding the generator is difficult. This question addresses that.)
typedef unsigned long long int UINT64;
UINT64 getRandom(UINT64 const& min = 0, UINT64 const& max = 0)
{
return (((UINT64)(unsigned int)rand() << 32) + (UINT64)(unsigned int)rand()) % (max - min) + min;
}
Using shift operation is unsafe since unsigned long long might be less than 64 bits on some machines. You can use unsigned __int64 instead, but keep in mind it's compiler dependant, therefore is available only in certain compilers.
unsigned __int64 getRandom(unsigned __int64 const& min, unsigned __int64 const& max)
{
return (((unsigned __int64)(unsigned int)rand() << 32) + (unsigned __int64)(unsigned int)rand()) % (max - min) + min;
}
Use your own PRNG that meets your requirements rather than the one provided with rand that seems not to and was never guaranteed to.
Given that ULLONG_MAX and LLONG_MAX are both way bigger than the RAND_MAX value, you will certainly get "less precision than you want".
Other than that, there's 50% chance that your value is below the LLONG_MAX, as it is halfway throuogh the range of 64-bit values.
I would suggest using the Mersenne-Twister from the C++11, which has a 64-bit variant
http://www.cplusplus.com/reference/random/mt19937_64/
That should give you a value that fits in a 64-bit number.
If you "always get the same value", then it's because you haven't seeded the random number generator, using for example srand(time(0)) - you should normally only seed once, because this sets the "sequence". If the seed is very similar, e.g. you run the same program twice in short succession, you will still get the same sequence, because "time" only ticks once a second, and even then, doesn't change that much. There are various other ways to seed a random number, but for most purposes, time(0) is reasonably good.
You are overflowing the computation, in the expression
((end - begin)*rand()/(double)RAND_MAX)
you are telling the compiler to multiply (ULLONG_MAX - 0) * rand() and then divide by RAND_MAX, you should divide by RAND_MAX first, then multiply by rand().
// http://stackoverflow.com/questions/22883840/c-get-random-number-from-0-to-max-long-long-integer
#include <iostream>
#include <stdlib.h> /* srand, rand */
#include <limits.h>
using std::cout;
using std::endl;
typedef unsigned long long int UINT64;
UINT64 getRandom(const UINT64 &begin = 0, const UINT64 &end = 100) {
//return begin >= end ? 0 : begin + (UINT64) ((end - begin)*rand()/(double)RAND_MAX);
return begin >= end ? 0 : begin + (UINT64) rand()*((end - begin)/RAND_MAX);
};
int main( int argc, char *argv[] )
{
cout << getRandom(0, ULLONG_MAX) << endl;
cout << getRandom(0, ULLONG_MAX) << endl;
cout << getRandom(0, ULLONG_MAX) << endl;
return 0;
}
See it live in Coliru
union bigRand {
uint64_t ll;
uint32_t ii[2];
};
uint64_t rand64() {
bigRand b;
b.ii[0] = rand();
b.ii[1] = rand();
return b.ll;
}
I am not sure how portable it is. But you could easily modify it depending on how wide RAND_MAX is on the particular platform. As an upside, it is brutally simple. I mean the compiler will likely optimize it to be quite efficient, without extra arithmetic whatsoever. Just the cost of calling rand twice.
The most reasonable solution would be to use C++11's <random>, mt19937_64 would do.
Alternativelly you might try:
return ((double)rand() / ((double)RAND_MAX + 1.0)) * (end - begin + 1) + begin;
to produce numbers in more reasonable way. However note that just like your first attempt, this will still not be producing uniformly distributed numbers (although it might be good enough).
The term (end - begin)*rand() seems produce an overflow. You can alleviate that problem by using (end - begin) * (rand()/(double)RAND_MAX). Using the second way, I get the following results:
15498727792227194880
7275080918072332288
14445630964995612672
14728618955737210880
with the following calls:
std::cout << getRandom(0, ULLONG_MAX) << std::endl;
std::cout << getRandom(0, ULLONG_MAX) << std::endl;
std::cout << getRandom(0, ULLONG_MAX) << std::endl;
std::cout << getRandom(0, ULLONG_MAX) << std::endl;
#include <iostream>
using namespace std;
int main()
{
cout.precision(32);
float val = 268433072;
float add = 13.5;
cout << "result =" << (val + add) << endl;
}
I'm compiling the above program with standard g++ main.cc
and running it with ./a.out
The ouput I receive however, is,
result =268433088
Clearly, this is not the correct answer..Why is this happening?
EDIT: This does not occur when using double in place of float
You can reproduce your "float bug" with an even simpler piece of code
#include <iostream>
using namespace std;
int main()
{
cout.precision(32);
float val = 2684330722;
cout << "result =" << val << endl;
}
The output
result =2684330752
As you can see the output does not match the value val was initialized with.
As it has been stated many times, floating-point types have limited precision. Your code sample simply exceeded that precision, so the result got rounded. There's no "bug" here.
Aside from the reference to (1991, PDF) What Every Computer Scientist Should Know About Floating-Point Arithmetic
The short answer is, that because float has limited storage (like the other primitives too) the engineers had to make a choice: which numbers to store with which precision. For the floating point formats they decided to store numbers of small magnitude precisely (some decimal digits), but numbers of large magnitude very imprecisely, in fact starting with +-16,777,217 the representable numbers are becoming so thin that not even all integers are represented which is the thing you noticed.