This question already has answers here:
Extract fractional part of double *efficiently* in C
(7 answers)
Closed 6 years ago.
Can someone help a rookie? If i have a number 4.561 that is derived from an equation, how can i ONLY display the .561 and ignore the 4?
Thanks in advance. I am new to programming and this is part of an assignment. Any help would be greatly appreciated. I'm coding in c++.
Not sure if this is what you need but do check this out.
float f = 4.561;
f = f-(long)f;
cout << "Value of f is : " << f << endl;
I would feel much better with using floor from math.h:
f = 4.561
if (f>=0) f=f-floor(f);
else f=f-ceil(f);
// here f = 0.561
for these reasons:
As you do not have control over casting to integral type (f-long(f)) at least I do not know if it is clearly defined as a standard it is using integer part or rounding. Not to mention custom types implementation.
what if your floating value holds bigger number then your integral type can hold? I know there are not that many mantissa bits for fractional part for bigger numbers but you did not specify which floating data-type you are using if 32/64/80/128/256 bits or more so hard to say and if the integer part is bigger then your integral data-type used to cut off the non fractional part then you would be in trouble with f-long(f).
PS.
The if statement could be avoided with masking in and out the sign bit before and after the operation. For example on standard 32bit float it looks like this:
float f=4.561; // input value
DWORD *dw=(DWORD*)(&f); // pointer to f as integer type to access bits
DWORD m; // sign storage
m=(*dw)&0x80000000; // store sign bit to m
(*dw)&= 0x7FFFFFFF; // f=fabs(f)
f-=floor(f);
(*dw)|=m; // restore original sign from m
// here f = 0.561
If you do not have DWORD use any unsigned 32 bit integer instead
Related
I've seen a couple old posts that had a similar context but the answers were code snippets, not explanatory. And I can't even get the code offered in those answers to compile.
I apologize in advance if this is poorly formatted or explained, I am new to C++ and I avoid asking questions on forums at all costs but this has become a real thorn in my side.
I am using C++ to access the memory register of an accelerometer.
This register contains 8 bits of data in the form of twos complement.
The accelerometer is set to have a range of +/- 2 g's, meaning (see section 1.5 of reference manual on page 10)
My goal is to take this twos complement data and save it to a text file as signed base-10 data.
Relevant code below:
while(1)
{
int d = 0;
d= wiringPiI2CReadReg8(fd, DATA_X_H);
d = (int8_t)d;
d = d * EIGHTBITCONV;
std::cout << d << std::endl;
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
I've tested the program by hard coding "d" as 01111111 (127) and the program returns "73". But hard coding "d" as 10000001 (-127) and the program returns the correct "-127".
So the program only functions properly when dealing with negative numbers.
Is this happening because when casting the float "d" to an 8bit integer it truncates the leading zero for positive numbers? How would I fix this?
Link to datasheet: https://www.mouser.com/datasheet/2/348/KX132-1211-Technical-Reference-Manual-Rev-1.0-1652679.pdf
You do not need to convert from 2's complement to "signed integer", because 2's complement is signed integer.
You are only reading 8 bits, but int (the return type of wiringPiI2CReadReg8) has more. So you need a sign-extend conversion. Something like:
int result = (int)(signed char)wiringPiI2CReadReg8(fd, DATA_X_H);
The (int) conversion is implicit and can be omitted. And in your case you are converting to a float (The conversion is again implicit). So:
d = (signed char)wiringPiI2CReadReg8(fd, DATA_X_H);
Actually your solution (negating twice) would work as well. More explicitly it could be written like this (since 1 is int):
d = -((int)(~(int8_t)d) + 1);
But this is just unnecessary work. It could be simplified to be:
d = -(-(int8_t)d);
and now it is obviously simplifies into:
d = (int8_t)d;
same as what I wrote before.
Ok so I think a lot of my confusion came from the fact that I was trying hard code values into my program without proper knowledge of how to do so.
If I were to do hard coding as a method to test the logic, I should have specified that the values of "d" were binary.
So it looks like my original code, while extremely sloppy, was functioning properly.
This question already has answers here:
John Carmack's Unusual Fast Inverse Square Root (Quake III)
(6 answers)
Closed 6 years ago.
I am looking an example code for invert sqrt used in quake.
I see a variable of type float: float x = 16;
and then an int variable takes the value of this expression: int i = *(int*)&a;
The way I understand it(from right to left) is that it takes the address of the variable a, typecasts it in to an integer pointer and then takes the value of that pointer and assigns it to i.
when I output the variable i in the console it's a big integer value.
can someone explain more in dept of what is going on here please ?
because I expected the variable i to be 16.
&a takes the address of a, a float; (int *) keeps the same address but tells the compiler now to pretend that an int is stored there; int i = * then loads the int from that address.
So you get* an integer with the same bit pattern as the float had. Floats and ints have very different encodings so you end up with a very different value.
The float is likely to be formulated like this but it's not required; your architecture may do something else. Possibly the author is doing a platform-specific optimised version of frexp.
int i = (int)x; would simply convert the float to an int, giving you 16.
This question already has answers here:
Handling large numbers in C++?
(10 answers)
Closed 7 years ago.
I would like to write a program, which could compute integers having more then 2000 or 20000 digits (for Pi's decimals). I would like to do in C++, without any libraries! (No big integer, boost,...). Can anyone suggest a way of doing it? Here are my thoughts:
using const char*, for holding the integer's digits;
representing the number like
( (1 * 10 + x) * 10 + x )...
The obvious answer works along these lines:
class integer {
bool negative;
std::vector<std::uint64_t> data;
};
Where the number is represented as a sign bit and a (unsigned) base 2**64 value.
This means the absolute value of your number is:
data[0] + (data[1] << 64) + (data[2] << 128) + ....
Or, in other terms you represent your number as a little-endian bitstring with words as large as your target machine can reasonably work with. I chose 64 bit integers, as you can minimize the number of individual word operations this way (on a x64 machine).
To implement Addition, you use a concept you have learned in elementary school:
a b
+ x y
------------------
(a+x+carry) (b+y reduced to one digit length)
The reduction (modulo 2**64) happens automatically, and the carry can only ever be either zero or one. All that remains is to detect a carry, which is simple:
bool next_carry = false;
if(x += y < y) next_carry = true;
if(prev_carry && !++x) next_carry = true;
Subtraction can be implemented similarly using a borrow instead.
Note that getting anywhere close to the performance of e.g. libgmp is... unlikely.
A long integer is usually represented by a sequence of digits (see positional notation). For convenience, use little endian convention: A[0] is the lowest digit, A[n-1] is the highest one. In general case your number is equal to sum(A[i] * base^i) for some value of base.
The simplest value for base is ten, but it is not efficient. If you want to print your answer to user often, you'd better use power-of-ten as base. For instance, you can use base = 10^9 and store all digits in int32 type. If you want maximal speed, then better use power-of-two bases. For instance, base = 2^32 is the best possible base for 32-bit compiler (however, you'll need assembly to make it work optimally).
There are two ways to represent negative integers, The first one is to store integer as sign + digits sequence. In this case you'll have to handle all cases with different signs yourself. The other option is to use complement form. It can be used for both power-of-two and power-of-ten bases.
Since the length of the sequence may be different, you'd better store digit sequence in std::vector. Do not forget to remove leading zeroes in this case. An alternative solution would be to store fixed number of digits always (fixed-size array).
The operations are implemented in pretty straightforward way: just as you did them in school =)
P.S. Alternatively, each integer (of bounded length) can be represented by its reminders for a set of different prime modules, thanks to CRT. Such a representation supports only limited set of operations, and requires nontrivial convertion if you want to print it.
This question already has answers here:
How disastrous is integer overflow in C++?
(3 answers)
Closed 8 years ago.
when i try to add two long numbers it gives me minus result :
#include<iostream>
using namespace std;
int main ()
{
int a=1825228665;
int b=1452556585;
cout<<a+b;
return 0;
}
This gives me:
-1017182046
It's overflowing of the type. When you add two big number that the result can't be stored in chosen type it get overfloved. In most cases it will wraped the number, but it's not defined in the standard. So in some compilers the result is undefined.
For int and other numeric types when program can't store this big number in it we can see a overflow of it.
Lets say that int can store number from -10 to 10, when you do this:
int a = 10;
int b = a+1;
You will get -10 in b or some random value (it can be anything because the result is undefined)
That's because the results overflows. Since the first bit in numeric signed data types is used for the sign representation. The specific representation is called Two's complement (Wikipedia article here). Practically a 1 in this bit maps to a - while a 0 to +. The solution to this problem is using a larger data type like long. Larger it means that the memory used to store it is bigger so the range of values increases.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Floating Point to Binary Value(C++)
Currently I'm working on a genetic algorithm for my thesis and I'm trying to optimize a problem which takes three doubles to be the genome for a particular solution. For the breeding of these doubles I would like to use a binary representation of these doubles and for this I'll have to convert the doubles to their binary representation. I've searched for this, but can't find a clear solution, unfortunately.
How to do this? Is there a library function to do this, as there is in Java? Any help is greatly appreciated.
What about:
double d = 1234;
unsigned char *b = (unsigned char *)&d;
Assuming a double consists of 8 bytes you could use b[0] ... b[7].
Another possibility is:
long long x = *(long long *)&d;
Since you tag the question C++ I would use a reinterpret_cast
For the genetic algorithm what you probably really want is treating the mantissa, exponent and sign of your doubles independently. See "how can I extract the mantissa of a double"
Why do you want to use a binary representation? Just because something is more popular, does not mean that it is the solution to your specific problem.
There is a known genome representation called real that you can use to solve your problem without being submitted to several issues of the binary representation, such as hamming cliffs and different mutation values.
Please notice that I am not talking about cutting-edge, experimental stuff. This 1991 paper already describes the issue I am talking about. If you are spanish or portuguese speaking, I could point you to my personal book on GA, but there are beutiful references in English, such as Melanie Mitchell's or Eiben's books that could describe this issue more deeply.
The important thing to have in mind is that you need to tailor the genetic algorithm to your problem, not modify your needs in order to be able to use a specific type of GA.
I wouldn't convert it into an array. I guess if you do genetic stuff it should be performant. If I were you I would use an integer type (like suggested from irrelephant) and then do the mutation and crossover stuff with int operations.
If you don't do that you're always converting it back and forth. And for crossover you have to iterate through the 64 elements.
Here an example for crossover:
__int64 crossover(__int64 a, __int64 b, int x) {
__int64 mask1 = ...; // left most x bits
__int64 mask2 = ...; // right most 64-x bits
return (a & mask1) + (b & mask2);
}
And for selection, you can just cast it back to a double.
You could do it like this:
// Assuming a DOUBLE is 64bits
double d = 42.0; // just a random double
char* bits = (char*)&d; // access my double byte-by-byte
int array[64]; // result
for (int i = 0, k = 63; i < 8; ++i) // for each byte of my double
for (char j = 0; j < 8; ++j, --k) // for each bit of each byte of my double
array[k] = (bits[i] >> j) & 1; // is the Jth bit of the current byte 1?
Good luck
Either start with a binary representation of the genome and then use one-point or two-point crossover operators, or, if you want to use a real encoding for your GA then please use the simulated binary crossover(SBX) operator for crossover. Most modern GA implementation use real coded representation and a corresponding crossover and mutation operator.
You could use an int (or variant thereof).
The trick is to encode a float of 12.34 as an int of 1234.
Therefore you just need to cast to a float & divide by 100 during the fitness function, and do all your mutation & crossover on an integer.
Gotchas:
Beware the loss of precision if you actually need the nth bit.
Beware the sign bit.
Beware the difference in range between floats & ints.