In binary notation, what is the meaning of the digits after the radix point "."? - c++

I have this example on how to convert from a base 10 number to IEEE 754 float representation
Number: 45.25 (base 10) = 101101.01 (base 2) Sign: 0
Normalized form N = 1.0110101 * 2^5
Exponent esp = 5 E = 5 + 127 = 132 (base 10) = 10000100 (base 2)
IEEE 754: 0 10000100 01101010000000000000000
This makes sense to me except one passage:
45.25 (base 10) = 101101.01 (base 2)
45 is 101101 in binary and that's okay.. but how did they obtain the 0.25 as .01 ?

Simple place value. In base 10, you have these places:
... 103 102 101 100 . 10-1 10-2 10-3 ...
... thousands, hundreds, tens, ones . tenths, hundredths, thousandths ...
Similarly, in binary (base 2) you have:
... 23 22 21 20 . 2-1 2-2 2-3 ...
... eights, fours, twos, ones . halves, quarters, eighths ...
So the second place after the . in binary is units of 2-2, well known to you as units of 1/4 (or alternately, 0.25).

You can convert the part after the decimal point to another base by repeatedly multiplying by the new base (in this case the new base is 2), like this:
0.25 * 2 = 0.5
-> The first binary digit is 0 (take the integral part, i.e. the part before the decimal point).
Continue multiplying with the part after the decimal point:
0.5 * 2 = 1.0
-> The second binary digit is 1 (again, take the integral part).
This is also where we stop because the part after the decimal point is now zero, so there is nothing more to multiply.
Therefore the final binary representation of the fractional part is: 0.012.
Edit:
Might also be worth noting that it's quite often that the binary representation is infinite even when starting with a finite fractional part in base 10. Example: converting 0.210 to binary:
0.2 * 2 = 0.4 -> 0
0.4 * 2 = 0.8 -> 0
0.8 * 2 = 1.6 -> 1
0.6 * 2 = 1.2 -> 1
0.2 * 2 = ...
So we end up with: 0.001100110011...2.
Using this method you see quite easily if the binary representation ends up being infinite.

"Decimals" (fractional bits) in other bases are surprisingly unintuitive considering they work in exactly the same way as integers.
base 10
scinot 10e2 10e1 10e0 10e-1 10e-2 10e-3
weight 100.0 10.0 1.0 0.1 0.01 0.001
value 0 4 5 .2 5 0
base 2
scinot 2e6 2e5 2e4 2e3 2e2 2e1 2e0 2e-1 2e-2 2e-3
weight 64 32 16 8 4 2 1 .5 .25 .125
value 0 1 0 1 1 0 1 .0 1 0
If we start with 45.25, that's bigger/equal than 32, so we add a binary 1, and subtract 32.
We're left with 13.25, which is smaller than 16, so we add a binary 0.
We're left with 13.25, which is bigger/equal than 8, so we add a binary 1, and subtract 8.
We're left with 05.25, which is bigger/equal than 4, so we add a binary 1, and subtract 4.
We're left with 01.25, which is smaller than 2, so we add a binary 0.
We're left with 01.25, which is bigger/equal than 1, so we add a binary 1, and subtract 1.
With integers, we'd have zero left, so we stop. But:
We're left with 00.25, which is smaller than 0.5, so we add a binary 0.
We're left with 00.25, which is bigger/equal to 0.25, so we add a binary 1, and subtract 0.25.
Now we have zero, so we stop (or not, you can keep going and calculating zeros forever if you want)
Note that not all "easy" numbers in decimal always reach that zero stopping point. 0.1 (decimal) converted into base 2, is infinitely repeating: 0.0001100110011001100110011... However, all "easy" numbers in binary will always convert nicely into base 10.
You can also do this same process with fractional (2.5), irrational (pi), or even imaginary(2i) bases, except the base cannot be between -1 and 1 inclusive .

2.00010 = 2+1 = 10.0002
1.00010 = 2+0 = 01.0002
0.50010 = 2-1 = 00.1002
0.25010 = 2-2 = 00.0102
0.12510 = 2-3 = 00.0012

The fractions base 2 are .1 = 1/2, .01 = 1/4. ...

Think of it this way
(dot) 2^-1 2^-2 2^-3 etc
so
. 0/2 + 1/4 + 0/8 + 0/16 etc
See http://floating-point-gui.de/formats/binary/

You can think of 0.25 as 1/4.
Dividing by 2 in (base 2) moves the decimal point one step left, the same way dividing by 10 in (base 10) moves the decimal point one step left. Generally dividing by M in (base M) moves the decimal point one step left.
so
base 10 base 2
--------------------------------------
1 => 1
1/2 = 0.5 => 0.1
0.5/2 = 1/4 = 0.25 => 0.01
0.25/2 = 1/8 = 0.125 => 0.001
.
.
.
etc.

Related

Why is the 4s-complement of 412 (in decimal) 321210?

I used an online converter to convert 412 from decimal to base 4 (which is 12130), and then applied the r's complement formula to get its 4s complement (which is 21210). However, in 6bits, 21210 becomes 321210.
When I try to convert it to decimal by doing
3 x - 4^5 + 2 x 4^4 + 1 x 4^3 + 2 x 4^2 + 1 x 4^1,
I get a number in decimal that is way larger than 412.
You have your "conversion to decimal" wrong -- it should be:
-1 x 4^5 + 2 x 4^4 + 1 x 4^3 + 2 x 4^2 + 1 x 4^1
note the difference in the first term, dealing with the sign -- the '3' digit has a value of -1 in a 4s complement sign digit.

Bitwise Operators NOT

I encountered a problem with bit arithmetic. It is bitwise NOT.
if A = 5; then ~A = ?
The binary of 5 is 101, the inverse is 010, and then converted to decimal is 0 * 2^2 + 1 * 2^1 + 0 * 2^0 = 2
But when I test in the IDE, the output is as follows:
System.out.println( ~5 );
Output:
-6
I don't know why. Thanks!!!
If you using a standard int, then after assignment your A to 5:
int A = 5;
Then your "A" would be not 101b, but 00000000000000000000000000000101b - all 32 bits.
After NEG operation, which inverse all bits, you will get:
A = 11111111111111111111111111111010
And this int-value is -6, in the 2-complement representation, used int the most of computers.

c++, binary number calculations

I have question that asks how values such as c are computed in terms of binary numbers. Im researching it but now but figured id ask here if anyone has somewhere they can send me or explain how this works.
int main()
{
int a 10, int b = 12, int c, int d;
int c = a << 2; //output 40
}
Well, I'm not answering with C++ code, as the question is not really related to the language.
The integer ten is written 10 in base 10 as it's equal to 1 * 10^1 + 0 * 10^0.
Binary is base 2, so let's try to write ten as a sum of powers of 2.
10 = 8 + 2
That is 2^3 + 2^1.
Let's switch to binary (using only two digits : 0 and 1).
2^3 is written 1000
2^1 is written 10
Their sum is 1010 in binary.
"<<" is the operation that shift left binary digits by a certain amount (beware of overflow).
So 1010 << 2 is 101000
That is in decimal 2^5 + 2^3 = 32 + 8 = 40
You can also think of "<< N" as being a multiplication by 2^N of an integer.

How to represent a negative number with a fraction in 2's complement?

So I want to represent the number -12.5. So 12.5 equals to:
001100.100
If I don't calculate the fraction then it's simple, -12 is:
110100
But what is -12.5? is it 110100.100? How can I calculate this negative fraction?
With decimal number systems, each number position (or column) represents (reading a number from right to left): units (which is 10^0), tens (i.e. 10^1),hundreds (i.e. 10^2), etc.
With unsigned binary numbers, the base is 2, thus each position becomes (again, reading from right to left): 1 (i.e. 2^0) ,2 (i.e. 2^1), 4 (i.e. 2^2), etc.
For example
2^2 (4), 2^1 (2), 2^0 (1).
In signed twos-complement the most significant bit (MSB) becomes negative. Therefore it represent the number sign: '1' for a negative number and '0' for a positive number.
For a three bit number the rows would hold these values:
-4, 2, 1
0 0 1 => 1
1 0 0 => -4
1 0 1 => -4 + 1 = -3
The value of the bits held by a fixed-point (fractional) system is unchanged. Column values follow the same pattern as before, base (2) to a power, but with power going negative:
2^2 (4), 2^1 (2), 2^0 (1) . 2^-1 (0.5), 2^-2 (0.25), 2^-3 (0.125)
-1 will always be 111.000
-0.5 add 0.5 to it: 111.100
In your case 110100.10 is equal to -32+16+4+0.5 = -11.5. What you did was create -12 then add 0.5 rather than subtract 0.5.
What you actually want is -32+16+2+1+0.5 = -12.5 = 110011.1
you can double the number again and again until it's negative integer or reaches a defined limit and then set the decimal point correspondingly.
-25 is 11100111, so -12.5 is 1110011.1
So;U want to represent -12.5 in 2's complement representation
12.5:->> 01100.1
2's complement of (01100.1):->>10011.1
verify the ans by checking the weighted code property of 2's complement representation(MSB weight is -ve). we will get -16+3+.5=-12.5

C floating point precision [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Floating point comparison
I have a problem about the accuracy of float in C/C++. When I execute the program below:
#include <stdio.h>
int main (void) {
float a = 101.1;
double b = 101.1;
printf ("a: %f\n", a);
printf ("b: %lf\n", b);
return 0;
}
Result:
a: 101.099998
b: 101.100000
I believe float should have 32-bit so should be enough to store 101.1 Why?
You can only represent numbers exactly in IEEE754 (at least for the single and double precision binary formats) if they can be constructed from adding together inverted powers of two (i.e., 2-n like 1, 1/2, 1/4, 1/65536 and so on) subject to the number of bits available for precision.
There is no combination of inverted powers of two that will get you exactly to 101.1, within the scaling provided by floats (23 bits of precision) or doubles (52 bits of precision).
If you want a quick tutorial on how this inverted-power-of-two stuff works, see this answer.
Applying the knowledge from that answer to your 101.1 number (as a single precision float):
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n
0 10000101 10010100011001100110011
| | | || || || |+- 8388608
| | | || || || +-- 4194304
| | | || || |+----- 524288
| | | || || +------ 262144
| | | || |+--------- 32768
| | | || +---------- 16384
| | | |+------------- 2048
| | | +-------------- 1024
| | +------------------ 64
| +-------------------- 16
+----------------------- 2
The mantissa part of that actually continues forever for 101.1:
mmmmmmmmm mmmm mmmm mmmm mm
100101000 1100 1100 1100 11|00 1100 (and so on).
hence it's not a matter of precision, no amount of finite bits will represent that number exactly in IEEE754 format.
Using the bits to calculate the actual number (closest approximation), the sign is positive. The exponent is 128+4+1 = 133 - 127 bias = 6, so the multiplier is 26 or 64.
The mantissa consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2n) as n starts at 1 and increases to the right), {1/2, 1/16, 1/64, 1/1024, 1/2048, 1/16384, 1/32768, 1/262144, 1/524288, 1/4194304, 1/8388608}.
When you add all these up, you get 1.57968747615814208984375.
When you multiply that by the multiplier previously calculated, 64, you get 101.09999847412109375.
All numbers were calculated with bc using a scale of 100 decimal digits, resulting in a lot of trailing zeros, so the numbers should be very accurate. Doubly so, since I checked the result with:
#include <stdio.h>
int main (void) {
float f = 101.1f;
printf ("%.50f\n", f);
return 0;
}
which also gave me 101.09999847412109375000....
You need to read more about how floating-point numbers work, especially the part on representable numbers.
You're not giving much of an explanation as to why you think that "32 bits should be enough for 101.1", so it's kind of hard to refute.
Binary floating-point numbers don't work well for all decimal numbers, since they basically store the number in, wait for it, base 2. As in binary.
This is a well-known fact, and it's the reason why e.g. money should never be handled in floating-point.
Your number 101.1 in base 10 is 1100101.0(0011) in base 2. The 0011 part is repeating. Thus, no matter how many digits you'll have, the number cannot be represented exactly in the computer.
Looking at the IEE754 standard for floating points, you can find out why the double version seemed to show it entirely.
PS: Derivation of 101.1 in base 10 is 1100101.0(0011) in base 2:
101 = 64 + 32 + 4 + 1
101 -> 1100101
.1 * 2 = .2 -> 0
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2 = .4 -> 0
.4 * 2 = .8 -> 0
.8 * 2 = 1.6 -> 1
.6 * 2 = 1.2 -> 1
.2 * 2....
PPS: It's the same if you'd wanted to store exactly the result of 1/3 in base 10.
What you see here is the combination of two factors:
IEEE754 floating point representation is not capable of accurately representing a whole class of rational and all irrational numbers
The effects of rounding (by default here to 6 decimal places) in printf. That is say that the error when using a double occurs somewhere to the right of the 6th DP.
If you had more digits to the print of the double you'll see that even double cannot be represented exactly:
printf ("b: %.16f\n", b);
b: 101.0999999999999943
The thing is float and double are using binary format and not all floating pointer numbers can be represented exactly with binary format.
Unfortunately, most decimal floating point numbers cannot be accurately represented in (machine) floating point. This is just how things work.
For instance, the number 101.1 in binary will be represented like 1100101.0(0011) ( the 0011 part will be repeated forever), so no matter how many bytes you have to store it, it will never become accurate. Here is a little article about binary representation of floating point, and here you can find some examples of converting floating point numbers to binary.
If you want to learn more on this subject, I could recommend you this article, though it's long and not too easy to read.