double precision in C++, 308 digits or 15 digits? - c++

If double can represent value up to 3.4E308 (308 zeros), then why do we say that double stores only 15 digits? What is point of saying "ten power of 308" ?

We don't say that "double stores only 15 digits". We say that "double has 15 digits of precision". It means that the computed value of double, when printed as a base-10 sequence of digits, is accurate only up to those 15 digits.
double can represent 3.4E308. Yes, to print it you need more than 15 digits of precision. Some particular values have those guarantees thanks to floating-point implementation. But, for example, a number 3.4E308 - 1, which is inside of double's range, cannot be represented accurately by a double.
If you want to be sure, just take the first 15 digits of double. Some values can be correctly represented with more than 15 digits, but some cannot. Every value in double's range will be correctly represented up to the 15th digit of its decimal representation.

To help perceive this in simple terms, consider a number type that can represent following numbers:
-100000
-100
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+100
+100000
The type can represent numbers up to 100000. Does this type have 6 digits of precision, or 1?

Related

Reverse number -first digit 0 [duplicate]

When an integer is initialized as int a = 010, a is actually set to 8, but for int a = 10, a is set to 10.
Can anyone tell me why a is not set to 10 for int a = 010?
Because it's interpreting 010 as a number in octal format. And in a base-8 system, the number 10 is equal to the number 8 in base-10 (our standard counting system).
More generally, in the world of C++, prefixing an integer literal with 0 specifies an octal literal, so the compiler is behaving exactly as expected.
0 before the number means it's in octal notation. So since octal uses a base of 8, 010 would equal 8.
In the same way 0x is used for hexadecimal notation which uses the base of 16. So 0x10 would equal 16 in decimal.
In C, C++, Objective C and related languages a 0 prefix signifies an octal literal constant, so 010 = 8 in decimal.
Leading 0 in 010 means that this number is in octal form. So 010 means 8 in decimal.

How is float_max + 1 defined in C++?

What is that actual value of f?
float f = std::numeric_limits<float>::max() + 1.0f;
For unsigned integral types it is well defined to overflow to 0, and for signed integral it is undefined/implementation specific if I'm not wrong.
But how is it specified in standard for float/double? Is it std::numeric_limits<float>::max() or does it become std::numeric_limits<float>::infinity()?
On cppreference I didn't find a specification so far, maybe I missed it.
Thanks for help!
In any rounding mode, max + 1 will simply be max with an IEEE-754 single-precision float.
Note that the maximum positive finite 32-bit float is:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 11111110 11111111111111111111111
Hex: 7F7F FFFF
Precision: SP
Sign: Positive
Exponent: 127 (Stored: 254, Bias: 127)
Hex-float: +0x1.fffffep127
Value: +3.4028235e38 (NORMAL)
For this number to overflow and become infinity using the default rounding mode of round-nearest-ties-to-even, you have to add at least:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 11100110 00000000000000000000000
Hex: 7300 0000
Precision: SP
Sign: Positive
Exponent: 103 (Stored: 230, Bias: 127)
Hex-float: +0x1p103
Value: +1.0141205e31 (NORMAL)
Anything you add less than this particular value will round it back to max value itself. Different rounding modes might have slightly different results, but the order of the number you're looking for is about 1e31, which is pretty darn large.
This is an excellent example of how IEEE floats get sparser and sparser as their magnitude increases.

Why is absolute value of INT_MIN different from INT MAX? [duplicate]

This question already has answers here:
What is “two's complement”?
(24 answers)
Closed 7 years ago.
I'm trying to understand why INT_MIN is equal to -2^31 - 1 and not just -2^31.
My understanding is that an int is 4 bytes = 32 bits. Of these 32 bits, I assume 1 bit is used for the +/- sign, leaving 31 bits for the actual value. As such, INT_MAX is equal to 2^31-1 = 2147483647. On the other hand, why is INT_MIN equal to -2^31 = -2147483648? Wouldn't this exceed the '4 bytes' allotted for int? Based on my logic, I would have expected INT_MIN to equal -2^31 = -2147483647
Most modern systems use two's complement to represent signed integer data types. In this representation, one state in the positive side is used up to represent zero, hence one positive value lesser than the negatives. In fact this is one of the prime advantage this system has over the sign-magnitude system, where zero has two representations, +0 and -0. Since zero has only one representation in two's complement, the other state, now free, is used to represent one more number.
Let's take a small data type, say 4 bits wide, to understand this better. The number of possible states with this toy integer type would be 2⁴ = 16 states. When using two's complement to represent signed numbers, we would have 8 negative and 7 positive numbers and zero; in sign-magnitude system, we'd get two zeros, 7 positive and 7 negative numbers.
Bin Dec
0000 = 0
0001 = 1
0010 = 2
0011 = 3
0100 = 4
0101 = 5
0110 = 6
0111 = 7
1000 = -8
1001 = -7
1010 = -6
1011 = -5
1100 = -4
1101 = -3
1110 = -2
1111 = -1
I think you are confused since you are imagining that sign-magnitude representation is used for signed numbers; although this is also allowed by the language standards, this system is very less likely to be implemented as two's complement system is significantly a better representation.
As of C++20, only two's complement is allowed for signed integers; source.

Minimum Number of Bits Required for Two's Complement Form

On my midterm, there was a question stating:
Given the decimal values, what is the minimum number of bits required to represent each number in Two's Complement form?
The values were: -26, -1, 10, -15, -4.
I did not get this question right whatsoever, and the solutions are quite baffling.
The only part I really understand is finding the range in which the value is located. For example, -15 would be within the range of [-2^5, 2^5), and -4 would be in the range from [-2^2, 2^2). What steps are needed from here in order to find how many bits were necessary?
I tried finding some pattern to solve it, but it only worked for the first two cases. Here's my attempt:
First I found the range. -2^6 < -26 < 2^6
Then I found the value for 2^6 = 32.
Then I found the difference between the "closest" bound, and the value.
-26 - (-32) = 6
Again, this worked for the first two values by chance, and now I'm stumped as to find the actual relation between the number of bits required for an integer to be represented in Two's complement form, and the actual integer.
Thanks in advance!
First off, you're off on your powers of 2. 32 = 25.
Anyway, I followed you through the first two steps. Your last step doesn't make sense.
Find the power-of-two range that brackets the number. You want a power-of-two range of the form [-2N, 2N - 1]. So, for -26, that would be -25 &leq; -26 &leq; 25 - 1. That corresponds to -32 &leq; -26 &leq; 31.
Number of bits for the 2s complement representation will then simply be N plus 1. The "plus 1" accounts for the sign bit. For -26, that's 5 + 1 = 6.
So, for each of the numbers you gave: -26, -1, 10, -15, -4.
-25 &leq; -26 &leq; 25 - 1 becomes -32 &leq; -26 &leq; 31, which gives 5 + 1 = 6.
-20 &leq; -1 &leq; 20 - 1 becomes -1 &leq; -1 &leq; 0, which gives 0 + 1 = 1.
-24 &leq; 10 &leq; 24 - 1 becomes -16 &leq; 10 &leq; 15, which gives 4 + 1 = 5.
-24 &leq; -15 &leq; 24 - 1 becomes -16 &leq; -15 &leq; 15, which gives 4 + 1 = 5.
-22 &leq; -4 &leq; 22 - 1 becomes -4 &leq; -4 &leq; 3, which gives 2 + 1 = 3.
Got it?
The -1 one is tricky...

How to represent a negative number with a fraction in 2's complement?

So I want to represent the number -12.5. So 12.5 equals to:
001100.100
If I don't calculate the fraction then it's simple, -12 is:
110100
But what is -12.5? is it 110100.100? How can I calculate this negative fraction?
With decimal number systems, each number position (or column) represents (reading a number from right to left): units (which is 10^0), tens (i.e. 10^1),hundreds (i.e. 10^2), etc.
With unsigned binary numbers, the base is 2, thus each position becomes (again, reading from right to left): 1 (i.e. 2^0) ,2 (i.e. 2^1), 4 (i.e. 2^2), etc.
For example
2^2 (4), 2^1 (2), 2^0 (1).
In signed twos-complement the most significant bit (MSB) becomes negative. Therefore it represent the number sign: '1' for a negative number and '0' for a positive number.
For a three bit number the rows would hold these values:
-4, 2, 1
0 0 1 => 1
1 0 0 => -4
1 0 1 => -4 + 1 = -3
The value of the bits held by a fixed-point (fractional) system is unchanged. Column values follow the same pattern as before, base (2) to a power, but with power going negative:
2^2 (4), 2^1 (2), 2^0 (1) . 2^-1 (0.5), 2^-2 (0.25), 2^-3 (0.125)
-1 will always be 111.000
-0.5 add 0.5 to it: 111.100
In your case 110100.10 is equal to -32+16+4+0.5 = -11.5. What you did was create -12 then add 0.5 rather than subtract 0.5.
What you actually want is -32+16+2+1+0.5 = -12.5 = 110011.1
you can double the number again and again until it's negative integer or reaches a defined limit and then set the decimal point correspondingly.
-25 is 11100111, so -12.5 is 1110011.1
So;U want to represent -12.5 in 2's complement representation
12.5:->> 01100.1
2's complement of (01100.1):->>10011.1
verify the ans by checking the weighted code property of 2's complement representation(MSB weight is -ve). we will get -16+3+.5=-12.5