C++: double vs unsigned int. Why it runs like this? - c++

I try to multiply three number but I get a strange result. Why I get so different results?
unsigned int a = 7;
unsigned int b = 8;
double d1 = -2 * a * b;
double d2 = -2 * (double) a * (double) b;
double d3 = -2 * ( a * b );
// outputs:
// d1 = 4294967184.000000
// d2 = -112.000000
// d3 = 4294967184.000000

In your first example, the number -2 is converted to unsigned int. The multiplication results in -112, which when represented as unsigned is 2^32 - 112 = 4294967184. Then this result is finally converted to double for the assignment.
In the second example, all math is done on doubles, leading to the correct result. You will get the same result if you did:
double d3 = -2.0 * a * b
as -2.0 is a double literal.

double is signed. Which means that the first bit (most significant bit aka sign bit) determines whether this number is positive or negative.
unsigned int cannot handle negative values because it uses the first bit (most significant bit) to expand the range of "positive" numbers it can express. so in
double d1 = -2 * a * b;
when executed, your machine puts the whole (-2 * a * b) in an unsigned int structure (like a and b) and it produces the following binary 1111 1111 1111 1111 1111 1111 1001 0000 (because it's the two's complement of 112 which is 0000 0000 0000 0000 0000 0000 0111 0000). But the problem here is that it's unsigned int so it's treated as a very big positive integer (which is 4294967184) because it doesn't treat the first 1 as a sign bit.
Then you put it in a double that's why you have the .00000 printed.
The other example, works because you typecast a to double and b to double so when multiplying -2 with a double, your computer will put it in a double structure, therefore, the sign bit will be considered.
double d3 = -2 * (double) (a * b)
will work as well.
To get a feeling about signed and unsigned, check this

double d1 = -2 * a * b;
Everything on the right hand side is an integral type, so the right hand side will be computed as an integral type. a and b are unsigned, so that dictates the specific type of the result. What about that -2? It's converted to an unsigned int. Negative integers are converted to unsigned integers using 2s complement arithmetic. That -2 becomes a very large positive unsigned integer.
double d2 = -2 * (double) a * (double) b;
Now the right hand side is mixed integers and floating point numbers, so the right hand side will be computed as a floating point type. What about that -2? It's converted to a double. Now the conversion is straightforward: -2 converted to a double becomes -2.0.

In C and C++ the built-in operators are always applied on two variables of the same type. A very precise set of rules guides the promotion of one (or two) of the two variables if they initially are different (or too small).
In this precise case, -2 is by default of type signed int (synonym to int) while a and b are of type unsigned int. In this case, the rules state that -2 should be promoted to an unsigned int, and because on your system you probably have 32 bits int and a 2-complement representation, this ends up being 2**32 - 2 (4 294 967 294). This number is then multiplied by a and the result taken modulo 2**32 (4 294 967 282), then b, modulo 2**32 once again (4 294 967 184).
It's a weird system really, and has led to countless bugs. The overflow itself, for example, led to the Linux bug on June 30th this year which hanged up so many computers around the world. I hear it also crashed a couple Java systems.

Related

Is a negative integer summed with a greater unsigned integer promoted to unsigned int?

After getting advised to read "C++ Primer 5 ed by Stanley B. Lipman" I don't understand this:
Page 66. "Expressions Involving Unsigned Types"
unsigned u = 10;
int i = -42;
std::cout << i + i << std::endl; // prints -84
std::cout << u + i << std::endl; // if 32-bit ints, prints 4294967264
He said:
In the second expression, the int value -42 is converted to unsigned before the addition is done. Converting a negative number to unsigned behaves exactly as if we had attempted to assign that negative value to an unsigned object. The value “wraps around” as described above.
But if I do something like this:
unsigned u = 42;
int i = -10;
std::cout << u + i << std::endl; // Why the result is 32?
As you can see -10 is not converted to unsigned int. Does this mean a comparison occurs before promoting a signed integer to an unsigned integer?
-10 is being converted to a unsigned integer with a very large value, the reason you get a small number is that the addition wraps you back around. With 32 bit unsigned integers -10 is the same as 4294967286. When you add 42 to that you get 4294967328, but the max value is 4294967296, so we have to take 4294967328 modulo 4294967296 and we get 32.
Well, I guess this is an exception to "two wrongs don't make a right" :)
What's happening is that there are actually two wrap arounds (unsigned overflows) under the hood and the final result ends up being mathematically correct.
First, i is converted to unsigned and as per the wrap around behavior the value is std::numeric_limits<unsigned>::max() - 9.
When this value is summed with u the mathematical result would be std::numeric_limits<unsigned>::max() - 9 + 42 == std::numeric_limits<unsigned>::max() + 33 which is an overflow and we get another wrap around. So the final result is 32.
As a general rule in an arithmetic expression if you only have unsigned overflows (no matter how many) and if the final mathematical result is representable in the expression data type, then the value of the expression will be the mathematically correct one. This is a consequence of the fact that unsigned integers in C++ obey the laws of arithmetic modulo 2n (see bellow).
Important notice. According to C++ unsigned arithmetic does not overflow:
§6.9.1 Fundamental types [basic.fundamental]
Unsigned integers shall obey the laws of arithmetic modulo 2n where n
is the number of bits in the value representation of that particular
size of integer 49
49) This implies that unsigned arithmetic does not overflow because a
result that cannot be represented by the resulting unsigned integer
type is reduced modulo the number that is one greater than the largest
value that can be represented by the resulting unsigned integer type.
I will however leave "overflow" in my answer to express values that cannot be represented in regular arithmetic.
Also what we colloquially call "wrap around" is in fact just the arithmetic modulo nature of the unsigned integers. I will however use "wrap around" also because it is easier to understand.
i is in fact promoted to unsigned int.
Unsigned integers in C and C++ implement arithmetic in ℤ / 2nℤ, where n is the number of bits in the unsigned integer type. Thus we get
[42] + [-10] ≡ [42] + [2n - 10] ≡ [2n + 32] ≡ [32],
with [x] denoting the equivalence class of x in ℤ / 2nℤ.
Of course, the intermediate step of picking only non-negative representatives of each equivalence class, while it formally occurs, is not necessary to explain the result; the immediate
[42] + [-10] ≡ [32]
would also be correct.
"In the second expression, the int value -42 is converted to unsigned before the addition is done"
yes this is true
unsigned u = 42;
int i = -10;
std::cout << u + i << std::endl; // Why the result is 32?
Supposing we are in 32 bits (that change nothing in 64b, this is just to explain) this is computed as 42u + ((unsigned) -10) so 42u + 4294967286u and the result is 4294967328u truncated in 32 bits so 32. All was done in unsigned
This is part of what is wonderful about 2's complement representation. The processor doesn't know or care if a number is signed or unsigned, the operations are the same. In both cases, the calculation is correct. It's only how the binary number is interpreted after the fact, when printing, that is actually matters (there may be other cases, as with comparison operators)
-10 in 32BIT binary is FFFFFFF6
42 IN 32bit BINARY is 0000002A
Adding them together, it doesn't matter to the processor if they are signed or unsigned, the result is: 100000020. In 32bit, the 1 at the start will be placed in the overflow register, and in c++ is just disappears. You get 0x20 as the result, which is 32.
In the first case, it is basically the same:
-42 in 32BIT binary is FFFFFFD6
10 IN 32bit binary is 0000000A
Add those together and get FFFFFFE0
FFFFFFE0 as a signed int is -32 (decimal). The calculation is correct! But, because it is being PRINTED as an unsigned, it shows up as 4294967264. It's about interpreting the result.

Squaring a number in C++ yields wrong value

If I do
int n = 100000;
long long x = n * n;
then x == 1410065408
1410065408 is 2^31, yet I expect x to be 64 bit
What is going on?
I'm using VSC++ ( default VS c++ compiler )
n*n is too big for an int because it is equal to 10^10. The (erroneous) result gets stored as a long long.
Try:
long long n = 100000;
long long x = n*n;
Here's an answer that references the standard that specifies that the operation long long x = (long long)n*n where n is an int will not cause data loss. Specifically,
If both operands have signed integer types or both have unsigned integer types, the operand with the type of lesser integer conversion rank shall be converted to the type of the operand with greater rank.
Since the functional cast has the highest precedence here, it converts the left multiplicand to a long long. The right multiplicand of type int gets converted to a long long according to the standard. No loss occurs.
Declaring n as a long long is the best solution as mentioned previously.
Just as a quick clarification to the original post, 1410065408 is not 2^31, the value comes about as follows:
100,000 ^ 2 = 10,000,000,000 which exists in binary form as:
10 0101 0100 0000 1011 1110 0100 0000 0000
C++ integers are strictly 32 bits in memory. Therefore, the front two bits are ignored and the value is stored in memory as binary:
0101 0100 0000 1011 1110 0100 0000 0000
In decimal, this is equal to exactly 1410065408.
Edit - This is another solution to the problem; what this will do is cast the integer values to a long long before the multiplication so you don't get truncation of bits.
Original Posting
int n = 100000;
long long x = static_cast<long long>( n ) * static_cast<long long>( n );
Edit - The original answer provided by Jossie Calderon was already accepted as a valid answer and this answer adds another valid solution.

C++ char arithmetic overflow

#include <stdio.h>
int main()
{
char a = 30;
char b = 40;
char c = 10;
printf ("%d ", char(a*b));
char d = (a * b) / c;
printf ("%d ", d);
return 0;
}
The above code yields normal int value if 127 > x > -127
and a overflow value if other. I can't understand how the overflow value is calculated. As -80 in this case.
Thanks
The trick here is how numbers are represented. Look into 2's complement. 30 * 40 in binary is 1200 or 10010110000 base 2. But our char is only 8 bits so we chop off the leading 100 (and all the implied 0s before that). This leaves us with 1011000.
Note the leading 1. In 2s complement, how your computer probably stores the values, this indicates a negative number. 11111111 is -1, 11111110 is -2 and so on. If go down to 1011000 we get to -80.
That is, if we convert 1011000 to 2s complement we're left with -80.
You can do 2s complement by hand. Take the value, drop the leading sign bit and swap the other values. In this case 10110000 turns into 01001111 in binary this would be 79. Turn it negative and remove one more because we don't start at zero and we're at -80.
Char has only 1 byte. In this case 1200 is 0100 1011 0000 (binary).
For one byte you can only assign 8 bit, in your case: 1011 0000 (first 4 bits will be deleted). Now you have -80 (first bit shows if negative (1) or positive (0)).
Try with your calculator (programmer) and type 1200 decimal and switch from Qword to Byte and you can see what happens with your number.

C++ modulus requires cast of subtraction between two *un*signed bytes to work, why?

The following Arduino (C++) code
void setup()
{
Serial.begin(115200);
byte b1 = 12;
byte b2 = 5;
const byte RING_BUFFER_SIZE = 64;
byte diff = b2 - b1;
byte diff2 = (byte)(b2 - b1) % RING_BUFFER_SIZE; //<---NOTE HOW THE (byte) CAST IS *REQUIRED* TO GET THE RIGHT RESULT!!!!
Serial.println(b1);
Serial.println(b2);
Serial.println(RING_BUFFER_SIZE);
Serial.println(diff);
Serial.println(diff2);
}
void loop()
{
}
produces the expected:
12
5
64
249
57 //<--correct answer
Whereas without the "(byte)" cast as shown here:
void setup()
{
Serial.begin(115200);
byte b1 = 12;
byte b2 = 5;
const byte RING_BUFFER_SIZE = 64;
byte diff = b2 - b1;
byte diff2 = (b2 - b1) % RING_BUFFER_SIZE; //<---(byte) cast removed
Serial.println(b1);
Serial.println(b2);
Serial.println(RING_BUFFER_SIZE);
Serial.println(diff);
Serial.println(diff2);
}
void loop()
{
}
it produces:
12
5
64
249
249 //<--wrong answer
Why the difference? Why does the modulo operator ONLY work with the explicit cast?
Note: "byte" = "uint8_t"
5 - 12 gives -7 (an int). So your code does -7 % 64.
Mathematically we would expect this to give 57. However, in C and C++, % for negative numbers it doesn't do what you might expect mathematically. Instead it satisfies the following equation:
(a/b) * b + a%b == a
Now, (-7)/64 gives 0 because C and C++ use truncation-towards-zero for integer division of negative-positive. Therefore -7 % 64 evaluates to -7.
Finally, converting -7 to uint8_t gives 249.
When you write (byte)-7 % 64 you are actually doing 249 % 64 giving the expected answer.
Regarding the behaviour of b2 - b1: all integer arithmetic is done in at least int precision; for each operand of -, if it is a narrower integer type than int it is first promoted to int (leaving the value unchanged). Further conversions may occur if the types differ after this promotion (which they don't in this case).
In code, b2 - b1 means (int)b2 - (int)b1 yielding an int; there is no way to specify doing the operation in lower precision.
Arithmetic operations want to operate on an int or larger. So, your byte's are being promoted to integers before they are subtracted -- and, you're likely getting actual int's, which C/C++ is OK with because they can hold the entire range of byte.
If the result of the subtraction is cast back down to byte, it gives you the expected overflow behavior. However, if you omit the cast in the diff2 calculation, you're doing the modulus on a negative int. And, because C/C++ signed division rounds towards zero, the signed modulus has the same sign as the dividend.
The first misstep here is to expect subtraction to act directly on your byte type, or to translate your unsigned byte into an unsigned int. The cascading problem is to overlook the behavior of C++ signed division (which is understandable if you don't know that you should expect signed arithmetic to be an issue in the first place).
Note that, if your RING_BUFFER_SIZE were not a power of two, the division wouldn't work correctly for cases like this anyway. And, since it is a power of two, note that:
(b2 - b1)&(RING_BUFFER_SIZE-1)
should work correctly.
And finally (as suggested in the comment), the right way to do a ring-buffer subtract would be to make sure b1 < RING_BUFFER_SIZE (which makes sense for a ring buffer operation), and use something like:
(b2>b1)? b2 - b1 : RING_BUFFER_SIZE + b2 - b1

How does floating-point arithmetic work when one is added to a big number?

If we run this code:
#include <iostream>
int main ()
{
using namespace std;
float a = 2.34E+22f;
float b = a+1.0f;
cout<<"a="<<a<<endl;
cout<<"b-a"<<b-a<<endl;
return 0;
}
Then the result will be 0, because float number has only 6 significant digits. But float number 1.0 tries to be added to 23 digit of number. So, how program realizes that there is no place for number 1, what is the algorithm?
Step by step:
IEEE-754 32-bit binary floating-point format:
sign 1 bit
significand 23 bits
exponent 8 bits
I) float a = 23400000000.f;
Convert 23400000000.f to float:
23,400,000,000 = 101 0111 0010 1011 1111 1010 1010 0000 00002
= 1.01011100101011111110101010000000002 • 234.
But the significand can store only 23 bits after the point. So we must round:
1.01011100101011111110101 010000000002 • 234
≈ 1.010111001010111111101012 • 234
So, after:
float a = 23400000000.f;
a is equal to 23,399,991,808.
II) float b = a + 1;
a = 101011100101011111110101000000000002.
b = 101011100101011111110101000000000012
= 1.01011100101011111110101000000000012 • 234.
But, again, the significand can store only 23 binary digits after the point. So we must round:
1.01011100101011111110101 000000000012 • 234
≈ 1.010111001010111111101012 • 234
So, after:
float b = a + 1;
b is equal to 23,399,991,808.
III) float c = b - a;
101011100101011111110101000000000002 - 101011100101011111110101000000000002 = 0
This value can be stored in a float without rounding.
So, after:
float c = b - a;
с is equal to 0.
The basic principle is that the two numbers are aligned so that the decimal point is in the same place. I'm using a 10 digit number to make it a little easier to read:
a = 1.234E+10f;
b = a+1.0f;
When calculating a + 1.0f, the decimal points need to be lined up:
1.234E+10f becomes 1234000000.0
1.0f becomes 1.0
+
= 1234000001.0
But since it's float, the 1 on the right is outside the valid range, so the number stored will be 1.234000E+10- any digits beyond that are lost, because there is just not enough digits.
[Note that if you do this on an optimizing compiler, it may still show 1.0 as a difference, because the floating point unit uses a 64- or 80-bit internal representation, so if the calculation is done without storing the intermediate results in a variable (and a decent compiler can certainly achieve that here) With 2.34E+22f it is guaranteed to not fit in a 64-bit float, and probably not in a 80-bit one either].
When adding two FP numbers, they're first converted to the same exponent. In decimal:
2.34000E+22 + 1.00000E0 = 2.34000E22 + 0.000000E+22. In this step, the 1.0 is lost to rounding.
Binary floating point works pretty much the same, except that E+22 is replaced by 2^77.