Overflow if carry out and carry in are both 1? - twos-complement

I am confused because the definitions of overflow I see say we get overflow when carry in and carry out differ. But what if we add two negative numbers such that carry out for the MSB is 1 (meaning that the number extends past the number of bits we have to represent it) but carry in is also 1? Technically the number would still be negative, but overflow would have occurred, right?
I think adding 1100 to itself works as an example since 1100+1100=1000 which is not correct algebraically but would still pass the overflow "test"?

Related

Overflow and carry flag

The context
I read in a textbook that...
An addition and subtraction cannot cause overflow. To quote,
"An overflow cannot occur after an addition if one number is positive and the other negative, since adding a positive number to a negative number produces a result who magnitude is smaller(...)".
However, by going through some problems it didn't seem to be the case and I want to confirm what I calculated isn't some mistake.
For example a context in which this applies, for a 4-bit adder-subtractor where M=1 (this means subtraction with B),
A = 0101 (+5) and B = 1010 (+10).
By taking the 2s complement of B = 0110 (-10) and adding the numbers, the subtraction could be made.
e.g (5)+(-10)
0 1
+5 0101
-10 0110
-------------
result: 1011
results 2s: 0101 (-5)
C: 0 and V = 1.
A couple of questions already arise just by performing this problem.
The overflow bit is set despite the fact there is no overflow (number is in range)
Given that the range is -8 to 7, wouldn't a signed integer and unsigned integer also cause overflow e.g. (-1+9)
e.g
-1 1110
+9 1001
-------------
result: 1111
result 2s: 0001 (1)
C: 1 and V: 1
I noticed that when C = 0 there is no overflow and when C = 1 there is a overflow.
I read that the overflow relationship between two unsigned integer is the V overflow flag.
On the other hand, the overflow relationship between two signed integer is related to the C carry flag. Could this be related?
Finally, notice that there is overflow between an unsigned and signed integer despite the statement I quoted contradicts that.
TL;DR
Is overflow between the addition of an unsigned integer and signed integer possible? If so, what would the relationship an unsigned integer and signed integer be for overflow (C or V flag)?
"overflow bit" is usually defined for adding or subtracting two signed numbers, when dealing with signed numbers the first bit is the sign, so for a 4 bit adder 7 is the biggest integer avaliable, when you choose 10 you already choose a number bigger than your adder support, 1010 does not means 10 but -6, you are, in fact, subtracting -6 from 5, wich causes overflow.
I think EduardoS's answer is fine, Harold's comment is cleaner. You have 4 bits. That represents
[ -(2**3) , (2**3 - 1) ]
Or, more simply,
[ -8, 7 ]
You pick -10. But, that doesn't exist in that range, so most architectures will wrap it. That wrapping sets the overflow. Let's expand the bitfield to 8 bits.
00001010 = 10
Now we negate it and add one
11110110
Now you can trim off a few of those 1 since it's essentially sign extended but you must store the sign bit,
10110
So you must start out with 5 bits, not 4. Without which you're dropping sign bit and that's overflowing.
Now that that's fixed, what happens if we add
00101 (+5)
10110 (-10)
-----
11011
Now the sign bit is set, so invert and add 1.
11011
00100 (inverted)
00001 (adding 1)
00101 (-5)
So the answer is -5, and you did it without overflowing.
The take away here, is to make a positive number overflow you need to add to it. To make a negative number overflow you have to subtract from it. If both of the numbers have a different sign (and they've valid) they have to fit in in the same space without overflow.

How are Overflow situations dealt with? [duplicate]

This question already has answers here:
Why is unsigned integer overflow defined behavior but signed integer overflow isn't?
(6 answers)
Closed 7 years ago.
I just simply wanted to know, who is responsible to deal with mathematical overflow cases in a computer ?
For example, in the following C++ code:
short x = 32768;
std::cout << x;
Compiling and running this code on my machine gave me a result of -32767
A "short" variable's size is 2 bytes .. and we know 2 bytes can hold a maximum decimal value of 32767 (if signed) .. so when I assigned 32768 to x .. after exceeding its max value 32767 .. It started counting from -32767 all over again to 32767 and so on ..
What exactly happened so the value -32767 was given in this case ?
ie. what are the binary calculations done in the background the resulted in this value ?
So, who decided that this happens ? I mean who is responsible to decide that when a mathematical overflow happens in my program .. the value of the variable simply starts again from its min value, or an exception is thrown for example, or the program simply freezes .. etc ?
Is it the language standard, the compiler, my OS, my CPU, or who is it ?
And how does it deal with that overflow situation ? (Simple explanation or a link explaining it in details would be appreciated :) )
And btw, pls .. Also, who decides what a size of a 'short int' for example on my machine would be ? also is it a language standard, compiler, OS, CPU .. etc ?
Thanks in advance! :)
Edit:
Ok so I understood from here : Why is unsigned integer overflow defined behavior but signed integer overflow isn't?
that It's the processor who defines what happens in an overflow situation (like for example in my machine it started from -32767 all over again), depending on "representations for signed values" of the processor, ie. is it sign magnitude, one's complement or two's complement ...
is that right ?
and in my case (When the result given was like starting from the min value -32767 again.. how do you suppose my CPU is representing the signed values, and how did the value -32767 for example come up (again, binary calculations that lead to this, pls :) ? )
It doesn't start at it's min value per se. It just truncates its value, so for a 4 bit number, you can count until 1111 (binary, = 15 decimal). If you increment by one, you get 10000, but there is no room for that, so the first digit is dropped and 0000 remains. If you would calculate 1111 + 10, you'd get 1.
You can add them up as you would on paper:
1111
0010
---- +
10001
But instead of adding up the entire number, the processor will just add up until it reaches (in this case) 4 bits. After that, there is no more room to add up any more, but if there is still 1 to 'carry', it sets the overflow register, so you can check whether the last addition it did overflowed.
Processors have basic instructions to add up numbers, and they have those for smaller and larger values. A 64 bit processor can add up 64 bit numbers (actually, usually they don't add up two numbers, but actually add a second number to the first number, modifying the first, but that's not really important for the story).
But apart from 64 bits, they often can also add up 32, 16 and 8 bit numbers. That's partly because it can be efficient to add up only 8 bits if you don't need more, but also sometimes to be backwards compatible with older programs for a previous version of a processor which could add up to 32 bits but not 64 bits.
Such a program uses an instruction to add up 32 bits numbers, and the same instruction must also exist on the 64 bit processor, with the same behavior if there is an overflow, otherwise the program wouldn't be able to run properly on the newer processor.
Apart from adding up using the core constructions of the processor, you could also add up in software. You could make an inc function that treats a big chunk of bits as a single value. To increment it, you can let the processor increment the first 64 bits. The result is stored in the first part of your chunk. If the overflow flag is set in the processor, you take the next 64 bits and increment those too. This way, you can extend the limitation of the processor to handle large numbers from software.
And same goes for the way an overflow is handled. The processor just sets the flag. Your application can decide whether to act on it or not. If you want to have a counter that just increments to 65535 and then wraps to 0, you (your program) don't need to do anything with the flag.

How to perform sum between double with bitwise operations [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'd like to know how floating-point numbers sum works.
How can I sum two double(or float) numbers using bitwise operations?
Short answer: if you need to ask, you are not going to implement floating-point addition from bitwise operators. It is completely possible but there are a number of subtle points that you would need to have asked about before. You could start by implementing a double → float conversion function, it is simpler but would introduce you to many of the concepts. You could also do double → nearest integer as an exercise.
Nevertheless, here is the naive version of addition:
Use large arrays of bits for each of the two operands (254 + 23 for float, 2046 + 52 for double). Place the significand at the right place in the array according to the exponent. Assuming the arguments are both normalized, do not forget to place the implicit leading 1. Add the two arrays of bits with the usual rules of binary addition. Then convert the resulting array to floating-point format: first look for the leftmost 1; the position of this leftmost 1 determines the exponent. The significand of the result starts right after this leading 1 and is respectively 23- or 52-bit wide. The bits after that determine whether the value should be rounded up or down.
Although this is the naive version, it is already quite complicated.
The non-naive version does not use 2100-bit wide arrays, but takes advantage of a couple of “guard bits” instead (see section “on rounding” in this document).
The additional subtleties include:
The sign bits of the arguments can mean that the magnitudes should be subtracted for an addition, or added for a subtraction.
One of the arguments can be NaN. Then the result is NaN.
One of the arguments can be an infinity. If the other argument is finite or the same infinity, the result is the same infinity. Otherwise, the result is NaN.
One of the arguments can be a denormalized number. In this case there is no leading 1 when transferring the number to the array of bits for addition.
The result of the addition can be an infinity: depending on the details of the implementation, this would be recognized as an exponent too large to fit the format, or an overflow during the addition of the binary arrays (the overflow can also occur during the rounding step).
The result of the addition can be a denormalized number. This is recognized as the absence of a leading 1 in the first 2046 bits of the array of bits. In this case the last 52 bits of the array should be transferred to the significand of the result, and the exponent should be set to zero, to indicate a denormalized result.

Getting a value as -0.000000 on multiplying bigger values [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
Why do i get a value as -0.000000 . Does negative zero even exist?
I am multiplying to bigger double value. Why do i get a result like this?
Is it overflowing ? should i use a bigger data type than this?
From Wiki:
Does negative zero even exist?
Signed zero is zero with an associated sign. In ordinary arithmetic, −0 = +0
= 0. However, in computing, some number representations allow for the existence of two zeros, often denoted by −0 (negative zero) and +0
(positive zero). This occurs in the sign and magnitude and ones'
complement signed number representations for integers, and in most
floating point number representations. The number 0 is usually encoded
as +0, but can be represented by either +0 or −0.
Is it overflowing ? should i use a bigger data type than this?
In IEEE 754 binary floating point numbers, zero values are represented
by the biased exponent and significand both being zero. Negative zero
has the sign bit set to one. One may obtain negative zero as the
result of certain computations, for instance as the result of
arithmetic underflow on a negative number, or −1.0*0.0, or simply as
−0.0.
It could be a sign magnitude thing. There exist 2 distinct values of zero in floating point types +0.0 and -0.0.
It could also be a precision thing. -0.000000000009 might be being printed as -0.000000, which it perfect reasonable.
As is evident from your other question, the value you have is not a negative zero but is a small negative value that is displayed as “-0.000000” because of the format specification used to display it.

c++ bitwise addition , calculates the final number of representative bits

I am currently developing an utility that handles all arithmetic operations on bitsets.
The bitset can auto-resize to fit any number, so it can perform addition / subtraction / division / multiplication and modulo on very big bitsets (i've come up to load a 700Mo movie inside to treat it just as a primitive integer)
I'm facing one problem though, i need for my addition to resize my bitset to fit the exact number of bits needed after an addition, but i couldn't come up with an absolute law to know exactly how many bits would be needed to store everything, knowing only the number of bits that both numbers are handling (either its representation is positive or negative, it doesn't matter)
I have the whole code that i can share with you to point out the problem if my question is not clear enough.
Thanks in advance.
jav974
but i couldn't come up with an absolute law to know exactly how many bits would be needed to store everything, knowing only the number of bits that both numbers are handling (either its representation is positive or negative, it doesn't matter)
Nor will you: there's no way given "only the number of bits that both numbers are handling".
In the case of same-signed numbers, you may need one extra bit - you can start at the most significant bit of the smaller number, and scan for 0s that would absorb the impact of a carry. For example:
1010111011101 +
..10111010101
..^ start here
As both numbers have a 1 here you need to scan left until you hit a 0 (in which case the result has the same number of digits as the larger input), or until you reach the most significant bit of the larger number (in which case there's one more digit in the result).
1001111011101 +
..10111010101
..^ start here
In this case where the longer input has a 0 at the starting location, you first need to do a right-moving scan to establish whether there'll be a carry from the right of that starting position before launching into the left-moving scan above.
When signs differ:
if one value has 2 or more digits less than the other, then the number of digits required in the result will be either the same or one less than the digits in the larger input
otherwise, you'll have to do more of the work for an addition just to work out how many digits the result needs.
This is assuming the sign bit is separate from the count of magnitude bits.
Finally the number of representative bits after an addition is at maximum the number of bits of the one that owns the most + 1.
Here is an explanation, using an unsigned char:
For max unsigned char :
11111111 (255)
+ 11111111 (255)
= 111111110 (510)
Naturally if max + max = (bits of max + 1) then for x and y between 0 and max the result bits is at max + 1 (very maximum)
this works the same way with signed integers.