Getting a value as -0.000000 on multiplying bigger values [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
Why do i get a value as -0.000000 . Does negative zero even exist?
I am multiplying to bigger double value. Why do i get a result like this?
Is it overflowing ? should i use a bigger data type than this?

From Wiki:
Does negative zero even exist?
Signed zero is zero with an associated sign. In ordinary arithmetic, −0 = +0
= 0. However, in computing, some number representations allow for the existence of two zeros, often denoted by −0 (negative zero) and +0
(positive zero). This occurs in the sign and magnitude and ones'
complement signed number representations for integers, and in most
floating point number representations. The number 0 is usually encoded
as +0, but can be represented by either +0 or −0.
Is it overflowing ? should i use a bigger data type than this?
In IEEE 754 binary floating point numbers, zero values are represented
by the biased exponent and significand both being zero. Negative zero
has the sign bit set to one. One may obtain negative zero as the
result of certain computations, for instance as the result of
arithmetic underflow on a negative number, or −1.0*0.0, or simply as
−0.0.

It could be a sign magnitude thing. There exist 2 distinct values of zero in floating point types +0.0 and -0.0.
It could also be a precision thing. -0.000000000009 might be being printed as -0.000000, which it perfect reasonable.

As is evident from your other question, the value you have is not a negative zero but is a small negative value that is displayed as “-0.000000” because of the format specification used to display it.

Related

How to perform sum between double with bitwise operations [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'd like to know how floating-point numbers sum works.
How can I sum two double(or float) numbers using bitwise operations?
Short answer: if you need to ask, you are not going to implement floating-point addition from bitwise operators. It is completely possible but there are a number of subtle points that you would need to have asked about before. You could start by implementing a double → float conversion function, it is simpler but would introduce you to many of the concepts. You could also do double → nearest integer as an exercise.
Nevertheless, here is the naive version of addition:
Use large arrays of bits for each of the two operands (254 + 23 for float, 2046 + 52 for double). Place the significand at the right place in the array according to the exponent. Assuming the arguments are both normalized, do not forget to place the implicit leading 1. Add the two arrays of bits with the usual rules of binary addition. Then convert the resulting array to floating-point format: first look for the leftmost 1; the position of this leftmost 1 determines the exponent. The significand of the result starts right after this leading 1 and is respectively 23- or 52-bit wide. The bits after that determine whether the value should be rounded up or down.
Although this is the naive version, it is already quite complicated.
The non-naive version does not use 2100-bit wide arrays, but takes advantage of a couple of “guard bits” instead (see section “on rounding” in this document).
The additional subtleties include:
The sign bits of the arguments can mean that the magnitudes should be subtracted for an addition, or added for a subtraction.
One of the arguments can be NaN. Then the result is NaN.
One of the arguments can be an infinity. If the other argument is finite or the same infinity, the result is the same infinity. Otherwise, the result is NaN.
One of the arguments can be a denormalized number. In this case there is no leading 1 when transferring the number to the array of bits for addition.
The result of the addition can be an infinity: depending on the details of the implementation, this would be recognized as an exponent too large to fit the format, or an overflow during the addition of the binary arrays (the overflow can also occur during the rounding step).
The result of the addition can be a denormalized number. This is recognized as the absence of a leading 1 in the first 2046 bits of the array of bits. In this case the last 52 bits of the array should be transferred to the significand of the result, and the exponent should be set to zero, to indicate a denormalized result.

Representation of a Gradual underflow program

I was reading about the gradual underflow concept and how it is op important in the music industry Gradual overflow Application in Music
I well understand the problem of an overflow buffer, but this i don't know how to represent an underflow.
Can you please give me an example(a program preferably in c or c++) as in how a computer handles gradual underflow?
Gradual underflow is related to what IEEE 754 calls "subnormal" numbers.
Consider using scientific notation in which you have (say) 10 digits of precision and exponents that can range from -99 through 99.
Under normal circumstances, you treat everything as scientific notation, so if you want to represent 1000, you represent it as 1e3 -- that is, 1 * 103.
Now, consider a number like 1.234e-102. The smallest exponent you can represent is -99. So, if you do your job the simplest possible way, you simply that since it has an exponent smaller than that, it's just 0. That would be "fast underflow".
In IEEE 754 (and related standards) you can store that as (essentially) 0.001234 * 10-99. In doing so, you may lose some precision compared to a normal number that has an exponent in the -99...99 range. On the other hand, you lose less than if you just rounded it to zero because it has an exponent smaller than -99. In fact, in this case it started with 4 significant digits, and as represented it retains all 4 significant digits.
On a computer, the numbers are represented in binary, so the numbers of significant digits and/or maximum range of exponents aren't round numbers when converted to decimal, but the same basic idea applies--when we have a number that's too small to represent in the normal format, we can still store it with the smallest exponent that can be represented, but also includes some leading zeros.
This does lead to one difficulty: numbers are normally stored in what's called normalized form. The "significand" part is normalized by shifting it left until the first digit is a 1 (keep in mind that since it's binary it can only be 0 or 1). Since we know it's a 1, we cheat a little: we don't normally store that 1 in the number as it's stored. So, a double precision floating point number normally has 53 bits of precision, but only actually stores 52 bits of significand.
With a subnormal number, that's no longer the case. That's not terribly difficult to deal with, but it still introduces a special case--and one that's only rarely used, so CPU designers (and such) rarely try to optimize for it. As a result, the exact same code can suddenly run a lot slower when executing on data that contains subnormals.

What does -1.#IND00 mean? [duplicate]

I'm messing around with some C code using floats, and I'm getting 1.#INF00, -1.#IND00 and -1.#IND when I try to print floats in the screen. What does those values mean?
I believe that 1.#INF00 means positive infinity, but what about -1.#IND00 and -1.#IND? I also saw sometimes this value: 1.$NaN which is Not a Number, but what causes those strange values and how can those help me with debugging?
I'm using MinGW which I believe uses IEEE 754 representation for float point numbers.
Can someone list all those invalid values and what they mean?
From IEEE floating-point exceptions in C++ :
This page will answer the following questions.
My program just printed out 1.#IND or 1.#INF (on Windows) or nan or inf (on Linux). What happened?
How can I tell if a number is really a number and not a NaN or an infinity?
How can I find out more details at runtime about kinds of NaNs and infinities?
Do you have any sample code to show how this works?
Where can I learn more?
These questions have to do with floating point exceptions. If you get some strange non-numeric output where you're expecting a number, you've either exceeded the finite limits of floating point arithmetic or you've asked for some result that is undefined. To keep things simple, I'll stick to working with the double floating point type. Similar remarks hold for float types.
Debugging 1.#IND, 1.#INF, nan, and inf
If your operation would generate a larger positive number than could be stored in a double, the operation will return 1.#INF on Windows or inf on Linux. Similarly your code will return -1.#INF or -inf if the result would be a negative number too large to store in a double. Dividing a positive number by zero produces a positive infinity and dividing a negative number by zero produces a negative infinity. Example code at the end of this page will demonstrate some operations that produce infinities.
Some operations don't make mathematical sense, such as taking the square root of a negative number. (Yes, this operation makes sense in the context of complex numbers, but a double represents a real number and so there is no double to represent the result.) The same is true for logarithms of negative numbers. Both sqrt(-1.0) and log(-1.0) would return a NaN, the generic term for a "number" that is "not a number". Windows displays a NaN as -1.#IND ("IND" for "indeterminate") while Linux displays nan. Other operations that would return a NaN include 0/0, 0*∞, and ∞/∞. See the sample code below for examples.
In short, if you get 1.#INF or inf, look for overflow or division by zero. If you get 1.#IND or nan, look for illegal operations. Maybe you simply have a bug. If it's more subtle and you have something that is difficult to compute, see Avoiding Overflow, Underflow, and Loss of Precision. That article gives tricks for computing results that have intermediate steps overflow if computed directly.
For anyone wondering about the difference between -1.#IND00 and -1.#IND (which the question specifically asked, and none of the answers address):
-1.#IND00
This specifically means a non-zero number divided by zero, e.g. 3.14 / 0 (source)
-1.#IND (a synonym for NaN)
This means one of four things (see wiki from source):
1) sqrt or log of a negative number
2) operations where both variables are 0 or infinity, e.g. 0 / 0
3) operations where at least one variable is already NaN, e.g. NaN * 5
4) out of range trig, e.g. arcsin(2)
Your question "what are they" is already answered above.
As far as debugging (your second question) though, and in developing libraries where you want to check for special input values, you may find the following functions useful in Windows C++:
_isnan(), _isfinite(), and _fpclass()
On Linux/Unix you should find isnan(), isfinite(), isnormal(), isinf(), fpclassify() useful (and you may need to link with libm by using the compiler flag -lm).
For those of you in a .NET environment the following can be a handy way to filter non-numbers out (this example is in VB.NET, but it's probably similar in C#):
If Double.IsNaN(MyVariableName) Then
MyVariableName = 0 ' Or whatever you want to do here to "correct" the situation
End If
If you try to use a variable that has a NaN value you will get the following error:
Value was either too large or too small for a Decimal.

Exponent in IEEE 754

Why exponent in float is displaced by 127?
Well, the real question is : What is the advantage of such notation in comparison to 2's complement notation?
Since the exponent as stored is unsigned, it is possible to use integer instructions to compare floating point values. the the entire floating point value can be treated as a signed magnitude integer value for purposes of comparison (not twos-compliment).
Just to correct some misinformation: it is 2^n * 1.mantissa, the 1 infront of the fraction is implicitly stored.
Note that there is a slight difference in the representable range for the exponent, between biased and 2's complement. The IEEE standard supports exponents in the range of (-127 to +128), while if it was 2's complement, it would be (-128 to +127). I don't really know the reason why the standard chooses the bias form, but maybe the committee members thought it would be more useful to allow extremely large numbers, rather than extremely small numbers.
#Stephen Canon, in response to ysap's answer (sorry, this should have been a follow up comment to my answer, but the original answer was entered as an unregistered user, so I cannot really comment it yet).
Stephen, obviously you are right, the exponent range I mentioned is incorrect, but the spirit of the answer still applies. Assuming that if it was 2's complement instead of biased value, and assuming that the 0x00 and 0xFF values would still be special values, then the biased exponents allow for (2x) bigger numbers than the 2's complement exponents.
The exponent in a 32-bit float consists of 8 bits, but without a sign bit. So the range is effectively [0;255]. In order to represent numbers < 2^0, that range is shifted by 127, becoming [-127;128].
That way, very small numbers can be represented very precisely. With a [0;255] range, small numbers would have to be represented as 2^0 * 0.mantissa with lots of zeroes in the mantissa. But with a [-127;128] range, small numbers are more precise because they can be represented as 2^-126 * 0.mantissa (with less unnecessary zeroes in the mantissa). Hope you get the point.

Questions about two's complement and IEEE 754 representations

How would i go about finding the value of the two-byte two’s complement value 0xFF72 is"?
Would i start by converting 0xFF72 to binary?
reverse the bits.
add 1 in binary notation. // lost here.
write decimal.
I just dont know..
Also,
What about an 8 byte double that has the value: 0x7FF8000000000000. Its value as a floating point?
I would think that this was homework, but for the particular double that is listed. 0x7FF8000000000000 is a quiet NaN per the IEEE-754 spec, not a very interesting value to put on a homework assignment:
The sign bit is clear.
The exponent field is 0x7ff, the largest possible exponent, which means that the number is either an infinity or a NaN.
The significand field is 0x8000000000000. Since it isn't zero, the number is not an infinity, and must be a NaN. Since the leading bit is set, it is a quiet NaN, not a so-called "signaling NaN".
Step 3 just means add 1 to the value. It's really as simple as it sounds. :-)
Example with 0xFF72 (assumed 16-bits here):
First, invert it: 0x008D (each digit is simply 0xF minus the original value)
Then add 1: 0x008E
This sounds like homework and for openness you should tag it as such if it is.
As for interpreting an 8 byte (double) floating point number, take a look at this Wikipedia article.