Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'd like to know how floating-point numbers sum works.
How can I sum two double(or float) numbers using bitwise operations?
Short answer: if you need to ask, you are not going to implement floating-point addition from bitwise operators. It is completely possible but there are a number of subtle points that you would need to have asked about before. You could start by implementing a double → float conversion function, it is simpler but would introduce you to many of the concepts. You could also do double → nearest integer as an exercise.
Nevertheless, here is the naive version of addition:
Use large arrays of bits for each of the two operands (254 + 23 for float, 2046 + 52 for double). Place the significand at the right place in the array according to the exponent. Assuming the arguments are both normalized, do not forget to place the implicit leading 1. Add the two arrays of bits with the usual rules of binary addition. Then convert the resulting array to floating-point format: first look for the leftmost 1; the position of this leftmost 1 determines the exponent. The significand of the result starts right after this leading 1 and is respectively 23- or 52-bit wide. The bits after that determine whether the value should be rounded up or down.
Although this is the naive version, it is already quite complicated.
The non-naive version does not use 2100-bit wide arrays, but takes advantage of a couple of “guard bits” instead (see section “on rounding” in this document).
The additional subtleties include:
The sign bits of the arguments can mean that the magnitudes should be subtracted for an addition, or added for a subtraction.
One of the arguments can be NaN. Then the result is NaN.
One of the arguments can be an infinity. If the other argument is finite or the same infinity, the result is the same infinity. Otherwise, the result is NaN.
One of the arguments can be a denormalized number. In this case there is no leading 1 when transferring the number to the array of bits for addition.
The result of the addition can be an infinity: depending on the details of the implementation, this would be recognized as an exponent too large to fit the format, or an overflow during the addition of the binary arrays (the overflow can also occur during the rounding step).
The result of the addition can be a denormalized number. This is recognized as the absence of a leading 1 in the first 2046 bits of the array of bits. In this case the last 52 bits of the array should be transferred to the significand of the result, and the exponent should be set to zero, to indicate a denormalized result.
Related
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 5 years ago.
Improve this question
I have the following function which has to convert numbers of any size to strings:
string NumberToString(long long number) {
ostringstream NumberInStream;
NumberInStream.str("");
NumberInStream.clear();
NumberInStream << setprecision(0);
NumberInStream << fixed << number;
return NumberInStream.str();
}
The function works very well for numbers of maxlength: 9.
So for example when I input a 10-digit long number, e.g. 1234567890 it returns wrong format.
Some examples:
1494978929 became 1494978944
1494979474 became 1494979456
1494979487 became 1494979456
1494979498 became 1494979456
1494979500 became 1494979456
1494979529 became 1494979584
1494979540 became 1494979584
However,
2 became 2
120 became 120
44567 became 44567
456.45 became 456 because of setprecision(0)
Welcome to floating point precision. Try your same script using double in the function prototype instead and you will see that you get the results you want. However this will fail if you input integers of a certain length.
Just looks at the output of this . .
printf ("%f\n", 1494978929.f)
And you'll see that you cannot represent that int as a float with total precision. Call the same with .0 instead of .f at the end and you'll see a different result.
This is a problem with IEEE floating point notation: https://en.wikipedia.org/wiki/IEEE_floating_point. Your question says you are converting between a long long, but your function takes a float as an argument. Floats are stored in 32 bits in memory. Depending on implementation, the bits are split into three sets: one bit is used to determine the sign (s), then a number of bits are used for the significand (c), and the rest of the bits are the quotient (q). Depending on the bits set, the number determined using this notation is (-1)^s * c * b^q where b is the base (usually 2 or 10). How all of this is represented depends on your compiler and the ISO standard. What this means is that the numbers represented by IEEE floating point have to fit this function. Basically all of the relatively small integers you would want to represent work with this formula, but when you try to represent very small or very large numbers, IEEE floating point will break down. In your situation, some strings of numbers over 10 digits require too much precision for floats to represent. I recommend that you use a double or long double for these, or use a long long as mentioned above instead of floating point numbers.
yep, float is 32 bit, and part of that is the exponent, so float has less precision than a normal int, but it has a wide range of expression due to sacrificing some bits for the exponent.
either use double, giving you more range, but still not enough for a long long, or make it a template:
template <typename T>
string NumberToString(T number) {
ostringstream NumberInStream;
NumberInStream.str("");
NumberInStream.clear();
NumberInStream << setprecision(0);
NumberInStream << fixed << number;
return NumberInStream.str();
}
That might need some tweaking now though, but it won't lost precision due to passing your value into a data type like float or double that has less bits of precision than the number you started with.
ques:
I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
Why do i get a value as -0.000000 . Does negative zero even exist?
I am multiplying to bigger double value. Why do i get a result like this?
Is it overflowing ? should i use a bigger data type than this?
From Wiki:
Does negative zero even exist?
Signed zero is zero with an associated sign. In ordinary arithmetic, −0 = +0
= 0. However, in computing, some number representations allow for the existence of two zeros, often denoted by −0 (negative zero) and +0
(positive zero). This occurs in the sign and magnitude and ones'
complement signed number representations for integers, and in most
floating point number representations. The number 0 is usually encoded
as +0, but can be represented by either +0 or −0.
Is it overflowing ? should i use a bigger data type than this?
In IEEE 754 binary floating point numbers, zero values are represented
by the biased exponent and significand both being zero. Negative zero
has the sign bit set to one. One may obtain negative zero as the
result of certain computations, for instance as the result of
arithmetic underflow on a negative number, or −1.0*0.0, or simply as
−0.0.
It could be a sign magnitude thing. There exist 2 distinct values of zero in floating point types +0.0 and -0.0.
It could also be a precision thing. -0.000000000009 might be being printed as -0.000000, which it perfect reasonable.
As is evident from your other question, the value you have is not a negative zero but is a small negative value that is displayed as “-0.000000” because of the format specification used to display it.
I'm learning about the representation of floating-point IEEE 754 numbers, and my textbook says:
To pack even more bits into the significand, IEEE 754 makes the leading 1-bit of normalized binary numbers implicit. Hence, the number is actually 24 bits long in single precision (implied 1 and 23-bit fraction), and 53 bits long in double precision (1 + 52).
I don't get what "implicit" means here... what's the difference between an explicit bit and an implicit bit? Don't all numbers have the bit, regardless of their sign?
Yes, all normalised numbers (other than the zeroes) have that bit set to one (a), so they make it implicit to prevent wasting space storing it.
In other words, they save that bit totally, and reuse it so that it can be used to increase the precision of your numbers.
Keep in mind that this is the first bit of the fraction, not the first bit of the binary pattern. The first bit of the binary pattern is the sign, followed by a few bits of exponent, followed by the fraction itself.
For example, a single precision number is (sign, exponent, fraction):
<1> <--8---> <---------23----------> <- bit widths
s eeeeeeee fffffffffffffffffffffff
If you look at the way the number is calculated, it's:
(-1)sign x 1.fraction x 2exponent-bias
So the fractional part used for calculating that value is 1.fffff...fff (in binary).
(a) There is actually a class of numbers (the denormalised ones and the zeroes) for which that property does not hold true. These numbers all have a biased exponent of zero but the vast majority of numbers follow the rule.
Here is what they are saying. The first non-zero bit is always going to be 1. So there is no need for the binary representation to include that bit, since you know what it is. So they don't. They tell you where that first 1 is, and then they give the bits after it. So there is a 1 that is not explicitly in the binary representation, whose location is implicit from the fact that they told you where it was.
It may also be helpful to note that we are dealing in binary representations of a number. The reason that the first digit of a normalized binary number (that is, no leading zeroes) has to be 1 is that 1 is the only non-zero value available to us in this representation. So, the same would not be true for, say, base-three representations.
How would i go about finding the value of the two-byte two’s complement value 0xFF72 is"?
Would i start by converting 0xFF72 to binary?
reverse the bits.
add 1 in binary notation. // lost here.
write decimal.
I just dont know..
Also,
What about an 8 byte double that has the value: 0x7FF8000000000000. Its value as a floating point?
I would think that this was homework, but for the particular double that is listed. 0x7FF8000000000000 is a quiet NaN per the IEEE-754 spec, not a very interesting value to put on a homework assignment:
The sign bit is clear.
The exponent field is 0x7ff, the largest possible exponent, which means that the number is either an infinity or a NaN.
The significand field is 0x8000000000000. Since it isn't zero, the number is not an infinity, and must be a NaN. Since the leading bit is set, it is a quiet NaN, not a so-called "signaling NaN".
Step 3 just means add 1 to the value. It's really as simple as it sounds. :-)
Example with 0xFF72 (assumed 16-bits here):
First, invert it: 0x008D (each digit is simply 0xF minus the original value)
Then add 1: 0x008E
This sounds like homework and for openness you should tag it as such if it is.
As for interpreting an 8 byte (double) floating point number, take a look at this Wikipedia article.