I am not sure what's happening, I have a simple loop with a subtraction and addition. At some point during the loop c becomes zero and stays zero until the end, so it shouldn't have any effect in the subtraction. However, after c becomes zero array1 keeps decreasing. I know of the floating point problems, but shouldn't 0 be 0? Also when I try to print the value of c I get 0.0000000E+00. array b holds positive and negative real numbers in the order of 0.00001, array c holds only positive numbers in the order of 0.00002. These arrays have 3 dimensions because they represent a 2-d map over time, think daily climate data for example.
INTEGER :: ntime
ntime=100000
REAL,DIMENSION(1,1,ntime) :: array1,b,c
array1(:,:,:)=0
DO itime=1,ntime
array1(1,1,itime)=array1(1,1,itime-1)+b(1,1,itime)-c(1,1,itime-1)
IF (itime .ge. 15000)
c(1,1,itime)=0
ENDIF
ENDO
When I add to the loop an IF condition that only includes c in the operation if c is greater than 0, than the values of array1 look better but I still don't get the correct result. The value of array1 at the end of the loop should be the initial value of array1 + the sum of all values in b - sum of all values in c. So the biggest problem must the the subtraction of 0 but also something else is not working. I should add that b is an array made by subtracting REAL numbers. Is there some compounding floating point error in additions?
Related
I am a university lecturer, and I will teach the Numerical Methods course this semester using Fortran 90/95 as the programming language. The beginning of the course starts with the representation of numbers, and I would like to talk about the limits of numbers that can be represented with REAL(4), REAL(8) and REAL(16). I intend to use the following code on OnlineGDB (so that students won't have to install anything on their computers, which may be a pain in times of remote learning):
Program declare_reals
implicit none
real(kind = 4) :: a_huge, a_tiny ! single precision ; default if kind not specified
!real(4) :: a ! Equivalent to real(kind = 4) :: a
a_huge = huge(a_huge)
print*, "Max positive for real(4) : ", a_huge
a_tiny = tiny(a_tiny)
print*, "Min positive for real(4) : ", a_tiny
print*,
End Program declare_reals
With this code, I get
Max positive for real(4) : 3.40282347E+38
Min positive for real(4) : 1.17549435E-38
However, if I write a_tiny = tiny(a_tiny)/2.0, the output becomes
Min positive for real(4) : 5.87747175E-39
Looking at the documentation for gfortran (which OnlineGDB uses as the f95 compiler), I had the impression that anything below tiny(x) could result in an underflow and zero would show instead of a non-zero number. Could anyone help me understand what is happening here? If tiny(x) doesn't yield the smallest positive representable number, what is being shown due to the function call?
The Fortran Standard states the following about a real value:
The model set for real x is defined by
where b and p are integers exceeding one; each fk is
a non-negative integer less than b, with f1
nonzero; s is
+1 or −1; and e is an integer that lies between some integer maximum emax and some integer minimum emin inclusively. For x = 0, its exponent e and digits fk
are defined to be zero. The integer parameters b, p,
emin, and emax determine the set of model floating-point numbers.
Real values which satisfy this definition, are referenced to be model numbers or normal floating point numbers. The floating point numbers your system can represent, i.e. the machine-representable numbers are a superset of the model numbers. They can, but not necessarily must, include the values with f1 zero — also known as subnormal floating point numbers — and are there to fill the underflow gap around zero.
The Fortran functions tiny(x), huge(x), epsilon(x), spacing(x) are all defined for model numbers.
The value of tiny(x) is given by bemin − 1, which for a single-precision floating-point number (binary32) is given by 2−126 and is the smallest model (normal) number. When your system follows IEEE754, the machine representable numbers will also contain the subnormal numbers. The smallest subnormal positive number is given bytiny(x)*epsilon(x) which in binary32 is 2−126 × 2−23. This explains why you can divide tiny(x) by two, i.e. the transition from normal to subnormal.
# smallest normal number
0 00000001 000000000000000000000002 = 0080 000016 = 2−126 ≈ 1.1754943508 × 10−38
# smallest subnormal number
0 00000000 000000000000000000000012 = 0000 000116 = 2−126 × 2−23 ≈ 1.4012984643 × 10−45
Note: when you divide tiny(x)*epsilon(x) by two, gfortran returns an arithmetic underflow error.
Ref: values taken from Wikipedia: Single precision floating-point format
Recently I encountered a problem while I was trying to subtract .size() values of two strings in c++. As far as I know, size() returns number of characters in a string. So lets say I have 2 strings p and q, abs(p.size()-q.size()) should return me difference in length of both strings. But when I ran this code, it returned an abruptly large value. When I individually print the length of both or if I store their length values in different integers and subtract them, they give me correct answer. Am not yet able to figure out why.
size() returns an unsigned value. A smaller unsigned value minus a larger one is then underflowing the calculation, resulting in a large negative value. Think of it as if you have the "rolling" counter of miles or km in a car, and you roll back past 0, it becomes 99999, which is a big number.
The solution, assuming you care about negative differences is to do static_cast<int>(p.size() - q.size()) (and pass that to abs).
Return Value of size() is the number of size_t (an unsigned integral type)
So if you subtract greater number from smaller number, you'll get into problem and get that big value as a result of subtraction.
Reference std::string::size
std::string member function size() returns an unsigned value, so if p.size() < q.size(), the expression p.size()-q.size() will not evaluate to a negative number (it's unsigned, cannot be negative) but to a (often) very very big (unsigned) number.
std::strings reports their size as some width of unsigned integer; such types are a bit like the second hand on a watch: you can wind it forward from 0 up to 59 but if you keep going clockwise it drops to 0 before incrementing again, while if you wind counterclockwise you count down to 0 then jump to 59 and count down from there, ad infinitum.
Say you are subtracting a string length of 6 from a string length of 4, it's much like saying "start the minute hand at 4 and wind counterclockwise by 6 minutes" - when you've wound back 4 minutes the second hand's already at 0, and you wind another minute to get to 59, and the final minute brings you to 58. For std::string::size_type the maximum isn't 59 - it's much larger - but the problem's the same. The result is always positive so is unaffected by abs, but regardless - not what you wanted!
The actual maximum value can be accessed after #include <limits> with std::numeric_limits<std::string::size_type>::max(), for whatever that's worth.
There are many ways to solve this problem. David Schwartz's comment on Zola's answer lists one good one: std::max(p.size(),q.size())-std::min(p.size(),q.size()), which you can think of as "subtract the smaller value from the larger value". Another option is...
p.size() > q.size() ? p.size() - q.size() : q.size() - p.size()
...which means "if p's larger, subtract q from it, otherwise subtract it (i.e. p) from q".
Q = (a_i + b_i) / (2^s)
-10^10 ≤ s ≤ 10^10
1 ≤ a_i, b_i ≤ 10^9
It is guaranteed that -10^10 ≤ Q ≤ 10^10.
Here s,a_i,b_i are integers and Q is a decimal no.
When we calculate Q, there is overflow due to large value of 2^s.I am using pow(2,s) to calculate 2^s. How can i calculate Q,given the range of Q as in the statement.
I assume by your statement that Q is decimal, that this involves floating point operations rather than integer arithmetic.
If you can't use logarithms for some reason, the slower approach would be to calculate a floating point value with value equal to a_i + b_i. If s is positive, simply divide that value s times by 2 (in a loop). If s is negative, multiply instead of divide.
For arbitrary a_i and b_i, you will still have the risk of overflow (when s is negative) or underflow (s positive) and will need to manage that. However, you claim to have a guarantee that is not the case .....
The program asks the user for the number of times to flip a coin (n; the number of trials).
A success is considered a heads.
Flawlessly, the program creates a random number between 0 and 1. 0's are considered heads and success.
Then, the program is supposed to output the expected values of getting x amount of heads. For example if the coin was flipped 4 times, what are the following probabilities using the formula
nCk * p^k * (1-p)^(n-k)
Expected 0 heads with n flips: xxx
Expected 1 heads with n flips: xxx
...
Expected n heads with n flips: xxx
When doing this with "larger" numbers, the numbers come out to weird values. It happens if 15 or twenty are put into the input. I have been getting 0's and negative values for the value that should be xxx.
Debugging, I have noticed that the nCk has come out to be negative and not correct towards the upper values and beleive this is the issue. I use this formula for my combination:
double combo = fact(n)/fact(r)/fact(n-r);
here is the psuedocode for my fact function:
long fact(int x)
{
int e; // local counter
factor = 1;
for (e = x; e != 0; e--)
{
factor = factor * e;
}
return factor;
}
Any thoughts? My guess is my factorial or combo functions are exceeding the max values or something.
You haven't mentioned how is factor declared. I think you are getting integer overflows. I suggest you use double. That is because since you are calculating expected values and probabilities, you shouldn't be concerned much about precision.
Try changing your fact function to.
double fact(double x)
{
int e; // local counter
double factor = 1;
for (e = x; e != 0; e--)
{
factor = factor * e;
}
return factor;
}
EDIT:
Also to calculate nCk, you need not calculate factorials 3 times. You can simply calculate this value in the following way.
if k > n/2, k = n-k.
n(n-1)(n-2)...(n-k+1)
nCk = -----------------------
factorial(k)
You're exceeding the maximum value of a long. Factorial grows so quickly that you need the right type of number--what type that is will depend on what values you need.
Long is an signed integer, and as soon as you pass 2^31, the value will become negative (it's using 2's complement math).
Using an unsigned long will buy you a little time (one more bit), but for factorial, it's probably not worth it. If your compiler supports long long, then try an "unsigned long long". That will (usually, depends on compiler and CPU) double the number of bits you're using.
You can also try switching to use double. The problem you'll face there is that you'll lose accuracy as the numbers increase. A double is a floating point number, so you'll have a fixed number of significant digits. If your end result is an approximation, this may work okay, but if you need exact values, it won't work.
If none of these solutions will work for you, you may need to resort to using an "infinite precision" math package, which you should be able to search for. You didn't say if you were using C or C++; this is going to be a lot more pleasant with C++ as it will provide a class that acts like a number and that would use standard arithmetic operators.
What is the most optimal way to convert a decimal number into its binary form ,i.e with the best time complexity?
Normally to convert a decimal number into binary,we keep on dividing the number by 2 and storing its remainders.But this would take really long time if the number in decimal form is very large.The time complexity in this case would turn out to be O(log n).
So i want to know if there is any approach other than this that can do my job with better time comlexity?
The problem is essentially that of evaluating a polynomial using binary integer arithmetic, so the result is in binary. Suppose
p(x) = a₀xⁿ + a₁xⁿ⁻¹ + ⋯ + aₙ₋₁x + aₙ
Now if a₀,a₁,a₂,⋯,aₙ are the decimal digits of the number (each implicitly represented by binary numbers in the range 0 through 9) and we evaluate p at x=10 (implicitly in binary) then the result is the binary number that the decimal digit sequence represents.
The best way to evaluate a polynomial at a single point given also the coefficients as input is Horner's Rule. This amounts to rewriting p(x) in a way easy to evaluate as follows.
p(x) = ((⋯((a₀x + a₁)x + a₂)x + ⋯ )x + aₙ₋₁)x + aₙ
This gives the following algorithm. Here the array a[] contains the digits of the decimal number, left to right, each represented as a small integer in the range 0 through 9. Pseudocode for an array indexed from 0:
toNumber(a[])
const x = 10
total = a[0]
for i = 1 to a.length - 1 do
total *= x //multiply the total by x=10
total += a[i] //add on the next digit
return total
Running this code on a machine where numbers are represented in binary gives a binary result. Since that's what we have on this planet, this gives you what you want.
If you want to get the actual bits, now you can use efficient binary operations to get them from the binary number you have constructed, for example, mask and shift.
The complexity of this is linear in the number of digits, because arithmetic operations on machine integers are constant time, and it does two operations per digit (apart from the first). This is a tiny amount of work, so this is supremely fast.
If you need very large numbers, bigger that 64 bits, just use some kind of large integer. Implemented properly this will keep the cost of arithmetic down.
To avoid as much large integer arithmetic as possible if your large integer implementation needs it, break the array of digits into slices of 19 digits, with the leftmost slice potentially having fewer. 19 is the maximum number of digits that can be converted into an (unsigned) 64-bit integer.
Convert each block as above into binary without using large integers and make a new array of those 64-bit values in left to right order. These are now the coefficients of a polynomial to be evaluated at x=10¹⁹. The same algorithm as above can be used only with large integer arithmetic operations, with 10 replaced by 10¹⁹ which should be evaluated with large integer arithmetic in advance of its use.