I am working on numerical analysis using a solver (the programming is based on object-oriented C++) compiled with double precision, and my unit is 64-bits. My problem is that when the solver computes a large number - say -1.45 to the power 21, to take an actual example - and stacks it in the allocated memory by passing this value to an existing variable, it is converted to 0. So of course, when I later use this variable in a division I get a segmentation fault. I do not understand how this process works, and because I use the DP, I do not see how to fix the issue. Could anyone give me a hand with this matter please ?
In case it helps: I just ran a test where I state a=-1.45e+21, and "print" the value which is returned correctly by the solver. But when I do not use the "e" exponent and enter the full value (with 19 zeros) I get 0 in return. So I guess the issue/limitation comes from the number of digits, any ideas ?? Thanks !
Edit: I post a summary of the steps I go through to compute one of the variables which poses an issue. The others being similarly defined.
First I initialise the field pointer lists:
PtrList<volScalarField> fInvFluids(fluidRegions.size());
Where the class of volScalarField is just an array of double. Then I populate the field pointer lists:
fInvFluids.set
(
i,
new volScalarField
(
IOobject
(
"fInv",
runTime.timeName(),
fluidRegions[i],
IOobject::NO_READ,
IOobject::AUTO_WRITE
),
fluidRegions[i],
dimensionedScalar
(
"fInv",
dimensionSet(3,1,-9,-1,0,0,0),
scalar(0)
)
)
);
After this, I set the field regions:
volScalarField& fInv = fInvFluids[i];
And finally I compute the value:
// Info<< " ** Calculating fInv **\n";
fInv = gT*pow(pow(T/Tlambda, 5.7)*(1 - pow(T/Tlambda, 5.7)), 3);
Where T is field variable and Tlambda a scalar value defined at run time.
A double variable probably can't hold 19 zeroes either. A (decimal) digit takes more than 3 bits, so 19 zeroes will take at least 57 bits. A double typically has a mantissa which is only 53 bits.
However, that doesn't sound like the problem you have. In C++ code, expressions have a type as well. 1 is not the same as 1.0. The first is an int and the second a double. While you can convert an int to double, they're not the same. An int most likely can hold values up to 2 billion , but formally it may have a limit as low as 32767. The solution to your problem might be as simple as adding a twentieth zero : 100000000000000000000.0 making it a double.
Related
I have a IloBoolVarArray in a MIP problem. When the solver has finished, I parse this variables to double but sometimes I get values very small as 1.3E-008 instead 0. My question is: Why? Is it a parsing problem only? Internally the solver has used this value, so is not the result trustworthy?
Thanks a lot.
CPLEX works with double precision floating point data internally. It has a tolerance parameter EpInt. If a variable x has values
0 <= x <= EpInt, or
1-EpInt <= x <= 1
Then CPLEX considers the value to be binary. The default value for EpInt is 10^-6, so your seeing solution values of 10^-8 is consistent with default behavior of CPLEX. Unless you really need exact integer values, then you should account for this when you pull solutions from CPLEX. One particularly bad thing you could do in C++ is
IloBoolVar x(env);
// ...
cplex.solve();
int bad_value = cplex.getValue(x); // BAD
int ok_value = cplex.getValue(x) + 0.5; // OK
Here, bad_value could be set to 0 even if the CPLEX solution has an effective value of 1. That's because CPLEX could have a value of 0.999999 which would be truncated to 0. The second assignment will reliably store the solution.
In the latest version of CPLEX, you can set the EpInt to 0 which will make CPLEX only consider only 0.0 and 1.0 as binary. If you do really need exact values of 0 or 1, then you should keep in mind the domains CPLEX is designed to work. If you are trying to use it to solve cryptology problems for example, you might not get good results, even with small instances.
Short version of the question: overflow or timeout in current settings when calculating large int64_t and double, anyway to avoid these?
Test case:
If only demand is 80,000,000,000, solved with correct result. But if it's 800,000,000,000, returned incorrect 0.
If input has two or more demands (means more inequalities need to be calculated), smaller value will also cause incorrectness. e.g., three equal demands of 20,000,000,000 will cause the problem.
I'm using COIN-OR CLP linear programming solver to solve some network flow problems. I use int64_t when representing the link bandwidth. But CLP uses double most of time and cannot transfer to other types easily.
When the values of the variables are not that large (typically smaller than 10,000,000,000) and the constraints (inequalities) are relatively few, it will give the solution I want it to. But if either of the above factors increases, the tool will stop and return a 0 value solution. I think the reason is the calculation complexity is over its maximum, so program breaks at some trivial point (it uses LP simplex method).
The inequality is some kind of:
totalFlowSum <= usePercentage * demand
I changed it to
totalFlowSum - usePercentage * demand <= 0
Since totalFLowSum and demand are very large int64_t, usePercentage is double, if the constraints like this are too many (several or even more), or if the demand is larger than 100,000,000,000, the returned solution will be wrong.
Is there any way to correct this, like increase the break threshold or avoid this level of calculation magnitude?
Decrease some accuracy is acceptable. I have a possible solution is that 1,000 times smaller on inputs and 1,000 time larger on outputs. But this is kind of naïve and may cause too much code modification in the program.
Update:
I have changed the formulation to
totalFlowSum / demand - usePercentage <= 0
but the problem still exists.
Update 2:
I divided usePercentage by 1000, making its coefficient from 1 to 0.001, it worked. But if I also divide totalFlowSum/demand by 1000 simultaneously, still no result. I don't know why...
I changed the rhs of equalities from 0 to 0.1, the problem is then solved! Since the inputs are very large, 0.1 offset won't impact the solution at all.
I think the reason is that previous coeffs are badly scaled, so the complier failed to find an exact answer.
I have some code:
float distance = pos + (screenSpeed * (float)(lastUpdateTimeMS - actualTimeMS));
This line should calculate the distance the character will move (the delta of position, from last check to actual time). But I've discovered it gives ridiculous results, e.g. for:
screenSpeed = 0.0001f;
lastUpdateTimeMS = 106;
actualTimeMS = 106;
I get 429497 and the following parts of my formula gives me:
(float)(lastUpdateTimeMS - actualTimeMS) = 4.29497e+009
(screenSpeed * (float)(lastUpdateTimeMS - actualTimeMS)) = 429497
And I get that magic 429497 (which is not the float/int range's end or anything familiar to me) for other arguments (screenSpeed is always 0.0001f, the lastUpdateTimeMS and actualTimeMS are different - sometimes they are equal, sometimes not).
Both the lastUpdateTimeMS and actualTimeMS are of unsigned type (int).
I am aware of that floating point numbers have some inaccurate, but with such a big differences I don't understand it.
I am working on x64 machine, with Visual Studio C++ 2013 (I build for x32 anyway), my project include some libraries (maybe there are some build options that I should be aware of or, when set differently in .lib and my code, results with such problems)?
I am aware of that floating point numbers have some inaccurate, but with such a big differences I don't understand it.
First, the floating-point inaccuracy is proportional to the values you manipulate.
Second, you are likely to have an overflow in the subtraction of the two unsigned 32-bit values in your computation lastUpdateTimeMS - actualTimeMS, giving a result near 232. This value is then converted to float and multiplied by 0.0001f, producing 429497.
In other words, your problem is that actualTimeMS is slightly larger than lastUpdateTimeMS. Also if the names of the variables can be trusted, shouldn't the subtraction be the other way round?
And I get that magic 429497 (which is not the float/int range's end or anything familiar to me)
It is exactly 232 * 0.0001.
(i) You seem to have your variables the wrong way round: surely it should be actualTimeMS - lastUpdateTimeMS?
(ii) unsigned - unsigned is unsigned, which you don't want. Instead of casting to float, which is done automatically anyway (to double, in fact), you should cast to int to make it signed.
(iii) Your diagnostics are faulty. 106U - 106U is zero, so screenSpeed * (float)(lastUpdateTimeMS - actualTimeMS) is zero too. Perhaps those variables are being updated in another thread (in which case you have a whole new set of problems to solve)? Perhaps the problem is that pos variable?
The following c++ code is almost the same as the example provided by Vector Statistical Library (VSL) in Intel's Math Kernel Library (MKL). However, once the variable 'total' is larger than, say, 3*10^9, it will output
MKL ERROR: Parameter 3 was incorrect on entry to vdRngUniform.
So I guess it implies that 'total' might be too large. But it is written in the manual that 'SFMT19937 method has a period length equal to 2^19937-1 of the produced sequence', which is far larger than 10^9.
I'd like to know what the upper limit really is, e.g., for double (I can't find any relevant information in the manual). And of course, any means to overcome this issue will be appreciated. Thanks in advance!
......
MKL_LONG status;
VSLStreamStatePtr stream;
......
int main(){
vslNewStream(&stream, VSL_BRNG_SFMT19937, 777);
rnd_data = (double*)malloc(total * sizeof(double));
status = vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, total, rnd_data, 0.0, 1.0);
vslDeleteStream(&stream);
......
}
Looking at https://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-D7AD317E-34EC-4789-8027-01D0E194FAC1.htm the vdRngUniform takes a const int which has capacity to represent 2^31 non-negative values (on most modern platforms). 3*10^9 is larger than that, so you're likely passing some negative value to the function (i.e. signed integer overflow).
The problem I'm attempting to tackle at the moment involves computing the order of 10 modulo(n), where n could be any number less than 1000. I have a function to do exactly that, however, I am unable to obtain accurate results as the value of the order increases.
The function works correctly as long as the order is sufficiently small, but returns incorrect values for large orders. So I stuck in some output to the terminal to locate the problem, and discovered that when I use exponentiation, the accuracy of my reals is being compromised.
I declared ALL variables in the function and in the program I tested it from as real(kind=nkind) where nkind = selected_real_kind(p=18, r=308). Any numbers explicitly referenced are also declared as, for example, 1.0_nkind. However, when I print out 10**n for n counting up from 1, I find that at 10**27, the value is correct. However, 10**28 gives 9999999999999999999731564544. All higher powers are similarly distorted, and this inaccuracy is the source of my problem.
So, my question is, is there a way to work around the error? I don't know of any way to use a more extended precision than I'm already using in the calculations.
Thanks,
Sean
*EDIT: There's not much to see in the code, but here you go:
integer, parameter :: nkind = selected_real_kind(p=18, r = 308)
real(kind=nkind) function order_ten_modulo(n)
real(kind=nkind) :: n, power
power = 1.0_nkind
if (mod(n, 5.0_nkind) == 0 .or. mod(n, 2.0_nkind) == 0) then
order_ten_modulo = 0
return
end if
do
if (power>300.0) then ! Just picked this number as a safeguard against endless looping -
exit
end if
if (mod(10.0_nkind**power, n) == 1.0_nkind) then
order_ten_modulo = power
exit
end if
power = power + 1.0_nkind
end do
return
end function order_ten_modulo