I am trying to solve a LP in which one of my constraints is of the form
A(x)/B(x) == 1
I transform this constraint into a linear one:
A(x) - B(x) == 0
However when CPLEX solves the problem, this constraint is of the order of ~1.e-5, but the ratio is of the order of ~1.4, which is a large deviation for my needs. Unfortunately the kappa value (condition number of the optimal basis) of the problem is kappa ~10^10, and this doesn't make things easier. I suspect this is the source of the problem.
My question: is it possible (given the value of kappa) to increase the precision of this constraint such as the ratio to be accurate within 2 decimal digits (at least) ~ 1.00something ?
Related
I'm writing a C++ program that takes the FFT of a real input signal containing double values and returns a vector X containing std::complex<double> values. Once I have the vector of results I then attempt to calculate the magnitude and phase of the result.
I am running into an issue with calculating the phase angle when one of the outputs is "zero". Zero is in quotes because when a calculation that results in 0 returns a double, the returned value will be very near zero, but not quite exactly zero.
For example, at index 3 my output array has the calculated "zero" value:
X[3] = 3.0531133177191805e-16 - i*5.5511151231257827e-17
I am trying to use the standard library std::arg function that is supposed to return the phase angle of a complex number. std::arg(X[3])
While X[3] is essentially 0, it is not EXACTLY 0 and the way phase is calculated this causes a problem because the calculation uses the ratio of the imaginary part divided by the ratio of the real part which is far from 0!
Doing the actual calculation results in a far from desirable result.
How can I make C++ realize that the result is really 0 so I can get the correct phase angle?
I'm looking for a more elegant solution than using an arbitrary hard-coded "epsilon" value to compare the double to, but so far searching online I haven't had any luck coming up with something better.
If you are computing the floating-point FFT of an input signal, then that signal will include noise, thus have a signal-to-noise ratio, including sensor noise, thermal noise, quantization noise, timing jitter noise, etc.
Thus the threshold for discarding FFT results as below your noise floor most likely isn't a matter of computational mathematics, but part of your physical or electronic data acquisition analysis. You will have to plug that number in, and set the phase to 0.0 or NaN or whatever your default flagging value is for a non-useful (at or below the noise floor) FFT result.
It was brought to my attention that my original answer will not work when the input to the FFT has been scaled. I believe I have an actual valid solution now... The original answer is kept below so that the comments still make sense.
From the comments on this answer and others, I've gathered that calculating the exact rounding error in the language may technically be possible but it is definitely not practical. The best practical solution seems to be to allow the user to provide their own noise threshold (in dB) and ignore any data points whose power level falls below that threshold. It would be impossible to come up with a generic threshold for all situations, but the user can provide a reasonable threshold based on the signal-to-noise ratio of the signal being analyzed and pass that in.
A generic phase calculation function is shown below that calculates the phase angles for a vector of complex data points.
std::vector<double> Phase(std::vector<std::complex<double>> X, double threshold, double amplitude)
{
size_t N = X.size();
std::vector<double> X_phase(N);
std::transform(X.begin(), X.end(), X_phase.begin(), [threshold, amplitude](const std::complex<double>& value) {
double level = 10.0 * std::log10(std::norm(value) / std::pow(amplitude, 2.0));
return level > threshold ? std::arg(value) : 0.0;
});
return X_phase;
}
This function takes 3 arguments:
The vector of complex signal data you want to calculate the phase of.
A sensible threshold -- Can be calculated from the signal-to-noise ratio of whatever measurement device was used to capture the signal. If your signal contains no noise other than the rounding errors of the language itself you can set this to some arbitrary really low value, like -120dB.
The maximum possible amplitude of your input signal. If your signal is calculated, this should simply be set to the amplitude of your signal. If your signal is measured, this should be set to the maximum amplitude the measuring device is capable of measuring (If your signal comes from reading an audio file, often its data will be normalized between -1.0 and 1.0. In this case you would just set the amplitude value to 1.0).
This new implementation still provides me with the correct results, but is much more robust. By leaving the threshold calculation to the user they can set the most sensible value themselves based on the characteristics of the measurement device used to measure their input signal.
Please let me know if you notice any errors or any ways I can further improve the design!
Original Answer
I found a solution that seems generic enough.
In the #include <limits> header, there is a constant value for std::numeric_limits<double>::digits10.
According the the documentation:
The value of std::numeric_limits<T>::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many significant decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow.
Using this I can filter out any output values that have a magnitude lower than this limit:
Calculate the phase of X[3]:
int N = X.size();
auto tmp = std::abs(X[3])/N > std::pow(10, -std::numeric_limits<double>::digits10)
? value
: 0.0
double phase = std::arg(tmp);
This effectively filters out any values that are not precisely zero due to rounding errors within the C++ language itself. It will NOT however filter out garbage data caused by noise in the input signal.
After adding this to my phase calculation I get the expected results.
The map from complex numbers to magnitude and phase is discontinuous at 0.
This is a discontinuity caused by the choice of coordinates you are using.
The solution will depend on why you chose those coordinates in a situation where values near the discontinuity are possible.
It isn't "really" zero. If you factored in error bars properly, your answer would really be a small magnitude (hopefully) and a unconstrained angle.
Is there a method in Pyomo for controlling the number of significant digits in the objective function and constraint matrix coefficients that get sent to the solver?
In my case, the form of the objective and the problem data is obtained from files or created in a program, and the resulting problem is then solved sequentially. As a result, using the round function or the decimal package only partially allows for the number of significant digits to be controlled. Ideally, some global method in Pyomo would be able to control the number of significant digits in the final model that gets sent to the solver in, say, an LP file.
Which solver are you using. For instance IP-OPT has many settings. These two may help you:
tiny_step_tol 0 <= (2.22045e-15) < +inf
Tolerance for detecting numerically insignificant steps.
If the search direction in the primal variables (x and s) is, in relative
terms for each component, less than this value, the algorithm accepts the
full step without line search. If this happens repeatedly, the algorithm
will terminate with a corresponding exit message. The default value is 10
times machine precision.
tiny_step_y_tol 0 <= ( 0.01) < +inf
Tolerance for quitting because of numerically insignificant steps.
If the search direction in the primal variables (x and s) is, in relative
terms for each component, repeatedly less than tiny_step_tol, and the
step in the y variables is smaller than this threshold, the algorithm
will terminate.
To play with these you need to set their new values in a ipopt.opt file in the working directory.
I read that the expm1 function is suited for computing 1 - exp(x) for small x, without losing precision from truncation of 1.0 at ~15 digits (for doubles). Is there such a function for 1 - sqrt(x)? At the moment I am simply using a very large Taylor expansion, because I require as many (preferably all) digits of precision as doubles/long doubles can offer.
Edit: I badly confused my intention here: I am looking to compute 1 - sqrt(1-x) for x anywhere between 10^-12 and 1.
The question here seems badly motivated. While exp(x) converges to 1 as x goes to 0, meaning that given the same floating point precision exp(x)-1 has more significant figures than exp(x) for small x, this is not true for sqrt(x), which converges to 0 as x goes to 0. In other words exp(x)-1 can be made fractionally more precise than exp(x) for small x, but the same is not true for 1-sqrt(x) -- which would in fact get worse, since you're taking it from something near 0 (1e-6) to something near 1 (0.999999).
If on the other hand you instead wanted to calculate sqrt(1+x) for very small x (as an accurate measurement of sqrt(x) very near x=1), sqrt(1+x)-1 would be a more accurate floating point computation. And its Taylor series would work very well; I find that for |x| < 1e-9, x/2 - x^2/8 + x^3/16 is a good approximation of sqrt(1+x)-1 to within an RMS fractional error of 3e-29 (with a maximum of 8e-29 on the edges) -- twice as many digits as are accurate in a double. Even the quadratic approximation is probably good enough (with roughly 20 digits of accuracy)
the expm1 function is suited for computing 1 - exp(x).
As can read in the ref:
For small magnitude values of x, expm1 may be more accurate than exp(x)-1.
Is there such a function for 1 - sqrt(x)?
No, at least not in the standard headers.
Using spherical harmonics for lighting I faced a problem for a big enough bandwidths. The correctness of an approximation by first n^2 terms became worse and worse starting from n=7. I look into associated Legendre polynomials definition and found out, that there is a ratio of factorials (l - m)! / (l + m)! in normalization constant. For n = 7 (l + m)! can be up to 12!. I have to use float (IEEE-754 32-bit floating-point type), due to GPUs nature.
Now I think, that tgamma from C/C++ might be more appropriate, then naive calculation of factorial by definition. Even more: maybe there is a good (approximation) formula for ratio of gamma functions (of two big arguments).
Is there a good stable approach to calculate gamma function (for big positive integers) in shaders?
Surely I just can save a lookup table (matrix) for all the possible combinations of values in numerator and denominator, but I want to have alternative (space-efficient) approach.
I need to evaluate the sum of the row: 1/1+1/2+1/3+...+1/n. Considering that in C++ evaluations are not complete accurate, the order of summation plays important role. 1/n+1/(n-1)+...+1/2+1/1 expression gives the more accurate result.
So I need to find out the order of summation, which provides the maximum accuracy.
I don't even know where to begin.
Preferred language of realization is C++.
Sorry for my English, if there are any mistakes.
For large n you'd better use asymptotic formulas, like the ones on http://en.wikipedia.org/wiki/Harmonic_number;
Another way is to use exp-log transformation. Basically:
H_n = 1 + 1/2 + 1/3 + ... + 1/n = log(exp(1 + 1/2 + 1/3 + ... + 1/n)) = log(exp(1) * exp(1/2) * exp(1/3) * ... * exp(1/n)).
Exponents and logarithms can be calculated pretty quickly and accuratelly by your standard library. Using multiplication you should get much more accurate results.
If this is your homework and you are required to use simple addition, you'll better add from the smallest one to the largest one, as others suggested.
The reason for the lack of accuracy is the precision of the float, double, and long double types. They only store so many "decimal" places. So adding a very small value to a large value has no effect, the small term is "lost" in the larger one.
The series you're summing has a "long tail", in the sense that the small terms should add up to a large contribution. But if you sum in descending order, then after a while each new small term will have no effect (even before that, most of its decimal places will be discarded). Once you get to that point you can add a billion more terms, and if you do them one at a time it still has no effect.
I think that summing in ascending order should give best accuracy for this kind of series, although it's possible there are some odd corner cases where errors due to rounding to powers of (1/2) might just so happen to give a closer answer for some addition orders than others. You probably can't really predict this, though.
I don't even know where to begin.
Here: What Every Computer Scientist Should Know About Floating-Point Arithmetic
Actually, if you're doing the summation for large N, adding in order from smallest to largest is not the best way -- you can still get into a situation where the numbers you're adding are too small relative to the sum to produce an accurate result.
Look at the problem this way: You have N summations, regardless of ordering, and you wish to have the least total error. Thus, you should be able to get the least total error by minimizing the error of each summation -- and you minimize the error in a summation by adding values as nearly close to each other as possible. I believe that following that chain of logic gives you a binary tree of partial sums:
Sum[0,i] = value[i]
Sum[1,i/2] = Sum[0,i] + Sum[0,i+1]
Sum[j+1,i/2] = Sum[j,i] + Sum[j,i+1]
and so on until you get to a single answer.
Of course, when N is not a power of two, you'll end up with leftovers at each stage, which you need to carry over into the summations at the next stage.
(The margins of StackOverflow are of course too small to include a proof that this is optimal. In part because I haven't taken the time to prove it. But it does work for any N, however large, as all of the additions are adding values of nearly identical magnitude. Well, all but log(N) of them in the worst not-power-of-2 case, and that's vanishingly small compared to N.)
http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic
You can find libraries with ready for use implementation for C/C++.
For example http://www.apfloat.org/apfloat/
Unless you use some accurate closed-form representation, a small-to-large ordered summation is likely to be most accurate simple solution (it's not clear to me why a log-exp would help - that's a neat trick, but you're not winning anything with it here, as far as I can tell).
You can further gain precision by realizing that after a while, the sum will become "quantized": Effectively, when you have 2 digits of precision, adding 1.3 to 41 results in 42, not 42.3 - but you achieve almost a precision doubling by maintaining an "error" term. This is called Kahan Summation. You'd compute the error term (42-41-1.3 == -0.3) and correct that in the next addition by adding 0.3 to the next term before you add it in again.
Kahan Summation in addition to a small-to-large ordering is liable to be as accurate as you'll ever need to get. I seriously doubt you'll ever need anything better for the harmonic series - after all, even after 2^45 iterations (crazy many) you'd still only be dealing with a numbers that are at least 1/2^45 large, and a sum that's on the order of 45 (<2^6), for an order of magnitude difference of 51 powers-of-two - i.e. even still representable in a double precision variable if you add in the "wrong" order.
If you go small-to-large, and use Kahan Summation, the sun's probably going to extinguish before today's processors reach a percent of error - and you'll run into other tricky accuracy issues just due to the individual term error on that scale first anyhow (being that a number of the order of 2^53 or larger cannot be represented accurately as a double at all anyhow.)
I'm not sure about the order of summation playing an important role, I havent heard that before. I guess you want to do this in floating point arithmetic so the first thing is to think more inline of (1.0/1.0 + 1.0/2.0+1.0/3.0) - otherwise the compiler will do integer division
to determine order of evaluation, maybe a for loop or brackets?
e.g.
float f = 0.0;
for (int i=n; i>0; --i)
{
f += 1.0/static_cast<float>(i);
}
oh forgot to say, compilers will normally have switches to determine floating point evaluation mode. this is maybe related to what you say on order of summation - in visual C+ these are found in code-generation compile settings, in g++ there're options -float that handle this
actually, the other guy is right - you should do summation in order of smallest component first; so
1/n + 1/(n-1) .. 1/1
this is because the precision of a floating point number is linked to the scale, if you start at 1 you'll have 23 bits of precision relative to 1.0. if you start at a smaller number the precision is relative to the smaller number, so you'll get 23 bits of precision relative to 1xe-200 or whatever. then as the number gets bigger rounding error will occur, but the overall error will be less than the other direction
As all your numbers are rationals, the easiest (and also maybe the fastest, as it will have to do less floating point operations) would be to do the computations with rationals (tuples of 2 integers p,q), and then do just one floating point division at the end.
update to use this technique effectively you will need to use bigints for p & q, as they grow quite fast...
A fast prototype in Lisp, that has built in rationals shows:
(defun sum_harmonic (n acc)
(if (= n 0) acc (sum_harmonic (- n 1) (+ acc (/ 1 n)))))
(sum_harmonic 10 0)
7381/2520
[2.9289682]
(sum_harmonic 100 0)
14466636279520351160221518043104131447711/278881500918849908658135235741249214272
[5.1873775]
(sum_harmonic 1000 0)
53362913282294785045591045624042980409652472280384260097101349248456268889497101
75750609790198503569140908873155046809837844217211788500946430234432656602250210
02784256328520814055449412104425101426727702947747127089179639677796104532246924
26866468888281582071984897105110796873249319155529397017508931564519976085734473
01418328401172441228064907430770373668317005580029365923508858936023528585280816
0759574737836655413175508131522517/712886527466509305316638415571427292066835886
18858930404520019911543240875811114994764441519138715869117178170195752565129802
64067621009251465871004305131072686268143200196609974862745937188343705015434452
52373974529896314567498212823695623282379401106880926231770886197954079124775455
80493264757378299233527517967352480424636380511370343312147817468508784534856780
21888075373249921995672056932029099390891687487672697950931603520000
[7.485471]
So, the next better option could be to mantain the list of floating points and to reduce it summing the two smallest numbers in each step...