finding maximum of a function with least probes taken

finding maximum of a function with least probes taken - c++

I have some code, a function basically, that returns a value. This function takes long time to run. Function takes a double as a parameter:
double estimate(double factor);
My goal is to find such parameter factor at which this estimate function returns maximum value. I can simply brute force and iterate different factor inputs and get what I need, but the function takes long time to run, so I'd like to minimize amount of "probes" that I take (e.g. call the estimate function as least as possible).
Usually, maximum is returned for factor values between 0.5 and 3.5. If I graph returned values, I get something that looks like a bell curve. What's the most efficient approach to partition possible inputs to that I could discover maximum faster?

The previous answer suggested a 2 point approach. This is a good idea for functions that are approximately lines, because lines are defined by 2 parameters: y=ax+b.
However, the actual bell-shaped curve is more like a parabola, which is defined by ax²+bx+c (so 3 parameters). You therefore should take 3 points {x1,x2,x3} and solve for {a,b,c}. This will give you an estimate for the xtop at -b/2a. (The linked answer uses the name x0 here).
You'll need to iteratively approximate the actual top if the function isn't a real parabola, but this process converges really fast. The easiest solution is to take the original triplet x1,x2,x3, add xtop and remove the xn value which is furthest away from xtop. The advantage of this is that you can reuse 2 of the old f(x) values. This reuse helps a lot with the stated goal of "mininal samples".

If your function indeed has a bell shaped curve then you can use binary search as follows:
Choose an initial x1 (say x1 = 2, midway between 0.5 and 3.5) and find f(x1) and f(x1 + delta) where delta is small enough. If f(x1 + delta) > f(x1) it means that the peak is towards the right of x1 otherwise it is towards the left.
Carry out binary search and come to a close enough value of the peak as you want.
You can modify the above approach by choosing the next x_t according to the difference f(x1 + delta) - f(x1).

Related

Optimization of cbrt() in C++

I am trying to improve the speed of my code, written in C++. Based on profilers, the function cbrt()/cbrtf32x is the function I spend the most time in/on (or more specifically):
double test_func(const double &test_val){
double cbrt_test_val = cbrt(test_val);
return (1 - 1e-10*cbrt_test_val);
}
According to data, I spend more then three times the time for cbrt()/cbrtf32x() than for the closest cost-expensive function. Thus I was wondering how to improve this function, and how to speed it up? The input values range from 1e18 to 1e30.

There is little that can be done if you are doing the cubic roots one at a time, and you want the exact result.
As is, I would be surprised if you can improve the cubic root calculation more than 10-20% - if that - while getting the same result numerically. (Note: I got that 10%-20% number out of thin air; it's an opinion, not a scientific number at all.)
If you can batch up the calculations, you might be able to SIMD the operation, or multi-thread them, or if you know more about the distribution of the data (or can find out more,) you might be able to sort them and - I don't know - maybe calculate an incremental cubic root or something.
If you can get away with an approximation, then there are more things that you can do. For example, you are calculating the function f(x) = 1 - cbrt(x) / 1e10, which is the same as 1 - cbrt(x / 1e30) which is a strictly decreasing function that maps the domain [1e18..1e30] to the range [0..0.9999]. With y = x / 1e30 it becomes f(y) = 1 - cbrt(y) and now y is in the range [1e-12..1] and it can be pre-calculated and approximated using a look-up table.
Depending on the number of times you need a cubic root, how much accuracy loss you can get away with (which determines the size of the table,) and whether you can sort or bucket your input (to improve the CPU cache utilization for your LUT look-ups) you might get a nice speed boost out of this.

Another way to calculate double type variables in c++?

Short version of the question: overflow or timeout in current settings when calculating large int64_t and double, anyway to avoid these?
Test case:
If only demand is 80,000,000,000, solved with correct result. But if it's 800,000,000,000, returned incorrect 0.
If input has two or more demands (means more inequalities need to be calculated), smaller value will also cause incorrectness. e.g., three equal demands of 20,000,000,000 will cause the problem.
I'm using COIN-OR CLP linear programming solver to solve some network flow problems. I use int64_t when representing the link bandwidth. But CLP uses double most of time and cannot transfer to other types easily.
When the values of the variables are not that large (typically smaller than 10,000,000,000) and the constraints (inequalities) are relatively few, it will give the solution I want it to. But if either of the above factors increases, the tool will stop and return a 0 value solution. I think the reason is the calculation complexity is over its maximum, so program breaks at some trivial point (it uses LP simplex method).
The inequality is some kind of:
totalFlowSum <= usePercentage * demand
I changed it to
totalFlowSum - usePercentage * demand <= 0
Since totalFLowSum and demand are very large int64_t, usePercentage is double, if the constraints like this are too many (several or even more), or if the demand is larger than 100,000,000,000, the returned solution will be wrong.
Is there any way to correct this, like increase the break threshold or avoid this level of calculation magnitude?
Decrease some accuracy is acceptable. I have a possible solution is that 1,000 times smaller on inputs and 1,000 time larger on outputs. But this is kind of naïve and may cause too much code modification in the program.
Update:
I have changed the formulation to
totalFlowSum / demand - usePercentage <= 0
but the problem still exists.
Update 2:
I divided usePercentage by 1000, making its coefficient from 1 to 0.001, it worked. But if I also divide totalFlowSum/demand by 1000 simultaneously, still no result. I don't know why...

I changed the rhs of equalities from 0 to 0.1, the problem is then solved! Since the inputs are very large, 0.1 offset won't impact the solution at all.
I think the reason is that previous coeffs are badly scaled, so the complier failed to find an exact answer.

What's the origin of this GLSL rand() one-liner?

I've seen this pseudo-random number generator for use in shaders referred to here and there around the web:
float rand(vec2 co){
return fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * 43758.5453);
}
It's variously called "canonical", or "a one-liner I found on the web somewhere".
What's the origin of this function? Are the constant values as arbitrary as they seem or is there some art to their selection? Is there any discussion of the merits of this function?
EDIT: The oldest reference to this function that I've come across is this archive from Feb '08, the original page now being gone from the web. But there's no more discussion of it there than anywhere else.

Very interesting question!
I am trying to figure this out while typing the answer :)
First an easy way to play with it: http://www.wolframalpha.com/input/?i=plot%28+mod%28+sin%28x*12.9898+%2B+y*78.233%29+*+43758.5453%2C1%29x%3D0..2%2C+y%3D0..2%29
Then let's think about what we are trying to do here: For two input coordinates x,y we return a "random number". Now this is not a random number though. It's the same every time we input the same x,y. It's a hash function!
The first thing the function does is to go from 2d to 1d. That is not interesting in itself, but the numbers are chosen so they do not repeat typically. Also we have a floating point addition there. There will be a few more bits from y or x, but the numbers might just be chosen right so it does a mix.
Then we sample a black box sin() function. This will depend a lot on the implementation!
Lastly it amplifies the error in the sin() implementation by multiplying and taking the fraction.
I don't think this is a good hash function in the general case. The sin() is a black box, on the GPU, numerically. It should be possible to construct a much better one by taking almost any hash function and converting it. The hard part is to turn the typical integer operation used in cpu hashing into float (half or 32bit) or fixed point operations, but it should be possible.
Again, the real problem with this as a hash function is that sin() is a black box.

The origin is probably the paper: "On generating random numbers, with help of y= [(a+x)sin(bx)] mod 1", W.J.J. Rey, 22nd European Meeting of Statisticians and the 7th Vilnius Conference on Probability Theory and Mathematical Statistics, August 1998
EDIT: Since I can't find a copy of this paper and the "TestU01" reference may not be clear, here's the scheme as described in TestU01 in pseudo-C:
#define A1 ???
#define A2 ???
#define B1 pi*(sqrt(5.0)-1)/2
#define B2 ???
uint32_t n; // position in the stream
double next() {
double t = fract(A1 * sin(B1*n));
double u = fract((A2+t) * sin(B2*t));
n++;
return u;
}
where the only recommended constant value is the B1.
Notice that this is for a stream. Converting to a 1D hash 'n' becomes the integer grid. So my guess is that someone saw this and converted 't' into a simple function f(x,y). Using the original constants above that would yield:
float hash(vec2 co){
float t = 12.9898*co.x + 78.233*co.y;
return fract((A2+t) * sin(t)); // any B2 is folded into 't' computation
}

the constant values are arbitrary, especially that they are very large, and a couple of decimals away from prime numbers.
a modulus over 1 of a hi amplitude sinus multiplied by 4000 is a periodic function. it's like a window blind or a corrugated metal made very small because it's multiplied by 4000, and turned at an angle by the dot product.
as the function is 2-D, the dot product has the effect of turning the periodic function at an oblique relative to X and Y axis. By 13/79 ratio approximately. It is inefficient, you can actually achieve the same by doing sinus of (13x + 79y) this will also achieve the same thing I think with less maths..
If you find the period of the function in both X and Y, you can sample it so that it will look like a simple sine wave again.
Here is a picture of it zoomed in graph
I don't know the origin but it is similar to many others, if you used it in graphics at regular intervals it would tend to produce moire patterns and you could see it's eventually goes around again.

Maybe it's some non-recurrent chaotic mapping, then it could explain many things, but also can be just some arbitrary manipulation with large numbers.
EDIT: Basically, the function fract(sin(x) * 43758.5453) is a simple hash-like function, the sin(x) provides smooth sin interpolation between -1 to 1, so sin(x) * 43758.5453 will be interpolation from -43758.5453 to 43758.5453. This is a quite huge range, so even small step in x will provide large step in result and really large variation in fractional part. The "fract" is needed to get values in range -0.99... to 0.999... .
Now, when we have something like hash function we should create function for production hash from the vector. The simplest way is call "hash" separetly for x any y component of the input vector. But then, we will have some symmetrical values. So, we should get some value from the vector, the approach is find some random vector and find "dot" product to that vector, here we go: fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * 43758.5453);
Also, according to the selected vector, its lenght should be long engough to have several peroids of the "sin" function after "dot" product will be computed.

I do not believe this to be the true origin, but OP's code is presented as code example in "The Book of Shaders" by Patricio Gonzalez Vivo and Jen Lowe ( https://thebookofshaders.com/10/ ). In their code, Patricio Gonzales Vivo is cited as the author, i.e "// Author #patriciogv - 2015"
Since the OP's research dates back even further (to '08), the source might at least explain its popularity, and the author might be able to shed some light on his source.

Solving floating-point rounding issues C++

I develop a scientific application (simulation of chromosomes moving in a cell nucleus). The chromosomes are divided in small fragments that rotate around a random axis using 4x4 rotation matrices.
The problem is that the simulation performs hundreds of billions of rotations, therefore the floating-point rounding errors stack up and grow exponentially, so the fragments tend to "float away" and detach from the rest of the chromosome as time passes.
I use double precision with C++. The soft runs on CPU for the moment but will be ported for CUDA, and simulations can last for 1 month at most.
I have no idea how I could somehow renormalize the chromosome, because all fragments are chained together (you can see it as a doubly linked-list), but I think that would be the best idea, if possible.
Do you have any suggestions ? I feel a bit lost.
Thank you very much,
H.
EDIT:
Added a simplified sample code.
You can assume all matrix math are classical implementations.
// Rotate 1000000 times
for (int i = 0; i < 1000000; ++i)
{
// Pick a random section start
int istart = rand() % chromosome->length;
// Pick the end 20 segments further (cyclic)
int iend = (istart + 20) % chromosome->length;
// Build rotation axis
Vector4 axis = chromosome->segments[istart].position - chromosome->segments[iend].position;
axis.normalize();
// Build rotation matrix and translation vector
Matrix4 rotm(axis, rand() / float(RAND_MAX));
Vector4 oldpos = chromosome->segments[istart].position;
// Rotate each segment between istart and iend using rotm
for (int j = (istart + 1) % chromosome->length; j != iend; ++j, j %= chromosome->length)
{
chromosome->segments[j].position -= oldpos;
chromosome->segments[j].position.transform(rotm);
chromosome->segments[j].position += oldpos;
}
}

You need to find some constraint for your system and work to keep that within some reasonable bounds. I've done a bunch of molecular collision simulations and in those systems the total energy is conserved, so every step I double check the total energy of the system and if it varies by some threshold, then I know that my time step was poorly chosen (too big or too small) and I pick a new time step and rerun it. That way I can keep track of what's happening to the system in real time.
For this simulation, I don't know what conserved quantity you have, but if you have one, you can try to keep that constant. Remember, making your time step smaller doesn't always increase the accuracy, you need to optimize the step size with the amount of precision you have. I've had numerical simulations run for weeks of CPU time and conserved quantities were always within 1 part in 10^8, so it is possible, you just need to play around some.
Also, as Tomalak said, maybe try to always reference your system to the start time rather than to the previous step. So rather than always moving your chromosomes keep the chromosomes at their start place and store with them a transformation matrix that gets you to the current location. When you compute your new rotation, just modify the transformation matrix. It may seem silly, but sometimes this works well because the errors average out to 0.
For example, lets say I have a particle that sits at (x,y) and every step I calculate (dx, dy) and move the particle. The step-wise way would do this
t0 (x0,y0)
t1 (x0,y0) + (dx,dy) -> (x1, y1)
t2 (x1,y1) + (dx,dy) -> (x2, y2)
t3 (x2,y2) + (dx,dy) -> (x3, y3)
t4 (x3,30) + (dx,dy) -> (x4, y4)
...
If you always reference to t0, you could do this
t0 (x0, y0) (0, 0)
t1 (x0, y0) (0, 0) + (dx, dy) -> (x0, y0) (dx1, dy1)
t2 (x0, y0) (dx1, dy1) + (dx, dy) -> (x0, y0) (dx2, dy2)
t3 (x0, y0) (dx2, dy2) + (dx, dy) -> (x0, y0) (dx3, dy3)
So at any time, tn, to get your real position you have to do (x0, y0) + (dxn, dyn)
Now for simple translation like my example, you're probably not going to win very much. But for rotation, this can be a life saver. Just keep a matrix with the Euler angles associated with each chromosome and update that rather than the actual position of the chromosome. At least this way they won't float away.

Write your formulae so that the data for timestep T does not derive solely from the floating-point data in timestep T-1. Try to ensure that the production of floating-point errors is limited to a single timestep.
It's hard to say anything more specific here without a more specific problem to solve.

The problem description is rather vague, so here are some rather vague suggestions.
Option 1:
Find some set of constraints such that (1) they should always hold, (2) if they fail, but only just, it's easy to tweak the system so that they do, (3) if they do all hold then your simulation isn't going badly crazy, and (4) when the system starts to go crazy the constraints start failing but only slightly. For instance, perhaps the distance between adjacent bits of chromosome should be at most d, for some d, and if a few of the distances are just slightly greater than d then you can (e.g.) walk along the chromosome from one end, fixing up any distances that are too big by moving the next fragment towards its predecessor, along with all its successors. Or something.
Then check the constraints often enough to be sure that any violation will still be small when caught; and when you catch a violation, fix things up. (You should probably arrange that when you fix things up, you "more than satisfy" the constraints.)
If it's cheap to check the constraints all the time, then of course you can do that. (Doing so may also enable you to do the fixup more cheaply, e.g. if it means that any violations are always tiny.)
Option 2:
Find a new way of describing the state of the system that makes it impossible for the problem to arise. For instance, maybe (I doubt this) you can just store a rotation matrix for each adjacent pair of fragments, and force it always to be an orthogonal matrix, and then let the positions of the fragments be implicitly determined by those rotation matrices.
Option 3:
Instead of thinking of your constraints as constraints, supply some small "restoring forces" so that when something gets out of line it tends to get pulled back towards the way it should be. Take care that when nothing is wrong the restoring forces are zero or at least very negligible, so that they don't perturb your results more badly than the original numeric errors did.

I think it depends on the compiler you are using.
Visual Studio compiler support the /fp switch which tells the behavior of the floating point operations
you can read more about it. Basically, /fp:strict is the harshest mode

I guess it depends on the required precision, but you could use 'integer' based floating point numbers. With this approach, you use an integer and provide your own offset for the number of decimals.
For example, with a precision of 4 decimal points, you would have
float value -> int value
1.0000 -> 10000
1.0001 -> 10001
0.9999 -> 09999
You have to be careful when you do your multiply and divide and be careful when you apply your precision offsets. Other wise you can quickly get overflow errors.
1.0001 * 1.0001 becomes 10001 * 10001 / 10000

If I read this code correctly, at no time is the distance between any two adjacent chromosome segments supposed to change. In that case, before the main loop compute the distance between each pair of adjacent points, and after the main loop, move each point if necessary to have the proper distance from the previous point.
You may need to enforce this constraint several times during the main loop, depending on circumstances.

Basically, you need to avoid the accumulation of error from these (inexact) matrix operators and there are two major ways of doing so in most applications.
Instead of writing the position as some initial position operated on many times, you can write out what the operator would be explicitly after N operations. For instance, imagine you had a position x and you were adding a value e (that you couldn't represent exactly.) Much better than computing x += e; a large amount of times would be to compute x + EN; where EN is some more accurate way of representing what happens with the operation after N times. You should think whether you have some way of representing the action of many rotations more accurately.
Slightly more artificial is to take your newly found point and project off any discrepancies from the expected radius from your center of rotation. This will guarantee that it doesn't drift off (but won't necessarily guarantee that the rotation angle is accurate.)

parallel calculation of infinite series

I just have a quick question, on how to speed up calculations of infinite series.
This is just one of the examples:
arctan(x) = x - x^3/3 + x^5/5 - x^7/7 + ....
Lets say you have some library which allow you to work with big numbers, then first obvious solution would be to start adding/subtracting each element of the sequence until you reach some target N.
You also can pre-save X^n so for each next element instead of calculating x^(n+2) you can do lastX*(x^2)
But over all it seems to be very sequential task, and what can you do to utilize multiple processors (8+)??.
Thanks a lot!
EDIT:
I will need to calculate something from 100k to 1m iterations. This is c++ based application, but I am looking for abstract solution, so it shouldn't matter.
Thanks for reply.

You need to break the problem down to match the number of processors or threads you have. In your case you could have for example one processor working on the even terms and another working on the odd terms. Instead of precalculating x^2 and using lastX*(x^2), you use lastX*(x^4) to skip every other term. To use 8 processors, multiply the previous term by x^16 to skip 8 terms.
P.S. Most of the time when presented with a problem like this, it's worthwhile to look for a more efficient way of calculating the result. Better algorithms beat more horsepower most of the time.

If you're trying to calculate the value of pi to millions of places or something, you first want to pay close attention to choosing a series that converges quickly, and which is amenable to parallellization. Then, if you have enough digits, it will eventually become cost-effective to split them across multiple processors; you will have to find or write a bignum library that can do this.
Note that you can factor out the variables in various ways; e.g.:
atan(x)= x - x^3/3 + x^5/5 - x^7/7 + x^9/9 ...
= x*(1 - x^2*(1/3 - x^2*(1/5 - x^2*(1/7 - x^2*(1/9 ...
Although the second line is more efficient than a naive implementation of the first line, the latter calculation still has a linear chain of dependencies from beginning to end. You can improve your parallellism by combining terms in pairs:
= x*(1-x^2/3) + x^3*(1/5-x^2/7) + x^5*(1/9 ...
= x*( (1-x^2/3) + x^2*((1/5-x^2/7) + x^2*(1/9 ...
= [yet more recursive computation...]
However, this speedup is not as simple as you might think, since the time taken by each computation depends on the precision needed to hold it. In designing your algorithm, you need to take this into account; also, your algebra is intimately involved; i.e., for the above case, you'll get infinitely repeating fractions if you do regular divisions by your constant numbers, so you need to figure some way to deal with that, one way or another.

Well, for this example, you might sum the series (if I've got the brackets in the right places):
(-1)^i * (x^(2i + 1))/(2i + 1)
Then on processor 1 of 8 compute the sum of the terms for i = 1, 9, 17, 25, ...
Then on processor 2 of 8 compute the sum of the terms for i = 2, 11, 18, 26, ...
and so on, finally adding up the partial sums.
Or, you could do as you (nearly) suggest, give i = 1..16 (say) to processor 1, i = 17..32 to processor 2 and so on, and they can compute each successive power of x from the previous one. If you want more than 8x16 elements in the series, then assign more to each processor in the first place.
I doubt whether, for this example, it is worth parallelising at all, I suspect that you will get to double-precision accuracy on 1 processor while the parallel threads are still waking up; but that's just a guess for this example, and you can probably many series for which parallelisation is worth the effort.
And, as #Mark Ransom has already said, a better algorithm ought to beat brute-force and a lot of processors every time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js