Optimization of cbrt() in C++

Optimization of cbrt() in C++ - c++

I am trying to improve the speed of my code, written in C++. Based on profilers, the function cbrt()/cbrtf32x is the function I spend the most time in/on (or more specifically):
double test_func(const double &test_val){
double cbrt_test_val = cbrt(test_val);
return (1 - 1e-10*cbrt_test_val);
}
According to data, I spend more then three times the time for cbrt()/cbrtf32x() than for the closest cost-expensive function. Thus I was wondering how to improve this function, and how to speed it up? The input values range from 1e18 to 1e30.

There is little that can be done if you are doing the cubic roots one at a time, and you want the exact result.
As is, I would be surprised if you can improve the cubic root calculation more than 10-20% - if that - while getting the same result numerically. (Note: I got that 10%-20% number out of thin air; it's an opinion, not a scientific number at all.)
If you can batch up the calculations, you might be able to SIMD the operation, or multi-thread them, or if you know more about the distribution of the data (or can find out more,) you might be able to sort them and - I don't know - maybe calculate an incremental cubic root or something.
If you can get away with an approximation, then there are more things that you can do. For example, you are calculating the function f(x) = 1 - cbrt(x) / 1e10, which is the same as 1 - cbrt(x / 1e30) which is a strictly decreasing function that maps the domain [1e18..1e30] to the range [0..0.9999]. With y = x / 1e30 it becomes f(y) = 1 - cbrt(y) and now y is in the range [1e-12..1] and it can be pre-calculated and approximated using a look-up table.
Depending on the number of times you need a cubic root, how much accuracy loss you can get away with (which determines the size of the table,) and whether you can sort or bucket your input (to improve the CPU cache utilization for your LUT look-ups) you might get a nice speed boost out of this.

Related

finding maximum of a function with least probes taken

I have some code, a function basically, that returns a value. This function takes long time to run. Function takes a double as a parameter:
double estimate(double factor);
My goal is to find such parameter factor at which this estimate function returns maximum value. I can simply brute force and iterate different factor inputs and get what I need, but the function takes long time to run, so I'd like to minimize amount of "probes" that I take (e.g. call the estimate function as least as possible).
Usually, maximum is returned for factor values between 0.5 and 3.5. If I graph returned values, I get something that looks like a bell curve. What's the most efficient approach to partition possible inputs to that I could discover maximum faster?

The previous answer suggested a 2 point approach. This is a good idea for functions that are approximately lines, because lines are defined by 2 parameters: y=ax+b.
However, the actual bell-shaped curve is more like a parabola, which is defined by ax²+bx+c (so 3 parameters). You therefore should take 3 points {x1,x2,x3} and solve for {a,b,c}. This will give you an estimate for the xtop at -b/2a. (The linked answer uses the name x0 here).
You'll need to iteratively approximate the actual top if the function isn't a real parabola, but this process converges really fast. The easiest solution is to take the original triplet x1,x2,x3, add xtop and remove the xn value which is furthest away from xtop. The advantage of this is that you can reuse 2 of the old f(x) values. This reuse helps a lot with the stated goal of "mininal samples".

If your function indeed has a bell shaped curve then you can use binary search as follows:
Choose an initial x1 (say x1 = 2, midway between 0.5 and 3.5) and find f(x1) and f(x1 + delta) where delta is small enough. If f(x1 + delta) > f(x1) it means that the peak is towards the right of x1 otherwise it is towards the left.
Carry out binary search and come to a close enough value of the peak as you want.
You can modify the above approach by choosing the next x_t according to the difference f(x1 + delta) - f(x1).

Another way to calculate double type variables in c++?

Short version of the question: overflow or timeout in current settings when calculating large int64_t and double, anyway to avoid these?
Test case:
If only demand is 80,000,000,000, solved with correct result. But if it's 800,000,000,000, returned incorrect 0.
If input has two or more demands (means more inequalities need to be calculated), smaller value will also cause incorrectness. e.g., three equal demands of 20,000,000,000 will cause the problem.
I'm using COIN-OR CLP linear programming solver to solve some network flow problems. I use int64_t when representing the link bandwidth. But CLP uses double most of time and cannot transfer to other types easily.
When the values of the variables are not that large (typically smaller than 10,000,000,000) and the constraints (inequalities) are relatively few, it will give the solution I want it to. But if either of the above factors increases, the tool will stop and return a 0 value solution. I think the reason is the calculation complexity is over its maximum, so program breaks at some trivial point (it uses LP simplex method).
The inequality is some kind of:
totalFlowSum <= usePercentage * demand
I changed it to
totalFlowSum - usePercentage * demand <= 0
Since totalFLowSum and demand are very large int64_t, usePercentage is double, if the constraints like this are too many (several or even more), or if the demand is larger than 100,000,000,000, the returned solution will be wrong.
Is there any way to correct this, like increase the break threshold or avoid this level of calculation magnitude?
Decrease some accuracy is acceptable. I have a possible solution is that 1,000 times smaller on inputs and 1,000 time larger on outputs. But this is kind of naïve and may cause too much code modification in the program.
Update:
I have changed the formulation to
totalFlowSum / demand - usePercentage <= 0
but the problem still exists.
Update 2:
I divided usePercentage by 1000, making its coefficient from 1 to 0.001, it worked. But if I also divide totalFlowSum/demand by 1000 simultaneously, still no result. I don't know why...

I changed the rhs of equalities from 0 to 0.1, the problem is then solved! Since the inputs are very large, 0.1 offset won't impact the solution at all.
I think the reason is that previous coeffs are badly scaled, so the complier failed to find an exact answer.

Formula for PI-regulation Proportional Integral algorithm

I've been reading this website: http://www.csimn.com/CSI_pages/PIDforDummies.html and I'm confused about the proportional integral part. Here's what it says.
Proportional control
Here’s a diagram of the controller when we have enabled only P control:
In Proportional Only mode, the controller simply multiplies the Error by the Proportional Gain (Kp) to get the controller output.
The Proportional Gain is the setting that we tune to get our desired performance from a “P only” controller.
A match made in heaven: The P + I Controller
If we put Proportional and Integral Action together, we get the humble PI controller. The Diagram below shows how the algorithm in a PI controller is calculated.
The tricky thing about Integral Action is that it will really screw up your process unless you know exactly how much Integral action to apply.
A good PID Tuning technique will calculate exactly how much Integral to apply for your specific process - but how is the Integral Action adjusted in the first place?
As you can see, the proportional part is easy to understand it says that you multiply error by tuning variable. The part that I don't get is where you get the P and I from on the second part, and what mathematical operation you do with them. I don't have a degree in mathematics or advanced calculus knowledge, so I would appreciate it if you would try to keep it algebra level.

There is a big part missing from the text, the actual physical system that turns the control into a process and the actual physical variable.
Think of the integral as some kind of averaging operation that filters out small oscillations in the PV input. It also represents some kind of memory of the immediate past of the process.
A moving exponential average, for instance, can be thought of being a mix of integral and proportional action.
Staying with the car driving example, if you come to a curb where you need the steering wheel in a certain position to go in a circle, you don't just yank the wheel to that position, you move it gradually (most of the time). Exactly such ramp-up and -down actions are effects of using the integral action part.

I integral part is just summation also multiplied by some constant.
Analogue integration is done by nonlinear gain and amplifier.
Digital integration of first order is just:
output += input*dt;
second order is:
temp += input*dt;
output += temp*dt;
dt is the duration time of iteration loop (timer or what ever)
do not forget that PI regulator can have more complicated response
i1 += input*dt;
i2 += i1*dt;
i3 += i2*dt;
output = a0*input + a1*i1 + a2*i2 +a3*i3 ...;
where a0 is the P part
Now the I regulator adds more and more amount of control value
until the controlled value is the same as the preset value
the longer it takes to match it the faster it controls
this creates fast oscillations around preset value
in comparison to P with the same gain
but in average the control time is smaller then in just P regulators
therefore the I gain is usually much much smaller which creates the memory and smooth effect LutzL mentioned. (while the regulation time is similar or smaller then just for P regulation)
The controlled device has its own response
this can be represented as differential function
there is a lot of theory in cybernetics about obtaining the right regulator response
to match your process needs as:
quality of control
reaction times
max oscillations amplitude
stability
but for all you need differential math like solving system of differential equations of any order
strongly recommend use of Laplace transform
but many people also use Z transform instead
So I-regulator add speed to regulation
but it also create bigger oscillations
and when not matching the regulated system properly also creates instability
Integration adds overflow risks to regulation (Analog integration is very sensitive to it)
Also take in mind you can also substracting the I part from control value
which will make the exact opposite
sometimes the combination of more I parts are used to match desired regulation response shape

Test of lot of math operations in a class

Is there a way of testing functions inside a class in an easy way for correct results? I mean, I have been looking at google test unit testing, but seems more to find fails in the work classes and functions, more than in the expected result.
For example, from math theory one could know which is the square root of all numbers, now you want to check a sqrt function, seeking for floating point precision errors, and then you also want to check lot of functions that use floats and look for any precision error, is there a way to make this easy and fast ?

I can think of 2 direct solutions
1)
one of the easiest ways to test for accuracy of mathematical functions is similar to what is used as definition work for limits in calculus. taking the value to be tested, and then also using a value that is "close" on both sides. I have heard of analogies drawn between limit analysis and unit testing, but keep in mind that if your looking for speed this will not be your best options. and that this will only work on continues operations, and that this analogy is for definition work only
so what you would do is have a "limitDomain" variable defined per function (this is because some operations are more accurate then others for reasoning look up taylor approximation of [function]), and then use that as you limiter. then test: low, high, and then the value itself, and then take the avg of all three within a given margin of error,
float testMathOpX(float _input){
float low = 0.0f;
float high = 0.0f;
low = _input - limitDomainOpX;
high = _input + limitDomainOpX;
low = OpX(low);
_input = OpX(_input);
high = OpX(high);
// doing 3 separate averages with division by 2 mains the worst decimal you will have is a trailing 5, or in some cases a trailing 25
low = (low + _input)/2
high = (_input + high)/2;
_input = (low + high)/2
return _input;
}
2)
the other method that I can think of is more of a table of values approach being that you take the input, and then check to see where on the domain of the operation it lies, and if it lies within certain values then you use value replacement. The thing to realize is that you need to have a lot of work ahead of time to get these table of values, and then it becomes just domain testing of the value your taking in in the form of:
if( (_input > valLow) && (_input < valHigh)){
... replace the value with an empirically found value
}
the problem with this is that you need o find those empirically found values.

Do you have requirements on the precision or do you want to find the precision?
If it is the former, then it is not hard to create test cases using any test framework.
y = myfunc(x);
if (y > expected_y + allowed_error || y < expected_y - allowed_error) {
// Test failed
...
}
Edit:
There are two routes to finding the precision, through testing and through algorithm analysis.
Testing should be straightforward: Compare the output with the correct values (which you have to obtain in some way).
Algortithm analysis is when you calculate the expected size of the error by calculating the error of the algorithm and the error caused by lack of precision in floating point arithmetic.

parallel calculation of infinite series

I just have a quick question, on how to speed up calculations of infinite series.
This is just one of the examples:
arctan(x) = x - x^3/3 + x^5/5 - x^7/7 + ....
Lets say you have some library which allow you to work with big numbers, then first obvious solution would be to start adding/subtracting each element of the sequence until you reach some target N.
You also can pre-save X^n so for each next element instead of calculating x^(n+2) you can do lastX*(x^2)
But over all it seems to be very sequential task, and what can you do to utilize multiple processors (8+)??.
Thanks a lot!
EDIT:
I will need to calculate something from 100k to 1m iterations. This is c++ based application, but I am looking for abstract solution, so it shouldn't matter.
Thanks for reply.

You need to break the problem down to match the number of processors or threads you have. In your case you could have for example one processor working on the even terms and another working on the odd terms. Instead of precalculating x^2 and using lastX*(x^2), you use lastX*(x^4) to skip every other term. To use 8 processors, multiply the previous term by x^16 to skip 8 terms.
P.S. Most of the time when presented with a problem like this, it's worthwhile to look for a more efficient way of calculating the result. Better algorithms beat more horsepower most of the time.

If you're trying to calculate the value of pi to millions of places or something, you first want to pay close attention to choosing a series that converges quickly, and which is amenable to parallellization. Then, if you have enough digits, it will eventually become cost-effective to split them across multiple processors; you will have to find or write a bignum library that can do this.
Note that you can factor out the variables in various ways; e.g.:
atan(x)= x - x^3/3 + x^5/5 - x^7/7 + x^9/9 ...
= x*(1 - x^2*(1/3 - x^2*(1/5 - x^2*(1/7 - x^2*(1/9 ...
Although the second line is more efficient than a naive implementation of the first line, the latter calculation still has a linear chain of dependencies from beginning to end. You can improve your parallellism by combining terms in pairs:
= x*(1-x^2/3) + x^3*(1/5-x^2/7) + x^5*(1/9 ...
= x*( (1-x^2/3) + x^2*((1/5-x^2/7) + x^2*(1/9 ...
= [yet more recursive computation...]
However, this speedup is not as simple as you might think, since the time taken by each computation depends on the precision needed to hold it. In designing your algorithm, you need to take this into account; also, your algebra is intimately involved; i.e., for the above case, you'll get infinitely repeating fractions if you do regular divisions by your constant numbers, so you need to figure some way to deal with that, one way or another.

Well, for this example, you might sum the series (if I've got the brackets in the right places):
(-1)^i * (x^(2i + 1))/(2i + 1)
Then on processor 1 of 8 compute the sum of the terms for i = 1, 9, 17, 25, ...
Then on processor 2 of 8 compute the sum of the terms for i = 2, 11, 18, 26, ...
and so on, finally adding up the partial sums.
Or, you could do as you (nearly) suggest, give i = 1..16 (say) to processor 1, i = 17..32 to processor 2 and so on, and they can compute each successive power of x from the previous one. If you want more than 8x16 elements in the series, then assign more to each processor in the first place.
I doubt whether, for this example, it is worth parallelising at all, I suspect that you will get to double-precision accuracy on 1 processor while the parallel threads are still waking up; but that's just a guess for this example, and you can probably many series for which parallelisation is worth the effort.
And, as #Mark Ransom has already said, a better algorithm ought to beat brute-force and a lot of processors every time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js