I am writing an application where in a certain block I need to exponentiate reals around 3*500*500 times. When I use the exp(y*log(x)) algorithm, the program noticeably lags. It is significantly faster if I use another algorithm based on playing with data types, but that algorithm isn't very precise, although provides decent results for the simulation, and it's still not perfect in terms of speed.
Is there any precise exponentiation algorithm for real powers faster than exp(y*log(x))?
Thank you in advance.
If you need good accuracy, and you don't know anything about the distribution of bases (x values) a priori, then pow(x, y) is the best portable answer (on many -- not all -- platforms, this will be faster than exp(y*log(x)), and is also better behaved numerically). If you do know something about what ranges x and y can lie in, and with what distribution, that would be a big help for people trying to offer advice.
The usual way to do it faster while keeping good accuracy is to use a library routine designed to do many of these computations simultaneously for an array of x values and an array of y values. The catch is that such library implementations tend to cost money (like Intel's MKL) or be platform-specific (vvpowf in the Accelerate.framework on OS X, for example). I don't know much about MinGW, so someone else will need to tell you what's available there. The GSL may have something along these lines.
Depending on your algorithm (in particular if you have few to no additions), sometimes you can get away with working (at least partially) in log-space. You've probably already considered this, but if your intermediate representation is log_x and log_y then log(x^y) = exp(log_y) * log_x, which will be faster. If you can even be selective about it, then obviously computing log(x^y) as y * log_x is even cheaper. If you can avoid even just a few exponentiations, you may win a lot of performance. If there's any way to rewrite whatever loops you have to get the exponentiation operations outside of the innermost loop, that's a fairly certain performance win.
Related
I'm trying to wrap my head around how people that code math functions for games and rendering engines can use an optimized math function in an efficient way; let me explain that further.
There is an high need for fast trigonometric functions in those fields, at times you can optimize a sin, a cos or other functions by rewriting them in a different form that is valid only for a given interval, often times this means that your approximation of f(x) is just for the first quadrant, meaning 0 <= x <= pi/2 .
Now the input for your f(x) is still about all 4 quadrants, but the real formula only covers 1/4 of that interval, the straightforward solution is to detect the quadrant by analyzing the input and see in which range it belongs to, then you adjust the result of the formula accordingly if the input comes from a quadrant that is not the first quadrant .
This is good in theory but this also presents a couple of really bad problems, especially considering the fact that you are doing all this to steal a couple of cycles from your CPU ( you also get a consistent implementation, that is not platform dependent like an hardcoded fsin in Intel x86 asm that only works on x86 and has a certain error range, all of this may differ on other platforms with other asm instructions ), so you should keep things working at a concurrent and high performance level .
The reason I can't wrap my head around the "switch case" with quadrants solution is:
it just prevents possible optimizations, namely memoization, considering that you usually want to put that switch-case inside the same functions that actually computes the f(x), probably the situation can be improved by implementing the formula for f(x) outside said function, but this is will lead to a doubling in the number of functions to maintain for any given math library
increase probability of more branching with a concurrent execution
generally speaking doesn't lead to better, clean, dry code, and conditional statements are often times a potential source of bugs, I don't really like switch-cases and similar things .
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
Note: In the below answer I am speaking very generally about code.
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
The general answer to this is: In the most obvious and simplest way possible that achieves your purpose.
I'm not entirely sure I follow most of your arguments/questions but I have a feeling you are looking for problems where really none exist. Do you truly have the need to re-implement the trigonometric functions? Don't fall into the trap of NIH (Not Invented Here).
the straightforward solution is to detect the quadrant
Yes! I love straightforward code. Code that is perfectly obvious at a glance what it does. Now, sometimes, just sometimes, you have to do some crazy things to get it to do what you want: for performance, or avoiding bugs out of your control. But the first version should be most obvious and simple code that solves your problem. From there you do testing, profiling, and benchmarking and if (only if) you find performance or other issues, then you go into the crazy stuff.
This is good in theory but this also presents a couple of really bad problems,
I would say that this is good in theory and in practice for most cases and I definitely don't see any "bad" problems. Minor issues in specific corner cases or design requirements at most.
A few things on a few of your specific comments:
approximation of f(x) is just for the first quadrant: Yes, and there are multiple reasons for this. One simply is that most trigonometric functions have identities so you can easily use these to reduce range of input parameters. This is important as many numerical techniques only work over a specific range of inputs, or are more accurate/performant for small inputs. Next, for very large inputs you'll have to trim the range anyways for most numerical techniques to work or at least work in a decent amount of time and have sufficient accuracy. For example, look at the Taylor expansion for cos() and see how long it takes to converge sufficiently for large vs small inputs.
it just prevents possible optimizations: Chances are your c++ compiler these days is way better at optimizations than you are. Sometimes it isn't but the general procedure is to let the compiler do its optimization and only do manual optimizations where you have measured and proven that you need it. Theses days it is very non-intuitive to tell what code is faster by just looking at it (you can read all the questions on SO about performance issues and how crazy some of the root causes are).
namely memoization: I've never seen memoization in place for a double function. Just think how many doubles are there between 0 and 1. Now in reduced accuracy situations you can take advantage of it but this is easily implemented as a custom function tailored for that exact situation. Thinking about it, I'm not exactly sure how to implement memoization for a double function that actually means anything and doesn't loose accuracy or performance in the process.
increase probability of more branching with a concurrent execution: I'm not sure I'd implement trigonometric functions in a concurrent manner but I suppose its entirely possible to get some performance benefits. But again, the compiler is generally better at optimizations than you so let it do its jobs and then benchmark/profile to see if you really need to do better.
doesn't lead to better, clean, dry code: I'm not sure what exactly you mean here, or what "dry code" is for that matter. Yes, sometimes you can get into trouble by too many or too complex if/switch blocks but I can't see a simple check for 4 quadrants apply here...it's a pretty basic and simple case.
So for any platform I get the same y for the same values of x: My guess is that getting "exact" values for all 53 bits of double across multiple platforms and systems is not going to be possible. What is the result if you only have 52 bits correct? This would be a great area to do some tests in and see what you get.
I've used trigonometric functions in C for over 20 years and 99% of the time I just use whatever built-in function is supplied. In the rare case I need more performance (or accuracy) as proven by testing or benchmarking, only then do I actually roll my own custom implementation for that specific case. I don't rewrite the entire gamut of <math.h> functions in the hope that one day I might need them.
I would suggest try coding a few of these functions in as many ways as you can find and do some accuracy and benchmark tests. This will give you some practical knowledge and give you some hard data on whether you actually need to reimplement these functions or not. At the very least this should give you some practical experience with implementing these types of functions and chances are answer a lot of your questions in the process.
I'm writing a program that depends a lot on complex additions and multiplications. I wanted to know whether I should use gsl_complex or std::complex.
I don't seem to find a comparison online of how much better GSL complex arithmetic is as compared to std::complex. A rudimentary google search didn't help me find a benchmarks page for GSL complex either.
I wrote a 20-line program that generates two random arrays of complex numbers (1e7 of them) and then checked how long addition and multiplication took using clock() from <ctime>. Using this method (without compiler optimisation) I got to know that gsl_complex_add and gsl_complex_mul are almost twice as fast as std::complex<double>'s + and * respectively. But I've never done this sort of thing before, so is this even the way you check which is faster?
Any links or suggestions would be helpful. Thanks!
EDIT:
Okay, so I tried again with a -O3 flag, and now the results are extremely different! std::complex<float>::operator+ is more than twice as fast as gsl_complex_add, while gsl_complex_mul is about 1.25 times as fast as std::complex<float>::operator*. If I use double, gsl_complex_add is about 30% faster than std::complex<double>::operator+ while std::complex<double>::operator* is about 10% faster than gsl_complex_mul. I only need float-level precision, but I've heard that double is faster (and memory is not an issue for me)! So now I'm really confused!
Turn on optimisations.
Any library or set of functions that you link with will be compiled WITH optimisation (unless the names of the developer are Kermit, Swedish Chef, Miss Peggy (project manager) and Cookie Monster (tester) - in other words, the development team is a bunch of Muppets).
Since std::complex uses templates, it is compiled by the compiler settings you give, so the code will be unoptimized. So your question is really "Why is function X faster than function Y that does the same thing, when function X is compiled with optimisation and Y is compiled without optimisation?" - which should really be obvious to answer: "Optimisation works nearly all of the time!" (If optimisation wasn't working most of the time, compiler developers would have a MUCH easier time)
Edit: So my above point has just been proven. Note that since templates can inline the code, it is often more efficient than an external library (because the compiler can just insert the instructions straight into the flow, rather than calling out to another function).
As to float vs. double, the only time that float is slower than double is if there is ONLY double hardware available, with two functions added to "shorten" and "lengthen" between float and double. I'm not aware of any such hardware. double has more bits, so it SHOULD take longer.
Edit2:
When it comes to choosing "one solution over another", there are so many factors. Performance is one (and in some cases, the most important, in other cases not). Other aspects are "ease of use", "availability", "fit for the project", etc.
If you look at ONLY performance, you can sometimes run simple benchmarks to determine that one solution is better or worse than another, but for complex libraries [not "real&imaginary" type complex numbers, but rather "complicated"], there are sometimes optimisations to deal with large amounts of data, where if you use a less sophisticated solution, the "large data" will not achieve the same performance, because less effort has been spent on solving the "big data" type problems. So, if you have a "simple" benchmark that does some basic calculations on a small set of data, and you are, in reality, going to run some much bigger datasets, the small benchmark MAY not reflect reality.
And there is no way that I, or anyone else, can tell you which solution will give you the best performance on YOUR system with YOUR datasets, unless we have access to your datasets, know exactly which calculations you are performance (that is, pretty much have your code), and have experience with running that with both "packages".
And going on to the rest of the criteria ("ease of use", etc), those are much more "personal opinion" based, so wouldn't be a good fit for an SO question in the first place.
This answer depends not only on the optimization flags, but also on the compiler used to compile GSL library and your particular code. Example: if you compile gsl with gcc and your program with icc, then you may see a (significant) difference (I have done this test with std::pow vs gsl_pow). Also, the standard makefile generated by ./configure does not compile GSL with aggressive float point optimizations (example: it does not include fast-math flag in gcc) because some GSL routines (differential equation solver for example) fail their stringent accuracy tests when these optimizations are present.
One of the great points about GSL is the modularity of the library. If you don't need double accuracy, then you can compile gsl_complex.h, gsl_complex_math.h and math.c separately with aggressive float number optimizations (however you need to delete the line #include <config.h> in math.c). Another strategy is to compile a separate version of the whole library with aggressive float number optimizations and test if accuracy is not an issue for your particular problem (that is my favorite approach).
EDIT: I forgot to mention that gsl_complex.h also has a float version of gsl_complex
typedef struct
{
float dat[2];
}
gsl_complex_float;
I'm implementing a compression algorithm. Thing is, it is taking a second for a 20 Kib files, so that's not acceptable. I think it's slow because of the calculations.
I need suggestions on how to make it faster. I have some tips already, like shifting bits instead of multiplying, but I really want to be sure of which changes actually help because of the complexity of the program. I also accept suggestions concerning compiler options, I've heard there is a way to make the program do faster mathematical calculations.
Common operations are:
pow(...) function of math library
large number % 2
large number multiplying
Edit: the program has no floating point numbers
The question of how to make things faster should not be asked here to other people, but rather in your environment to a profiler. Use the profiler to determine where most of the time is spent, and that will hint you into which operations need to be improved, then if you don't know how to do it, ask about specific operations. It is almost impossible to say what you need to change without knowing what your original code is, and the question does not provide enough information: pow(...) function: what are the arguments to the function, is the exponent fixed? how much precision do you need? can you change the function for something that will yield a similar result? large number: how large is large in large number? what is number in this context? integers? floating point?
Your question is very broad, without enough informaiton to give you concrete advise, we have to do with a general roadmap.
What platform, what compiler? What is "large number"? What have you done already, what do you know about optimization?
Test a release build with optimization (/Ox /LTCG in Visual C++, -O3 IIRC for gcc)
Measure where time is spent - disk access, or your actual compression routine?
Is there a better algorithm, and code flow? The fastest operation is the one not executed.
for 20K files, memory working set should not be an issue (unless your copmpression requries large data structures), so so code optimization are the next step indeed
a modern compiler implements a lot of optimizations already, e.g replacing a division by a power-of-two constant with a bit shift.
pow is very slow for native integers
if your code is well written, you may try to post it, maybe someone's up to the challenge.
Hints :-
1) modulo 2 works only on the last bit.
2) power functions can be implemented in logn time, where n is the power. (Math library should be fast enough though). Also for fast power you may check this out
If nothing works, just check if there exists some fast algorithm.
I'm coding a game and it's very important to make speed calculations in render-code.
How can I get the speed of some operations?
For example, how to know whether multiplying is faster then sqrt, etc? Or I have to make tests and calculate the time.
Programming language is c++, thanks.
This kind of micro-optimisation is just the thing to waste your time for minimal gain.
Use a profiler and start by improving your own algorithms and code wherever the profiler tells you that the game is spending most of its time.
Note that in some cases you may have to overhaul the whole software - or a major part of it - in order to implement a more efficient design. In that case the profiler results can be misleading to the inexperienced. E.g. optimising a complex computation may procure minimal gain, when compared to caching its result once and for all.
See also this somewhat related thread.
Determining the speed of a particular operation is often known as profiling.The best solution for profiling an operation is to use a profiler. Visual Studio has a good profiler. Linux has gprof . If your compiler doesn't have a profiler, it might be worthwhile purchasing a compiler that does if you will often be profiling your code.
If you have to get by without using a professional profiler, then you can usually get by embedding your own into your program
check this out for codes of some profilers.
Your best bet is to use a tool like AQTime and do a profiling run. Then you will know where to spend your time optimizing. But doing it prematurely or based on guess work likely wont get you much, and just complicate your code or break something. The best thing is to take any floating point calculations, especially sin, cos and the like, and sqrt out of any loops if you can.
I once had something like this:
for i = 0 to nc
for j = 0 to nc
aij = sqrt(a[i]*b[j])
which calculates nc*nc square roots. But since sqrt(a*b) is equal to sqrt(a)*sqrt(b), you can precompute the square roots for all the a's and b's beforehand so the loop then just becomes what is shown below. So instead of nc*nc square roots, you have 2*nc square roots.
for i = 0 to nc
for j = 0 to nc
aij = asqrt[i]*bsqrt[j]
The question you are asking is highly dependent on the platform you are developing for at the hardware level. Not only will there be variation between different chipsets (Intel / AMD) but there will also be variations on the platform (I suspect that the iPhone doesn't have as many instructions for doing certain things quicker).
You state in your question that you are talking about 'render code'. The rules change massively if you're talking about code that will actually run on the GPU (shader code) instead of the CPU.
As #thkala states, I really wouldn't worry about this before you start. I've found it not only easier, but quicker to code it in a way that works first, and then (only if it needs improving) rewriting the bits that are slow when you profile your code. Better algorithms will usually provide better performance than trying to make use of only specific functions.
In the game(s) that we're developing for the iPhone, the only thing that I've kept in mind are that big math operations (sqrt) are slow (not basic maths) and that for loops that run every frame can quickly eat up CPU. Keeping that in mind, we've not had to optimise hardly any code - as it's all running at 60fps anyway - so I'm glad I didn't worry about it at the start.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have recently read an article about fast sqrt calculation. Therefore, I have decided to ask SO community and its experts to help me find out, which STL algorithms or mathematical calculations can be implemented faster with programming hacks?
It would be great if you can give examples or links.
Thanks in advance.
System library developers have more concerns than just performance in mind:
Correctness and standards compliance: Critical!
General use: No optimisations are introduced, unless they benefit the majority of users.
Maintainability: Good hand-written assembly code can be faster, but you don't see much of it. Why?
Portability: Decent libraries should be portable to more than just Windows/x86/32bit.
Many optimisation hacks that you see around violate one or more of the requirements above.
In addition, optimisations that will be useless or even break when the next generation CPU comes around the corner are not a welcome thing.
If you don't have profiler evidence on it being really useful, don't bother optimising the system libraries. If you do, work on your own algorithms and code first, anyway...
EDIT:
I should also mention a couple of other all-encompassing concerns:
The cost/effort to profit/result ratio: Optimisations are an investment. Some of them are seemingly-impressive bubbles. Others are deeper and more effective in the long run. Their benefits must always be considered in relation to the cost of developing and maintaining them.
The marketing people: No matter what you think, you'll end up doing whatever they want - or think they want.
Probably all of them can be made faster for a specific problem domain.
Now the real question is, which ones should you hack to make faster? None, until the profiler tells you to.
Several of the algorithms in <algorithm> can be optimized for vector<bool>::[const_]iterator. These include:
find
count
fill
fill_n
copy
copy_backward
move // C++0x
move_backward // C++0x
swap_ranges
rotate
equal
I've probably missed some. But all of the above algorithms can be optimized to work on many bits at a time instead of just one bit at a time (as would a naive implementation).
This is an optimization that I suspect is sorely missing from most STL implementations. It is not missing from this one:
http://libcxx.llvm.org/
This is where you really need to listen to project managers and MBAs. What you're suggesting is re-implementing parts of the STL and or standard C library. There is an associated cost in terms of time to implement and maintenance burden of doing so, so you shouldn't do it unless you really, genuinely need to, as John points out. The rule is simple: is this calculation you're doing slowing you down (a.k.a. you are bound by the CPU)? If not, don't create your own implementation just for the sake of it.
Now, if you're really interested in fast maths, there are a few places you can start. The gnu multi-precision library implements many algorithms from modern computer arithmetic and semi numerical algorithms that are all about doing maths on arbitrary precision integers and floats insanely fast. The guys who write it optimise in assembly per build platform - it is about as fast as you can get in single core mode. This is the most general case I can think of for optimised maths i.e. that isn't specific to a certain domain.
Bringing my first paragraph and second in with what thkala has said, consider that GMP/MPIR have optimised assembly versions per cpu architecture and OS they support. Really. It's a big job, but it is what makes those libraries so fast on a specific small subset of problems that are programming.
Sometimes domain specific enhancements can be made. This is about understanding the problem in question. For example, when doing finite field arithmetic under rijndael's finite field you can, based on the knowledge that the characteristic polynomial is 2 with 8 terms, assume that your integers are of size uint8_t and that addition/subtraction are equivalent to xor operations. How does this work? Well basically if you add or subtract two elements of the polynomial, they contain either zero or one. If they're both zero or both one, the result is always zero. If they are different, the result is one. Term by term, that is equivalent to xor across a 8-bit binary string, where each bit represents a term in the polynomial. Multiplication is also relatively efficient. You can bet that rijndael was designed to take advantage of this kind of result.
That's a very specific result. It depends entirely on what you're doing to make things efficient. I can't imagine many STL functions are purely optimised for cpu speed, because amongst other things STL provides: collections via templates, which are about memory, file access which is about storage, exception handling etc. In short, being really fast is a narrow subset of what STL does and what it aims to achieve. Also, you should note that optimisation has different views. For example, if your app is heavy on IO, you are IO bound. Having a massively efficient square root calculation isn't really helpful since "slowness" really means waiting on the disk/OS/your file parsing routine.
In short, you as a developer of an STL library are trying to build an "all round" library for many different use cases.
But, since these things are always interesting, you might well be interested in bit twiddling hacks. I can't remember where I saw that, but I've definitely stolen that link from somebody else on here.
Almost none. The standard library is designed the way it is for a reason.
Taking sqrt, which you mention as an example, the standard library version is written to be as fast as possible, without sacrificing numerical accuracy or portability.
The article you mention is really beyond useless. There are some good articles floating around the 'net, describing more efficient ways to implement square roots. But this article isn't among them (it doesn't even measure whether the described algorithms are faster!) Carmack's trick is slower than std::sqrt on a modern CPU, as well as being less accurate.
It was used in a game something like 12 years ago, when CPUs had very different performance characteristics. It was faster then, but CPU's have changed, and today, it's both slower and less accurate than the CPU's built-in sqrt instruction.
You can implement a square root function which is faster than std::sqrt without losing accuracy, but then you lose portability, as it'll rely on CPU features not present on older CPU's.
Speed, accuracy, portability: choose any two. The standard library tries to balance all three, which means that the speed isn't as good as it could be if you were willing to sacrifice accuracy or portability, and accuracy is good, but not as good as it could be if you were willing to sacrifice speed, and so on.
In general, forget any notion of optimizing the standard library. The question you should be asking is whether you can write more specialized code.
The standard library has to cover every case. If you don't need that, you might be able to speed up the cases that you do need. But then it is no longer a suitable replacement for the standard library.
Now, there are no doubt parts of the standard library that could be optimized. the C++ IOStreams library in particular comes to mind. It is often naively, and very inefficiently, implemented. The C++ committee's technical report on C++ performance has an entire chapter dedicated to exploring how IOStreams could be implemented to be faster.
But that's I/O, where performance is often considered to be "unimportant".
For the rest of the standard library, you're unlikely to find much room for optimization.