I'm writing a program that depends a lot on complex additions and multiplications. I wanted to know whether I should use gsl_complex or std::complex.
I don't seem to find a comparison online of how much better GSL complex arithmetic is as compared to std::complex. A rudimentary google search didn't help me find a benchmarks page for GSL complex either.
I wrote a 20-line program that generates two random arrays of complex numbers (1e7 of them) and then checked how long addition and multiplication took using clock() from <ctime>. Using this method (without compiler optimisation) I got to know that gsl_complex_add and gsl_complex_mul are almost twice as fast as std::complex<double>'s + and * respectively. But I've never done this sort of thing before, so is this even the way you check which is faster?
Any links or suggestions would be helpful. Thanks!
EDIT:
Okay, so I tried again with a -O3 flag, and now the results are extremely different! std::complex<float>::operator+ is more than twice as fast as gsl_complex_add, while gsl_complex_mul is about 1.25 times as fast as std::complex<float>::operator*. If I use double, gsl_complex_add is about 30% faster than std::complex<double>::operator+ while std::complex<double>::operator* is about 10% faster than gsl_complex_mul. I only need float-level precision, but I've heard that double is faster (and memory is not an issue for me)! So now I'm really confused!
Turn on optimisations.
Any library or set of functions that you link with will be compiled WITH optimisation (unless the names of the developer are Kermit, Swedish Chef, Miss Peggy (project manager) and Cookie Monster (tester) - in other words, the development team is a bunch of Muppets).
Since std::complex uses templates, it is compiled by the compiler settings you give, so the code will be unoptimized. So your question is really "Why is function X faster than function Y that does the same thing, when function X is compiled with optimisation and Y is compiled without optimisation?" - which should really be obvious to answer: "Optimisation works nearly all of the time!" (If optimisation wasn't working most of the time, compiler developers would have a MUCH easier time)
Edit: So my above point has just been proven. Note that since templates can inline the code, it is often more efficient than an external library (because the compiler can just insert the instructions straight into the flow, rather than calling out to another function).
As to float vs. double, the only time that float is slower than double is if there is ONLY double hardware available, with two functions added to "shorten" and "lengthen" between float and double. I'm not aware of any such hardware. double has more bits, so it SHOULD take longer.
Edit2:
When it comes to choosing "one solution over another", there are so many factors. Performance is one (and in some cases, the most important, in other cases not). Other aspects are "ease of use", "availability", "fit for the project", etc.
If you look at ONLY performance, you can sometimes run simple benchmarks to determine that one solution is better or worse than another, but for complex libraries [not "real&imaginary" type complex numbers, but rather "complicated"], there are sometimes optimisations to deal with large amounts of data, where if you use a less sophisticated solution, the "large data" will not achieve the same performance, because less effort has been spent on solving the "big data" type problems. So, if you have a "simple" benchmark that does some basic calculations on a small set of data, and you are, in reality, going to run some much bigger datasets, the small benchmark MAY not reflect reality.
And there is no way that I, or anyone else, can tell you which solution will give you the best performance on YOUR system with YOUR datasets, unless we have access to your datasets, know exactly which calculations you are performance (that is, pretty much have your code), and have experience with running that with both "packages".
And going on to the rest of the criteria ("ease of use", etc), those are much more "personal opinion" based, so wouldn't be a good fit for an SO question in the first place.
This answer depends not only on the optimization flags, but also on the compiler used to compile GSL library and your particular code. Example: if you compile gsl with gcc and your program with icc, then you may see a (significant) difference (I have done this test with std::pow vs gsl_pow). Also, the standard makefile generated by ./configure does not compile GSL with aggressive float point optimizations (example: it does not include fast-math flag in gcc) because some GSL routines (differential equation solver for example) fail their stringent accuracy tests when these optimizations are present.
One of the great points about GSL is the modularity of the library. If you don't need double accuracy, then you can compile gsl_complex.h, gsl_complex_math.h and math.c separately with aggressive float number optimizations (however you need to delete the line #include <config.h> in math.c). Another strategy is to compile a separate version of the whole library with aggressive float number optimizations and test if accuracy is not an issue for your particular problem (that is my favorite approach).
EDIT: I forgot to mention that gsl_complex.h also has a float version of gsl_complex
typedef struct
{
float dat[2];
}
gsl_complex_float;
Related
I'm planning on writing a game with C++, and it will be extremely CPU-intensive (pathfinding,genetic algorithms, neural networks, ...)
So I've been thinking about how to tackle this situation best so that it would run smoothly.
(let this top section of this question be side information, I don't want it to restrict the main question, but it would be nice if you could give me side notes as well)
Is it worth it to learn how to work with ASM, so I can make ASM calls in C++,
can it give me a significant/notable performance advantage?
In what situations should I use it?
Almost never:
You only want to be using it once you've profiled your C++ code and have identified a particular section as a bottleneck.
And even then, you only want to do it once you've exhausted all C++ optimization options.
And even then, you only want to be using ASM for tight, inner loops.
And even then, it takes quite a lot of effort and skill to beat a C++ compiler on a modern platform.
If your not an experienced assembly programmer, I doubt you will be able to optimize assembly code better than your compiler.
Also note that assembly is not portable. If you decide to go this way, you will have to write different assembly for all the architectures you decide to support.
Short answer: it depends, most likely you won't need it.
Don't start optimizing prematurely. Write code that is also easy to read and to modify. Separate logical sections into modules. Write something that is easy to extend.
Do some profiling.
You can't tell where your bottlenecks are unless you profile your code. 99% of the time you won't get that much performance gain by writing asm. There's a high chance you might even worsen your performance. Optimizers nowadays are very good at what they do. If you do have a bottleneck, it will most probably be because of some poorly chosen algorithm or at least something that can be remedied at a high-level.
My suggestion is, even if you do learn asm, which is a good thing, don't do it just so you can optimize.
Profile profile profile....
A legitimate use case for going low-level (although sometimes a compiler can infer it for you) is to make use of SIMD instructions such as SSE. I would assume that at least some of the algorithms you mention will benefit from parallel processing.
However, you don't need to write actual assembly, instead you can simply use intrinsic functions. See, e.g. this.
Don't get ahead of yourself.
I've posted a sourceforge project showing how a simulation program was massively speeded up (over 700x).
This was not done by assuming in advance what needed to be made fast.
It was done by "profiling", which I put in quotes because the method I use is not to employ a profiler.
Rather I rely on random pausing, a method known and used to good effect by some programmers.
It proceeds through a series of iterations.
In each iteration a large source of time-consumption is identified and fixed, resulting in a certain speedup ratio.
As you proceed through multiple iterations, these speedup ratios multiply together (like compound interest).
That's how you get major speedup.
If, and only if, you get to a point where some code is taking a large fraction of time, and it doesn't contain any function calls, and you think you can write assembly code better than the compiler does, then go for it.
P.S. If you're wondering, the difference between using a profiler and random pausing is that profilers look for "bottlenecks", on the assumption that those are localized things. They look for routines or lines of code that are responsible for a large percent of overall time.
What they miss is problems that are diffuse.
For example, you could have 100 routines, each taking 1% of time.
That is, no bottlenecks.
However, there could be an activity being done within many or all of those routines, accounting for 1/3 of the time, that could be done better or not at all.
Random pausing will see that activity with a small number of samples, because you don't summarize, you examine the samples.
In other words, if you took 9 samples, on average you would notice the activity on 3 of them.
That tells you it's big.
So you can fix it and get your 3/2 speedup ratio.
"To understand recursion, you must first understand recursion." That quote comes to mind when I consider my response to your question, which is "until you understand when to use assembly, you should never use assembly." After you have completely implemented your soution, extensively profiled its performance and determined precise bottlenecks, and experimented with several alternative solutions, then you can begin to consider using assembly. If you code a single line of assembly before you have a working and extensively profiled program, you have made a mistake.
If you need to ask than you don't need it.
I'm implementing a compression algorithm. Thing is, it is taking a second for a 20 Kib files, so that's not acceptable. I think it's slow because of the calculations.
I need suggestions on how to make it faster. I have some tips already, like shifting bits instead of multiplying, but I really want to be sure of which changes actually help because of the complexity of the program. I also accept suggestions concerning compiler options, I've heard there is a way to make the program do faster mathematical calculations.
Common operations are:
pow(...) function of math library
large number % 2
large number multiplying
Edit: the program has no floating point numbers
The question of how to make things faster should not be asked here to other people, but rather in your environment to a profiler. Use the profiler to determine where most of the time is spent, and that will hint you into which operations need to be improved, then if you don't know how to do it, ask about specific operations. It is almost impossible to say what you need to change without knowing what your original code is, and the question does not provide enough information: pow(...) function: what are the arguments to the function, is the exponent fixed? how much precision do you need? can you change the function for something that will yield a similar result? large number: how large is large in large number? what is number in this context? integers? floating point?
Your question is very broad, without enough informaiton to give you concrete advise, we have to do with a general roadmap.
What platform, what compiler? What is "large number"? What have you done already, what do you know about optimization?
Test a release build with optimization (/Ox /LTCG in Visual C++, -O3 IIRC for gcc)
Measure where time is spent - disk access, or your actual compression routine?
Is there a better algorithm, and code flow? The fastest operation is the one not executed.
for 20K files, memory working set should not be an issue (unless your copmpression requries large data structures), so so code optimization are the next step indeed
a modern compiler implements a lot of optimizations already, e.g replacing a division by a power-of-two constant with a bit shift.
pow is very slow for native integers
if your code is well written, you may try to post it, maybe someone's up to the challenge.
Hints :-
1) modulo 2 works only on the last bit.
2) power functions can be implemented in logn time, where n is the power. (Math library should be fast enough though). Also for fast power you may check this out
If nothing works, just check if there exists some fast algorithm.
I have a very large program which I have been compiling under visual studio (v6 then migrated to 2008). I need the executable to run as fast as possible. The program spends most of its time processing integers of various sizes and does very little IO.
Obviously I will select maximum optimization, but it seems that there are a variety of things that can be done which don't come under the heading of optimization which do still affect the speed of the executable. For example selecting the __fastcall calling convention or setting structure member alignment to a large number.
So my question is: Are there other compiler/linker options I should be using to make the program faster which are not controlled from the "optimization" page of the "properties" dialog.
EDIT: I already make extensive use of profilers.
Another optimization option to consider is optimizing for size. Sometimes size-optimized code can run faster than speed-optimized code due to better cache locality.
Also, beyond optimization operations, run the code under a profiler and see where the bottlenecks are. Time spent with a good profiler can reap major dividends in performance (especially it if gives feedback on the cache-friendliness of your code).
And ultimately, you'll probably never know what "as fast as possible" is. You'll eventually need to settle for "this is fast enough for our purposes".
Profile-guided optimization can result in a large speedup. My application runs about 30% faster with a PGO build than a normal optimized build. Basically, you run your application once and let Visual Studio profile it, and then it is built again with optimization based on the data collected.
1) Reduce aliasing by using __restrict.
2) Help the compiler in common subexpression elimination / dead code elimination by using __pure.
3) An introduction to SSE/SIMD can be found here and here. The internet isn't exactly overflowing with articles about the topic, but there's enough. For a reference list of intrinsics, you can search MSDN for 'compiler intrinsics'.
4) For 'macro parallelization', you can try OpenMP. It's a compiler standard for easy task parallelization -- essentially, you tell the compiler using a handful of #pragmas that certain sections of the code are reentrant, and the compiler creates the threads for you automagically.
5) I second interjay's point that PGO can be pretty helpful. And unlike #3 and #4, it's almost effortless to add in.
You're asking which compiler options can help you speed up your program, but here's some general optimisation tips:
1) Ensure your algorithms are appropriate for the job. No amount of fiddling with compiler options will help you if you write an O(shit squared) algorithm.
2) There's no hard and fast rules for compiler options. Sometimes optimise for speed, sometimes optimise for size, and make sure you time the differences!
3) Understand the platform you are working on. Understand how the caches for that CPU operate, and write code that specifically takes advantage of the hardware. Make sure you're not following pointers everywhere to get access to data which will thrash the cache. Understand the SIMD operations available to you and use the intrinsics rather than writing assembly. Only write assembly if the compiler is definitely not generating the right code (i.e. writing to uncached memory in bad ways). Make sure you use __restrict on pointers that will not alias. Some platforms prefer you to pass vector variables by value rather than by reference as they can sit in registers - I could go on with this but this should be enough to point you in the right direction!
Hope this helps,
-Tom
Forget micro-optimization such as what you are describing. Run your application through a profiler (there is one included in Visual Studio, at least in some editions). The profiler will tell you where your application is spending its time.
Micro-optimization will rarely give you more than a few percentage points increase in performance. To get a really big boost, you need to identify areas in your code where inefficient algorithms and/or data structures are being used. Focus on those, for example by changing algorithms. The profiler will help identify these problem areas.
Check which /precision mode you are using. Each one generates quite different code and you need to choose based on what accuracy is required in your app. Our code needs precision (geometry, graphics code) but we still use /fp:fast (C/C++ -> Code generation options).
Also make sure you have /arch:SSE2, assuming your deployment covers processors that all support SSE2. This will result is quite a big difference in performance, as compile will use very few cycles. Details are nicely covered in the blog SomeAssemblyRequired
Since you are already profiling, I would suggest loop unrolling if it is not happening. I have seen VS2008 not doing it more frequently (templates, references etc..)
Use __forceinline in hotspots if applicable.
Change hotspots of your code to use SSE2 etc as your app seems to be compute intense.
You should always address your algorithm and optimise that before relying on compiler optimisations to get you significant improvements in most cases.
Also you can throw hardware at the problem. Your PC may already have the necessary hardware lying around mostly unused: the GPU! One way of improving performance of some types of computationally expensive processing is to execute it on the GPU. This is hardware specific but NVIDIA provide an API for exactly that: CUDA. Using the GPU is likely to get you far greater improvement than using the CPU.
I agree with what everyone has said about profiling. However you mention "integers of various sizes". If you are doing much arithmetic with mismatched integers a lot of time can be wasted in changing sizes, shorts to ints for example, when the expressions are evaluated.
I'll throw in one more thing too. Probably the most significant optimisation is in choosing and implementing the best algorithm.
You have three ways to speed up your application:
Better algorithm - you've not specified the algorithm or the data types (is there an upper limit to integer size?) or what output you want.
Macro parallelisation - split the task into chunks and give each chunk to a separate CPU, so, on a two core cpu divide the integer set into two sets and give half to each cpu. This depends on the algorithm you're using - not all algorithms can be processed like this.
Micro parallelisation - this is like the above but uses SIMD. You can combine this with point 2 as well.
You say the program is very large. That tells me it probably has many classes in a hierarchy.
My experience with that kind of program is that, while you are probably assuming that the basic structure is just about right, and to get better speed you need to worry about low-level optimization, chances are very good that there are large opportunities for optimization that are not of the low-level kind.
Unless the program has already been tuned aggressively, there may be room for massive speedup in the form of mid-stack operations that can be done differently. These are usually very innocent-looking and would never grab your attention. They are not cases of "improve the algorithm". They are usually cases of "good design" that just happen to be on the critical path.
Unfortunately, you cannot rely on profilers to find these things, because they are not designed to look for them.
This is an example of what I'm talking about.
I need to write a program (a project for university) that solves (approx) an NP-hard problem.
It is a variation of Linear ordering problems.
In general, I will have very large inputs (as Graphs) and will try to find the best solution
(based on a function that will 'rate' each solution)
Will there be a difference if I write this in C-style code (one main, and functions)
or build a Solver class, create an instance and invoke a 'run' method from a main (similar to Java)
Also, there will be alot of floating point math going on in each iteration.
Thanks!
No.
The biggest performance gains/flaws will be on the algorithm you implement, and how much unneeded work you perform (Unneeded work could be everything from recalculating a previous value that could have been cached, to using too many malloc/free's vs using memory pools,
passing large immutable data by value instead of reference)
The biggest roadblock to optimal code is no longer the language (for properly compiled languages), but rather the programmer.
No, unless you are using virtual functions.
Edit: If you have a case where you need run-time dynamism, then yes, virtual functions are as fast or faster than a manually constructed if-else statement. However, if you drop in the virtual keyword in front of a method, but you don't actually need the polymorphism, then you will be paying an unnecessary overhead. The compiler won't optimize it away at compile time. I am just pointing this out because it's one of the features of C++ that breaks the 'zero-overhead principle` (quoting Stroustrup).
As a side note, since you mention heavy use of fp math:
The following gcc flags may help you speed things up (I'm sure there are equivalent ones for visual C++, but I don't use it): -mfpmath=sse, -ffast-math and -mrecip (The last two are 'slightly dangerous', meaning that they could give you weird results in edge cases in exchange for the speed. The first one reduces precision by a bit -- you have 64-bit doubles instead of 80-bit ones -- but this extra precision is often unneeded.) These flags would work equally well for C and C++ compilers.
Depending on your processor, you may also find that simulating true INFINITY with a large-but-not-infinite value gives you a good speed boost. This is because true INFINITY has to be handled as a special case by the processor.
Rule of thumb - do not optimize until you know what to optimize. So start with C++ and have some working prototype. Then profile it and rewrite bottle necks in assembly. But as others noted, chosen algorithm will have much greater impact than the language.
When speaking of performance, anything you can do in C can be done in C++.
For example, virtual methods are known to be “slow”, but if it's really a problem, you can still resort to C idioms.
C++ also brings templates, which lead to better performance than using void* for generic programming.
The Solver class will be constructed once, I take it, and the run method executed once... in that kind of environment, you won't see a difference. Instead, here are things to watch out for:
Memory management is hellishly expensive. If you need to do lots of little malloc()s, the operating system will eat your lunch. Make a determined effort to re-use whatever data structures you create if you know you'll be doing the same kind of thing again soon!
Instantiating classes generally means... allocating memory! Again, there's practically no cost for instantiating a handful of objects and re-using them. But beware of creating objects only to tear them down and rebuild them soon after!
Choose the right flavor of floating point for your architecture, insofar as the problem permits. It's possible that double will end up being faster than float, although it will need more memory. You should experiment to fine-tune this. Ideally, you'll use a #define or typedef to specify the type so you can change it easily in one place.
Integer calculations are probably faster than floating point. Depending on the numeric range of your data, you may also consider doing it with integers treated as fixed-point decimals. If you need 3 decimal places, you could use ints and just consider them "milli-somethings". You'll have to remember to shift decimals after division and multiplication... but no big deal. If you use any math functions beyond the basic arithmetic, of course, that would of course kill this possibility.
Since both are compiled, and the compilers now are very good at how to handle C++, I think the only problem would come from how well optimized your code is. I think it would be easier to write slower code in C++, but that depends on which style your model fits into best.
When it comes down to it, I doubt there will be any real difference, assuming both are well-written, any libraries you use, how well written they are, if you are measuring on the same computer.
Function call vs. member function call overhead is unlikely to be the limiting factor, compared to file input and the algorithm itself. C++ iostreams are not necessarily super high speed. C has 'restrict' if you're really optimizing, in C++ it's easier to inline function calls. Overall, C++ offers more options for organizing your code clearly, but if it's not a big program, or you're just going to write it in a similar manner whether it's C or C++, then the portability of C libraries becomes more important.
As long as you don't use any virtual functions etc. you won't note any considerable performance differences. Early C++ was compiled to C, so as long as you know the pinpoints where this creates any considerable overhead (such as with virtual functions) you can clearly calculate for the differences.
In addition I want to note that using C++ can give you a lot to gain if you use the STL and Boost Libraries. Especially the STL provides very efficient and proven implementations of the most important data structures and algorithms, so you can save a lot of development time.
Effectively it also depends on the compiler you will be using and how it will optimize the code.
first, writing in C++ doesn't imply using OOP, look at the STL algorithms.
second, C++ can be even slightly faster at runtime (the compilation times can be terrible compared to C, but that's because modern C++ tends to rely heavily on abstractions that tax the compiler).
edit: alright, see Bjarne Stroustrup's discussion of qsort and std::sort, and the article that FAQ mentions (Learning Standard C++ as a New Language), where he shows that C++-style code can be not only shorter and more readable (because of higher abstractions), but also somewhat faster.
Another aspect:
C++ templates can be an excellent tool to generate type-specific /
optimized code variations.
For example, C qsort requires a function call to the comparator, whereas std::sort can inline the functor passed. This can make a significant difference when compare and swap themselves are cheap.
Note that you could generate "custom qsorts" optimized for various types with a barrage of defines or a code generator, or by hand - you could do these optimizations in C, too, but at much higher cost.
(It's not a general weapon, templates help only in sepcific scenarios - usually a single algorithm applied to different data types or with differing small pieces of code injected.)
Good answers. I would put it like this:
Make the algorithm as efficient as possible in terms of its formal structure.
C++ will be as fast as C, except that it will tempt you to do dumb things, like constructing objects that you don't have to, so don't take the bait. Things like STL container classes and iterators may look like the latest-and-greatest thing, but they will kill you in a hotspot.
Even so, single-step it at the disassembly level. You should see it very directly working on your problem. If it is spending lots of cycles getting in and out of routines, try some in-lining (or macros). If it is wandering off into memory allocation and freeing, for much of the time, put a stop to that. If it's got inner loops where the loop overhead is a large percentage, try unrolling the loop.
That's how you can make it about as fast as possible.
I would go with C++ definitely. If you are careful about your design and avoid creating heavy objects inside hotspots you should not see any performance difference but the code is going to be much simpler to understand, maintain, and expand.
Use templates and classes judiciously. avoid unnecessary object creation by passing objects by reference. Avoid excessive memory allocation, if needed, allocate memory in advance of hotspots. Use restrict keyword on memory pointers to tell compiler whenever pointers overlap or not.
As far as optimization, pay careful attention to memory alignment. Assuming you are working on Intel processor, you can make use of vector instructions, provided you tell the compiler through pragma's about your memory alignment and aliased pointers. you can also use vector instructions directly via intrinsics.
you can also automatically create hotspot code using templates and let compiler optimize it out if you have things like short loops of different sizes. To find out performance and to drill down to your bottlenecks, Intel vtune or oprofile are extremely helpful.
hope that helps
I do some DSP coding, where it still pays off to go to assembly language sometimes. I'd say use C or C++, either one, and be prepared to go to assembly language when you need to, especially to exploit SIMD instructions.
I am writing an application where in a certain block I need to exponentiate reals around 3*500*500 times. When I use the exp(y*log(x)) algorithm, the program noticeably lags. It is significantly faster if I use another algorithm based on playing with data types, but that algorithm isn't very precise, although provides decent results for the simulation, and it's still not perfect in terms of speed.
Is there any precise exponentiation algorithm for real powers faster than exp(y*log(x))?
Thank you in advance.
If you need good accuracy, and you don't know anything about the distribution of bases (x values) a priori, then pow(x, y) is the best portable answer (on many -- not all -- platforms, this will be faster than exp(y*log(x)), and is also better behaved numerically). If you do know something about what ranges x and y can lie in, and with what distribution, that would be a big help for people trying to offer advice.
The usual way to do it faster while keeping good accuracy is to use a library routine designed to do many of these computations simultaneously for an array of x values and an array of y values. The catch is that such library implementations tend to cost money (like Intel's MKL) or be platform-specific (vvpowf in the Accelerate.framework on OS X, for example). I don't know much about MinGW, so someone else will need to tell you what's available there. The GSL may have something along these lines.
Depending on your algorithm (in particular if you have few to no additions), sometimes you can get away with working (at least partially) in log-space. You've probably already considered this, but if your intermediate representation is log_x and log_y then log(x^y) = exp(log_y) * log_x, which will be faster. If you can even be selective about it, then obviously computing log(x^y) as y * log_x is even cheaper. If you can avoid even just a few exponentiations, you may win a lot of performance. If there's any way to rewrite whatever loops you have to get the exponentiation operations outside of the innermost loop, that's a fairly certain performance win.