C++ Eigen Library - Quadratic Programming, Fixed vs Dynamic, Performance - c++

I am looking into doing some quadratic programming, and have seen different libraries. I have seen various Eigen variants of QuadProg++ (KDE Forums, Benjamin Stephens, StackOverflow posts). Just as a test, I forked wingsit's Eigen variant, available on GitHub, to implement compile-time-sized problems to measure performance via templates.
I'm finding that I have worse performance in the templating case than the dynamically sized (MatrixXD / VectorXD) case from wingsit's code. I know this is not a simple question, but can anyone see a reason why this might be?
Note: I do need to increase the problem size / iteration count, will post that once I can.
EDIT: I am using GCC 4.6.3 on Ubuntu 12.04. These are the flags that I am using (modified from wingsit's code):
CFLAGS = -O4 -Wall -msse2 -fopenmp # option for obj
LFLAGS = -O4 -Wall -msse2 -fopenmp # option for exe (-lefence ...)

The benefits of static-sized code generally diminishes as the sizes grow bigger. The typical benefits of static-sized code mainly include (but not limited to) the following:
stack-based allocations which are faster than heap allocations. However, at large sizes, stack-based allocations are no longer feasible (stack-overflow), or even beneficial from a point of view of pre-fetching and locality of reference.
loop unrolling when the compiler sees a small static-sized loop, it can just unroll it, and possibly use SSE instructions. This doesn't apply at larger sizes.
In other words, for small sizes (up to maybe N=12 or so), static-sized code can be much better and faster than the equivalent dynamically-sized code, as long as the compiler is fairly aggressive about in-lining and loop unrolling. But, when sizes are larger, there is no point.
Additionally, there are a number of drawbacks about static-sized code too:
No efficient move-semantics / swapping / copy-on-write strategies since such code is usually implemented with static arrays (in order to have the benefits mentioned above) which cannot be simply swapped (as in, swapping the internal pointer).
Larger executables which contain functions that are spread out. Say you have one function template to multiply two matrices, and so, a new compiled function (inlined or not) is generated for each size combination. Then, you have some algorithm that does a lot of matrix multiplications, well, each multiplication will have to jump to (or execute inline) the specialization for that size combination. At the end, you end up with a lot more code that needs to be cached, and then, cache misses become a lot more frequent, and that will completely destroy the performance of your algorithm.
So, the lesson to draw from that is to use static-sized vectors and matrices only for small things like 2 to 6 dimensional vectors or matrices. But beyond that, it's preferable to go with dynamically-sized code (or maybe try static-sized code, but verify that it performs better, because it is not guaranteed to do so). So, I would advise you to reconsider your idea of using static-sized code for larger problems.

Related

Basic ways to speed up a simple Eigen program

I'm looking for the fastest way to do simple operations using Eigen. There are so many datastructures available, its hard to tell which is the fastest.
I've tried to predefine my data structures, but even then my code is being outperformed by similar Fortran code. I've guessed Eigen::Vector3d is the fastest for my needs, (since its predefined), but I could easily be wrong. Using -O3 optimization during compile time gave me a big boost, but I'm still running 4x slower than a Fortran implementation of the same code.
I make use of an 'Atom' structure, which is then stored in an 'atoms' vector defined by the following:
struct Atom {
std::string element;
//double x, y, z;
Eigen::Vector3d coordinate;
};
std::vector<Atom> atoms;
The slowest part of my code is the following:
distance = atoms[i].coordinate - atoms[j].coordinate;
distance_norm = distance.norm();
Is there a faster data structure I could use? Or is there a faster way to perform these basic operations?
As you pointed out in your comment, adding the -fno-math-errno compiler flag gives you a huge increase in speed. As to why that happens, your code snipped shows that you're doing a sqrt via distance_norm = distance.norm();.
This makes the compiler not set ERRNO after each sqrt (that's a saved write to a thread local variable), which is faster and enables vectorization of any loop that is doing this repeatedly.The only disadvantage to this is that the IEEE adherence is lost. See gcc man.
Another thing you might want to try is adding -march=native and adding -mfma if -march=native doesn't turn it on for you (I seem to remember that in some cases it wasn't turned on by native and had to be turned on by hand - check here for details). And as always with Eigen, you can disable bounds checking with -DNDEBUG.
SoA instead of AoS!!! If performance is actually a real problem, consider using a single 4xN matrix to store the positions (and have Atom keep the column index instead of the Eigen::Vector3d). It shouldn't matter too much in the small code snippet you showed, but depending on the rest of your code, may give you another huge increase in performance.
Given you are ~4x off, it might be worth checking that you have enabled vectorization such as AVX or AVX2 at compile time. There are of course also SSE2 (~2x) and AVX512 (~8x) when dealing with doubles.
Either try another compiler like Intel C++ compiler (free for academic and non-profit usage) or use other libraries like Intel MKL (far faster that your own code) or even other BLAS/LAPACK implementations for dense matrices or PARDISO or SuperLU (not sure if still exists) for sparse matrices.

Dynamic arrays vs. std::vector

I've written a small program to calculate prime numbers using the naive division algorithm. In order to improve the performance, I thought it should only check the divisibility based on previously detected primes less than equal to the number's square root.
In order to do that, I need to keep track of the primes. I've implemented it using dynamic arrays. (e.g. Using new and delete). Should I use std::vector instead? Which is better in terms of performance? (Maintenance is not an issue.)
Any help would be appreciated. 😊
The ideal answer:
How should any of us know? It depends on your compiler, your OS, your architecture, your standard library implementation, the alignment of the planets...
Benchmark it. Possibly with this. (Haven't used it, but it seems simple enough to use.)
The practical answer:
Use std::vector. Every new and delete you make is a chance for a memory leak, or a double-delete, or otherwise forgetting to do something. std::vector basically does this under the hood anyway. You're more likely to get a sizeable performance boost by maxing out your optimization flags (if you're using gcc, try -Ofast and -march=native).
Also:
Maintenance is not an issue.
Doubt it. Trust me on this one. If nothing else, at least comment your code (but that's another can of worms).
For your purpose, vector might be better, so that you don't need to worry about memory management (e.g. grow your array size and copy previous results), or reserve too much memory to store your results.

gsl_complex vs. std::complex performance

I'm writing a program that depends a lot on complex additions and multiplications. I wanted to know whether I should use gsl_complex or std::complex.
I don't seem to find a comparison online of how much better GSL complex arithmetic is as compared to std::complex. A rudimentary google search didn't help me find a benchmarks page for GSL complex either.
I wrote a 20-line program that generates two random arrays of complex numbers (1e7 of them) and then checked how long addition and multiplication took using clock() from <ctime>. Using this method (without compiler optimisation) I got to know that gsl_complex_add and gsl_complex_mul are almost twice as fast as std::complex<double>'s + and * respectively. But I've never done this sort of thing before, so is this even the way you check which is faster?
Any links or suggestions would be helpful. Thanks!
EDIT:
Okay, so I tried again with a -O3 flag, and now the results are extremely different! std::complex<float>::operator+ is more than twice as fast as gsl_complex_add, while gsl_complex_mul is about 1.25 times as fast as std::complex<float>::operator*. If I use double, gsl_complex_add is about 30% faster than std::complex<double>::operator+ while std::complex<double>::operator* is about 10% faster than gsl_complex_mul. I only need float-level precision, but I've heard that double is faster (and memory is not an issue for me)! So now I'm really confused!
Turn on optimisations.
Any library or set of functions that you link with will be compiled WITH optimisation (unless the names of the developer are Kermit, Swedish Chef, Miss Peggy (project manager) and Cookie Monster (tester) - in other words, the development team is a bunch of Muppets).
Since std::complex uses templates, it is compiled by the compiler settings you give, so the code will be unoptimized. So your question is really "Why is function X faster than function Y that does the same thing, when function X is compiled with optimisation and Y is compiled without optimisation?" - which should really be obvious to answer: "Optimisation works nearly all of the time!" (If optimisation wasn't working most of the time, compiler developers would have a MUCH easier time)
Edit: So my above point has just been proven. Note that since templates can inline the code, it is often more efficient than an external library (because the compiler can just insert the instructions straight into the flow, rather than calling out to another function).
As to float vs. double, the only time that float is slower than double is if there is ONLY double hardware available, with two functions added to "shorten" and "lengthen" between float and double. I'm not aware of any such hardware. double has more bits, so it SHOULD take longer.
Edit2:
When it comes to choosing "one solution over another", there are so many factors. Performance is one (and in some cases, the most important, in other cases not). Other aspects are "ease of use", "availability", "fit for the project", etc.
If you look at ONLY performance, you can sometimes run simple benchmarks to determine that one solution is better or worse than another, but for complex libraries [not "real&imaginary" type complex numbers, but rather "complicated"], there are sometimes optimisations to deal with large amounts of data, where if you use a less sophisticated solution, the "large data" will not achieve the same performance, because less effort has been spent on solving the "big data" type problems. So, if you have a "simple" benchmark that does some basic calculations on a small set of data, and you are, in reality, going to run some much bigger datasets, the small benchmark MAY not reflect reality.
And there is no way that I, or anyone else, can tell you which solution will give you the best performance on YOUR system with YOUR datasets, unless we have access to your datasets, know exactly which calculations you are performance (that is, pretty much have your code), and have experience with running that with both "packages".
And going on to the rest of the criteria ("ease of use", etc), those are much more "personal opinion" based, so wouldn't be a good fit for an SO question in the first place.
This answer depends not only on the optimization flags, but also on the compiler used to compile GSL library and your particular code. Example: if you compile gsl with gcc and your program with icc, then you may see a (significant) difference (I have done this test with std::pow vs gsl_pow). Also, the standard makefile generated by ./configure does not compile GSL with aggressive float point optimizations (example: it does not include fast-math flag in gcc) because some GSL routines (differential equation solver for example) fail their stringent accuracy tests when these optimizations are present.
One of the great points about GSL is the modularity of the library. If you don't need double accuracy, then you can compile gsl_complex.h, gsl_complex_math.h and math.c separately with aggressive float number optimizations (however you need to delete the line #include <config.h> in math.c). Another strategy is to compile a separate version of the whole library with aggressive float number optimizations and test if accuracy is not an issue for your particular problem (that is my favorite approach).
EDIT: I forgot to mention that gsl_complex.h also has a float version of gsl_complex
typedef struct
{
float dat[2];
}
gsl_complex_float;

What is the optimization level (g++) you use while comparing two different algorithms written in C++?

I have two algorithms written in C++. As far as I know, it is conventional to compile with
-O0 -NDEBUG (g++) while comparing the performance of two algorithms(asymptotically they are same).
But I think the optimization level is unfair to one of them, because it uses STL in every case. The program which uses plain array outperforms the STL-heavy algorithm 5 times faster while compiled with -O0 options. But the performance difference is not much different when I compile them with -O2 -NDEBUG.
Is there any way to get the best out of STL (I am getting heavy performance hit in the vector [] operator) in optimization level -O0?
What optimization level (and possibly variables like -NDEBUG) do you use while comparing two algorithms?
It will be also great help if someone can give some idea about the trend in academic research about comparing the performance of algorithms written in C++?
Ok, To isolate the problem of optimization level, I am using one algorithm but two different implementation now.
I have changed one of the functions with raw pointers(int and boolean) to std::vector and std::vector... With -O0 -NDEBUG the performances are 5.46s(raw pointer) and 11.1s(std::vector). And with -O2 -NDEBUG , the performances are 2.02s(raw pointer) and 2.21s(std::vector). Same algorithm, one implementation is using 4/5 dynamic arrays of int and boolean. And the other one is using using std::vector and std::vector instead. They are same in every other case
You can see that in -O0 std::vector is outperformed with twice faster pointers. While in -O2 they are almost the same.
But I am really confused, because in academic fields, when they publish the results of algorithms in running time, they compile the programs with -O0.
Is there some compiler options I am missing?
It depends on what you want to optimize for.
Speed
I suggest using -O2 -NDEBUG -ftree-vectorize, and if your code is designed to specifically run on x86 or x86_64, add -msse2. This will give you a broad idea on how it will perform with GIMPLE.
Size
I believe you should use -Os -fno-rtti -fno-exceptions -fomit-frame-pointer. This will minimize the size of the executable to a degree (assuming C++).
In both cases, algorithm's speed is not compiler dependent, but a compiler can drastically change the way the code behaves if it can "prove" it can.
GCC detects 'common' code such as hand-coded min() and max() and turns them into one SSE instruction (on x86/x86_64 and when -msse is set) or using cmov when i686 is available (SSE has higher priority). GCC will also take liberty in reordering loops, unrolling and inlining functions if it wants to, and even remove useless code.
As for your latest edit:
You can see that in -O0 std::vector is
outperformed with twice faster
pointers. While in -O2 they are almost
the same.
That's because std::vector still has code that throws exceptions and may use rtti. Try comparing with -O2 -NDEBUG -ftree-vectorize -fno-rtti -fno-exceptions -fomit-frame-pointer, and you'll see that std::vector will be slightly better than your code. GCC knows what 'built-in' types are and how to exploit them in real world use and will gladly do so - just like it knows what memset() and memcpy() does and how to optimize accordingly when copy size is known.
The compiler optimizations usually won't change the complexity order of an algorithm, just the constant and the linear scale factor. Compilers are fairly smart, but they're not that smart.
Are you going to be compiling your code for release with just -O0? Probably not. You might as well compare the performance of the algorithms when compiled with whatever compilation flags you actually intend to use.
You have two algorithms implemented in C++. If you want to compare the relative performance of the two implementations then you should use the optimization level that you are going to use in your final product. For me, that's -O3.
If you want to analyse the complexity of an algorithm, then that's more of an analysis problem where you look at the overall count of operations that must be performed for different sizes and characteristics of inputs.
As a developer writing code where performance is an issue, it is a good idea to be aware of the range of optimizations that a compiler can, and is likely to, apply to your code. Not optimizing unfairly penalises code that is written clearly, but designed to be easily optimized against code that is already 'micro-optimized'.
I see no reason not to compile and run them both at O2. Unless you're doing it as a purely academic exercise (and even if you were it's very unlikely the optimizations would produce fundamental changes in the properties of the algorithm - Though, I think I'd be happy if GCC started turnning O(N) source into O(lgN) assembly) , you'll want information that's consistant what you would get when actually running the final program. You most likely won't be releasing the program with O0 optimizations, so you don't want to compare the algorithms under O0 optimizations.
Such a comparison is less about fairness than producing useful information. You should use the optimization level that you plan to use when/if the code is put into production use. If you're basically doing research, so you don't personally plan to put it into production use, you're stuck with the slightly more difficult job of guessing what somebody who would put it into production would probably do.
Realistically, even if you are doing development, not research, you're stuck with a little of that anyway -- it's nearly impossible to predict what optimization level you might eventually use with this particular code.
Personally, I usually use -O2 with gcc. My general rule of thumb is to use the lowest level of optimization that turns on automatic inlining. I write a lot of my code with the expectation that small functions will be inlined by the compiler -- and write the code specifically to assist in doing that (e.g. often using functors instead of functions). If the compiler isn't set to produce code for those inline, you're not getting what I really intended. The performance of the code when it's compiled that way doesn't really mean anything -- I certainly would not plan on ever really using it that way.

I need high performance. Will there be a difference if I use C or C++?

I need to write a program (a project for university) that solves (approx) an NP-hard problem.
It is a variation of Linear ordering problems.
In general, I will have very large inputs (as Graphs) and will try to find the best solution
(based on a function that will 'rate' each solution)
Will there be a difference if I write this in C-style code (one main, and functions)
or build a Solver class, create an instance and invoke a 'run' method from a main (similar to Java)
Also, there will be alot of floating point math going on in each iteration.
Thanks!
No.
The biggest performance gains/flaws will be on the algorithm you implement, and how much unneeded work you perform (Unneeded work could be everything from recalculating a previous value that could have been cached, to using too many malloc/free's vs using memory pools,
passing large immutable data by value instead of reference)
The biggest roadblock to optimal code is no longer the language (for properly compiled languages), but rather the programmer.
No, unless you are using virtual functions.
Edit: If you have a case where you need run-time dynamism, then yes, virtual functions are as fast or faster than a manually constructed if-else statement. However, if you drop in the virtual keyword in front of a method, but you don't actually need the polymorphism, then you will be paying an unnecessary overhead. The compiler won't optimize it away at compile time. I am just pointing this out because it's one of the features of C++ that breaks the 'zero-overhead principle` (quoting Stroustrup).
As a side note, since you mention heavy use of fp math:
The following gcc flags may help you speed things up (I'm sure there are equivalent ones for visual C++, but I don't use it): -mfpmath=sse, -ffast-math and -mrecip (The last two are 'slightly dangerous', meaning that they could give you weird results in edge cases in exchange for the speed. The first one reduces precision by a bit -- you have 64-bit doubles instead of 80-bit ones -- but this extra precision is often unneeded.) These flags would work equally well for C and C++ compilers.
Depending on your processor, you may also find that simulating true INFINITY with a large-but-not-infinite value gives you a good speed boost. This is because true INFINITY has to be handled as a special case by the processor.
Rule of thumb - do not optimize until you know what to optimize. So start with C++ and have some working prototype. Then profile it and rewrite bottle necks in assembly. But as others noted, chosen algorithm will have much greater impact than the language.
When speaking of performance, anything you can do in C can be done in C++.
For example, virtual methods are known to be “slow”, but if it's really a problem, you can still resort to C idioms.
C++ also brings templates, which lead to better performance than using void* for generic programming.
The Solver class will be constructed once, I take it, and the run method executed once... in that kind of environment, you won't see a difference. Instead, here are things to watch out for:
Memory management is hellishly expensive. If you need to do lots of little malloc()s, the operating system will eat your lunch. Make a determined effort to re-use whatever data structures you create if you know you'll be doing the same kind of thing again soon!
Instantiating classes generally means... allocating memory! Again, there's practically no cost for instantiating a handful of objects and re-using them. But beware of creating objects only to tear them down and rebuild them soon after!
Choose the right flavor of floating point for your architecture, insofar as the problem permits. It's possible that double will end up being faster than float, although it will need more memory. You should experiment to fine-tune this. Ideally, you'll use a #define or typedef to specify the type so you can change it easily in one place.
Integer calculations are probably faster than floating point. Depending on the numeric range of your data, you may also consider doing it with integers treated as fixed-point decimals. If you need 3 decimal places, you could use ints and just consider them "milli-somethings". You'll have to remember to shift decimals after division and multiplication... but no big deal. If you use any math functions beyond the basic arithmetic, of course, that would of course kill this possibility.
Since both are compiled, and the compilers now are very good at how to handle C++, I think the only problem would come from how well optimized your code is. I think it would be easier to write slower code in C++, but that depends on which style your model fits into best.
When it comes down to it, I doubt there will be any real difference, assuming both are well-written, any libraries you use, how well written they are, if you are measuring on the same computer.
Function call vs. member function call overhead is unlikely to be the limiting factor, compared to file input and the algorithm itself. C++ iostreams are not necessarily super high speed. C has 'restrict' if you're really optimizing, in C++ it's easier to inline function calls. Overall, C++ offers more options for organizing your code clearly, but if it's not a big program, or you're just going to write it in a similar manner whether it's C or C++, then the portability of C libraries becomes more important.
As long as you don't use any virtual functions etc. you won't note any considerable performance differences. Early C++ was compiled to C, so as long as you know the pinpoints where this creates any considerable overhead (such as with virtual functions) you can clearly calculate for the differences.
In addition I want to note that using C++ can give you a lot to gain if you use the STL and Boost Libraries. Especially the STL provides very efficient and proven implementations of the most important data structures and algorithms, so you can save a lot of development time.
Effectively it also depends on the compiler you will be using and how it will optimize the code.
first, writing in C++ doesn't imply using OOP, look at the STL algorithms.
second, C++ can be even slightly faster at runtime (the compilation times can be terrible compared to C, but that's because modern C++ tends to rely heavily on abstractions that tax the compiler).
edit: alright, see Bjarne Stroustrup's discussion of qsort and std::sort, and the article that FAQ mentions (Learning Standard C++ as a New Language), where he shows that C++-style code can be not only shorter and more readable (because of higher abstractions), but also somewhat faster.
Another aspect:
C++ templates can be an excellent tool to generate type-specific /
optimized code variations.
For example, C qsort requires a function call to the comparator, whereas std::sort can inline the functor passed. This can make a significant difference when compare and swap themselves are cheap.
Note that you could generate "custom qsorts" optimized for various types with a barrage of defines or a code generator, or by hand - you could do these optimizations in C, too, but at much higher cost.
(It's not a general weapon, templates help only in sepcific scenarios - usually a single algorithm applied to different data types or with differing small pieces of code injected.)
Good answers. I would put it like this:
Make the algorithm as efficient as possible in terms of its formal structure.
C++ will be as fast as C, except that it will tempt you to do dumb things, like constructing objects that you don't have to, so don't take the bait. Things like STL container classes and iterators may look like the latest-and-greatest thing, but they will kill you in a hotspot.
Even so, single-step it at the disassembly level. You should see it very directly working on your problem. If it is spending lots of cycles getting in and out of routines, try some in-lining (or macros). If it is wandering off into memory allocation and freeing, for much of the time, put a stop to that. If it's got inner loops where the loop overhead is a large percentage, try unrolling the loop.
That's how you can make it about as fast as possible.
I would go with C++ definitely. If you are careful about your design and avoid creating heavy objects inside hotspots you should not see any performance difference but the code is going to be much simpler to understand, maintain, and expand.
Use templates and classes judiciously. avoid unnecessary object creation by passing objects by reference. Avoid excessive memory allocation, if needed, allocate memory in advance of hotspots. Use restrict keyword on memory pointers to tell compiler whenever pointers overlap or not.
As far as optimization, pay careful attention to memory alignment. Assuming you are working on Intel processor, you can make use of vector instructions, provided you tell the compiler through pragma's about your memory alignment and aliased pointers. you can also use vector instructions directly via intrinsics.
you can also automatically create hotspot code using templates and let compiler optimize it out if you have things like short loops of different sizes. To find out performance and to drill down to your bottlenecks, Intel vtune or oprofile are extremely helpful.
hope that helps
I do some DSP coding, where it still pays off to go to assembly language sometimes. I'd say use C or C++, either one, and be prepared to go to assembly language when you need to, especially to exploit SIMD instructions.