Basic ways to speed up a simple Eigen program - c++

I'm looking for the fastest way to do simple operations using Eigen. There are so many datastructures available, its hard to tell which is the fastest.
I've tried to predefine my data structures, but even then my code is being outperformed by similar Fortran code. I've guessed Eigen::Vector3d is the fastest for my needs, (since its predefined), but I could easily be wrong. Using -O3 optimization during compile time gave me a big boost, but I'm still running 4x slower than a Fortran implementation of the same code.
I make use of an 'Atom' structure, which is then stored in an 'atoms' vector defined by the following:
struct Atom {
std::string element;
//double x, y, z;
Eigen::Vector3d coordinate;
};
std::vector<Atom> atoms;
The slowest part of my code is the following:
distance = atoms[i].coordinate - atoms[j].coordinate;
distance_norm = distance.norm();
Is there a faster data structure I could use? Or is there a faster way to perform these basic operations?

As you pointed out in your comment, adding the -fno-math-errno compiler flag gives you a huge increase in speed. As to why that happens, your code snipped shows that you're doing a sqrt via distance_norm = distance.norm();.
This makes the compiler not set ERRNO after each sqrt (that's a saved write to a thread local variable), which is faster and enables vectorization of any loop that is doing this repeatedly.The only disadvantage to this is that the IEEE adherence is lost. See gcc man.
Another thing you might want to try is adding -march=native and adding -mfma if -march=native doesn't turn it on for you (I seem to remember that in some cases it wasn't turned on by native and had to be turned on by hand - check here for details). And as always with Eigen, you can disable bounds checking with -DNDEBUG.
SoA instead of AoS!!! If performance is actually a real problem, consider using a single 4xN matrix to store the positions (and have Atom keep the column index instead of the Eigen::Vector3d). It shouldn't matter too much in the small code snippet you showed, but depending on the rest of your code, may give you another huge increase in performance.

Given you are ~4x off, it might be worth checking that you have enabled vectorization such as AVX or AVX2 at compile time. There are of course also SSE2 (~2x) and AVX512 (~8x) when dealing with doubles.

Either try another compiler like Intel C++ compiler (free for academic and non-profit usage) or use other libraries like Intel MKL (far faster that your own code) or even other BLAS/LAPACK implementations for dense matrices or PARDISO or SuperLU (not sure if still exists) for sparse matrices.

Related

Integer/Floating points values with SSE

I have to multiply a vector of integers with an other vector of integers, and then add the result (so a vector of integers) with a vector of floating points values.
Should I use MMX or SSE4 for integers, or can I just use SSE with all these values (even if there is integer ?) putting integers in __m128 registers ?
Indeed, I am often using integers in __m128 registers, and I don't know if I am wasting time (implicit casting values) or if it's the same thing.
I am compiling with -O3 option.
You should probably just use SSE for everything (MMX is just a very out-dated precursor to SSE). If you're going to be targetting mainly newer CPUs then you might even consider AVX/AVX2.
Start by implementing everything cleanly and robustly in scalar code, then benchmark it. It's possible that a scalar implementation will be fast enough, and you won't need to do anything else. Furthermore, gcc and other compilers (e.g. clang, ICC, even Visual Studio) are getting reasonably good at auto-vectorization, so you may get SIMD-vectorized code "for free" that meets your performance needs. However if you still need better performance at this point then you can start to convert your scalar code to SSE. Keep the original scalar implementation for validation and benchmarking purposes though - it's very easy to introduce bugs when optimising code, and it's useful to know how much faster your optimised code is than the baseline code (you're probably looking for somewhere between 2x and 4x faster for SSE versus scalar code).
While previous answer is reasonable, there is one significant difference - data organization. For direct SSE use data better be organized as Structure-of-Arrays (SoA). Typically, you scalar code might have data made around Array-of-Structures (AoS) layout. If it is the case, conversion from scalar to vectorized form would be difficult
More reading https://software.intel.com/en-us/articles/creating-a-particle-system-with-streaming-simd-extensions

Coding for ARM NEON: How to start?

I'm looking to optimize C++ code (mainly some for loops) using the NEON capability of computing 4 or 8 array elements at a time. Is there some kind of library or set of functions that can be used in C++ environment?
I use Eclipse IDE in Linux Gentoo to write C++ code.
UPDATE
After reading the answers I did some tests with the software. I compiled my project with the following flags:
-O3 -mcpu=cortex-a9 -ftree-vectorize -mfloat-abi=hard -mfpu=neon
Keep in mind that this project includes extensive libraries such as open frameworks, OpenCV, and OpenNI, and everything was compiled with these flags.
To compile for the ARM board we use a Linaro toolchain cross-compiler, and GCC's version is 4.8.3.
Would you expect this to improve the performance of the project? Because we experienced no changes at all, which is rather weird considering all the answers I read here.
Another question: all the for cycles have an apparent number of iterations, but many of them iterate through custom data types (structs or classes). Can GCC optimize these cycles even though they iterate through custom data types?
EDIT:
From your update, you may misunderstand what the NEON processor does. It is an SIMD (Single Instruction, Multiple Data) vector processor. That means that it is very good at performing an instruction (say "multiply by 4") to several pieces of data at the same time. It also loves to do things like "add all these numbers together" or "add each element of these two lists of numbers to create a third list of numbers." So if you problem looks like those things the NEON processor is going to be huge help.
To get that benefit, you must put your data in very specific formats so that the vector processor can load multiple data simultaneously, process it in parallel, and then write it back out simultaneously. You need to organize things such that the math avoids most conditionals (because looking at the results too soon means a roundtrip to the NEON). Vector programming is a different way of thinking about your program. It's all about pipeline management.
Now, for many very common kinds of problems, the compiler automatically can work all of this out. But it's still about working with numbers, and numbers in particular formats. For example, you almost always need to get all of your numbers into a contiguous block in memory. If you're dealing with fields inside of structs and classes, the NEON can't really help you. It's not a general-purpose "do stuff in parallel" engine. It's an SIMD processor for doing parallel math.
For very high-performance systems, data format is everything. You don't take arbitrary data formats (structs, classes, etc.) and try to make them fast. You figure out the data format that will let you do the most parallel work, and you write your code around that. You make your data contiguous. You avoid memory allocation at all costs. But this isn't really something a simple StackOverflow question can address. High-performance programming is a whole skill set and a different way of thinking about things. It isn't something you get by finding the right compiler flag. As you've found, the defaults are pretty good already.
The real question you should be asking is whether you could reorganize your data so that you can use more of OpenCV. OpenCV already has lots of optimized parallel operations that will almost certainly make good use of the NEON. As much as possible, you want to keep your data in the format that OpenCV works in. That's likely where you're going to get your biggest improvements.
My experience is that it is certainly possible to hand-write NEON assembly that will beat clang and gcc (at least from a couple of years ago, though the compiler certainly continues to improve). Having excellent ARM optimization is not the same as NEON optimization. As #Mats notes, the compiler will generally do an excellent job at obvious cases, but does not always handle every case ideally, and it is certainly possible for even a lightly skilled developer to sometimes beat it, sometimes dramatically. (#wallyk is also correct that hand-tuning assembly is best saved for last; but it can still be very powerful.)
That said, given your statement "Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point," then no, you should not even bother. Without first at least understanding the basics (and a few non-basics) of assembly (and specifically vectorized NEON assembly), there is no point in second-guessing the compiler. Step one of beating the compiler is knowing the target.
If you are willing to learn the target, my favorite introduction is Whirlwind Tour of ARM Assembly. That, plus some other references (below), were enough to let me beat the compiler by 2-3x in my particular problems. On the other hand, they were insufficient enough that when I showed my code to an experienced NEON developer, he looked at it for about three seconds and said "you have a halt right there." Really good assembly is hard, but half-decent assembly can still be better than optimized C++. (Again, every year this gets less true as the compiler writers get better, but it can still be true.)
ARM Assembly language
A few things iOS developers ought to know about the ARM architecture (iPhone-focused, but the principles are the same for all uses.)
ARM NEON support in the ARM compiler
Coding for NEON
One side note, my experience with NEON intrinsics is that they are seldom worth the trouble. If you're going to beat the compiler, you're going to need to actually write full assembly. Most of the time, whatever intrinsic you would have used, the compiler already knew about. Where you get your power is more often in restructuring your loops to best manage your pipeline (and intrinsics don't help there). It's possible this has improved over the last couple of years, but I would expect the improving vector optimizer to outpace the value of intrinsics more than the other way around.
Here's a "mee too" with some blog posts from ARM. FIRST, start with the following to get the background information, including 32-bit ARM (ARMV7 and below), Aarch32 (ARMv8 32-bit ARM) and Aarch64 (ARMv8 64-bit ARM):
ARM NEON programming quick reference
Second, checkout the Coding for NEON series. Its a nice introduction with pictures so things like interleaved loads make sense with a glance.
ARM NEON programming quick reference
Coding for NEON - Part 1: Load and Stores
Coding for NEON - Part 2: Dealing With Leftovers
Coding for NEON - Part 3: Matrix Multiplication
Coding for NEON - Part 4: Shifting Left and Right
Coding for NEON - Part 5: Rearranging Vectors
I also went on Amazon looking for some books on ARM assembly with a treatment of NEON. I could only find two, and neither book's treatment of NEON were impressive. They reduced to a single chapter with the obligatory Matrix example.
I believe ARM Intrinsics are a very good idea. The instrinsics allow you to write code for GCC, Clang and Visual C/C++ compilers. We have one code base that works for ARM Linux distros (like Linaro), some iOS devices (using -arch armv7) and Microsoft gadgets (like Windows Phone and Windows Store Apps).
If you have access to a reasonably modern GCC (GCC 4.8 and upwards) I would recommend giving intrinsics a go. The NEON intrinsics are a set of functions that the compiler knows about, which can be used from C or C++ programs to generate NEON/Advanced SIMD instructions. To gain access to them in your program, it is necessary to #include <arm_neon.h>. The verbose documentation of all available intrinsics is available at http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf , but you may find more user-friendly tutorials elsewhere online.
Advice on this site is generally against the NEON intrinsics, and certainly there are GCC versions which have done a poor job of implementing them, but recent versions do reasonably well (and if you spot bad code generation, please do raise it as a bug - https://gcc.gnu.org/bugzilla/ )
They are an easy way to program to the NEON/Advanced SIMD instruction set, and the performance you can achieve is often rather good. They are also "portable", in that when you move to an AArch64 system, a superset of the intrinsics you can use from ARMv7-A are available. They are also portable across implementations of the ARM architecture, which can vary in their performance characteristics, but which the compiler will model for performance tuning.
The principle benefit of the NEON intrinsics over hand-written assembly, is that the compiler can understand them when performing its various optimization passes. By contrast hand-written assembler is an opaque block to GCC, and will not be optimized. On the other hand, expert assembler programmers can often beat the compiler's register allocation policies, particularly when using the instructions which write to or read from to multiple consecutive registers.
In addition to Wally's answer - and probably should be a comment, but I couldn't make it short enough: ARM has a team of compiler developers whose entire role is to improve the parts of GCC and Clang/llvm that does code generation for ARM CPUs, including features that provides "auto-vectorization" - I have not looked deeply into it, but from my experience on x86 code generation, I'd expect for anything that is relatively easy to vectorize, the compiler should do a deecent job. Some code is hard for the compiler to understand when it can vectorize or not, and may need some "encouragement" - such as unrolling loops or marking conditions as "likely" or "unlikely", etc.
Disclaimer: I work for ARM, but have very little to do with the compilers or even CPUs, as I work for the group that does graphics (where I have some involvement with compilers for the GPUs in the OpenCL part of the GPU driver).
Edit:
Performance, and use of various instruction extensions is really depending on EXACTLY what the code is doing. I'd expect that libraries such as OpenCV is already doing a fair amount of clever stuff in their code (such as both handwritten assembler as compiler intrinsics and generally code that is designed to allow the compiler to already do a good job), so it may not really give you much improvement. I'm not a computer vision expert, so I can't really comment on exactly how much such work is done on OpenCV, but I'd certainly expect the "hottest" points of the code to have been fairly well optimised already.
Also, profile your application. Don't just fiddle with optimisation flags, measure it's performance and use a profiling tool (e.g. the Linux "perf" tool) to measure WHERE your code is spending time. Then see what can be done to that particular code. Is it possible to write a more parallel version of it? Can the compiler help, do you need to write assembler? Is there a different algorithm that does the same thing but in a better way, etc, etc...
Although tweaking compiler options CAN help, and often does, it can give tens of percent, where a change in algorithm can often lead to 10 times or 100 times faster code - assuming of course, your algorithm can be improved!
Understanding what part of your application is taking the time, however, is KEY. It's no point in changing things to make the code that takes 5% of the time 10% faster, when a change somewhere else could make a piece of code that is 30 or 60% of the total time 20% faster. Or optimise some math routine, when 80% of the time is spent on reading a file, where making the buffer twice the size would make it twice as fast...
Although a long time has passed since I submitted this question, I realize that it gathers some interest and I decided to tell what I ended up doing regarding this.
My main goal was to optimize a for-loop which was the bottleneck of the project. So, since I don't know anything about Assembly I decided to give NEON intrinsics a go. I ended up having a 40-50% gain in performance (in this loop alone), and a significant overall improvement in performance of the whole project.
The code does some math to transform a bunch of raw distance data into distance to a plane in millimetres. I use some constants (like _constant05, _fXtoZ) that are not defined here, but they are just constant values defined elsewhere.
As you can see, I'm doing the math for 4 elements at a time, talk about real parallelization :)
unsigned short* frameData = frame.ptr<unsigned short>(_depthLimits.y, _depthLimits.x);
unsigned short step = _runWidth - _actWidth; //because a ROI being processed, not the whole image
cv::Mat distToPlaneMat = cv::Mat::zeros(_runHeight, _runWidth, CV_32F);
float* fltPtr = distToPlaneMat.ptr<float>(_depthLimits.y, _depthLimits.x); //A pointer to the start of the data
for(unsigned short y = _depthLimits.y; y < _depthLimits.y + _depthLimits.height; y++)
{
for (unsigned short x = _depthLimits.x; x < _depthLimits.x + _depthLimits.width - 1; x +=4)
{
float32x4_t projX = {(float)x, (float)(x + 1), (float)(x + 2), (float)(x + 3)};
float32x4_t projY = {(float)y, (float)y, (float)y, (float)y};
framePixels = vld1_u16(frameData);
float32x4_t floatFramePixels = {(float)framePixels[0], (float)framePixels[1], (float)framePixels[2], (float)framePixels[3]};
float32x4_t fNormalizedY = vmlsq_f32(_constant05, projY, _yResInv);
float32x4_t auxfNormalizedX = vmulq_f32(projX, _xResInv);
float32x4_t fNormalizedX = vsubq_f32(auxfNormalizedX, _constant05);
float32x4_t realWorldX = vmulq_f32(fNormalizedX, floatFramePixels);
realWorldX = vmulq_f32(realWorldX, _fXtoZ);
float32x4_t realWorldY = vmulq_f32(fNormalizedY, floatFramePixels);
realWorldY = vmulq_f32(realWorldY, _fYtoZ);
float32x4_t realWorldZ = floatFramePixels;
realWorldX = vsubq_f32(realWorldX, _tlVecX);
realWorldY = vsubq_f32(realWorldY, _tlVecY);
realWorldZ = vsubq_f32(realWorldZ, _tlVecZ);
float32x4_t distAuxX, distAuxY, distAuxZ;
distAuxX = vmulq_f32(realWorldX, _xPlane);
distAuxY = vmulq_f32(realWorldY, _yPlane);
distAuxZ = vmulq_f32(realWorldZ, _zPlane);
float32x4_t distToPlane = vaddq_f32(distAuxX, distAuxY);
distToPlane = vaddq_f32(distToPlane, distAuxZ);
*fltPtr = (float) distToPlane[0];
*(fltPtr + 1) = (float) distToPlane[1];
*(fltPtr + 2) = (float) distToPlane[2];
*(fltPtr + 3) = (float) distToPlane[3];
frameData += 4;
fltPtr += 4;
}
frameData += step;
fltPtr += step;
}
If you don't want to mess with assembly code at all, then tweak the compiler flags to maximally optimize for speed. gcc given the proper ARM target should do this provided the number of loop iterations is apparent.
To check gcc code generation, request assembly output by adding the -S flag.
If after several tries (of reading the gcc documentation and tweaking flags) you still can't get it to produce the code you want, then take the assembly output and edit it to your satisfaction.
Beware of premature optimization. The proper development order is to get the code functional, then see if it needs optimization. Only when the code is stable does it makes sense to do so.
Play with some minimal assembly examples on QEMU to understand the instructions
The following setup does not have many examples yet, but it serves as a neat playground:
v7 examples
v8 examples
setup usage
The examples run on QEMU user mode, which dispenses extra hardware, and the GDB is working just fine.
The asserts are done through the C standard library.
You should be a able to easily extend that setup with new instructions as you learn them.
ARM intrinsincs in particular were asked at: Is there a good reference for ARM Neon intrinsics?

gsl_complex vs. std::complex performance

I'm writing a program that depends a lot on complex additions and multiplications. I wanted to know whether I should use gsl_complex or std::complex.
I don't seem to find a comparison online of how much better GSL complex arithmetic is as compared to std::complex. A rudimentary google search didn't help me find a benchmarks page for GSL complex either.
I wrote a 20-line program that generates two random arrays of complex numbers (1e7 of them) and then checked how long addition and multiplication took using clock() from <ctime>. Using this method (without compiler optimisation) I got to know that gsl_complex_add and gsl_complex_mul are almost twice as fast as std::complex<double>'s + and * respectively. But I've never done this sort of thing before, so is this even the way you check which is faster?
Any links or suggestions would be helpful. Thanks!
EDIT:
Okay, so I tried again with a -O3 flag, and now the results are extremely different! std::complex<float>::operator+ is more than twice as fast as gsl_complex_add, while gsl_complex_mul is about 1.25 times as fast as std::complex<float>::operator*. If I use double, gsl_complex_add is about 30% faster than std::complex<double>::operator+ while std::complex<double>::operator* is about 10% faster than gsl_complex_mul. I only need float-level precision, but I've heard that double is faster (and memory is not an issue for me)! So now I'm really confused!
Turn on optimisations.
Any library or set of functions that you link with will be compiled WITH optimisation (unless the names of the developer are Kermit, Swedish Chef, Miss Peggy (project manager) and Cookie Monster (tester) - in other words, the development team is a bunch of Muppets).
Since std::complex uses templates, it is compiled by the compiler settings you give, so the code will be unoptimized. So your question is really "Why is function X faster than function Y that does the same thing, when function X is compiled with optimisation and Y is compiled without optimisation?" - which should really be obvious to answer: "Optimisation works nearly all of the time!" (If optimisation wasn't working most of the time, compiler developers would have a MUCH easier time)
Edit: So my above point has just been proven. Note that since templates can inline the code, it is often more efficient than an external library (because the compiler can just insert the instructions straight into the flow, rather than calling out to another function).
As to float vs. double, the only time that float is slower than double is if there is ONLY double hardware available, with two functions added to "shorten" and "lengthen" between float and double. I'm not aware of any such hardware. double has more bits, so it SHOULD take longer.
Edit2:
When it comes to choosing "one solution over another", there are so many factors. Performance is one (and in some cases, the most important, in other cases not). Other aspects are "ease of use", "availability", "fit for the project", etc.
If you look at ONLY performance, you can sometimes run simple benchmarks to determine that one solution is better or worse than another, but for complex libraries [not "real&imaginary" type complex numbers, but rather "complicated"], there are sometimes optimisations to deal with large amounts of data, where if you use a less sophisticated solution, the "large data" will not achieve the same performance, because less effort has been spent on solving the "big data" type problems. So, if you have a "simple" benchmark that does some basic calculations on a small set of data, and you are, in reality, going to run some much bigger datasets, the small benchmark MAY not reflect reality.
And there is no way that I, or anyone else, can tell you which solution will give you the best performance on YOUR system with YOUR datasets, unless we have access to your datasets, know exactly which calculations you are performance (that is, pretty much have your code), and have experience with running that with both "packages".
And going on to the rest of the criteria ("ease of use", etc), those are much more "personal opinion" based, so wouldn't be a good fit for an SO question in the first place.
This answer depends not only on the optimization flags, but also on the compiler used to compile GSL library and your particular code. Example: if you compile gsl with gcc and your program with icc, then you may see a (significant) difference (I have done this test with std::pow vs gsl_pow). Also, the standard makefile generated by ./configure does not compile GSL with aggressive float point optimizations (example: it does not include fast-math flag in gcc) because some GSL routines (differential equation solver for example) fail their stringent accuracy tests when these optimizations are present.
One of the great points about GSL is the modularity of the library. If you don't need double accuracy, then you can compile gsl_complex.h, gsl_complex_math.h and math.c separately with aggressive float number optimizations (however you need to delete the line #include <config.h> in math.c). Another strategy is to compile a separate version of the whole library with aggressive float number optimizations and test if accuracy is not an issue for your particular problem (that is my favorite approach).
EDIT: I forgot to mention that gsl_complex.h also has a float version of gsl_complex
typedef struct
{
float dat[2];
}
gsl_complex_float;

Using SIMD in a Game Engine Math Library by using function pointers ~ A good idea?

I have been reading Game Engine Books since I was 14 (At that time I didn't understand a thing:P)
Now quite some years later I wanted to start programming the Mathmatical Basis for my Game Engine. I've been thinking long about how to design this 'library'. (Which I mean as "Organized set of files") Every few years new SIMD instructionsets come out, and I wouldn't want them to go to waste. (Tell me if I am wrong about this.)
I wanted to at least have the following properties:
Making it able to check if it has SIMD at runtime, and use SIMD if it has it and uses the normal C++ version if it doesn't. (Might have some call overhead, is this worth it?)
Making it able to compile for SIMD or normal C++ if we already know the target at compile time. The calls can get inlined and made suitable for Cross-Optimisation because the compiler knows if SIMD or C++ is used.
EDIT - I want to make the sourcecode portable so it can run on other deviced then x86(-64) too
So I thought it would be a good solution to use function pointers which I would make static and initialize at the start of the program. And which the suited functions(For example multiplication of Matrix/Vector) will call.
What do you think are the advantages and disadvantages(Which outweights more?) of this design and is it even possible to create it with both properties as described above?
Christian
It's important to get the right granularity at which you make decision on which routine to call. If you do this at too low a level then function dispatch overhead becomes a problem, e.g. a small routine which just has a few instructions could become very inefficient if called via some kind of function pointer dispatch mechanism rather than say just being inlined. Ideally the architecture-specific routines should be processing a reasonable amount of data so that function dispatch cost is negligible, without being so large that you get significant code bloat due to compiling additional non-architecture-specific code for each supported architecture.
The simplest way to do this is to compile your game twice, once with SIMD enabled, once without. Create a tiny little launcher app that performs the _may_i_use_cpu_feature checks, and then runs the correct build.
The double indirection caused by calling a matrix multiply (for example) via a function pointer is not going to be nice. Instead of inlining trivial maths functions, it'll introduce function calls all over the shop, and those calls will be forced to save/restore a lot of registers to boot (because the code behind the pointer is not going to be know until runtime).
At that point, a non-optimised version without the double indirection will massively outperform the SSE version with function pointers.
As for supporting multiple platforms, this can be easy, and it can also be a real bother. ARM neon is similar enough to SSE4 to make it worth wrapping the intructions behind some macros, however neon is also different enough to be really annoying!
#if CPU_IS_INTEL
#include <immintrin.h>
typedef __m128 f128;
#define add4f _mm_add_ps
#else
#include <neon.h>
typedef float32x4 f128;
#define add4f vqadd_f32
#endif
The MAJOR problem with starting on say Intel, and porting to ARM later is that a lot of the nice things don't exist. Shuffling is possible on ARM, but it's also a bother. Division, dot product, and sqrt don't exist on ARM (only reciprocal estimates, which you'll need to do your own newton iteration on)
If you are thinking about SIMD like this:
struct Vec4
{
float x;
float y;
float z;
float w;
};
Then you may just be able to wrap SSE and NEON behind a semi-ok wrapper. When it comes to AVX512 and AVX2 though, you'll probably be screwed.
If however you are thinking about SIMD using structure-of-array formats:
struct Vec4SOA
{
float x[BIG_NUM];
float y[BIG_NUM];
float z[BIG_NUM];
float w[BIG_NUM];
};
Then there is a chance you'll be able to produce an AVX2/AVX512 version. However, working with code organised like that is not the easiest thing in the world.

C++ Eigen Library - Quadratic Programming, Fixed vs Dynamic, Performance

I am looking into doing some quadratic programming, and have seen different libraries. I have seen various Eigen variants of QuadProg++ (KDE Forums, Benjamin Stephens, StackOverflow posts). Just as a test, I forked wingsit's Eigen variant, available on GitHub, to implement compile-time-sized problems to measure performance via templates.
I'm finding that I have worse performance in the templating case than the dynamically sized (MatrixXD / VectorXD) case from wingsit's code. I know this is not a simple question, but can anyone see a reason why this might be?
Note: I do need to increase the problem size / iteration count, will post that once I can.
EDIT: I am using GCC 4.6.3 on Ubuntu 12.04. These are the flags that I am using (modified from wingsit's code):
CFLAGS = -O4 -Wall -msse2 -fopenmp # option for obj
LFLAGS = -O4 -Wall -msse2 -fopenmp # option for exe (-lefence ...)
The benefits of static-sized code generally diminishes as the sizes grow bigger. The typical benefits of static-sized code mainly include (but not limited to) the following:
stack-based allocations which are faster than heap allocations. However, at large sizes, stack-based allocations are no longer feasible (stack-overflow), or even beneficial from a point of view of pre-fetching and locality of reference.
loop unrolling when the compiler sees a small static-sized loop, it can just unroll it, and possibly use SSE instructions. This doesn't apply at larger sizes.
In other words, for small sizes (up to maybe N=12 or so), static-sized code can be much better and faster than the equivalent dynamically-sized code, as long as the compiler is fairly aggressive about in-lining and loop unrolling. But, when sizes are larger, there is no point.
Additionally, there are a number of drawbacks about static-sized code too:
No efficient move-semantics / swapping / copy-on-write strategies since such code is usually implemented with static arrays (in order to have the benefits mentioned above) which cannot be simply swapped (as in, swapping the internal pointer).
Larger executables which contain functions that are spread out. Say you have one function template to multiply two matrices, and so, a new compiled function (inlined or not) is generated for each size combination. Then, you have some algorithm that does a lot of matrix multiplications, well, each multiplication will have to jump to (or execute inline) the specialization for that size combination. At the end, you end up with a lot more code that needs to be cached, and then, cache misses become a lot more frequent, and that will completely destroy the performance of your algorithm.
So, the lesson to draw from that is to use static-sized vectors and matrices only for small things like 2 to 6 dimensional vectors or matrices. But beyond that, it's preferable to go with dynamically-sized code (or maybe try static-sized code, but verify that it performs better, because it is not guaranteed to do so). So, I would advise you to reconsider your idea of using static-sized code for larger problems.