D CTFE and GPU Code Generation

D CTFE and GPU Code Generation - glsl

Could D's Mixins be used to map linear algebra operations to either/both CPU code and OpenCL or GPU vertex shader functions such as GLSL? This would be a real killer application for D and better bridge logic targeted for both CPU and GPU execution. Compare this with glm and D's gl3n which is only compile fixed-size linear algebra to CPU-code.
VexCL is a proof of concept for this using OpenCL and C++11 (GCC 4.6 or later) by completely abstracting away backend-dependent (CPU/GPU) implementation details about memory allocations and code execution somewhat similar to C++ AMP. So things can only get better in D right? Can mixins completely replace the use of C++ expression templates used in VexCL? Here's a nice tutorial on its use.
CTFE may also play a role here in this discussion.

Yes, definitely. In fact it should be quite straightforward. I did a proof-of-concept of this sort of thing back in 2007 (see my presentation at the first D conference).
Hardly anything worked in CTFE in those days, but it was still an order of magnitude easier than doing the equivalent thing in C++.
The desire to do this sort of thing was part of the motivation for development of template value parameters, CTFE, and SIMD operations.

Related

Writing optimal code in FORTRAN using array expressions

I am looking for a way to write fast code and be able to use builtin vector operations (for the sake of readability).
FORTRAN seems to be the good candidate. However, almost all resources I find on the web are about writing code without array expressions, and have only trivial examples of vector operations.
I feel strong need in some good resource which can cover caveats and give some insight into optimizations of code with vector expressions.
Example:
currently I am not even able to predict the behavior of such code:
! a = [0], indices = [1, 1]
a(indices) = a(indices) + 1
After compiling I get a = [2], but it this correct? If I use openmp, will it behave like this?
Personally, I would be very happy to have something like following examples on numpy:
100 numpy excercises
numpy: tips and tricks to work with data
Getting the Best Performance out of NumPy

Your code is not standard conforming:
Fortran 2008 6.5.3.3.2.3:
If a vector subscript has two or more elements with the same value,
an array section with that vector subscript shall not appear in a
variable definition context (16.6.7). NOTE 6.15
Therefore the result of your operation is not defined by the standard.
Other parts of your question appear to be too broad to treat them here. There are many books about scientific programming in Fortran 90 and later.
Also be aware that by vectorization most people in Fortran and C or C++ mean the usage of SIMD instructions simd and not the vectorized expressions from NumPy. These are just array expressions in Fortran.

I have scanned many sources (~20 books and dozens of web pages). Hard luck I missed something really important. The question I posted is indeed incorrect and comes from my initial high expectation about array operations in fortran.
The answer I would expect is: there are no tools to write short, readable code in fortran with automatic parallelization (to be more precise: there are, but those are proprietary libraries).
The list of intrinsic functions available in fortran is quite short
(link), and consists only of functions easily mapped to SIMD ops.
There are lots of functions that one will be missing.
while this could be resolved by separate library with separate implementation for each platform, fortran doesn't provide such. There are commercial options (see this thread)
Brief examples of missing functions:
no built-in array sort or unique. The proposed way is to use this library, which provides single-threaded code (forget threads and CUDA)
cumulative sum / running sum. One trivially can implement it, but the resulting code will never work fine on threads/CUDA/Xeon Phi/whatever comes next.
bincount, numpy.ufunc.at, numpy.ufunc.reduceat (which is very useful in many applications)
In most cases fortran provides 2x speed up even with simple implementations, but the code written will always be one-threaded, while matlab/numpy functions can be reimplemented for GPU or other parallel platform without any effort from user side (which occasionally happened to MATLAB, also see gnumpy, theano and parakeet)
To conclude, this is bad news for me. Fortran developers really care about having fast programs today, not in the future. I also can't lock my code on proprietary software. And I'm still looking for appropriate tool. (Julia is current candidate)
See also:
STL analogue in fortran
where ready-to-use algorithms are asked.
Numerical recipes: the art of parallel programming author implements basic MATLAB-like operations to have more expressive code
I also find useful these notes to see recommended ways of code optimizations (to see there is no place for vector operations)
numpy, fortran, blitz++: a case study
dicussion about implementing unique in fortran, where proprietary tools are recommended.

Performance Tradeoff - When is MATLAB better/slower than C/C++

I am aware that C/C++ is a lower-level language and generates relatively optimized machine code when we compare with any other high-level language. But I guess there is pretty much more than that, which is also evident from the practice.
When I do simple calculations like montecarlo averaging of a Gaussian sample collection or so, I see there is not much of a difference between a C++ implementation or MATLAB implementation, sometimes in fact MATLAB performs a bit better in time.
When I move on to larger scale simulations with thousands of lines of code, slowly the real picture shows up. C++ simulations show superior performance like 100x better in time complexity than an equivalent MATLAB implementation.
The code in C++ most of the times, is pretty much serial and no hi-fi optimization is done explicitly. Whereas, as per my awareness, MATLAB inherently does a lot of optimization. This shows up for example when I try to generate a huge chunk of random samples, where as the equivalent in C++ using some library like IT++/GSL/Boost performs relatively slower (the algorithm used is the same namely mt19937).
My question is simply to know if there is a simpler tradeoff between MATLAB/C++ in performance. Is it just like what people say, "Whenever you can, C/C++ is the better"(The frequently experienced)?. In a different perspective, "What is MATLAB good for, other than comfort?"
By the way, I don't see coding efficiency parameter being significant here, thinking of the same programmer in both cases. And also, I think the other alternatives like python,R are not relevant here. But dependence on the specific libraries we use should be interesting.
[I am a phd student in Coding Theory in communication systems. I do simulations using matlab/C++ all the time, and have reasonable experience of coding few 10K's of lines in both cases]

I have been using Matlab and C++ for about 10 years. For every numerical algorithms implemented for my research, I always start from prototyping with Matlab and then translate the project to C++ to gain a 10x to 100x (I am not kidding) performance improvement. Of course, I am comparing optimized C++ code to the fully vectorized Matlab code. On average, the improvement is about 50x.
There are lot of subtleties behind both of the two programming languages, and the following are some misunderstandings:
Matlab is a script language but C++ is compiled
Matlab uses JIT compiler to translate your script to machine code, you can improve your speed at most by a factor 1.5 to 2 by using the compiler that Matlab provides.
Matlab code might be able to get fully vectorized but you have to optimize your code by hand in C++
Fully vectorized Matlab code can call libraries written in C++/C/Assembly (for example Intel MKL). But plain C++ code can be reasonably vectorized by modern compilers.
Toolboxes and routines that Matlab provides should be very well tuned and should have reasonable performance
No. Other than linear algebra routines, the performance is generally bad.
The reasons why you can gain 10x~100x performance in C++ comparing to vectorized Matlab code:
Calling external libraries (MKL) in Matlab costs time.
Memory in Matlab is dynamically allocated and freed. For example, small matrices multiplication:
A = B*C + D*E + F*G
requires Matlab to create 2 temporary matrices. And in C++, if you allocate your memory before hand, you create NONE. And now imagine you loop that statement for 1000 times. Another solution in C++ is provided by C++11 Rvalue reference. This is the one of the biggest improvement in C++, now C++ code can be as fast as plain C code.
If you want to do parallel processing, Matlab model is multi-process and the C++ way is multi-thread. If you have many small tasks needing to be parallelized, C++ provides linear gain up to many threads but you might have negative performance gain in Matlab.
Vectorization in C++ involves using intrinsics/assembly, and sometimes SIMD vectorization is only possible in C++.
In C++, it is possible for an experienced programmer to completely avoid L2 cache miss and even L1 cache miss, hence pushing CPU to its theoretical throughput limit. Performance of Matlab can lag behind C++ by a factor of 10x due to this reason alone.
In C++, computational intensive instructions sometimes can be grouped according to their latencies (code carefully in assembly or intrinsics) and dependencies (most of time is done automatically by compiler or CPU hardware), such that theoretical IPC (instructions per clock cycle) could be reached and CPU pipelines are filled.
However, development time in C++ is also a factor of 10x comparing to Matlab!
The reasons why you should use Matlab instead of C++:
Data visualization. I think my career can go on without C++ but I won't be able to survive without Matlab just because it can generate beautiful plots!
Low efficiency but mathematically robust build-in routines and toolboxes. Get the correct answer first and then talk about efficiency. People can make subtle mistakes in C++ (for example implicitly convert double to int) and get sort of correct results.
Express your ideas and present your code to your colleagues. Matlab code is much easier to read and much shorter than C++, and Matlab code can be correctly executed without compiler. I just refuse to read other people's C++ code. I don't even use C++ GNU scientific libraries because the code quality is not guaranteed. It is dangerous for a researcher/engineer to use a C++ library as a black box and take the accuracy as granted. Even for commercial C/C++ libraries, I remember Intel compiler had a sign error in its sin() function last year and numerical accuracy problems also occurred in MKL.
Debugging Matlab script with interactive console and workspace is a lot more efficient than C++ debugger. Finding an index calculation bug in Matlab could be done within minutes, but it could take hours in C++ figuring out why the program crashes randomly if boundary check is removed for the sake of speed.
Last but not the least:
Because once Matlab code is vectorized, there is not much left for a programmer to optimize, Matlab code performance is much less sensitive to the quality of the code comparing with C++ code. Therefore it is best to optimize computation algorithms in Matlab, and marginally better algorithms normally have marginally better performance in Matlab. On the other hand, algorithm test in C++ requires decent programmer to write algorithms optimized more or less in the same way, and to make sure the compiler does not optimize the algorithms differently.
My recent experience in C++ and Matlab:
I made several large Matlab data analysis tools in the past year and suffered from the slow speed of Matlab. But I was able to improve my Matlab program speed by 10x through the following techniques:
Run/profile the Matlab script, re-implement critical routines in C/C++ and compile with MEX. Critical routines are mostly likely logically simple but numerically heavy. This improves speed by 5x.
Simplify ".m" files shipped with Matlab tool boxes by commenting all unnecessary safety checks and output parameter computations. Please be reminded that the modified code cannot be distributed with the rest of the user scripts. This improves speed by another 2x (after C/C++ and MEX).
The improved code is ~98% in Matlab and ~2% in C++.
I believe it is possible to improve the speed by another 2x (total 20x) if the entire tool is coded in C++, this is ~100x speed improvement of the computation routines. The hard drive I/O will then dominate the program run time.
Question for Mathworks engineers:
When Matlab code is fully vectorized, one of the performance limiting factor is the matrix indexing operation. For instance, a finite difference operation needs to be performed on Matrix A which has a dimension of 5000x5000:
B = A(:,2:end)-A(:,1:end-1)
The matrix indexing operation makes the Matlab code multiple times slower than the C++ code. Can the matrix indexing performance be improved?

In my experience (several years of Computer Vision and image processing in both languages) there is no simple answer to this question, as Matlab performance depends strongly (and much more than C++ performance) on your coding style.
Generally, Matlab wraps the classic C++ / Fortran based linear algebra libraries. So anything like x = A\b is going to be very fast. Also, Matlab does a good job in choosing the most efficient solver for these types of problems, so for x = A\b Matlab will look at the size of your matrices and chose the appropriate low-level routines.
Matlab also shines in data manipulation of large matrices if you "vectorize" your code, i.e. if you avoid for loops and use index arrays or boolean arrays to access your data. This stuff is highly optimised.
For other routines, some are written in Matlab code, while others point to a C/C++ implementation (e.g. the Delaunay stuff). You can check this yourself by typing edit some_routine.m. This opens the code and you see whether it is all Matlab or just a wrapper for something compiled.
Matlab, I think, is primarily for comfort - but comfort translates to coding time and ultimately money which is why Matlab is used in the industry. Also, it is easy to learn for engineers from other fields than computer science, with little training in programming.

As a PhD Student too, and a 10years long Matlab user, I'm glad to share my POV:
Matlab is a great tool for developing and prototyping algorithms, especially when dealing with GUIs, high-level analysis (Frequency Domain, LS Optimization etc.): fast coding, powerful syntaxis (think about [],{},: etc.).
As soon as your processing chain is more stable and defined and data dimensions grows move to C/C++.
The main Matlab limit rises when considering its language is script-like: as long as you avoid any cycle (using arrayfun, cellfun or other matrix procedures) performances are high since the called subroutine is again in C/C++.

Your question is difficult to answer. In general C++ is faster, but if make use of the well written algorithms of Matlab it can outperform C++. In some cases Matlab can parallelize your code which has to be done manually in many cases for C++. Mathlab can kind of export C++ code.
So my conclusion is, that you have to measure the performance of both programs to get an answer. But then you compare your two implementations and not Matlab and C++ in general.

Matlab does very well with linear algebra and array/matrix operations, since they seem to have been doing some extra optimizations on the underlying operations - if you want to beat Matlab there, you would need a similarly optimized BLAS/LAPACK library.
As an interpreted language, Matlab loses time whenever a Matlab function is called, due to internal overhead, which traditionally meant that Matlab loops were slow. This has been alleviated somewhat in recent years thanks to significant improvement in the JIT compiler (search for "performance" questions on Matlab on SO for examples). As a consequence of the function call overhead, all Matlab functions that have not been implemented in C/C++ behind the scenes (call edit functionName to see whether it's written in Matlab) risks being slower than a C/C++ counterpart.
Finally, Matlab attempts to be user friendly, and may do "unnecessary" input checking that can take time (due to function call overhead). For example, if you know that ismember gets sorted inputs, you can call ismembc directly (the behind-the-scene compiled function), saving quite a bit of time.

I think you can consider the difference in four folds at least.
Compiled vs Interpreted
Strongly-typed vs Dynamically-typed
Performance vs Fast-prototyping
Special strength
For 1-3 can be easily generalized into comparison between two family of programming languages.
For 4, MATLAB is optimized for matrix operations. So if you can vectorize more code in MATLAB, the performance can be drastically boosted. Conversely, if many loops are required, never hesitate to use C++ or create a mex file.
It is a difficult quesion after all.

I saw a 5.5x speed improvement when switching from MATLAB to C++. This was for a robot controller- lots of loops and ode solving. I spent many hours trying to optimize the MATLAB code, hardly any time optimizing the C++ (I'm sure it could have been 10x faster with a little more effort).
However, it was easy to add a GUI for the MATLAB code, so I still use it more often. Like others have said, it was nice to prototype first on MATLAB. That made the implementation on C++ much simpler.

Besides the speed of the final program, you should also take into account the total development time of your code, ie., not only the time to write, but also to debug, etc. Matlab (and its open-source counterpart, Octave) can be good for quick prototyping due to its visualisation capabilities.
If you're using straight C++ (ie. no matrix libraries), it may take you much longer to write C++ code that's equivalent to Matlab code (eg. there might be no point in spending 10 hours writing C++ code that only runs 10 seconds quicker, compared to a Matlab program that took 5 minutes to write).
However, there are dedicated C++ matrix libraries, such as Armadillo, which provide a Matlab-like API. This can be useful for writing performance critical code that can be called from Matlab, or for converting Matlab code into "real" programs.

Some Matlab code uses standard linear algebra fictions with multithreading built into it. So, it appears that they are faster than a sequential C code.

GPGPU programming architecture for HSA in C++ for Matrix Math

GPU Compute Programmers,
I have a C++ program which currently relies on the ACML (LAPACK) to invert and multiple fairly large matrices of single precision fp values (E.g. 4,000 x 4,000). These matrices are very sparse although they do not always fit nicely into a diagonal matrix so I cannot presently reduce them. The other thing about this program is I have to do this invert and multiply several times (serially) as part of a Newton Rapson. However, I have several thousand permutations which can be done in parallel, each with a small change to the matrix before again calculating and inverting the Jacobian. This is all single precision fp, and seems perfectly suited for the GPU. My question is this...
I suspect I will need to use the AMD Accelerated Parallel Processing Math Libraries (APPML) for OpenGL as that is the only thing (non-CUDA, I want to be GPU agnostic) I know of which is available with BLAS functionality. My problem is I do not see the LAPACK dgetrf and dgetri functions included in APPML (yes, these are fp64 but I don't need that precision). Would C++ AMP be a better alternative? I am very interested in HSA features of passing pointers rather than copying data as there is a lot of data in flight here and some calculations still are done on the CPU. I believe that copy overhead would kill me otherwise. Ultimately, performance is the key and I want to make the right architectural decisions to set myself up for the most performance I can wring out of HSA GPUs coming out over the next 6 months.
I am using VS 2013 Ultimate preview and would be able to take advantage of C++ AMP for these HSA capabilities. I just want to make sure I am making the right long term architectural decision now while my program is in its infancy. Here is a link and snippet from some interesting data I found on Anandtech:
http://anandtech.com/show/7118/windows-81-and-vs2013-bring-gpu-computing-updates-to-direct3d-and-c-amp-
C++ AMP, Microsoft's C++ extension for GPU computing, has also been updated with the upcoming VS2013. I think the biggest feature update is that C++ AMP programs will also gain a shared memory feature on APUs/SoCs where the compiler and runtime will be able to eliminate extra data copies between CPU and GPU. This feature will also be available only on Windows 8.1 and it is likely built on top of the "map default buffer" as Microsoft's AMP implementation uses Direct3D under the hood. C++ AMP also brings some other nice additions including enhanced texture support and better debugging abilities.
Any thoughts, additional questions or discussion would be greatly appreciated!

What are the most widely used C++ vector/matrix math/linear algebra libraries, and their cost and benefit tradeoffs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
It seems that many projects slowly come upon a need to do matrix math, and fall into the trap of first building some vector classes and slowly adding in functionality until they get caught building a half-assed custom linear algebra library, and depending on it.
I'd like to avoid that while not building in a dependence on some tangentially related library (e.g. OpenCV, OpenSceneGraph).
What are the commonly used matrix math/linear algebra libraries out there, and why would decide to use one over another? Are there any that would be advised against using for some reason? I am specifically using this in a geometric/time context*(2,3,4 Dim)* but may be using higher dimensional data in the future.
I'm looking for differences with respect to any of: API, speed, memory use, breadth/completeness, narrowness/specificness, extensibility, and/or maturity/stability.
Update
I ended up using Eigen3 which I am extremely happy with.

There are quite a few projects that have settled on the Generic Graphics Toolkit for this. The GMTL in there is nice - it's quite small, very functional, and been used widely enough to be very reliable. OpenSG, VRJuggler, and other projects have all switched to using this instead of their own hand-rolled vertor/matrix math.
I've found it quite nice - it does everything via templates, so it's very flexible, and very fast.
Edit:
After the comments discussion, and edits, I thought I'd throw out some more information about the benefits and downsides to specific implementations, and why you might choose one over the other, given your situation.
GMTL -
Benefits: Simple API, specifically designed for graphics engines. Includes many primitive types geared towards rendering (such as planes, AABB, quatenrions with multiple interpolation, etc) that aren't in any other packages. Very low memory overhead, quite fast, easy to use.
Downsides: API is very focused specifically on rendering and graphics. Doesn't include general purpose (NxM) matrices, matrix decomposition and solving, etc, since these are outside the realm of traditional graphics/geometry applications.
Eigen -
Benefits: Clean API, fairly easy to use. Includes a Geometry module with quaternions and geometric transforms. Low memory overhead. Full, highly performant solving of large NxN matrices and other general purpose mathematical routines.
Downsides: May be a bit larger scope than you are wanting (?). Fewer geometric/rendering specific routines when compared to GMTL (ie: Euler angle definitions, etc).
IMSL -
Benefits: Very complete numeric library. Very, very fast (supposedly the fastest solver). By far the largest, most complete mathematical API. Commercially supported, mature, and stable.
Downsides: Cost - not inexpensive. Very few geometric/rendering specific methods, so you'll need to roll your own on top of their linear algebra classes.
NT2 -
Benefits: Provides syntax that is more familiar if you're used to MATLAB. Provides full decomposition and solving for large matrices, etc.
Downsides: Mathematical, not rendering focused. Probably not as performant as Eigen.
LAPACK -
Benefits: Very stable, proven algorithms. Been around for a long time. Complete matrix solving, etc. Many options for obscure mathematics.
Downsides: Not as highly performant in some cases. Ported from Fortran, with odd API for usage.
Personally, for me, it comes down to a single question - how are you planning to use this. If you're focus is just on rendering and graphics, I like Generic Graphics Toolkit, since it performs well, and supports many useful rendering operations out of the box without having to implement your own. If you need general purpose matrix solving (ie: SVD or LU decomposition of large matrices), I'd go with Eigen, since it handles that, provides some geometric operations, and is very performant with large matrix solutions. You may need to write more of your own graphics/geometric operations (on top of their matrices/vectors), but that's not horrible.

So I'm a pretty critical person, and figure if I'm going to invest in a library, I'd better know what I'm getting myself into. I figure it's better to go heavy on the criticism and light on the flattery when scrutinizing; what's wrong with it has many more implications for the future than what's right. So I'm going to go overboard here a little bit to provide the kind of answer that would have helped me and I hope will help others who may journey down this path. Keep in mind that this is based on what little reviewing/testing I've done with these libs. Oh and I stole some of the positive description from Reed.
I'll mention up top that I went with GMTL despite it's idiosyncrasies because the Eigen2 unsafeness was too big of a downside. But I've recently learned that the next release of Eigen2 will contain defines that will shut off the alignment code, and make it safe. So I may switch over.
Update: I've switched to Eigen3. Despite it's idiosyncrasies, its scope and elegance are too hard to ignore, and the optimizations which make it unsafe can be turned off with a define.
Eigen2/Eigen3
Benefits: LGPL MPL2, Clean, well designed API, fairly easy to use. Seems to be well maintained with a vibrant community. Low memory overhead. High performance. Made for general linear algebra, but good geometric functionality available as well. All header lib, no linking required.
Idiocyncracies/downsides: (Some/all of these can be avoided by some defines that are available in the current development branch Eigen3)
Unsafe performance optimizations result in needing careful following of rules. Failure to follow rules causes crashes.
you simply cannot safely pass-by-value
use of Eigen types as members requires special allocator customization (or you crash)
use with stl container types and possibly other templates required
special allocation customization (or you will crash)
certain compilers need special care to prevent crashes on function calls (GCC windows)
GMTL
Benefits: LGPL, Fairly Simple API, specifically designed for graphics engines.
Includes many primitive types geared towards rendering (such as
planes, AABB, quatenrions with multiple interpolation, etc) that
aren't in any other packages. Very low memory overhead, quite fast,
easy to use. All header based, no linking necessary.
Idiocyncracies/downsides:
API is quirky
what might be myVec.x() in another lib is only available via myVec[0] (Readability problem)
an array or stl::vector of points may cause you to do something like pointsList[0][0] to access the x component of the first point
in a naive attempt at optimization, removed cross(vec,vec) and
replaced with makeCross(vec,vec,vec) when compiler eliminates
unnecessary temps anyway
normal math operations don't return normal types unless you shut
off some optimization features e.g.: vec1 - vec2 does not return a
normal vector so length( vecA - vecB ) fails even though vecC = vecA -
vecB works. You must wrap like: length( Vec( vecA - vecB ) )
operations on vectors are provided by external functions rather than
members. This may require you to use the scope resolution everywhere
since common symbol names may collide
you have to do
length( makeCross( vecA, vecB ) )
or
gmtl::length( gmtl::makeCross( vecA, vecB ) )
where otherwise you might try
vecA.cross( vecB ).length()
not well maintained
still claimed as "beta"
documentation missing basic info like which headers are needed to
use normal functionalty
Vec.h does not contain operations for Vectors, VecOps.h contains
some, others are in Generate.h for example. cross(vec&,vec&,vec&) in
VecOps.h, [make]cross(vec&,vec&) in Generate.h
immature/unstable API; still changing.
For example "cross" has moved from "VecOps.h" to "Generate.h", and
then the name was changed to "makeCross". Documentation examples fail
because still refer to old versions of functions that no-longer exist.
NT2
Can't tell because they seem to be more interested in the fractal image header of their web page than the content. Looks more like an academic project than a serious software project.
Latest release over 2 years ago.
Apparently no documentation in English though supposedly there is something in French somewhere.
Cant find a trace of a community around the project.
LAPACK & BLAS
Benefits: Old and mature.
Downsides:
old as dinosaurs with really crappy APIs

For what it's worth, I've tried both Eigen and Armadillo. Below is a brief evaluation.
Eigen
Advantages:
1. Completely self-contained -- no dependence on external BLAS or LAPACK.
2. Documentation decent.
3. Purportedly fast, although I haven't put it to the test.
Disadvantage:
The QR algorithm returns just a single matrix, with the R matrix embedded in the upper triangle. No idea where the rest of the matrix comes from, and no Q matrix can be accessed.
Armadillo
Advantages:
1. Wide range of decompositions and other functions (including QR).
2. Reasonably fast (uses expression templates), but again, I haven't really pushed it to high dimensions.
Disadvantages:
1. Depends on external BLAS and/or LAPACK for matrix decompositions.
2. Documentation is lacking IMHO (including the specifics wrt LAPACK, other than changing a #define statement).
Would be nice if an open source library were available that is self-contained and straightforward to use. I have run into this same issue for 10 years, and it gets frustrating. At one point, I used GSL for C and wrote C++ wrappers around it, but with modern C++ -- especially using the advantages of expression templates -- we shouldn't have to mess with C in the 21st century. Just my tuppencehapenny.

If you are looking for high performance matrix/linear algebra/optimization on Intel processors, I'd look at Intel's MKL library.
MKL is carefully optimized for fast run-time performance - much of it based on the very mature BLAS/LAPACK fortran standards. And its performance scales with the number of cores available. Hands-free scalability with available cores is the future of computing and I wouldn't use any math library for a new project doesn't support multi-core processors.
Very briefly, it includes:
Basic vector-vector, vector-matrix,
and matrix-matrix operations
Matrix factorization (LU decomp, hermitian,sparse)
Least squares fitting and eigenvalue problems
Sparse linear system solvers
Non-linear least squares solver (trust regions)
Plus signal processing routines such as FFT and convolution
Very fast random number generators (mersenne twist)
Much more.... see: link text
A downside is that the MKL API can be quite complex depending on the routines that you need. You could also take a look at their IPP (Integrated Performance Primitives) library which is geared toward high performance image processing operations, but is nevertheless quite broad.
Paul
CenterSpace Software ,.NET Math libraries, centerspace.net

What about GLM?
It's based on the OpenGL Shading Language (GLSL) specification and released under the MIT license.
Clearly aimed at graphics programmers

I've heard good things about Eigen and NT2, but haven't personally used either. There's also Boost.UBLAS, which I believe is getting a bit long in the tooth. The developers of NT2 are building the next version with the intention of getting it into Boost, so that might count for somthing.
My lin. alg. needs don't exteed beyond the 4x4 matrix case, so I can't comment on advanced functionality; I'm just pointing out some options.

I'm new to this topic, so I can't say a whole lot, but BLAS is pretty much the standard in scientific computing. BLAS is actually an API standard, which has many implementations. I'm honestly not sure which implementations are most popular or why.
If you want to also be able to do common linear algebra operations (solving systems, least squares regression, decomposition, etc.) look into LAPACK.

I'll add vote for Eigen: I ported a lot of code (3D geometry, linear algebra and differential equations) from different libraries to this one - improving both performance and code readability in almost all cases.
One advantage that wasn't mentioned: it's very easy to use SSE with Eigen, which significantly improves performance of 2D-3D operations (where everything can be padded to 128 bits).

Okay, I think I know what you're looking for. It appears that GGT is a pretty good solution, as Reed Copsey suggested.
Personally, we rolled our own little library, because we deal with rational points a lot - lots of rational NURBS and Beziers.
It turns out that most 3D graphics libraries do computations with projective points that have no basis in projective math, because that's what gets you the answer you want. We ended up using Grassmann points, which have a solid theoretical underpinning and decreased the number of point types. Grassmann points are basically the same computations people are using now, with the benefit of a robust theory. Most importantly, it makes things clearer in our minds, so we have fewer bugs. Ron Goldman wrote a paper on Grassmann points in computer graphics called "On the Algebraic and Geometric Foundations of Computer Graphics".
Not directly related to your question, but an interesting read.

FLENS
http://flens.sf.net
It also implements a lot of LAPACK functions.

I found this library quite simple and functional (http://kirillsprograms.com/top_Vectors.php). These are bare bone vectors implemented via C++ templates. No fancy stuff - just what you need to do with vectors (add, subtract multiply, dot, etc).

Matrix classes in c++

I'm doing some linear algebra math, and was looking for some really lightweight and simple to use matrix class that could handle different dimensions: 2x2, 2x1, 3x1 and 1x2 basically.
I presume such class could be implemented with templates and using some specialization in some cases, for performance.
Anybody know of any simple implementation available for use? I don't want "bloated" implementations, as I'll running this in an embedded environment where memory is constrained.
Thanks

You could try Blitz++ -- or Boost's uBLAS

I've recently looked at a variety of C++ matrix libraries, and my vote goes to Armadillo.
The library is heavily templated and header-only.
Armadillo also leverages templates to implement a delayed evaluation framework (resolved at compile time) to minimize temporaries in the generated code (resulting in reduced memory usage and increased performance).
However, these advanced features are only a burden to the compiler and not your implementation running in the embedded environment, because most Armadillo code 'evaporates' during compilation due to its design approach based on templates.
And despite all that, one of its main design goals has been ease of use - the API is deliberately similar in style to Matlab syntax (see the comparison table on the site).
Additionally, although Armadillo can work standalone, you might want to consider using it with LAPACK (and BLAS) implementations available to improve performance. A good option would be for instance OpenBLAS (or ATLAS). Check Armadillo's FAQ, it covers some important topics.
A quick search on Google dug up this presentation showing that Armadillo has already been used in embedded systems.

std::valarray is pretty lightweight.

I use Newmat libraries for matrix computations. It's open source and easy to use, although I'm not sure it fits your definition of lightweight (it includes over 50 source files which Visual Studio compiles it into a 1.8MB static library).

CML matrix is pretty good, but may not be lightweight enough for an embedded environment. Check it out anyway: http://cmldev.net/?p=418

Another option, altough may be too late is:
https://launchpad.net/lwmatrix

I for one wasn't able to find simple enough library so I wrote it myself: http://koti.welho.com/aarpikar/lib/
I think it should be able to handle different matrix dimensions (2x2, 3x3, 3x1, etc) by simply setting some rows or columns to zero. It won't be the most fastest approach since internally all operations will be done with 4x4 matrices. Although in theory there might exist that kind of processors that can handle 4x4-operations in one tick. At least I would much rather believe in existence of such processors that than go optimizing those low level matrix calculations. :)

How about just store the matrix in an array, like
2x3 matrix = {2,3,val1,val2,...,val6}
This is really simple, and addition operations are trivial. However, you need to write your own multiplication function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js