GPGPU programming architecture for HSA in C++ for Matrix Math

GPGPU programming architecture for HSA in C++ for Matrix Math - c++

GPU Compute Programmers,
I have a C++ program which currently relies on the ACML (LAPACK) to invert and multiple fairly large matrices of single precision fp values (E.g. 4,000 x 4,000). These matrices are very sparse although they do not always fit nicely into a diagonal matrix so I cannot presently reduce them. The other thing about this program is I have to do this invert and multiply several times (serially) as part of a Newton Rapson. However, I have several thousand permutations which can be done in parallel, each with a small change to the matrix before again calculating and inverting the Jacobian. This is all single precision fp, and seems perfectly suited for the GPU. My question is this...
I suspect I will need to use the AMD Accelerated Parallel Processing Math Libraries (APPML) for OpenGL as that is the only thing (non-CUDA, I want to be GPU agnostic) I know of which is available with BLAS functionality. My problem is I do not see the LAPACK dgetrf and dgetri functions included in APPML (yes, these are fp64 but I don't need that precision). Would C++ AMP be a better alternative? I am very interested in HSA features of passing pointers rather than copying data as there is a lot of data in flight here and some calculations still are done on the CPU. I believe that copy overhead would kill me otherwise. Ultimately, performance is the key and I want to make the right architectural decisions to set myself up for the most performance I can wring out of HSA GPUs coming out over the next 6 months.
I am using VS 2013 Ultimate preview and would be able to take advantage of C++ AMP for these HSA capabilities. I just want to make sure I am making the right long term architectural decision now while my program is in its infancy. Here is a link and snippet from some interesting data I found on Anandtech:
http://anandtech.com/show/7118/windows-81-and-vs2013-bring-gpu-computing-updates-to-direct3d-and-c-amp-
C++ AMP, Microsoft's C++ extension for GPU computing, has also been updated with the upcoming VS2013. I think the biggest feature update is that C++ AMP programs will also gain a shared memory feature on APUs/SoCs where the compiler and runtime will be able to eliminate extra data copies between CPU and GPU. This feature will also be available only on Windows 8.1 and it is likely built on top of the "map default buffer" as Microsoft's AMP implementation uses Direct3D under the hood. C++ AMP also brings some other nice additions including enhanced texture support and better debugging abilities.
Any thoughts, additional questions or discussion would be greatly appreciated!

Related

Performance Tradeoff - When is MATLAB better/slower than C/C++

I am aware that C/C++ is a lower-level language and generates relatively optimized machine code when we compare with any other high-level language. But I guess there is pretty much more than that, which is also evident from the practice.
When I do simple calculations like montecarlo averaging of a Gaussian sample collection or so, I see there is not much of a difference between a C++ implementation or MATLAB implementation, sometimes in fact MATLAB performs a bit better in time.
When I move on to larger scale simulations with thousands of lines of code, slowly the real picture shows up. C++ simulations show superior performance like 100x better in time complexity than an equivalent MATLAB implementation.
The code in C++ most of the times, is pretty much serial and no hi-fi optimization is done explicitly. Whereas, as per my awareness, MATLAB inherently does a lot of optimization. This shows up for example when I try to generate a huge chunk of random samples, where as the equivalent in C++ using some library like IT++/GSL/Boost performs relatively slower (the algorithm used is the same namely mt19937).
My question is simply to know if there is a simpler tradeoff between MATLAB/C++ in performance. Is it just like what people say, "Whenever you can, C/C++ is the better"(The frequently experienced)?. In a different perspective, "What is MATLAB good for, other than comfort?"
By the way, I don't see coding efficiency parameter being significant here, thinking of the same programmer in both cases. And also, I think the other alternatives like python,R are not relevant here. But dependence on the specific libraries we use should be interesting.
[I am a phd student in Coding Theory in communication systems. I do simulations using matlab/C++ all the time, and have reasonable experience of coding few 10K's of lines in both cases]

I have been using Matlab and C++ for about 10 years. For every numerical algorithms implemented for my research, I always start from prototyping with Matlab and then translate the project to C++ to gain a 10x to 100x (I am not kidding) performance improvement. Of course, I am comparing optimized C++ code to the fully vectorized Matlab code. On average, the improvement is about 50x.
There are lot of subtleties behind both of the two programming languages, and the following are some misunderstandings:
Matlab is a script language but C++ is compiled
Matlab uses JIT compiler to translate your script to machine code, you can improve your speed at most by a factor 1.5 to 2 by using the compiler that Matlab provides.
Matlab code might be able to get fully vectorized but you have to optimize your code by hand in C++
Fully vectorized Matlab code can call libraries written in C++/C/Assembly (for example Intel MKL). But plain C++ code can be reasonably vectorized by modern compilers.
Toolboxes and routines that Matlab provides should be very well tuned and should have reasonable performance
No. Other than linear algebra routines, the performance is generally bad.
The reasons why you can gain 10x~100x performance in C++ comparing to vectorized Matlab code:
Calling external libraries (MKL) in Matlab costs time.
Memory in Matlab is dynamically allocated and freed. For example, small matrices multiplication:
A = B*C + D*E + F*G
requires Matlab to create 2 temporary matrices. And in C++, if you allocate your memory before hand, you create NONE. And now imagine you loop that statement for 1000 times. Another solution in C++ is provided by C++11 Rvalue reference. This is the one of the biggest improvement in C++, now C++ code can be as fast as plain C code.
If you want to do parallel processing, Matlab model is multi-process and the C++ way is multi-thread. If you have many small tasks needing to be parallelized, C++ provides linear gain up to many threads but you might have negative performance gain in Matlab.
Vectorization in C++ involves using intrinsics/assembly, and sometimes SIMD vectorization is only possible in C++.
In C++, it is possible for an experienced programmer to completely avoid L2 cache miss and even L1 cache miss, hence pushing CPU to its theoretical throughput limit. Performance of Matlab can lag behind C++ by a factor of 10x due to this reason alone.
In C++, computational intensive instructions sometimes can be grouped according to their latencies (code carefully in assembly or intrinsics) and dependencies (most of time is done automatically by compiler or CPU hardware), such that theoretical IPC (instructions per clock cycle) could be reached and CPU pipelines are filled.
However, development time in C++ is also a factor of 10x comparing to Matlab!
The reasons why you should use Matlab instead of C++:
Data visualization. I think my career can go on without C++ but I won't be able to survive without Matlab just because it can generate beautiful plots!
Low efficiency but mathematically robust build-in routines and toolboxes. Get the correct answer first and then talk about efficiency. People can make subtle mistakes in C++ (for example implicitly convert double to int) and get sort of correct results.
Express your ideas and present your code to your colleagues. Matlab code is much easier to read and much shorter than C++, and Matlab code can be correctly executed without compiler. I just refuse to read other people's C++ code. I don't even use C++ GNU scientific libraries because the code quality is not guaranteed. It is dangerous for a researcher/engineer to use a C++ library as a black box and take the accuracy as granted. Even for commercial C/C++ libraries, I remember Intel compiler had a sign error in its sin() function last year and numerical accuracy problems also occurred in MKL.
Debugging Matlab script with interactive console and workspace is a lot more efficient than C++ debugger. Finding an index calculation bug in Matlab could be done within minutes, but it could take hours in C++ figuring out why the program crashes randomly if boundary check is removed for the sake of speed.
Last but not the least:
Because once Matlab code is vectorized, there is not much left for a programmer to optimize, Matlab code performance is much less sensitive to the quality of the code comparing with C++ code. Therefore it is best to optimize computation algorithms in Matlab, and marginally better algorithms normally have marginally better performance in Matlab. On the other hand, algorithm test in C++ requires decent programmer to write algorithms optimized more or less in the same way, and to make sure the compiler does not optimize the algorithms differently.
My recent experience in C++ and Matlab:
I made several large Matlab data analysis tools in the past year and suffered from the slow speed of Matlab. But I was able to improve my Matlab program speed by 10x through the following techniques:
Run/profile the Matlab script, re-implement critical routines in C/C++ and compile with MEX. Critical routines are mostly likely logically simple but numerically heavy. This improves speed by 5x.
Simplify ".m" files shipped with Matlab tool boxes by commenting all unnecessary safety checks and output parameter computations. Please be reminded that the modified code cannot be distributed with the rest of the user scripts. This improves speed by another 2x (after C/C++ and MEX).
The improved code is ~98% in Matlab and ~2% in C++.
I believe it is possible to improve the speed by another 2x (total 20x) if the entire tool is coded in C++, this is ~100x speed improvement of the computation routines. The hard drive I/O will then dominate the program run time.
Question for Mathworks engineers:
When Matlab code is fully vectorized, one of the performance limiting factor is the matrix indexing operation. For instance, a finite difference operation needs to be performed on Matrix A which has a dimension of 5000x5000:
B = A(:,2:end)-A(:,1:end-1)
The matrix indexing operation makes the Matlab code multiple times slower than the C++ code. Can the matrix indexing performance be improved?

In my experience (several years of Computer Vision and image processing in both languages) there is no simple answer to this question, as Matlab performance depends strongly (and much more than C++ performance) on your coding style.
Generally, Matlab wraps the classic C++ / Fortran based linear algebra libraries. So anything like x = A\b is going to be very fast. Also, Matlab does a good job in choosing the most efficient solver for these types of problems, so for x = A\b Matlab will look at the size of your matrices and chose the appropriate low-level routines.
Matlab also shines in data manipulation of large matrices if you "vectorize" your code, i.e. if you avoid for loops and use index arrays or boolean arrays to access your data. This stuff is highly optimised.
For other routines, some are written in Matlab code, while others point to a C/C++ implementation (e.g. the Delaunay stuff). You can check this yourself by typing edit some_routine.m. This opens the code and you see whether it is all Matlab or just a wrapper for something compiled.
Matlab, I think, is primarily for comfort - but comfort translates to coding time and ultimately money which is why Matlab is used in the industry. Also, it is easy to learn for engineers from other fields than computer science, with little training in programming.

As a PhD Student too, and a 10years long Matlab user, I'm glad to share my POV:
Matlab is a great tool for developing and prototyping algorithms, especially when dealing with GUIs, high-level analysis (Frequency Domain, LS Optimization etc.): fast coding, powerful syntaxis (think about [],{},: etc.).
As soon as your processing chain is more stable and defined and data dimensions grows move to C/C++.
The main Matlab limit rises when considering its language is script-like: as long as you avoid any cycle (using arrayfun, cellfun or other matrix procedures) performances are high since the called subroutine is again in C/C++.

Your question is difficult to answer. In general C++ is faster, but if make use of the well written algorithms of Matlab it can outperform C++. In some cases Matlab can parallelize your code which has to be done manually in many cases for C++. Mathlab can kind of export C++ code.
So my conclusion is, that you have to measure the performance of both programs to get an answer. But then you compare your two implementations and not Matlab and C++ in general.

Matlab does very well with linear algebra and array/matrix operations, since they seem to have been doing some extra optimizations on the underlying operations - if you want to beat Matlab there, you would need a similarly optimized BLAS/LAPACK library.
As an interpreted language, Matlab loses time whenever a Matlab function is called, due to internal overhead, which traditionally meant that Matlab loops were slow. This has been alleviated somewhat in recent years thanks to significant improvement in the JIT compiler (search for "performance" questions on Matlab on SO for examples). As a consequence of the function call overhead, all Matlab functions that have not been implemented in C/C++ behind the scenes (call edit functionName to see whether it's written in Matlab) risks being slower than a C/C++ counterpart.
Finally, Matlab attempts to be user friendly, and may do "unnecessary" input checking that can take time (due to function call overhead). For example, if you know that ismember gets sorted inputs, you can call ismembc directly (the behind-the-scene compiled function), saving quite a bit of time.

I think you can consider the difference in four folds at least.
Compiled vs Interpreted
Strongly-typed vs Dynamically-typed
Performance vs Fast-prototyping
Special strength
For 1-3 can be easily generalized into comparison between two family of programming languages.
For 4, MATLAB is optimized for matrix operations. So if you can vectorize more code in MATLAB, the performance can be drastically boosted. Conversely, if many loops are required, never hesitate to use C++ or create a mex file.
It is a difficult quesion after all.

I saw a 5.5x speed improvement when switching from MATLAB to C++. This was for a robot controller- lots of loops and ode solving. I spent many hours trying to optimize the MATLAB code, hardly any time optimizing the C++ (I'm sure it could have been 10x faster with a little more effort).
However, it was easy to add a GUI for the MATLAB code, so I still use it more often. Like others have said, it was nice to prototype first on MATLAB. That made the implementation on C++ much simpler.

Besides the speed of the final program, you should also take into account the total development time of your code, ie., not only the time to write, but also to debug, etc. Matlab (and its open-source counterpart, Octave) can be good for quick prototyping due to its visualisation capabilities.
If you're using straight C++ (ie. no matrix libraries), it may take you much longer to write C++ code that's equivalent to Matlab code (eg. there might be no point in spending 10 hours writing C++ code that only runs 10 seconds quicker, compared to a Matlab program that took 5 minutes to write).
However, there are dedicated C++ matrix libraries, such as Armadillo, which provide a Matlab-like API. This can be useful for writing performance critical code that can be called from Matlab, or for converting Matlab code into "real" programs.

Some Matlab code uses standard linear algebra fictions with multithreading built into it. So, it appears that they are faster than a sequential C code.

BLAS+Multiple Precision+MPI

I am writing a scientific application for my Maths PhD in C++, it's based on some heavy linear algebra, mostly BLAS level 3 routines. The sizes of the matrices employed vary considerably, ideally I would like to be able to deal with very large matrices of order 10000 and higher. So far I have used Intel MKL, multi-threaded, scales nicely onto 8 cores. My algorithm produces the correct results, however is very unstable, in double precision arithmetic, due to the accumulating errors, resulting from high powers being taken. Additionally, as I have access to a large supercomputer cluster, and my algorithm can be easily scaled across multiple nodes, I would like to employ MPI to scale the application across hundreds of nodes.
My goal is to find a templated BLAS library that:
Supports Multiple Precision Arithmetic,
Supports Multi-threading,
Supports MPI
My findings so far:
MTL4 - Matrix Template library 4 seems to do all of the above, however the open source edition will only run on one core, and the supercomputing edition is quite costly.
Eigen - appears not to support multicore? Does it support multicore and MPI if linked with MKL?
Armadillo - does all the above?
I would greatly appreciate any insights and recommendations
Kind Regards,
Maria

Depending on your matrix problem, the Tpetra package of Trilinos might be worth a look. It's templated on the scalar type, so you might use multiple precision types. It targets large scale applications on supercomputers so one can expect good parallel performances.
Hope it helps!
Edit: and it's free!

Large Matrix Inversion

I am looking at taking the inverse of a large matrix, common size of 1000 x 1000, but sometimes exceeds 100000 x 100000 (which is currently failing due to time and memory). I know that the normal sentiment is 'don't take the inverse, find some other way to do it', but that is not possible at the moment. The reason for this is due to the usage of software that is already made that expects to get the matrix inverse. (Note: I am looking into ways of changing this, but that will take a long time)
At the moment we are using an LU decomposition method from numerical recopies, and I am currently in the process of testing the eigen library. The eigen library seems to be more stable and a bit faster, but I am still in testing phase for accuracy. I have taken a quick look at other libraries such as ATLAS and LAPACK but have not done any substantial testing with these yet.
It seems as though the eigen library does not use concurrent methods to compute the inverse (though does for LU factorization part of the inverse) and as far as I can tell ATLAS and LAPACK are similar in this limitation. (I am currently testing the speed difference for eigen with openMP and without.)
First question is can anyone explain how it would be possible to optimize matrix inversion by parallelization. I found an article here that talks about matrix inversion parallel algorithms, but I did not understand. It seems this article talks about another method? I am also not sure if scaLAPACK or PETSc are useful?
Second question, I read this article of using the GPUs to increase performance, but I have never coded for GPUs and so have no idea what is trying to convey, but the charts at the bottom looked rather alarming. How is this even possible, and how where do I start to go about implementing something like this if it is to be true.
I also found this article, have yet had the time to read through it to understand, but it seems promising, as memory is a current issue with our software.
Any information about these articles or the problems in general would be of great help. And again I apologize if this question seems vague, I will try to expand more if necessary.

First question is can anyone explain how it would be possible to optimize matrix inversion by parallelization.
I'd hazard a guess that this, and related topics in linear algebra, is one of the most studied topics in parallel computing. If you're stuck looking for somewhere to start reading, well good old Golub and Van Loan have a chapter on the topic. As to whether Scalapack and Petsc are likely to be useful, certainly the former, probably the latter. Of course, they both depend on MPI but that's kind of taken for granted in this field.
Second question ...
Use GPUs if you've got them and you can afford to translate your code into the programming model supported by your GPUs. If you've never coded for GPUs and have access to a cluster of commodity-type CPUs you'll get up to speed quicker by using the cluster than by wrestling with a novel technology.
As for the last article you refer to, it's now 10 years old in a field that changes very quickly (try finding a 10-year old research paper on using GPUs for matrix inversion). I can't comment on its excellence or other attributes, but the problem sizes you mention seem to me to be well within the capabilities of modern clusters for in-core (to use an old term) computation. If your matrices are very big, are they also sparse ?
Finally, I strongly support your apparent intention to use existing off-the-shelf codes rather than to try to develop your own.

100000 x 100000 is 80GB at double precision. You need a library that supports memory-mapped matrices on disk. I can't recommend a particular library and I didn't find anything with quick Google searches. But code from Numerical Recipes certainly isn't going to be adequate.

Regarding the first question (how to parallellize computing the inverse):
I assume you are computing the inverse by doing an LU decomposition of your matrix and then using the decomposition to solve A*B = I where A is your original matrix, B is the matrix you solve for, and I is the identity matrix. Then B is the inverse.
The last step is easy to parallellize. Divide your identity matrix along the columns. If you have p CPUs and your matrix is n-by-n, then every part has n/p columns and n rows. Lets call the parts I1, I2, etc. On every CPU, solve a system of the form A*B1 = I1, this gives you the parts B1, B2, etc., and you can combine them to form B which is the inverse.

An LU decomp on a GPU can be ~10x faster than on a CPU. Although this is now changing, GPU's have traditionally been designed around single precision arithmetic, and so on older hardware single precision arithmetic is generally much faster than double precision arithmetic. Also, storage requirements and performance will be greatly impacted by the structure of your matrices. A sparse 100,000 x 100,000 matrix LU decomp is a reasonable problem to solve and will not require much memory.
Unless you want to become a specialist and spend a lot of time tuning for hardware updates, I would strongly recommend using a commercial library. I would suggest CULA tools. They have both sparse and dense GPU libraries and in fact their free library offers SGETRF - a single precision (dense) LU decomp routine. You'll have to pay for their double precision libraries.

I know it's old post - but really - OpenCL (you download the relevant one based on your graphics card) + OpenMP + Vectorization (not in that order) is the way to go.
Anyhow - for me my experience with matrix anything is really to do with overheads from copying double double arrays in and out the system and also to pad up or initialize matrices with 0s before any commencement of computation - especially when I am working with creating .xll for Excel usage.
If I were to reprioritize the top -
try to vectorize the code (Visual Studio 2012 and Intel C++ has autovectorization - I'm not sure about MinGW or GCC, but I think there are flags for the compiler to analyse your for loops to generate the right assembly codes to use instead of the normal registers to hold your data, to populate your processor's vector registers. I think Excel is doing that because when I monitored Excel's threads while running their MINVERSE(), I notice only 1 thread is used.
I don't know much assembly language - so I don't know how to vectorize manually... (haven't had time to go learn this yet but sooooo wanna do it!)
Parallelize with OpenMP (omp pragma) or MPI or pthreads library (parallel_for) - very simple - but... here's the catch - I realise that if your matrix class is completely single threaded in the first place - then parallelizing the operation like mat multiply or inverse is scrappable - cuz parallelizing will deteriorate the speed due to initializing or copying to or just accessing the non-parallelized matrix class.
But... where parallelization helps is - if you're designing your own matrix class and you parallelize its constructor operation (padding with 0s etc), then your computation of LU(A^-1) = I will also be faster.
It's also mathematically straightforward to also optimize your LU decomposition, and also optimizing ur forward backward substitution for the special case of identity. (I.e. don't waste time creating any identity matrix - analyse where your for (row = col) and evaluate to be a function with 1 and the rest with 0.)
Once it's been parallelized (on the outer layers) - the matrix operations requiring element by element can be mapped to be computed by GPU(SSSSSS) - hundreds of processors to compute elements - beat that!. There is now sample Monte Carlo code available on ATI's website - using ATI's OpenCL - don't worry about porting code to something that uses GeForce - all u gotta do is recompile there.
For 2 and 3 though - remember that overheads are incurred so no point unless you're handling F*K*G HUGE matrices - but I see 100k^2? wow...
Gene

processing an image using CUDA implementation, python (pycuda) or C++?

I am in a project to process an image using CUDA. The project is simply an addition or subtraction of the image.
May I ask your professional opinion, which is best and what would be the advantages and disadvantages of those two?
I appreciate everyone's opinions and/or suggestions since this project is very important to me.

General answer: It doesn't matter. Use the language you're more comfortable with.
Keep in mind, however, that pycuda is only a wrapper around the CUDA C interface, so it may not always be up-to-date, also it adds another potential source of bugs, …
Python is great at rapid prototyping, so I'd personally go for Python. You can always switch to C++ later if you need to.

If the rest of your pipeline is in Python, and you're using Numpy already to speed things up, pyCUDA is a good complement to accelerate expensive operations. However, depending on the size of your images and your program flow, you might not get too much of a speedup using pyCUDA. There is latency involved in passing the data back and forth across the PCI bus that is only made up for with large data sizes.
In your case (addition and subtraction), there are built-in operations in pyCUDA that you can use to your advantage. However, in my experience, using pyCUDA for something non-trivial requires knowing a lot about how CUDA works in the first place. For someone starting from no CUDA knowledge, pyCUDA might be a steep learning curve.

Take a look at openCV, it contains a lot of image processing functions and all the helpers to load/save/display images and operate cameras.
It also now supports CUDA, some of the image processing functions have been reimplemented in CUDA and it gives you a good framework to do your own.

Alex's answer is right. The amount of time consumed in the wrapper is minimal. Note that PyCUDA has some nice metaprogramming constructs for generating kernels which might be useful.
If all you're doing is adding or subtracting elements of an image, you probably shouldn't use CUDA for this at all. The amount of time it takes to transfer back and forth across the PCI-E bus will dwarf the amount of savings you get from parallelism.
Any time you deal with CUDA, it's useful to think about the CGMA ratio (computation to global memory access ratio). Your addition/subtraction is only 1 float point operation for 2 memory accesses (1 read and 1 write). This ends up being very lousy from a CUDA perspective.

What are the most widely used C++ vector/matrix math/linear algebra libraries, and their cost and benefit tradeoffs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
It seems that many projects slowly come upon a need to do matrix math, and fall into the trap of first building some vector classes and slowly adding in functionality until they get caught building a half-assed custom linear algebra library, and depending on it.
I'd like to avoid that while not building in a dependence on some tangentially related library (e.g. OpenCV, OpenSceneGraph).
What are the commonly used matrix math/linear algebra libraries out there, and why would decide to use one over another? Are there any that would be advised against using for some reason? I am specifically using this in a geometric/time context*(2,3,4 Dim)* but may be using higher dimensional data in the future.
I'm looking for differences with respect to any of: API, speed, memory use, breadth/completeness, narrowness/specificness, extensibility, and/or maturity/stability.
Update
I ended up using Eigen3 which I am extremely happy with.

There are quite a few projects that have settled on the Generic Graphics Toolkit for this. The GMTL in there is nice - it's quite small, very functional, and been used widely enough to be very reliable. OpenSG, VRJuggler, and other projects have all switched to using this instead of their own hand-rolled vertor/matrix math.
I've found it quite nice - it does everything via templates, so it's very flexible, and very fast.
Edit:
After the comments discussion, and edits, I thought I'd throw out some more information about the benefits and downsides to specific implementations, and why you might choose one over the other, given your situation.
GMTL -
Benefits: Simple API, specifically designed for graphics engines. Includes many primitive types geared towards rendering (such as planes, AABB, quatenrions with multiple interpolation, etc) that aren't in any other packages. Very low memory overhead, quite fast, easy to use.
Downsides: API is very focused specifically on rendering and graphics. Doesn't include general purpose (NxM) matrices, matrix decomposition and solving, etc, since these are outside the realm of traditional graphics/geometry applications.
Eigen -
Benefits: Clean API, fairly easy to use. Includes a Geometry module with quaternions and geometric transforms. Low memory overhead. Full, highly performant solving of large NxN matrices and other general purpose mathematical routines.
Downsides: May be a bit larger scope than you are wanting (?). Fewer geometric/rendering specific routines when compared to GMTL (ie: Euler angle definitions, etc).
IMSL -
Benefits: Very complete numeric library. Very, very fast (supposedly the fastest solver). By far the largest, most complete mathematical API. Commercially supported, mature, and stable.
Downsides: Cost - not inexpensive. Very few geometric/rendering specific methods, so you'll need to roll your own on top of their linear algebra classes.
NT2 -
Benefits: Provides syntax that is more familiar if you're used to MATLAB. Provides full decomposition and solving for large matrices, etc.
Downsides: Mathematical, not rendering focused. Probably not as performant as Eigen.
LAPACK -
Benefits: Very stable, proven algorithms. Been around for a long time. Complete matrix solving, etc. Many options for obscure mathematics.
Downsides: Not as highly performant in some cases. Ported from Fortran, with odd API for usage.
Personally, for me, it comes down to a single question - how are you planning to use this. If you're focus is just on rendering and graphics, I like Generic Graphics Toolkit, since it performs well, and supports many useful rendering operations out of the box without having to implement your own. If you need general purpose matrix solving (ie: SVD or LU decomposition of large matrices), I'd go with Eigen, since it handles that, provides some geometric operations, and is very performant with large matrix solutions. You may need to write more of your own graphics/geometric operations (on top of their matrices/vectors), but that's not horrible.

So I'm a pretty critical person, and figure if I'm going to invest in a library, I'd better know what I'm getting myself into. I figure it's better to go heavy on the criticism and light on the flattery when scrutinizing; what's wrong with it has many more implications for the future than what's right. So I'm going to go overboard here a little bit to provide the kind of answer that would have helped me and I hope will help others who may journey down this path. Keep in mind that this is based on what little reviewing/testing I've done with these libs. Oh and I stole some of the positive description from Reed.
I'll mention up top that I went with GMTL despite it's idiosyncrasies because the Eigen2 unsafeness was too big of a downside. But I've recently learned that the next release of Eigen2 will contain defines that will shut off the alignment code, and make it safe. So I may switch over.
Update: I've switched to Eigen3. Despite it's idiosyncrasies, its scope and elegance are too hard to ignore, and the optimizations which make it unsafe can be turned off with a define.
Eigen2/Eigen3
Benefits: LGPL MPL2, Clean, well designed API, fairly easy to use. Seems to be well maintained with a vibrant community. Low memory overhead. High performance. Made for general linear algebra, but good geometric functionality available as well. All header lib, no linking required.
Idiocyncracies/downsides: (Some/all of these can be avoided by some defines that are available in the current development branch Eigen3)
Unsafe performance optimizations result in needing careful following of rules. Failure to follow rules causes crashes.
you simply cannot safely pass-by-value
use of Eigen types as members requires special allocator customization (or you crash)
use with stl container types and possibly other templates required
special allocation customization (or you will crash)
certain compilers need special care to prevent crashes on function calls (GCC windows)
GMTL
Benefits: LGPL, Fairly Simple API, specifically designed for graphics engines.
Includes many primitive types geared towards rendering (such as
planes, AABB, quatenrions with multiple interpolation, etc) that
aren't in any other packages. Very low memory overhead, quite fast,
easy to use. All header based, no linking necessary.
Idiocyncracies/downsides:
API is quirky
what might be myVec.x() in another lib is only available via myVec[0] (Readability problem)
an array or stl::vector of points may cause you to do something like pointsList[0][0] to access the x component of the first point
in a naive attempt at optimization, removed cross(vec,vec) and
replaced with makeCross(vec,vec,vec) when compiler eliminates
unnecessary temps anyway
normal math operations don't return normal types unless you shut
off some optimization features e.g.: vec1 - vec2 does not return a
normal vector so length( vecA - vecB ) fails even though vecC = vecA -
vecB works. You must wrap like: length( Vec( vecA - vecB ) )
operations on vectors are provided by external functions rather than
members. This may require you to use the scope resolution everywhere
since common symbol names may collide
you have to do
length( makeCross( vecA, vecB ) )
or
gmtl::length( gmtl::makeCross( vecA, vecB ) )
where otherwise you might try
vecA.cross( vecB ).length()
not well maintained
still claimed as "beta"
documentation missing basic info like which headers are needed to
use normal functionalty
Vec.h does not contain operations for Vectors, VecOps.h contains
some, others are in Generate.h for example. cross(vec&,vec&,vec&) in
VecOps.h, [make]cross(vec&,vec&) in Generate.h
immature/unstable API; still changing.
For example "cross" has moved from "VecOps.h" to "Generate.h", and
then the name was changed to "makeCross". Documentation examples fail
because still refer to old versions of functions that no-longer exist.
NT2
Can't tell because they seem to be more interested in the fractal image header of their web page than the content. Looks more like an academic project than a serious software project.
Latest release over 2 years ago.
Apparently no documentation in English though supposedly there is something in French somewhere.
Cant find a trace of a community around the project.
LAPACK & BLAS
Benefits: Old and mature.
Downsides:
old as dinosaurs with really crappy APIs

For what it's worth, I've tried both Eigen and Armadillo. Below is a brief evaluation.
Eigen
Advantages:
1. Completely self-contained -- no dependence on external BLAS or LAPACK.
2. Documentation decent.
3. Purportedly fast, although I haven't put it to the test.
Disadvantage:
The QR algorithm returns just a single matrix, with the R matrix embedded in the upper triangle. No idea where the rest of the matrix comes from, and no Q matrix can be accessed.
Armadillo
Advantages:
1. Wide range of decompositions and other functions (including QR).
2. Reasonably fast (uses expression templates), but again, I haven't really pushed it to high dimensions.
Disadvantages:
1. Depends on external BLAS and/or LAPACK for matrix decompositions.
2. Documentation is lacking IMHO (including the specifics wrt LAPACK, other than changing a #define statement).
Would be nice if an open source library were available that is self-contained and straightforward to use. I have run into this same issue for 10 years, and it gets frustrating. At one point, I used GSL for C and wrote C++ wrappers around it, but with modern C++ -- especially using the advantages of expression templates -- we shouldn't have to mess with C in the 21st century. Just my tuppencehapenny.

If you are looking for high performance matrix/linear algebra/optimization on Intel processors, I'd look at Intel's MKL library.
MKL is carefully optimized for fast run-time performance - much of it based on the very mature BLAS/LAPACK fortran standards. And its performance scales with the number of cores available. Hands-free scalability with available cores is the future of computing and I wouldn't use any math library for a new project doesn't support multi-core processors.
Very briefly, it includes:
Basic vector-vector, vector-matrix,
and matrix-matrix operations
Matrix factorization (LU decomp, hermitian,sparse)
Least squares fitting and eigenvalue problems
Sparse linear system solvers
Non-linear least squares solver (trust regions)
Plus signal processing routines such as FFT and convolution
Very fast random number generators (mersenne twist)
Much more.... see: link text
A downside is that the MKL API can be quite complex depending on the routines that you need. You could also take a look at their IPP (Integrated Performance Primitives) library which is geared toward high performance image processing operations, but is nevertheless quite broad.
Paul
CenterSpace Software ,.NET Math libraries, centerspace.net

What about GLM?
It's based on the OpenGL Shading Language (GLSL) specification and released under the MIT license.
Clearly aimed at graphics programmers

I've heard good things about Eigen and NT2, but haven't personally used either. There's also Boost.UBLAS, which I believe is getting a bit long in the tooth. The developers of NT2 are building the next version with the intention of getting it into Boost, so that might count for somthing.
My lin. alg. needs don't exteed beyond the 4x4 matrix case, so I can't comment on advanced functionality; I'm just pointing out some options.

I'm new to this topic, so I can't say a whole lot, but BLAS is pretty much the standard in scientific computing. BLAS is actually an API standard, which has many implementations. I'm honestly not sure which implementations are most popular or why.
If you want to also be able to do common linear algebra operations (solving systems, least squares regression, decomposition, etc.) look into LAPACK.

I'll add vote for Eigen: I ported a lot of code (3D geometry, linear algebra and differential equations) from different libraries to this one - improving both performance and code readability in almost all cases.
One advantage that wasn't mentioned: it's very easy to use SSE with Eigen, which significantly improves performance of 2D-3D operations (where everything can be padded to 128 bits).

Okay, I think I know what you're looking for. It appears that GGT is a pretty good solution, as Reed Copsey suggested.
Personally, we rolled our own little library, because we deal with rational points a lot - lots of rational NURBS and Beziers.
It turns out that most 3D graphics libraries do computations with projective points that have no basis in projective math, because that's what gets you the answer you want. We ended up using Grassmann points, which have a solid theoretical underpinning and decreased the number of point types. Grassmann points are basically the same computations people are using now, with the benefit of a robust theory. Most importantly, it makes things clearer in our minds, so we have fewer bugs. Ron Goldman wrote a paper on Grassmann points in computer graphics called "On the Algebraic and Geometric Foundations of Computer Graphics".
Not directly related to your question, but an interesting read.

FLENS
http://flens.sf.net
It also implements a lot of LAPACK functions.

I found this library quite simple and functional (http://kirillsprograms.com/top_Vectors.php). These are bare bone vectors implemented via C++ templates. No fancy stuff - just what you need to do with vectors (add, subtract multiply, dot, etc).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js