Does the cuda math function norm3df overflow?

Does the cuda math function norm3df overflow? - c++

I am working on an nbody simulator in cuda. I want to use float types for the speed benefits but this is making my task difficult. What I am worried about is say I have have a vector <10^20, 10^20, 10^20> and I want to compute its magnitude using the Pythagorean theorem. I would have to square each of the components which would be 10^40 and in 32 bit this would just be infinity. So even though the final result when I take the square root of the sum would be in range the intermediate step would overflow. I came across the following function in the cuda math API. norm3df(x, y, z). Would this prevent the intermediate step overflow I am talking about? Also I might need to use this function on the host as well as device. Would the behavior be the same?

The standard C++ math library contains a function hypot() for the computation of 2D norms while avoiding premature underflow and overflow in intermediate computations. Because 3D norms are also commonly encountered, the CUDA math library offers in addition an analogous function norm3d(). The description in the CUDA math API documentation reads:
Calculate the length of three dimensional vector p in euclidean space
without undue overflow or underflow
Further, the CUDA math library offers reciprocal norm functions rhypot() and rnorm3d() that are useful when normalizing 2D and 3D vectors, as they allow replacing an expensive division with a much cheaper multiplication.
As norm3d(), rhypot(), and rnorm3d() are not standard C++ math library functions, they cannot be used in the host portion of CUDA programs, as host code is processed by the host toolchain. NVIDIA provides math library support for the device. You may want to file an enhancement request with the vendor of your host toolchain to add these useful functions as proprietary extensions, and/or lobby the ISO C/C++ committees to have them added to future versions of the standard.
It has previously come to my attention that currently shipping CUDA header files seem to erroneously mark normd3d() and a few other CUDA-specific functions as __host__ __device__, although there is in fact no host implementation. This would appear to be a bug, likely caused by cut & past application of these attributes to the prototypes.
The norm and reciprocal norm functions do not require higher intermediate precision in their internal computation, meaning there is no negative performance impact on GPUs with low-throughput double precision. Instead, they use clever rearrangements of the mathematics, re-scaling of the operands, and use of FMA to achieve their goal. Not only do they prevent undue overflow and underflow, they should also be more accurate than the equivalent naive computation.
Up to and including CUDA version 6.5, implementation details of the CUDA math library were visible in the CUDA header files math_functions.h and math_functions_dbl_ptx3.h, so anybody who would like to get a better idea of the internal details of norm functions may want to look there.

Related

Fixed size SVD and solver in CUDA (in the device)

I implemented a program on the GPU (CUDA) which only uses the host (in C++) to start new kernels. During the calculation on the device I need SVD and solving systems of 3x3 (dense) matrices, fixed size.
I've got my own SVD and solver implementation but it is not numerical stable (thus not usable). Due to me being rather new with C++ and CUDA I would prefer to use a library instead. (numerical stuff is very tricky)
Now I have trouble finding that library:
cuSOLVER is not callable from the device
cuLA is not callable form the device (and abandoned so it seems)
Eigen looks promising (should be callable from device?) but it is unclear what the status is on CUDA support (it says experimental). I find people saying it works, others got compile errors?
Preferable I would also being able to do general matrix operations with the library (transpose, inversion, sum, multiply, ...) as my own implementations will likely be less efficient and numerically stable for those.
Any ideas on how to achieve this?
UPDATE:
Seems like Eigen supports basic functions like *,+, transpose and even eigenvalues but SVD, inverse ect is not yet supported. This is at the time of writing.

According to the website, a subset of features works for fixed size matrices (3x3 in your case) from Eigen 3.3. The current stable release is 3.2.6 while 3.3 is in alpha. I don't know if specifically SVD is supported in CUDA. I would recommend trying a small MCVE to see if it works (as well as the other functions you require), and if so, implementing it in your project.

I'm having a similar problem; want to generate random vectors within a kernel function which requires performing cholesky/eigenvalue decompositions of NxN (N<=5) covariance matrices. Since, as you noted, the MAGMA and CULA libraries are not available from the device, and there seems to be no cuSOLVER device API yet, I've resorted to implementing these myself following algorithms outlined in, for example, Numerical Recipes in C. As for solving linear systems, I'd suggest checking out the cuBLAS (level 2 functions), as it provides some basic functionality. If you want to invert matrices, I'd suggest cublasmatinvBatched(). I haven't used it myself, will give it a try during the weekend, but from the description it sounds promising. Hope others will chime into this thread with better solutions...

Difference between computation results of MATLAB code and C(C++) with IPP code

I need to increase computation speed of MATLAB code. For this purpose I rewrite my program on C language with Intel IPP library for operations with vectors. And here I got a problem:
after some step main computation circle program in MATLAB and my C program go to different pathes of algorythm. It is happened because computations not absolutely equal and my program accumulate error in compare with MATLAB computations results. For this reason, my program doesn't compute correct gradient and the whole optimization algorythm doesn't count well. So I got a computation speed increase, but lost computation efficiency - when on 100th step MATLAB compute optimization error on 0.004, C program compute on 0.05 and this is important in my task.
I checked what function give me error, and what I found: common operations (like ippsAdd_64f_A53, ippsSub_64f_A53, ippsMul_f64_A53, ippsDiv_64f_A53 and usual C operations ,-,*,/) make equal to MATLAB results and sum error is zero, but math.h hyperbolic functions give a sum error on array with 75699 elements about -3..-5e-13. Intel functions ippsCosh_64f_A53 and others give a sum error about -1..-5e-14.
Do you know a library to compute high precision hyperbolic and exponent functions? Or maybe there are some compilator settings in Visual Studio 2012, which can help me?
All computations made in Ipp64f data type (double) in VS 2012 with installed Intel Parallel Studio XE 2013.
P.S.: Sum error was computed in MATLAB. I saved arrays from my C program to level 4 mat file and then imported in MATLAB where I summed difference between MATLAB array and imported array like sum(M_cosh - C_cosh);

Not an answer, more of an extended comment:
You write
I need to increase computation speed of MATLAB code
and ask
Do you know a library to compute high precision trigonometric and
exponent functions?
Yes, I know of several such libraries, but they implement floating-point numbers with more bits than are typically-provided on current CPUs (mainly 32- and 64-bit) and which implement, in software, arithmetic on these numbers. For your purpose of increasing computation speed, such libraries are useless, their increased precision is explicitly bought at the cost of increased execution time. For many other users that's a reasonable trade off.
I don't know of any widely-used or well-regarded libraries which implement precision-preserving algorithms on machine-numbers. There isn't space here to go into any detail, but for an introduction to the problem you could do worse than start reading about Kahan's summation algorithm.
The Mathworks are somewhat coy about revealing what algorithms Matlab implements. However most of the computational kernels of Matlab are written in C (or C++, I believe) and compiled into libraries. Many of them are now multi-threaded too. If you are trying to write code to outperform Matlab you will have to write multi-threaded, high-performance numerical code.
It wouldn't surprise me at all to learn that the algorithms that Matlab implements do have precision-preserving capabilities. The Mathworks are, after all, trying to offer the market a tool which will solve a wide range of problems without the user having to consider low-level issues such as whether or not machine-precision is good enough for a particular combination of problem and dataset.
Finally. It doesn't surprise me that your first attempts were unsuccessful, though beating Matlab for speed is impressive. And I look forward, sceptically, to being pleasantly surprised when you report success, a code of your own which outperforms Matlab in time and produces satisfactory results.

GPGPU programming architecture for HSA in C++ for Matrix Math

GPU Compute Programmers,
I have a C++ program which currently relies on the ACML (LAPACK) to invert and multiple fairly large matrices of single precision fp values (E.g. 4,000 x 4,000). These matrices are very sparse although they do not always fit nicely into a diagonal matrix so I cannot presently reduce them. The other thing about this program is I have to do this invert and multiply several times (serially) as part of a Newton Rapson. However, I have several thousand permutations which can be done in parallel, each with a small change to the matrix before again calculating and inverting the Jacobian. This is all single precision fp, and seems perfectly suited for the GPU. My question is this...
I suspect I will need to use the AMD Accelerated Parallel Processing Math Libraries (APPML) for OpenGL as that is the only thing (non-CUDA, I want to be GPU agnostic) I know of which is available with BLAS functionality. My problem is I do not see the LAPACK dgetrf and dgetri functions included in APPML (yes, these are fp64 but I don't need that precision). Would C++ AMP be a better alternative? I am very interested in HSA features of passing pointers rather than copying data as there is a lot of data in flight here and some calculations still are done on the CPU. I believe that copy overhead would kill me otherwise. Ultimately, performance is the key and I want to make the right architectural decisions to set myself up for the most performance I can wring out of HSA GPUs coming out over the next 6 months.
I am using VS 2013 Ultimate preview and would be able to take advantage of C++ AMP for these HSA capabilities. I just want to make sure I am making the right long term architectural decision now while my program is in its infancy. Here is a link and snippet from some interesting data I found on Anandtech:
http://anandtech.com/show/7118/windows-81-and-vs2013-bring-gpu-computing-updates-to-direct3d-and-c-amp-
C++ AMP, Microsoft's C++ extension for GPU computing, has also been updated with the upcoming VS2013. I think the biggest feature update is that C++ AMP programs will also gain a shared memory feature on APUs/SoCs where the compiler and runtime will be able to eliminate extra data copies between CPU and GPU. This feature will also be available only on Windows 8.1 and it is likely built on top of the "map default buffer" as Microsoft's AMP implementation uses Direct3D under the hood. C++ AMP also brings some other nice additions including enhanced texture support and better debugging abilities.
Any thoughts, additional questions or discussion would be greatly appreciated!

BLAS+Multiple Precision+MPI

I am writing a scientific application for my Maths PhD in C++, it's based on some heavy linear algebra, mostly BLAS level 3 routines. The sizes of the matrices employed vary considerably, ideally I would like to be able to deal with very large matrices of order 10000 and higher. So far I have used Intel MKL, multi-threaded, scales nicely onto 8 cores. My algorithm produces the correct results, however is very unstable, in double precision arithmetic, due to the accumulating errors, resulting from high powers being taken. Additionally, as I have access to a large supercomputer cluster, and my algorithm can be easily scaled across multiple nodes, I would like to employ MPI to scale the application across hundreds of nodes.
My goal is to find a templated BLAS library that:
Supports Multiple Precision Arithmetic,
Supports Multi-threading,
Supports MPI
My findings so far:
MTL4 - Matrix Template library 4 seems to do all of the above, however the open source edition will only run on one core, and the supercomputing edition is quite costly.
Eigen - appears not to support multicore? Does it support multicore and MPI if linked with MKL?
Armadillo - does all the above?
I would greatly appreciate any insights and recommendations
Kind Regards,
Maria

Depending on your matrix problem, the Tpetra package of Trilinos might be worth a look. It's templated on the scalar type, so you might use multiple precision types. It targets large scale applications on supercomputers so one can expect good parallel performances.
Hope it helps!
Edit: and it's free!

What are the most widely used C++ vector/matrix math/linear algebra libraries, and their cost and benefit tradeoffs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
It seems that many projects slowly come upon a need to do matrix math, and fall into the trap of first building some vector classes and slowly adding in functionality until they get caught building a half-assed custom linear algebra library, and depending on it.
I'd like to avoid that while not building in a dependence on some tangentially related library (e.g. OpenCV, OpenSceneGraph).
What are the commonly used matrix math/linear algebra libraries out there, and why would decide to use one over another? Are there any that would be advised against using for some reason? I am specifically using this in a geometric/time context*(2,3,4 Dim)* but may be using higher dimensional data in the future.
I'm looking for differences with respect to any of: API, speed, memory use, breadth/completeness, narrowness/specificness, extensibility, and/or maturity/stability.
Update
I ended up using Eigen3 which I am extremely happy with.

There are quite a few projects that have settled on the Generic Graphics Toolkit for this. The GMTL in there is nice - it's quite small, very functional, and been used widely enough to be very reliable. OpenSG, VRJuggler, and other projects have all switched to using this instead of their own hand-rolled vertor/matrix math.
I've found it quite nice - it does everything via templates, so it's very flexible, and very fast.
Edit:
After the comments discussion, and edits, I thought I'd throw out some more information about the benefits and downsides to specific implementations, and why you might choose one over the other, given your situation.
GMTL -
Benefits: Simple API, specifically designed for graphics engines. Includes many primitive types geared towards rendering (such as planes, AABB, quatenrions with multiple interpolation, etc) that aren't in any other packages. Very low memory overhead, quite fast, easy to use.
Downsides: API is very focused specifically on rendering and graphics. Doesn't include general purpose (NxM) matrices, matrix decomposition and solving, etc, since these are outside the realm of traditional graphics/geometry applications.
Eigen -
Benefits: Clean API, fairly easy to use. Includes a Geometry module with quaternions and geometric transforms. Low memory overhead. Full, highly performant solving of large NxN matrices and other general purpose mathematical routines.
Downsides: May be a bit larger scope than you are wanting (?). Fewer geometric/rendering specific routines when compared to GMTL (ie: Euler angle definitions, etc).
IMSL -
Benefits: Very complete numeric library. Very, very fast (supposedly the fastest solver). By far the largest, most complete mathematical API. Commercially supported, mature, and stable.
Downsides: Cost - not inexpensive. Very few geometric/rendering specific methods, so you'll need to roll your own on top of their linear algebra classes.
NT2 -
Benefits: Provides syntax that is more familiar if you're used to MATLAB. Provides full decomposition and solving for large matrices, etc.
Downsides: Mathematical, not rendering focused. Probably not as performant as Eigen.
LAPACK -
Benefits: Very stable, proven algorithms. Been around for a long time. Complete matrix solving, etc. Many options for obscure mathematics.
Downsides: Not as highly performant in some cases. Ported from Fortran, with odd API for usage.
Personally, for me, it comes down to a single question - how are you planning to use this. If you're focus is just on rendering and graphics, I like Generic Graphics Toolkit, since it performs well, and supports many useful rendering operations out of the box without having to implement your own. If you need general purpose matrix solving (ie: SVD or LU decomposition of large matrices), I'd go with Eigen, since it handles that, provides some geometric operations, and is very performant with large matrix solutions. You may need to write more of your own graphics/geometric operations (on top of their matrices/vectors), but that's not horrible.

So I'm a pretty critical person, and figure if I'm going to invest in a library, I'd better know what I'm getting myself into. I figure it's better to go heavy on the criticism and light on the flattery when scrutinizing; what's wrong with it has many more implications for the future than what's right. So I'm going to go overboard here a little bit to provide the kind of answer that would have helped me and I hope will help others who may journey down this path. Keep in mind that this is based on what little reviewing/testing I've done with these libs. Oh and I stole some of the positive description from Reed.
I'll mention up top that I went with GMTL despite it's idiosyncrasies because the Eigen2 unsafeness was too big of a downside. But I've recently learned that the next release of Eigen2 will contain defines that will shut off the alignment code, and make it safe. So I may switch over.
Update: I've switched to Eigen3. Despite it's idiosyncrasies, its scope and elegance are too hard to ignore, and the optimizations which make it unsafe can be turned off with a define.
Eigen2/Eigen3
Benefits: LGPL MPL2, Clean, well designed API, fairly easy to use. Seems to be well maintained with a vibrant community. Low memory overhead. High performance. Made for general linear algebra, but good geometric functionality available as well. All header lib, no linking required.
Idiocyncracies/downsides: (Some/all of these can be avoided by some defines that are available in the current development branch Eigen3)
Unsafe performance optimizations result in needing careful following of rules. Failure to follow rules causes crashes.
you simply cannot safely pass-by-value
use of Eigen types as members requires special allocator customization (or you crash)
use with stl container types and possibly other templates required
special allocation customization (or you will crash)
certain compilers need special care to prevent crashes on function calls (GCC windows)
GMTL
Benefits: LGPL, Fairly Simple API, specifically designed for graphics engines.
Includes many primitive types geared towards rendering (such as
planes, AABB, quatenrions with multiple interpolation, etc) that
aren't in any other packages. Very low memory overhead, quite fast,
easy to use. All header based, no linking necessary.
Idiocyncracies/downsides:
API is quirky
what might be myVec.x() in another lib is only available via myVec[0] (Readability problem)
an array or stl::vector of points may cause you to do something like pointsList[0][0] to access the x component of the first point
in a naive attempt at optimization, removed cross(vec,vec) and
replaced with makeCross(vec,vec,vec) when compiler eliminates
unnecessary temps anyway
normal math operations don't return normal types unless you shut
off some optimization features e.g.: vec1 - vec2 does not return a
normal vector so length( vecA - vecB ) fails even though vecC = vecA -
vecB works. You must wrap like: length( Vec( vecA - vecB ) )
operations on vectors are provided by external functions rather than
members. This may require you to use the scope resolution everywhere
since common symbol names may collide
you have to do
length( makeCross( vecA, vecB ) )
or
gmtl::length( gmtl::makeCross( vecA, vecB ) )
where otherwise you might try
vecA.cross( vecB ).length()
not well maintained
still claimed as "beta"
documentation missing basic info like which headers are needed to
use normal functionalty
Vec.h does not contain operations for Vectors, VecOps.h contains
some, others are in Generate.h for example. cross(vec&,vec&,vec&) in
VecOps.h, [make]cross(vec&,vec&) in Generate.h
immature/unstable API; still changing.
For example "cross" has moved from "VecOps.h" to "Generate.h", and
then the name was changed to "makeCross". Documentation examples fail
because still refer to old versions of functions that no-longer exist.
NT2
Can't tell because they seem to be more interested in the fractal image header of their web page than the content. Looks more like an academic project than a serious software project.
Latest release over 2 years ago.
Apparently no documentation in English though supposedly there is something in French somewhere.
Cant find a trace of a community around the project.
LAPACK & BLAS
Benefits: Old and mature.
Downsides:
old as dinosaurs with really crappy APIs

For what it's worth, I've tried both Eigen and Armadillo. Below is a brief evaluation.
Eigen
Advantages:
1. Completely self-contained -- no dependence on external BLAS or LAPACK.
2. Documentation decent.
3. Purportedly fast, although I haven't put it to the test.
Disadvantage:
The QR algorithm returns just a single matrix, with the R matrix embedded in the upper triangle. No idea where the rest of the matrix comes from, and no Q matrix can be accessed.
Armadillo
Advantages:
1. Wide range of decompositions and other functions (including QR).
2. Reasonably fast (uses expression templates), but again, I haven't really pushed it to high dimensions.
Disadvantages:
1. Depends on external BLAS and/or LAPACK for matrix decompositions.
2. Documentation is lacking IMHO (including the specifics wrt LAPACK, other than changing a #define statement).
Would be nice if an open source library were available that is self-contained and straightforward to use. I have run into this same issue for 10 years, and it gets frustrating. At one point, I used GSL for C and wrote C++ wrappers around it, but with modern C++ -- especially using the advantages of expression templates -- we shouldn't have to mess with C in the 21st century. Just my tuppencehapenny.

If you are looking for high performance matrix/linear algebra/optimization on Intel processors, I'd look at Intel's MKL library.
MKL is carefully optimized for fast run-time performance - much of it based on the very mature BLAS/LAPACK fortran standards. And its performance scales with the number of cores available. Hands-free scalability with available cores is the future of computing and I wouldn't use any math library for a new project doesn't support multi-core processors.
Very briefly, it includes:
Basic vector-vector, vector-matrix,
and matrix-matrix operations
Matrix factorization (LU decomp, hermitian,sparse)
Least squares fitting and eigenvalue problems
Sparse linear system solvers
Non-linear least squares solver (trust regions)
Plus signal processing routines such as FFT and convolution
Very fast random number generators (mersenne twist)
Much more.... see: link text
A downside is that the MKL API can be quite complex depending on the routines that you need. You could also take a look at their IPP (Integrated Performance Primitives) library which is geared toward high performance image processing operations, but is nevertheless quite broad.
Paul
CenterSpace Software ,.NET Math libraries, centerspace.net

What about GLM?
It's based on the OpenGL Shading Language (GLSL) specification and released under the MIT license.
Clearly aimed at graphics programmers

I've heard good things about Eigen and NT2, but haven't personally used either. There's also Boost.UBLAS, which I believe is getting a bit long in the tooth. The developers of NT2 are building the next version with the intention of getting it into Boost, so that might count for somthing.
My lin. alg. needs don't exteed beyond the 4x4 matrix case, so I can't comment on advanced functionality; I'm just pointing out some options.

I'm new to this topic, so I can't say a whole lot, but BLAS is pretty much the standard in scientific computing. BLAS is actually an API standard, which has many implementations. I'm honestly not sure which implementations are most popular or why.
If you want to also be able to do common linear algebra operations (solving systems, least squares regression, decomposition, etc.) look into LAPACK.

I'll add vote for Eigen: I ported a lot of code (3D geometry, linear algebra and differential equations) from different libraries to this one - improving both performance and code readability in almost all cases.
One advantage that wasn't mentioned: it's very easy to use SSE with Eigen, which significantly improves performance of 2D-3D operations (where everything can be padded to 128 bits).

Okay, I think I know what you're looking for. It appears that GGT is a pretty good solution, as Reed Copsey suggested.
Personally, we rolled our own little library, because we deal with rational points a lot - lots of rational NURBS and Beziers.
It turns out that most 3D graphics libraries do computations with projective points that have no basis in projective math, because that's what gets you the answer you want. We ended up using Grassmann points, which have a solid theoretical underpinning and decreased the number of point types. Grassmann points are basically the same computations people are using now, with the benefit of a robust theory. Most importantly, it makes things clearer in our minds, so we have fewer bugs. Ron Goldman wrote a paper on Grassmann points in computer graphics called "On the Algebraic and Geometric Foundations of Computer Graphics".
Not directly related to your question, but an interesting read.

FLENS
http://flens.sf.net
It also implements a lot of LAPACK functions.

I found this library quite simple and functional (http://kirillsprograms.com/top_Vectors.php). These are bare bone vectors implemented via C++ templates. No fancy stuff - just what you need to do with vectors (add, subtract multiply, dot, etc).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js