GLSL reciprocal trigonometric functions - glsl

I want to express a formula with sec(x) in it in my GLSL shader. I assume, that GLSL does not have such reciprocal trigonometric functions and there seems to be no extension providing them. No sec(x), no csc(x), no cot(x). We are forced to do 1/cos(x), 1/sin(x) and cos(x)/sin(x) respectively.
Am I understanding the situation correctly? Should I rephrase my formulas without reciprocal trigonometric functions to avoid doing a costly 1/x division? Am I overthinking it and doing placebo optimizations?

Related

Does the cuda math function norm3df overflow?

I am working on an nbody simulator in cuda. I want to use float types for the speed benefits but this is making my task difficult. What I am worried about is say I have have a vector <10^20, 10^20, 10^20> and I want to compute its magnitude using the Pythagorean theorem. I would have to square each of the components which would be 10^40 and in 32 bit this would just be infinity. So even though the final result when I take the square root of the sum would be in range the intermediate step would overflow. I came across the following function in the cuda math API. norm3df(x, y, z). Would this prevent the intermediate step overflow I am talking about? Also I might need to use this function on the host as well as device. Would the behavior be the same?
The standard C++ math library contains a function hypot() for the computation of 2D norms while avoiding premature underflow and overflow in intermediate computations. Because 3D norms are also commonly encountered, the CUDA math library offers in addition an analogous function norm3d(). The description in the CUDA math API documentation reads:
Calculate the length of three dimensional vector p in euclidean space
without undue overflow or underflow
Further, the CUDA math library offers reciprocal norm functions rhypot() and rnorm3d() that are useful when normalizing 2D and 3D vectors, as they allow replacing an expensive division with a much cheaper multiplication.
As norm3d(), rhypot(), and rnorm3d() are not standard C++ math library functions, they cannot be used in the host portion of CUDA programs, as host code is processed by the host toolchain. NVIDIA provides math library support for the device. You may want to file an enhancement request with the vendor of your host toolchain to add these useful functions as proprietary extensions, and/or lobby the ISO C/C++ committees to have them added to future versions of the standard.
It has previously come to my attention that currently shipping CUDA header files seem to erroneously mark normd3d() and a few other CUDA-specific functions as __host__ __device__, although there is in fact no host implementation. This would appear to be a bug, likely caused by cut & past application of these attributes to the prototypes.
The norm and reciprocal norm functions do not require higher intermediate precision in their internal computation, meaning there is no negative performance impact on GPUs with low-throughput double precision. Instead, they use clever rearrangements of the mathematics, re-scaling of the operands, and use of FMA to achieve their goal. Not only do they prevent undue overflow and underflow, they should also be more accurate than the equivalent naive computation.
Up to and including CUDA version 6.5, implementation details of the CUDA math library were visible in the CUDA header files math_functions.h and math_functions_dbl_ptx3.h, so anybody who would like to get a better idea of the internal details of norm functions may want to look there.

C++ armadillo not correctly solving poorly conditioned matrix

I have a relatively simple question regarding the linear solver built into Armadillo. I am a relative newcomer to C++ but have experience coding in other languages. I am solving a fluid flow problem by successive linearization, using the armadillo function Solve(A,b) to get the solution at each iteration.
The issue that I am running into is that my matrix is very ill-conditioned. The determinant is on the order of 10^-20 and the condition number is 75000. I know these are terrible conditions but it's what I've got. Does anyone know if it is possible to specify the precision in my A matrix and in the solve function to something beyond double (long double perhaps)? I know that there are double matrix classes in Armadillo but I haven't found any documentation for higher levels of precision.
To approach this from another angle, I wrote some code in Mathematica and the LinearSolve worked very well and the program converged to the correct answer. My reasoning is that Mathematica variables have higher precision which can handle the higher levels of rounding error.
If anyone has any insight on this, please let me know. I know there are other ways to approach a poorly conditioned matrix (like preconditioning and pivoting), but my work is more in the physics than in the actual numerical solution so I'm trying to steer clear of that.
EDIT: I just limited the precision in the Mathematica version to 15 decimal places and the program still converges. This leads me to believe it is NOT a variable precision question but rather an issue with the method.
As you said "your work is more in the physics": rather than trying to increase the accuracy, I would use the Moore-Penrose Pseudo-Inverse, which in Armadillo can be obtained by the function pinv. You should then experience a bit with the parameter tolerance to set it to a reasonable level.
The geometrical interpretation is as follows: bad condition numbers are due to the fact that the row/column-vectors are linearly dependent. In physics, such linearly dependencies usually have an origin which at least needs to be interpreted. The pseudoinverse first projects the matrix onto a lower dimensional space in which the vectors are "less linearly dependent" by dropping all singular vectors with singular values smaller than the parameter tolerance. The reulting matrix has a better condition number such that the standard inverse can be constructed with less problems.

GLSL Double Precision Angle, Trig and Exponential Functions Workaround

In GLSL there's rudimentary support for double precision variables and operations which can be found here. However they also mention "Double-precision versions of angle, trigonometry, and exponential
functions are not supported.".
Is there a simple workaround for this, or do I have to write my own functions from scratch?
this link seem's to be the best answer
So yes, you'll need to make your own implementation for those functions.
glibc source may be your friend.

Can I use OpenGL for general purpose matrix multiplication?

The MultMatrix appears to only multiply 4x4 matrices, which makes sense for OpenGL purposes, but I was wondering if a more general matrix multiplication function existed within OpenGL.
No, as can be easily verified by looking at the documentation, including the GL Shader Language. The largest matrix data type is a 4x4.
It is very true there is a whole craft of of getting GPUs to do more general purpose math, including string and text manipulation for e.g. cryptographic purposes, by using OpenGL primitives in very tricky ways. However you asked for a general purpose matrix multiplication function.
OpenCL is a somewhat different story. It doesn't have a multiply primitive, but it's designed for general numeric computation, so examples and libraries are very common.
You can also easily code a general matrix multiply for NVIDIA processors in CUDA. Their tutorials include the design of such a routine.
A lot of people think, that legacy OpenGL's (up to OpenGL-2.1) matrix multiplication would be in some way faster. This is not the case. The fixed function pipeline matrix manipulation functions are all executed on the CPU and only update the GPU matrix register on demand before a drawing call.
There's no benefit in using OpenGL for doing matrix math multiplication. If you want do to GPGPU computing you must do this using either OpenCL or compute shaders and to actually benefit from it, it must be applied to a very well parallelized problem.

Does acos, atan functions in stl uses lots of cpu cycles

I wanted to calculate the angle between two vectors but I have seen these inverse trig operations such as acos and atan uses lots of cpu cycles. Is there a way where I can get this calculation done without using these functions? Also, does these really hit you when you in your optimization?
There are no such functions in the STL; those are in the math library.
Also, are you sure it's important to be efficient here? Have you profiled to see if there's function calls like this in the hot spots? Do you know that the performance is bad when using these functions? You should always answer these questions before diving into such microoptimizations.
In order to give advice, what are you going to do with it? How accurate does it have to be?
If you need the actual angle to a high precision, you probably can't do better. If you need it for some comparison, you can use absolute values and the dot product to get the cosine of the angle. If you don't need precision, you can do that and use an acos lookup table. If you're using it as input for another calculation, you might be able to use a little geometry or maybe a trig identity to avoid having to find an arccosine or arctangent.
In any case, once you've done what optimization you're going to do, do before and after timing runs to see if you've made any significant difference.
This is totally implementation defined. Of course, you could use a third-paty implementation, or an approximation, but first you should profile and determine what your bottlenecks are.
If these functions are indeed the bottleneck, and you only need an approximation, you can try using the few first terms of the Taylor series expansion of those functions. The magnitude of the next unused term represents the error in your approximation.
Arccos Taylor series
Arctan Taylor series
The implementations of atan and acos depend on the compiler and the optimization settings. Many implementations will use a table and interpolate to get the nearest value.
Try these things first:
Profile the application to find the
where most of the execution time is
spent.
Redesign this area for better
performance.
Consider Data Driven Design
techniques to speed up your program.
Change logic to reduce branches and
if statements, consider using
Karnaugh maps to simplify the
logic.