I need open source (no restriction on license) implementation of log function, something with signature
__m128d _mm_log_pd(__m128d);
It is available in Intel Short Vector Math Library (part of ICC), but ICC is neither free nor open source. I am looking for implementation using intrinsics only.
It should use special rational function approximations. I need something almost as accurate as cmath log, say 9-10 decimal digits, but faster.
I believe log2 is easier to compute. You can multiply/divide your number by a power of two (very quick) such that it lies in (0.5, 2], and then you use a Pade approximant (take M close to N) which is easy to derive once and for all, and whose order you can chose according to your needs. You only need arithmetic operations that you can do with SSE intrinsics. Don't forget to add/remove a constant according to the above scaling factor.
If you want natural log, divide by log2(e), that you can compute once and for all.
It is not rare to see custom log functions in some specific projects. Standard library functions address the general case, but you need something more specific. I sincerely think it is not that hard to do it yourself.
Take a look at AMD LibM. It isn't open source, but free. AFAIK, it works on Intel CPUs. On the same web page you find a link to ACML, another free math lib from AMD. It has everything from AMD LibM + Matrix algos, FF and distributions.
I don't know any open source implementation of double precision vectorized math functions. I guess Intel and AMD libs are hand optimised by the CPU manufacturer and everyone uses them when speed is important. IIRC, there was an attempt to implement intrinsics for vectorized math functions in GCC. I don't how far they managed to get. Obviously, it isn't a trivial task.
Framewave project is Apache 2.0 licensed and aims to be the open source equivalent of Intel IPP. It has implementations that are close to what you are looking for.
Check the fixed accuracy arithmetic functions in the documentation.
Here's the counterpart for __m256d: https://stackoverflow.com/a/45898937/1915854 . It should be pretty trivial to cut it to __m128d. Let me know if you encounter any problems with this.
Or you can view my implementation as something obtaining two __m128d numbers at once.
If you cannot find an existing open source implementation it is relatively easy to create your own using the standard method of a Taylor series. See Wikipedia for this and a variety of other methods.
Related
I am working on an nbody simulator in cuda. I want to use float types for the speed benefits but this is making my task difficult. What I am worried about is say I have have a vector <10^20, 10^20, 10^20> and I want to compute its magnitude using the Pythagorean theorem. I would have to square each of the components which would be 10^40 and in 32 bit this would just be infinity. So even though the final result when I take the square root of the sum would be in range the intermediate step would overflow. I came across the following function in the cuda math API. norm3df(x, y, z). Would this prevent the intermediate step overflow I am talking about? Also I might need to use this function on the host as well as device. Would the behavior be the same?
The standard C++ math library contains a function hypot() for the computation of 2D norms while avoiding premature underflow and overflow in intermediate computations. Because 3D norms are also commonly encountered, the CUDA math library offers in addition an analogous function norm3d(). The description in the CUDA math API documentation reads:
Calculate the length of three dimensional vector p in euclidean space
without undue overflow or underflow
Further, the CUDA math library offers reciprocal norm functions rhypot() and rnorm3d() that are useful when normalizing 2D and 3D vectors, as they allow replacing an expensive division with a much cheaper multiplication.
As norm3d(), rhypot(), and rnorm3d() are not standard C++ math library functions, they cannot be used in the host portion of CUDA programs, as host code is processed by the host toolchain. NVIDIA provides math library support for the device. You may want to file an enhancement request with the vendor of your host toolchain to add these useful functions as proprietary extensions, and/or lobby the ISO C/C++ committees to have them added to future versions of the standard.
It has previously come to my attention that currently shipping CUDA header files seem to erroneously mark normd3d() and a few other CUDA-specific functions as __host__ __device__, although there is in fact no host implementation. This would appear to be a bug, likely caused by cut & past application of these attributes to the prototypes.
The norm and reciprocal norm functions do not require higher intermediate precision in their internal computation, meaning there is no negative performance impact on GPUs with low-throughput double precision. Instead, they use clever rearrangements of the mathematics, re-scaling of the operands, and use of FMA to achieve their goal. Not only do they prevent undue overflow and underflow, they should also be more accurate than the equivalent naive computation.
Up to and including CUDA version 6.5, implementation details of the CUDA math library were visible in the CUDA header files math_functions.h and math_functions_dbl_ptx3.h, so anybody who would like to get a better idea of the internal details of norm functions may want to look there.
I am looking for a way to write fast code and be able to use builtin vector operations (for the sake of readability).
FORTRAN seems to be the good candidate. However, almost all resources I find on the web are about writing code without array expressions, and have only trivial examples of vector operations.
I feel strong need in some good resource which can cover caveats and give some insight into optimizations of code with vector expressions.
Example:
currently I am not even able to predict the behavior of such code:
! a = [0], indices = [1, 1]
a(indices) = a(indices) + 1
After compiling I get a = [2], but it this correct? If I use openmp, will it behave like this?
Personally, I would be very happy to have something like following examples on numpy:
100 numpy excercises
numpy: tips and tricks to work with data
Getting the Best Performance out of NumPy
Your code is not standard conforming:
Fortran 2008 6.5.3.3.2.3:
If a vector subscript has two or more elements with the same value,
an array section with that vector subscript shall not appear in a
variable definition context (16.6.7). NOTE 6.15
Therefore the result of your operation is not defined by the standard.
Other parts of your question appear to be too broad to treat them here. There are many books about scientific programming in Fortran 90 and later.
Also be aware that by vectorization most people in Fortran and C or C++ mean the usage of SIMD instructions simd and not the vectorized expressions from NumPy. These are just array expressions in Fortran.
I have scanned many sources (~20 books and dozens of web pages). Hard luck I missed something really important. The question I posted is indeed incorrect and comes from my initial high expectation about array operations in fortran.
The answer I would expect is: there are no tools to write short, readable code in fortran with automatic parallelization (to be more precise: there are, but those are proprietary libraries).
The list of intrinsic functions available in fortran is quite short
(link), and consists only of functions easily mapped to SIMD ops.
There are lots of functions that one will be missing.
while this could be resolved by separate library with separate implementation for each platform, fortran doesn't provide such. There are commercial options (see this thread)
Brief examples of missing functions:
no built-in array sort or unique. The proposed way is to use this library, which provides single-threaded code (forget threads and CUDA)
cumulative sum / running sum. One trivially can implement it, but the resulting code will never work fine on threads/CUDA/Xeon Phi/whatever comes next.
bincount, numpy.ufunc.at, numpy.ufunc.reduceat (which is very useful in many applications)
In most cases fortran provides 2x speed up even with simple implementations, but the code written will always be one-threaded, while matlab/numpy functions can be reimplemented for GPU or other parallel platform without any effort from user side (which occasionally happened to MATLAB, also see gnumpy, theano and parakeet)
To conclude, this is bad news for me. Fortran developers really care about having fast programs today, not in the future. I also can't lock my code on proprietary software. And I'm still looking for appropriate tool. (Julia is current candidate)
See also:
STL analogue in fortran
where ready-to-use algorithms are asked.
Numerical recipes: the art of parallel programming author implements basic MATLAB-like operations to have more expressive code
I also find useful these notes to see recommended ways of code optimizations (to see there is no place for vector operations)
numpy, fortran, blitz++: a case study
dicussion about implementing unique in fortran, where proprietary tools are recommended.
Unfortunately, the standard C++ library doesn't have a single call for sincos, which gives a place for this question.
First question:
If I want to calculate a sin and a cos, is it cheaper to calculate a sin and a cos, or calculate a sin then a sqrt(1-sin^2) to get the cos?
Second question:
The intel math kernel library provides very good functions for standard mathematical functions calculations, so a function vdSinCos() exists to solve the problem in a very optimized way, but the intel compiler isn't for free. Is there any open source library (C, C++, Fortran) available in linux distributions that would have those functions where I can simply link to them and have the optimum implementations?
Note: I wouldn't like to go into instruction calls, since they're not supported by all CPUs. I would like to link to a general library that would do the job for me on any CPU.
Thanks.
The GNU C library has a sincos() function, which will take advantage of the "FSINCOS" instruction which most modern instruction sets have. I'd say that's your best bet; it should be just as fast as the Intel library method.
If you don't do that, I'd go with the "sqrt(1-sin(x)^2)" route. In every processor architecture document I've looked at so far, the FSQRT instruction is significantly faster than the FSIN function.
The answer to almost every performance problem is "why don't you measure it in your code", because there are a large number of different factors that affect the performance of almost any calculation like this. For example, "Who produces the math functions". Square root is relatively simple to calculate, but I'm not convinced it's a huge difference between sqrt(1-sin*sin) and calculating cos again. What processor may also be a factor, and what other calculations are done "around" the sin/cos calculations.
I wouldn't be surprised if there is a library around somewhere that has this sort of function, but I haven't been looking.
If precision is not critical the fastest way to get sin or cos is to use tables.
Hold some global const array with sin and cos values for all agles with a step you need. So your sin/cos function just need to cast angle to index and you get the result.
I'm doing some linear algebra math, and was looking for some really lightweight and simple to use matrix class that could handle different dimensions: 2x2, 2x1, 3x1 and 1x2 basically.
I presume such class could be implemented with templates and using some specialization in some cases, for performance.
Anybody know of any simple implementation available for use? I don't want "bloated" implementations, as I'll running this in an embedded environment where memory is constrained.
Thanks
You could try Blitz++ -- or Boost's uBLAS
I've recently looked at a variety of C++ matrix libraries, and my vote goes to Armadillo.
The library is heavily templated and header-only.
Armadillo also leverages templates to implement a delayed evaluation framework (resolved at compile time) to minimize temporaries in the generated code (resulting in reduced memory usage and increased performance).
However, these advanced features are only a burden to the compiler and not your implementation running in the embedded environment, because most Armadillo code 'evaporates' during compilation due to its design approach based on templates.
And despite all that, one of its main design goals has been ease of use - the API is deliberately similar in style to Matlab syntax (see the comparison table on the site).
Additionally, although Armadillo can work standalone, you might want to consider using it with LAPACK (and BLAS) implementations available to improve performance. A good option would be for instance OpenBLAS (or ATLAS). Check Armadillo's FAQ, it covers some important topics.
A quick search on Google dug up this presentation showing that Armadillo has already been used in embedded systems.
std::valarray is pretty lightweight.
I use Newmat libraries for matrix computations. It's open source and easy to use, although I'm not sure it fits your definition of lightweight (it includes over 50 source files which Visual Studio compiles it into a 1.8MB static library).
CML matrix is pretty good, but may not be lightweight enough for an embedded environment. Check it out anyway: http://cmldev.net/?p=418
Another option, altough may be too late is:
https://launchpad.net/lwmatrix
I for one wasn't able to find simple enough library so I wrote it myself: http://koti.welho.com/aarpikar/lib/
I think it should be able to handle different matrix dimensions (2x2, 3x3, 3x1, etc) by simply setting some rows or columns to zero. It won't be the most fastest approach since internally all operations will be done with 4x4 matrices. Although in theory there might exist that kind of processors that can handle 4x4-operations in one tick. At least I would much rather believe in existence of such processors that than go optimizing those low level matrix calculations. :)
How about just store the matrix in an array, like
2x3 matrix = {2,3,val1,val2,...,val6}
This is really simple, and addition operations are trivial. However, you need to write your own multiplication function.
Is there a publically available library that will produce the exact
same results for sin, cos, floor, ceil, exp and log on 32 bit and
64 bit linux, solaris and possibly other platforms?
I am considering the following alternatives:
a) cephes compiled
with gcc -mfpmath=sse and the same optimization levels on each
platform ... but its not clear that this would work.
b) MPFR but I am worried that this would be
too slow.
Regarding precision (edited): For this particular application
I don't really need something that produces the value that is
numerically closest to the exact value. I just need the answers
to be the exact same on all platforms, os and "bitness". That
being said the values need to be reasonable (5 digits would
probably be enough). I apologize for not having made this clear
in my initial question.
I guess MAPM or MPFR with a low enough precision setting might do
the trick but I was hoping to find something that did not have
the "multiple precision" machinery/flavor to it. In any case, I will
try this out.
Would something like: http://lipforge.ens-lyon.fr/www/crlibm/index.html be what you are searching for (this is a library whose aim is to be able to replace the standard math library of C99 -- so keep good enough performance in the normal cases -- while ensuring correctly rounded result according to IEEE 754 rounding modes) ?
crlibm is the correct tool for this. An earlier poster linked to it. Because it is correctly rounded, it will deliver bit-identical results on all platforms with IEEE-754 compliant hardware if compiled properly. It is much, much faster than MPFR.
You shouldn't need one. floor and ceil will be exact since their computation is straightforward.
What you are concerned with is rounding on the last bit for the transendentals like sin, cos, and exp. But these are native to the CPU microcode and can be done in high quality consistently regardless of library. However, the rounding does vary from chip architecture to architecture.
So, if exact answers for the transindentals is indeed your goal, you do need a portable library, and you also will be giving up huge efficiencies by doing so.
You could use a portable library like MAPM which gives you not only consistent ULP results but as a side benefit lets you define arbirary precision.
You can check your math precision with tools like this one and this one.
You mention using SSE. If you're planning on only running on x86 chips, then what exactly are the inconsistencies you're expecting?
As for MPFR, don't worry - test it! By the way, if it's good enough to be included in GCC, it's probably good enough for you.
You want to use MPFR. That library has been around for years and has been ported to every platform under the sun and optimized by tons of people.
If MPFR isn't sufficient for your needs we're talking about full custom ASM implementations in which case it might be more efficient to consider implementing it in dedicated hardware.