I am looking for a fast C (or C++) code that does 3D convolution with small kernels. What I have found online so far are based on FFT and require the use of other libraries. However, as pointed out by many, these FFT-based convolution codes are not suitable for small kernels (in my case they are mainly 3×3×3 or 5×5×5). Is there a C/C++ code that directly implements 3D convolution algorithm? I can implement a 3D convolution in a code according to its definition. However, what I write will not be the most efficient and fast implementation. I am actually looking for a code that has been verified in this aspect.
Related
I implemented a program on the GPU (CUDA) which only uses the host (in C++) to start new kernels. During the calculation on the device I need SVD and solving systems of 3x3 (dense) matrices, fixed size.
I've got my own SVD and solver implementation but it is not numerical stable (thus not usable). Due to me being rather new with C++ and CUDA I would prefer to use a library instead. (numerical stuff is very tricky)
Now I have trouble finding that library:
cuSOLVER is not callable from the device
cuLA is not callable form the device (and abandoned so it seems)
Eigen looks promising (should be callable from device?) but it is unclear what the status is on CUDA support (it says experimental). I find people saying it works, others got compile errors?
Preferable I would also being able to do general matrix operations with the library (transpose, inversion, sum, multiply, ...) as my own implementations will likely be less efficient and numerically stable for those.
Any ideas on how to achieve this?
UPDATE:
Seems like Eigen supports basic functions like *,+, transpose and even eigenvalues but SVD, inverse ect is not yet supported. This is at the time of writing.
According to the website, a subset of features works for fixed size matrices (3x3 in your case) from Eigen 3.3. The current stable release is 3.2.6 while 3.3 is in alpha. I don't know if specifically SVD is supported in CUDA. I would recommend trying a small MCVE to see if it works (as well as the other functions you require), and if so, implementing it in your project.
I'm having a similar problem; want to generate random vectors within a kernel function which requires performing cholesky/eigenvalue decompositions of NxN (N<=5) covariance matrices. Since, as you noted, the MAGMA and CULA libraries are not available from the device, and there seems to be no cuSOLVER device API yet, I've resorted to implementing these myself following algorithms outlined in, for example, Numerical Recipes in C. As for solving linear systems, I'd suggest checking out the cuBLAS (level 2 functions), as it provides some basic functionality. If you want to invert matrices, I'd suggest cublasmatinvBatched(). I haven't used it myself, will give it a try during the weekend, but from the description it sounds promising. Hope others will chime into this thread with better solutions...
The MultMatrix appears to only multiply 4x4 matrices, which makes sense for OpenGL purposes, but I was wondering if a more general matrix multiplication function existed within OpenGL.
No, as can be easily verified by looking at the documentation, including the GL Shader Language. The largest matrix data type is a 4x4.
It is very true there is a whole craft of of getting GPUs to do more general purpose math, including string and text manipulation for e.g. cryptographic purposes, by using OpenGL primitives in very tricky ways. However you asked for a general purpose matrix multiplication function.
OpenCL is a somewhat different story. It doesn't have a multiply primitive, but it's designed for general numeric computation, so examples and libraries are very common.
You can also easily code a general matrix multiply for NVIDIA processors in CUDA. Their tutorials include the design of such a routine.
A lot of people think, that legacy OpenGL's (up to OpenGL-2.1) matrix multiplication would be in some way faster. This is not the case. The fixed function pipeline matrix manipulation functions are all executed on the CPU and only update the GPU matrix register on demand before a drawing call.
There's no benefit in using OpenGL for doing matrix math multiplication. If you want do to GPGPU computing you must do this using either OpenCL or compute shaders and to actually benefit from it, it must be applied to a very well parallelized problem.
After some studying, I created a small app that calculates DFTs (Discrete Fourier Transformations) from some input. It works well enough, but it is quite slow.
I read that FFTs (Fast Fourier Transformations) allow quicker calculations, but how are they different? And more importantly, how would I go about implementing them in C++?
If you don't need to manually implement the algorithm, you could take a look at the Fastest Fourier Transform in the West
Even thought it's developed in C, it officially works in C++ (from the FAQ)
Question 2.9. Can I call FFTW from
C++?
Most definitely. FFTW should compile
and/or link under any C++ compiler.
Moreover, it is likely that the C++
template class is
bit-compatible with FFTW's
complex-number format (see the FFTW
manual for more details).
FFT has n*log(n) compexity compared to DFT which has n^2.
There are lot of literature about that, and I strongly advise that you check that first, because such wide topic can not be full explaned here.
http://en.wikipedia.org/wiki/Fast_Fourier_transform (check external links )
If you need library I advise you to use existing one, for instance.
http://www.fftw.org/
This library has efficiently implementation of FFT and is also used in propariaretery software (MATLAB for instance)
Steven Smith's book The Scientist and Engineer's Guide to Digital Signal Processing , specifically Chapter 8 on the DFT and Chapter 12 on the FFT, does a much better job of explaining the two transforms that I ever could.
By the way, the whole book is available for free (link above) and it's a very good introduction to signal processing.
Regarding the C++ code request, I've only used the Fastest Fourier Transform in the West (already cited by superexsl) or DSP libraries such as those from TI or Analog Devices.
The results of a correctly implemented DFT are essentially identical to the results of a correctly implemented FFT (they differ only by rounding errors). As others have pointed out here, the major difference is that of performance. DFT has O(n^2) operations while the FFT has O(nlogn) operations.
The best, most readable publication I have ever found (the one I still refer to) is The Fast Fourier Transform and its Applications by E Oran Brigham. The first few chapters provide a very thorough overview of the continuous and discrete forms of the Fourier Transform. He then uses that to develop the fast version of the DFT based on the Cooley-Tukey Algorithm for the radix-2 (n is a power of 2) and mixed-radix cases (though the latter being somewhat more shallow treatise than the former).
The basic approach in the radix-2 algorithm to perform a linear time operation on the input X and to recursively split the result in half and perform a similar linear time operation on the two halves. The mixed radix case is similar, though you need to divide X into equal portions each time, so it helps if n doesn't have any large prime factors.
I've found this nice explanation with some algorithms described.
FastFourierTransform
About implementation,
first i'd make sure your implementation returns correct results (compare the output from matlab or octave - which have built in fourier transformates)
optimize when necessary, use profilers
don't use unnecesary for loops
i am trying to implement discrete curve evolution algorithm in c++ do any one help me with psudo code or c code or
some simple steps of your understanding
Discrete Curve Evolution is an algorithm to compute an everywhere convex curve from one that is concave. It moves concave sections of the curve outward along their normal in discrete steps until all concavities are eliminated. It is not a genetic algorithm, the term evolution refers to 'evolving' the position of the curve over time.
Having searched on this for quite some time the best source on the internet is here:
https://cis.temple.edu/~latecki/Software/Evo.zip
This is matlab code so it's not quite what you are looking for but you have three good options:
Port it to C++ (usually not to hard with matlab as long as it doesn't use matrix prims.)
Wrap the matlab code so you can call it from C (matlab provides libraries to do this)
Compile it to an executable and call that from C (matlab also allows this)
Option 2 would require anyone that want's to run it to have a copy of the matlab dynamic library on their computer which may be undesirable. I'm guessing option 3 would require this too, but I only have experience with options 1 and 2. Porting matlab to c++ is usually not that bad; it depends on how much the code utilizes matrix primitives and matrix operations which are easy to use in matlab and hard to use in C++ (because they aren't built-in). Still, I'd recommend giving it the old college try!
If you're just looking for DCE, check out the file evolution.m. That's the function that implements DCE. The full skeleton pruning algorithm this comes from can only be described simply at a high level. The individual steps and parts are QUITE complicated and DCE is only a small piece of that.
Hope this helps! I will be working with this code myself so if I do end up using it in C++ in some way that might help you I will let you know.
I'm not exactly sure what you mean by Discrete Curve evolutionary algorithm, but if you mean a Symbolic regression algorithm, you can start by reading about symbolic regression (or genetic programming in general):
http://en.wikipedia.org/wiki/Symbolic_Regression
There's also some nice existing programs. The Eureqa one has an open API:
http://code.google.com/p/eureqa-api/
I'm looking for some source code implementing 3d convolution. Ideally, I need C++ code or CUDA code. I'd appreciate if anybody can point me to a nice and fast implementation :-)
Cheers
you understand that convolution is normally done by using an fft? see, for example, http://en.wikipedia.org/wiki/Convolution
so you need an fft library.
Fastest method to compute convolution suggests http://www.fftw.org/ (for a traditional cpu).
for cuda, use cufft - http://www.gsic.titech.ac.jp/~ccwww/tebiki/tesla_e/tesla6_e.html
Are you a registered developer? If so you should download the 3.0 SDK and check out the FDTD3d sample which shows a 3d convolution as applied for an explicit finite differences app. In the 2.3 SDK there was a sample called 3dfd which was similar (and has now been replaced).
It may be more efficient to use this approach rather than FFT if your impulse response is short.
Intel has a very good example - using SSE + OpenMP and a serial version of it. The code is primarily meant to profile the serial and a parallel approach, but is done in a nice way. http://software.intel.com/en-us/articles/16bit-3d-convolution-sse4openmp-implementation-on-penryn-cpu/