3d convolution in c++ - c++

I'm looking for some source code implementing 3d convolution. Ideally, I need C++ code or CUDA code. I'd appreciate if anybody can point me to a nice and fast implementation :-)
Cheers

you understand that convolution is normally done by using an fft? see, for example, http://en.wikipedia.org/wiki/Convolution
so you need an fft library.
Fastest method to compute convolution suggests http://www.fftw.org/ (for a traditional cpu).
for cuda, use cufft - http://www.gsic.titech.ac.jp/~ccwww/tebiki/tesla_e/tesla6_e.html

Are you a registered developer? If so you should download the 3.0 SDK and check out the FDTD3d sample which shows a 3d convolution as applied for an explicit finite differences app. In the 2.3 SDK there was a sample called 3dfd which was similar (and has now been replaced).
It may be more efficient to use this approach rather than FFT if your impulse response is short.

Intel has a very good example - using SSE + OpenMP and a serial version of it. The code is primarily meant to profile the serial and a parallel approach, but is done in a nice way. http://software.intel.com/en-us/articles/16bit-3d-convolution-sse4openmp-implementation-on-penryn-cpu/

Related

C++ How do I solve very large system of sparse linear system

I am trying to solve a very large and sparse system of linear equations in C++. Currently, I am using BiCGSTAB from eigen. It works fine for small matrix, but it is taking just too much time for matrix of the size I need, which is 40804x40804 (It could be even larger in the future).
I have a very long script, but I simply used the following format:
SparseMatrix<double> sj(40804,40804);
VectorXd c_(40804), sf(40804);
sj.reserve(VectorXi::Constant(40804,36)); //This is a very good estimate of how many non zeros in each column
//...Fill in actual number in sj
sj.makeCompressed();
BiCGSTAB<SparseMatrix<double> > handler;
//...Fill in sj, only in the entries that have been initialized previously
handler.analyzePattern(sj)
handler.factorize(sj);
c_.setZero();
c_=handler.solve(sf);
This takes way too long! And yes, the solution does exist. Sparse function in matlab seems to handle this very well, but I need it in C++ in order to connect to a server.
I would really appreciate it you could help me!
You should consider use of one of the advanced sparse direct solvers: CHOLMOD
Sparse direct solvers are a fundamental tool in computational analysis, providing a very general method for obtaining high-quality results to almost any problem. CHOLMOD is a high performance library for sparse Cholesky factorization.
I guarantee that this package definetly will help you. Moreover CHOLMOD has supported GPU acceleration since 2012 with version 4.0.0 . In SuiteSparse-4.3.1 performance has been further improved, providing speedups of 3x or greater vs. the CPU for the sparse factorization operation.
If your matrices are the representations of graphs you can also consider METIS with combination of CHOLMOD. Which means you will be able to do partition/domainDecomposition in graphs then parallel solve with CHOLMOD.
SuiteSparse is a powerfull tool with the support of linear(KLU) and direct solvers.
Here are the GitHub link, UserGuide and SuiteSparse's home page

Fixed size SVD and solver in CUDA (in the device)

I implemented a program on the GPU (CUDA) which only uses the host (in C++) to start new kernels. During the calculation on the device I need SVD and solving systems of 3x3 (dense) matrices, fixed size.
I've got my own SVD and solver implementation but it is not numerical stable (thus not usable). Due to me being rather new with C++ and CUDA I would prefer to use a library instead. (numerical stuff is very tricky)
Now I have trouble finding that library:
cuSOLVER is not callable from the device
cuLA is not callable form the device (and abandoned so it seems)
Eigen looks promising (should be callable from device?) but it is unclear what the status is on CUDA support (it says experimental). I find people saying it works, others got compile errors?
Preferable I would also being able to do general matrix operations with the library (transpose, inversion, sum, multiply, ...) as my own implementations will likely be less efficient and numerically stable for those.
Any ideas on how to achieve this?
UPDATE:
Seems like Eigen supports basic functions like *,+, transpose and even eigenvalues but SVD, inverse ect is not yet supported. This is at the time of writing.
According to the website, a subset of features works for fixed size matrices (3x3 in your case) from Eigen 3.3. The current stable release is 3.2.6 while 3.3 is in alpha. I don't know if specifically SVD is supported in CUDA. I would recommend trying a small MCVE to see if it works (as well as the other functions you require), and if so, implementing it in your project.
I'm having a similar problem; want to generate random vectors within a kernel function which requires performing cholesky/eigenvalue decompositions of NxN (N<=5) covariance matrices. Since, as you noted, the MAGMA and CULA libraries are not available from the device, and there seems to be no cuSOLVER device API yet, I've resorted to implementing these myself following algorithms outlined in, for example, Numerical Recipes in C. As for solving linear systems, I'd suggest checking out the cuBLAS (level 2 functions), as it provides some basic functionality. If you want to invert matrices, I'd suggest cublasmatinvBatched(). I haven't used it myself, will give it a try during the weekend, but from the description it sounds promising. Hope others will chime into this thread with better solutions...

opencv function implementation

I wonder how does opencv do operations on Matrices. For example, when I write code for
cv::add (Mat mat1, Mat mat2, Mat &result)
using two for loops, it takes around 120-130 ms for 1000x750 image. But using opencv add function it takes 6-7 ms. Does anyone know what is their trick? I want to learn it to be able to write functions that opencv doesn't have.
I have searched inside opencv and find this two .cpp files(first, second) but I dont know if I'm looking at correct place.
I just want to know how to use this power. Could somebody help me?
Thanks,
The two cpp files you provided are for GPU operations (CUDA and OpenCL). From your question, I think you are looking for non-GPU operations and this is the correct file..
OpenCV is famous for its speed and it comes from a lot of optimizations they do in their codes. I will just give some hints to some of them.
1. SIMD Optimization
This is one of the major source of optimization in OpenCV. Almost all arithmetic operations are SIMD optimized. In your case also, SIMD optimization is the better option (which OpenCV has already done). It improves the performance by several times depending on the level of your implementation. All the modern day processors comes with in-built SIMD support (SSE, AVX etc).
It is a little bit complicated compared to our normal C++. Instead of adding only two pixels from both matrices at a time, you add some 16 pixels (It depends on the datatype) simultaneosly. Theoretically it provides 16x speedup. Here is a simple example which I wrote while I was learning SIMD assembly (you can use Intrinsics which are much more simpler). It is not much optimized (written just to learn it), still provides a speedup of 20x.
Similarly, for use in ARM platform, the codes are being NEON optimized (contributed mainly by Nvidia Team for their Tegra processors). Example
2. Multi-threading via TBB
Another important one is use of TBB, Some one has already mentioned it in his answer and you have to compile OpenCV source with TBB to achieve it. As he mentioned, it may not be an easy task to do. Many functions like face detection etc are TBB optimized in OpenCV.
OpenCV does some other techniques also like loop unrolling. (Example) It provides a slight improvement. Modern day compilers are already very good at this.
You can read Agner Fog's optimization techniques manuals for more details on optimizing C++ codes. All those details are relevant.
In this page they say at the end of the document that it is faster because functions of the core are multi-thread enabled via Intel Threaded Building Blocks.

GPGPU programming architecture for HSA in C++ for Matrix Math

GPU Compute Programmers,
I have a C++ program which currently relies on the ACML (LAPACK) to invert and multiple fairly large matrices of single precision fp values (E.g. 4,000 x 4,000). These matrices are very sparse although they do not always fit nicely into a diagonal matrix so I cannot presently reduce them. The other thing about this program is I have to do this invert and multiply several times (serially) as part of a Newton Rapson. However, I have several thousand permutations which can be done in parallel, each with a small change to the matrix before again calculating and inverting the Jacobian. This is all single precision fp, and seems perfectly suited for the GPU. My question is this...
I suspect I will need to use the AMD Accelerated Parallel Processing Math Libraries (APPML) for OpenGL as that is the only thing (non-CUDA, I want to be GPU agnostic) I know of which is available with BLAS functionality. My problem is I do not see the LAPACK dgetrf and dgetri functions included in APPML (yes, these are fp64 but I don't need that precision). Would C++ AMP be a better alternative? I am very interested in HSA features of passing pointers rather than copying data as there is a lot of data in flight here and some calculations still are done on the CPU. I believe that copy overhead would kill me otherwise. Ultimately, performance is the key and I want to make the right architectural decisions to set myself up for the most performance I can wring out of HSA GPUs coming out over the next 6 months.
I am using VS 2013 Ultimate preview and would be able to take advantage of C++ AMP for these HSA capabilities. I just want to make sure I am making the right long term architectural decision now while my program is in its infancy. Here is a link and snippet from some interesting data I found on Anandtech:
http://anandtech.com/show/7118/windows-81-and-vs2013-bring-gpu-computing-updates-to-direct3d-and-c-amp-
C++ AMP, Microsoft's C++ extension for GPU computing, has also been updated with the upcoming VS2013. I think the biggest feature update is that C++ AMP programs will also gain a shared memory feature on APUs/SoCs where the compiler and runtime will be able to eliminate extra data copies between CPU and GPU. This feature will also be available only on Windows 8.1 and it is likely built on top of the "map default buffer" as Microsoft's AMP implementation uses Direct3D under the hood. C++ AMP also brings some other nice additions including enhanced texture support and better debugging abilities.
Any thoughts, additional questions or discussion would be greatly appreciated!

Discrete Curve evolution algorithm

i am trying to implement discrete curve evolution algorithm in c++ do any one help me with psudo code or c code or
some simple steps of your understanding
Discrete Curve Evolution is an algorithm to compute an everywhere convex curve from one that is concave. It moves concave sections of the curve outward along their normal in discrete steps until all concavities are eliminated. It is not a genetic algorithm, the term evolution refers to 'evolving' the position of the curve over time.
Having searched on this for quite some time the best source on the internet is here:
https://cis.temple.edu/~latecki/Software/Evo.zip
This is matlab code so it's not quite what you are looking for but you have three good options:
Port it to C++ (usually not to hard with matlab as long as it doesn't use matrix prims.)
Wrap the matlab code so you can call it from C (matlab provides libraries to do this)
Compile it to an executable and call that from C (matlab also allows this)
Option 2 would require anyone that want's to run it to have a copy of the matlab dynamic library on their computer which may be undesirable. I'm guessing option 3 would require this too, but I only have experience with options 1 and 2. Porting matlab to c++ is usually not that bad; it depends on how much the code utilizes matrix primitives and matrix operations which are easy to use in matlab and hard to use in C++ (because they aren't built-in). Still, I'd recommend giving it the old college try!
If you're just looking for DCE, check out the file evolution.m. That's the function that implements DCE. The full skeleton pruning algorithm this comes from can only be described simply at a high level. The individual steps and parts are QUITE complicated and DCE is only a small piece of that.
Hope this helps! I will be working with this code myself so if I do end up using it in C++ in some way that might help you I will let you know.
I'm not exactly sure what you mean by Discrete Curve evolutionary algorithm, but if you mean a Symbolic regression algorithm, you can start by reading about symbolic regression (or genetic programming in general):
http://en.wikipedia.org/wiki/Symbolic_Regression
There's also some nice existing programs. The Eureqa one has an open API:
http://code.google.com/p/eureqa-api/