Convolution with separable kernel - c++

I'm looking at the CUDA SDK convolution with separable kernels, and I have a simple question but can't find an answer:
Do the vectors, whose convolution gives the kernel, need to have the same size? Can I first perform a row-convolution with a vector 1x3 and then a column convolution with another one 5x1 ? Or they both need to be same size? Google isn't helping (or I'm unable to search for an answer)

Yes, the vectors can be different sizes. The only consequence is that you'll get a rectangular matrix that is not square.

The vectors of a separable convolution could only be different sizes if the equivalent convolution matrix was not square.

Related

Levenberg Marquardt in calibration opencv

I tried to understand Levenberg-Marquardt algorithm implementation in OpenCv Camera Calibration.
In W.Burger's paper-> here
I saw this Matrix in page 24.
So what does theoretically mean of this matrix' each cell?
And how it implemented in openCv code.
Levenberg Marquardt is a classical algorithm for non-linear optimization. OpenCV indeed uses it for camera calibration. One should understand the algorithm in general before diving into its usage in OpenCV.
Specifically, the matrix that you've pointed out is the Jacobian, a generalization of one dimensional derivative to multidimensional function.

Custom 3D (parallel) FFTW transformation possible?

I have a 3D array of real data that I'd like to transform with a either a DST or DCT along one axis and normal DFT's along the other two axes. The result should be a 3D complex array that holds this transformation's coefficients.
Do you know if the FFTW3 package offers such a routine - possibly in parallel - out of the box? FFTW3 provides such a routine for a simple 3D DFT in all three directions.
And if not, would you might have a hint on how to achieve it the best way in C/C++?
My naive idea: Assembly of DST/DCT followed by a 2D real-to-complex transformation along the first axis inside some wrapping routine. Then, one could think of a 1D decomposition to achieve the parallelism. A 2D would be nicer but much more work.
PS:
This transformation is used in a spectral method for solving the Navier-Stokes equation.
Your naive idea is the way to go. Individual dimensions can be transformed independently.

In OpenCV, is there any way to combine/separate image channels more efficiently than cv::merge and cv::split?

For a little background, I'm doing a whole lot of image filtering, using a bunch of complex-valued filters.
I generate the real and imaginary parts of the filters separately (its more efficient that way), and store them in two separate arrays.
I followed this guide on how to do dft's in opencv.
Basically, I have to
loop through my filters
call merge to combine the real and imaginary parts
perform a DFT
call split to separate real and imaginary again
compute the magnitude of the response
I do this, and its fairly slow. I initially thought I needed a faster FFT library, but based on Visual Studio's profiler, it turns out that cv::split() and cv::merge() take an order of magnitude more time than the actual DFT does. In fact, most of the running-time is spent in these two functions.
The whole split/merge thing seems a little redundant to me, and the fact that they're the most time consuming functions is pretty annoying. Is there a faster way to do what I'm trying to do?
OpenCV matrices are stored in interleaved format. That means for a two channel image with channels A and B, a 4x1 matrix would be stored in the order ABABABAB. If you have two different planes, you have AAAA and BBBB.
It is probably hardest to convince the DFT to operate on non-interleaved input. However maybe you are able to store your filters as a two-channel matrix in the first place?
To compute the magnitude from the two channels, you can loop through all elements and call cv::norm on them or do the calculation yourself. You can speed that up further by using SSE and/or TBB.
So at least you save one of the conversions.

batch CUDA solution of sparse banded Ax=b for various b's

I have a sparse banded matrix A and I'd like to (direct) solve Ax=b. I have about 500 vectors b, so I'd like to solve for the corresponding 500 x's.
I'm brand new to CUDA, so I'm a little confused as to what options I have available.
cuSOLVER has a batch direct solver cuSolverSP for sparse A_i x_i = b_i using QR here. (I'd be fine with LU too since A is decently conditioned.) However, as far as I can tell, I can't exploit the fact that all my A_i's are the same.
Would an alternative option be to first determine a sparse LU (QR) factorization on the CPU or GPU then perform in parallel the backsubstitution (respectively, backsub and matrix mult) on the GPU? If cusolverSp< t >csrlsvlu() is for one b_i, is there a standard way to batch perform this operation for multiple b_i's?
Finally, since I don't have intuition for this, should I expect a speedup on a GPU for either of these options, given the necessary overhead? x has length ~10000-100000. Thanks.
I'm currently working on something similar myself. I decided to basically wrap the conjugate gradient and level-0 incomplete cholesky preconditioned conjugate gradient solvers utility samples that came with the CUDA SDK into a small class.
You can find them in your CUDA_HOME directory under the path:
samples/7_CUDALibraries/conjugateGradient and /Developer/NVIDIA/CUDA-samples/7_CUDALibraries/conjugateGradientPrecond
Basically, you would load the matrix into the device memory once (and for ICCG, compute the corresponding conditioner / matrix analysis) then call the solve kernel with different b vectors.
I don't know what you anticipate your matrix band structure to look like, but if it is symmetric and either diagonally dominant (off diagonal bands along each row and column are opposite sign of diagonal and their sum is less than the diagonal entry) or positive definite (no eigenvectors with an eigenvalue of 0.) then CG and ICCG should be useful. Alternately, the various multigrid algorithms are another option if you are willing to go through coding them up.
If your matrix is only positive semi-definite (e.g. has at least one eigenvector with an eigenvalue of zero) you can still ofteb get away with using CG or ICCG as long as you ensure that:
1) The right hand side (b vectors) are orthogonal to the null space (null space meaning eigenvectors with an eigenvalue of zero).
2) The solution you obtain is orthogonal to the null space.
It is interesting to note that if you do have a non-trivial null space, then different numeric solvers can give you different answers for the same exact system. The solutions will end up differing by a linear combination of the null space... That problem has caused me many many man hours of debugging and frustration before I finally caught on, so its good to be aware of it.
Lastly, if your matrix has a Circulant Band structure you might consider using a fast fourier transform (FFT) based solver. FFT based numerical solvers can often yield superior performance in cases where they are applicable.
is there a standard way to batch perform this operation for multiple b_i's?
One option is to use the batched refactorization module in CUDA's cuSOLVER, but I am not sure if it is standard.
Batched refactorization module in cuSOLVER provides an efficient method to solve batches of linear systems with fixed left-hand side sparse matrix (or matrices with fixed sparsity pattern but varying coefficients) and varying right-hand sides, based on LU decomposition. Only some partially completed code snippets can be found in the official documentation (as of CUDA 10.1) that are related to it. A complete example can be found here.
If you don't mind going with an open-source library, you could also check out CUSP:
CUSP Quick Start Page
It has a fairly decent suite of solvers, including a few preconditioned methods:
CUSP Preconditioner Examples
The smoothed aggregation preconditioner (a variant of algebraic multigrid) seems to work very well as long as your GPU has enough onboard memory for it.

Confusion about methods of pose estimation

I'm trying to do pose estimation (actually [Edit: 3DOF] rotation is all I need) from a planar marker with 4 corners = 4 coplanar points.
Up until today I was under the impression from everything I read that you will always compute a homography (e.g. using DLT) and decompose that matrix using the various methods available (Faugeras, Zhang, the analytic method which is also described in this post here on stackexchange) and refine it using non-linear optimization, if necessary.
First minor question: if this is an analytical method (simply taking two columns from a matrix and creating an orthonormal matrix out of these resulting in the desired rotation matrix), what is there to optimize? I've tried it in Matlab and the result jitters badly so I can clearly see the result is not perfect or even sufficient, but I also don't understand why one would want to use the rather expensive and complex SVDs used by Faugeras and Zhang if this simple method yields results already.
Then there are iterative pose estimation methods like the Ortohogonal Iteration (OI) Algorithm by Lu et al. or the Robust Pose Estimation Algorithm by Schweighofer and Pinz where there's not even a mention of the word 'homography'. All they need is an initial pose estimation which is then optimized (the reference implementation in Matlab done by Schweighofer uses the OI algorithm, for example, which itself uses some method based on SVD).
My problem is: everything I read so far was '4 points? Homography, homography, homography. Decomposition? Well, tricky, in general not unique, several methods.' Now this iterative world opens up and I just cannot connect these two worlds in my head, I don't fully understand their relation. I cannot even articulate properly what my problem is, I just hope someone understands where I am.
I'd be very thankful for a hint or two.
Edit: Is it correct to say: 4 points on a plane and their image are related by a homography, i.e. 8 parameters. Finding the parameters of the marker's pose can be done by calculating and decomposing the homography matrix using Faugeras, Zhang or a direct solution, each with their drawbacks. It can also be done using iterative methods like OI or Schweighofer's algorithm, which at no point calculate the homography matrix, but just use the corresponding points and which require an initial estimation (for which the initial guess from a homography decomposition could be used).
With only four points your solution will be normally very sensitive to small errors in their location, particularly when the rectangle is nearly orthogonal to the optical axis (this is because the vanishing points are not observable - they are outside the image and very far from the measurements - and the pose is given by the cross product of the vectors from the centre of the quadrangle to the vanishing points).
Is your pattern such that the corners can be confidently located with subpixel accuracy? I recommend using "checkerboard-type" patterns for the corners, which allow using a good and simple iterative refining algorithm to achieve subpixel accuracy (look up "iterative saddle points algorithm", or look up the docs in OpenCV).
I will not provide you with a full answer, but it looks like at least one of the points that need to be clarified is this:
homography is an invertible mapping from P^2 (homogeneous 3-vectors) to itself, which always may be represented by an invertible 3x3 matrix. Having said that, note that if your 3d points are coplanar you will always be able to use homography to relate the world points to the image points.
In general, a point in 3-space is represented in homogeneous coordinates as a 4-vector. Projective transformation acting on P^3 is represented by a non-singular 4x4 matrix (15 degrees of freedom, 16 elements minus one for overall scale).
So, the bottom line is that if your model is planar, you will be able to get away with a homography (8 DOF) and an appropriate algorithm, while in general case you will need to estimate 4x4 matrix and would need a different algorithm for that.
Hope this helps,
Alex