GPU-Computation (CUDA) tex2d/tex3d - How to deal with anisotropic pixel/voxel - c++

I am quite new to cuda programming and i have a question about the texXD function. My goal is to implement a simple GPU-based ray tracer using the optimized CUDA functionality.
See CUDA texture API that is used by NVIDIA.
At my research I have to deal with images that have a different resolution for every dimension (like CT images, (x,y) have a different resolution as (z)). Resampling to an isotropic pixel/voxel size might bring up some problems (especially for medical diagnosis).
For example i have an image with size (100px x 50px) and a resolution of (2px/mm x 1px/mm). The ray enters the image at an arbitrary point and leaves is somewhere else. The ray is sampled in the direction form entrance to leaving point. At each sample point (pos.x,pos.y) the tex2D function carries out an (bicubicbilinear) interpolation taking the neighbour pixel values into account weighted by their distance from the sample point.
example image:
In both shapes the corner points are named the same way(x1,y1),.... The only difference is the physical space between the corner points. The interpolation point is (x,y). I computed an example using the formula for rectangular grids and yield a different results for both grids. But if I use the ratio of areas of the numbered rectangles I got a different result.
My Question: Will CUDA take care of the different resolutions of the dimensions or does CUDA see all pixel in the same distance (and therefore as a squared grid)?
The formula used by CUDA seems to be the one for a squared grid (google:CUDA Texture fetching).
Or can I resample the image to squared grid before using tex2D without a substantial information loss?
Any suggestions are recommended. If you need some more clarification, feel free to ask. I will specifiy my question.

I don't believe what (I think it is) you are trying to do can be achieved using textures. The sole filtering mode supported using textures is described here.
Some salient points:
Textures don't have resolution. The just have dimensions.
Textures data is implicitly uniformly spaced in all dimensions.
Texture interpolation is done in a reduced accuracy fixed point arithmetic format which gives 8 bits of representational accuracy
None of this seems like anything that would be useful for the interpolation on a non-uniform grid which you are describing. At a minimum you would need to perform a coordinate transformation before you could use the uniform filtering mode. The amount of effort and expense would be about the same as just writing an interpolation routine yourself in user code.

Related

GLSL truncated signed distance representation (TSDF) implementation

I am looking forward to implement a model reconstruction of RGB-D images. Preferred on mobile phones. For that I read, it is all done with an TSDF-representation. I read a lot of papers now over hierarchical structures and other ideas to speed this up, but my problem is, that I still hat no clue how to actually implement this representation.
If I have a volume grid of size n, so n x n x n and I want to store in each voxel the signed distance, weight and color information. My only guess is, that I have to build a discrete set of points, for each voxel position. And with GLSL "paint" all these points and calculate the nearest distance. But that don't seem quite good or efficient to calculate this n^3 times.
How can I imagine to implement such a TSDF-representation?
The problem is, my only idea is to render the voxel grid to store in the data of signed distances. But for each depth map I have to render again all voxels and calculate all distances. Is there any way to render it the other way around?
So can't I render the points of the depth map and store informations in the voxel grid?
How is the actual state of art to render such a signed distance representation in an efficient way?
You are on the right track, it's an ambitious project but very cool if you can do it.
First, it's worth getting a good idea how these things work. The original paper identifying a TSDF was by Curless and Levoy and is reasonably approachable - a copy is here . There are many later variations but this is the starting point.
Second, you will need to create nxnxn storage as you have said. This very quickly gets big. For example, if you want 400x400x400 voxels with RGB data and floating point values for distance and weight then that will be 768MB of GPU memory - you may want to check how much GPU memory you have available on a mobile device. Yup, I said GPU because...
While you can implement toy solutions on CPU, you really need to get serious about GPU programming if you want to have any kind of performance. I built an early version on an Intel i7 CPU laptop. Admittedly I spent no time optimising it but it was taking tens of seconds to integrate a single depth image. If you want to get real time (30Hz) then you'll need some GPU programming.
Now you have your TSFD data representation, each frame you need to do this:
1. Work out where the camera is with respect to the TSDF in world coordinates.
Usually you assume that you are the origin at time t=0 and then measure your relative translation and rotation with regard to the previous frame. The most common way to do this is with an algorithm called iterative closest point (ICP) You could implement this yourself or use a library like PCL though I'm not sure if they have a mobile version. I'd suggest you get started without this by just keeping your camera and scene stationary and build up to movement later.
2. Integrate the depth image you have into the TSDF This means updating the TSDF with the next depth image. You don't throw away the information you have to date, you merge the new information with the old.
You do this by iterating over every voxel in your TSDF and :
a) Work out the distance from the voxel centre to your camera
b) Project the point into your depth camera's image plane to get a pixel coordinate (using the extrinsic camera position obtained above and the camera intrinsic parameters which are easily searchable for Kinect)
c) Lookup up the depth in your depth map at that pixel coordinate
d) Project this point back into space using the pixel x and y coords plus depth and your camera properties to get a 3D point corresponding to that depth
e) Update the value for the current voxel's distance with the value distance_from_step_d - distance_from_step_a (update is usually a weighted average of the existing value plus the new value).
You can use a similar approach for the voxel colour.
Once you have integrated all of your depth maps into the TSDF, you can visualise the result by either raytracing it or extracting the iso surface (3D mesh) and viewing it in another package.
A really helpful paper that will get you there is here. This is a lab report by some students who actually implemented Kinect Fusion for themselves on a PC. It's pretty much a step by step guide though you'll still have to learn CUDA or similar to implement it
You can also check out my source code on GitHub for ideas though all normal disclaimers about fitness for purpose apply.
Good luck!
After I posted my other answer I thought of another approach which seems to match the second part of your question but it definitely is not reconstruction and doesn't involve using a TSDF. It's actually a visualisation but it is a lot simpler :)
Each frame you get an RGB and a Depth image. Assuming that these images are registered, that is the pixel at (x,y) in the RGB image represents the same pixel as that at (x,y) in the depth image then you can create a dense point cloud coloured using the RGB data. To do this you would:
For every pixel in the depth map
a) Use the camera's intrinsic matrix (K), the pixel coordinates and the depth value in the map at that point to project the point into a 3D point in camera coordinates
b) Associate the RGB value at the same pixel with that point in space
So now you have an array of (probably 640x480) structures like {x,y,z,r,g,b}
You can render these using on GLES just by creating a set of vertices and rendering points. There's a discussion on how to do this here
With this approach you throw away the data every frame and redo from scratch. Importantly, you don't get a reconstructed surface, and you don't use a TSDF. You can get pretty results but it's not reconstruction.

How to compute texture in an image?

I have segmented 'the thyroid' in an Ultrasound image. In this figure, it is the area under the green contour. Now, I would like to compute the texture of that segmented part. How can I do that? I did a google search on computation of textures in an image but it doesn't seem to help.
I guess that you meant "feature characterization" or "feature descriptors". Then here is a non exhaustive list:
Cooccurrence matrix with Haralick features (the most famous/used)
Local Binary Pattern (LBP). Extremely simple/powerful descriptor used in most of application. It has the advantage of being gray level, rotation, translation invariant.
Run Length Matrix. Useful for sequential/periodic texture (not in your case).
Size Zone Matrix (SZM). Very powerful to discriminate between homogeneous/non homogeneous textures.
Granulometry (see mathematical morphology).
Make your choice!

OpenCV triangulatePoints varying distance

I am using OpenCV's triangulatePoints function to determine 3D coordinates of a point imaged by a stereo camera.
I am experiencing that this function gives me different distance to the same point depending on angle of camera to that point.
Here is a video:
https://www.youtube.com/watch?v=FrYBhLJGiE4
In this video, we are tracking the 'X' mark. In the upper left corner info is displayed about the point that is being tracked. (Youtube dropped the quality, the video is normally much sharper. (2x1280) x 720)
In the video, left camera is the origin of 3D coordinate system and it's looking in positive Z direction. Left camera is undergoing some translation, but not nearly as much as the triangulatePoints function leads to believe. (More info is in the video description.)
Metric unit is mm, so the point is initially triangulated at ~1.94m distance from the left camera.
I am aware that insufficiently precise calibration can cause this behaviour. I have ran three independent calibrations using chessboard pattern. The resulting parameters vary too much for my taste. ( Approx +-10% for focal length estimation).
As you can see, the video is not highly distorted. Straight lines appear pretty straight everywhere. So the optimimum camera parameters must be close to the ones I am already using.
My question is, is there anything else that can cause this?
Can a convergence angle between the two stereo cameras can have this effect? Or wrong baseline length?
Of course, there is always a matter of errors in feature detection. Since I am using optical flow to track the 'X' mark, I get subpixel precision which can be mistaken by... I don't know... +-0.2 px?
I am using the Stereolabs ZED stereo camera. I am not accessing the video frames using directly OpenCV. Instead, I have to use the special SDK I acquired when purchasing the camera. It has occured to me that this SDK I am using might be doing some undistortion of its own.
So, now I wonder... If the SDK undistorts an image using incorrect distortion coefficients, can that create an image that is neither barrel-distorted nor pincushion-distorted but something different altogether?
The SDK provided with the ZED Camera performs undistortion and rectification of images. The geometry model is based on the same as openCV :
intrinsic parameters and distortion parameters for both Left and Right cameras.
extrinsic parameters for rotation/translation between Right and Left.
Through one of the tool of the ZED ( ZED Settings App), you can enter your own intrinsic matrix for Left/Right and distortion coeff, and Baseline/Convergence.
To get a precise 3D triangulation, you may need to adjust those parameters since they have a high impact on the disparity you will estimate before converting to depth.
OpenCV gives a good module to calibrate 3D cameras. It does :
-Mono calibration (calibrateCamera) for Left and Right , followed by a stereo calibration (cv::StereoCalibrate()). It will output Intrinsic parameters (focale, optical center (very important)), and extrinsic (Baseline = T[0], Convergence = R[1] if R is a 3x1 matrix). the RMS (return value of stereoCalibrate()) is a good way to see if the calibration has been done correctly.
The important thing is that you need to do this calibration on raw images, not by using images provided with the ZED SDK. Since the ZED is a standard UVC Camera, you can use opencv to get the side by side raw images (cv::videoCapture with the correct device number) and extract Left and RIght native images.
You can then enter those calibration parameters in the tool. The ZED SDK will then perform the undistortion/rectification and provide the corrected images. The new camera matrix is provided in the getParameters(). You need to take those values when you triangulate, since images are corrected as if they were taken from this "ideal" camera.
hope this helps.
/OB/
There are 3 points I can think of and probably can help you.
Probably the least important, but from your description you have separately calibrated the cameras and then the stereo system. Running an overall optimization should improve the reconstruction accuracy, as some "less accurate" parameters compensate for the other "less accurate" parameters.
If the accuracy of reconstruction is important to you, you need to have a systematic approach to reducing it. Building an uncertainty model, thanks to the mathematical model, is easy and can write a few lines of code to build that for you. Say you want to see if the 3d point is 2 meters away, at a particular angle to the camera system, and you have a specific uncertainty on the 2d projections of the 3d point, it's easy to backproject the uncertainty to the 3d space around your 3d point. By adding uncertainty to the other parameters of the system then you can see which ones are more important and need to have lower uncertainty.
This inaccuracy is inherent in the problem and the method you're using.
First if you model the uncertainty you will see the reconstructed 3d points further away from the center of cameras have a much higher uncertainty. The reason is that the angle <left-camera, 3d-point, right-camera> is narrower. I remember the MVG book had a good description of this with a figure.
Second, if you look at the implementation of triangulatePoints you see that the pseudo-inverse method is implemented using SVD to construct the 3d point. That can lead to many issues, which you probably remember from linear algebra.
Update:
But I consistently get larger distance near edges and several times
the magnitude of the uncertainty caused by the angle.
That's the result of using pseudo-inverse, a numerical method. You can replace that with a geometrical method. One easy method is to back-project the 2d-projections to get 2 rays in 3d space. Then you want to find where the intersect, which doesn't happen due to the inaccuracies. Instead you want to find the point where the 2 rays have the least distance. Without considering the uncertainty you will consistently favor a point from the set of feasible solutions. That's why with pseudo inverse you don't see any fluctuation but a gross error.
Regarding the general optimization, yes, you can run an iterative LM optimization on all the parameters. This is the method used in applications like SLAM for autonomous vehicles where accuracy is very important. You can find some papers by googling bundle adjustment slam.

OpenCV Image stiching when camera parameters are known

We have pictures taken from a plane flying over an area with 50% overlap and is using the OpenCV stitching algorithm to stitch them together. This works fine for our version 1. In our next iteration we want to look into a few extra things that I could use a few comments on.
Currently the stitching algorithm estimates the camera parameters. We do have camera parameters and a lot of information available from the plane about camera angle, position (GPS) etc. Would we be able to benefit anything from this information in contrast to just let the algorithm estimate everything based on matched feature points?
These images are taken in high resolution and the algorithm takes up quite amount of RAM at this point, not a big problem as we just spin large machines up in the cloud. But I would like to in our next iteration to get out the homography from down sampled images and apply it to the large images later. This will also give us more options to manipulate and visualize other information on the original images and be able to go back and forward between original and stitched images.
If we in question 1 is going to take apart the stitching algorithm to put in the known information, is it just using the findHomography method to get the info or is there better alternatives to create the homography when we actually know the plane position and angles and the camera parameters.
I got a basic understanding of opencv and is fine with c++ programming so its not a problem to write our own customized stitcher, but the theory is a bit rusty here.
Since you are using homographies to warp your imagery, I assume you are capturing areas small enough that you don't have to worry about Earth curvature effects. Also, I assume you don't use an elevation model.
Generally speaking, you will always want to tighten your (homography) model using matched image points, since your final output is a stitched image. If you have the RAM and CPU budget, you could refine your linear model using a max likelihood estimator.
Having a prior motion model (e.g. from GPS + IMU) could be used to initialize the feature search and match. With a good enough initial estimation of the feature apparent motion, you could dispense with expensive feature descriptor computation and storage, and just go with normalized crosscorrelation.
If I understand correctly, the images are taken vertically and overlap by a known amount of pixels, in that case calculating homography is a bit overkill: you're just talking about a translation matrix, and using more powerful algorithms can only give you bad conditioned matrixes.
In 2D, if H is a generalised homography matrix representing a perspective transformation,
H=[[a1 a2 a3] [a4 a5 a6] [a7 a8 a9]]
then the submatrixes R and T represent rotation and translation, respectively, if a9==1.
R= [[a1 a2] [a4 a5]], T=[[a3] [a6]]
while [a7 a8] represents the stretching of each axis. (All of this is a bit approximate since when all effects are present they'll influence each other).
So, if you known the lateral displacement, you can create a 3x3 matrix having just a3, a6 and a9=1 and pass it to cv::warpPerspective or cv::warpAffine.
As a criteria of matching correctness you can, f.e., calculate a normalized diff between pixels.

Algorithm to zoom images clearly

I know images can be zoomed with the help of image pyramids. And I know opencv pyrUp() method can zoom images. But, after certain extent, the image gets non-clear. For an example, if we zoom a small image 15 times of its original size, it is definitely not clear.
Are there any method in OpenCV to zoom the images but keep the clearance as it is in the original one? Or else, any algorithm to do this?
One thing to remember: You can't pull extra resolution out of nowhere. When you scale up an image, you can have either a blurry, smooth image, or you can have a sharp, blocky image, or you can have something in between. Better algorithms, that appear to have better performance with specific types of subjects, make certain assumptions about the contents of the image, which, if true, can yield higher apparent performance, but will mess up if those assumptions prove false; there you are trading accuracy for sharpness.
There are several good algorithms out there for zooming specific types of subjects, including pixel art,
faces, or text.
More general algorithms for sharpening images include unsharp masking, edge enhancement, and others, however all of these are assume specific things about the contents of the image, for instance, that the image contains text, or that a noisy area would still be noisy (or not) at a higher resolution.
A low-resolution polka-dot pattern, or a sandy beach's gritty pattern, will not go over very well, and the computer may turn your seascape into something more reminiscent of a mosh pit. Every zoom algorithm or sharpening filter has a number of costs associated with it.
In order to correctly select a zoom or sharpening algorithm, more context, including sample images, are absolutely necessary.
OpenCV has the Super Resolution module. I haven't had a chance to try it yet so not too sure how well it works.
You should check out Super-Resolution From a Single Image:
Methods for super-resolution (SR) can be broadly classified into two families of methods: (i) The classical multi-image super-resolution (combining images obtained at subpixel misalignments), and (ii) Example-Based super-resolution (learning correspondence between low and high resolution image patches from a database). In this paper we propose a unified framework for combining these two families of methods.
You most likely want to experiment with different interpolation schemes for your images. OpenCV provides the resize function that can be used with various different interpolation schemes (docs). You will likely be trading off bluriness (e.g., in bicubic or bilinear interpolation schemes) with jagged aliasing effects (for example, in nearest-neighbour interpolation). I'd recommend experimenting with the different schemes that it provides and see which ones give you the best results.
The supported interpolation schemes are listed as:
INTER_NEAREST nearest-neighbor interpolation
INTER_LINEAR bilinear interpolation (used by default)
INTER_AREA resampling using pixel area relation. It may be the preferred method
for image decimation, as it gives moire-free results. But when the image is
zoomed, it is similar to the INTER_NEAREST method
INTER_CUBIC bicubic interpolation over 4x4 pixel neighborhood
INTER_LANCZOS4 Lanczos interpolation over 8x8 pixel neighborhood
Wikimedia commons provides this nice comparison image for nearest-neighbour, bilinear, and bicubic interpolation:
You can see that you are unlikely to get the same sharpness as the original image when zoomed, but you can trade off "smoothness" for aliasing effects (i.e., jagged edges).
Take a look at quick image scaling algorithms.
First, I will discuss a simple algorithm, dubbed "smooth Bresenham" that can best be described as nearest neighbour interpolation on a zoomed grid, using a Bresenham algorithm. The algorithm is quick, it produces a quality equivalent to that of linear interpolation and it can zoom up and down, but it is only suitable for a zoom factor that is within a fairly small range. To offset this, I next develop a directional interpolation algorithm that can only magnify (scale up) and only with a factor of 2×, but that does so in a way that keeps edges sharp. This directional interpolation method is quite a bit slower than the smooth Bresenham algorithm, and it is therefore practical to cache those 2× images, once computed. Caching images with relative sizes that are powers of 2, combined with simple interpolation, is actually a third image zooming technique: MIP-mapping.
A related question is Image scaling and rotating in C/C++. Also, you can use CImpg.
What your asking goes out of this universe physics: there are simply not enough bits in the original image to represent 15*15 times more details. Whatever algorithm cannot invent the "right information" that is not there. It can just find a suitable interpolation. But it will never increase the details.
Despite what happens in many police fiction, getting a picture of fingerprint on a car door handle stating from a panoramic view of a city is definitively a fake.
You Can easily zoom in or zoom out an image in opencv using the following two functions.
For Zoom In
pyrUp(tmp, dst, Size(tmp.cols * 2, tmp.rows * 2));
For Zoom Out
pyrDown(tmp, dst, Size(tmp.cols / 2, tmp.rows / 2));
You can get details about the method in the following link:
Image Zoom Out and Zoom In using OpenCV