Faster Method for Multiple Bilinear Interpolation? - c++

I am writing a program in C++ to reconstruct a 3D object from a set of projected 2D images, the most computation-intensive part of which involves magnifying and shifting each image via bilinear interpolation. I currently have a pair of functions for this task; "blnSetup" defines a handful of parameters outside the loop, then "bilinear" applies the interpolation point-by-point within the loop:
(NOTE: 'I' is a 1D array containing ordered rows of image data)
//Pre-definition structure (in header)
struct blnData{
float* X;
float* Y;
int* I;
float X0;
float Y0;
float delX;
float delY;
};
//Pre-definition function (outside the FOR loop)
extern inline blnData blnSetup(float* X, float* Y, int* I)
{
blnData bln;
//Create pointers to X, Y, and I vectors
bln.X = X;
bln.Y = Y;
bln.I = I;
//Store offset and step values for X and Y
bln.X0 = X[0];
bln.delX = X[1] - X[0];
bln.Y0 = Y[0];
bln.delY = Y[1] - Y[0];
return bln;
}
//Main interpolation function (inside the FOR loop)
extern inline float bilinear(float x, float y, blnData bln)
{
float Ixy;
//Return -1 if the target point is outside the image matrix
if (x < bln.X[0] || x > bln.X[-1] || y < bln.Y[0] || y > bln.Y[-1])
Ixy = 0;
//Otherwise, apply bilinear interpolation
else
{
//Define known image width
int W = 200;
//Find nearest indices for interpolation
int i = floor((x - bln.X0) / bln.delX);
int j = floor((y - bln.Y0) / bln.delY);
//Interpolate I at (xi, yj)
Ixy = 1 / ((bln.X[i + 1] - bln.X[i])*(bln.Y[j + 1] - bln.Y[j])) *
(
bln.I[W*j + i] * (bln.X[i + 1] - x) * (bln.Y[j + 1] - y) +
bln.I[W*j + i + 1] * (x - bln.X[i]) * (bln.Y[j + 1] - y) +
bln.I[W*(j + 1) + i] * (bln.X[i + 1] - x) * (y - bln.Y[j]) +
bln.I[W*(j + 1) + i + 1] * (x - bln.X[i]) * (y - bln.Y[j])
);
}
return Ixy;
}
EDIT: The function calls are below. 'flat.imgdata' is a std::vector containing the input image data and 'proj.imgdata' is a std::vector containing the transformed image.
int Xs = flat.dim[0];
int Ys = flat.dim[1];
int* Iarr = flat.imgdata.data();
float II, x, y;
bln = blnSetup(X, Y, Iarr);
for (int j = 0; j < flat.imgdata.size(); j++)
{
x = 1.2*X[j % Xs];
y = 1.2*Y[j / Xs];
II = bilinear(x, y, bln);
proj.imgdata[j] = (int)II;
}
Since I started optimizing, I have been able to reduce computation time by ~50x (!) by switching from std::vectors to C arrays within the interpolation function, and another 2x or so by cleaning up redundant computations/typecasting/etc, but assuming O(n) with n being the total number of processed pixels, the full reconstruction (~7e10 pixels) should still take 40min or so--about an order of magnitude longer than my goal of <5min.
According to Visual Studio's performance profiler, the interpolation function call ("II = bilinear(x, y, bln);") is unsurprisingly still the majority of my computation load. I haven't been able to find any linear algebraic methods for fast multiple interpolation, so my question is: is this basically as fast as my code will get, short of applying more or faster CPUs to the task? Or is there a different approach that might speed things up?
P.S. I've also only been coding in C++ for about a month now, so feel free to point out any beginner mistakes I might be making.

I wrote up a long answer suggesting looking at OpenCV (opencv.org), or using Halide (http://halide-lang.org/), and getting into how image warping is optimized, but I think a shorter answer might serve better. If you are really just scaling and translating entire images, OpenCV has code to do that and we have an example for resizing in Halide as well (https://github.com/halide/Halide/blob/master/apps/resize/resize.cpp).
If you really have an algorithm that needs to index an image using floating-point coordinates which result from a computation that cannot be turned into a moderately simple function on integer coordinates, then you really want to be using filtered texture sampling on a GPU. Most techniques for optimizing on the CPU rely on exploiting some regular pattern of access in the algorithm and removing float to integer conversion from the addressing. (For resizing, one uses two integer variables, one which indexes the pixel coordinate of the image and the other which is the fractional part of the coordinate and it indexes a kernel of weights.) If this is not possible, the speedups are somewhat limited on CPUs. OpenCV does provide fairly general remapping support, but it likely isn't all that fast.
Two optimizations that may be applicable here are trying to move the boundary condition out the loop and using a two pass approach in which the horizontal and vertical dimensions are processed separately. The latter may or may not win and will require tiling the data to fit in cache if the images are very large. Tiling in general is pretty important for large images, but it isn't clear it is the first order performance problem here and depending on the values in the inputs, the cache behavior may not be regular enough anyway.

"vector 50x slower than array". That's a dead giveaway you're in debug mode, where vector::operator[] is not inlined. You will probably get the necessary speedup, and a lot more, simply by switching to release mode.
As a bonus, vector has a .back() method, so you have a proper replacement for that [-1]. Pointers to the begin of an array don't contain the array size, so you can't find the back of an array that way.

Related

Speeding up a do-while loop with loop unrolling

I am trying to speed up code in a function that may be called many times over (maybe more than a million). The code has to do with setting two variables to random numbers and finding squared distance. My first idea for this is loop unrolling but I am a bit confused on how I should execute this because of the while condition that dictates it.
To speed up my program I have replaced the c++ built in rand() function with a custom one but I am stumped on how to make my program even faster.
do {
x = customRand();
y = customRand();
distance = x * x + y * y; // euclidean square distance
} while (distance >= 1.0);
You shouldn't expect to make your program faster by loop unrolling, because with a properly selected range ([-1, 1]) for the random number generator output your loop's body will be executed just once in more than 3/4 of the cases.
What you might want to do to help the compiler is to mark your while condition as "unlikely". For example, in GCC it would be:
#define unlikely(x) __builtin_expect((x),0)
do {
x = customRand();
y = customRand();
distance = x * x + y * y; // euclidean square distance
} while (unlikely(distance >= 1.0));
Still, even this is unlikely to speed up your code in a measurable way.
If you were about guaranteed time and not speed, then for an uniform random distribution within a circle, with customRand() uniformly distributed in [-1, 1]
r = std::sqrt(std::abs(customRand()));
t = M_PI * customRand();
x = r * std::cos(t);
y = r * std::sin(t);
would do the trick.
It smells like a math-based XY problem, but I will first focus on your question about loop speedup. I will still do some math though.
So basic speedup may use early continue if any of x or y is bigger or equal 1.0. There is no chance that x*x + y*y < 1.0 when one of x or y is bigger than 1.
do {
float x = customRand();
if (x >= 1.0) {
continue;
}
float y = customRand();
if (y >= 1.0) {
continue;
}
distance = x * x + y * y; // euclidean square distance
} while (distance >= 1.0);
Unfortunately, it most likely screws up modern CPUs branch prediction mechanism and still may not give big speedup. Benchmarks, as always, would be required.
Also, as the fastest code is the one which does not run, let's discuss the potential XY problem. You seem to generate point which is bounded by circle with radius 1.0. In Cartesian system it is tough task, but it is really easy in polar system, where each point in disk is described with radius and angle and can be easily converted to Cartesian. I don't know what distribution do you expect over disk area, but see answers for this problem here: Generate a random point within a circle (uniformly)
PS. There are some valid comments about <random>, which as std library may lead to some speedup as well.

Convolutional network filter always negative

I asked a question about a network which I've been building last week, and I iterated on the suggestions which lead me to finding a few problems. I've come back to this project and fixed up all the issues and learnt a lot more about CNNs in the process. Now I'm stuck on an issue were all of my weights move to massively negative values, which coupled with the RELU ends in the output image always being completely black (making it impossible for the classifier to do it's job).
On two labeled images:
These are passed into a two layer network, one classifier (which gets 100% on its own) and a one filter 3*3 convolutional layer.
On the first iteration the output from the conv layer looks like (images in same order as above):
The filter is 3*3*3, due to the images being RGB. The weights are all random numbers between 0.0f-1.0f. On the next iteration the images are completely black, printing the filters shows that they are now in range of -49678.5f (the highest I can see) and -61932.3f.
This issue in turn is due to the gradients being passed back from the Logistic Regression/Linear layer being crazy high for the cross (label 0, prediction 0). For the circle (label 1, prediction 0) the values are between roughly -12 and -5, but for the cross they are all in the positive high 1000 to high 2000 range.
The code which sends these back looks something like (some parts omitted):
void LinearClassifier::Train(float * x,float output, float y)
{
float h = output - y;
float average = 0.0f;
for (int i =1; i < m_NumberOfWeights; ++i)
{
float error = h*x[i-1];
m_pGradients[i-1] = error;
average += error;
}
average /= static_cast<float>(m_NumberOfWeights-1);
for (int theta = 1; theta < m_NumberOfWeights; ++theta)
{
m_pWeights[theta] = m_pWeights[theta] - learningRate*m_pGradients[theta-1];
}
// Bias
m_pWeights[0] -= learningRate*average;
}
This is passed back to the single convolution layer:
// This code is in three nested for loops (for layer,for outWidth, for outHeight)
float gradient = 0.0f;
// ReLu Derivative
if ( m_pOutputBuffer[outputIndex] > 0.0f)
{
gradient = outputGradients[outputIndex];
}
for (int z = 0; z < m_InputDepth; ++z)
{
for ( int u = 0; u < m_FilterSize; ++u)
{
for ( int v = 0; v < m_FilterSize; ++v)
{
int x = outX + u - 1;
int y = outY + v - 1;
int inputIndex = x + y*m_OutputWidth + z*m_OutputWidth*m_OutputHeight;
int kernelIndex = u + v*m_FilterSize + z*m_FilterSize*m_FilterSize;
m_pGradients[inputIndex] += m_Filters[layer][kernelIndex]*gradient;
m_GradientSum[layer][kernelIndex] += input[inputIndex]*gradient;
}
}
}
This code is iterated over by passing each image in a one at a time fashion. The gradients are obviously going in the right direction but how do I stop the huge gradients from throwing the prediction function?
RELU activations are notorious for doing this. You usually have to use a low learning rate. The reasoning behind this is that when the RELU returns positive numbers it can continue to learn freely, but if a unit gets in a position where the signal coming into it is always negative it can become a "dead" neuron and never activate again.
Also initializing your weights is more delicate with RELU. It appears that you are initializing to range 0-1 which creates a huge bias. Two tips here - Use a range centered around 0, and a range that is much smaller. A normal distribution with mean 0 and std 0.02 usually works well.
I fixed it by downscaling the gradients int the CNN layer, but now I'm confused as to why this works/is needed so if anyone has any intuition as to why this works that'd be great.

Fast way to convert ARGB texture to next power of 2 texture

We have some old devices that don't support non-pot textures and we have a function that converts ARGB textures to next power of 2 texture. The problem is that it's quite slow and we're wondering if there is a better approach to convert these textures.
void PotTexture()
{
size_t u2 = 1; while (u2 < imageData.width) u2 *= 2;
size_t v2 = 1; while (v2 < imageData.height) v2 *= 2;
std::vector<unsigned char> pottedImageData;
pottedImageData.resize(u2 * v2 * 4);
size_t y, x, c;
for (y = 0; y < imageData.height; y++)
{
for (x = 0; x < imageData.width; x++)
{
for (c = 0; c < 4; c++)
{
pottedImageData[4 * u2 * y + 4 * x + c] = imageData.convertedData[4 * imageData.width * y + 4 * x + c];
}
}
}
imageData.width = u2;
imageData.height = v2;
std::swap(imageData.convertedData, pottedImageData);
}
On some devices this can easily use 100% of the CPU so any optimizations would be amazing. Are there any existing functions that I could look at that perform this conversion?
Edit:
I've optimized the above loop slightly to:
for (y = 0; y < imageData.height; y++)
{
memcpy(
&(pottedImageData[y * u2 * 4]),
&(imageData.convertedData[y * imageData.width * 4]),
imageData.width * 4);
}
Even devices that don't support NPOT texture should support NPOT load.
Create the texture as an exact power of 2 and NO CONTENT using glTexImage2D, passing a null pointer for data.
data may be a null pointer. In this case, texture memory is allocated to accommodate a texture of width width and height height. You can then download subtextures to initialize this texture memory. The image is undefined if the user tries to apply an uninitialized portion of the texture image to a primitive.
Then use glTexSubImage2D to upload a NPOT image, which occupies only a portion of the total texture. This can be done without any CPU-side image rearrangement.
Having had a similar problem in a program I wrote, I took a very different approach. Rather than stretch the source texture, I just copied it into the top left corner of an otherwise empty power-of-two texture.
Then in the pixel shader you use a pair of floats to adjust s,t values so you fetch from just the top left corner.
float sAdjust = static_cast<float>(textureWidth) / static_cast<float>(containerWidth)
float tAdjust = static_cast<float>(textureHeight) / static_cast<float>(containerHeight)
is how you compute them, and to use them you'll get a Vec2 holding the s,t coordinates, just multiply s by sAdjust and t by tAdjust before using them to fetch. If you're using Direct3D, it'd be something akin to this:
D3DXVECTOR4 stAdjust;
stAdjust.x = sAdjust;
stAdjust.y = tAdjust;
// Transfer stAdjust into a float4 inside your pixel shader, call it stAdjust in there
Now in the pixel shader assume you have:
float2 texcoord;
float4 stAdjust;
you just say:
texcoord.x = texcoord.x * stAdjust.x;
texcoord.y = texcoord.y * stAdjust.y;
before using texcoord. Sorry I can't tell you how to do this in GLSL, but you get the general idea.
Okay, the very first optimization can be done here:
size_t u2 = 1; while (u2 < imageData.width) u2 *= 2;
size_t v2 = 1; while (v2 < imageData.height) v2 *= 2;
What you want to do is (for each dimension) find the floor of the logarithm-base2 (log2) and put that into 2**n+1. The standard math library has function log2 but it operates on floating point. But we can use is. 2**n can be written as 1 << n. So this gives
size_t const dim_p2_… = 1 << (int)floor(log2(dim_…)+1);
Better but not yet ideal, because of that float conversion. The Bit Twiddling hacks document has a few functions for integer ilog2: https://graphics.stanford.edu/~seander/bithacks.html#IntegerLog
But we're still not optimal. Let me introduce you to Compiler intrinsics, which translate into machine instructions, if the machine in question can do it on the metal.
GNU GCC: int __builtin_ffs (unsigned int x), which returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.
MSVC++: _BitScanReverse, which returns the length of the run of the most significant bits set zero. So _BitScanReverse is like builtin_ffs - 1 (there's also a builtin_clz which behaves exactly like BitScanReverse.
So we can do
#define ilog2_p1(x) (__builtin_ffs(x))
or
#define ilog2_p1(x) (__BitScanReverse(x)+1)
and use that.
size_t const dim_p2_… = 1 << (int)floor(ilog2_p1(dim_…));
While we're at bit twiddling: We can save that whole ordeal if a texture is already in power of two format. A few years ago I (independently) rediscovered the wonderfully portable bit twiddling trick, exploiting the properties of complement-2 integers. You can also find it in the bit twiddles document. But the type neutral, concise macro form is rarely seen. So here it is:
#define ISPOW2(x) ( (x) && !( (x) & ((x) - 1) ) )
You're using C++ so templates are in order:
template<typename T> bool ispow2(T const x) { return x && !( x & (x - 1) ); }
Then Ben Voight already told you, how to use glTexSubImage2D to load that into the texture. Also have a look at the GL_ARB_texture_rectangle extension, that allows to load NPOT textures, but without the ability for mipmapping and advanced filtering. But it might be a viable choice for you.
If you ever feel the need to scale the texture it's always worth looking into dual spaces. In this case the spatial frequency domain dual space. Upscaling a signal is essentially a impulse response. As such it can be described as a convolution. Convolutions usually are O(n²) in complexity. But due to the Fourier Convolution theorem in Fourier space the equivalent is simple multiplication, so it becomes O(n). FFT can be done with O(n log n), so the total complexity is about O(n + 2n log n), which is much better.

Improving C++ algorithm for finding all points within a sphere of radius r

Language/Compiler: C++ (Visual Studio 2013)
Experience: ~2 months
I am working in a rectangular grid in 3D-space (size: xdim by ydim by zdim) where , "xgrid, ygrid, and zgrid" are 3D arrays of the x,y, and z-coordinates, respectively. Now, I am interested in finding all points that lie within a sphere of radius "r" centered about the point "(vi,vj,vk)". I want to store the index locations of these points in the vectors "xidx,yidx,zidx". For a single point this algorithm works and is fast enough but when I wish to iterate over many points within the 3D-space I run into very long run times.
Does anyone have any suggestions on how I can improve the implementation of this algorithm in C++? After running some profiling software I found online (very sleepy, Luke stackwalker) it seems that the "std::vector::size" and "std::vector::operator[]" member functions are bogging down my code. Any help is greatly appreciated.
Note: Since I do not know a priori how many voxels are within the sphere, I set the length of vectors xidx,yidx,zidx to be larger than necessary and then erase all the excess elements at the end of the function.
void find_nv(int vi, int vj, int vk, vector<double> &xidx, vector<double> &yidx, vector<double> &zidx, double*** &xgrid, double*** &ygrid, double*** &zgrid, int r, double xdim,double ydim,double zdim, double pdim)
{
double xcor, ycor, zcor,xval,yval,zval;
vector<double>xyz(3);
xyz[0] = xgrid[vi][vj][vk];
xyz[1] = ygrid[vi][vj][vk];
xyz[2] = zgrid[vi][vj][vk];
int counter = 0;
// Confine loop to be within boundaries of sphere
int istart = vi - r;
int iend = vi + r;
int jstart = vj - r;
int jend = vj + r;
int kstart = vk - r;
int kend = vk + r;
if (istart < 0) {
istart = 0;
}
if (iend > xdim-1) {
iend = xdim-1;
}
if (jstart < 0) {
jstart = 0;
}
if (jend > ydim - 1) {
jend = ydim-1;
}
if (kstart < 0) {
kstart = 0;
}
if (kend > zdim - 1)
kend = zdim - 1;
//-----------------------------------------------------------
// Begin iterating through all points
//-----------------------------------------------------------
for (int k = 0; k < kend+1; ++k)
{
for (int j = 0; j < jend+1; ++j)
{
for (int i = 0; i < iend+1; ++i)
{
if (i == vi && j == vj && k == vk)
continue;
else
{
xcor = pow((xgrid[i][j][k] - xyz[0]), 2);
ycor = pow((ygrid[i][j][k] - xyz[1]), 2);
zcor = pow((zgrid[i][j][k] - xyz[2]), 2);
double rsqr = pow(r, 2);
double sphere = xcor + ycor + zcor;
if (sphere <= rsqr)
{
xidx[counter]=i;
yidx[counter]=j;
zidx[counter] = k;
counter = counter + 1;
}
else
{
}
//cout << "counter = " << counter - 1;
}
}
}
}
// erase all appending zeros that are not voxels within sphere
xidx.erase(xidx.begin() + (counter), xidx.end());
yidx.erase(yidx.begin() + (counter), yidx.end());
zidx.erase(zidx.begin() + (counter), zidx.end());
return 0;
You already appear to have used my favourite trick for this sort of thing, getting rid of the relatively expensive square root functions and just working with the squared values of the radius and center-to-point distance.
One other possibility which may speed things up (a) is to replace all the:
xyzzy = pow (plugh, 2)
calls with the simpler:
xyzzy = plugh * plugh
You may find the removal of the function call could speed things up, however marginally.
Another possibility, if you can establish the maximum size of the target array, is to use an real array rather than a vector. I know they make the vector code as insanely optimal as possible but it still won't match a fixed-size array for performance (since it has to do everything the fixed size array does plus handle possible expansion).
Again, this may only offer very marginal improvement at the cost of more memory usage but trading space for time is a classic optimisation strategy.
Other than that, ensure you're using the compiler optimisations wisely. The default build in most cases has a low level of optimisation to make debugging easier. Ramp that up for production code.
(a) As with all optimisations, you should measure, not guess! These suggestions are exactly that: suggestions. They may or may not improve the situation, so it's up to you to test them.
One of your biggest problems, and one that is probably preventing the compiler from making a lot of optimisations is that you are not using the regular nature of your grid.
If you are really using a regular grid then
xgrid[i][j][k] = x_0 + i * dxi + j * dxj + k * dxk
ygrid[i][j][k] = y_0 + i * dyi + j * dyj + k * dyk
zgrid[i][j][k] = z_0 + i * dzi + j * dzj + k * dzk
If your grid is axis aligned then
xgrid[i][j][k] = x_0 + i * dxi
ygrid[i][j][k] = y_0 + j * dyj
zgrid[i][j][k] = z_0 + k * dzk
Replacing these inside your core loop should result in significant speedups.
You could do two things. Reduce the number of points you are testing for inclusion and simplify the problem to multiple 2d tests.
If you take the sphere an look at it down the z axis you have all the points for y+r to y-r in the sphere, using each of these points you can slice the sphere into circles that contain all the points in the x/z plane limited to the circle radius at that specific y you are testing. Calculating the radius of the circle is a simple solve the length of the base of the right angle triangle problem.
Right now you ar testing all the points in a cube, but the upper ranges of the sphere excludes most points. The idea behind the above algorithm is that you can limit the points tested at each level of the sphere to the square containing the radius of the circle at that height.
Here is a simple hand draw graphic, showing the sphere from the side view.
Here we are looking at the slice of the sphere that has the radius ab. Since you know the length ac and bc of the right angle triangle, you can calculate ab using Pythagoras theorem. Now you have a simple circle that you can test the points in, then move down, it reduce length ac and recalculate ab and repeat.
Now once you have that you can actually do a little more optimization. Firstly, you do not need to test every point against the circle, you only need to test one quarter of the points. If you test the points in the upper left quadrant of the circle (the slice of the sphere) then the points in the other three points are just mirror images of that same point offset either to the right, bottom or diagonally from the point determined to be in the first quadrant.
Then finally, you only need to do the circle slices of the top half of the sphere because the bottom half is just a mirror of the top half. In the end you only tested a quarter of the point for containment in the sphere. This should be a huge performance boost.
I hope that makes sense, I am not at a machine now that I can provide a sample.
simple thing here would be a 3D flood fill from center of the sphere rather than iterating over the enclosing square as you need to visited lesser points. Moreover you should implement the iterative version of the flood-fill to get more efficiency.
Flood Fill

topology layers separation algorithm

I have the following problem. Suppose you have a big array of Manhattan polygons on the plane (their sides are parallel to x or y axis). I need to find a polygons, placed closer than some value delta. The question - is how to make this in most effective way, because the number of this polygons is very large. I will be glad if you will give me a reference to implemented solution, which will be easy to adapt for my case.
The first thing that comes to mind is the sweep and prune algorithm (also known as sort and sweep).
Basically, you first find out the 'bounds' of each shape along each axis. For the x axis, these would be leftmost and rightmost points on a shape. For the y axis, the topmost and bottommost.
Lets say you have a bound structure that looks something like this:
struct Bound
{
float value; // The value of the bound, ie, the x or y coordinate.
bool isLower; // True for a lower bound (leftmost point or bottommost point).
int shapeIndex; // The index (into your array of shapes) of the shape this bound is on.
};
Create two arrays of these Bounds, one for the x axis and one for the y.
Bound xBounds* = new Bound[2 * numberOfShapes];
Bound yBounds* = new Bound[2 * numberOfShapes];
You will also need two more arrays. An array that tracks on how many axes each pair of shapes is close to one another, and an array of candidate pairs.
int closeAxes* = new int[numberOfShapes * numberOfShapes];
for (int i = 0; i < numberOfShapes * numberOfShapes; i++)
CloseAxes[i] = 0;
struct Pair
{
int shapeIndexA;
int shapeIndexB;
};
Pair candidatePairs* = new Pair[numberOfShapes * numberOfShape];
int numberOfPairs = 0;
Iterate through your list of shapes and fill the arrays appropriately, with one caveat:
Since you're checking for closeness rather than intersection, add delta to each upper bound.
Then sort each array by value, using whichever algorithm you like.
Next, do the following (and repeat for the Y axis):
for (int i = 0; i + 1 < 2 * numberOfShapes; i++)
{
if (xBounds[i].isLower && xBounds[i + 1].isLower)
{
unsigned int L = xBounds[i].shapeIndex;
unsigned int R = xBounds[i + 1].shapeIndex;
closeAxes[L + R * numberOfShapes]++;
closeAxes[R + L * numberOfShapes]++;
if (closeAxes[L + R * numberOfShapes] == 2 ||
closeAxes[R + L * numberOfShapes] == 2)
{
candidatePairs[numberOfPairs].shapeIndexA = L;
candidatePairs[numberOfPairs].shapeIndexB = R;
numberOfPairs++;
}
}
}
All the candidate pairs are less than delta apart on each axis. Now simply check each candidate pair to make sure they're actually less than delta apart. I won't go into exactly how to do that at the moment because, well, I haven't actually thought about it, but hopefully my answer will at least get you started. I suppose you could just check each pair of line segments and find the shortest x or y distance, but I'm sure there's a more efficient way to go about the 'narrow phase' step.
Obviously, the actual implementation of this algorithm can be a lot more sophisticated. My goal was to make the explanation clear and brief rather than elegant. Depending on the layout of your shapes and the sorting algorithm you use, a single run of this is approximately between O(n) and O(n log n) in terms of efficiency, as opposed to O(n^2) to check every pair of shapes.