I am working on implementing Image convolution in C++, and I already have a naive working code based on the given pseudo code:
for each image row in input image:
for each pixel in image row:
set accumulator to zero
for each kernel row in kernel:
for each element in kernel row:
if element position corresponding* to pixel position then
multiply element value corresponding* to pixel value
add result to accumulator
endif
set output image pixel to accumulator
As this can be a big bottleneck with big Images and Kernels, I was wondering if there exist some other approach to make things faster ? even with additionnal input info like : sparse image or kernel, already known kernel etc...
I know this can be parallelized, but it's not doable in my case.
if element position corresponding* to pixel position then
I presume this test is meant to avoid a multiplication by 0. Skip the test! multiplying by 0 is way faster than the delays caused by a conditional jump.
The other alternative (and it's always better to post actual code rather than pseudo-code, here you have me guessing at what you implemented!) is that you're testing for out-of-bounds access. That is terribly expensive also. It is best to break up your loops so that you don't need to do this testing for the majority of the pixels:
for (row = 0; row < k/2; ++row) {
// inner loop over kernel rows is adjusted so it only loops over part of the kernel
}
for (row = k/2; row < nrows-k/2; ++row) {
// inner loop over kernel rows is unrestricted
}
for (row = nrows-k/2; row < nrows; ++row) {
// inner loop over kernel rows is adjusted
}
Of course, the same applies to loops over columns, leading to 9 repetitions of the inner loop over kernel values. It's ugly but way faster.
To avoid the code repetition you can create a larger image, copy the image data over, padded with zeros on all sides. The loops now do not need to worry about accessing out-of-bounds, you have much simpler code.
Next, a certain class of kernel can be decomposed into 1D kernels. For example, the well-known Sobel kernel results from the convolution of [1,1,1] and [1,0,-1]T. For a 3x3 kernel this is not a huge deal, but for larger kernels it is. In general, for a NxN kernel, you go from N2 to 2N operations.
In particular, the Gaussian kernel is separable. This is a very important smoothing filter that can also be used for computing derivatives.
Besides the obvious computational cost saving, the code is also much simpler for these 1D convolutions. The 9 repeated blocks of code we had earlier become 3 for a 1D filter. The same code for the horizontal filter can be re-used for the vertical one.
Finally, as already mentioned in MBo's answer, you can compute the convolution through the DFT. The DFT can be computed using the FFT in O(MN log MN) (for an image of size MxN). This requires padding the kernel to the size of the image, transforming both to the Fourier domain, multiplying them together, and inverse-transforming the result. 3 transforms in total. Whether this is more efficient than the direct computation depends on the size of the kernel and whether it is separable or not.
For small kernel size simple method might be faster. Also note that separable kernels (for example, Gauss kernel is separable) as mentioned, allow to make filtering by lines then by columns, resulting O(N^2 * M) complexity.
For other cases: there exists fast convolution based on FFT (Fast Fourier Transform). It's complexity is O(N^2*logN) (where N is size of image ) comparing to O(N^2*M^2) for naive implementation.
Of course, there some peculiarities in applying this techniques, for example, edge effects, but one needs to account for them in naive implementation too (in a lesser degree though).
FI = FFT(Image)
FK = FFT(Kernel)
Prod = FI * FK (element-by-element complex multiplication)
Conv(I, K) = InverseFFT(Prod)
Note that you can use some fast library intended for image filtering, for example, OpenCV allows to apply kernel to 1024x1024 image in 5-30 milliseconds.
One way to this speed up, might be, depending on target platform, to distinctly get every value in the kernel, then, in memory, store multiple copies of the image, one for every distinct value in the kernel, and multiply each copy of the image by its distinct kernel value, then at the end, multiply by distinct kernel value, shift, sum and divide up all the image copies into one image. This could be done on a graphics processor for example where memory is ample and which is more suited for this tight repetitive processing. The copies of the image will need to support overflow of the pixels, or you could use floating point values.
Related
I am working on a C++ project that needs to perform FFT on a large 2D raster data (10 to 100 GB). In particular, the performance is quite bad when applying FFT for each column, whose elements are not contiguous in memory (placed with a stride of the width of the data).
Currently, I'm doing this. Since the data does not fit in the memory, I read several columns, namely n columns, into the memory with its orientation transposed (so that a column in the file becomes a row in the memory) and apply FFT with an external library (MKL). I read (fread) n pixels, move on to the next row (fseek as much as width - n), read n pixels, jump to the next row, and so on. When the operation (FFT) is done with the column chunk, I write it back to the file in the same manner. I write n pixels, jump to the next row, and so on. This way of reading and writing file takes too much time, so I want to find some way of boosting it.
I have considered transposing the whole file beforehand, but the entire process includes both row-major and column-major FFT operations and transposing will not benefit.
I'd like to hear any experiences or idea about this kind of column-major operations on a large data. Any suggestions related particularly to FFT or MKL will help as well.
Why not to work with both transposed and non-transposed data at the same time? That will increase memory requirement x2, but that may worth it.
Consider switching to a Hadamard Transformation. As a complete IPS, the transform offers no multiplications, since all of the coefficients in the transform are plus or minus one. If you need the resultant transform in a fourier basis, a matrix multiplication will change bases.
So I'm working on a CUDA program, and I'm experiencing some issues when it comes to indexing blocks and threads. Basically, I'm trying to implement the Pixel Sort algorithm in CUDA. (With one modification, we are only processing either rows or columns, not both at the same time)
The way I was visualizing it was to simply run N blocks with 1 thread each (for the number of rows, or columns) and have each block process that row / column independently of each other.
So if we want to sort on columns, we launch the kernel like this (there are a couple extra parameters that are only relevant to our specific processing, so I've left them out for simplicity)
pixel_sort<<<cols, 1>>>(d_image, d_imageSort, rows, cols, ...);
Then in the kernel, I access the block index with
int tid = blockIdx.x;
This allows me to work with one row / columns data per block, but it has some issues. It's running slower than our serial implementation of the algorithm for smaller images, and straight up crashes when the image size becomes too large.
An alternative thread scheme I was considering would be to map each of the image's pixels to one thread, however I have a couple of questions on this.
If we were to launch N blocks with M threads, representing N cols with M rows, how do we avoid 512 (or 1024 ?) limit of threads per block. Can we just have each thread process multiple pixels in the column in this instance? How would the indexing look like in the kernel?
The algorithm basically requires that we work on the entire column, hence each thread cannot just do some work on that pixel, they have to communicate, presumably using shared memory. Would it be a valid strategy to have one "master" thread per block, that does the actual sorting calculations, and then have all of the other threads participate in the shared memory?
Other Notes:
Our image data is read in through OpenCV, and has the RGBA values stored in a uchar4 array
If you have a single thread per block, you very quickly run into thread occupancy issues. If your goal is to do a full row sort (for columns you could transpose the image before sending to the GPU to take advantage of global coalescing), the fastest way that gets a decent result is probably to do a radix or merge-sort on a per-row basis, basically copying the steps from http://mgarland.org/files/papers/nvr-2008-001.pdf. You could assign k blocks of m threads each for each row such that km >= image width. Then you would be launching k*(image height) blocks. Your grid would then be of size (k, height, 1).
As for your specific questions:
You cannot get around the 512/1024 thread-per-block limit, you would have to restructure your algorithm.
A "master" thread would generally be poor design, causing stalls, overhead, and not taking full advantage of the many cores. You may sometimes need to utilize a single thread, say to output/broadcast a result, but mostly you want to avoid it. See the linked article for sample algorithms that mostly avoid this.
Current Design
In my program I have a big 2-D grid (1000 x 1000, or more), each cell contains a small information.
In order to represent this concept the choice is quite trivial: a matrix data structure.
The correspondent code (in C++) is something like:
int w_size_grid = 1000;
int h_size_grid = 1000;
int* matrix = new int[w_size_grid * h_size_grid];
As you can notice I've used a vector, but the principle is the same.
In order to access an element of the grid, we need a function that given a cell in the grid, identified by (x,y), it returns the value stored in that cell.
Mathematically:
f(x,y) -> Z
obviously:
f: Z^2 -> Z where Z is the set of integer numbers.
That can be trivially achieved with a linear function. Here a C++ code representation:
int get_value(int x, int y) {
return matrix[y*w_size_grid + x];
}
Additional Implementation Notes
Actually the design concept requires a sort of "circular-continuous grid": the access indices for the cell can go out the limits of the grid itself.
That means, for example, the particular case: get_value(-1, -1); is still valid. The function will just return the same value as get_value(w_size_grid - 1, h_size_grid -1);.
Actually this is no a problem in the implementation:
int get_value(int x, int y) {
adjust_xy(&x, &y); // modify x and y in accordance with that rule.
return matrix[y*w_size_grid + x];
}
Anyway this is just an additional note in order to make the scenario more clear.
What is the problem?
The problem presented above is very trivial and simple to design and to implement.
My problem comes with the fact that the matrix is updated with an high frequency. Each cell in the matrix is read and possibly updated with a new value.
Obviously the parsing of the matrix is done with two loop in according to a cache-friend design:
for (int y = 0; y < h_size_grid; ++y) {
for (int x = 0; x < w_size_grid; ++x) {
int value = get_value(x, y);
}
}
The inner cycle is x since [x-1] [x] [x+1] are stored contiguously. Indeed, that cycle exploits principle of locality.
The problem comes now because, actually in order to update a value in a cell, it depends on values in the adjacent cells.
Each cell has exactly eight neighbours, which are the cells that are horizontally, vertically, or diagonally adjacent.
(-1,-1) | (0,-1) | (1,-1)
-------------------------
(-1,0) | (0,0) | (0, 1)
-------------------------
(-1,1) | (0,1) | (1,1)
So the code is intuitively:
for (int y = 0; y < h_size_grid; ++y) {
for (int x = 0; x < w_size_grid; ++x) {
int value = get_value(x, y);
auto values = get_value_all_neighbours(x, y); // values are 8 integer
}
}
The function get_value_all_neighbours(x,y) will access one row up and one row down in the matrix relatively to y.
Since a row in the matrix is quite big, it involves a cache miss and it dirties the caches themselves.
The Question
One I have finally present you the scenario and the problem, my question is how to "solve" the problem.
Using some additional data structures, or reorganizing the data is there a way to exploit the caches and to avoid all those miss?
Some Personal Consideration
My feelings guide me toward a strategic data structure.
I've thought about a reimplementation of the order in which the values are stored in the vector, trying to stored in contiguous indices those cell which are neighbours.
That implies a no-more-linear function for get_value.
After some thinking, I believe is not possible to find this no-linear function.
I've also thought some additional data-structure like hash-table to store adjacent value for each cell, but I think is an overkill more in space and maybe in CPU cycle also.
Lets assume you have indeed a problem with cache misses that can't easily be avoided (referring to other answers here).
You could use a space filling curve to organize your data in a cache friendly way. Essentially, space filling curves map a volume or plane (such as your matrix) to a linear representations, such that values that are close together in space end (mostly) up close together in the linear representation. In effect, if you store the matrix in a z-ordered array, neighbouring elements have a high likelihood of being on the same memory page.
The best proximity mapping is available with the Hilbert Curve, but it is expensive to calculate. A better option may be a z-curve (Morton-Order). It provides good proximity and is fast to calculate.
Z-Curve: Essentially, to get the ordering, you have to interleave the bits of your x and y coordinate into a single value, called 'z-value'. This z-value determines the position in your list of matrix values (or even simply the index in an array if you use an array). The z-values are consecutive for a completely filled matrix (every cell is used). Conversely, you can de-interleave the position in the list (=array index) and get your x/y coordinates back.
Interleaving bits of two values is quite fast, there are even dedicated CPU instructions to do this with few cycles. If you can't find these (I can't, at the moment), you can simply use some bit twiddling tricks.
Actually, the data structure is not trivial, especially when optimizations are concerned.
There are two main issues to resolve: data content and data usage. Data content are the values in the data and the usage is how the data is stored, retrieved and how often.
Data Content
Are all the values accessed? Frequently?
Data that is not accessed frequently can be pushed to slower media, including files. Leave the fast memory (such as data caches) for the frequently accessed data.
Is the data similar? Are there patterns?
There are alternative methods for representing matrices where a lot of the data is the same (such as a sparse matrix or a lower triangular matrix). For large matrices, maybe performing some checks and returning constant values may be faster or more efficient.
Data Usage
Data usage is a key factor in determining an efficient structure for the data. Even with matrices.
For example, for frequently access data, a map or associative array may be faster.
Sometimes, using many local variables (i.e. registers) may be more efficient for processing matrix data. For example, load up registers with values first (data fetches), operate using the registers, then store the registers back into memory. For most processors, registers are the fastest media for holding data.
The data may want to be rearranged to make efficient use of data cache's and cache lines. The data cache is a high speed area of memory very close to the processor core. A cache line is one row of data in the data cache. An efficient matrix can fit one or more row per cache line.
The more efficient method is to perform as many accesses to a data cache line as possible. Prefer to reduce the need to reload the data cache (because an index was out of range).
Can the operations be performed independently?
For example, scaling a matrix, where each location is multiplied by a value. These operations don't depend on other cells of the matrix. This allows the operations to be performed in parallel. If they can be performed in parallel, then they can be delegated to processors with multiple cores (such as GPUs).
Summary
When a program is data driven, the choice of data structures is not trivial. The content and usage are important factors when choosing a structure for the data and how the data is aligned. Usage and performance requirements will also determine the best algorithms for accessing the data. There are already many articles on the internet for optimizing for data driven applications and best usage of data caches.
What is the most efficient way to shift and scale (linear operation) each channel with different shift/scale values (floats) per channel? The input image may be any data type, but the output matrix needs to be CV_32F.
I could write my own for loop to do this, similar to
out[i,j,k] = scale[i] * in[i,j,k] + shift[i]
, but I wonder if that would be slow compared to perhaps more optimized routines. Scaler operations are supported, but over the entire image, so I don't know how to isolate a given channel for this. I looked at split, but the docs don't mention whether memory is copied or not. If it is, it's sure to be slower. There are routines for color conversion, which is similar, but of course only specific conversions are supported.
Might there be some way, similar to
out.channels[i] = scale[i] * in.channels[i] + shift[i]
Preferably, I'd like to skip creating an intermediate matrix.
I need to square root each element of a matrix (which is basically a vector of float values once in memory) using CUDA.
Matrix dimensions are not known 'a priori' and may vary [2-20.000].
I was wondering: I might use (as Jonathan suggested here) one block dimension like this:
int thread_id = blockDim.x * block_id + threadIdx.x;
and check for thread_id lower than rows*columns... that's pretty simple and straight.
But is there any particular performance reason why should I use two (or even three) block grid dimensions to perform such a calculation (keeping in mind that I have a matrix afterall) instead of just one?
I'm thinking at coalescence problems, like making all threads reading values sequentially
The dimensions only exist for convenience, internally everything is linear, so there would be no advantage in terms of efficiency either way. Avoiding the computation of the (contrived) linear index as you've shown above would seem to be a bit faster, but there wouldn't be any difference in how the threads coalesce.