I have a cuda code that I have implemented several C2C 2D FFT's in. They all use the same plan, but for some reason, the times on the 2D FFT's are large, and seem to vary quite a bit. Same data size FFT's seem to take anywhere from 0.4s to 1.8s
This is for a 1920x1080 FFT. Do those times seem reasonable?
Anyhow - I have had good experience with CUDA 1-D batched FFTs being fast. is it the same to take a 1D FFT across the rows, and then again across the columns of a matrix to give the same results as this 2D FFT? I have experience FFTs happening in a few hundreths of a second across larger data sets for 1D FFTs before, so I was hoping to maybe fix some of these results.
Thanks
A 2D transform of a 1K by 1K image requires 2K 1D transforms. Therefore those times seem reasonable.
For more information have a look at: http://paulbourke.net/miscellaneous/dft/
Related
I have a problem which involves many matrix multiplications (classical and kronecker product) . I read that GPU are suited for this task and since speed is my main objective I was thinking about using Cuda with c++. However I would have to learn Cuda first. So before I start waisting my time I thought I should ask wiser people first. Can Cuda speed up my calculations? The matrices are generally quite small around 20x50. Sometimes involving a third dimension so it becomes a 20x50x10 matrix. I can only multiply a couple of matrices at one step in time (10-100). But I need to do several millions iteration after each other (Monte Carlo simulation). Currently I am using armadillo and matlab.
You would see some speed ups if your matrices were bigger, now you will be facing data bandwidth bottlenecks worse than computation time delays.
Something worth considering is to see mathematical tricks that could allow you (based on your computations) to combine multiple instances into bigger matrices then transfer and compute. But usually this is quite difficult and probably not always doable.
I am curious what is the most efficient method when I process image block-by-block.
At that moment, I applied some vectorization technics such as I read one row of pixels (8 pixels a row, each 8-bit depth) from an 8x8 block. But as modern processors support 128/256-bit vector operation, I think loading two rows of pixels from the image block can improve code speed.
But the problem is, an image(for example 16x16 image, contains 4 8x8 blocks) in memory is stored from the first pixel to the last pixel continuously. The loading of one 8-pixel row of is easy, but how should I operate the pointer or align image data so that I could load 2 rows together?
I think this figure can illustrate my problem clearly:
pixels' address in a image
So, when we load 8 pixels (a row) together, we simply load 8 bytes data from the initial pointer position by 1 instruction. When we load 2nd row, we simply add 9 to the pointer and load the second row.
So, the questions is, is there any method that we could load these two rows (16 pixels) together from the initial pointer position?
Thanks!
To make each row aligned, you can pad the end of each row. Writing your code to support a shorter image width than the stride between rows lets your algorithm work on a subset of an image.
Also, you don't actually need everything to be aligned for SIMD to work well. Contiguous is sufficient. Most SIMD instruction sets (SSE, NEON, etc.) have unaligned load instructions. Depending on the specific implementation, there might not be much penalty.
You don't load two different rows into the same SIMD vector. For example, to do an 8x8 SAD (sum of absolute differences) using AVX2 VPSADBW, each 32-byte load would get data from one row of four different 8x8 blocks. But that's fine, you just use that to produce four 8x8 SAD results in parallel, instead of wasting a lot of time shuffling to do a single 8x8 SAD.
For example, Intel's MPSADBW tutorial shows how to implement an exhaustive motion search for 4x4, 8x8, and 16x16 blocks, with C and Intel's SSE intrinsics. Apparently the actual MPSADBW instruction isn't actually worth using in practice, though, because it's slower than PSADBW and you can get identical results faster with a sequential elimination exhaustive search, as used by x264 (and mentioned by x264 developers in this forum thread about whether SSE4.1 would help x264.)
Some SIMD-programming blog posts from the archives of Dark Shikari's blog: Diary Of An x264 Developer:
Cacheline splits, take two: using PALIGNR or other techniques to set up unaligned inputs for motion searches
A curious SIMD assembly challenge: the zigzag
I have searched for an answer to this question but have not found anything that can directly help me.
I am working on a 3D numerical integrator for a non-linear PDE using the parallel FFT library included in MKL.
My arrays consist of 2^30 data points which is much much larger than the cache. This results in ~50% of cache references being misses, which appears to add a massive amount of overhead accessing memory.
Is there a clever way I can deal with this? Is it expected to have 50% cache misses using an array this large?
Any help would be much appreciated.
Thanks,
Dylan
2^30 data points in a single FFT counts as being quite big!
The data plus the exponentials and the output array is several thousand times bigger than the L3 cache, and millions times bigger than L1.
Given that disparity one might argue that a 50% cache miss rate is actually quite good, especially for an algorithm like an FFT which accesses memory in non-sequential ways.
I don't think that there will be much you can do about it. The MKL is quite good, and I'm sure that they've taken advantage of whatever cache hinting instructions there are.
You might try contacting Mercury Systems Inc. (www.mrcy.com) and ask them about their Scientific Algorithms Library (SAL). They have a habit of writing their own math libraries, and in my experience they are pretty good at it. Their FFT on PowerPC was 30% quicker than the next best one; quite an achievement. You can try an un-optimised version of SAL for free (http://sourceforge.net/projects/opensal/). The real optimised for Intel SAL is definitely not free though.
Also bear in mind that no matter how clever the algorithm is, with a data set that size you're always going to be fundamentally stuck with main memory bandwidths, not cache bandwidths.
GPUs might be worth a look, but you'd need one with a lot of memory to hold 2^30 data points (32 bit complex values = 2gbytes, same again for the output array, plus exponentials, etc).
I think the problem of excessive misses is due to a failure of the cache prefetch mechanism, but not knowing the details of the memory accesses I can't tell you exactly why.
It does not matter that your arrays are very large, 50% misses are excessive. The processor should avoid misses by detecting you are iterating over an array and loading ahead of time the data elements you are likely to use.
Either the pattern of array accesses is not regular and thus the prefetcher in the processor does not figure out a pattern to prefetch, or you have a cache associativy problem, that is, elements in your iteration might be matched to the same cache slots.
For example, assume a cache size of 1Mb and a set associativy of 4. In this example, the cache will map memory using the lower 20 bits to an internal slot. If you stride by 1Mb, that is, your iterations are exactly 1Mb, then the lower 20 bits are always the same and go to the same cache slot, the new element shares the same cache slot as the old one. When you get to the fifth element, all four positions are used up and from then on it is only misses, in such case your cache size is effectively one single slot; if you stride by half the cache size, then the effective number of slots is 2, which might be enough to not have any misses at all or have 100% or anything in between depending on whether your access pattern requires both slots simultaneously or not.
To convince yourself of this, make a toy program with varying stride sizes and you'll see that those that divide or are multiples of the cache sizes increase misses, you can use valgrind --tool=cachegrind
You should first make sure you know what is causing the cache misses; they may be the fault of other code you've written rather than the FFT library. In fact, I expect that is very likely the case.
The rest of this post assumes that the FFT is really at fault and we need to optimize.
The standard trick to get data locality out of an FFT is to
Arrange the data in a two-dimensional array
Do an FFT along each row
Apply twiddle factors
Do a matrix transpose
Do an FFT along each row
This is the Cooley-Tukey algorithm, in the case where we factor 2^(m+n) = 2^m * 2^n.
The point of this is that the recursive calls to the FFT are much much smaller, and may very well fit in cache. And if not, you can apply this method recursively until things do fit in cache. And if you're ambitious, you do a lot of benchmarking to figure out the optimal way to do the splitting.
Thus, assuming you also use a good matrix transpose algorithm, the end result is a relatively cache-friendly FFT.
The library you're using really should be doing this already. If it's not, then some options are:
Maybe it exposes enough lower level functionality that you can tell it to use Cooley-Tukey in an efficient way even though the high level routines aren't
You could implement Cooley-Tukey yourself, using the given library to do the smaller FFTs.
I'm working with convolutional deep neural networks (CRBM in particular). I need to perform several 'valid' convolution and several 'full' convolutions. I need to optimize their speed. I'm working with C++.
In the case of the 'valid' convolution I have been able to speed it up by around 50% by rearranging the image (This is the equivalent of im2col in Matlab) and multiplying (matrix multiply) by the matrix of kernels (one per line), replacing several convolutions by one matrix multiplication.
Unfortunately, I haven't been able to gain anything with 'full' convolution. I know that I can also rearrange the image (using an equivalent of convmtx2 in Matlab), but simply doing the rearranging of the image matrix to a convolution matrix is already much slower (almost an order of magnitude) than doing all the convolutions. The convolution matrix becomes large very quickly.
Is there a good practical way of speeding 'full' convolution (with multiple kernels) up with matrix multiplication in C++ ? Perhaps, an efficient way of computing convmtx2 ?
I am just beginning to learn some cuda programming and I am interested how to handle calculation of large matrices which surpass the Block/thread sizes.
For example, I have seen code which shows how to perform tiled matrix multiplication but it fails with the Block size and grid size are too small. In the mentioned code, if the Block size and Grid size are each set to 1, then only the first element of the final matrix will be computed.
The answer is simple: call the kernel with larger block and grid sizes, but what happens when I want to perform a matrix multiplication with 8 million rows and 6 million columns - something arbitrarily large for which there cannot be a proper Grid and Block size for any modern GPU?
Where can I find example code or an algorithm for how to work with this sort of thing? I believe that the simple case should be a matrix multiplication algorithm which works if called with <<<1,1>>> and any algorithm which can account for this call should be able to account for any larger matrix.
The main problem with very large matrix is not the number of blocks or number of threads. The main problem is that you cannot fit the whole matrix in GPU's DRAM memory. So for doing the multiplication, you need to manually use tiling to divide the input matrix into tiles that you can fit in the GPU's memory. Then, you need to run matrix multiplication on that tile on GPU with as many threads as you need and then return the tile result back to the host (CPU).
When you are working on these big tiles on the GPU, you need to launch 1000s of threads to get the performance that you need. launching only one thread does not help you in any way.
for more information you can look at this paper:
CUDA Based Fast Implementation of Very Large Matrix Computation
I just found it by googling "large matrix multiplication CUDA"