Calling multiple kernels, global memory performances - CUDA - c++

I have four CUDA kernels working on matrices in the following way:
convolution<<<>>>(A,B);
multiplybyElement1<<<>>>(B);
multiplybyElement2<<<>>>(A);
multiplybyElement3<<<>>>(C);
// A + B + C with CUBLAS' cublasSaxpy
every kernel basically (except the convolution first) performs a matrix each-element multiplication by a fixed value hardcoded in its constant memory (to speed things up).
Should I join these kernels into a single one by calling something like
multiplyBbyX_AbyY_CbyZ<<<>>>(B,A,C)
?
Global memory should already be on the device so probably that would not help, but I'm not totally sure

If I understood correctly, you're asking if you should merge the three "multiplybyElement" kernels into one, where each of those kernels reads an entire (different) matrix, multiplying each element by a constant, and storing the new scaled matrix.
Given that these kernels will be memory bandwidth bound (practically no computation, just one multiply for every element) there is unlikely to be any benefit from merging the kernels unless your matrices are small, in which case you would be making inefficient use of the GPU since the kernels will execute in series (same stream).

If merging the kernels means that you can do only one pass over the memory, then you may see a 3x speedup.
Can you multiply up the fixed values up front and then do a single multiply in a single kernel?

Related

OpenCL: Single kernel working on independent arrays in parallel

So I have a bunch of Arrays A_i of size M x N_i which are entirely independent and don't need to communicate with each other.
I've designed a kernel that I want to operate on each array separately, however since multiple kernels can't run at the same time, I would like to design a single kernel to operate on either all these arrays at once, or on batches of these arrays at a time.
Since these arrays all have the same number of rows, I could column-stack them into a single mega-array, and then use a single kernel with the appropriate off-sets to operate on each portion of this mega-array separately, however I'm looking for a cleaner solution. Especially since the number of work-groups and work-items used for each A_i depends on its column-dimension N_i.
Hopefully this explanation is clear.

Reordering of work dimensions may cause a huge performance boost, but why?

I am using OpenCL for stereo image processing on the GPU and after I ported a C++ implementation to OpenCL i was playing around with optimizations. A very simple experiment was to swap around the dimensions.
Consider a simple kernel, which is executed for every pixel along the two dimensional work space (f.e. 640x480). In my case it was a Census transform.
Swapping from:
int globalU = get_global_id(0);
int globalV = get_global_id(1);
too
int globalU = get_global_id(1);
int globalV = get_global_id(0);
while adjusting the NDRange in the same way, gave a performance boost about 500%. Other experiments in 3d Space achieved a execution time from 72ms to 2ms, only with reordering of the dimensions.
Can anybody explain my, how this happens?
Is it just an effect of memory pipelines and cache usage?
EDIT: The image has a standart mamory layout. Thats why i wondered about the effects. I expected the best speed, when the iteration goes like the image is stored in the memory, which is not the case.
After some reading of the AMD APP SDK documantation, i found some interesting details about the memory channels. That could be a reason.
When you access an element in a memory in first being loaded into a CPU's cache. CPU does not load single element (say 1 byte), but instead it loads a single line (for example 64 adjacent bytes) into cache. This is because it is usually likely that you will access subsequent elements, so CPU would not need to access RAM again.
This is make a huge difference since to access cache memory, an electrical signal should not even leave CPU chip, while if CPU need to load data from RAM, the signal need to travel to a separate chip and probably more then one signal from CPU is required since it generally need to specify row and column in RAM to access part of it (Read What every programmer should know about memory for more info). In practice access time to cache may take only 0.5 ns while RAM access will cost 100 ns.
So computer algorithms should take this into account. If you traverse through all elements in a matrix, you should traverse them so you would access elements that are located near each other approximately at the same time. So if your matrix has the following layout in memory:
m_0_0, m_0_1, m_0_2, ... m_1_0, m_1_1 (first column, second column, etc.)
you should access elements in order: m_0_0, m_0_1, m_0_2 (by column)
If you would use different access order (say by row in this case), CPU would load part of first column in cache when you access first element in first column, then part of second column when you access first element in second column etc. When you will traverse first row and access second element in first column cache values for the first column would no longer be present in cache, since it has limited (and very small) size. Therefore such an algorithm would effectively eliminate cache at all.

Is there any workaround to "reserve" a cache fraction?

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?
Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.
I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.
I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.
I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

cudaMemcpy2D for shared memory copies

I have some memory that has been allocated on device that is just a single malloc of H*W*sizeof(float) in size.
This is to represent an H*W matrix.
I have a code where I need to swap the quadrants of the matrix. Can i use cudaMemcpy2D to accomplish this? Would I just need to specify the spitch and dpitch to be W*sizeof(float) and just use pointers to each quadrant of the matrix to accomplish this?
Also, when these cudaMemcpy talk about the memory areas not overlapping - does that mean src and dst cannot overlap at all? As in, if I had a 10 byte wide array that I wanted to shift left one time - it will fail?
Thanks
You can use cudaMemcpy2D for moving around sub-blocks which are part of larger pitched linear memory allocations. There is no problem in doing that. The non-overlapping requirement is non-negotiable and it will fail if you try it. The source and destination can come from the same allocation, but the address ranges of the source and destination cannot overlap. If you need to do some "in-situ" copying where there is overlap, you might be better served to write a kernel to do it (see the matrix transpose example in the SDK as a sound way to do that kind of thing).
I suggest writing a simple kernel to do this matrix manipulation. I think it would be easier to write than using cudaMemcpy(2D) and almost definitely faster assuming you write it to get good memory coherence.
It's probably easiest to do an out-of-place transform (i.e. different input and output arrays) to avoid clobbering the input matrix. Each thread would simply read from its input offset and write to the transformed offset.
It would be similar to a matrix transpose. There is a matrix transpose example in the CUDA SDK.

Speed up float 5x5 matrix * vector multiplication with SSE

I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions.
I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement?
Do I need the Intel compiler to do it?
Can you point out some references?
The Eigen C++ template library for vectors, matrices, ... has both
optimised code for small fixed size matrices (as well as dynamically sized ones)
optimised code that uses SSE optimisations
so you should give it a try.
In principle the speedup could be 4 times with SSE (8 times with AVX). Let me explain.
Let's call your fixed 5x5 matrix M. Defining the components of a 5D vector as (x,y,z,w,t). Now form a 5x4 matrix U from the first four vectors.
U =
xxxx
yyyy
zzzz
wwww
tttt
Next, do the matrix product MU = V. The matrix V contains the product of M and the first four vectors. The only problem is that for SSE we need read in the rows of U but in memory U is stored as xyzwtxyzwtxyzwtxyzwt so we have to transpose it to xxxxyyyyzzzzwwwwtttt. This can be done with shuffles/blends in SSE. Once we have this format the matrix product is very efficient.
Instead of taking O(5x5x4) operations with scalar code it only takes O(5x5) operations i.e. a 4x speedup. With AVX the matrix U will be 5x8 so instead of taking O(5x5x8) operations it only taxes O(5x5), i.e. a 8x speedup.
The matrix V, however, will be in xxxxyyyyzzzzwwwwtttt format so depending on the application it might have to be transposed to xyzwtxyzwtxyzwtxyzwt format.
Repeat this for the next four vectors (8 for AVX) and so forth until done.
If you have control over the vectors, for example if your application generates the vectors on the fly, then you can generate them in xxxxyyyyzzzzwwwwtttt format and avoid the transpose of the array. In that case you should get a 4x speed up with SSE and a 8x with AVX. If you combine this with threading, e.g. OpenMP, your speedup should be close to 16x (assuming four physical cores) with SSE. I think that's the best you can do with SSE.
Edit: Due to instruction level parallelism (ILP) you can get another factor of 2 in speedup so the speedup for SSE could 32x with four cores (64x AVX) and again another factor of 2 with Haswell due to FMA3.
I would suggest using Intel IPP and abstract yourself of dependency on techniques
If you're using GCC, note that the -O3 option will enable auto-vectorization, which will automatically generate SSE or AVX instructions in many cases. In general, if you just write it as a simple for-loop, GCC will vectorize it. See http://gcc.gnu.org/projects/tree-ssa/vectorization.html for more information.
This should be easy, especially when you're on Core 2 or later: You neeed 5* _mm_dp_ps , one _mm_mul_ps, two _mm_add_ps, one ordinary multiplication, plus some shuffles, loads and stores (and if the matrix is fixed, You can keep most of it in SSE registers, if you don't need them for anything else).
As for memory bandwidth: we're talking about 2,4 megabytes of vectors, when memory bandwidths are in single-digit gigabytes per second.
What is known about the vector? Since the matrix is fixed, AND if there is a limited amount of values that the vector can take, then I'd suggest that you pre-compute the calculations and access them using a table look-up.
The classic optimization technique to trade memory for cycles...
I would recommend having a look at an optimised BLAS library, such as the Intel MKL or the AMD ACML. Based on your description I would assume that you'd be after the SGEMV level 2 matrix-vector routine, to do y = A*x style operations.
If you really want to implement something yourself, using the (available) SSE..SSE4 and AVX instruction sets can offer significant performance improvements in some cases, although this is exactly what a good BLAS library will be doing. You also need to think alot about cache friendly data access patterns.
I don't know if this is applicable in your case, but can you operate on "chunks" of vectors at a time?? So rather than repeatedly doing an y = A*x style operation can you operate on blocks of [y1 y2 ... yn] = A * [x1 x2 ... xn]. If so, this means that you could use an optimised matrix-matrix routine, such as SGEMM. Due to the data access patterns this may be significantly more efficient than repeated calls to SGEMV. If it were me, I would try to go down this path...
Hope this helps.
If you know the vectors in advance (e.g., doing all 240k at once), you'd get a better speedup by parallelising the loop than by going to SSE. If you've already taken that step, or you don't know them all at once, SSE could be a big benefit.
If the memory is contiguous, then don't worry too much about the memory operations. If you've got a linked list or something then you're in trouble, but it should be able to keep up without too much problem.
5x5 is a funny size, but you could do at least 4 flops in one SSE instruction and try to cut your arithmetic overheads. You don't need the Intel compiler, but it might be better, I've heard legends about how it's much better with arithmetic code. Visual Studio has intrinsics for dealing with SSE2, and I think up to SSE4 depending on what you need. Of course, you'd have to roll it yourself. Grabbing a library might be the smart move here.