Arrayfire Vectorization

Arrayfire Vectorization - c++

I'm trying to speed up the following calculations but have not been able to reach the desired speed. Im sure the issue is with my code and not physical limitations of the GPU.
I have a matrix V that is 10,000 x 6 x 6.
And another matrix P that is 6 x 1,000
Both complex
I need to do V * P (which should results in 10,000 x 6 x 1000)
Take the magnitude (or mag sq) of it and then sum in the 6 dimension.
resulting in a 10,000 x 1000 of real values.
I have tried the following:
af::array V{ 10000, 6, 6, c32 };
af::array P{ 6, 1000, c32 };
af::array VP = af::matmul(V, P); (results in 10,000x1000x6 - ok, as long as i still sum in the 6 dim)
af::array res = af::sum(af::abs(VP),2);
This was not nealy fast enough. Then I tried converting V into an array, so I had:
af::array V[6] = { af::array{ 10000, 6, c32 },
af::array{ 10000, 6, c32 }, af::array{ 10000, 6, c32 }, af::array{
10000, 6, c32 }, af::array{ 10000, 6, c32 }, af::array{
10000, 6, c32 } };
af::array VP[6];
af::array res;
for (int i = 0; i < 6; i++)
{
VP[i] = af::matmul(V[i], P);
}
res= af::abs(mCalledData[0]);
for (int i = 1; i < 6; i++)
{
res+= af::abs(VP[i]);
}
This had about a 2x speedup. I came up with another solution but af::matmult that takes in 3 arrays doesn't support options (like hermitian) and doesn't support gfor, so I couldn't try that route.
Currently, the matrix multiply (in both approaches) takes about 2.2ms and it looks like arrayfire can combine the abs and sum into one JIT kernel that takes about 2 ms.
My knowledge of arrayfire is limited so i'm guessing there is something I'm not thinking of. Does anyone have an idea of how I can increase the speed of this algorithm?
Thank you!

I can confirm your findings that looped version is about twice as fast as the batched matmul. Matmul on its own is not essentially the one taking long runtime in your code snippet, it is the other operation of summing up along third dimension after abs which is costly. It is due to the following reasons.
1) sum(abs(result)) - abs is again not issue here. Sum is reduction algorithm, which are usually quite fast along the fast moving dimension. However, reduction along higher dimension the element stride is size of the matrix for successive elements. This expensive compared to reduction along continuous locations.
2) looped abs additions - This version is however is accessing elements that continuous in memory because, we are basically adding respective elements of 6 matrices. On top of this, the entire loop (along with abs OP) will be converted into a single JIT kernel that does the following which is very efficient.
res = res + ptr0[i] + ptr1[i] + ptr2[i] + ptr0[i] + ptr1[i]
Above line is just for illustration, that is not the exact JIT kernel.
Hence, the batched version is faster than looped version in this specific case because of the reduction operation that is being done on the result of matmul.
My test GPU: GTX 1060
The matmul itself for a single [10k x 6] * [6 x 1k] is about half a millisecond on GTX 1060. Six such matmuls can't be done under millisecond on my GTX 1060 at least I would think. What is your target runtime ?
EDITED (Jan 10, 2020): - Actually, This won't work because of abs operation on result of each matmul.
You can try looking into our latest entry into gemm category in master branch of ArrayFire. However, you would have to build arrayfire from source until our next feature release 3.7. You can look at the documentation at the following page.
https://github.com/arrayfire/arrayfire/blob/master/include/af/blas.h#L230
It follows the principle of Carray from cuBLAS gemm API.

Related

Low performance – Patch Match. Image Processing on GPU (CUDA)

I have a performance problem: CPU and GPU performances are almost the same.
The Problem I Dealing with is PATCH MATCH. I Have 2 Matrices. I want to find where is the maximum similarity between the big matrix and the small one.
The Matrices has Binary values 0/1 (Black and White).
When I am checking a match between a small matrix to a big one with i5 CPU, it takes 30ms (using multithreading).
When I am checking a match between a small matrix to a big one in a Ge-force GT 730, it takes also 33ms.
I would expect that The GPU will work faster in at least 1 magnitude of order. I pretty disappointed from my current results.
I have two matrices:
1) Big - 300000 (300 rows, 1000 columns)
2) Small 50000 (50 rows, 1000 columns)
The comparing process is done by dividing the big matrix into 250 sub matrices and then comparing each one to the small matrix, then find highest similarity.
The Similarity criterion is the sum of corresponding black pixels on both matrices (the small and the sub-big) divided by the sum of black pixels on sub-big.
I did the last task using the following CUDA code:
__global__ void matCompare_cuda (uint8_t *D_SUB , uint8_t *D_SMALL , float *D_RSLTS , unsigned int step, int numOfIndentations ,int SUB_size, int SMALL_size)
{
int i = 0 , j = 0 , success = 0, sumZero = 0;
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int LoopIndex = ( tid * step );
if (tid < numOfIndentations)
{
for ( j = 0 ; j < (SMALL_size) ; j++)
{
i = j + LoopIndex;
if ( D_SUB[i] == 0 )
{
{
sumZero++;
if ( D_SMALL[j] == 0 )
success++;
}
}
}
if ( success > 0 && sumZero > 500)
D_RSLTS[tid] = 100*((float)success / sumZero) ;
}
}
The Kernal launch:
int numOfIndentations = 300-50 //[ (big.row) - (small.row)]
int numBlock = 16;
int threadNumber = numOfIndentations/numBlock;
matCompare_cuda<<< numBlock , threadNumber >>> ( D_SUB , D_SMALL , D_RSLTS , step, numOfIndentations, SUB_size, SMALL_size );
The Cpu Code:
for (i=0; i < (pixelNum) ; i++)
{
if (SUB[i]==0)
{
sumDots = sumDots +1;
if (SMALL->Image[i]==0)
{
success = success + 1;
}
}
}
if (success>0)
if (sumDots>500)
RSLT=((float)success/sumDots)*100;
Do you see any improvement that can be done in the GPU code?

A few things.
Try to avoid the if's if possible. You can write here:
sumZero += (1 - D_SUB[i])
success += (1 - D_SUB[i]) * (1 - D_SMALL[j])
However I don't think you're going to see a huge difference here. I see two reasons.
One is that there's overhead in invoking cuda. The data needs to be copied to the graphic card and back. That eats some of the speedup you get. Not sure how much it is, but since the run-time is so short it could play a role. I hope you didn't time the compilation of the kernel and other one-time things (take them out by running the code in a loop and ignoring the first few iterations).
Second your big matrix is too small and your small matrix is too big. Because the small matrix is so big (1000 columns) I'm guessing it plays really well with the CPU cache lines. If the small matrix were smaller you would have to go to the next line more often which would increase the chances of breaking the cache line. The gpu uses rectangles for caching so it wouldn't be a problem. If the big matrix were to be bigger you would also increase the amount of computation required so the GPU would start to get ahead.

Knapsack using dynamic programming

There is a common algorithm for solving the knapsack problem using dynamic programming. But it's not work for W=750000000, because there is an error of bad alloc. Any ideas how to solve this problem for my value of W?
int n=this->items.size();
std::vector<std::vector<uint64_t>> dps(this->W + 1, std::vector<uint64_t>(n + 1, 0));
for (int j = 1; j <= n; j++)
for (int k = 1; k <= this->W; k++) {
if (this->items[j - 1]->wts <= k)
dps[k][j] = std::max(dps[k][j - 1], dps[k - this->items[j - 1]->wts][j - 1] + this->items[j - 1]->cost);
else
dps[k][j] = dps[k][j - 1];
}

First of all, you can use only one dimension to solve the knapsack problem. This will reduce your memory from dp[W][n] (n*W space) to dp[W] (W space). You can look here: 0/1 Knapsack Dynamic Programming Optimazion, from 2D matrix to 1D matrix
But, even if you use only dp[W], your W is really high, and might be too much memory. If your items are big, you can use some approach to reduce the number of possible weights. First, realize that you don't need all positions of W, only those such that the sum of weight[i] exists.
For example:
W = 500
weights = [100, 200, 400]
You will never use position dp[473] of your matrix, because the items can occupy only positions p = [0, 100, 200, 300, 400, 500]. It is easy to see that this problem is the same as when:
W = 5
weights = [1,2,4]
Another more complicated example:
W = 20
weights = [5, 7, 8]
Using the same approach as before, you don't need all weights from 0 to 20, because the items can occupy only fill up to positions
p = [0, 5, 7, 5 + 7, 5 + 8, 7 + 8, 5 + 7 + 8]
p = [0, 5, 7, 12, 13, 15, 20]
, and you can reduce your matrix from dp[20] to dp[size of p] = M[7].

You do not show n, but even if we assume it is 1, lets see how much data you are trying to allocate. So, it would be:
W*64*2 // Here we don't consider overhead of the vector
This comes out to be:
750000000*64*2 bits = ~11.1758Gb
I am guessing this is more space then your program will allow. You are going to need to take a new approach. Perhaps try to handle the problem as multiple blocks. Consider the first and second half seperatley, then swap.

A better way to access n-d array element with a 1-d index array in C++?

Recently, I'm doing something about C++ pointers, I got this question when I want to access elements in multi-dimensional array with a 1-dimensional array which contains index.
Say I have a array arr, which is a 4-dimensional array with all elements set to 0 except for arr[1][2][3][4] is 1, and a array idx which contains index in every dimension for arr, I can access this element by using arr[idx[0]][idx[1]][idx[2]][idx[3]], or by using *(*(*(*(arr + idx[0]) + idx[1]) + idx[2]) + idx[3]).
The question comes with when n is large, this would be not so good, so I wonder if there is a better way to work with multi-dimensional accessing?
#include <bits/stdc++.h>
using namespace std;
#define N 10
int main()
{
int arr[N][N][N][N] = {0};
int idx[4] = {1, 2, 3, 4};
arr[1][2][3][4] = 1;
cout<<"Expected: "<<arr[1][2][3][4]<<" at "<<&arr[1][2][3][4]<<endl;
cout<<"Got with ****: ";
cout<<*(*(*(*(arr + idx[0]) + idx[1]) + idx[2]) + idx[3])<<endl;
return 0;
}
output
Expected: 1 at 0x7fff54c61f28
Got with ****: 1

The way you constructor your algorithm for indexing a multi dimensional array will vary depending on the language of choice; you have tagged this question with both C and C++. I will stick with the latter since my answer would pertain to C++. For a little while now I've been working on something similar but different so this becomes an interesting question as I was building a multipurpose multidimensional matrix class template.
What I have discovered about higher levels of multi dimensional vectors and matrices is that the order of 3 repetitiously works miracles in understanding the nature of higher dimensions. Think of this in the geometrical perspective before considering the algorithmic software implementation side of it.
Mathematically speaking Let's consider the lowest dimension of 0 with the first shape that is a 0 Dimensional object. This happens to be any arbitrary point where this point can have an infinite amount of coordinate location properties. Points such as p0(0), p1(1), p2(2,2), p3(3,3,3),... pn(n,n,...n) where each of these objects point to a specific locale with the defined number of dimensional attributes. This means that there is no linear distance such as length, width, or height and conversely a magnitude in any direction or dimension where this shape or object that has no bounds of magnitude does not define any area, volume or higher dimensions of volume. Also with these 0 dimensional points there is no awareness of direction which also implies that there is no angle of rotation that defines magnitude. Another thing to consider is that any arbitrary point is also the zero vector. Another thing to help in understand this is by the use of algebraic polynomials such that f(x) = mx+b which is linear is a One Dimensional equation, shape(in this case a line) or graph, f(x) = x^2 is Two Dimensional, f(x) = x^3 is Three Dimensional, f(x) = x^4 is Four Dimensional and so on up to f(x) = x^n where this would be N Dimensional. Length or Magnitude, Direction or Angle of Rotation, Area, Volume, and others can not be defined until you relate two distinct points to give you at least 1 line segment or vector with a specified direction. Once you have an implied direction you then have slope.
When looking at operations in mathematics the simplest is addition and it is nothing more than a linear translation and once you introduce addition you also introduce all other operations such as subtraction, multiplication, division, powers, and radicals; once you have multiplication and division you define rotation, angles of rotation, area, volume, rates of change, slope (also tangent function), which thus defines geometry and trigonometry which then also leads into integrations and derivatives. Yes, we have all had our math lessons but I think that this is important in to understanding how to construct the relationships of one order of magnitude to another, which then will help us to work through higher dimensional orders with ease once you know how to construct it. Once you can understand that even your higher orders of operations are nothing more than expansions of addition and subtraction you will begin to learn that their continuous operations are still linear in nature it is just that they expand into multiple dimensions.
Early I stated that the order of 3 repetitiously works miracles so let me explain my meaning. Since we perceive things on a daily basis in the perspective of 3D; we can only visualize 3 distinct vectors that are orthogonal to each other giving you our natural 3 Dimensions of Space such as Left & Right, Forward & Backward giving you the Horizontal axis and planes and Up & Down giving you the Vertical axis and planes. We can not visualize anything higher so dimensions of the order of x^4, x^5, x^6 etc... we can not visualize but yet they do exist. If we begin to look at the graphs of the mathematical polynomials we can begin to see a pattern between odd and even functions where x^4, x^6, x^8 are similar where they are nothing more than expansions of x^2 and functions of x^5, x^7 & x^9 are nothing more than expansions of x^3. So I consider the first few dimensions as normal: Zero - Point, 1st - Linear, 2nd - Area, and 3rd - Volume and as for the 4th and higher dimensions I call all of them Volumetric.
So if you see me use Volume then it relates directly to the 3rd Dimension where if I refer to Volumetric it relates to any Dimension higher than the 3rd. Now lets consider a matrix such that you have seen in regular algebra where the common matrices are defined by MxN. Well this is a 2D flat matrix that has M * N elements and this matrix also has an area of M * N as well. Let's expand to a higher dimensional matrix such as MxNxO this is a 3D Matrix with M * N * O elements and now has M * N * O Volume. So when you visualize this think of the MxN 2D part as being a page to a book and the O components represents each page of a book or slice of a box. The elements of these matrices can be anything from a simple value, to an applied operation, to an equation, system of equations, sets or just an arbitrary object as in a storage container. So now when we have a matrix that is of the 4th order such as MxNxOxP this now has a 4th dimensional aspect but the easiest way to visualize this is that This would be a 1 dimensional array or vector to where all of its P elements would be a 3D Matrix of a Volume of MxNxO. When you have a matrix of MxNxOxPxQ now you have a 2D Area Matrix of PxQ where each of those elements are a MxNxO Volume Matrix. Then again if you have a MxNxOxPxQxR you now have a 6th dimensional matrix and this time you have a 3D Volume Matrix where each of the PxQxR elements are in fact 3D Matrices of MxNxO. And once you go higher and higher this patter repeats and merges again. So the order of how arbitrary matrices behave is that these dimensionalities repeat: 1D are Linear Vectors or Matrices, 2D are Area or Planar Matrices and 3D is Volume Matrices and any thing of a higher repeats this process compressing the previous step of Volumes thus the terminology of Volumetric Matrices. Take a Look at this table:
// Order of Magnitude And groupings
-----------------------------------
Linear Area Volume
x^1 x^2 x^3
x^4 x^5 x^6
x^7 x^8 x^9
x^10 x^11 x^12
... ... ...
----------------------------------
Now it is just a matter of using a little bit of calculus to know which order of magnitude to index into which higher level of dimensionality. Once you know a specific dimension it is simple to take multiple derivatives to give you a linear expression; then traverse the space, then integrate to the same orders of the multiple derivatives to give the results. This should eliminate a good amount of intermediate work by at first ignoring the least significant lower dimensions in a high dimensional order. If you are working in something that has 12 dimensions you can assume that the first 3 dimensions that define the first set of volume is packed tight being an element to another 3D Volumetric Matrix and then once again that 2d order of Volumetric Matrix is itself an element of another 3D Volumetric Matrix. Thus we have a repeating pattern and now it's just a matter of apply this to construct an algorithm and once you have an algorithm; it should be quite easy to implement the methods in any programmable language. So you may have to have a 3 case switch to determine which algorithmic approach to use knowing the overall dimensionality of your matrix or n-d array where one handles orders of linearity, another to handle area, and the final to handle volumes and if they are 4th+ then the overall process becomes recursive in nature.

I figured out a way to solve this myself.
The idea is that use void * pointers, we know that every memory cell holds value or an address of a memory cell, so we can directly compute the offset of the target to the base address.
In this case, we use void *p = arr to get the base address of the n-d array, and then loop over the array idx, to calculate the offset.
For arr[10][10][10][10], the offset between arr[0] and arr[1] is 10 * 10 * 10 * sizeof(int), since arr is 4-d, arr[0] and arr[1] is 3-d, so there is 10 * 10 * 10 = 1000 elements between arr[0] and arr[1], after that, we should know that the offset between two void * adjacent addresses is 1 byte, so we should multiply sizeof(int) to get the correct offset, according to this, we finally get the exact address of the memory cell we want to access.
Finally, we have to cast void * pointer to int * pointer and access the address to get the correct int value, that's it!
With void *(not so good)
#include <bits/stdc++.h>
using namespace std;
#define N 10
int main()
{
int arr[N][N][N][N] = {0};
int idx[4] = {1, 2, 3, 4};
arr[1][2][3][4] = 1;
cout<<"Expected: "<<arr[1][2][3][4]<<" at "<<&arr[1][2][3][4]<<endl;
cout<<"Got with ****: ";
cout<<*(*(*(*(arr + idx[0]) + idx[1]) + idx[2]) + idx[3])<<endl;
void *p = arr;
for(int i = 0; i < 4; i++)
p += idx[i] * int(pow(10, 3-i)) * sizeof(int);
cout<<"Got with void *:";
cout<<*((int*)p)<<" at "<<p<<endl;
return 0;
}
Output
Expected: 1 at 0x7fff5e3a3f18
Got with ****: 1
Got with void *:1 at 0x7fff5e3a3f18
Notice:
There is a warning when compiling it, but I choose to ignore it.
test.cpp: In function 'int main()':
test.cpp:23:53: warning: pointer of type 'void *' used in arithmetic [-Wpointer-arith]
p += idx[i] * int(pow(10, 3-i)) * sizeof(int);
Use char * instead of void *(better)
Since we want to manipulate pointer byte by byte, it would be better to use char * to replace void *.
#include <bits/stdc++.h>
using namespace std;
#define N 10
int main()
{
int arr[N][N][N][N] = {0};
int idx[4] = {1, 2, 3, 4};
arr[1][2][3][4] = 1;
cout<<"Expected: "<<arr[1][2][3][4]<<" at "<<&arr[1][2][3][4]<<endl;
char *p = (char *)arr;
for(int i = 0; i < 4; i++)
p += idx[i] * int(pow(10, 3-i)) * sizeof(int);
cout<<"Got with char *:";
cout<<*((int*)p)<<" at "<<(void *)p<<endl;
return 0;
}
Output
Expected: 1 at 0x7fff4ffd7f18
Got with char *:1 at 0x7fff4ffd7f18
With int *(In this specific case)
I have been told it's not a good practice for void * used in arithmetic, it would be better to use int *, so I cast arr into int * pointer and also replace pow.
#include <bits/stdc++.h>
using namespace std;
#define N 10
int main()
{
int arr[N][N][N][N] = {0};
int idx[4] = {1, 2, 3, 4};
arr[1][2][3][4] = 1;
cout<<"Expected: "<<arr[1][2][3][4]<<" at "<<&arr[1][2][3][4]<<endl;
cout<<"Got with ****: ";
cout<<*(*(*(*(arr + idx[0]) + idx[1]) + idx[2]) + idx[3])<<endl;
int *p = (int *)arr;
int offset = 1e3;
for(int i = 0; i < 4; i++)
{
p += idx[i] * offset;
offset /= 10;
}
cout<<"Got with int *:";
cout<<*p<<" at "<<p<<endl;
return 0;
}
Output
Expected: 1 at 0x7fff5eaf9f08
Got with ****: 1
Got with int *:1 at 0x7fff5eaf9f08

Sum of product: can we vectorize the following in C++? (using Eigen or other libraries)

UPDATE: the (sparse) three-dimensional matrix v in my question below is symmetric: v(i1,i2,i3) = v(j1,j2,j3) where (j1,j2,j3) is any of the 6 permutations of (i1,i2,i3), i.e.
v(i1,i2,i3) = v(i1,i3,i2) = v(i2,i3,i1) = v(i2,i1,i3) = v(i3,i1,i2) = v(i3,i2,i1).
Moreover, v(i1,i2,i3) != 0 only when i1 != i2 && i1 != i3 && i2 != i3.
E.g. v(i,i,j) = 0, v(i, k, k) = 0, v(k, j, k) = 0, etc...
I thought that with this additional information, I could already get a significant speed-up by doing the following:
Remark: v contains duplicate values (a triplet (i,j,k) has 6 permutations, and the values of v for these 6 are the same).
So I defined a more compact matrix uthat contains only non-duplicates of v. The indices of u are (i1,i2,i3) where i1 < i2 < i3. The length of u is equal to the length of v divided by 6.
I computed the sum by iterating over the new value vector and the new index vectors.
With this, I only got a little speed-up. I realized that instead of iterating N times doing a multiplication each time, I iterated N/6 times doing 6 multiplications each time, and that's pretty much the same as before :(
Hope somebody could come up with a better solution.
--- (Original question) ---
In my program I have an expensive operation that is repeated every iteration.
I have three n-dimensional vectors x1, x2 and x3 that are supposed to change every iteration.
I have four N-dimensional vectors I1, I2, I3 and v that are pre-defined and will not change, where:
I1, I2 and I3 contain the indices of respectively x1, x2 and x3 (the elements in I_i are between 0 and n-1)
v is a vector of values.
For example:
We can see v as a (reshaped) sparse three-dimensional matrix, each index k of v corresponds to a triplet (i1,i2,i3) of indices of x1, x2, x3.
I want to compute at each iteration three n-dimensional vectors y1, y2 and y3 defined by:
y1[i1] = sum_{i2,i3} v(i1,i2,i3)*x2(i2)*x3(i3)
y2[i2] = sum_{i1,i3} v(i1,i2,i3)*x1(i1)*x3(i3)
y3[i3] = sum_{i1,i2} v(i1,i2,i3)*x1(i1)*x2(i2)
More precisely what the program does is:
Repeat:
Compute y1 then update x1 = f(y1)
Compute y2 then update x2 = f(y2)
Compute y3 then update x3 = f(y3)
where f is some external function.
I would like to know if there is a C++ library that helps me to do so as fast as possible. Using for loops is just too slow.
Thank you very much for your help!
Update: Looks like it's not easy to get a better solution than the straight-forward for loops. If the vector of indices I1 above is ordered in non-decreasing order, can we compute y1 faster?
For example: I1 = [0 0 0 0 1 1 2 2 2 3 3 3 ... n n].

The simple answer is no, at least, not trivially. Your access pattern (e.g. x2(i2)*x3(i3)) does not (at least at compile time) access contiguous memory, but rather has a layer of indirection. Due to this, SIMD instructions are pretty useless, as they work on chunks of memory. What you may want to consider doing is creating a copy of xM sorted according to iM, removing the layer of indirection. This should reduce the number of cache misses in that xM(iM) generates and since it's accessed N times, that may reduce some of the wall time (assuming N is large).
If maximal accuracy is not critical, you may want to consider using a FFT method instead of the convolution (at least, that's how I understood your question. Feel free to correct me if I'm wrong).
Assuming you are doing a convolution and the vectors (a and b, same size as in your question) are large, the result (c) can be calculated naïvely as
// O(n^2)
for(int i = 0; i < c.size(); i++)
c(i) = a(i) * b.array();
Using the convolution theorem, you could take the Fourier transform of both a and b and perform an element wise multiplication and then take the inverse Fourier transform of the result to get c (will probably differ a little):
// O(n log(n)); note A, B, and C are vectors of complex floating point numbers
fft.fwd(a, A);
fft.fwd(b, B);
C = A.array() * B.array();
fft.inv(C, c);

How good is OpenCV GPU library for matrix operations?

I'm using OpenCV for an application in computer vision. I'd like to accelerate some matrix operations (matrices are fairly large) on GPU and want to avoid coding directly in CUDA C, if possible. OpenCV 2.4.1 has a number of GPU accelerated functions. How well do they perform in your experience? Am I better off using another library (e.g. Thrust) instead?
EDIT
Sample application: Calculate squared Euclidean distance matrix on GPU. Currently, my GPU accelerated (and vectorized) implementation in Matlab using the Parallel Computing Toolbox (PCT) is about 5-10 times faster than my C++ implementation with OpenCV.
Matlab implementation:
function K = sqEuclideanDist(P_cpu,Q_cpu)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
P_gpu = gpuArray(P_cpu);
Q_gpu = gpuArray(Q_cpu);
[nP, d] = size(P_gpu);
[nQ, d] = size(Q_gpu);
pmag = sum(P_gpu .* P_gpu, 2);
qmag = sum(Q_gpu .* Q_gpu, 2);
% note that K is on GPU
K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P_gpu*Q_gpu';
end
UPDATE Here's another Matlab implementation that accomplishes the same (thanks to https://stackoverflow.com/a/7774323/1121420). But it runs only on CPU because bsxfun is not supported by PCT. Still looking for C++ alternative though.
function K = sqEuclideanDist(P_cpu,Q_cpu)
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
% Runs on CPU only.
K = bsxfun(#plus,sum(p.^2,2),sum(q.^2,2)') - 2*(p*q');
end

I find ArrayFire to be much faster and have started using it instead of the GPU kernels in OpenCV for image processing. Here are some benchmarks I found comparing ArrayFire (used to be in a different interface called LibJacket) to OpenCV and it's been true in my benchmarking too that ArrayFire is 2-4X faster than the GPU functions in OpenCV. From what I hear, NVIDIA didn't write the GPU kernels in OpenCV but contracted those out to someone, which may be why they are so slow. Since I'm only using 1 GPU, I can use ArrayFire for free.
Update, given the new MATLAB code posted by #Alex: I ran the benchmark of this code on my system. I get that the Parallel Computing Toolbox gpuArray is slower than the CPU, but Jacket and ArrayFire kick butt. HW specs are:
Intel(R) Xeon(R) CPU X5660 # 2.80GHz
NVIDIA Tesla M2090
Results of CPU vs GPU using Parallel Computing Toolbox gpuArray (fully warmed up). CPU is faster than PCT gpuArray:
>> tic; sqEuclideanDist(gpuArray(rand(1581,3)),gpuArray(rand(189,3))); toc;
Elapsed time is 0.006859 seconds.
>> tic; sqEuclideanDist(rand(1581,3),rand(189,3)); toc;
Elapsed time is 0.005712 seconds.
Results of CPU vs GPU using Jacket (fully warmed up). Jacket beats PCT gpuArray by 3.7X and beats the CPU by 3X
>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001876 seconds.
Here is the modified code that let's you run all that easily:
function K = sqEuclideanDist(P,Q)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
[nP, d] = size(P);
[nQ, d] = size(Q);
pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);
K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q';
end
Jacket does support BSXFUN on the GPU, and it does improve the speeds somewhat:
>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001420 seconds.
Note that the sizes used here are pretty small, so most CUDA code that attempts to run on these small sizes is likely to perform poorly. That's why I like to use AccelerEyes' stuff, because those guys have optimized the heck out of the GPU, unlike PCT gpuArray, Thrust, OpenCV, each of which I've tried in the past.
Here is the ArrayFire Free C++ results:
Time: 0.0003577 seconds
Speedups: 19.2X faster than PCT gpuArray, 16X faster than the CPU, 5.2X faster
than Jacket in MATLAB original version, 4X faster than Jacket in MATLAB using
BSXFUN
Here is the ArrayFire code I wrote for this:
static array SqEuclideanDist(array P, array Q)
{
// 0 based indexing
array pmag = sum(P * P, 1);
array qmag = sum(Q * Q, 1);
int np = P.dims(0);
int nq = Q.dims(0);
array K = tile(qmag.T(), np, 1) + tile(pmag, 1, nq) - 2 * matmul(P, Q.T());
return K;
}
int main(int argc, char **argv)
{
double *P_cpu = new double[1581 * 3];
double *Q_cpu = new double[189 * 3];
array P = array(1581, 3, P_cpu);
array Q = array(189 , 3, Q_cpu);
af::sync();
int iter = 1000;
timer::tic();
for (int i = 0; i < iter; i++) {
array K = SqEuclideanDist(P, Q);
af::eval(K);
}
af::sync();
printf("Time taken: %2.4lfms\n", (1000 * timer::toc()) / iter);
delete[] P_cpu;
delete[] Q_cpu;
}

They've been contributed by NVidia, so does have good performance on CUDA compatible cards.
The real performance depends on the card itself and the function you are using.
In my experience only cvRotate and cvResize had a better performance than a normal Intel cpu.
(Note: I was only interested in image related functions)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js