I have a Matrix that is a representation of a higher dimensional tensor which could in principle be N dimensional but each dimension is the same size. Lets say I want to compute the following:
and C is stored as a matrix via
where there is some mapping from ij to I and kl to J.
I can do this with nested for loops where each dimension of my tensor is of size 3 via
for (int i=0; i<3; i++){
for (int j=0; j<3; j++){
I = map_ij_to_I(i,j);
for (int k=0; k<3; k++){
for (int l=0; l<3; l++){
J = map_kl_to_J(k,l);
D(I,J) = 0.;
for (int m=0; m<3; m++){
for (int n=0; n<3; n++){
M = map_mn_to_M(m,n);
D(I,J) += a(i,m)*C(M,J)*b(j,n);
}
}
}
}
}
}
but that's pretty messy and not very efficient. I'm using the Eigen matrix library so I suspect there is probably a much better way to do this than either a for loop or coding each entry separately. I've tried the unsupported tensor library and found it was slower than my explicit loops. Any thoughts?
As a bonus question, how would I compute something like the following efficiently?
There's a lot of work that the optimizer of your compiler will do for you under the hood. For once, loops with constant number of iterations are unrolled. That may be the reason why your code is faster than the library.
I would suggest to take a look at the assembly produced with the optimizations turned to really get a grasp on where you can optimize and how really your program looks like once compiled.
Then of course, you can think about parallel implementations either on the CPU (multiple threads) or on GPU (cuda, OpenCL, OpenAcc, etc).
As for the bonus question, if you think about writing it as two nested loops, I would suggest to rearrange the expression so that the a_km term is between the two sums. No need to perform that multiplication inside the inner sum as it doesn't depend on n. Although this will probably give only a slight performance benefit in modern CPUs...
Related
I've got a program multiplying two sub-matrices residing in the same container matrix. I'm trying to obtain some performance gain by using the OpenMP API for parallelization. Below is the multiplication algorithm I use.
#pragma omp parallel for
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
The algorithm accesses the elements of both input sub-matrices row-wise to enhance cache usage with the spatial locality.
What other OpenMP directives can be used to obtain better performance from that simple algorithm? Is there any other directive for optimizing the operations on the overlapping areas of two sub-matrices?
You can assume that all the sub-matrices have the same size and they are square-shaped. The resulting sub-matrix resides in another container matrix.
For the matrix-matrix product, any permutation of i,j,k indices computes the right result, sequentially. In parallel, not so. In your original code the k iterations do not write to unique locations, so you can not just collapse the outer two loops. Do a k,j interchange and then it is allowed.
Of course OpenMP gets you from 5 percent efficiency on one core to 5 percent on all cores. You really want to block the loops. But that is a lot harder. See the paper by Goto and van de Geijn.
I'm adding something related to main matrix. Do you use this code to multiply two bigger matrices? Then one of the sub-matrices are re-used between different iterations and likely to benefit from CPU cache. For example, if there are 4 sub-matrices of a matrix, then each sub-matrix is used twice, to get a value on result matrix.
To benefit from cache most, the re-used data should be kept in the cache of the same thread (core). To do this, maybe it is better to move the work-distribution level to the place where you select two submatrices.
So, something like this:
select sub-matrix A
#pragma omp parallel for
select sub-matrix B
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
could work faster since whole data always stays in same thread (core).
My question is:
I have this code:
#pragma acc parallel loop
for(i=0; i<bands; i++)
{
#pragma acc loop seq
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
#pragma acc loop
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
}
I'm trying to translate it to SYCL, and I thought about putting a kernel substituting the first parallel loop, with the typical "queue.submit(...)", using "i". But then I realized that inside the first big loop there is a loop that must be executed in serial. Is there a way to tell SYCL to execute a loop inside a kernel in serial?
I can't think of another way to solve this, as I need to make both the first big for and the last for inside the main one parallel.
Thank you in advance.
You have a couple of options here. The first one, as you suggest, is to create a kernel with a 1D range over i:
q.submit([&](sycl::handler &cgh){
cgh.parallel_for(sycl::range<1>(bands), [&](sycl::item<1> i){
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
})
});
Note that for the inner loops, the kernel will just iterate serially over j in both cases. SYCL doesn't apply any magic to your loops like a #pragma would - loops are loops.
This is fine, but you're missing out on a higher degree of parallelism which could be achieved by writing a kernel with a 2D range over i and j: sycl::range<2>(bands, lines_samples). This can be made to work relatively easily, assuming your first loop is doing what I think it's doing, which is computing the average of a line of an image. In this case, you don't really need a serial loop - you can achieve this using work-groups.
Work-groups in SYCL have access to fast on-chip shared memory, and are able to synchronise. This means that you can have a work-group load all the pixels from a line of your image, then the work-group can collaboratively compute the average of that line, synchronize, then each member of the work-group uses the computed average to compute a single value of R_o, your output. This approach maximises available parallelism.
The collaborative reduction operation to find the average of the given line is probably best achieved through tree-reduction. Here are a couple of guides which go through this workgroup reduction approach:
https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers/examples
https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/kernels/reduction.html
I currently have a vector of vectors of a float type, which contain some data:
vector<vector<float> > v1;
vector<vector<float> > v2;
I wanted to know what is the fasted way to square each element in v1 and store it in v2? Currently I am just accessing each element of v1 multiplying it by itself and storing it in v2. As seen below:
for(int i = 0; i < 10; i++){
for(int j = 0; j < 10; j++){
v2[i][j] = v1[i][j]*v[i][j];
}
}
With a bit of luck, the compiler you are using understands what you want to do and converts it so it uses sse-instruction of the cpu which do your squaring in parallel. In this case your code is close to the optimal speed (on single core). You could also try the eigen-library (http://eigen.tuxfamily.org/) which provides some more reliable means to achieve high performance. You would then get something like
ArrayXXf v1 = ArrayXXf::Random(10, 10);
ArrayXXf v2 = v1.square();
which also makes your intention more clear.
If you want to stay in CPU world, OpenMP should help you easily. A single #pragma omp parallel for will divide the load between available cores and you could get further gains by telling the compiler to vectorize with ivdep and simd pragmas.
If GPU is an option, this is a matrix calculation which is perfect for OpenCL. Google for OpenCL matrix multiplication examples. Basically, you can have 2000 threads executing a single operation or fewer threads operating on vector chunks and kernel is very simple to write.
Here there is a matrix (mm, 10 samples* 1000 features), I want to get the own-defined distance between 10 samples. In other words, there are 10+...+4+3+2+1 calculations (I need the distance between one sample, too).
the serial C++ code like this:
for (i=0; i<10; i++){
for (j=0; j<10; j++){
disX <- dis(mm[i], mm[j])
}
}
how can I use MPI_GATHER the collect the disX? Could you give me an example script similar this? I just google (MPI nested for loop c++), but got bad searching results. The real mm matrix is big and the memory for dis function is small. Thank you.
I have to do a matrix boolean multiplication of a matrix with itself in a C++ program and I want to optimize it.
The matrix is symmetric so I think to do a row by row multiplication to reduce cache misses.
I allocated space for matrix in this way:
matrix=new bool*[dimension];
for (i=0; i<dimension; i++) {
matrix[i]=new bool[dimension];
}
And the multiplication is the following:
for (m=0; m<dimension; m++) {
for (n=0; n<dimension; n++) {
for (k=0; k<dimension; k++) {
temp=mat[m][k] && mat[n][k];
B[m][n]= B[m][n] || temp;
...
I did some test of computation time with this version and with another version whit a row by column multiplication like this
for (m=0; m<dimension; m++) {
for (n=0; n<dimension; n++) {
for (k=0; k<dimension; k++) {
temp=mat[m][k] && mat[k][n];
B[m][n]= B[m][n] || temp;
...
I did tests on a 1000x1000 matrix The result showed that the second version ( row by column ) is faster the previous one.
Could you show me why? Shouldn't The misses in the first algorithm be less ?
In the first multiplication approach the rows of the boolean matrices are stored consecutively in memory and also accessed consecutively so that prefetching works flawlessly. In the second approach the cacheline fetched when accessing the element (n,0) can already be evicted when accessing (n+1,0). Whether this actually happens depends on the architecture and its cache hierarchy properties you run your code on. On my machine the first approach is indeed faster for large enough matrices.
As for speeding up the computations: Do not use logical operators since they are evaluated lazy and thus branch misprediction can occur. The inner loop can be exited early as soon as B[m][n] becomes true. Instead of using booleans you might want to consider using the bits of say integers. That way you can combine 32 or 64 elements in your inner loop at once and possibly use vectorization. If your matrices are rather sparse then you might want to consider switching to sparse matrix data structures. Also changing the order of the loops can help as well as introducing blocking. However, any performance optimization is specific to an architecture and class of input matrices.
Speeding suggestion. In the inner loop:
Bmn = false;
for (k=0; k<dimension; k++) {
if ((Bmn = mat[m][k] && mat[k][n])) {
k = dimension; // exit for-k loop
}
}
B[m][n]= Bmn