How do I translate this ACC code to SYCL? - c++

My question is:
I have this code:
#pragma acc parallel loop
for(i=0; i<bands; i++)
{
#pragma acc loop seq
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
#pragma acc loop
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
}
I'm trying to translate it to SYCL, and I thought about putting a kernel substituting the first parallel loop, with the typical "queue.submit(...)", using "i". But then I realized that inside the first big loop there is a loop that must be executed in serial. Is there a way to tell SYCL to execute a loop inside a kernel in serial?
I can't think of another way to solve this, as I need to make both the first big for and the last for inside the main one parallel.
Thank you in advance.

You have a couple of options here. The first one, as you suggest, is to create a kernel with a 1D range over i:
q.submit([&](sycl::handler &cgh){
cgh.parallel_for(sycl::range<1>(bands), [&](sycl::item<1> i){
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
})
});
Note that for the inner loops, the kernel will just iterate serially over j in both cases. SYCL doesn't apply any magic to your loops like a #pragma would - loops are loops.
This is fine, but you're missing out on a higher degree of parallelism which could be achieved by writing a kernel with a 2D range over i and j: sycl::range<2>(bands, lines_samples). This can be made to work relatively easily, assuming your first loop is doing what I think it's doing, which is computing the average of a line of an image. In this case, you don't really need a serial loop - you can achieve this using work-groups.
Work-groups in SYCL have access to fast on-chip shared memory, and are able to synchronise. This means that you can have a work-group load all the pixels from a line of your image, then the work-group can collaboratively compute the average of that line, synchronize, then each member of the work-group uses the computed average to compute a single value of R_o, your output. This approach maximises available parallelism.
The collaborative reduction operation to find the average of the given line is probably best achieved through tree-reduction. Here are a couple of guides which go through this workgroup reduction approach:
https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers/examples
https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/kernels/reduction.html

Related

How do I indicate OpenACC to sequentially execute one instruction inside a parallel loop?

I would like the 'r_m[i] /= lines_samples;' line to be executed once, by one thread I mean. Do I have to put a special pragma or do anything for the compiler to understand it?
Here is the code:
#pragma acc parallel loop
for(i=0; i<bands; i++)
{
#pragma acc loop seq // This may be a reduction, not a seq, who knows? ^^
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
#pragma acc loop
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
}
Thank you a lot!
Assuming the loops are scheduled with the outer loop being "gang" and inner loop as "vector", this line would be executed once per gang (i.e. only on thread in the gang). So it will work as you expect.
Depending on the trip count of the first "j" loop, you may or may not use a reduction. Reductions do have overhead, so if the trip count is small, then it may be better to leave it as sequential. Otherwise, I suggest using a temp scalar for the reductions since as it is now, would require an array reduction which incurs more overhead.
Something like:
float rmi = r_m[i];
#pragma acc loop reduction(+:rmi)
for(j=0; j<lines_samples; j++)
rmi += image_vector[i*lines_samples+j];
r_m[i] = rm1/lines_samples;

Is this the correct use of OpenMP firstprivate?

I need to parallelize the following:
for(i=0; i<n/2; i++)
a[i] = a[i+1] + a[2*i]
In parallel, the output will be different than in sequential, because the "to read" values will be "rewritten". In order to get the sequential output, but then parallelized I want to make use of firstprivate(a). Because firstprivate gives each tread a copy of a.
Let's imagine 4 threads and a loop of 100.
1 --> i = 0 till 24
2 --> i = 25 till 49
3 --> i = 50 till 74
4 --> i =75 till 99
That means that each tread will rewrite 25% of the array.
When the parallel region is over, all the threads "merge". Does that mean that you get the same a as if you ran it in sequential?
#pragma omp parallel for firstprivate(a)
for(i=0; i<n/2; i++)
a[i] = a[i+1] + a[2*i]
Question:
Is my way of thinking correct?
Is the code parallelized in the right way to get the sequential output?
As you noted, using firstprivate to copy the data for each thread does not really help you getting the data back.
The easiest solution is in fact to separate input and output and have both be shared (the default).
In order to avoid a copy it would be good to just use the new variable instead of b from thereon in the code. Alternatively you could just have pointers and swap them.
int out[100];
#pragma omp parallel for
for(i=0; i<n/2; i++)
out[i] = a[i+1] + a[2*i]
// use out from here when you would have used a.
There is no easy and general way to have private copies of a for each thread and then merge them afterwards. lastprivate just copies one incomplete output array from the thread executing the last iteration and reduction doesn't know which elements to take from which array. Even if it was, it would be wasteful to copy the entire array for each thread. Having shared in-/outputs here is much more efficient.

Efficient Tensor Multiplication

I have a Matrix that is a representation of a higher dimensional tensor which could in principle be N dimensional but each dimension is the same size. Lets say I want to compute the following:
and C is stored as a matrix via
where there is some mapping from ij to I and kl to J.
I can do this with nested for loops where each dimension of my tensor is of size 3 via
for (int i=0; i<3; i++){
for (int j=0; j<3; j++){
I = map_ij_to_I(i,j);
for (int k=0; k<3; k++){
for (int l=0; l<3; l++){
J = map_kl_to_J(k,l);
D(I,J) = 0.;
for (int m=0; m<3; m++){
for (int n=0; n<3; n++){
M = map_mn_to_M(m,n);
D(I,J) += a(i,m)*C(M,J)*b(j,n);
}
}
}
}
}
}
but that's pretty messy and not very efficient. I'm using the Eigen matrix library so I suspect there is probably a much better way to do this than either a for loop or coding each entry separately. I've tried the unsupported tensor library and found it was slower than my explicit loops. Any thoughts?
As a bonus question, how would I compute something like the following efficiently?
There's a lot of work that the optimizer of your compiler will do for you under the hood. For once, loops with constant number of iterations are unrolled. That may be the reason why your code is faster than the library.
I would suggest to take a look at the assembly produced with the optimizations turned to really get a grasp on where you can optimize and how really your program looks like once compiled.
Then of course, you can think about parallel implementations either on the CPU (multiple threads) or on GPU (cuda, OpenCL, OpenAcc, etc).
As for the bonus question, if you think about writing it as two nested loops, I would suggest to rearrange the expression so that the a_km term is between the two sums. No need to perform that multiplication inside the inner sum as it doesn't depend on n. Although this will probably give only a slight performance benefit in modern CPUs...

OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

I have some C++ code that I am running for an optimisation task, and I am trying to parallelise it using OpenMP. I tried using #pragma omp parallel for on both loops, but realised pretty quickly that it didnt work, so I want to set up a conditional to decide whether to parallelise the outer or inner loop, depending on how many outer iterations there are.
Here is the code:
std::vector<Sol> seeds; // vector with initial solutions
std::vector<Sol> sols (N_OUTER*N_INNER); // vector for output solutions
int N_OUTER; // typically 1-8
int N_INNER; // typically > 100
int PAR_THRESH; // this is the parameter I am interested in setting
#pragma omp parallel for if (N_OUTER >= PAR_THRESH)
for (int outer = 0; outer < N_OUTER; ++outer){
#pragma omp parallel for if (N_OUTER < PAR_THRESH)
for (int inner = 0; inner < N_INNER; ++inner){
sols[outer*N_INNER + inner] = solve(seeds[outer]);
}
}
This works fine to decide which loop (inner or outer) gets parallelised; however, I am trying to determine what is the best value for PAR_THRESH.
My intuition says that if N_OUTER is 1, then it shouldn't parallelise the outer loop, and if N_OUTER is greater than the number of threads available, then the outer loop should be the one to be parallelised; because it uses maximum available threads and the threads are long as possible. My question is about when N_OUTER is either 2 or 3 (4 being the number of threads available).
Is it better to run, say, 2 or 3 threads that are long, in parallel; but not use up all of the available threads? Or is it better to run the 2 or 3 outer loops in serial, while utilising the maximum number of threads for the inner loop?
Or is there a kind of trade off in play, and maybe 2 outer loop iterations might be wasting threads, but if there are 3 outer loop iterations, then having longer threads is beneficial, despite the fact that one thread is remaining unused?
EDIT:
edited code to replace N_ITER with N_INNER in two places
Didn't have much experience with OpenMP, but I have found something like collapse directive:
https://software.intel.com/en-us/articles/openmp-loop-collapse-directive
Understanding the collapse clause in openmp
It seems to be even more appropriate when number of inner loop iterations differs.
--
On the other hand:
It seems to me that solve(...) is side-effect free. It seems also that N_ITER is N_INNER.
Currently you calculate solve N_INNER*N_OUTER times.
While reducing that won't reduce O notation complexity, assuming it has very large constant factor - it should save a lot of time. You cannot cache the result with collapse, so maybe this could be even better:
std::vector<Sol> sols_tmp (N_INNER);
#pragma omp parallel for
for (int i = 0; i < N_OUTER; ++i) {
sols_tmp[i] = solve(seeds[i]);
}
This calculates only N_OUTER times.
Because solve returns same value for each row:
#pragma omp parallel for
for (int i = 0; i < N_OUTER*N_INNER; ++i) {
sols[i] = sols_tmp[i/N_INNER];
}
Of course it must be measured if parallelization is suitable for those loops.

What is the most time efficient way to square each element of a vector of vectors c++

I currently have a vector of vectors of a float type, which contain some data:
vector<vector<float> > v1;
vector<vector<float> > v2;
I wanted to know what is the fasted way to square each element in v1 and store it in v2? Currently I am just accessing each element of v1 multiplying it by itself and storing it in v2. As seen below:
for(int i = 0; i < 10; i++){
for(int j = 0; j < 10; j++){
v2[i][j] = v1[i][j]*v[i][j];
}
}
With a bit of luck, the compiler you are using understands what you want to do and converts it so it uses sse-instruction of the cpu which do your squaring in parallel. In this case your code is close to the optimal speed (on single core). You could also try the eigen-library (http://eigen.tuxfamily.org/) which provides some more reliable means to achieve high performance. You would then get something like
ArrayXXf v1 = ArrayXXf::Random(10, 10);
ArrayXXf v2 = v1.square();
which also makes your intention more clear.
If you want to stay in CPU world, OpenMP should help you easily. A single #pragma omp parallel for will divide the load between available cores and you could get further gains by telling the compiler to vectorize with ivdep and simd pragmas.
If GPU is an option, this is a matrix calculation which is perfect for OpenCL. Google for OpenCL matrix multiplication examples. Basically, you can have 2000 threads executing a single operation or fewer threads operating on vector chunks and kernel is very simple to write.