Example of C++ code optimization for parallel computing - c++

I'm trying to understand optimization routines. I'm focusing on the most critical part of my code (the code has some cycles of length "nc" and one cycle of length "np", where number "np" is much larger then "nc"). I present part of the code in here. The rest of code is not very essential in % of computational time so i prefer code purify in the rest of the algorithm. However, the critical cycle with "np" length is a pretty simple piece of code and it can be parallelized. So it will not hurt if i rewrite this part into some more effective and less clear version (maybe into SSE instructions). I'm using a gcc compiler, c++ code, and OpenMP parallelization.
This code is part of the well known particle-in-cell algorithm (and this one is also basic one). I'm trying to learn code optimization on this version (so my goal is not to have effective PIC algorithm only, because it is already written in thousand variants, but i want to bring some demonstrative example for code optimization also). I'm trying to do some work but i am not very sure if i solved all optimization properties correctly.
const int NT = ...; // number of threads (in two versions: about 6 or about 30)
const int np = 10000000; // np is about 1000-10000 times larger than nc commonly
const int nc = 10000;
const int step = 1000;
float u[np], x[np];
float a[nc], a_lin[nc], rho_full[NT][nc], rho_diff[NT][nc] , weight[nc];
int p,num;
for ( i = 0 ; i<step ; i++) {
// ***
// *** some not very time consuming code for calculation
// *** a, a_lin from values of rho_full and rho_diff
#pragma omp for private(p,num)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
rho_full[num][p] += weight[k];
rho_diff[num][p] += weight[k] * (x[k] - p);
}
};
I realize this has problems:
1) (main question) I use set of arrays rho_full[num][p] where num is index for each thread. After computation i just summarize this arrays (rho_full[0][p] + rho_full[1][p] + rho_full[2][p] ...). The reason is avoidance of writing into same part of array with two different threads. I am not very sure if this way is an effective solution (note that number "nc" is relatively small, so number of operations with "np" is still probably most essential)
2) (also important question) I need to read x[k] many times and it's also changed many times. Maybe its better to read this value into some register and then forget whole x array or fix some pointer in here. After all calculation i can call x[k] array again and store obtained value. I believe that compiler do this work for me but i am not very sure because i used modification of x[k] in the center of algorithm. So the compiler probably do some effective work on their own but maybe in this version it call more times then nessesary becouse more then ones I swich calling and storing this value.
3) (probably not relevant) The code works with integer part and remainder below decimal point part. It needs both of this values. I identify integer part as p = (int) x and remainder as x - p. I calculate this routine at the begin and also in the end of cycle interior. One can see that this spliting can be stored somewhere and used at next step (i mean step at i index). Do you thing that following version is better? I store integral and remainder part at arrays of x instead of whole value x.
int x_int[np];
float x_rem[np];
//...
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
u[k] += a[x_int[k]] + a_lin[x_int[k]] * x_rem[k];
x_rem[k] += u[k];
p = (int) x_rem[k]; // *** This part is added into code for simplify the rest.
x_int[k] += p; // *** And maybe there is a better way how to realize
x_rem[k] -= p; // *** this "pushing correction".
if (x_int[k]<0 ) {x_int[k]+=nc;} else
if (x_int[k]>nc) {x_int[k]-=nc;};
rho_full[num][x_int[k]] += weight[k];
rho_diff[num][x_int[k]] += weight[k] * x_rem[k];
}
};

You can use OMP reduction for your for loop:
int result = 0;
#pragma omp for nowait reduction(+:result)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
result += weight[k] + weight[k] * (x[k] - p);
}

Related

OpenMP parallel calculating for loop indices

My parallel programming class has the program below demonstrating how to use the parallel construct in OpenMP to calculate array bounds for each thread to be use in a for loop.
#pragma omp parallel
{
int id = omp_get_thread_num();
int p = omp_get_num_threads();
int start = (N * id) / p;
int end = (N * (id + 1)) / p;
if (id == p - 1) end = N;
for (i = start; i < end; i++)
{
A[i] = x * B[i];
}
}
My question is, is the if statement (id == p - 1) necessary? From my understanding, if id = p - 1, then end will already be N, thus the if statement is not necessary. I asked in my class's Q&A board, but wasn't able to get a proper answer that I understood. Assumptions are: N is the size of array, x is just an int, id is between 0 and p - 1.
You are right. Indeed, (N * ((p - 1) + 1)) / p is equivalent to
(N * p) / p assuming p is strictly positive (which is the case since the number of OpenMP thread is guaranteed to be at least 1). (N * p) / p is equivalent to N assuming there is no overflow. Such condition is often useful when the integer division cause some truncation but this is not the case here (it would be the case with something like (N / p) * id).
Note that this code is not very safe for large N because sizeof(int) is often 4 and the multiplication is likely to cause overflows (resulting in an undefined behaviour). This is especially true on machines with many cores like on supercomputer nodes. It is better to use the size_t type which is usually an unsigned 64-bit type meant to be able to represent the size of any object (for example the size of an array).

What is the most efficient way to repeat elements in a vector and apply a set of different functions across all elements using Eigen?

Say I have a vector containing only positive, real elements defined like this:
Eigen::VectorXd v(1.3876, 8.6983, 5.438, 3.9865, 4.5673);
I want to generate a new vector v2 that has repeated the elements in v some k times. Then I want to apply k different functions to each of the repeated elements in the vector.
For example, if v2 was v repeated 2 times and I applied floor() and ceil() as my two functions, the result based on the above vector would be a column vector with values: [1; 2; 8; 9; 5; 6; 3; 4; 4; 5]. Preserving the order of the original values is important here as well. These values are also a simplified example, in practice, I'm generating vectors v with ~100,000 or more elements and would like to make my code as vectorizable as possible.
Since I'm coming to Eigen and C++ from Matlab, the simplest approach I first took was to just convert this Nx1 vector into an Nx2 matrix, apply floor to the first column and ceil to the second column, take the transpose to get a 2xN matrix and then exploit the column-major nature of the matrix and reshape the 2xN matrix into a 2Nx1 vector, yielding the result I want. However, for large vectors, this would be very slow and inefficient.
This response by ggael effectively addresses how I could repeat the elements in the input vector by generating a sequence of indices and indexing the input vector. I could just then generate more sequences of indices to apply my functions to the relevant elements v2 and copy the result back to their respective places. However, is this really the most efficient approach? I dont fully grasp copy-on-write and move semantics, but I think the second indexing expressions would be in a sense redundant?
If that is true, then my guess is that a solution here would be some sort of nullary or unary expression where I could define an expression that accepts the vector, some index k and k expressions/functions to apply to each element and spits out the vector I'm looking for. I've read the Eigen documentation on the subject, but I'm struggling to build a functional example. Any help would be appreciated!
So, if I understand you correctly, you don't want to replicate (in terms of Eigen methods) the vector, you want to apply different methods to the same elements and store the result for each, correct?
In this case, computing it sequentially once per function is the easiest route. Most CPUs can only do one (vector) memory store per clock cycle, anyway. So for simple unary or binary operations, your gains have an upper bound.
Still, you are correct that one load is technically always better than two and it is a limitation of Eigen that there is no good way of achieving this.
Know that even if you manually write a loop that would generate multiple outputs, you should limit yourself in the number of outputs. CPUs have a limited number of line-fill buffers. IIRC Intel recommended using less than 10 "output streams" in tight loops, otherwise you could stall the CPU on those.
Another aspect is that C++'s weak aliasing restrictions make it hard for compilers to vectorize code with multiple outputs. So it might even be detrimental.
How I would structure this code
Remember that Eigen is column-major, just like Matlab. Therefore use one column per output function. Or just use separate vectors to begin with.
Eigen::VectorXd v = ...;
Eigen::MatrixX2d out(v.size(), 2);
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
Following the KISS principle, this is good enough. You will not gain much if anything by doing something more complicated. A bit of multithreading might gain you something (less than factor 2 I would guess) because a single CPU thread is not enough to max out memory bandwidth but that's about it.
Some benchmarking
This is my baseline:
int main()
{
int rows = 100013, repetitions = 100000;
Eigen::VectorXd v = Eigen::VectorXd::Random(rows);
Eigen::MatrixX2d out(rows, 2);
for(int i = 0; i < repetitions; ++i) {
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
}
}
Compiled with gcc-11, -O3 -mavx2 -fno-math-errno I get ca. 5.7 seconds.
Inspecting the assembler code finds good vectorization.
Plain old C++ version:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
for(std::ptrdiff_t j = 0; j < rows; ++j) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
40 seconds instead of 5! This version actually does not vectorize because the compiler cannot prove that the arrays don't alias each other.
Next, let's use fixed size Eigen vectors to get the compiler to generate vectorized code:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
std::ptrdiff_t j;
for(j = 0; j + 4 <= rows; j += 4) {
const Eigen::Vector4d vj = Eigen::Vector4d::Map(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector4d::Map(outfloor + j) = floorval;
Eigen::Vector4d::Map(outceil + j) = ceilval;;
}
if(j + 2 <= rows) {
const Eigen::Vector2d vj = Eigen::Vector2d::MapAligned(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector2d::Map(outfloor + j) = floorval;
Eigen::Vector2d::Map(outceil + j) = ceilval;;
j += 2;
}
if(j < rows) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
7.5 seconds. The assembler looks fine, fully vectorized. I'm not sure why performance is lower. Maybe cache line aliasing?
Last attempt: We don't try to avoid re-reading the vector but we re-read it blockwise so that it will be in cache by the time we read it a second time.
const int blocksize = 64 * 1024 / sizeof(double);
std::ptrdiff_t j;
for(j = 0; j + blocksize <= rows; j += blocksize) {
const auto& vj = v.segment(j, blocksize);
auto outj = out.middleRows(j, blocksize);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
}
const auto& vj = v.tail(rows - j);
auto outj = out.bottomRows(rows - j);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
5.4 seconds. So there is some gain here but not nearly enough to justify the added complexity.

Iterative Karatsuba algorithm parallelized and vectorized using OpenACC in C++

I'm trying to parallelize iterative version of Karatsuba algorithm using OpenACC in C++. I would like to ask how can I vectorize inner for loop. My compiler shows my this message about that loop:
526, Complex loop carried dependence of result-> prevents parallelization
Loop carried dependence of result-> prevents parallelization
Loop carried backward dependence of result-> prevents vectorization
and here the code of two nested loops:
#pragma acc kernels num_gangs(1024) num_workers(32) copy (result[0:2*size-1]) copyin(A[0:size],$
{
#pragma acc loop gang
for (TYPE position = 1; position < 2 * (size - 1); position++) {
// for even coefficient add Di/2
if (position % 2 == 0)
result[position] += D[position / 2];
TYPE start = (position >= size) ? (position % size ) + 1 : 0;
TYPE end = (position + 1) / 2;
// inner loop: sum (Dst) - sum (Ds + Dt) where s+t=i
#pragma acc loop worker
for(TYPE inner = start; inner < end; inner++){
result[position] += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
result[position] -= (D[inner] + D[position - inner]);
}
}
}
Actually, I'm not sure if it is possible to vectorize it. But if It is, I can't realize what I'm doing wrong. Thank you
The "Complex loop carried dependence of result" problem is due to pointer aliasing. The compiler can't tell if the object that "result" points to overlaps with one of the other pointer's objects.
As a C++ extension, you can add the C99 "restrict" keyword to the declaration of your arrays. This will assert to the compiler that pointers don't alias.
Alternatively, you can add the OpenACC "independent" clause on your loop directives to tell the compiler that the loops do not have any dependencies.
Note that OpenACC does not support array reductions, so you wont be able to parallelize the "inner" loop unless you modify the code to use a scalar. Something like:
rtmp = result[position];
#pragma acc loop vector reduction(+:rtmp)
for(TYPE inner = start; inner < end; inner++){
rtmp += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
rtmp -= (D[inner] + D[position - inner]);
}
result[position] = rtmp;

Implementing iterative autocorrelation process in C++ using for loops

I am implementing pitch tracking using an autocorrelation method in C++ but I am struggling to write the actual line of code which performs the autocorrelation.
I have an array containing a certain number ('values') of amplitude values of a pre-recorded signal, and I am performing the autocorrelation function on a set number (N) of these values.
In order to perform the autocorrelation I have taken the original array and reversed it so that point 0 = point N, point 1 = point N-1 etc, this array is called revarray
Here is what I want to do mathematically:
(array[0] * revarray[0])
(array[0] * revarray[1]) + (array[1] * revarray[0])
(array[0] * revarray[2]) + (array[1] * revarray[1]) + (array[2] * revarray[0])
(array[0] * revarray[3]) + (array[1] * revarray[2]) + (array[2] * revarray[1]) + (array[3] * revarray[0])
...and so on. This will be repeated for array[900]->array[1799] etc until autocorrelation has been performed on all of the samples in the array.
The number of times the autocorrelation is carried out is:
values / N = measurements
Here is the relevent section of my code so far
for (k = 0; k = measurements; ++k){
for (i = k*(N - 1), j = k*N; i >= 0; i--, j++){
revarray[j] = array[i];
for (a = k*N; a = k*(N - 1); ++a){
autocor[a]=0;
for (b = k*N; b = k*(N - 1); ++b){
autocor[a] += //**Here is where I'm confused**//
}
}
}
}
I know that I want to keep iteratively adding new values to autocor[a], but my problem is that the value that needs to be added to will keep changing. I've tried using an increasing count like so:
for (i = (k*N); i = k*(N-1); ++i){
autocor[i] += array[i] * revarray[i-1]
}
But I clearly know this won't work as when the new value is added to the previous autocor[i] this previous value will be incorrect, and when i=0 it will be impossible to calculate using revarray[i-1]
Any suggestions? Been struggling with this for a while now. I managed to get it working on just a single array (not taking N samples at a time) as seen here but I think using the inverted array is a much more efficient approach, I'm just struggling to implement the autocorrelation by taking sections of the entire signal.
It is not very clear to me, but I'll assume that you need to perform your iterations as many times as there are elements in that array (if it is indeed only half that much - adjust the code accordingly).
Also the N is assumed to mean the size of the array, so the index of the last element is N-1.
The loops would looks like that:
for(size_t i = 0; i < N; ++i){
autocorr[i] = 0;
for(size_t j = 0; j <= i; ++j){
const size_t idxA = j
, idxR = i - j; // direct and reverse indices in the array
autocorr[i] += array[idxA] * array[idxR];
}
}
Basically you run the outer loop as many times as there are elements in your array and for each of those iterations you run a shorter loop up to the current last index of the outer array.
All that is left to be done now is to properly calculate the indices of the array and revarray to perform the calculations and accummulate a running sum in the current outer loop's index.

Multithread C++ program to speed up a summatory loop

I have a loop that iterates from 1 to N and takes a modular sum over time. However N is very large and so I am wondering if there is a way to modify it by taking advantage of multithread.
To give sample program
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
f(i) in my case isn't an actual function, but a long expression that would take up room here. Putting it there to illustrate purpose.
Yes, try this:
double total=0;
#pragma omp parallel for reduction(+:total)
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
Compile with:
g++ -fopenmp your_program.c
It's that simple! No headers are required. The #pragma line automatically spins up a couple of threads, divides the iterations of the loop evenly, and then recombines everything after the loop. Note though, that you must know the number of iterations beforehand.
This code uses OpenMP, which provides easy-to-use parallelism that's quite suitable to your case. OpenMP is even built-in to the GCC and MSVC compilers.
This page shows some of the other reduction operations that are possible.
If you need nested for loops, you can just write
double total=0;
#pragma omp parallel for reduction(+:total)
for (long long i = 1; i < N; ++i)
for (long long j = 1; j < N; ++j)
total = (total + f(i)*j) % modulus;
And the outer loop will be parallelised, with each thread running its own copy of the inner loop.
But you could also use the collapse directive:
#pragma omp parallel for reduction(+:total) collapse(2)
and then the iterations of both loops will be automagically divied up.
If each thread needs its own copy of a variable defined prior to the loop, use the private command:
double total=0, cheese=4;
#pragma omp parallel for reduction(+:total) private(cheese)
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
Note that you don't need to use private(total) because this is implied by reduction.
As presumably the f(i) are independent but take the same time roughly to run, you could create yourself 4 threads, and get each to sum up 1/4 of the total, then return the sum as a value, and join each one. This isn't a very flexible method, especially if the times the f(i) times can be random.
You might also want to consider a thread pool, and make each thread calculate f(i) then get the next i to sum.
Try openMP's parallel for with the reduction clause for your total http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause
If f(long long int) is a function that solely relies on its input and no global state and the abelian properties of addition hold, you can gain a significant advantage like this:
for(long long int i = 0, j = 1; i < N; i += 2, j += 2)
{
total1 = (total1 + f(i)) % modulus;
total2 = (total2 + f(j)) % modulus;
}
total = (total1 + total2) % modulus;
Breaking this out like that should help by allowing the compiler to improve code generation and the CPU to use more resources (the two operations can be handled in parallel) and pump more data out and reduce stalls. [I am assuming an x86 architecture here]
Of course, without knowing what f really looks like, it's hard to be sure if this is possible or if it will really help or make a measurable difference.
There may be other similar tricks that you can exploit special knowledge of your input and your platform - for example, SSE instructions could allow you to do even more. Platform-specific functionality might also be useful. For example, a modulo operation may not be required at all and your compiler may provide a special intrinsic function to perform addition modulo N.
I must ask, have you profiled your code and found this to be a hotspot?
You could use Threading Building Blocks
tbb::parallel_for(1, N, [=](long long i) {
total = (total + f(i)) % modulus;
});
Or whitout overflow checks:
tbb::parallel_for(1, N, [=](long long i) {
total = (total + f(i));
});
total %= modulus;