Multithreading alternative to mutex in parallel_for - c++

I'm fairly new to C++, therefore please pardon if this is a stupid question, but I didn't find good example of what I'm looking for on the internet.
Basically I'm using a parallel_for cycle to find a maximum inside a 2D array (and a bunch of other operations in between). First of all I don't even know if this is the best approach, but given the length of this 2D array, I though splitting the calculations would be faster.
My code:
vector<vector<double>> InterpU(1801, vector<double>(3601, 0));
Concurrency::parallel_for(0, 1801, [&](int i) {
long k = 0; long l = 0;
pair<long, long> Normalized;
double InterpPointsU[4][4];
double jRes;
double iRes = i * 0.1;
double RelativeY, RelativeX;
int p, q;
while (iRes >= (k + 1) * DeltaTheta) k++;
RelativeX = iRes / DeltaTheta - k;
for (long j = 0; j < 3600; j++)
{
jRes = j * 0.1;
while (jRes >= (l + 1) * DeltaPhi) l++;
RelativeY = jRes / DeltaPhi - l;
p = 0;
for (long m = k - 1; m < k + 3; m++)
{
q = 0;
for (long n = l - 1; n < l + 3; n++)
{
Normalized = Normalize(m, n, PointsTheta, PointsPhi);
InterpPointsU[p][q] = U[Normalized.first][Normalized.second];
q++;
}
p++;
}
InterpU[i][j] = bicubicInterpolate(InterpPointsU, RelativeX, RelativeY);
if (InterpU[i][j] > MaxU)
{
SharedDataLock.lock();
MaxU = InterpU[i][j];
SharedDataLock.unlock();
}
}
InterpU[i][3600] = InterpU[i][0];
});
You can see here that I'm using a mutex called SharedDataLock to protect multiple threads accessing the same resource. MaxU is a variable that should only containe the maximum of the InterpU vector.
The code works well, but since I'm having speed performance problem, I began to look into atomic and some other stuff.
Is there any good example on how to modify a similar code to make it faster?

As mentioned by VTT, you can simply find the local maximum of each thread, and merge those afterwards With use of combinable:
Concurrency::combinable<double> CombinableMaxU;
Concurrency::parallel_for(0, 1801, [&](int i) {
...
CombinableMaxU.local() = std::max(CombinableMaxU.local(), InterpU[i][j]);
}
MaxU = std::max(MaxU, CombinableMaxU.combine(std::max<double>));
Note that your current code is actually wrong (unless MaxU is atomic), you read MaxU outside of the lock, while it can be written simultaneously by other threads. Generally, you must not read a value that is being written to simultaneously unless both sides are protected by atomic semantics or locks and memory fences. A reason is that a variable access may very well consist of multiple memory accesses, depending on how the type is supported by hardware.
But in your case, you even have a classic race condition:
MaxU == 1
Thread a | Thread b
InterpU[i][j] = 3 | InterpU[i][j] = 2
if (3 > MaxU) | if (2 > MaxU)
SharedDataLock.lock(); | SharedDataLock.lock();
(gets the lock) | (waiting for lock)
MaxU = 3 | ...
SharedDataLock.unlock(); | ...
... | (gets the lock)
| MaxU = 2
| SharedDataLock.unlock();
MaxU == 2
Locks are hard.
You can also use an atomic and compute the maximum on that. However, I would guess1 that it still doesn't perform well inside the loop2, and outside the loop it doesn't matter whether you use atomics or locks.
1: When in doubt, don't guess - measure!
2: Just because something is atomic and supported by hardware, doesn't mean it is as efficient as accessing local data. First, atomic instructions are often much more costly than their non-atomic counterparts, second you have to deal with very bad cache effects, because cores/caches will fight for the ownership of the data. While atomics may be more elegant in many cases (not this one IMHO), reduction is faster most of the time.

Related

OpenMP parallel calculating for loop indices

My parallel programming class has the program below demonstrating how to use the parallel construct in OpenMP to calculate array bounds for each thread to be use in a for loop.
#pragma omp parallel
{
int id = omp_get_thread_num();
int p = omp_get_num_threads();
int start = (N * id) / p;
int end = (N * (id + 1)) / p;
if (id == p - 1) end = N;
for (i = start; i < end; i++)
{
A[i] = x * B[i];
}
}
My question is, is the if statement (id == p - 1) necessary? From my understanding, if id = p - 1, then end will already be N, thus the if statement is not necessary. I asked in my class's Q&A board, but wasn't able to get a proper answer that I understood. Assumptions are: N is the size of array, x is just an int, id is between 0 and p - 1.
You are right. Indeed, (N * ((p - 1) + 1)) / p is equivalent to
(N * p) / p assuming p is strictly positive (which is the case since the number of OpenMP thread is guaranteed to be at least 1). (N * p) / p is equivalent to N assuming there is no overflow. Such condition is often useful when the integer division cause some truncation but this is not the case here (it would be the case with something like (N / p) * id).
Note that this code is not very safe for large N because sizeof(int) is often 4 and the multiplication is likely to cause overflows (resulting in an undefined behaviour). This is especially true on machines with many cores like on supercomputer nodes. It is better to use the size_t type which is usually an unsigned 64-bit type meant to be able to represent the size of any object (for example the size of an array).

What is the most efficient way to repeat elements in a vector and apply a set of different functions across all elements using Eigen?

Say I have a vector containing only positive, real elements defined like this:
Eigen::VectorXd v(1.3876, 8.6983, 5.438, 3.9865, 4.5673);
I want to generate a new vector v2 that has repeated the elements in v some k times. Then I want to apply k different functions to each of the repeated elements in the vector.
For example, if v2 was v repeated 2 times and I applied floor() and ceil() as my two functions, the result based on the above vector would be a column vector with values: [1; 2; 8; 9; 5; 6; 3; 4; 4; 5]. Preserving the order of the original values is important here as well. These values are also a simplified example, in practice, I'm generating vectors v with ~100,000 or more elements and would like to make my code as vectorizable as possible.
Since I'm coming to Eigen and C++ from Matlab, the simplest approach I first took was to just convert this Nx1 vector into an Nx2 matrix, apply floor to the first column and ceil to the second column, take the transpose to get a 2xN matrix and then exploit the column-major nature of the matrix and reshape the 2xN matrix into a 2Nx1 vector, yielding the result I want. However, for large vectors, this would be very slow and inefficient.
This response by ggael effectively addresses how I could repeat the elements in the input vector by generating a sequence of indices and indexing the input vector. I could just then generate more sequences of indices to apply my functions to the relevant elements v2 and copy the result back to their respective places. However, is this really the most efficient approach? I dont fully grasp copy-on-write and move semantics, but I think the second indexing expressions would be in a sense redundant?
If that is true, then my guess is that a solution here would be some sort of nullary or unary expression where I could define an expression that accepts the vector, some index k and k expressions/functions to apply to each element and spits out the vector I'm looking for. I've read the Eigen documentation on the subject, but I'm struggling to build a functional example. Any help would be appreciated!
So, if I understand you correctly, you don't want to replicate (in terms of Eigen methods) the vector, you want to apply different methods to the same elements and store the result for each, correct?
In this case, computing it sequentially once per function is the easiest route. Most CPUs can only do one (vector) memory store per clock cycle, anyway. So for simple unary or binary operations, your gains have an upper bound.
Still, you are correct that one load is technically always better than two and it is a limitation of Eigen that there is no good way of achieving this.
Know that even if you manually write a loop that would generate multiple outputs, you should limit yourself in the number of outputs. CPUs have a limited number of line-fill buffers. IIRC Intel recommended using less than 10 "output streams" in tight loops, otherwise you could stall the CPU on those.
Another aspect is that C++'s weak aliasing restrictions make it hard for compilers to vectorize code with multiple outputs. So it might even be detrimental.
How I would structure this code
Remember that Eigen is column-major, just like Matlab. Therefore use one column per output function. Or just use separate vectors to begin with.
Eigen::VectorXd v = ...;
Eigen::MatrixX2d out(v.size(), 2);
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
Following the KISS principle, this is good enough. You will not gain much if anything by doing something more complicated. A bit of multithreading might gain you something (less than factor 2 I would guess) because a single CPU thread is not enough to max out memory bandwidth but that's about it.
Some benchmarking
This is my baseline:
int main()
{
int rows = 100013, repetitions = 100000;
Eigen::VectorXd v = Eigen::VectorXd::Random(rows);
Eigen::MatrixX2d out(rows, 2);
for(int i = 0; i < repetitions; ++i) {
out.col(0) = v.array().floor();
out.col(1) = v.array().ceil();
}
}
Compiled with gcc-11, -O3 -mavx2 -fno-math-errno I get ca. 5.7 seconds.
Inspecting the assembler code finds good vectorization.
Plain old C++ version:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
for(std::ptrdiff_t j = 0; j < rows; ++j) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
40 seconds instead of 5! This version actually does not vectorize because the compiler cannot prove that the arrays don't alias each other.
Next, let's use fixed size Eigen vectors to get the compiler to generate vectorized code:
double* outfloor = out.data();
double* outceil = outfloor + out.outerStride();
const double* inarr = v.data();
std::ptrdiff_t j;
for(j = 0; j + 4 <= rows; j += 4) {
const Eigen::Vector4d vj = Eigen::Vector4d::Map(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector4d::Map(outfloor + j) = floorval;
Eigen::Vector4d::Map(outceil + j) = ceilval;;
}
if(j + 2 <= rows) {
const Eigen::Vector2d vj = Eigen::Vector2d::MapAligned(inarr + j);
const auto floorval = vj.array().floor();
const auto ceilval = vj.array().ceil();
Eigen::Vector2d::Map(outfloor + j) = floorval;
Eigen::Vector2d::Map(outceil + j) = ceilval;;
j += 2;
}
if(j < rows) {
const double vj = inarr[j];
outfloor[j] = std::floor(vj);
outceil[j] = std::ceil(vj);
}
7.5 seconds. The assembler looks fine, fully vectorized. I'm not sure why performance is lower. Maybe cache line aliasing?
Last attempt: We don't try to avoid re-reading the vector but we re-read it blockwise so that it will be in cache by the time we read it a second time.
const int blocksize = 64 * 1024 / sizeof(double);
std::ptrdiff_t j;
for(j = 0; j + blocksize <= rows; j += blocksize) {
const auto& vj = v.segment(j, blocksize);
auto outj = out.middleRows(j, blocksize);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
}
const auto& vj = v.tail(rows - j);
auto outj = out.bottomRows(rows - j);
outj.col(0) = vj.array().floor();
outj.col(1) = vj.array().ceil();
5.4 seconds. So there is some gain here but not nearly enough to justify the added complexity.

Example of C++ code optimization for parallel computing

I'm trying to understand optimization routines. I'm focusing on the most critical part of my code (the code has some cycles of length "nc" and one cycle of length "np", where number "np" is much larger then "nc"). I present part of the code in here. The rest of code is not very essential in % of computational time so i prefer code purify in the rest of the algorithm. However, the critical cycle with "np" length is a pretty simple piece of code and it can be parallelized. So it will not hurt if i rewrite this part into some more effective and less clear version (maybe into SSE instructions). I'm using a gcc compiler, c++ code, and OpenMP parallelization.
This code is part of the well known particle-in-cell algorithm (and this one is also basic one). I'm trying to learn code optimization on this version (so my goal is not to have effective PIC algorithm only, because it is already written in thousand variants, but i want to bring some demonstrative example for code optimization also). I'm trying to do some work but i am not very sure if i solved all optimization properties correctly.
const int NT = ...; // number of threads (in two versions: about 6 or about 30)
const int np = 10000000; // np is about 1000-10000 times larger than nc commonly
const int nc = 10000;
const int step = 1000;
float u[np], x[np];
float a[nc], a_lin[nc], rho_full[NT][nc], rho_diff[NT][nc] , weight[nc];
int p,num;
for ( i = 0 ; i<step ; i++) {
// ***
// *** some not very time consuming code for calculation
// *** a, a_lin from values of rho_full and rho_diff
#pragma omp for private(p,num)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
rho_full[num][p] += weight[k];
rho_diff[num][p] += weight[k] * (x[k] - p);
}
};
I realize this has problems:
1) (main question) I use set of arrays rho_full[num][p] where num is index for each thread. After computation i just summarize this arrays (rho_full[0][p] + rho_full[1][p] + rho_full[2][p] ...). The reason is avoidance of writing into same part of array with two different threads. I am not very sure if this way is an effective solution (note that number "nc" is relatively small, so number of operations with "np" is still probably most essential)
2) (also important question) I need to read x[k] many times and it's also changed many times. Maybe its better to read this value into some register and then forget whole x array or fix some pointer in here. After all calculation i can call x[k] array again and store obtained value. I believe that compiler do this work for me but i am not very sure because i used modification of x[k] in the center of algorithm. So the compiler probably do some effective work on their own but maybe in this version it call more times then nessesary becouse more then ones I swich calling and storing this value.
3) (probably not relevant) The code works with integer part and remainder below decimal point part. It needs both of this values. I identify integer part as p = (int) x and remainder as x - p. I calculate this routine at the begin and also in the end of cycle interior. One can see that this spliting can be stored somewhere and used at next step (i mean step at i index). Do you thing that following version is better? I store integral and remainder part at arrays of x instead of whole value x.
int x_int[np];
float x_rem[np];
//...
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
u[k] += a[x_int[k]] + a_lin[x_int[k]] * x_rem[k];
x_rem[k] += u[k];
p = (int) x_rem[k]; // *** This part is added into code for simplify the rest.
x_int[k] += p; // *** And maybe there is a better way how to realize
x_rem[k] -= p; // *** this "pushing correction".
if (x_int[k]<0 ) {x_int[k]+=nc;} else
if (x_int[k]>nc) {x_int[k]-=nc;};
rho_full[num][x_int[k]] += weight[k];
rho_diff[num][x_int[k]] += weight[k] * x_rem[k];
}
};
You can use OMP reduction for your for loop:
int result = 0;
#pragma omp for nowait reduction(+:result)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
result += weight[k] + weight[k] * (x[k] - p);
}

Multithread C++ program to speed up a summatory loop

I have a loop that iterates from 1 to N and takes a modular sum over time. However N is very large and so I am wondering if there is a way to modify it by taking advantage of multithread.
To give sample program
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
f(i) in my case isn't an actual function, but a long expression that would take up room here. Putting it there to illustrate purpose.
Yes, try this:
double total=0;
#pragma omp parallel for reduction(+:total)
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
Compile with:
g++ -fopenmp your_program.c
It's that simple! No headers are required. The #pragma line automatically spins up a couple of threads, divides the iterations of the loop evenly, and then recombines everything after the loop. Note though, that you must know the number of iterations beforehand.
This code uses OpenMP, which provides easy-to-use parallelism that's quite suitable to your case. OpenMP is even built-in to the GCC and MSVC compilers.
This page shows some of the other reduction operations that are possible.
If you need nested for loops, you can just write
double total=0;
#pragma omp parallel for reduction(+:total)
for (long long i = 1; i < N; ++i)
for (long long j = 1; j < N; ++j)
total = (total + f(i)*j) % modulus;
And the outer loop will be parallelised, with each thread running its own copy of the inner loop.
But you could also use the collapse directive:
#pragma omp parallel for reduction(+:total) collapse(2)
and then the iterations of both loops will be automagically divied up.
If each thread needs its own copy of a variable defined prior to the loop, use the private command:
double total=0, cheese=4;
#pragma omp parallel for reduction(+:total) private(cheese)
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
Note that you don't need to use private(total) because this is implied by reduction.
As presumably the f(i) are independent but take the same time roughly to run, you could create yourself 4 threads, and get each to sum up 1/4 of the total, then return the sum as a value, and join each one. This isn't a very flexible method, especially if the times the f(i) times can be random.
You might also want to consider a thread pool, and make each thread calculate f(i) then get the next i to sum.
Try openMP's parallel for with the reduction clause for your total http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause
If f(long long int) is a function that solely relies on its input and no global state and the abelian properties of addition hold, you can gain a significant advantage like this:
for(long long int i = 0, j = 1; i < N; i += 2, j += 2)
{
total1 = (total1 + f(i)) % modulus;
total2 = (total2 + f(j)) % modulus;
}
total = (total1 + total2) % modulus;
Breaking this out like that should help by allowing the compiler to improve code generation and the CPU to use more resources (the two operations can be handled in parallel) and pump more data out and reduce stalls. [I am assuming an x86 architecture here]
Of course, without knowing what f really looks like, it's hard to be sure if this is possible or if it will really help or make a measurable difference.
There may be other similar tricks that you can exploit special knowledge of your input and your platform - for example, SSE instructions could allow you to do even more. Platform-specific functionality might also be useful. For example, a modulo operation may not be required at all and your compiler may provide a special intrinsic function to perform addition modulo N.
I must ask, have you profiled your code and found this to be a hotspot?
You could use Threading Building Blocks
tbb::parallel_for(1, N, [=](long long i) {
total = (total + f(i)) % modulus;
});
Or whitout overflow checks:
tbb::parallel_for(1, N, [=](long long i) {
total = (total + f(i));
});
total %= modulus;

Optimizing this code block

for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++)
{
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
{
int num = matBigger[i+ii][j+jj];
// Extract range from this.
int low = num & 0xff;
int high = num >> 8;
if (low < matSmaller[ii][jj] && matSmaller[ii][jj] > high)
// match found
}
}
The machine is x86_64, 32kb L1 cahce, 256 Kb L2 cache.
Any pointers on how can I possibly optimize this code?
EDIT Some background to the original problem : Fastest way to Find a m x n submatrix in M X N matrix
First thing I'd try is to move the ii and jj loops outside the i and j loops. That way you're using the same elements of matSmaller for 25 million iterations of the i and j loops, meaning that you (or the compiler if you're lucky) can hoist the access to them outside those loops:
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++) {
int num = matBigger[i+ii][j+jj];
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
}
}
This might be faster (thanks to less access to the matSmaller array), or it might be slower (because I've changed the pattern of access to the matBigger array, and it's possible that I've made it less cache-friendly). A similar alternative would be to move the ii loop outside i and j and hoist matSmaller[ii], but leave the jj loop inside. The rule of thumb is that it's more cache-friendly to increment the last index of a multi-dimensional array in your inner loops, than earlier indexes. So we're "happier" to modify jj and j than we are to modify ii and i.
Second thing I'd try - what's the type of matBigger? Looks like the values in it are only 16 bits, so try it both as int and as (u)int16_t. The former might be faster because aligned int access is fast. The latter might be faster because more of the array fits in cache at any one time.
There are some higher-level things you could consider with some early analysis of smaller: for example if it's 0 then you needn't examine matBigger for that value of ii and jj, because num & 0xff < 0 is always false.
To do better than "guess things and see whether they're faster or not" you need to know for starters which line is hottest, which means you need a profiler.
Some basic advice:
Profile it, so you can learn where the hot-spots are.
Think about cache locality, and the addresses resulting from your loop order.
Use more const in the innermost scope, to hint more to the compiler.
Try breaking it up so you don't compute high if the low test is failing.
Try maintaining the offset into matBigger and matSmaller explicitly, to the innermost stepping into a simple increment.
Best thing ist to understand what the code is supposed to do, then check whether another algorithm exists for this problem.
Apart from that:
if you are just interested if a matching entry exists, make sure to break out of all 3 loops at the position of // match found.
make sure the data is stored in an optimal way. It all depends on your problem, but i.e. it could be more efficient to have just one array of size 5000*5000*20 and overload operator()(int,int,int) for accessing elements.
What are matSmaller and matBigger?
Try changing them to matBigger[i+ii * COL_COUNT + j+jj]
I agree with Steve about rearranging your loops to have the higher count as the inner loop. Since your code is only doing loads and compares, I believe a significant portion of the time is used for pointer arithmetic. Try an experiment to change Steve's answer into this:
for (int ii = 0; ii < 20; ii++)
{
for (int jj = 0; jj < 20; jj++)
{
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
{
int *pI = &matBigger[i+ii][jj];
for (int j = 0; j < 5000; j++)
{
int num = *pI++;
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
} // for j
} // for i
} // for jj
} // for ii
Even in 64-bit mode, the C compiler doesn't necessarily do a great job of keeping everything in register. By changing the array access to be a simple pointer increment, you'll make the compiler's job easier to produce efficient code.
Edit: I just noticed #unwind suggested basically the same thing. Another issue to consider is the statistics of your comparison. Is the low or high comparison more probable? Arrange the conditional statement so that the less probable test is first.
Looks like there is a lot of repetition here. One optimization is to reduce the amount of duplicate effort. Using pen and paper, I'm showing the matBigger "i" index iterating as:
[0 + 0], [0 + 1], [0 + 2], ..., [0 + 19],
[1 + 0], [1 + 1], ..., [1 + 18], [1 + 19]
[2 + 0], ..., [2 + 17], [2 + 18], [2 + 19]
As you can see there are locations that are accessed many times.
Also, multiplying the iteration counts indicate that the inner content is accessed: 20 * 20 * 5000 * 5000, or 10000000000 (10E+9) times. That's a lot!
So rather than trying to speed up the execution of 10E9 instructions (such as execution (pipeline) cache or data cache optimization), try reducing the number of iterations.
The code is searcing the matrix for a number that is within a range: larger than a minimal value and less than the maximum range value.
Based on this, try a different approach:
Find and remember all coordinates where the search value is greater
than the low value. Let us call these anchor points.
For each anchor point, find the coordinates of the first value after
the anchor point that is outside the range.
The objective is to reduce the number of duplicate accesses. Anchor points allow for a one pass scan and allow other decisions such as finding a range or determining an MxN matrix that contains the anchor value.
Another idea is to create new data structures containing the matBigger and matSmaller that are more optimized for searching.
For example, create a {value, coordinate list} entry for each unique value in matSmaller:
Value coordinate list
26 -> (2,3), (6,5), ..., (1007, 75)
31 -> (4,7), (2634, 5), ...
Now you can use this data structure to find values in matSmaller and immediately know their locations. So you could search matBigger for each unique value in this data structure. This again reduces the number of access to the matrices.