OpenMP parallel calculating for loop indices - c++

My parallel programming class has the program below demonstrating how to use the parallel construct in OpenMP to calculate array bounds for each thread to be use in a for loop.
#pragma omp parallel
{
int id = omp_get_thread_num();
int p = omp_get_num_threads();
int start = (N * id) / p;
int end = (N * (id + 1)) / p;
if (id == p - 1) end = N;
for (i = start; i < end; i++)
{
A[i] = x * B[i];
}
}
My question is, is the if statement (id == p - 1) necessary? From my understanding, if id = p - 1, then end will already be N, thus the if statement is not necessary. I asked in my class's Q&A board, but wasn't able to get a proper answer that I understood. Assumptions are: N is the size of array, x is just an int, id is between 0 and p - 1.

You are right. Indeed, (N * ((p - 1) + 1)) / p is equivalent to
(N * p) / p assuming p is strictly positive (which is the case since the number of OpenMP thread is guaranteed to be at least 1). (N * p) / p is equivalent to N assuming there is no overflow. Such condition is often useful when the integer division cause some truncation but this is not the case here (it would be the case with something like (N / p) * id).
Note that this code is not very safe for large N because sizeof(int) is often 4 and the multiplication is likely to cause overflows (resulting in an undefined behaviour). This is especially true on machines with many cores like on supercomputer nodes. It is better to use the size_t type which is usually an unsigned 64-bit type meant to be able to represent the size of any object (for example the size of an array).

Related

Efficiently randomly shuffling the bits of a sequence of words

Consider the following algorithm from the C++ standard library: std::shuffle that has the following signature:
template <class RandomIt, class URBG>
void shuffle(RandomIt first, RandomIt last, URBG&& g);
It reorders the elements in the given range [first, last) such that each possible permutation of those elements has equal probability of appearance.
I am trying to implement the same algorithms, but which works at the bit level, randomly shuffling the bits of the words of the input sequence. Considering a sequence of 64-bits words, I am trying to implement:
template <class URBG>
void bit_shuffle(std::uint64_t* first, std::uint64_t* last, URBG&& g)
Question: How to do that as efficiently as possible (using compiler intrinsics if necessary)? I am not necessarily looking for an entire implementation, but more for suggestions/directions of research, because it's really not clear to me if it's even feasible to implement that efficiently.
It's obvious that asymptotically, the speed is O(N), where N is number of bits. Our goal is to improve the constants involved in it.
Disclaimer: the description proposed algorithm is a rough sketch. There are a lot of stuffs needs to be added and, especially, a lot of details that needs to be cared of in order to make it work correctly. The approximated execution time will not be different from what is claimed here though.
Baseline Algorithm
The most obvious one is the textbook approach, which takes N operations, each of which involves calling the random_generator which takes R milliseconds, and accessing the bit's value of two different bits, and set new value to them in total of 4 * A milliseconds (A is time to read/write one bit). Suppose that the array lookup operations takes C milliseconds. So the total time of this algorithm is N * (R + 4 * A + 2 * C) milliseconds (approximately). It is also reasonable to assume that the random number generation takes more time, i.e. R >> A == C.
Proposed Algorithm
Suppose the bits are stored in a byte storage, i.e. we will work with blocks of bytes.
unsigned char bit_field[field_size = N / 8];
First, let's count the number of 1 bits in our bitset. For that, we can use a lookup-table and iterate through the bitset as byte array:
# Generate lookup-table, you may modify it with `constexpr`
# to make it run in compile time.
int bitcount_lookup[256];
for (int = 0; i < 256; ++i) {
bitcount_lookup[i] = 0;
for (int b = 0; b < 8; ++b)
bitcount_lookup[i] += (i >> b) & 1;
}
We can treat this as preprocessing overhead (as it may as well be calculated at compile-time) and say that it takes 0 milliseconds. Now, counting number of 1 bits is easy (the following will take (N / 8) * C milliseconds):
int bitcount = 0;
for (auto *it = bit_field; it != bit_field + field_size; ++it)
bitcount += bitcount_lookup[*it];
Now, we randomly generate N / 8 numbers (let's call the resulting array gencnt[N / 8]), each in the range [0..8], such that they sums up to bitcount. This is a bit tricky and kind of hard to do it uniformly (the "correct" algorithm to generate uniform distribution is quite slow comparing to the baseline algo). A quite uniform-ish but quick solution is roughly:
Fill the gencnt[N / 8] array with values v = bitcount / (N / 8).
Randomly choose N / 16 "black" cells. The rests are "white". The algorithm is similar to random permutation, but only of half of the array.
Generate N / 16 random numbers in the range [0..v]. Let's call them tmp[N / 16].
Increase "black" cells by tmp[i] values, and decrease "white" cells by tmp[i]. This will ensure that the overall sum is bitcount.
After that, we will have a uniform-ish random-ish array gencnt[N / 8], the value of which are the number of 1 bytes in a particular "cell". It was all generated in:
(N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C)
^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^
filling step random coloring filling
milliseconds (this estimation is done with a concrete implementation in my mind). Lastly, we can have a lookup table of the bytes with specified number of bits set to 1 (can be compiled overhead, or even in compile-time as constexpr, so let's assume that this takes 0 milliseconds):
std::vector<std::vector<unsigned char>> random_lookup(8);
for (int c = 0; c < 8; c++)
random_lookup[c] = { /* numbers with `c` bits set to `1` */ };
Then, we can fill our bit_field as follows (which takes roughly (N / 8) * (R + 3 * C) milliseconds):
for (int i = 0; i < field_size; i++) {
bit_field[i] = random_lookup[gencnt[i]][rand() % gencnt[i].size()];
Summing everything up, we have the total execution time:
T = (N / 8) * C +
(N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C) +
(N / 8) * (R + 3 * C)
= N * (C + (3/16) * R) < N * (R + 4 * A + 2 * C)
^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
proposed algorithm naive baseline algo
Although it's not truly uniformly random, but it does spread the bits out quite evenly and randomly, and it's quite fast and hopefully gets the job done in your use-case.
Observing that actual shuffling bits, which involves swapping via Fisher-Yates, is not required for producing the exact equivalent, a random distribution of the bits.
#include <iostream>
#include <vector>
#include <random>
// shuffle a vector of bools. This requires only counting the number of trues in the vector
// followed by clearing the vector and inserting bool trues to produce an equivalent to
// a bit shuffle. This is cache line friendly and doesn't require swapping.
std::vector<bool> DistributeBitsRandomly(std::vector<bool> bvector)
{
std::random_device rd;
static std::mt19937 gen(rd()); //mersenne_twister_engine seeded with rd()
// count the number of set bits and clear bvector
int set_bits_count = 0;
for (int i=0; i < bvector.size(); i++)
if (bvector[i])
{
set_bits_count++;
bvector[i] = 0;
}
// set a bit if a random value in range bvector.size()-bit_loc-1 is
// less than the number of bits remaining to be placed. This produces exactly the same
// distribution as a random shuffle but only does an insertion of a 1 bit rather than
// a swap. It requires counting the number of 1 bits. There are efficient ways
// of doing this. See https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
for (int bit_loc = 0; set_bits_count; bit_loc++)
{
std::uniform_int_distribution<int> dis(0, bvector.size()-bit_loc-1);
auto x = dis(gen);
if (x < set_bits_count)
{
bvector[bit_loc] = true;
set_bits_count--;
}
}
return bvector;
}
This performs the equivalent of shuffling the bools in a vector<bool> It is cache line friendly and involves no swapping. It's presented in executable, but simple algorithmic form as requested by the OP. Much can be done to optimize this such as improving the speed of bit counting and clearing the array.
This sets 4 bits out of 10, calls the "shuffle" routine 100,000 times, and prints the number of time a 1 bit occurs in each of the 10 locations. It should be around 40,000 in each position.
int main()
{
std::vector<bool> initial{ 1,1,1,1,0,0,0,0,0,0 };
std::vector<int> totals(initial.size());
for (int i = 0; i < 100000; i++)
{
auto a_distribution = DistributeBitsRandomly(initial);
for (int ii = 0; ii < totals.size(); ii++)
if (a_distribution[ii])
totals[ii]++;
}
for (auto cnt : totals)
std::cout << cnt << "\n";
}
Possible Output:
40116
39854
40045
39917
40105
40074
40214
39963
39946
39766

Multithreading alternative to mutex in parallel_for

I'm fairly new to C++, therefore please pardon if this is a stupid question, but I didn't find good example of what I'm looking for on the internet.
Basically I'm using a parallel_for cycle to find a maximum inside a 2D array (and a bunch of other operations in between). First of all I don't even know if this is the best approach, but given the length of this 2D array, I though splitting the calculations would be faster.
My code:
vector<vector<double>> InterpU(1801, vector<double>(3601, 0));
Concurrency::parallel_for(0, 1801, [&](int i) {
long k = 0; long l = 0;
pair<long, long> Normalized;
double InterpPointsU[4][4];
double jRes;
double iRes = i * 0.1;
double RelativeY, RelativeX;
int p, q;
while (iRes >= (k + 1) * DeltaTheta) k++;
RelativeX = iRes / DeltaTheta - k;
for (long j = 0; j < 3600; j++)
{
jRes = j * 0.1;
while (jRes >= (l + 1) * DeltaPhi) l++;
RelativeY = jRes / DeltaPhi - l;
p = 0;
for (long m = k - 1; m < k + 3; m++)
{
q = 0;
for (long n = l - 1; n < l + 3; n++)
{
Normalized = Normalize(m, n, PointsTheta, PointsPhi);
InterpPointsU[p][q] = U[Normalized.first][Normalized.second];
q++;
}
p++;
}
InterpU[i][j] = bicubicInterpolate(InterpPointsU, RelativeX, RelativeY);
if (InterpU[i][j] > MaxU)
{
SharedDataLock.lock();
MaxU = InterpU[i][j];
SharedDataLock.unlock();
}
}
InterpU[i][3600] = InterpU[i][0];
});
You can see here that I'm using a mutex called SharedDataLock to protect multiple threads accessing the same resource. MaxU is a variable that should only containe the maximum of the InterpU vector.
The code works well, but since I'm having speed performance problem, I began to look into atomic and some other stuff.
Is there any good example on how to modify a similar code to make it faster?
As mentioned by VTT, you can simply find the local maximum of each thread, and merge those afterwards With use of combinable:
Concurrency::combinable<double> CombinableMaxU;
Concurrency::parallel_for(0, 1801, [&](int i) {
...
CombinableMaxU.local() = std::max(CombinableMaxU.local(), InterpU[i][j]);
}
MaxU = std::max(MaxU, CombinableMaxU.combine(std::max<double>));
Note that your current code is actually wrong (unless MaxU is atomic), you read MaxU outside of the lock, while it can be written simultaneously by other threads. Generally, you must not read a value that is being written to simultaneously unless both sides are protected by atomic semantics or locks and memory fences. A reason is that a variable access may very well consist of multiple memory accesses, depending on how the type is supported by hardware.
But in your case, you even have a classic race condition:
MaxU == 1
Thread a | Thread b
InterpU[i][j] = 3 | InterpU[i][j] = 2
if (3 > MaxU) | if (2 > MaxU)
SharedDataLock.lock(); | SharedDataLock.lock();
(gets the lock) | (waiting for lock)
MaxU = 3 | ...
SharedDataLock.unlock(); | ...
... | (gets the lock)
| MaxU = 2
| SharedDataLock.unlock();
MaxU == 2
Locks are hard.
You can also use an atomic and compute the maximum on that. However, I would guess1 that it still doesn't perform well inside the loop2, and outside the loop it doesn't matter whether you use atomics or locks.
1: When in doubt, don't guess - measure!
2: Just because something is atomic and supported by hardware, doesn't mean it is as efficient as accessing local data. First, atomic instructions are often much more costly than their non-atomic counterparts, second you have to deal with very bad cache effects, because cores/caches will fight for the ownership of the data. While atomics may be more elegant in many cases (not this one IMHO), reduction is faster most of the time.

Example of C++ code optimization for parallel computing

I'm trying to understand optimization routines. I'm focusing on the most critical part of my code (the code has some cycles of length "nc" and one cycle of length "np", where number "np" is much larger then "nc"). I present part of the code in here. The rest of code is not very essential in % of computational time so i prefer code purify in the rest of the algorithm. However, the critical cycle with "np" length is a pretty simple piece of code and it can be parallelized. So it will not hurt if i rewrite this part into some more effective and less clear version (maybe into SSE instructions). I'm using a gcc compiler, c++ code, and OpenMP parallelization.
This code is part of the well known particle-in-cell algorithm (and this one is also basic one). I'm trying to learn code optimization on this version (so my goal is not to have effective PIC algorithm only, because it is already written in thousand variants, but i want to bring some demonstrative example for code optimization also). I'm trying to do some work but i am not very sure if i solved all optimization properties correctly.
const int NT = ...; // number of threads (in two versions: about 6 or about 30)
const int np = 10000000; // np is about 1000-10000 times larger than nc commonly
const int nc = 10000;
const int step = 1000;
float u[np], x[np];
float a[nc], a_lin[nc], rho_full[NT][nc], rho_diff[NT][nc] , weight[nc];
int p,num;
for ( i = 0 ; i<step ; i++) {
// ***
// *** some not very time consuming code for calculation
// *** a, a_lin from values of rho_full and rho_diff
#pragma omp for private(p,num)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
rho_full[num][p] += weight[k];
rho_diff[num][p] += weight[k] * (x[k] - p);
}
};
I realize this has problems:
1) (main question) I use set of arrays rho_full[num][p] where num is index for each thread. After computation i just summarize this arrays (rho_full[0][p] + rho_full[1][p] + rho_full[2][p] ...). The reason is avoidance of writing into same part of array with two different threads. I am not very sure if this way is an effective solution (note that number "nc" is relatively small, so number of operations with "np" is still probably most essential)
2) (also important question) I need to read x[k] many times and it's also changed many times. Maybe its better to read this value into some register and then forget whole x array or fix some pointer in here. After all calculation i can call x[k] array again and store obtained value. I believe that compiler do this work for me but i am not very sure because i used modification of x[k] in the center of algorithm. So the compiler probably do some effective work on their own but maybe in this version it call more times then nessesary becouse more then ones I swich calling and storing this value.
3) (probably not relevant) The code works with integer part and remainder below decimal point part. It needs both of this values. I identify integer part as p = (int) x and remainder as x - p. I calculate this routine at the begin and also in the end of cycle interior. One can see that this spliting can be stored somewhere and used at next step (i mean step at i index). Do you thing that following version is better? I store integral and remainder part at arrays of x instead of whole value x.
int x_int[np];
float x_rem[np];
//...
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
u[k] += a[x_int[k]] + a_lin[x_int[k]] * x_rem[k];
x_rem[k] += u[k];
p = (int) x_rem[k]; // *** This part is added into code for simplify the rest.
x_int[k] += p; // *** And maybe there is a better way how to realize
x_rem[k] -= p; // *** this "pushing correction".
if (x_int[k]<0 ) {x_int[k]+=nc;} else
if (x_int[k]>nc) {x_int[k]-=nc;};
rho_full[num][x_int[k]] += weight[k];
rho_diff[num][x_int[k]] += weight[k] * x_rem[k];
}
};
You can use OMP reduction for your for loop:
int result = 0;
#pragma omp for nowait reduction(+:result)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
result += weight[k] + weight[k] * (x[k] - p);
}

Why M = L + ((R - L) / 2) instead of M=(L+R)/2 avoid overflow in C++?

Hello I was looking at the C++ solution to the question "Suppose a sorted array is rotated at some pivot unknown to you beforehand. (i.e., 0 1 2 4 5 6 7 might become 4 5 6 7 0 1 2). How do you find an element in the rotated array efficiently? You may assume no duplicate exists in the array."
int rotated_binary_search(int A[], int N, int key) {
int L = 0;
int R = N - 1;
while (L <= R) {
// Avoid overflow, same as M=(L+R)/2
int M = L + ((R - L) / 2);
if (A[M] == key) return M;
// the bottom half is sorted
if (A[L] <= A[M]) {
if (A[L] <= key && key < A[M])
R = M - 1;
else
L = M + 1;
}
// the upper half is sorted
else {
if (A[M] < key && key <= A[R])
L = M + 1;
else
R = M - 1;
}
}
return -1;
}
and saw the comment says that using M = L + ((R - L) / 2) instead of M=(L+R)/2 avoid overflow. Why is that? Thx ahead
Because it does...
Let's assume for a minute you're using unsigned chars (same applies to larger integers of course).
If L is 100 and R is 200, the first version is:
M = (100 + 200) / 2 = 300 / 2 = 22
100+200 overflows (because the largest unsigned char is 255), and you get 100+200=44 (unsigned no. addition).
The second, on the other hand:
M = 100 + (200-100) / 2 = 100 + 100 / 2 = 150
No overflow.
As #user2357112 pointed out in a comment, there are no free lunches. If L is negative, the second version might not work while the first will.
Not sure, but if the max limit of int is suppose 100.
R=80 & L = 40
then,
M=(L+R)/2
M=(120)/2, here 120 is out limits if our integer type, so this causes overflow
However,
M = L + ((R - L) / 2)
M = 80 +((40)/2)
M = 80 +20
M =100.
So in this case we never encounter a value that exceeds the limits of our integer type.So this approach will never encounter a overFlow, THEORATICALLY.
I hope this analogy will help
It avoids overflow in this specific implementation, which operates under the guarantees that L and R are non-negative and L <= R. Under these guarantees it should be obvious that R - L does not overflow and L + ((R - L) / 2) does not overflow either.
In general case (i.e. for arbitrary values of L and R) R - L is as prone to overflow as L + R, meaning that this trick does not achieve anything.
The comment is wrong, for a number of reasons.
For the particular problem the risk of overflow is probably nil.
Reordering calculations does not guarantee that the compiler will perform them in that order.
If there is a range of values for which an ordering can cause overflow, then there is another range of values for which the reordered calculation will cause overflow.
If overflow could be a problem then it should be controlled explicitly, not implicitly.
This is an excellent place for an assert. In this case the algorithm is only valid if N is less than half the maximum positive range of int, so say it in an assert.
If the algorithm is required to work for the whole positive range of signed int then the range should be explicitly tested in an assert, and the calculation should be ordered by introducing a sequence point (eg broken into two statements).
Doing this right is hard. Numerical computation is full of this stuff. Best to avoid, if possible. And don't accept random advice (even this!) without doing your own research.

Is there an expression using modulo to do backwards wrap-around ("reverse overflow")?

For any whole number input W restricted by the range R = [x,y], the "overflow," for lack of a better term, of W over R is W % (y-x+1) + x. This causes it wrap back around if W exceeds y.
As an example of this principle, suppose we iterate over a calendar's months:
int this_month = 5;
int next_month = (this_month + 1) % 12;
where both integers will be between 0 and 11, inclusive. Thus, the expression above "clamps" the integer to the range R = [0,11]. This approach of using an expression is simple, elegant, and advantageous as it omits branching.
Now, what if we want to do the same thing, but backwards? The following expression works:
int last_month = ((this_month - 1) % 12 + 12) % 12;
but it's abstruse. How can it be beautified?
tl;dr - Can the expression ((x-1) % k + k) % k be simplified further?
Note: C++ tag specified because other languages handle negative operands for the modulo operator differently.
Your expression should be ((x-1) + k) % k. This will properly wrap x=0 around to 11. In general, if you want to step back more than 1, you need to make sure that you add enough so that the first operand of the modulo operation is >= 0.
Here is an implementation in C++:
int wrapAround(int v, int delta, int minval, int maxval)
{
const int mod = maxval + 1 - minval;
if (delta >= 0) {return (v + delta - minval) % mod + minval;}
else {return ((v + delta) - delta * mod - minval) % mod + minval;}
}
This also allows to use months labeled from 0 to 11 or from 1 to 12, setting min_val and max_val accordingly.
Since this answer is so highly appreciated, here is an improved version without branching, which also handles the case where the initial value v is smaller than minval. I keep the other example because it is easier to understand:
int wrapAround(int v, int delta, int minval, int maxval)
{
const int mod = maxval + 1 - minval;
v += delta - minval;
v += (1 - v / mod) * mod;
return v % mod + minval;
}
The only issue remaining is if minval is larger than maxval. Feel free to add an assertion if you need it.
k % k will always be 0. I'm not 100% sure what you're trying to do but it seems you want the last month to be clamped between 0 and 11 inclusive.
(this_month + 11) % 12
Should suffice.
The general solution is to write a function that computes the value that you want:
//Returns floor(a/n) (with the division done exactly).
//Let ÷ be mathematical division, and / be C++ division.
//We know
// a÷b = a/b + f (f is the remainder, not all
// divisions have exact Integral results)
//and
// (a/b)*b + a%b == a (from the standard).
//Together, these imply (through algebraic manipulation):
// sign(f) == sign(a%b)*sign(b)
//We want the remainder (f) to always be >=0 (by definition of flooredDivision),
//so when sign(f) < 0, we subtract 1 from a/n to make f > 0.
template<typename Integral>
Integral flooredDivision(Integral a, Integral n) {
Integral q(a/n);
if ((a%n < 0 && n > 0) || (a%n > 0 && n < 0)) --q;
return q;
}
//flooredModulo: Modulo function for use in the construction
//looping topologies. The result will always be between 0 and the
//denominator, and will loop in a natural fashion (rather than swapping
//the looping direction over the zero point (as in C++11),
//or being unspecified (as in earlier C++)).
//Returns x such that:
//
//Real a = Real(numerator)
//Real n = Real(denominator)
//Real r = a - n*floor(n/d)
//x = Integral(r)
template<typename Integral>
Integral flooredModulo(Integral a, Integral n) {
return a - n * flooredDivision(a, n);
}
Easy Peasy, do not use the first module operator, it is superfluous:
int last_month = (this_month - 1 + 12) % 12;
which is the general case
In this instance you can write 11, but I would still do the -1 + 11 as it more clearly states what you want to achieve.
Note that normal mod causes the pattern 0...11 to repeat at 12...23, 24...35, etc. but doesn't wrap on -11...-1. In other words, it has two sets of behaviors. One from -infinity...-1, and a different set of behavior from 0...infinity.
The expression ((x-1) % k + k) % k fixes -11...-1 but has the same problem as normal mod with -23...-12. I.e. while it fixes 12 additional numbers, it doesn't wrap around infinitely. It still has one set of behavior from -infinity...-12, and a different behavior from -11...+infinity.
This means that if you're using the function for offsets, it could lead to buggy code.
If you want a truly wrap around mod, it should handle the entire range, -infinity...infinity in exactly the same way.
There is probably a better way to implement this, but here is an easy to understand implementation:
// n must be greater than 0
func wrapAroundMod(a: Int, n: Int) -> Int {
var offsetTimes: Int = 0
if a < 0 {
offsetTimes = (-a / n) + 1
}
return (a + n * offsetTimes) % n
}
Not sure if you were having the same problem as me, but my problem was essentially that I wanted to constrain all numbers to a certain range. Say that range was 0-6, so using %7 means that any number higher than 6 will wrap back around to 0 or above. The actual problem is that numbers less than zero didn't wrap back around to 6. I have a solution to that (where X is the upper limit of your number range and 0 is the minimum):
if(inputNumber <0)//If this is a negative number
{
(X-(inputNumber*-1))%X;
}
else
{
inputNumber%X;
}