How to parallelize a loop? - c++

I'm using OpenMP on C++ and I want to parallelize very simple loop. But I can't do it correctly. All time I get wrong result.
for(i=2;i<N;i++)
for(j=2;j<N;j++)
A[i,j] =A[i-2,j] +A[i,j-2];
Code:
int const N = 10;
int arr[N][N];
#pragma omp parallel for
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
arr[i][j] = 1;
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
printf_s("%d ",arr[i][j]);
printf("\n");
}
Do you have any suggestions how I can do it? Thank you!

serial and parallel run will give different. result because in
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
.....
you update arr[i]. so you change data used by the other thread. it will lead to a read over write data race!

This
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
is always going to be a source of grief and unpredictable output. The OpenMP run time is going to hand each thread a range of values for i and leave them to it. There will be no determinism in the relative order in which threads update arr. For example, while thread 1 is updating elements with i = 2,3,4,5,...,100 (or whatever) and thread 2 is updating elements with i = 102,103,104,...,200 the program does not determine whether thread 1 updates arr[i,:] = 100 before or after thread 2 wants to use the updated values in arr. You have written a code with a classic data race.
You have a number of options to fix this:
You could tie yourself in knots trying to ensure that the threads update arr in the right (ie sequential) order. The end result would be an OpenMP program that runs more slowly than the sequential program. DO NOT TAKE THIS OPTION.
You could make 2 copies of arr and always update from one to the other, then from the other to the one. Something like (very pseudo-code)
for ...
{
old = 0
new = 1
arr[i][j][new] = arr[i-2][j][old] +arr[i][j-2][old];
old = 1
new = 0
}
Of course, this second approach trades space for time but that's often a reasonable trade-off.
You may find that adding an extra plane to arr doesn't immediately speed things up because it wrecks the spatial locality of values pulled into cache. Experiment a bit with this, possibly make [old] the first index element rather than the last.
Since updating each element in the array depends on the values found in elements 2 rows/columns away you're effectively splitting the array up like a chess-board, into white and black elements. You could use 2 threads, one on each 'colour', without the threads racing for access to the same data. Again, though, the disruption of spatial locality in the cache might have a bad impact on speed.
If any other options occur to me I'll edit them in.

To parallelize the loop nest in the question is tricky, but doable. Lamport's paper "The Parallel Execution of DO Loops" covers the technique. Basically you have to rotate your (i,j) coordinates by 45 degrees into a new coordinate system (k,l), where k=i+j and l=i-j.
Though to actually get speedup, the iterations likely have to be grouped into tiles, which makes the code even uglier (four nested loops).
A completely different approach is to solve the problem recursively, using OpenMP tasking. The recursion is:
if( too small to be worth parallelizing ) {
do serially
} else {
// Recursively:
Do upper left quadrant
Do lower left and upper right quadrants in parallel
Do lower right quadrant
}
As a practical matter, the ratio of arithmetic operations to memory accesses is so low that it is going to be difficult to get speedup out of the example.

If you ask about parallelism in general, then one more possible answer is vectorization. You could achieve some relatively poor vector parallelizm (something like 2x speedup or so) without
changing the data structure and codebase. This is possible using OpenMP4.0 or CilkPlus pragma simd or similar (with safelen/vectorlength(2))
Well, you really have data dependence (both inner and outer loops), but it belongs to «WAR»[ (write after read) dependencies sub-category, which is blocker for using «omp parallel for» «as is» but not necessarily a problem for «pragma omp simd» loops.
To make this working you will need x86 compilers supporting pragma simd either via OpenMP4 or via CilkPlus (very recent gcc or Intel compiler).

Related

filling only half of Matrix using OpenMp in C++

I have a quite big matrix. I would like to fill half of the matrix in parallel.
m_matrix is 2D std vector. Any suggestion for the type of container is appreciated as well. What _fill(i,j) function is doing is not considered heavy compared to size of the matrix.
//i: row
//j: column
for (size_t i=1; i<num_row; ++i)
{
for (size_t j=0; j<i; ++j)
{
m_matrix[i][j] = _fill(i, j);
}
}
What would be a nice openMP structure for that? I tried dynamic strategy bet I got even time increase compared to the sequential mode.

CUDA C++ shared memory and if-condition

i have a question i couldnt find an answer to myself, and i was hoping some of you could offer me some insight regarding a possible solution. Within a kernel call, i would like to insert an if-condition regarding access to shared memory.
__global__ void GridFillGPU (int * gridGLOB, int n) {
__shared__ int grid[SIZE] // ... initialized to zero
int tid = threadIdx.x
if (tid < n) {
for ( int k = 0; k < SIZE; k++) {
if (grid[k] == 0) {
grid[k] = tid+1;
break;
}
}
}
//... here write grid to global memory gridGLOB
}
The idea is that, if the element grid[k] has already been written by one thread (with the index tid), it should not be written by another one. My question is: can this even be done in parallel ? Since all parallel threads perform the same for-loop, how can i be sure that the if-condition is evaluated correctly? I am guessing this will lead to certain race-conditions. I am quite new to Cuda, so i hope this question is not stupid. I know that grid needs to be in shared memory, and that one should avoid if-statements, but i find no other way around at the moment.
I am thankful for any help
EDIT: here is the explicit version, which explains why the array is called grid
__global__ void GridFillGPU (int * pos, int * gridGLOB, int n) {
__shared__ int grid[SIZE*7] // ... initialized to zero
int tid = threadIdx.x
if (tid < n) {
int jmin = pos[tid] - 3;
int jmax = pos[tid] + 3;
for ( int j = jmin; j <= jmax; j++ {
for ( int k = 0; k < SIZE; k++) {
if (grid[(j-jmin)*SIZE + k] == 0) {
grid[(j-jmin)*SIZE + k] = tid+1;
break;
}
}
}
} //... here write grid to global memory gridGLOB
}
You should model you problem in a way you don't need to worry about "if has been written already", also because cuda offers no guarantee in the order in which thread will be executed, so the order might not be the way you excpect.
There are some minor things that cuda ensure you order wise within a warp but that is not the case.
There are sync barries and stuff you can use but I don't think is your case.
if you are processing a grid you should model that in a way that each thread has its own region of memory is going to work on. and that should not overlap with other thread region (at least in writing, in reading you can go outside boundaries). Also I would not worry about shared memory, make the algorithm works first, then think about optimization like load a tile in shared memory using the warp.
In that case if you want to split your domain in a grid you should setup the kernel, in order to have enough threads as your grid "cells" or pixels if is an image. Then you use the thread and block coordinates that cuda provides you to compute where you should read and write in memory.
There is a really good course on udacity.com about cuda, you might want to have a look at that.
https://www.udacity.com/courses/cs344
There is also another one on coursera.com but I don't know if it is open right now.
Anyway dividing the domain in a grid is a really common and solved problem ,you can find a lot of material on that.

Fastest way to calculate the abs()-values of a complex array

I want to calculate the absolute values of the elements of a complex array in C or C++. The easiest way would be
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But for large vectors that will be slow. Is there a way to speed that up (by using parallelization, for example)? Language can be either C or C++.
Given that all loop iterations are independent, you can use the following code for parallelization:
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
Of course, for using this you should enable OpenMP support while compiling your code (usually by using /openmp flag or setting the project options).
You can find several examples of OpenMP usage in wiki.
Or use Concurrency::parallele_for like that :
Concurrency::parallel_for(0, N, [&a, &b](int i)
{
b[i] = cabs(a[i]);
});
Use vector operations.
If you have glibc 2.22 (pretty recent), you can use the SIMD capabilities of OpenMP 4.0 to operate on vectors/arrays.
Libmvec is vector math library added in Glibc 2.22.
Vector math library was added to support SIMD constructs of OpenMP4.0
(#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by adding
vector implementations of vector math functions.
Vector math functions are vector variants of corresponding scalar math
operations implemented using SIMD ISA extensions (e.g. SSE or AVX for
x86_64). They take packed vector arguments, perform the operation on
each element of the packed vector argument, and return a packed vector
result. Using vector math functions is faster than repeatedly calling
the scalar math routines.
Also, see Parallel for vs omp simd: when to use each?
If you're running on Solaris, you can explicitly use vhypot() from the math vector library libmvec.so to operate on a vector of complex numbers to obtain the absolute value of each:
Description
These functions evaluate the function hypot(x, y) for an entire vector
of values at once. ...
The source code for libmvec can be found at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/ and the vhypot() code specifically at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/common/__vhypot.c I don't recall if Sun Microsystems ever provided a Linux version of libmvec.so or not.
Using #pragma simd (even with -Ofast) or relying on the compilers auto-vectorization are more example of why it's a bad idea to blindly expect your compiler to implement SIMD efficiently. In order to use SIMD efficiently for this you need to use an array of struct of arrays. For example for single float with a SIMD width of 4 you could use
//struct of arrays of four complex numbers
struct c4 {
float x[4]; // real values of four complex numbers
float y[4]; // imaginary values of four complex numbers
};
Here is code showing how you could do this with SSE for the x86 instruction set.
#include <stdio.h>
#include <x86intrin.h>
#define N 10
struct c4{
float x[4];
float y[4];
};
static inline void cabs_soa4(struct c4 *a, float *b) {
__m128 x4 = _mm_loadu_ps(a->x);
__m128 y4 = _mm_loadu_ps(a->y);
__m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
_mm_storeu_ps(b, b4);
}
int main(void)
{
int n4 = ((N+3)&-4)/4; //choose next multiple of 4 and divide by 4
printf("%d\n", n4);
struct c4 a[n4]; //array of struct of arrays
for(int i=0; i<n4; i++) {
for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
}
float b[4*n4];
for(int i=0; i<n4; i++) {
cabs_soa4(&a[i], &b[4*i]);
}
for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
}
It may help to unroll the loop a few times. In any case all this is moot for large N because the operation is memory bandwidth bound. For large N (meaning when the memory usage is much larger than the last level cache), although #pragma omp parallel may help some, the best solution is not to do this for large N. Instead do this in chunks which fit in the lowest level cache along with other compute operations. I mean something like this
for(int i = 0; i < nchunks; i++) {
for(int j = 0; j < chunk_size; j++) {
b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
}
foo(&b[i*chunck_size]); // foo is computationally intensive.
}
I did not implement an array of struct of array here but it should be easy to adjust the code for that.
If you are using a modern compiler (GCC 5, for example), you can use Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions, and parallelisation.
So, if you want to run them in parallel you would do:
#include <cilk/cilk.h>
cilk_for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
or if you want to test SIMD:
#pragma simd
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But, the nicest part of Cilk is that you can just do:
b[:] = cabs(a[:])
In this case, the compiler and the runtime environment will decide to which level it should be SIMDed and what should be paralellised (the optimal way is applying SIMD on large-ish chunks in parallel).
Since this is decided by a work scheduler at runtime, Intel claims it is capable of providing a near optimal scheduling, and that it should be able to make an optimal use of the cache.
Also, you can use std::future and std::async (they are part of C++11), maybe it's more clear way of achieving what you want to do:
#include <future>
...
int main()
{
...
// Create async calculations
std::future<void> *futures = new std::future<void>[N];
for (int i = 0; i < N; ++i)
{
futures[i] = std::async([&a, &b, i]
{
b[i] = std::sqrt(a[i]);
});
}
// Wait for calculation of all async procedures
for (int i = 0; i < N; ++i)
{
futures[i].get();
}
...
return 0;
}
IdeOne live code
We first create asynchronous procedures and then wait until everything is calculated.
Here I use sqrt instead of cabs because I just don't know what is cabs. I'm sure it doesn't matter.
Also, maybe you'll find this link useful: cplusplus.com

Matrix Multiplication optimization via matrix transpose

I am working on an assignment where I transpose a matrix to reduce cache misses for a matrix multiplication operation. From what I understand from a few classmates, I should get 8x improvement. However, I am only getting 2x ... what might I be doing wrong?
Full Source on GitHub
void transpose(int size, matrix m) {
int i, j;
for (i = 0; i < size; i++)
for (j = 0; j < size; j++)
std::swap(m.element[i][j], m.element[j][i]);
}
void mm(matrix a, matrix b, matrix result) {
int i, j, k;
int size = a.size;
long long before, after;
before = wall_clock_time();
// Do the multiplication
transpose(size, b); // transpose the matrix to reduce cache miss
for (i = 0; i < size; i++)
for (j = 0; j < size; j++) {
int tmp = 0; // save memory writes
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
}
after = wall_clock_time();
fprintf(stderr, "Matrix multiplication took %1.2f seconds\n", ((float)(after - before))/1000000000);
}
Am I doing things right so far?
FYI: The next optimization I need to do is use SIMD/Intel SSE3
Am I doing things right so far?
No. You have a problem with your transpose. You should have seen this problem before you started worrying about performance. When you are doing any kind of hacking around for optimizations it always a good idea to use the naive but suboptimal implementation as a test. An optimization that achieves a factor of 100 speedup is worthless if it doesn't yield the right answer.
Another optimization that will help is to pass by reference. You are passing copies. In fact, your matrix result may never get out because you are passing copies. Once again, you should have tested.
Yet another optimization that will help the speedup is to cache some pointers. This is still quite slow:
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
An optimizer might see a way around the pointer problems, but probably not. At least not if you don't use the nonstandard __restrict__ keyword to tell the compiler that your matrices don't overlap. Cache pointers so you don't have to do a.element[i], b.element[j], and result.element[i]. And it still might help to tell the compiler that these arrays don't overlap with the __restrict__ keyword.
Addendum
After looking over the code, it needs help. A minor comment first. You aren't writing C++. Your code is C with a tiny hint of C++. You're using struct rather than class, malloc rather than new, typedef struct rather than just struct, C headers rather than C++ headers.
Because of your implementation of your struct matrix, my comment on slowness due to copy constructors was incorrect. That it was incorrect is even worse! Using the implicitly-defined copy constructor in conjunction with classes or structs that contain naked pointers is playing with fire. You will get burned very badly if someone calls m(a, a, a_squared) to get the square of matrix a. You will get burned even worse if some expects m(a, a, a) to do an in-place computation of a2.
Mathematically, your code only covers a tiny portion of the matrix multiplication problem. What if someone wants to multiply a 100x1000 matrix by a 1000x200 matrix? That's perfectly valid, but your code doesn't handle it because your code only works with square matrices. On the other hand, your code will let someone multiply a 100x100 matrix by a 200x200 matrix, which doesn't make a bit of sense.
Structurally, your code has close to a 100% guarantee that it will be slow because of your use of ragged arrays. malloc can spritz the rows of your matrices all across memory. You'll get much better performance if the matrix is internally represented as a contiguous array but is accessed as if it were a NxM matrix. C++ provides some nice mechanisms for doing just that.
If your assignment implies that you MUST transpose, then, of course, you should correct your transpose procedure. As it stands, it does the transpose TWO times, resulting in no transpose at all. The j=loop should not read
j=0; j<size; j++
but
j=0; j<i; j++
Transposing is not necessary to avoid processing the elements of one of the factor-matrices in the "wrong" order. Just interchange the j-loop and the k-loop. Leaving aside for the moment any (other) performance-tuning, the basic loop-structure should be:
for (int i=0; i<size; i++)
{
for (int k=0; k<size; k++)
{
double tmp = a[i][k];
for (int j=0; j<size; j++)
{
result[i][j] += tmp * b[k][j];
}
}
}

C++: Improving cache performance in a 3d array

I don't know how to optimize cache performance at a really low level, thinking about cache-line size or associativity. That's not something you can learn overnight. Considering my program will run on many different systems and architectures, I don't think it would be worth it anyway. But still, there are probably some steps I can take to reduce cache misses in general.
Here is a description of my problem:
I have a 3d array of integers, representing values at points in space, like [x][y][z]. Each dimension is the same size, so it's like a cube. From that I need to make another 3d array, where each value in this new array is a function of 7 parameters: the corresponding value in the original 3d array, plus the 6 indices that "touch" it in space. I'm not worried about the edges and corners of the cube for now.
Here is what I mean in C++ code:
void process3DArray (int input[LENGTH][LENGTH][LENGTH],
int output[LENGTH][LENGTH][LENGTH])
{
for(int i = 1; i < LENGTH-1; i++)
for (int j = 1; j < LENGTH-1; j++)
for (int k = 1; k < LENGTH-1; k++)
//The for loops start at 1 and stop before LENGTH-1
//or other-wise I'll get out-of-bounds errors
//I'm not concerned with the edges and corners of the
//3d array "cube" at the moment.
{
int value = input[i][j][k];
//I am expecting crazy cache misses here:
int posX = input[i+1] [j] [k];
int negX = input[i-1] [j] [k];
int posY = input[i] [j+1] [k];
int negY = input[i] [j-1] [k];
int posZ = input[i] [j] [k+1];
int negZ = input[i] [j] [k-1];
output [i][j][k] =
process(value, posX, negX, posY, negY, posZ, negZ);
}
}
However, it seems like if LENGTH is large enough, I'll get tons of cache misses when I'm fetching the parameters for process. Is there a cache-friendlier way to do this, or a better way to represent my data other than a 3d array?
And if you have the time to answer these extra questions, do I have to consider the value of LENGTH? Like it's different whether LENGTH is 20 vs 100 vs 10000. Also, would I have to do something else if I used something other than integers, like maybe a 64-byte struct?
# ildjarn:
Sorry, I did not think that the code that generates the arrays I am passing into process3DArray mattered. But if it does, I would like to know why.
int main() {
int data[LENGTH][LENGTH][LENGTH];
for(int i = 0; i < LENGTH; i++)
for (int j = 0; j < LENGTH; j++)
for (int k = 0; k < LENGTH; k++)
data[i][j][k] = rand() * (i + j + k);
int result[LENGTH][LENGTH][LENGTH];
process3DArray(data, result);
}
There's an answer to a similar question here: https://stackoverflow.com/a/7735362/6210 (by me!)
The main goal of optimizing a multi-dimensional array traversal is to make sure you visit the array such that you tend to reuse the cache lines accessed from the previous iteration step. For visiting each element of an array once and only once, you can do this just by visiting in memory order (as you are doing in your loop).
Since you are doing something more complicated than a simple element traversal (visiting an element plus 6 neighbors), you need to break up your traversal such that you don't access too many cache lines at once. Since the cache thrashing is dominated by traversing along j and k, you just need to modify the traversal such that you visit blocks at a time rather than rows at a time.
E.g.:
const int CACHE_LINE_STEP= 8;
void process3DArray (int input[LENGTH][LENGTH][LENGTH],
int output[LENGTH][LENGTH][LENGTH])
{
for(int i = 1; i < LENGTH-1; i++)
for (int k_start = 1, k_next= CACHE_LINE_STEP; k_start < LENGTH-1; k_start= k_next; k_next+= CACHE_LINE_STEP)
{
int k_end= min(k_next, LENGTH - 1);
for (int j = 1; j < LENGTH-1; j++)
//The for loops start at 1 and stop before LENGTH-1
//or other-wise I'll get out-of-bounds errors
//I'm not concerned with the edges and corners of the
//3d array "cube" at the moment.
{
for (int k= k_start; k<k_end; ++k)
{
int value = input[i][j][k];
//I am expecting crazy cache misses here:
int posX = input[i+1] [j] [k];
int negX = input[i-1] [j] [k];
int posY = input[i] [j+1] [k];
int negY = input[i] [j-1] [k];
int posZ = input[i] [j] [k+1];
int negZ = input[i] [j] [k-1];
output [i][j][k] =
process(value, posX, negX, posY, negY, posZ, negZ);
}
}
}
}
What this does in ensure that you don't thrash the cache by visiting the grid in a block oriented fashion (actually, more like a fat column oriented fashion bounded by the cache line size). It's not perfect as there are overlaps that cross cache lines between columns, but you can tweak it to make it better.
The most important thing you already have right. If you were using Fortran, you'd be doing it exactly wrong, but that's another story. What you have right is that you are processing in the inner loop along the direction where memory addresses are closest together. A single memory fetch (beyond the cache) will pull in multiple values, corresponding to a series of adjacent values of k. Inside your loop the cache will contain some number of values from i,j; a similar number from i+/-1, j and from i,j+/-1. So you basically have five disjoint sections of memory active. For small values of LENGTH these will only be 1 or three sections of memory. It is in the nature of how caches are built that you can have more than this many disjoint sections of memory in your active set.
I hope process() is small, and inline. Otherwise this may well be insignificant. Also, it will affect whether your code fits in the instruction cache.
Since you're interested in performance, it is almost always better to initialize five pointers (you only need one for value, posZ and negZ), and then take *(p++) inside the loop.
input[i+1] [j] [k];
is asking the compiler to generate 3 adds and two multiplies, unless you have a very good optimizer. If your compiler is particularly lazy about register allocation, you also get four memory accesses; otherwise one.
*inputIplusOneJK++
is asking for one add and a memory reference.