Does compiler detect false shared variable? - c++

As i prepared some code sample for a small presentation of OpenMP to my teammate i found a weird case. First i wrote a classic loop :
void sequential(int *a, int size, int *b)
{
int i;
for (i = 0; i < size; i++) {
b[i] = a[i] * i;
}
}
A correct OpenMP usage of a for directive is simple. We just have to move int i declaration in for scope to make it private.
void parallel_for(int *a, int size, int *b)
{
#pragma omp parallel for
for (int i = 0; i < size; i++) {
b[i] = a[i] * i;
}
}
But when i wrote the following function, I expected i get a different result of the 2 other due to the shared int jI declared out-of the for loop scope. But using my test framework I don't see the error I expected, an incorrect value at the output of that function.
void parallel_for_with_an_usage_error(int *a, int size, int *b)
{
int j;
#pragma omp parallel for
for (int i = 0; i < size; i++) {
/*int*/ j = a[i]; //To be correct j should be declared here, in-loop to be private !
j *= i;
b[i] = j;
}
}
I have a complete source code for testing, it build with VS'12 and gcc (with C++11 enable) here http://pastebin.com/NJ4L0cbV
Have you an idea of what's the compiler do ? Does it detect the false sharing, does in move int j in the loop due to an optimization heuristic ?
Thank

In my opinion what might be happening is that compiler is doing some optimization. Since in the above pasted code (without cout), the variable j is not being anywhere other than inside the loop, the compiler can put the declaration of j inside the loop in the produced assembly code.
Another possibility is that the compiler might be converting the 3 statements inside the loop into one single statement, i.e. from
/*int*/ j = a[i]; //To be correct j should be declared here, in-loop to be private !
j *= i;
b[i] = j;
to,
b[i] = a[i] * i;
Compiler will do this optimization regardless of whether its an OpenMP code or not.

Related

OpenMP Segmentation Fault in C++

I have a very straightforward function that counts how many inner entries of an N by N 2D matrix (represented by a pointer arr) is below a certain threshold, and updates a counter below_threshold that is passed by reference:
void count(float *arr, const int N, const float threshold, int &below_threshold) {
below_threshold = 0; // make sure it is reset
bool comparison;
float temp;
#pragma omp parallel for shared(arr, N, threshold) private(temp, comparison) reduction(+:below_threshold)
for (int i = 1; i < N-1; i++) // count only the inner N-2 rows
{
for (int j = 1; j < N-1; j++) // count only the inner N-2 columns
{
temp = *(arr + i*N + j);
comparison = (temp < threshold);
below_threshold += comparison;
}
}
}
When I do not use OpenMP, it runs fine (thus, the allocation and initialization were done correctly already).
When I use OpenMP with an N that is less than around 40000, it runs fine.
However, once I start using a larger N with OpenMP, it keeps giving me a segmentation fault (I am currently testing with N = 50000 and would like to eventually get it up to ~100000).
Is there something wrong with this at a software level?
P.S. The allocation was done dynamically ( float *arr = new float [N*N] ), and here is the code used to randomly initialize the entire matrix, which didn't have any issues with OpenMP with large N:
void initialize(float *arr, const int N)
{
#pragma omp parallel for
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
*(arr + i*N + j) = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
}
}
}
UPDATE:
I have tried changing i, j, and N to long long int, and it still has not fixed my segmentation fault. If this was the issue, why has it already worked without OpenMP? It is only once I add #pragma omp ... that it fails.
I think, it is because, your value (50000*50000 = 2500000000) reached above INT_MAX (2147483647) in c++. As a result, the array access behaviour will be undefined.
So, you should use UINT_MAX or some other types that suits with your usecase.

Declaring and Initializing C++ Static variables

Note: This is coming from a beginner.
Why does the first one outputs 333 and the second one 345 ?
In the second code does it skips over the declaration? I mean it should initialise the "j" variable again like in the first one.
int main() {
static int j =12;
for(int i=0; i<=2; i++) {
j = 2;
j += 1;
std::cout<<j;
}
return 0;
}
int main() {
for(int i=0; i<=2; i++) {
static int j = 2;
j += 1;
std::cout<<j;
}
return 0;
}
static variables are only initialized once (when first used). After that, it behaves as if it is declared somewhere else (I mean, the initialization is ignored). In the second main, the initialization static int j = 2; is executed only the first time, after that, you are incrementing it sequentially (i.e. 2,3,4...)
In the first loop, you set it's value on each iteration, which is not the same as initialization, so it can be run multiple times.

Add OpenMP to program to calculate the determinant of an n x n matrix n x n

Here is code to find determinant of matrix n x n.
#include <iostream>
using namespace std;
int determinant(int *matrix[], int size);
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column);
int main()
{
int size;
cout << "What is the size of the matrix for which you want to find the determinant?:\t";
cin >> size;
int **matrix;
matrix = new int*[size];
for (int i = 0 ; i < size ; i++)
matrix[i] = new int[size];
cout << "\nEnter the values of the matrix seperated by spaces:\n\n";
for(int i = 0; i < size; i++)
for(int j = 0; j < size; j++)
cin >> matrix[i][j];
cout << "\nThe determinant of the matrix is:\t" << determinant(matrix, size) << endl;
return 0;
}
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
}
}
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column){
for(int i = 0; i < size; i++){
for(int j = 0; j < size; j++){
if(i < row){
if(j < column)minorMatrix[i][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i][j-1] = matrix[i][j];
}
else if(i == row)continue;
else{
if(j < column)minorMatrix[i-1][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i-1][j-1] = matrix[i][j];
}
}
}
}
After adding OpenMP pragmas, I've changed the determinant function and now it looks like this:
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
#pragma omp parallel for default(none) shared(size,matrix,sign) private(j,k) reduction(+ : result)
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
delete [] matrix;
}
}
My problem is that the result is every time different. Sometimes it gives correct value, but most often it is wrong. I think it's because of the sign variable. I am following the formula:
As you can see, in every iteration of my for loop there should be different sign but when I use OpenMP, something is wrong. How can I make this program to run with OpenMP?
Finally, my second issue is that using OpenMP does not make the program run quicker than without OpenMP. I also tried to make a 100,000 x 100,000 matrix, but my program reports an error about allocating memory. How can I run this program with very large matrices?
Your issues as I see it are as follows:
1) As noted by Hristo, your threads are stomping over each other's data with respect to the sign variable. It should be private to each thread so that they have full read/write access to it without having to worry about race conditions. Then, you simply need an algorithm to compute whether sign is plus or minus 1 depending on the iteration j independently from the other iterations. With a little thinking, you'll see that Hristo's suggestion is correct: sign = (j % 2) ? -1 : 1; should do the trick.
2) Your determinant() function is recursive. As is, that means that every iteration of the loop, after forming your minors, you then call your function again on that minor. Therefore, a single thread is going to be performing its iteration, enter the recursive function, and then try to split itself up into nthreads more threads. You can see now how you are oversubscribing your system by launching many more threads than you physically have cores. Two easy solutions:
Call your original serial function from within the omp parallel code. This is the fastest way to do it because this would avoid any OpenMP-startup overhead.
Turn off nested parallelism by calling omp_set_nested(0); before your first call to determinant().
Add an if clause to your parallel for directive: if(omp_in_parallel())
3) Your memory issues are because every iteration of your recursion, you are allocating more memory. If you fix problem #2, then you should be using comparable amounts of memory in the serial case as the parallel case. That being said, it would be much better to allocate all the memory you want before entering your algorithm. Allocating large chunks of memory (and then freeing it!), especially in parallel, is a terrible bottleneck in your code.
Compute the amount of memory you would need (on paper) before entering the first loop and allocate it all at once. I would also strongly suggest you consider allocating your memory contiguously (aka in 1D) to take better advantage of caching as well. Remember that each thread should have its own separate area to work with. Then, change your function to:
int determinant(int *matrix, int *startOfMyWorkspace, int size).
Instead of allocating a new (size-1)x(size-1) matrix inside of your loop, you would simply utilize the next (size-1)*(size-1) integers of your workspace, update what startOfMyWorkspace would be for the next recursive call, and continue along.

2D Vector Optimization

This is only a compact test case but I have a vector of doubles and I want to populate a square matrix (2D vector) of all pairwise differences. When compiled with -O3 optimization, this takes about 1.96 seconds on my computer (computed only from the nested double for-loop).
#include <vector>
using namespace std;
int main(){
vector<double> a;
vector<vector<double> > b;
unsigned int i, j;
unsigned int n;
double d;
n=10000; //In practice, this value is MUCH bigger
a.resize(n);
for (i=0; i< n; i++){
a[i]=static_cast<double>(i);
}
b.resize(n);
for (i=0; i< n; i++){
b[i].resize(n);
b[i][i]=0.0; //Zero diagonal
}
for (i=0; i< n; i++){
for (j=i+1; j< n; j++){
d=a[i]-a[j];
//Commenting out the next two lines makes the code significantly faster
b[i][j]=d;
b[j][i]=d;
}
}
return 0;
}
However, when I comment out the two lines:
b[i][j]=d;
b[j][i]=d;
The program completes in about 0.000003 seconds (computed only from the nested double for-loop)! I really didn't expect these two lines to be the rate-limiting-step. I've been staring at this code for a while and I'm out of ideas. Can anybody please offer any suggestions as to how I could optimize this simple piece of code so that the time can be significantly reduced?
When you comment out those two lines, all that's left in the nested loop is to keep computing d and then throwing away the result. Since this can't have any effect on the behaviour of the program, the compiler will just optimize out the nested loop. That's why the program finishes almost instantaneously.
In fact, I confirmed this by compiling the code twice with g++ -O3, once with only the d=a[i]-a[j] statement in the nested loop, and once with the nested loop deleted entirely. The code emitted was identical.
Nevertheless, your code is currently slower than it has to be, because it's missing the cache. When you access a two-dimensional array in a nested loop like this, you should always arrange for the iteration to be continuous through memory if possible. This means that the second index should be the one that's varying faster. The access to b[j][i] is violating this rule and missing the cache. So let's rewrite.
Before:
for (i=0; i< n; i++){
for (j=i+1; j< n; j++){
d=a[i]-a[j];
b[i][j]=d;
b[j][i]=d;
}
}
Timing:
real 0m1.026s
user 0m0.824s
sys 0m0.196s
After:
for (i = 0; i < n; i++) {
for (j = 0; j < i; j++) {
b[i][j] = a[j] - a[i];
}
for (j = i+1; j < n; j++) {
b[i][j] = a[i] - a[j];
}
}
Timing:
real 0m0.335s
user 0m0.164s
sys 0m0.164s

I want to optimize this short loop

I would like to optimize this simple loop:
unsigned int i;
while(j-- != 0){ //j is an unsigned int with a start value of about N = 36.000.000
float sub = 0;
i=1;
unsigned int c = j+s[1];
while(c < N) {
sub += d[i][j]*x[c];//d[][] and x[] are arrays of float
i++;
c = j+s[i];// s[] is an array of unsigned int with 6 entries.
}
x[j] -= sub; // only one memory-write per j
}
The loop has an execution time of about one second with a 4000 MHz AMD Bulldozer. I thought about SIMD and OpenMP (which I normally use to get more speed), but this loop is recursive.
Any suggestions?
think you may want to transpose the matrix d -- means store it in such a way that you can exchange the indices -- make i the outer index:
sub += d[j][i]*x[c];
instead of
sub += d[i][j]*x[c];
This should result in better cache performance.
I agree with transposing for better caching (but see my comments on that at the end), and there's more to do, so let's see what we can do with the full function...
Original function, for reference (with some tidying for my sanity):
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *x, float *b){
//We want to solve L D Lt x = b where D is a diagonal matrix described by Diagonals[0] and L is a unit lower triagular matrix described by the rest of the diagonals.
//Let D Lt x = y. Then, first solve L y = b.
float *y = new float[n];
float **d = IncompleteCholeskyFactorization->Diagonals;
unsigned int *s = IncompleteCholeskyFactorization->StartRows;
unsigned int M = IncompleteCholeskyFactorization->m;
unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i, j;
for(j = 0; j != N; j++){
float sub = 0;
for(i = 1; i != M; i++){
int c = (int)j - (int)s[i];
if(c < 0) break;
if(c==j) {
sub += d[i][c]*b[c];
} else {
sub += d[i][c]*y[c];
}
}
y[j] = b[j] - sub;
}
//Now, solve x from D Lt x = y -> Lt x = D^-1 y
// Took this one out of the while, so it can be parallelized now, which speeds up, because division is expensive
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] = y[j]/d[0][j];
}
while(j-- != 0){
float sub = 0;
for(i = 1; i != M; i++){
if(j + s[i] >= N) break;
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
delete[] y;
}
Because of the comment about parallel divide giving a speed boost (despite being only O(N)), I'm assuming the function itself gets called a lot. So why allocate memory? Just mark x as __restrict__ and change y to x everywhere (__restrict__ is a GCC extension, taken from C99. You might want to use a define for it. Maybe the library already has one).
Similarly, though I guess you can't change the signature, you can make the function take only a single parameter and modify it. b is never used when x or y have been set. That would also mean you can get rid of the branch in the first loop which runs ~N*M times. Use memcpy at the start if you must have 2 parameters.
And why is d an array of pointers? Must it be? This seems too deep in the original code, so I won't touch it, but if there's any possibility of flattening the stored array, it will be a speed boost even if you can't transpose it (multiply, add, dereference is faster than dereference, add, dereference).
So, new code:
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *__restrict__ x){
// comments removed so that suggestions are more visible. Don't remove them in the real code!
// these definitions got long. Feel free to remove const; it does nothing for the optimiser
const float *const __restrict__ *const __restrict__ d = IncompleteCholeskyFactorization->Diagonals;
const unsigned int *const __restrict__ s = IncompleteCholeskyFactorization->StartRows;
const unsigned int M = IncompleteCholeskyFactorization->m;
const unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i;
unsigned int j;
for(j = 0; j < N; j++){ // don't use != as an optimisation; compilers can do more with <
float sub = 0;
for(i = 1; i < M && j >= s[i]; i++){
const unsigned int c = j - s[i];
sub += d[i][c]*x[c];
}
x[j] -= sub;
}
// Consider using processor-specific optimisations for this
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] /= d[0][j];
}
for( j = N; (j --) > 0; ){ // changed for clarity
float sub = 0;
for(i = 1; i < M && j + s[i] < N; i++){
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
}
Well it's looking tidier, and the lack of memory allocation and reduced branching, if nothing else, is a boost. If you can change s to include an extra UINT_MAX value at the end, you can remove more branches (both the i<M checks, which again run ~N*M times).
Now we can't make any more loops parallel, and we can't combine loops. The boost now will be, as suggested in the other answer, to rearrange d. Except… the work required to rearrange d has exactly the same cache issues as the work to do the loop. And it would need memory allocated. Not good. The only options to optimise further are: change the structure of IncompleteCholeskyFactorization->Diagonals itself, which will probably mean a lot of changes, or find a different algorithm which works better with data in this order.
If you want to go further, your optimisations will need to impact quite a lot of the code (not a bad thing; unless there's a good reason for Diagonals being an array of pointers, it seems like it could do with a refactor).
I want to give an answer to my own question: The bad performance was caused by cache conflict misses due to the fact that (at least) Win7 aligns big memory blocks to the same boundary. In my case, for all buffers, the adresses had the same alignment (bufferadress % 4096 was same for all buffers), so they fall into the same cacheset of L1 cache. I changed memory allocation to align the buffers to different boundaries to avoid cache conflict misses and got a speedup of factor 2. Thanks for all the answers, especially the answers from Dave!