Multi source BFS multithreading

Multi source BFS multithreading - c++

I have a graph represented with an adjacency matrix arr. And a vector source for the multiple starting vertices.
My idea is to split the source vector into "equal" pieces depending on the number of threads(if it doesn't split equally I add the remaining on the last piece). And create threads that run this function. bool used[] is a global array
I am trying to get (I think its called) "liner" scaling. I assume the number of starting vertices is at least equal to the number of threads.
If I use a mutex to synchronise the threads it is very inefficient.
And If I don't some vertices get traversed more then once.
Question is there a data structure that would let me remove the mutex?
or another way to implement this algorithm?
mutex m;
void msBFS(bool** arr, int n, vector<int> s, atomic<bool>* used) //s is a different
// piece of the original source
{
queue<int> que;
for(auto i = 0; i < s.size(); ++i)
{
que.push(s[i]);
used[s[i]] = true;
}
while (!que.empty())
{
int curr = que.front();
que.pop();
cout << curr << " ";
for (auto i = 0; i < n; ++i)
{
lock_guard<mutex> guard(m);
if (arr[curr][i] == 1 && !used[i] && curr != i)
{
que.push(i);
used[i] = true;
}
}
}
}```

With the atomic<bool> I think you are almost there. The only piece missing is the atomic exchange operation. It allows you to read-modify-write as an atomic operation. For bool atomic type there is usually a piece of hardware that supports it.
void msBFS(bool** arr, int n, vector<int> s, atomic<bool>* used) //s is a different
// piece of the original source
{
//used[i] initialized to 'false' for all i
queue<int> que;
for(auto i = 0; i < s.size(); ++i)
{
que.push(s[i]);
//we don't change used just yet!
}
while (!que.empty())
{
int curr = que.front();
que.pop();
bool alreadyUsed = used[curr].exchange(true);
if(alreadyUsed) {
continue; //some other thread already processing it
}
cout << curr << " ";
for (auto i = 0; i < n; ++i) {
if (arr[curr][i] == 1 && !used[i] && curr != i) {
que.push(i);
}
}
}
}
Note, there is one logical change: the used[i] is set to true when a thread starts processing the node, and not when it is added to the queue.
At the first attempt to process a node, when used[i] is set to true, alreadyUsed will hold the previous value (false) indicating that noone else started processing the node earlier. At subsequent attempts to process the node, alreadyUsed will be already set to true and the processing will skip.
The above approach is not ideal: it is possible for a node to be added many times to a queue before it is processed. Depending on the shape of your graph it may or may not be a problem.
If this is a problem - I would suggest using a three-value used state: not_visited, queued and processed.
static constexpr int not_visited = 0;
static constexpr int queued = 1;
static constexpr int processed = 2;
Now, every time we try to push onto the que, and every time we try to process the node, we advance the state accordingly. These advancements need to be performed atomically, via compare_exchange_strong so that each change may happen exactly once. The call to compare_exchange_strong returns true if it succeeded (i.e. the previously contained value actually matched the expected one)
void msBFS(bool** arr, int n, vector<int> s, atomic<int>* used) //s is a different
// piece of the original source
{
//used[i] initialized to '0' for all i
queue<int> que;
int empty = 0;
for(auto i = 0; i < s.size(); ++i) {
int expected = not_visited;
//we check it even here, because one thread may be seriously lagging behind others which already entered the while loop
if(used[s[i]].compare_exchange_strong(expected, queued)) {
que.push(s[i]);
}
}
while (!que.empty())
{
int curr = que.front();
que.pop();
int expected = queued;
if(!used[curr].compare_exchange_strong(expected, processed)) {
continue;
}
cout << curr << " ";
for (auto i = 0; i < n; ++i) {
if (arr[curr][i] == 1 && curr != i) {
int expected = not_visited;
if(used[i].compare_exchange_strong(expected, queued)) {
que.push(i);
}
}
}
}
}
Check the performance. There are many atomic operations, but those are generally cheaper than a mutex. Internally, mutex also performs atomic operations similar to these, but in addition it may completely block a thread. The code I have shown never blocks (the thread is never put on a halt), all synchronization is done on the atomic variables only.
Edit: Some possible optimisations for second approach:
I realized, that if the transition not_visited->queued is guaranteed to occur exactly once, the other transition does not even have to be performed, because the node is present exactly once in exactly one queue anyway. So, you may save up on few atomic operations and use bool again. Since it is rare, though, I don't think it will have that much of an impact.
When iterating over the neighbors - there is an if statement: if (arr[curr][i] == 1 && curr != i) -- this does not check if the neighbor was visited or not. It is checked only later, through the atomic exchange. However, you may want to see if checking within that if anyway will help. If you find early that used[i] is already in a queue or processed, you skip over the branch, and over the no-longer-needed atomic compare-and-swap.
If you want to squeeze every tick from your processor, consider using bitfields, instead of bools for your adjacency matrix and used array. I think the iteration over neighbors, and the conditions can be evaluated with bitwise operations, for 32 bits/neighbors at once. There is even std::atomic<uint32_t>::fetch_or to aid you with updating 32 used at once.
Edit2: Possible optimisation for first approach:
Each thread can hold its own localUsed array, which would be checked and set to true when pushing to the queue (similarly how you did in your original code). This would be local for the thread, so no atomics, mutexes etc. With this simple check you have a guarantee that a given node appears in a queue of each thread at most once. So, at most, a node will appear N times, where N is the number of threads.
I think this is a compromise worth considering between scalability and the memory footprint, and may perform better than the second approach.

Related

Threads fighting over global variables in class using OpenMP task

I am currently trying to find out does a clique of size k exist in an undirected graph using OpenMP to make the code run faster.
This is the code which I am trying to paralleize:
bool Graf::doesCliqueSizeKExistParallel(int k) {
if (k > n) return false;
clique_parallel.resize(k);
bool foundClique = false;
#pragma omp parallel
#pragma omp single
for (int i = 0; i < n; i++) {
if (degree[i] >= k - 1) {
clique_parallel[0] = i;
#pragma omp task
doesCliqueSizeKExistParallelRecursive(i + 1, 1, k, foundClique);
}
}
return foundClique;
}
void Graf::doesCliqueSizeKExistParallelRecursive(int node, int currentLength, int k, bool & foundClique) {
for (int j(node); j < n; j++) {
if (degree[j] >= k - 1) {
clique_parallel[currentLength - 1] = j;
bool isClique=true;
for(int i(0);i<currentLength;i++){
for(int l(i+1);l<currentLength;l++){
if(!neighbors[clique_parallel[i]][clique_parallel[l]]){isClique=false; break;}
}
if(!isClique) break;
}
if (isClique) {
if (currentLength < k)
doesCliqueSizeKExistParallelRecursive(j + 1, currentLength + 1, k, foundClique);
else {
foundClique= true;
return;
}
}
}
}
}
The problem here, which I suppose could be the case is that the variables degree, neighbors, clique_parallel are all global and when some thread is trying to write in one of these variables, another one comes and writes in that variable instead of the right thread. The only solution which I tried was to pass, these three variables as a copy to the function so that each thread has its own variable, and it didn't work. I am trying not to use #pragma omp taskwait because that would just be the sequential algorithm and there wouldn't be any speed up. Currently I am lost and don't how to fix this issue (if it is an issue) and don't know what else to try or how to avoid sharing these variables between threads.
Here is the class Graf:
class Graf {
int n; // number of nodes
vector<vector<int>> neighbors; //matrix adjacency
vector<int> degree; //number of nodes each node is adjacent to
vector<int> clique_parallel;
bool directGraph;
void doesCliqueSizeKExistParallelRecursive(int node, int currentLength, int k, bool & foundClique);
public:
Graf(int n, bool directGraph = false);
void addEdge(int i, int j);
void printGraf();
bool doesCliqueSizeKExistParallel(int k);
};
So my question is, here in this code the problem that the threads are fighting over the global variables, or could it be somethin else? Any help is useful, and if you have any question regarding the code, I'll answer.

Your observation that omp task wait turns this into the sequential algorithm is sort of correct. It's in fact worse: it turns your Depth-First search algorithm into effectively a Breadth-First, which would traverse the whole space.
Ok. First of all use task group, which has an implicit task wait at the end of a for loop that generates tasks.
Next, let your tasks return either false, or the values of a clique found.
Now for the big trick: once one task has found a solution, call omp cancel task group, which makes you leave the for loop and you can keep the values you found! This cancel will kill all other tasks (and their recursively spawned tasks) at that level. And now the magic of recursion kicks in and all groups get cancelled higher up the tree.
I once figured this out for another recursive search problem, but I'm sure you can translate it to your problem: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-examples.html#Treetraversal

Implementation of a lock free vector

After several searches, I cannot find a lock-free vector implementation.
There is a document that speaks about it but nothing concrete (in any case I have not found it). http://pirkelbauer.com/papers/opodis06.pdf
There are currently 2 threads dealing with arrays, there may be more in a while.
One thread that updates different vectors and another thread that accesses the vector to do calculations, etc. Each thread accesses the different array a large number of times per second.
I implemented a lock with mutex on the different vectors but when the reading or writing thread takes too long to unlock, all further updates are delayed.
I then thought of copying the array all the time to go faster, but copying thousands of times per second an array of thousands of elements doesn't seem great to me.
So I thought to use 1 mutex per value in each table to lock only the value I am working on.
A lock-free could be better but I can not find a solution and I wonder if the performances would be really better.
EDIT:
I have a thread that receives data and ranges in vectors.
When I instantiate the structure, I use a fixed size.
I have to do 2 different things for the updates:
-Update vector elements. (1d vector which simulates a 2d vector)
-Add a line at the end of the vector and remove the first line. The array always remains sorted. Adding elements is much much rarer than updating
The thread that is read-only walks the array and performs calculations.
To limit the time spent on the array and do as little calculation as possible, I use arrays that store the result of my calculations. Despite this, I often have to scan the table enough to do new calculations or just update them. (the application is in real-time so the calculations to be made vary according to the requests)
When a new element is added to the vector, the reading thread must directly use it to update the calculations.
When I say calculation, it is not necessarily only arithmetic, it is more a treatment to be done.

There is no perfect implementation to run concurrency, each task has it's own good enogh. My goto method to find a decent implementation is to only alow what is needed and then check if i would need somthing more in the future.
You described a quite simple scenario, one thread one accion to a shared vector, then the vector needs to tell if the acction is alowed soo std::atomic_flag is good enogh.
This example shuld give you an idea on how it works and what to expent. Mainly i just attached a flag to eatch array and checkt it before to see if is safe to do somthing and some people like to add a guard to the flag, just in case.
#include <iostream>
#include <thread>
#include <atomic>
#include <chrono>
const int vector_size = 1024;
struct Element {
void some_yield(){
std::this_thread::yield();
};
void some_wait(){
std::this_thread::sleep_for(
std::chrono::microseconds(1)
);
};
};
Element ** data;
std::atomic_flag * vector_safe;
bool alive = true;
uint32_t c_down_time = 0;
uint32_t p_down_time = 0;
uint32_t c_intinerations = 0;
uint32_t p_intinerations = 0;
std::chrono::high_resolution_clock::time_point c_time_point;
std::chrono::high_resolution_clock::time_point p_time_point;
int simple_consumer_work(){
Element a_read;
uint16_t i, e;
while (alive){
// Loops thru the vectors
for (i=0; i < vector_size; i++){
// locks the thread untin the vector
// at index i is free to read
while (!vector_safe[i].test_and_set()){}
// Do the watherver
for (e=0; e < vector_size; e++){
a_read = data[i][e];
}
// And signal that this vector is done
vector_safe[i].clear();
}
}
return 0;
};
int simple_producer_work(){
uint16_t i;
while (alive){
for (i=0; i < vector_size; i++){
while (!vector_safe[i].test_and_set()){}
data[i][i].some_wait();
vector_safe[i].clear();
}
p_intinerations++;
}
return 0;
};
int consumer_work(){
Element a_read;
uint16_t i, e;
bool waiting;
while (alive){
for (i=0; i < vector_size; i++){
waiting = false;
c_time_point = std::chrono::high_resolution_clock::now();
while (!vector_safe[i].test_and_set(std::memory_order_acquire)){
waiting = true;
}
if (waiting){
c_down_time += (uint32_t)std::chrono::duration_cast<std::chrono::nanoseconds>
(std::chrono::high_resolution_clock::now() - c_time_point).count();
}
for (e=0; e < vector_size; e++){
a_read = data[i][e];
}
vector_safe[i].clear(std::memory_order_release);
}
c_intinerations++;
}
return 0;
};
int producer_work(){
bool waiting;
uint16_t i;
while (alive){
for (i=0; i < vector_size; i++){
waiting = false;
p_time_point = std::chrono::high_resolution_clock::now();
while (!vector_safe[i].test_and_set(std::memory_order_acquire)){
waiting = true;
}
if (waiting){
p_down_time += (uint32_t)std::chrono::duration_cast<std::chrono::nanoseconds>
(std::chrono::high_resolution_clock::now() - p_time_point).count();
}
data[i][i].some_wait();
vector_safe[i].clear(std::memory_order_release);
}
p_intinerations++;
}
return 0;
};
void print_time(uint32_t down_time){
if ( down_time <= 1000) {
std::cout << down_time << " [nanosecods] \n";
} else if (down_time <= 1000000) {
std::cout << down_time / 1000 << " [microseconds] \n";
} else if (down_time <= 1000000000) {
std::cout << down_time / 1000000 << " [miliseconds] \n";
} else {
std::cout << down_time / 1000000000 << " [seconds] \n";
}
};
int main(){
std::uint16_t i;
std::thread consumer;
std::thread producer;
vector_safe = new std::atomic_flag [vector_size] {ATOMIC_FLAG_INIT};
data = new Element * [vector_size];
for(i=0; i < vector_size; i++){
data[i] = new Element;
}
consumer = std::thread(consumer_work);
producer = std::thread(producer_work);
std::this_thread::sleep_for(
std::chrono::seconds(10)
);
alive = false;
producer.join();
consumer.join();
std::cout << " Consumer loops > " << c_intinerations << std::endl;
std::cout << " Consumer time lost > "; print_time(c_down_time);
std::cout << " Producer loops > " << p_intinerations << std::endl;
std::cout << " Producer time lost > "; print_time(p_down_time);
for(i=0; i < vector_size; i++){
delete data[i];
}
delete [] vector_safe;
delete [] data;
return 0;
}
And dont forget that the compiler can and will change portions of the code, spagueti code is realy realy buggy in multithreading.

Fill an array from different threads concurrently c++

First of all, I think it is important to say that I am new to multithreading and know very little about it. I was trying to write some programs in C++ using threads and ran into a problem (question) that I will try to explain to you now:
I wanted to use several threads to fill an array, here is my code:
static const int num_threads = 5;
int A[50], n;
//------------------------------------------------------------
void ThreadFunc(int tid)
{
for (int q = 0; q < 5; q++)
{
A[n] = tid;
n++;
}
}
//------------------------------------------------------------
int main()
{
thread t[num_threads];
n = 0;
for (int i = 0; i < num_threads; i++)
{
t[i] = thread(ThreadFunc, i);
}
for (int i = 0; i < num_threads; i++)
{
t[i].join();
}
for (int i = 0; i < n; i++)
cout << A[i] << endl;
return 0;
}
As a result of this program I get:
0
0
0
0
0
1
1
1
1
1
2
2
2
2
2
and so on.
As I understand, the second thread starts writing elements to an array only when the first thread finishes writing all elements to an array.
The question is why threads dont't work concurrently? I mean why don't I get something like that:
0
1
2
0
3
1
4
and so on.
Is there any way to solve this problem?
Thank you in advance.

Since n is accessed from more than one thread, those accesses need to be synchronized so that changes made in one thread don't conflict with changes made in another. There are (at least) two ways to do this.
First, you can make n an atomic variable. Just change its definition, and do the increment where the value is used:
std::atomic<int> n;
...
A[n++] = tid;
Or you can wrap all the accesses inside a critical section:
std::mutex mtx;
int next_n() {
std::unique_lock<std::mutex> lock(mtx);
return n++;
}
And in each thread, instead of directly incrementing n, call that function:
A[next_n()] = tid;
This is much slower than the atomic access, so not appropriate here. In more complex situations it will be the right solution.

The worker function is so short, i.e., finishes executing so quickly, that it's possible that each thread is completing before the next one even starts. Also, you may need to link with a thread library to get real threads, e.g., -lpthread. Even with that, the results you're getting are purely by chance and could appear in any order.
There are two corrections you need to make for your program to be properly synchronized. Change:
int n;
// ...
A[n] = tid; n++;
to
std::atomic_int n;
// ...
A[n++] = tid;
Often it's preferable to avoid synchronization issues altogether and split the workload across threads. Since the work done per iteration is the same here, it's as easy as dividing the work evenly:
void ThreadFunc(int tid, int first, int last)
{
for (int i = first; i < last; i++)
A[i] = tid;
}
Inside main, modify the thread create loop:
for (int first = 0, i = 0; i < num_threads; i++) {
// possible num_threads does not evenly divide ASIZE.
int last = (i != num_threads-1) ? std::size(A)/num_threads*(i+1) : std::size(A);
t[i] = thread(ThreadFunc, i, first, last);
first = last;
}
Of course by doing this, even though the array may be written out of order, the values will be stored to the same locations every time.

Negligible Perfomance Boost from p_thread c++

I've been using Mac OS gcc 4.2.1 and Eclipse to write a program that sorts numbers using a simple merge sort. I've tested the sort extensively and I know it works, and I thought, maybe somewhat naively, that because of the way the algorithm divides up the list, I could simply have a thread sort half and the main thread sort half, and then it would take half the time, but unfortunately, it doesn't seem to be working.
Here's the main code:
float x = clock(); //timing
int half = (int)size/2; // size is the length of the list
status = pthread_create(thready,NULL,voidSort,(void *)datay); //start the thread sorting
sortStep(testArray,tempList,half,0,half); //sort using the main thread
int join = pthread_join(*thready,&someptr); //wait for the thread to finish
mergeStep(testArray,tempList,0,half,half-1); //merge the two sublists
if (status != 0) { std::cout << "Could not create thread.\nError: " << status << "\n"; }
if (join != 0) { std::cout << "Could not create thread.\nError: " << status << "\n"; }
float y = clock() - x; //timing
sortStep is the main sorting function, mergeStep is used to merge two sublists within one array (it uses a placeholder array to switch the numbers around), and voidSort is a function I use to pass a struct containing all the arguments for sortStep to the thread. I feel like maybe the main thread is waiting until the new thread is done, but I'm not sure how to overcome that. I'm extremely, unimaginably grateful for any and all help, thank you in advanced!
EDIT:
Here's the merge step
void mergeStep (int *array,int *tempList,int start, int lengthOne, int lengthTwo) //the merge step of a merge sort
{
int i = start;
int j = i+lengthOne;
int k = 0; // index for the entire templist
while (k < lengthOne+lengthTwo) // a C++ while loop
{
if (i - start == lengthOne)
{ //list one exhausted
for (int n = 0; n+j < lengthTwo+lengthOne+start;n++ ) //add the rest
{
tempList[k++] = array[j+n];
}
break;
}
if (j-(lengthOne+lengthTwo)-start == 0)
{//list two exhausted
for (int n = i; n < start+lengthOne;n++ ) //add the rest
{
tempList[k++] = array[n];
}
break;
}
if (array[i] > array[j]) // figure out which variable should go first
{
tempList[k] = array[j++];
}
else
{
tempList[k] = array[i++];
}
k++;
}
for (int s = 0; s < lengthOne+lengthTwo;s++) // add the templist into the original
{
array[start+s] = tempList[s];
}
}
-Will

The overhead of creating threads is quite large, so unless you have a large amount (to be determined) of data to sort your better off sorting it in the main thread.
The mergeStep also counts against the part of the code that can't be palletized, remember Amdahl's law.
If you don't have a coarsening step as the last part of you sortStep when you get below 8-16 elements much of your performance will go up in function calls. The coarsening step will have to be done by a simpler sort, insertion sort or sorting network.
Unless you have a large enough sorting the actual timing could drown in measuring uncertainty.

C++ openmp much slower than serial implementation

I am doing a thermodynamic simulation on a double dimension array. The array is 1024x1024. The while loop iterates through a specified amount of times or until goodTempChange is false. goodTempChange is set true or false based on the change in temperature of a block being greater than a defined EPSILON value. If every block in the array is below that value, then the plate is in stasis. The program works, I have no problems with the code, my problem is that the serial code is absolutely blowing the openmp code out of the water. I don't know why. I have tried removing everything except the average calculation which is just the average of the 4 blocks up, down, left, right around your desired square and still it is getting destroyed by the serial code. I've never done openmp before and I looked up some things online to do what I have. I have the variables within critical regions in the most efficient way I could see possible, I have no race conditions. I really don't see what is wrong. Any help would be greatly appreciated. Thanks.
while(iterationCounter < DESIRED_ITERATIONS && goodTempChange) {
goodTempChange = false;
if((iterationCounter % 1000 == 0) && (iterationCounter != 0)) {
cout << "Iteration Count Highest Change Center Plate Temperature" << endl;
cout << "-----------------------------------------------------------" << endl;
cout << iterationCounter << " "
<< highestChange << " " << newTemperature[MID][MID] << endl;
cout << endl;
}
highestChange = 0;
if(iterationCounter != 0)
memcpy(oldTemperature, newTemperature, sizeof(oldTemperature));
for(int i = 1; i < MAX-1; i++) {
#pragma omp parallel for schedule(static)
for(int j = 1; j < MAX-1; j++) {
bool tempGoodChange = false;
double tempHighestChange = 0;
newTemperature[i][j] = (oldTemperature[i-1][j] + oldTemperature[i+1][j] +
oldTemperature[i][j-1] + oldTemperature[i][j+1]) / 4;
if((iterationCounter + 1) % 1000 == 0) {
if(abs(oldTemperature[i][j] - newTemperature[i][j]) > highestChange)
tempHighestChange = abs(oldTemperature[i][j] - newTemperature[i][j]);
if(tempHighestChange > highestChange) {
#pragma omp critical
{
if(tempHighestChange > highestChange)
highestChange = tempHighestChange;
}
}
}
if(abs(oldTemperature[i][j] - newTemperature[i][j]) > EPSILON
&& !tempGoodChange)
tempGoodChange = true;
if(tempGoodChange && !goodTempChange) {
#pragma omp critical
{
if(tempGoodChange && !goodTempChane)
goodTempChange = true;
}
}
}
}
iterationCounter++;
}

Trying to get rid of those critical sections may help. For example:
#pragma omp critical
{
if(tempHighestChange > highestChange)
{
highestChange = tempHighestChange;
}
}
Here, you can store the highestChange computed by each thread in a local variable and, when the parallel section finishes, get the maximum of the highestChange's you have.

Here is my attempt (not tested).
double**newTemperature;
double**oldTemperature;
while(iterationCounter < DESIRED_ITERATIONS && goodTempChange) {
if((iterationCounter % 1000 == 0) && (iterationCounter != 0))
std::cout
<< "Iteration Count Highest Change Center Plate Temperature\n"
<< "---------------------------------------------------------------\n"
<< iterationCounter << " "
<< highestChange << " "
<< newTemperature[MID][MID] << '\n' << std::endl;
goodTempChange = false;
highestChange = 0;
// swap pointers to arrays (but not the arrays themselves!)
std::swap(newTemperature,oldTemperature);
if(iterationCounter != 0)
std::swap(newTemperature,oldTemperature);
bool CheckTempChange = (iterationCounter + 1) % 1000 == 0;
#pragma omp parallel
{
bool localGoodChange = false;
double localHighestChange = 0;
#pragma omp for
for(int i = 1; i < MAX-1; i++) {
//
// note that putting a second
// #pragma omp for
// here has (usually) zero effect. this is called nested parallelism and
// usually not implemented, thus the new nested team of threads has only
// one thread.
//
for(int j = 1; j < MAX-1; j++) {
newTemperature[i][j] = 0.25 * // multiply is faster than divide
(oldTemperature[i-1][j] + oldTemperature[i+1][j] +
oldTemperature[i][j-1] + oldTemperature[i][j+1]);
if(CheckTempChange)
localHighestChange =
std::max(localHighestChange,
std::abs(oldTemperature[i][j] - newTemperature[i][j]));
localGoodChange = localGoodChange ||
std::abs(oldTemperature[i][j] - newTemperature[i][j]) > EPSILON;
// shouldn't this be < EPSILON? in the previous line?
}
}
//
// note that we have moved the critical sections out of the loops to
// avoid any potential issues with contentions (on the mutex used to
// implement the critical section). Also note that I named the sections,
// allowing simultaneous update of goodTempChange and highestChange
//
if(!goodTempChange && localGoodChange)
#pragma omp critical(TempChangeGood)
goodTempChange = true;
if(CheckTempChange && localHighestChange > highestChange)
#pragma omp critical(TempChangeHighest)
highestChange = std::max(highestChange,localHighestChange);
}
iterationCounter++;
}
There are several changes to your original:
The outer instead of the inner of the nested for loops is performed in parallel. This should make a significant difference.
added in edit: It appears from the comments that you don't understand the significance of this, so let me explain. In your original code, the outer loop (over i) was done only by the master thread. For every i, a team of threads was created to perform the inner loop over j in parallel. This creates an synchronisation overhead (with significant imbalance) at every i! If one instead parallelises the outer loop over i, this overhead is encountered only once and each thread will run the entire inner loop over j for its share of i. Thus, always to parallelise the outermost loop possible is a basic wisdom for multi-threaded coding.
The double for loop is inside a parallel region to minimise the critical region calls to one per thread per while loop. You may also consider to put the whole while loop inside a parallel region.
I also swap between two arrays (similar as suggested in other answers) to avoid to memcpy, but this shouldn't really be performance critical.
added in edit: std::swap(newTemperature,oldTemperature) only swaps the pointer values and not the memory pointed to, of course, that's the point.
Finally, don't forget that the proof of the pudding is in the eating: just try what difference it makes to have the #pragma omp for in front of the inner or the outer loop. Always do experiments like this before asking on SO -- otherwise you can be rightously accused of not having done sufficient research.

I assume that you are concerned with the time taken by the entire code inside the while loop, not just by the time taken by the loop beginning for(int i = 1; i < MAX-1; i++).
This operation
if(iterationCounter != 0)
{
memcpy(oldTemperature, newTemperature, sizeof(oldTemperature));
}
is unnecessary and, for large arrays, may be enough to kill performance. Instead of maintaining 2 arrays, old and new, maintain one 3D array with two planes. Create two integer variables, let's call them old and new, and set them to 0 and 1 initially. Replace
newTemperature[i][j] = ((oldTemperature[i-1][j] + oldTemperature[i+1][j] + oldTemperature[i][j-1] + oldTemperature[i][j+1]) / 4);
by
temperature[new][i][j] =
(temperature[old][i-1][j] +
temperature[old][i+1][j] +
temperature[old][i][j-1] +
temperature[old][i][j+1])/4;
and, at the end of the update swap the values of old and new so that the updates go the other way round. I'll leave it to you to determine whether old/new should be the first index into your array or the last. This approach eliminates the need to move (large amounts of) data around in memory.
Another possible cause of serious slowdown, or failure to accelerate, is covered in this SO question and answer. Whenever I see arrays with sizes of 2^n I suspect cache issues.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js