C++ openmp much slower than serial implementation - c++

I am doing a thermodynamic simulation on a double dimension array. The array is 1024x1024. The while loop iterates through a specified amount of times or until goodTempChange is false. goodTempChange is set true or false based on the change in temperature of a block being greater than a defined EPSILON value. If every block in the array is below that value, then the plate is in stasis. The program works, I have no problems with the code, my problem is that the serial code is absolutely blowing the openmp code out of the water. I don't know why. I have tried removing everything except the average calculation which is just the average of the 4 blocks up, down, left, right around your desired square and still it is getting destroyed by the serial code. I've never done openmp before and I looked up some things online to do what I have. I have the variables within critical regions in the most efficient way I could see possible, I have no race conditions. I really don't see what is wrong. Any help would be greatly appreciated. Thanks.
while(iterationCounter < DESIRED_ITERATIONS && goodTempChange) {
goodTempChange = false;
if((iterationCounter % 1000 == 0) && (iterationCounter != 0)) {
cout << "Iteration Count Highest Change Center Plate Temperature" << endl;
cout << "-----------------------------------------------------------" << endl;
cout << iterationCounter << " "
<< highestChange << " " << newTemperature[MID][MID] << endl;
cout << endl;
}
highestChange = 0;
if(iterationCounter != 0)
memcpy(oldTemperature, newTemperature, sizeof(oldTemperature));
for(int i = 1; i < MAX-1; i++) {
#pragma omp parallel for schedule(static)
for(int j = 1; j < MAX-1; j++) {
bool tempGoodChange = false;
double tempHighestChange = 0;
newTemperature[i][j] = (oldTemperature[i-1][j] + oldTemperature[i+1][j] +
oldTemperature[i][j-1] + oldTemperature[i][j+1]) / 4;
if((iterationCounter + 1) % 1000 == 0) {
if(abs(oldTemperature[i][j] - newTemperature[i][j]) > highestChange)
tempHighestChange = abs(oldTemperature[i][j] - newTemperature[i][j]);
if(tempHighestChange > highestChange) {
#pragma omp critical
{
if(tempHighestChange > highestChange)
highestChange = tempHighestChange;
}
}
}
if(abs(oldTemperature[i][j] - newTemperature[i][j]) > EPSILON
&& !tempGoodChange)
tempGoodChange = true;
if(tempGoodChange && !goodTempChange) {
#pragma omp critical
{
if(tempGoodChange && !goodTempChane)
goodTempChange = true;
}
}
}
}
iterationCounter++;
}

Trying to get rid of those critical sections may help. For example:
#pragma omp critical
{
if(tempHighestChange > highestChange)
{
highestChange = tempHighestChange;
}
}
Here, you can store the highestChange computed by each thread in a local variable and, when the parallel section finishes, get the maximum of the highestChange's you have.

Here is my attempt (not tested).
double**newTemperature;
double**oldTemperature;
while(iterationCounter < DESIRED_ITERATIONS && goodTempChange) {
if((iterationCounter % 1000 == 0) && (iterationCounter != 0))
std::cout
<< "Iteration Count Highest Change Center Plate Temperature\n"
<< "---------------------------------------------------------------\n"
<< iterationCounter << " "
<< highestChange << " "
<< newTemperature[MID][MID] << '\n' << std::endl;
goodTempChange = false;
highestChange = 0;
// swap pointers to arrays (but not the arrays themselves!)
std::swap(newTemperature,oldTemperature);
if(iterationCounter != 0)
std::swap(newTemperature,oldTemperature);
bool CheckTempChange = (iterationCounter + 1) % 1000 == 0;
#pragma omp parallel
{
bool localGoodChange = false;
double localHighestChange = 0;
#pragma omp for
for(int i = 1; i < MAX-1; i++) {
//
// note that putting a second
// #pragma omp for
// here has (usually) zero effect. this is called nested parallelism and
// usually not implemented, thus the new nested team of threads has only
// one thread.
//
for(int j = 1; j < MAX-1; j++) {
newTemperature[i][j] = 0.25 * // multiply is faster than divide
(oldTemperature[i-1][j] + oldTemperature[i+1][j] +
oldTemperature[i][j-1] + oldTemperature[i][j+1]);
if(CheckTempChange)
localHighestChange =
std::max(localHighestChange,
std::abs(oldTemperature[i][j] - newTemperature[i][j]));
localGoodChange = localGoodChange ||
std::abs(oldTemperature[i][j] - newTemperature[i][j]) > EPSILON;
// shouldn't this be < EPSILON? in the previous line?
}
}
//
// note that we have moved the critical sections out of the loops to
// avoid any potential issues with contentions (on the mutex used to
// implement the critical section). Also note that I named the sections,
// allowing simultaneous update of goodTempChange and highestChange
//
if(!goodTempChange && localGoodChange)
#pragma omp critical(TempChangeGood)
goodTempChange = true;
if(CheckTempChange && localHighestChange > highestChange)
#pragma omp critical(TempChangeHighest)
highestChange = std::max(highestChange,localHighestChange);
}
iterationCounter++;
}
There are several changes to your original:
The outer instead of the inner of the nested for loops is performed in parallel. This should make a significant difference.
added in edit: It appears from the comments that you don't understand the significance of this, so let me explain. In your original code, the outer loop (over i) was done only by the master thread. For every i, a team of threads was created to perform the inner loop over j in parallel. This creates an synchronisation overhead (with significant imbalance) at every i! If one instead parallelises the outer loop over i, this overhead is encountered only once and each thread will run the entire inner loop over j for its share of i. Thus, always to parallelise the outermost loop possible is a basic wisdom for multi-threaded coding.
The double for loop is inside a parallel region to minimise the critical region calls to one per thread per while loop. You may also consider to put the whole while loop inside a parallel region.
I also swap between two arrays (similar as suggested in other answers) to avoid to memcpy, but this shouldn't really be performance critical.
added in edit: std::swap(newTemperature,oldTemperature) only swaps the pointer values and not the memory pointed to, of course, that's the point.
Finally, don't forget that the proof of the pudding is in the eating: just try what difference it makes to have the #pragma omp for in front of the inner or the outer loop. Always do experiments like this before asking on SO -- otherwise you can be rightously accused of not having done sufficient research.

I assume that you are concerned with the time taken by the entire code inside the while loop, not just by the time taken by the loop beginning for(int i = 1; i < MAX-1; i++).
This operation
if(iterationCounter != 0)
{
memcpy(oldTemperature, newTemperature, sizeof(oldTemperature));
}
is unnecessary and, for large arrays, may be enough to kill performance. Instead of maintaining 2 arrays, old and new, maintain one 3D array with two planes. Create two integer variables, let's call them old and new, and set them to 0 and 1 initially. Replace
newTemperature[i][j] = ((oldTemperature[i-1][j] + oldTemperature[i+1][j] + oldTemperature[i][j-1] + oldTemperature[i][j+1]) / 4);
by
temperature[new][i][j] =
(temperature[old][i-1][j] +
temperature[old][i+1][j] +
temperature[old][i][j-1] +
temperature[old][i][j+1])/4;
and, at the end of the update swap the values of old and new so that the updates go the other way round. I'll leave it to you to determine whether old/new should be the first index into your array or the last. This approach eliminates the need to move (large amounts of) data around in memory.
Another possible cause of serious slowdown, or failure to accelerate, is covered in this SO question and answer. Whenever I see arrays with sizes of 2^n I suspect cache issues.

Related

Why does omp_set_dynamic(1) never adjust the number of threads (in Visual C++)?

If we look at the Visual C++ documentation of omp_set_dynamic, it is literally copy-pasted from the OMP 2.0 standard (section 3.1.7 on page 39):
If [the function argument] evaluates to a nonzero value, the number of threads that are used for executing upcoming parallel regions may be adjusted automatically by the run-time environment to best use system resources. As a consequence, the number of threads specified by the user is the maximum thread count. The number of threads in the team executing a parallel region stays fixed for the duration of that parallel region and is reported by the omp_get_num_threads function.
It seems clear that omp_set_dynamic(1) allows the implementation to use fewer than the current maximum number of threads for a parallel region (presumably to prevent oversubscription under high loads). Any reasonable reading of this paragraph would suggest that said reduction should be observable by querying omp_get_num_threads inside parallel regions.
(Both documentations also show the signature as void omp_set_dynamic(int dynamic_threads);. It appears that "the number of threads specified by the user" does not refer to dynamic_threads but instead means "whatever the user specified using the remaining OpenMP interface").
However, no matter how high I push my system load under omp_set_dynamic(1), the return value of omp_get_num_threads (queried inside the parallel regions) never changes from the maximum in my test program. Yet I can still observe clear performance differences between omp_set_dynamic(1) and omp_set_dynamic(0).
Here is a sample program to reproduce the issue:
#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>
#include <cstdlib>
#include <cmath>
#include <omp.h>
#define UNDER_LOAD true
const int SET_DYNAMIC_TO = 1;
const int REPEATS = 3000;
const unsigned MAXCOUNT = 1000000;
std::size_t threadNumSum = 0;
std::size_t threadNumCount = 0;
void oneRegion(int i)
{
// Pesudo-randomize the number of iterations.
unsigned ui = static_cast<unsigned>(i);
int count = static_cast<int>(((MAXCOUNT + 37) * (ui + 7) * ui) % MAXCOUNT);
#pragma omp parallel for schedule(guided, 512)
for (int j = 0; j < count; ++j)
{
if (j == 0)
{
threadNumSum += omp_get_num_threads();
threadNumCount++;
}
if ((j + i + count) % 16 != 0)
continue;
// Do some floating point math.
double a = j + i;
for (int k = 0; k < 10; ++k)
a = std::sin(i * (std::cos(a) * j + std::log(std::abs(a + count) + 1)));
volatile double out = a;
}
}
int main()
{
omp_set_dynamic(SET_DYNAMIC_TO);
#if UNDER_LOAD
for (int i = 0; i < 10; ++i)
{
std::thread([]()
{
unsigned x = 0;
float y = static_cast<float>(std::sqrt(2));
while (true)
{
//#pragma omp parallel for
for (int i = 0; i < 100000; ++i)
{
x = x * 7 + 13;
y = 4 * y * (1 - y);
}
volatile unsigned xx = x;
volatile float yy = y;
}
}).detach();
}
#endif
std::chrono::high_resolution_clock clk;
auto start = clk.now();
for (int i = 0; i < REPEATS; ++i)
oneRegion(i);
std::cout << (clk.now() - start).count() / 1000ull / 1000ull << " ms for " << REPEATS << " iterations" << std::endl;
double averageThreadNum = double(threadNumSum) / threadNumCount;
std::cout << "Entered " << threadNumCount << " parallel regions with " << averageThreadNum << " threads each on average." << std::endl;
std::getchar();
return 0;
}
Compiler version: Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27024.1 for x64
On e.g. gcc, this program will print a significantly lower averageThreadNum for omp_set_dynamic(1) than for omp_set_dynamic(0). But on MSVC, the same value is shown in both cases, despite a 30% performance difference (170s vs 230s).
How can this be explained?
In Visual C++, the number of threads executing the loop does get reduced with omp_set_dynamic(1) in this example, which explains the performance difference.
However, contrary to any good-faith interpretation of the standard (and Visual C++ docs), omp_get_num_threads does not report this reduction.
The only way to figure out how many threads MSVC actually uses for each parallel region is to inspect omp_get_thread_num on every loop iteration (or parallel task). The following would be one way to do it with little in-loop performance overhead:
// std::hardware_destructive_interference_size is not available in gcc or clang, also see comments by Peter Cordes:
// https://stackoverflow.com/questions/39680206/understanding-stdhardware-destructive-interference-size-and-stdhardware-cons
struct alignas(2 * std::hardware_destructive_interference_size) NoFalseSharing
{
int flagValue = 0;
};
void foo()
{
std::vector<NoFalseSharing> flags(omp_get_max_threads());
#pragma omp parallel for
for (int j = 0; j < count; ++j)
{
flags[omp_get_thread_num()].flagValue = 1;
// Your real loop body
}
int realOmpNumThreads = 0;
for (auto flag : flags)
realOmpNumThreads += flag.flagValue;
}
Indeed, you will find realOmpNumThreads to yield significantly different values from the omp_get_num_threads() inside the parallel region with omp_set_dynamic(1) on Visual C++.
One could argue that technically
"the number of threads in the team executing a parallel region" and
"the number of threads that are used for executing upcoming parallel regions"
are not literally the same.
This is a nonsensical interpretation of the standard in my view, because the intent is very clear and there is no reason for the standard to say "The number of threads in the team executing a parallel region stays fixed for the duration of that parallel region and is reported by the omp_get_num_threads function" in this section if this number is unrelated to the functionality of omp_set_dynamic.
However, it could be that MSVC decided to keep the number of threads in a team unaffected and just assign no loop iterations for execution to a subset of them under omp_set_dynamic(1) for ease of implementation.
Whatever the case may be: Do not trust omp_get_num_threads in Visual C++.

Optimization of a large array sum (multi-threaded)

So I want to optimize the sum of a really big array and in order to do that I have wrote a multi-threaded code. The problem is that with this code I'm getting better timing results using only one thread instead of 2 or 3 or 4 threads...
Can someone explain me why this happens?
(Also I've only started coding in C++ this semester, until then I only knew C, so I'm sorry for possible dumb mistakes)
This is the thread code
*localSum = 0.0;
for (size_t i = 0; i < stop; i++)
*localSum += v[i];
Main process code
int numThreads = atoi(argv[1]);
int N = 100000000;
// create the input vector v and put some values in v
vector<double> v(N);
for (int i = 0; i < N; i++)
v[i] = i;
// this vector will contain the partial sum for each thread
vector<double> localSum(numThreads, 0);
// create threads. Each thread will compute part of the sum and store
// its result in localSum[threadID] (threadID = 0, 1, ... numThread-1)
startChrono();
vector<thread> myThreads(numThreads);
for (int i = 0; i < numThreads; i++){
int start = i * v.size() / numThreads;
myThreads[i] = thread(threadsum, i, numThreads, &v[start], &localSum[i],v.size()/numThreads);
}
for_each(myThreads.begin(), myThreads.end(), mem_fn(&thread::join));
// calculate global sum
double globalSum = 0.0;
for (int i = 0; i < numThreads; i++)
globalSum += localSum[i];
cout.precision(12);
cout << "Sum = " << globalSum << endl;
cout << "Runtime: " << stopChrono() << endl;
exit(EXIT_SUCCESS);
}
There are a few things:
1- The array just isn't big enough. Vectorized streaming add will be really hard to beat. You need a more complex function than add to really see results. Or a very large array.
2- Related, the overhead of all the thread creation and joining is going to swamp any performance gains from the threading. Adding is really fast, and you can easily saturate the CPU's functional units. for the thread to help it can't even be a hyperthread on the same core, it would need to be on a different core entirely (as the hyperthreads would both compete for the floating point units).
To test this, you can try to create all the treads before you start the timer and stop them all after you stop the timer (have them set a done flag instead of waiting on the join).
3- All your localsum variables are sharing the same cache line. Better would be to make the localsum variable on the stack and put the result into the array instead of adding directly into the array: https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
If for some reason, you need to keep the sum observable to others in that array, pad the localsum vector entries like this so they don't share the same cache line:
struct localsumentry {
double sum;
char pad[56];
};

openMP C++ simple parallel region - inconsistent output

As stated above, I have been trying to craft a simple parallel loop, but it has inconsistent behaviour for different number of threads. Here is my code (testable!)
#include <iostream>
#include <stdio.h>
#include <vector>
#include <utility>
#include <string>
using namespace std;
int row = 5, col = 5;
int token = 1;
int ar[20][20] = {0};
int main (void)
{
unsigned short j_end = 1, k = 1;
unsigned short mask;
for (unsigned short i=1; i<=(row + col -1); i++)
{
#pragma omp parallel default(none) shared(ar) firstprivate(k, row, col, i, j_end, token) private(mask)
{
if(i > row) {
mask = row;
}
else {
mask = i;
}
#pragma omp for schedule(static, 2)
for(unsigned short j=k; j<=j_end; j++)
{
ar[mask][j] = token;
if(mask > 1) {
#pragma omp critical
{
mask--;
}
}
} //inner loop - barrier
}//end parallel
token++;
if(j_end == col) {
k++;
j_end = col;
}
else {
j_end++;
}
} // outer loop
// print the array
for (int i = 0; i < row + 2; i++)
{
for (int j = 0; j < col + 2; j++)
{
cout << ar[i][j] << " ";
}
cout << endl;
}
return 0;
} // main
I believe most of the code is self explanatory, but to sum it up, I have 2 loops, the inner one iterates through the inverse-diagonals of the square matrix ar[row][col], (row & col variables can be used to change the total size of ar).
Visual aid: desired output for 5x5 ar (serial version)
(Note: This does happen when OMP_NUM_THREADS=1 too.)
But when OMP_NUM_THREADS=2 or OMP_NUM_THREADS=4 the output looks like this:
The serial (and for 1 thread) code is consistent so I don't think the implementation is problematic. Also, given the output of the serial code, there shouldn't be any dependencies in the inner loop.
I have also tried:
Vectorizing
threadpivate counters for the inner loop
But nothing seems to work so far...
Is there a fault in my approach, or did I miss something API-wise that led to this behavior?
Thanks for your time in advance.
Analyzing the algorithm
As you noted, the algorithm itself has no dependencies in the inner or outer loop. An easy way to show this is to move the parallelism "up" to the outer loop so that you can iterate across all the different inverse diagonals simultaneously.
Right now, the main problem with the algorithm you've written is that it's presented as a serial algorithm in both the inner and outer loop. If you're going to parallelize across the inner loop, then mask needs to be handled specially. If you're going to parallelize across the outer loop, then j_end, token, and k need to be handled specially. By "handled specially," I mean they need to be computed independently of the other threads. If you try adding critical regions into your code, you will kill all performance benefits of adding OpenMP in the first place.
Fixing the problem
In the following code, I parallelize over the outer loop. i corresponds to what you call token. That is, it is both the value to be added to the inverse diagonal and the assumed starting length of this diagonal. Note that for this to parallelize correctly, length, startRow, and startCol must be calculated as a function of i independently from other iterations.
Finally note that once the algorithm is re-written this way, the actual OpenMP pragma is incredibly simple. Every variable is assumed to be shared by default because they're all read-only. The only exception is ar in which we are careful never to overwrite another thread's value of the array. All variables that must be private are only created inside the parallel loop and thus are thread-private by definition. Lastly, I've changed the schedule to dynamic to showcase that this algorithm exhibits load-imbalance. In your example if you had 9 threads (the worst case scenario), you can see how the thread assigned to i=5 has to do much more work than the thread assigned to i=1 or i=9.
Example code
#include <iostream>
#include <omp.h>
int row = 5;
int col = 5;
#define MAXSIZE 20
int ar[MAXSIZE][MAXSIZE] = {0};
int main(void)
{
// What an easy pragma!
#pragma omp parallel for default(shared) schedule(dynamic)
for (unsigned short i = 1; i < (row + col); i++)
{
// Calculates the length of the current diagonal to consider
// INDEPENDENTLY from other i iterations!
unsigned short length = i;
if (i > row) {
length -= (i-row);
}
if (i > col) {
length -= (i-col);
}
// Calculates the starting coordinate to start at
// INDEPENDENTLY from other i iterations!
unsigned short startRow = i;
unsigned short startCol = 1;
if (startRow > row) {
startCol += (startRow-row);
startRow = row;
}
for(unsigned short offset = 0; offset < length; offset++) {
ar[startRow-offset][startCol+offset] = i;
}
} // outer loop
// print the array
for (int i = 0; i <= row; i++)
{
for (int j = 0; j <= col; j++)
{
std::cout << ar[i][j] << " ";
}
std::cout << std::endl;
}
return 0;
} // main
Final points
I want to leave with a few last points.
If you are only adding parallelism on a small array (row,col < 1e6), you will most likely not get any benefits from OpenMP. On a small array, the algorithm itself will take microseconds, while setting up the threads could take milliseconds... slowing down execution time considerably from your original serial code!
While I did rewrite this algorithm and change around variable names, I tried to keep the spirit of your implementation as best as I could. Thus, the inverse-diagonal scanning and nested loop pattern remains.
There is a better way to parallelize this algorithm to avoid load balance, though. If instead you give each thread a row and have it instead iterate its token value (i.e. row/thread 2 places the numbers 2->6), then each thread will work on exactly the same amount of numbers and you can change the pragma to schedule(static).
As I mentioned in the comments above, don't use firstprivate when you mean shared. A good rule of thumb is that all read-only variables should be shared.
It is erroneous to assume that getting correct output when running parallel code on 1 thread implies the implementation is correct. In fact, barring disastrous use of OpenMP, you are incredibly unlikely to get the wrong output with only 1 thread. Testing with multiple threads reveals that your previous implementation was not correct.
Hope this helps.
EDIT: The output I get is the same as yours for a 5x5 matrix.

How to apply openMP to a C++ function to validate all rows of a sudoku puzzle solution?

I am designing a program that will test to see whether a valid sudoku puzzle solution is given to the program or not. I first designed it in C++ but now I want to try to make it parallel. The program compiles fine without errors.
First I had to figure out a way to deal with using a return statement inside of a structured block. I just decided to make an array of bool's that are initialized to true. However the output from this function is false and I know for a fact the solution I am submitting is true. I am new to openMP and was wondering if anyone could help me out?
I have a feeling the issue is with my variable a getting set back to 0 and maybe also with my other variable nextSudokuNum getting set back to 1.
bool test_rows(int sudoku[9][9])
{
int i, j, a;
int nextSudokuNum = 1;
bool rowReturn[9];
#pragma omp parallel for private(i)
for(i = 0; i < 9; i++)
{
rowReturn[i] = true;
}
#pragma omp parallel for private(i,j) \
reduction(+: a, nextSudokuNum)
for(i = 0; i < 9; i++)
{
for(j = 0; j < 9; j++)
{
a = 0;
while(sudoku[i][a] != nextSudokuNum) {
a++;
if(a > 9) {
rowReturn[i] = false;
}
}
nextSudokuNum++;
}
nextSudokuNum = 1;
}
for(i = 0; i < 9; i++)
{
if(rowReturn[i] == false) {
cout << "Invalid Sudoku Solution(Next Valid Sudoku Number Not Found)" << endl;
cout << "Check row " << (i+1) << endl;
return false;
}
}
cout << "Valid sudoku rows(Returning true)" << endl;
return true;
}
Disclaimer:
First off, do not parallelize very small loops or loops which execute nearly instantaneously. The overhead of creating the threads will dominate the benefit you would get by executing the inner statements of the loop in parallel. So unless each iteration you are parallelizing performs thousands-millions of FLOPs, the serial version of the code will run faster than the parallel version of the code.
Therefore, a better plan for parallelizing your (probable) tasks is to parallelize at a higher level. That is, presumably you are calling test_rows(sudoku), test_columns(sudoku), and test_box(sudoku) from one function somewhere else. What you can do is call these three serial functions in parallel using OpenMP sections where calling each of these three functions is a separate OpenMP section. This will only benefit from using 3 cores of your CPU, but presumably you are doing this on your laptop anyway so you probably only have 2 or 4 anyway.
Now to your actual problems:
You are not parallelizing over j, but merely over i. Therefore, you can see that your variable nextSudokuNum is not being reduced; for every i iteration, nextSudokuNum is self-contained. Thus it should be initialized inside the loop and made private in the #pragma omp parallel clause.
Likewise, you are not performing a reduction over a either. For every iteration of i, a is set, compared to, and incremented internally. Again it should be a private variable.
Therefore, your new code should look like:
#pragma omp parallel for private(i,j,a,nextSudokuNum)
for(i = 0; i < 9; i++)
{
// all private variables must be set internal to parallel region before being used
nextSudokuNum = 1;
for(j = 0; j < 9; j++)
{
a = 0;
while(sudoku[i][a] != nextSudokuNum) {
a++;
if(a > 9) {
rowReturn[i] = false;
}
}
nextSudokuNum++;
}
}

Negligible Perfomance Boost from p_thread c++

I've been using Mac OS gcc 4.2.1 and Eclipse to write a program that sorts numbers using a simple merge sort. I've tested the sort extensively and I know it works, and I thought, maybe somewhat naively, that because of the way the algorithm divides up the list, I could simply have a thread sort half and the main thread sort half, and then it would take half the time, but unfortunately, it doesn't seem to be working.
Here's the main code:
float x = clock(); //timing
int half = (int)size/2; // size is the length of the list
status = pthread_create(thready,NULL,voidSort,(void *)datay); //start the thread sorting
sortStep(testArray,tempList,half,0,half); //sort using the main thread
int join = pthread_join(*thready,&someptr); //wait for the thread to finish
mergeStep(testArray,tempList,0,half,half-1); //merge the two sublists
if (status != 0) { std::cout << "Could not create thread.\nError: " << status << "\n"; }
if (join != 0) { std::cout << "Could not create thread.\nError: " << status << "\n"; }
float y = clock() - x; //timing
sortStep is the main sorting function, mergeStep is used to merge two sublists within one array (it uses a placeholder array to switch the numbers around), and voidSort is a function I use to pass a struct containing all the arguments for sortStep to the thread. I feel like maybe the main thread is waiting until the new thread is done, but I'm not sure how to overcome that. I'm extremely, unimaginably grateful for any and all help, thank you in advanced!
EDIT:
Here's the merge step
void mergeStep (int *array,int *tempList,int start, int lengthOne, int lengthTwo) //the merge step of a merge sort
{
int i = start;
int j = i+lengthOne;
int k = 0; // index for the entire templist
while (k < lengthOne+lengthTwo) // a C++ while loop
{
if (i - start == lengthOne)
{ //list one exhausted
for (int n = 0; n+j < lengthTwo+lengthOne+start;n++ ) //add the rest
{
tempList[k++] = array[j+n];
}
break;
}
if (j-(lengthOne+lengthTwo)-start == 0)
{//list two exhausted
for (int n = i; n < start+lengthOne;n++ ) //add the rest
{
tempList[k++] = array[n];
}
break;
}
if (array[i] > array[j]) // figure out which variable should go first
{
tempList[k] = array[j++];
}
else
{
tempList[k] = array[i++];
}
k++;
}
for (int s = 0; s < lengthOne+lengthTwo;s++) // add the templist into the original
{
array[start+s] = tempList[s];
}
}
-Will
The overhead of creating threads is quite large, so unless you have a large amount (to be determined) of data to sort your better off sorting it in the main thread.
The mergeStep also counts against the part of the code that can't be palletized, remember Amdahl's law.
If you don't have a coarsening step as the last part of you sortStep when you get below 8-16 elements much of your performance will go up in function calls. The coarsening step will have to be done by a simpler sort, insertion sort or sorting network.
Unless you have a large enough sorting the actual timing could drown in measuring uncertainty.