How to dynamically allocate work to threads - c++

I am trying to write code for finding if pairwise sums are even or not(among all possible pairs from 0 to 100000). I have written code using pthreads where the work allocation is done statically. Here is the code
#include<iostream>
#include<chrono>
#include<iomanip>
#include<pthread.h>
using namespace std;
#define MAX_THREAD 4
vector<long long> cnt(MAX_THREAD,0);
long long n = 100000;
int work_per_thread;
void *count_array(void* arg)
{
int t = *((int*)arg);
long long sum = 0;
int counter = 0;
for(int i = t*work_per_thread + 1; i <= (t+1)*work_per_thread; i++)
for(int j = i-1; j >= 0; j--)
{
sum = i + j;
if(sum%2 == 0)
counter++;
}
cnt[t] = counter;
cout<<"thread"<<t<<" finished work"<<endl;
return NULL;
}
int main()
{
pthread_t threads[MAX_THREAD];
vector<int> arr;
for(int i = 0; i < MAX_THREAD; i++)
arr.push_back(i);
long long total_count = 0;
work_per_thread = n/MAX_THREAD;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < MAX_THREAD; i++)
pthread_create(&threads[i], NULL, count_array, &arr[i]);
for (int i = 0; i < MAX_THREAD; i++)
pthread_join(threads[i], NULL);
for (int i = 0; i < MAX_THREAD; i++)
total_count += cnt[i];
cout << "count is " << total_count << endl;
auto end = chrono::high_resolution_clock::now();
double time_taken = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
cout << "Time taken by program is : " << fixed << time_taken << setprecision(9)<<" secs"<<endl;
return 0;
}
Now I want to do the work allocation part dynamically. To be specific, let's say I have 5 threads. Initially I give the threads a certain range to work with, let's say thread1 works on all pairs from 0-1249, thread2 from 1250-2549 and so on. Now as soon as a thread completes its work I want to give it a new range to work on. This way no threads will be idle for most of the time, like was in the case of static allocation.

This is the classic usage of a thread pool. Typically you set up a synchronized queue that can be pushed and pulled by any number of threads. Then you start N threads, the "thread pool". These threads wait on a condition variable that locks a mutex. When you have work to do from the main thread, it pushes work into the queue (it can be as simple as a struct with a range) and then signals the condition variable, which will release one thread.
See this answer: https://codereview.stackexchange.com/questions/221617/thread-pool-c-implementation

Related

Thread not improving the code performance

I am trying to convert a basic long loop into thread to improve the loop performance.
Here is the threaded version:
#include <iostream>
#include <thread>
#include <chrono>
using namespace std;
using namespace std::chrono;
void funcSum(long long int start, long long int end, long long int *sum)
{
for(auto i = start; i <= end; ++i)
{
*sum += i;
}
}
int main()
{
long long int start = 10, end = 1900000000;
long long int sum = 0;
auto startTime = high_resolution_clock::now();
thread t1(funcSum, start, end / 2, &sum);
thread t2(funcSum, end / 2 + 1 , end, &sum);
t1.join();
t2.join();
auto stopTime = high_resolution_clock::now();
auto duration = duration_cast<seconds>(stopTime - startTime);
cout << "Sum: " << sum << endl;
cout << duration.count() << " Seconds";
return 0;
}
And here is the normal code (Without threads):
#include <iostream>
#include <thread>
#include <chrono>
using namespace std;
using namespace std::chrono;
void funcSum(long long int start, long long int end, long long int *sum)
{
for(auto i = start; i <= end; ++i)
{
*sum += i;
}
}
int main()
{
long long int start = 10, end = 1900000000;
long long int sum = 0;
auto startTime = high_resolution_clock::now();
funcSum(start, end, &sum);
auto stopTime = high_resolution_clock::now();
auto duration = duration_cast<seconds>(stopTime - startTime);
cout << "Sum: " << sum << endl;
cout << duration.count() << " Seconds";
return 0;
}
Sum: 1805000000949999955
5 Seconds
Process finished with exit code 0
In both the cases, time spent is 5 seconds.
Why the first threaded version does not improve the performance? How do I decrease the time using threads for this sum of range?
Fixed version of threaded code:
// Compute the sum of start ... end
class Summer {
public:
long long int start;
long long int end;
long long int sum = 0;
Summer(long long int aStart, long long int aEnd)
: start(aStart),
end(aEnd)
{
}
void funcSum()
{
sum = 0;
for (auto i = start; i <= end; ++i)
{
sum += i;
}
}
};
class SummerFunctor {
Summer& mSummer;
public:
SummerFunctor(Summer& aSummer)
: mSummer(aSummer)
{
}
void operator()()
{
mSummer.funcSum();
}
};
// Version with n thread objects reports
// 1 threads, sum = 1805000000949999955, 1587 ms
// 2 threads, sum = 1805000000949999955, 2547 ms
// 4 threads, sum = 1805000000949999955, 1251 ms
// 6 threads, sum = 1805000000949999955, 916 ms
int main()
{
long long int start = 10, end = 1900000000;
long long int sum = 0;
auto startTime = high_resolution_clock::now();
const size_t threadCount = 6;
if (threadCount < 2) {
funcSum(start, end, &sum);
} else {
Summer* summers[threadCount];
std::thread* threads[threadCount];
// Start threads
auto val = start;
auto partitionSize = (end-start) / threadCount;
for (size_t i = 0; i < threadCount; ++i) {
auto partitionEnd = std::min(start + partitionSize, end);
summers[i] = new Summer(start, partitionEnd);
start = partitionEnd + 1;
SummerFunctor functor (*summers[i]);
threads[i] = new std::thread(functor);
}
// Join threads
for (size_t i = 0; i < threadCount; ++i) {
threads[i]->join();
sum += summers[i]->sum;
delete threads[i];
delete summers[i];
}
}
auto stopTime = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(stopTime - startTime);
cout << threadCount << " threads, sum = " << sum << ", " << duration.count() << " ms" << std::endl;
return 0;
}
I had to wrap the Summer object with a functor because std::thread insists on making a copy of a functor handed to it, that we can't access later. The execution gets better when more threads are used (running times see comments). Possible reasons for this:
The CPU has to synchronize access to the memory pages even though the threads use separate variables here because the variables likely lie in the same page
If there is only one thread running on a CPU, that thread may run at higher CPU frequency, but several threads may run only at normal CPU frequency
CPU cores often share arithmetic units
Without threads, the compiler can make optimizations that are not possible with threads. In theory, the compiler could unroll the loop and directly print the result.

read/write to large array using large loop - execution time concerns

So recently I ran into a problem that I thought was interesting and I couldn't fully explain. I've highlighted the nature of the problem in the following code:
#include <cstring>
#include <chrono>
#include <iostream>
#define NLOOPS 10
void doWorkFast(int total, int *write, int *read)
{
for (int j = 0; j < NLOOPS; j++) {
for (int i = 0; i < total; i++) {
write[i] = read[i] + i;
}
}
}
void doWorkSlow(int total, int *write, int *read, int innerLoopSize)
{
for (int i = 0; i < NLOOPS; i++) {
for (int j = 0; j < total/innerLoopSize; j++) {
for (int k = 0; k < innerLoopSize; k++) {
write[j*k + k] = read[j*k + k] + j*k + k;
}
}
}
}
int main(int argc, char *argv[])
{
int n = 1000000000;
int *heapMemoryWrite = new int[n];
int *heapMemoryRead = new int[n];
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start1 = std::chrono::high_resolution_clock::now();
doWorkFast(n,heapMemoryWrite, heapMemoryRead);
auto finish1 = std::chrono::high_resolution_clock::now();
auto duration1 = std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1);
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start2 = std::chrono::high_resolution_clock::now();
doWorkSlow(n,heapMemoryWrite, heapMemoryRead, 10);
auto finish2 = std::chrono::high_resolution_clock::now();
auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2);
std::cout << "Small inner loop:" << duration1.count() << " microseconds.\n" <<
"Large inner loop:" << duration2.count() << " microseconds." << std::endl;
delete[] heapMemoryWrite;
delete[] heapMemoryRead;
}
Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses. I understand that in the doWorkSlow implementation, we are doing one or two more operations to resolve j*k + k, however, I think it's reasonably safe to assume that relative to the time it takes to do the load/stores for memory read and write, the time contribution of these operations is negligible.
Nevertheless, doWorkSlow takes about twice as long (46.8s) compared to doWorkFast (25.5s) on my i7-3700 using g++ --version 7.5.0. While things like cache prefetching and branch prediction come to mind, I don't have a great explanation as to why doWorkFast is much faster than doWorkSlow. Does anyone have insight?
Thanks
Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses.
This is not true!
In doWorkFast, you index each integer incrementally, as array[i].
array[0]
array[1]
array[2]
array[3]
In doWorkSlow, you index each integer as array[j*k + k], which jumps around and repeats.
When j is 10, for example, and you iterate k from 0 onwards, you are accessing
array[0] // 10*0+0
array[11] // 10*1+1
array[22] // 10*2+2
array[33] // 10*3+3
This will prevent your optimizer from using instructions that can operate on many adjacent integers at once.

C++ : Passing threadID to function anomaly

I implemented a concurrent queue with two methods: add (enqueue) & remove (dequeue).
To test my implementation using 2 threads, I generated 10 (NUMBER_OF_OPERATIONS) random numbers between 0 and 1 in a method called getRandom(). This allows me to create different distribution of add and remove operations.
The doWork method splits up the work done by the number of threads.
PROBLEM: The threadID that I am passing in from the main function does not match the threadID that the doWork method receives. Here are some sample runs:
Output 1
Output 2
#define NUMBER_OF_THREADS 2
#define NUMBER_OF_OPERATIONS 10
int main () {
BoundedQueue<int> bQ;
std::vector<double> temp = getRandom();
double* randomNumbers = &temp[0];
std::thread myThreads[NUMBER_OF_THREADS];
for(int i = 0; i < NUMBER_OF_THREADS; i++) {
cout << "Thread " << i << " created.\n";
myThreads[i] = std::thread ( [&] { bQ.doWork(randomNumbers, i); });
}
cout << "Main Thread\n";
for(int i = 0; i < NUMBER_OF_THREADS; i++) {
if(myThreads[i].joinable()) myThreads[i].join();
}
return 0;
}
template <class T> void BoundedQueue<T>::doWork (double randomNumbers[], int threadID) {
cout << "Thread ID is " << threadID << "\n";
srand(time(NULL));
int split = NUMBER_OF_OPERATIONS / NUMBER_OF_THREADS;
for (int i = threadID * split; i < (threadID * split) + split; i++) {
if(randomNumbers[i] <= 0.5) {
int numToAdd = rand() % 10 + 1;
add(numToAdd);
}
else {
int numRemoved = remove();
}
}
}
In this line you're capturing i by reference:
myThreads[i] = std::thread ( [&] { bQ.doWork(randomNumbers, i); });
This means that when the other thread runs the lambda, it'll get the latest value of i, not the value when it was created. Capture it by value instead:
myThreads[i] = std::thread ( [&, i] { bQ.doWork(randomNumbers, i); });
Whats worse, as you've got unordered read and write to i, your current code has undefined behavoir. And the fact i may've gone out of scope on the main thread before the other thread reads it. This fix above fixes all these issues.

OpenMP function calls in parallel

I'm looking for a way to call a function in parallel.
For example, if I have 4 threads, I want to each of them to call the same function with their own thread id as an argument.
Because of the argument, no thread will work on the same data.
#pragma omp parallel
{
for(int p = 0; p < numberOfThreads; p++)
{
if(p == omp_get_thread_num())
parDF(p);
}
}
Thread 0 should run parDF(0)
Thread 1 should run parDF(1)
Thread 2 should run parDF(2)
Thread 3 should run parDF(3)
All this should be done at the same time...
This (obviously) doesn't work, but what is the right way to do parallel function calls?
EDIT: The actual code (This might be too much information... But it was asked for...)
From the function that calls parDF():
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
numberOfThreads = omp_get_num_threads();
//split nodeQueue
#pragma omp master
{
splitNodeQueue(numberOfThreads);
}
int tid = omp_get_thread_num();
//printf("Hello World from thread = %d\n", tid);
#pragma omp parallel for private(tid)
for(int i = 0; i < numberOfThreads; ++i)
{
parDF(tid, originalQueueSize, DFlevel);
}
}
The parDF function:
bool Tree::parDF(int id, int originalQueueSize, int DFlevel)
{
double possibilities[20];
double sequence[3];
double workingSequence[3];
int nodesToExpand = originalQueueSize/omp_get_num_threads();
int tenthsTicks = nodesToExpand/10;
int numPossibilities = 0;
int percentage = 0;
list<double>::iterator i;
list<TreeNode*>::iterator n;
cout << "My ID is: "<< omp_get_thread_num() << endl;
while(parNodeQueue[id].size() > 0 and parNodeQueue[id].back()->depth == DFlevel)
{
if(parNodeQueue[id].size()%tenthsTicks == 0)
{
cout << endl;
cout << percentage*10 << "% done..." << endl;
if(percentage == 10)
{
percentage = 0;
}
percentage++;
}
//countStartPoints++;
depthFirstQueue.push_back(parNodeQueue[id].back());
numPossibilities = 0;
for(i = parNodeQueue[id].back()->content.sortedPoints.begin(); i != parNodeQueue[id].back()->content.sortedPoints.end(); i++)
{
for(int j = 0; j < deltas; j++)
{
if(parNodeQueue[id].back()->content.doesPointExist((*i) + delta[j]))
{
for(int k = 0; k <= numPossibilities; k++)
{
if(fabs((*i) + delta[j] - possibilities[k]) < 0.01)
{
goto pointAlreadyAdded;
}
}
possibilities[numPossibilities] = ((*i) + delta[j]);
numPossibilities++;
pointAlreadyAdded:
continue;
}
}
}
// Out of the list of possible points. All combinations of 3 are added, building small subtrees in from the node.
// If a subtree succesfully breaks the lower bound, true is returned.
for(int i = 0; i < numPossibilities; i++)
{
for(int j = 0; j < numPossibilities; j++)
{
for(int k = 0; k < numPossibilities; k++)
{
if( k != j and j != i and i != k)
{
sequence[0] = possibilities[i];
sequence[1] = possibilities[j];
sequence[2] = possibilities[k];
//countSeq++;
if(addSequence(sequence, id))
{
//successes++;
workingSequence[0] = sequence[0];
workingSequence[1] = sequence[1];
workingSequence[2] = sequence[2];
parNodeQueue[id].back()->workingSequence[0] = sequence[0];
parNodeQueue[id].back()->workingSequence[1] = sequence[1];
parNodeQueue[id].back()->workingSequence[2] = sequence[2];
parNodeQueue[id].back()->live = false;
succesfulNodes.push_back(parNodeQueue[id].back());
goto nextNode;
}
else
{
destroySubtree(parNodeQueue[id].back());
}
}
}
}
}
nextNode:
parNodeQueue[id].pop_back();
}
Is this what you are after?
Live On Coliru
#include <omp.h>
#include <cstdio>
int main()
{
int nthreads, tid;
#pragma omp parallel private(tid)
{
tid = ::omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0) {
nthreads = ::omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and terminate */
}
Output:
Hello World from thread = 0
Number of threads = 8
Hello World from thread = 4
Hello World from thread = 3
Hello World from thread = 5
Hello World from thread = 2
Hello World from thread = 1
Hello World from thread = 6
Hello World from thread = 7
You should be doing something like this :
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
parDF(tid);
}
I think its quite straight forward.
There are two ways to achieve what you want:
Exactly the way you are describing it: each thread starts the function with it's own thread id:
#pragma omp parallel
{
int threadId = omp_get_thread_num();
parDF(threadId);
}
The parallel block starts as many threads as the system reports that it supports, and each of them executes the block. Since they differ in threadId, they will process different data. To force that starting of more threads you can add a numthreads(100) or whatever to the pragma.
The correct way to do what you want is to use a parallel for block.
#pragma omp parallel for
for (int i=0; i < numThreads; ++i) {
parDF(i);
}
This way each iteration of the loop (value of i) gets assigned to a thread, that executes it. As many iterations will be ran in parallel, as there are available threads.
Method 1. is not very general, and is inefficient because you have to have as many threads as you want function calls. Method 2. is the canonical (right) way to get your problem solved.

Questions About Threads

I am new to thread programing and I have a conceptual problem. I am doing matrix multiplication as a project for my class. However, I do it without using threads, and then using threads to compute the scalar product for each cell of the answer matrix, and then once again splitting up the first matrix into proportions so that each thread has a equal portion to compute. My problem is that the scalar product implementation finishes very quickly which is what I expect, but the third implementation doesn't computer the answer much faster than the nonthreaded implementation. For instance, if it were to use 2 threads, it would copute it in roughly half the time because it can work on both halves of the matrix at the same time but that is not the case at all. I feel like there is an issue in the third implementation, I don't think it operates in parallel, the code is below. Can anyone set me straight on this? Not all of the code is relevant to the question but I included it in case the problem is not local.
Thanks,
Main Program:
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <cmath>
#include<fstream>
#include<string>
#include<sstream>
#include <matrix.h>
#include <timer.h>
#include <random_generator2.h>
const float averager=2.0; //used to find the average of the time taken to multiply the matrices.
//Precondition: The matrix has been manipulated in some way and is ready to output the statistics
//Outputs the size of the matrix along with the user elapsed time.
//Postconidition: The stats are outputted to the file that is specified with the number of threads used
//file name example: "Nonparrallel2.dat"
void output(string file, int numThreads , long double time, int n);
//argv[1] = the size of the matrix
//argv[2] = the number of threads to be used.
//argv[3] =
int main(int argc, char* argv[])
{
random_generator rg;
timer t, nonparallel, scalar, variant;
int n, total = 0, numThreads = 0;
long double totalNonP = 0, totalScalar = 0, totalVar = 0;
n = 100;
/*
* check arguments
*/
n = atoi(argv[1]);
n = (n < 1) ? 1 : n;
numThreads = atoi(argv[2]);
/*
* allocated and generate random strings
*/
int** C;
int** A;
int** B;
cout << "**NOW STARTING ANALYSIS FOR " << n << " X " << n << " MATRICES WITH " << numThreads << "!**"<< endl;
for (int timesThrough = 0; timesThrough < averager; timesThrough++)
{
cout << "Creating the matrices." << endl;
t.start();
C = create_matrix(n);
A = create_random_matrix(n, rg);
B = create_random_matrix(n, rg);
t.stop();
cout << "Timer (generate): " << t << endl;
//---------------------------------------------------------Ends non parallel-----------------------------
/*
* run algorithms
*/
cout << "Running non-parallel matrix multiplication: " << endl;
nonparallel.start();
multiply(C, A, B, n);
nonparallel.stop();
//-----------------------------------------Ends non parallel----------------------------------------------
//cout << "The correct matrix" <<endl;
//output_matrix(C, n);
cout << "Timer (multiplication): " << nonparallel << endl;
totalNonP += nonparallel.user();
//D is the transpose of B so that the p_scalarproduct function does not have to be rewritten
int** D = create_matrix(n);
for (int i = 0; i < n; i++)
for(int j = 0; j < n; j++)
D[i][j] = B[j][i];
//---------------------------------------------------Start Threaded Scalar Poduct--------------------------
cout << "Running scalar product in parallel" << endl;
scalar.start();
//Does the scalar product in parallel to multiply the two matrices.
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++){
C[i][j] = 0;
C[i][j] = p_scalarproduct(A[i],D[j],n,numThreads);
}//ends the for loop with j
scalar.stop();
cout << "Timer (scalar product in parallel): " << scalar << endl;
totalScalar += scalar.user();
//---------------------------------------------------Ends Threaded Scalar Poduct------------------------
//---------------------------------------------------Starts Threaded Variant For Loop---------------
cout << "Running the variation on the for loop." << endl;
boost :: thread** thrds;
//create threads and bind to p_variantforloop_t
thrds = new boost::thread*[numThreads];
variant.start();
for (int i = 1; i <= numThreads; i++)
thrds[i-1] = new boost::thread(boost::bind(&p_variantforloop_t,
C, A, B, ((i)*n - n)/numThreads ,(i * n)/numThreads, numThreads, n));
cout << "before join" <<endl;
// join threads
for (int i = 0; i < numThreads; i++)
thrds[i]->join();
variant.stop();
// cleanup
for (int i = 0; i < numThreads; i++)
delete thrds[i];
delete[] thrds;
cout << "Timer (variation of for loop): " << variant <<endl;
totalVar += variant.user();
//---------------------------------------------------Ends Threaded Variant For Loop------------------------
// output_matrix(A, n);
// output_matrix(B, n);
// output_matrix(E,n);
/*
* free allocated storage
*/
cout << "Deleting Storage" <<endl;
delete_matrix(A, n);
delete_matrix(B, n);
delete_matrix(C, n);
delete_matrix(D, n);
//avoids dangling pointers
A = NULL;
B = NULL;
C = NULL;
D = NULL;
}//ends the timesThrough for loop
//output the results to .dat files
output("Nonparallel", numThreads, (totalNonP / averager) , n);
output("Scalar", numThreads, (totalScalar / averager), n);
output("Variant", numThreads, (totalVar / averager), n);
cout << "Nonparallel = " << (totalNonP / averager) << endl;
cout << "Scalar = " << (totalScalar / averager) << endl;
cout << "Variant = " << (totalVar / averager) << endl;
return 0;
}
void output(string file, int numThreads , long double time, int n)
{
ofstream dataFile;
stringstream ss;
ss << numThreads;
file += ss.str();
file += ".dat";
dataFile.open(file.c_str(), ios::app);
if(dataFile.fail())
{
cout << "The output file didn't open." << endl;
exit(1);
}//ends the if statement.
dataFile << n << " " << time << endl;
dataFile.close();
}//ends optimalOutput function
Matrix file:
#include <matrix.h>
#include <stdlib.h>
using namespace std;
int** create_matrix(int n)
{
int** matrix;
if (n < 1)
return 0;
matrix = new int*[n];
for (int i = 0; i < n; i++)
matrix[i] = new int[n];
return matrix;
}
int** create_random_matrix(int n, random_generator& rg)
{
int** matrix;
if (n < 1)
return 0;
matrix = new int*[n];
for (int i = 0; i < n; i++)
{
matrix[i] = new int[n];
for (int j = 0; j < n; j++)
//rg >> matrix[i][j];
matrix[i][j] = rand() % 100;
}
return matrix;
}
void delete_matrix(int** matrix, int n)
{
for (int i = 0; i < n; i++)
delete[] matrix[i];
delete[] matrix;
//avoids dangling pointers.
matrix = NULL;
}
/*
* non-parallel matrix multiplication
*/
void multiply(int** C, int** A, int** B, int n)
{
if ((C == A) || (C == B))
{
cout << "ERROR: C equals A or B!" << endl;
return;
}
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
{
C[i][j] = 0;
for (int k = 0; k < n; k++)
C[i][j] += A[i][k] * B[k][j];
}
}
void p_scalarproduct_t(int* c, int* a, int* b,
int s, int e, boost::mutex* lock)
{
int tmp;
tmp = 0;
for (int k = s; k < e; k++){
tmp += a[k] * b[k];
//cout << "a[k]= "<<a[k]<<"b[k]= "<< b[k] <<" "<<k<<endl;
}
lock->lock();
*c = *c + tmp;
lock->unlock();
}
int p_scalarproduct(int* a, int* b, int n, int m)
{
int c;
boost::mutex lock;
boost::thread** thrds;
c = 0;
/* create threads and bind to p_merge_sort_t */
thrds = new boost::thread*[m];
for (int i = 0; i < m; i++)
thrds[i] = new boost::thread(boost::bind(&p_scalarproduct_t,
&c, a, b, i*n/m, (i+1)*n/m, &lock));
/* join threads */
for (int i = 0; i < m; i++)
thrds[i]->join();
/* cleanup */
for (int i = 0; i < m; i++)
delete thrds[i];
delete[] thrds;
return c;
}
void output_matrix(int** matrix, int n)
{
cout << "[";
for (int i = 0; i < n; i++)
{
cout << "[ ";
for (int j = 0; j < n; j++)
cout << matrix[i][j] << " ";
cout << "]" << endl;
}
cout << "]" << endl;
}
void p_variantforloop_t(int** C, int** A, int** B, int s, int e, int numThreads, int n)
{
//cout << "s= " <<s<<endl<< "e= " << e << endl;
for(int i = s; i < e; i++)
for(int j = 0; j < n; j++){
C[i][j] = 0;
//cout << "i " << i << " j " << j << endl;
for (int k = 0; k < n; k++){
C[i][j] += A[i][k] * B[k][j];}
}
}//ends the function
My guess is that you're running into False Sharing. Try to use a local variable in p_variantforloop_t:
void p_variantforloop_t(int** C, int** A, int** B, int s, int e, int numThreads, int n)
{
for(int i = s; i < e; i++)
for(int j = 0; j < n; j++){
int accu = 0;
for (int k = 0; k < n; k++)
accu += A[i][k] * B[k][j];
C[i][j] = accu;
}
}
Based on your responses in the comments, in theory, because you only have a single thread (i.e., CPU) available, all the threaded versions should be the same time as the single-threaded version or longer because of thread management overhead. You shouldn't be seeing any speedup at all since the time slice-taken to solve one part of the matrix is a time-slice that is stolen from another parallel task. With a single CPU you're only time-sharing the CPU's resources--there is no real parallel working going on in a given single slice of time. I would suspect then the reason your second implementation runs faster is because you're doing less pointer dereferencing and memory access in your inner loop. For example, in the main operation C[i][j] += A[i][k] * B[k][j]; from both multiply and p_variantforloop_t, you're looking at a lot of operations at the assembly level, many of them memory related. It would look something like the following in "assembly pseudo-code":
1) Move pointer value from address referenced by A on the stack into register R1
2) Increment the address in register R1 by the value off the stack referenced by the variable i, j, or k
3) Move the pointer address value from the address pointed to by R1 into R1
4) Increment the address in R1 by the value off the stack referenced by the variable i, j, or k
5) Move the value from the address pointed to by R1 into R1 (so R1 now holds the value of A[i][k])
6) Do steps 1-5 for the address referenced by B on the stack into register R2 (so R2 now holds the value of B[k][j])
7) Do steps 1-4 for the address referenced by C on the stack into register R3
8) Move the value from the address pointed to by R3 into R4 (i.e., R4 holds the actual value at C[i][j])
9) Multiply registers R1 and R2 and store in register R5
10) Add registers R4 and R5 and store in R4
11) Move the final value from R4 back into the memory address pointed to by R3 (now C[i][j] has the final result)
And that's assuming we have 5 general purpose registers to play with, and the compiler properly optimized your C-code to take advantage of them. I left the loop index variables i, j, and k on the stack, so accessing those will take even more time than if they were in registers ... it really depends on how many registers your compiler has to play with on your platform. Additionally, if you compiled without any optimizations, you could be doing a lot more memory access off the stack, where some of these temp values are stored on the stack rather than in registers, and then reaccessed off the stack, which takes a lot longer than moving values between registers. Either way, the code above is a lot harder to optimize. It works, but if you're on a 32-bit x86 platform, then you're not going to have that many general purpose registers to play with (you should have at least 6 though). x86_64 has more registers to play with, but still, there are all the memory accesses to contend with.
On the other-hand an operation like tmp += a[k] * b[k] from p_scalarproduct_t in a tight inner loop is going to move MUCH faster ... here is the above operation in assembly pseudo-code:
There would be a small initialization step for the loop
1) Make tmp a register R1 rather than stack variable, and initialize it's value to 0
2) Move the address value referenced by a on the stack into R2
3) Add the value of s off the stack to R2 and save resulting address in R2
4) Move the address value referenced by b on the stack into R3
5) Add the value of s off the stack to R3 and save resulting address in R3
6) Setup a counter in R6 initialized to e - s
After the one-time initialization we would begin the actual inner loop
7) Move the value from the address pointed to by R2 into R4
8) Move the value from the address pointed to by R3 into R5
9) Multiply R4 and R5 and store the results in R5
10) Add R5 to R1 and store the results in R1
11) Increment R2 and R3
12) Decrement counter in R6 until it reaches zero, where we terminate loop
I can't guarantee this is exactly how your compiler would setup this loop, but you can see in general with your scalar example there are less steps in the inner loop required, and more importantly less memory accesses. Therefore more can be done with operations that are solely using registers rather than operations that include memory locations and require a memory fetch, which is much slower than register-only operations. So in general it's going to move a lot faster, and that has nothing to-do with threads.
Finally, I notice you only have two nested loops for the scalar product, so it's complexity is O(N^2), where-as for your other two methods you have three nested loops for O(N^3) complexity. That's going to make a difference as well.