C++ trying to improve performance of pthread program - c++

i need help with improving speed of my multithread program in c++ using pthreads.
std::vector<double> solve_progon(std::vector<std::vector<double> > A, std::vector <double> B) {
// solving
}
std::vector<double> solve(std::vector<double> left, std::vector<double> mid, std::vector<double> right, int j) {
//solving
}
void * calc(void *thread) {
long t = (long) thread;
int start_index = t * (X_SIZE / THREADS);
int end_index = (t != THREADS - 1)?(t + 1) * (X_SIZE / THREADS) - 1: X_SIZE - 1;
std::vector<std::vector<double> > local, next;
std::vector<double> zeros;
for (int i = 0; i < Y_SIZE; i++) {
zeros.push_back(0);
}
double cur_time = 0;
while (cur_time < T) {
for (int i = start_index; i <= end_index; i ++) {
next.push_back(solve(phi[i - 1], phi[i], phi[i + 1], i - start_index));
}
cur_time += dt;
pthread_barrier_wait(&bar);
for (int i = start_index; i <=end_index; i++) {
phi[i] = next[i - start_index];
}
next.clear();
pthread_barrier_wait(&syn);
}
pthread_exit(NULL);
}
int main(int argc, char **argv) {
//Some init
pthread_barrier_init(&bar, NULL, THREADS);
pthread_barrier_init(&syn, NULL, THREADS);
pthread_t *threads = new pthread_t[THREADS];
unsigned long long start = clock_time();
for (long i = 0; i < THREADS; i++) {
if (pthread_create(&threads[i], NULL, calc, (void *)i) != 0) {
std::cout << "Can't create thread " << i << std::endl;
}
}
for (int i = 0; i < THREADS; i++) {
pthread_join(threads[i], NULL);
}
std::cout << "It takes " << (double)(clock_time() - start) / 1e9 << std::endl;
return 0;
}
Full version at https://github.com/minaevmike/fedlab_pthread/blob/master/main.cpp
So f.e. if i have 4 thread calculation time is 118.288 sec. If 1 101.993. So how i can improve the speed. Thank you.

Related

Efficiency comparison: function pointer vs function object vs branched code. Why is function object worst performing?

I was hoping to improve performance by passing function pointer or a function object to a function call within a nested loop, in order to avoid branching of the loop. Below are three codes: one with function object, with function pointer and with branching. For any of compiler optimization option or for any of the problem size, the function pointer and object versions both perform the least. This is surprising to me; why would the overhead due to function pointer or object scale with problem size?
Second question. Why is the function object performing worse than the function pointer?
Update
To the end I am also adding a lambda expression version of the same code. Again the brute force wins. The lambda expression version takes more than twice the time with or without optimization compared to the corresponding brute force code, and for different problem size.
Codes below. Execute with ./a.out [SIZE] [function choice]
Function Object:
#include <iostream>
#include <chrono>
class Interpolator
{
public:
Interpolator(){};
virtual double operator()(double left, double right) = 0;
};
class FirstOrder : public Interpolator
{
public:
FirstOrder(){};
virtual double operator()(double left, double right) { return 2.0 * left * left * left + 3.0 * right; }
};
class SecondOrder : public Interpolator
{
public:
SecondOrder(){};
virtual double operator()(double left, double right) { return 2.0 * left * left + 3.0 * right * right; }
};
double kernel(double left, double right, Interpolator *int_func) { return (*int_func)(left, right); }
int main(int argc, char *argv[])
{
double *a;
int SIZE = atoi(argv[1]);
int it = atoi(argv[2]);
//initialize
a = new double[SIZE];
for (int i = 0; i < SIZE; i++)
a[i] = (double)i;
std::cout << "Initialized" << std::endl;
Interpolator *first;
switch (it)
{
case 1:
first = new FirstOrder();
break;
case 2:
first = new SecondOrder();
break;
}
std::cout << "function" << std::endl;
auto start = std::chrono::high_resolution_clock::now();
//loop
double g;
for (int i = 0; i < SIZE; i++)
{
g = 0.0;
for (int j = 0; j < SIZE; j++)
{
g += kernel(a[i], a[j], first);
}
a[i] += g;
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Finalized in " << duration.count() << " ms" << std::endl;
return 0;
}
Function Pointer:
#include <iostream>
#include <chrono>
double firstOrder(double left, double right) { return 2.0 * left * left * left + 3.0 * right; }
double secondOrder(double left, double right) { return 2.0 * left * left + 3.0 * right * right; }
double kernel(double left, double right, double (*f)(double, double))
{
return (*f)(left, right);
}
int main(int argc, char *argv[])
{
double *a;
int SIZE = atoi(argv[1]);
int it = atoi(argv[2]);
a = new double[SIZE];
for (int i = 0; i < SIZE; i++)
a[i] = (double)i; // initialization
std::cout << "Initialized" << std::endl;
//Func func(it);
double (*func)(double, double);
switch (it)
{
case 1:
func = &firstOrder;
break;
case 2:
func = &secondOrder;
break;
}
std::cout << "function" << std::endl;
auto start = std::chrono::high_resolution_clock::now();
//loop
double g;
for (int i = 0; i < SIZE; i++)
{
g = 0.0;
for (int j = 0; j < SIZE; j++)
{
g += kernel(a[i], a[j], func);
}
a[i] += g;
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Finalized in " << duration.count() << " ms" << std::endl;
return 0;
}
Branching:
#include <iostream>
#include <chrono>
double firstOrder(double left, double right) { return 2.0 * left * left * left + 3.0 * right; }
double secondOrder(double left, double right) { return 2.0 * left * left + 3.0 * right * right; }
int main(int argc, char *argv[])
{
double *a;
int SIZE = atoi(argv[1]); // array size
int it = atoi(argv[2]); // function choice
//initialize
a = new double[SIZE];
double g;
for (int i = 0; i < SIZE; i++)
a[i] = (double)i; // initialization
std::cout << "Initialized" << std::endl;
auto start = std::chrono::high_resolution_clock::now();
//loop
for (int i = 0; i < SIZE; i++)
{
g = 0.0;
for (int j = 0; j < SIZE; j++)
{
if (it == 1)
{
g += firstOrder(a[i], a[j]);
}
else if (it == 2)
{
g += secondOrder(a[i], a[j]);
}
}
a[i] += g;
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Finalized in " << duration.count() << " ms" << std::endl;
return 0;
}
Lambda expression
#include <iostream>
#include <chrono>
#include<functional>
std::function<double(double, double)> makeLambda(int kind){
return [kind] (double left, double right){
if(kind == 0) return 2.0 * left * left * left + 3.0 * right;
else if (kind ==1) return 2.0 * left * left + 3.0 * right * right;
};
}
int main(int argc, char *argv[])
{
double *a;
int SIZE = atoi(argv[1]);
int it = atoi(argv[2]);
//initialize
a = new double[SIZE];
for (int i = 0; i < SIZE; i++)
a[i] = (double)i;
std::cout << "Initialized" << std::endl;
std::function<double(double,double)> interp ;
switch (it)
{
case 1:
interp = makeLambda(0);
break;
case 2:
interp = makeLambda(1);
break;
}
std::cout << "function" << std::endl;
auto start = std::chrono::high_resolution_clock::now();
//loop
double g;
for (int i = 0; i < SIZE; i++)
{
g = 0.0;
for (int j = 0; j < SIZE; j++)
{
g += interp(a[i], a[j]);
}
a[i] += g;
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Finalized in " << duration.count() << " ms" << std::endl;
return 0;
}
Although, this question is about performance, I have a few comments to improve the code:
Branching could be potentially more error-prone
Imagine that you want to add another interpolation function. Then, you need to define a new function and add a new case (for switch) or a new if/else. The solution could be to create a vector of lambdas:
std::vector<std::function<double(double, double)>> Interpolate {
[](double left, double right) {return 2.0*left*left*left + 3.0*right;}, //first order
[](double left, double right) {return 2.0*left*left + 3.0*right*right;} //second order
};
or alternatively:
double firstOrder(double left, double right) {return 2.0*left*left*left + 3.0*right;}
double secondOrder(double left, double right) {return 2.0*left*left + 3.0*right*right;}
std::array<double(*)(double, double), 2> Interpolate {firstOrder, secondOrder};
With this change, you do not need an if or switch statement. You simply write:
g += Interpolate[it-1] (x, y);
instead of
if (it == 1)
g += firstOrder(a[i], a[j]);
else if (it == 2)
g += secondOrder(a[i], a[j]);
Therefore, there is less maintenance required and it is less probable to miss an if/else statement.
Better to avoid bare new
Instead of writing double *a = new double[SIZE]; people suggest using std::vector<double> a (SIZE);. This way we do not need to release any resource and we avoid a potential memory leak in the code.
Back to the question, I do not see a reason why lambdas should result in better performance. Especially, in this case, that we do not benefit from the constexpr nature of lambdas.

Adding threads increases time needed to perform the same task

I have been struggling with this for past 3 days. I do some stuff in image processing. I came to a point where I could distribute the workflow to more threads since I had "patches" of image, that I could pass to different threads. Unfortunately, the whole time it took to process the image was the same no matter if using 1 or more threads.
So I started digging, making copies of patches so every thread has its own local data, stopped writing result to array, but it was still the same. So I made the most minimalistic program I could have. After thread was created, it would make 10x10 matrix and write its determinant to console. So nothing shared between, only thing passed was index of a thread.
But it was still the same. I made tests both on Linux and Windows. These show time required to compute one determinant, so when using two threads each one took the same amount of time if not stated otherwise:
Windows:
1 Thread = 4479ms
2 Threads = 7500ms
3 Threads = 11300ms
4 Threads = 15800 ms
Linux:
1 Thread = 490ms
2 Threads = 478ms
3 Threads = First: 503ms; Other two: 1230ms
4 Threads = 1340ms
first thing is obvious, Linux is computing the same thing 10x faster. Nevermind. However windows not that single thread performance is worse, it gets worse no matter how many I add. Linux seems to be slowed down only when workload is being done on logical core. Thats why 1 and 2 are ok, since I have 2Core with HT, and when using 3 threads it slows down on the core that uses HT as well but the other is ok. However windows sucks no matter what.
Funny thing is that on windows it takes +- the same amount of time if I compute 4 determinants on one core or 1 determinant on each core.
The code I was using to get these results. I was able to compile with g++ and msvc no problem. Important are only last few methods, there are some constructors I wasn't sure are not being used.
#include <iostream>
#include <cmath>
#include <thread>
#include <chrono>
#include <float.h>
class FVector
{
public:
FVector();
FVector(int length);
FVector(const FVector &vec);
FVector(FVector &&vec);
FVector &operator=(const FVector &vec);
FVector &operator=(FVector &&vec);
~FVector();
void setLength(int length);
int getLength() const;
double *getData();
const double* getConstData() const;
private:
double *data;
int length;
void allocateDataArray(int length);
void deallocateDataArray();
};
FVector::FVector() {
data = nullptr;
length = 0;
}
FVector::FVector(int length) {
data = nullptr;
this->length = length;
allocateDataArray(length);
for (int i = 0; i < length; i++) {
data[i] = 0.;
}
}
FVector::FVector(const FVector &vec) {
allocateDataArray(vec.length);
length = vec.length;
for (int i = 0; i < length; i++) {
data[i] = vec.data[i];
}
}
FVector::FVector(FVector &&vec) {
data = vec.data;
vec.data = nullptr;
length = vec.length;
}
FVector &FVector::operator=(const FVector &vec) {
deallocateDataArray();
if (data == nullptr) {
allocateDataArray(vec.length);
for (int i = 0; i < vec.length; i++) {
data[i] = vec.data[i];
}
length = vec.length;
}
return *this;
}
FVector &FVector::operator=(FVector &&vec) {
deallocateDataArray();
if (data == nullptr) {
data = vec.data;
vec.data = nullptr;
length = vec.length;
}
return *this;
}
FVector::~FVector() {
deallocateDataArray();
}
void FVector::allocateDataArray(int length) {
data = new double[length];
}
void FVector::deallocateDataArray() {
if (data != nullptr) {
delete[] data;
}
data = nullptr;
}
int FVector::getLength() const {
return length;
}
double *FVector::getData() {
return data;
}
void FVector::setLength(int length) {
deallocateDataArray();
allocateDataArray(length);
this->length = length;
}
const double* FVector::getConstData() const {
return data;
}
class FMatrix
{
public:
FMatrix();
FMatrix(int columns, int rows);
FMatrix(const FMatrix &mat);
FMatrix(FMatrix &&mat);
FMatrix& operator=(const FMatrix &mat);
FMatrix& operator=(FMatrix &&mat);
~FMatrix();
FVector *getData();
const FVector* getConstData() const;
void makeIdentity();
int determinant() const;
private:
FVector *data;
int columns;
int rows;
void deallocateDataArray();
void allocateDataArray(int count);
};
FMatrix::FMatrix() {
data = nullptr;
columns = 0;
rows = 0;
}
FMatrix::FMatrix(int columns, int rows) {
data = nullptr;
allocateDataArray(columns);
for (int i = 0; i < columns; i++) {
data[i].setLength(rows);
}
this->columns = columns;
this->rows = rows;
}
FMatrix::FMatrix(const FMatrix &mat) {
data = nullptr;
allocateDataArray(mat.columns);
for (int i = 0; i < mat.columns; i++) {
data[i].setLength(mat.data[i].getLength());
data[i] = mat.data[i];
}
columns = mat.columns;
rows = mat.rows;
}
FMatrix::FMatrix(FMatrix &&mat) {
data = mat.data;
mat.data = nullptr;
columns = mat.columns;
rows = mat.rows;
}
FMatrix &FMatrix::operator=(const FMatrix &mat) {
deallocateDataArray();
if (data == nullptr) {
allocateDataArray(mat.columns);
for (int i = 0; i < mat.columns; i++) {
data[i].setLength(mat.rows);
data[i] = mat.data[i];
}
}
columns = mat.columns;
rows = mat.rows;
return *this;
}
FMatrix &FMatrix::operator=(FMatrix &&mat) {
deallocateDataArray();
data = mat.data;
mat.data = nullptr;
columns = mat.columns;
rows = mat.rows;
return *this;
}
FMatrix::~FMatrix() {
deallocateDataArray();
}
void FMatrix::deallocateDataArray() {
if (data != nullptr) {
delete[] data;
}
data = nullptr;
}
void FMatrix::allocateDataArray(int count) {
data = new FVector[count];
}
FVector *FMatrix::getData() {
return data;
}
void FMatrix::makeIdentity() {
for (int i = 0; i < columns; i++) {
for (int j = 0; j < rows; j++) {
if (i == j) {
data[i].getData()[j] = 1.;
}
else {
data[i].getData()[j] = 0.;
}
}
}
}
int FMatrix::determinant() const {
int det = 0;
FMatrix subMatrix(columns - 1, rows - 1);
int subi;
if (columns == rows && rows == 1) {
return data[0].getData()[0];
}
if (columns != rows) {
//throw EXCEPTIONS::SINGULAR_MATRIX;
}
if (columns == 2)
return ((data[0].getConstData()[0] * data[1].getConstData()[1]) - (data[1].getConstData()[0] * data[0].getConstData()[1]));
else {
for (int x = 0; x < columns; x++) {
subi = 0;
for (int i = 0; i < columns; i++) {
for (int j = 1; j < columns; j++) {
if (x == i) {
continue;
}
subMatrix.data[subi].getData()[j - 1] = data[i].getConstData()[j];
}
if (x != i) {
subi++;
}
}
det += (pow(-1, x) * data[x].getConstData()[0] * subMatrix.determinant());
}
}
return det;
}
const FVector* FMatrix::getConstData() const {
return data;
}
class FCore
{
public:
FCore();
~FCore();
void process();
private:
int getMaxThreads() const;
void joinThreads(std::thread *threads, int max);
};
void parallelTest(int i) {
auto start = std::chrono::high_resolution_clock::now();
FMatrix m(10, 10);
m.makeIdentity();
std::cout << "Det: " << i << "= " << m.determinant() << std::endl;
auto finish = std::chrono::high_resolution_clock::now();
auto microseconds = std::chrono::duration_cast<std::chrono::microseconds>(finish - start);
std::cout << "Time: " << microseconds.count() / 1000. << std::endl;
}
FCore::FCore()
{
}
FCore::~FCore()
{
}
void FCore::process() {
/*********************************************/
/*Set this to limit number of created threads*/
int threadCount = getMaxThreads();
/*********************************************/
/*********************************************/
std::cout << "Thread count: " << threadCount;
std::thread *threads = new std::thread[threadCount];
for (int i = 0; i < threadCount; i++) {
threads[i] = std::thread(parallelTest, i);
}
joinThreads(threads, threadCount);
delete[] threads;
getchar();
}
int FCore::getMaxThreads() const {
int count = std::thread::hardware_concurrency();
if (count == 0) {
return 1;
}
else {
return count;
}
}
void FCore::joinThreads(std::thread *threads, int max) {
for (int i = 0; i < max; i++) {
threads[i].join();
}
}
int main() {
FCore core;
core.process();
return 0;
}
Obviously I've done some testing with more primitive ones, as simple as adding numbers and it was the same. So I just wanted to ask if any of you have ever stumbled on something remotely similar to this. I know that I won't be able to get the awesome time on windows as it is on Linux, but at least the scaling could be better.
Tested on Win7/Linux intel 2C+2T and Win10 ryzen 8C+8T. Times posted are from 2C+2T

Segmentation fault , caused by wrong interval (2D array)

I wanted to learn how threads work, and I tried to make a program, which would use 2 threads, to copy a picture (just to test my newly acquired threading skills) . But I bumped into an error, probably because my interval (created by the interval function) is only working ( I believe) with one dimensional arrays.How can I change my program , to correctly create intervals , which work on 2 dimensional arrays, such as pictures ?
#include <iostream>
#include <vector>
#include <time.h>
#include <thread>
#include <mutex>
#include <png++/png.hpp>
std::mutex my_mutex;
std::vector<int> interval(int max, int n_threads)
{
std::vector<int> intervallum;
int ugras = max / n_threads;
int maradek = max % n_threads;
int n1 = 0;
int n2;
intervallum.push_back(n1);
for (int i = 0; i < n_threads; i++)
{
n2 = n1 + ugras;
if (i == n_threads - 1)
n2 += maradek;
intervallum.push_back(n2);
n1 = n2;
}
return intervallum;
}
void create_image(png::image<png::rgb_pixel> image, png::image<png::rgb_pixel> new_image, int start, int end)
{
std::lock_guard<std::mutex> lock(my_mutex);
for (int i = start; i < end; i++)
for (int j = start; j < end; j++)
{
new_image[i][j].red = image[i][j].red;
new_image[i][j].blue = image[i][j].blue;
new_image[i][j].green = image[i][j].green;
}
}
int main()
{
png::image<png::rgb_pixel> png_image("mandel.png");
int image_size = png_image.get_width() * png_image.get_height();
png::image<png::rgb_pixel> new_image(png_image.get_width(), png_image.get_height());
time_t start, end;
time(&start);
int size = 2;
std::vector<std::thread> threads;
std::vector<int> stuff_interval = interval(image_size, size);
for (int i = 0; i < size-1; i++)
threads.push_back(std::thread(create_image, std::ref(png_image), std::ref(new_image), stuff_interval[i], stuff_interval[i + 1]));
for (auto& i : threads)
i.join();
create_image(png_image,new_image,stuff_interval[size-2],stuff_interval[size-1]);
new_image.write("test.png");
time(&end);
std::cout << (start - end) << std::endl;
return 0;
}
Okay , I found a way around it (this way I am not getting segmentation error, but it does not copy the image correctly, the new image is fully black, here is the code :
EDIT : seems like, I was passing wrong the pictures, that is the reason why the picture was black.
#include <iostream>
#include <vector>
#include <time.h>
#include <thread>
#include <mutex>
#include <png++/png.hpp>
std::mutex my_mutex;
std::vector<int> interval(int max, int n_threads)
{
std::vector<int> intervallum;
int ugras = max / n_threads;
int maradek = max % n_threads;
int n1 = 0;
int n2;
intervallum.push_back(n1);
for (int i = 0; i < n_threads; i++)
{
n2 = n1 + ugras;
if (i == n_threads - 1)
n2 += maradek;
intervallum.push_back(n2);
n1 = n2;
}
return intervallum;
}
void create_image(png::image<png::rgb_pixel>& image, png::image<png::rgb_pixel>& new_image, int start, int end)
{
std::lock_guard<std::mutex> lock(my_mutex);
for (int i = start; i < end; i++)
for (int j = 0; j < image.get_height(); j++)
{
new_image[i][j].red = image[i][j].red;
new_image[i][j].blue = image[i][j].blue;
new_image[i][j].green = image[i][j].green;
}
}
int main()
{
png::image<png::rgb_pixel> png_image("mandel.png");
int image_size = png_image.get_width() * png_image.get_height();
png::image<png::rgb_pixel> new_image(png_image.get_width(), png_image.get_height());
time_t start, end;
time(&start);
int size = 2;
std::vector<std::thread> threads;
std::vector<int> stuff_interval = interval(png_image.get_width(), size);
new_image.write("test2.png");
for (int i = 0; i < size - 1; i++)
threads.push_back(std::thread(create_image, std::ref(png_image), std::ref(new_image), stuff_interval[i], stuff_interval[i + 1]));
for (auto &i : threads)
i.join();
create_image(std::ref(png_image), std::ref(new_image), stuff_interval[size - 1], stuff_interval[size]);
new_image.write("test.png");
time(&end);
std::cout << (start - end) << std::endl;
return 0;
}

measured runtime from c++ "time.h" is double than real

I am running this pthread-c++ program (gauss elimination) on my laptop to measure its runtime.
The program runs about 10 seconds in real but my output shows about 20 seconds. What is wrong with this program?
I used
g++ -pthread main.c
./a.out 32 2048
to run
#include <stdio.h>
#include <stdlib.h>
#include <ctime>
#include <cstdlib>
#include <pthread.h>
#include <iostream>
typedef float Type;
void mat_rand (Type**, int, int);
Type** mat_aloc (int, int);
void mat_free (Type**);
void mat_print (Type**, int, int);
void* eliminate(void*);
unsigned int n, max_threads, active_threads, thread_length;
Type** A;
int current_row;
struct args
{
int start;
int end;
};
typedef struct args argument;
void *print_message_function( void *ptr );
int main(int argc, char *argv[])
{
if (argc < 3)
{
printf ("Error!. Please Enter The Matrix Dimension and No. of Threads!\n");
return 0;
} else
{
n = atoi(argv[2]);
max_threads = atoi(argv[1]);
if (n > 4096)
{
printf ("The maximum allowed size is 4096!\n");
return 0;
}
if (max_threads > 32)
{
printf ("The maximum allowed Threads Count is 32!\n");
return 0;
}
}
A = mat_aloc(n , n+1);
mat_rand (A, n, n+1);
//mat_print (A, n, n+1);
std::clock_t start;
double exe_time;
start = std::clock();
pthread_attr_t attr;
pthread_attr_init(&attr);
argument* thread_args = new argument[max_threads];
pthread_t* thread = new pthread_t[max_threads];
for (int i=0; i<n-1; i++)
{
current_row = i;
if (max_threads >= n-i)
active_threads = n-i-1;
else
active_threads = max_threads;
thread_length = (n-i-1)/active_threads;
for (int j=0; j<active_threads-1; j++)
{
thread_args[j].start = i+1+j*thread_length;
thread_args[j].end = i+1+(j+1)*thread_length;
pthread_create( &thread[j], &attr, eliminate, (void*) &thread_args[j]);
}
thread_args[active_threads-1].start = i+1+(active_threads-1)*thread_length;
thread_args[active_threads-1].end = n-1;
pthread_create(&thread[active_threads-1], &attr, eliminate, (void*) &thread_args[active_threads-1]);
for (int j=0; j<active_threads; j++)
{
pthread_join(thread[j], NULL);
}
}
exe_time = (clock() - start) / (double) CLOCKS_PER_SEC;
printf("Execution time for Matrix of size %i: %f\n", n, exe_time);
//mat_print (A, n, n+1);
return 0;
}
void* eliminate(void* arg)
{
Type k, row_constant;
argument* info = (argument*) arg;
row_constant = A[current_row][current_row];
for (int i=info->start; i<=info->end; i++)
{
k = A[i][current_row] / row_constant;
A[i][current_row] = 0;
for (int j=current_row+1; j<n+1; j++)
{
A[i][j] -= k*A[current_row][j];
}
}
}
// matrix random values
void mat_rand (Type** matrix, int row, int column)
{
for (int i=0; i<row; i++)
for (int j=0; j<column; j++)
{
matrix[i][j] = (float)(1) + ((float)rand()/(float)RAND_MAX)*256;
}
}
// allocates a 2d matrix
Type** mat_aloc (int row, int column)
{
Type* temp = new Type [row*column];
if (temp == NULL)
{
delete [] temp;
return 0;
}
Type** mat = new Type* [row];
if (temp == NULL)
{
delete [] mat;
return 0;
}
for (int i=0; i<row; i++)
{
mat[i] = temp + i*column;
}
return mat;
}
// free memory of matrix
void mat_free (Type** matrix)
{
delete[] (*matrix);
delete[] matrix;
}
// print matrix
void mat_print (Type** matrix, int row, int column)
{
for (int i=0; i<row; i++)
{
for (int j=0; j<column; j++)
{
std::cout<< matrix[i][j] << "\t\t";
}
printf("\n");
}
printf(".................\n");
}
clock reports CPU time used. If you have 2 CPUs and run a thread on each one for 10 seconds, clock will report 20 seconds.

Implementation of Lamport's Bakery Algorithm has seg faults with more than 1 thread

I'm implementing Lamport's Bakery Algorithm using pthreads and a Processor class to act as shared memory. With a single thread it works fine, with 2 threads I get the seg fault after thread 2 runs through all 30 attempts to access the 'bakery':
dl-tls.c: No such file or directory.
With 3 or more threads I get the seg fault after outputting "here" twice from the bakeryAlgo function:
0x0804ae52 in Processor::getNumber (this=0x5b18c483) at Processor.cpp:33
bakery.cpp
struct argStruct {
vector<Processor>* processors;
Processor* processor;
};
int findMax(vector<Processor>* processors) {
int max = -99;
for (int i = 0; i < processors->size(); i++) {
if (processors->at(i).getNumber() > max) {
max = processors->at(i).getNumber();
}
}
return max;
}
void* bakeryAlgo(void* arg) {
struct argStruct* args = static_cast<struct argStruct *>(arg);
cout << "here" << endl;
for (int i = 0; i < 30; i++) {
args->processor->setChoosing(1);
args->processor->setNumber(findMax(args->processors));
args->processor->setChoosing(0);
for (int j = 0; j < args->processors->size(); j++) {
int jChoosing = args->processors->at(j).getChoosing();
int jNumber = args->processors->at(j).getNumber();
int jId = args->processors->at(j).getId();
int pNumber = args->processor->getNumber();
int pId = args->processor->getId();
if (jId != pId) {
while (jChoosing != 0) {}
while (jNumber != 0 && ((jNumber < pNumber) || ((jNumber == pNumber) && (jId < pId)))) { }
}
}
cout << "Processor: " << args->processor->getId() << " executing critical section!" << endl;
args->processor->setNumber(0);
}
}
int main(int argc, char *argv[]) {
// Check that a command line argument was provided
if (2 == argc) {
int numProcessors = atoi(argv[1]);
vector<Processor> processors;
vector<argStruct> argVect;
vector < pthread_t > threads;
for (int i = 0; i < numProcessors; i++) {
Processor p = Processor(i);
processors.push_back(p);
}
for (int i = 0; i < numProcessors; i++) {
pthread_t processorThread;
struct argStruct args;
args.processors = &processors;
args.processor = &processors.at(i);
argVect.push_back(args);
threads.push_back(processorThread);
pthread_create(&threads.at(i), NULL, &bakeryAlgo, &argVect.at(i));
}
for (int i = 0; i < numProcessors; i++) {
pthread_join(threads.at(i), NULL);
}
}
else {
cout << "Usage: bakery num, num is number of threads." << endl;
}
return 0;
}
The code in Processor.cpp / Processor.h is simple, it's just a few getters and setters on the values id, choosing, and number, with a default constructor and a constructor that takes an int id.
Processor::Processor() {
}
Processor::Processor(int idval) {
id = idval;
choosing = 0;
number = 0;
}
Processor::~Processor() {
}
int Processor::getChoosing() {
return choosing;
}
int Processor::getNumber() {
return number;
}
int Processor::getId() {
return id;
}
void Processor::setChoosing(int c) {
choosing = c;
}
void Processor::setNumber(int n) {
number = n;
}
Does anyone have any idea why these seg faults are occuring? The places gdb says they are occuring look like innocent lines of code to me.
Are you using a pointer to a vector defined in main as your data? Stack is not shared among threads, so the other threads accessing this memory would be undefined behavior at best. I expect this is the source of your troubles.
You are taking the address of an element in a vector which is changing:
for (int i = 0; i < numProcessors; i++) {
pthread_t processorThread;
struct argStruct args;
args.processors = &processors;
args.processor = &processors.at(i);
argVect.push_back(args);
threads.push_back(processorThread);
// danger!
pthread_create(&threads.at(i), NULL, &bakeryAlgo, &argVect.at(i));
}
Each time a new thread is pushed onto the threads vector, the vector may be relocated in memory, so the pointer that you passed to pthread_create may be pointing to garbage when the next thread is added.
You want to do this instead:
for (int i = 0; i < numProcessors; i++) {
pthread_t processorThread;
struct argStruct args;
args.processors = &processors;
args.processor = &processors.at(i);
argVect.push_back(args);
threads.push_back(processorThread);
}
for (int i = 0; i < numProcessors; i++) {
pthread_create(&threads.at(i), NULL, &bakeryAlgo, &argVect.at(i));
}
By waiting until all the elements are added to the vector before you create your threads, you are now passing pointers that stay good while the threads are running.