I'm struggling to write a threaded program in c++ that is accurate and faster than my non-threaded version.
I'm finding the largest entry in a 2d array of random doubles.
Here is the general code:
void getLargest(double** anArray, double largestEntry, int dimLower, int dimUpper, int dim) {
for (int i = dimLower; i < dimUpper; i++) {
for (int j = 0; j < dim; j++) {
if (anArray[i][j] > largestEntry) {
largestEntry = anArray[i][j];
}
}
}
}
int main(){
// Seed the random number generator
srand( time(NULL));
// 2D array dimension
int dim = 30000;
// Specify max values
double max = (double) (dim * dim * dim);
double min = (double) (dim * dim * dim * -1.0);
double t1 = get_wallTime();
// Create a 2D array
double **myArray = new double*[dim];
for (int i=0; i<dim; i++){
myArray[i] = new double[dim];
for (int j=0; j<dim; j++){
// generate random number
myArray[i][j] = genRandNum(min, max);
}
}
double largestEntry = 0.0;
int portion = dim / 5;
std::future<void> thread1 = std::async (std::launch::async, getLargest, myArray, largestEntry, 0, portion, dim);
thread1.get();
std::future<void> thread2 = std::async (std::launch::async, getLargest, myArray, largestEntry, portion, (portion * 2), dim);
thread2.get();
std::future<void> thread3 = std::async (std::launch::async, getLargest, myArray, largestEntry, (portion * 2), (portion * 3), dim);
thread3.get();
std::future<void> thread4 = std::async (std::launch::async, getLargest, myArray, largestEntry, (portion * 3), (portion * 4), dim);
thread4.get();
std::future<void> thread5 = std::async (std::launch::async, getLargest, myArray, largestEntry, (portion *4), dim, dim);
thread5.get();
double t2 = get_wallTime();
double t3 = t2 - t1;
cout << " The largest entry is " << largestEntry << endl;
cout << "runtime : " << t3 << "\n";
}
I have the appropriate #includes.
I understand my code as updating the double largestEntry from each thread if the portion of the 2d array that the thread is processing has a larger entry than the thread prior to it. Then I output the largest entry, and the runtime.
Here is the output:
The largest entry is 0
runtime : 14.7113
This runs way faster than I'm expecting it to, and the largest entry should not be zero. Basically, I'm having trouble finding why that is. I'm not very comfortable with using async, but when I have before, this method worked very well. I know I'm not updating largestEntry correctly, though I'm unsure of where I've made a mistake.
Thanks for any advice you guys could give.
You're passing largestEntry into getLargest by value, so when it is updated only the value within the function is updated, not the value in main.
Two other notes: The thread1.get() etc. calls should all be after the threads are created, so they all run simultaneously.
Two, each thread should return it's own value for largestEntry (it can be the value of the future), then compare those to find the largest. If they all reference the same variable you're going to get into race conditions between the threads, CPU cache thrashing, and possibly bad answers depending on how the optimizer handles the updates to largestEntry (it could avoid writing the value out until all the looping was done).
Related
my task is to parallelize the creation, doubling, and summation of the array seen in my code below using C++ and OpenMP. However, I cannot get the summation to work in parallel properly. This is my first time using OpenMP, and I am also quite new to C++ as well. I have tried what can be seen in my code below as well as other variations (having the sum outside of the for loop, defining a sum in parallel to add to the global sum, I have tried what is suggested here, etc). The sum should be 4.15362e-14, but when I use multiple threads, I get different results each time that are incorrect. What is the proper way to achieve this?
P.S. We have only been taught the critical, master, barrier, and single constructs thus far so I would appreciate if answers would not include any others. Thanks!
#include <iostream>
#include <cmath>
#include <omp.h>
using namespace std;
int main()
{
const int size = 256;
double* sinTable = new double[256];
double sum = 0.0;
// parallelized
#pragma omp parallel
{
for (int n = 0; n < size; n++)
{
sinTable[n] = std::sin(2 * M_PI * n / size); // calculate and insert element into array
sinTable[n] = sinTable[n] * 2; // double current element in array
#pragma omp critical
sum += sinTable[n]; // add element to total sum (one thread at a time)
}
}
// print sum and exit
cout << "Sum: " << sum << endl;
return 0;
}
Unfortunately your code is not OK, because you run the for loop number of thread times instead of distributing the work. You should use:
#pragma omp parallel for
to distribute the work among threads.
Another alternative is to use reduction:
int main()
{
const int size = 256;
const double step = (2.0 * M_PI) / static_cast<double>(size);
double* sinTable = new double[size];
double sum = 0.0;
// parallelized
#pragma omp parallel for reduction(+:sum)
for (int n = 0; n < size; n++)
{
sinTable[n] = std::sin( static_cast<double>(n) * step); // calculate and insert element into array
sinTable[n] = sinTable[n] * 2.0; // double current element in array
sum += sinTable[n]; // add element to total sum (one thread at a time)
}
// print sum and exit
cout << "Sum: " << sum << endl;
delete[] sinTable;
return 0;
}
Note that in theory the sum should be zero. The value you obtain depends on the order of additions, so slight difference can be observed due to rounding errors.
size=256 sum(openmp)=2.84217e-14 sum(no openmp)= 4.15362e-14
size=512 sum(openmp)=5.68434e-14 sum(no openmp)= 5.68434e-14
size=1024 sum(openmp)=0 sum(no openmp)=-2.83332e-14
Here is the link to CodeExplorer.
I'm trying to implement a naive threaded matrix multiplication, I'm creating multiple threads for each linear combination using a manually allocated result array and writing to the respective position on each thread, however my code run slower than on it's single threaded version, is it the use of memory that slows down the code?
I used heap allocation to avoid any memory copying but could it be the problem?
#define rows first
#define columns second
void linear_combination(double const *arr_1,std::pair<int, int> sp_1,
double const *arr_2, std::pair<int, int> sp_2,
double *arr_3, std::pair<int, int> sp_3,
int base_row,int base_col){
double sum = 0;
for (int i = 0; i < sp_1.columns; i++){
int idx_1 = base_row * sp_1.columns + i;
int idx_2 = i * sp_2.columns + base_col;
sum += arr_1[idx_1] * arr_2[idx_2];
}
int idx_3 = base_row * sp_3.columns + base_col;
arr_3[idx_3] = sum;
}
auto matmul(double *m1, std::pair<int, int> sp_1, double *m2, std::pair<int, int> sp_2){
// "sp_n" stands for shape for n-th matrix
if (sp_1.second == sp_2.first){
auto *m3 = (double *) malloc(sp_1.first*sp_2.second* sizeof(double));
std::pair sp_3 = {sp_1.first, sp_2.second};
for (int k = 0; k < sp_3.rows; k++){
std::vector<std::thread> thread_list(sp_2.columns);
for (int j = 0; j < sp_2.columns; j++){
// will automatically save linear combination sum into m3
thread_list[j] = ( std::thread(linear_combination,
m1, sp_1,
m2, sp_2,
m3, sp_3,
k, j) );
}
// join threads and use calculation
std::for_each(thread_list.begin(), thread_list.end(), std::mem_fn(&std::thread::join));
}
return std::make_tuple(m3, sp_3);
} else{
puts("Size mismatch");
printf("%d %d\n", sp_1.second, sp_2.first);
double m3 = 0;
return std::make_tuple(&m3, std::make_pair(0, 0));
}
}
I am carrying out a 3D matrix by 1D vector multiplication within a class in C++. All variables are contained within the class. When I create one instance of the class on a single thread and carry out the multiplication 100 times, the multiplication operation takes ~0.8ms each time.
When I create 4 instances of the class, each on a separate thread, and run the multiplication operation 25 times on each, the operation takes ~1.7ms each time. The operations on each thread are being carried out on separate data, and are running on separate cores.
As expected, however, the overall time to complete the 100 matrix multiplications is reduced with 4 threads over a single thread.
My questions are:
1) What is the cause of the slowdown in the multiplication operation when multiple threads are used?
2) Is there any way in which the operation can be sped up?
EDIT:
To clarify the problem:
The overall time to carry out 100 matrix products does decrease when I split them over 4 threads - threading does make the overall program faster.
The timing in question is the actual matrix multiplication within the already created threads (see code). This time excludes thread creation, and memory allocation & deletion. This is the time that doubles when I use 4 threads rather than 1. The overall time to carry out all multiplications halves when I use 4 threads. My question is why are the individual matrix products slower when running on 4 threads rather than 1.
Below is a code sample. It is not my actual code, but a simplified example I have written to demonstrate the problem.
Multiply.h
class Multiply
{
public:
Multiply ();
~Multiply ();
void
DoProduct ();
private:
double *a;
};
Multiply.cpp
Multiply::Multiply ()
{
a = new double[100 * 100 * 100];
std::memset(a,1,100*100*100*sizeof(double));
}
void
Multiply::DoProduct ()
{
double *result = new double[100 * 100];
double *b = new double[100];
std::memset(result,0,100*100*sizeof(double));
std::memset(b,1,100*sizeof(double));
//Timer starts here, i.e. excluding memory allocation and thread creation and the rest
auto start_time = std::chrono::high_resolution_clock::now ();
//matrix product
for (int i = 0; i < 100; ++i)
for (int j = 0; j < 100; ++j)
{
double t = 0;
for (int k = 0; k < 100; ++k)
t = t + a[k + j * 100 + i * 100 * 100] * b[k];
result[j + 100 * i] = result[j + 100 * i] + t;
}
//Timer stops here, i.e. before memory deletion
int time = std::chrono::duration_cast < std::chrono::microseconds > (std::chrono::high_resolution_clock::now () - start_time).count ();
std::cout << "Time: " << time << std::endl;
delete []result;
delete []b;
}
Multiply::~Multiply ()
{
delete[] a;
}
Main.cpp
void
threadWork (int iters)
{
Multiply *m = new Multiply ();
for (int i = 0; i < iters; i++)
{
m->DoProduct ();
}
}
void
main ()
{
int numProducts = 100;
int numThreads = 1; //4;
std::thread t[numThreads];
auto start_time = std::chrono::high_resolution_clock::now ();
for (int i = 0; i < numThreads; i++)
t[i] = std::thread (threadWork, numProducts / numThreads);
for (int i = 0; i < n; i++)
t[i].join ();
int time = std::chrono::duration_cast < std::chrono::microseconds > (std::chrono::high_resolution_clock::now () - start_time).count ();
std::cout << "Time total: " << time << std::endl;
}
Async and thread calls are quite time expensive compare to ordinary function calls. So pre-launch threads and create a thread pool. You push your functions as tasks and request the thread pool to tether these tasks from the prority-queue.
The tasks could be set with priorities to execute in proper order to avoid use and hence delays arising due to use of mutexes and locks
You are launching too many threads , keep it below the maximum allowed by your system to avoid bottlenecks.
I am studying pthread but confused about how to use pthread to synchronize the functions.
For example, I have a simple code to do some operations on an array like following:
float add(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + numbers[i] +5;
}
return sum/5;
}
float subtract(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + numbers[i] -10;
}
return sum/5;
}
float mul(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + (float)numbers[i] * 1.5 ;
}
return sum/5;
}
float div(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + (float)numbers[i]/ 2;
}
return sum/5;
}
int main(){
int numbers [5] = { 34, 2, 77, 40, 12 };
float addition = add(numbers);
float subtraction = subtract(numbers);
float multiplication = mul(numbers);
float division = div(numbers);
cout << addition + subtraction + multiplication + division << endl;
return -1;
}
Since all the four functions are independent from each other and using the same input, how can I put each operation into one thread and let the functions(or threads) run at the same time?
I think if one day I have a very large array and run the program like above, it will spend a lot time but if I can make the functions run simultaneously, it will save a lot time.
First of all, I suspect, you are not clear how arrays are passed into functions. float subtract(int numbers[5]) does not tell anything about size of the passed array. It is equivalent of float subtract(int numbers[]) (no size), which, in turn, is equivalent to ``float subtract(int* numbers)` (pointer to int). You also have a bug in your function, since you do not initialize float before first use (as well as other functions).
Having this in mind, the whole substract function is better to be written like this:
float subtract(int* numbers, const size_t size) {
float sum = 0;
for(int i = 0; i < size; i++) {
sum = sum + numbers[i] -10;
}
return sum/5;
}
Now, once we cleared the function itself, we can tackle multithreading. I really suggest to ditch pthreads and instead use C++11 thread capability. That is especially true when you need to get the result back as a return value of the function. Doing it with pthreads would require too much typing. The relevant code to do this in C++ would be looking similar to this:
int numbers[] = {34, 2, 77, 40, 12}; // No need to provide array size when inited
auto sub_result = std::async(std::launch::async, &subtract, numbers, sizeof(numbers) / sizeof(*numbers);
auto div_result = ....
// rest of functions
std::cout << "Result of subtraction: " << div_result.get();
Now, this is a lot to grasp :) std::async is a way to run the function asynchronously without worring about multithreading at all. The task of threading is delegated to the compiler. It is a much cleaner way than using pthreads - see, invocation is not much different from normal function invocation! The only thing to keep in mind is that it returns so-called std::future object - an special object on which you can wait until the function which was run completes execution. It also has a get function, which waits until the function is completed and returns it's result. Nice, eh?
I'm doing an assignment that involves calculating pi with threads. I've done this using mutex and it works fine, but I would like to get this version working as well. Here is my code.
#include <iostream>
#include <stdlib.h>
#include <iomanip>
#include <vector>
#include <pthread.h>
using namespace std;
typedef struct{
int iterations; //How many iterations this thread is going to do
int offset; //The offset multiplier for the calculations (Makes sure each thread calculates a different part of the formula)
}threadParameterList;
vector<double> partialSumList;
void* pi_calc(void* param){
threadParameterList* _param = static_cast<threadParameterList*>(param);
double k = 1.0;
for(int i = _param->iterations * _param->offset + 1; i < _param->iterations * (_param->offset + 1); ++i){
partialSumList[_param->offset] += (double)k*(4.0/((2.0*i)*(2.0*i+1.0)*(2.0*i+2.0)));
k *= -1.0;
}
pthread_exit(0);
}
int main(int argc, char* argv[]){
//Error checking
if(argc != 3){
cout << "error: two parameters required [iterations][threadcount]" << endl;
return -1;
}
if(atoi(argv[1]) <= 0 || atoi(argv[2]) <= 0){
cout << "error: invalid parameter supplied - parameters must be > 0." << endl;
return -1;
}
partialSumList.resize(atoi(argv[2]));
vector<pthread_t> threadList (atoi(argv[2]));
vector<threadParameterList> parameterList (atoi(argv[2]));
int iterations = atoi(argv[1]),
threadCount = atoi(argv[2]);
//Calculate workload for each thread
if(iterations % threadCount == 0){ //Threads divide evenly
for(int i = 0; i < threadCount; ++i){
parameterList[i].iterations = iterations/threadCount;
parameterList[i].offset = i;
pthread_create(&threadList[i], NULL, pi_calc, ¶meterList[i]);
}
void* status;
for(int i = 0; i < threadCount; ++i){
pthread_join(threadList[i], &status);
}
}
else{ //Threads do not divide evenly
for(int i = 0; i < threadCount - 1; ++i){
parameterList[i].iterations = iterations/threadCount;
parameterList[i].offset = i;
pthread_create(&threadList[i], NULL, pi_calc, ¶meterList[i]);
}
//Add the remainder to the last thread
parameterList[threadCount].iterations = (iterations % threadCount) + (iterations / threadCount);
parameterList[threadCount].offset = threadCount - 1;
pthread_create(&threadList[threadCount], NULL, pi_calc, ¶meterList[threadCount]);
void* status;
for(int i = 0; i < threadCount-1; ++i){
pthread_join(threadList[i], &status);
cout << status << endl;
}
}
//calculate pi
double pi = 3.0;
for(int i = 0; i < partialSumList.size(); ++i){
pi += partialSumList[i];
}
cout << "Value of pi: " << setw(15) << setprecision(15) << pi << endl;
return 0;
}
The code works fine in most cases. There are certain combinations of parameters that cause me to get a double free or corruption error on return 0. For example, if I use the parameters 100 and 10 the program creates 10 threads and does 10 iterations of the formula on each thread, works fine. If I use the parameters 10 and 4 the program creates 4 threads that do 2 iterations on 3 threads and 4 on the 4th thread, works fine. However, if I use 5 and 3, the program will correctly calculate the value and even print it out, but I get the error immediately after. This also happens for 17 and 3, and 10 and 3. I tried 15 and 7, but then I get a munmap_chunk(): invalid pointer error when the threads are trying to be joined - although i think that's something for another question.
If I had to guess, it has something to do with pthread_exit deallocating memory and then the same memory trying to be deallocated again on return, since I'm passing the parameter struct as a pointer. I tried a few different things like creating a local copy and defining parameterList as a vector of pointers, but it didn't solve anything. I've also tried eraseing and clearing the vector before return but that didn't help either.
I see this issue:
You are writing beyond the vector's bounds:
vector<threadParameterList> parameterList (atoi(argv[2]));
//...
int threadCount = atoi(argv[2]);
//...
parameterList[threadCount].iterations = (iterations % threadCount) + (iterations / threadCount);
parameterList[threadCount].offset = threadCount - 1;
Accessing parameterList[threadCount] is out of bounds.
I don't see in the code where threadCount is adjusted, so it remains the same value throughout that snippet.
Tip: If the goal is to access the last item in a container, use vector::back(). It works all the time for non-empty vectors.
parameterList.back().iterations = (iterations % threadCount) + (iterations / threadCount);
parameterList.back().offset = threadCount - 1;
One thing I can see is you might be going past the end of the vector here:
for(int i = 0; i < partialSumList.capacity(); ++i)
capacity() returns how many elements the vector can hold. This can be more than the size() of the vector. You can change you call to capacity() to size() to make sure you don't go past the end of the vector
for(int i = 0; i < partialSumList.size(); ++i)
The second thing I spot is that when iterations % threadCount != 0 you have:
parameterList[threadCount].iterations = (iterations % threadCount) + (iterations / threadCount);
parameterList[threadCount].offset = threadCount - 1;
pthread_create(&threadList[threadCount], NULL, pi_calc, ¶meterList[threadCount]);
Which is writing past the end of the vector. Then when you join all of the threads you don't join the last thread as you do:
for(int i = 0; i < threadCount-1; ++i){
^^^ uh oh. we missed the last thread
pthread_join(threadList[i], &status);
cout << status << endl;
}