Using openMP to get the index of minimum element parallelly - c++

I tried to write this code
float* theArray; // the array to find the minimum value
int index, i;
float thisValue, min;
index = 0;
min = theArray[0];
#pragma omp parallel for reduction(min:min_dist)
for (i=1; i<size; i++) {
thisValue = theArray[i];
if (thisValue < min)
{ /* find the min and its array index */
min = thisValue;
index = i;
}
}
return(index);
However this one is not outputting correct answers. Seems the min is OK but the correct index has been destroyed by threads.
I also tried some ways provided on the Internet and here (using parallel for for outer loop and use critical for final comparison) but this cause a speed drop rather than speedup.
What should I do to make both the min value and its index correct? Thanks!

I don't know of an elegant want to do a minimum reduction and save an index. I do this by finding the local minimum and index for each thread and then the global minimum and index in a critical section.
index = 0;
min = theArray[0];
#pragma omp parallel
{
int index_local = index;
float min_local = min;
#pragma omp for nowait
for (i = 1; i < size; i++) {
if (theArray[i] < min_local) {
min_local = theArray[i];
index_local = i;
}
}
#pragma omp critical
{
if (min_local < min) {
min = min_local;
index = index_local;
}
}
}
With OpenMP 4.0 it's possible to use user-defined reductions. A user-defined minimum reduction can be defined like this
struct Compare { float val; sizt_t index; };
#pragma omp declare reduction(minimum : struct Compare : omp_out = omp_in.val < omp_out.val ? omp_in : omp_out)
Then the reduction can be done like this
struct Compare min;
min.val = theArray[0];
min.index = 0;
#pragma omp parallel for reduction(minimum:min)
for(int i = 1; i<size; i++) {
if(theArray[i]<min.val) {
min.val = a[i];
min.index = i;
}
}
That works for C and C++. User defined reductions have other advantages besides simplified code. There are multiple algorithms for doing reductions. For example the merging can be done in O(number of threads) or O(Log(number of threads). The first solution I gave does this in O(number of threads) however using user-defined reductions let's OpenMP choose the algorithm.

Basic Idea
This can be accomplished without any parellelization-breaking critical or atomic sections by creating a custom reduction. Basically, define an object that stores both the index and value, and then create a function that sorts two of these objects by only the value, not the index.
Details
An object to store an index and value together:
typedef std::pair<unsigned int, float> IndexValuePair;
You can access the index by accessing the first property and the value by accessing the second property, i.e.,
IndexValuePair obj(0, 2.345);
unsigned int ix = obj.first; // 0
float val = obj.second; // 2.345
Define a function to sort two IndexValuePair objects:
IndexValuePair myMin(IndexValuePair a, IndexValuePair b){
return a.second < b.second ? a : b;
}
Then, construct a custom reduction following the guidelines in the OpenMP documentation:
#pragma omp declare reduction \
(minPair:IndexValuePair:omp_out=myMin(omp_out, omp_in)) \
initializer(omp_priv = IndexValuePair(0, 1000))
In this case, I've chosen to initialize the index to 0 and the value to 1000. The value should be initialized to some number larger than the largest value you expect to sort.
Functional Example
Finally, combine all these pieces with the parallel for loop!
// Compile with g++ -std=c++11 -fopenmp demo.cpp
#include <iostream>
#include <utility>
#include <vector>
typedef std::pair<unsigned int, float> IndexValuePair;
IndexValuePair myMin(IndexValuePair a, IndexValuePair b){
return a.second < b.second ? a : b;
}
int main(){
std::vector<float> vals {10, 4, 6, 2, 8, 0, -1, 2, 3, 4, 4, 8};
unsigned int i;
IndexValuePair minValueIndex(0, 1000);
#pragma omp declare reduction \
(minPair:IndexValuePair:omp_out=myMin(omp_out, omp_in)) \
initializer(omp_priv = IndexValuePair(0, 1000))
#pragma omp parallel for reduction(minPair:minValueIndex)
for(i = 0; i < vals.size(); i++){
if(vals[i] < minValueIndex.second){
minValueIndex.first = i;
minValueIndex.second = vals[i];
}
}
std::cout << "minimum value = " << minValueIndex.second << std::endl; // Should be -1
std::cout << "index = " << minValueIndex.first << std::endl; // Should be 6
return EXIT_SUCCESS;
}

Because you're not only trying to find the minimal value (reduction(min:___)) but also retain the index, you need to make the check critical. This can significantly slow down the loop (as reported). In general, make sure that there is enough work so you don't encounter overhead as in this question. An alternative would be to have each thread find the minimum and it's index and save them to a unique variable and have the master thread do a final check on those as in the following program.
#include <iostream>
#include <vector>
#include <ctime>
#include <random>
#include <omp.h>
using std::cout;
using std::vector;
void initializeVector(vector<double>& v)
{
std::mt19937 generator(time(NULL));
std::uniform_real_distribution<double> dis(0.0, 1.0);
v.resize(100000000);
for(int i = 0; i < v.size(); i++)
{
v[i] = dis(generator);
}
}
int main()
{
vector<double> vec;
initializeVector(vec);
float minVal = vec[0];
int minInd = 0;
int startTime = clock();
for(int i = 1; i < vec.size(); i++)
{
if(vec[i] < minVal)
{
minVal = vec[i];
minInd = i;
}
}
int elapsedTime1 = clock() - startTime;
// Change the number of threads accordingly
vector<float> threadRes(4, std::numeric_limits<float>::max());
vector<int> threadInd(4);
startTime = clock();
#pragma omp parallel for
for(int i = 0; i < vec.size(); i++)
{
{
if(vec[i] < threadRes[omp_get_thread_num()])
{
threadRes[omp_get_thread_num()] = vec[i];
threadInd[omp_get_thread_num()] = i;
}
}
}
float minVal2 = threadRes[0];
int minInd2 = threadInd[0];
for(int i = 1; i < threadRes.size(); i++)
{
if(threadRes[i] < minVal2)
{
minVal2 = threadRes[i];
minInd2 = threadInd[i];
}
}
int elapsedTime2 = clock() - startTime;
cout << "Min " << minVal << " at " << minInd << " took " << elapsedTime1 << std::endl;
cout << "Min " << minVal2 << " at " << minInd2 << " took " << elapsedTime2 << std::endl;
}
Please note that with optimizations on and nothing else to be done in the loop, the serial version seems to remain king. With optimizations turned off, OMP gains the upper hand.
P.S. you wrote reduction(min:min_dist) and the proceeded to use min instead of min_dist.

Actually, we can use omp critical directive to make only one thread run the code inside the critical region at a time.So only one thread can run it and the indexvalue wont be destroyed by other threads.
About omp critical directive:
The omp critical directive identifies a section of code that must be executed by a single thread at a time.
This code solves your issue:
#include <stdio.h>
#include <omp.h>
int main() {
int i;
int arr[10] = {11,42,53,64,55,46,47, 68, 59, 510};
float* theArray; // the array to find the minimum value
int index;
float thisValue, min;
index = 0;
min = arr[0];
int size=10;
#pragma omp parallel for
for (i=1; i<size; i++) {
thisValue = arr[i];
#pragma omp critical
if (thisValue < min)
{ /* find the min and its array index */
min = thisValue;
index = i;
}
}
printf("min:%d index:%d",min,index);
return 0;
}

Related

Lazy vector access in parallel loops

Inside a performance-critical, parallel code I have a vector whose elements are:
Very expensive to compute, and the result is deterministic (the value of the element at a given position will depend on the position only)
Random access (typically the number of accesses are larger or much larger than the size of the vector)
Clustered accesses (many accesses request the same value)
The vector is shared by different threads (race condition?)
To avoid heap defragmention, the object should never be recreated, but whenever possible resetted and recycled
The value to be placed in the vector will be provided by a polymorphic object
Currently, I precompute all possible values of the vectors, so race condition should not be an issue.
In order to improve performances, I am considering to create a lazy vector, such that the code performs computations only when the element of the vector is requested.
In a parallel region, it might happen that more than one thread are requesting, and perhaps calculating, the same element at the same time.
How do I take care of this possible race condition?
Below is an example of what I want to achieve. It compiles and runs properly under Windows 10, Visual Studio 17. I use C++17.
// Lazy.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include <vector>
#include <iostream>
#include <stdlib.h>
#include <chrono>
#include <math.h>
const double START_SUM = 1;
const double END_SUM = 1000;
//base object responsible for providing the values
class Evaluator
{
public:
Evaluator() {};
~Evaluator() {};
//Function with deterministic output, depending on the position
virtual double expensiveFunction(int pos) const = 0;
};
//
class EvaluatorA: public Evaluator
{
public:
//expensive evaluation
virtual double expensiveFunction(int pos) const override {
double t = 0;
for (int j = START_SUM; j++ < END_SUM; j++)
t += log(exp(log(exp(log(j + pos)))));
return t;
}
EvaluatorA() {};
~EvaluatorA() {};
};
class EvaluatorB : public Evaluator
{
public:
//even more expensive evaluation
virtual double expensiveFunction(int pos) const override {
double t = 0;
for (int j = START_SUM; j++ < 10*END_SUM; j++)
t += log(exp(log(exp(log(j + pos)))));
return t;
}
EvaluatorB() {};
~EvaluatorB() {};
};
class LazyVectorTest //vector that contains N possible results
{
public:
LazyVectorTest(int N,const Evaluator & eval) : N(N), innerContainer(N, 0), isThatComputed(N, false), eval_ptr(&eval)
{};
~LazyVectorTest() {};
//reset, to generate a new table of values
//the size of the vector stays constant
void reset(const Evaluator & eval) {
this->eval_ptr = &eval;
for (int i = 0; i<N; i++)
isThatComputed[i] = false;
}
int size() { return N; }
//accessing the same position should yield the same result
//unless the object is resetted
const inline double& operator[](int pos) {
if (!isThatComputed[pos]) {
innerContainer[pos] = eval_ptr->expensiveFunction(pos);
isThatComputed[pos] = true;
}
return innerContainer[pos];
}
private:
const int N;
const Evaluator* eval_ptr;
std::vector<double> innerContainer;
std::vector<bool> isThatComputed;
};
//the parallel access will take place here
template <typename T>
double accessingFunction(T& A, const std::vector<int>& elementsToAccess) {
double tsum = 0;
int size = elementsToAccess.size();
//#pragma omp parallel for
for (int i = 0; i < size; i++)
tsum += A[elementsToAccess[i]];
return tsum;
}
std::vector<int> randomPos(int sizePos, int N) {
std::vector<int> elementsToAccess;
for (int i = 0; i < sizePos; i++)
elementsToAccess.push_back(rand() % N);
return elementsToAccess;
}
int main()
{
srand(time(0));
int minAccessNumber = 1;
int maxAccessNumber = 100;
int sizeVector = 50;
auto start = std::chrono::steady_clock::now();
double res = 0;
float numberTest = 100;
typedef LazyVectorTest container;
EvaluatorA eval;
for (int i = 0; i < static_cast<int>(numberTest); i++) {
res = eval.expensiveFunction(i);
}
auto end = std::chrono::steady_clock::now();
std::chrono::duration<double, std::milli>diff(end - start);
double benchmark = diff.count() / numberTest;
std::cout <<"Average time to compute expensive function:" <<benchmark<<" ms"<<std::endl;
std::cout << "Value of the function:" << res<< std::endl;
std::vector<std::vector<int>> indexs(numberTest);
container A(sizeVector, eval);
for (int accessNumber = minAccessNumber; accessNumber < maxAccessNumber; accessNumber++) {
indexs.clear();
for (int i = 0; i < static_cast<int>(numberTest); i++) {
indexs.emplace_back(randomPos(accessNumber, sizeVector));
}
auto start_lazy = std::chrono::steady_clock::now();
for (int i = 0; i < static_cast<int>(numberTest); i++) {
A.reset(eval);
double res_lazy = accessingFunction(A, indexs[i]);
}
auto end_lazy = std::chrono::steady_clock::now();
std::chrono::duration<double, std::milli>diff_lazy(end_lazy - start_lazy);
std::cout << accessNumber << "," << diff_lazy.count() / numberTest << ", " << diff_lazy.count() / (numberTest* benchmark) << std::endl;
}
return 0;
}
Rather than roll you own locking, I'd first see if you get acceptable performance with std::call_once.
class LazyVectorTest //vector that contains N possible results
{
//Function with deterministic output, depending on the position
void expensiveFunction(int pos) {
double t = 0;
for (int j = START_SUM; j++ < END_SUM; j++)
t += log(exp(log(exp(log(j+pos)))));
values[pos] = t;
}
public:
LazyVectorTest(int N) : values(N), flags(N)
{};
int size() { return values.size(); }
//accessing the same position should yield the same result
double operator[](int pos) {
std::call_once(flags[pos], &LazyVectorTest::expensiveFunction, this, pos);
return values[pos];
}
private:
std::vector<double> values;
std::vector<std::once_flag> flags;
};
call_once is pretty transparent. It allows exactly one thread to run a function to completion. The only potential drawback is that it will block a second thread waiting for a possible exception, rather than immediately do nothing. In this case that is desirable, as you want the modification values[pos] = t; to be sequenced before the read return values[pos];
Your current code is problematic, mainly because of std::vector<bool> being horrible, but also atomicity and memory consistency is missing. Here is the sketch of a solution based entirely on OpenMP. I would suggest to actually special marker for missing entries instead of a separate vector<bool> - it makes everything much easier:
class LazyVectorTest //vector that contains N possible results
{
public:
LazyVectorTest(int N,const Evaluator & eval) : N(N), innerContainer(N, invalid), eval_ptr(&eval)
{};
~LazyVectorTest() {};
//reset, to generate a new table of values
//the size of the vector stays constant
void reset(const Evaluator & eval) {
this->eval_ptr = &eval;
for (int i = 0; i<N; i++) {
// Use atomic if that could possible be done in parallel
// omit that for performance if you doun't ever run it in parallel
#pragma omp atomic write
innerContainer[i] = invalid;
}
// Flush to make sure invalidation is visible to all threads
#pragma omp flush
}
int size() { return N; }
// Don't return a reference here
double operator[] (int pos) {
double value;
#pragma omp atomic read
value = innerContainer[pos];
if (value == invalid) {
value = eval_ptr->expensiveFunction(pos);
#pragma omp atomic write
innerContainer[pos] = value;
}
return value;
}
private:
// Use nan, inf or some random number - doesn't really matter
static constexpr double invalid = std::nan("");
const int N;
const Evaluator* eval_ptr;
std::vector<double> innerContainer;
};
In case of a collision, the other threads will just redundantly compute the value. - exploiting the deterministic nature. My using omp atomic on both read and write of the elements, you ensure that no inconsistent "half-written" values are ever read.
This solution may create some additional latency for the rare bad cases. In turn, the good cases are optimal, with just a single atomic read. You don't even need any memory flushes / seq_cst - worst case is a redundant computation. You would need these (sequential consistency) if you write the flag and value separately, to ensure the order in which the changes becomes visible is correct.

Getting speed improvement with OpenMP in nested for loops with dependencies

I am trying to implement a procedure in parallel processing form with OpenMP. It contains four level nested for loops (dependent) and has a variable sum_p to be updated in the innermost loop. In short, the my question is regarding the parallel implementation of the following code snippet:
for (int i = (test_map.size() - 1); i >= 1; --i) {
bin_i = test_map.at(i); //test_map is a "STL map of vectors"
len_rank_bin_i = bin_i.size(); // bin_i is a vector
for (int j = (i - 1); j >= 0; --j) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
for (int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i]; //node_u is a scalar
for (int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
}
}
}
}
The full program is given below:
#include <iostream>
#include <vector>
#include <omp.h>
#include <random>
#include <unordered_map>
#include <algorithm>
#include <functional>
#include <time.h>
int main(int argc, char* argv[]){
double time_temp;
int test_map_size = 5000;
std::unordered_map<unsigned int, std::vector<unsigned int> > test_map(test_map_size);
// Fill the test map with random intergers ---------------------------------
std::random_device rd;
std::mt19937 gen1(rd());
std::uniform_int_distribution<int> dist(1, 5);
auto gen = std::bind(dist, gen1);
for(int i = 0; i < test_map_size; i++)
{
int vector_len = dist(gen1);
std::vector<unsigned int> tt(vector_len);
std::generate(begin(tt), end(tt), gen);
test_map.insert({i,tt});
}
// Sequential implementation -----------------------------------------------
time_temp = omp_get_wtime();
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
int sum_s = 0;
for (unsigned int i = (test_map_size - 1); i >= 1; --i) {
bin_i = test_map.at(i);
len_rank_bin_i = bin_i.size();
for (unsigned int j = i; j-- > 0; ) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
for (unsigned int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
for (unsigned int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_s += 1;
}
}
}
}
std::cout<<"Estimated sum (seq): "<<sum_s<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for sequential implementation: %.2fs\n", time_temp);
// Parallel implementation -----------------------------------------------
time_temp = omp_get_wtime();
int sum_p = 0;
omp_set_num_threads(4);
#pragma omp parallel
{
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
unsigned int i, u_i, v_i;
int j;
#pragma omp parallel for private(j,u_i,v_i) reduction(+:sum_p)
for (i = (test_map_size - 1); i >= 1; --i) {
bin_i = test_map.at(i);
len_rank_bin_i = bin_i.size();
#pragma omp parallel for private(u_i,v_i)
for (j = (i - 1); j >= 0; --j) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
#pragma omp parallel for private(v_i)
for (u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
#pragma omp parallel for
for (v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
}
}
}
}
}
std::cout<<"Estimated sum (parallel): "<<sum_p<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for parallel implementation: %.2fs\n", time_temp);
return 0;
}
Running the code with command g++-7 -fopenmp -std=c++11 -O3 -Wall -o so_qn so_qn.cpp in macOS 10.13.3 (i5 processor with four logical cores) gives the following output:
Estimated sum (seq): 38445750
Time taken for sequential implementation: 0.49s
Estimated sum (parallel): 38445750
Time taken for parallel implementation: 50.54s
The time taken for parallel implementation is multiple times higher than sequential implementation. Do you think the code or logic can deduced to parallel implementation? I have spent a few days to improve the terrible performance of my code but to no avail. Any help is greatly appreciated.
Update
With the changes suggested by JimCownie, i.e., "using omp for, not omp parallel for" and removing the parellelism of inner loops, the performance is greatly improved.
Estimated sum (seq): 42392944
Time taken for sequential implementation: 0.48s
Estimated sum (parallel): 42392944
Time taken for parallel implementation: 0.27s
My CPU has four logical cores (and I am using four threads), now I am wondering, would there be anyway to get four times better performance than the sequential implementation.
I see a different problem here when my map of vectors test_map is short, but fat at each level, i.e., the map size is small and but the vector size at each of the keys is very large. In such a case the performance of sequential and parallel implementations are comparable, without much difference. It seems like we need to parallelize inner loops too. Do you know how to achieve it in this context?

Threads failing to affect performance

Below is a small program meant to parallelize the approximation of the 1/(n^2) series. Note the global parameter NUM_THREADS.
My issue is that increasing the number of threads from 1 to 4 (the number of processors my computer has is 4) does not significantly affect the outcomes of timing experiments. Do you see a logical flaw in the ThreadFunction? Is there false sharing or misplaced blocking that ends up serializing the execution?
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
#include <string>
#include <future>
#include <chrono>
std::mutex sum_mutex; // This mutex is for the sum vector
std::vector<double> sum_vec; // This is the sum vector
int NUM_THREADS = 1;
int UPPER_BD = 1000000;
/* Thread function */
void ThreadFunction(std::vector<double> &l, int beg, int end, int thread_num)
{
double sum = 0;
for(int i = beg; i < end; i++) sum += (1 / ( l[i] * l[i]) );
std::unique_lock<std::mutex> lock1 (sum_mutex, std::defer_lock);
lock1.lock();
sum_vec.push_back(sum);
lock1.unlock();
}
void ListFill(std::vector<double> &l, int z)
{
for(int i = 0; i < z; ++i) l.push_back(i);
}
int main()
{
std::vector<double> l;
std::vector<std::thread> thread_vec;
ListFill(l, UPPER_BD);
int len = l.size();
int lower_bd = 1;
int increment = (UPPER_BD - lower_bd) / NUM_THREADS;
for (int j = 0; j < NUM_THREADS; ++j)
{
thread_vec.push_back(std::thread(ThreadFunction, std::ref(l), lower_bd, lower_bd + increment, j));
lower_bd += increment;
}
for (auto &t : thread_vec) t.join();
double big_sum;
for (double z : sum_vec) big_sum += z;
std::cout << big_sum << std::endl;
return 0;
}
From looking at your code, I suspect that ListFill is taking longer than ThreadFunction. Why pass a list of values to the thread instead of the bounds each thread should loop over? Something like:
void ThreadFunction( int beg, int end ) {
double sum = 0.0;
for(double i = beg; i < end; i++)
sum += (1.0 / ( i * i) );
std::unique_lock<std::mutex> lock1 (sum_mutex);
sum_vec.push_back(sum);
}
To maximize parallelism, you need to push as much work as possible onto the threads. See Amdahl's Law
In addition to dohashi's nice improvement, you can remove the need for the mutex by populating the sum_vec in advance in the main thread:
sum_vec.resize(4);
then writing directly to it in ThreadFunction:
sum_vec[thread_num] = sum;
since each thread writes to a distinct element and doesn't modify the vector itself there is no need to lock anything.

std::inner_product with omp

Is it possible to parallelize std::inner_product() from C++ with omp.h library? Unfortunately I can't use __gnu_parallel::inner_product() available in newer versions of gcc. I know that I can implement my own inner_product and parallelize it, but I would like to use standard means.
Short answer: no.
The whole point of algorithms like inner_product is that they abstract the loop away from you. But in order to parallelise the algorithm you need to parallelise that loop – either via #pragma omp parallel for or via parallel sections. Both methods are inherently linked to the loop in the code structure so even if the loop were trivially parallelisable (which it might well be), you need to put the OpenMP pragmas inside the function to apply parallelism to it.
Following up on Hristo's comment, you can kind of do this by decomposing the arrays over threads, calling inner_product on each subarray, and then using some sort of reduction operation to combine the sub-results
#include <iostream>
#include <numeric>
#include <omp.h>
#include <sys/time.h>
void tick(struct timeval *t);
double tock(struct timeval *t);
int main (int argc, char **argv) {
const long int nelements=1000000;
long int *a = new long int[nelements];
long int *b = new long int[nelements];
int nthreads;
long int sum = 0;
struct timeval t;
double time;
#pragma omp parallel for
for (long int i=0; i<nelements; i++) {
a[i] = i+1;
b[i] = 1;
}
tick(&t);
#pragma omp parallel
#pragma omp single
nthreads = omp_get_num_threads();
#pragma omp parallel default(none) reduction(+:sum) shared(a,b,nthreads)
{
int tid = omp_get_thread_num();
int nitems = nelements/nthreads;
int start = tid*nitems;
int end = start + nitems;
if (tid == nthreads-1) end = nelements;
sum += std::inner_product( &(a[start]), a+end, &(b[start]), 0L);
}
time = tock(&t);
std::cout << "using omp: sum = " << sum << " time = " << time << std::endl;
delete [] a;
delete [] b;
a = new long int[nelements];
b = new long int[nelements];
sum = 0;
for (long int i=0; i<nelements; i++) {
a[i] = i+1;
b[i] = 1;
}
tick(&t);
sum = std::inner_product( a, a+nelements, b, 0L);
time = tock(&t);
std::cout << "single threaded: sum = " << sum << " time = " << time << std::endl;
std::cout << "correct answer: sum = " << (nelements)*(nelements+1)/2 << std::endl ;
delete [] a;
delete [] b;
return 0;
}
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
Running this gets better speedup than I would have expected:
$ for NT in 1 2 4 8; do export OMP_NUM_THREADS=${NT}; echo; echo "NTHREADS=${NT}";./inner; done
NTHREADS=1
using omp: sum = 500000500000 time = 0.004675
single threaded: sum = 500000500000 time = 0.004765
correct answer: sum = 500000500000
NTHREADS=2
using omp: sum = 500000500000 time = 0.002317
single threaded: sum = 500000500000 time = 0.004773
correct answer: sum = 500000500000
NTHREADS=4
using omp: sum = 500000500000 time = 0.001205
single threaded: sum = 500000500000 time = 0.004758
correct answer: sum = 500000500000
NTHREADS=8
using omp: sum = 500000500000 time = 0.000617
single threaded: sum = 500000500000 time = 0.004784
correct answer: sum = 500000500000

OpenMP - using functions

When I am using OpenMP without functions with the reduction(+ : sum) , the OpenMP version works fine.
#include <iostream>
#include <omp.h>
using namespace std;
int sum = 0;
void summation()
{
sum = sum + 1;
}
int main()
{
int i,sum;
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
std::cerr << "Sum is=" << sum << std::endl;
}
But when I am calling a function summation over a global variable, the OpenMP version is taking even more time than the sequential version.
I would like to know the reason for the same and the changes that should be made.
The summation function doesn't use the OMP shared variable that you are reducing to. Fix it:
#include <iostream>
#include <omp.h>
void summation(int& sum) { sum++; }
int main()
{
int sum;
#pragma omp parallel for reduction (+ : sum)
for(int i = 0; i < 1000000000; ++i)
summation(sum);
std::cerr << "Sum is=" << sum << '\n';
}
The time taken to synchronize the access to this one variable will be way in excess of what you gain by using multiple cores- they will all be endlessly waiting on each other, because there is only one variable and only one core can access it at a time. This design is not capable of concurrency and all the sync you're paying will just increase the run-time.