How to implement argmax with OpenMP? - c++

I am trying to implement a argmax with OpenMP. If short, I have a function that computes a floating point value:
double toOptimize(int val);
I can get the integer maximizing the value with:
double best = 0;
#pragma omp parallel for reduction(max: best)
for(int i = 2 ; i < MAX ; ++i)
{
double v = toOptimize(i);
if(v > best) best = v;
}
Now, how can I get the value i corresponding to the maximum?
Edit:
I am trying this, but would like to make sure it is valid:
double best_value = 0;
int best_arg = 0;
#pragma omp parallel
{
double local_best = 0;
int ba = 0;
#pragma omp for reduction(max: best_value)
for(size_t n = 2 ; n <= MAX ; ++n)
{
double v = toOptimize(n);
if(v > best_value)
{
best_value = v;
local_best = v;
bn = n;
}
}
#pragma omp barrier
#pragma omp critical
{
if(local_best == best_value)
best_arg = bn;
}
}
And in the end, I should have best_arg the argmax of toOptimize.

Your solution is completely standard conformant. Anyhow, if you are willing to add a bit of syntactic sugar, you may try something like the following:
#include<iostream>
using namespace std;
double toOptimize(int arg) {
return arg * (arg%100);
}
class MaximumEntryPair {
public:
MaximumEntryPair(size_t index = 0, double value = 0.0) : index_(index), value_(value){}
void update(size_t arg) {
double v = toOptimize(arg);
if( v > value_ ) {
value_ = v;
index_ = arg;
}
}
bool operator<(const MaximumEntryPair& other) const {
if( value_ < other.value_ ) return true;
return false;
}
size_t index_;
double value_;
};
int main() {
MaximumEntryPair best;
#pragma omp parallel
{
MaximumEntryPair thread_local;
#pragma omp for
for(size_t ii = 0 ; ii < 1050 ; ++ii) {
thread_local.update(ii);
} // implicit barrier
#pragma omp critical
{
if ( best < thread_local ) best = thread_local;
}
} // implicit barries
cout << "The maximum is " << best.value_ << " obtained at index " << best.index_ << std::endl;
cout << "\t toOptimize(" << best.index_ << ") = " << toOptimize(best.index_) << std::endl;
return 0;
}

I would just create a separate buffer for each thread to store a val and idx and then select the max out of the buffer afterwards.
std::vector<double> thread_maxes(omp_get_max_threads());
std::vector<int> thread_max_ids(omp_get_max_threads());
#pragma omp for reduction(max: best_value)
for(size_t n = 2 ; n <= MAX ; ++n)
{
int thread_num = omp_get_num_threads();
double v = toOptimize(n);
if(v > thread_maxes[thread_num])
{
thread_maxes[thread_num] = v;
thread_max_ids[thread_num] = i;
}
}
std::vector<double>::iterator max =
std::max_element(thread_maxes.begin(), thread_maxes.end());
best.val = *max;
best.idx = thread_max_ids[max - thread_maxes.begin()];

Your solution is fine. It has O(nthreads) convergence with the critical section. However, it's possible to do this with O(Log(nthreads)) convergence.
For example imagine there were 32 threads.
You would first find the local max for the 32 threads. Then you could combine pairs with 16 threads, then 8, then 4, then 2, then 1. In five steps you could merge the local max values without a critical section and free threads in the process. But your method would merge the local max values in 32 steps in a critical section and uses all threads.
The same logic goes for a reduction. That's why it's best to let OpenMP do the reduction rather than do it manually with an atomic section. But at least in the C/C++ implementation of OpenMP there is no easy way to get the max/min in O(Log(nthreads)). It might be possible using tasks but I have not tried that.
In practice this might not make a difference since the time to merge the local values even with a critical section is probably negligible compared the time to do the parallel loop. It probably makes more of a difference on the GPU though where the number of "threads" is much larger.

Related

parallel programming multiplying two arrays of numbers

I have the following C++ code that multiply two array elements of a large size count
double* pA1 = { large array };
double* pA2 = { large array };
for(register int r = mm; r <= count; ++r)
{
lg += *pA1-- * *pA2--;
}
Is there a way that I can implement parallelism for the code?
Here is an alternative OpenMP implementation that is simpler (and a bit faster on many-core platforms):
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < dim; ++i)
sum += v1[i] * v2[i];
return sum;
}
GCC ad ICC are able to vectorize this loop in -O3. Clang 13.0 fail to do this, even with -ffast-math and even with explicit OpenMP SIMD instructions as well as a with loop tiling. This appears to be a bug of the Clang's optimizer related to OpenMP... Note that you can use -mavx to use the AVX instruction set which can be up to twice as fast as SSE (default). It is available on almost all recent x86-64 PC processors.
I wanted to answer my own question. Looks like we can use openMP like the following. However, the speed gains is not that much (2x). My computer has 16 cores.
// need to use compile flag /openmp
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
int i;
# pragma omp parallel shared(sum)
{
int num = omp_get_num_threads();
int id = omp_get_thread_num();
printf("I am thread # % d of % d.\n", id, num);
double priv_sum = 0.;
# pragma omp for
for (i = 0; i < dim; i++)
{
priv_sum += v1[i] * v2[i];
}
#pragma omp critical
{
cout << "priv_sum = " << priv_sum << endl;
sum += priv_sum;
}
}
return sum;
}

Compute the minimum value of every row in a matrix using loops parallel with openmp C++

I want to compute the minimum value of every row in a matrix in parallel using openmp c++ as follows:
// matrix Distf (float) of size n by n is declared before.
vector<float> minRows;
#pragma omp parallel for
for (i=0; i < n; ++i){
float minValue = Distf[i][0];
#pragma omp parallel for reduction(min : minValue)
for (j=1; j < n; ++j){
if (Distf[i][j] < minValue){
minValue = Distf[i][j];
}
}
minRows.push_back(minValue);
}
So far the compiler does not raise any error but I wonder if this would give the correct answer as expect? Thanks
What we talked about in the comments as an answer: Since I had to write some boilerplate anyway, I used ints as the type and avoided thinking about float problems at all:
#include <vector>
#include <iostream>
using namespace std;
int main(){
constexpr size_t n = 3;
// dummy Distf (int) declared in lieu of matrix Distf
int Distf[n][n] = {{1,2,3},{6,5,4},{7,8,8}};
//could be an array<int,n> instead
vector<int> minRows(n);
#pragma omp parallel for
for (size_t i = 0; i < n; ++i){
int minValue = Distf[i][0];
// Alain Merigot argues this is a performance drag
//#pragma omp parallel for reduction(min : minValue)
for (size_t j = 1; j < n; ++j){
if (Distf[i][j] < minValue){
minValue = Distf[i][j];
}
}
//minRows.push_back(minValue) is a race condition!
minRows[i] = minValue;
}
int k = 0;
for(auto el: minRows){
cout << "row " << k++ << ": " << el << '\n';
}
cout << '\n';
}
The inner loop normally doesn't need to be parallelized. I don't know how many cores you can use, but unless you're on a massively parallel system, think GPU-level of parallelism, the outer loop should either utilize all available cores already, or the problem just isn't big enough to matter. Starting more threads in either situation is a pessimization.

Simple task-based OpenMP application hangs

The following small program (online version) attempts to calculate the area of a 64 by 64 square by recursively dividing into four squares until the smallest square has unit length (hardly optimal). But for some reason the program hangs. What am doing wrong?
#include <iostream>
unsigned compute( unsigned length )
{
if( length == 1 ) return length * length;
unsigned a[4] , area = 0 , len = length/2;
for( unsigned i = 0; i < 4; ++i )
{
#pragma omp task
{
a[i] = compute( len );
}
#pragma omp single
{
area += a[i];
}
}
return area;
}
int main()
{
unsigned area , length = 64;
#pragma omp parallel
{
area = compute( length );
}
std::cout << area << std::endl;
}
The single construct acts as an implicit barrier for all threads in the team. However, not all threads in the team do encounter this single block, because different threads are working at different recursion depths. This is why your application hangs.
In any case your code is not correct. After your task block, a[i] is not yet assigned, so you cannot immediately use it! You must wait for the task to be completed. Of course you shouldn't do that inside the loop, otherwise the tasking wouldn't exploit any parallelism. The solution is to do this at the end of the loop. Also you must specify a as shared for the output to become visible:
for( unsigned i = 0; i < 4; ++i )
{
#pragma omp task shared(a)
{
a[i] = compute( len );
}
}
#pragma omp taskwait
for( unsigned i = 0; i < 4; ++i )
{
area += a[i];
}
Note that the reduction is not wrapped a single construct! Compute is executed by a task, so only one thread should ever have it's own local area. However, you need one single construct before you first spawn any tasks:
#pragma omp parallel
#pragma omp single
{
area = compute( length );
}
Simply speaking this opens a parallel region with a team of threads, and only one thread begins the initial computation. The other threads will pick up the tasks that are later spawned by this initial thread with the task construct. This is what tasking is all about.
Motivated by the discussion about taskwait and how it can be avoided, I show below a slightly modified version of the original code. Please note that the implied barrier at the end of the single construct is really necessary in this case.
unsigned tp_area = 0;
#pragma omp threadprivate(tp_area)
void compute (unsigned length)
{
if (length == 1)
{
tp_area += 1;
return;
}
unsigned len = length / 2;
for (unsigned i = 0; i < 4; ++i)
{
#pragma omp task
{
compute (len);
}
}
}
int main ()
{
unsigned area, length = 64;
#pragma omp parallel
{
#pragma omp single
{
compute (length);
}
#pragma omp atomic
area += tp_area;
}
std::cout << area << std::endl;
}

Is this for-loop valid using OpenMP

I am in the process of learning OpenMP . This is a for loop I am using
std::string result;
#pragma omp parallel
{
#pragma omp parallel for public(local_arg) reduction(+:result)
for(int i=0 ; i<Myvector.size();i++)
{
result = result + someMethod(urn,Myvector[i]);
}
}
Now someMethod(urn,Myvector[i]) which will called by multiple threads in the above code will return a string. This string needs to be appended to the return string. My question is do I need to put a lock on the statement in the for loop ? Is there a better approach ? Any suggestions ?
This isn't perfect (and it's been a while since I've used OpenMP), but the idea is basic divide-and-conquer.
std::vector<std::string> results;
int n = 2*omp_get_num_threads();
results.reserve(n); // For reliability, ask OS about # of cores, double that.
// Reserve a small string for each prospective worker
for(int i = 0; i < n; ++i){
std::string str{};
str.reserve(worker_reserve);
results.push_back(move(str));
}
// Let each worker grab and mutate the string
// corresponding to its worker ID
//
#pragma omp parallel for
for(int i = 0; i < Myvector.size(); ++i)
{
auto &str = results[omp_get_thread_num()];
str.append(someMethod(urn, Myvector[i]));
}
// Measure the total size of the result
std::string end_result;
size_t total_len = 0;
for(auto &res : results){
total_len += res.length();
}
// Reserve and combine
end_result.reserve(total_len + 1);
for(auto &res : results){
end_result.append(res);
}
However, there is still the issue of heap contention.
Also omp_get_num_threads isn't guaranteed to return the actual number of threads.

Dijkstra's algorithm openmp strange behavior

I'm trying to run an openmp realization of Dijkstra's algorithm which I downloaded here heather.cs.ucdavis.edu/~matloff/OpenMP/Dijkstra.c
If I add for example one more vertice from 5 to 6, so that the path from 0th goes through two vertices, my program fails to give me a correct result, saying that the distance between 0th and 6th is infinite :^(
What can be the reason?
#define LARGEINT 2<<30-1 // "infinity"
#define NV 6
// global variables, all shared by all threads by default
int ohd[NV][NV], // 1-hop distances between vertices
mind[NV], // min distances found so far
notdone[NV], // vertices not checked yet
nth, // number of threads
chunk, // number of vertices handled by each thread
md, // current min over all threads
mv; // vertex which achieves that min
void init(int ac, char **av)
{ int i,j;
for (i = 0; i < NV; i++)
for (j = 0; j < NV; j++) {
if (j == i) ohd[i][i] = 0;
else ohd[i][j] = LARGEINT;
}
ohd[0][1] = ohd[1][0] = 40;
ohd[0][2] = ohd[2][0] = 15;
ohd[1][2] = ohd[2][1] = 20;
ohd[1][3] = ohd[3][1] = 10;
ohd[1][4] = ohd[4][1] = 25;
ohd[2][3] = ohd[3][2] = 100;
ohd[1][5] = ohd[5][1] = 6;
ohd[4][5] = ohd[5][4] = 8;
for (i = 1; i < NV; i++) {
notdone[i] = 1;
mind[i] = ohd[0][i];
}
}
// finds closest to 0 among notdone, among s through e
void findmymin(int s, int e, int *d, int *v)
{ int i;
*d = LARGEINT;
for (i = s; i <= e; i++)
if (notdone[i] && mind[i] < *d) {
*d = ohd[0][i];
*v = i;
}
}
// for each i in [s,e], ask whether a shorter path to i exists, through
// mv
void updateohd(int s, int e)
{ int i;
for (i = s; i <= e; i++)
if (mind[mv] + ohd[mv][i] < mind[i])
mind[i] = mind[mv] + ohd[mv][i];
}
void dowork()
{
#pragma omp parallel // Note 1
{ int startv,endv, // start, end vertices for this thread
step, // whole procedure goes NV steps
mymd, // min value found by this thread
mymv, // vertex which attains that value
me = omp_get_thread_num(); // my thread number
#pragma omp single // Note 2
{ nth = omp_get_num_threads(); chunk = NV/nth;
printf("there are %d threads\n",nth); }
// Note 3
startv = me * chunk;
endv = startv + chunk - 1;
for (step = 0; step < NV; step++) {
// find closest vertex to 0 among notdone; each thread finds
// closest in its group, then we find overall closest
#pragma omp single
{ md = LARGEINT; mv = 0; }
findmymin(startv,endv,&mymd,&mymv);
// update overall min if mine is smaller
#pragma omp critical // Note 4
{ if (mymd < md)
{ md = mymd; mv = mymv; }
}
// mark new vertex as done
#pragma omp single
{ notdone[mv] = 0; }
// now update my section of ohd
updateohd(startv,endv);
#pragma omp barrier
}
}
}
int main(int argc, char **argv)
{ int i;
init(argc,argv);
dowork();
// back to single thread now
printf("minimum distances:\n");
for (i = 1; i < NV; i++)
printf("%d\n",mind[i]);
}
There are two problems here:
If the number of threads doesn't evenly divide the number of values, then this division of work
startv = me * chunk;
endv = startv + chunk - 1;
is going to leave the last (NV - nth*(NV/nth)) elements undone, which will mean the distances are left at LARGEINT. This can be fixed any number of ways; the easiest for now is to give all remaining work to the last thread
if (me == (nth-1)) endv = NV-1;
(This leads to more load imbalance than is necessary, but is a reasonable start to get the code working.)
The other issue is that a barrier has been left out before setting notdone[]
#pragma omp barrier
#pragma omp single
{ notdone[mv] = 0; }
This makes sure notdone is updated and updateohd() is started only after everyone has finished their findmymin() and updated md and mv.
Note that it's very easy to introduce errors into the original code you started with; the global variables used make it very difficult to reason about. John Burkardt has a nicer version of this same algorithm for teaching up on his website here, which is almost excessively well commented and easier to trace through.