I'm trying to run an openmp realization of Dijkstra's algorithm which I downloaded here heather.cs.ucdavis.edu/~matloff/OpenMP/Dijkstra.c
If I add for example one more vertice from 5 to 6, so that the path from 0th goes through two vertices, my program fails to give me a correct result, saying that the distance between 0th and 6th is infinite :^(
What can be the reason?
#define LARGEINT 2<<30-1 // "infinity"
#define NV 6
// global variables, all shared by all threads by default
int ohd[NV][NV], // 1-hop distances between vertices
mind[NV], // min distances found so far
notdone[NV], // vertices not checked yet
nth, // number of threads
chunk, // number of vertices handled by each thread
md, // current min over all threads
mv; // vertex which achieves that min
void init(int ac, char **av)
{ int i,j;
for (i = 0; i < NV; i++)
for (j = 0; j < NV; j++) {
if (j == i) ohd[i][i] = 0;
else ohd[i][j] = LARGEINT;
}
ohd[0][1] = ohd[1][0] = 40;
ohd[0][2] = ohd[2][0] = 15;
ohd[1][2] = ohd[2][1] = 20;
ohd[1][3] = ohd[3][1] = 10;
ohd[1][4] = ohd[4][1] = 25;
ohd[2][3] = ohd[3][2] = 100;
ohd[1][5] = ohd[5][1] = 6;
ohd[4][5] = ohd[5][4] = 8;
for (i = 1; i < NV; i++) {
notdone[i] = 1;
mind[i] = ohd[0][i];
}
}
// finds closest to 0 among notdone, among s through e
void findmymin(int s, int e, int *d, int *v)
{ int i;
*d = LARGEINT;
for (i = s; i <= e; i++)
if (notdone[i] && mind[i] < *d) {
*d = ohd[0][i];
*v = i;
}
}
// for each i in [s,e], ask whether a shorter path to i exists, through
// mv
void updateohd(int s, int e)
{ int i;
for (i = s; i <= e; i++)
if (mind[mv] + ohd[mv][i] < mind[i])
mind[i] = mind[mv] + ohd[mv][i];
}
void dowork()
{
#pragma omp parallel // Note 1
{ int startv,endv, // start, end vertices for this thread
step, // whole procedure goes NV steps
mymd, // min value found by this thread
mymv, // vertex which attains that value
me = omp_get_thread_num(); // my thread number
#pragma omp single // Note 2
{ nth = omp_get_num_threads(); chunk = NV/nth;
printf("there are %d threads\n",nth); }
// Note 3
startv = me * chunk;
endv = startv + chunk - 1;
for (step = 0; step < NV; step++) {
// find closest vertex to 0 among notdone; each thread finds
// closest in its group, then we find overall closest
#pragma omp single
{ md = LARGEINT; mv = 0; }
findmymin(startv,endv,&mymd,&mymv);
// update overall min if mine is smaller
#pragma omp critical // Note 4
{ if (mymd < md)
{ md = mymd; mv = mymv; }
}
// mark new vertex as done
#pragma omp single
{ notdone[mv] = 0; }
// now update my section of ohd
updateohd(startv,endv);
#pragma omp barrier
}
}
}
int main(int argc, char **argv)
{ int i;
init(argc,argv);
dowork();
// back to single thread now
printf("minimum distances:\n");
for (i = 1; i < NV; i++)
printf("%d\n",mind[i]);
}
There are two problems here:
If the number of threads doesn't evenly divide the number of values, then this division of work
startv = me * chunk;
endv = startv + chunk - 1;
is going to leave the last (NV - nth*(NV/nth)) elements undone, which will mean the distances are left at LARGEINT. This can be fixed any number of ways; the easiest for now is to give all remaining work to the last thread
if (me == (nth-1)) endv = NV-1;
(This leads to more load imbalance than is necessary, but is a reasonable start to get the code working.)
The other issue is that a barrier has been left out before setting notdone[]
#pragma omp barrier
#pragma omp single
{ notdone[mv] = 0; }
This makes sure notdone is updated and updateohd() is started only after everyone has finished their findmymin() and updated md and mv.
Note that it's very easy to introduce errors into the original code you started with; the global variables used make it very difficult to reason about. John Burkardt has a nicer version of this same algorithm for teaching up on his website here, which is almost excessively well commented and easier to trace through.
Related
I have tried to solve the problem Rerouting at hacker rank. I am posting here for help as competition is over.
https://www.hackerrank.com/contests/hack-the-interview-v-asia-pacific/challenges/rerouting
I have tried to solve problem using Strong connected components, but test cases failed. I can understand we have to remove cycles. But I stuck how to approach problem. Below is solution i have written. I am looking for guidence how to move forward so that i can apply my knowledge future based on mistakes i made here. Thanks for your time and help
int getMinConnectionChange(vector<int> connection) {
// Idea: Get number of strongly connected components.
int numberOfVertices = connection.size();
for(int idx = 0; idx < numberOfVertices; idx++) {
cout << idx+1 <<":"<< connection[idx] << endl;
}
stack<int> stkVertices;
map<int, bool> mpVertexVisited; //is vertex visited.think this as a chalk mark for nodes visited.
int numOFSCCs = 0;
int currTime = 1;
for (int vertexId = 0; vertexId < numberOfVertices; vertexId++) {
// check if node is already visited.
if (mpVertexVisited.find(vertexId+1) == mpVertexVisited.end()) {
numOFSCCs++;
mpVertexVisited.insert(make_pair(vertexId+1, true));
stkVertices.push(vertexId+1);
currTime++;
while (!stkVertices.empty()) {
int iCurrentVertex = stkVertices.top();
stkVertices.pop();
// get adjacent vertices. In this excercise we have only one neighbour. i.e., edge
int neighbourVertexId = connection[iCurrentVertex-1];
// if vertex is already visisted, don't insert in to stack.
if (mpVertexVisited.find(neighbourVertexId) != mpVertexVisited.end()) {
continue;
}
mpVertexVisited.insert(make_pair(neighbourVertexId, true));
stkVertices.push(neighbourVertexId);
} // while loop
} // if condition m_mapVrtxTimes.find(*itr) == m_mapVrtxTimes.end()
} // for loop of vertices
return numOFSCCs - 1;
}
This is a problem that I just solved and would like to share the solution.
The problem can be solved with union-find.
Two main observation:
The number of edges that has to be changed is the number of components - 1 (not necessarily strongly connected) Thus, union-find is handy here for finding the number of components
Second observation is that some component doesn't have terminating node, consider 1<->2, in other words, a cycle exist. We can detect whether there exists a terminating node if some node doesn't have an outgoing edge.
If all components have a cycle, it means that we need to change every component instead of a number of components - 1. This is to make it such that the graph will have a terminating point.
Code:
struct UF {
vector<int> p, rank, size;
int cnt;
UF(int N) {
p = rank = size = vector<int>(N, 1);
for (int i = 0; i < N; i++) p[i] = i;
cnt = N;
}
int find(int i) {
return p[i] == i ? i : p[i] = find(p[i]);
}
bool connected(int i, int j) {
return find(i) == find(j);
}
void join(int i, int j) {
if (connected(i, j)) return;
int x = find(i), y = find(j);
cnt--;
if (rank[x] > rank[y]) {
p[y] = x;
size[x] += size[y];
} else {
p[x] = y;
size[y] += size[x];
if (rank[x] == rank[y]) rank[y]++;
}
}
};
int getMinConnectionChange(vector<int> connection) {
int nonCycle = 0;
int n = connection.size();
UF uf(n);
for(int i=0;i<n;i++) {
int to = connection[i] - 1;
if(to == i) nonCycle++;
else uf.join(i, to);
}
int components = uf.cnt;
int countCycle = uf.cnt - nonCycle;
int res = components - 1;
if(countCycle == components) res++; // all components have cycle
return res;
}
TL;DR: you can view this as looking for a minimal spanning arborescence problem.
More precisely, add a node for each server, and another one called "Terminate".
Make a complete graph (each node is linked to every other one) and set as cost 0 for the edges corresponding to your input, 1 for the other ones.
You can use for example Edmond's algorithm to solve this.
I have a question here about entry sustitution. Let's say we have a matrix (squared) of a fixed size MATRIX_SIZE (unsorted 2D Array), a list of numbers replacementPolicy and another list of number substitudeNUMBER. We loop over the matrix, and if the entry has the same value as the (first) element in the replacementPolicy, we remember the position i, and substitude this entry with the i-th element in substitudeNUMBER. It sounds a little bit complicated, the code is as follows:
void substitute_entry() {
// For each entry in the matrix
for (int column = 0; column < MATRIX_SIZE; ++column) {
for (int row = 0; row < MATRIX_SIZE; ++row) {
// Search for the entry in the original number list
// and replace it with corresponding the element in the substituted number list
int index = -1;
for (int i = 0; i < LIST_SIZE; i++) {
if (replacementPolicy[i] == MATRIX[row][column]) {
index = i;
}
}
MATRIX[row][column] = substitutedNUMBER[index];
}
}
}
However, I would expect to optimize this code in order to achieve a faster runtime. My first idea is to switch the for loop - first over columns and then over rows, but this does not affect the runtime significantly. My second thought is to use a better algorithm to replace the entries, but unfortunately I mess up when testing. Is there any better way to do so?
Thank you!
I think your loops are perfect for a multithreading solution, for example, using the OpenMP, and with its capabilities, you can expect a significant improvement in the performance. I've made a few changes to your code, as follows:
#include <iostream>
#include <chrono>
#include <omp.h>
#define MATRIX_SIZE 1000
#define LIST_SIZE 1000
int arr[MATRIX_SIZE][MATRIX_SIZE];
int replacementPolicy[LIST_SIZE];
int substitutedNUMBER[MATRIX_SIZE];
void substitute_entry() {
// For each entry in the matrix
#pragma omp parallel for
for (int column = 0; column < MATRIX_SIZE; ++column) {
#pragma omp parallel for
for (int row = 0; row < MATRIX_SIZE; ++row) {
// Search for the entry in the original number list
// and replace it with corresponding the element in the substituted number list
int index = -1;
for (int i = 0; i < LIST_SIZE; i++) {
if (replacementPolicy[i] == arr[row][column]) {
index = i;
}
}
arr[row][column] = substitutedNUMBER[index];
}
}
}
int main()
{
omp_set_num_threads(4);
for ( int i = 0; i<MATRIX_SIZE ; i++)
{
replacementPolicy[i] = i;
substitutedNUMBER[i] = i;
for ( int j=0; j<MATRIX_SIZE ; j++)
{
arr[i][j] = i+j;
}
}
auto start = std::chrono::high_resolution_clock::now();
substitute_entry();
auto end = std::chrono::high_resolution_clock::now();
uint64_t diff = std::chrono::duration_cast<std::chrono::microseconds>(end-start).count();
std::cerr << diff << '\n';
return 0;
}
you can comment out the 3,14,16, and 34 lines and have the single thread version of your code.
In this example with MATRIX_SIZE of 1000, and on my personal computer which has only four cores, the single thread version gets done in 3731737 us and the multithreaded version in 718039 us.
I'm new to Pthreads and c++ and trying to parallelize an image flipping program. Obviously it isnt working. I'm told I need to port some code from an Image class but not really sure what porting means. I just copied and pasted the code but I guess that's wrong.
I get the general idea. allocate the workload, intitialize the threads, create the threads, join the threads and define a callback function.
I'm not totally sure what the cells_per_thread should be. I'm pretty sure it should be the image width * height / threads. Does that seem correct?
I'm getting multiple errors when compiling with cmake.
its saying m_thread_number, getWidth, getHeight, getPixel, temp are not define in the scope. I assume thats because the Image class code isn't ported?
PthreadImage.cxx
//Declare a callabck fucntion for Horizontal flip
void* H_flip_callback_function(void* aThreadData);
PthreadImage PthreadImage::flipHorizontally() const
{
if (m_thread_number == 0 || m_thread_number == 1)
{
return PthreadImage(Image::flipHorizontally(), m_thread_number);
}
else
{
PthreadImage temp(getWidth(), getHeight(), m_thread_number);
//Workload allocation
//Create a vector of type ThreadData whcih is constructed at the top of the class under Struct ThreadData. Pass in the number of threads.
vector<ThreadData> p_thread_data(m_thread_number);
//create an integer to hold the last element. inizialize it as -1.
int last_element = -1;
//create an unsigned int to hold how many cells we need per thread. For the image we want the width and height divided by the number of threads.
unsigned int cells_per_thread = getHeight() * getWidth() / m_thread_number;
//Next create a variable to hold the remainder of the sum.
unsigned int remainder = getHeight() * getWidth() % m_thread_number;
//print the number of cells per thread to the console
cout << "Default number for cells per thread: " << cells_per_thread << endl;
//inizialize the threads with a for loop to interate through each thread and populate it
for (int i = 0; i < m_thread_number; i++)
{
//thread ids correspond with the for loop index values.
p_thread_data[i].thread_id = i;
//start is last element + 1 i.e -1 + 1 start = 0.
p_thread_data[i].start_id = ++last_element;
p_thread_data[i].end_id = last_element + cells_per_thread - 1;
p_thread_data[i].input = this;
p_thread_data[i].output = &temp;
//if the remainder is > thats 0 add 1 to the end them remove 1 remainder.
if (remainder > 0)
{
p_thread_data[i].end_id++;
--remainder;
}
//make the last element not = -1 but = the end of the threads.
last_element = p_thread_data[i].end_id;
//print to console what number then thread start and end on
cout << "Thread[" << i << "] starts with " << p_thread_data[i].start_id << " and stops on " << p_thread_data[i].end_id << endl;
}
//create the threads with antoher for loop
for (int i = 0; i < m_thread_number; i++)
{
pthread_create(&p_thread_data[i].thread_id, NULL, H_flip_callback_function, &p_thread_data[i]);
}
//Wait for each thread to complete;
for (int i = 0; i < m_thread_number; i++)
{
pthread_join(p_thread_data[i].thread_id, NULL);
}
return temp;
}
}
Callback function
//Define the callabck fucntion for Horizontal flip
void* H_flip_callback_function(void* aThreadData)
{
//convert void to Thread data
ThreadData* p_thread_data = static_cast<ThreadData*>(aThreadData);
int tempHeight = temp(getHeight());
int tempWidth = temp(getWidth());
for (int i = p_thread_data->start_id; i <= p_thread_data->end_id; i++)
{
// Process every row of the image
for (unsigned int j = 0; j < m_height; ++j)
{
// Process every column of the image
for (unsigned int i = 0; i < m_width / 2; ++i)
{
(*(p_thread_data->output))( i, j) = getPixel(m_width - i - 1, j);
(*(p_thread_data->output))(m_width - i - 1, j) = getPixel( i, j);
}
}
}
}
Image class
#include <sstream> // Header file for stringstream
#include <fstream> // Header file for filestream
#include <algorithm> // Header file for min/max/fill
#include <numeric> // Header file for accumulate
#include <cmath> // Header file for abs and pow
#include <vector>
#include "Image.h"
//-----------------
Image::Image():
//-----------------
m_width(0),
m_height(0)
//-----------------
{}
//----------------------------------
Image::Image(const Image& anImage):
//----------------------------------
m_width(anImage.m_width),
m_height(anImage.m_height),
m_p_image(anImage.m_p_image)
//----------------------------------
Image class code to be ported
//-----------------------------------
Image Image::flipHorizontally() const
//-----------------------------------
{
// Create an image of the right size
Image temp(getWidth(), getHeight());
// Process every row of the image
for (unsigned int j = 0; j < m_height; ++j)
{
// Process every column of the image
for (unsigned int i = 0; i < tempWidth / 2; ++i)
{
temp(i, j) = getPixel(tempWidth - i - 1, j);
temp(tempWidth - i - 1, j) = getPixel(i, j);
}
}
return 0;
}
I feel like its pretty close. Any help greatly appreciated!
EDIT
Ok, so this is the correct code for anyone wasting their time on this.
There was obviously a fair few things wrong.
I don't know why there was 3 for loops. There should be 2. 1 for Rows and 1 for columns.
The cells_per_thread should be pixels_per_thread and rows/threads as #Larry B suggested not ALL the pixels per thread.
You can use -> to get members of a pointer i.e setPixel(),getPixel` etc. Who knew that!?
There was a data structure that was pretty inportant for you guys but I forgot.
struct ThreadData
{
pthread_t thread_id;
unsigned int start_id;
unsigned int end_id;
const Image* input;
Image* output;
};
Correct Callback
void* H_flip_callback_function(void* aThreadData)
{
//convert void to Thread data
ThreadData* p_thread_data = static_cast<ThreadData*>(aThreadData);
int width = p_thread_data->input->getWidth();
// Process every row of the image
for (unsigned int j = p_thread_data->start_id; j <=p_thread_data->end_id; ++j)
}
// Process every column of the image
for (unsigned int i = 0; i < width / 2; ++i)
{
p_thread_data->output->setPixel(i,j, p_thread_data->input->getPixel(width - i - 1, j));
p_thread_data->output->setPixel(width - i - 1, j, p_thread_data->input->getPixel(i, j));
}
}
return 0;
}
So now this code compiles and flips.
Thanks!
The general strategy for porting single threaded code to a multi-thread version is essentially rewriting the existing code to divide the work into self contained units of work that you can hand off to a thread for execution.
With that in mind, I don't agree with your implementation of H_flip_callback_function:
void* H_flip_callback_function(void* aThreadData)
{
//convert void to Thread data
ThreadData* p_thread_data = static_cast<ThreadData*>(aThreadData);
// Create an image of the right size
PthreadImage temp(getWidth(), getHeight(), m_thread_number);
int tempHeight = temp(getHeight());
int tempWidth = temp(getWidth());
for (int i = p_thread_data->start_id; i <= p_thread_data->end_id; i++)
{
// Process every row of the image
for (unsigned int j = 0; j < tempHeight; ++j)
{
// Process every column of the image
for (unsigned int i = 0; i < tempWidth / 2; ++i)
{
temp(i, j) = getPixel(tempWidth - i - 1, j);
temp(tempWidth - i - 1, j) = getPixel(i, j);
}
}
}
}
At face value, it looks like all your threads will be operating on the whole image. If this is the case, there is no real difference between your single and multi-thread version as you're just doing the same work multiple times in the multi-thread version.
I would argue that the smallest self contained unit of work would be to horizontally flip a single row of the image. However, if you have less threads than the number of rows, then you could allocate (Num rows / Num threads) to each thread. Each thread would then flip the rows assigned to it and the main thread would collect the results and assemble the final image.
With regards to your build warnings and errors, you'll have to provide the complete source code, build settings, environment, etc..
I'm new here and this is my first question in this site;
I am doing a simple program to find a maximum value of a vector c that is function of two other vectors a and b. I'm doing it on Microsoft Visual Studio 2013 and the problem is that it only support OpenMP 2.0 and I cannot do a Reduction operation to find directy the max or min value of a vector, because OpenMP 2.0 does not supports this operation.
I'm trying to do the without the constructor reduction with the following code:
for (i = 0; i < NUM_THREADS; i++){
cMaxParcial[i] = - FLT_MAX;
}
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private (i,j,indice)
for (i = 0; i < N; i++){
for (j = 0; j < N; j++){
indice = omp_get_thread_num();
if (c[i*N + j] > cMaxParcial[indice]){
cMaxParcial[indice] = c[i*N + j];
bMaxParcial[indice] = b[j];
aMaxParcial[indice] = a[i];
}
}
}
cMax = -FLT_MAX;
for (i = 0; i < NUM_THREADS; i++){
if (cMaxParcial[i]>cMax){
cMax = cMaxParcial[i];
bMax = bMaxParcial[i];
aMax = aMaxParcial[i];
}
}
I'm getting the error: "The expression must have integral or unscoped enum type"
on the command cMaxParcial[indice] = c[i*N + j];
Can anybody help me with this error?
Normally, the error is caused by one of the indices not being in integer type. Since you haven't shown the code where i, j, N and indice are declared, my guess is that either N or indice is a float or double, but it would be simpler to answer if you had provided a MCVE. However, the line above it seems to have used the same indices correctly. This leads me to believe that it's an IntelliSense error, which often are false positives. Try compiling the code and running it.
Now, on to issues that you haven't (yet) asked about (why is my parallel code slower than my serial code?). You're causing false sharing by using (presumably) contiguous arrays to find the a, b, and c values of each thread. Instead of using a single pragma for parallel and for, split it up like so:
cMax = -FLT_MAX;
#pragma omp parallel
{
float aMaxParcialPerThread;
float bMaxParcialPerThread;
float cMaxParcialPerThread;
#pragma omp for nowait private (i,j)
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
if (c[i*N + j] > cMaxParcialPerThread){
cMaxParcialPerThread = c[i*N + j];
bMaxParcialPerThread = b[j];
aMaxParcialPerThread = a[i];
} // if
} // for j
} // for i
#pragma omp critical
{
if (cMaxParcialPerThread < cMax) {
cMax = cMaxParcialPerThread;
bMax = bMaxParcialPerThread;
aMax = aMaxParcialPerThread;
}
}
}
I don't know what is wrong with your compiler since (as far as I can see with only the partial data you gave), the code seems valid. However, it is a bit convoluted and not so good.
What about the following:
#include <omp.h>
#include <float.h>
extern int N, NUM_THREADS;
extern float aMax, bMax, cMax, *a, *b, *c;
int foo() {
cMax = -FLT_MAX;
#pragma omp parallel num_threads( NUM_THREADS )
{
float localAMax, localBMax, localCMax = -FLT_MAX;
#pragma omp for
for ( int i = 0; i < N; i++ ) {
for ( int j = 0; j < N; j++ ) {
float pivot = c[i*N + j];
if ( pivot > localCMax ) {
localAMax = a[i];
localBMax = b[j];
localCMax = pivot;
}
}
}
#pragma omp critical
{
if ( localCMax > cMax ) {
aMax = localAMax;
bMax = localBMax;
cMax = localCMax;
}
}
}
}
It compiles but I haven't tested it...
Anyway, I avoided using the [a-c]MaxParcial arrays since they will generate false sharing between the threads, leading to poor performance. The final reduction is done based on critical. It is not ideal, but will perform perfectly as long as you have a "moderated" number of threads. If you see some hot spot there or you need to use a "large" number of threads, it can be optimised better with a proper parallel reduction later.
I am trying to implement a argmax with OpenMP. If short, I have a function that computes a floating point value:
double toOptimize(int val);
I can get the integer maximizing the value with:
double best = 0;
#pragma omp parallel for reduction(max: best)
for(int i = 2 ; i < MAX ; ++i)
{
double v = toOptimize(i);
if(v > best) best = v;
}
Now, how can I get the value i corresponding to the maximum?
Edit:
I am trying this, but would like to make sure it is valid:
double best_value = 0;
int best_arg = 0;
#pragma omp parallel
{
double local_best = 0;
int ba = 0;
#pragma omp for reduction(max: best_value)
for(size_t n = 2 ; n <= MAX ; ++n)
{
double v = toOptimize(n);
if(v > best_value)
{
best_value = v;
local_best = v;
bn = n;
}
}
#pragma omp barrier
#pragma omp critical
{
if(local_best == best_value)
best_arg = bn;
}
}
And in the end, I should have best_arg the argmax of toOptimize.
Your solution is completely standard conformant. Anyhow, if you are willing to add a bit of syntactic sugar, you may try something like the following:
#include<iostream>
using namespace std;
double toOptimize(int arg) {
return arg * (arg%100);
}
class MaximumEntryPair {
public:
MaximumEntryPair(size_t index = 0, double value = 0.0) : index_(index), value_(value){}
void update(size_t arg) {
double v = toOptimize(arg);
if( v > value_ ) {
value_ = v;
index_ = arg;
}
}
bool operator<(const MaximumEntryPair& other) const {
if( value_ < other.value_ ) return true;
return false;
}
size_t index_;
double value_;
};
int main() {
MaximumEntryPair best;
#pragma omp parallel
{
MaximumEntryPair thread_local;
#pragma omp for
for(size_t ii = 0 ; ii < 1050 ; ++ii) {
thread_local.update(ii);
} // implicit barrier
#pragma omp critical
{
if ( best < thread_local ) best = thread_local;
}
} // implicit barries
cout << "The maximum is " << best.value_ << " obtained at index " << best.index_ << std::endl;
cout << "\t toOptimize(" << best.index_ << ") = " << toOptimize(best.index_) << std::endl;
return 0;
}
I would just create a separate buffer for each thread to store a val and idx and then select the max out of the buffer afterwards.
std::vector<double> thread_maxes(omp_get_max_threads());
std::vector<int> thread_max_ids(omp_get_max_threads());
#pragma omp for reduction(max: best_value)
for(size_t n = 2 ; n <= MAX ; ++n)
{
int thread_num = omp_get_num_threads();
double v = toOptimize(n);
if(v > thread_maxes[thread_num])
{
thread_maxes[thread_num] = v;
thread_max_ids[thread_num] = i;
}
}
std::vector<double>::iterator max =
std::max_element(thread_maxes.begin(), thread_maxes.end());
best.val = *max;
best.idx = thread_max_ids[max - thread_maxes.begin()];
Your solution is fine. It has O(nthreads) convergence with the critical section. However, it's possible to do this with O(Log(nthreads)) convergence.
For example imagine there were 32 threads.
You would first find the local max for the 32 threads. Then you could combine pairs with 16 threads, then 8, then 4, then 2, then 1. In five steps you could merge the local max values without a critical section and free threads in the process. But your method would merge the local max values in 32 steps in a critical section and uses all threads.
The same logic goes for a reduction. That's why it's best to let OpenMP do the reduction rather than do it manually with an atomic section. But at least in the C/C++ implementation of OpenMP there is no easy way to get the max/min in O(Log(nthreads)). It might be possible using tasks but I have not tried that.
In practice this might not make a difference since the time to merge the local values even with a critical section is probably negligible compared the time to do the parallel loop. It probably makes more of a difference on the GPU though where the number of "threads" is much larger.