misusing OpenMP? - c++

I have a program using OpenMP to parallelize a for-loop. Inside the loop, the threads will write to shared variable, so I need to synchronize them. However I can sometimes get either segment fault or double free or corruption error. Anyone knows what happens? Thanks and regards! Here is the code:
void KNNClassifier::classify_various_k(int dim, double *feature, int label, int *ks, double * errors, int nb_ks, int k_max) {
ANNpoint queryPt = 0;
ANNidxArray nnIdx = 0;
ANNdistArray dists = 0;
queryPt = feature;
nnIdx = new ANNidx[k_max];
dists = new ANNdist[k_max];
if(strcmp(_search_neighbors, "brutal") == 0) {// search
_search_struct->annkSearch(queryPt, k_max, nnIdx, dists, _eps);
}else if(strcmp(_search_neighbors, "kdtree") == 0) {
_search_struct->annkSearch(queryPt, k_max, nnIdx, dists, _eps); // double free or corruption
}
for (int j = 0; j < nb_ks; j++)
{
scalar_t result = 0.0;
for (int i = 0; i < ks[j]; i++) {
result+=_labels[ nnIdx[i] ]; // Segmentation fault
}
if (result*label<0)
{
#pragma omp critical
{
errors[j]++;
}
}
}
delete [] nnIdx;
delete [] dists;
}
void KNNClassifier::tune_complexity(int nb_examples, int dim, double **features, int *labels, int fold, char *method, int nb_examples_test, double **features_test, int *labels_test) {
int nb_try = (_k_max - _k_min) / scalar_t(_k_step);
scalar_t *error_validation = new scalar_t [nb_try];
int *ks = new int [nb_try];
for(int i=0; i < nb_try; i ++){
ks[i] = _k_min + _k_step * i;
}
if (strcmp(method, "ct")==0)
{
train(nb_examples, dim, features, labels );// train once for all nb of nbs in ks
for(int i=0; i < nb_try; i ++){
if (ks[i] > nb_examples){nb_try=i; break;}
error_validation[i] = 0;
}
int i = 0;
#pragma omp parallel shared(nb_examples_test, error_validation,features_test, labels_test, nb_try, ks) private(i)
{
#pragma omp for schedule(dynamic) nowait
for (i=0; i < nb_examples_test; i++)
{
classify_various_k(dim, features_test[i], labels_test[i], ks, error_validation, nb_try, ks[nb_try - 1]); // where error occurs
}
}
for (i=0; i < nb_try; i++)
{
error_validation[i]/=nb_examples_test;
}
}
......
}
UPDATE:
As in my last post double free or corruption, the code runs fine with single-thread but gives runtime errors for multi-thread. The error changes from time to time. If I run it twice, one will be segfault, and the other will be double free or corruption.

Let's take a look at your segmentation fault line:
result+=_labels[ nnIdx[i] ];
result is local -- OK.
nnIdx is local -- also OK.
i is local -- still OK.
_labels ... what is it?
Is it global? Did you define access to it via #pragma shared?
Same goes for the former:
_search_struct->annkSearch(queryPt, k_max, nnIdx, dists, _eps);
Seems as we have here a problem that is not easily solvable -- _search_struct is not thread safe -- probably values in it are modified by threads at once. You have to have a dedicated _search_struct per-thread, probably by allocating it in classify_various_k.
The really bad news however is that ANN is probably completely non-threadable:
The library allocates a small amount
of storage, which is shared by all
search struc- tures built during the
program’s lifetime. Because the data
is shared, it is not deallocated, even
when the all the individual structures
are deleted.
As seen above there'll always be problems with parallel data modification, because the library itself has some shared data -- hence it's not thread-safe itself :/.

Related

Parallelize for-loop in c++ - memory error

I am trying to parallelize a for-loop in C++, but every time I try to use this loop for a larger data set, I get this error:
Process returned -1073741819 (0xC0000005)
For small data sets the loop works and for larger sets the initialization works but after this I get memory errors.
I am using Codeblocks and the GNU GCC Compiler.
In this loop I want to run several iterations of an optimiziation evolutionary heuristic.
I am using openmp and tried to put the variables which are used in several threads in private.
#include <omp.h>
void search_framework(Data &data, Solution &best_s)
{
vector<Solution> pop(data.p_size);
vector<Solution> child(data.p_size);
for (int i = 0; i < data.p_size; i++)
{
pop[i].reserve(data);
child[i].reserve(data);
}
// parent index in pop
vector<tuple<int, int>> p_indice(data.p_size);
bool time_exhausted = false;
int run = 1;
#pragma omp parallel for firstprivate(pop, pop_fit, pop_argrank, child, child_fit, child_argrank, p_indice)
for (int run = 1; run <= data.runs; run++)
{
run++;
int no_improve = 0;
int gen = 0;
initialization(pop, pop_fit, pop_argrank, data);
local_search(pop, pop_fit, pop_argrank, data);
while (!termination(no_improve, data))
{
gen++;
// printf("---------------------------------Gen %d---------------------------\n", gen);
no_improve++;
// select parents
select_parents(pop, pop_fit, p_indice, data);
// do local search for children
local_search(child, child_fit, child_argrank, data);
// replacement
replacement(pop, p_indice, child, pop_fit, pop_argrank, child_fit, child_argrank, data);
// update best
argsort(pop_fit, pop_argrank, data.p_size);
update_best_solution(pop[pop_argrank[0]], best_s, used, run, gen, data);
if (data.tmax != NO_LIMIT && used > clock_t(data.tmax))
{
time_exhausted = true;
break;
}
}
if (time_exhausted) run = data.runs;
}
}
Edited: This is the part where pop etc.. is initialized:
void initialization(vector<Solution> &pop, vector<double> &pop_fit, vector<int> &pop_argrank, Data &data)
{
int len = int(pop.size());
for (int i = 0; i < len; i++)
{
pop[i].clear(data);
}
for (int i = 0; i < len; i++)
{
data.lambda_gamma = data.latin[i];
new_route_insertion(pop[i], data);
}
for (int i = 0; i < len; i++)
{
pop_fit[i] = pop[i].cost;
}
argsort(pop_fit, pop_argrank, len);
}
You duplicated for increasing run
One for (int run = 1; run <= data.runs; run++)
and right below run++
I don't know what is the 'Data' in this case, but I guess this is unstable.
If not, I guess the type of data.runs is unsigned long, be careful with
for (int run = 1; run <= data.runs; run++)
The range of int is "-2147483648 to 2147483647", if the value of data.runs out of int range, this is very dangerous, it may create an infinite loop.
Try increasing the stack size for each OMP thread using the OMP_STACKSIZE environment variable.
https://gcc.gnu.org/onlinedocs/gcc-12.1.0/libgomp/OMP_005fSTACKSIZE.html
I think the private data structures get put on the stack. So increasing the problem size will eventually exceed the reserved stack space.

Usage of OpenMP reduction clause with nested loops

I have the current version of a function:
void*
function(const Input_st *Data, Output_st *Image)
{
int i,j,r,Offset;
omp_set_num_threads(24);
#pragma omp parallel for schedule(static) shared(Data,Image),\
private(i,j,r,Offset)
for (i = 0; i < Data->NX; i++)
{
for (j = 0; j < (Data->NZ); j++)
{
for (r = 0; r < Data->NR; r++)
{
Offset = i*Data->NR*Data->NZ + j*Data->NR + r;
Image->pTime[Offset] = function2()
}
}
}
return NULL;
}
It works very well, however I wanted to remove the calculation of the variable Offset and use of a pointer pointing to the member Image->pTimeR and then increment, which can look like following:
void*
function(const Input_st *Data, Output_st *Image)
{
int i, j, r;
double *pTime = Image->pTime;
omp_set_num_threads(24);
#pragma omp parallel for schedule(static) shared(Data,Image),\
private(i,j,r)
for (i = 0; i < Data->NX; i++)
{
for (j = 0; j < (Data->NZ); j++)
{
for (r = 0; r < Data->NR; r++)
{
*pTime = function2()
pTime++;
}
}
}
return NULL;
}
I get Seg Fault. I assume I need to use the reduction clause like reduction(+:pTime).
First, the purpose here is to speed up the function and I am wondering if such change would significantly speed up? (Like less cache memory used?)
Second, well I tried to benchmark it and failed to do so! I think the problem here can be solved by using a reduction clause, but since loops are nested the problem is not that straightforward to me.
There's no need of any sort of reduction clause here. However,at the moment, all threads use the same pointer and update the same memory location (with race conditions in the value assigned to pTime, hence the crashes I suspect).
So you need to define your pointer in a private way (typically by declaring it within the parallel region, and to set it individually per thread to a meaningful value. Then it can be incremented the way you want.
Here is what the code could look like once fixed (not tested obviously):
void* function( const Input_st *Data, Output_st *Image ) {
#pragma omp parallel for schedule( static ) num_threads( 24 )
for ( int i = 0; i < Data->NX; i++ ) {
double *pTime = Image->pTime + i * Data->NR * Data->NZ;
for ( int j = 0; j < Data->NZ; j++ ) {
for ( int r = 0; r < Data->NR; r++ ) {
*pTime = function2();
pTime++;
}
}
}
return NULL;
}

Max Reduction Open MP 2.0 Visual Studio 2013 C/C++

I'm new here and this is my first question in this site;
I am doing a simple program to find a maximum value of a vector c that is function of two other vectors a and b. I'm doing it on Microsoft Visual Studio 2013 and the problem is that it only support OpenMP 2.0 and I cannot do a Reduction operation to find directy the max or min value of a vector, because OpenMP 2.0 does not supports this operation.
I'm trying to do the without the constructor reduction with the following code:
for (i = 0; i < NUM_THREADS; i++){
cMaxParcial[i] = - FLT_MAX;
}
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private (i,j,indice)
for (i = 0; i < N; i++){
for (j = 0; j < N; j++){
indice = omp_get_thread_num();
if (c[i*N + j] > cMaxParcial[indice]){
cMaxParcial[indice] = c[i*N + j];
bMaxParcial[indice] = b[j];
aMaxParcial[indice] = a[i];
}
}
}
cMax = -FLT_MAX;
for (i = 0; i < NUM_THREADS; i++){
if (cMaxParcial[i]>cMax){
cMax = cMaxParcial[i];
bMax = bMaxParcial[i];
aMax = aMaxParcial[i];
}
}
I'm getting the error: "The expression must have integral or unscoped enum type"
on the command cMaxParcial[indice] = c[i*N + j];
Can anybody help me with this error?
Normally, the error is caused by one of the indices not being in integer type. Since you haven't shown the code where i, j, N and indice are declared, my guess is that either N or indice is a float or double, but it would be simpler to answer if you had provided a MCVE. However, the line above it seems to have used the same indices correctly. This leads me to believe that it's an IntelliSense error, which often are false positives. Try compiling the code and running it.
Now, on to issues that you haven't (yet) asked about (why is my parallel code slower than my serial code?). You're causing false sharing by using (presumably) contiguous arrays to find the a, b, and c values of each thread. Instead of using a single pragma for parallel and for, split it up like so:
cMax = -FLT_MAX;
#pragma omp parallel
{
float aMaxParcialPerThread;
float bMaxParcialPerThread;
float cMaxParcialPerThread;
#pragma omp for nowait private (i,j)
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
if (c[i*N + j] > cMaxParcialPerThread){
cMaxParcialPerThread = c[i*N + j];
bMaxParcialPerThread = b[j];
aMaxParcialPerThread = a[i];
} // if
} // for j
} // for i
#pragma omp critical
{
if (cMaxParcialPerThread < cMax) {
cMax = cMaxParcialPerThread;
bMax = bMaxParcialPerThread;
aMax = aMaxParcialPerThread;
}
}
}
I don't know what is wrong with your compiler since (as far as I can see with only the partial data you gave), the code seems valid. However, it is a bit convoluted and not so good.
What about the following:
#include <omp.h>
#include <float.h>
extern int N, NUM_THREADS;
extern float aMax, bMax, cMax, *a, *b, *c;
int foo() {
cMax = -FLT_MAX;
#pragma omp parallel num_threads( NUM_THREADS )
{
float localAMax, localBMax, localCMax = -FLT_MAX;
#pragma omp for
for ( int i = 0; i < N; i++ ) {
for ( int j = 0; j < N; j++ ) {
float pivot = c[i*N + j];
if ( pivot > localCMax ) {
localAMax = a[i];
localBMax = b[j];
localCMax = pivot;
}
}
}
#pragma omp critical
{
if ( localCMax > cMax ) {
aMax = localAMax;
bMax = localBMax;
cMax = localCMax;
}
}
}
}
It compiles but I haven't tested it...
Anyway, I avoided using the [a-c]MaxParcial arrays since they will generate false sharing between the threads, leading to poor performance. The final reduction is done based on critical. It is not ideal, but will perform perfectly as long as you have a "moderated" number of threads. If you see some hot spot there or you need to use a "large" number of threads, it can be optimised better with a proper parallel reduction later.

openMp optimisation of dynamic array access

I am trying to measure the speedup in parallel section using one or four threads. As my parallel section is relatively simple, I expect a near-to-fourfold speedup. ( This is following my question:
openMp: severe perfomance loss when calling shared references of dynamic arrays )
As my parallel sections runs twice as fast on four cores compared to only one, I believe I have still not found the reason for the performance loss.
I want to parallelise my function iter as well as possible. The function is using entries of dynamic arrays and private quantities to change the entries of other dynamic arrays. Because every iteration step only uses the array entries of the respective loop step, I don't have different threads accessing the same array entry. Furthermore, I put some thought on false sharing, due to accessing entries in the same cache line. My guess is, that this is a minor effect, as my double-arrays are 5*10^5 long and by choosing a reasonable chunk size for the schedule(dynamic,chunk) command, I don't expect the very few entires in a given cache line to be accessed at the same time by different threads. In my simulation, I have about 80 of such arrays, so that allocating them on the stack is not comfortable and making private copies for every thread is out of question too.
Does anybody have an idea, how to improve this? I want to fully understand why this is so slow, before starting with compiler optimisations.
What also surprised me was: calling iter(parallel), with parallel = false, is slower than calling it with parallel = true and omp_set_num_threads(1).
main.cpp:
int main(){
mathClass m;
m.fillArrays();
double timeCount = 0.0;
for(int j = 0; j<1000; j++){
timeCount += m.iter(true);
}
printf("meam time difference = %fms\n",timeCount);
return 0;
}
mathClass.h:
class mathClass{
private:
double* A;
double* B;
double* C;
int length;
public:
double* D;
mathClass();
double iter(bool parallel);
void fillArrays();
};
mathClass.cpp:
mathClass::mathClass(){
length = 5000000;
A = new double[length];
B = new double[length];
C = new double[length];
D = new double[length];
}
void mathClass::fillArrays(){
int temp;
for ( int i=0; i<length; i++){
temp = rand() % 100;
A[i] = double(temp);
temp = rand() % 100;
B[i] = double(temp);
temp = rand() % 100;
C[i] = double(temp);
}
}
double mathClass::iter(bool parallel){
double startTime;
double endTime;
omp_set_num_threads(4);
startTime = omp_get_wtime();
#pragma omp parallel if(parallel)
{
int alpha; // private in all threads
#pragma omp for schedule(static)
for (int i=0; i<length; i++){
alpha = 15*A[i];
D[i] = C[i]*alpha + B[i]*alpha*alpha;
}
}
endTime = omp_get_wtime();
return endTime - startTime;
}

#pragma omp parallel for schedule crashes my program

I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?
Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.