Multithreaded Program for Sparse Matrices - c++

I am a newbie to multithreading. I am trying to design a program that solves a sparse matrix. In my code I call Vector Vector dot product and Matix vector product as subroutines many times to arrive at the final solution. I am trying to parallelise the code using open MP (Especially the above two sub routines.)
I also have sequential codes in between which i donot intend to parallelise.
My question is how do I handle the threads created when the sub routine is called. Should I put a barrier at the end of every sub routine call.
Also where should I set the number of threads?
Mat_Vec_Mult(MAT,x0,rm);
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
rm[i] = b[i] - rm[i];
#pragma omp barrier
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
xm[i] = x0[i];
#pragma omp barrier
double* pm = (double*) malloc(numcols*sizeof(double));
#pragma omp parallel for schedule(static)
for(int i=0;i<numcols;i++)
pm[i] = rm[i];
#pragma omp barrier
scalarProd(rm,rm,numcols);
Thanks
EDIT:
for the scalar dotproduct, I am using the following piece of code:
double scalarProd(double* vec1, double* vec2, int n){
double prod = 0.0;
int chunk = 10;
int i;
//double* c = (double*) malloc(n*sizeof(double));
omp_set_num_threads(4);
// #pragma omp parallel shared(vec1,vec2,c,prod) private(i)
#pragma omp parallel
{
double pprod = 0.0;
#pragma omp for
for(i=0;i<n;i++) {
pprod += vec1[i]*vec2[i];
}
//#pragma omp for reduction (+:prod)
#pragma omp critical
for(i=0;i<n;i++) {
prod += pprod;
}
}
return prod;
}
I have now added the time calculation code in my ConjugateGradient function as below:
start_dotprod = omp_get_wtime();
rm_rm_old = scalarProd(rm,rm,MAT->ncols);
run_dotprod = omp_get_wtime() - start_dotprod;
fprintf(timing,"Time taken by rm_rm dot product : %lf \n",run_dotprod);
Observed results : Time taken for the dot product Sequential Version : 0.000007s Parallel Version : 0.002110
I am doing a simple compile using gcc -fopenmp command on Linux OS on my Intel I7 laptop.
I am currently using a matrix of size n = 5000.
I am getting huge speed down overall since the same dot product gets called multiple times till convergence is achieved( around 80k times).
Please suggest some improvements. Any help is much appreciated!

Honestly, I would suggest parallelizing at a higher level. By this I mean trying to minimize the number of #pragma omp parallels you are using. Every time you try and split up the work among your threads, there is an OpenMP overhead. Try and avoid this whenever possible.
So in your case at the very least I would try:
Mat_Vec_Mult(MAT,x0,rm);
double* pm = (double*) malloc(numcols*sizeof(double)); // must be performed once outside of parallel region
// all threads forked and created once here
#pragma omp parallel for schedule(static)
for(int i = 0; i < numcols; i++) {
rm[i] = b[i] - rm[i]; // (1)
xm[i] = x0[i]; // (2) does not require (1)
pm[i] = rm[i]; // (3) requires (1) at this i, not (2)
}
// implicit barrier at the end of omp for
// implicit join of all threads at the end of omp parallel
scalarProd(rm,rm,numcols);
Notice how I show that no barriers are actually necessary between your loops anyway.
If the majority of your time had been spent in this computation stage, you will surely be seeing considerable improvement. However, I'm reasonably confident that the majority of your time is being spent in Mat_Vec_Mult() and maybe also scalarProd(), so the amount of time you'll be saving is probably minimal.
** EDIT **
And as per your edit, I am seeing a few problems. (1) Always compile with -O3 when you are testing performance of your algorithm. (2) You won't be able to improve the runtime of something that takes .000007 sec to complete; that's nearly instantaneous. This goes back to what I said previously: try and parallelize at a higher level. CG Method is inherently a sequential algorithm, but there are certainly research papers developed detailing parallel CG. (3) Your implementation of scalar product is not optimal. Indeed, I suspect your implementation of matrix-vector product is not either. I would personally do the following:
double scalarProd(double* vec1, double* vec2, int n) {
double prod = 0.0;
int i;
// omp_set_num_threads(4); this should be done once during initialization somewhere previously in your program
#pragma omp parallel for private(i) reduction(+:prod)
for (i = 0; i < n; ++i) {
prod += vec1[i]*vec2[i];
}
return prod;
}
(4) There are entire libraries (LAPACK, BLAS, etc) that have highly optimized matrix-vector, vector-vector, etc operations. Any Linear Algebra library must be built upon them. Therefore, I'd suggest looking at using one of those libraries to do your two operations before you start re-creating the wheel here and trying to implement your own.

Related

warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored

I'm currently trying to use OpenMP for parallel computing.
I've written the following basic code.
However it returns the following warning:
warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored.
Changing the number of threads does not change the required running time since omp.h is ignored for some reason which is unclear to me.
Can someone help me out?
#include <stdio.h>
#include <omp.h>
#include <math.h>
int main(void)
{
double ts;
double something;
clock_t begin = clock();
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
clock_t end = clock();
ts = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time elpased is %f seconds", ts);
}
In order to get OpenMP support you need to explicitly tell your compiler.
g++, gcc and clang need the option -fopenmp
mvsc needs the option /openmp (more info here if you use visual studio)
Aside from the obvious having to compile with -fopenmp flag your code has some problem worth pointing out, namely:
To measure time use omp_get_wtime() instead of clock() (it will give you the number of clock ticks accumulated across all threads).
The other problem is:
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
the iterations of the loop are not being assigned to threads as you wanted. Because you have added again the clause parallel to #pragma omp for, and assuming that you have nested parallelism disabled, which by default it is, each of the threads created in the outer parallel region will execute "sequentially" the code within that region. Consequently, for a n = 6 (i.e., pow(10,7) = 6) and number of threads = 4, you would have the following block of code:
for (int i=0; i<n; i++) {
something=sqrt(123456);
}
being executed 6 x 4 = 24 times (i.e., the total number of loop iterations multiple by the total number of threads). For a more in depth explanation check this SO Thread about a similar issue. Nevertheless, the image below provides a visualization of the essential:
To fix this adapt your code to the following:
#pragma omp parallel for num_threads(4)
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}

openMP excessive synchronization

I am trying to add an openMP parallelization into quite a big Project and I found out the openMP does too much synchronization outside the parallel blocks.
This synchronization is done for all of the variables, even those not used in the parallel block and it is done continuously, not only before entering the block.
I made an example proving this:
#include <cmath>
int main()
{
double dummy1 = 1.234;
int const size = 1000000;
int const size1 = 2500;
int const size2 = 500;
for(unsigned int i=0; i<size; ++i){
//for (unsigned int j=0; j<size1; j++){
// dummy1 = pow(dummy1/2 + 1, 1.5);
//}
#pragma omp parallel for
for (unsigned int j=0; j<size2; j++){
double dummy2 = 2.345;
dummy2 = pow(dummy2/2 + 1, 1.5);
}
}
}
If I run this code (with the for cycle commented), the runtimes are 6.75s with parallelization and 30.6s without. Great.
But if I uncomment the for cycle and run it again, the excessive synchronization kicks in and I get results 67.9s with parallelization and 73s without. If I increase size1 I even get slower results with parallelization than without it.
Is there a way to disable this synchronization and force it only before the second for cycle? Or any other way how to improve the speed?
Note that the outer neither the first for cycle are in the real example parallelizable. The outer one is in fact a ODE solver and the first inner one updating of loads of inner values.
I am using gcc (SUSE Linux) 4.8.5
Thanks for Your answers.
In the end the solution for my problem was specifying number of threads = number of processor cores. It seems the hyperthreading was causing the problems. So using (my processor has 4 real cores)
#pragma omp parallel for num_threads(4)
I get times 8.7s without the first for loop and 51.9s with it. There is still about 1.2s overhead, but that is acceptable. Using default (8 threads)
#pragma omp parallel for
the times are 6.65s and 68s. Here the overhead is about 19s.
So the hyperthreading helps if no other code is present, but when it is it might not always be a good idea to use it.

How should I interpreter these VTune results?

I'm trying to parallelyzing this code using OpenMP. OpenCV (built using IPP for best efficiency) is used as external library.
I'm having problems unbalanced CPU usage in parallel fors, but it seems that there is no load imbalance. As you will see, this could be because of KMP_BLOCKTIME=0, but this could be necessary because of external libraries (IPP, TBB, OpenMP, OpenCV). In the rest of the questions you will find more details and data that you can download.
These are the Google Drive links to my VTune results:
c755823 basic KMP_BLOCKTIME=0 30 runs : basic hotspot with environment variable KMP_BLOCKTIME set to 0 on 30 runs of the same input
c755823 basic 30 runs : same as above, but with default KMP_BLOCKTIME=200
c755823 advanced KMP_BLOCKTIME=0 30 runs : same as first, but advanced hotspot
For those who are interested, I can send you the original code somehow.
On my Intel i7-4700MQ the actual wall-clock time of the application on average on 10 runs is around 0.73 seconds. I compile the code with icpc 2017 update 3 with the following compiler flags:
INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl
In addition I set KMP_BLOCKTIME=0 because the default value (200) was generating an huge overhead.
We can divide the code in 3 parallel regions (wrapped in only one #pragma parallel for efficiency) and a previous serial one, which is around 25% of the algorithm (and it can't be parallelized).
I'll try to describe them (or you can skip to the code structure directly):
We create a parallel region in order to avoid the overhead to create a new parallel region. The final result is to populate the rows of a matrix obejct, cv::Mat descriptor. We have 3 shared std::vector objects: (a) blurs which is a chain of blurs (not parallelizable) using GuassianBlur by OpenCV (which uses the IPP implementation of guassian blurs) (b) hessResps (size known, say 32) (c) findAffineShapeArgs (unkown size, but in order of thousands of elements, say 2.3k) (d) cv::Mat descriptors (unkown size, final result). In the serial part, we populate `blurs, which is a read only vector.
In the first parallel region,hessResps is populated using blurs without any synchronization mechanism.
In the second parallel region findLevelKeypoints is populated using hessResps as read only. Since findAffineShapeArgs size is unkown, we need a local vector localfindAffineShapeArgs which will be appended to findAffineShapeArgs in the next step
Since findAffineShapeArgs is shared and its size is unkown, we need a critical section where each localfindAffineShapeArgs is appended to it.
In the third parallel region, each findAffineShapeArgs is used to generate the rows of the final cv::Mat descriptor. Again, since descriptors is shared, we need a local version cv::Mat localDescriptors.
A final critical section push_back each localDescriptors to descriptors. Notice that this is extremely fast since cv::Mat is "kinda" of a smart pointer, so we push_back pointers.
This is the code structure:
cv::Mat descriptors;
std::vector<Mat> blurs(blursSize);
std::vector<Mat> hessResps(32);
std::vector<FindAffineShapeArgs> findAffineShapeArgs;//we don't know its tsize in advance
#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
hessResps[/**/] = hessianResponse(/*...*/);
}
std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
findLevelKeypoints(localfindAffineShapeArgs, hessResps[/*...*], /*...*/); //populate localfindAffineShapeArgs with push_back
}
#pragma omp critical{
findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}
#pragma omp barrier
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
{
findAffineShape(findAffineShapeArgs[i]);
}
#pragma omp critical{
for(size_t i=0; i<localRes.size(); i++)
descriptors.push_back(localRes[i].descriptor);
}
}
At the end of the question, you can find FindAffineShapeArgs.
I'm using Intel Amplifier to see hotspots and evaluate my application.
The OpenMP Potential Gain analsysis says that the Potential Gain if there would be perfect load balancing would be 5.8%, so we can say that the workload is balanced between different CPUs.
This i the CPU usage histogram for the OpenMP region (remember that this is the result of 10 consecutive runs):
So as you can see, the Average CPU Usage is 7 cores, which is good.
This OpenMP Region Duration Histogram shows that in these 10 runs the parallel region is executed always with the same time (with a spread around 4 milliseconds):
This is the Caller/Calee tab:
For you knowledge:
interpolate is called in the last parallel region
l9_ownFilter* functions are all called in the last parallel region
samplePatch is called in the last parallel region.
hessianResponse is called in the second parallel region
Now, my first question is: how should I interpret the data above? As you can see, in many of the functions half of the time the "Effective Time by Utilization` is "ok", which would probably become "Poor" with more cores (for example on a KNL machine, where I'll test the application next).
Finally, this is the Wait and Lock analysis result:
Now, this is the first weird thing: line 276 Join Barrier (which corresponds to the most expensive wait object) is#pragma omp parallel`, so the beginning of the parallel region. So it seems that someone spawned threads before. Am I wrong? In addition, the wait time is longer than the program itself (0.827s vs 1.253s of the Join Barrier that I'm talking about)! But maybe that refers to the waiting of all threads (and not wall-clock time, which is clearly impossible since it's longer than the program itself).
Then, the Explicit Barrier at line 312 is #pragma omp barrier of the code above, and its duration is 0.183s.
Looking at the Caller/Callee tab:
As you can see, most of wait time is poor, so it refers to one thread. But I'm sure that I'm understanding this. My second question is: can we interpret this as "all the threads are waiting just for one thread who is staying behind?".
FindAffineShapeArgs definition:
struct FindAffineShapeArgs
{
FindAffineShapeArgs(float x, float y, float s, float pixelDistance, float type, float response, const Wrapper &wrapper) :
x(x), y(y), s(s), pixelDistance(pixelDistance), type(type), response(response), wrapper(std::cref(wrapper)) {}
float x, y, s;
float pixelDistance, type, response;
std::reference_wrapper<Wrapper const> wrapper;
};
The Top 5 Parallel Regions by Potential Gain in the summary view shows only one region (the only one)
Look at the "/OpenMP Region/OpenMP Barrier-to-Barrier" grouping, this is the order of the most expensives loops:
The 3th loop:
pragma omp for schedule(dynamic) nowait
for(int i=0; i
is the most expensive one (as I already knew) and here's a screenshot of the expended view:
As you can see, many functions are from OpenCV, which exploits IPP and is (should be) already optimized. Expanding the two other functions (interpolate and samplePatch) shows a [No call stack information]. Same for all the other functions (in other regions too).
The 2nd most expensive region is the second parallel for:
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
findLevelKeypoints(localfindAffineShapeArgs, hessResps[/*...*], /*...*/); //populate localfindAffineShapeArgs with push_back
}
Here's the expanded view:
And finally the 3th most expensive is the first loop:
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
hessResps[/**/] = hessianResponse(/*...*/);
}
Here's the expended view:
If you want to know more, please use my attached VTune files or just ask!
Try reading the info in this link, especially the part about "Nested OpenMP" since Intel IPP already uses OpenMP in their implementation. From my experience with Intel IPP and OpenMP, if you are doing some other type of multi-threading and when each of the created threads gets to the OpenMP calls performance was really bad. Also, you can try having #pragma omp parallel for for each of the parallel regions instead of #pragma omp for and getting rid of the outer #pragma omp parallel

pragma omp for inside pragma omp master or single

I'm sitting with some stuff here trying to make orphaning work, and reduce the overhead by reducing the calls of #pragma omp parallel.
What I'm trying is something like:
#pragma omp parallel default(none) shared(mat,mat2,f,max_iter,tol,N,conv) private(diff,k)
{
#pragma omp master // I'm not against using #pragma omp single or whatever will work
{
while(diff>tol) {
do_work(mat,mat2,f,N);
swap(mat,mat2);
if( !(k%100) ) // Only test stop criteria every 100 iteration
diff = conv[k] = do_more_work(mat,mat2);
k++;
} // end while
} // end master
} // end parallel
The do_work depends on the previous iteration so the while-loop is has to be run sequential.
But I would like to be able to run the ´do_work´ parallel, so it would look something like:
void do_work(double *mat, double *mat2, double *f, int N)
{
int i,j;
double scale = 1/4.0;
#pragma omp for schedule(runtime) // Just so I can test different settings without having to recompile
for(i=0;i<N;i++)
for(j=0;j<N;j++)
mat[i*N+j] = scale*(mat2[(i+1)*N+j]+mat2[(i-1)*N+j] + ... + f[i*N+j]);
}
I hope this can be accomplished some way, I'm just not sure how. So any help I can get is greatly appreciated (also if you're telling me this isn't possible). Btw I'm working with open mp 3.0, the gcc compiler and the sun studio compiler.
The outer parallel region in your original code contains only a serial piece (#pragma omp master), which makes no sense and effectively results in purely serial execution (no parallelism). As do_work() depends on the previous iteration, but you want to run it in parallel, you must use synchronisation. The openmp tool for that is an (explicit or implicit) synchronisation barrier.
For example (code similar to yours):
#pragma omp parallel
for(int j=0; diff>tol; ++j) // must be the same condition for each thread!
#pragma omp for // note: implicit synchronisation after for loop
for(int i=0; i<N; ++i)
work(j,i);
Note that the implicit synchronisation ensures that no thread enters the next j if any thread is still working on the current j.
The alternative
for(int j=0; diff>tol; ++j)
#pragma omp parallel for
for(int i=0; i<N; ++i)
work(j,i);
should be less efficient, as it creates a new team of threads at each iteration, instead of merely synchronising.

openmp reduce technique

I have this for loop that finds minimum and maximum length, as you can see I have two values to reduce here while looking at OpenMP I can only notice that it provides reduction technique for only one value.
for (size_t i = 0; i < m_patterns.size(); ++i)
{// start for loop
if (m_patterns[i].size() < m_lmin)
m_lmin = m_patterns[i].size();
else if (m_patterns[i].size() > m_lmax)
m_lmax = m_patterns[i].size();
}// end for loop
can I do the following
#pragma omp parallel for reduction (min:m_lmin,max:m_lmax)
or should I rewrite the for loop to two for loops one for the minimum and one for the maximum
another question .. can I use tbb containers like concurrent_vector in OpenMP
From OpenMP 3.1 they started support of min & max reduction operation. OpenMP 3.1 is available from GCC 4.7. You can refer this link for further details of min max reduction.
You can roll-your-own concurrent vector as well as min and max reductions by filling private versions of the variables in parallel and then merging them in a critical section. This will work in MSVC which only supports OpenMP 2.5 (which does not support min and max reductions). But irrespective of whether your version of OpenMP supports min and max reductions this is a useful technique to learn.
This method is efficient as long as the number of items you loop over is much larger than the number of threads (or the time run over the items is large compared to the merging).
#pragma parallel
{
int m_lmin_private = m_lmin;
int m_max_private = m_max_private;
#pragma omp for nowait
for (size_t i = 0; i < m_patterns.size(); ++i) {
if (m_patterns[i].size() < m_lmin_private)
m_lmin_private = m_patterns[i].size();
else if (m_patterns[i].size() > m_lmax_private)
m_lmax_private = m_patterns[i].size();
}
#pragma omp critical
{
if (m_lmin_private<m_lmin)
m_lmin = m_lmin_private;
if (m_lmax_private>m_lmax)
m_lmax = m_lmax_private;
}
}
For concurrent vectors us the same method:
std::vector<int> vec;
#pragma omp parallel
{
std::vector<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<n; i++) {
vec_private.push_back(i);
}
#pragma omp critical
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}
As far as openmp is concerned - an official specification is available ( www.openmp.org ). But finally your compiler is doing all the work. So the answer to your question may be compiler related...
However Microsoft is offering http://msdn.microsoft.com/de-de/library/2etkydkz(v=vs.80).aspx
suggesting
#pragma omp parallel for reduction(min:m_lmin) reduction(max:m_lmax)