I have implemented the Jacobi algorithm based on the routine described in the book Numerical Recipes but since I plan to work with very large matrices I am trying to parallelize it using openmp.
void ROTATE(MatrixXd &a, int i, int j, int k, int l, double s, double tau)
{
double g,h;
g=a(i,j);
h=a(k,l);
a(i,j)=g-s*(h+g*tau);
a(k,l)=h+s*(g-h*tau);
}
void jacobi(int n, MatrixXd &a, MatrixXd &v, VectorXd &d )
{
int j,iq,ip,i;
double tresh,theta,tau,t,sm,s,h,g,c;
VectorXd b(n);
VectorXd z(n);
v.setIdentity();
z.setZero();
#pragma omp parallel for
for (ip=0;ip<n;ip++)
{
d(ip)=a(ip,ip);
b(ip)=d(ip);
}
for (i=0;i<50;i++)
{
sm=0.0;
for (ip=0;ip<n-1;ip++)
{
#pragma omp parallel for reduction (+:sm)
for (iq=ip+1;iq<n;iq++)
sm += fabs(a(ip,iq));
}
if (sm == 0.0) {
break;
}
if (i < 3)
tresh=0.2*sm/(n*n);
else
tresh=0.0;
#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
for (ip=0;ip<n-1;ip++)
{
//#pragma omp parallel for private (g,h,t,theta,c,s,tau)
for (iq=ip+1;iq<n;iq++)
{
g=100.0*fabs(a(ip,iq));
if (i > 3 && (fabs(d(ip))+g) == fabs(d[ip]) && (fabs(d[iq])+g) == fabs(d[iq]))
a(ip,iq)=0.0;
else if (fabs(a(ip,iq)) > tresh)
{
h=d(iq)-d(ip);
if ((fabs(h)+g) == fabs(h))
{
t=(a(ip,iq))/h;
}
else
{
theta=0.5*h/(a(ip,iq));
t=1.0/(fabs(theta)+sqrt(1.0+theta*theta));
if (theta < 0.0)
{
t = -t;
}
c=1.0/sqrt(1+t*t);
s=t*c;
tau=s/(1.0+c);
h=t*a(ip,iq);
#pragma omp critical
{
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
a(ip,iq)=0.0;
for (j=0;j<ip;j++)
ROTATE(a,j,ip,j,iq,s,tau);
for (j=ip+1;j<iq;j++)
ROTATE(a,ip,j,j,iq,s,tau);
for (j=iq+1;j<n;j++)
ROTATE(a,ip,j,iq,j,s,tau);
for (j=0;j<n;j++)
ROTATE(v,j,ip,j,iq,s,tau);
}
}
}
}
}
}
}
I wanted to parallelize the loop that does most of the calculations and both comments inserted in the code:
//#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
//#pragma omp parallel for private (g,h,t,theta,c,s,tau)
are my attempts at it. Unfortunately both of them end up producing incorrect results. I suspect the problem may be in this block:
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
because usually this sort of accumulation would need a reduction, but since each thread accesses a different part of the array, I am not certain of this.
I am not really sure if I am doing the parallelization in a correct manner because I have only recently started working with openmp, so any suggestion or recommendation would also be welcomed.
Sidenote: I know there are faster algorithms for eigenvalue and eigenvector determination including the SelfAdjointEigenSolver in Eigen, but those are not giving me the precision I need in the eigenvectors and this algorithm is.
My thanks in advance.
Edit: I considered to correct answer to be the one provided by The Quantum Physicist because what I did does not reduce the computation time for system of size up to 4096x4096. In any case I corrected the code in order to make it work and maybe for big enough systems it could be of some use. I would advise the use of timers to test if the
#pragma omp for
actually decrease the computation time.
I'll try to help, but I'm not sure this is the answer to your question.
There are tons of problems with your code. My friendly advice for you is: Don't do parallel things if you don't understand the implications of what you're doing.
For some reason, it looks like that you think putting everything in parallel #pragma for will make it faster. This is VERY wrong. Because spawning threads is an expensive thing to do and costs (relatively) lots of memory and time. So if you redo that #pragma for for every loop, you'll respawn threads for every loop, which will significantly reduce the speed of your program... UNLESS: Your matrices are REALLY huge and the computation time is >> than the cost of spawning them.
I fell into a similar issue when I wanted to multiply huge matrices, element-wise (and then I needed the sum for some expectation value in quantum mechanics). To use OpenMP for that, I had to flatten the matrices to linear arrays, and then distribute the array chunks each to a thread, and then run a for loop, where every loop iteration uses elements that are independent of others for sure, and I made them all evolve independently. This was quite fast. Why? Because I never had to respawn threads twice.
Why you're getting wrong results? I believe the reason is because you're not respecting shared memory rules. You have some variable(s) that is being modified by multiple threads simultaneously. It's hiding somewhere, and you have to find it! For example, what does the function z do? Does it take stuff by reference? What I see here:
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
Looks VERY multi-threading not-safe, and I don't understand what you're doing. Are you returning a reference that you have to modify? This is a recipe for thread non-safety. Why don't you create clean arrays and deal with them instead of this?
How to debug: Start with a small example (2x2 matrix, maybe), and use only 2 threads, and try to understand what's going on. Use a debugger and define break points, and check what information is shared between threads.
Also consider using a mutex to check what data gets ruined when it becomes shared. Here is how to do it.
My recommendation: Don't use OpenMP unless you plan to spawn the threads ONLY ONCE. I actually believe that OpenMP is going to die very soon because of C++11. OpenMP was beautiful back then when C++ didn't have any native multi-threading implementation. So learn how to use std::thread, and use it, and if you need to run many things in threads, then learn how to create a Thread Pool with std::thread. This is a good book for learning multithreading.
Related
I am running my code on Intel® Xeon(R) CPU X5680 # 3.33GHz × 12. Here is a fairly simple OpenMP pseudo code (the OpenMP parts are exact, just normal code in between is changed for compactness and clarity):
vector<int> myarray(arraylength,something);
omp_set_num_threads(3);
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for(int j=0;j<pr.max_iteration_limit;j++)
{
vector<int> temp_array(updated_array(a,b,myarray));
for(int i=0;i<arraylength;i++)
{
#pragma omp atomic
myarray[i]+=temp_array[i];
}
}
}
all parameters taken by temp_array function are copied so that there would be no clashes. Basic structure of temp_array function:
vector<int> updated_array(myClass1 a, vector<myClass2> b, vector<int> myarray)
{
//lots of preparations, but obviously there are only local variables, since
//function only takes copies
//the core code taking most of the time, which I will be measuring:
double time_s=time(NULL);
while(waiting_time<t_wait) //as long as needed
{
//a fairly short computaiton
//generates variable: vector<int> another_array
waiting_time++;
}
double time_f=time(NULL);
cout<<"Thread "<<omp_get_thread_num()<<" / "<<omp_get_num_threads()
<< " runtime "<<time_f-time_s<<endl;
//few more changes to the another_array
return another_array;
}
Questions and my attempts to resolve it:
adding more threads (with omp_set_num_threads(3);) does create more threads, but each thread does the job slower. E.g. 1: 6s, 2: 10s, 3: 15s ... 12: 60s.
(where to "job" I refer to the exact part of the code I pointed out as core, (NOT the whole omp loop or so) since it takes most of the time, and makes sure I am not missing anything additional)
There are no rand() things happening inside the core code.
Dynamic or static schedule doesnt make a difference here of course (and I tried..)
There seem to be no sharing possible in any way or form, thus I am running out of ideas completely... What can it be? I would be extremely grateful if you could help me with this (even with just ideas)!
p.s. The point of the code is to take myarray, do a bit of montecarlo on it with a single thread, and then collect tiny changes and add/substract to the original array.
OpenMP may implement the atomic access using a mutex, when your code will suffer from heavy contention on that mutex. This will result in a significant performance hit.
If the work in updated_array() dominates the cost of the parallel loop, you'de better put the whole of the second loop inside a critical section:
{ // body of parallel loop
vector<int> temp_array = updated_array(a,b,myarray);
#pragma omp critical(UpDateMyArray)
for(int i=0;i<arraylength;i++)
myarray[i]+=temp_array[i];
}
However, your code looks broken (essentially not threadsafe), see my comment.
My serial code for the convolution between a matrix and a kernel works like this:
int index1, index2, a, b;
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
for(int i=0;i<krows;++i){
for(int j=0;j<kcolumns;++j){
a=x+i-krows/2;
b=y+j-kcolumns/2;
if(a<0)
index1=rows+a;
else if(a>rows-1)
index1=a-rows;
else
index1=a;
if(b<0)
index2=columns+b;
else if(b>columns-1)
index2=b-columns;
else
index2=b;
output[x*columns+y]+=input[index1*columns+index2]*kernel[i*kcolumns+j];
}
}
}
}
The convolution considers cyclic treatment for the borders. Now I want to parallelize the code with openmp. I thought about reducing the first two for-cycles to just one and using the syntax:
#pragma omp parallel
#pragma omp for private(x,y,a, b, index1, index2)
for(int z=0;z<rows*columns;z++){
x=z/columns;
y=z%columns;
...
I see that parallelizing like that it reduces the cpu-time but I'm not a big expert of openmp so I was asking myself if there are other more efficient solutions. I don't think it is a good idea to parallelize also the others 2 nested for-cycles.
With an input matrix of dimensions 1000*10000 and a square kernel matrix 9*9 I obtain these times:
4823 ms for 1 thread
2696 ms for 2 threads
2513 ms for 4 threads.
I hope someone can give me some useful suggestions. What about the for reduction syntax?
My suggestion is to change approach altogether. If you are using cyclic treatment for the border (i.e. your problem is periodic) the fast way to do it is based on the fft-based spectral approach:
-Fourier transform matrix and kernel
-compute the product
-Inverse fourier transform the product (you have the convolution)
This is (1) much more efficient (unless the dimensions of the kernel are much smaller than those of the matrix) and (2) you can use a fft library that supports multithreading (like FFTW) and let it deal with it.
You don't need to change the for loops. You can make each thread iterate thru all rows in a column or thru all columns in a row. Also, bear in mind that if the number of threads is higher than the number of physical cores, the performance won't change much.
OpenMP already takes care of the number of threads that it should create, using the logical cores count - which might be a problem on Intel i3 and i7, since they have hyperthreading and thus the performance gain per extra thread won't be big.
In resume, you can either:
#pragma omp parallel for private (x,y,a,b,index1,index2)
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
// ...
}
}
Or:
for(int x=0;x<rows;++x){
#pragma omp parallel for private (y,a,b,index1,index2)
for(int y=0;y<columns;++y){
// ...
}
}
If you are using OpenMP 3.0 or greater you may exploit the collapse clause of the loop work-sharing construct:
The collapse clause may be used to specify how many loops are
associated with the loop construct. The parameter of the collapse
clause must be a constant positive integer expression. If no collapse
clause is present, the only loop that is associated with the loop
construct is the one that immediately follows the loop directive
This means that you may write the following:
#pragma omp parallel for collapse(2)
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
/* Work here */
}
}
and obtain exactly the same result as your linearized loop:
#pragma omp parallel for
for(int z=0;z<rows*columns;z++){
x=z/columns;
y=z%columns;
/* Work here */
}
As you may see, with the collapse clause no modification is needed to your serial code and you may easily experiment further loop collapsing changing the positive number in the clause.
I am writing c++ codes using OpenMP. I have a global huge array (100,000+ elements) that will be modified by adding values in a for loop. Is there a way that I can efficiently have each thread created by OpenMP for parallel maintain its local copy of array and then join after the loop? Since the number of threads is a variable, I could not create the local copies of array beforehand. If using a global copy and address the race condition by a synchronization lock, the performance is terrible.
Thanks!
Edited:
Sorry for not being clear. Here's some pseudo-code hopefully could clarify the scenario:
int* huge_array=new int[N];
memset(huge_array, 0, N*sizeof(int));
#pragma omp parallel for
for (i=0; i<n; i++)
{
get a value v independently
get a position p independently
// I have to set a lock here
omp_set_lock(&lock);
huge_array[p] += v;
omp_unset_lock(&lock);
}
Is there a way to improve the performance of the code above?
Okay, I finally understood what you want to do. Yes, you do it the same way as with ptreads.
std::vector<int> A(N,0);
std::vector<int*> local(omp_max_num_threads());
#pragma omp parallel
{
int np = omp_get_num_threads();
std::vector<int> localA(N);
local[omp_get_thread_num()] = localA.data();
// add values to local array
#pragma omp for
for(int i=0; i<num_values; ++i)
localA[position()] += value(); // (1)
// implicit barrier ensures all local copies are ready for aggregation
// aggregate local copies into global array
#pragma omp for
for(int k=0; k<N; ++k)
for(int p=0; p<np; ++p)
A[k] += local[p][k]; // (2)
// implicit barrier ensures no local copy is deleted before aggregation is done
}
but it is important to do the aggregate also in parallel.
In Walter's answer, I believe instead of
std::vector<int*> local(omp_max_num_threads());
It should be
std::vector<int*> local(omp_get_max_threads());
omp_max_num_threads() is not a routine in OpenMP.
What about using the directive
'#'pragma omp parallel for private(VARIABLE)
for your program (only with a cross, not with these '')?
EDIT:
For your code I would use my directive, you won't loose so much time when locking and unlocking your variable...
EDIT 2:
Sorry, you can not use my code for your problem, only, if you create a temporary array first where you store your data temporarily...
As far as I can tell you are essentially filling a histogram where position is the bin of the histogram to fill and value is the weight/value that you will add to that bin. Filling a histogram in parallel is equivalent to doing an array reduction. The C++ implementation of OpenMP does not have direct support for this, however, as far as I understand some version of the Fortran implementation do. To do an array reduction in C++ with OpenMP I have two suggestions.
1.) If the number of bins of the histogram (array) is much less than the number of values that will fill the histogram (which is often the preferred case since one wants reasonable statistics in each bin), then you can fill private version of the histogram in parallel and merge them in a critical section in serial. Since the number of bins is much less than the number of values this should be efficient.
2.) However, If the number of bins is large (as your example seems to imply) then it's possible to merge the private histograms in parallel as well but this is a bit more tricky. Additionally, one needs to be careful with cache alignment and false sharing.
I showed how to do both these methods and discuss some of the cache issues in the following question:
Fill histograms (array reduction) in parallel with openmp without using a critical section.
I'm doing the following code that construct a distance matrix between each point and all the other points that I have in the map dat[]. Although the code is working perfect, the performance of the code in terms of running time doesn't improve which means that it takes the same time if I set the number of thread = 1 or even 10 on an 8 core machine. Therefore, I'd appreciate if anyone can help me know what is wrong in my code and if anyone have any suggestion to help make the code runs faster that would be very helpful too.
The following is the code:
map< int,string >::iterator datIt;
map <int, map< int, double> > dist;
int mycont=0;
datIt=dat.begin();
int size=dat.size();
omp_lock_t lock;
omp_init_lock(&lock);
#pragma omp parallel //construct the distance matrix
{
map< int,string >::iterator datItLocal=datIt;
int lastIdx = 0;
#pragma omp for
for(int i=0;i<size;i++)
{
std::advance(datItLocal, i - lastIdx);
lastIdx = i;
map< int,string >::iterator datIt2=datItLocal;
datIt2++;
while(datIt2!=dat.end())
{
double ecl=0;
int c=count((*datItLocal).second.begin(),(*datItLocal).second.end(),delm);
string line1=(*datItLocal).second;
string line2=(*datIt2).second;
for (int i=0;i<c;i++)
{
double num1=atof(line1.substr(0,line1.find_first_of(delm)).c_str());
line1=line1.substr(line1.find_first_of(delm)+1).c_str();
double num2=atof(line2.substr(0,line2.find_first_of(delm)).c_str());
line2=line2.substr(line2.find_first_of(delm)+1).c_str();
ecl += (num1-num2)*(num1-num2);
}
ecl=sqrt(ecl);
omp_set_lock(&lock);
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
omp_unset_lock(&lock);
datIt2++;
}
}
}
omp_destroy_lock(&lock);
My guess is that using a single lock for protecting 'dist' serializes your program.
Option 1:
Consider using a fine-grained locking strategy. Typically, you benefit from this if dist.size() is much larger than the number of threads.
map <int, omp_lock_t > locks;
...
int key1 = (*datItLocal).first;
int key2 = (*datIt2).first;
omp_set_lock(&(locks[key1]));
omp_set_lock(&(locks[key2]));
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
omp_unset_lock(&(locks[key2]));
omp_unset_lock(&(locks[key1]));
Option 2:
Your compiler might already have this optimization mention in option 1, so you can try to drop your lock and use the built-in critical section:
#pragma omp critical
{
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
}
I'm a bit unsure of exactly what you're trying to do with your loops etc, which looks rather like it's going to do a quadratic nested loop over the map. Assuming that's expected though, I think the following line will perform poorly when parallelised:
std::advance(datItLocal, i - lastIdx);
If OpenMP were disabled, that's advancing by one step each time, which is fine. But with OpenMP, there are going to be multiple threads doing chunks of that loop at random. So one of them might start at i=100000, so it has to advance 100000 steps through the map to begin. That might happen quite a lot if there are lots of threads being given relatively small chunks of the loop at a time. It might even be that you end up being memory/cache constrained since you're constantly having to walk over all of this presumably large map. It seems like this might be (part of) your culprit since it may get worse as more threads are available.
Fundamentally I guess I'm a bit suspicious of trying to parallelise iteration over a sequential data structure. You may know more about which parts of it really are slow or not though if you've profiled it.
I am trying OpenMP on a particular code snippet. Not sure if the snippet needs a revamp, perhaps it is set up too rigidly for sequential implementation. Anyway here is the (pseudo-)code that I'm trying to parallelize:
#pragma omp parallel for private(id, local_info, current_local_cell_id, local_subdomain_size) shared(cells, current_global_cell_id, global_id)
for(id = 0; id < grid_size; ++id) {
local_info = cells.get_local_subdomain_info(id);
local_subdomain_size = local_info.size();
...do other stuff...
do {
current_local_cell_id = cells.get_subdomain_cell_id(id);
global_id.set(id, current_global_cell_id + current_local_cell_id);
} while(id < local_subdomain_size && ++id);
current_global_cell_id += local_subdomain_size;
}
This makes complete sense (after staring at it for some time) in a sequential sense, which also might mean that it needs to be re-written for OpenMP. My concern is that current_local_cell_id and local_subdomain_size are private, but current_global_cell_id and global_id are shared.
Hence the statement current_global_cell_id += local_subdomain_size after the inner loop:
do {
...
} while(...)
current_global_cell_id += local_subdomain_size;
might lead to errors in the OpenMP setting, I suspect. I would greatly appreciate if any of the OpenMP experts out there can provide some pointers on any of the special OMP directives I can use to make minimum changes to the code but still avail of OpenMP for such a type of for loop.
I'm not sure I understand your code. However, I think you really want some kind of parallel accumulation.
You could use a pattern like
size_t total = 0;
#pragma omp parallel for shared(total) reduction (+:total)
for (int i=0; i<MAXITEMS; i++)
{
total += getvalue(i); // TODO replace with your logic
}
// total has been 'magically' combined by OMP
On a related note, when you use gcc you can just use the __gnu_parallel::accumulate drop-in replacement for std::accumulate, which does exactly the same. See Chapter 18. Parallel Mode
size_t total = __gnu_parallel::accumulate(c.begin(), c.end(), 0, &myvalue_accum);
You can even compile with -D_GLIBCXX_PARALLEL which will make all use of std algorithms automatically parallellized if possible. Don't use that unless you know what you're doing! Frequently, performance just suffers and the chance of introducing bugs due to unexpected parallelism is real
changing id inside the loop is not correct. There is no way to dispatch the loop to different thread, as loop step does not produce a predictable id value.
Why are you using the id inside that do while loop?