I'm writing a spectral PDE code I'd like to parallelise, using FFTW to do the FFT's. The main loop in the code looks as below. Lets say I have a real space array, u, and a fourier space array, uhat, and I've already constructed plans to go between them outside of this for loop (and outside of any parallel region), as well as calling the needed FFTW initialisation functions for parallelisation of fftw_execute.
for
{
fftw_execute(u_to_uhat)
// do some things with uhat, the fourier space array - for
// example,scale the transform in a for loop
for(int n = 0, n<Nmax,n++)
{
uhat[n] = scale*uhat[n])
}
// transform back
fftw_execute(uhat_to_u)
}
I want to parallelise everything here, including the manipulations of uhat within that second for loop. My question is how should I use openMP #pragmas to do so? Currently I have a parallel region around the inner loop:
fftw_init_threads();
fftw_plan_with_nthreads(omp_get_max_threads());
for
{
fftw_execute(u_to_uhat)
// do some things with uhat, the fourier space array - for
// example,scale the transform in a for loop
#pragma omp parallel for
for(int n = 0, n<Nmax,n++)
{
uhat[n] = scale*uhat[n])
}
// transform back
fftw_execute(uhat_to_u)
}
But as I understand it this creates and destroys a thread block every time I enter this for loop, which is expensive. I'd rather construct the parallel region outside the for loop once:
fftw_init_threads();
fftw_plan_with_nthreads(omp_get_max_threads());
#pragma omp parallel
for
{
fftw_execute(u_to_uhat)
// do some things with uhat, the fourier space array - for
// example,scale the transform in a for loop
#pragma omp for
for(int n = 0, n<Nmax,n++)
{
uhat[n] = scale*uhat[n])
}
// transform back
fftw_execute(uhat_to_u)
}
But then I have to worry about the behaviour of fftw_execute called from within the parallel region. I believe the documentation, section 5.4, says fftw_execute is thread safe (i.e safe to call from within a parallel region). But it doesn't tell me whether fftw_execute creates its own block of threads when called - i.e by putting it in a parallel region, am I just making every existing thread construct a load more threads within the fftw_execute function? Alternately, does it know to use the threads that are already there?
In short Can I call fftw_execute() from inside a parallel region, and can you make it work as one would hope it does - i.e just use the existing threads to do the work, rather than spawning new ones.
Sorry for the long question, I'd really appreciate some advice on this!
Related
This question already has answers here:
OpenMP multiple threads update same array
(2 answers)
Closed 4 years ago.
I want to parallelize a for loop with many iterations using OpenPM. The results should be stored in a vector.
for (int i=0; i<n; i++)
{
// not every iteration produces a result
if (condition)
{
results.push_back (result_value);
}
}
This the does not work properly with the #pragma omp parallel for.
So what's the best practice to achieve that?
Is it somehow possible use a separate results vector for each thread and then combining all result vectors at the end? The ordering of the results is not important.
Something like that is not practical because it consumes to much space
int *results = new int[n];
for (int i=0; i<n; i++)
{
// not every iteration produces a result
if (condition)
{
results[i] = result_value;
}
}
// remove all unused slots in results array
Option 1: If each iteration takes a significant amount of time before adding the element to the vector, you can keep the push_back in a critical region:
for (int i=0; i<n; i++)
{
// not every iteration produces a result
if (condition)
{
#pragma omp critical
results.push_back (result_value);
}
}
If threads are mostly busy with other things than the push_back, there will be little overhead from the critical region.
Option 2: If iterations are too cheap compared to the synchronization overhead, you can have each vector fill a thread-private array and then merge them at the end:
There is a good duplicate for this here and here.
The "naive" way:
You can init several vectors (call omp_get_max_threads() to know the thread count inside the current parallel region) then call omp_get_thread_num() inside the parallel region to know the current thread ID, and let each thread write into its vector.
Then outside the parallel region merge the vectors together. This can be worth it or not, depending on how "heavy" your processing is compared to the time required to merge the vectors.
If you know the maximum final size of the vector, you can reserve it before processing (so that push_back calls won't resize the vector and you gain processing time) then call the push_back method from inside a critical section (#pragma omp critical), but critical sections are horribly slow so it's worth it only if the processing you do inside the loop is time consuming. In your case the "processing" looks to be only checking the if-clause, so it's probably not worth it.
Finally, it's a quite known problem. You should read this for more detailed information:
C++ OpenMP Parallel For Loop - Alternatives to std::vector
I have implemented the Jacobi algorithm based on the routine described in the book Numerical Recipes but since I plan to work with very large matrices I am trying to parallelize it using openmp.
void ROTATE(MatrixXd &a, int i, int j, int k, int l, double s, double tau)
{
double g,h;
g=a(i,j);
h=a(k,l);
a(i,j)=g-s*(h+g*tau);
a(k,l)=h+s*(g-h*tau);
}
void jacobi(int n, MatrixXd &a, MatrixXd &v, VectorXd &d )
{
int j,iq,ip,i;
double tresh,theta,tau,t,sm,s,h,g,c;
VectorXd b(n);
VectorXd z(n);
v.setIdentity();
z.setZero();
#pragma omp parallel for
for (ip=0;ip<n;ip++)
{
d(ip)=a(ip,ip);
b(ip)=d(ip);
}
for (i=0;i<50;i++)
{
sm=0.0;
for (ip=0;ip<n-1;ip++)
{
#pragma omp parallel for reduction (+:sm)
for (iq=ip+1;iq<n;iq++)
sm += fabs(a(ip,iq));
}
if (sm == 0.0) {
break;
}
if (i < 3)
tresh=0.2*sm/(n*n);
else
tresh=0.0;
#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
for (ip=0;ip<n-1;ip++)
{
//#pragma omp parallel for private (g,h,t,theta,c,s,tau)
for (iq=ip+1;iq<n;iq++)
{
g=100.0*fabs(a(ip,iq));
if (i > 3 && (fabs(d(ip))+g) == fabs(d[ip]) && (fabs(d[iq])+g) == fabs(d[iq]))
a(ip,iq)=0.0;
else if (fabs(a(ip,iq)) > tresh)
{
h=d(iq)-d(ip);
if ((fabs(h)+g) == fabs(h))
{
t=(a(ip,iq))/h;
}
else
{
theta=0.5*h/(a(ip,iq));
t=1.0/(fabs(theta)+sqrt(1.0+theta*theta));
if (theta < 0.0)
{
t = -t;
}
c=1.0/sqrt(1+t*t);
s=t*c;
tau=s/(1.0+c);
h=t*a(ip,iq);
#pragma omp critical
{
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
a(ip,iq)=0.0;
for (j=0;j<ip;j++)
ROTATE(a,j,ip,j,iq,s,tau);
for (j=ip+1;j<iq;j++)
ROTATE(a,ip,j,j,iq,s,tau);
for (j=iq+1;j<n;j++)
ROTATE(a,ip,j,iq,j,s,tau);
for (j=0;j<n;j++)
ROTATE(v,j,ip,j,iq,s,tau);
}
}
}
}
}
}
}
I wanted to parallelize the loop that does most of the calculations and both comments inserted in the code:
//#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
//#pragma omp parallel for private (g,h,t,theta,c,s,tau)
are my attempts at it. Unfortunately both of them end up producing incorrect results. I suspect the problem may be in this block:
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
because usually this sort of accumulation would need a reduction, but since each thread accesses a different part of the array, I am not certain of this.
I am not really sure if I am doing the parallelization in a correct manner because I have only recently started working with openmp, so any suggestion or recommendation would also be welcomed.
Sidenote: I know there are faster algorithms for eigenvalue and eigenvector determination including the SelfAdjointEigenSolver in Eigen, but those are not giving me the precision I need in the eigenvectors and this algorithm is.
My thanks in advance.
Edit: I considered to correct answer to be the one provided by The Quantum Physicist because what I did does not reduce the computation time for system of size up to 4096x4096. In any case I corrected the code in order to make it work and maybe for big enough systems it could be of some use. I would advise the use of timers to test if the
#pragma omp for
actually decrease the computation time.
I'll try to help, but I'm not sure this is the answer to your question.
There are tons of problems with your code. My friendly advice for you is: Don't do parallel things if you don't understand the implications of what you're doing.
For some reason, it looks like that you think putting everything in parallel #pragma for will make it faster. This is VERY wrong. Because spawning threads is an expensive thing to do and costs (relatively) lots of memory and time. So if you redo that #pragma for for every loop, you'll respawn threads for every loop, which will significantly reduce the speed of your program... UNLESS: Your matrices are REALLY huge and the computation time is >> than the cost of spawning them.
I fell into a similar issue when I wanted to multiply huge matrices, element-wise (and then I needed the sum for some expectation value in quantum mechanics). To use OpenMP for that, I had to flatten the matrices to linear arrays, and then distribute the array chunks each to a thread, and then run a for loop, where every loop iteration uses elements that are independent of others for sure, and I made them all evolve independently. This was quite fast. Why? Because I never had to respawn threads twice.
Why you're getting wrong results? I believe the reason is because you're not respecting shared memory rules. You have some variable(s) that is being modified by multiple threads simultaneously. It's hiding somewhere, and you have to find it! For example, what does the function z do? Does it take stuff by reference? What I see here:
z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;
Looks VERY multi-threading not-safe, and I don't understand what you're doing. Are you returning a reference that you have to modify? This is a recipe for thread non-safety. Why don't you create clean arrays and deal with them instead of this?
How to debug: Start with a small example (2x2 matrix, maybe), and use only 2 threads, and try to understand what's going on. Use a debugger and define break points, and check what information is shared between threads.
Also consider using a mutex to check what data gets ruined when it becomes shared. Here is how to do it.
My recommendation: Don't use OpenMP unless you plan to spawn the threads ONLY ONCE. I actually believe that OpenMP is going to die very soon because of C++11. OpenMP was beautiful back then when C++ didn't have any native multi-threading implementation. So learn how to use std::thread, and use it, and if you need to run many things in threads, then learn how to create a Thread Pool with std::thread. This is a good book for learning multithreading.
I am trying to achieve something contrived using OpenMP.
I have a multi-core system with N available processors. I want to have a vector of objects of length k*P to be populated in batches of P by a single thread (by reading a file), i.e. a single thread reads this file and writes in vecObj[0 to P-1] then vecObj[p to 2P-1] etc. To make things simple, this vector is pre-resized (i.e. inserting using = operator, no pushbacks, constant length as far as we are concerned).
After a batch is written into the vector, I want the remaining N-1 threads to work on the available data. Since every object can take different time to be worked upon, it would be good to have dynamic scheduling for the remaining threads. The below snippet works really well when all the threads are working on the data.
#pragma omp parallel for schedule(dynamic, per_thread)
for(size_t i = 0; i < dataLength(); ++i) {
threadWorkOnElement(vecObj, i);
}
Now, according to me, the the main issue I am facing in thinking up of a solution is the question as to how can I have N-1 threads dynamically scheduled over the range of available data, while another thread just keeps on reading and populating the vector with data?
I am guessing that the issue of writing new data and messaging the remaining threads can be achieved using std atomic.
I think that what I am trying to achieve is along the lines of the following pseudo code
std::atomic<size_t> freshDataEnd;
size_t dataWorkStart = 0;
size_t dataWorkEnd;
#pragma omp parallel
{
#pragma omp task
{
//increment freshDataEnd atomically upon reading every P objects
//return when end of file is reached
readData(vecObj, freshDataEnd);
}
#pragma omp task
{
omp_set_num_threads(N-1);
while(freshDataEnd <= MAX_VEC_LEN) {
if (dataWorkStart < freshDataEnd) {
dataWorkEnd = freshDataEnd;
#pragma omp parallel for schedule(dynamic, per_thread)
for(size_t i = dataWorkStart; i < dataWorkEnd; ++i) {
threadWorkOnElement(vecObj, i);
}
dataWorkStart = dataWorkEnd;
}
}
}
}
Is this the correct approach to achieve what I am trying to do? How can I handle this sort of nested parallelism? Not so important : I would have preferred to stick with openmp directives and not use std atomics, is that possible? How?
I am running my code on Intel® Xeon(R) CPU X5680 # 3.33GHz × 12. Here is a fairly simple OpenMP pseudo code (the OpenMP parts are exact, just normal code in between is changed for compactness and clarity):
vector<int> myarray(arraylength,something);
omp_set_num_threads(3);
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for(int j=0;j<pr.max_iteration_limit;j++)
{
vector<int> temp_array(updated_array(a,b,myarray));
for(int i=0;i<arraylength;i++)
{
#pragma omp atomic
myarray[i]+=temp_array[i];
}
}
}
all parameters taken by temp_array function are copied so that there would be no clashes. Basic structure of temp_array function:
vector<int> updated_array(myClass1 a, vector<myClass2> b, vector<int> myarray)
{
//lots of preparations, but obviously there are only local variables, since
//function only takes copies
//the core code taking most of the time, which I will be measuring:
double time_s=time(NULL);
while(waiting_time<t_wait) //as long as needed
{
//a fairly short computaiton
//generates variable: vector<int> another_array
waiting_time++;
}
double time_f=time(NULL);
cout<<"Thread "<<omp_get_thread_num()<<" / "<<omp_get_num_threads()
<< " runtime "<<time_f-time_s<<endl;
//few more changes to the another_array
return another_array;
}
Questions and my attempts to resolve it:
adding more threads (with omp_set_num_threads(3);) does create more threads, but each thread does the job slower. E.g. 1: 6s, 2: 10s, 3: 15s ... 12: 60s.
(where to "job" I refer to the exact part of the code I pointed out as core, (NOT the whole omp loop or so) since it takes most of the time, and makes sure I am not missing anything additional)
There are no rand() things happening inside the core code.
Dynamic or static schedule doesnt make a difference here of course (and I tried..)
There seem to be no sharing possible in any way or form, thus I am running out of ideas completely... What can it be? I would be extremely grateful if you could help me with this (even with just ideas)!
p.s. The point of the code is to take myarray, do a bit of montecarlo on it with a single thread, and then collect tiny changes and add/substract to the original array.
OpenMP may implement the atomic access using a mutex, when your code will suffer from heavy contention on that mutex. This will result in a significant performance hit.
If the work in updated_array() dominates the cost of the parallel loop, you'de better put the whole of the second loop inside a critical section:
{ // body of parallel loop
vector<int> temp_array = updated_array(a,b,myarray);
#pragma omp critical(UpDateMyArray)
for(int i=0;i<arraylength;i++)
myarray[i]+=temp_array[i];
}
However, your code looks broken (essentially not threadsafe), see my comment.
I am writing c++ codes using OpenMP. I have a global huge array (100,000+ elements) that will be modified by adding values in a for loop. Is there a way that I can efficiently have each thread created by OpenMP for parallel maintain its local copy of array and then join after the loop? Since the number of threads is a variable, I could not create the local copies of array beforehand. If using a global copy and address the race condition by a synchronization lock, the performance is terrible.
Thanks!
Edited:
Sorry for not being clear. Here's some pseudo-code hopefully could clarify the scenario:
int* huge_array=new int[N];
memset(huge_array, 0, N*sizeof(int));
#pragma omp parallel for
for (i=0; i<n; i++)
{
get a value v independently
get a position p independently
// I have to set a lock here
omp_set_lock(&lock);
huge_array[p] += v;
omp_unset_lock(&lock);
}
Is there a way to improve the performance of the code above?
Okay, I finally understood what you want to do. Yes, you do it the same way as with ptreads.
std::vector<int> A(N,0);
std::vector<int*> local(omp_max_num_threads());
#pragma omp parallel
{
int np = omp_get_num_threads();
std::vector<int> localA(N);
local[omp_get_thread_num()] = localA.data();
// add values to local array
#pragma omp for
for(int i=0; i<num_values; ++i)
localA[position()] += value(); // (1)
// implicit barrier ensures all local copies are ready for aggregation
// aggregate local copies into global array
#pragma omp for
for(int k=0; k<N; ++k)
for(int p=0; p<np; ++p)
A[k] += local[p][k]; // (2)
// implicit barrier ensures no local copy is deleted before aggregation is done
}
but it is important to do the aggregate also in parallel.
In Walter's answer, I believe instead of
std::vector<int*> local(omp_max_num_threads());
It should be
std::vector<int*> local(omp_get_max_threads());
omp_max_num_threads() is not a routine in OpenMP.
What about using the directive
'#'pragma omp parallel for private(VARIABLE)
for your program (only with a cross, not with these '')?
EDIT:
For your code I would use my directive, you won't loose so much time when locking and unlocking your variable...
EDIT 2:
Sorry, you can not use my code for your problem, only, if you create a temporary array first where you store your data temporarily...
As far as I can tell you are essentially filling a histogram where position is the bin of the histogram to fill and value is the weight/value that you will add to that bin. Filling a histogram in parallel is equivalent to doing an array reduction. The C++ implementation of OpenMP does not have direct support for this, however, as far as I understand some version of the Fortran implementation do. To do an array reduction in C++ with OpenMP I have two suggestions.
1.) If the number of bins of the histogram (array) is much less than the number of values that will fill the histogram (which is often the preferred case since one wants reasonable statistics in each bin), then you can fill private version of the histogram in parallel and merge them in a critical section in serial. Since the number of bins is much less than the number of values this should be efficient.
2.) However, If the number of bins is large (as your example seems to imply) then it's possible to merge the private histograms in parallel as well but this is a bit more tricky. Additionally, one needs to be careful with cache alignment and false sharing.
I showed how to do both these methods and discuss some of the cache issues in the following question:
Fill histograms (array reduction) in parallel with openmp without using a critical section.