Join array results in OpenMP - c++

I am writing c++ codes using OpenMP. I have a global huge array (100,000+ elements) that will be modified by adding values in a for loop. Is there a way that I can efficiently have each thread created by OpenMP for parallel maintain its local copy of array and then join after the loop? Since the number of threads is a variable, I could not create the local copies of array beforehand. If using a global copy and address the race condition by a synchronization lock, the performance is terrible.
Thanks!
Edited:
Sorry for not being clear. Here's some pseudo-code hopefully could clarify the scenario:
int* huge_array=new int[N];
memset(huge_array, 0, N*sizeof(int));
#pragma omp parallel for
for (i=0; i<n; i++)
{
get a value v independently
get a position p independently
// I have to set a lock here
omp_set_lock(&lock);
huge_array[p] += v;
omp_unset_lock(&lock);
}
Is there a way to improve the performance of the code above?

Okay, I finally understood what you want to do. Yes, you do it the same way as with ptreads.
std::vector<int> A(N,0);
std::vector<int*> local(omp_max_num_threads());
#pragma omp parallel
{
int np = omp_get_num_threads();
std::vector<int> localA(N);
local[omp_get_thread_num()] = localA.data();
// add values to local array
#pragma omp for
for(int i=0; i<num_values; ++i)
localA[position()] += value(); // (1)
// implicit barrier ensures all local copies are ready for aggregation
// aggregate local copies into global array
#pragma omp for
for(int k=0; k<N; ++k)
for(int p=0; p<np; ++p)
A[k] += local[p][k]; // (2)
// implicit barrier ensures no local copy is deleted before aggregation is done
}
but it is important to do the aggregate also in parallel.

In Walter's answer, I believe instead of
std::vector<int*> local(omp_max_num_threads());
It should be
std::vector<int*> local(omp_get_max_threads());
omp_max_num_threads() is not a routine in OpenMP.

What about using the directive
'#'pragma omp parallel for private(VARIABLE)
for your program (only with a cross, not with these '')?
EDIT:
For your code I would use my directive, you won't loose so much time when locking and unlocking your variable...
EDIT 2:
Sorry, you can not use my code for your problem, only, if you create a temporary array first where you store your data temporarily...

As far as I can tell you are essentially filling a histogram where position is the bin of the histogram to fill and value is the weight/value that you will add to that bin. Filling a histogram in parallel is equivalent to doing an array reduction. The C++ implementation of OpenMP does not have direct support for this, however, as far as I understand some version of the Fortran implementation do. To do an array reduction in C++ with OpenMP I have two suggestions.
1.) If the number of bins of the histogram (array) is much less than the number of values that will fill the histogram (which is often the preferred case since one wants reasonable statistics in each bin), then you can fill private version of the histogram in parallel and merge them in a critical section in serial. Since the number of bins is much less than the number of values this should be efficient.
2.) However, If the number of bins is large (as your example seems to imply) then it's possible to merge the private histograms in parallel as well but this is a bit more tricky. Additionally, one needs to be careful with cache alignment and false sharing.
I showed how to do both these methods and discuss some of the cache issues in the following question:
Fill histograms (array reduction) in parallel with openmp without using a critical section.

Related

Multithreading is slower than no threading C++

I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.
I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:
Non-Multithread Case
std::vector<hobj> validobjs;
int length = 70;
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
validobjs.push_back(hobj);
}
}
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
validobjs.push_back(hobj);
}
}
Multithread Case
std::vector<hobj> validobjs;
int length = 70;
#pragma omp parallel
{
std::vector<hobj> threaded1; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
threaded1.push_back(obj);
}
}
std::vector<hobj> threaded2; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
threaded2.push_back(obj);
}
}
#pragma omp critical // Insert local vectors to main vector one thread at a time
{
validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
}
}
In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).
I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.
So questions:
1) Is this kind of operation suited for multi-threading?
2) If yes to 1) - does the multithreaded code look reasonable?
3) Is there anything I can do to improve the performance to get it faster than the no-thread case?
Additional info:
The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4
Cheers!
I'm no expert on multithreading, but I'll give it a try:
Is this kind of operation suited for multi-threading?
I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.
As is the merging at the end of the threaded version.
Does the multithreaded code look reasonable?
I think you'r on the right path to let each thread work on independent data.
Is there anything I can do to improve the performance to get it faster than the no-thread case?
I see a few points that might improve performance:
The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
validobjs.reserve(v1.size() + v2.size());
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
Filter the data in two threads as you do now.
Synchronise them and allocate a vector with a number of elements:
validobjs.resize(v1.size() + v2.size());
Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
Allthough I'm not sure if this is entirely legal or if it is undefined behaviour
You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.
Those were my thoughts on this, I hope there was something usefull in it.
IMHO,
You copy each element twice: into threaded1/2 and after that into validobjs.
It can make your code slower.
You can add elements into single vector by using synchronization.

OpenMP - store results in vector [duplicate]

This question already has answers here:
OpenMP multiple threads update same array
(2 answers)
Closed 4 years ago.
I want to parallelize a for loop with many iterations using OpenPM. The results should be stored in a vector.
for (int i=0; i<n; i++)
{
// not every iteration produces a result
if (condition)
{
results.push_back (result_value);
}
}
This the does not work properly with the #pragma omp parallel for.
So what's the best practice to achieve that?
Is it somehow possible use a separate results vector for each thread and then combining all result vectors at the end? The ordering of the results is not important.
Something like that is not practical because it consumes to much space
int *results = new int[n];
for (int i=0; i<n; i++)
{
// not every iteration produces a result
if (condition)
{
results[i] = result_value;
}
}
// remove all unused slots in results array
Option 1: If each iteration takes a significant amount of time before adding the element to the vector, you can keep the push_back in a critical region:
for (int i=0; i<n; i++)
{
// not every iteration produces a result
if (condition)
{
#pragma omp critical
results.push_back (result_value);
}
}
If threads are mostly busy with other things than the push_back, there will be little overhead from the critical region.
Option 2: If iterations are too cheap compared to the synchronization overhead, you can have each vector fill a thread-private array and then merge them at the end:
There is a good duplicate for this here and here.
The "naive" way:
You can init several vectors (call omp_get_max_threads() to know the thread count inside the current parallel region) then call omp_get_thread_num() inside the parallel region to know the current thread ID, and let each thread write into its vector.
Then outside the parallel region merge the vectors together. This can be worth it or not, depending on how "heavy" your processing is compared to the time required to merge the vectors.
If you know the maximum final size of the vector, you can reserve it before processing (so that push_back calls won't resize the vector and you gain processing time) then call the push_back method from inside a critical section (#pragma omp critical), but critical sections are horribly slow so it's worth it only if the processing you do inside the loop is time consuming. In your case the "processing" looks to be only checking the if-clause, so it's probably not worth it.
Finally, it's a quite known problem. You should read this for more detailed information:
C++ OpenMP Parallel For Loop - Alternatives to std::vector

Parellize some nested for in openmp c++

My serial code for the convolution between a matrix and a kernel works like this:
int index1, index2, a, b;
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
for(int i=0;i<krows;++i){
for(int j=0;j<kcolumns;++j){
a=x+i-krows/2;
b=y+j-kcolumns/2;
if(a<0)
index1=rows+a;
else if(a>rows-1)
index1=a-rows;
else
index1=a;
if(b<0)
index2=columns+b;
else if(b>columns-1)
index2=b-columns;
else
index2=b;
output[x*columns+y]+=input[index1*columns+index2]*kernel[i*kcolumns+j];
}
}
}
}
The convolution considers cyclic treatment for the borders. Now I want to parallelize the code with openmp. I thought about reducing the first two for-cycles to just one and using the syntax:
#pragma omp parallel
#pragma omp for private(x,y,a, b, index1, index2)
for(int z=0;z<rows*columns;z++){
x=z/columns;
y=z%columns;
...
I see that parallelizing like that it reduces the cpu-time but I'm not a big expert of openmp so I was asking myself if there are other more efficient solutions. I don't think it is a good idea to parallelize also the others 2 nested for-cycles.
With an input matrix of dimensions 1000*10000 and a square kernel matrix 9*9 I obtain these times:
4823 ms for 1 thread
2696 ms for 2 threads
2513 ms for 4 threads.
I hope someone can give me some useful suggestions. What about the for reduction syntax?
My suggestion is to change approach altogether. If you are using cyclic treatment for the border (i.e. your problem is periodic) the fast way to do it is based on the fft-based spectral approach:
-Fourier transform matrix and kernel
-compute the product
-Inverse fourier transform the product (you have the convolution)
This is (1) much more efficient (unless the dimensions of the kernel are much smaller than those of the matrix) and (2) you can use a fft library that supports multithreading (like FFTW) and let it deal with it.
You don't need to change the for loops. You can make each thread iterate thru all rows in a column or thru all columns in a row. Also, bear in mind that if the number of threads is higher than the number of physical cores, the performance won't change much.
OpenMP already takes care of the number of threads that it should create, using the logical cores count - which might be a problem on Intel i3 and i7, since they have hyperthreading and thus the performance gain per extra thread won't be big.
In resume, you can either:
#pragma omp parallel for private (x,y,a,b,index1,index2)
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
// ...
}
}
Or:
for(int x=0;x<rows;++x){
#pragma omp parallel for private (y,a,b,index1,index2)
for(int y=0;y<columns;++y){
// ...
}
}
If you are using OpenMP 3.0 or greater you may exploit the collapse clause of the loop work-sharing construct:
The collapse clause may be used to specify how many loops are
associated with the loop construct. The parameter of the collapse
clause must be a constant positive integer expression. If no collapse
clause is present, the only loop that is associated with the loop
construct is the one that immediately follows the loop directive
This means that you may write the following:
#pragma omp parallel for collapse(2)
for(int x=0;x<rows;++x){
for(int y=0;y<columns;++y){
/* Work here */
}
}
and obtain exactly the same result as your linearized loop:
#pragma omp parallel for
for(int z=0;z<rows*columns;z++){
x=z/columns;
y=z%columns;
/* Work here */
}
As you may see, with the collapse clause no modification is needed to your serial code and you may easily experiment further loop collapsing changing the positive number in the clause.

Converting for loop to OpenMP

So I have a loop where I iterate over elements of a vector, call a function on each element, and if it meets a certain criteria, I push it onto a list.
my_list li;
for (auto itr = Obj.begin(); itr != Obj.end(); ++itr) {
if ((*itr).function_call())
li.push_back((*itr);
}
I've been thinking of ways to optimize my program, and I came across OpenMP, but a lot of the sample code is hard to follow.
Could someone walk me through how to convert the above loop to utilize multiple cores in parallel?
Thanks.
There are a few points you need to take care to parallelize that code snippet
If you're using OpenMP 3.0 (or above) you can parallelize your for-loop #pragma omp for, if you're using an older version of OpenMP, you need to be using a for loop accessing vector with indexes.
You need to guard li.push_back((*itr); statement with a lock or set it as critical section
If function_call is not a really slow function or your vector does not contain so many items, it may not be necessary to parallelize as thread creation will introduce overhead.
So a pseudo-code implementation would be
my_list li;
#pragma omp for
for (auto itr = Obj.begin(); itr != Obj.end(); ++itr) {
if ((*itr).function_call())
{
#pragma omp critical CRIT_1
{
li.push_back((*itr);
}
}
}
The time has come to discuss efficient ways to use container classes such as std::list or std::vector with OpenMP (since the OP wants to optimize his code using lists with OpenMP). Let me list four ways in increasing level of efficiency.
Fill the container in a parallel section in a critical block
Make private versions of the container for each thread, fill them in parallel, and then merge them in a critical section
Don't use STL containers. STL was not designed with efficiency in mind. Instead either write your own or use something like Agner Fog's containters which are designed for efficiency. For example instead of using a heap for memory allocation they use a memory pool.
In some special cases it's possible to merge the private versions of the containers in parallel as well.
Example code for the first case is given in the accepted answer. This defeats most of the purpose of using threaded code since each iteration fills the container in critical section.
Example code for the second case can be found at C++ OpenMP Parallel For Loop - Alternatives to std::vector. Rather than re-post the code here let me give an example for the third case using Agner Fog's container classes
DynamicArray<int> vec;
#pragma omp parallel
{
DynamicArray<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<100; i++) {
vec_private.Push(i);
}
//merging here is probably not optimal
//Dynamic array needs an append function
//vec should reserve a size equal to the sum of size each vec_private
//then use memcpy to append vec_private into vec in a critcal section
#pragma omp critical
{
for(int i=0; i<vec_private.GetNum(); i++) {
vec.Push(vec_private[i]);
}
}
}
Finally, in special cases for example with histograms (probably the most common data structure in experimental particle physics), it's possible to merge the private arrays in parallel as well. For the histograms this is equivalent to an array reduction. This is a bit tricky. An example showing how to do this can be found at Fill histograms (array reduction) in parallel with OpenMP without using a critical section

OpenMP Performance impact: private directive vs. declaring variable inside for construct

Performance wise, which of the following is more efficient?
Assigning in the master thread and copying the value to all threads:
int i = 0;
#pragma omp parallel for firstprivate(i)
for( ; i < n; i++){
...
}
Declaring and assigning the variable in each thread
#pragma omp parallel for
for(int i = 0; i < n; i++){
...
}
Declaring the variable in the master thread but assigning it in each thread.
int i;
#pragma omp parallel for private(i)
for(i = 0; i < n; i++){
...
}
It may seem a silly question and/or the performance impact may be negligible. But I'm parallelizing a loop that does a small amount of computation and is called a large number of times, so any optimization I can squeeze out of this loop is helpful.
I'm looking for a more low level explanation and how OpenMP handles this.
For example, if parallelizing for a large number of threads I assume the second implementation would be more efficient, since initializing a variable using xor is far more efficient than copying the variable to all the threads
There is not much of a difference in terms of performance among the 3 versions you presented, since each one of them is using #pragma omp parallel for. Hence, OpenMP will automatically assign each for iteration to different threads. Thus, variable i will became private to each thread, and each thread will have a different range of for iterations to work with. The variable 'i' was automatically set to private in order to avoid race conditions when updating this variable. Since, the variable 'i' will be private on the parallel for anyway, there is no need to put private(i) on the #pragma omp parallel for.
Nevertheless, your first version will produce an error since OpenMP is expecting that the loop right underneath of #pragma omp parallel for have the following format:
for(init-expr; test-expr;incr-expr)
inorder to precompute the range of work.
The for directive places restrictions on the structure of all
associated for-loops. Specifically, all associated for-loops must
have the following canonical form:
for (init-expr; test-expr;incr-expr) structured-block (OpenMP Application Program Interface pag. 39/40.)
Edit: I tested your two last versions, and inspected the generated assembly. Both version produce the same assembly, as you can see -> version 2 and version 3.