openMp : parallelize std::map iteration - c++

There are some posts about this issue but none of them satisfies me.
I don't have openMp 3.0 support and I need to parallelize a iteration over a map. I want to know if this solution would work or not :
auto element = myMap.begin();
#pragma omp parallel for shared(element)
for(int i = 0 ; i < myMap.size() ; ++i){
MyKeyObject * current_first = nullptr;
MyValueObject * current_second = nullptr;
#pragma omp critical
{
current_first = element->first;
current_second = element->second;
++element;
}
// Here I can use 'current' as in a usual loop
}
So I am using the for loop just to make sure the threads will handle equally the same number of elements of the map. Is that a correct guess or would that fail ?
ps : I am working on visual studio 2012 so if you have a hint about how to make my compiler support openMp 3.0, that would also solve my problem..

This is not a direct answer to your question, but I will try to save you some of the future bad "OpenMP with Visual Studio" experience.
The Microsoft C/C++ Compiler only supports OpenMP 2.0. There is no way to make it support OpenMP 3.0 or higher since OpenMP is built into the compiler core and is not an add-on package (unless someone comes up with an external source-to-source transformation engine) and Microsoft seems not to be interested in providing further OpenMP support while pushing their own solutions (see below). You should therefore either get the Intel C/C++ Compiler that integrates with Visual Studio or a standalone compiler like GCC or the PGI C/C++ compiler.
If you are developing specifically for Windows, then you might want to abandon OpenMP and use the Concurrency Runtime and specifically PPL instead. PPL comes with Visual Studio 2012 and newer and provides data- and task-parallel equivalents to some of the algorithms in STL. What you are interested in is concurrency::parallel_for_each(), which is the parallel version of std::for_each(). It works with forward iterators, although not as efficiently as with random iterators. But you have to make sure that processing one element of the map takes at least a thousand instructions, otherwise the parallelisation won't be beneficial.
If you aim for cross-platform compatibility, then Intel Threading Building Blocks (Intel TBB for short) is the alternative to PPL. It provides the tbb::parallel_do() algorithm, which is specifically designed to work with forward iterators. The same warning about the amount of work per map element applies.

Your method will work since you access and iterate the shared object element in a critical section. Whether of not this is good for performance you will have to test. Here is an alternative method you may want to consider. Let me call this the "fast-forward" method.
Let's assume you want to do this in parallel
for(auto element = myMap.begin(); element !=myMap.end(); ++element) {
foo(element->first, element->second);
}
You can do this with OpenMP 2.0
#pragma omp parallel
{
size_t cnt = 0;
int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
for(auto element = myMap.begin(); element !=myMap.end(); ++element, cnt++) {
if(cnt%nthreads != ithread) continue;
foo(element->first, element->second);
}
}
Every thread runs through myMap.size() iteartors. However, each thread only calls foo myMap.size()/num_threads. Your method only runs through myMap.size()/num_threads iterators. However, it requires using a critical section every iteration.
The fast-forward method is efficient as long as the time to "fast-forward" through nthreads itererators is much less then the time for foo, i.e:
nthreads*time(++elements) << time(foo)
If, however, the time for foo is on order the time to iterate and foo is reading/writing memory then foo is likely memory bandwidth bound and won't scale with the number of threads anyway.

Your approach will not work - because of a mix of a conceptual problem and a few bugs.
[bug] you will always miss the first element, since the first thing that you do is increment the elements iterator.
[bug] all threads will iterate over the whole map, since the elements iterator is not shared. BTW, it's not clear what the shared variable 'part' is in your code.
If you make element shared, then the code that is accessing it (outside of the critical section) will see whatever it is currently pointing to, regardless of the thread. You will end up processing some elements more than once and some - not at all.
There is no easy way to parallelize access to a map using an iterator, since the map iterator is not random-access. You may want to split the keys up manually and then use different parts of the key set on different threads.

Related

Can I use openMP parallel for with an object oriented loop iterator that I want to have saved?

I am working with a simulation that consists of a cells, objects that interact with each other and evolve over time. I am looking at parallelizing my code using the OpenMP parallel for imperative and want my cells (objects) to be updated per timestep in parallel, below you can see part of the code.
#pragma omp parallel for default(none) shared(cells, maxCellReach, parameters, cellCellJunctions, timestepDistance) firstprivate(timestep), reduction(+ : centerOfMass)
for (auto it = cells.begin(); it != cells.end(); it++) {
maxCellReach = std::max(it->getNucleus()->getCellReach(), maxCellReach);
it->ApplyIntercellularForce(cellCellJunctions, Constants::junctionCreateDistance, parameters);
it->step(timestep, parameters);
centerOfMass += it->GetCenterOfMass();
}
I have read that the loop iterator variable of a openMP parallel for loop is always private. I however, want all the cells to be updated together. Does this code achieve that or do I need a different approach with openMP?
I looked around online, but could not find anything on making loop iterators shared variables or something similar in a for loop.
To be able to use OpenMP on a for loop, it has to fulfill the "Canonical Loop Form" as specified by the standard. For example it is described in the OpenMP 4.5 Specification section 2.6 (starting on page 53) and in the OpenMP 5.0 Specification Section 2.9.1 (starting on page 95).
Both standards specify the loop form to be
for (init-expr; test-expr; incr-expr) structured-block
with differing restrictions on init-expr, test-expr and inc-expr.
While 4.5 already allows random-access-iterator-type to be used here specifically from C++, it doesn't allow for != in the test-expr. This was fixed by the 5.0 standard which is almost completely implemented by the newest versions of gcc and clang at the time of writing this answer. For compatibility with older versions of these compilers you might have to use < instead if possible. For more information on which versions of which compiler implements which standard, see here.
If your iterators are not allowing random access, things get more complicated. See for example the tasking code in this answer.
On page 99 the 5.0 standard even allows for range-based for loops from C++11, so you could write your loop even more elegantly.
As of OpenMP 5.2 there is still at least one special case (#pragma omp simd) which does not work on non-pointer random access iterators.

Multithreaded function performance worse than single threaded

I wrote an update() function which ran on a single thread, then I wrote the below function updateMP() which does the same thing except I divide the work in my two for loops here amongst some threads:
void GameOfLife::updateMP()
{
std::vector<Cell> toDie;
std::vector<Cell> toLive;
#pragma omp parallel
{
// private, per-thread variables
std::vector<Cell> myToDie;
std::vector<Cell> myToLive;
#pragma omp for
for (int i = 0; i < aliveCells.size(); i++) {
auto it = aliveCells.begin();
std::advance(it, i);
int liveCount = aliveCellNeighbors[*it];
if (liveCount < 2 || liveCount > 3) {
myToDie.push_back(*it);
}
}
#pragma omp for
for (int i = 0; i < aliveCellNeighbors.size(); i++) {
auto it = aliveCellNeighbors.begin();
std::advance(it, i);
if (aliveCells.find(it->first) != aliveCells.end()) // is this cell alive?
continue; // if so skip because we already updated aliveCells
if (aliveCellNeighbors[it->first] == 3) {
myToLive.push_back(it->first);
}
}
#pragma omp critical
{
toDie.insert(toDie.end(), myToDie.begin(), myToDie.end());
toLive.insert(toLive.end(), myToLive.begin(), myToLive.end());
}
}
for (const Cell& deadCell : toDie) {
setDead(deadCell);
}
for (const Cell& liveCell : toLive) {
setAlive(liveCell);
}
}
I noticed that it performs worse than the single threaded update() and seems like it's getting slower over time.
I think I might be doing something wrong by having two uses of omp for? I am new to OpenMP so I am still figuring out how to use it.
Why am I getting worse performance with my multithreaded implementation?
EDIT: Full source here: https://github.com/k-vekos/GameOfLife/tree/hashing?files=1
Why am I getting worse performance with my multithreaded implementation?
Classic question :)
You loop through only the alive cells. That's actually pretty interesting. A naive implementation of Conway's Game of Life would look at every cell. Your version optimizes for a fewer number of alive than dead cells, which I think is common later in the game. I can't tell from your excerpt, but I assume it trades off by possibly doing redundant work when the ratio of alive to dead cells is higher.
A caveat of omp parallel is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section. It's implementation dependent. I can't seem to find any information on MSVC's implementation. If anyone knows, please weight in.
So that means that your threads could be created/destroyed every update loop, which is heavy overhead. For this to be worth it, the amount of work should be orders of magnitude more expensive than the overhead.
You can profile/measure the code to determine overhead and work time. It should also help you see where the real bottlenecks are.
Visual Studio has a profiler with a nice GUI. You need to compile your code with release optimizations but also include the debug symbols (those are excluded in the default release configuration). I haven't looked into how to set this up manually because I usually use CMake and it sets up a RelWithDebInfo configuration automatically.
Use high_resolution_clock to time sections that are hard to measure with the profiler.
If you can use C++17, it has a standard parallel for_each (the ExecutionPolicy overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each. They're so new that I barely know anything about them (they might also have the same issues as OpenMP).
seems like it's getting slower over time.
Could it be that you're not cleaning up one of your vectors?
First off, if you want any kind of performance, you must do as little work as possible in a critical section. I'd start by changing the following:
std::vector<Cell> toDie;
std::vector<Cell> toLive;
to
std::vector<std::vector<Cell>> toDie;
std::vector<std::vector<Cell>> toLive;
Then, in your critical section, you can just do:
toDie.push_back(std::move(myToDie));
toLive.push_back(std::move(myToLive));
Arguably, a vector of vector isn't cute, but this will prevent deep-copying inside the CS which is unnecessary time consumption.
[Update]
IMHO there's no point in using multithreading if you are using non-contiguous data structures, at least not in that way. The fact is you'll spend most of your time waiting on cache misses because that's what associative containers do, and little doing the actual work.
I don't know how this game works. It feels like if I had to do something with numerous updates and render, I would have the updates done as quickly as possible on the 'main' thread', and I would have another (detached) thread for the renderer. You could then just give the renderer the results after every "update", and perform another update while it's rendering.
Also, I'm definitely not an expert in hashing, but hash<int>()(k.x * 3 + k.y * 5) seems like a high-collision hashing. You can certainly try something else like what's proposed here

Parallel tasks get better performances with boost::thread than with ppl or OpenMP

I have a C++ program which could be parallelized. I'm using Visual Studio 2010, 32bit compilation.
In short the structure of the program is the following
#define num_iterations 64 //some number
struct result
{
//some stuff
}
result best_result=initial_bad_result;
for(i=0; i<many_times; i++)
{
result *results[num_iterations];
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
// update best_result;
}
Since each some_computations() is independent(some global variables read, but no global variables modified) I parallelized the inner for-loop.
My first attempt was with boost::thread,
thread_group group;
for(j=0; j<num_iterations; j++)
{
group.create_thread(boost::bind(&some_computation, this, result+j));
}
group.join_all();
The results were good, but I decided to try more.
I tried the OpenMP library
#pragma omp parallel for
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
The results were worse than the boost::thread's ones.
Then I tried the ppl library and used parallel_for():
Concurrency::parallel_for(0,num_iterations, [=](int j) {
some_computations(results+j);
})
The results were the worst.
I found this behaviour quite surprising. Since OpenMP and ppl are designed for the parallelization, I would have expected better results, than boost::thread. Am I wrong?
Why is boost::thread giving me better results?
OpenMP or PPL do no such thing as being pessimistic. They just do as they are told, however there's some things you should take into consideration when you do try to paralellize loops.
Without seeing how you implemented these things, it's hard to say what the real cause may be.
Also if the operations in each iteration have some dependency on any other iterations in the same loop, then this will create contention, which will slow things down. You haven't shown what your some_operation function actually does, so it's hard to tell if there is data dependencies.
A loop that can be truly parallelized has to be able to have each iteration run totally independent of all other iterations, with no shared memory being accessed in any of the iterations. So preferably, you'd write stuff to local variables and then copy at the end.
Not all loops can be parallelized, it is very dependent on the type of work being done.
For example, something that is good for parallelizing is work being done on each pixel of a screen buffer. Each pixel is totally independent from all other pixels, and therefore, a thread can take one iteration of a loop and do the work without needing to be held up waiting for shared memory or data dependencies within the loop between iterations.
Also, if you have a contiguous array, this array may be partly in a cache line, and if you are editing element 5 in thread A and then changing element 6 in thread B, you may get cache contention, which will also slow down things, as these would be residing in the same cache line. A phenomenon known as false sharing.
There is many aspects to think about when doing loop parallelization.
In short words, openMP is mainly based on shared memory, with additional cost of tasking management and memory management. ppl is designed to handle generic patterns of common data structures and algorithms, it brings additional complexity cost. Both of them have additional CPU cost, but your simple falling down boost threads do not (boost threads are just simple API wrapping). That's why both of them are slower than your boost version. And, since the exampled computation is independent for each other, without synchronization, openMP should be close to the boost version.
It occurs in simple scenarios, but, for complicated scenarios, with complicated data layout and algorithms, it should be context dependent.

Very Basic for loop using TBB

I am a very new programmer, and I have some trouble with the examples from intel. I think it would be helpful if I could see how the most basic possible loop is implemented in tbb.
for (n=0 ; n < songinfo.frames; ++n) {
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
Here is a loop I am using to de-interleave audio data. Would this loop benefit from tbb? How would you implement it?
First of all for the following code I assume your arrays are of type mytype*, otherwise the code need some modifications. Furthermore I assume that your ranges don't overlap, otherwise parallelization attemps won't work correctly (at least not without more work)
Since you asked for it in tbb:
First you need to initialize the library somewhere (typically in your main). For the code assume I put a using namespace tbb somewhere.
int main(int argc, char *argv[]){
task_scheduler_init init;
...
}
Then you will need a functor which captures your arrays and executes the body of the forloop:
struct apply_func {
const mytype* songin; //whatever type you are operating on
mytype* sli;
mytype* sri;
apply_func(const mytype* sin, mytype* sl, mytype* sr):songin(sin), sli(sl), sri(sr)
{}
void operator()(const blocked_range<size_t>& range) {
for(size_t n = range.begin(); n !=range.end(); ++n){
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
}
}
Now you can use parallel_for to parallelize this loop:
size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
apply_func func(songin, sli, sri);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize), func);
That should do it (if I remember correctly haven't looked at tbb in a while, so there might be small mistakes).
If you use c++11, you can simplify the code by using lambda:
size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize),
[&](const blocked_range<size_t>&){
for(size_t n = range.begin(); n !=range.end(); ++n){
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
});
That being said tbb is not exactly what I would recommend for a new programmer. I would really suggest parallelizing only code which is trivial to parallelize until you have a very firm grip on threading. For this I would suggest using openmp which is quiet a bit simpler to start with then tbb, while still being powerfull enough to parallelize a lot of stuff (Depends on the compiler supporting it,though). For your loop it would look like the following:
#pragma omp prallel for
for(size_t n = 0; n < songinfo.frames; ++n) {
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
Then you have to tell your compiler to compile and link with openmp (-fopenmp for gcc, /openmp for visual c++). As you can see it is quite a bit simpler to use (for such easy usecases, more complex scenarious are a different matter) then tbb and has the added benefit of workingon plattforms which don't support openmp or tbb too (since unknown #pragmas are ignored by the compiler). Personally I'm using openmp in favor of tbb for some projects since I couldn't use it's open source license and buying tbb was a bit to steep for the projects.
Now that we have the how to parallize the loop out of the way, lets get to the question if it's worth it. This is a question which really can't be answered easily, since it completely depends on how many elements you process and what kind of platform your program is expected to run on. Your problem is very bandwidth heavy so I wouldn't count on to much of an increase in performance.
If you are only processing 1000 elements the parallel version of the loop is very likely to be slower then the single threaded version due to overhead.
If your data is not in the cache (because it doesn't fit) and your system is very bandwidth starved you might not see much of a benefit (although it's likely that you will see some benefit, just don't be supprised if its in the order of 1.X even if you use a lot of processors)
If your system is ccNUMA (likely for multisocket systems) your performance might decrease regardless of the amount of elements, due to additional transfercosts
The compiler might miss optimizations regarding pointer aliasing (since the loop body is moved to a different dunction). Using __restrict (for gcc, no clue for vs) might help with that problem.
...
Personally I think the situation where you are most likely to see a significant performance increase is if your system has a single multi-core cpu, for which the dataset fit's into the L3-Cache (but not the individual L2 Caches). For bigger datasets your performance will probably increase, but not by much (and correctly using prefetching might get similar gains). Of course this is pure speculization.

Concurrency and optimization using OpenMP

I'm learning OpenMP. To do so, I'm trying to make an existing code parallel. But I seems to get an worse time when using OpenMP than when I don't.
My inner loop:
#pragma omp parallel for
for(unsigned long j = 0; j < c_numberOfElements; ++j)
{
//int th_id = omp_get_thread_num();
//printf("thread %d, j = %d\n", th_id, (int)j);
Point3D current;
#pragma omp critical
{
current = _points[j];
}
Point3D next = getNext(current);
if (!hasConstraint(next))
{
continue;
}
#pragma omp critical
{
_points[j] = next;
}
}
_points is a pointMap_t, defined as:
typedef boost::unordered_map<unsigned long, Point3D> pointMap_t;
Without OpenMP my running time is 44.904s. With OpenMP enabled, on a computer with two cores, it is 64.224s. What I am doing wrong?
Why have you wrapped your reads and writes to _points[j] in critical sections ? I'm not much of a C++ programmer, but it doesn't look to me as if you need those sections at all. As you've written it (uunamed critical sections) each thread is going to wait while the other goes through each of the sections. This could easily make the program slower.
It seems possible that the lookup and write to _points in critical sections is dragging down the performance when you use OpenMP. Single-threaded, this will not result in any contention.
Sharing seed data like this seems counterproductive in a parallel programming context. Can you restructure to avoid these contention points?
You need to show the rest of the code. From a comment to another answer, it seems you are using a map. That is really a bad idea, especially if you are mapping 0..n numbers to values: why don't you use an array?
If you really need to use containers, consider using the ones from the Intel's Thread Building Blocks library.
I agree that it would be best to see some working code.
The ultimate issue here is that there are criticals within a parallel region, and criticals are (a) enormously expensive in and of themselves, and (b) by definition, kill parallelism. The assignment to current certainl doesn't need to be inside a critical, as it is private; I wouldn't have thought the _points[j] assignment would be, either, but I don't know what the map stuff does, so there you go.
But you have a loop in which you have a huge amount of overhead, which grows linearly in the number of threads (the two critical regions) in order to do a tiny amount of actual work (walk along a linked list, it looks like). That's never going to be a good trade...