I have a relatively simple for loop that I would like to parallelize with OpenMP. For some reason, the parallel version is slower than the serial version. Here is the function and loop in question:
template<typename T, class UnaryOperation>
void State:apply_diagmatrix(const std::vector<T>& matrix, const UnaryOperation& transformation)
{
#pragma omp parallel for
for (uint64_t i=0; i<dim_; ++i) {
state_[i] *= transformation(matrix[i]);
}
}
where state_ is std::vector<std::complex<double>>. This function is called by another function where the UnaryOperation transformation is defined as
auto exp_func = [dt](double matrix_term) {
return std::complex<double>{std::cos(dt*matrix_term), -std::sin(dt*matrix_term)};
};
However, running the program with this parallel section is actually slower than running the serial version (without the #pragma omp directive). This is the case even when dim_ > 4096, where I would expect the overhead to be much smaller compared to the benefit of parallelizing 4096 executions of exp_func.
Of course, the entire program does much more than just this loop. Nevertheless, from profiling my code I know that the apply_diagmatrix function has significant relative cost. Hence, I would expect measurable benefit after parallelizing it.
I also tried to benchmark the parallel vs the serial version of just the loop in question. In that case, there is a clear benefit of the parallel version once dim_ > 512. Why am I seeing this benefit only when running the loop in isolation. What is the difference when the loop is part of a larger code? Any ideas?
Edit: From the comments it seems clear that the problem is not obvious without some more context. I will go back and look whether I can extract some min. reproducible example without the need to share the entire code.
Related
I wrote an update() function which ran on a single thread, then I wrote the below function updateMP() which does the same thing except I divide the work in my two for loops here amongst some threads:
void GameOfLife::updateMP()
{
std::vector<Cell> toDie;
std::vector<Cell> toLive;
#pragma omp parallel
{
// private, per-thread variables
std::vector<Cell> myToDie;
std::vector<Cell> myToLive;
#pragma omp for
for (int i = 0; i < aliveCells.size(); i++) {
auto it = aliveCells.begin();
std::advance(it, i);
int liveCount = aliveCellNeighbors[*it];
if (liveCount < 2 || liveCount > 3) {
myToDie.push_back(*it);
}
}
#pragma omp for
for (int i = 0; i < aliveCellNeighbors.size(); i++) {
auto it = aliveCellNeighbors.begin();
std::advance(it, i);
if (aliveCells.find(it->first) != aliveCells.end()) // is this cell alive?
continue; // if so skip because we already updated aliveCells
if (aliveCellNeighbors[it->first] == 3) {
myToLive.push_back(it->first);
}
}
#pragma omp critical
{
toDie.insert(toDie.end(), myToDie.begin(), myToDie.end());
toLive.insert(toLive.end(), myToLive.begin(), myToLive.end());
}
}
for (const Cell& deadCell : toDie) {
setDead(deadCell);
}
for (const Cell& liveCell : toLive) {
setAlive(liveCell);
}
}
I noticed that it performs worse than the single threaded update() and seems like it's getting slower over time.
I think I might be doing something wrong by having two uses of omp for? I am new to OpenMP so I am still figuring out how to use it.
Why am I getting worse performance with my multithreaded implementation?
EDIT: Full source here: https://github.com/k-vekos/GameOfLife/tree/hashing?files=1
Why am I getting worse performance with my multithreaded implementation?
Classic question :)
You loop through only the alive cells. That's actually pretty interesting. A naive implementation of Conway's Game of Life would look at every cell. Your version optimizes for a fewer number of alive than dead cells, which I think is common later in the game. I can't tell from your excerpt, but I assume it trades off by possibly doing redundant work when the ratio of alive to dead cells is higher.
A caveat of omp parallel is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section. It's implementation dependent. I can't seem to find any information on MSVC's implementation. If anyone knows, please weight in.
So that means that your threads could be created/destroyed every update loop, which is heavy overhead. For this to be worth it, the amount of work should be orders of magnitude more expensive than the overhead.
You can profile/measure the code to determine overhead and work time. It should also help you see where the real bottlenecks are.
Visual Studio has a profiler with a nice GUI. You need to compile your code with release optimizations but also include the debug symbols (those are excluded in the default release configuration). I haven't looked into how to set this up manually because I usually use CMake and it sets up a RelWithDebInfo configuration automatically.
Use high_resolution_clock to time sections that are hard to measure with the profiler.
If you can use C++17, it has a standard parallel for_each (the ExecutionPolicy overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each. They're so new that I barely know anything about them (they might also have the same issues as OpenMP).
seems like it's getting slower over time.
Could it be that you're not cleaning up one of your vectors?
First off, if you want any kind of performance, you must do as little work as possible in a critical section. I'd start by changing the following:
std::vector<Cell> toDie;
std::vector<Cell> toLive;
to
std::vector<std::vector<Cell>> toDie;
std::vector<std::vector<Cell>> toLive;
Then, in your critical section, you can just do:
toDie.push_back(std::move(myToDie));
toLive.push_back(std::move(myToLive));
Arguably, a vector of vector isn't cute, but this will prevent deep-copying inside the CS which is unnecessary time consumption.
[Update]
IMHO there's no point in using multithreading if you are using non-contiguous data structures, at least not in that way. The fact is you'll spend most of your time waiting on cache misses because that's what associative containers do, and little doing the actual work.
I don't know how this game works. It feels like if I had to do something with numerous updates and render, I would have the updates done as quickly as possible on the 'main' thread', and I would have another (detached) thread for the renderer. You could then just give the renderer the results after every "update", and perform another update while it's rendering.
Also, I'm definitely not an expert in hashing, but hash<int>()(k.x * 3 + k.y * 5) seems like a high-collision hashing. You can certainly try something else like what's proposed here
I'm using the following code, which contains an OpenMP parallel for loop nested in another for-loop. Somehow the performance of this code is 4 Times slower than the sequential version (omitting #pragma omp parallel for).
Is it possible that OpenMp has to create Threads every time the method is called? In my test it is called 10000 times directly after each other.
I heard that sometimes OpenMP will keep the threads spinning. I also tried setting OMP_WAIT_POLICY=active and GOMP_SPINCOUNT=INFINITE. When I remove the openMP pragmas, the code is about 10 times faster. Note that the method containing this code will be called 10000 times.
for (round k = 1; k < processor.max; ++k) {
initialise_round(k);
for (std::vector<int> bucket : color_buckets) {
#pragma omp parallel for schedule (dynamic)
for (int i = 0; i < bucket.size(); ++i) {
if (processor.mark.is_marked_item(bucket[i])) {
processor.process(k, bucket[i]);
}
}
processor.finish_round(k);
}
}
You say that your sequential code is much faster so this makes me think that your processor.process function has too few instructions and duration. This leads to the case where passing the data to each thread does not pay off (the data exchange overhead is simply larger than the actual computation on that thread).
Other than that, I think that parallelizing the middle loop won't affect the algorithm but increase the amount of work per thread/
I think you are creating a team of threads on each iteration of the loop... (although I'm not sure what for alone does - I thought it should be parallel for). In this case, it would probably be better to separate the parallel from the for so the work of forking and creating the threads is done just once rather than being repeated in the other loops. So you could try to put a parallel pragma before your outermost loop so the overhead of forking and thread creation is just done once.
The actual problem was not related to OpenMP directly.
As the system has two CPUs, half of the threads where spawned on one and the other half on the other CPU. Therefore there was not shared L3 Cache. This lead in combination that the algorithm doesn't scale well to a performance decrease especially when using 2-4 Threads.
The solution was to use thread pinning for example via the linux tool: taskset
I'm running an algorithm at the moment that is very heavy but extremely parallel.
I've been looking at ways to speed it up and I've noticed that the slowest operation I have is my VecAdd function (It gets called thousands of times on a 6000 or so wide vector).
It is implemented as follows:
bool VecAdd( float* pOut, const float* pIn1, const float* pIn2, unsigned int num )
{
for( int idx = 0; idx < num; idx++ )
{
pOut[idx] = pIn1[idx] + pIn2[idx];
}
return true;
}
Its a very simple loop but all the additions can be performed in parallel. My first optimisation option is to move over to using SIMD as I can easily get a near 4 times speed up doing this.
However I'm also interested in the possibility of using OpenMP and having it automatically thread the for loop (potentially giving me a further 4x speedup for a total of 16x with SIMD).
However it really runs slowly. With the loop straight it takes around 3.2 seconds to process my example data. If I insert
#pragma omp parallel for
before the for loop I was assuming it would farm out several blocks of additions to other threads.
Unfortunately the result is that it takes ~7 seconds to process my example data.
Now I understand that a lot of my problem here will be caused by overheads with setting up threads and so forth but I'm still surprised just how much slower it makes things run.
Is it possible to speed this up by somehow setting up the thread pool in advance or will I never be able to combat these overheads?
Any thoughts on advice as to whether I can thread this nicely with OpenMP will be much appreciated!
Your loop should parallelize fine with the #pragma omp parallel for.
However, I think the problem is that you shouldn't parallelize at that level. You said that the function gets called thousands of times, but only operates on 6000 floats. Parallelize at the higher level, so that each thread is responsible for thounsands/4 calls to VecAdd. Right now you have this algorithm:
List item
serial execution
(re) start threads
do short computation
synchronize threads (at the end of the for loop)
back to serial code
Change it so that it's parallel at the highest possible level.
Memory bandwidth of course matters, but there is no way it would result in slower than serial execution.
I am a very new programmer, and I have some trouble with the examples from intel. I think it would be helpful if I could see how the most basic possible loop is implemented in tbb.
for (n=0 ; n < songinfo.frames; ++n) {
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
Here is a loop I am using to de-interleave audio data. Would this loop benefit from tbb? How would you implement it?
First of all for the following code I assume your arrays are of type mytype*, otherwise the code need some modifications. Furthermore I assume that your ranges don't overlap, otherwise parallelization attemps won't work correctly (at least not without more work)
Since you asked for it in tbb:
First you need to initialize the library somewhere (typically in your main). For the code assume I put a using namespace tbb somewhere.
int main(int argc, char *argv[]){
task_scheduler_init init;
...
}
Then you will need a functor which captures your arrays and executes the body of the forloop:
struct apply_func {
const mytype* songin; //whatever type you are operating on
mytype* sli;
mytype* sri;
apply_func(const mytype* sin, mytype* sl, mytype* sr):songin(sin), sli(sl), sri(sr)
{}
void operator()(const blocked_range<size_t>& range) {
for(size_t n = range.begin(); n !=range.end(); ++n){
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
}
}
Now you can use parallel_for to parallelize this loop:
size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
apply_func func(songin, sli, sri);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize), func);
That should do it (if I remember correctly haven't looked at tbb in a while, so there might be small mistakes).
If you use c++11, you can simplify the code by using lambda:
size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize),
[&](const blocked_range<size_t>&){
for(size_t n = range.begin(); n !=range.end(); ++n){
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
});
That being said tbb is not exactly what I would recommend for a new programmer. I would really suggest parallelizing only code which is trivial to parallelize until you have a very firm grip on threading. For this I would suggest using openmp which is quiet a bit simpler to start with then tbb, while still being powerfull enough to parallelize a lot of stuff (Depends on the compiler supporting it,though). For your loop it would look like the following:
#pragma omp prallel for
for(size_t n = 0; n < songinfo.frames; ++n) {
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
Then you have to tell your compiler to compile and link with openmp (-fopenmp for gcc, /openmp for visual c++). As you can see it is quite a bit simpler to use (for such easy usecases, more complex scenarious are a different matter) then tbb and has the added benefit of workingon plattforms which don't support openmp or tbb too (since unknown #pragmas are ignored by the compiler). Personally I'm using openmp in favor of tbb for some projects since I couldn't use it's open source license and buying tbb was a bit to steep for the projects.
Now that we have the how to parallize the loop out of the way, lets get to the question if it's worth it. This is a question which really can't be answered easily, since it completely depends on how many elements you process and what kind of platform your program is expected to run on. Your problem is very bandwidth heavy so I wouldn't count on to much of an increase in performance.
If you are only processing 1000 elements the parallel version of the loop is very likely to be slower then the single threaded version due to overhead.
If your data is not in the cache (because it doesn't fit) and your system is very bandwidth starved you might not see much of a benefit (although it's likely that you will see some benefit, just don't be supprised if its in the order of 1.X even if you use a lot of processors)
If your system is ccNUMA (likely for multisocket systems) your performance might decrease regardless of the amount of elements, due to additional transfercosts
The compiler might miss optimizations regarding pointer aliasing (since the loop body is moved to a different dunction). Using __restrict (for gcc, no clue for vs) might help with that problem.
...
Personally I think the situation where you are most likely to see a significant performance increase is if your system has a single multi-core cpu, for which the dataset fit's into the L3-Cache (but not the individual L2 Caches). For bigger datasets your performance will probably increase, but not by much (and correctly using prefetching might get similar gains). Of course this is pure speculization.
I'm learning OpenMP. To do so, I'm trying to make an existing code parallel. But I seems to get an worse time when using OpenMP than when I don't.
My inner loop:
#pragma omp parallel for
for(unsigned long j = 0; j < c_numberOfElements; ++j)
{
//int th_id = omp_get_thread_num();
//printf("thread %d, j = %d\n", th_id, (int)j);
Point3D current;
#pragma omp critical
{
current = _points[j];
}
Point3D next = getNext(current);
if (!hasConstraint(next))
{
continue;
}
#pragma omp critical
{
_points[j] = next;
}
}
_points is a pointMap_t, defined as:
typedef boost::unordered_map<unsigned long, Point3D> pointMap_t;
Without OpenMP my running time is 44.904s. With OpenMP enabled, on a computer with two cores, it is 64.224s. What I am doing wrong?
Why have you wrapped your reads and writes to _points[j] in critical sections ? I'm not much of a C++ programmer, but it doesn't look to me as if you need those sections at all. As you've written it (uunamed critical sections) each thread is going to wait while the other goes through each of the sections. This could easily make the program slower.
It seems possible that the lookup and write to _points in critical sections is dragging down the performance when you use OpenMP. Single-threaded, this will not result in any contention.
Sharing seed data like this seems counterproductive in a parallel programming context. Can you restructure to avoid these contention points?
You need to show the rest of the code. From a comment to another answer, it seems you are using a map. That is really a bad idea, especially if you are mapping 0..n numbers to values: why don't you use an array?
If you really need to use containers, consider using the ones from the Intel's Thread Building Blocks library.
I agree that it would be best to see some working code.
The ultimate issue here is that there are criticals within a parallel region, and criticals are (a) enormously expensive in and of themselves, and (b) by definition, kill parallelism. The assignment to current certainl doesn't need to be inside a critical, as it is private; I wouldn't have thought the _points[j] assignment would be, either, but I don't know what the map stuff does, so there you go.
But you have a loop in which you have a huge amount of overhead, which grows linearly in the number of threads (the two critical regions) in order to do a tiny amount of actual work (walk along a linked list, it looks like). That's never going to be a good trade...