Multithreaded function performance worse than single threaded - c++

I wrote an update() function which ran on a single thread, then I wrote the below function updateMP() which does the same thing except I divide the work in my two for loops here amongst some threads:
void GameOfLife::updateMP()
{
std::vector<Cell> toDie;
std::vector<Cell> toLive;
#pragma omp parallel
{
// private, per-thread variables
std::vector<Cell> myToDie;
std::vector<Cell> myToLive;
#pragma omp for
for (int i = 0; i < aliveCells.size(); i++) {
auto it = aliveCells.begin();
std::advance(it, i);
int liveCount = aliveCellNeighbors[*it];
if (liveCount < 2 || liveCount > 3) {
myToDie.push_back(*it);
}
}
#pragma omp for
for (int i = 0; i < aliveCellNeighbors.size(); i++) {
auto it = aliveCellNeighbors.begin();
std::advance(it, i);
if (aliveCells.find(it->first) != aliveCells.end()) // is this cell alive?
continue; // if so skip because we already updated aliveCells
if (aliveCellNeighbors[it->first] == 3) {
myToLive.push_back(it->first);
}
}
#pragma omp critical
{
toDie.insert(toDie.end(), myToDie.begin(), myToDie.end());
toLive.insert(toLive.end(), myToLive.begin(), myToLive.end());
}
}
for (const Cell& deadCell : toDie) {
setDead(deadCell);
}
for (const Cell& liveCell : toLive) {
setAlive(liveCell);
}
}
I noticed that it performs worse than the single threaded update() and seems like it's getting slower over time.
I think I might be doing something wrong by having two uses of omp for? I am new to OpenMP so I am still figuring out how to use it.
Why am I getting worse performance with my multithreaded implementation?
EDIT: Full source here: https://github.com/k-vekos/GameOfLife/tree/hashing?files=1

Why am I getting worse performance with my multithreaded implementation?
Classic question :)
You loop through only the alive cells. That's actually pretty interesting. A naive implementation of Conway's Game of Life would look at every cell. Your version optimizes for a fewer number of alive than dead cells, which I think is common later in the game. I can't tell from your excerpt, but I assume it trades off by possibly doing redundant work when the ratio of alive to dead cells is higher.
A caveat of omp parallel is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section. It's implementation dependent. I can't seem to find any information on MSVC's implementation. If anyone knows, please weight in.
So that means that your threads could be created/destroyed every update loop, which is heavy overhead. For this to be worth it, the amount of work should be orders of magnitude more expensive than the overhead.
You can profile/measure the code to determine overhead and work time. It should also help you see where the real bottlenecks are.
Visual Studio has a profiler with a nice GUI. You need to compile your code with release optimizations but also include the debug symbols (those are excluded in the default release configuration). I haven't looked into how to set this up manually because I usually use CMake and it sets up a RelWithDebInfo configuration automatically.
Use high_resolution_clock to time sections that are hard to measure with the profiler.
If you can use C++17, it has a standard parallel for_each (the ExecutionPolicy overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each. They're so new that I barely know anything about them (they might also have the same issues as OpenMP).
seems like it's getting slower over time.
Could it be that you're not cleaning up one of your vectors?

First off, if you want any kind of performance, you must do as little work as possible in a critical section. I'd start by changing the following:
std::vector<Cell> toDie;
std::vector<Cell> toLive;
to
std::vector<std::vector<Cell>> toDie;
std::vector<std::vector<Cell>> toLive;
Then, in your critical section, you can just do:
toDie.push_back(std::move(myToDie));
toLive.push_back(std::move(myToLive));
Arguably, a vector of vector isn't cute, but this will prevent deep-copying inside the CS which is unnecessary time consumption.
[Update]
IMHO there's no point in using multithreading if you are using non-contiguous data structures, at least not in that way. The fact is you'll spend most of your time waiting on cache misses because that's what associative containers do, and little doing the actual work.
I don't know how this game works. It feels like if I had to do something with numerous updates and render, I would have the updates done as quickly as possible on the 'main' thread', and I would have another (detached) thread for the renderer. You could then just give the renderer the results after every "update", and perform another update while it's rendering.
Also, I'm definitely not an expert in hashing, but hash<int>()(k.x * 3 + k.y * 5) seems like a high-collision hashing. You can certainly try something else like what's proposed here

Related

how can I get good speedup for a parallel write to memory?

I'm new to OpenMP and trying to get some very basic loops in my code parallelized with OpenMP, with good speedup on multiple cores. Here's a function in my program:
bool Individual::_SetFitnessScaling_1(double source_value, EidosObject **p_values, size_t p_values_size)
{
if ((source_value < 0.0) || (std::isnan(source_value)))
return true;
#pragma omp parallel for schedule(static) default(none) shared(p_values_size) firstprivate(p_values, source_value) if(p_values_size >= EIDOS_OMPMIN_SET_FITNESS_S1)
for (size_t value_index = 0; value_index < p_values_size; ++value_index)
((Individual *)(p_values[value_index]))->fitness_scaling_ = source_value;
return false;
}
So the goal is to set the fitnessScaling ivar of every object pointed to by pointers in the buffer that p_values points to, to the same double value source_value. Those various objects might be more or less anywhere in memory, so each write probably hits a different cache line; that's an aspect of the code that would be difficult to change, but I'm hoping that by spreading it across multiple cores I can at least divide that pain by a good speedup factor. The cast to (Individual *) is safe, by the way; checks were already done external to this function that guarantee its safety.
You can see my first attempt at parallelizing this, using the default static schedule (so each thread gets its own contiguous block in p_values), making the loop limit shared, and making p_values and source_value be firstprivate so each thread gets its own private copy of those variables, initialized to the original value. The threshold for parallelization, EIDOS_OMPMIN_SET_FITNESS_S1, is set to 900. I test this with a script that passes in a million values, with between 1 and 8 cores (and a max thread count to match), so the loop should certainly run in parallel. I have followed these same practices in some other places in the code and have seen a good speedup. [EDIT: I should say that the speedup I observe for this, for 2/4/6/8 cores/threads, is always about 1.1x-1.2x the single-threaded performance, so there's a very small win but it is realized already with 2 cores and does not get any better with 8 cores.] The notable difference with this code is that this loop spends its time writing to memory; the other loops I have successfully parallelized spend their time doing things like reading values from a buffer and summing across them, so they might be limited by memory read speeds, but not by memory write speeds.
It occurred to me that with all of this writing through a pointer, my loop might be thrashing due to things like aliasing (making the compiler force a flush of the cache after each write), or some such. I attempted to solve that kind of issue as follows, using const and __restrict:
bool Individual::_SetFitnessScaling_1(double source_value, EidosObject **p_values, size_t p_values_size)
{
if ((source_value < 0.0) || (std::isnan(source_value)))
return true;
#pragma omp parallel default(none) shared(p_values_size) firstprivate(p_values, source_value) if(p_values_size >= EIDOS_OMPMIN_SET_FITNESS_S1)
{
EidosObject * const * __restrict local_values = p_values;
#pragma omp for schedule(static)
for (size_t value_index = 0; value_index < p_values_size; ++value_index)
((Individual *)(local_values[value_index]))->fitness_scaling_ = source_value;
}
return false;
}
This made no difference to the performance, however. I still suspect that some kind of memory contention, cache thrash, or aliasing issue is preventing the code from parallelizing effectively, but I don't know how to solve it. Or maybe I'm barking up the wrong tree?
These tests are done with Xcode 13 (i.e., using Apple clang 13.0.0) on macOS, on an M1 Mac mini (2020).
[EDIT: In reply to comments below, a few points. (1) There is nothing fancy going on inside the class here, no operator= or similar; the assignment of source_value into fitness_scaling_ is, in effect, simply the assignment of a double into a field in a struct. (2) The use of firstprivate(p_values, source_value) is to ensure that repeated reading from those values across threads doesn't introduce some kind of between-thread contention that slows things down. It is recommended in Mattson, He, & Koniges' book "The OpenMP Common Core"; see section 6.3.2, figure 6.10 with the corrected Mandelbrot code using firstprivate, and the quote on p. 111: "An easy solution is to change the storage attribute for eps to firstprivate. This gives each thread its own copy of the variable but with a specified value. Notice that eps is read-only. It is not updated inside the parallel region. Therefore, another solution is to let it be shared (shared(eps)) or not specify eps in a data environment clause and let its default, shared behavior be used. While this would result in correct code, it would potentially increase overhead. If eps is shared, every thread will be reading the same address in memory... Some compilers will optimize for such read-only variables by putting them into registers, but we should not rely on that behavior." I have observed this change speeding up parallelized loops in other contexts, so I have adopted it as my standard practice in such cases; if I have misunderstood, please do let me know. (3) No, keeping the fitness_scaling_ values in their own buffer is not a workable solution for several reasons. Most importantly, this method may be called with any arbitrary buffer of pointers to Individual; it is not necessarily setting the fitness_scaling_ of all Individual objects, just an effectively random subset of them, so this operation will never be reducible to a simple memset(). Also, I am going to need to similarly optimize the setting of many other properties on Individual and on other classes in my code, so a general solution is needed; I can't very well put all of the ivars of all of my classes into separately allocated buffers external to the objects themselves. And third, Individual objects are being dynamically allocated and deallocated independently of each other, so an external buffer of fitness_scaling_ values for the objects would have big implementation problems.]

Parallel tasks get better performances with boost::thread than with ppl or OpenMP

I have a C++ program which could be parallelized. I'm using Visual Studio 2010, 32bit compilation.
In short the structure of the program is the following
#define num_iterations 64 //some number
struct result
{
//some stuff
}
result best_result=initial_bad_result;
for(i=0; i<many_times; i++)
{
result *results[num_iterations];
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
// update best_result;
}
Since each some_computations() is independent(some global variables read, but no global variables modified) I parallelized the inner for-loop.
My first attempt was with boost::thread,
thread_group group;
for(j=0; j<num_iterations; j++)
{
group.create_thread(boost::bind(&some_computation, this, result+j));
}
group.join_all();
The results were good, but I decided to try more.
I tried the OpenMP library
#pragma omp parallel for
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
The results were worse than the boost::thread's ones.
Then I tried the ppl library and used parallel_for():
Concurrency::parallel_for(0,num_iterations, [=](int j) {
some_computations(results+j);
})
The results were the worst.
I found this behaviour quite surprising. Since OpenMP and ppl are designed for the parallelization, I would have expected better results, than boost::thread. Am I wrong?
Why is boost::thread giving me better results?
OpenMP or PPL do no such thing as being pessimistic. They just do as they are told, however there's some things you should take into consideration when you do try to paralellize loops.
Without seeing how you implemented these things, it's hard to say what the real cause may be.
Also if the operations in each iteration have some dependency on any other iterations in the same loop, then this will create contention, which will slow things down. You haven't shown what your some_operation function actually does, so it's hard to tell if there is data dependencies.
A loop that can be truly parallelized has to be able to have each iteration run totally independent of all other iterations, with no shared memory being accessed in any of the iterations. So preferably, you'd write stuff to local variables and then copy at the end.
Not all loops can be parallelized, it is very dependent on the type of work being done.
For example, something that is good for parallelizing is work being done on each pixel of a screen buffer. Each pixel is totally independent from all other pixels, and therefore, a thread can take one iteration of a loop and do the work without needing to be held up waiting for shared memory or data dependencies within the loop between iterations.
Also, if you have a contiguous array, this array may be partly in a cache line, and if you are editing element 5 in thread A and then changing element 6 in thread B, you may get cache contention, which will also slow down things, as these would be residing in the same cache line. A phenomenon known as false sharing.
There is many aspects to think about when doing loop parallelization.
In short words, openMP is mainly based on shared memory, with additional cost of tasking management and memory management. ppl is designed to handle generic patterns of common data structures and algorithms, it brings additional complexity cost. Both of them have additional CPU cost, but your simple falling down boost threads do not (boost threads are just simple API wrapping). That's why both of them are slower than your boost version. And, since the exampled computation is independent for each other, without synchronization, openMP should be close to the boost version.
It occurs in simple scenarios, but, for complicated scenarios, with complicated data layout and algorithms, it should be context dependent.

Splitting up a program into 4 threads is slower than a single thread

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?
Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.
The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?
One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.
I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.

how do i properly design a worker thread? (avoid for example Sleep(1))

i am still a beginner at multi-threading, so bear with me please:
i am currently writing an application that does some FVM calculation on a grid. it's a time-explicit model, so at every timestep i need to calculate new values for the whole grid. my idea was to distribute this calculation to 4 worker-threads, which then deal with the cells of the grid (first thread calculating 0, 4, 8... second thread 1, 5, 9... and so forth).
i create those 4 threads at program start.
they look something like this:
void __fastcall TCalculationThread::Execute()
{
bool alive = true;
THREAD_SIGNAL ts;
while (alive)
{
Sleep(1);
if (TryEnterCriticalSection(&TMS))
{
ts = thread_signal;
LeaveCriticalSection(&TMS);
alive = !ts.kill;
if (ts.go && !ts.done.at(this->index))
{
double delta_t = ts.dt;
for (unsigned int i=this->index; i < cells.size(); i+= this->steps)
{
calculate_one_cell();
}
EnterCriticalSection(&TMS);
thread_signal.done.at(this->index)=true;
LeaveCriticalSection(&TMS);
}
}
}
they use a global struct, to communicate with the main thread (main thread sets ts.go to true when the workers need to start.
now i am sure this is not the way to do it! not only does it feel wrong, it also doesn't perform very well...
i read for example here that a semaphore or an event would work better. the answer to this guy's question talks about a lockless queue.
i am not very familiar with these concepts would like some pointers how to continue.
could you line out any of the ways to do this better?
thank you for your time. (and sorry for the formatting)
i am using borland c++ builder and its thread-object (TThread).
The definitely more effective algorithm would be to calculate yields for 0,1,2,3 on one thread, 4,5,6,7 on another, etc. Interleaving memory accesses like that is very bad, even if the variables are completely independent- you'll get false sharing problems. This is the equivalent of the CPU locking every write.
Calling Sleep(1) in a calculation thread can't be a good solution to any problem. You want your threads to be doing useful work rather than blocking for no good reason.
I think your basic problem can be expressed as a serial algorithm of this basic form:
for (int i=0; i<N; i++)
cells[i]->Calculate();
You are in the happy position that calls to Calculate() are independent of each other—what you have here is a parallel for. This means that you can implement this without a mutex.
There are a variety of ways to achieve this. OpenMP would be one; a threadpool class another. If you are going to roll your own thread based solution then use InterlockedIncrement() on a shared variable to iterate through the array.
You may hit some false sharing problems, as #DeadMG suggests, but quite possibly not. If you do have false sharing then yet another approach is to stride across larger sub-arrays. Essentially the increment (i.e. stride) passed to InterlockedIncrement() would be greater than one.
The bottom line is that the way to make the code faster is to remove both the the critical section (and hence the contention on it) and the Sleep(1).

Concurrency and optimization using OpenMP

I'm learning OpenMP. To do so, I'm trying to make an existing code parallel. But I seems to get an worse time when using OpenMP than when I don't.
My inner loop:
#pragma omp parallel for
for(unsigned long j = 0; j < c_numberOfElements; ++j)
{
//int th_id = omp_get_thread_num();
//printf("thread %d, j = %d\n", th_id, (int)j);
Point3D current;
#pragma omp critical
{
current = _points[j];
}
Point3D next = getNext(current);
if (!hasConstraint(next))
{
continue;
}
#pragma omp critical
{
_points[j] = next;
}
}
_points is a pointMap_t, defined as:
typedef boost::unordered_map<unsigned long, Point3D> pointMap_t;
Without OpenMP my running time is 44.904s. With OpenMP enabled, on a computer with two cores, it is 64.224s. What I am doing wrong?
Why have you wrapped your reads and writes to _points[j] in critical sections ? I'm not much of a C++ programmer, but it doesn't look to me as if you need those sections at all. As you've written it (uunamed critical sections) each thread is going to wait while the other goes through each of the sections. This could easily make the program slower.
It seems possible that the lookup and write to _points in critical sections is dragging down the performance when you use OpenMP. Single-threaded, this will not result in any contention.
Sharing seed data like this seems counterproductive in a parallel programming context. Can you restructure to avoid these contention points?
You need to show the rest of the code. From a comment to another answer, it seems you are using a map. That is really a bad idea, especially if you are mapping 0..n numbers to values: why don't you use an array?
If you really need to use containers, consider using the ones from the Intel's Thread Building Blocks library.
I agree that it would be best to see some working code.
The ultimate issue here is that there are criticals within a parallel region, and criticals are (a) enormously expensive in and of themselves, and (b) by definition, kill parallelism. The assignment to current certainl doesn't need to be inside a critical, as it is private; I wouldn't have thought the _points[j] assignment would be, either, but I don't know what the map stuff does, so there you go.
But you have a loop in which you have a huge amount of overhead, which grows linearly in the number of threads (the two critical regions) in order to do a tiny amount of actual work (walk along a linked list, it looks like). That's never going to be a good trade...