Threads in c++ not generating speedup on mandelbrot image processing - c++

So, I wrote a program that generates a mandelbrot image. Then, I decided to write it in a way that would use a specified number of threads to speed it up. This is what I came up with:
void mandelbrot_all(std::vector<std::vector<int>>& pixels, int X, int Y, int threadCount) {
using namespace std;
vector<thread> threads;
int numThreads = threadCount;
for(int i=0; i<numThreads; i++) {
threads.push_back(thread (mandelbrot_range, std::ref(pixels), i*X/numThreads, 0, X*(i+1)/numThreads, Y, X));
}
for(int i=0; i<numThreads; i++) {
threads[i].join();
}
}
The intention was to split the processing into chunks and process each one separately. When I run the program, it takes a number as an argument, which will be used as the number of threads to be used in the program for that run. Unfortunately, I get similar times for any number of threads.
Is there something about threading in c++ that I'm missing? Do I have to add something or boilerplate of some kind to make the threads function simultaneously? Or is the way I'm making threads just silly?
I've tried running this code on a raspberry pi and my quad core laptop, with these same results.
Any help would be appreciated.

I'm a little late back to this question, but looking back, I remember the solution: I was programming on a single-core raspberry pi. One core means no speedup from threading.

I think spawning the threads is too expensive, You could try PPL or TBB. which both have parallel_for and parallel_foreach, and use those to loop through the pixels instead of using threads. they internally manage the threads so you have less overhead and the most throughput.

Solving one problem at a time, why not give it a try and hardcode the use of 2 threads, then 3? Thread starting is expensive however if you start only 2 threads and calculate a fairly large Mandelbrot, then thread start time will be relatively zero.
Up until you don't achieve 2x and 3x speedup, then you have other problems that you need to debug & solve, separately.

Without looking at your code and playing with it, it's hard to pinpoint what the problem is exactly. Here's a guess though: some portions of the Mandelbrot set image is much easier to compute than others. Your code is cutting the image up into equal slices by the x-axis, but the majority of the work (say 70%) could fall into one slice. In that case, the best you can do is a 30% speed up, since rest of the threads still have to wait for the last one to finish. For example, if you run with four threads and cut up the image into four pieces, the third piece certainly looks more intensive than the rest. Of course the 70% is just an estimate.

Related

Incorrect measurement of the code execution time inside OpenMP thread

So I need to measure execution time of some code inside for loop. Originally, I needed to measure several different activities, so I wrote a timer class to help me with that. After that I tried to speed things up by paralleling the for loop using OpenMP. The problem is that when running my code in parallel my time measurements become really different - the values increase approximately up to a factor of 10. So to avoid the possibility of flaw inside the timer class I started to measure execution time of the whole loop iteration, so structurally my code looks something like this:
#pragma omp parallel for num_threads(20)
for(size_t j = 0; j < entries.size(); ++j)
{
auto t1 = std::chrono::steady_clock::now();
// do stuff
auto t2 = std::chrono::steady_clock::now();
std::cout << "Execution time is "
<< std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count()
<< std::endl;
}
Here are some examples of difference between measurements in parallel and measurements in single thread:
Single-threaded
Multi-threaded
11.363545868021
94.154685442
4.8963048650184
16.618173163
4.939025568
25.4751074
18.447368772
110.709813843
Even though it is only a couple of examples, this behaviour seems to prevail in all loop iterations. I also tried to use boost's chrono library and thread_clock but got the same result. Do I misunderstand something? What may be the cause of this? Maybe I get cumulative time of all threads?
Inside the for loop, during each iteration I read a different file. Based on this file I create and solve multitude of mixed-integer optimisation models. I solve them with MIP solver, which I set to run in one thread. The instance of the solver is created on each iteration. The only variables that is shared between iteration are constant strings which represents paths to some directories.
My machine has 32 threads (16 cores, 2 threads per core).
Also here are the timings of the whole application in single-threaded mode:
real 23m17.763s
user 21m46.284s
sys 1m28.187s
and in multi-threaded mode:
real 12m47.657s
user 156m20.479s
sys 2m34.311s
A few points here.
What you're measuring corresponds (roughly) to what time returns as the user time--that is total CPU time consumed by all threads. But when we look at the real time reported by time, we see that your multithreaded code is running close to twice as fast as the single threaded code. So it is scaling to some degree--but not very well.
Reading a file in the parallel region may well be part of this. Even at best, the fastest NVMe SSDs can only support reading from a few (e.g., around three or four) threads concurrently before you're using the drive's entire available bandwidth (and if you're doing I/O efficiently that may well be closer to 2. If you're using an actual spinning hard drive, it's usually pretty easy for a single thread to saturate the drive's bandwidth. A PCIe 5 SSD should keep up with more threads, but I kind of doubt even it has the bandwidth to feed 20 threads.
Depending on what parts of the standard library you're using, it's pretty easy to have some "invisible" shared variables. For one common example, quite code that uses Monte Carlo methods will frequently have calls to rand(). Even though it looks like a normal function call, rand() will typically end up using a seed variable that's shared between threads, and every call to rand() not only reads but also writes to that shared variable--so the calls to rand() all end up serialized.
You mention your MIP solver running in a single thread, but say there's a separate instance per thread, leaving it unclear whether the MIP solving code is really one thread shared between the 20 other threads, or that you have one MIP solver instance running in each of the 20 threads. I'd guess the latter, but if it's really the former, then it's being a bottleneck wouldn't seem surprising at all.
Without code to look at, it's impossible to get really specific though.

Using multiple OMP parallel sections in parallel -> Performance Issue?

I am trying to understand a huge performance problem with one of our C++ applications using OpenMP (on Windows). The structure of the application is as follows:
I have an algorithm which basically consists of a couple of for-loops which are parallelized using OpenMP:
void algorithm()
{
#pragma omp parallel for numThreads(12)
for (int i=0; ...)
{
// do some heavy computation (pure memory and CPU work, no I/O, no waiting)
}
// ... some more for-loops of this kind
}
The application executes this algorithm n times in parallel from n different threads:
std::thread t1(algorithm);
std::thread t2(algorithm);
//...
std::thread tn(algorithm);
t1.join();
t2.join();
//...
tn.join();
// end of application
Now, the problem is as follows:
when I run the application with n=1 (only one call to algorithm()) on my system with 32 physical CPU cores (no hyperthreading), it takes about 5s and loads the CPU to about 30% as expected (given that I have told OpenMP to only use 12 threads).
when I run with n=2, the CPU load goes up to about 60%, but the application takes almost 10 seconds. This means that it is almost impossible to run multiple algorithm instances in parallel.
This alone, of course, can have many reasons (including cache misses, RAM bandwidth limitations, etc.), but there is one thing that strikes me:
if I run my application twice in two parallel processes, each with n=1, both processes complete after about 5 seconds, meaning that I was well able to run two of my algorithms in parallel, as long as they live in different processes.
This seems to exclude many possible reasons for this performance bottleneck. And indeed, I have been unable to understand the cause of this, even after profiling the code. One of my suspicions is that there might be some excessive synchronization in OpenMP between different parallel sections.
Has anyone ever seen an effect like this before? Or can anyone give me advice how to approach this? I have really come to a point where I have tried all I can imagine, but without any success so far. I thus appreciate any help I can get!
Thanks a lot,
Da
PS.:
I have been using both, MS Visual Studio 2015 and Intel's 2017 compiler - both show basically the same effect.
I have a very simple reproducer showing this problem which I can provide if needed. It is really not much more than the above, just adding some real work to be done inside the for-loops.

How can I run tasks on the CPU and a GPU device concurrently?

I have this piece of code that is as profiled, optimised and cache-efficient as I am likely to get it with my level of knowledge. It runs on the CPU conceptually like this:
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < numberOfTasks; ++i)
{
result[i] = RunTask(i); // result is some array where I store the result of RunTask.
}
It just so happens that RunTask() is essentially a set of linear algebra operations that operate repeatedly on the same, very large dataset every time, so it's suitable to run on a GPU. So I would like to achieve the following:
Offload some of the tasks to the GPU
While the GPU is busy, process the rest of the tasks on the CPU
For the CPU-level operations, keep my super-duper RunTask() function without having to modify it to comply with restrict(amp). I could of course design a restrict(amp) compliant lambda for the GPU tasks.
Initially I thought of doing the following:
// assume we know exactly how much time the GPU/CPU needs per task, and this is the
// most time-efficient combination:
int numberOfTasks = 1000;
int ampTasks = 800;
// RunTasksAMP(start,end) sends a restrict(amp) kernel to the GPU, and stores the result in the
// returned array_view on the GPU
Concurrency::array_view<ResulType, 1> concurrencyResult = RunTasksAMP(0,ampTasks);
// perform the rest of the tasks on the CPU while we wait
#pragma omp parallel for schedule(dynamic)
for (int i = ampTasks; i < numberOfTasks; ++i)
{
result[i] = RunTask(i); // this is a thread-safe
}
// do something to wait for the parallel_for_each in RunTasksAMP to finish.
concurrencyResult.synchronize();
//... now load the concurrencyResult array into the first elements of "result"
But I doubt you could do something like this because
A call to parallel_for_each behaves as though it's synchronous
(http://msdn.microsoft.com/en-us/library/hh305254.aspx)
So is it possible to achieve 1-3 of my requests, or do I have to ditch number 3? Even so, how would I implement it?
See my answer to will array_view.synchronize_asynch wait for parallel_for_each completion? for an explanation as to why parallel_for_each can be though of as a queuing or scheduling operation rather than a synchronous one. This explains why your code should satisfy your requirements 1 & 2. It should also meet requirement 3, although you might want to consider having one function that are restrict(cpu, amp) as this will give you less code to maintain.
However you may want to consider some of the performance implications of your approach.
Firstly, the parallel_for_each only queues work, the data copies from the host and GPU memory use host resources (assuming your GPU is discrete and/or does not support direct copy). If your work on the host saturates all the resources required to keep the GPU working then you may actually slow up your GPU calculation.
Secondly, for many calculations that are data parallel and amenable to running on a GPU they are so much faster that the additional overhead of trying to run work on the CPU doesn't result in an overall speedup. Overhead includes item one (above) and the additional overhead of coordinating work on the host (scheduling threads, merging the results, etc.).
Finally your implementation above does not take into account any variability in the time taken to run tasks on the GPU and CPU. It assumes that 800 AMP tasks will take as long as 200 cpu tasks. This may be true on some hardware but not on others. If one set of tasks takes longer than expected then your application will block and wait for the slower set of tasks to complete. You can avoid this using a master/worker pattern to pull tasks from a queue until there are no more available tasks. This approach means that in the worst case your application will have to wait for the final task to complete, not a block of tasks. Using the master/worker approach also means that your application will run with equal efficiency regardless of the relative CPU/GPU performance.
My book discusses examples of scheduling work across multiple GPUs using a master/worker (n-body) and parallel queue (cartoonizer). You can download the source code from CodePlex. Note that it deliberately does not cover sharing work on both CPU and GPU for the reasons outlined above based on discussions with the C++ AMP product team.

Improving image processing speed

I am using C++ and OpenCV to process some images taken from a Webcam in realtime and I am looking to get the best speed I can from my system.
Other than changing the processing algorithm (assume, for now, that you can't change it). Is there anything that I should be doing to maximize the speed of processing?
I am thinking maybe Multithreading could help here but I'm ashamed to say I don't really know the ins and outs (although obviously I have used multithreading before but not in C++).
Assuming I have an x-core processor, does splitting the processing into x threads actually speed things up?...or would the management overhead of these threads negate it assuming that I am looking for a throughput of 20fps (I assume that will affect the answer you give as it should give you an indication of how much processing will be done per thread)
Would multithreading help here?
Are there any tips for increasing the speed of OpenCV specifically, or any pitfalls that I might be falling into that reduce speed.
Thanks.
The easier way, I think, could be pipelining frame operations.
You could work with a thread pool, allocating sequentially a frame memory buffer to the first available thread, to be released to pool when the algorithm step on the associated frame has completed.
This could leave practically unchanged your current (debugged :) algorithm, but will require substantially more memory for buffering intermediate results.
Of course, without details about your task, it's hard to say if this is appropriate...
There is one important thing about increasing speed in OpenCV not related to processor nor algorithm and it is avoiding extra copying when dealing with matrices. I will give you an example taken from the documentation:
"...by constructing a header for a part of another matrix. It can be a
single row, single column, several rows, several columns, rectangular
region in the matrix (called a minor in algebra) or a diagonal. Such
operations are also O(1), because the new header will reference the
same data. You can actually modify a part of the matrix using this
feature, e.g."
// add 5-th row, multiplied by 3 to the 3rd row
M.row(3) = M.row(3) + M.row(5)*3;
// now copy 7-th column to the 1-st column
// M.col(1) = M.col(7); // this will not work
Mat M1 = M.col(1);
M.col(7).copyTo(M1);
Maybe you already knew this issue but I think it is important to highlight headers in openCV as an important and efficient coding tool.
Assuming I have an x-core processor, does splitting the processing into x threads actually speed things up?
Yes, although it very heavily depends on the particular algorithm being used, as well as your skill in writing threaded code to handle things like synchronization. You didn't really provide enough detail to make a better assessment than that.
Some algorithms are extremely easy to parallelize, like ones that have the form:
for (i=0; i < DATA_SIZE; i++)
{
output[i] = f(input[i]);
}
for some function f. These are known as embarassingly parallelizable; you can simply split the data into N blocks and have N threads process each block individually. Libraries like OpenMP make this kind of threading extremely simple.
Unless the particular algorithm you are using is already optimized for a multithreaded/parallel platform, throwing it at an x-core processor will do nothing for you. The algorithm has to be inherently threadable to benefit from multiple threads. But if it wasn't designed with that in mind, it would have to be altered. On the other hand, many image processing algorithms are "embarassingly-parallel", at least in concept. Can you share more details about the algorithm you have in mind?
If your threads can operate on different data, it would seem reasonable to thread it off, perhaps queueing each frame object to a thread pool. You may have to add sequence numbers to the frame objects to ensure that the processed frames emerging from the pool are delivered in the same order they went in.
As example code for multi-threaded image processing with OpenCV, you might want to check out my code:
https://github.com/vmlaker/sherlock-cpp
It's what I came up with wanting to take advantage of x-core CPU to improve performance of object detection. The detect program is basically a parallel algorithm that distributes tasks among multiple threads, a separate pipelined thread for every task:
Allocation of frame memory and video capture.
Object detection (one thread per each Haar classifier.)
Augmenting output with detection result and displaying the frame.
Memory deallocation.
With memory for every captured frame shared between all threads, I got great performance and CPU utilization.

Multidimensional Array Initialization: Any benefit from Threading?

say I have the following code:
char[5][5] array;
for(int i =0; i < 5; ++i)
{
for(int j = 0; j < 5; ++i)
{
array[i][j] = //random char;
}
}
Would there be a benefit for initializing each row in this array in a separate thread?
Imagine instead of a 5 by 5 array, we have a 10 by 10?
n x n?
Also, this is done once, during application startup.
You're joking, right?
If not: The answer is certainly no!!!
You'd incur a lot of overhead for putting together enough synchronization to dispatch the work via a message queue, plus knowing all the threads had finished their rows and the arrays were ready. That would far outstrip the time it takes one CPU core to fill 25 bytes with a known value. So for almost any simple initialization like this you do not want to use threads.
Also bear in mind that threads provide concurrency but not speedup on a single core machine. If you have an operation which has to be completed synchronously--like an array initialization--then you'll only get value by adding a # of threads up to the # of CPU cores available. In theory.
So if you're on a multi-core system and if what you were putting in each cell took a long time to calculate... then sure, it may be worth exploiting some kind of parallelism. So I like genpfault's suggestion: write it multithreaded for a multi-core system and time it as an educational exercise just to get a feel for when the crossover of benefit happens...
Unless you're doing a significant amount of computation, no, there will not be any benefit. It's possible you might even see worse performance due to caching effects.
This type of initialization is memory-bound, not CPU bound. The time it takes to initialize the array depends on the speed of your memory; your CPU will just waste cycles spinning waiting for the memory operations to commit. Adding more threads will still have them all waiting for memory, and if they're all fighting over the same cache lines, the performance will be worse because now the caches of the separate CPUs have to synchronize with each other to avoid cache incoherency.
On modern hardware? Probably none, since you're not doing any significant computation. You'll most likely be limited by your memory bandwidth.
Pretty easy to test though. Whip up some OpenMP and give it a whirl!
Doubtful, but for some point of n x n, maybe... but I'd imagine it's a really high n and you'd have probably already be multi-threading on processing this data. Remember that these threads will be writing back to the same area which may also lead to cache contention.
If you want to know for sure, try it and profile.
Also, this is done once, during application startup.
For this kind of thing, the cost of allocating the threads is probably greater than what you save by using them. Especially if you only need to do it once.
I did something similar, but in my case, the 2d array represented pixels on the screen. I was doing pretty expensive stuff, colour lerping, Perlin noise calculation... When launching it all in a single thread, I got around 40 fps, but when I added slave threads responsible for calculating rows of pixels, I managed to double that result. So yes, there might be situations where multithreading helps in speeding up whatever you do in the array, providing that what you do is expensive enough to justify using multiple threads.
You can download a live demo where you adjust the number of threads to watch the fps counter change: http://umbrarumregnum.110mb.com/download/mnd (the multithreading test is the "Noise Demo 3").