Multithreading slower than single threading - c++

I writing a program that does heavy calculations for large arrays in real-time. The task can be split into several sub-arrays for multithreading. However, I cannot run this any faster using threads.
Here is a sample dummy code which was created for demonstration (same problem).
Two threads-version ends up lasting 39 seconds, which is couple of seconds longer if they were computed one after another(!). It doesn't matter if the arrays are global etc. I also tested using "thread constructors" only once, but with the same result.
I'm using XCode (5.1.1) and Macbook Air (2013 model, Core i5, Os X 10.8.5). Yes, this is old computer, I'm rarely programming...
So, could you find any mistake in the logic I have in the code or could it be somewhere in the settings of Xcode etc?
#include <ctime>
#include <iostream>
#include <thread>
class Value
{
public:
float a[3000000];
};
void cycle(Value *val)
{
int i;
for (i=0; i<3000000; i++)
{
val->a[i]=n;
n+=0.0001;
}
}
int main()
{
Value *val1=new Value, *val2=new Value;
clock_t start,stop;
start=clock();
for (int i=0; i<1000; i++)
{
thread first (cycle,val1);
thread second (cycle,val2);
first.join();
second.join();
}
stop=clock();
float tdiff=(((float)stop - (float)start) / 1000000.0F);
std::cout<<endl<<"This took "<<tdiff<<" seconds...";
return 0;
}
'''

There is a joke that goes something like that:
One programmer needs 1 day to finsish a program, how many days do 10 programmers need for the same? - 10 days.
The work in your code is done in this loop:
for (int i=0; i<1000; i++)
{
thread first (cycle,val1);
thread second (cycle,val2);
first.join();
second.join();
}
Now consider that spawning and joining threads is overhead. In total your parallel code does more than a sequential would have to do, in gerenal there is no way around that. And you are not creating and joining threads once, but 1000-times, ie you add 1000-times overhead.
Don't expect code to run faster by simply adding more threads to it. I refer you to Amdahl's law or Gustavson's Law (which basically states the same just a bit more positive).
I suggest you to experiment with sequential vs threaded but only one thread to get a feeling for the overhead. You can compare this:
for (int i=0; i<1000; i++)
{
thread first (cycle,val1);
first.join();
}
With a sequential version that does not use any threads. You will be surprised by the difference.
You get most out of multithreading when the threads do lots of work (cf Amdahl/Gustavson) and when there is no synchronisation between different threads. Your 1000 times joining the threads is basically a barrier, where second has to wait doing nothing until first is finished. Such barriers are best avoided.
Last but not least, as mentioned in a comment, your benchmark is rather questionable, because you are not using the result of the computations. That is, either you didn't turn on optimizations which makes the results rather meaningless, or you did turn on optimizations and the compiler might optimize things away without you noticing it. And actually I am not sure whether you are comparing two versions that do the same work, or if perhaps your parallel version is doing twice the work. Moreover, when measuring time you need to take care to measure wall clock time not cpu time, because cpu time adds times spend on multiple cores, while you want to compare wall clock time.
TL;DR: More threads != automatically less runtime.

If you read the documentation for clock, you might notice that it says that time can appear to go faster if the process executes on multiple cores; clock is a total CPU-use approximation, not "wall clock time", and one "CPU tick" on two cores in parallel is the same amount of "time" as two sequential "ticks" on one core.
(By the way: in order to get the time in seconds, you should be dividing by CLOCKS_PER_SEC.)
Using a more appropriate timer, like std::chrono::steady_clock, will show that the sequential variant takes almost twice as long as the multithreaded version.
The difference can be explained completely by the overhead of creating and destroying threads.

Unlike the other comments suggest, n is not the culprit, neither is the sequential joins.
Each thread performs the same operation cycle, so how could there be any improvement?
You have to manually split the workload between the two threads, so that each one works on half the data.

Related

counting upto billions for multicore processor C++ 11 or higher

#include <iostream>
using namespace std;
volatile int counter = 0;
int main(int argc, char** argv){
size_t nb_iter = 1000000000;
for( int i=0; i<nb_iter; i++){
counter ++;
}
printf("counter: %ld / %ld\n", counter, nb_iter);
return 0;
}
this takes about 2.19 sec to built and run the code,
how can we optimize in multicore processor?
this takes about 2.19 sec to built and run the code
That's like saying "My car is slow, it does 0-100km/h in 5 days and 15 seconds" where 5 days is the time it took for the factory to build the car. Build time and execution time are two completely unrelated things not in the slightest related to each other.
how can we optimize
The only purpose volatile fills in this code is to prevent optimization from happening. Likely because someone wanted the loop to get executed and not get optimized away. If you remove volatile the the loop will gets optimized out and everyone will be happy.
optimize in multicore processor
You can't do that in a sensible way, since the nature of counter++ is that it adds based on the previous value. To split the work over a number of worker threads, they should ideally be able to operate without knowing a thing about each other's results.
Sure you can create x number of threads doing dummy counting, each thread having its own counter. But I don't see the purpose with that - it's not necessarily creating any performance benefits, since you have to take the thread creation overhead in account.
It's important to understand that threads aren't some magic performance boost in every situation. You could consider creating the threads manually and benchmark from there, with or without taking thread creation overhead in account. When you have tried that and understand it, only then consider playing with things like OpenMP.

Threads seem to alternate rather than run parallel

Here is a little program that runs two for loops on separate threads:
#include <iostream>
#include <thread>
void print_one() {
for (int i = 0; i < 1000; i++) {
std::cout << i << "\n";
}
}
void print_two() {
for (int i = 0; i > -1000; i--) {
std::cout << i << "\n";
}
}
int main(){
std::thread t1(print_one);
std::thread t2(print_two);
t1.join();
t2.join();
}
The program follows a pattern in which one for loop runs about 30-400 times (usually about 30-200), after which the other loop gets its "turn" and does the same. The behavior continues until the program exits.
I left mutex out on purpose to show that it's not about locks. In all examples that I've seen on Youtube, the loops usually alternate after every 1-3 iterations. On my PC it's as if two threads simply can't run simultaneously, and one thread has to take time off while the other one is working. I'm not experiencing any performance issues in general, so it doesn't seem like hardware failure to me.
A thread not doing anything while another one has time to execute std::cout and add a new line hundreds of times in a row just doesn't sound right to me. Is this normal behavior?
I have a Ryzen 5 1600 processor, if that makes any difference.
A thread not doing anything while another one has time to execute std::cout and add a new line hundreds of times in a row just doesn't sound right to me. Is this normal behavior?
There is no other sensible possibility here. Each thread only needs to do a single integer increment before it needs access to the terminal, something that only one thread can access at a time.
There are only two other possibilities, and they're both obviously terrible:
A single core runs the two threads, switching threads after every single output. This provides atrocious performance as 90+% of the time, the core is switching from one thread to the other.
The threads run on two cores. Each core does a single increment, then waits for the other core to finish writing to the terminal, and then writes to the terminal. Each core spends at least half its time waiting for the other core to release the terminal. This takes up two cores and provides fairly poor performance because the threads are spending a lot of time stopping and starting each other.
What you are seeing is the best possible behavior. The only sensible thing to do is to allow each thread to run long enough that the cost of switching is drowned out.
Answer to the first part:
The behaviour you observed is a normal behaviour.
Each thread gets a specific amount of time (to be executed on the core) that is dynamically decided by the OS, based on current situations.
Answer to the second part:
The cout operation is buffered.
Even if the calls to write (or whatever it is that accomplishes that effect in that particular implementation) are guaranteed to be mutually exclusive, the buffer might be shared by the different threads. This will quickly lead to corruption of the internal state of the stream. [1]
That is why you see only newline for some time.
The solution to this issue is using printf(). It's behaviour is atomic.
EDITS:
The point made about the printf() function that its behaviour is atomic is based on my experiments. The code at [2] demonstrates that the printf() function is atomic in nature.
You can actually test it.
References:
https://stackoverflow.com/a/6374525/7422352
https://www.researchgate.net/publication/351979208_Code_to_demonstrate_sequential_execution_of_2_parallel_regions_created_using_OpenMP_API

OpenMP and optimising vector operations

I'm running an algorithm at the moment that is very heavy but extremely parallel.
I've been looking at ways to speed it up and I've noticed that the slowest operation I have is my VecAdd function (It gets called thousands of times on a 6000 or so wide vector).
It is implemented as follows:
bool VecAdd( float* pOut, const float* pIn1, const float* pIn2, unsigned int num )
{
for( int idx = 0; idx < num; idx++ )
{
pOut[idx] = pIn1[idx] + pIn2[idx];
}
return true;
}
Its a very simple loop but all the additions can be performed in parallel. My first optimisation option is to move over to using SIMD as I can easily get a near 4 times speed up doing this.
However I'm also interested in the possibility of using OpenMP and having it automatically thread the for loop (potentially giving me a further 4x speedup for a total of 16x with SIMD).
However it really runs slowly. With the loop straight it takes around 3.2 seconds to process my example data. If I insert
#pragma omp parallel for
before the for loop I was assuming it would farm out several blocks of additions to other threads.
Unfortunately the result is that it takes ~7 seconds to process my example data.
Now I understand that a lot of my problem here will be caused by overheads with setting up threads and so forth but I'm still surprised just how much slower it makes things run.
Is it possible to speed this up by somehow setting up the thread pool in advance or will I never be able to combat these overheads?
Any thoughts on advice as to whether I can thread this nicely with OpenMP will be much appreciated!
Your loop should parallelize fine with the #pragma omp parallel for.
However, I think the problem is that you shouldn't parallelize at that level. You said that the function gets called thousands of times, but only operates on 6000 floats. Parallelize at the higher level, so that each thread is responsible for thounsands/4 calls to VecAdd. Right now you have this algorithm:
List item
serial execution
(re) start threads
do short computation
synchronize threads (at the end of the for loop)
back to serial code
Change it so that it's parallel at the highest possible level.
Memory bandwidth of course matters, but there is no way it would result in slower than serial execution.

Splitting up a program into 4 threads is slower than a single thread

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?
Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.
The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?
One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.
I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.

Why are all iterations in a loop parallelized using OpenMP schedule(dynamic) given to one thread? (MSVS 2010)

Direct Question: I've got a simple loop with, what can be, a computationally intensive function. Let's assume that each iteration takes the same amount of time (so load balancing should be easy).
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for ( int i=0; i < 30; i++ )
{
MyExpensiveFunction();
}
} // parallel block
Why are all of the iterations assigned to a single thread? I can add a:
std::cout << "tID = " << omp_get_thread_num() << "\n\n";
and it prints a bunch of zeros with only the last iteration assigned to thread 1.
My System: I must support cross compiling. So I'm using gcc 4.4.3 & 4.5.0 and they both work as expected, but for MS 2010, I see the above behavior where 29 iterations are assigned to thread 0 and one iteration is assigned to thread 1.
Really Odd: It took me a bit to realize that this might simply be a scheduling problem. I google'd and found this website, which if you skip to the bottom has an example with what must be auto-generated output. All iterations using dynamic and guided scheduling are assigned to thread zero??!?
Any guidance would be greatly appreciated!!
Most likely, this is because the OMP implementation in Visual Studio decided that you did nowhere near enough work to merit putting it on more than one thread. If you simply increase the quantity of iterations, then you may well find that the other threads have more utilization. Dynamic scheduling means that the implementation only forks new threads if it needs them, so if it doesn't need them, it doesn't make them or assign them work.
If each iteration takes the same amount of time, then you actually don't need a dynamic scheduling which causes more scheduling overhead than static scheduling policies. (static, 1) and (static) should be okay.
Could you let me know the length of each iteration? Regarding the example you cited (MSDN's example for schedulings), it is because the amount of work of each iteration is so small, so the first thread just got almost work. If you really increase the work of each iteration (at least an order of millisecond), then you will see the differences.
I did a lot of experiments related to OpenMP scheduling policies. MSVC's implementation of dynamic scheduling works well. I'm pretty sure your work in each iteration was too small.