Why are threads slowing down the execution of my program?

Why are threads slowing down the execution of my program? - c++

I have some code that can make use of parallelism for efficiency gain. Since my PC has a dual processor I tried running the code on two threads. So I wrote the below code (This is a very simplified version of it):
Evaluator::evaluate(vector<inpType> input, bool parallel) {
std::thread t0;
if(parallel) {
// Evaluate half of the input in a spawned thread
t0 = std::thread(&Evaluator::evaluatePart, this, std::ref(input), 0, input.size()/2 + input.size()%2);
// Evaluate the other half of the input in this thread
evaluatePart(input, input.size()/2 + input.size()%2 - 1, input.size()/2);
} else {
// sequential evaluate all of the input
evaluatePart(input, 0, input.size());
}
// some other code
// after finishing everything join the thread
if(parallel) t0.join();
}
Evaluator::evaluatePart(vector<inpType> &input, int start, int count) {
for(int i=start; i<count; i++) {
evaluateSingle(input[i]);
}
}
Evaluator::evaluateSingle(inpType &input) {
// do stuff with input
// note I use a vector<int> belonging to Evaluator object in here, not sure it matters though
}
Running sequentially takes around 3 ms but running in parallel is taking around 6 ms. Does that mean spawning a thread takes so much time that it is more efficient to just evaluate sequentially? Or am I doing something wrong?
Note that I don't make use of any locking mechanisms because the evaluations are independent on each other. Every evaluateSingle reads from a vector that is a member of the Evaluator object but only alters the single input that was given to it. Hence there is no need for any locking.
Update
I apologize that I didn't make this clear. This is more of a pseudo code describing in abstract how my code looks like. It will not work or compile but mine does so that is not the issue. Anyways I fixed the t0 scope issue in this code.
Also the input size is around 38,000 which I think is sufficient enough to make use of parallelism.
Update
I tried increasing the size of input to 5,000,000 but that didn't help. Sequential is still faster than multi-threaded.
Update
I tried increasing the number of threads running while splitting the vector evenly between them for evaluation, and got some interesting results:
Note that I have an i7-7500U CPU that can run 4 threads in parallel. This leaves me with two questions:
Why does creating 4 or more threads starts to see a performance improvement in comparison with 2, 3.
Why is it the case that creating more than 4 threads is more efficient than
just 4 threads (the maximum the CPU can run concurrently).

Related

Multithreading nested foor loop with std::thread

I am quite new to c++ and I would really need some advice on multithreading using std::thread.
i have the following piece of code, which basically separates a for loop of N = 8^L iterations (up to 8^14) using thread:
void Lanczos::Hamil_vector_multiply(vec& initial_vec, vec& result_vec) {
result_vec.zeros();
std::vector<arma::vec> result_threaded(num_of_threads);
std::vector<std::thread> threads;
threads.reserve(num_of_threads);
for (int t = 0; t < num_of_threads; t++) {
u64 start = t * N / num_of_threads;
u64 stop = ((t + 1) == num_of_threads ? N : N * (t + 1) / num_of_threads);
result_threaded[t] = arma::vec(stop - start, fill::zeros);
threads.emplace_back(&Lanczos::Hamil_vector_multiply_kernel, this, start, stop, ref(initial_vec), ref(result_vec));
}for (auto& t : threads) t.join();
}
where Lanczos is my general class (actually it is not necessary to know what it contains), while the member function Hamil_vector_multiply_kernel is of the form:
void Lanczos::Hamil_vector_multiply_kernel(u64 start, u64 stop, vec& initial_vec, vec& result_vec_threaded){
// some declarations
for (u64 k = start; k < stop; k++) {
// some prealiminary work
for (int j = 0; j <= L - 1; j++) {
// a bunch of if-else statements, where result_vec_threaded(k) += something
}
}
}
(the code is quite long, so i didn't paste the whole whing here). My problem is that i call the function Hamil_vector_multiply 100-150 times in another function, so i create each time a new vector of threads, which then destroys itself.My questions:
Is it better to create threads in the function which calls Hamil_vector_multiply and then pass a vector of threads to Hamil_vector_multiply in order to avoid creating each time new threads?
Would it be better to asynchronously attack the loop (for instance the first thread to finish an iterations starts the next available? If yes can you point to any literature describing threads asynchronously?
3)Are there maybe better ways of multithreading such a loop? (without multithreading i have a loop from k=0 to k=N=8^14, which takes up a lot of time)
I found several attempts to create a threadpool and job queue, would it be useful to use for instance some workpool like this: https://codereview.stackexchange.com/questions/221617/thread-pool-c-implementation
My code works as it is supposed to (gives the correct result), it boosts up the speed of the programm soemthing like 10 times with 16 cores. But if you have other helpful comments not regarding multithreading I woul be grateful for every piece of advice
Thank you very much in advance!
PS: The function which calls Hamil_vector_multiply 100-150 times is of the form:
void Lanczos::Build_Lanczos_Hamil(vec& initial_vec) {
vec tmp(N);
Hamil_vector_multiply(initial_vec, tmp);
// some calculations
for(int j=0; j<100; j++{
// somtheing
vec tmp2 = ...
Hamil_vector_multiply(tmp2, tmp);
// do somthing else -- not related
}
}

Is it better to create threads in the function which calls Hamil_vector_multiply and then pass a vector of threads to Hamil_vector_multiply in order to avoid creating each time new threads?
If your worried about performance, yes it would help. What your doing right now is essentially allocating a new heap block in every function call (I'm talking about the vector). If you can do it beforehand, it'll give you some performance. There isn't an issue doing this but you could gain some performance.
Would it be better to asynchronously attack the loop (for instance the first thread to finish an iterations starts the next available? If yes can you point to any literature describing threads asynchronously?
This might not be a good idea. You will have to lock resources using mutexes when sharing the same data between multiple threads. This means that you'll get the same amount of performance as processing using one thread because the other thread(s) will have to wait till the resource is unlocked and ready to be used.
Are there maybe better ways of multithreading such a loop? (without multithreading i have a loop from k=0 to k=N=8^14, which takes up a lot of time)
If your goal is to improve performance, if you can put it into multiple threads, and most importantly if multithreading will help, then there isn't a reason to not doing it. From what I can see, your implementation looks pretty neat. But keep in mind, starting a thread itself is a little costly (negligible when compared to your performance gain), and load balancing will definitely improve performance even further.
But if you have other helpful comments not regarding multithreading I woul be grateful for every piece of advice
If your load per thread might vary, it'll be a good investment to think about load balancing. Other than that, I don't see an issue. The major places to improve would be your logic itself. Threads can do so much if your logic takes a hell of a lot time..
Optional:
You can use std::future to implement the same with the added bonus of it starting the thread asynchronously upon destruction, meaning when your thread pool destroys (when the vector goes out of scope), it'll start the threads. But then it might interfere with your first question.

Multithreading is slower than no threading C++

I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.
I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:
Non-Multithread Case
std::vector<hobj> validobjs;
int length = 70;
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
validobjs.push_back(hobj);
}
}
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
validobjs.push_back(hobj);
}
}
Multithread Case
std::vector<hobj> validobjs;
int length = 70;
#pragma omp parallel
{
std::vector<hobj> threaded1; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
threaded1.push_back(obj);
}
}
std::vector<hobj> threaded2; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
threaded2.push_back(obj);
}
}
#pragma omp critical // Insert local vectors to main vector one thread at a time
{
validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
}
}
In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).
I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.
So questions:
1) Is this kind of operation suited for multi-threading?
2) If yes to 1) - does the multithreaded code look reasonable?
3) Is there anything I can do to improve the performance to get it faster than the no-thread case?
Additional info:
The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4
Cheers!

I'm no expert on multithreading, but I'll give it a try:
Is this kind of operation suited for multi-threading?
I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.
As is the merging at the end of the threaded version.
Does the multithreaded code look reasonable?
I think you'r on the right path to let each thread work on independent data.
Is there anything I can do to improve the performance to get it faster than the no-thread case?
I see a few points that might improve performance:
The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
validobjs.reserve(v1.size() + v2.size());
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
Filter the data in two threads as you do now.
Synchronise them and allocate a vector with a number of elements:
validobjs.resize(v1.size() + v2.size());
Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
Allthough I'm not sure if this is entirely legal or if it is undefined behaviour
You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.
Those were my thoughts on this, I hope there was something usefull in it.

IMHO,
You copy each element twice: into threaded1/2 and after that into validobjs.
It can make your code slower.
You can add elements into single vector by using synchronization.

Using TBB for an simple example

I am new to TBB and try to do a simple exprement.
My data for functions are:
int n = 9000000;
int *data = new int[n];
I created a function, the first one without using TBB:
void _array(int* &data, int n) {
for (int i = 0; i < n; i++) {
data[i] = busyfunc(data[i])*123;
}
}
It takes 0.456635 seconds.
And also created a to function, the first one with using TBB:
void parallel_change_array(int* &data,int list_count) {
//Instructional example - parallel version
parallel_for(blocked_range<int>(0, list_count),
[=](const blocked_range<int>& r) {
for (int i = r.begin(); i < r.end(); i++) {
data[i] = busyfunc(data[i])*123;
}
});
}
It takes me 0.584889 seconds.
As for busyfunc(int m):
int busyfunc(int m)
{
m *= 32;
return m;
}
Can you tell me, why the function without using TBB spends less time, than if it is with TBB?
I think, the problem is that the functions are simple, and it's easy to calculate without using TBB.

First, the busyfunc() seems not so busy because 9M elements are computed in just half a second, which makes this example rather memory bound (uncached memory operations take orders of magnitude more cycles than arithmetic operations). Memory bound computations scale not as good as compute-bound, e.g. plain memory copying usually scales up to no more than, say, 4 times even running on much bigger number of cores/processors.
Also, memory bound programs are more sensitive to NUMA effects and since you allocated this array as contiguous memory using standard C++, it will be allocated by default entirely on the same memory node where the initialization occurs. This default can be altered by running with numactl -i all --.
And the last, but the most important thing is that TBB initializes threads lazily and pretty slowly. I guess you do not intend writing an application which exits after 0.5 seconds spent on parallel computation. Thus, a fair benchmark should take into account all the warm-up effects, which are expected in the real application. At the very least, it has to wait until all the threads are up and running before starting measurements. This answer suggests one way to do that.
[update] Please also refer to Alexey's answer for another possible reason lurking in compiler optimization differences.

In addition to Anton's asnwer, I recommend to check if the compiler was able to optimize the code equivalently.
For start, check performance of the TBB version executed by a single thread, without real parallelism. You can use tbb::global_control or tbb::task_scheduler_init to limit the number of threads to 1, e.g.
tbb::global_control ctl(tbb::global_control::max_allowed_parallelism, 1);
The overheads of thread creation, as well as cache locality or NUMA effects, should not play a role when all the code is executed by one thread. Therefore you should see approximately the same performance as for the no-TBB version. If you do, then you have a scalability issue, and Anton explained possible reasons.
However if you see that performance drops a lot, then it is a serial optimization issue. One of known reasons is that some compilers cannot optimize the loop over a blocked_range as good as they optimize the original loop; and it was also observed that storing r.end() into a local variable may help:
int rend = r.end();
for (int i = r.begin(); i < rend; i++) {
data[i] = busyfunc(data[i])*123;
}

OpenMP, reason for slowdown with more threads? (no sharing/no rand() (I think..) )

I am running my code on Intel® Xeon(R) CPU X5680 # 3.33GHz × 12. Here is a fairly simple OpenMP pseudo code (the OpenMP parts are exact, just normal code in between is changed for compactness and clarity):
vector<int> myarray(arraylength,something);
omp_set_num_threads(3);
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for(int j=0;j<pr.max_iteration_limit;j++)
{
vector<int> temp_array(updated_array(a,b,myarray));
for(int i=0;i<arraylength;i++)
{
#pragma omp atomic
myarray[i]+=temp_array[i];
}
}
}
all parameters taken by temp_array function are copied so that there would be no clashes. Basic structure of temp_array function:
vector<int> updated_array(myClass1 a, vector<myClass2> b, vector<int> myarray)
{
//lots of preparations, but obviously there are only local variables, since
//function only takes copies
//the core code taking most of the time, which I will be measuring:
double time_s=time(NULL);
while(waiting_time<t_wait) //as long as needed
{
//a fairly short computaiton
//generates variable: vector<int> another_array
waiting_time++;
}
double time_f=time(NULL);
cout<<"Thread "<<omp_get_thread_num()<<" / "<<omp_get_num_threads()
<< " runtime "<<time_f-time_s<<endl;
//few more changes to the another_array
return another_array;
}
Questions and my attempts to resolve it:
adding more threads (with omp_set_num_threads(3);) does create more threads, but each thread does the job slower. E.g. 1: 6s, 2: 10s, 3: 15s ... 12: 60s.
(where to "job" I refer to the exact part of the code I pointed out as core, (NOT the whole omp loop or so) since it takes most of the time, and makes sure I am not missing anything additional)
There are no rand() things happening inside the core code.
Dynamic or static schedule doesnt make a difference here of course (and I tried..)
There seem to be no sharing possible in any way or form, thus I am running out of ideas completely... What can it be? I would be extremely grateful if you could help me with this (even with just ideas)!
p.s. The point of the code is to take myarray, do a bit of montecarlo on it with a single thread, and then collect tiny changes and add/substract to the original array.

OpenMP may implement the atomic access using a mutex, when your code will suffer from heavy contention on that mutex. This will result in a significant performance hit.
If the work in updated_array() dominates the cost of the parallel loop, you'de better put the whole of the second loop inside a critical section:
{ // body of parallel loop
vector<int> temp_array = updated_array(a,b,myarray);
#pragma omp critical(UpDateMyArray)
for(int i=0;i<arraylength;i++)
myarray[i]+=temp_array[i];
}
However, your code looks broken (essentially not threadsafe), see my comment.

Parallel reads from STL containers

It is safe to read a STL container from multiple parallel threads. However, the performance is terrible. Why?
I create a small object that stores some data in a multiset. This makes the constructors fairly expensive ( about 5 usecs on my machine. ) I store hundreds of thousands of the small objects in a large multiset. Processing these objects is an independent business, so I split the work between threads running on a multi-core machine. Each thread reads the objects it needs from the large multiset, and processes them.
The problem is that the reading from the big multiset does not proceed in parallel. It looks like the reads in one thread block the reads in the other.
The code below is the simplest I can make it and still show the problem. First it creates a large multiset containing 100,000 small objects each containing its own empty multiset. Then it calls the multiset copy constructor twice in series, then twice again in parallel.
A profiling tool shows that the serial copy constructors take about 0.23 secs, whereas the parallel ones take twice as long. Somehow the parallel copies are interfering with each other.
// a trivial class with a significant ctor and ability to populate an associative container
class cTest
{
multiset<int> mine;
int id;
public:
cTest( int i ) : id( i ) {}
bool operator<(const cTest& o) const { return id < o.id; }
};
// add 100,000 objects to multiset
void Populate( multiset<cTest>& m )
{
for( int k = 0; k < 100000; k++ )
{
m.insert(cTest(k));
}
}
// copy construct multiset, called from mainline
void Copy( const multiset<cTest>& m )
{
cRavenProfile profile("copy_main");
multiset<cTest> copy( m );
}
// copy construct multiset, called from thread
void Copy2( const multiset<cTest>& m )
{
cRavenProfile profile("copy_thread");
multiset<cTest> copy( m );
}
int _tmain(int argc, _TCHAR* argv[])
{
cRavenProfile profile("test");
profile.Start();
multiset<cTest> master;
Populate( master );
// two calls to copy ctor from mainline
Copy( master );
Copy( master );
// call copy ctor in parrallel
boost::thread* pt1 = new boost::thread( boost::bind( Copy2, master ));
boost::thread* pt2 = new boost::thread( boost::bind( Copy2, master ));
pt1->join();
pt2->join();
// display profiler results
cRavenProfile print_profile;
return 0;
}
Here is the output
Scope Calls Mean (secs) Total
copy_thread 2 0.472498 0.944997
copy_main 2 0.233529 0.467058

You mentioned copy constructors. I assume that these also allocate memory from the heap?
Allocating heap memory in multiple threads is a big mistake.
The standard allocator is probably a single pool locked implementation. You need to either not use heap memory (stack allocate) or you need a thread optimized heap allocator.

OK, after spending the majority of the week on this issues, I have the fix.
There were two problems with the code I posted in the question:
boost::bind makes a copy of its parameters, even if the underlying function uses call by reference. Copying the container is expensive, and so the multi-threaded version was working too hard. ( No-one noticed this! ) To pass the container by reference, I needed to use this code:
boost::thread* pt1 = new boost::thread( boost::bind( Copy2, boost::cref(master) ));
As Zan Lynx pointed out the default container allocates memory for its contents on the global heap using a thread safe singleton memory allocator, resulting in great contention between the threads as they created hundreds of thousands of objects through the same allocator instance. ( Since this was the crux of the mystery, I accepted Zan Lynx's answer. )
The fix for #1 is straightforward, as presented above.
The fix for #2 is, as several people pointed out, to replace the default STL allocator with a thread specific one. This is quite the challenge, and no-one offered a specific source for such an allocator.
I spent some time looking for a thread specific allocator "off the shelf". The best I found was hoard ( hoard.org ). This provided a significant performance improvement, however hoard has some serious drawbacks
I experienced some crashes during testing
Commercial licensing is expensive
It 'hooks' system calls to malloc, a technique I consider dodgy.
So I decided to roll my own thread specific memory allocator, based on boost::pool and boost::threadspecificptr. This required a small amount of, IMHO, seriously advanced C++ code, but now seems to be working well.

What's the scheduling of your threads? If you run two threads, doing the considerable work, the threads most likely start at once and end at once. Hence the profiler thinks that the execution of each thread has taken twice as much time, because during the time each thread executes the work is done twice. Whereas the execution of each of sequential calls took the normal time.
step 0 1 2 3 4 5 6 7 8 9
threaded: 1,2,1,2,1,2,1,2,1,2
sequential: 1,1,1,1,1,2,2,2,2,2
Thread one started at 0 and ended at 8, showing execution time as 8; thread 2 started at 1 and ended at 9, the execution time is 8. Two sequential runs show 5 steps each. So in the resultant table you'll see 16 for concurrent version and 10 for the sequential one.
Assuming that all above is true and there is a considerable number of steps, the ratio of the execution times shown by your profiler should be about two. Experiment does not contradict this hypothesis.

Since I am not sure how your profiler is working it is hard to tell.
What I would prefer to see is some explicit timing around the code:
Then do the work a couple of times to average out any thing that causing a context switch.
for(int loop=0;loop < 100;++loop)
{
ts = timer();
Copy( master );
Copy( master );
te = timer();
tt += te - ts;
}
tt /= 100;
etc
Compare this with your profiler results.

To answer Pavel Shved with more detail, here is how the majority of my code runs:
step 0 1 2 3 4 5 6 7 8 9
core1: 1 1 1 1 1
core2: 2,2,2,2,2
sequential: 1,1,1,1,1,2,2,2,2,2
Only the parallel reads interfere with each other.
As an experiment, I replaces the big multiset with an array of pointers to cTest. The code now has huge memory leaks, but never mind. The interesting thing is that the relative performance is worse - running the copy constructors in parallel slows them down 4 times!
Scope Calls Mean (secs) Total
copy_array_thread 2 0.454432 0.908864
copy_array_main 2 0.116905 0.233811

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js