Multithreading nested foor loop with std::thread

Multithreading nested foor loop with std::thread - c++

I am quite new to c++ and I would really need some advice on multithreading using std::thread.
i have the following piece of code, which basically separates a for loop of N = 8^L iterations (up to 8^14) using thread:
void Lanczos::Hamil_vector_multiply(vec& initial_vec, vec& result_vec) {
result_vec.zeros();
std::vector<arma::vec> result_threaded(num_of_threads);
std::vector<std::thread> threads;
threads.reserve(num_of_threads);
for (int t = 0; t < num_of_threads; t++) {
u64 start = t * N / num_of_threads;
u64 stop = ((t + 1) == num_of_threads ? N : N * (t + 1) / num_of_threads);
result_threaded[t] = arma::vec(stop - start, fill::zeros);
threads.emplace_back(&Lanczos::Hamil_vector_multiply_kernel, this, start, stop, ref(initial_vec), ref(result_vec));
}for (auto& t : threads) t.join();
}
where Lanczos is my general class (actually it is not necessary to know what it contains), while the member function Hamil_vector_multiply_kernel is of the form:
void Lanczos::Hamil_vector_multiply_kernel(u64 start, u64 stop, vec& initial_vec, vec& result_vec_threaded){
// some declarations
for (u64 k = start; k < stop; k++) {
// some prealiminary work
for (int j = 0; j <= L - 1; j++) {
// a bunch of if-else statements, where result_vec_threaded(k) += something
}
}
}
(the code is quite long, so i didn't paste the whole whing here). My problem is that i call the function Hamil_vector_multiply 100-150 times in another function, so i create each time a new vector of threads, which then destroys itself.My questions:
Is it better to create threads in the function which calls Hamil_vector_multiply and then pass a vector of threads to Hamil_vector_multiply in order to avoid creating each time new threads?
Would it be better to asynchronously attack the loop (for instance the first thread to finish an iterations starts the next available? If yes can you point to any literature describing threads asynchronously?
3)Are there maybe better ways of multithreading such a loop? (without multithreading i have a loop from k=0 to k=N=8^14, which takes up a lot of time)
I found several attempts to create a threadpool and job queue, would it be useful to use for instance some workpool like this: https://codereview.stackexchange.com/questions/221617/thread-pool-c-implementation
My code works as it is supposed to (gives the correct result), it boosts up the speed of the programm soemthing like 10 times with 16 cores. But if you have other helpful comments not regarding multithreading I woul be grateful for every piece of advice
Thank you very much in advance!
PS: The function which calls Hamil_vector_multiply 100-150 times is of the form:
void Lanczos::Build_Lanczos_Hamil(vec& initial_vec) {
vec tmp(N);
Hamil_vector_multiply(initial_vec, tmp);
// some calculations
for(int j=0; j<100; j++{
// somtheing
vec tmp2 = ...
Hamil_vector_multiply(tmp2, tmp);
// do somthing else -- not related
}
}

Is it better to create threads in the function which calls Hamil_vector_multiply and then pass a vector of threads to Hamil_vector_multiply in order to avoid creating each time new threads?
If your worried about performance, yes it would help. What your doing right now is essentially allocating a new heap block in every function call (I'm talking about the vector). If you can do it beforehand, it'll give you some performance. There isn't an issue doing this but you could gain some performance.
Would it be better to asynchronously attack the loop (for instance the first thread to finish an iterations starts the next available? If yes can you point to any literature describing threads asynchronously?
This might not be a good idea. You will have to lock resources using mutexes when sharing the same data between multiple threads. This means that you'll get the same amount of performance as processing using one thread because the other thread(s) will have to wait till the resource is unlocked and ready to be used.
Are there maybe better ways of multithreading such a loop? (without multithreading i have a loop from k=0 to k=N=8^14, which takes up a lot of time)
If your goal is to improve performance, if you can put it into multiple threads, and most importantly if multithreading will help, then there isn't a reason to not doing it. From what I can see, your implementation looks pretty neat. But keep in mind, starting a thread itself is a little costly (negligible when compared to your performance gain), and load balancing will definitely improve performance even further.
But if you have other helpful comments not regarding multithreading I woul be grateful for every piece of advice
If your load per thread might vary, it'll be a good investment to think about load balancing. Other than that, I don't see an issue. The major places to improve would be your logic itself. Threads can do so much if your logic takes a hell of a lot time..
Optional:
You can use std::future to implement the same with the added bonus of it starting the thread asynchronously upon destruction, meaning when your thread pool destroys (when the vector goes out of scope), it'll start the threads. But then it might interfere with your first question.

Related

Multithreading is slower than no threading C++

I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.
I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:
Non-Multithread Case
std::vector<hobj> validobjs;
int length = 70;
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
validobjs.push_back(hobj);
}
}
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
validobjs.push_back(hobj);
}
}
Multithread Case
std::vector<hobj> validobjs;
int length = 70;
#pragma omp parallel
{
std::vector<hobj> threaded1; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
threaded1.push_back(obj);
}
}
std::vector<hobj> threaded2; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
threaded2.push_back(obj);
}
}
#pragma omp critical // Insert local vectors to main vector one thread at a time
{
validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
}
}
In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).
I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.
So questions:
1) Is this kind of operation suited for multi-threading?
2) If yes to 1) - does the multithreaded code look reasonable?
3) Is there anything I can do to improve the performance to get it faster than the no-thread case?
Additional info:
The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4
Cheers!

I'm no expert on multithreading, but I'll give it a try:
Is this kind of operation suited for multi-threading?
I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.
As is the merging at the end of the threaded version.
Does the multithreaded code look reasonable?
I think you'r on the right path to let each thread work on independent data.
Is there anything I can do to improve the performance to get it faster than the no-thread case?
I see a few points that might improve performance:
The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
validobjs.reserve(v1.size() + v2.size());
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
Filter the data in two threads as you do now.
Synchronise them and allocate a vector with a number of elements:
validobjs.resize(v1.size() + v2.size());
Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
Allthough I'm not sure if this is entirely legal or if it is undefined behaviour
You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.
Those were my thoughts on this, I hope there was something usefull in it.

IMHO,
You copy each element twice: into threaded1/2 and after that into validobjs.
It can make your code slower.
You can add elements into single vector by using synchronization.

Using TBB for an simple example

I am new to TBB and try to do a simple exprement.
My data for functions are:
int n = 9000000;
int *data = new int[n];
I created a function, the first one without using TBB:
void _array(int* &data, int n) {
for (int i = 0; i < n; i++) {
data[i] = busyfunc(data[i])*123;
}
}
It takes 0.456635 seconds.
And also created a to function, the first one with using TBB:
void parallel_change_array(int* &data,int list_count) {
//Instructional example - parallel version
parallel_for(blocked_range<int>(0, list_count),
[=](const blocked_range<int>& r) {
for (int i = r.begin(); i < r.end(); i++) {
data[i] = busyfunc(data[i])*123;
}
});
}
It takes me 0.584889 seconds.
As for busyfunc(int m):
int busyfunc(int m)
{
m *= 32;
return m;
}
Can you tell me, why the function without using TBB spends less time, than if it is with TBB?
I think, the problem is that the functions are simple, and it's easy to calculate without using TBB.

First, the busyfunc() seems not so busy because 9M elements are computed in just half a second, which makes this example rather memory bound (uncached memory operations take orders of magnitude more cycles than arithmetic operations). Memory bound computations scale not as good as compute-bound, e.g. plain memory copying usually scales up to no more than, say, 4 times even running on much bigger number of cores/processors.
Also, memory bound programs are more sensitive to NUMA effects and since you allocated this array as contiguous memory using standard C++, it will be allocated by default entirely on the same memory node where the initialization occurs. This default can be altered by running with numactl -i all --.
And the last, but the most important thing is that TBB initializes threads lazily and pretty slowly. I guess you do not intend writing an application which exits after 0.5 seconds spent on parallel computation. Thus, a fair benchmark should take into account all the warm-up effects, which are expected in the real application. At the very least, it has to wait until all the threads are up and running before starting measurements. This answer suggests one way to do that.
[update] Please also refer to Alexey's answer for another possible reason lurking in compiler optimization differences.

In addition to Anton's asnwer, I recommend to check if the compiler was able to optimize the code equivalently.
For start, check performance of the TBB version executed by a single thread, without real parallelism. You can use tbb::global_control or tbb::task_scheduler_init to limit the number of threads to 1, e.g.
tbb::global_control ctl(tbb::global_control::max_allowed_parallelism, 1);
The overheads of thread creation, as well as cache locality or NUMA effects, should not play a role when all the code is executed by one thread. Therefore you should see approximately the same performance as for the no-TBB version. If you do, then you have a scalability issue, and Anton explained possible reasons.
However if you see that performance drops a lot, then it is a serial optimization issue. One of known reasons is that some compilers cannot optimize the loop over a blocked_range as good as they optimize the original loop; and it was also observed that storing r.end() into a local variable may help:
int rend = r.end();
for (int i = r.begin(); i < rend; i++) {
data[i] = busyfunc(data[i])*123;
}

OpenMP and unbalanced nested loops

I'm trying to parallelize a simulator written in C++ using OpenMP pragmas.
I have a basic understanding of it but no experience.
The code below shows the main method to parallelize:
void run(long long end) {
while (now + dt <= end) {
now += dt;
for (unsigned int i=0; i < populations.size(); i++) {
populations[i]->update(now);
}
}
}
where populations is a std::vector of instances of the class Population. Each population updates its own elements as follows:
void Population::update(long long ts) {
for (unsigned int j = 0; j < this->size(); j++) {
if (check(j,ts)) {
doit(ts, j);
}
}
}
Being each population of a different size, the loop in Population::update() takes a varying amount of time leading to suboptimal speedups. By adding #pragma omp parallel for schedule(static) in the run() method. I get a 2X speedup with 4 threads, however it drops for 8 threads.
I am aware of the schedule(dynamic) clause, allowing to balance out the computation between the threads. However, when I tried to dynamically dispatch the threads I did not observe any improvements.
Am I going in the right direction? Do you think playing with the chunck size would help? Any suggestion is appreciated!

So there is two things to distinguish:
The influence of the number of threads and the scheduling policy.
For the number of threads, having more threads than cores usually slows down performances because of the context switches. So it depends on the number of cores you have on your computer
The difference between the code generated (at least as far as I remember) for static and dynamic is that with the static scheduling, the loop iterations are divided by the number of threads equally and with the dynamic scheduling, the distribution is computed at runtime (after the end of every iteration, the omp runtime is queried with __builtin_GOMP_loop_dynamic_next).
The reason for the slowdown observed when switching to dynamic can be that the loop doesn't contain enough iterations/computations so the overhead of computing dynamically the iterations distribution is not covered by the gain in performance.
(I assumed that every population instance doesn't share data with others)
Just throwing ideas, hope this help =)

Why are threads slowing down the execution of my program?

I have some code that can make use of parallelism for efficiency gain. Since my PC has a dual processor I tried running the code on two threads. So I wrote the below code (This is a very simplified version of it):
Evaluator::evaluate(vector<inpType> input, bool parallel) {
std::thread t0;
if(parallel) {
// Evaluate half of the input in a spawned thread
t0 = std::thread(&Evaluator::evaluatePart, this, std::ref(input), 0, input.size()/2 + input.size()%2);
// Evaluate the other half of the input in this thread
evaluatePart(input, input.size()/2 + input.size()%2 - 1, input.size()/2);
} else {
// sequential evaluate all of the input
evaluatePart(input, 0, input.size());
}
// some other code
// after finishing everything join the thread
if(parallel) t0.join();
}
Evaluator::evaluatePart(vector<inpType> &input, int start, int count) {
for(int i=start; i<count; i++) {
evaluateSingle(input[i]);
}
}
Evaluator::evaluateSingle(inpType &input) {
// do stuff with input
// note I use a vector<int> belonging to Evaluator object in here, not sure it matters though
}
Running sequentially takes around 3 ms but running in parallel is taking around 6 ms. Does that mean spawning a thread takes so much time that it is more efficient to just evaluate sequentially? Or am I doing something wrong?
Note that I don't make use of any locking mechanisms because the evaluations are independent on each other. Every evaluateSingle reads from a vector that is a member of the Evaluator object but only alters the single input that was given to it. Hence there is no need for any locking.
Update
I apologize that I didn't make this clear. This is more of a pseudo code describing in abstract how my code looks like. It will not work or compile but mine does so that is not the issue. Anyways I fixed the t0 scope issue in this code.
Also the input size is around 38,000 which I think is sufficient enough to make use of parallelism.
Update
I tried increasing the size of input to 5,000,000 but that didn't help. Sequential is still faster than multi-threaded.
Update
I tried increasing the number of threads running while splitting the vector evenly between them for evaluation, and got some interesting results:
Note that I have an i7-7500U CPU that can run 4 threads in parallel. This leaves me with two questions:
Why does creating 4 or more threads starts to see a performance improvement in comparison with 2, 3.
Why is it the case that creating more than 4 threads is more efficient than
just 4 threads (the maximum the CPU can run concurrently).

Performance issues in joining threads

I wrote the following parallel code for examining all elements in a vector of vector. I store only those elements from vector<vector<int> > which satisfy a given condition. However, my problem is some of the vectors within vector<vector<int> > are pretty large while others are pretty small. Due to which my code takes a long time to perform thread.join(). Can someone please suggest as to how can I improve the performance of my code.
void check_if_condition(vector<int>& a, vector<int>& satisfyingElements)
{
for(vector<int>::iterator i1=a.begin(), l1=a.end(); i1!=l1; ++i1)
if(some_check_condition(*i1))
satisfyingElements.push_back(*i1);
}
void doWork(std::vector<vector<int> >& myVec, std::vector<vector<int> >& results, size_t current, size_t end)
{
end = std::min(end, myVec.size());
int numPassed = 0;
for(; current < end; ++current) {
vector<int> satisfyingElements;
check_if_condition(myVec[current], satisfyingElements);
if(!satisfyingElements.empty()){
results[current] = satisfyingElements;
}
}
}
int main()
{
std::vector<std::vector<int> > myVec(1000000);
std::vector<std::vector<int> > results(myVec.size());
unsigned numparallelThreads = std::thread::hardware_concurrency();
std::vector<std::thread> parallelThreads;
auto blockSize = myVec.size() / numparallelThreads;
for(size_t i = 0; i < numparallelThreads - 1; ++i) {
parallelThreads.emplace_back(doWork, std::ref(myVec), std::ref(results), i * blockSize, (i+1) * blockSize);
}
//also do work in this thread
doWork(myVec, results, (numparallelThreads-1) * blockSize, myVec.size());
for(auto& thread : parallelThreads)
thread.join();
std::vector<int> storage;
storage.reserve(numPassed.load());
auto itRes = results.begin();
auto itmyVec = myVec.begin();
auto endRes = results.end();
for(; itRes != endRes; ++itRes, ++itmyVec) {
if(!(*itRes).empty())
storage.insert(storage.begin(),(*itRes).begin(), (*itRes).end());
}
std::cout << "Done" << std::endl;
}

It would be nice to see if you can give some scale of those 'large' inner-vectors just to see how bad is the problem.
I think however, is that your problem is this:
for(auto& thread : parallelThreads)
thread.join();
This bit makes goes through on all thread sequentially and wait until they finish, and only then looks at the next one. For a thread-pool, you want to wait until every thread is done. This can be done by using condition_variable for each thread to finish. Before they finish they have to notify the condition_variable for which you can wait.
Looking at your implementation the bigger issue here is that your worker threads are not balanced in their consumption.
To get a more balanced load on all of your threads, you need to flatten your data structure, so the different worker threads can process relatively similar sized chunks of data. I am not sure where is your data coming from, but having a vector of a vector in an application that is dealing with large data sets doesn't sound like a great idea. Either process the existing vector of vectors into a single one, or read the data in like that if possible. If you need the row number for your processing, you can keep a vector of start-end ranges from which you can find your row number.
Once you have a single big vector, you can break it down to equal sized chunks to feed into worker threads. Second, you don't want to build vectors on the stack handing and pushing them into another vector because, chances are, you are running into issues to allocate memory during the working of your threads. Allocating memory is a global state change and as such will require some level of locking (with proper address partitioning it could be avoided though). As a rule of thumb, whenever your are looking for performance you should remove dynamic allocation from performance critical parts.
In this case, perhaps your threads would rather 'mark' elements are satisfying conditions, rather than building vectors of the satisfying elems. And once that's done, you can iterate through only the good ones without pushing and copying anything. Such solution would be less wastefull.
In fact, if I were you, I would give a try to solve this issue first on a single thread, doing the suggestions above. If you get rid of the vector-of-vectors structure, and iterate through elements conditionally (this might be as simple as using the of the xxxx_if algorithms C++11 standard library provides), you could end up with a decent enough performance. And only at that point worth looking at delegating chunks of this work to worker threads. At this point in your coded there's very little justification to use worker threads, just to filter them. Do as little writing and moving as you can, and you gain a lot of performance. Parallelization only works well in certain circumstances.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js