Performance issues in joining threads

Performance issues in joining threads - c++

I wrote the following parallel code for examining all elements in a vector of vector. I store only those elements from vector<vector<int> > which satisfy a given condition. However, my problem is some of the vectors within vector<vector<int> > are pretty large while others are pretty small. Due to which my code takes a long time to perform thread.join(). Can someone please suggest as to how can I improve the performance of my code.
void check_if_condition(vector<int>& a, vector<int>& satisfyingElements)
{
for(vector<int>::iterator i1=a.begin(), l1=a.end(); i1!=l1; ++i1)
if(some_check_condition(*i1))
satisfyingElements.push_back(*i1);
}
void doWork(std::vector<vector<int> >& myVec, std::vector<vector<int> >& results, size_t current, size_t end)
{
end = std::min(end, myVec.size());
int numPassed = 0;
for(; current < end; ++current) {
vector<int> satisfyingElements;
check_if_condition(myVec[current], satisfyingElements);
if(!satisfyingElements.empty()){
results[current] = satisfyingElements;
}
}
}
int main()
{
std::vector<std::vector<int> > myVec(1000000);
std::vector<std::vector<int> > results(myVec.size());
unsigned numparallelThreads = std::thread::hardware_concurrency();
std::vector<std::thread> parallelThreads;
auto blockSize = myVec.size() / numparallelThreads;
for(size_t i = 0; i < numparallelThreads - 1; ++i) {
parallelThreads.emplace_back(doWork, std::ref(myVec), std::ref(results), i * blockSize, (i+1) * blockSize);
}
//also do work in this thread
doWork(myVec, results, (numparallelThreads-1) * blockSize, myVec.size());
for(auto& thread : parallelThreads)
thread.join();
std::vector<int> storage;
storage.reserve(numPassed.load());
auto itRes = results.begin();
auto itmyVec = myVec.begin();
auto endRes = results.end();
for(; itRes != endRes; ++itRes, ++itmyVec) {
if(!(*itRes).empty())
storage.insert(storage.begin(),(*itRes).begin(), (*itRes).end());
}
std::cout << "Done" << std::endl;
}

It would be nice to see if you can give some scale of those 'large' inner-vectors just to see how bad is the problem.
I think however, is that your problem is this:
for(auto& thread : parallelThreads)
thread.join();
This bit makes goes through on all thread sequentially and wait until they finish, and only then looks at the next one. For a thread-pool, you want to wait until every thread is done. This can be done by using condition_variable for each thread to finish. Before they finish they have to notify the condition_variable for which you can wait.
Looking at your implementation the bigger issue here is that your worker threads are not balanced in their consumption.
To get a more balanced load on all of your threads, you need to flatten your data structure, so the different worker threads can process relatively similar sized chunks of data. I am not sure where is your data coming from, but having a vector of a vector in an application that is dealing with large data sets doesn't sound like a great idea. Either process the existing vector of vectors into a single one, or read the data in like that if possible. If you need the row number for your processing, you can keep a vector of start-end ranges from which you can find your row number.
Once you have a single big vector, you can break it down to equal sized chunks to feed into worker threads. Second, you don't want to build vectors on the stack handing and pushing them into another vector because, chances are, you are running into issues to allocate memory during the working of your threads. Allocating memory is a global state change and as such will require some level of locking (with proper address partitioning it could be avoided though). As a rule of thumb, whenever your are looking for performance you should remove dynamic allocation from performance critical parts.
In this case, perhaps your threads would rather 'mark' elements are satisfying conditions, rather than building vectors of the satisfying elems. And once that's done, you can iterate through only the good ones without pushing and copying anything. Such solution would be less wastefull.
In fact, if I were you, I would give a try to solve this issue first on a single thread, doing the suggestions above. If you get rid of the vector-of-vectors structure, and iterate through elements conditionally (this might be as simple as using the of the xxxx_if algorithms C++11 standard library provides), you could end up with a decent enough performance. And only at that point worth looking at delegating chunks of this work to worker threads. At this point in your coded there's very little justification to use worker threads, just to filter them. Do as little writing and moving as you can, and you gain a lot of performance. Parallelization only works well in certain circumstances.

Related

Multithreading nested foor loop with std::thread

I am quite new to c++ and I would really need some advice on multithreading using std::thread.
i have the following piece of code, which basically separates a for loop of N = 8^L iterations (up to 8^14) using thread:
void Lanczos::Hamil_vector_multiply(vec& initial_vec, vec& result_vec) {
result_vec.zeros();
std::vector<arma::vec> result_threaded(num_of_threads);
std::vector<std::thread> threads;
threads.reserve(num_of_threads);
for (int t = 0; t < num_of_threads; t++) {
u64 start = t * N / num_of_threads;
u64 stop = ((t + 1) == num_of_threads ? N : N * (t + 1) / num_of_threads);
result_threaded[t] = arma::vec(stop - start, fill::zeros);
threads.emplace_back(&Lanczos::Hamil_vector_multiply_kernel, this, start, stop, ref(initial_vec), ref(result_vec));
}for (auto& t : threads) t.join();
}
where Lanczos is my general class (actually it is not necessary to know what it contains), while the member function Hamil_vector_multiply_kernel is of the form:
void Lanczos::Hamil_vector_multiply_kernel(u64 start, u64 stop, vec& initial_vec, vec& result_vec_threaded){
// some declarations
for (u64 k = start; k < stop; k++) {
// some prealiminary work
for (int j = 0; j <= L - 1; j++) {
// a bunch of if-else statements, where result_vec_threaded(k) += something
}
}
}
(the code is quite long, so i didn't paste the whole whing here). My problem is that i call the function Hamil_vector_multiply 100-150 times in another function, so i create each time a new vector of threads, which then destroys itself.My questions:
Is it better to create threads in the function which calls Hamil_vector_multiply and then pass a vector of threads to Hamil_vector_multiply in order to avoid creating each time new threads?
Would it be better to asynchronously attack the loop (for instance the first thread to finish an iterations starts the next available? If yes can you point to any literature describing threads asynchronously?
3)Are there maybe better ways of multithreading such a loop? (without multithreading i have a loop from k=0 to k=N=8^14, which takes up a lot of time)
I found several attempts to create a threadpool and job queue, would it be useful to use for instance some workpool like this: https://codereview.stackexchange.com/questions/221617/thread-pool-c-implementation
My code works as it is supposed to (gives the correct result), it boosts up the speed of the programm soemthing like 10 times with 16 cores. But if you have other helpful comments not regarding multithreading I woul be grateful for every piece of advice
Thank you very much in advance!
PS: The function which calls Hamil_vector_multiply 100-150 times is of the form:
void Lanczos::Build_Lanczos_Hamil(vec& initial_vec) {
vec tmp(N);
Hamil_vector_multiply(initial_vec, tmp);
// some calculations
for(int j=0; j<100; j++{
// somtheing
vec tmp2 = ...
Hamil_vector_multiply(tmp2, tmp);
// do somthing else -- not related
}
}

Is it better to create threads in the function which calls Hamil_vector_multiply and then pass a vector of threads to Hamil_vector_multiply in order to avoid creating each time new threads?
If your worried about performance, yes it would help. What your doing right now is essentially allocating a new heap block in every function call (I'm talking about the vector). If you can do it beforehand, it'll give you some performance. There isn't an issue doing this but you could gain some performance.
Would it be better to asynchronously attack the loop (for instance the first thread to finish an iterations starts the next available? If yes can you point to any literature describing threads asynchronously?
This might not be a good idea. You will have to lock resources using mutexes when sharing the same data between multiple threads. This means that you'll get the same amount of performance as processing using one thread because the other thread(s) will have to wait till the resource is unlocked and ready to be used.
Are there maybe better ways of multithreading such a loop? (without multithreading i have a loop from k=0 to k=N=8^14, which takes up a lot of time)
If your goal is to improve performance, if you can put it into multiple threads, and most importantly if multithreading will help, then there isn't a reason to not doing it. From what I can see, your implementation looks pretty neat. But keep in mind, starting a thread itself is a little costly (negligible when compared to your performance gain), and load balancing will definitely improve performance even further.
But if you have other helpful comments not regarding multithreading I woul be grateful for every piece of advice
If your load per thread might vary, it'll be a good investment to think about load balancing. Other than that, I don't see an issue. The major places to improve would be your logic itself. Threads can do so much if your logic takes a hell of a lot time..
Optional:
You can use std::future to implement the same with the added bonus of it starting the thread asynchronously upon destruction, meaning when your thread pool destroys (when the vector goes out of scope), it'll start the threads. But then it might interfere with your first question.

Multithreading is slower than no threading C++

I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.
I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:
Non-Multithread Case
std::vector<hobj> validobjs;
int length = 70;
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
validobjs.push_back(hobj);
}
}
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
validobjs.push_back(hobj);
}
}
Multithread Case
std::vector<hobj> validobjs;
int length = 70;
#pragma omp parallel
{
std::vector<hobj> threaded1; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
threaded1.push_back(obj);
}
}
std::vector<hobj> threaded2; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
threaded2.push_back(obj);
}
}
#pragma omp critical // Insert local vectors to main vector one thread at a time
{
validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
}
}
In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).
I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.
So questions:
1) Is this kind of operation suited for multi-threading?
2) If yes to 1) - does the multithreaded code look reasonable?
3) Is there anything I can do to improve the performance to get it faster than the no-thread case?
Additional info:
The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4
Cheers!

I'm no expert on multithreading, but I'll give it a try:
Is this kind of operation suited for multi-threading?
I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.
As is the merging at the end of the threaded version.
Does the multithreaded code look reasonable?
I think you'r on the right path to let each thread work on independent data.
Is there anything I can do to improve the performance to get it faster than the no-thread case?
I see a few points that might improve performance:
The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
validobjs.reserve(v1.size() + v2.size());
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
Filter the data in two threads as you do now.
Synchronise them and allocate a vector with a number of elements:
validobjs.resize(v1.size() + v2.size());
Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
Allthough I'm not sure if this is entirely legal or if it is undefined behaviour
You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.
Those were my thoughts on this, I hope there was something usefull in it.

IMHO,
You copy each element twice: into threaded1/2 and after that into validobjs.
It can make your code slower.
You can add elements into single vector by using synchronization.

Copying std::vector between threads without locking

I have a vector that is modified in one thread, and I need to use its contents in another. Locking between these threads is unacceptable due to performance requirements. Since iterating over the vector while it is changing will cause a crash, I thought to copy the vector and then iterate over the copy. My question is, can this way also crash?
struct Data
{
int A;
double B;
bool C;
};
std::vector<Data> DataVec;
void ModifyThreadFunc()
{
// Here the vector is changed, which includes adding and erasing elements
...
}
void ReadThreadFunc()
{
auto temp = DataVec; // Will this crash?
for (auto& data : temp)
{
// Do stuff with the data
...
}
// This definitely can crash
/*for (auto& data : DataVec)
{
// Do stuff with the data
...
}*/
}
The basic thread safety guarantee for vector::operator= is:
"if an exception is thrown, the container is in a valid state."
What types of exceptions are possible here?
EDIT:
I solved this using double buffering, and posted my answer below.

As has been pointed out by the other answers, what you ask for is not doable. If you have concurrent access, you need synchronization, end of story.
That being said, it is not unusual to have requirements like yours where synchronization is not an option. In that case, what you can still do is get rid of the concurrent access. For example, you mentioned that the data is accessed once per frame in a game-loop like execution. Is it strictly required that you get the data from the current frame or could it also be the data from the last frame?
In that case, you could work with two vectors, one that is being written to by the producer thread and one that is being read by all the consumer threads. At the end of the frame, you simply swap the two vectors. Now you no longer need *(1) fine-grained synchronization for the data access, since there is no concurrent data access any more.
This is just one example how to do this. If you need to get rid of locking, start thinking about how to organize data access so that you avoid getting into the situation where you need synchronization in the first place.
*(1): Strictly speaking, you still need a synchronization point that ensures that when you perform the swapping, all the writer and reader threads have finished working. But this is far easier to do (usually you have such a synchronization point at the end of each frame anyway) and has a far lesser impact on performance than synchronizing on every access to the vector.

My question is, can this way also crash?
Yes, you still have a data race. If thread A modifies the vector while thread B is creating a copy, all iterators to the vector are invalidated.
What types of exceptions are possible here?
std::vector::operator=(const vector&) will throw on memory allocation failure, or if the contained elements throw on copy. The same thing applies to copy construction, which is what the line in your code marked "Will this crash?" is actually doing.
The fundamental problem here is that std::vector is not thread-safe. You have to either protect it with a lock/mutex, or replace it with a thread-safe container (such as the lock-free containers in Boost.Lockfree or libcds).

I have a vector that is modified in one thread, and I need to use its contents in another. Locking between these threads is unacceptable due to performance requirements.
this is an impossible to meet requirement.
Anyway, any sharing of data between 2 threads will require a kind of locking, be it explicit or implementation (eventually hardware) provided. You must examine again your actual requirements: it can be inacceptable to suspend one thread until the other one ends, but you could lock short sequences of instructions. And/or possibly use a diffent architecture. For example erasing an item in a vector is a costly operation (linear time because you have to move all the data above the removed item) while marking it as invalid is much quicker (constant time because it is one single write). If you really have to erase in the middle of a vector, maybe a list would be more appropriate.
But if you can put a locking exclusion around the copy of the vector in ReadThreadFunc and around any vector modification in ModifyThreadFunc, it could be enough. To give a priority to the modifying thread, you could just try to lock in the other thread and immediately give up if you cannot.

Maybe you should rethink your design!
Each thread should have his own vector (list, queue whatever fit your needs) to work on. So thread A can do some work and pass the result to thrad B. You simply have to lock when writing the data from thread A int thread B's queue.
Without some kind of locking it's not possible.

So I solved this using double buffering, which guarantees no crashing, and the reading thread will always have usable data, even if it might not be correct:
struct Data
{
int A;
double B;
bool C;
};
const int MAXSIZE = 100;
Data Buffer[MAXSIZE];
std::vector<Data> DataVec;
void ModifyThreadFunc()
{
// Here the vector is changed, which includes adding and erasing elements
...
// Copy from the vector to the buffer
size_t numElements = DataVec.size();
memcpy(Buffer, DataVec.data(), sizeof(Data) * numElements);
memset(&Buffer[numElements], 0, sizeof(Data) * (MAXSIZE - numElements));
}
void ReadThreadFunc()
{
Data* p = Buffer;
for (int i = 0; i < MAXSIZE; ++i)
{
// Use the data
...
++p;
}
}

Couple performance questions (one bigger vector vs smaller chunks vectors) and Is it worth to store iteration index for jump access of vector?

I am a bit curiuous about vector optimization and have couple questions about it. (I am still a beginner in programing)
example:
struct GameInfo{
EnumType InfoType;
// Other info...
};
int _lastPosition;
// _gameInfoV is sorted beforehand
std::vector<GameInfo> _gameInfoV;
// The tick function is called every game frame (in "perfect" condition it's every 1.0/60 second)
void BaseClass::tick()
{
for (unsigned int i = _lastPosition; i < _gameInfoV.size(); i++{
auto & info = _gameInfoV[i];
if( !info.bhasbeenAdded ){
if( DoWeNeedNow() ){
_lastPosition++;
info.bhasbeenAdded = true;
_otherPointer->DoSomething(info.InfoType);
// Do something more with "info"....
}
else return; //Break the cycle since we don't need now other "info"
}
}
}
The _gameInfoV vector size can be between 2000 and 5000.
My main 2 questions are:
Is it better to leave the way how it is or it's better to make smaller chunks of it, which is checked for every different GameInfo.InfoType
Is it worth the hassle of storing the last start position index of the vector instead of iterating from the beginning.
Note that if using smaller vectors there will be like 3 to 6 of them
The third thing is probably that I am not using vector iterators, but is it safe to use then like this?
std::vector<GameInfo>::iterator it = _gameInfoV.begin() + _lastPosition;
for (it = _gameInfoV.begin(); it != _gameInfoV.end(); ++it){
//Do something
}
Note: It will be used in smartphones, so every optimization will be appreciated, when targeting weaker phones.
-Thank you

Don't; except if you frequently move memory around
It is no hassle if you do it correctly:
std::vector<GameInfo>::const_iterator _lastPosition(gameInfoV.begin());
// ...
for (std::vector<GameInfo>::iterator info=_lastPosition; it!=_gameInfoV.end(); ++info)
{
if (!info->bhasbeenAdded)
{
if (DoWeNeedNow())
{
++_lastPosition;
_otherPointer->DoSomething(info->InfoType);
// Do something more with "info"....
}
else return; //Break the cycle since we don't need now other "i
}
}

Breaking one vector up into several smaller vectors in general doesn't improve performance. It could even slightly degrade performance because the compiler has to manage more variables, which take up more CPU registers etc.
I don't know about gaming so I don't understand the implication of GameInfo.InfoType. Your processing time and CPU resource requirements are going to increase if you do more total iterations through loops (where each loop iteration performs the same type of operation). So if separating the vectors causes you to avoid some loop iterations because you can skip entire vectors, that's going to increase performance of your app.
iterators are the most secure way to iterate through containers. But for a vector I often just use the index operator [] and my own indexer (a plain old unsigned integer).

performance of openmp code and how to make it faster

I'm doing the following code that construct a distance matrix between each point and all the other points that I have in the map dat[]. Although the code is working perfect, the performance of the code in terms of running time doesn't improve which means that it takes the same time if I set the number of thread = 1 or even 10 on an 8 core machine. Therefore, I'd appreciate if anyone can help me know what is wrong in my code and if anyone have any suggestion to help make the code runs faster that would be very helpful too.
The following is the code:
map< int,string >::iterator datIt;
map <int, map< int, double> > dist;
int mycont=0;
datIt=dat.begin();
int size=dat.size();
omp_lock_t lock;
omp_init_lock(&lock);
#pragma omp parallel //construct the distance matrix
{
map< int,string >::iterator datItLocal=datIt;
int lastIdx = 0;
#pragma omp for
for(int i=0;i<size;i++)
{
std::advance(datItLocal, i - lastIdx);
lastIdx = i;
map< int,string >::iterator datIt2=datItLocal;
datIt2++;
while(datIt2!=dat.end())
{
double ecl=0;
int c=count((*datItLocal).second.begin(),(*datItLocal).second.end(),delm);
string line1=(*datItLocal).second;
string line2=(*datIt2).second;
for (int i=0;i<c;i++)
{
double num1=atof(line1.substr(0,line1.find_first_of(delm)).c_str());
line1=line1.substr(line1.find_first_of(delm)+1).c_str();
double num2=atof(line2.substr(0,line2.find_first_of(delm)).c_str());
line2=line2.substr(line2.find_first_of(delm)+1).c_str();
ecl += (num1-num2)*(num1-num2);
}
ecl=sqrt(ecl);
omp_set_lock(&lock);
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
omp_unset_lock(&lock);
datIt2++;
}
}
}
omp_destroy_lock(&lock);

My guess is that using a single lock for protecting 'dist' serializes your program.
Option 1:
Consider using a fine-grained locking strategy. Typically, you benefit from this if dist.size() is much larger than the number of threads.
map <int, omp_lock_t > locks;
...
int key1 = (*datItLocal).first;
int key2 = (*datIt2).first;
omp_set_lock(&(locks[key1]));
omp_set_lock(&(locks[key2]));
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
omp_unset_lock(&(locks[key2]));
omp_unset_lock(&(locks[key1]));
Option 2:
Your compiler might already have this optimization mention in option 1, so you can try to drop your lock and use the built-in critical section:
#pragma omp critical
{
dist[(*datItLocal).first][(*datIt2).first]=ecl;
dist[(*datIt2).first][(*datItLocal).first]=ecl;
}

I'm a bit unsure of exactly what you're trying to do with your loops etc, which looks rather like it's going to do a quadratic nested loop over the map. Assuming that's expected though, I think the following line will perform poorly when parallelised:
std::advance(datItLocal, i - lastIdx);
If OpenMP were disabled, that's advancing by one step each time, which is fine. But with OpenMP, there are going to be multiple threads doing chunks of that loop at random. So one of them might start at i=100000, so it has to advance 100000 steps through the map to begin. That might happen quite a lot if there are lots of threads being given relatively small chunks of the loop at a time. It might even be that you end up being memory/cache constrained since you're constantly having to walk over all of this presumably large map. It seems like this might be (part of) your culprit since it may get worse as more threads are available.
Fundamentally I guess I'm a bit suspicious of trying to parallelise iteration over a sequential data structure. You may know more about which parts of it really are slow or not though if you've profiled it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js