It is safe to read a STL container from multiple parallel threads. However, the performance is terrible. Why?
I create a small object that stores some data in a multiset. This makes the constructors fairly expensive ( about 5 usecs on my machine. ) I store hundreds of thousands of the small objects in a large multiset. Processing these objects is an independent business, so I split the work between threads running on a multi-core machine. Each thread reads the objects it needs from the large multiset, and processes them.
The problem is that the reading from the big multiset does not proceed in parallel. It looks like the reads in one thread block the reads in the other.
The code below is the simplest I can make it and still show the problem. First it creates a large multiset containing 100,000 small objects each containing its own empty multiset. Then it calls the multiset copy constructor twice in series, then twice again in parallel.
A profiling tool shows that the serial copy constructors take about 0.23 secs, whereas the parallel ones take twice as long. Somehow the parallel copies are interfering with each other.
// a trivial class with a significant ctor and ability to populate an associative container
class cTest
{
multiset<int> mine;
int id;
public:
cTest( int i ) : id( i ) {}
bool operator<(const cTest& o) const { return id < o.id; }
};
// add 100,000 objects to multiset
void Populate( multiset<cTest>& m )
{
for( int k = 0; k < 100000; k++ )
{
m.insert(cTest(k));
}
}
// copy construct multiset, called from mainline
void Copy( const multiset<cTest>& m )
{
cRavenProfile profile("copy_main");
multiset<cTest> copy( m );
}
// copy construct multiset, called from thread
void Copy2( const multiset<cTest>& m )
{
cRavenProfile profile("copy_thread");
multiset<cTest> copy( m );
}
int _tmain(int argc, _TCHAR* argv[])
{
cRavenProfile profile("test");
profile.Start();
multiset<cTest> master;
Populate( master );
// two calls to copy ctor from mainline
Copy( master );
Copy( master );
// call copy ctor in parrallel
boost::thread* pt1 = new boost::thread( boost::bind( Copy2, master ));
boost::thread* pt2 = new boost::thread( boost::bind( Copy2, master ));
pt1->join();
pt2->join();
// display profiler results
cRavenProfile print_profile;
return 0;
}
Here is the output
Scope Calls Mean (secs) Total
copy_thread 2 0.472498 0.944997
copy_main 2 0.233529 0.467058
You mentioned copy constructors. I assume that these also allocate memory from the heap?
Allocating heap memory in multiple threads is a big mistake.
The standard allocator is probably a single pool locked implementation. You need to either not use heap memory (stack allocate) or you need a thread optimized heap allocator.
OK, after spending the majority of the week on this issues, I have the fix.
There were two problems with the code I posted in the question:
boost::bind makes a copy of its parameters, even if the underlying function uses call by reference. Copying the container is expensive, and so the multi-threaded version was working too hard. ( No-one noticed this! ) To pass the container by reference, I needed to use this code:
boost::thread* pt1 = new boost::thread( boost::bind( Copy2, boost::cref(master) ));
As Zan Lynx pointed out the default container allocates memory for its contents on the global heap using a thread safe singleton memory allocator, resulting in great contention between the threads as they created hundreds of thousands of objects through the same allocator instance. ( Since this was the crux of the mystery, I accepted Zan Lynx's answer. )
The fix for #1 is straightforward, as presented above.
The fix for #2 is, as several people pointed out, to replace the default STL allocator with a thread specific one. This is quite the challenge, and no-one offered a specific source for such an allocator.
I spent some time looking for a thread specific allocator "off the shelf". The best I found was hoard ( hoard.org ). This provided a significant performance improvement, however hoard has some serious drawbacks
I experienced some crashes during testing
Commercial licensing is expensive
It 'hooks' system calls to malloc, a technique I consider dodgy.
So I decided to roll my own thread specific memory allocator, based on boost::pool and boost::threadspecificptr. This required a small amount of, IMHO, seriously advanced C++ code, but now seems to be working well.
What's the scheduling of your threads? If you run two threads, doing the considerable work, the threads most likely start at once and end at once. Hence the profiler thinks that the execution of each thread has taken twice as much time, because during the time each thread executes the work is done twice. Whereas the execution of each of sequential calls took the normal time.
step 0 1 2 3 4 5 6 7 8 9
threaded: 1,2,1,2,1,2,1,2,1,2
sequential: 1,1,1,1,1,2,2,2,2,2
Thread one started at 0 and ended at 8, showing execution time as 8; thread 2 started at 1 and ended at 9, the execution time is 8. Two sequential runs show 5 steps each. So in the resultant table you'll see 16 for concurrent version and 10 for the sequential one.
Assuming that all above is true and there is a considerable number of steps, the ratio of the execution times shown by your profiler should be about two. Experiment does not contradict this hypothesis.
Since I am not sure how your profiler is working it is hard to tell.
What I would prefer to see is some explicit timing around the code:
Then do the work a couple of times to average out any thing that causing a context switch.
for(int loop=0;loop < 100;++loop)
{
ts = timer();
Copy( master );
Copy( master );
te = timer();
tt += te - ts;
}
tt /= 100;
etc
Compare this with your profiler results.
To answer Pavel Shved with more detail, here is how the majority of my code runs:
step 0 1 2 3 4 5 6 7 8 9
core1: 1 1 1 1 1
core2: 2,2,2,2,2
sequential: 1,1,1,1,1,2,2,2,2,2
Only the parallel reads interfere with each other.
As an experiment, I replaces the big multiset with an array of pointers to cTest. The code now has huge memory leaks, but never mind. The interesting thing is that the relative performance is worse - running the copy constructors in parallel slows them down 4 times!
Scope Calls Mean (secs) Total
copy_array_thread 2 0.454432 0.908864
copy_array_main 2 0.116905 0.233811
Related
I am quite new to c++ and I would really need some advice on multithreading using std::thread.
i have the following piece of code, which basically separates a for loop of N = 8^L iterations (up to 8^14) using thread:
void Lanczos::Hamil_vector_multiply(vec& initial_vec, vec& result_vec) {
result_vec.zeros();
std::vector<arma::vec> result_threaded(num_of_threads);
std::vector<std::thread> threads;
threads.reserve(num_of_threads);
for (int t = 0; t < num_of_threads; t++) {
u64 start = t * N / num_of_threads;
u64 stop = ((t + 1) == num_of_threads ? N : N * (t + 1) / num_of_threads);
result_threaded[t] = arma::vec(stop - start, fill::zeros);
threads.emplace_back(&Lanczos::Hamil_vector_multiply_kernel, this, start, stop, ref(initial_vec), ref(result_vec));
}for (auto& t : threads) t.join();
}
where Lanczos is my general class (actually it is not necessary to know what it contains), while the member function Hamil_vector_multiply_kernel is of the form:
void Lanczos::Hamil_vector_multiply_kernel(u64 start, u64 stop, vec& initial_vec, vec& result_vec_threaded){
// some declarations
for (u64 k = start; k < stop; k++) {
// some prealiminary work
for (int j = 0; j <= L - 1; j++) {
// a bunch of if-else statements, where result_vec_threaded(k) += something
}
}
}
(the code is quite long, so i didn't paste the whole whing here). My problem is that i call the function Hamil_vector_multiply 100-150 times in another function, so i create each time a new vector of threads, which then destroys itself.My questions:
Is it better to create threads in the function which calls Hamil_vector_multiply and then pass a vector of threads to Hamil_vector_multiply in order to avoid creating each time new threads?
Would it be better to asynchronously attack the loop (for instance the first thread to finish an iterations starts the next available? If yes can you point to any literature describing threads asynchronously?
3)Are there maybe better ways of multithreading such a loop? (without multithreading i have a loop from k=0 to k=N=8^14, which takes up a lot of time)
I found several attempts to create a threadpool and job queue, would it be useful to use for instance some workpool like this: https://codereview.stackexchange.com/questions/221617/thread-pool-c-implementation
My code works as it is supposed to (gives the correct result), it boosts up the speed of the programm soemthing like 10 times with 16 cores. But if you have other helpful comments not regarding multithreading I woul be grateful for every piece of advice
Thank you very much in advance!
PS: The function which calls Hamil_vector_multiply 100-150 times is of the form:
void Lanczos::Build_Lanczos_Hamil(vec& initial_vec) {
vec tmp(N);
Hamil_vector_multiply(initial_vec, tmp);
// some calculations
for(int j=0; j<100; j++{
// somtheing
vec tmp2 = ...
Hamil_vector_multiply(tmp2, tmp);
// do somthing else -- not related
}
}
Is it better to create threads in the function which calls Hamil_vector_multiply and then pass a vector of threads to Hamil_vector_multiply in order to avoid creating each time new threads?
If your worried about performance, yes it would help. What your doing right now is essentially allocating a new heap block in every function call (I'm talking about the vector). If you can do it beforehand, it'll give you some performance. There isn't an issue doing this but you could gain some performance.
Would it be better to asynchronously attack the loop (for instance the first thread to finish an iterations starts the next available? If yes can you point to any literature describing threads asynchronously?
This might not be a good idea. You will have to lock resources using mutexes when sharing the same data between multiple threads. This means that you'll get the same amount of performance as processing using one thread because the other thread(s) will have to wait till the resource is unlocked and ready to be used.
Are there maybe better ways of multithreading such a loop? (without multithreading i have a loop from k=0 to k=N=8^14, which takes up a lot of time)
If your goal is to improve performance, if you can put it into multiple threads, and most importantly if multithreading will help, then there isn't a reason to not doing it. From what I can see, your implementation looks pretty neat. But keep in mind, starting a thread itself is a little costly (negligible when compared to your performance gain), and load balancing will definitely improve performance even further.
But if you have other helpful comments not regarding multithreading I woul be grateful for every piece of advice
If your load per thread might vary, it'll be a good investment to think about load balancing. Other than that, I don't see an issue. The major places to improve would be your logic itself. Threads can do so much if your logic takes a hell of a lot time..
Optional:
You can use std::future to implement the same with the added bonus of it starting the thread asynchronously upon destruction, meaning when your thread pool destroys (when the vector goes out of scope), it'll start the threads. But then it might interfere with your first question.
I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.
I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:
Non-Multithread Case
std::vector<hobj> validobjs;
int length = 70;
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
validobjs.push_back(hobj);
}
}
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
validobjs.push_back(hobj);
}
}
Multithread Case
std::vector<hobj> validobjs;
int length = 70;
#pragma omp parallel
{
std::vector<hobj> threaded1; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
threaded1.push_back(obj);
}
}
std::vector<hobj> threaded2; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
threaded2.push_back(obj);
}
}
#pragma omp critical // Insert local vectors to main vector one thread at a time
{
validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
}
}
In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).
I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.
So questions:
1) Is this kind of operation suited for multi-threading?
2) If yes to 1) - does the multithreaded code look reasonable?
3) Is there anything I can do to improve the performance to get it faster than the no-thread case?
Additional info:
The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4
Cheers!
I'm no expert on multithreading, but I'll give it a try:
Is this kind of operation suited for multi-threading?
I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.
As is the merging at the end of the threaded version.
Does the multithreaded code look reasonable?
I think you'r on the right path to let each thread work on independent data.
Is there anything I can do to improve the performance to get it faster than the no-thread case?
I see a few points that might improve performance:
The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
validobjs.reserve(v1.size() + v2.size());
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
Filter the data in two threads as you do now.
Synchronise them and allocate a vector with a number of elements:
validobjs.resize(v1.size() + v2.size());
Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
Allthough I'm not sure if this is entirely legal or if it is undefined behaviour
You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.
Those were my thoughts on this, I hope there was something usefull in it.
IMHO,
You copy each element twice: into threaded1/2 and after that into validobjs.
It can make your code slower.
You can add elements into single vector by using synchronization.
I am new to TBB and try to do a simple exprement.
My data for functions are:
int n = 9000000;
int *data = new int[n];
I created a function, the first one without using TBB:
void _array(int* &data, int n) {
for (int i = 0; i < n; i++) {
data[i] = busyfunc(data[i])*123;
}
}
It takes 0.456635 seconds.
And also created a to function, the first one with using TBB:
void parallel_change_array(int* &data,int list_count) {
//Instructional example - parallel version
parallel_for(blocked_range<int>(0, list_count),
[=](const blocked_range<int>& r) {
for (int i = r.begin(); i < r.end(); i++) {
data[i] = busyfunc(data[i])*123;
}
});
}
It takes me 0.584889 seconds.
As for busyfunc(int m):
int busyfunc(int m)
{
m *= 32;
return m;
}
Can you tell me, why the function without using TBB spends less time, than if it is with TBB?
I think, the problem is that the functions are simple, and it's easy to calculate without using TBB.
First, the busyfunc() seems not so busy because 9M elements are computed in just half a second, which makes this example rather memory bound (uncached memory operations take orders of magnitude more cycles than arithmetic operations). Memory bound computations scale not as good as compute-bound, e.g. plain memory copying usually scales up to no more than, say, 4 times even running on much bigger number of cores/processors.
Also, memory bound programs are more sensitive to NUMA effects and since you allocated this array as contiguous memory using standard C++, it will be allocated by default entirely on the same memory node where the initialization occurs. This default can be altered by running with numactl -i all --.
And the last, but the most important thing is that TBB initializes threads lazily and pretty slowly. I guess you do not intend writing an application which exits after 0.5 seconds spent on parallel computation. Thus, a fair benchmark should take into account all the warm-up effects, which are expected in the real application. At the very least, it has to wait until all the threads are up and running before starting measurements. This answer suggests one way to do that.
[update] Please also refer to Alexey's answer for another possible reason lurking in compiler optimization differences.
In addition to Anton's asnwer, I recommend to check if the compiler was able to optimize the code equivalently.
For start, check performance of the TBB version executed by a single thread, without real parallelism. You can use tbb::global_control or tbb::task_scheduler_init to limit the number of threads to 1, e.g.
tbb::global_control ctl(tbb::global_control::max_allowed_parallelism, 1);
The overheads of thread creation, as well as cache locality or NUMA effects, should not play a role when all the code is executed by one thread. Therefore you should see approximately the same performance as for the no-TBB version. If you do, then you have a scalability issue, and Anton explained possible reasons.
However if you see that performance drops a lot, then it is a serial optimization issue. One of known reasons is that some compilers cannot optimize the loop over a blocked_range as good as they optimize the original loop; and it was also observed that storing r.end() into a local variable may help:
int rend = r.end();
for (int i = r.begin(); i < rend; i++) {
data[i] = busyfunc(data[i])*123;
}
I have some code that can make use of parallelism for efficiency gain. Since my PC has a dual processor I tried running the code on two threads. So I wrote the below code (This is a very simplified version of it):
Evaluator::evaluate(vector<inpType> input, bool parallel) {
std::thread t0;
if(parallel) {
// Evaluate half of the input in a spawned thread
t0 = std::thread(&Evaluator::evaluatePart, this, std::ref(input), 0, input.size()/2 + input.size()%2);
// Evaluate the other half of the input in this thread
evaluatePart(input, input.size()/2 + input.size()%2 - 1, input.size()/2);
} else {
// sequential evaluate all of the input
evaluatePart(input, 0, input.size());
}
// some other code
// after finishing everything join the thread
if(parallel) t0.join();
}
Evaluator::evaluatePart(vector<inpType> &input, int start, int count) {
for(int i=start; i<count; i++) {
evaluateSingle(input[i]);
}
}
Evaluator::evaluateSingle(inpType &input) {
// do stuff with input
// note I use a vector<int> belonging to Evaluator object in here, not sure it matters though
}
Running sequentially takes around 3 ms but running in parallel is taking around 6 ms. Does that mean spawning a thread takes so much time that it is more efficient to just evaluate sequentially? Or am I doing something wrong?
Note that I don't make use of any locking mechanisms because the evaluations are independent on each other. Every evaluateSingle reads from a vector that is a member of the Evaluator object but only alters the single input that was given to it. Hence there is no need for any locking.
Update
I apologize that I didn't make this clear. This is more of a pseudo code describing in abstract how my code looks like. It will not work or compile but mine does so that is not the issue. Anyways I fixed the t0 scope issue in this code.
Also the input size is around 38,000 which I think is sufficient enough to make use of parallelism.
Update
I tried increasing the size of input to 5,000,000 but that didn't help. Sequential is still faster than multi-threaded.
Update
I tried increasing the number of threads running while splitting the vector evenly between them for evaluation, and got some interesting results:
Note that I have an i7-7500U CPU that can run 4 threads in parallel. This leaves me with two questions:
Why does creating 4 or more threads starts to see a performance improvement in comparison with 2, 3.
Why is it the case that creating more than 4 threads is more efficient than
just 4 threads (the maximum the CPU can run concurrently).
I wrote the following parallel code for examining all elements in a vector of vector. I store only those elements from vector<vector<int> > which satisfy a given condition. However, my problem is some of the vectors within vector<vector<int> > are pretty large while others are pretty small. Due to which my code takes a long time to perform thread.join(). Can someone please suggest as to how can I improve the performance of my code.
void check_if_condition(vector<int>& a, vector<int>& satisfyingElements)
{
for(vector<int>::iterator i1=a.begin(), l1=a.end(); i1!=l1; ++i1)
if(some_check_condition(*i1))
satisfyingElements.push_back(*i1);
}
void doWork(std::vector<vector<int> >& myVec, std::vector<vector<int> >& results, size_t current, size_t end)
{
end = std::min(end, myVec.size());
int numPassed = 0;
for(; current < end; ++current) {
vector<int> satisfyingElements;
check_if_condition(myVec[current], satisfyingElements);
if(!satisfyingElements.empty()){
results[current] = satisfyingElements;
}
}
}
int main()
{
std::vector<std::vector<int> > myVec(1000000);
std::vector<std::vector<int> > results(myVec.size());
unsigned numparallelThreads = std::thread::hardware_concurrency();
std::vector<std::thread> parallelThreads;
auto blockSize = myVec.size() / numparallelThreads;
for(size_t i = 0; i < numparallelThreads - 1; ++i) {
parallelThreads.emplace_back(doWork, std::ref(myVec), std::ref(results), i * blockSize, (i+1) * blockSize);
}
//also do work in this thread
doWork(myVec, results, (numparallelThreads-1) * blockSize, myVec.size());
for(auto& thread : parallelThreads)
thread.join();
std::vector<int> storage;
storage.reserve(numPassed.load());
auto itRes = results.begin();
auto itmyVec = myVec.begin();
auto endRes = results.end();
for(; itRes != endRes; ++itRes, ++itmyVec) {
if(!(*itRes).empty())
storage.insert(storage.begin(),(*itRes).begin(), (*itRes).end());
}
std::cout << "Done" << std::endl;
}
It would be nice to see if you can give some scale of those 'large' inner-vectors just to see how bad is the problem.
I think however, is that your problem is this:
for(auto& thread : parallelThreads)
thread.join();
This bit makes goes through on all thread sequentially and wait until they finish, and only then looks at the next one. For a thread-pool, you want to wait until every thread is done. This can be done by using condition_variable for each thread to finish. Before they finish they have to notify the condition_variable for which you can wait.
Looking at your implementation the bigger issue here is that your worker threads are not balanced in their consumption.
To get a more balanced load on all of your threads, you need to flatten your data structure, so the different worker threads can process relatively similar sized chunks of data. I am not sure where is your data coming from, but having a vector of a vector in an application that is dealing with large data sets doesn't sound like a great idea. Either process the existing vector of vectors into a single one, or read the data in like that if possible. If you need the row number for your processing, you can keep a vector of start-end ranges from which you can find your row number.
Once you have a single big vector, you can break it down to equal sized chunks to feed into worker threads. Second, you don't want to build vectors on the stack handing and pushing them into another vector because, chances are, you are running into issues to allocate memory during the working of your threads. Allocating memory is a global state change and as such will require some level of locking (with proper address partitioning it could be avoided though). As a rule of thumb, whenever your are looking for performance you should remove dynamic allocation from performance critical parts.
In this case, perhaps your threads would rather 'mark' elements are satisfying conditions, rather than building vectors of the satisfying elems. And once that's done, you can iterate through only the good ones without pushing and copying anything. Such solution would be less wastefull.
In fact, if I were you, I would give a try to solve this issue first on a single thread, doing the suggestions above. If you get rid of the vector-of-vectors structure, and iterate through elements conditionally (this might be as simple as using the of the xxxx_if algorithms C++11 standard library provides), you could end up with a decent enough performance. And only at that point worth looking at delegating chunks of this work to worker threads. At this point in your coded there's very little justification to use worker threads, just to filter them. Do as little writing and moving as you can, and you gain a lot of performance. Parallelization only works well in certain circumstances.