I have a loop that I want to process in parallel. Each thread needs an (independent) chunk of memory, but it can be overwritten in every iteration and needn't be reallocated. See the following example:
vector<int> scratch(size);
for(int i=0; i < count; i++){
f(arguments, scratch);
g(scratch);
}
where f takes scratch as an output parameter. To make this parallelizable, I could do
#pragma omp parallel for
for(int i=0; i < count; i++){
vector<int> scratch(size);
f(arguments, scratch);
g(scratch);
}
or
#pragma omp parallel
{
vector<int> scratch(size);
#pragma omp for
for(int i=0; i < count; i++){
f(arguments, scratch);
g(scratch);
}
}
Will I be wasting time for constructing and deconstructing scratch in the first version? Or will the compiler (with optimization) most likely reuse the memory and refrain from reallocation?
On a mainstream PC, the second code is inefficient. Indeed, it generally code the vector to be reallocated and filled with zeros for every iterations. Regarding your system, the default allocator may not scale (AFAIK it is typically the case on Windows with MSVC, but it should be fine on Linux with Jemalloc) and this will reduce the performance of your application. The eager zeros-based vector filling can also causes the same issue if size is big since the RAM is a limited shared resource. Compilers like Clang are able to optimize out some allocations, but in this case, neither GCC nor Clang are able to do this optimization (and the overhead of the memset would still be present anyway).
The third example is quite efficient since the array is allocated and filled only once. Each thread has its own vector so the locality is good. This solution is only worst than the first if the number of iteration is smaller than the number of thread. However, this is not much an issue since it is inefficient in both cases anyway if the f and g calls are short (because of the overhead to distribute the work between threads) or the overhead of the vector is negligible in both cases if the f and g calls are long.
Related
I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.
I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:
Non-Multithread Case
std::vector<hobj> validobjs;
int length = 70;
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
validobjs.push_back(hobj);
}
}
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
validobjs.push_back(hobj);
}
}
Multithread Case
std::vector<hobj> validobjs;
int length = 70;
#pragma omp parallel
{
std::vector<hobj> threaded1; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
threaded1.push_back(obj);
}
}
std::vector<hobj> threaded2; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
threaded2.push_back(obj);
}
}
#pragma omp critical // Insert local vectors to main vector one thread at a time
{
validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
}
}
In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).
I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.
So questions:
1) Is this kind of operation suited for multi-threading?
2) If yes to 1) - does the multithreaded code look reasonable?
3) Is there anything I can do to improve the performance to get it faster than the no-thread case?
Additional info:
The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4
Cheers!
I'm no expert on multithreading, but I'll give it a try:
Is this kind of operation suited for multi-threading?
I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.
As is the merging at the end of the threaded version.
Does the multithreaded code look reasonable?
I think you'r on the right path to let each thread work on independent data.
Is there anything I can do to improve the performance to get it faster than the no-thread case?
I see a few points that might improve performance:
The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
validobjs.reserve(v1.size() + v2.size());
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
Filter the data in two threads as you do now.
Synchronise them and allocate a vector with a number of elements:
validobjs.resize(v1.size() + v2.size());
Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
Allthough I'm not sure if this is entirely legal or if it is undefined behaviour
You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.
Those were my thoughts on this, I hope there was something usefull in it.
IMHO,
You copy each element twice: into threaded1/2 and after that into validobjs.
It can make your code slower.
You can add elements into single vector by using synchronization.
I am new to TBB and try to do a simple exprement.
My data for functions are:
int n = 9000000;
int *data = new int[n];
I created a function, the first one without using TBB:
void _array(int* &data, int n) {
for (int i = 0; i < n; i++) {
data[i] = busyfunc(data[i])*123;
}
}
It takes 0.456635 seconds.
And also created a to function, the first one with using TBB:
void parallel_change_array(int* &data,int list_count) {
//Instructional example - parallel version
parallel_for(blocked_range<int>(0, list_count),
[=](const blocked_range<int>& r) {
for (int i = r.begin(); i < r.end(); i++) {
data[i] = busyfunc(data[i])*123;
}
});
}
It takes me 0.584889 seconds.
As for busyfunc(int m):
int busyfunc(int m)
{
m *= 32;
return m;
}
Can you tell me, why the function without using TBB spends less time, than if it is with TBB?
I think, the problem is that the functions are simple, and it's easy to calculate without using TBB.
First, the busyfunc() seems not so busy because 9M elements are computed in just half a second, which makes this example rather memory bound (uncached memory operations take orders of magnitude more cycles than arithmetic operations). Memory bound computations scale not as good as compute-bound, e.g. plain memory copying usually scales up to no more than, say, 4 times even running on much bigger number of cores/processors.
Also, memory bound programs are more sensitive to NUMA effects and since you allocated this array as contiguous memory using standard C++, it will be allocated by default entirely on the same memory node where the initialization occurs. This default can be altered by running with numactl -i all --.
And the last, but the most important thing is that TBB initializes threads lazily and pretty slowly. I guess you do not intend writing an application which exits after 0.5 seconds spent on parallel computation. Thus, a fair benchmark should take into account all the warm-up effects, which are expected in the real application. At the very least, it has to wait until all the threads are up and running before starting measurements. This answer suggests one way to do that.
[update] Please also refer to Alexey's answer for another possible reason lurking in compiler optimization differences.
In addition to Anton's asnwer, I recommend to check if the compiler was able to optimize the code equivalently.
For start, check performance of the TBB version executed by a single thread, without real parallelism. You can use tbb::global_control or tbb::task_scheduler_init to limit the number of threads to 1, e.g.
tbb::global_control ctl(tbb::global_control::max_allowed_parallelism, 1);
The overheads of thread creation, as well as cache locality or NUMA effects, should not play a role when all the code is executed by one thread. Therefore you should see approximately the same performance as for the no-TBB version. If you do, then you have a scalability issue, and Anton explained possible reasons.
However if you see that performance drops a lot, then it is a serial optimization issue. One of known reasons is that some compilers cannot optimize the loop over a blocked_range as good as they optimize the original loop; and it was also observed that storing r.end() into a local variable may help:
int rend = r.end();
for (int i = r.begin(); i < rend; i++) {
data[i] = busyfunc(data[i])*123;
}
I have run into a rather frustrating problem with OpenMP: it seems that if OpenMP is used in parallel mode somewhere in the code (for more than one thread), then dynamic memory allocation/de-allocation becomes slower even in non-parallel portions of code. Here is an example program (just an illustration):
int main()
{
#pragma omp parallel
{
// Just to get OpenMP going
}
double wtime0, wtime;
wtime0 = omp_get_wtime();
double **stuff;
const int N = 1000000;
stuff = new double*[N];
for (int i=0; i < N; i++) stuff[i] = new double;
for (int i=0; i < N; i++) *(stuff[i]) = sqrt(i);
for (int i=0; i < N; i++) delete[] stuff[i];
delete[] stuff;
wtime = omp_get_wtime() - wtime0;
cout << "Total CPU time: " << wtime << endl;
}
When I run this code with one thread on my laptop (which is an Intel Core 2 Duo), I get a CPU time of 0.093. On the other hand, if I run it with two threads, the CPU time increases to 0.13. The more pointer allocations there are, the worse the discrepancy becomes. In the above code, if I were to replace "stuff" by a simple array, e.g.
double stuff2[N];
for (int i=0; i < N; i++) stuff2[i] = sqrt(i);
then there is no discrepancy. Can someone tell me why this problem exists when pointers are allocated/de-allocated, even though it's not done in parallel? The reason why this is a problem is because in the real code I am working with, dynamic memory allocation is essential. There are sections that can be sped up by running in parallel, but (with two threads versus one) this is more than overcompensated by the fact that the memory allocation/de-allocation is slowed down considerably, even in the non-parallel sections. If someone with extensive OpenMP experience can tell me how to get around this problem I would really appreciate it. (Worst case scenario, I can just use MPI instead, but I would love it if this can be solved within OpenMP.)
Thanks in advance for the help.
Yes, this is concievable. In general, one should avoid naive dynamic allocations in multi-threading anvironment, as there is a single lock there. MT-aware allocators provide a much better performance and should be preferred in allocation-heavy scenarios.
This is exactly why I always scowl down on code here which just uses vectors or strings or shared pointers as a class members without letting users to specify allocation policy.
I am writing c++ codes using OpenMP. I have a global huge array (100,000+ elements) that will be modified by adding values in a for loop. Is there a way that I can efficiently have each thread created by OpenMP for parallel maintain its local copy of array and then join after the loop? Since the number of threads is a variable, I could not create the local copies of array beforehand. If using a global copy and address the race condition by a synchronization lock, the performance is terrible.
Thanks!
Edited:
Sorry for not being clear. Here's some pseudo-code hopefully could clarify the scenario:
int* huge_array=new int[N];
memset(huge_array, 0, N*sizeof(int));
#pragma omp parallel for
for (i=0; i<n; i++)
{
get a value v independently
get a position p independently
// I have to set a lock here
omp_set_lock(&lock);
huge_array[p] += v;
omp_unset_lock(&lock);
}
Is there a way to improve the performance of the code above?
Okay, I finally understood what you want to do. Yes, you do it the same way as with ptreads.
std::vector<int> A(N,0);
std::vector<int*> local(omp_max_num_threads());
#pragma omp parallel
{
int np = omp_get_num_threads();
std::vector<int> localA(N);
local[omp_get_thread_num()] = localA.data();
// add values to local array
#pragma omp for
for(int i=0; i<num_values; ++i)
localA[position()] += value(); // (1)
// implicit barrier ensures all local copies are ready for aggregation
// aggregate local copies into global array
#pragma omp for
for(int k=0; k<N; ++k)
for(int p=0; p<np; ++p)
A[k] += local[p][k]; // (2)
// implicit barrier ensures no local copy is deleted before aggregation is done
}
but it is important to do the aggregate also in parallel.
In Walter's answer, I believe instead of
std::vector<int*> local(omp_max_num_threads());
It should be
std::vector<int*> local(omp_get_max_threads());
omp_max_num_threads() is not a routine in OpenMP.
What about using the directive
'#'pragma omp parallel for private(VARIABLE)
for your program (only with a cross, not with these '')?
EDIT:
For your code I would use my directive, you won't loose so much time when locking and unlocking your variable...
EDIT 2:
Sorry, you can not use my code for your problem, only, if you create a temporary array first where you store your data temporarily...
As far as I can tell you are essentially filling a histogram where position is the bin of the histogram to fill and value is the weight/value that you will add to that bin. Filling a histogram in parallel is equivalent to doing an array reduction. The C++ implementation of OpenMP does not have direct support for this, however, as far as I understand some version of the Fortran implementation do. To do an array reduction in C++ with OpenMP I have two suggestions.
1.) If the number of bins of the histogram (array) is much less than the number of values that will fill the histogram (which is often the preferred case since one wants reasonable statistics in each bin), then you can fill private version of the histogram in parallel and merge them in a critical section in serial. Since the number of bins is much less than the number of values this should be efficient.
2.) However, If the number of bins is large (as your example seems to imply) then it's possible to merge the private histograms in parallel as well but this is a bit more tricky. Additionally, one needs to be careful with cache alignment and false sharing.
I showed how to do both these methods and discuss some of the cache issues in the following question:
Fill histograms (array reduction) in parallel with openmp without using a critical section.
I'm writing a function where I need a significant amount of heap memory. Is it possible to tell the compiler that those data will be accessed frequently within a specific for loop, so as to improve performance (through compile options or similar)?
The reason I cannot use the stack is that the number of elements I need to store is big, and I get segmentation fault if I try to do it.
Right now the code is working but I think it could be faster.
UPDATE:
I'm doing something like this
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
some details:
- I used hash_set, little improvement, beside the fact that hash_set is not available in all machines I have for simulation purposes
- I tried to allocate vec on the stack using arrays but, as I said, I might get segmentation fault if the number of elements is too big
If node_vec.size() is, say, equal to k, where k is of the order of a few thousands, I expect vec to be 4 or 5 times bigger than node_vec. With this order of magnitude the code appears to be slow, considering the fact that I have to run it many times. Of course, I am using multithreading to parallelize these calls, but I can't get the function per se to run much faster than what I'm seeing right now.
Would it be possible, for example, to have vec allocated in the cache memory for fast data retrieval, or something similar?
I'm writing a function where I need a significant amount of heap memory ... will be accessed frequently within a specific for loop
This isn't something you can really optimize at a compiler level. I think your concern is that you have a lot of memory that may be "stale" (paged out) but at a particular point in time you will need to iterate over all of it, maybe several times and you don't want the memory pages to be paged out to disk.
You will need to investigate strategies that are platform specific to improve performance. Keeping the pages in memory can be achieved with mlockall or VirtualLock but you really shouldn't need to do this. Make sure you know what the implications of locking your application's memory pages into RAM is, however. You're hogging memory from other processes.
You might also want to investigate a low fragmentation heap (however it may not be relevant at all to this problem) and this page which describes cache lines with respect to for loops.
The latter page is about the nitty-gritty of how CPUs work (a detail you normally shouldn't have to be concerned with) with respect to memory access.
Example 1: Memory accesses and performance
How much faster do you expect Loop 2 to run, compared Loop 1?
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The first loop multiplies every value in the array by 3, and the second loop multiplies only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.
UPDATE
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
That still doesn't show much, because we cannot know how often the condition x > threshold will be true. If x > threshold is very frequently true, then the std::set might be the bottleneck, because it has to do a dynamic memory allocation for every uint you insert.
Also we don't know what "some computation" actually means/does/is. If it does much, or does it in the wrong way that could be the bottleneck.
And we don't know how you need to access the result.
Anyway, on a hunch:
vector<pair<int, int> > vec1;
vector<pair<int, int> > vec2;
for (uint i = 0; i < node_vec.size(); i++)
{
for (uint j = i+1; j < node_vec.size(); j++)
{
// some computation, basic math, store the result in variable x
if (x > threshold)
{
vec1.push_back(make_pair(i, j));
vec2.push_back(make_pair(j, i));
}
}
}
If you can use the result in that form, you're done. Otherwise you could do some post-processing. Just don't copy it into a std::set again (obviously). Try to stick to std::vector<POD>. E.g. you could build an index into the vectors like this:
// ...
vector<int> index1 = build_index(node_vec.size(), vec1);
vector<int> index2 = build_index(node_vec.size(), vec2);
// ...
}
vector<int> build_index(size_t count, vector<pair<int, int> > const& vec)
{
vector<int> index(count, -1);
size_t i = vec.size();
do
{
i--;
assert(vec[i].first >= 0);
assert(vec[i].first < count);
index[vec[i].first] = i;
}
while (i != 0);
return index;
}
ps.: I'm almost sure your loop is not memory-bound. Can't be sure though... if the "nodes" you're not showing us are really big it might still be.
Original answer:
There is no easy I_will_access_this_frequently_so_make_it_fast(void* ptr, size_t len)-kind-of solution.
You can do some things though.
Make sure the compiler can "see" the implementation of every function that's called inside critical loops. What is necessary for the compiler to be able to "see" the implementation depends on the compiler. There is one way to be sure though: define all relevant functions in the same translation unit before the loop, and declare them as inline.
This also means you should not by any means call "external" functions in those critical loops. And by "external" functions I mean things like system-calls, runtime-library stuff or stuff implemented in a DLL/SO. Also don't call virtual functions and don't use function pointers. And or course don't allocate or free memory (inside the critical loops).
Make sure you use an optimal algorithm. Linear optimization is moot if the complexity of the algorithm is higher than necessary.
Use the smallest possible types. E.g. don't use int if signed char will do the job. That's something I wouldn't normally recommend, but when processing a large chunk of memory it can increase performance quite a lot. Especially in very tight loops.
If you're just copying or filling memory, use memcpy or memset. Disable the intrinsic version of those two functions if the chunks are larger then about 50 to 100 bytes.
Make sure you access the data in a cache-friendly manner. The optimum is "streaming" - i.e. accessing the memory with ascending or descending addresses. You can "jump" ahead some bytes at a time, but don't jump too far. The worst is random access to a big block of memory. E.g. if you have to work on a 2 dimensional matrix (like a bitmap image) where p[0] to p[1] is a step "to the right" (x + 1), make sure the inner loop increments x and the outer increments y. If you do it the other way around performance will be much much worse.
If your pointers are alias-free, you can tell the compiler (how that's done depends on the compiler). If you don't know what alias-free means I recommend searching the net and your compiler's documentation, since an explanation would be beyond the scope.
Use intrinsic SIMD instructions if appropriate.
Use explicit prefetch instructions if you know which memory locations will be needed in the near future.
You can't do that with compiler options. Depending on your usage (insertion, random-access, deleting, sorting, etc.), you could maybe get a better suited container.
The compiler can already see that the data is accessed frequently within the loop.
Assuming you're only allocating the data from the heap once before doing the looping, note, as #lvella, that memory is memory and if it's accessed frequently it should be effectively cached during execution.