Increasing array index in openMP - c++

I am new to using OpenMP. I am trying to parallelize a nested loop, and so far I have something of this form...
#pragma omp parallel for
for (j=0;j <m; j++) {
some work;
for (i= 0; i < n ; i++) {
p =b[i];
if (P< 0 && k < m) {
a[k] = c[i]; k++ ;
} else {
x=c[i];
}
}
some work
}
The outer loop is in parallel, and the inner loop updates k. The current value of k is needed for the other threads to update a[k] correctly. The problem is that all of the threads are updating a[k], but the proper order of k is not kept.
Some threads will update k and a[k], and some will not. How do I communicate the latest k between threads to update a[k] properly, since c[i] will have different values for each thread?
For example, when it runs serially, the program might set the first seven values of a to {1,3,5,7,3,9,13} and terminate with k equal to 7, but when done parallel, produces different results, or results in a different (therefore wrong) order.
How do I keep the same order and ensure parallelism at the same time?

Note: this answer was completely rewritten in light of OP clarifications. The original answer text is at the end.
How do I keep the same order and ensure parallelism at the same time?
Order dependency is antithetical to parallelism, as running operations in parallel inherently entails relaxing the relative order in which they are performed. Not all computations can be effectively parallelized.
Your case is not an exception. The second and each subsequent iteration of your outer loop needs to use the final value of k (among other things) computed by the previous iteration. How can it get that? Only by performing the previous iteration first. What room does that leave for concurrent operation? None. Concurrency is not the same thing as parallelism, but it is one of the main motivations for parallelism, because that's how parallelism yields improvements in elapsed time.
With no scope for concurrency, parallelism is actively counterproductive for you. Suppose you made the whole body of the outer loop a critical section, so that there was no concurrency in fact (as your present code requires) and no data races involving k. Then you would still pay the overhead for parallelism, get no speedup in return, and probably still get the wrong results because of evaluations of the outer-loop body being performed in the wrong order.
It may be that the whole thing can be rewritten to reduce or remove the data dependencies that prevent effective parallelization of the computation, or it may not. We haven't enough information to determine, as it depends in part on the details of "some work" and on the significance of the data. Probably you would need an altogether different algorithm for producing the desired results.
> Instead of giving a[n]={0,1,2,3,.......n} , it gives me garbage values for a when I use the reduction clause. I need the total sum of K, hence the reduction clause.
There is a closed-form equation for the sum of consecutive integers, and it has especially simple form when the first integer in the list is 0 or 1. In particular, the sum of the integers from 0 to n, inclusive, is n * (n + 1) / 2. You do not need a reduction for this.
If you wanted to use a reduction anyway, then you need to understand that it doesn't work the way you seem to think it does. What you get is a separate, private copy of the reduction variable for each thread executing the parallel construct, with the per thread (not per iteration) final values of those independant variables combined according to the reduction operator. Thus, if you really want to do the computation via an OpenMP reduction, then you would need to restructure the loop something like this:
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
a[i] = i;
k += i;
}
That assumes that the value of k is 0 immediately prior to the loop, as you indeed seem to be doing. If that were not a safe assumption then you would need something like
type_of_k k0 = k;
k = 0;
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
a[k0 + i] = i;
k += k0 + i;
}
Note that in either case, not only does that set up the reduction correctly, but it also breaks the data dependency between loop iterations that was previously carried by the expression k++.

It sounds like you're essentially filling in a with a filter of entries from c, and want to preserve their order. If this is the only use k has, some other methods spring to mind:
Always write a[i], but use a mark indicating unused values where the P predicate wasn't satisfied. This preserves order, but requires a larger a you can compact in a second pass.
Write an a_i array storing which index each entry belonged to. This still requires a #pragma omp atomic k_local = k++ access to k, and a second sort to restore order. And you'd need both a and a_i to be the full size again, or you might miss entries, so in all a terrible workaround.
Even with some sequential dependencies you can do optimizations, e.g. a scan to calculate what k would be for each i could be done in O(log n) rather than O(n). E.g. parallel prefix sum, openmp discussion on stack overflow. This sort of thing is what OpenMP's ordered depend is for, I believe. Anyhow, this leads to the third solution:
Generate a k array, holding the values k will have for each iteration, such that those threads that will write write to the correct places. This requires scanning the predicate.
It is useful to have higher level constructs like map, scan and reduce when planning out algorithms.

Related

Multithreading is slower than no threading C++

I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.
I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:
Non-Multithread Case
std::vector<hobj> validobjs;
int length = 70;
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
validobjs.push_back(hobj);
}
}
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
validobjs.push_back(hobj);
}
}
Multithread Case
std::vector<hobj> validobjs;
int length = 70;
#pragma omp parallel
{
std::vector<hobj> threaded1; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
threaded1.push_back(obj);
}
}
std::vector<hobj> threaded2; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
threaded2.push_back(obj);
}
}
#pragma omp critical // Insert local vectors to main vector one thread at a time
{
validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
}
}
In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).
I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.
So questions:
1) Is this kind of operation suited for multi-threading?
2) If yes to 1) - does the multithreaded code look reasonable?
3) Is there anything I can do to improve the performance to get it faster than the no-thread case?
Additional info:
The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4
Cheers!
I'm no expert on multithreading, but I'll give it a try:
Is this kind of operation suited for multi-threading?
I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.
As is the merging at the end of the threaded version.
Does the multithreaded code look reasonable?
I think you'r on the right path to let each thread work on independent data.
Is there anything I can do to improve the performance to get it faster than the no-thread case?
I see a few points that might improve performance:
The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
validobjs.reserve(v1.size() + v2.size());
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
Filter the data in two threads as you do now.
Synchronise them and allocate a vector with a number of elements:
validobjs.resize(v1.size() + v2.size());
Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
Allthough I'm not sure if this is entirely legal or if it is undefined behaviour
You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.
Those were my thoughts on this, I hope there was something usefull in it.
IMHO,
You copy each element twice: into threaded1/2 and after that into validobjs.
It can make your code slower.
You can add elements into single vector by using synchronization.

Are there workarounds for using OpenMP reduction on floating point addition?

I am currently working on parallelizing a nested for loop using C++ and OpenMP. Without going into the actual details of the program, I have constructed a basic example on the concepts I am using below:
float var = 0.f;
float distance = some float array;
float temp[] = some float array;
for(int i=0; i < distance.size; i++){
\\some work
for(int j=0; j < temp.size; j++){
var += temp[i]/distance[j]
}
}
I attempted to parallelize the above code in the following way:
float var = 0.f;
float distance = some float array;
float temp[] = some float array;
#pragma omp parallel for default(shared)
for(int i=0; i < distance.size; i++){
\\some work
#pragma omp parallel for reduction(+:var)
for(int j=0; j < temp.size; j++){
var += temp[i]/distance[j]
}
}
I then compared the serial program output with the parallel program output and I got incorrect result. I know that this is mainly due to the fact that floating point arithmetic is not associative. But are there any workarounds to this that give exact results?
Although the lack of associativity of floating point arithmetic might be an issue in some cases, the code you show here exposes a much more essential problem which you need to address first: the status of the var variable in the outer loop.
Indeed, since var is modified inside the i loop, even if only in the j part of the i loop, it needs to be "privatized" somehow. Now the exact status it needs to get depends on the value you expect it to store upon exit of the enclosing parallel region:
If you don't care about its value at all, just declare it private (or better, declare it inside the parallel region.
If you need its final value at the end of the i loop, and considering it accumulates a sum of values, most likely you'll need to declare it reduction(+:), although lastprivate might also be what you want (impossible to say without further details)
If private or lastprivate was all you needed, but you also need its initial value upon entrance of the parallel region, then you'll have to consider adding firstprivate too (no need of that if you went for reduction as it is already been taken care of)
That should be enough for fixing your issue.
Now, in your snippet, you also parallelized the inner loop. That is usually a bad idea to go for nested parallelism. So unless you have a very compelling reason for doing so, you will likely get much better performance by only parallelizing the outer loop, and leaving the inner loop alone. That won't mean the inner loop won't benefit from the parallelization, but rather that several instances of the inner loop will be computed in parallel (each one being sequential admittedly, but the whole process is parallel).
A nice side effect of removing the inner loop's parallelization (in addition to making the code faster) is that now all accumulations inside the privates var variables are done in the same order as when not in parallel. Therefore, your (hypothetical) floating point arithmetic issues inside the outer loop will now have disappeared, and only if you needed the final reduction upon exit of the parallel region might you still face them there.

Which one is more optimized for accessing array?

Solving the following exercise:
Write three different versions of a program to print the elements of
ia. One version should use a range for to manage the iteration, the
other two should use an ordinary for loop in one case using subscripts
and in the other using pointers. In all three programs write all the
types directly. That is, do not use a type alias, auto, or decltype to
simplify the code.[C++ Primer]
a question came up: Which of these methods for accessing array is optimized in terms of speed and why?
My Solutions:
Foreach Loop:
int ia[3][4]={{1,2,3,4},{5,6,7,8},{9,10,11,12}};
for (int (&i)[4]:ia) //1st method using for each loop
for(int j:i)
cout<<j<<" ";
Nested for loops:
for (int i=0;i<3;i++) //2nd method normal for loop
for(int j=0;j<4;j++)
cout<<ia[i][j]<<" ";
Using pointers:
int (*i)[4]=ia;
for(int t=0;t<3;i++,t++){ //3rd method. using pointers.
for(int x=0;x<4;x++)
cout<<(*i)[x]<<" ";
Using auto:
for(auto &i:ia) //4th one using auto but I think it is similar to 1st.
for(auto j:i)
cout<<j<<" ";
Benchmark result using clock()
1st: 3.6 (6,4,4,3,2,3)
2nd: 3.3 (6,3,4,2,3,2)
3rd: 3.1 (4,2,4,2,3,4)
4th: 3.6 (4,2,4,5,3,4)
Simulating each method 1000 times:
1st: 2.29375 2nd: 2.17592 3rd: 2.14383 4th: 2.33333
Process returned 0 (0x0) execution time : 13.568 s
Compiler used:MingW 3.2 c++11 flag enabled. IDE:CodeBlocks
I have some observations and points to make and I hope you get your answer from this.
The fourth version, as you mention yourself, is basically the same as the first version. auto can be thought of as only a coding shortcut (this is of course not strictly true, as using auto can result in getting different types than you'd expected and therefore result in different runtime behavior. But most of the time this is true.)
Your solution using pointers is probably not what people mean when they say that they are using pointers! One solution might be something like this:
for (int i = 0, *p = &(ia[0][0]); i < 3 * 4; ++i, ++p)
cout << *p << " ";
or to use two nested loops (which is probably pointless):
for (int i = 0, *p = &(ia[0][0]); i < 3; ++i)
for (int j = 0; j < 4; ++j, ++p)
cout << *p << " ";
from now on, I'm assuming this is the pointer solution you've written.
In such a trivial case as this, the part that will absolutely dominate your running time is the cout. The time spent in bookkeeping and checks for the loop(s) will be completely negligible comparing to doing I/O. Therefore, it won't matter which loop technique you use.
Modern compilers are great at optimizing such ubiquitous tasks and access patterns (iterating over an array.) Therefore, chances are that all these methods will generate exactly the same code (with the possible exception of the pointer version, which I will talk about later.)
The performance of most codes like this will depend more on the memory access pattern rather than how exactly the compiler generates the assembly branch instructions (and the rest of the operations.) This is because if a required memory block is not in the CPU cache, it's going to take a time roughly equivalent of several hundred CPU cycles (this is just a ballpark number) to fetch those bytes from RAM. Since all the examples access memory in exactly the same order, their behavior in respect to memory and cache will be the same and will have roughly the same running time.
As a side note, the way these examples access memory is the best way for it to be accessed! Linear, consecutive and from start to finish. Again, there are problems with the cout in there, which can be a very complicated operation and even call into the OS on every invocation, which might result, among other things, an almost complete deletion (eviction) of everything useful from the CPU cache.
On 32-bit systems and programs, the size of an int and a pointer are usually equal (both are 32 bits!) Which means that it doesn't matter much whether you pass around and use index values or pointers into arrays. On 64-bit systems however, a pointer is 64 bits but an int will still usually be 32 bits. This suggests that it is usually better to use indexes into arrays instead of pointers (or even iterators) on 64-bit systems and programs.
In this particular example, this is not significant at all though.
Your code is very specific and simple, but the general case, it is almost always better to give as much information to the compiler about your code as possible. This means that you must use the narrowest, most specific device available to you to do a job. This in turn means that a generic for loop (i.e. for (int i = 0; i < n; ++i)) is worse than a range-based for loop (i.e. for (auto i : v)) for the compiler, because in the latter case the compiler simply knows that you are going to iterate over the whole range and not go outside of it or break out of the loop or something, while in the generic for loop case, specially if your code is more complex, the compiler cannot be sure of this and has to insert extra checks and tests to make sure the code executes as the C++ standard says it should.
In many (most?) cases, although you might think performance matters, it does not. And most of the time you rewrite something to gain performance, you don't gain much. And most of the time the performance gain you get is not worth the loss in readability and maintainability that you sustain. So, design your code and data structures right (and keep performance in mind) but avoid this kind of "micro-optimization" because it's almost always not worth it and even harms the quality of the code too.
Generally, performance in terms of speed is very hard to reason about. Ideally you have to measure the time with real data on real hardware in real working conditions using sound scientific measuring and statistical methods. Even measuring the time it takes a piece of code to run is not at all trivial. Measuring performance is hard, and reasoning about it is harder, but these days it is the only way of recognizing bottlenecks and optimizing the code.
I hope I have answered your question.
EDIT: I wrote a very simple benchmark for what you are trying to do. The code is here. It's written for Windows and should be compilable on Visual Studio 2012 (because of the range-based for loops.) And here are the timing results:
Simple iteration (nested loops): min:0.002140, avg:0.002160, max:0.002739
Simple iteration (one loop): min:0.002140, avg:0.002160, max:0.002625
Pointer iteration (one loop): min:0.002140, avg:0.002160, max:0.003149
Range-based for (nested loops): min:0.002140, avg:0.002159, max:0.002862
Range(const ref)(nested loops): min:0.002140, avg:0.002155, max:0.002906
The relevant numbers are the "min" times (over 2000 runs of each test, for 1000x1000 arrays.) As you see, there is absolutely no difference between the tests. Note that you should turn on compiler optimizations or test 2 will be a disaster and cases 4 and 5 will be a little worse than 1 and 3.
And here are the code for the tests:
// 1. Simple iteration (nested loops)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows; ++i)
for (unsigned j = 0; j < gc_Cols; ++j)
sum += g_Data[i][j];
// 2. Simple iteration (one loop)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += g_Data[i / gc_Cols][i % gc_Cols];
// 3. Pointer iteration (one loop)
unsigned sum = 0;
unsigned * p = &(g_Data[0][0]);
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += *p++;
// 4. Range-based for (nested loops)
unsigned sum = 0;
for (auto & i : g_Data)
for (auto j : i)
sum += j;
// 5. Range(const ref)(nested loops)
unsigned sum = 0;
for (auto const & i : g_Data)
for (auto const & j : i)
sum += j;
It has many factors affecting it:
It depends on the compiler
It depends on the compiler flags used
It depends on the computer used
There is only one way to know the exact answer: measuring the time used when dealing with huge arrays (maybe from a random number generator) which is the same method you have already done except that the array size should be at least 1000x1000.

Why is the != operator not allowed with OpenMP?

I was trying to compile the following code:
#pragma omp parallel shared (j)
{
#pragma omp for schedule(dynamic)
for(i = 0; i != j; i++)
{
// do something
}
}
but I got the following error: error: invalid controlling predicate.
The OpenMP standard states that for parallel for constructor it "only" allows one of the following operators: <, <=, > >=.
I do not understand the rationale for not allowing i != j. I could understand, in the case of the static schedule, since the compiler needs to pre-compute the number of iterations assigned to each thread. But I can't understand why this limitation in such case for example. Any clues?
EDIT: even if I make for(i = 0; i != 100; i++), although I could just have put "<" or "<=" .
.
I sent an email to OpenMP developers about this subject, the answer I got:
For signed int, the wrap around behavior is undefined. If we allow !=, programmers may get unexpected tripcount. The problem is whether the compiler can generate code to compute a trip count for the loop.
For a simple loop, like:
for( i = 0; i < n; ++i )
the compiler can determine that there are 'n' iterations, if n>=0, and zero iterations if n < 0.
For a loop like:
for( i = 0; i != n; ++i )
again, a compiler should be able to determine that there are 'n' iterations, if n >= 0; if n < 0, we don't know how many iterations it has.
For a loop like:
for( i = 0; i < n; i += 2 )
the compiler can generate code to compute the trip count (loop iteration count) as floor((n+1)/2) if n >= 0, and 0 if n < 0.
For a loop like:
for( i = 0; i != n; i += 2 )
the compiler can't determine whether 'i' will ever hit 'n'. What if 'n' is an odd number?
For a loop like:
for( i = 0; i < n; i += k )
the compiler can generate code to compute the trip count as floor((n+k-1)/k) if n >= 0, and 0 if n < 0, because the compiler knows that the loop must count up; in this case, if k < 0, it's not a legal OpenMP program.
For a loop like:
for( i = 0; i != n; i += k )
the compiler doesn't even know if i is counting up or down. It doesn't know if 'i' will ever hit 'n'. It may be an infinite loop.
Credits: OpenMP ARB
Contrary to what it may look like, schedule(dynamic) does not work with dynamic number of elements. Rather the assignment of iteration blocks to threads is what is dynamic. With static scheduling this assignment is precomputed at the beginning of the worksharing construct. With dynamic scheduling iteration blocks are given out to threads on the first come, first served basis.
The OpenMP standard is pretty clear that the amount of iteratons is precomputed once the workshare construct is encountered, hence the loop counter may not be modified inside the body of the loop (OpenMP 3.1 specification, ยง2.5.1 - Loop Construct):
The iteration count for each associated loop is computed before entry to the outermost
loop. If execution of any associated loop changes any of the values used to compute any
of the iteration counts, then the behavior is unspecified.
The integer type (or kind, for Fortran) used to compute the iteration count for the
collapsed loop is implementation defined.
A worksharing loop has logical iterations numbered 0,1,...,N-1 where N is the number of
loop iterations, and the logical numbering denotes the sequence in which the iterations
would be executed if the associated loop(s) were executed by a single thread. The
schedule clause specifies how iterations of the associated loops are divided into
contiguous non-empty subsets, called chunks, and how these chunks are distributed
among threads of the team. Each thread executes its assigned chunk(s) in the context of
its implicit task. The chunk_size expression is evaluated using the original list items of any variables that are made private in the loop construct. It is unspecified whether, in what order, or how many times, any side-effects of the evaluation of this expression occur. The use of a variable in a schedule clause expression of a loop construct causes an implicit reference to the variable in all enclosing constructs.
The rationale behind these relational operator restriction is quite simple - it provides clear indication on what is the direction of the loop, it alows easy computation of the number of iterations, and it provides similar semantics of the OpenMP worksharing directive in C/C++ and Fortran. Also other relational operations would require close inspection of the loop body in order to understand how the loop goes which would be unaceptable in many cases and would make the implementation cumbersome.
OpenMP 3.0 introduced the explicit task construct which allows for parallelisation of loops with unknown number of iterations. There is a catch though: tasks introduce some severe overhead and the one task per loop iteration only makes sense if these iterations take quite some time to be executed. Otherwise the overhead would dominate the execution time.
The answer is simple.
OpenMP does not allow premature termination of a team of threads.
With == or !=, OpenMP has no way of determining when the loop stops.
1. One or more threads could hit the termination condition, which might not be unique.
2. OpenMP has no way to shut down the other threads that might never detect the condition.
If I were to see the statement
for(i = 0; i != j; i++)
used instead of the statement
for(i = 0; i < j; i++)
I would be left wondering why the programmer had made that choice, never mind that it can mean the same thing. It may be that OpenMP is making a hard syntactic choice in order to force a certain clarity of code.
Here's code which raises challenges for the use of != and may help explain why it isn't allowed.
#include <cstdio>
int main(){
int j=10;
#pragma omp parallel for
for(int i = 0; i < j; i++){
printf("%d\n",i++);
}
}
notice that i is incremented in both the for statement as well as within the loop itself leading to the possibility (but not the guarantee) of an infinite loop.
If the predicate is < then the loop's behavior can still be well-defined in a parallel context without the compiler having to check within the loop for changes to i and determining how those changes will affect the loop's bounds.
If the predicate is != then the loop's behavior is no longer well-defined and it may be infinite in extent, preventing easy parallel subdivision.
I think there is perhaps no good reason other than having extended existing functionality to get this far.
IIRC originally these had to be static so that it could determine at compile time how to generate the loop code... it could just be a hangover from that.

C++ heap memory performance improvement

I'm writing a function where I need a significant amount of heap memory. Is it possible to tell the compiler that those data will be accessed frequently within a specific for loop, so as to improve performance (through compile options or similar)?
The reason I cannot use the stack is that the number of elements I need to store is big, and I get segmentation fault if I try to do it.
Right now the code is working but I think it could be faster.
UPDATE:
I'm doing something like this
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
some details:
- I used hash_set, little improvement, beside the fact that hash_set is not available in all machines I have for simulation purposes
- I tried to allocate vec on the stack using arrays but, as I said, I might get segmentation fault if the number of elements is too big
If node_vec.size() is, say, equal to k, where k is of the order of a few thousands, I expect vec to be 4 or 5 times bigger than node_vec. With this order of magnitude the code appears to be slow, considering the fact that I have to run it many times. Of course, I am using multithreading to parallelize these calls, but I can't get the function per se to run much faster than what I'm seeing right now.
Would it be possible, for example, to have vec allocated in the cache memory for fast data retrieval, or something similar?
I'm writing a function where I need a significant amount of heap memory ... will be accessed frequently within a specific for loop
This isn't something you can really optimize at a compiler level. I think your concern is that you have a lot of memory that may be "stale" (paged out) but at a particular point in time you will need to iterate over all of it, maybe several times and you don't want the memory pages to be paged out to disk.
You will need to investigate strategies that are platform specific to improve performance. Keeping the pages in memory can be achieved with mlockall or VirtualLock but you really shouldn't need to do this. Make sure you know what the implications of locking your application's memory pages into RAM is, however. You're hogging memory from other processes.
You might also want to investigate a low fragmentation heap (however it may not be relevant at all to this problem) and this page which describes cache lines with respect to for loops.
The latter page is about the nitty-gritty of how CPUs work (a detail you normally shouldn't have to be concerned with) with respect to memory access.
Example 1: Memory accesses and performance
How much faster do you expect Loop 2 to run, compared Loop 1?
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The first loop multiplies every value in the array by 3, and the second loop multiplies only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.
UPDATE
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
That still doesn't show much, because we cannot know how often the condition x > threshold will be true. If x > threshold is very frequently true, then the std::set might be the bottleneck, because it has to do a dynamic memory allocation for every uint you insert.
Also we don't know what "some computation" actually means/does/is. If it does much, or does it in the wrong way that could be the bottleneck.
And we don't know how you need to access the result.
Anyway, on a hunch:
vector<pair<int, int> > vec1;
vector<pair<int, int> > vec2;
for (uint i = 0; i < node_vec.size(); i++)
{
for (uint j = i+1; j < node_vec.size(); j++)
{
// some computation, basic math, store the result in variable x
if (x > threshold)
{
vec1.push_back(make_pair(i, j));
vec2.push_back(make_pair(j, i));
}
}
}
If you can use the result in that form, you're done. Otherwise you could do some post-processing. Just don't copy it into a std::set again (obviously). Try to stick to std::vector<POD>. E.g. you could build an index into the vectors like this:
// ...
vector<int> index1 = build_index(node_vec.size(), vec1);
vector<int> index2 = build_index(node_vec.size(), vec2);
// ...
}
vector<int> build_index(size_t count, vector<pair<int, int> > const& vec)
{
vector<int> index(count, -1);
size_t i = vec.size();
do
{
i--;
assert(vec[i].first >= 0);
assert(vec[i].first < count);
index[vec[i].first] = i;
}
while (i != 0);
return index;
}
ps.: I'm almost sure your loop is not memory-bound. Can't be sure though... if the "nodes" you're not showing us are really big it might still be.
Original answer:
There is no easy I_will_access_this_frequently_so_make_it_fast(void* ptr, size_t len)-kind-of solution.
You can do some things though.
Make sure the compiler can "see" the implementation of every function that's called inside critical loops. What is necessary for the compiler to be able to "see" the implementation depends on the compiler. There is one way to be sure though: define all relevant functions in the same translation unit before the loop, and declare them as inline.
This also means you should not by any means call "external" functions in those critical loops. And by "external" functions I mean things like system-calls, runtime-library stuff or stuff implemented in a DLL/SO. Also don't call virtual functions and don't use function pointers. And or course don't allocate or free memory (inside the critical loops).
Make sure you use an optimal algorithm. Linear optimization is moot if the complexity of the algorithm is higher than necessary.
Use the smallest possible types. E.g. don't use int if signed char will do the job. That's something I wouldn't normally recommend, but when processing a large chunk of memory it can increase performance quite a lot. Especially in very tight loops.
If you're just copying or filling memory, use memcpy or memset. Disable the intrinsic version of those two functions if the chunks are larger then about 50 to 100 bytes.
Make sure you access the data in a cache-friendly manner. The optimum is "streaming" - i.e. accessing the memory with ascending or descending addresses. You can "jump" ahead some bytes at a time, but don't jump too far. The worst is random access to a big block of memory. E.g. if you have to work on a 2 dimensional matrix (like a bitmap image) where p[0] to p[1] is a step "to the right" (x + 1), make sure the inner loop increments x and the outer increments y. If you do it the other way around performance will be much much worse.
If your pointers are alias-free, you can tell the compiler (how that's done depends on the compiler). If you don't know what alias-free means I recommend searching the net and your compiler's documentation, since an explanation would be beyond the scope.
Use intrinsic SIMD instructions if appropriate.
Use explicit prefetch instructions if you know which memory locations will be needed in the near future.
You can't do that with compiler options. Depending on your usage (insertion, random-access, deleting, sorting, etc.), you could maybe get a better suited container.
The compiler can already see that the data is accessed frequently within the loop.
Assuming you're only allocating the data from the heap once before doing the looping, note, as #lvella, that memory is memory and if it's accessed frequently it should be effectively cached during execution.