i have a struct
struct A
{
int v[10000000];
};
if i have A a[2]; and wish to calculate the total sum of values which of these 2 methods is the fastest?
int method_1(const A &a[],int length)
{
int total = 0;
for(int i=0;i<length;i++)
for(int j=0;j<10000000;j++)
total+=a[i][j];
return total;
}
int method_2(const A &a[],int length)
{
int total = 0;
for(int j=0;j<10000000;j++)
for(int i=0;i<length;i++)
total+=a[i][j];
return total;
}
a[2] is declared as two consective blocks of struct A as so:
----a[0]---- /--- a[1]----
[][][][][][][][]/[][][][][][][][]
so, i might be tempted to say that method_1 is faster, based on intuition that the blocks are consecutive and the iteration through each block's v is also consecutive.
What i am really interested in is how the memory is really accessed and how is the most efficient way to access it.
EDIT
i have changed the v size from 32 to 10000000, because apparently it wasn't understood that i was referring to a general case
Each time a memory fragment is read a whole cache line will be read from main memory to the CPU cache, today you'll probably have a 32byte long cache lines. Mostly because of this reading consecutive memory blocks is fast.
Now there is more then one cache line...
In your case both cases may have similar performance as both arrays will most probably not collide into the same cache line and so both may be in the cache on different lines so I suspect performance will be similar.
One related thing you may consider in this case to change the performance is NOT using the [] operators in favor of iterating more using "iterators" like this:
int method_1(const A &a[],int length)
{
int total = 0;
for(const A* aIt=a;aIt<a+length;++aIt)
for(const v* vIt=aIt->v;vIt<aIt->v+10000000;++vIt)
total+=*vIt;
return total;
}
This way you avoid a double [] which is simply a multiplication by the sizeof of an array element (which may be optimized but may not and if not it will be costly when called millions of times). Your compiler may be smart enough to optimize the code just as I've shown to only use additions but... it very well may not be and I've seen this make a big difference when the operation performed for each of the elements is as trivial as an incrementation - you're best to measure this and see how these options work out in your environment.
Accessing elements in the order they appear in memory will improve performance in most cases since it allows the prefetcher to load data before you even use it. Besides, if you use data in a non-contiguous way, you might load and discard the same cache line many times and this has a cost.
Data size is small enough to be fit completely in a single cache line on modern CPUs. I'm not sure about vertorizing this code by compiler
I don't think method_2 is slower than method_1. The chunk of memory will be taken to CPUs main memory and then accessing a[0] and a[1] both will be take same time.
For a safer side, method_1 can always be considered better than method_2.
Related
I noticed that sometimes a program runs very slow but later the performance is good. For example, I have some code which I run in a loop and the first iteration takes ages but other iterations of the same code runs pretty fast. It's hard to name the circumstances because I can't figure it out and it seems that even single literal can affect this behavior. I prepared a small code snippet:
#include <chrono>
#include <vector>
#include <iostream>
using namespace std;
int main()
{
const int num{ 100000 };
vector<vector<int>> octs;
for (int i{ 0 }; i < num; ++i)
{
octs.emplace_back(vector<int>{ 42 });
}
vector<int> datas;
for (int i{ 0 }; i < num; ++i)
{
datas.push_back(42);
}
for (int n{ 0 }; n < 10; ++n)
{
cout << "start" << '\n';
//cout << 0 << "start" << '\n';
auto start = chrono::high_resolution_clock::now();
for (int i{ 0 }; i < num; ++i)
{
vector<int> points{ 42 };
}
auto end = chrono::high_resolution_clock::now();
auto time = chrono::duration_cast<chrono::milliseconds>(end - start);
cout << time.count() << '\n';
}
cin.get();
return 0;
}
The first two vectors are essential. At least with Visual Studio. Thought they're not in use they affect the performance a lot. Moreover, tweaking them also give performance effect (like change the order of initialization, remove push_back and allocate the necessary size in constructor). But this code as it is gives me the following results:
with gcc there're no problems at all
with clang the first iteration takes two times longer than the others
with vs2013 the first iteration is 100 (yes, one hundred) times slower.
Moreover, with vs2013 if I uncomment the line cout << 0 << "start" << '\n'; the performance problem goes away and all iterations are equal!
What's going on?
For your first two loops, probably the biggest performance consideration is going to be the allocation of memory, and the copying of the vector contents to the larger buffer. In this case, the fact that the loops appear to be 'gaining speed' is not surprising.
This is due to the implementation details of the vector class. Let's look at the documentation:
Internally, vectors use a dynamically allocated array to store their
elements. This array may need to be reallocated in order to grow in
size when new elements are inserted, which implies allocating a new
array and moving all elements to it. This is a relatively expensive
task in terms of processing time, and thus, vectors do not reallocate
each time an element is added to the container.
Instead, vector containers may allocate some extra storage to
accommodate for possible growth, and thus the container may have an
actual capacity greater than the storage strictly needed to contain
its elements (i.e., its size). Libraries can implement different
strategies for growth to balance between memory usage and
reallocations, but in any case, reallocations should only happen at
logarithmically growing intervals of size so that the insertion of
individual elements at the end of the vector can be provided with
amortized constant time complexity (see push_back).
So under the hood, the actual memory allocated for your vector might be much more than what you are actually using. So the vector only needs to do the costly re-allocation and copy when you add a new element to the vector which wouldn't fit into its current buffer. Moreover, since it says that re-allocations should only happen at logarithmically growing intervals, you can expect that the vector class is roughly doubling the buffer size every time it needs to re-allocate. But note that the vector implementations on various platforms are highly tuned to be optimal for the most common usage patterns for the class, which could be one factor in the different performance you are seeing across tool chains and platforms.
So you should see the loops be slow on the first several executions, and then gain more speed as push_back and emplace operations need to do fewer re-allocations and copies to accommodate the new elements.
So I think this is the main fact you can use to reason about how long your first two loops should take to execute. But for your specific examples, due to the simplicity of the program, the compiler may be taking some liberties with what code it generates. So we could imagine that a sufficiently clever optimizing compiler might be able to see that your vectors will only be growing to a size which it knows at compile time, num. And this is the biggest issue I suspect with your last loop, which seems like an arbitrary and useless test. For example, the nested loop in loop 3 can be optimized away entirely. I think this is the main reason why you are seeing such different run-time behavior across the different compilers.
If you want to get the real story, take a look at the assembly code that your compiler is generating.
template<size_t size>
class Objects{
std::array<int,size> a;
std::array<int,size> b;
std::array<int,size> c;
void update(){
for (size_t i = 0; i < size; ++i){
c[i] = a[i] + b[i];
}
}
};
I am gathering information of how to write cache friendly code since a week and I read though several articles but I still haven't understood the basics.
Code like I have written above is used in most of the examples, but for me this is not cache friendly at all.
For me the memory layout should look like this
aaaabbbbcccc
and in the first loop it will access
[a]aaa[b]bbb[c]ccc
If I understood it correctly, the cpu prefetches elements that are near in memory. I am not sure how intelligent this method is but I assume it's primitive and it just fetches the nth nearest elements.
The problem is that [a]aaa[b]bbb[c]ccc will not access the elements in order at all. So it might fetch the next '3' elements a[aaa]bbbbcccc which is nice for the next a because it will be a cache hit but not for the b.
Is the example above cache friendly code?
I suggest you use an array of structures:
struct Cache_Item
{
int a;
int b;
int c;
};
Cache_Item cache_line[size];
for (unsigned int i = 0; i < size; ++i)
{
cache_line[i].c = cache_line[i].a + cache_line[i].b;
}
The structure arrangement allows for all the variables in use to be next to each other in the cache line or very close.
In your array method, element b[0] ideally is at location a[size], so they are size items apart. This could mean that they are on different cache lines. The result location, c[0], is at a[size + size], which means it could be 2 cache lines away.
Your code is not particularly unfriendly. It requires three active cache lines at a time instead of one, but that isn't too much to ask. Your code would be a lot more cache-unfriendly if instead of
std::array<int,size> a;
you had
std::array<struct { int x; char description[5000]; }, size> a;
because then the CPU would have to pick out the lone x among the thousands of bytes of description (which your loop never uses).
Your example would also be more cache-unfriendly if you had not just a, b, and c, but also d-z and aa-az and maybe a few more. (How far you have to go depends on the sophistication of your cache - how many way-associative it is, etc.)
Have you profiled yours vs Thomas Matthews' code?
You should trust the compiler optimization work (and of course enable optimizations); it probably deals quite well with the CPU cache (perhaps by issuing appropriate prefetch instructions).
Sometimes you can hint the compiler thru builtins or pragmas. For example with GCC on x86-64 you might -with care- use the __builtin_prefetch. Usually it is not worth the effort (and if you misuse it performance will suffer). See this answer to a related question.
This question is about trade-off between memory and performance.
I am doing C++ on Linux.
The for_loop () is on time critical path. I am tring to reduce its run time as much as possible.
myArray.assignMemory(); // the memory will be 50KB.
if (myFlag)
myArray is assigned meaningful values
else
myArray is assigned NotNumber (a very small negative number)
for_loop ( iterationNumber = N) { // N will be very large
myF1 ( myArray[i] ) ;
}
myF1(double j){
if(myFlag)
use j
else
doNothing
}
Here, assign memory to myArray even though myFlag is false, in which case the memory is wasted. But, if I put if(myFlag) in the for_loop, this will have performance overhead.
I can put if(myFlag) out of for_loop so that if myFlag is true we run myF1( myArray[i]), otherwise, we run myF1(notNumber). But, this will have duplication.
So, my question is : are there other better ways that do not pull in performace overhead while not wasting any memory ?
Thanks
To me, from what I can see:
myArray.assignMemory(); // the memory will be 50KB.
if (myFlag)
myArray is assigned meaningful values
else
myArray is assigned NotNumber (a very small negative number)
for_loop ( iterationNumber = N) { // N will be very large
myF1 ( myArray[i] ) ;
}
myF1(double j){
if(myFlag)
use j
else
doNothing
}
is the same as
if (myflag)
{
myArray.assignMemory(); // the memory will be 50KB.
myArray is assigned meaningful values
for_loop ( iterationNumber = N) { // N will be very large
myF1 ( myArray[i] ) ;
}
}
myF1(double j){
use j
}
Of course, it could be that your code does more things than what you describe, in which case this part of the answer is completely useless (but not really my fault - I can only go by what you have posted, and the posted code don't do anything else with myArray.
As to your DIRECT question, it really depends on what you are trying to achieve. 50KB is not a very large allocation (as long as you are not doing it several times). But allocating memory that you don't actually need is also completely meaningless AND takes time.
The title of your question is about "tradeoff between memory and performance", which is typically about "do I store something in lots of memory that is fast to access, or work out a more memory efficient way to store it, but taking more time." For example, if we have a telephone directory, we could have a very large array with all telephone numbers from 000000000 to 999999999 in one large, directly addressed array, or we can use a map or hash_map that stores only the items we actually need in the table. The directly addressed array is faster to access, but it's so much larger that it may not fit in the memory in most machines [if each record is large as well]. So it's a choice, do we make it "fast, using lots of memory", or do we make it "small memory, but not so fast". And like so many things, there's no directly right or wrong answer - it depends on which is more important, speed or memory space.
Call myArray.assignMemory() after checking MyFlag, but before the for_loop(). Depending on what myArray is assigned meaningful values and myArray is assigned NotNumber (a very small negative number) do, you might need to change the implementation of the class that myArray belongs to.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am writing some code that needs to be as fast as possible without sucking up all of my research time (in other words, no hand optimized assembly).
My systems primarily consist of a bunch of 3D points (atomic systems) and so the code I write does lots of distance comparisons, nearest-neighbor searches, and other types of sorting and comparisons. These are large, million or billion point systems, and the naive O(n^2) nested for loops just won't cut it.
It would be easiest for me to just use a std::vector to hold point coordinates. And at first I thought it will probably be about as fast an array, so that's great! However, this question (Is std::vector so much slower than plain arrays?) has left me with a very uneasy feeling. I don't have time to write all of my code using both arrays and vectors and benchmark them, so I need to make a good decision right now.
I am sure that someone who knows the detailed implementation behind std::vector could use those functions with very little speed penalty. However, I primarily program in C, and so I have no clue what std::vector is doing behind the scenes, and I have no clue if push_back is going to perform some new memory allocation every time I call it, or what other "traps" I could fall into that make my code very slow.
An array is simple though; I know exactly when memory is being allocated, what the order of all my algorithms will be, etc. There are no blackbox unknowns that I may have to suffer through. Yet so often I see people criticized for using arrays over vectors on the internet that I can't but help wonder if I am missing some more information.
EDIT: To clarify, someone asked "Why would you be manipulating such large datasets with arrays or vectors"? Well, ultimately, everything is stored in memory, so you need to pick some bottom layer of abstraction. For instance, I use kd-trees to hold the 3D points, but even so, the kd-tree needs to be built off an array or vector.
Also, I'm not implying that compilers cannot optimize (I know the best compilers can outperform humans in many cases), but simply that they cannot optimize better than what their constraints allow, and I may be unintentionally introducing constraints simply due to my ignorance of the implementation of vectors.
all depends on this how you implement your algorithms. std::vector is such general container concept that gives us flexibility but leaves us with freedom and responsibility of structuring implementation of algorithm deliberately. Most of the efficiency overhead that we will observe from std::vector comes from copying. std::vector provides a constructor which lets you initialize N elements with value X, and when you use that, the vector is just as fast as an array.
I did a tests std::vector vs. array described here,
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>
class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}
~TestTimer()
{
using namespace std;
using namespace boost;
posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;
cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}
private:
std::string name;
boost::posix_time::ptime start;
};
struct Pixel
{
Pixel()
{
}
Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}
unsigned char r, g, b;
};
void UseVector()
{
TestTimer t("UseVector");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseVectorPushBack()
{
TestTimer t("UseVectorPushBack");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.reserve(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}
void UseArray()
{
TestTimer t("UseArray");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
free(pixels);
}
}
void UseVectorCtor()
{
TestTimer t("UseConstructor");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
}
}
int main()
{
TestTimer t1("The whole thing");
UseArray();
UseVector();
UseVectorCtor();
UseVectorPushBack();
return 0;
}
and here are results (compiled on Ubuntu amd64 with g++ -O3):
UseArray completed in 0.325 seconds
UseVector completed in 1.23 seconds
UseConstructor completed in 0.866 seconds
UseVectorPushBack completed in 8.987 seconds
The whole thing completed in 11.411 seconds
clearly push_back wasn't good choice here, using constructor is still 2 times slower than array.
Now, providing Pixel with empty copy Ctor:
Pixel(const Pixel&) {}
gives us following results:
UseArray completed in 0.331 seconds
UseVector completed in 0.306 seconds
UseConstructor completed in 0 seconds
UseVectorPushBack completed in 2.714 seconds
The whole thing completed in 3.352 seconds
So in summary: re-think your algorithm, otherwise, perhaps resort to a custom wrapper around New[]/Delete[]. In any case, the STL implementation isn't slower for some unknown reason, it just does exactly what you ask; hoping you know better.
In the case when you just started with vectors it might be surprising how they behave, for example this code:
class U{
int i_;
public:
U(){}
U(int i) : i_(i) {cout << "consting " << i_ << endl;}
U(const U& ot) : i_(ot.i_) {cout << "copying " << i_ << endl;}
};
int main(int argc, char** argv)
{
std::vector<U> arr(2,U(3));
arr.resize(4);
return 0;
}
results with:
consting 3
copying 3
copying 3
copying 548789016
copying 548789016
copying 3
copying 3
Vectors guarantee that the underlying data is a contiguous block in memory. The only sane way to guarantee this is by implementing it as an array.
Memory reallocation on pushing new elements can happen, because the vector can't know in advance how many elements you are going to add to it. But when you know it in advance, you can call reserve with the appropriate number of entries to avoid reallocation when adding them.
Vectors are usually preferred over arrays because they allow bound-checking when accessing elements with .at(). That means accessing indices outside of the vector doesn't cause undefined behavior like in an array. This bound-checking does however require additional CPU cycles. When you use the []-operator to access elements, no bound-checking is done and access should be as fast as an array. This however risks undefined behavior when your code is buggy.
People who invented STL, and then made it into the C++ standard library, are expletive deleted smart. Don't even let yourself imagine for one little moment you can outperform them because of your superior knowledge of legacy C arrays. (You would have a chance if you knew some Fortran though).
With std::vector, you can allocate all memory in one go, just like with C arrays. You can also allocate incrementally, again just like with C arrays. You can control when each allocation happens, just like with C arrays. Unlike with C arrays, you can also forget about it all and let the system manage the allocations for you, if that's what you want. This is all absolutely necessary, basic functionality. I'm not sure why anyone would assume it is missing.
Having said all that, go with arrays if you find them easier to understand.
I am not really advising you to go either for arrays or vectors, because I think that for your needs they may not be totally fit.
You need to be able to organize your data efficiently, so that queries would not need to scan the whole memory range to get the relevant data. So you want to group the points which are more likely to be selected together close to each other.
If your dataset is static, then you can do that sorting offline, and make your array nice and tidy to be loaded up in memory at your application start up time, and either vector or array would work (provided you do the reserve call up front for vector, since the default allocation growth scheme double up the size of the underlying array whenever it gets full, and you wouldn't want to use up 16Gb of memory for only 9Gb worth of data).
But if your dataset is dynamic, it will be difficult to do efficient inserts in your set with a vector or an array. Recall that each insert within the array would create a shift of all the successor elements of one place. Of course, an index, like the kd tree you mention, will help by avoiding a full scan of the array, but if the selected points are scattered accross the array, the effect on memory and cache will essentially be the same. The shift would also mean that the index needs to be updated.
My solution would be to cut the array in pages (either list linked or array indexed) and store data in the pages. That way, it would be possible to group relevant elements together, while still retaining the speed of contiguous memory access within pages. The index would then refer to a page and an offset in that page. Pages wouldn't be filled automatically, which leaves rooms to insert related elements, or make shifts really cheap operations.
Note that if pages are always full (excepted for the last one), you still have to shift every single one of them in case of an insert, while if you allow incomplete pages, you can limit a shift to a single page, and if that page is full, insert a new page right after it to contain the suplementary element.
Some things to keep in mind:
array and vector allocation have upper limits, which is OS dependent (these limits might be different)
On my 32bits system, the maximum allowed allocation for a vector of 3D points is at around 180 millions entries, so for larger datasets, on would have to find a different solution. Granted, on 64bits OS, that amount might be significantly larger (On windows 32bits, the maximum memory space for a process is 2Gb - I think they added some tricks on more advanced versions of the OS to extend that amount). Admittedly memory will be even more problematic for solutions like mine.
resizing of a vector requires allocating the new size of the heap, copy the elements from the old memory chunck to the new one.
So for adding just one element to the sequence, you will need twice the memory during the resizing. This issue may not come up with plain arrays, which can be reallocated using the ad hoc OS memory functions (realloc on unices for instance, but as far as I know that function doesn't make any guarantee that the same memory chunck will be reused). The problem might be avoided in vector as well if a custom allocator which would use the same functions is used.
C++ doesn't make any assumption about the underlying memory architecture.
vectors and arrays are meant to represent contiguous memory chunks provided by an allocator, and wrap that memory chunk with an interface to access it. But C++ doesn't know how the OS is managing that memory. In most modern OS, that memory is actually cut in pages, which are mapped in and out of physical memory. So my solution is somehow to reproduce that mechanism at the process level. In order to make the paging efficient, it is necessary to have our page fit the OS page, so a bit of OS dependent code will be necessary. On the other hand, this is not a concern at all for a vector or array based solution.
So in essence my answer is concerned by the efficiency of updating the dataset in a manner which will favor clustering points close to each others. It supposes that such clustering is possible. If not the case, then just pushing a new point at the end of the dataset would be perfectly alright.
Although I do not know the exact implementation of std:vector, most list systems like this are slower than arrays as they allocate memory when they are resized, normally double the current capacity although this is not always the case.
So if the vector contains 16 items and you added another, it needs memory for another 16 items. As vectors are contiguous in memory, this means that it will allocate a solid block of memory for 32 items and update the vector. You can get some performance improvements by constructing the std:vector with an initial capacity that is roughly the size you think your data set will be, although this isn't always an easy number to arrive at.
For operation that are common between vectors and arrays (hence not push_back or pop_back, since array are fixed in size) they perform exactly the same, because -by specification- they are the same.
vector access methods are so trivial that the simpler compiler optimization will wipe them out.
If you know in advance the size of a vector, just construct it by specifyinfg the size or just call resize, and you will get the same you can get with a new [].
If you don't know the size, but you know how much you will need to grow, just call reserve, and you get no penality on push_back, since all the required memory is already allocated.
In any case, relocation are not so "dumb": the capacity and the size of a vector are two distinct things, and the capacity is typically doubled upon exhaustion, so that relocation of big amounts are less and less frequent.
Also, in case you know everything about sizes, and you need no dynamic memory and want the same vector interface, consider also std::array.
Sounds like you need gigs of RAM so you're not paging. I tend to go along with #Philipp's answer, because you really really want to make sure it's not re-allocating under the hood
but
what's this about a tree that needs rebalancing?
and you're even thinking about compiler optimization?
Please take a crash course in how to optimize software.
I'm sure you know all about Big-O, but I bet you're used to ignoring the constant factors, right? They might be out of whack by 2 to 3 orders of magnitude, doing things you never would have thought costly.
If that translates to days of compute time, maybe it'll get interesting.
And no compiler optimizer can fix these things for you.
If you're academically inclined, this post goes into more detail.
I'm writing a function where I need a significant amount of heap memory. Is it possible to tell the compiler that those data will be accessed frequently within a specific for loop, so as to improve performance (through compile options or similar)?
The reason I cannot use the stack is that the number of elements I need to store is big, and I get segmentation fault if I try to do it.
Right now the code is working but I think it could be faster.
UPDATE:
I'm doing something like this
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
some details:
- I used hash_set, little improvement, beside the fact that hash_set is not available in all machines I have for simulation purposes
- I tried to allocate vec on the stack using arrays but, as I said, I might get segmentation fault if the number of elements is too big
If node_vec.size() is, say, equal to k, where k is of the order of a few thousands, I expect vec to be 4 or 5 times bigger than node_vec. With this order of magnitude the code appears to be slow, considering the fact that I have to run it many times. Of course, I am using multithreading to parallelize these calls, but I can't get the function per se to run much faster than what I'm seeing right now.
Would it be possible, for example, to have vec allocated in the cache memory for fast data retrieval, or something similar?
I'm writing a function where I need a significant amount of heap memory ... will be accessed frequently within a specific for loop
This isn't something you can really optimize at a compiler level. I think your concern is that you have a lot of memory that may be "stale" (paged out) but at a particular point in time you will need to iterate over all of it, maybe several times and you don't want the memory pages to be paged out to disk.
You will need to investigate strategies that are platform specific to improve performance. Keeping the pages in memory can be achieved with mlockall or VirtualLock but you really shouldn't need to do this. Make sure you know what the implications of locking your application's memory pages into RAM is, however. You're hogging memory from other processes.
You might also want to investigate a low fragmentation heap (however it may not be relevant at all to this problem) and this page which describes cache lines with respect to for loops.
The latter page is about the nitty-gritty of how CPUs work (a detail you normally shouldn't have to be concerned with) with respect to memory access.
Example 1: Memory accesses and performance
How much faster do you expect Loop 2 to run, compared Loop 1?
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The first loop multiplies every value in the array by 3, and the second loop multiplies only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.
UPDATE
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
That still doesn't show much, because we cannot know how often the condition x > threshold will be true. If x > threshold is very frequently true, then the std::set might be the bottleneck, because it has to do a dynamic memory allocation for every uint you insert.
Also we don't know what "some computation" actually means/does/is. If it does much, or does it in the wrong way that could be the bottleneck.
And we don't know how you need to access the result.
Anyway, on a hunch:
vector<pair<int, int> > vec1;
vector<pair<int, int> > vec2;
for (uint i = 0; i < node_vec.size(); i++)
{
for (uint j = i+1; j < node_vec.size(); j++)
{
// some computation, basic math, store the result in variable x
if (x > threshold)
{
vec1.push_back(make_pair(i, j));
vec2.push_back(make_pair(j, i));
}
}
}
If you can use the result in that form, you're done. Otherwise you could do some post-processing. Just don't copy it into a std::set again (obviously). Try to stick to std::vector<POD>. E.g. you could build an index into the vectors like this:
// ...
vector<int> index1 = build_index(node_vec.size(), vec1);
vector<int> index2 = build_index(node_vec.size(), vec2);
// ...
}
vector<int> build_index(size_t count, vector<pair<int, int> > const& vec)
{
vector<int> index(count, -1);
size_t i = vec.size();
do
{
i--;
assert(vec[i].first >= 0);
assert(vec[i].first < count);
index[vec[i].first] = i;
}
while (i != 0);
return index;
}
ps.: I'm almost sure your loop is not memory-bound. Can't be sure though... if the "nodes" you're not showing us are really big it might still be.
Original answer:
There is no easy I_will_access_this_frequently_so_make_it_fast(void* ptr, size_t len)-kind-of solution.
You can do some things though.
Make sure the compiler can "see" the implementation of every function that's called inside critical loops. What is necessary for the compiler to be able to "see" the implementation depends on the compiler. There is one way to be sure though: define all relevant functions in the same translation unit before the loop, and declare them as inline.
This also means you should not by any means call "external" functions in those critical loops. And by "external" functions I mean things like system-calls, runtime-library stuff or stuff implemented in a DLL/SO. Also don't call virtual functions and don't use function pointers. And or course don't allocate or free memory (inside the critical loops).
Make sure you use an optimal algorithm. Linear optimization is moot if the complexity of the algorithm is higher than necessary.
Use the smallest possible types. E.g. don't use int if signed char will do the job. That's something I wouldn't normally recommend, but when processing a large chunk of memory it can increase performance quite a lot. Especially in very tight loops.
If you're just copying or filling memory, use memcpy or memset. Disable the intrinsic version of those two functions if the chunks are larger then about 50 to 100 bytes.
Make sure you access the data in a cache-friendly manner. The optimum is "streaming" - i.e. accessing the memory with ascending or descending addresses. You can "jump" ahead some bytes at a time, but don't jump too far. The worst is random access to a big block of memory. E.g. if you have to work on a 2 dimensional matrix (like a bitmap image) where p[0] to p[1] is a step "to the right" (x + 1), make sure the inner loop increments x and the outer increments y. If you do it the other way around performance will be much much worse.
If your pointers are alias-free, you can tell the compiler (how that's done depends on the compiler). If you don't know what alias-free means I recommend searching the net and your compiler's documentation, since an explanation would be beyond the scope.
Use intrinsic SIMD instructions if appropriate.
Use explicit prefetch instructions if you know which memory locations will be needed in the near future.
You can't do that with compiler options. Depending on your usage (insertion, random-access, deleting, sorting, etc.), you could maybe get a better suited container.
The compiler can already see that the data is accessed frequently within the loop.
Assuming you're only allocating the data from the heap once before doing the looping, note, as #lvella, that memory is memory and if it's accessed frequently it should be effectively cached during execution.