a trade-off between memory and performance - c++

This question is about trade-off between memory and performance.
I am doing C++ on Linux.
The for_loop () is on time critical path. I am tring to reduce its run time as much as possible.
myArray.assignMemory(); // the memory will be 50KB.
if (myFlag)
myArray is assigned meaningful values
else
myArray is assigned NotNumber (a very small negative number)
for_loop ( iterationNumber = N) { // N will be very large
myF1 ( myArray[i] ) ;
}
myF1(double j){
if(myFlag)
use j
else
doNothing
}
Here, assign memory to myArray even though myFlag is false, in which case the memory is wasted. But, if I put if(myFlag) in the for_loop, this will have performance overhead.
I can put if(myFlag) out of for_loop so that if myFlag is true we run myF1( myArray[i]), otherwise, we run myF1(notNumber). But, this will have duplication.
So, my question is : are there other better ways that do not pull in performace overhead while not wasting any memory ?
Thanks

To me, from what I can see:
myArray.assignMemory(); // the memory will be 50KB.
if (myFlag)
myArray is assigned meaningful values
else
myArray is assigned NotNumber (a very small negative number)
for_loop ( iterationNumber = N) { // N will be very large
myF1 ( myArray[i] ) ;
}
myF1(double j){
if(myFlag)
use j
else
doNothing
}
is the same as
if (myflag)
{
myArray.assignMemory(); // the memory will be 50KB.
myArray is assigned meaningful values
for_loop ( iterationNumber = N) { // N will be very large
myF1 ( myArray[i] ) ;
}
}
myF1(double j){
use j
}
Of course, it could be that your code does more things than what you describe, in which case this part of the answer is completely useless (but not really my fault - I can only go by what you have posted, and the posted code don't do anything else with myArray.
As to your DIRECT question, it really depends on what you are trying to achieve. 50KB is not a very large allocation (as long as you are not doing it several times). But allocating memory that you don't actually need is also completely meaningless AND takes time.
The title of your question is about "tradeoff between memory and performance", which is typically about "do I store something in lots of memory that is fast to access, or work out a more memory efficient way to store it, but taking more time." For example, if we have a telephone directory, we could have a very large array with all telephone numbers from 000000000 to 999999999 in one large, directly addressed array, or we can use a map or hash_map that stores only the items we actually need in the table. The directly addressed array is faster to access, but it's so much larger that it may not fit in the memory in most machines [if each record is large as well]. So it's a choice, do we make it "fast, using lots of memory", or do we make it "small memory, but not so fast". And like so many things, there's no directly right or wrong answer - it depends on which is more important, speed or memory space.

Call myArray.assignMemory() after checking MyFlag, but before the for_loop(). Depending on what myArray is assigned meaningful values and myArray is assigned NotNumber (a very small negative number) do, you might need to change the implementation of the class that myArray belongs to.

Related

Why does my program keep on giving me heap buffer overflow even though I am not going out of bounds on my array or overwriting any values

I am trying to solve Leetcode problem 746 which is a basic Dynamic programming problem. Even though my logic is little complex, it should work perfectly fine. I have had 2 sleepless nights trying to find what the problem in my code is. Can someone point exactly where am I doing heap overflow and how to avoid it ?
Also, I forgot to add, the size of cost is always greater than 3.
I have already tried inserting comments into my solution. I have realized that the problem lies with the costing[1] updating code but what the problem is, I have no idea.
class Solution {
public:
int minCostClimbingStairs(vector<int>& cost) {
if(cost.size() < 3)
return 0;
int costing[2];
costing[0]=cost[0];
costing[1]=cost[1];
int i=1;
while(i<cost.size()-2)
{
if(costing[0]+cost[i]>costing[0]+cost[i+1])
{
costing[0]=costing[0]+cost[i+1];
i++;
}
else if (costing[0]+cost[i]<=costing[0]+cost[i+1])
{
costing[0]=costing[0]+cost[i];
i+=2;
}
if(costing[1]+cost[i+1]>=costing[1]+cost[i+2])
{
cout<<costing[1]+cost[i+1]<<"is greater than " <<costing[1]+cost[i+2]<<"\n";
costing[1]+=cost[i+2];
i+=2;
}
else if (costing[1]+cost[i+1]<costing[1]+cost[i+2])
{
cout<<costing[1]+cost[i+1]<<"is less than " <<costing[1]+cost[i+2]<<"\n";
costing[1]+=cost[i+1];
i++;
}
}
return min(costing[0],costing[1]);
}
};
The initial value of i is 1. It can increment by 4 in one iteration of the while under different conditions (if the first else if and the second if are true).
So in the second iteration of the while, the value of i can become 9.
So in the last else if, cost[i+2] is cost[11]. Since your dataset has only 10 elements (as mentioned in the comment), this results in an out-of-bounds access.
Couldn't it be that your heap has not enough space? Vector allocate continuous memory blocks on the heap. If a vector grows for a long time, the heap may get fragmented. So at some point there is no free block available.
Maybe using std::vector::reserve can help you prevent this behaviour.
Also keep in mind that the indexed access ([i]) to vectors doesn't include range checking as opposed to the at function.

Fastest way to allocate temporary elements (knowing maximum number) in a vector?

in a function I need to store some integers in a vector. The function is called a lot of times. I know that they are less then 10 but the number is variable for each call of the function. What is the choice to have better performances?
In example I found that this:
std::vector<int> list(10)
std::vector<int>::iterator it=list.begin();
unsigned int nume_of_elements_stored;
for ( ... iterate on some structures ... ){
if (... a specific condition ...){
*it= integer from structures ;
it++;
nume_of_elements_stored++;
}
}
is slower than:
std::vector<int> list;
unsigned int num_of_elements_stored(0);
for ( ... iterate on some structures ... ){
if (... a specific condition ...){
list.push_back( integer from structures )
}
}
num_of_elements_stored=list.size();
I'm going to go down an extremely uncool route here. At the risk of being crucified, I would suggest that std::vector isn't so great here. An exception would be if you get lucky with the memory allocator and get that temporal locality through the allocator that creating and destroying a bunch of teeny vectors normally wouldn't provide.
Wait!
Before people kill me, I want to say that vector is awesome, generally speaking, as one of the most well-rounded data structures available. But when you're looking at a hotspot like this (hopefully with a profiler) as a result of creating a bunch of teeny vectors repeatedly in a tight loop, that's where this kind of straightforward usage of vector can bite you.
The trouble is that it's a heap-allocated structure (basically a dynamic array), and when we're dealing with a boatload of teeny arrays like this, we really want to use that often-cached memory at the top of the stack that's so cheap to allocate/free when we can.
One way to mitigate this is to reuse the same vector across repeated calls. Store it in the outside caller function's scope and pass it in by reference, clear it, do your push_backs, rinse and repeat. It's worth noting that clear doesn't free any memory in the vector, so it keeps that former capacity around (useful here when we want to reuse the same memory and play to temporal locality).
But here we can play to that stack. As a simplified example (using C-style code that isn't very kosher in C++ or even bothers with exception-safety, but easier to illustrate):
int stack_mem[32];
int num = 0;
int cap = 32;
int* ptr = stack_mem;
for ( ... iterate on some structures ... )
{
if (... a specific condition ...)
{
if (num == cap)
{
cap *= 2;
int* new_ptr = static_cast<int*>(malloc(cap * sizeof(int)));
memcpy(new_ptr, ptr, num * sizeof(int));
if (ptr != stack_mem)
free(ptr);
ptr = new_ptr;
}
ptr[num++] = your_int;
}
}
if (ptr != stack_mem)
free(ptr);
Of course if you use something like this, you should properly wrap it in a reusable class template that does bounds-checking, doesn't use memcpy, has exception-safety, a formal push_back method, emplace_back, copy ctor, move ctor, swap, possibly a fill ctor, range ctor, erase, range erase, insert, range insert, size, empty, iterators/begin/end, uses placement new to avoid requiring copy assignment or default ctor, etc.
The solution uses the stack when N <= 32 (can use a different number suited for your common-case needs) and then switches to heap when exceeded. This allows it to handle your common case scenarios efficiently but also not just go kablooey in those rare case scenarios when N might be huge in some pathological case. That makes it somewhat comparable to variable-length arrays in C (something I actually wish we had in C++, at least until std::dynarray is available) but without the stack overflow tendencies VLAs could have since it switches to heap in rare case scenarios.
I applied all these standard-compliant formalities with a structure based on this idea with a class template that accepts <T, FixedN>, and now use it almost as much as vector since I work with so many cases like this with teeny arrays being repeatedly created that should, in the vast majority of common cases, fit on the stack (but always with those ultra rare exceptional possibilities). It wiped off many profiler hotspots I was getting related to memory off the map.
... but applying this basic idea might give you quite a boost. You can apply that kind of effort above of wrapping it into a safe container preserving C++ object semantics if it pays off in your measurements, and I think it should quite a bit in your case.
I would probably go with sort of a middle ground:
std::vector<int> list;
list.reserve(10);
...and the rest could be pretty much like your second version. To be honest, however, it's probably open to question whether this will really make a lot of difference though.
If you use static vector it will be allocated only once.
First example works slower because it allocates and destroys vector each call.

efficiency when handling consecutive blocks vs. non consecutive blocks of memory

i have a struct
struct A
{
int v[10000000];
};
if i have A a[2]; and wish to calculate the total sum of values which of these 2 methods is the fastest?
int method_1(const A &a[],int length)
{
int total = 0;
for(int i=0;i<length;i++)
for(int j=0;j<10000000;j++)
total+=a[i][j];
return total;
}
int method_2(const A &a[],int length)
{
int total = 0;
for(int j=0;j<10000000;j++)
for(int i=0;i<length;i++)
total+=a[i][j];
return total;
}
a[2] is declared as two consective blocks of struct A as so:
----a[0]---- /--- a[1]----
[][][][][][][][]/[][][][][][][][]
so, i might be tempted to say that method_1 is faster, based on intuition that the blocks are consecutive and the iteration through each block's v is also consecutive.
What i am really interested in is how the memory is really accessed and how is the most efficient way to access it.
EDIT
i have changed the v size from 32 to 10000000, because apparently it wasn't understood that i was referring to a general case
Each time a memory fragment is read a whole cache line will be read from main memory to the CPU cache, today you'll probably have a 32byte long cache lines. Mostly because of this reading consecutive memory blocks is fast.
Now there is more then one cache line...
In your case both cases may have similar performance as both arrays will most probably not collide into the same cache line and so both may be in the cache on different lines so I suspect performance will be similar.
One related thing you may consider in this case to change the performance is NOT using the [] operators in favor of iterating more using "iterators" like this:
int method_1(const A &a[],int length)
{
int total = 0;
for(const A* aIt=a;aIt<a+length;++aIt)
for(const v* vIt=aIt->v;vIt<aIt->v+10000000;++vIt)
total+=*vIt;
return total;
}
This way you avoid a double [] which is simply a multiplication by the sizeof of an array element (which may be optimized but may not and if not it will be costly when called millions of times). Your compiler may be smart enough to optimize the code just as I've shown to only use additions but... it very well may not be and I've seen this make a big difference when the operation performed for each of the elements is as trivial as an incrementation - you're best to measure this and see how these options work out in your environment.
Accessing elements in the order they appear in memory will improve performance in most cases since it allows the prefetcher to load data before you even use it. Besides, if you use data in a non-contiguous way, you might load and discard the same cache line many times and this has a cost.
Data size is small enough to be fit completely in a single cache line on modern CPUs. I'm not sure about vertorizing this code by compiler
I don't think method_2 is slower than method_1. The chunk of memory will be taken to CPUs main memory and then accessing a[0] and a[1] both will be take same time.
For a safer side, method_1 can always be considered better than method_2.

If I want maximum speed, should I just use an array over a std::vector? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am writing some code that needs to be as fast as possible without sucking up all of my research time (in other words, no hand optimized assembly).
My systems primarily consist of a bunch of 3D points (atomic systems) and so the code I write does lots of distance comparisons, nearest-neighbor searches, and other types of sorting and comparisons. These are large, million or billion point systems, and the naive O(n^2) nested for loops just won't cut it.
It would be easiest for me to just use a std::vector to hold point coordinates. And at first I thought it will probably be about as fast an array, so that's great! However, this question (Is std::vector so much slower than plain arrays?) has left me with a very uneasy feeling. I don't have time to write all of my code using both arrays and vectors and benchmark them, so I need to make a good decision right now.
I am sure that someone who knows the detailed implementation behind std::vector could use those functions with very little speed penalty. However, I primarily program in C, and so I have no clue what std::vector is doing behind the scenes, and I have no clue if push_back is going to perform some new memory allocation every time I call it, or what other "traps" I could fall into that make my code very slow.
An array is simple though; I know exactly when memory is being allocated, what the order of all my algorithms will be, etc. There are no blackbox unknowns that I may have to suffer through. Yet so often I see people criticized for using arrays over vectors on the internet that I can't but help wonder if I am missing some more information.
EDIT: To clarify, someone asked "Why would you be manipulating such large datasets with arrays or vectors"? Well, ultimately, everything is stored in memory, so you need to pick some bottom layer of abstraction. For instance, I use kd-trees to hold the 3D points, but even so, the kd-tree needs to be built off an array or vector.
Also, I'm not implying that compilers cannot optimize (I know the best compilers can outperform humans in many cases), but simply that they cannot optimize better than what their constraints allow, and I may be unintentionally introducing constraints simply due to my ignorance of the implementation of vectors.
all depends on this how you implement your algorithms. std::vector is such general container concept that gives us flexibility but leaves us with freedom and responsibility of structuring implementation of algorithm deliberately. Most of the efficiency overhead that we will observe from std::vector comes from copying. std::vector provides a constructor which lets you initialize N elements with value X, and when you use that, the vector is just as fast as an array.
I did a tests std::vector vs. array described here,
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>
class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}
~TestTimer()
{
using namespace std;
using namespace boost;
posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;
cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}
private:
std::string name;
boost::posix_time::ptime start;
};
struct Pixel
{
Pixel()
{
}
Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}
unsigned char r, g, b;
};
void UseVector()
{
TestTimer t("UseVector");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseVectorPushBack()
{
TestTimer t("UseVectorPushBack");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.reserve(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}
void UseArray()
{
TestTimer t("UseArray");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
free(pixels);
}
}
void UseVectorCtor()
{
TestTimer t("UseConstructor");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
}
}
int main()
{
TestTimer t1("The whole thing");
UseArray();
UseVector();
UseVectorCtor();
UseVectorPushBack();
return 0;
}
and here are results (compiled on Ubuntu amd64 with g++ -O3):
UseArray completed in 0.325 seconds
UseVector completed in 1.23 seconds
UseConstructor completed in 0.866 seconds
UseVectorPushBack completed in 8.987 seconds
The whole thing completed in 11.411 seconds
clearly push_back wasn't good choice here, using constructor is still 2 times slower than array.
Now, providing Pixel with empty copy Ctor:
Pixel(const Pixel&) {}
gives us following results:
UseArray completed in 0.331 seconds
UseVector completed in 0.306 seconds
UseConstructor completed in 0 seconds
UseVectorPushBack completed in 2.714 seconds
The whole thing completed in 3.352 seconds
So in summary: re-think your algorithm, otherwise, perhaps resort to a custom wrapper around New[]/Delete[]. In any case, the STL implementation isn't slower for some unknown reason, it just does exactly what you ask; hoping you know better.
In the case when you just started with vectors it might be surprising how they behave, for example this code:
class U{
int i_;
public:
U(){}
U(int i) : i_(i) {cout << "consting " << i_ << endl;}
U(const U& ot) : i_(ot.i_) {cout << "copying " << i_ << endl;}
};
int main(int argc, char** argv)
{
std::vector<U> arr(2,U(3));
arr.resize(4);
return 0;
}
results with:
consting 3
copying 3
copying 3
copying 548789016
copying 548789016
copying 3
copying 3
Vectors guarantee that the underlying data is a contiguous block in memory. The only sane way to guarantee this is by implementing it as an array.
Memory reallocation on pushing new elements can happen, because the vector can't know in advance how many elements you are going to add to it. But when you know it in advance, you can call reserve with the appropriate number of entries to avoid reallocation when adding them.
Vectors are usually preferred over arrays because they allow bound-checking when accessing elements with .at(). That means accessing indices outside of the vector doesn't cause undefined behavior like in an array. This bound-checking does however require additional CPU cycles. When you use the []-operator to access elements, no bound-checking is done and access should be as fast as an array. This however risks undefined behavior when your code is buggy.
People who invented STL, and then made it into the C++ standard library, are expletive deleted smart. Don't even let yourself imagine for one little moment you can outperform them because of your superior knowledge of legacy C arrays. (You would have a chance if you knew some Fortran though).
With std::vector, you can allocate all memory in one go, just like with C arrays. You can also allocate incrementally, again just like with C arrays. You can control when each allocation happens, just like with C arrays. Unlike with C arrays, you can also forget about it all and let the system manage the allocations for you, if that's what you want. This is all absolutely necessary, basic functionality. I'm not sure why anyone would assume it is missing.
Having said all that, go with arrays if you find them easier to understand.
I am not really advising you to go either for arrays or vectors, because I think that for your needs they may not be totally fit.
You need to be able to organize your data efficiently, so that queries would not need to scan the whole memory range to get the relevant data. So you want to group the points which are more likely to be selected together close to each other.
If your dataset is static, then you can do that sorting offline, and make your array nice and tidy to be loaded up in memory at your application start up time, and either vector or array would work (provided you do the reserve call up front for vector, since the default allocation growth scheme double up the size of the underlying array whenever it gets full, and you wouldn't want to use up 16Gb of memory for only 9Gb worth of data).
But if your dataset is dynamic, it will be difficult to do efficient inserts in your set with a vector or an array. Recall that each insert within the array would create a shift of all the successor elements of one place. Of course, an index, like the kd tree you mention, will help by avoiding a full scan of the array, but if the selected points are scattered accross the array, the effect on memory and cache will essentially be the same. The shift would also mean that the index needs to be updated.
My solution would be to cut the array in pages (either list linked or array indexed) and store data in the pages. That way, it would be possible to group relevant elements together, while still retaining the speed of contiguous memory access within pages. The index would then refer to a page and an offset in that page. Pages wouldn't be filled automatically, which leaves rooms to insert related elements, or make shifts really cheap operations.
Note that if pages are always full (excepted for the last one), you still have to shift every single one of them in case of an insert, while if you allow incomplete pages, you can limit a shift to a single page, and if that page is full, insert a new page right after it to contain the suplementary element.
Some things to keep in mind:
array and vector allocation have upper limits, which is OS dependent (these limits might be different)
On my 32bits system, the maximum allowed allocation for a vector of 3D points is at around 180 millions entries, so for larger datasets, on would have to find a different solution. Granted, on 64bits OS, that amount might be significantly larger (On windows 32bits, the maximum memory space for a process is 2Gb - I think they added some tricks on more advanced versions of the OS to extend that amount). Admittedly memory will be even more problematic for solutions like mine.
resizing of a vector requires allocating the new size of the heap, copy the elements from the old memory chunck to the new one.
So for adding just one element to the sequence, you will need twice the memory during the resizing. This issue may not come up with plain arrays, which can be reallocated using the ad hoc OS memory functions (realloc on unices for instance, but as far as I know that function doesn't make any guarantee that the same memory chunck will be reused). The problem might be avoided in vector as well if a custom allocator which would use the same functions is used.
C++ doesn't make any assumption about the underlying memory architecture.
vectors and arrays are meant to represent contiguous memory chunks provided by an allocator, and wrap that memory chunk with an interface to access it. But C++ doesn't know how the OS is managing that memory. In most modern OS, that memory is actually cut in pages, which are mapped in and out of physical memory. So my solution is somehow to reproduce that mechanism at the process level. In order to make the paging efficient, it is necessary to have our page fit the OS page, so a bit of OS dependent code will be necessary. On the other hand, this is not a concern at all for a vector or array based solution.
So in essence my answer is concerned by the efficiency of updating the dataset in a manner which will favor clustering points close to each others. It supposes that such clustering is possible. If not the case, then just pushing a new point at the end of the dataset would be perfectly alright.
Although I do not know the exact implementation of std:vector, most list systems like this are slower than arrays as they allocate memory when they are resized, normally double the current capacity although this is not always the case.
So if the vector contains 16 items and you added another, it needs memory for another 16 items. As vectors are contiguous in memory, this means that it will allocate a solid block of memory for 32 items and update the vector. You can get some performance improvements by constructing the std:vector with an initial capacity that is roughly the size you think your data set will be, although this isn't always an easy number to arrive at.
For operation that are common between vectors and arrays (hence not push_back or pop_back, since array are fixed in size) they perform exactly the same, because -by specification- they are the same.
vector access methods are so trivial that the simpler compiler optimization will wipe them out.
If you know in advance the size of a vector, just construct it by specifyinfg the size or just call resize, and you will get the same you can get with a new [].
If you don't know the size, but you know how much you will need to grow, just call reserve, and you get no penality on push_back, since all the required memory is already allocated.
In any case, relocation are not so "dumb": the capacity and the size of a vector are two distinct things, and the capacity is typically doubled upon exhaustion, so that relocation of big amounts are less and less frequent.
Also, in case you know everything about sizes, and you need no dynamic memory and want the same vector interface, consider also std::array.
Sounds like you need gigs of RAM so you're not paging. I tend to go along with #Philipp's answer, because you really really want to make sure it's not re-allocating under the hood
but
what's this about a tree that needs rebalancing?
and you're even thinking about compiler optimization?
Please take a crash course in how to optimize software.
I'm sure you know all about Big-O, but I bet you're used to ignoring the constant factors, right? They might be out of whack by 2 to 3 orders of magnitude, doing things you never would have thought costly.
If that translates to days of compute time, maybe it'll get interesting.
And no compiler optimizer can fix these things for you.
If you're academically inclined, this post goes into more detail.

C++ heap memory performance improvement

I'm writing a function where I need a significant amount of heap memory. Is it possible to tell the compiler that those data will be accessed frequently within a specific for loop, so as to improve performance (through compile options or similar)?
The reason I cannot use the stack is that the number of elements I need to store is big, and I get segmentation fault if I try to do it.
Right now the code is working but I think it could be faster.
UPDATE:
I'm doing something like this
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
some details:
- I used hash_set, little improvement, beside the fact that hash_set is not available in all machines I have for simulation purposes
- I tried to allocate vec on the stack using arrays but, as I said, I might get segmentation fault if the number of elements is too big
If node_vec.size() is, say, equal to k, where k is of the order of a few thousands, I expect vec to be 4 or 5 times bigger than node_vec. With this order of magnitude the code appears to be slow, considering the fact that I have to run it many times. Of course, I am using multithreading to parallelize these calls, but I can't get the function per se to run much faster than what I'm seeing right now.
Would it be possible, for example, to have vec allocated in the cache memory for fast data retrieval, or something similar?
I'm writing a function where I need a significant amount of heap memory ... will be accessed frequently within a specific for loop
This isn't something you can really optimize at a compiler level. I think your concern is that you have a lot of memory that may be "stale" (paged out) but at a particular point in time you will need to iterate over all of it, maybe several times and you don't want the memory pages to be paged out to disk.
You will need to investigate strategies that are platform specific to improve performance. Keeping the pages in memory can be achieved with mlockall or VirtualLock but you really shouldn't need to do this. Make sure you know what the implications of locking your application's memory pages into RAM is, however. You're hogging memory from other processes.
You might also want to investigate a low fragmentation heap (however it may not be relevant at all to this problem) and this page which describes cache lines with respect to for loops.
The latter page is about the nitty-gritty of how CPUs work (a detail you normally shouldn't have to be concerned with) with respect to memory access.
Example 1: Memory accesses and performance
How much faster do you expect Loop 2 to run, compared Loop 1?
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The first loop multiplies every value in the array by 3, and the second loop multiplies only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.
UPDATE
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
That still doesn't show much, because we cannot know how often the condition x > threshold will be true. If x > threshold is very frequently true, then the std::set might be the bottleneck, because it has to do a dynamic memory allocation for every uint you insert.
Also we don't know what "some computation" actually means/does/is. If it does much, or does it in the wrong way that could be the bottleneck.
And we don't know how you need to access the result.
Anyway, on a hunch:
vector<pair<int, int> > vec1;
vector<pair<int, int> > vec2;
for (uint i = 0; i < node_vec.size(); i++)
{
for (uint j = i+1; j < node_vec.size(); j++)
{
// some computation, basic math, store the result in variable x
if (x > threshold)
{
vec1.push_back(make_pair(i, j));
vec2.push_back(make_pair(j, i));
}
}
}
If you can use the result in that form, you're done. Otherwise you could do some post-processing. Just don't copy it into a std::set again (obviously). Try to stick to std::vector<POD>. E.g. you could build an index into the vectors like this:
// ...
vector<int> index1 = build_index(node_vec.size(), vec1);
vector<int> index2 = build_index(node_vec.size(), vec2);
// ...
}
vector<int> build_index(size_t count, vector<pair<int, int> > const& vec)
{
vector<int> index(count, -1);
size_t i = vec.size();
do
{
i--;
assert(vec[i].first >= 0);
assert(vec[i].first < count);
index[vec[i].first] = i;
}
while (i != 0);
return index;
}
ps.: I'm almost sure your loop is not memory-bound. Can't be sure though... if the "nodes" you're not showing us are really big it might still be.
Original answer:
There is no easy I_will_access_this_frequently_so_make_it_fast(void* ptr, size_t len)-kind-of solution.
You can do some things though.
Make sure the compiler can "see" the implementation of every function that's called inside critical loops. What is necessary for the compiler to be able to "see" the implementation depends on the compiler. There is one way to be sure though: define all relevant functions in the same translation unit before the loop, and declare them as inline.
This also means you should not by any means call "external" functions in those critical loops. And by "external" functions I mean things like system-calls, runtime-library stuff or stuff implemented in a DLL/SO. Also don't call virtual functions and don't use function pointers. And or course don't allocate or free memory (inside the critical loops).
Make sure you use an optimal algorithm. Linear optimization is moot if the complexity of the algorithm is higher than necessary.
Use the smallest possible types. E.g. don't use int if signed char will do the job. That's something I wouldn't normally recommend, but when processing a large chunk of memory it can increase performance quite a lot. Especially in very tight loops.
If you're just copying or filling memory, use memcpy or memset. Disable the intrinsic version of those two functions if the chunks are larger then about 50 to 100 bytes.
Make sure you access the data in a cache-friendly manner. The optimum is "streaming" - i.e. accessing the memory with ascending or descending addresses. You can "jump" ahead some bytes at a time, but don't jump too far. The worst is random access to a big block of memory. E.g. if you have to work on a 2 dimensional matrix (like a bitmap image) where p[0] to p[1] is a step "to the right" (x + 1), make sure the inner loop increments x and the outer increments y. If you do it the other way around performance will be much much worse.
If your pointers are alias-free, you can tell the compiler (how that's done depends on the compiler). If you don't know what alias-free means I recommend searching the net and your compiler's documentation, since an explanation would be beyond the scope.
Use intrinsic SIMD instructions if appropriate.
Use explicit prefetch instructions if you know which memory locations will be needed in the near future.
You can't do that with compiler options. Depending on your usage (insertion, random-access, deleting, sorting, etc.), you could maybe get a better suited container.
The compiler can already see that the data is accessed frequently within the loop.
Assuming you're only allocating the data from the heap once before doing the looping, note, as #lvella, that memory is memory and if it's accessed frequently it should be effectively cached during execution.