Performace of large 3D arrays: contiguous 1D storage vs T*** - c++

I wonder if anyone could advise on storage of large (say 2000 x 2000 x 2000) 3D arrays for finite difference discretization computations. Does contiguous storage float* give better performance then float*** on modern CPU architectures?
Here is a simplified example of computations, which are done over entire arrays:
for i ...
for j ...
for k ...
u[i][j][k] += v[i][j][k+1] + v[i][j][k-1]
+ v[i][j+1][k] + v[i][j-1][k] + v[i+1][j][k] + v[i-1][j][k];
Vs
u[i * iStride + j * jStride + k] += ...
PS:
Considering size of problems, storing T*** is a very small overhead. Access is not random. Moreover, I do loop blocking to minimize cache misses. I am just wondering how triple dereferencing in T*** case compares to index computation and single dereferencing in case of 1D array.

These are not apples-to-apples comparisons: a flat array is just that - a flat array, which your code partitions into segments according to some logic of linearizing a rectangular 3D array. You access an element of an array with a single dereference, plus a handful of math operations.
float***, on the other hand, lets you keep a "jagged" array of arrays or arrays, so the structure that you can represent inside such an array is a lot more flexible. Naturally, you need to pay for that flexibility with additional CPU cycles required for dereferencing pointers to pointers to pointers, then a pointer to pointer, and finally a pointer (the three pairs of square brackets in the code).
Naturally, access to the individual elements of float*** is going to be a little slower, if you access them in truly random order. However, if the order is not random, the difference that you see may be tiny, because the values of pointers would be cached.
float*** will also require more memory, because you need to allocate two additional levels of pointers.

The short answer is: benchmark it. If the results are inconclusive, it means it doesn't matter. Do what makes your code most readable.
As #dasblinkenlight has pointed out, the structures are nor equivalent because T*** can be jagged.
At the most fundamental level, though, this comes down to the arithmetic and memory access operations.
For your 1D array, as you have already (almost) written, the calculation is:
ptr = u + (i * iStride) + (j * jStride) + k
read *ptr
With T***:
ptr = u + i
x = read ptr
ptr = x + j
y = read ptr
ptr = y + k
read ptr
So you are trading two multiplications for two memory accesses.

In computer go, where people are very performance-sensitive, everyone (AFAIK) uses T[361] rather than T[19][19] (*). This decision is based on benchmarking, both in isolation and the whole program. (It is possible everyone did those benchmarks years and years ago, and have never done them again on the latest hardware, but my hunch is that a single 1-D array will still be better.)
However your array is huge, in comparison. As the code involved in each case is easy to write, I would definitely try it both ways and benchmark.
*: Aside: I think it is actually T[21][21] vs. t[441], in most programs, as an extra row all around is added to speed up board-edge detection.

One issue that has not been mentioned yet is aliasing.
Does your compiler support some type of keyword like restrict to indicate that you have no aliasing? (It's not part of C++11 so would have to be an extension.) If so, performance may be very close to the same. If not, there could be significant differences in some cases. The issue will be with something like:
for (int i = ...) {
for (int j = ...) {
a[j] = b[i];
}
}
Can b[i] be loaded once per outer loop iteration and stored in a register for the entire inner loop? In the general case, only if the arrays don't overlap. How does the compiler know? It needs some type of restrict keyword.

Related

data locality for implementing 2d array in c/c++

Long time ago, inspired by "Numerical recipes in C", I started to use the following construct for storing matrices (2D-arrays).
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
for (i = 0; i < NumRows; ++i) x[i] = (double *)calloc(NumCol, sizeof(double));
return x;
}
double **x = allocate_matrix(1000,2000);
x[m][n] = ...;
But recently noticed that many people implement matrices as follows
double *x = (double *)malloc(NumRows * NumCols * sizeof(double));
x[NumCol * m + n] = ...;
From the locality point of view the second method seems perfect, but has awful readability... So I started to wonder, is my first method with storing auxiliary array or **double pointers really bad or the compiler will optimize it eventually such that it will be more or less equivalent in performance to the second method? I am suspicious because I think that in the first method two jumps are made when accessing the value, x[m] and then x[m][n] and there is a chance that each time the CPU will load first the x array and then x[m] array.
p.s. do not worry about extra memory for storing **double, for large matrices it is just a small percentage.
P.P.S. since many people did not understand my question very well, I will try to re-shape it: do I understand right that the first method is kind of locality-hell, when each time x[m][n] is accessed first x array will be loaded into CPU cache and then x[m] array will be loaded thus making each access at the speed of talking to RAM. Or am I wrong and the first method is also OK from data-locality point of view?
For C-style allocations you can actually have the best of both worlds:
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
x[0] = (double *)calloc(NumRows * NumCol, sizeof(double)); // <<< single contiguous memory allocation for entire array
for (i = 1; i < NumRows; ++i) x[i] = x[i - 1] + NumCols;
return x;
}
This way you get data locality and its associated cache/memory access benefits, and you can treat the array as a double ** or a flattened 2D array (array[i * NumCols + j]) interchangeably. You also have fewer calloc/free calls (2 versus NumRows + 1).
No need to guess whether the compiler will optimize the first method. Just use the second method which you know is fast, and use a wrapper class that implements for example these methods:
double& operator(int x, int y);
double const& operator(int x, int y) const;
... and access your objects like this:
arr(2, 3) = 5;
Alternatively, if you can bear a little more code complexity in the wrapper class(es), you can implement a class that can be accessed with the more traditional arr[2][3] = 5; syntax. This is implemented in a dimension-agnostic way in the Boost.MultiArray library, but you can do your own simple implementation too, using a proxy class.
Note: Considering your usage of C style (a hardcoded non-generic "double" type, plain pointers, function-beginning variable declarations, and malloc), you will probably need to get more into C++ constructs before you can implement either of the options I mentioned.
The two methods are quite different.
While the first method allows for easier direct access to the values by adding another indirection (the double** array, hence you need 1+N mallocs), ...
the second method guarantees that ALL values are stored contiguously and only requires one malloc.
I would argue that the second method is always superior. Malloc is an expensive operation and contiguous memory is a huge plus, depending on the application.
In C++, you'd just implement it like this:
std::vector<double> matrix(NumRows * NumCols);
matrix[y * numCols + x] = value; // Access
and if you're concerned with the inconvenience of having to compute the index yourself, add a wrapper that implements operator(int x, int y) to it.
You are also right that the first method is more expensive when accessing the values. Because you need two memory lookups as you described x[m] and then x[m][n]. There is no way the compiler will "optimize this away". The first array, depending on its size, will be cached, and the performance hit may not be that bad. In the second case, you need an extra multiplication for direct access.
In the first method you use, the double* in the master array point to logical columns (arrays of size NumCol).
So, if you write something like below, you get the benefits of data locality in some sense (pseudocode):
foreach(row in rows):
foreach(elem in row):
//Do something
If you tried the same thing with the second method, and if element access was done the way you specified (i.e. x[NumCol*m + n]), you still get the same benefit. This is because you treat the array to be in row-major order. If you tried the same pseudocode while accessing the elements in column-major order, I assume you'd get cache misses given that the array size is large enough.
In addition to this, the second method has the additional desirable property of being a single contiguous block of memory which further improves the performance even when you loop through multiple rows (unlike the first method).
So, in conclusion, the second method should be much better in terms of performance.
If NumCol is a compile-time constant, or if you are using GCC with language extensions enabled, then you can do:
double (*x)[NumCol] = (double (*)[NumCol]) malloc(NumRows * sizeof (double[NumCol]));
and then use x as a 2D array and the compiler will do the indexing arithmetic for you. The caveat is that unless NumCol is a compile-time constant, ISO C++ won't let you do this, and if you use GCC language extensions you won't be able to port your code to another compiler.

Optimizing a quadruple nested "for" loop

I'm developing a 2D numerical model in c++, and I would like to speed up a specific member function that is slowing down my code. The function is required to loop over every i,j grid point in the model and then perform a double summation at every grid point over l and m. The function is as follows:
int Class::Function(void) {
double loadingEta;
int i,j,l,m;
//etaLatLen=64, etaLonLen=2*64
//l_max = 12
for (i=0; i<etaLatLen; i++) {
for (j=0; j < etaLonLen; j++) {
loadingEta = 0.0;
for (l=0; l<l_max+1; l++) {
for (m=0; m<=l; m++) {
loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
}
}
etaNewArray[i][j] = loadingEta;
}
}
return 1;
}
I've been trying to change the loop order to speed things up, but to no avail. Any help would be much appreciated. Thank you!
EDIT 1:
All five arrays are allocated in the constructor of my class as follows:
etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
etaLegendreArray[i] = new double*[l_max+1];
for (int l=0; l<l_max+1; l++) {
etaLegendreArray[i][l] = new double[l_max+1];
}
}
SH_C = new double*[l_max+1];
SH_S = new double*[l_max+1];
for (int i=0; i<l_max+1; i++) {
SH_C[i] = new double[l_max+1];
SH_S[i] = new double[l_max+1];
}
etaCosMLon = new double*[etaLonLen];
etaSinMLon = new double*[etaLonLen];
for (int j=0; j<etaLonLen; j++) {
etaCosMLon[j] = new double[l_max+1];
etaSinMLon[j] = new double[l_max+1];
}
Perhaps it would be better if these were 1D arrays instead of multidimensional?
Hopping off into X-Y territory here. Rather than speeding up the algorithm, let's try and speed up data access.
etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
etaLegendreArray[i] = new double*[l_max+1];
for (int l=0; l<l_max+1; l++) {
etaLegendreArray[i][l] = new double[l_max+1];
}
}
Doesn't create a 3D array of doubles. It creates an array of pointers to arrays of pointers to arrays of doubles. Each array is its own block of memory and who knows where it's going to sit in storage. This results in a data structure that has what is called "poor spacial locality." All of the pieces of the structure may be scattered all over the place. In the 3D array you are hopping to three different places just to find out where your value is.
Because the many blocks of storage required to simulate the 3D array may be nowhere near each other, the CPU may not be able to effectively load the cache (high-speed memory) ahead of time and has to stop the useful work it's doing and wait to access slower storage, probably RAM much more frequently. Here is a good, high-level article on how much this can hurt performance.
On the other hand, if the whole array is in one block of memory, is "contiguous", the CPU can read larger chunks of the memory, maybe all of it, it needs into cache all at once. Plus if the compiler knows the memory the program will use is all in one big block it can perform all sorts of groovy optimizations that will make your program even faster.
So how do we get a 3D array that's all one memory block? If the sizes are static, this is easy
double etaLegendreArray[SIZE1][SIZE2][SIZE3];
This doesn't look to be your case, so what you want to do is allocate a 1D array, because it will be one contiguous block of memory.
double * etaLegendreArray= new double [SIZE1*SIZE2*SIZE3];
and do the array indexing math by hand
etaLegendreArray[(x * SIZE2 + y) * SIZE3 + z] = data;
Looks like that ought to be slower with all the extra math, huh? Turns out the compiler is hiding math that looks a lot like that from you every time you use a []. You lose almost nothing, and certainly not as much as you lose with one unnecessary cache miss.
But it is insane to repeat that math all over the place, sooner or later you will screw up even if the drain on readability doesn't have you wishing for death first, so you really want to wrap the 1D array in a class to helper handle the math for you. And once you do that, you might as well have that class handle the allocation and deallocation so you can take advantage of all that RAII goodness. No more for loops of news and deletes all over the place. It's all wrapped up and tied with a bow.
Here is an example of a 2D Matrix class easily extendable to 3D. that will take care of the basic functionality you probably need in a nice predictable, and cache-friendly manner.
If the CPU supports it and the compiler is optimizing enough, you might get some small gain out of the C99 fma (fused multiply-add) function, to convert some of your two step operations (multiply, then add) to one step operations. It would also improve accuracy, since you only suffer floating point rounding once for fused operation, not once for multiplication and once for addition.
Assuming I'm reading it right, you could change your innermost loop's expression from:
loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
to (note no use of += now, it's incorporated in fma):
loadingEta = fma(etaLegendreArray[i][l][m], fma(SH_C[l][m], etaCosMLon[j][m], SH_S[l][m]*etaSinMLon[j][m]), loadingEta);
I wouldn't expect anything magical performance-wise, but it might help a little (again, only with optimizations turned up enough for the compiler to inline hardware instructions to do the work; if it's calling a library function, you'll lose any improvements to the function call overhead). And again, it should improve accuracy a bit, by avoiding two rounding steps you were incurring.
Mind you, on some compilers with appropriate compilation flags, they'll convert your original code to hardware FMA instructions for you; if that's an option, I'd go with that, since (as you can see) the fma function tends to reduce code readability.
Your compiler may offer vectorized versions of floating point instructions as well, which might meaningfully improve performance (see previous link on automatic conversion to FMA).
Most other improvements would require more information about the goal, the nature of the input arrays being used, etc. Simple threading might gain you something, OpenMP pragmas might be something to look at as a way to simplify parallelizing the loop(s).

How not to mess up the cache when working on some long vectors in memory?

Premise
I want to do some kind of computation involving k long vectors of data (each of length n), which I receive in main memory, and write some kind of result back to main memory. For simplicity suppose also that the computation is merely
for(i = 0; i < n; i++)
v_out[i] = foo(v_1[i],v_2[i], ... ,v_k[i])
or perhaps
for(i = 0; i < n; i++)
v_out[i] = bar_k(...bar_2(bar_1(v_1[i]),v_2[i]), ... ),v_k[i])
(this is not code, it's pseudocode.) foo() and bar_i() functions have no side-effects. k is constant (known at compile-time), n is known only right before this computation occurs (and it is relatively big - at least a few times larger than the entire L2 cache size and maybe more).
Suppose I am on an single thread on a single core of an x86_64 processor (Intel, or AMD, or what-have-you; the choice does matter, probably). Finally, suppose foo() (resp. bar_i()) is not an intensive computation, i.e. the time to read data from memory and write it back is significant or even dominant relative to the n (resp. kxn) invocations of foo() (resp. bar_i()).
Question
How do I arrange this computation, so as to avoid:
The data from one input vector clearing out cached data for another vector.
Input vector data clearing out cached output vector data.
Intermediary results of bar_j(...bar_1(v_1[i])...) remaining in registers or L1 cache if there's enough capacity to hold them there until the data for v_{j+1}[i] ... v_k[i] arrives and allows us to complete the computation. Same for L2.
The L1 cache lines of the output vectors being cleared while we intend to continue working on elements in that cache line. Same for L2.
Under-utilization of memory bandwidth.
core idle time, to the extent possible.
Write-read-write-read-write sequences on v_out (which might be very expensive if those writes need to be updated into main memory; the motivation here is that it might be tempting to read just one vector, update the output and repeat).
Notes:
Any rearrangement of the input data counts towards the total computation time. The vectors will not be re-used in the alternate arrangement so it's basically a waste of time.
If it makes things easier for you to assume alignment, or lack of alignment, that's fine, just say so.
Computing with the bar_i functions allows more flexibility with the access patters, but creates additional challenges w.r.t. the caching of v_out values.
The data from one input vector clearing out cached data for another
vector.
If you are using v_1[i], v_2[i], ..., v_k[i] in the same function call, the input vectors will not clear out data cached for other vectors. For each element you read in a vector, the CPU will fetch just a cache line, not the entire vector. So if you read k elements, you will bring k cache lines from each vector.
Input vector data clearing out cached output vector data.
You have the same case as the above. That won't be the case.
Output vector data remaining in registers or L1 cache for as long as
we need it (in the bar case).
You could try to use _mm_prefetch intrinsics to fetch the data before writing it.
Under-utilization of memory bandwidth.
To do that you need to maximize the number of full width transactions. Basically you need that when the CPU fetches a cache line, all the elements are used right away. To do this you must rearrange your data. I would consider all the k vectors as a matrix of k x n elements, stored in a column major format.
type* pMat = (type*)aligned_alloc(CACHE_LINE_SIZE, n * k * sizeof(type));
v_0[i] = pMat[i * k + 0];
v_1[i] = pMat[i * k + 1];
// ...
v_k-1[i] = pMat[i * k + k-1];
That will put the v_0, ... v_k elements in SIMD registers and you might have the chance of better vectorization.
core idle time, to the extent possible.
Less cache misses, less transcedental instructions will lead to less idle time.
Write-read-write-read-write sequences on v_out (which might be very
expensive if those writes need to be updated into main memory; the
motivation here is that it might be tempting to read just one vector,
update the output and repeat).
You could decrease the price of the sequences using prefetching (_mm_prefetch).
To reduce the clearing out of cached data, you could rearrange your k vectors to 1 vector holding a structure with k members. That way, the loop will access those elments in sequence and not jump around in memory.
struct VectorData
{
Type1 Var1;
Type2 Var2;
// ...
TypeK VarK;
};
std::vector<VectorData> v_in;
for (i = 0; i < n; i++){
v_out[i] = foo(v_in[i].Var1, v_in[i].Var2, ... , v_in[i].VarK);
// Or just pass the whole element:
v_out[i] = foo(v_in[i]);
}

If I want maximum speed, should I just use an array over a std::vector? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am writing some code that needs to be as fast as possible without sucking up all of my research time (in other words, no hand optimized assembly).
My systems primarily consist of a bunch of 3D points (atomic systems) and so the code I write does lots of distance comparisons, nearest-neighbor searches, and other types of sorting and comparisons. These are large, million or billion point systems, and the naive O(n^2) nested for loops just won't cut it.
It would be easiest for me to just use a std::vector to hold point coordinates. And at first I thought it will probably be about as fast an array, so that's great! However, this question (Is std::vector so much slower than plain arrays?) has left me with a very uneasy feeling. I don't have time to write all of my code using both arrays and vectors and benchmark them, so I need to make a good decision right now.
I am sure that someone who knows the detailed implementation behind std::vector could use those functions with very little speed penalty. However, I primarily program in C, and so I have no clue what std::vector is doing behind the scenes, and I have no clue if push_back is going to perform some new memory allocation every time I call it, or what other "traps" I could fall into that make my code very slow.
An array is simple though; I know exactly when memory is being allocated, what the order of all my algorithms will be, etc. There are no blackbox unknowns that I may have to suffer through. Yet so often I see people criticized for using arrays over vectors on the internet that I can't but help wonder if I am missing some more information.
EDIT: To clarify, someone asked "Why would you be manipulating such large datasets with arrays or vectors"? Well, ultimately, everything is stored in memory, so you need to pick some bottom layer of abstraction. For instance, I use kd-trees to hold the 3D points, but even so, the kd-tree needs to be built off an array or vector.
Also, I'm not implying that compilers cannot optimize (I know the best compilers can outperform humans in many cases), but simply that they cannot optimize better than what their constraints allow, and I may be unintentionally introducing constraints simply due to my ignorance of the implementation of vectors.
all depends on this how you implement your algorithms. std::vector is such general container concept that gives us flexibility but leaves us with freedom and responsibility of structuring implementation of algorithm deliberately. Most of the efficiency overhead that we will observe from std::vector comes from copying. std::vector provides a constructor which lets you initialize N elements with value X, and when you use that, the vector is just as fast as an array.
I did a tests std::vector vs. array described here,
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>
class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}
~TestTimer()
{
using namespace std;
using namespace boost;
posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;
cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}
private:
std::string name;
boost::posix_time::ptime start;
};
struct Pixel
{
Pixel()
{
}
Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}
unsigned char r, g, b;
};
void UseVector()
{
TestTimer t("UseVector");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseVectorPushBack()
{
TestTimer t("UseVectorPushBack");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.reserve(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}
void UseArray()
{
TestTimer t("UseArray");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
free(pixels);
}
}
void UseVectorCtor()
{
TestTimer t("UseConstructor");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
}
}
int main()
{
TestTimer t1("The whole thing");
UseArray();
UseVector();
UseVectorCtor();
UseVectorPushBack();
return 0;
}
and here are results (compiled on Ubuntu amd64 with g++ -O3):
UseArray completed in 0.325 seconds
UseVector completed in 1.23 seconds
UseConstructor completed in 0.866 seconds
UseVectorPushBack completed in 8.987 seconds
The whole thing completed in 11.411 seconds
clearly push_back wasn't good choice here, using constructor is still 2 times slower than array.
Now, providing Pixel with empty copy Ctor:
Pixel(const Pixel&) {}
gives us following results:
UseArray completed in 0.331 seconds
UseVector completed in 0.306 seconds
UseConstructor completed in 0 seconds
UseVectorPushBack completed in 2.714 seconds
The whole thing completed in 3.352 seconds
So in summary: re-think your algorithm, otherwise, perhaps resort to a custom wrapper around New[]/Delete[]. In any case, the STL implementation isn't slower for some unknown reason, it just does exactly what you ask; hoping you know better.
In the case when you just started with vectors it might be surprising how they behave, for example this code:
class U{
int i_;
public:
U(){}
U(int i) : i_(i) {cout << "consting " << i_ << endl;}
U(const U& ot) : i_(ot.i_) {cout << "copying " << i_ << endl;}
};
int main(int argc, char** argv)
{
std::vector<U> arr(2,U(3));
arr.resize(4);
return 0;
}
results with:
consting 3
copying 3
copying 3
copying 548789016
copying 548789016
copying 3
copying 3
Vectors guarantee that the underlying data is a contiguous block in memory. The only sane way to guarantee this is by implementing it as an array.
Memory reallocation on pushing new elements can happen, because the vector can't know in advance how many elements you are going to add to it. But when you know it in advance, you can call reserve with the appropriate number of entries to avoid reallocation when adding them.
Vectors are usually preferred over arrays because they allow bound-checking when accessing elements with .at(). That means accessing indices outside of the vector doesn't cause undefined behavior like in an array. This bound-checking does however require additional CPU cycles. When you use the []-operator to access elements, no bound-checking is done and access should be as fast as an array. This however risks undefined behavior when your code is buggy.
People who invented STL, and then made it into the C++ standard library, are expletive deleted smart. Don't even let yourself imagine for one little moment you can outperform them because of your superior knowledge of legacy C arrays. (You would have a chance if you knew some Fortran though).
With std::vector, you can allocate all memory in one go, just like with C arrays. You can also allocate incrementally, again just like with C arrays. You can control when each allocation happens, just like with C arrays. Unlike with C arrays, you can also forget about it all and let the system manage the allocations for you, if that's what you want. This is all absolutely necessary, basic functionality. I'm not sure why anyone would assume it is missing.
Having said all that, go with arrays if you find them easier to understand.
I am not really advising you to go either for arrays or vectors, because I think that for your needs they may not be totally fit.
You need to be able to organize your data efficiently, so that queries would not need to scan the whole memory range to get the relevant data. So you want to group the points which are more likely to be selected together close to each other.
If your dataset is static, then you can do that sorting offline, and make your array nice and tidy to be loaded up in memory at your application start up time, and either vector or array would work (provided you do the reserve call up front for vector, since the default allocation growth scheme double up the size of the underlying array whenever it gets full, and you wouldn't want to use up 16Gb of memory for only 9Gb worth of data).
But if your dataset is dynamic, it will be difficult to do efficient inserts in your set with a vector or an array. Recall that each insert within the array would create a shift of all the successor elements of one place. Of course, an index, like the kd tree you mention, will help by avoiding a full scan of the array, but if the selected points are scattered accross the array, the effect on memory and cache will essentially be the same. The shift would also mean that the index needs to be updated.
My solution would be to cut the array in pages (either list linked or array indexed) and store data in the pages. That way, it would be possible to group relevant elements together, while still retaining the speed of contiguous memory access within pages. The index would then refer to a page and an offset in that page. Pages wouldn't be filled automatically, which leaves rooms to insert related elements, or make shifts really cheap operations.
Note that if pages are always full (excepted for the last one), you still have to shift every single one of them in case of an insert, while if you allow incomplete pages, you can limit a shift to a single page, and if that page is full, insert a new page right after it to contain the suplementary element.
Some things to keep in mind:
array and vector allocation have upper limits, which is OS dependent (these limits might be different)
On my 32bits system, the maximum allowed allocation for a vector of 3D points is at around 180 millions entries, so for larger datasets, on would have to find a different solution. Granted, on 64bits OS, that amount might be significantly larger (On windows 32bits, the maximum memory space for a process is 2Gb - I think they added some tricks on more advanced versions of the OS to extend that amount). Admittedly memory will be even more problematic for solutions like mine.
resizing of a vector requires allocating the new size of the heap, copy the elements from the old memory chunck to the new one.
So for adding just one element to the sequence, you will need twice the memory during the resizing. This issue may not come up with plain arrays, which can be reallocated using the ad hoc OS memory functions (realloc on unices for instance, but as far as I know that function doesn't make any guarantee that the same memory chunck will be reused). The problem might be avoided in vector as well if a custom allocator which would use the same functions is used.
C++ doesn't make any assumption about the underlying memory architecture.
vectors and arrays are meant to represent contiguous memory chunks provided by an allocator, and wrap that memory chunk with an interface to access it. But C++ doesn't know how the OS is managing that memory. In most modern OS, that memory is actually cut in pages, which are mapped in and out of physical memory. So my solution is somehow to reproduce that mechanism at the process level. In order to make the paging efficient, it is necessary to have our page fit the OS page, so a bit of OS dependent code will be necessary. On the other hand, this is not a concern at all for a vector or array based solution.
So in essence my answer is concerned by the efficiency of updating the dataset in a manner which will favor clustering points close to each others. It supposes that such clustering is possible. If not the case, then just pushing a new point at the end of the dataset would be perfectly alright.
Although I do not know the exact implementation of std:vector, most list systems like this are slower than arrays as they allocate memory when they are resized, normally double the current capacity although this is not always the case.
So if the vector contains 16 items and you added another, it needs memory for another 16 items. As vectors are contiguous in memory, this means that it will allocate a solid block of memory for 32 items and update the vector. You can get some performance improvements by constructing the std:vector with an initial capacity that is roughly the size you think your data set will be, although this isn't always an easy number to arrive at.
For operation that are common between vectors and arrays (hence not push_back or pop_back, since array are fixed in size) they perform exactly the same, because -by specification- they are the same.
vector access methods are so trivial that the simpler compiler optimization will wipe them out.
If you know in advance the size of a vector, just construct it by specifyinfg the size or just call resize, and you will get the same you can get with a new [].
If you don't know the size, but you know how much you will need to grow, just call reserve, and you get no penality on push_back, since all the required memory is already allocated.
In any case, relocation are not so "dumb": the capacity and the size of a vector are two distinct things, and the capacity is typically doubled upon exhaustion, so that relocation of big amounts are less and less frequent.
Also, in case you know everything about sizes, and you need no dynamic memory and want the same vector interface, consider also std::array.
Sounds like you need gigs of RAM so you're not paging. I tend to go along with #Philipp's answer, because you really really want to make sure it's not re-allocating under the hood
but
what's this about a tree that needs rebalancing?
and you're even thinking about compiler optimization?
Please take a crash course in how to optimize software.
I'm sure you know all about Big-O, but I bet you're used to ignoring the constant factors, right? They might be out of whack by 2 to 3 orders of magnitude, doing things you never would have thought costly.
If that translates to days of compute time, maybe it'll get interesting.
And no compiler optimizer can fix these things for you.
If you're academically inclined, this post goes into more detail.

C++ heap memory performance improvement

I'm writing a function where I need a significant amount of heap memory. Is it possible to tell the compiler that those data will be accessed frequently within a specific for loop, so as to improve performance (through compile options or similar)?
The reason I cannot use the stack is that the number of elements I need to store is big, and I get segmentation fault if I try to do it.
Right now the code is working but I think it could be faster.
UPDATE:
I'm doing something like this
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
some details:
- I used hash_set, little improvement, beside the fact that hash_set is not available in all machines I have for simulation purposes
- I tried to allocate vec on the stack using arrays but, as I said, I might get segmentation fault if the number of elements is too big
If node_vec.size() is, say, equal to k, where k is of the order of a few thousands, I expect vec to be 4 or 5 times bigger than node_vec. With this order of magnitude the code appears to be slow, considering the fact that I have to run it many times. Of course, I am using multithreading to parallelize these calls, but I can't get the function per se to run much faster than what I'm seeing right now.
Would it be possible, for example, to have vec allocated in the cache memory for fast data retrieval, or something similar?
I'm writing a function where I need a significant amount of heap memory ... will be accessed frequently within a specific for loop
This isn't something you can really optimize at a compiler level. I think your concern is that you have a lot of memory that may be "stale" (paged out) but at a particular point in time you will need to iterate over all of it, maybe several times and you don't want the memory pages to be paged out to disk.
You will need to investigate strategies that are platform specific to improve performance. Keeping the pages in memory can be achieved with mlockall or VirtualLock but you really shouldn't need to do this. Make sure you know what the implications of locking your application's memory pages into RAM is, however. You're hogging memory from other processes.
You might also want to investigate a low fragmentation heap (however it may not be relevant at all to this problem) and this page which describes cache lines with respect to for loops.
The latter page is about the nitty-gritty of how CPUs work (a detail you normally shouldn't have to be concerned with) with respect to memory access.
Example 1: Memory accesses and performance
How much faster do you expect Loop 2 to run, compared Loop 1?
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The first loop multiplies every value in the array by 3, and the second loop multiplies only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.
UPDATE
vector< set<uint> > vec(node_vec.size());
for(uint i = 0; i < node_vec.size(); i++)
for(uint j = i+1; j < node_vec.size(); j++)
// some computation, basic math, store the result in variable x
if( x > threshold ) {
vec[i].insert(j);
vec[j].insert(i);
}
That still doesn't show much, because we cannot know how often the condition x > threshold will be true. If x > threshold is very frequently true, then the std::set might be the bottleneck, because it has to do a dynamic memory allocation for every uint you insert.
Also we don't know what "some computation" actually means/does/is. If it does much, or does it in the wrong way that could be the bottleneck.
And we don't know how you need to access the result.
Anyway, on a hunch:
vector<pair<int, int> > vec1;
vector<pair<int, int> > vec2;
for (uint i = 0; i < node_vec.size(); i++)
{
for (uint j = i+1; j < node_vec.size(); j++)
{
// some computation, basic math, store the result in variable x
if (x > threshold)
{
vec1.push_back(make_pair(i, j));
vec2.push_back(make_pair(j, i));
}
}
}
If you can use the result in that form, you're done. Otherwise you could do some post-processing. Just don't copy it into a std::set again (obviously). Try to stick to std::vector<POD>. E.g. you could build an index into the vectors like this:
// ...
vector<int> index1 = build_index(node_vec.size(), vec1);
vector<int> index2 = build_index(node_vec.size(), vec2);
// ...
}
vector<int> build_index(size_t count, vector<pair<int, int> > const& vec)
{
vector<int> index(count, -1);
size_t i = vec.size();
do
{
i--;
assert(vec[i].first >= 0);
assert(vec[i].first < count);
index[vec[i].first] = i;
}
while (i != 0);
return index;
}
ps.: I'm almost sure your loop is not memory-bound. Can't be sure though... if the "nodes" you're not showing us are really big it might still be.
Original answer:
There is no easy I_will_access_this_frequently_so_make_it_fast(void* ptr, size_t len)-kind-of solution.
You can do some things though.
Make sure the compiler can "see" the implementation of every function that's called inside critical loops. What is necessary for the compiler to be able to "see" the implementation depends on the compiler. There is one way to be sure though: define all relevant functions in the same translation unit before the loop, and declare them as inline.
This also means you should not by any means call "external" functions in those critical loops. And by "external" functions I mean things like system-calls, runtime-library stuff or stuff implemented in a DLL/SO. Also don't call virtual functions and don't use function pointers. And or course don't allocate or free memory (inside the critical loops).
Make sure you use an optimal algorithm. Linear optimization is moot if the complexity of the algorithm is higher than necessary.
Use the smallest possible types. E.g. don't use int if signed char will do the job. That's something I wouldn't normally recommend, but when processing a large chunk of memory it can increase performance quite a lot. Especially in very tight loops.
If you're just copying or filling memory, use memcpy or memset. Disable the intrinsic version of those two functions if the chunks are larger then about 50 to 100 bytes.
Make sure you access the data in a cache-friendly manner. The optimum is "streaming" - i.e. accessing the memory with ascending or descending addresses. You can "jump" ahead some bytes at a time, but don't jump too far. The worst is random access to a big block of memory. E.g. if you have to work on a 2 dimensional matrix (like a bitmap image) where p[0] to p[1] is a step "to the right" (x + 1), make sure the inner loop increments x and the outer increments y. If you do it the other way around performance will be much much worse.
If your pointers are alias-free, you can tell the compiler (how that's done depends on the compiler). If you don't know what alias-free means I recommend searching the net and your compiler's documentation, since an explanation would be beyond the scope.
Use intrinsic SIMD instructions if appropriate.
Use explicit prefetch instructions if you know which memory locations will be needed in the near future.
You can't do that with compiler options. Depending on your usage (insertion, random-access, deleting, sorting, etc.), you could maybe get a better suited container.
The compiler can already see that the data is accessed frequently within the loop.
Assuming you're only allocating the data from the heap once before doing the looping, note, as #lvella, that memory is memory and if it's accessed frequently it should be effectively cached during execution.