Long time ago, inspired by "Numerical recipes in C", I started to use the following construct for storing matrices (2D-arrays).
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
for (i = 0; i < NumRows; ++i) x[i] = (double *)calloc(NumCol, sizeof(double));
return x;
}
double **x = allocate_matrix(1000,2000);
x[m][n] = ...;
But recently noticed that many people implement matrices as follows
double *x = (double *)malloc(NumRows * NumCols * sizeof(double));
x[NumCol * m + n] = ...;
From the locality point of view the second method seems perfect, but has awful readability... So I started to wonder, is my first method with storing auxiliary array or **double pointers really bad or the compiler will optimize it eventually such that it will be more or less equivalent in performance to the second method? I am suspicious because I think that in the first method two jumps are made when accessing the value, x[m] and then x[m][n] and there is a chance that each time the CPU will load first the x array and then x[m] array.
p.s. do not worry about extra memory for storing **double, for large matrices it is just a small percentage.
P.P.S. since many people did not understand my question very well, I will try to re-shape it: do I understand right that the first method is kind of locality-hell, when each time x[m][n] is accessed first x array will be loaded into CPU cache and then x[m] array will be loaded thus making each access at the speed of talking to RAM. Or am I wrong and the first method is also OK from data-locality point of view?
For C-style allocations you can actually have the best of both worlds:
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
x[0] = (double *)calloc(NumRows * NumCol, sizeof(double)); // <<< single contiguous memory allocation for entire array
for (i = 1; i < NumRows; ++i) x[i] = x[i - 1] + NumCols;
return x;
}
This way you get data locality and its associated cache/memory access benefits, and you can treat the array as a double ** or a flattened 2D array (array[i * NumCols + j]) interchangeably. You also have fewer calloc/free calls (2 versus NumRows + 1).
No need to guess whether the compiler will optimize the first method. Just use the second method which you know is fast, and use a wrapper class that implements for example these methods:
double& operator(int x, int y);
double const& operator(int x, int y) const;
... and access your objects like this:
arr(2, 3) = 5;
Alternatively, if you can bear a little more code complexity in the wrapper class(es), you can implement a class that can be accessed with the more traditional arr[2][3] = 5; syntax. This is implemented in a dimension-agnostic way in the Boost.MultiArray library, but you can do your own simple implementation too, using a proxy class.
Note: Considering your usage of C style (a hardcoded non-generic "double" type, plain pointers, function-beginning variable declarations, and malloc), you will probably need to get more into C++ constructs before you can implement either of the options I mentioned.
The two methods are quite different.
While the first method allows for easier direct access to the values by adding another indirection (the double** array, hence you need 1+N mallocs), ...
the second method guarantees that ALL values are stored contiguously and only requires one malloc.
I would argue that the second method is always superior. Malloc is an expensive operation and contiguous memory is a huge plus, depending on the application.
In C++, you'd just implement it like this:
std::vector<double> matrix(NumRows * NumCols);
matrix[y * numCols + x] = value; // Access
and if you're concerned with the inconvenience of having to compute the index yourself, add a wrapper that implements operator(int x, int y) to it.
You are also right that the first method is more expensive when accessing the values. Because you need two memory lookups as you described x[m] and then x[m][n]. There is no way the compiler will "optimize this away". The first array, depending on its size, will be cached, and the performance hit may not be that bad. In the second case, you need an extra multiplication for direct access.
In the first method you use, the double* in the master array point to logical columns (arrays of size NumCol).
So, if you write something like below, you get the benefits of data locality in some sense (pseudocode):
foreach(row in rows):
foreach(elem in row):
//Do something
If you tried the same thing with the second method, and if element access was done the way you specified (i.e. x[NumCol*m + n]), you still get the same benefit. This is because you treat the array to be in row-major order. If you tried the same pseudocode while accessing the elements in column-major order, I assume you'd get cache misses given that the array size is large enough.
In addition to this, the second method has the additional desirable property of being a single contiguous block of memory which further improves the performance even when you loop through multiple rows (unlike the first method).
So, in conclusion, the second method should be much better in terms of performance.
If NumCol is a compile-time constant, or if you are using GCC with language extensions enabled, then you can do:
double (*x)[NumCol] = (double (*)[NumCol]) malloc(NumRows * sizeof (double[NumCol]));
and then use x as a 2D array and the compiler will do the indexing arithmetic for you. The caveat is that unless NumCol is a compile-time constant, ISO C++ won't let you do this, and if you use GCC language extensions you won't be able to port your code to another compiler.
Related
I am using some legacy C code that passing around lots of raw pointers. To interface with the code, I have to pass a function of the form:
const int N = ...;
T * func(T * x) {
// TODO Put N elements in x
return x + N;
}
where this function should write the result into x, and then return x.
Internally, in this function, I am using Eigen extensively to perform some calculations. Then I write the result back to the raw pointer using the Map class. A simple example which mimics what I am doing is this:
const int N = 5;
T * func(T * x) {
// Do a lot of operations that result in some matrices like
Eigen::Matrix<T, N, 1 > A = ...
Eigen::Matrix<T, N, 1 > B = ...
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
return x + N;
}
Obviously, there is much more complicated stuff going on internally, but that is the gist of it... Do some calculations with Eigen, then use the Map class to write the result back to the raw pointer.
Now the problem is that when I profile this code with Callgrind, and then view the results with KCachegrind, the lines
constraint = A - B;
are almost always the bottleneck. This is sort of understandable, because such lines could/are potentially doing three things:
Constructing the Map object
Performing the calculation
Writing the result to the pointer
So it is understandable that this line might have the longest runtime. But I am a little bit worried that perhaps I am somehow doing an extra copy in that line before the data gets written to the raw pointer.
So is there a better way of writing the result to the raw pointer? Or is that the idiom I should be using?
In the back of my mind, I am wondering if using the placement new syntax would buy me anything here.
Note: This code is mission critical and should run in realtime, so I really need to squeeze every ounce of speed out of it. For instance, getting this call from a runtime of 0.12 seconds to 0.1 seconds would be huge for us. But code legibility is also a huge concern since we are constantly tweaking the model used in the internal calculations.
These two lines of code:
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
are essentially compiled by Eigen as:
for(int i=0; i<N; ++i)
x[i] = A[i] - B[i];
The reality is a bit more complicated because of explicit unrolling, and explicit vectorization (both depends on T), but that's essentially it. So the construction of the Map object is essentially a no-op (it is optimized away by any compiler) and no, there is no extra copy going on here.
Actually, if your profiler is able to tell you that the bottleneck lies on this simple expression, then that very likely means that this piece of code has not been inlined, meaning that you did not enabled compiler optimizations flags (like -O3 with gcc/clang).
I'm developing a 2D numerical model in c++, and I would like to speed up a specific member function that is slowing down my code. The function is required to loop over every i,j grid point in the model and then perform a double summation at every grid point over l and m. The function is as follows:
int Class::Function(void) {
double loadingEta;
int i,j,l,m;
//etaLatLen=64, etaLonLen=2*64
//l_max = 12
for (i=0; i<etaLatLen; i++) {
for (j=0; j < etaLonLen; j++) {
loadingEta = 0.0;
for (l=0; l<l_max+1; l++) {
for (m=0; m<=l; m++) {
loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
}
}
etaNewArray[i][j] = loadingEta;
}
}
return 1;
}
I've been trying to change the loop order to speed things up, but to no avail. Any help would be much appreciated. Thank you!
EDIT 1:
All five arrays are allocated in the constructor of my class as follows:
etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
etaLegendreArray[i] = new double*[l_max+1];
for (int l=0; l<l_max+1; l++) {
etaLegendreArray[i][l] = new double[l_max+1];
}
}
SH_C = new double*[l_max+1];
SH_S = new double*[l_max+1];
for (int i=0; i<l_max+1; i++) {
SH_C[i] = new double[l_max+1];
SH_S[i] = new double[l_max+1];
}
etaCosMLon = new double*[etaLonLen];
etaSinMLon = new double*[etaLonLen];
for (int j=0; j<etaLonLen; j++) {
etaCosMLon[j] = new double[l_max+1];
etaSinMLon[j] = new double[l_max+1];
}
Perhaps it would be better if these were 1D arrays instead of multidimensional?
Hopping off into X-Y territory here. Rather than speeding up the algorithm, let's try and speed up data access.
etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
etaLegendreArray[i] = new double*[l_max+1];
for (int l=0; l<l_max+1; l++) {
etaLegendreArray[i][l] = new double[l_max+1];
}
}
Doesn't create a 3D array of doubles. It creates an array of pointers to arrays of pointers to arrays of doubles. Each array is its own block of memory and who knows where it's going to sit in storage. This results in a data structure that has what is called "poor spacial locality." All of the pieces of the structure may be scattered all over the place. In the 3D array you are hopping to three different places just to find out where your value is.
Because the many blocks of storage required to simulate the 3D array may be nowhere near each other, the CPU may not be able to effectively load the cache (high-speed memory) ahead of time and has to stop the useful work it's doing and wait to access slower storage, probably RAM much more frequently. Here is a good, high-level article on how much this can hurt performance.
On the other hand, if the whole array is in one block of memory, is "contiguous", the CPU can read larger chunks of the memory, maybe all of it, it needs into cache all at once. Plus if the compiler knows the memory the program will use is all in one big block it can perform all sorts of groovy optimizations that will make your program even faster.
So how do we get a 3D array that's all one memory block? If the sizes are static, this is easy
double etaLegendreArray[SIZE1][SIZE2][SIZE3];
This doesn't look to be your case, so what you want to do is allocate a 1D array, because it will be one contiguous block of memory.
double * etaLegendreArray= new double [SIZE1*SIZE2*SIZE3];
and do the array indexing math by hand
etaLegendreArray[(x * SIZE2 + y) * SIZE3 + z] = data;
Looks like that ought to be slower with all the extra math, huh? Turns out the compiler is hiding math that looks a lot like that from you every time you use a []. You lose almost nothing, and certainly not as much as you lose with one unnecessary cache miss.
But it is insane to repeat that math all over the place, sooner or later you will screw up even if the drain on readability doesn't have you wishing for death first, so you really want to wrap the 1D array in a class to helper handle the math for you. And once you do that, you might as well have that class handle the allocation and deallocation so you can take advantage of all that RAII goodness. No more for loops of news and deletes all over the place. It's all wrapped up and tied with a bow.
Here is an example of a 2D Matrix class easily extendable to 3D. that will take care of the basic functionality you probably need in a nice predictable, and cache-friendly manner.
If the CPU supports it and the compiler is optimizing enough, you might get some small gain out of the C99 fma (fused multiply-add) function, to convert some of your two step operations (multiply, then add) to one step operations. It would also improve accuracy, since you only suffer floating point rounding once for fused operation, not once for multiplication and once for addition.
Assuming I'm reading it right, you could change your innermost loop's expression from:
loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
to (note no use of += now, it's incorporated in fma):
loadingEta = fma(etaLegendreArray[i][l][m], fma(SH_C[l][m], etaCosMLon[j][m], SH_S[l][m]*etaSinMLon[j][m]), loadingEta);
I wouldn't expect anything magical performance-wise, but it might help a little (again, only with optimizations turned up enough for the compiler to inline hardware instructions to do the work; if it's calling a library function, you'll lose any improvements to the function call overhead). And again, it should improve accuracy a bit, by avoiding two rounding steps you were incurring.
Mind you, on some compilers with appropriate compilation flags, they'll convert your original code to hardware FMA instructions for you; if that's an option, I'd go with that, since (as you can see) the fma function tends to reduce code readability.
Your compiler may offer vectorized versions of floating point instructions as well, which might meaningfully improve performance (see previous link on automatic conversion to FMA).
Most other improvements would require more information about the goal, the nature of the input arrays being used, etc. Simple threading might gain you something, OpenMP pragmas might be something to look at as a way to simplify parallelizing the loop(s).
I wonder if anyone could advise on storage of large (say 2000 x 2000 x 2000) 3D arrays for finite difference discretization computations. Does contiguous storage float* give better performance then float*** on modern CPU architectures?
Here is a simplified example of computations, which are done over entire arrays:
for i ...
for j ...
for k ...
u[i][j][k] += v[i][j][k+1] + v[i][j][k-1]
+ v[i][j+1][k] + v[i][j-1][k] + v[i+1][j][k] + v[i-1][j][k];
Vs
u[i * iStride + j * jStride + k] += ...
PS:
Considering size of problems, storing T*** is a very small overhead. Access is not random. Moreover, I do loop blocking to minimize cache misses. I am just wondering how triple dereferencing in T*** case compares to index computation and single dereferencing in case of 1D array.
These are not apples-to-apples comparisons: a flat array is just that - a flat array, which your code partitions into segments according to some logic of linearizing a rectangular 3D array. You access an element of an array with a single dereference, plus a handful of math operations.
float***, on the other hand, lets you keep a "jagged" array of arrays or arrays, so the structure that you can represent inside such an array is a lot more flexible. Naturally, you need to pay for that flexibility with additional CPU cycles required for dereferencing pointers to pointers to pointers, then a pointer to pointer, and finally a pointer (the three pairs of square brackets in the code).
Naturally, access to the individual elements of float*** is going to be a little slower, if you access them in truly random order. However, if the order is not random, the difference that you see may be tiny, because the values of pointers would be cached.
float*** will also require more memory, because you need to allocate two additional levels of pointers.
The short answer is: benchmark it. If the results are inconclusive, it means it doesn't matter. Do what makes your code most readable.
As #dasblinkenlight has pointed out, the structures are nor equivalent because T*** can be jagged.
At the most fundamental level, though, this comes down to the arithmetic and memory access operations.
For your 1D array, as you have already (almost) written, the calculation is:
ptr = u + (i * iStride) + (j * jStride) + k
read *ptr
With T***:
ptr = u + i
x = read ptr
ptr = x + j
y = read ptr
ptr = y + k
read ptr
So you are trading two multiplications for two memory accesses.
In computer go, where people are very performance-sensitive, everyone (AFAIK) uses T[361] rather than T[19][19] (*). This decision is based on benchmarking, both in isolation and the whole program. (It is possible everyone did those benchmarks years and years ago, and have never done them again on the latest hardware, but my hunch is that a single 1-D array will still be better.)
However your array is huge, in comparison. As the code involved in each case is easy to write, I would definitely try it both ways and benchmark.
*: Aside: I think it is actually T[21][21] vs. t[441], in most programs, as an extra row all around is added to speed up board-edge detection.
One issue that has not been mentioned yet is aliasing.
Does your compiler support some type of keyword like restrict to indicate that you have no aliasing? (It's not part of C++11 so would have to be an extension.) If so, performance may be very close to the same. If not, there could be significant differences in some cases. The issue will be with something like:
for (int i = ...) {
for (int j = ...) {
a[j] = b[i];
}
}
Can b[i] be loaded once per outer loop iteration and stored in a register for the entire inner loop? In the general case, only if the arrays don't overlap. How does the compiler know? It needs some type of restrict keyword.
I'm developing a CT reconstruction algorithm using C++. I'm using C++ because I need to use a library written in C++ that will let me read/write a specific file format.
This reconstruction algorithm involves working with 3D and 2D images. I've written similar algorithms in C and MATLAB using arrays. However, I've read that, in C++, arrays are "evil" (see http://www.parashift.com/c++-faq-lite/containers.html). The way I use arrays to manipulate images (in C) is the following (this creates a 3D array that will be used as a 3D image):
int i,j;
int *** image; /* another way to make a 5x12x27 array */
image = (int ***) malloc(depth * sizeof(int **));
for (i = 0; i < depth; ++i) {
image[i] = (int **) malloc(height * sizeof(int *));
for (j = 0; j < height; ++j) {
image[i][j] = (int *) malloc(width * sizeof(int));
}
}
or I use 1-dimensional arrays and do index arithmetic to simulate 3D data. At the end, I free the necessary memory.
I have read that there are equivalent ways of doing this in C++. I've seen that I could create my own matrix class that uses vectors of vectors (from STL) or that I could use the boost-matrix library. The problem is that this makes my code look bloated.
My questions are:
1) Is there a reason to not use arrays for this purpose? Why should I use the more complicated data structures?
2) I don't think I'll use the advantages of containers (as seen in the C++ FAQ lite link I posted). Is there something I'm not seeing?
3) The C++ FAQ lite mentions that arrays will make me less productive. I don't really see how that applies to my case. What do you guys think?
Working at a higher level always saves you time given equal familiarity with both types of code. It's usually simpler and you might not need to bother with some tasks like deleting.
That said, if you already have the C code and are basically converting malloc to new (or leaving it as-is) then it makes perfect sense to leave it. No reason to duplicate work for no advantage. If you're going to be extending it and adding more features you might want to think about a rewrite. Image manipulation is often an intensive process and I see straight code like yours all the time for performance reasons.
Arrays have a purpose, vectors have a purpose, and so on. You seem to understand the tradeoffs so I won't go into that. Understanding the context of what you're doing is necessary; anyone who says that arrays are always bad or vectors are always too much overhead (etc.) probably doesn't know what they're talking about.
I know it looks difficult at first, and your code seems simple - but eventually yours is going to hurt.
Use a library like boost, or consider a custom 3D image toolkit like vtk
if the 3D canvas has a fixed size you won't win much by using containers. I would avoid allocating the matrix in small chunks as you do, though, and just instead do
#define DIM_X 5
#define DIM_Y 12
#define DIM_Z 27
#define SIZE (DIM_X * DIM_Y * DIM_Z)
#define OFFS(x, y, z) (x + y * DIM_X + z * (DIM_Y * DIM_X))
and then
class 3DImage {
private unsigned int pixel_data[SIZE];
int & operator()(int x, int y, int z) { return pixel_data[OFFS(x,y,z)]; }
}
after which you can do e.g.
3DImage img;
img(1,1,1) = 10;
img(2,2,2) = img(1,1,1) + 2;
without having any memory allocation or algorithm overhead. But as some others have noted, the choice of the data structure also depends on what kind of algorithms you are planning to run on the images. You can always however adapt a third-party algorithm e.g. for matrix inversion with a proper facade class if needed; and this flat representation is much faster than the nested arrays of pointers you wrote.
If the dimensions are not fixed compile time, you can obviously still use exactly the same approach, it's just that you need to allocate pixel_data dynamically and store the dimensions in the 3DImage object itself. Here's that version:
class 3DImage {
private unsigned int *pixel_data;
unsigned int dim_x, dim_y, dim_z;
3DImage(int xd, int yd, int zd) { dim_x = xd; dim_y = yd; dim_z = zd;
pixel_data = new int[dim_x * dim_y * dim_z];
}
virtual ~3DImage() { delete pixel_data; }
int & operator(int x, int y, int z) {
return pixel_data[x + y * dim_x + z * dim_y * dim_x];
}
}
My questions are:
1) Is there a reason to not use arrays for this purpose? Why should I use the more complicated data structures?
I personally prefer to use basic arrays. By basic I mean a 1D linear array. Say you have a 512 X 512 image, and you have 5 slices, then the image array looks like following:
int sizeX = 512;
int sizeY = 512;
int sizeZ = 5;
float* img = new float[sizeX * sizeY * sizeZ]
To access of a pixel/voxel at location (x,y,z), you would need to do:
float val = img[z*sizeX*sizeY + y*sizeX + sizeX];
2) I don't think I'll use the advantages of containers (as seen in the C++ FAQ lite link I posted). Is there something I'm not seeing?
To use containers is more like a programming thing (easier, safer, exception catching....). If you are an algorithm guy, then it might not be your concern at all. However, one example to use <vector> in C++, you can always do this:
int sizeX = 512;
int sizeY = 512;
int sizeZ = 5;
std::vector<float> img(sizeX * sizeY * sizeZ);
float* p = &img[0];
3) The C++ FAQ lite mentions that arrays will make me less productive. I don't really see how that applies to my case. What do you guys think?
I don't see why array makes you less productive. Of course, c++ guys would prefer to use vectors to raw arrays. But again, it is just a programming thing.
Hope this helps.
Supplement:
The easiest way to do a 2D/3D CT recon would be to use MATLAB/python + C/C++; But again, this would require you sufficient experience when to use which. MATLAB has built in FFT/IFFT, so you don't have to write a C/C++ code for that. I remember I used KissFFT before, and it was no problem.
I was wondering whether (apart from the obvious syntax differences) there would be any efficiency difference between having a class containing multiple instances of an object (of the same type) or a fixed size array of objects of that type.
In code:
struct A {
double x;
double y;
double z;
};
struct B {
double xvec[3];
};
In reality I would be using boost::arrays which are a better C++ alternative to C-style arrays.
I am mainly concerned with construction/destruction and reading/writing such doubles, because these classes will often be constructed just to invoke one of their member functions once.
Thank you for your help/suggestions.
Typically the representation of those two structs would be exactly the same. It is, however, possible to have poor performance if you pick the wrong one for your use case.
For example, if you need to access each element in a loop, with an array you could do:
for (int i = 0; i < 3; i++)
dosomething(xvec[i]);
However, without an array, you'd either need to duplicate code:
dosomething(x);
dosomething(y);
dosomething(z);
This means code duplication - which can go either way. On the one hand there's less loop code; on the other hand very tight loops can be quite fast on modern processors, and code duplication can blow away the I-cache.
The other option is a switch:
for (int i = 0; i < 3; i++) {
int *r;
switch(i) {
case 0: r = &x; break;
case 1: r = &y; break;
case 1: r = &z; break;
}
dosomething(*r); // assume this is some big inlined code
}
This avoids the possibly-large i-cache footprint, but has a huge negative performance impact. Don't do this.
On the other hand, it is, in principle, possible for array accesses to be slower, if your compiler isn't very smart:
xvec[0] = xvec[1] + 1;
dosomething(xvec[1]);
Since xvec[0] and xvec[1] are distinct, in principle, the compiler ought to be able to keep the value of xvec[1] in a register, so it doesn't have to reload the value at the next line. However, it's possible some compilers might not be smart enough to notice that xvec[0] and xvec[1] don't alias. In this case, using seperate fields might be a very tiny bit faster.
In short, it's not about one or the other being fast in all cases. It's about matching the representation to how you use it.
Personally, I would suggest going with whatever makes the code working on xvec most natural. It's not worth spending a lot of human time worrying about something that, at best, will probably only produce such a small performance difference that you'll only catch it in micro-benchmarks.
MVC++ 2010 generated exactly the same code for reading/writing from two POD structs like in your example. Since the offsets to read/write to are computable at compile time, this is not surprising. Same goes for construction and destruction.
As for the actual performance, the general rule applies: profile it if it matters, if it doesn't - why care?
Indexing into an array member is perhaps a bit more work for the user of your struct, but then again, he can more easily iterate over the elements.
In case you can't decide and want to keep your options open, you can use an anonymous union:
struct Foo
{
union
{
struct
{
double x;
double y;
double z;
} xyz;
double arr[3];
};
};
int main()
{
Foo a;
a.xyz.x = 42;
std::cout << a.arr[0] << std::endl;
}
Some compilers also support anonymous structs, in that case you can leave the xyz part out.
It depends. For instance, the example you gave is a classic one in favor of 'old-school' arrays: a math point/vector (or matrix)
has a fixed number of elements
the data itself is usually kept
private in an object
since (if?) it has a class as an
interface, you can properly
initialize them in the constructor
(otherwise, classic array
inialization is something I don't
really like, syntax-wise)
In such cases (going with the math vector/matrix examples), I always ended up using C-style arrays internally, as you can loop over them instead of writing copy/pasted code for each component.
But this is a special case -- for me, in C++ nowadays arrays == STL vector, it's fast and I don't have to worry about nuthin' :)
The difference can be in storing the variables in memory. In the first example compiler can add padding to align the data. But in your paticular case it doesn't matter.
raw arrays offer better cache locality than c++ arrays, as presented however, the array example's only advantage over the multiple objects is the ability to iterate over the elements.
The real answer is of course, create a test case and measure.