C structure pointer dereferencing speed - c++

I have a question regarding the speed of pointer dereferencing. I have a structure like so:
typedef struct _TD_RECT TD_RECT;
struct _TD_RECT {
double left;
double top;
double right;
double bottom;
};
My question is, which of these would be faster and why?
CASE 1:
TD_RECT *pRect;
...
for(i = 0; i < m; i++)
{
if(p[i].x < pRect->left) ...
if(p[i].x > pRect->right) ...
if(p[i].y < pRect->top) ...
if(p[i].y > pRect->bottom) ...
}
CASE 2:
TD_RECT *pRect;
double left = pRect->left;
double top = pRect->top;
double right = pRect->right;
double bottom = pRect->bottom;
...
for(i = 0; i < m; i++)
{
if(p[i].x < left) ...
if(p[i].x > right) ...
if(p[i].y < top) ...
if(p[i].y > bottom) ...
}
So in case 1, the loop is directly dereferencing the pRect pointer to obtain the comparison values. In case 2, new values were made on the function's local space (on the stack) and the values were copied from the pRect to the local variables. Through a loop there will be many comparisons.
In my mind, they would be equally slow, because the local variable is also a memory reference on the stack, but I'm not sure...
Also, would it be better to keep referencing p[] by index, or increment p by one element and dereference it directly without an index.
Any ideas? Thanks :)

You'll probably find it won't make a difference with modern compilers. Most of them would probably perform common subexpresion elimination of the expressions that don't change within the loop. It's not wise to assume that there's a simple one-to-one mapping between your C statements and assembly code. I've seen gcc pump out code that would put my assembler skills to shame.
But this is neither a C nor C++ question since the ISO standard doesn't mandate how it's done. The best way to check for sure is to generate the assembler code with something like gcc -S and examine the two cases in detail.
You'll also get more return on your investment if you steer away from this sort of micro-optimisation and concentrate more on the macro level, such as algorithm selection and such.
And, as with all optimisation questions, measure, don't guess! There are too many variables which can affect it, so you should be benchmarking different approaches in the target environment, and with realistic data.

It is not likely to be a hugely performance critical difference. You could profile doing each option multiple times and see. Ensure you have your compiler optimisations set in the test.
With regards to storing the doubles, you might get some performance hit by using const. How big is your array?
With regards to using pointer arithmetic, this can be faster, yes.
You can instantly optimise if you know left < right in your rect (surely it must be). If x < left it can't also be > right so you can put in an "else".
Your big optimisation, if there is one, would come from not having to loop through all the items in your array and not have to perform 4 checks on all of them.
For example, if you indexed or sorted your array on x and y, you would be able, using binary search, to find all values that have x < left and loop through just those.

I think the second case is likely to be faster because you are not dereferencing the pointer to pRect on every loop iteration.
Practically, a compiler doing optimisation may notice this and there might be no difference in the code that is generated, but the possibility of pRect being an alias of an item in p[] could prevent this.

An optimizing compiler will see that the structure accesses are loop invariant and so do a Loop-invariant code motion, making your two cases look the same.

I will be surprised if even a totally non-optimized compile (- O0) will produce differentcode for the two cases presented. In order to perform any operation on a modern processor, the data need to loaded into registers. So even when you declare automatic variables, these variables will not exist in main memory but rather in one of the processors floating point registers. This will be true even when you do not declare the variables yourself and therefore I expect no difference in generated machine code even for when you declare the temporary variables in your C++ code.
But as others have said, compile the code into assembler and see for yourself.

Related

Optimizing a quadruple nested "for" loop

I'm developing a 2D numerical model in c++, and I would like to speed up a specific member function that is slowing down my code. The function is required to loop over every i,j grid point in the model and then perform a double summation at every grid point over l and m. The function is as follows:
int Class::Function(void) {
double loadingEta;
int i,j,l,m;
//etaLatLen=64, etaLonLen=2*64
//l_max = 12
for (i=0; i<etaLatLen; i++) {
for (j=0; j < etaLonLen; j++) {
loadingEta = 0.0;
for (l=0; l<l_max+1; l++) {
for (m=0; m<=l; m++) {
loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
}
}
etaNewArray[i][j] = loadingEta;
}
}
return 1;
}
I've been trying to change the loop order to speed things up, but to no avail. Any help would be much appreciated. Thank you!
EDIT 1:
All five arrays are allocated in the constructor of my class as follows:
etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
etaLegendreArray[i] = new double*[l_max+1];
for (int l=0; l<l_max+1; l++) {
etaLegendreArray[i][l] = new double[l_max+1];
}
}
SH_C = new double*[l_max+1];
SH_S = new double*[l_max+1];
for (int i=0; i<l_max+1; i++) {
SH_C[i] = new double[l_max+1];
SH_S[i] = new double[l_max+1];
}
etaCosMLon = new double*[etaLonLen];
etaSinMLon = new double*[etaLonLen];
for (int j=0; j<etaLonLen; j++) {
etaCosMLon[j] = new double[l_max+1];
etaSinMLon[j] = new double[l_max+1];
}
Perhaps it would be better if these were 1D arrays instead of multidimensional?
Hopping off into X-Y territory here. Rather than speeding up the algorithm, let's try and speed up data access.
etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
etaLegendreArray[i] = new double*[l_max+1];
for (int l=0; l<l_max+1; l++) {
etaLegendreArray[i][l] = new double[l_max+1];
}
}
Doesn't create a 3D array of doubles. It creates an array of pointers to arrays of pointers to arrays of doubles. Each array is its own block of memory and who knows where it's going to sit in storage. This results in a data structure that has what is called "poor spacial locality." All of the pieces of the structure may be scattered all over the place. In the 3D array you are hopping to three different places just to find out where your value is.
Because the many blocks of storage required to simulate the 3D array may be nowhere near each other, the CPU may not be able to effectively load the cache (high-speed memory) ahead of time and has to stop the useful work it's doing and wait to access slower storage, probably RAM much more frequently. Here is a good, high-level article on how much this can hurt performance.
On the other hand, if the whole array is in one block of memory, is "contiguous", the CPU can read larger chunks of the memory, maybe all of it, it needs into cache all at once. Plus if the compiler knows the memory the program will use is all in one big block it can perform all sorts of groovy optimizations that will make your program even faster.
So how do we get a 3D array that's all one memory block? If the sizes are static, this is easy
double etaLegendreArray[SIZE1][SIZE2][SIZE3];
This doesn't look to be your case, so what you want to do is allocate a 1D array, because it will be one contiguous block of memory.
double * etaLegendreArray= new double [SIZE1*SIZE2*SIZE3];
and do the array indexing math by hand
etaLegendreArray[(x * SIZE2 + y) * SIZE3 + z] = data;
Looks like that ought to be slower with all the extra math, huh? Turns out the compiler is hiding math that looks a lot like that from you every time you use a []. You lose almost nothing, and certainly not as much as you lose with one unnecessary cache miss.
But it is insane to repeat that math all over the place, sooner or later you will screw up even if the drain on readability doesn't have you wishing for death first, so you really want to wrap the 1D array in a class to helper handle the math for you. And once you do that, you might as well have that class handle the allocation and deallocation so you can take advantage of all that RAII goodness. No more for loops of news and deletes all over the place. It's all wrapped up and tied with a bow.
Here is an example of a 2D Matrix class easily extendable to 3D. that will take care of the basic functionality you probably need in a nice predictable, and cache-friendly manner.
If the CPU supports it and the compiler is optimizing enough, you might get some small gain out of the C99 fma (fused multiply-add) function, to convert some of your two step operations (multiply, then add) to one step operations. It would also improve accuracy, since you only suffer floating point rounding once for fused operation, not once for multiplication and once for addition.
Assuming I'm reading it right, you could change your innermost loop's expression from:
loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
to (note no use of += now, it's incorporated in fma):
loadingEta = fma(etaLegendreArray[i][l][m], fma(SH_C[l][m], etaCosMLon[j][m], SH_S[l][m]*etaSinMLon[j][m]), loadingEta);
I wouldn't expect anything magical performance-wise, but it might help a little (again, only with optimizations turned up enough for the compiler to inline hardware instructions to do the work; if it's calling a library function, you'll lose any improvements to the function call overhead). And again, it should improve accuracy a bit, by avoiding two rounding steps you were incurring.
Mind you, on some compilers with appropriate compilation flags, they'll convert your original code to hardware FMA instructions for you; if that's an option, I'd go with that, since (as you can see) the fma function tends to reduce code readability.
Your compiler may offer vectorized versions of floating point instructions as well, which might meaningfully improve performance (see previous link on automatic conversion to FMA).
Most other improvements would require more information about the goal, the nature of the input arrays being used, etc. Simple threading might gain you something, OpenMP pragmas might be something to look at as a way to simplify parallelizing the loop(s).

For loop or no loop? (dataset is small and not subject to change)

Let's say I have a situation where I have a matrix of a small, known size where the size is unlikely to change over the life of the software. If I need to examine each matrix element, would it be more efficient to use a loop or to manually index into each matrix location?
For example, let's say I have a system made up of 3 windows, 2 panes per window. I need to keep track of state for each window pane. In my system, there will only ever be 3 windows, 2 panes per window.
static const int NUMBER_OF_WINDOWS = 3;
static const int NUMBER_OF_PANES = 2;
static const int WINDOW_LEFT = 0;
static const int WINDOW_MIDDLE = 1;
static const int WINDOW_RIGHT = 2;
static const int PANE_TOP = 0;
static const int PANE_BOTTOM = 1;
paneState windowPanes[NUMBER_OF_WINDOWS][NUMBER_OF_PANES];
Which of these accessing methods would be more efficient?
loop version:
for (int ii=0; ii<NUMBER_OF_WINDOWS; ii++)
{
for (int jj=0; jj<NUMBER_OF_PANES; jj++)
{
doSomething(windowPanes[ii][jj];
}
}
vs.
manual access version:
doSomething(windowPanes[WINDOW_LEFT][PANE_TOP]);
doSomething(windowPanes[WINDOW_MIDDLE][PANE_TOP]);
doSomething(windowPanes[WINDOW_RIGHT][PANE_TOP]);
doSomething(windowPanes[WINDOW_LEFT][PANE_BOTTOM]);
doSomething(windowPanes[WINDOW_MIDDLE][PANE_BOTTOM]);
doSomething(windowPanes[WINDOW_RIGHT][PANE_BOTTOM]);
Will the loop code generate branch instructions, and will those be more costly than the instructions that would be generated on the manual access?
The classic Efficiency vs Organization. The for loops are much more human readable and the manual way is more machine readable.
I recommend you use the loops. Because the compiler, if optimizing is enabled, will actually generate the manual code for you when it sees that the upper bounds are constant. That way you get the best of both worlds.
First of all: How complex is your function doSomething? If it is (most likely this is so), then you will not notice any difference.
In general, calling your function sequentially will be slightly more effective than the loop. But once again, the gain will be so tiny that it is not worth discussing it.
Bear in mind that optimizing compilers do loop unrolling. This is essentially generating code that will rotate your loop smaller number of times while doing more work in each rotation (they will call your function 2-4 times in sequence). When the number of rotations is small and fixed compiler may easily eliminate the loop completely.
Look at your code from the point of view of clarity and ease of modification. In many cases compiler will do a lot of useful tricks related to performance.
You may linearize your multi-dimensional array
paneState windowPanes[NUMBER_OF_WINDOWS * NUMBER_OF_PANES];
and then
for (auto& pane : windowPanes) {
doSomething(pane);
}
Which avoid extra loop if compiler doesn't optimize it.

C++ fixed size arrays vs multiple objects of same type

I was wondering whether (apart from the obvious syntax differences) there would be any efficiency difference between having a class containing multiple instances of an object (of the same type) or a fixed size array of objects of that type.
In code:
struct A {
double x;
double y;
double z;
};
struct B {
double xvec[3];
};
In reality I would be using boost::arrays which are a better C++ alternative to C-style arrays.
I am mainly concerned with construction/destruction and reading/writing such doubles, because these classes will often be constructed just to invoke one of their member functions once.
Thank you for your help/suggestions.
Typically the representation of those two structs would be exactly the same. It is, however, possible to have poor performance if you pick the wrong one for your use case.
For example, if you need to access each element in a loop, with an array you could do:
for (int i = 0; i < 3; i++)
dosomething(xvec[i]);
However, without an array, you'd either need to duplicate code:
dosomething(x);
dosomething(y);
dosomething(z);
This means code duplication - which can go either way. On the one hand there's less loop code; on the other hand very tight loops can be quite fast on modern processors, and code duplication can blow away the I-cache.
The other option is a switch:
for (int i = 0; i < 3; i++) {
int *r;
switch(i) {
case 0: r = &x; break;
case 1: r = &y; break;
case 1: r = &z; break;
}
dosomething(*r); // assume this is some big inlined code
}
This avoids the possibly-large i-cache footprint, but has a huge negative performance impact. Don't do this.
On the other hand, it is, in principle, possible for array accesses to be slower, if your compiler isn't very smart:
xvec[0] = xvec[1] + 1;
dosomething(xvec[1]);
Since xvec[0] and xvec[1] are distinct, in principle, the compiler ought to be able to keep the value of xvec[1] in a register, so it doesn't have to reload the value at the next line. However, it's possible some compilers might not be smart enough to notice that xvec[0] and xvec[1] don't alias. In this case, using seperate fields might be a very tiny bit faster.
In short, it's not about one or the other being fast in all cases. It's about matching the representation to how you use it.
Personally, I would suggest going with whatever makes the code working on xvec most natural. It's not worth spending a lot of human time worrying about something that, at best, will probably only produce such a small performance difference that you'll only catch it in micro-benchmarks.
MVC++ 2010 generated exactly the same code for reading/writing from two POD structs like in your example. Since the offsets to read/write to are computable at compile time, this is not surprising. Same goes for construction and destruction.
As for the actual performance, the general rule applies: profile it if it matters, if it doesn't - why care?
Indexing into an array member is perhaps a bit more work for the user of your struct, but then again, he can more easily iterate over the elements.
In case you can't decide and want to keep your options open, you can use an anonymous union:
struct Foo
{
union
{
struct
{
double x;
double y;
double z;
} xyz;
double arr[3];
};
};
int main()
{
Foo a;
a.xyz.x = 42;
std::cout << a.arr[0] << std::endl;
}
Some compilers also support anonymous structs, in that case you can leave the xyz part out.
It depends. For instance, the example you gave is a classic one in favor of 'old-school' arrays: a math point/vector (or matrix)
has a fixed number of elements
the data itself is usually kept
private in an object
since (if?) it has a class as an
interface, you can properly
initialize them in the constructor
(otherwise, classic array
inialization is something I don't
really like, syntax-wise)
In such cases (going with the math vector/matrix examples), I always ended up using C-style arrays internally, as you can loop over them instead of writing copy/pasted code for each component.
But this is a special case -- for me, in C++ nowadays arrays == STL vector, it's fast and I don't have to worry about nuthin' :)
The difference can be in storing the variables in memory. In the first example compiler can add padding to align the data. But in your paticular case it doesn't matter.
raw arrays offer better cache locality than c++ arrays, as presented however, the array example's only advantage over the multiple objects is the ability to iterate over the elements.
The real answer is of course, create a test case and measure.

Is repeatedly calling size() on a container (during loop) bad?

For efficiency reasons, I always avoid writing loops like this:
for(std::size_t i = 0; i < vec.size(); ++i) { ... }
where vec is an STL container. Instead, I either do
const std::size_t vec_size = vec.size();
for(std::size_t i = 0; i < vec_size; ++i) { ... }
or use the container iterators.
But how bad is the first solution really? I remember reading in Meyers that it will be quadratic instead of linear because the vector doesn't know its size and repeatedly has to count. But won't modern compilers detect this and optimize it away?
vector::size() is constant-time and usually implemented as a trivial inline function that is optimised away. Don't bother hand-optimising it.
I remember reading in Meyers that it will be quadratic instead of linear because the vector doesn't know its size and repeatedly has to count.
You're getting vector and list confused. vector's size value is held in the vector; list's requires transversal of the actual list.
The easiest way to tell if something is being optimized out by the compiler is to compare the assembly-language compiler output.
That said, the two chunks of code are not actually equivalent. What if the size of the vector changes while you're iterating over it? The compiler would have to be very, very smart to prove conclusively that the vector's size could not change.
Now, in the real world, is this tiny optimization really worth the extra effort? The vec.size() just returns a stored value. It doesn't re-count the length.
Consider the following stupid function:
void sum (vector<int>& vec, int* sumOut)
{
*sumOut = 0;
for(std::size_t i = 0; i < vec.size(); ++i)
{
*sumOut += vec[i];
}
}
The actual assembly generated will depend on the compiler and implementation of vector, but I think in most cases, the compiler has to re-read the vector's size from memory each time through the loop. This is because the sumOut pointer could potentially overlap (alias) the vector's internal storage of the size (assuming the vector stores its size in an int), so the size could be changed by the loop. If you call a function like this a lot, it could add up to a lot of cycles because you're touching memory more than you need.
Three possible solutions are:
Store the size in a local variable.
Ideally, the size this will get
stored in a register and avoid touching
memory altogether. Even if it has to
get put on the stack, the compiler
should be able to order the
loads/stores more efficiently.
Use __restrict on the output
pointer. This tells the compiler
that the pointer can't possibly
overlap anything else, so writes to
it don't require reloading anything
else.
Reverse the loop. The termination
condition now checks against 0
instead, so vec.size() is never
called again.
Of those, I think #1 is the cleanest, but some people might prefer #3. #2 is the probably least reader-friendly, but might be faster than the others (because it means the vector's data could be read more efficiently).
For more info on aliasing, see Christer Ericson's GDC presentation on memory optimization; there's an example almost identical to this in there.

c++ variable declaration

Im wondering if this code:
int main(){
int p;
for(int i = 0; i < 10; i++){
p = ...;
}
return 0
}
is exactly the same as that one
int main(){
for(int i = 0; i < 10; i++){
int p = ...;
}
return 0
}
in term of efficiency ?
I mean, the p variable will be recreated 10 times in the second example ?
It's is the same in terms of efficiency.
It's not the same in terms of readability. The second is better in this aspect, isn't it?
It's a semantic difference which the code keeps hidden because it's not making a difference for int, but it makes a difference to the human reader. Do you want to carry the value of whatever calculation you do in ... outside of the loop? You don't, so you should write code that reflects your intention.
A human reader will need to seek the function and look for other uses of p to confirm himself that what you did was just premature "optimization" and didn't have any deeper purpose.
Assuming it makes a difference for the type you use, you can help the human reader by commenting your code
/* p is only used inside the for-loop, to keep it from reallocating */
std::vector<int> p;
p.reserve(10);
for(int i = 0; i < 10; i++){
p.clear();
/* ... */
}
In this case, it's the same. Use the smallest scope possible for the most readable code.
If int were a class with a significant constructor and destructor, then the first (declaring it outside the loop) can be a significant savings - but inside you usually need to recreate the state anyway... so oftentimes it ends up being no savings at all.
One instance where it might make a difference is containers. A string or vector uses internal storage that gets grown to fit the size of the data it is storing. You may not want to reconstruct this container each time through the loop, instead, just clear its contents and it may not need as many reallocations inside the loop. This can (in some cases) result in a significant performance improvement.
The bottom-line is write it clearly, and if profiling shows it matters, move it out :)
They are equal in terms of efficiency - you should trust your compiler to get rid of the immeasurably small difference. The second is better design.
Edit: This isn't necessarily true for custom types, especially those that deal with memory. If you were writing a loop for any T, I'd sure use the first form just in case. But if you know that it's an inbuilt type, like int, pointer, char, float, bool, etc. I'd go for the second.
In second example the p is visible only inside of the for loop. you cannot use it further in your code.
In terms of efficiency they are equal.