Efficiency of manually written loops vs operator overloads - c++

in the program I'm working on I have 3-element arrays, which I use as mathematical vectors for all intents and purposes.
Through the course of writing my code, I was tempted to just roll my own Vector class with simple arithmetic overloads (+, -, * /) so I can simplify statements like:
// old:
for (int i = 0; i < 3; i++)
r[i] = r1[i] - r2[i];
// new:
r = r1 - r2;
Which should be more or less identical in generated code. But when it comes to more complicated things, could this really impact my performance heavily? One example that I have in my code is this:
Manually written version:
for (int j = 0; j < 3; j++)
{
p.vel[j] = p.oldVel[j] + (p.oldAcc[j] + p.acc[j]) * dt2 + (p.oldJerk[j] - p.jerk[j]) * dt12;
p.pos[j] = p.oldPos[j] + (p.oldVel[j] + p.vel[j]) * dt2 + (p.oldAcc[j] - p.acc[j]) * dt12;
}
Using the Vector class with operator overloads:
p.vel = p.oldVel + (p.oldAcc + p.acc) * dt2 + (p.oldJerk - p.jerk) * dt12;
p.pos = p.oldPos + (p.oldVel + p.vel) * dt2 + (p.oldAcc - p.acc) * dt12;
I am attempting to optimize my code for speed, since this sort of code runs inside of inner loops. Will using the overloaded operators for these things affect performance? I'm doing some numerical integration of a system of n mutually gravitating bodies. These vector operations are extremely common so having this run fast is important.
Any insight would be appreciated, as would any idioms or tricks I'm unaware of.

If the operations are inlined and optimised well by your compiler you shouldn't usually see any difference between writing the code well (using operators to make it readable and maintainable) and manually inlining everything.
Manual inlining also considerably increases the risk of bugs because you won't be re-using a single piece of well-tested code, you'll be writing the same code over and over. I would recommend writing the code with operators, and then if you can prove you can speed it up by manually inlining, duplicate the code and manually inline the second version. Then you can run the two variants of the code off against each other to prove (a) that the manual inlining is effective, and (b) that the readable and manually-inlined code both produce the same result.
Before you start manually inlining, though, there's an easy way for you to answer your question for yourself: Write a few simple test cases both ways, then execute a few million iterations and see which approach executes faster. This will teach you a lot about what's going on and give you a definite answer for your particular implementation and compiler that you will never get from the theoretical answers you'll receive here.

I would like to look at it the other way around; starting with the Vector class, and if you get performance problems with that you can see if manually inlining the calculations is faster.
Aside from the performance you also mention that the calculations has to be accurate. Having the vector specific calculations in a class means that it's easier to test those individually, and also that the code using the class gets shorter and easier to maintain.

Check out the ConCRT code samples
http://code.msdn.microsoft.com/concrtextras/Release/ProjectReleases.aspx?ReleaseId=4270
There's a couple (including an NBody sample) which do a bunch of tricks like this with Vector types and templates etc.

Related

Will an operation done several times in sequence be simplified by compiler?

I've had this question for a long time but never knew where to look. If a certain operation is written many times will the compiler simplify it or will it run the exact same operation and get the exact same answer?
For example, in the following c-like pseudo-code (i%3)*10 is repeated many times.
for(int i=0; i<100; i++) {
array[(i%3)*10] = someFunction((i%3)*10);
int otherVar = (i%3)*10 + array[(i%3)*10];
int lastVar = (i%3)*10 - otherVar;
anotherFunction(lastVar);
}
I understand a variable would be better for visual purposes, but is it also faster? Is (i%3)*10 calculated 5 times per loop?
There are certain cases where I don't know if its faster to use a variable or just leave the original operation.
Edit: using gcc (MinGW.org GCC-8.2.0-3) 8.2.0 on win 10
Which optimizations are done depends on the compiler, the compiler optimization flag(s) you specify, and the architecture.
Here are a few possible optimizations for your example:
Loop Unrolling This makes the binary larger and thus is a trade-off; for example you may not want this on a tiny microprocessor with very little memory.
Common Subexpression Elimination (CSE) you can be pretty sure that your (i % 3) * 10 will only be executed once per loop iteration.
About your concern about visual clarity vs. optimization: When dealing with a 'local situation' like yours, you should focus on code clarity.
Optimization gains are often to be made at a higher level; for example in the algorithm you use.
There's a lot to be said about optimization; the above are just a few opening remarks. It's great that you're interested in how things work, because this is important for a good (C/C++) programmer.
As a matter of course, you should remove the obfuscation present in your code:
for (int i = 0; i < 100; ++i) {
int i30 = i % 3 * 10;
int r = someFunction(i30);
array[i30] = r;
anotherFunction(-r);
}
Suddenly, it looks quite a lot simpler.
Leave it to the compiler (with appropriate options) to optimize your code unless you find you actually have to take a hand after measuring.
In this case, unrolling three times looks like a good idea for the compiler to pursue. Though inlining might always reveal even better options.
Yes, operations done several times in sequence will be optimized by a compiler.
To go into more detail, all major compilers (GCC, Clang, and MSVC) store the value of (i%3)*10 into a temporary (scratch, junk) register, and then use that whenever an equivalent expression is used again.
This optimization is called GCSE (GNU Common Subexpression Elimination) for GCC, and just CSE otherwise.
This takes a decent chunk out of the time that it takes to compute the loop.

How to generate computation intensive code in C++ that will not be removed by compiler? [duplicate]

This question already has an answer here:
How to prevent optimization of busy-wait
(1 answer)
Closed 7 years ago.
I am doing some experiments on CPU's performance. I wonder if anyone know a formal way or a tool to generate simple code that can run for a period of time (several seconds) and consumes significant computation resource of a CPU.
I know there are a lot of CPU benchmarks but the code of them is pretty complicated. What I want is a program more straight forward.
As the compiler is very smart, writing some redundant code as following will not work.
for (int i = 0; i < 100; i++) {
int a = i * 200 + 100;
}
Put the benchmark code in a function in a separate translation unit from the code that calls it. This prevents the code from being inlined, which can lead to aggressive optimizations.
Use parameters for the fixed values (e.g., the number of iterations to run) and return the resulting value. This prevents the optimizer from doing too much constant folding and it keeps it from eliminating calculations for a variable that it determines you never use.
Building on the example from the question:
int TheTest(int iterations) {
int a;
for (int i = 0; i < iterations; i++) {
a = i * 200 + 100;
}
return a;
}
Even in this example, there's still a chance that the compiler might realize that only the last iteration matters and completely omit the loop and just return 200*(iterations - 1) + 100, but I wouldn't expect that to happen in many real-life cases. Examine the generated code to be certain.
Other ideas, like using volatile on certain variables can inhibit some reasonable optimizations, which might make your benchmark perform worse that actual code.
There are also frameworks, like this one, for writing benchmarks like these.
It's not necessarily your optimiser that removes the code. CPU's these days are very powerful, and you need to increase the challenge level. However, note that your original code is not a good general benchmark: you only use a very subset of a CPU's instruction set. A good benchmark will try to challenge the CPU on different kinds of operations, to predict the performance in real world scenarios. Very good benchmarks will even put load on various components of your computer, to test their interplay.
Therefore, just stick to a well known published benchmark for your problem. There is a very good reason why they are more involved. However, if you really just want to benchmark your setup and code, then this time, just go for higher counter values:
double j=10000;
for (double i = 0; i < j*j*j*j*j; i++)
{
}
This should work better for now. Note that there a just more iterations. Change j according to your needs.

Will matrix multiplication using for loops decrease performance?

Currently I'm working on a program that uses matrices. I came up with this nested loop to multiply two matrices:
// The matrices are 1-dimensional arrays
for (int i = 0; i < 4; i++)
for (int j = 0; j < 4; j++)
for (int k = 0; k < 4; k++)
result[i * 4 + j] += M1[i * 4 + k] * M2[k * 4 + j];
The loop works. My question is: will this loop be slower compared to writing it all out manually like this:
result[0] = M1[0]*M2[0] + M1[1]*M2[4] + M1[2]*M2[8] + M1[3]*M2[12];
result[1] = M1[0]*M2[1] + M1[1]*M2[5] + M1[2]*M2[9] + M1[4]*M2[13];
result[2] = ... etc.
Because in the nested loop, the array positions are calculated and in the second method, they do not.
Thanks.
As with so many things, "it depends", but in this instance I would tend toward the second, expanded form performing just about the same. Any modern compiler will unroll appropriate loops for you, and take care of it.
Two points perhaps worth making:
The second approach is uglier, is more prone to errors and tedious to write/maintain.
This is a nice example of 'premature optimization' (AKA the root of all evil). Do you know if this section is a bottleneck? Is this really the most intensive part of the code? By optimizing so early we incur everything in point #1 for what amounts to a hunch if we haven't bench marked our code.
Your compiler might already do this, take a look at loop unrolling.
Let the compiler do the guessing and the heavy work, stick to the clean code, and as always, measure your performance.
I don't think the loop will be slower. You are accessing the memory of the M1 and M2 arrays in the same way in both instances i.e. . If you want to make the "manual" version faster then use scalar replacement and do the computation on registers e.g.
double M1_0 = M1[0];
double M2_0 = M2[0];
result[0] = M1_0*M2_0 + ...
but you can use scalar replacement within the loop as well. You can do it if you do blocking and loop unrolling (in fact your triple loop looks like a blocking version of the MMM).
What you are trying to do is to speed up the program by improving locality i.e. better use of the memory hierarchy and better locality.
Assuming that you are running code on Intel processors or compatible (AMD) you may actually want to switch to assembly language to do heavy matrix computations. Luckily, you have the Intel-IPP library that does the actual work for you using advanced processor technology and selecting what is expected to be the fastest algorithm depending on your processor.
The IPP includes all the necessary matrix computations that you'd possibly need. The only problem you may encounter is the order in which you created your matrices. You may have to re-organize the order to make it easier to use the IPP functions you'd like to use.
Note that in regard to your two code examples, the second one will be faster because you avoid the += operator which is a read / modify / write cycle and that's generally slow (not only that, it requires the result matrix to be all zeroes to start with whereas the second example does not require clearing the output first), although your matrices are likely to fit in the cache... but, the processors are optimized to read input data in sequence (a[0], a1, a[2], a[3], ...) and also to write that data back in sequence. If you can write your algorithm to be as close as possible to such a sequence, all the better. Don't get me wrong, I know that matrix multiplications cannot be done in sequence. But if you think of that to do your optimization, you'll achieve better results (i.e. change the order in which your matrices are saved in memory could be one of them).

How do I force the compiler not to skip my function calls?

Let's say I want to benchmark two competing implementations of some function double a(double b, double c). I already have a large array <double, 1000000> vals from which I can take input values, so my benchmarking would look roughly like this:
//start timer here
double r;
for (int i = 0; i < 1000000; i+=2) {
r = a(vals[i], vals[i+1]);
}
//stop timer here
Now, a clever compiler could realize that I can only ever use the result of the last iteration and simply kill the rest, leaving me with double r = a(vals[999998], vals[999999]). This of course defeats the purpose of benchmarking.
Is there a good way (bonus points if it works on multiple compilers) to prevent this kind of optimization while keeping all other optimizations in place?
(I have seen other threads about inserting empty asm blocks but I'm worried that might prevent inlining or reordering. I'm also not particularly fond of the idea of adding the results sum += r; during each iteration because that's extra work that should not be included in the resulting timings. For the purposes of this question, it would be great if we could focus on other alternative solutions, although for anyone interested in this there is a lively discussion in the comments where the consensus is that += is the most appropriate method in many cases. )
Put a in a separate compilation unit and do not use LTO (link-time optimizations). That way:
The loop is always identical (no difference due to optimizations based on a)
The overhead of the function call is always the same
To measure the pure overhead and to have a baseline to compare implementations, just benchmark an empty version of a
Note that the compiler can not assume that the call to a has no side-effect, so it can not optimize the loop away and replace it with just the last call.
A totally different approach could use RDTSC, which is a hardware register in the CPU core that measures the clock cycles. It's sometimes useful for micro-benchmarks, but it's not exactly trivial to understand the results correctly. For example, check out this and goggle/search SO for more information on RDTSCs.

C++ valarray vs. vector

I like vectors a lot. They're nifty and fast. But I know this thing called a valarray exists. Why would I use a valarray instead of a vector? I know valarrays have some syntactic sugar, but other than that, when are they useful?
valarray is kind of an orphan that was born in the wrong place at the wrong time. It's an attempt at optimization, fairly specifically for the machines that were used for heavy-duty math when it was written -- specifically, vector processors like the Crays.
For a vector processor, what you generally wanted to do was apply a single operation to an entire array, then apply the next operation to the entire array, and so on until you'd done everything you needed to do.
Unless you're dealing with fairly small arrays, however, that tends to work poorly with caching. On most modern machines, what you'd generally prefer (to the extent possible) would be to load part of the array, do all the operations on it you're going to, then move on to the next part of the array.
valarray is also supposed to eliminate any possibility of aliasing, which (at least theoretically) lets the compiler improve speed because it's more free to store values in registers. In reality, however, I'm not at all sure that any real implementation takes advantage of this to any significant degree. I suspect it's rather a chicken-and-egg sort of problem -- without compiler support it didn't become popular, and as long as it's not popular, nobody's going to go to the trouble of working on their compiler to support it.
There's also a bewildering (literally) array of ancillary classes to use with valarray. You get slice, slice_array, gslice and gslice_array to play with pieces of a valarray, and make it act like a multi-dimensional array. You also get mask_array to "mask" an operation (e.g. add items in x to y, but only at the positions where z is non-zero). To make more than trivial use of valarray, you have to learn a lot about these ancillary classes, some of which are pretty complex and none of which seems (at least to me) very well documented.
Bottom line: while it has moments of brilliance, and can do some things pretty neatly, there are also some very good reasons that it is (and will almost certainly remain) obscure.
Edit (eight years later, in 2017): Some of the preceding has become obsolete to at least some degree. For one example, Intel has implemented an optimized version of valarray for their compiler. It uses the Intel Integrated Performance Primitives (Intel IPP) to improve performance. Although the exact performance improvement undoubtedly varies, a quick test with simple code shows around a 2:1 improvement in speed, compared to identical code compiled with the "standard" implementation of valarray.
So, while I'm not entirely convinced that C++ programmers will be starting to use valarray in huge numbers, there are least some circumstances in which it can provide a speed improvement.
Valarrays (value arrays) are intended to bring some of the speed of Fortran to C++. You wouldn't make a valarray of pointers so the compiler can make assumptions about the code and optimise it better. (The main reason that Fortran is so fast is that there is no pointer type so there can be no pointer aliasing.)
Valarrays also have classes which allow you to slice them up in a reasonably easy way although that part of the standard could use a bit more work. Resizing them is destructive and they lack iterators they have iterators since C++11.
So, if it's numbers you are working with and convenience isn't all that important use valarrays. Otherwise, vectors are just a lot more convenient.
During the standardization of C++98, valarray was designed to allow some sort of fast mathematical computations. However, around that time Todd Veldhuizen invented expression templates and created blitz++, and similar template-meta techniques were invented, which made valarrays pretty much obsolete before the standard was even released. IIRC, the original proposer(s) of valarray abandoned it halfway into the standardization, which (if true) didn't help it either.
ISTR that the main reason it wasn't removed from the standard is that nobody took the time to evaluate the issue thoroughly and write a proposal to remove it.
Please keep in mind, however, that all this is vaguely remembered hearsay. Take this with a grain of salt and hope someone corrects or confirms this.
I know valarrays have some syntactic sugar
I have to say that I don't think std::valarrays have much in way of syntactic sugar. The syntax is different, but I wouldn't call the difference "sugar." The API is weird. The section on std::valarrays in The C++ Programming Language mentions this unusual API and the fact that, since std::valarrays are expected to be highly optimized, any error messages you get while using them will probably be non-intuitive.
Out of curiosity, about a year ago I pitted std::valarray against std::vector. I no longer have the code or the precise results (although it shouldn't be hard to write your own). Using GCC I did get a little performance benefit when using std::valarray for simple math, but not for my implementations to calculate standard deviation (and, of course, standard deviation isn't that complex, as far as math goes). I suspect that operations on each item in a large std::vector play better with caches than operations on std::valarrays. (NOTE, following advice from musiphil, I've managed to get almost identical performance from vector and valarray).
In the end, I decided to use std::vector while paying close attention to things like memory allocation and temporary object creation.
Both std::vector and std::valarray store the data in a contiguous block. However, they access that data using different patterns, and more importantly, the API for std::valarray encourages different access patterns than the API for std::vector.
For the standard deviation example, at a particular step I needed to find the collection's mean and the difference between each element's value and the mean.
For the std::valarray, I did something like:
std::valarray<double> original_values = ... // obviously I put something here
double mean = original_values.sum() / original_values.size();
std::valarray<double> temp(mean, original_values.size());
std::valarray<double> differences_from_mean = original_values - temp;
I may have been more clever with std::slice or std::gslice. It's been over five years now.
For std::vector, I did something along the lines of:
std::vector<double> original_values = ... // obviously, I put something here
double mean = std::accumulate(original_values.begin(), original_values.end(), 0.0) / original_values.size();
std::vector<double> differences_from_mean;
differences_from_mean.reserve(original_values.size());
std::transform(original_values.begin(), original_values.end(), std::back_inserter(differences_from_mean), std::bind1st(std::minus<double>(), mean));
Today I would certainly write that differently. If nothing else, I would take advantage of C++11 lambdas.
It's obvious that these two snippets of code do different things. For one, the std::vector example doesn't make an intermediate collection like the std::valarray example does. However, I think it's fair to compare them because the differences are tied to the differences between std::vector and std::valarray.
When I wrote this answer, I suspected that subtracting the value of elements from two std::valarrays (last line in the std::valarray example) would be less cache-friendly than the corresponding line in the std::vector example (which happens to also be the last line).
It turns out, however, that
std::valarray<double> original_values = ... // obviously I put something here
double mean = original_values.sum() / original_values.size();
std::valarray<double> differences_from_mean = original_values - mean;
Does the same thing as the std::vector example, and has almost identical performance. In the end, the question is which API you prefer.
valarray was supposed to let some FORTRAN vector-processing goodness rub off on C++. Somehow the necessary compiler support never really happened.
The Josuttis books contains some interesting (somewhat disparaging) commentary on valarray (here and here).
However, Intel now seem to be revisiting valarray in their recent compiler releases (e.g see slide 9); this is an interesting development given that their 4-way SIMD SSE instruction set is about to be joined by 8-way AVX and 16-way Larrabee instructions and in the interests of portability it'll likely be much better to code with an abstraction like valarray than (say) intrinsics.
I found one good usage for valarray.
It's to use valarray just like numpy arrays.
auto x = linspace(0, 2 * 3.14, 100);
plot(x, sin(x) + sin(3.f * x) / 3.f + sin(5.f * x) / 5.f);
We can implement above with valarray.
valarray<float> linspace(float start, float stop, int size)
{
valarray<float> v(size);
for(int i=0; i<size; i++) v[i] = start + i * (stop-start)/size;
return v;
}
std::valarray<float> arange(float start, float step, float stop)
{
int size = (stop - start) / step;
valarray<float> v(size);
for(int i=0; i<size; i++) v[i] = start + step * i;
return v;
}
string psstm(string command)
{//return system call output as string
string s;
char tmp[1000];
FILE* f = popen(command.c_str(), "r");
while(fgets(tmp, sizeof(tmp), f)) s += tmp;
pclose(f);
return s;
}
string plot(const valarray<float>& x, const valarray<float>& y)
{
int sz = x.size();
assert(sz == y.size());
int bytes = sz * sizeof(float) * 2;
const char* name = "plot1";
int shm_fd = shm_open(name, O_CREAT | O_RDWR, 0666);
ftruncate(shm_fd, bytes);
float* ptr = (float*)mmap(0, bytes, PROT_WRITE, MAP_SHARED, shm_fd, 0);
for(int i=0; i<sz; i++) {
*ptr++ = x[i];
*ptr++ = y[i];
}
string command = "python plot.py ";
string s = psstm(command + to_string(sz));
shm_unlink(name);
return s;
}
Also, we need python script.
import sys, posix_ipc, os, struct
import matplotlib.pyplot as plt
sz = int(sys.argv[1])
f = posix_ipc.SharedMemory("plot1")
x = [0] * sz
y = [0] * sz
for i in range(sz):
x[i], y[i] = struct.unpack('ff', os.read(f.fd, 8))
os.close(f.fd)
plt.plot(x, y)
plt.show()
The C++11 standard says:
The valarray array classes are defined to be free of certain forms of
aliasing, thus allowing operations on these classes to be optimized.
See C++11 26.6.1-2.
With std::valarray you can use the standard mathematical notation like v1 = a*v2 + v3 out of the box. This is not possible with vectors unless you define your own operators.
std::valarray is intended for heavy numeric tasks, such as Computational Fluid Dynamics or Computational Structure Dynamics, in which you have arrays with millions, sometimes tens of millions of items, and you iterate over them in a loop with also millions of timesteps. Maybe today std::vector has a comparable performance but, some 15 years ago, valarray was almost mandatory if you wanted to write an efficient numeric solver.