For example, I have three float arrays, a, b and c, and I want to add a and b element-wisely up to c. A naive way is like
for(int i = 0; i < n; i++){
c[i] = a[i] + b[i];
}
As far as I know, OpenMP can parallelize this piece of code. In OpenCV code, I see some flags like CV_SSE2 and CV_NEON which are related to optimization.
What's the common way to optimize these kinds of code, if I want my code highly efficient?
There is no common strategy. You should be sure that it is a bottleneck (which it might not be, if the size n of your arrays is small enough).
Some compilers are able to optimize that (at least in some simple cases) by using vector machine instructions. With GCC try to compile with gcc -O3 -mtune=native (or other -mtune=... or -mfpu=... arguments, in particular if you are cross-compiling) and possibly -ffast-math
You could consider OpenMP, OpenCL (with a GPGPU), OpenACC, MPI, explicit threading with e.g. pthreads or C++11 std::thread-s, etc... (and a clever mix of several approaches)
I would leave the optimization to the compiler, and only consider improving that if you measure that it is a bottleneck. You could spend months or years (or even specialize in that for your whole work life) of your developer time to improve it ....
You could also use some numerical computation library (e.g. LAPACK, GSL, etc...) or specialized software like Scilab, Octave, R, etc...
Read also http://floating-point-gui.de/
You should continue looking into parallel options. But for single-threaded, it's generally faster to do it like this:
int i = 0;
for (; i < n - 3; i += 4) {
c[i] = a[i] + b[i];
c[i + 1] = a[i + 1] + b[i + 1];
c[i + 2] = a[i + 2] + b[i + 2];
c[i + 3] = a[i + 3] + b[i + 3];
}
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
Sometimes unrolling can be done by the compiler, but at least in my experience (I use MSC), the compiler typically never tries to perform any partial unrolling like this, and sometimes it can help. This can be beneficial when each of the 4 things inside the loop can be pipelined and running in parallel and it saves comparisons/jumps.
So I would use this as a starting point, and measure it. Then, only apply the parallelization if you measure a gain over this. Or, if you make your threads by hand, each thread should probably do the unrolled variant.
Update: I'm not personally seeing any gain from this. I think it's because inside the unrolled loop, a full 12 floats are accessed. And the float operations are likely slow enough to negate any savings from the jge/cmp operations that are eliminated by unrolling it.
Still, whenever you have a similar problem, with lighter, independent operations, I still recommend at least trying this, because it generates clearly different assembly when you unroll it in the code and you'll get some different perf characteristics and reduce the number of cmp/jmp by a factor of 4, which can help but I think the floating point operations are just too significant for this to matter here.
As already mentioned by others, there is not the "common strategy" but it really depends on your particular use case: Are the arrays very large? Are they rather small but you have to call this function very frequently? Such question you will have to ask yourself. And before trying to optimize anything, you should always profile your code. In most applications more than 90% of the time is spend in only less than 10% of the code. Unless you know exactly where to find this 10% it can have little to no effect to optimize parts of the application.
However, when it is about arithmetic computations, I think it is always a good start to rely on the optimized standard algorithms. When concerned about efficiency, I would add two arrays (after putting a and b in a std::vector or std::array and preallocating c) via
std::transform(a.begin(), a.end(), b.begin(),c.begin(), std::plus<float>());
Depending on your compiler's optimization stage an array index a[i]may be slower than a pointer dereference *p (with p incremented in each iteration so p = a+i)
So without relying on the optimizer this may be faster with some compilers:
float* pa = a;
float* pb = b;
float* pc = c;
for(int i = 0; i < n; i++)
*pc++ = *pa++ + *pb++;
While it may seem trivial in this case, this basic technique can result in large gains in more complicated cases where things are too complicated for the optimizer to do the job.
Recently, I've been thinking about all the ways that one could iterate through an array and wondered which of these is the most (and least) efficient. I've written a hypothetical problem and five possible solutions.
Problem
Given an int array arr with len number of elements, what would be the most efficient way of assigning an arbitrary number 42 to every element?
Solution 0: The Obvious
for (unsigned i = 0; i < len; ++i)
arr[i] = 42;
Solution 1: The Obvious in Reverse
for (unsigned i = len - 1; i >= 0; --i)
arr[i] = 42;
Solution 2: Address and Iterator
for (unsigned i = 0; i < len; ++i)
{ *arr = 42;
++arr;
}
Solution 3: Address and Iterator in Reverse
for (unsigned i = len; i; --i)
{ *arr = 42;
++arr;
}
Solution 4: Address Madness
int* end = arr + len;
for (; arr < end; ++arr)
*arr = 42;
Conjecture
The obvious solutions are almost always used, but I wonder whether the subscript operator could result in a multiplication instruction, as if it had been written like *(arr + i * sizeof(int)) = 42.
The reverse solutions try to take advantage of how comparing i to 0 instead of len might mitigate a subtraction operation. Because of this, I prefer Solution 3 over Solution 2. Also, I've read that arrays are optimized to be accessed forwards because of how they're stored in the cache, which could present an issue with Solution 1.
I don't see why Solution 4 would be any less efficient than Solution 2. Solution 2 increments the address and the iterator, while Solution 4 only increments the address.
In the end, I'm not sure which of these solutions I prefer. I'm think the answer also varies with the target architecture and optimization settings of your compiler.
Which of these do you prefer, if any?
Just use std::fill.
std::fill(arr, arr + len, 42);
Out of your proposed solutions, on a good compiler, neither should be faster than the others.
The ISO standard doesn't mandate the efficiency of the different ways of doing things in code (other than certain big-O type stuff for some collection algorithms), it simply mandates how it functions.
Unless your arrays are billions of elements in size, or you're wanting to set them millions of times per minute, it generally won't make the slightest difference which method you use.
If you really want to know (and I still maintain it's almost certainly unnecessary), you should benchmark the various methods in the target environment. Measure, don't guess!
As to which I prefer, my first inclination is to optimise for readability. Only if there's a specific performance problem do I then consider other possibilities. That would be simply something like:
for (size_t idx = 0; idx < len; idx++)
arr[idx] = 42;
I don't think that performance is an issue here - those are, if at all (I could imagine the compiler producing the identical assembly for most of them), micro optimizations hardly ever necessary.
Go with the solution that is most readable; the standard library provides you with std::fill, or for more complex assignments
for(unsigned k = 0; k < len; ++k)
{
// whatever
}
so it is obvious to other people looking at your code what you are doing. With C++11 you could also
for(auto & elem : arr)
{
// whatever
}
just don't try to obfuscate your code without any necessity.
For nearly all meaningful cases, the compiler will optimize all of the suggested ones to the same thing, and it's very unlikely to make any difference.
There used to be a trick where you could avoid the automatic prefetching of data if you ran the loop backwards, which under some bizarre set of circumstances actually made it more efficient. I can't recall the exact circumstances, but I expect modern processors will identify backwards loops as well as forwards loops for automatic prefetching anyway.
If it's REALLY important for your application to do this over a large number of elements, then looking at blocked access and using non-temporal storage will be the most efficient. But before you do that, make sure you have identified the filling of the array as an important performance point, and then make measurements for the current code and the improved code.
I may come back with some actual benchmarks to prove that "it makes little difference" in a bit, but I've got an errand to run before it gets too late in the day...
Inspired by Herb Sutter's compelling lecture Not your father's C++, I decided to take another look at the latest version of C++ using Microsoft's Visual Studio 2010. I was particularly interested by Herb's assertion that C++ is "safe and fast" because I write a lot of performance-critical code.
As a benchmark, I decided to try to write the same simple FFT algorithm in a variety of languages.
I came up with the following C++11 code that uses the built-in complex type and vector collection:
#include <complex>
#include <vector>
using namespace std;
// Must provide type or MSVC++ barfs with "ambiguous call to overloaded function"
double pi = 4 * atan(1.0);
void fft(int sign, vector<complex<double>> &zs) {
unsigned int j=0;
// Warning about signed vs unsigned comparison
for(unsigned int i=0; i<zs.size()-1; ++i) {
if (i < j) {
auto t = zs.at(i);
zs.at(i) = zs.at(j);
zs.at(j) = t;
}
int m=zs.size()/2;
j^=m;
while ((j & m) == 0) { m/=2; j^=m; }
}
for(unsigned int j=1; j<zs.size(); j*=2)
for(unsigned int m=0; m<j; ++m) {
auto t = pi * sign * m / j;
auto w = complex<double>(cos(t), sin(t));
for(unsigned int i = m; i<zs.size(); i+=2*j) {
complex<double> zi = zs.at(i), t = w * zs.at(i + j);
zs.at(i) = zi + t;
zs.at(i + j) = zi - t;
}
}
}
Note that this function only works for n-element vectors where n is an integral power of two. Anyone looking for fast FFT code that works for any n should look at FFTW.
As I understand it, the traditional xs[i] syntax from C for indexing a vector does not do bounds checking and, consequently, is not memory safe and can be a source of memory errors such as non-deterministic corruption and memory access violations. So I used xs.at(i) instead.
Now, I want this code to be "safe and fast" but I am not a C++11 expert so I'd like to ask for improvements to this code that would make it more idiomatic or efficient?
I think you are being overly "safe" in your use of at(). In most of your cases the index used is trivially verifable as being constrained by the size of the container in the for loop.
e.g.
for(unsigned int i=0; i<zs.size()-1; ++i) {
...
auto t = zs.at(i);
The only ones I'd leave as at()s are the (i + j)s. It's not immediately obvious whether they would always be constrained (although if I was really unsure I'd probably manually check - but I'm not familiar with FFTs enough to have an opinion in this case).
There are also some fixed computations being repeated for each loop iteration:
int m=zs.size()/2;
pi * sign
2*j
And the zs.at(i + j) is computed twice.
It's possible that the optimiser may catch these - but if you are treating this as performance critical, and you have your timers testing it, I'd hoist them out of the loops (or, in the case of zs.at(i + j), just take a reference on first use) and see if that impacts the timer.
Talking of second-guessing the optimiser: I'm sure that the calls to .size() will be inlined as, at least, a direct call to an internal member variable - but given how many times you call it I'd also experiment with introducing local variables for zs.size() and zs.size()-1 upfront. They're more likely to be put into registers that way too.
I don't know how much of a difference (if any) all of this will have on your total runtime - some of it may already be caught by the optimiser, and the differences may be small compared to the computations involved - but worth a shot.
As for being idiomatic my only comment, really, is that size() returns a std::size_t (which is usually a typedef for an unsigned int - but it's more idiomatic to use that type instead). If you did want to use auto but avoid the warning you could try adding the ul suffix to the 0 - not sure I'd say that is idiomatic, though. I suppose you're already less than idiomatic in not using iterators here, but I can see why you can't do that (easily).
Update
I gave all my suggestions a try and they all had a measurable performance improvement - except the i+j and 2*j precalcs - they actually caused a slight slowdown! I presume they either prevented a compiler optimisation or prevented it from using registers for some things.
Overall I got a >10% perf. improvement with those suggestions.
I suspect more could be had if the second block of loops was refactored a little to avoid the jumps - and having done so enabling SSE2 instruction set may give a significant boost (I did try it as is and saw a slight slowdown).
I think that refactoring, along with using something like MKL for the cos and sin calls should give greater, and less brittle, improvements. And neither of those things would be language dependent (I know this was originally being compared to an F# implementation).
Update 2
I forgot to mention that pre-calculating zs.size() did make a difference.
Update 3
Also forgot to say (until reminded by #xeo in comment to OP) that the block following the i < j check can be boiled down to a std::swap. This is more idiomatic and at least as performant - in the worst case should inline to the same code as written. Indeed when I did it I saw no change in the performance. In other cases it can lead to a performance gain if move constructors are available.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am a newbie on optimization. I have been reading some reference on how to optimize c++ code but I have a hard time applying it to real code. Therefore I just want to gather some real world optimization technique on how to squeeze as much juice from the CPU/Memory as possible from the loop below
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
for(int t = 0; t < T; ++t){
sum += fun(a,b,c,d,e,f,sum);
*(array+t) = sum;
}
where a,b,c,d,e,f are double and T is int. Anything including but not limited to memory alignment, parallelism, openmp/MPI, and SSE instructions are welcome. Compiler is standard gcc, microsoft, or commonly available compiler. If the solution is compiler specific, please specific compiler and any option flag associate with your solution.
Thanks!
PS: Forgot to mention properties fun. Please assume that it is a simple function with no loop inside, and compose of only basic arithmetic operation. Simply think of it as an inline function.
EDIT2: since the details of fun is important, please forget the parameter c, d, e, f and assume fun is defined as
inline double fun(a,b, sum){
return sum + a* ( b - sum);
}
Since sum depends on its previous values in a non-trivial way, it is impossible to parallelize the code (so OpenMP and MPI are out).
Memory alignment and SSE should be enforced/used automatically with appropriate compiler settings.
Besides inlining fun and unrolling the loop (by compiling in -O3) there is not much we can do if fun is completely general.
Since fun(a,b,sum) = sum + a*(b-sum), we have the closed form
ab t+1
array[t] = ———— ( 1 - (2-a) )
a-1
which can be vectorized and parallelized, but the division and exponentiation can be very expensive.
Nevertheless, with the closed form we can start the loop from any index, e.g. create 2 threads one from t = 0 to T/2-1, another from t = T/2 to T-1, which perform the original loop, but the initial sum is computed using the above closed form solution. Also, if only a few values from the array is needed this can be computed lazily.
And for SSE, you could first fill the array first with (2-a)^(t+1), and then apply x :-> x - 1 to the whole array, and then apply x :-> x * c to the whole array where c = a*b/(1-a), but there may be automatic vectorization already.
Unless fun() is very trivial - in which case consider inline, it is likely to dominate anything else you can do to the loop.
You probably want to look at the algorithm inside fun()
one (very) minor optimization that can be done is:
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
double* pStart = array;
double* pEnd = array + T;
while(pStart < pEnd)
{
sum += fun(a,b,c,d,e,f,sum);
*pStart++ = sum;
}
this eliminates the addition of t to array for every iteration of the loop, the incrementation of t is replaced by the incrementation of pStart, for small sets of iterations(think less than 3, in that case the loop should be derolled), there is no real gain. the compiler should do this automatically, but sometimes it needs a little encouragement.
also depending on the size ranges of T it might be possible to gain performance by using a variable sized array(which would be stack allocated) or aligned _alloca
The code you have above is about as fast as you can make it.
Memory alignment is normally handled well enough by malloc anyway.
The code cannot be parallelized because f is a function of previous sums (so you can't break up the computation into chunks).
The computations are not specified so it's unclear whether SSE or CUDA or anything like that would be applicable.
Likewise, you can't perform any useful loop-unrolling based on the properties of f because we don't know what it does.
(Stylistically, I'd use array[t] since it's clearer what's going on and it is no slower.)
Edit: now that we have f(a,b,sum) = sum + a*(b-sum), we can try loop unrolling by hand to see if there's some pattern. Like so (where I'm using ** to mean "to the power of"):
sum(n) = sum(n-1) + sum(n-1) + a*(b-sum(n-1)) = (2-a)*sum(n-1) + a*b
sum(n) = (2-a)*( (2-a)*sum(n-2) + a*b ) + a*b
. . .
sum(n) = a*b*(2-a)**n + a*b*(2-a)**(n-1) + ... + a*b
sum(n) = a*b*( (2-a)**0 + (2-a)**1 + ... + (2-a)**n )
Well, now, isn't that interesting! We've converted from a recurrent formula to a geometric series! And, you may recall that the geometric series
SUM( x^n , n = 0..N ) = (x**(n+1) - 1) / (x - 1)
so that
sum(n) = a*b*( (pow(2-a,n+1) - 1) / (1-a) )
Now that you've done that math, you can start in on the sum anywhere you want (using the somewhat expensive pow computation). If you have M free processors, and your array is long, you can split it into M equal pieces, use the above computation to find the first sum, and then use the recurrence formula you were using before (with the function) to fill the rest in.
At the very least, you could calculate a*b and 2-a separately and use those instead of the existing function:
sum = ab + twonega*sum
That cuts the math in your inner loop in half, approximately.
Accept #KennyTM's answer. He is wrong to state that the computation is not parallelisable, as he goes on to show. In showing that you can rewrite your recurrence relation in closed form he illustrates a very general principle of optimising programs -- choose the best algorithm you can find and implement that. None of the micro-optimisations that the other answers suggest will come close to computing the closed form and spreading the computation across many processors in parallel.
And, lest anyone offer the suggestion that this is just an example for learning parallelisation, I contend that #KennyTM's answer still holds good -- don't learn to optimise fragments of code, learn to optimise computations. Choose the best algorithm for your purposes, implement it well and only then worry about performance.
Have a look at callgrind, part of the valgrind toolset. Run your code through that and see if anything sticks out as taking an unusually large amount of time. Then you'll know what needs optimizing. Otherwise, you're just guessing, and you (as well as the rest of us) will likely guess wrong.
Just a couple of suggestions that haven't come up yet. I'm a bit out of date when it comes to modern PC-style processors, so they may make no significant difference.
Using float might be faster than double if you can tolerate the lower precision. Integer-based fixed point might be faster still, depending on how well floating point operations are pipelined.
Counting down from T to zero (and incrementing array each iteration) might be slightly faster - certainly, on an ARM processor this would save one cycle per loop.
another very minor optimization would be to turn the for() into
while (--T)
as comparing with zero is usually faster than comparing two random integers.
I would enable vector processing on the compiler. You could rewrite the code to open up the loops but the compiler will do it for you. If it is a later version.
You could use t+array as the for loop increment... again the optimizer might do this.
means your array index won't use a multiply again optimizer might do this.
You could use the switch to dump the generated assembler code and using that see what you can change in the code to make it run faster.
Following the excellent answer by #KennyTM, I'd say the fastest way to do it sequentially should be:
double acc = 1, *array;
array = (double*) malloc(T * sizeof(double));
// the compiler should be able to derive those constant expressions, but let's do it explicitly anyway
const double k = a*b/(a-1);
const double twominusa = 2 - a;
for(int t = 0; t < T; ++t){
acc *= twominusa;
array[t] = k*(1-acc);
}
Rather than having the compiler unroll the loop, you could unroll the loop and and some data prefetching. Search the web for data driven design c++. Here is an example of loop unrolling and prefetching data:
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
// Calculate the number iterations and the
// remaining iterations.
unsigned int iterations = T / 4;
unsigned int remaining_iterations = T % 4;
double sum1;
double sum2;
double sum3;
double sum4;
double * p_array = array;
for(int t = 0; t < T; T += 4)
{
// Do some data precalculation
sum += fun(a,b,c,d,e,f,sum);
sum1 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum2 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum3 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum4 = sum;
// Do a "block" transfer to the array.
p_array[0] = sum1;
p_array[1] = sum2;
p_array[2] = sum3;
p_array[3] = sum4;
p_array += 4;
}
// Handle the few remaining calculations
for (t = 0; t < remaining_iterations; ++t)
{
sum += fun(a,b,c,d,e,f,sum);
p_array[t] = sum;
}
The big hit here is the call to the fun function. There are hidden setup and restore instructions involved when executing a function. Also, the call forces a branch which will cause instruction pipelines to be flushed and reloaded (or cause to processor to waste time in branch prediction).
Another time performance hit is the number of variables passed to the function. These variables have to be placed on the stack and copied into the function, which takes time.
Many computers have a dedicated multiply-accumulate unit implemented in the processor's hardware. Depending on your ultimate algorithm and target platform, you may be able to use this if the compiler isn't already using it when it optimizes.
The following may not be worth it, but ....
The routine fun() takes seven (7) parameters.
Changing the order of the parameters to fun (sum, a, b, c, d, e, f) could help, IF the compiler can take advantage of the following scenario. Parameters through appear to be invariant, and only appears to be changing at this level in the code. As parameters are pushed onto the stack in C/C++ from right to left, if parameters through truly are invariant, then the compiler could in theory optimize the pushing of the stack variables. In other words, through would only need to be pushed onto the stack once, and could in theory be the only parameter pushed and popped while in the loop.
I do not know if the compiler would take advantage of such a scenario, but I am tossing it out there as a possibility. Disassembling could verify it as true or false, and profiling would indicate how much of a benefit that may be if true.
Is it a good idea to vectorize the code? What are good practices in terms of when to do it? What happens underneath?
Vectorization means that the compiler detects that your independent instructions can be executed as one SIMD instruction. Usual example is that if you do something like
for (i = 0; i < N; i++) {
a[i] = a[i] + b[i];
}
It will be vectorized as (using vector notation)
for (i = 0; i < (N - N % VF); i += VF) {
a[i : i + VF] = a[i : i + VF] + b[i : i + VF];
}
Basically the compiler picks one operation that can be done on VF elements of the array at the same time and does this N/VF times instead of doing the single operation N times.
It increases performance, but puts more requirement on the architecture.
As mentioned above, vectorization is used to make use of SIMD instructions, which can perform identical operations of different data packed into large registers.
A generic guideline to enable a compiler to autovectorize a loop is to ensure that there are no flow- and anti-dependencies b/w data elements in different iterations of a loop.
http://en.wikipedia.org/wiki/Data_dependency
Some compilers like the Intel C++/Fortran compilers are capable of autovectorizing code. In case it was not able to vectorize a loop, the Intel compiler is capable of reporting why it could not do that. There reports can be used to modify the code such that it becomes vectorizable (assuming it's possible)
Dependencies are covered in depth in the book 'Optimizing Compilers for Modern Architectures: A Dependence-based Approach'
Vectorization need not be limited to single register which can hold large data. Like using '128' bit register to hold '4 x 32' bit data. It depends on architectural limitations. Some architecture have different execution units which have registers of their own. In that case, a part of the data can be fed to that execution unit and the result can be taken from a register corresponding to that execution unit.
For example, consider the below case.
for(i=0; i < N; i++) { a[i] =
a[i] + b[i]; }
If I am working on an architecture which has two execution units, then my vector size is defined as two. The loop mentioned above will be reframed as
for(i=0; i<(N/2); i+=2) {
a[i] = a[i] + b[i] ;
a[i+1] = a[i+1] + b[i+1]; }
NOTE: The 2 inside the for statement
is derived from the vector size.
As I am having two execution units the two statements inside the loop will be fed into the two execution units. The sum will be accumulated in the execution units separately. Finally the sum of accumulated values (from two execution units) will be carried out.
The good practices are
1. The constraints like dependency (between different iterations of the loop) needs to be checked before vectorizing the loop.
2. Function calls needs to be prevented.
3. Pointer access can create aliasing and it needs to be prevented.
It's SSE code Generation.
You have a loop with float matrix code in it matrix1[i][j] + matrix2[i][j] and the compiler generates SSE code.
Maybe also have a look at libSIMDx86 (source code).
A nice example well explained is:
Choosing to Avoid Branches: A Small Altivec Example