How much can this c/c++ loop be optimized? [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am a newbie on optimization. I have been reading some reference on how to optimize c++ code but I have a hard time applying it to real code. Therefore I just want to gather some real world optimization technique on how to squeeze as much juice from the CPU/Memory as possible from the loop below
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
for(int t = 0; t < T; ++t){
sum += fun(a,b,c,d,e,f,sum);
*(array+t) = sum;
}
where a,b,c,d,e,f are double and T is int. Anything including but not limited to memory alignment, parallelism, openmp/MPI, and SSE instructions are welcome. Compiler is standard gcc, microsoft, or commonly available compiler. If the solution is compiler specific, please specific compiler and any option flag associate with your solution.
Thanks!
PS: Forgot to mention properties fun. Please assume that it is a simple function with no loop inside, and compose of only basic arithmetic operation. Simply think of it as an inline function.
EDIT2: since the details of fun is important, please forget the parameter c, d, e, f and assume fun is defined as
inline double fun(a,b, sum){
return sum + a* ( b - sum);
}

Since sum depends on its previous values in a non-trivial way, it is impossible to parallelize the code (so OpenMP and MPI are out).
Memory alignment and SSE should be enforced/used automatically with appropriate compiler settings.
Besides inlining fun and unrolling the loop (by compiling in -O3) there is not much we can do if fun is completely general.
Since fun(a,b,sum) = sum + a*(b-sum), we have the closed form
ab t+1
array[t] = ———— ( 1 - (2-a) )
a-1
which can be vectorized and parallelized, but the division and exponentiation can be very expensive.
Nevertheless, with the closed form we can start the loop from any index, e.g. create 2 threads one from t = 0 to T/2-1, another from t = T/2 to T-1, which perform the original loop, but the initial sum is computed using the above closed form solution. Also, if only a few values from the array is needed this can be computed lazily.
And for SSE, you could first fill the array first with (2-a)^(t+1), and then apply x :-> x - 1 to the whole array, and then apply x :-> x * c to the whole array where c = a*b/(1-a), but there may be automatic vectorization already.

Unless fun() is very trivial - in which case consider inline, it is likely to dominate anything else you can do to the loop.
You probably want to look at the algorithm inside fun()

one (very) minor optimization that can be done is:
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
double* pStart = array;
double* pEnd = array + T;
while(pStart < pEnd)
{
sum += fun(a,b,c,d,e,f,sum);
*pStart++ = sum;
}
this eliminates the addition of t to array for every iteration of the loop, the incrementation of t is replaced by the incrementation of pStart, for small sets of iterations(think less than 3, in that case the loop should be derolled), there is no real gain. the compiler should do this automatically, but sometimes it needs a little encouragement.
also depending on the size ranges of T it might be possible to gain performance by using a variable sized array(which would be stack allocated) or aligned _alloca

The code you have above is about as fast as you can make it.
Memory alignment is normally handled well enough by malloc anyway.
The code cannot be parallelized because f is a function of previous sums (so you can't break up the computation into chunks).
The computations are not specified so it's unclear whether SSE or CUDA or anything like that would be applicable.
Likewise, you can't perform any useful loop-unrolling based on the properties of f because we don't know what it does.
(Stylistically, I'd use array[t] since it's clearer what's going on and it is no slower.)
Edit: now that we have f(a,b,sum) = sum + a*(b-sum), we can try loop unrolling by hand to see if there's some pattern. Like so (where I'm using ** to mean "to the power of"):
sum(n) = sum(n-1) + sum(n-1) + a*(b-sum(n-1)) = (2-a)*sum(n-1) + a*b
sum(n) = (2-a)*( (2-a)*sum(n-2) + a*b ) + a*b
. . .
sum(n) = a*b*(2-a)**n + a*b*(2-a)**(n-1) + ... + a*b
sum(n) = a*b*( (2-a)**0 + (2-a)**1 + ... + (2-a)**n )
Well, now, isn't that interesting! We've converted from a recurrent formula to a geometric series! And, you may recall that the geometric series
SUM( x^n , n = 0..N ) = (x**(n+1) - 1) / (x - 1)
so that
sum(n) = a*b*( (pow(2-a,n+1) - 1) / (1-a) )
Now that you've done that math, you can start in on the sum anywhere you want (using the somewhat expensive pow computation). If you have M free processors, and your array is long, you can split it into M equal pieces, use the above computation to find the first sum, and then use the recurrence formula you were using before (with the function) to fill the rest in.
At the very least, you could calculate a*b and 2-a separately and use those instead of the existing function:
sum = ab + twonega*sum
That cuts the math in your inner loop in half, approximately.

Accept #KennyTM's answer. He is wrong to state that the computation is not parallelisable, as he goes on to show. In showing that you can rewrite your recurrence relation in closed form he illustrates a very general principle of optimising programs -- choose the best algorithm you can find and implement that. None of the micro-optimisations that the other answers suggest will come close to computing the closed form and spreading the computation across many processors in parallel.
And, lest anyone offer the suggestion that this is just an example for learning parallelisation, I contend that #KennyTM's answer still holds good -- don't learn to optimise fragments of code, learn to optimise computations. Choose the best algorithm for your purposes, implement it well and only then worry about performance.

Have a look at callgrind, part of the valgrind toolset. Run your code through that and see if anything sticks out as taking an unusually large amount of time. Then you'll know what needs optimizing. Otherwise, you're just guessing, and you (as well as the rest of us) will likely guess wrong.

Just a couple of suggestions that haven't come up yet. I'm a bit out of date when it comes to modern PC-style processors, so they may make no significant difference.
Using float might be faster than double if you can tolerate the lower precision. Integer-based fixed point might be faster still, depending on how well floating point operations are pipelined.
Counting down from T to zero (and incrementing array each iteration) might be slightly faster - certainly, on an ARM processor this would save one cycle per loop.

another very minor optimization would be to turn the for() into
while (--T)
as comparing with zero is usually faster than comparing two random integers.

I would enable vector processing on the compiler. You could rewrite the code to open up the loops but the compiler will do it for you. If it is a later version.
You could use t+array as the for loop increment... again the optimizer might do this.
means your array index won't use a multiply again optimizer might do this.
You could use the switch to dump the generated assembler code and using that see what you can change in the code to make it run faster.

Following the excellent answer by #KennyTM, I'd say the fastest way to do it sequentially should be:
double acc = 1, *array;
array = (double*) malloc(T * sizeof(double));
// the compiler should be able to derive those constant expressions, but let's do it explicitly anyway
const double k = a*b/(a-1);
const double twominusa = 2 - a;
for(int t = 0; t < T; ++t){
acc *= twominusa;
array[t] = k*(1-acc);
}

Rather than having the compiler unroll the loop, you could unroll the loop and and some data prefetching. Search the web for data driven design c++. Here is an example of loop unrolling and prefetching data:
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
// Calculate the number iterations and the
// remaining iterations.
unsigned int iterations = T / 4;
unsigned int remaining_iterations = T % 4;
double sum1;
double sum2;
double sum3;
double sum4;
double * p_array = array;
for(int t = 0; t < T; T += 4)
{
// Do some data precalculation
sum += fun(a,b,c,d,e,f,sum);
sum1 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum2 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum3 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum4 = sum;
// Do a "block" transfer to the array.
p_array[0] = sum1;
p_array[1] = sum2;
p_array[2] = sum3;
p_array[3] = sum4;
p_array += 4;
}
// Handle the few remaining calculations
for (t = 0; t < remaining_iterations; ++t)
{
sum += fun(a,b,c,d,e,f,sum);
p_array[t] = sum;
}
The big hit here is the call to the fun function. There are hidden setup and restore instructions involved when executing a function. Also, the call forces a branch which will cause instruction pipelines to be flushed and reloaded (or cause to processor to waste time in branch prediction).
Another time performance hit is the number of variables passed to the function. These variables have to be placed on the stack and copied into the function, which takes time.

Many computers have a dedicated multiply-accumulate unit implemented in the processor's hardware. Depending on your ultimate algorithm and target platform, you may be able to use this if the compiler isn't already using it when it optimizes.

The following may not be worth it, but ....
The routine fun() takes seven (7) parameters.
Changing the order of the parameters to fun (sum, a, b, c, d, e, f) could help, IF the compiler can take advantage of the following scenario. Parameters through appear to be invariant, and only appears to be changing at this level in the code. As parameters are pushed onto the stack in C/C++ from right to left, if parameters through truly are invariant, then the compiler could in theory optimize the pushing of the stack variables. In other words, through would only need to be pushed onto the stack once, and could in theory be the only parameter pushed and popped while in the loop.
I do not know if the compiler would take advantage of such a scenario, but I am tossing it out there as a possibility. Disassembling could verify it as true or false, and profiling would indicate how much of a benefit that may be if true.

Related

Speeding Up Arithmetic Operation in C++ [duplicate]

For example, I have three float arrays, a, b and c, and I want to add a and b element-wisely up to c. A naive way is like
for(int i = 0; i < n; i++){
c[i] = a[i] + b[i];
}
As far as I know, OpenMP can parallelize this piece of code. In OpenCV code, I see some flags like CV_SSE2 and CV_NEON which are related to optimization.
What's the common way to optimize these kinds of code, if I want my code highly efficient?
There is no common strategy. You should be sure that it is a bottleneck (which it might not be, if the size n of your arrays is small enough).
Some compilers are able to optimize that (at least in some simple cases) by using vector machine instructions. With GCC try to compile with gcc -O3 -mtune=native (or other -mtune=... or -mfpu=... arguments, in particular if you are cross-compiling) and possibly -ffast-math
You could consider OpenMP, OpenCL (with a GPGPU), OpenACC, MPI, explicit threading with e.g. pthreads or C++11 std::thread-s, etc... (and a clever mix of several approaches)
I would leave the optimization to the compiler, and only consider improving that if you measure that it is a bottleneck. You could spend months or years (or even specialize in that for your whole work life) of your developer time to improve it ....
You could also use some numerical computation library (e.g. LAPACK, GSL, etc...) or specialized software like Scilab, Octave, R, etc...
Read also http://floating-point-gui.de/
You should continue looking into parallel options. But for single-threaded, it's generally faster to do it like this:
int i = 0;
for (; i < n - 3; i += 4) {
c[i] = a[i] + b[i];
c[i + 1] = a[i + 1] + b[i + 1];
c[i + 2] = a[i + 2] + b[i + 2];
c[i + 3] = a[i + 3] + b[i + 3];
}
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
Sometimes unrolling can be done by the compiler, but at least in my experience (I use MSC), the compiler typically never tries to perform any partial unrolling like this, and sometimes it can help. This can be beneficial when each of the 4 things inside the loop can be pipelined and running in parallel and it saves comparisons/jumps.
So I would use this as a starting point, and measure it. Then, only apply the parallelization if you measure a gain over this. Or, if you make your threads by hand, each thread should probably do the unrolled variant.
Update: I'm not personally seeing any gain from this. I think it's because inside the unrolled loop, a full 12 floats are accessed. And the float operations are likely slow enough to negate any savings from the jge/cmp operations that are eliminated by unrolling it.
Still, whenever you have a similar problem, with lighter, independent operations, I still recommend at least trying this, because it generates clearly different assembly when you unroll it in the code and you'll get some different perf characteristics and reduce the number of cmp/jmp by a factor of 4, which can help but I think the floating point operations are just too significant for this to matter here.
As already mentioned by others, there is not the "common strategy" but it really depends on your particular use case: Are the arrays very large? Are they rather small but you have to call this function very frequently? Such question you will have to ask yourself. And before trying to optimize anything, you should always profile your code. In most applications more than 90% of the time is spend in only less than 10% of the code. Unless you know exactly where to find this 10% it can have little to no effect to optimize parts of the application.
However, when it is about arithmetic computations, I think it is always a good start to rely on the optimized standard algorithms. When concerned about efficiency, I would add two arrays (after putting a and b in a std::vector or std::array and preallocating c) via
std::transform(a.begin(), a.end(), b.begin(),c.begin(), std::plus<float>());
Depending on your compiler's optimization stage an array index a[i]may be slower than a pointer dereference *p (with p incremented in each iteration so p = a+i)
So without relying on the optimizer this may be faster with some compilers:
float* pa = a;
float* pb = b;
float* pc = c;
for(int i = 0; i < n; i++)
*pc++ = *pa++ + *pb++;
While it may seem trivial in this case, this basic technique can result in large gains in more complicated cases where things are too complicated for the optimizer to do the job.

What's the common strategy to optimize c++ arithmetic computation for arrays?

For example, I have three float arrays, a, b and c, and I want to add a and b element-wisely up to c. A naive way is like
for(int i = 0; i < n; i++){
c[i] = a[i] + b[i];
}
As far as I know, OpenMP can parallelize this piece of code. In OpenCV code, I see some flags like CV_SSE2 and CV_NEON which are related to optimization.
What's the common way to optimize these kinds of code, if I want my code highly efficient?
There is no common strategy. You should be sure that it is a bottleneck (which it might not be, if the size n of your arrays is small enough).
Some compilers are able to optimize that (at least in some simple cases) by using vector machine instructions. With GCC try to compile with gcc -O3 -mtune=native (or other -mtune=... or -mfpu=... arguments, in particular if you are cross-compiling) and possibly -ffast-math
You could consider OpenMP, OpenCL (with a GPGPU), OpenACC, MPI, explicit threading with e.g. pthreads or C++11 std::thread-s, etc... (and a clever mix of several approaches)
I would leave the optimization to the compiler, and only consider improving that if you measure that it is a bottleneck. You could spend months or years (or even specialize in that for your whole work life) of your developer time to improve it ....
You could also use some numerical computation library (e.g. LAPACK, GSL, etc...) or specialized software like Scilab, Octave, R, etc...
Read also http://floating-point-gui.de/
You should continue looking into parallel options. But for single-threaded, it's generally faster to do it like this:
int i = 0;
for (; i < n - 3; i += 4) {
c[i] = a[i] + b[i];
c[i + 1] = a[i + 1] + b[i + 1];
c[i + 2] = a[i + 2] + b[i + 2];
c[i + 3] = a[i + 3] + b[i + 3];
}
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
Sometimes unrolling can be done by the compiler, but at least in my experience (I use MSC), the compiler typically never tries to perform any partial unrolling like this, and sometimes it can help. This can be beneficial when each of the 4 things inside the loop can be pipelined and running in parallel and it saves comparisons/jumps.
So I would use this as a starting point, and measure it. Then, only apply the parallelization if you measure a gain over this. Or, if you make your threads by hand, each thread should probably do the unrolled variant.
Update: I'm not personally seeing any gain from this. I think it's because inside the unrolled loop, a full 12 floats are accessed. And the float operations are likely slow enough to negate any savings from the jge/cmp operations that are eliminated by unrolling it.
Still, whenever you have a similar problem, with lighter, independent operations, I still recommend at least trying this, because it generates clearly different assembly when you unroll it in the code and you'll get some different perf characteristics and reduce the number of cmp/jmp by a factor of 4, which can help but I think the floating point operations are just too significant for this to matter here.
As already mentioned by others, there is not the "common strategy" but it really depends on your particular use case: Are the arrays very large? Are they rather small but you have to call this function very frequently? Such question you will have to ask yourself. And before trying to optimize anything, you should always profile your code. In most applications more than 90% of the time is spend in only less than 10% of the code. Unless you know exactly where to find this 10% it can have little to no effect to optimize parts of the application.
However, when it is about arithmetic computations, I think it is always a good start to rely on the optimized standard algorithms. When concerned about efficiency, I would add two arrays (after putting a and b in a std::vector or std::array and preallocating c) via
std::transform(a.begin(), a.end(), b.begin(),c.begin(), std::plus<float>());
Depending on your compiler's optimization stage an array index a[i]may be slower than a pointer dereference *p (with p incremented in each iteration so p = a+i)
So without relying on the optimizer this may be faster with some compilers:
float* pa = a;
float* pb = b;
float* pc = c;
for(int i = 0; i < n; i++)
*pc++ = *pa++ + *pb++;
While it may seem trivial in this case, this basic technique can result in large gains in more complicated cases where things are too complicated for the optimizer to do the job.

Which one is more optimized for accessing array?

Solving the following exercise:
Write three different versions of a program to print the elements of
ia. One version should use a range for to manage the iteration, the
other two should use an ordinary for loop in one case using subscripts
and in the other using pointers. In all three programs write all the
types directly. That is, do not use a type alias, auto, or decltype to
simplify the code.[C++ Primer]
a question came up: Which of these methods for accessing array is optimized in terms of speed and why?
My Solutions:
Foreach Loop:
int ia[3][4]={{1,2,3,4},{5,6,7,8},{9,10,11,12}};
for (int (&i)[4]:ia) //1st method using for each loop
for(int j:i)
cout<<j<<" ";
Nested for loops:
for (int i=0;i<3;i++) //2nd method normal for loop
for(int j=0;j<4;j++)
cout<<ia[i][j]<<" ";
Using pointers:
int (*i)[4]=ia;
for(int t=0;t<3;i++,t++){ //3rd method. using pointers.
for(int x=0;x<4;x++)
cout<<(*i)[x]<<" ";
Using auto:
for(auto &i:ia) //4th one using auto but I think it is similar to 1st.
for(auto j:i)
cout<<j<<" ";
Benchmark result using clock()
1st: 3.6 (6,4,4,3,2,3)
2nd: 3.3 (6,3,4,2,3,2)
3rd: 3.1 (4,2,4,2,3,4)
4th: 3.6 (4,2,4,5,3,4)
Simulating each method 1000 times:
1st: 2.29375 2nd: 2.17592 3rd: 2.14383 4th: 2.33333
Process returned 0 (0x0) execution time : 13.568 s
Compiler used:MingW 3.2 c++11 flag enabled. IDE:CodeBlocks
I have some observations and points to make and I hope you get your answer from this.
The fourth version, as you mention yourself, is basically the same as the first version. auto can be thought of as only a coding shortcut (this is of course not strictly true, as using auto can result in getting different types than you'd expected and therefore result in different runtime behavior. But most of the time this is true.)
Your solution using pointers is probably not what people mean when they say that they are using pointers! One solution might be something like this:
for (int i = 0, *p = &(ia[0][0]); i < 3 * 4; ++i, ++p)
cout << *p << " ";
or to use two nested loops (which is probably pointless):
for (int i = 0, *p = &(ia[0][0]); i < 3; ++i)
for (int j = 0; j < 4; ++j, ++p)
cout << *p << " ";
from now on, I'm assuming this is the pointer solution you've written.
In such a trivial case as this, the part that will absolutely dominate your running time is the cout. The time spent in bookkeeping and checks for the loop(s) will be completely negligible comparing to doing I/O. Therefore, it won't matter which loop technique you use.
Modern compilers are great at optimizing such ubiquitous tasks and access patterns (iterating over an array.) Therefore, chances are that all these methods will generate exactly the same code (with the possible exception of the pointer version, which I will talk about later.)
The performance of most codes like this will depend more on the memory access pattern rather than how exactly the compiler generates the assembly branch instructions (and the rest of the operations.) This is because if a required memory block is not in the CPU cache, it's going to take a time roughly equivalent of several hundred CPU cycles (this is just a ballpark number) to fetch those bytes from RAM. Since all the examples access memory in exactly the same order, their behavior in respect to memory and cache will be the same and will have roughly the same running time.
As a side note, the way these examples access memory is the best way for it to be accessed! Linear, consecutive and from start to finish. Again, there are problems with the cout in there, which can be a very complicated operation and even call into the OS on every invocation, which might result, among other things, an almost complete deletion (eviction) of everything useful from the CPU cache.
On 32-bit systems and programs, the size of an int and a pointer are usually equal (both are 32 bits!) Which means that it doesn't matter much whether you pass around and use index values or pointers into arrays. On 64-bit systems however, a pointer is 64 bits but an int will still usually be 32 bits. This suggests that it is usually better to use indexes into arrays instead of pointers (or even iterators) on 64-bit systems and programs.
In this particular example, this is not significant at all though.
Your code is very specific and simple, but the general case, it is almost always better to give as much information to the compiler about your code as possible. This means that you must use the narrowest, most specific device available to you to do a job. This in turn means that a generic for loop (i.e. for (int i = 0; i < n; ++i)) is worse than a range-based for loop (i.e. for (auto i : v)) for the compiler, because in the latter case the compiler simply knows that you are going to iterate over the whole range and not go outside of it or break out of the loop or something, while in the generic for loop case, specially if your code is more complex, the compiler cannot be sure of this and has to insert extra checks and tests to make sure the code executes as the C++ standard says it should.
In many (most?) cases, although you might think performance matters, it does not. And most of the time you rewrite something to gain performance, you don't gain much. And most of the time the performance gain you get is not worth the loss in readability and maintainability that you sustain. So, design your code and data structures right (and keep performance in mind) but avoid this kind of "micro-optimization" because it's almost always not worth it and even harms the quality of the code too.
Generally, performance in terms of speed is very hard to reason about. Ideally you have to measure the time with real data on real hardware in real working conditions using sound scientific measuring and statistical methods. Even measuring the time it takes a piece of code to run is not at all trivial. Measuring performance is hard, and reasoning about it is harder, but these days it is the only way of recognizing bottlenecks and optimizing the code.
I hope I have answered your question.
EDIT: I wrote a very simple benchmark for what you are trying to do. The code is here. It's written for Windows and should be compilable on Visual Studio 2012 (because of the range-based for loops.) And here are the timing results:
Simple iteration (nested loops): min:0.002140, avg:0.002160, max:0.002739
Simple iteration (one loop): min:0.002140, avg:0.002160, max:0.002625
Pointer iteration (one loop): min:0.002140, avg:0.002160, max:0.003149
Range-based for (nested loops): min:0.002140, avg:0.002159, max:0.002862
Range(const ref)(nested loops): min:0.002140, avg:0.002155, max:0.002906
The relevant numbers are the "min" times (over 2000 runs of each test, for 1000x1000 arrays.) As you see, there is absolutely no difference between the tests. Note that you should turn on compiler optimizations or test 2 will be a disaster and cases 4 and 5 will be a little worse than 1 and 3.
And here are the code for the tests:
// 1. Simple iteration (nested loops)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows; ++i)
for (unsigned j = 0; j < gc_Cols; ++j)
sum += g_Data[i][j];
// 2. Simple iteration (one loop)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += g_Data[i / gc_Cols][i % gc_Cols];
// 3. Pointer iteration (one loop)
unsigned sum = 0;
unsigned * p = &(g_Data[0][0]);
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += *p++;
// 4. Range-based for (nested loops)
unsigned sum = 0;
for (auto & i : g_Data)
for (auto j : i)
sum += j;
// 5. Range(const ref)(nested loops)
unsigned sum = 0;
for (auto const & i : g_Data)
for (auto const & j : i)
sum += j;
It has many factors affecting it:
It depends on the compiler
It depends on the compiler flags used
It depends on the computer used
There is only one way to know the exact answer: measuring the time used when dealing with huge arrays (maybe from a random number generator) which is the same method you have already done except that the array size should be at least 1000x1000.

Safe and fast FFT

Inspired by Herb Sutter's compelling lecture Not your father's C++, I decided to take another look at the latest version of C++ using Microsoft's Visual Studio 2010. I was particularly interested by Herb's assertion that C++ is "safe and fast" because I write a lot of performance-critical code.
As a benchmark, I decided to try to write the same simple FFT algorithm in a variety of languages.
I came up with the following C++11 code that uses the built-in complex type and vector collection:
#include <complex>
#include <vector>
using namespace std;
// Must provide type or MSVC++ barfs with "ambiguous call to overloaded function"
double pi = 4 * atan(1.0);
void fft(int sign, vector<complex<double>> &zs) {
unsigned int j=0;
// Warning about signed vs unsigned comparison
for(unsigned int i=0; i<zs.size()-1; ++i) {
if (i < j) {
auto t = zs.at(i);
zs.at(i) = zs.at(j);
zs.at(j) = t;
}
int m=zs.size()/2;
j^=m;
while ((j & m) == 0) { m/=2; j^=m; }
}
for(unsigned int j=1; j<zs.size(); j*=2)
for(unsigned int m=0; m<j; ++m) {
auto t = pi * sign * m / j;
auto w = complex<double>(cos(t), sin(t));
for(unsigned int i = m; i<zs.size(); i+=2*j) {
complex<double> zi = zs.at(i), t = w * zs.at(i + j);
zs.at(i) = zi + t;
zs.at(i + j) = zi - t;
}
}
}
Note that this function only works for n-element vectors where n is an integral power of two. Anyone looking for fast FFT code that works for any n should look at FFTW.
As I understand it, the traditional xs[i] syntax from C for indexing a vector does not do bounds checking and, consequently, is not memory safe and can be a source of memory errors such as non-deterministic corruption and memory access violations. So I used xs.at(i) instead.
Now, I want this code to be "safe and fast" but I am not a C++11 expert so I'd like to ask for improvements to this code that would make it more idiomatic or efficient?
I think you are being overly "safe" in your use of at(). In most of your cases the index used is trivially verifable as being constrained by the size of the container in the for loop.
e.g.
for(unsigned int i=0; i<zs.size()-1; ++i) {
...
auto t = zs.at(i);
The only ones I'd leave as at()s are the (i + j)s. It's not immediately obvious whether they would always be constrained (although if I was really unsure I'd probably manually check - but I'm not familiar with FFTs enough to have an opinion in this case).
There are also some fixed computations being repeated for each loop iteration:
int m=zs.size()/2;
pi * sign
2*j
And the zs.at(i + j) is computed twice.
It's possible that the optimiser may catch these - but if you are treating this as performance critical, and you have your timers testing it, I'd hoist them out of the loops (or, in the case of zs.at(i + j), just take a reference on first use) and see if that impacts the timer.
Talking of second-guessing the optimiser: I'm sure that the calls to .size() will be inlined as, at least, a direct call to an internal member variable - but given how many times you call it I'd also experiment with introducing local variables for zs.size() and zs.size()-1 upfront. They're more likely to be put into registers that way too.
I don't know how much of a difference (if any) all of this will have on your total runtime - some of it may already be caught by the optimiser, and the differences may be small compared to the computations involved - but worth a shot.
As for being idiomatic my only comment, really, is that size() returns a std::size_t (which is usually a typedef for an unsigned int - but it's more idiomatic to use that type instead). If you did want to use auto but avoid the warning you could try adding the ul suffix to the 0 - not sure I'd say that is idiomatic, though. I suppose you're already less than idiomatic in not using iterators here, but I can see why you can't do that (easily).
Update
I gave all my suggestions a try and they all had a measurable performance improvement - except the i+j and 2*j precalcs - they actually caused a slight slowdown! I presume they either prevented a compiler optimisation or prevented it from using registers for some things.
Overall I got a >10% perf. improvement with those suggestions.
I suspect more could be had if the second block of loops was refactored a little to avoid the jumps - and having done so enabling SSE2 instruction set may give a significant boost (I did try it as is and saw a slight slowdown).
I think that refactoring, along with using something like MKL for the cos and sin calls should give greater, and less brittle, improvements. And neither of those things would be language dependent (I know this was originally being compared to an F# implementation).
Update 2
I forgot to mention that pre-calculating zs.size() did make a difference.
Update 3
Also forgot to say (until reminded by #xeo in comment to OP) that the block following the i < j check can be boiled down to a std::swap. This is more idiomatic and at least as performant - in the worst case should inline to the same code as written. Indeed when I did it I saw no change in the performance. In other cases it can lead to a performance gain if move constructors are available.

Multiply Large Complex Number Vector by Scalar efficiently C++

I'm currently trying to most efficiently do an in-place multiplication of an array of complex numbers (memory aligned the same way the std::complex would be but currently using our own ADT) by an array of scalar values that is the same size as the complex number array.
The algorithm is already parallelized, i.e. the calling object splits the work up into threads. This calculation is done on arrays in the 100s of millions - so, it can take some time to complete. CUDA is not a solution for this product, although I wish it was. I do have access to boost and thus have some potential to use BLAS/uBLAS.
I'm thinking, however, that SIMD might yield much better results, but I'm not familiar enough with how to do this with complex numbers. The code I have now is as follows (remember this is chunked up into threads which correspond to the number of cores on the target machine). The target machine is also unknown. So, a generic approach is probably best.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (register int idx = start; idx < end; ++idx)
{
values[idx].real *= scalar[idx];
values[idx].imag *= scalar[idx];
}
}
fcomplex is defined as follows:
struct fcomplex
{
float real;
float imag;
};
I've tried manually unrolling the loop, as my finally loop count will always be a power of 2, but the compiler is already doing that for me (I've unrolled as far as 32). I've tried a const float reference to the scalar - in thinking I'd save one access - and that proved to be equal to the what the compiler was already doing. I've tried STL and transform, which game close results, but still worse. I've also tried casting to std::complex and allow it to use the overloaded operator for scalar * complex for the multiplication but this ultimately produced the same results.
So, anyone with any ideas? Much appreciation is given for your time in considering this! Target platform is Windows. I'm using Visual Studio 2008. Product cannot contain GPL code as well! Thanks so much.
You can do this fairly easily with SSE, e.g.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 2)
{
__m128 vc = _mm_load_ps((float *)&values[idx]);
__m128 vk = _mm_set_ps(scalar[idx + 1], scalar[idx + 1], scalar[idx], scalar[idx]);
vc = _mm_mul_ps(vc, vk);
_mm_store_ps((float *)&values[idx], vc);
}
}
Note that values and scalar need to be 16 byte aligned.
Or you could just use the Intel ICC compiler and let it do the hard work for you.
UPDATE
Here is an improved version which unrolls the loop by a factor of 2 and uses a single load instruction to get 4 scalar values which are then unpacked into two vectors:
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 4)
{
__m128 vc0 = _mm_load_ps((float *)&values[idx]);
__m128 vc1 = _mm_load_ps((float *)&values[idx + 2]);
__m128 vk = _mm_load_ps(&scalar[idx]);
__m128 vk0 = _mm_shuffle_ps(vk, vk, 0x50);
__m128 vk1 = _mm_shuffle_ps(vk, vk, 0xfa);
vc0 = _mm_mul_ps(vc0, vk0);
vc1 = _mm_mul_ps(vc1, vk1);
_mm_store_ps((float *)&values[idx], vc0);
_mm_store_ps((float *)&values[idx + 2], vc1);
}
}
Your best bet will be to use an optimised BLAS which will take advantage of whatever is available on your target platform.
One problem I see is that in the function it's hard for the compiler to understand that the scalar pointer is not indeed pointing in the middle of the complex array (scalar could in theory be pointing to the complex or real part of a complex).
This actually forces the order of evaluation.
Another problem I see is that here the computation is so simple that other factors will influence the raw speed, therefore if you really care about performance the only solution is in my opinion to implement several variations and test them at runtime on the user machine to discover what is the fastest.
What I'd consider is using different unrolling sizes, and also playing with the alignment of scalar and values (the memory access pattern can have a big influence of caching effects).
For the problem of the unwanted serialization an option is to see what is the generated code for something like
float r0 = values[i].real, i0 = values[i].imag, s0 = scalar[i];
float r1 = values[i+1].real, i1 = values[i+1].imag, s1 = scalar[i+1];
float r2 = values[i+2].real, i2 = values[i+2].imag, s2 = scalar[i+2];
values[i].real = r0*s0; values[i].imag = i0*s0;
values[i+1].real = r1*s1; values[i+1].imag = i1*s1;
values[i+2].real = r2*s2; values[i+2].imag = i2*s2;
because here the optimizer has in theory a little bit more freedom.
Do you have access to Intel's Integrated Performance Primitives?
Integrated Performance Primitives They have a number of functions that handle cases like this with pretty decent performance. You might have some success with your particular problem, but I would not be surprised if your compiler already does a decent job of optimizing the code.