C++ loop unrolling for compile time constant small values

C++ loop unrolling for compile time constant small values - c++

I have these 2 functions:
template<int N>
void fun()
{
for(int i = 0; i < N; ++i)
{
std::cout<<i<<" ";
}
}
void gun(int N)
{
for(int i = 0; i < N; ++i)
{
std::cout<<i<<" ";
}
}
May I assume that in the first version the compiler will optimize the loop for every small N(by small I mean N = {1, 2, 3, 4})?

May I assume that in the first version the compiler will optimize the loop for every small N
That is a typical optimization, although "assume" is a strong word. If an optimization is imperative you will eventually be disappointed by any potential optimization.
Your second version may experience the same optimization if the compiler is able to inline the function.

You never have any guarantees as to what the optimization will do, but given a suitable optimization level, you can usually rely on it making better choices than you would if optimizing manually.
if you really want to know what code is produced, you can always take a look at the resulting assembly.

If the compiler can inline either of the functions, it will also unroll the loop if it thinks that's the right thing to do. When & how a compiler decides there is a benefit in unrolling a loop is quite a complex matter, and depends highly on other factors, such as the number of available registers, what happens inside the loop (I doubt the example given above, for example, would gain much in time from reducing the 5 or so instructions involved in the loop, given that cout ... will probably consume several thousand times as much time - whether the compiler can figure that out or not is another matter, but it isn't entirely unknown for compilers to have SOME understanding of whether a function is small or not.
On the other hand, if the code looks something like this:
int arr[N]; // Global array.
template<int N>
int fun()
{
int sum = 0;
for(int i = 0; i < N; ++i)
{
sum += arr[i];
}
}
Then I would expect the compiler to unroll the loop to be something like this:
int *tmp = arr;
sum += *tmp++;
sum += *tmp++;
sum += *tmp++;
sum += *tmp++;
sum += *tmp++;
Assuming N = 5.
And this applies to any function that is "visible" to the compiler and where N is known at compile-time. So, assuming gun isn't in a different source file, then I would expect it to be inlined and unrolled exactly the same as fun (which, being a template function, HAS to be visible in this compile unit)

It depends on your optimization level and flags. There's a big difference between -O0 -g (no optimization, debugging enabled), -O3 (aggresively optimize for speed), and -Os (optimize for space).
These days loop unrolling isn't necessarily a win, even when optimizing for speed. Too much code can cause an instruction cache miss which will greatly outweigh the speedup of inlining a simple loop. And the cost of the conditional branch in a loop like this is almost negligible since branch prediction will correctly anticipate all but the last iteration.

If you want to be a little more explicit, you can use Duff's Device which uses switch case fallthrough to unroll loops. I can't speak to how well it works in practice, though. I would imagine, however, that if you can hint to the compiler to unroll it instead, that would be faster.
Compilers are also pretty smart, and while they're not infallible their optimization choices are generally better than our own intuition.

Related

For loop or no loop? (dataset is small and not subject to change)

Let's say I have a situation where I have a matrix of a small, known size where the size is unlikely to change over the life of the software. If I need to examine each matrix element, would it be more efficient to use a loop or to manually index into each matrix location?
For example, let's say I have a system made up of 3 windows, 2 panes per window. I need to keep track of state for each window pane. In my system, there will only ever be 3 windows, 2 panes per window.
static const int NUMBER_OF_WINDOWS = 3;
static const int NUMBER_OF_PANES = 2;
static const int WINDOW_LEFT = 0;
static const int WINDOW_MIDDLE = 1;
static const int WINDOW_RIGHT = 2;
static const int PANE_TOP = 0;
static const int PANE_BOTTOM = 1;
paneState windowPanes[NUMBER_OF_WINDOWS][NUMBER_OF_PANES];
Which of these accessing methods would be more efficient?
loop version:
for (int ii=0; ii<NUMBER_OF_WINDOWS; ii++)
{
for (int jj=0; jj<NUMBER_OF_PANES; jj++)
{
doSomething(windowPanes[ii][jj];
}
}
vs.
manual access version:
doSomething(windowPanes[WINDOW_LEFT][PANE_TOP]);
doSomething(windowPanes[WINDOW_MIDDLE][PANE_TOP]);
doSomething(windowPanes[WINDOW_RIGHT][PANE_TOP]);
doSomething(windowPanes[WINDOW_LEFT][PANE_BOTTOM]);
doSomething(windowPanes[WINDOW_MIDDLE][PANE_BOTTOM]);
doSomething(windowPanes[WINDOW_RIGHT][PANE_BOTTOM]);
Will the loop code generate branch instructions, and will those be more costly than the instructions that would be generated on the manual access?

The classic Efficiency vs Organization. The for loops are much more human readable and the manual way is more machine readable.
I recommend you use the loops. Because the compiler, if optimizing is enabled, will actually generate the manual code for you when it sees that the upper bounds are constant. That way you get the best of both worlds.

First of all: How complex is your function doSomething? If it is (most likely this is so), then you will not notice any difference.
In general, calling your function sequentially will be slightly more effective than the loop. But once again, the gain will be so tiny that it is not worth discussing it.
Bear in mind that optimizing compilers do loop unrolling. This is essentially generating code that will rotate your loop smaller number of times while doing more work in each rotation (they will call your function 2-4 times in sequence). When the number of rotations is small and fixed compiler may easily eliminate the loop completely.
Look at your code from the point of view of clarity and ease of modification. In many cases compiler will do a lot of useful tricks related to performance.

You may linearize your multi-dimensional array
paneState windowPanes[NUMBER_OF_WINDOWS * NUMBER_OF_PANES];
and then
for (auto& pane : windowPanes) {
doSomething(pane);
}
Which avoid extra loop if compiler doesn't optimize it.

The simple task of iterating through an array. Which of these solutions is the most efficient?

Recently, I've been thinking about all the ways that one could iterate through an array and wondered which of these is the most (and least) efficient. I've written a hypothetical problem and five possible solutions.
Problem
Given an int array arr with len number of elements, what would be the most efficient way of assigning an arbitrary number 42 to every element?
Solution 0: The Obvious
for (unsigned i = 0; i < len; ++i)
arr[i] = 42;
Solution 1: The Obvious in Reverse
for (unsigned i = len - 1; i >= 0; --i)
arr[i] = 42;
Solution 2: Address and Iterator
for (unsigned i = 0; i < len; ++i)
{ *arr = 42;
++arr;
}
Solution 3: Address and Iterator in Reverse
for (unsigned i = len; i; --i)
{ *arr = 42;
++arr;
}
Solution 4: Address Madness
int* end = arr + len;
for (; arr < end; ++arr)
*arr = 42;
Conjecture
The obvious solutions are almost always used, but I wonder whether the subscript operator could result in a multiplication instruction, as if it had been written like *(arr + i * sizeof(int)) = 42.
The reverse solutions try to take advantage of how comparing i to 0 instead of len might mitigate a subtraction operation. Because of this, I prefer Solution 3 over Solution 2. Also, I've read that arrays are optimized to be accessed forwards because of how they're stored in the cache, which could present an issue with Solution 1.
I don't see why Solution 4 would be any less efficient than Solution 2. Solution 2 increments the address and the iterator, while Solution 4 only increments the address.
In the end, I'm not sure which of these solutions I prefer. I'm think the answer also varies with the target architecture and optimization settings of your compiler.
Which of these do you prefer, if any?

Just use std::fill.
std::fill(arr, arr + len, 42);
Out of your proposed solutions, on a good compiler, neither should be faster than the others.

The ISO standard doesn't mandate the efficiency of the different ways of doing things in code (other than certain big-O type stuff for some collection algorithms), it simply mandates how it functions.
Unless your arrays are billions of elements in size, or you're wanting to set them millions of times per minute, it generally won't make the slightest difference which method you use.
If you really want to know (and I still maintain it's almost certainly unnecessary), you should benchmark the various methods in the target environment. Measure, don't guess!
As to which I prefer, my first inclination is to optimise for readability. Only if there's a specific performance problem do I then consider other possibilities. That would be simply something like:
for (size_t idx = 0; idx < len; idx++)
arr[idx] = 42;

I don't think that performance is an issue here - those are, if at all (I could imagine the compiler producing the identical assembly for most of them), micro optimizations hardly ever necessary.
Go with the solution that is most readable; the standard library provides you with std::fill, or for more complex assignments
for(unsigned k = 0; k < len; ++k)
{
// whatever
}
so it is obvious to other people looking at your code what you are doing. With C++11 you could also
for(auto & elem : arr)
{
// whatever
}
just don't try to obfuscate your code without any necessity.

For nearly all meaningful cases, the compiler will optimize all of the suggested ones to the same thing, and it's very unlikely to make any difference.
There used to be a trick where you could avoid the automatic prefetching of data if you ran the loop backwards, which under some bizarre set of circumstances actually made it more efficient. I can't recall the exact circumstances, but I expect modern processors will identify backwards loops as well as forwards loops for automatic prefetching anyway.
If it's REALLY important for your application to do this over a large number of elements, then looking at blocked access and using non-temporal storage will be the most efficient. But before you do that, make sure you have identified the filling of the array as an important performance point, and then make measurements for the current code and the improved code.
I may come back with some actual benchmarks to prove that "it makes little difference" in a bit, but I've got an errand to run before it gets too late in the day...

Loop optimisation techniques in C++

In order to increase the performance of our applications, we have to consider loop optimisation techniques during the development phase.
I'd like to show you some different ways to iterate over a simple std::vector<uint32_t> v:
Unoptimized loop with index:
uint64_t sum = 0;
for (unsigned int i = 0; i < v.size(); i++)
sum += v[i];
Unoptimized loop with iterator:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it;
for (it = v.begin(); it != v.end(); it++)
sum += *it;
Cached std::vector::end iterator:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it, end(v.end());
for (it = v.begin(); it != end; it++)
sum += *it;
Pre-increment iterators:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it, end(v.end());
for (it = v.begin(); it != end; ++it)
sum += *it;
Range-based loop:
uint64_t sum = 0;
for (auto const &x : v)
sum += x;
There are also other means to build a loop in C++; for instance by using std::for_each, BOOST_FOREACH, etc...
In your opinion, which is the best approach to increase the performance and why?
Furthermore, in performance-critical applications it could be useful to unroll the loops: again, which approach would you suggest?

There's no hard and fast rule, since it depends on the
implementation. If the measures I did some years back are
typical, however: about the only thing which makes a difference
is caching the end iterator. Pre- or post-fix makes no
difference, regardless of the container and iterator type.
At the time, I didn't measure indexing (because I was comparing
iterators of different types of container as well, and not all
support indexing). But I would guess that if you use indexes,
you should cache the results of v.size() as well.
Of course, these measures were for one compiler (g++) on one
system, with a specific hardware. The only way you can know for
your environment is to measure yourself.
RE your note: are you sure you have full optimization turned on.
My measures showed no difference between 3 and 4, and I doubt
that commpilers optimize less today.
It's very important for the optimizations here that the
functions are actually inlined. If they're not,
post-incrementation does require some extra copying, and
typically will require an extra function call (to the copy
constructor of the iterator) as well. Once the functions are
inlined, however, the compiler can easily see that all this is
a unessential, and (at least when I tried it) generate exactly
the same code in both cases. (I'd use pre-incrementation
anyway. Not because it makes a difference, but because if you
don't, some idiots will come along claiming it will, despite
your measures. Or maybe they're not idiots, but are just using
a particularly stupid compiler.)
To tell the truth, when I did the measurements, I was surprised
that caching the end iterator made a difference, even for
vector, where as there was no difference between pre- and
post-incrementation, even for a reverse iterator into a map.
After all, end() was inlined as well; in fact, every single
function used in my tests was inlined.
As to unrolling the loops: I'd probably do something like this:
std::vector<uint32_t>::const_iterator current = v.begin();
std::vector<uint32_t>::const_iterator end = v.end();
switch ( (end - current) % 4 ) {
case 3:
sum += *current ++;
case 2:
sum += *current ++;
case 1:
sum += *current ++;
case 0:
}
while ( current != end ) {
sum += current[0] + current[1] + current[2] + current[3];
current += 4;
}
(This is a factor of 4. You can easily increase it if
necessary.)

I'm going on the assumption that you are well aware of the evils of premature micro-optimization, and that you have identified hotspots in your code by profiling and all the rest. I'm also going on the assumption that you're only concerned about performance with respect to speed. That is, you don't care deeply about the size of the resulting code or memory use.
The code snippets you have provided will yield largely the same results, with the exception of the cached end() iterator. Aside from caching and inlining as much as you can, there is not much you can do to tweak structure of the loops above to realize significant gains in performance.
Writing performant code in critical paths relies first and foremost on selecting the best algorithm for the job. If you have a performance problem, look first and hard at the algorithm. The compiler will generally do a much better job at micro-optimizing the code you wrote than you could ever hope to.
All that being said, there are a few things you can do to give your compiler a little help.
Cache everything you can
Keep small allocations to a minimum, especially within a loop
Make as many things const as you can. This gives the compiler additional opportunities to micro-optimize.
Learn your toolchain well and leverage that knowledge
Learn your architecture well and leverage that knowledge
Learn to read assembly code and examine the assembly output from your compiler
Learning your toolchain and architecture are going to yield the most benefits. For example, GCC has many options you can enable to increase performance, including loop unrolling. See here. When iterating datasets, it is often beneficial to keep each item aligned to the size of a cache line. In modern architecture this often means 64 bytes, but learn your architecture.
Here is an excellent guide to writing performant C++ in an Intel environment.
Once you have learned your architecture and toolchain, you might find that the algorithm you originally selected is not optimal in your real world. Be open to change in the face of new data.

It's very likely that modern compilers will produce the same assembly for the approaches you give above. You should look at the actual assembly (after enabling optimizations) to see.
When you're down to worrying about the speed of your loops, you should really think about whether your algorithm is truly optimal. If you're convinced it is, then you need to think about (and make use of) the underlying implementation of the data structures. std::vector uses an array underneath, and, depending on the compiler and the other code in the function, pointer aliasing may prevent the compiler from fully optimizing your code.
There's a fair amount of information out there on pointer aliasing (including What is the strict aliasing rule?), but Mike Acton has some wonderful information about pointer aliasing.
The restrict keyword (see What does the restrict keyword mean in C++? or, again, Mike Acton), available through compiler extensions for many years and codified in C99 (currently only available as a compiler extension in C++), is meant to deal with this. The way to use this in your code is far more C-like, but may allow the compiler to better optimize your loop, at least for the examples you've given:
uint64_t sum = 0;
uint32_t *restrict velt = &v[0];
uint32_t *restrict vend = velt + v.size();
while(velt < vend) {
sum += *velt;
velt++;
}
However, to see whether this provides a difference, you really need to profile different approaches for your actual, real-life problem, and possibly look at the underlying assembly produced. If you're summing simple data types, this may help you. If you're doing anything more complicated, including calling a function that cannot be inlined in the loop, it's unlikely to make any different at all.

If you're using clang, then pass it these flags:
-Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize
In Visual C++ add this to the build:
/Qvec-report:2
These flags will tell you if a loop fails to vectorise (and give you an often cryptic message explaining why).
In general though, prefer options 4 and 5 (or std::for_each). Whilst clang and gcc will typically do a decent job in most cases, Visual C++ tends to err on the side of caution sadly. If the scope of the variable is unknown (e.g. a reference or pointer passed into a function, or this pointer), then vectorisation often fails (containers in the local scope will almost always vectorise).
#include <vector>
#include <cmath>
// fails to vectorise in Visual C++ and /O2
void func1(std::vector<float>& v)
{
for(size_t i = 0; i < v.size(); ++i)
{
v[i] = std::sqrt(v[i]);
}
}
// this will vectorise with sqrtps
void func2(std::vector<float>& v)
{
for(std::vector<float>::iterator it = v.begin(), e = v.end(); it != e; ++it)
{
*it = std::sqrt(*it);
}
}
Clang and gcc aren't immune to these issues either. If you always take a copy of begin/end, then it cannot be a problem.
Here's another classic that affects many compilers sadly (clang 3.5.0 fails this trivial test, but it's fixed in clang 4.0). It crops up a LOT!
struct Foo
{
void func3();
void func4();
std::vector<float> v;
float f;
};
// will not vectorise
void Foo::func3()
{
// this->v.end() !!
for(std::vector<float>::iterator it = v.begin(); it != v.end(); ++it)
{
*it *= f; // this->f !!
}
}
void Foo::func4()
{
// you need to take a local copy of v.end(), and 'f'.
const float temp = f;
for(std::vector<float>::iterator it = v.begin(), e = v.end(); it != e; ++it)
{
*it *= temp;
}
}
In the end, if it's something you care about, use the vectorisation reports from the compiler to fix up your code. As mentioned above, this is basically an issue of pointer aliasing. You can use the restrict keyword to help fix some of these issues (but I've found that applying restrict to 'this' is often not that useful).

Use range based for by default as it will give the compiler the most direct information to optimize (compiler knows it can cache the end iterator for example). Then profile and only optimize further if you identify a significant bottleneck. There will be very few real world situations where these different loop variants make a meaningful performance difference. Compilers are pretty good at loop optimization and it is far more likely that you should focus your optimization effort elsewhere (like choosing a better algorithm or focusing on optimizing the loop body).

Which one is more optimized for accessing array?

Solving the following exercise:
Write three different versions of a program to print the elements of
ia. One version should use a range for to manage the iteration, the
other two should use an ordinary for loop in one case using subscripts
and in the other using pointers. In all three programs write all the
types directly. That is, do not use a type alias, auto, or decltype to
simplify the code.[C++ Primer]
a question came up: Which of these methods for accessing array is optimized in terms of speed and why?
My Solutions:
Foreach Loop:
int ia[3][4]={{1,2,3,4},{5,6,7,8},{9,10,11,12}};
for (int (&i)[4]:ia) //1st method using for each loop
for(int j:i)
cout<<j<<" ";
Nested for loops:
for (int i=0;i<3;i++) //2nd method normal for loop
for(int j=0;j<4;j++)
cout<<ia[i][j]<<" ";
Using pointers:
int (*i)[4]=ia;
for(int t=0;t<3;i++,t++){ //3rd method. using pointers.
for(int x=0;x<4;x++)
cout<<(*i)[x]<<" ";
Using auto:
for(auto &i:ia) //4th one using auto but I think it is similar to 1st.
for(auto j:i)
cout<<j<<" ";
Benchmark result using clock()
1st: 3.6 (6,4,4,3,2,3)
2nd: 3.3 (6,3,4,2,3,2)
3rd: 3.1 (4,2,4,2,3,4)
4th: 3.6 (4,2,4,5,3,4)
Simulating each method 1000 times:
1st: 2.29375 2nd: 2.17592 3rd: 2.14383 4th: 2.33333
Process returned 0 (0x0) execution time : 13.568 s
Compiler used:MingW 3.2 c++11 flag enabled. IDE:CodeBlocks

I have some observations and points to make and I hope you get your answer from this.
The fourth version, as you mention yourself, is basically the same as the first version. auto can be thought of as only a coding shortcut (this is of course not strictly true, as using auto can result in getting different types than you'd expected and therefore result in different runtime behavior. But most of the time this is true.)
Your solution using pointers is probably not what people mean when they say that they are using pointers! One solution might be something like this:
for (int i = 0, *p = &(ia[0][0]); i < 3 * 4; ++i, ++p)
cout << *p << " ";
or to use two nested loops (which is probably pointless):
for (int i = 0, *p = &(ia[0][0]); i < 3; ++i)
for (int j = 0; j < 4; ++j, ++p)
cout << *p << " ";
from now on, I'm assuming this is the pointer solution you've written.
In such a trivial case as this, the part that will absolutely dominate your running time is the cout. The time spent in bookkeeping and checks for the loop(s) will be completely negligible comparing to doing I/O. Therefore, it won't matter which loop technique you use.
Modern compilers are great at optimizing such ubiquitous tasks and access patterns (iterating over an array.) Therefore, chances are that all these methods will generate exactly the same code (with the possible exception of the pointer version, which I will talk about later.)
The performance of most codes like this will depend more on the memory access pattern rather than how exactly the compiler generates the assembly branch instructions (and the rest of the operations.) This is because if a required memory block is not in the CPU cache, it's going to take a time roughly equivalent of several hundred CPU cycles (this is just a ballpark number) to fetch those bytes from RAM. Since all the examples access memory in exactly the same order, their behavior in respect to memory and cache will be the same and will have roughly the same running time.
As a side note, the way these examples access memory is the best way for it to be accessed! Linear, consecutive and from start to finish. Again, there are problems with the cout in there, which can be a very complicated operation and even call into the OS on every invocation, which might result, among other things, an almost complete deletion (eviction) of everything useful from the CPU cache.
On 32-bit systems and programs, the size of an int and a pointer are usually equal (both are 32 bits!) Which means that it doesn't matter much whether you pass around and use index values or pointers into arrays. On 64-bit systems however, a pointer is 64 bits but an int will still usually be 32 bits. This suggests that it is usually better to use indexes into arrays instead of pointers (or even iterators) on 64-bit systems and programs.
In this particular example, this is not significant at all though.
Your code is very specific and simple, but the general case, it is almost always better to give as much information to the compiler about your code as possible. This means that you must use the narrowest, most specific device available to you to do a job. This in turn means that a generic for loop (i.e. for (int i = 0; i < n; ++i)) is worse than a range-based for loop (i.e. for (auto i : v)) for the compiler, because in the latter case the compiler simply knows that you are going to iterate over the whole range and not go outside of it or break out of the loop or something, while in the generic for loop case, specially if your code is more complex, the compiler cannot be sure of this and has to insert extra checks and tests to make sure the code executes as the C++ standard says it should.
In many (most?) cases, although you might think performance matters, it does not. And most of the time you rewrite something to gain performance, you don't gain much. And most of the time the performance gain you get is not worth the loss in readability and maintainability that you sustain. So, design your code and data structures right (and keep performance in mind) but avoid this kind of "micro-optimization" because it's almost always not worth it and even harms the quality of the code too.
Generally, performance in terms of speed is very hard to reason about. Ideally you have to measure the time with real data on real hardware in real working conditions using sound scientific measuring and statistical methods. Even measuring the time it takes a piece of code to run is not at all trivial. Measuring performance is hard, and reasoning about it is harder, but these days it is the only way of recognizing bottlenecks and optimizing the code.
I hope I have answered your question.
EDIT: I wrote a very simple benchmark for what you are trying to do. The code is here. It's written for Windows and should be compilable on Visual Studio 2012 (because of the range-based for loops.) And here are the timing results:
Simple iteration (nested loops): min:0.002140, avg:0.002160, max:0.002739
Simple iteration (one loop): min:0.002140, avg:0.002160, max:0.002625
Pointer iteration (one loop): min:0.002140, avg:0.002160, max:0.003149
Range-based for (nested loops): min:0.002140, avg:0.002159, max:0.002862
Range(const ref)(nested loops): min:0.002140, avg:0.002155, max:0.002906
The relevant numbers are the "min" times (over 2000 runs of each test, for 1000x1000 arrays.) As you see, there is absolutely no difference between the tests. Note that you should turn on compiler optimizations or test 2 will be a disaster and cases 4 and 5 will be a little worse than 1 and 3.
And here are the code for the tests:
// 1. Simple iteration (nested loops)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows; ++i)
for (unsigned j = 0; j < gc_Cols; ++j)
sum += g_Data[i][j];
// 2. Simple iteration (one loop)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += g_Data[i / gc_Cols][i % gc_Cols];
// 3. Pointer iteration (one loop)
unsigned sum = 0;
unsigned * p = &(g_Data[0][0]);
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += *p++;
// 4. Range-based for (nested loops)
unsigned sum = 0;
for (auto & i : g_Data)
for (auto j : i)
sum += j;
// 5. Range(const ref)(nested loops)
unsigned sum = 0;
for (auto const & i : g_Data)
for (auto const & j : i)
sum += j;

It has many factors affecting it:
It depends on the compiler
It depends on the compiler flags used
It depends on the computer used
There is only one way to know the exact answer: measuring the time used when dealing with huge arrays (maybe from a random number generator) which is the same method you have already done except that the array size should be at least 1000x1000.

Safe and fast FFT

Inspired by Herb Sutter's compelling lecture Not your father's C++, I decided to take another look at the latest version of C++ using Microsoft's Visual Studio 2010. I was particularly interested by Herb's assertion that C++ is "safe and fast" because I write a lot of performance-critical code.
As a benchmark, I decided to try to write the same simple FFT algorithm in a variety of languages.
I came up with the following C++11 code that uses the built-in complex type and vector collection:
#include <complex>
#include <vector>
using namespace std;
// Must provide type or MSVC++ barfs with "ambiguous call to overloaded function"
double pi = 4 * atan(1.0);
void fft(int sign, vector<complex<double>> &zs) {
unsigned int j=0;
// Warning about signed vs unsigned comparison
for(unsigned int i=0; i<zs.size()-1; ++i) {
if (i < j) {
auto t = zs.at(i);
zs.at(i) = zs.at(j);
zs.at(j) = t;
}
int m=zs.size()/2;
j^=m;
while ((j & m) == 0) { m/=2; j^=m; }
}
for(unsigned int j=1; j<zs.size(); j*=2)
for(unsigned int m=0; m<j; ++m) {
auto t = pi * sign * m / j;
auto w = complex<double>(cos(t), sin(t));
for(unsigned int i = m; i<zs.size(); i+=2*j) {
complex<double> zi = zs.at(i), t = w * zs.at(i + j);
zs.at(i) = zi + t;
zs.at(i + j) = zi - t;
}
}
}
Note that this function only works for n-element vectors where n is an integral power of two. Anyone looking for fast FFT code that works for any n should look at FFTW.
As I understand it, the traditional xs[i] syntax from C for indexing a vector does not do bounds checking and, consequently, is not memory safe and can be a source of memory errors such as non-deterministic corruption and memory access violations. So I used xs.at(i) instead.
Now, I want this code to be "safe and fast" but I am not a C++11 expert so I'd like to ask for improvements to this code that would make it more idiomatic or efficient?

I think you are being overly "safe" in your use of at(). In most of your cases the index used is trivially verifable as being constrained by the size of the container in the for loop.
e.g.
for(unsigned int i=0; i<zs.size()-1; ++i) {
...
auto t = zs.at(i);
The only ones I'd leave as at()s are the (i + j)s. It's not immediately obvious whether they would always be constrained (although if I was really unsure I'd probably manually check - but I'm not familiar with FFTs enough to have an opinion in this case).
There are also some fixed computations being repeated for each loop iteration:
int m=zs.size()/2;
pi * sign
2*j
And the zs.at(i + j) is computed twice.
It's possible that the optimiser may catch these - but if you are treating this as performance critical, and you have your timers testing it, I'd hoist them out of the loops (or, in the case of zs.at(i + j), just take a reference on first use) and see if that impacts the timer.
Talking of second-guessing the optimiser: I'm sure that the calls to .size() will be inlined as, at least, a direct call to an internal member variable - but given how many times you call it I'd also experiment with introducing local variables for zs.size() and zs.size()-1 upfront. They're more likely to be put into registers that way too.
I don't know how much of a difference (if any) all of this will have on your total runtime - some of it may already be caught by the optimiser, and the differences may be small compared to the computations involved - but worth a shot.
As for being idiomatic my only comment, really, is that size() returns a std::size_t (which is usually a typedef for an unsigned int - but it's more idiomatic to use that type instead). If you did want to use auto but avoid the warning you could try adding the ul suffix to the 0 - not sure I'd say that is idiomatic, though. I suppose you're already less than idiomatic in not using iterators here, but I can see why you can't do that (easily).
Update
I gave all my suggestions a try and they all had a measurable performance improvement - except the i+j and 2*j precalcs - they actually caused a slight slowdown! I presume they either prevented a compiler optimisation or prevented it from using registers for some things.
Overall I got a >10% perf. improvement with those suggestions.
I suspect more could be had if the second block of loops was refactored a little to avoid the jumps - and having done so enabling SSE2 instruction set may give a significant boost (I did try it as is and saw a slight slowdown).
I think that refactoring, along with using something like MKL for the cos and sin calls should give greater, and less brittle, improvements. And neither of those things would be language dependent (I know this was originally being compared to an F# implementation).
Update 2
I forgot to mention that pre-calculating zs.size() did make a difference.
Update 3
Also forgot to say (until reminded by #xeo in comment to OP) that the block following the i < j check can be boiled down to a std::swap. This is more idiomatic and at least as performant - in the worst case should inline to the same code as written. Indeed when I did it I saw no change in the performance. In other cases it can lead to a performance gain if move constructors are available.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js