C++ conversion optimization

C++ conversion optimization - c++

I would like to ask if there is a quicker way to do my audio conversion than by iterating through all values one by one and dividing them through 32768.
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
for ( int i = 0; i <=uIntegers.size()-1;i++)
{
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
My approach works fine, but it could be quicker. However I did not find any way to speed it up.
Thank you for the help!

If your array is large enough it may be worthwhile to parallelize this for loop. OpenMP's parallel for statement is what I would use.
The function would then be:
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
#pragma omp parallel for
for (int i = 0; i < uIntegers.size(); i++)
{
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
with gcc you need to pass -fopenmp when you compile for the pragma to be used, on MSVC it is /openmp. Since spawning threads has a noticeable overhead, this will only be faster if you are processing large arrays, YMMV.

For maximum speed you want to convert more than one value per loop iteration. The easiest way to do that is with SIMD. Here's roughly how you'd do it with SSE2:
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
__m128d scale = _mm_set_pd( 1.0 / 32768.0, 1.0 / 32768.0 );
int i = 0;
for ( ; i < uIntegers.size() - 3; i += 4)
{
__m128i x = _mm_loadu_si128(&uIntegers[i]);
__m128i y = _mm_shuffle_epi32(x, _MM_SHUFFLE(2,3,0,0) );
__m128d dx = _mm_cvtepi32_pd(x);
__m128d dy = _mm_cvtepi32_pd(y);
dx = _mm_mul_pd(dx, scale);
dy = _mm_mul_pd(dy, scale);
_mm_storeu_pd(dx, &uDoubles[i]);
_mm_storeu_pd(dy, &uDoubles[i + 2]);
}
// Finish off the last 0-3 elements the slow way
for ( ; i < uIntegers.size(); i ++)
{
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
We process four integers per loop iteration. As we can only fit two doubles in the registers there's some duplicated work, but the extra unrolling will help performance unless the arrays are tiny.
Changing the data types to smaller ones (say short and float) should also help performance, because they cut down on memory bandwidth, and you can fit four floats in an SSE register. For audio data you shouldn't need the precision of a double.
Note that I've used unaligned loads and stores. Aligned ones will be slightly quicker if the data is actually aligned (which it won't be by default, and it's hard to make stuff aligned inside a std::vector).

Your function is highly parallelizable. On modern Intel CPU there are three independent ways to parallelize: Instruction level parallelism (ILP), thread level parallelism (TLP), and SIMD. I was able to use all three to get big boosts in your function. The results are compiler dependent though. The boost is much less using GCC since it already vectorizes the function. See the table of numbers below.
However, the main limiting factor in your function is that it's time complexity is only O(n) and so there is a drastic drop in efficiency when the size of the array you're running over crosses each cache level boundary. If you look at for example at large dense matrix multiplication (GEMM) it's a O(n^3) operation so if one does things right (using e.g. loop tiling) the cache hierarchy is not a problem: you can get close to the maximum flops/s even for very large matrices (which seems to indicate that GEMM is one of the thinks Intel thinks of when they design the CPU). The way to fix this in your case is to find a way to do your function on a L1 cache block right after/before you do a more complex operation (for example that goes as O(n^2)) and then move to another L1 block. Of course I don't know what you're doing so I can't do that.
ILP is partially done for you by the CPU hardware. However, often carried loop dependencies limit the ILP so it often helps to do loop unrolling to take full advantage of the ILP. For TLP I use OpenMP, and for SIMD I used AVX (however the code below works for SSE as well). I used 32 byte aligned memory and made sure the array was a multiple of 8 so that no clean up was necessary.
Here are the results from Visual Studio 2012 64bit with AVX and OpenMP (release mode obviously) SandyBridge EP 4 cores (8 HW threads) #3.6 GHz. The variable n is the number of items. I repeat the function several times as well so the total time includes that. The function convert_vec4_unroll2_openmp gives the best results except in the L1 region. You can also cleary see that the efficiency drops significantly each time you move to a new cache level but even for main memory it's still better.
l1 chache, n 2752, repeat 300000
covert time 1.34, error 0.000000
convert_vec4 time 0.16, error 0.000000
convert_vec4_unroll2 time 0.16, error 0.000000
convert_vec4_unroll2_openmp time 0.31, error 0.000000
l2 chache, n 21856, repeat 30000
covert time 1.14, error 0.000000
convert_vec4 time 0.24, error 0.000000
convert_vec4_unroll2 time 0.24, error 0.000000
convert_vec4_unroll2_openmp time 0.12, error 0.000000
l3 chache, n 699072, repeat 1000
covert time 1.23, error 0.000000
convert_vec4 time 0.44, error 0.000000
convert_vec4_unroll2 time 0.45, error 0.000000
convert_vec4_unroll2_openmp time 0.14, error 0.000000
main memory , n 8738144, repeat 100
covert time 1.56, error 0.000000
convert_vec4 time 0.95, error 0.000000
convert_vec4_unroll2 time 0.89, error 0.000000
convert_vec4_unroll2_openmp time 0.51, error 0.000000
Results with g++ foo.cpp -mavx -fopenmp -ffast-math -O3 on a i5-3317 (ivy bridge) # 2.4 GHz 2 cores (4 HW threads). GCC seems to vectorize this and the only benefit comes from OpenMP (which, however, gives a worse result in the L1 region).
l1 chache, n 2752, repeat 300000
covert time 0.26, error 0.000000
convert_vec4 time 0.22, error 0.000000
convert_vec4_unroll2 time 0.21, error 0.000000
convert_vec4_unroll2_openmp time 0.46, error 0.000000
l2 chache, n 21856, repeat 30000
covert time 0.28, error 0.000000
convert_vec4 time 0.27, error 0.000000
convert_vec4_unroll2 time 0.27, error 0.000000
convert_vec4_unroll2_openmp time 0.20, error 0.000000
l3 chache, n 699072, repeat 1000
covert time 0.80, error 0.000000
convert_vec4 time 0.80, error 0.000000
convert_vec4_unroll2 time 0.80, error 0.000000
convert_vec4_unroll2_openmp time 0.83, error 0.000000
main memory chache, n 8738144, repeat 100
covert time 1.10, error 0.000000
convert_vec4 time 1.10, error 0.000000
convert_vec4_unroll2 time 1.10, error 0.000000
convert_vec4_unroll2_openmp time 1.00, error 0.000000
Here is the code. I use the vectorclass http://www.agner.org/optimize/vectorclass.zip to do SIMD. This will use either AVX to write 4 doubles at once or SSE to write 2 doubles at once.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "vectorclass.h"
void convert(const int *uIntegers, double *uDoubles, const int n) {
for ( int i = 0; i<n; i++) {
uDoubles[i] = uIntegers[i] / 32768.0;
}
}
void convert_vec4(const int *uIntegers, double *uDoubles, const int n) {
Vec4d div = 1.0/32768;
for ( int i = 0; i<n; i+=4) {
Vec4i u4i = Vec4i().load(&uIntegers[i]);
Vec4d u4d = to_double(u4i);
u4d*=div;
u4d.store(&uDoubles[i]);
}
}
void convert_vec4_unroll2(const int *uIntegers, double *uDoubles, const int n) {
Vec4d div = 1.0/32768;
for ( int i = 0; i<n; i+=8) {
Vec4i u4i_v1 = Vec4i().load(&uIntegers[i]);
Vec4d u4d_v1 = to_double(u4i_v1);
u4d_v1*=div;
u4d_v1.store(&uDoubles[i]);
Vec4i u4i_v2 = Vec4i().load(&uIntegers[i+4]);
Vec4d u4d_v2 = to_double(u4i_v2);
u4d_v2*=div;
u4d_v2.store(&uDoubles[i+4]);
}
}
void convert_vec4_openmp(const int *uIntegers, double *uDoubles, const int n) {
#pragma omp parallel for
for ( int i = 0; i<n; i+=4) {
Vec4i u4i = Vec4i().load(&uIntegers[i]);
Vec4d u4d = to_double(u4i);
u4d/=32768.0;
u4d.store(&uDoubles[i]);
}
}
void convert_vec4_unroll2_openmp(const int *uIntegers, double *uDoubles, const int n) {
Vec4d div = 1.0/32768;
#pragma omp parallel for
for ( int i = 0; i<n; i+=8) {
Vec4i u4i_v1 = Vec4i().load(&uIntegers[i]);
Vec4d u4d_v1 = to_double(u4i_v1);
u4d_v1*=div;
u4d_v1.store(&uDoubles[i]);
Vec4i u4i_v2 = Vec4i().load(&uIntegers[i+4]);
Vec4d u4d_v2 = to_double(u4i_v2);
u4d_v2*=div;
u4d_v2.store(&uDoubles[i+4]);
}
}
double compare(double *a, double *b, const int n) {
double diff = 0.0;
for(int i=0; i<n; i++) {
double tmp = a[i] - b[i];
//printf("%d %f %f \n", i, a[i], b[i]);
if(tmp<0) tmp*=-1;
diff += tmp;
}
return diff;
}
void loop(const int n, const int repeat, const int ifunc) {
void (*fp[4])(const int *uIntegers, double *uDoubles, const int n);
int *a = (int*)_mm_malloc(sizeof(int)* n, 32);
double *b1_cmp = (double*)_mm_malloc(sizeof(double)*n, 32);
double *b1 = (double*)_mm_malloc(sizeof(double)*n, 32);
double dtime;
const char *fp_str[] = {
"covert",
"convert_vec4",
"convert_vec4_unroll2",
"convert_vec4_unroll2_openmp",
};
for(int i=0; i<n; i++) {
a[i] = rand()*RAND_MAX;
}
fp[0] = convert;
fp[1] = convert_vec4;
fp[2] = convert_vec4_unroll2;
fp[3] = convert_vec4_unroll2_openmp;
convert(a, b1_cmp, n);
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) {
fp[ifunc](a, b1, n);
}
dtime = omp_get_wtime() - dtime;
printf("\t%s time %.2f, error %f\n", fp_str[ifunc], dtime, compare(b1_cmp,b1,n));
_mm_free(a);
_mm_free(b1_cmp);
_mm_free(b1);
}
int main() {
double dtime;
int l1 = (32*1024)/(sizeof(int) + sizeof(double));
int l2 = (256*1024)/(sizeof(int) + sizeof(double));
int l3 = (8*1024*1024)/(sizeof(int) + sizeof(double));
int lx = (100*1024*1024)/(sizeof(int) + sizeof(double));
int n[] = {l1, l2, l3, lx};
int repeat[] = {300000, 30000, 1000, 100};
const char *cache_str[] = {"l1", "l2", "l3", "main memory"};
for(int c=0; c<4; c++ ) {
int lda = ((n[c]+7) & -8); //make sure array is a multiple of 8
printf("%s chache, n %d\n", cache_str[c], lda);
for(int i=0; i<4; i++) {
loop(lda, repeat[c], i);
} printf("\n");
}
}
Lastly, anyone who has read this far and feels like reminding me that my code looks more like C than C++ please read this first before you decide to comment http://www.stroustrup.com/sibling_rivalry.pdf

You might also try:
uDoubles[i] = ldexp((double)uIntegers[i], -15);

Edit: See Adam's answer above for a version using SSE intrinsics. Better than what I had here ...
To make this more useful, let's look at compiler-generated code here. I'm using gcc 4.8.0 and yes, it is worth checking your specific compiler (version) as there are quite significant differences in output for, say, gcc 4.4, 4.8, clang 3.2 or Intel's icc.
Your original, using g++ -O8 -msse4.2 ... translates into the following loop:
.L2:
cvtsi2sd (%rcx,%rax,4), %xmm0
mulsd %xmm1, %xmm0
addl $1, %edx
movsd %xmm0, (%rsi,%rax,8)
movslq %edx, %rax
cmpq %rdi, %rax
jbe .L2
where %xmm1 holds 1.0/32768.0 so the compiler automatically turns the division into multiplication-by-reverse.
On the other hand, using g++ -msse4.2 -O8 -funroll-loops ..., the code created for the loop changes significantly:
[ ... ]
leaq -1(%rax), %rdi
movq %rdi, %r8
andl $7, %r8d
je .L3
[ ... insert a duff's device here, up to 6 * 2 conversions ... ]
jmp .L3
.p2align 4,,10
.p2align 3
.L39:
leaq 2(%rsi), %r11
cvtsi2sd (%rdx,%r10,4), %xmm9
mulsd %xmm0, %xmm9
leaq 5(%rsi), %r9
leaq 3(%rsi), %rax
leaq 4(%rsi), %r8
cvtsi2sd (%rdx,%r11,4), %xmm10
mulsd %xmm0, %xmm10
cvtsi2sd (%rdx,%rax,4), %xmm11
cvtsi2sd (%rdx,%r8,4), %xmm12
cvtsi2sd (%rdx,%r9,4), %xmm13
movsd %xmm9, (%rcx,%r10,8)
leaq 6(%rsi), %r10
mulsd %xmm0, %xmm11
mulsd %xmm0, %xmm12
movsd %xmm10, (%rcx,%r11,8)
leaq 7(%rsi), %r11
mulsd %xmm0, %xmm13
cvtsi2sd (%rdx,%r10,4), %xmm14
mulsd %xmm0, %xmm14
cvtsi2sd (%rdx,%r11,4), %xmm15
mulsd %xmm0, %xmm15
movsd %xmm11, (%rcx,%rax,8)
movsd %xmm12, (%rcx,%r8,8)
movsd %xmm13, (%rcx,%r9,8)
leaq 8(%rsi), %r9
movsd %xmm14, (%rcx,%r10,8)
movsd %xmm15, (%rcx,%r11,8)
movq %r9, %rsi
.L3:
cvtsi2sd (%rdx,%r9,4), %xmm8
mulsd %xmm0, %xmm8
leaq 1(%rsi), %r10
cmpq %rdi, %r10
movsd %xmm8, (%rcx,%r9,8)
jbe .L39
[ ... out ... ]
So it blocks the operations up, but still converts one-value-at-a-time.
If you change your original loop to operate on a few elements per iteration:
size_t i;
for (i = 0; i < uIntegers.size() - 3; i += 4)
{
uDoubles[i] = uIntegers[i] / 32768.0;
uDoubles[i+1] = uIntegers[i+1] / 32768.0;
uDoubles[i+2] = uIntegers[i+2] / 32768.0;
uDoubles[i+3] = uIntegers[i+3] / 32768.0;
}
for (; i < uIntegers.size(); i++)
uDoubles[i] = uIntegers[i] / 32768.0;
the compiler, gcc -msse4.2 -O8 ... (i.e. even without requesting unrolling), identifies the potential to use CVTDQ2PD/MULPD and the core of the loop becomes:
.p2align 4,,10
.p2align 3
.L4:
movdqu (%rcx), %xmm0
addq $16, %rcx
cvtdq2pd %xmm0, %xmm1
pshufd $238, %xmm0, %xmm0
mulpd %xmm2, %xmm1
cvtdq2pd %xmm0, %xmm0
mulpd %xmm2, %xmm0
movlpd %xmm1, (%rdx,%rax,8)
movhpd %xmm1, 8(%rdx,%rax,8)
movlpd %xmm0, 16(%rdx,%rax,8)
movhpd %xmm0, 24(%rdx,%rax,8)
addq $4, %rax
cmpq %r8, %rax
jb .L4
cmpq %rdi, %rax
jae .L29
[ ... duff's device style for the "tail" ... ]
.L29:
rep ret
I.e. now the compiler recognizes the opportunity to put two double per SSE register, and do parallel multiply / conversion. This is pretty close to the code that Adam's SSE intrinsics version would generate.
The code in total (I've shown only about 1/6th of it) is much more complex than the "direct" intrinsics, due to the fact that, as mentioned, the compiler tries to prepend/append unaligned / not-block-multiple "heads" and "tails" to the loop. It largely depends on the average/expected sizes of your vectors whether this will be beneficial or not; for the "generic" case (vectors more than twice the size of the block processed by the "innermost" loop), it'll help.
The result of this exercise is, largely ... that, if you coerce (by compiler options/optimization) or hint (by slightly rearranging the code) your compiler to do the right thing, then for this specific kind of copy/convert loop, it comes up with code that's not going to be much behind hand-written intrinsics.
Final experiment ... make the code:
static double c(int x) { return x / 32768.0; }
void Convert(const std::vector<int>& uIntegers, std::vector<double>& uDoubles)
{
std::transform(uIntegers.begin(), uIntegers.end(), uDoubles.begin(), c);
}
and (for the nicest-to-read assembly output, this time using gcc 4.4 with gcc -O8 -msse4.2 ...) the generated assembly core loop (again, there's a pre/post bit) becomes:
.p2align 4,,10
.p2align 3
.L8:
movdqu (%r9,%rax), %xmm0
addq $1, %rcx
cvtdq2pd %xmm0, %xmm1
pshufd $238, %xmm0, %xmm0
mulpd %xmm2, %xmm1
cvtdq2pd %xmm0, %xmm0
mulpd %xmm2, %xmm0
movapd %xmm1, (%rsi,%rax,2)
movapd %xmm0, 16(%rsi,%rax,2)
addq $16, %rax
cmpq %rcx, %rdi
ja .L8
cmpq %rbx, %rbp
leaq (%r11,%rbx,4), %r11
leaq (%rdx,%rbx,8), %rdx
je .L10
[ ... ]
.L10:
[ ... ]
ret
With that, what do we learn ? If you want to use C++, really use C++ ;-)

Let me try another way:
If multiplying is seriously better from the perspective of assembly instructions, then this should guarantee that it will get multiplied.
void CAudioDataItem::Convert(const vector<int>&uIntegers, vector<double> &uDoubles)
{
for ( int i = 0; i <=uIntegers.size()-1;i++)
{
uDoubles[i] = uIntegers[i] * 0.000030517578125;
}
}

Related

How to calculate 2x2 matrix multiplied by 2D vector using SSE intrinsics (32 bit floating points)? (C++, Mac and Windows)

I need to calculate a 2D matrix multiplied with 2D vector. Both use 32 bit floats. I'm hoping to do this using SSE (any version really) for speed optimization purposes, as I'm going to be using it for realtime audio processing.
So the formula I would need is the following:
left = L*A + R*B
right = L*C + R*D
I was thinking of reading the whole matrix from memory as a 128 bit floating point SIMD (4 x 32 bit floating points) if it makes sense. But if it's a better idea to process this in smaller pieces, then that's fine too.
L & R variables will be in their own floats when the processing begin, so they would need to be moved into the SIMD register/variable and when the calculation is done, moved back into regular variables.
The IDEs I'm hoping to get it compiled on are Xcode and Visual Studio. So I guess that'll be Clang and Microsoft's own compilers then which this would need to run properly on.
All help is welcome. Thank you in advance!
I already tried reading SSE instruction sets, but there seems to be so much content in there that it would take a very long time to find the suitable instructions and then the corresponding intrinsics to get anything working.
ADDITIONAL INFORMATION BASED ON YOUR QUESTIONS:
The L & R data comes from their own arrays of data. I have pointers to each of the two arrays (L & R) and then go through them at the same time. So the left/right audio channel data is not interleaved but have their own pointers. In other words, the data is arranged like: LLLLLLLLL RRRRRRRRRR.
Some really good points have been made in the comments about the modern compilers being able to optimize the code really well. This is especially true when multiplication is quite fast and shuffling data inside the SIMD registers might be needed: using more multiplications might still be faster than having to shuffle the data multiple times. I didn't realise that modern compilers can be that good these days. I have to experiment with Godbolt using std::array and seeing what kind of results I'll get for my particular case.
The data needs to be in 32 bit floats, as that is used all over the application. So 16 bit doesn't work for my case.
MORE INFORMATION BASED ON MY TESTS:
I used Godbolt.org to test how the compiler optimizes my code. What I found is that if I do the following, I don't get optimal code:
using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;
Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
Vec2 result;
result[0] = v[0]*m[0] + v[1]*m[1];
result[1] = v[0]*m[2] + v[1]*m[3];
return result;
}
But if I do the following, I do get quite nice code:
using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;
Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
Vec2 result;
result[0] = v[0]*m[0] + v[1]*m[2];
result[1] = v[0]*m[1] + v[1]*m[3];
return result;
}
Meaning that if I transpose the 2D matrix, the compiler seems to output pretty good results as is. I believe I should go with this method since the compiler seems to be able to handle the code nicely.

You are better leave the assembly generation to your compiler. It is doing a great job from what I could gather.
Additionally, GCC and CLANG (not sure about the others) have extension attributes that allow you to compile code for several different architectures, one of which will be picked at runtime.
For example, consider the following code:
using Vector = std::array<float, 2>;
using Matrix = std::array<float, 4>;
namespace detail {
Vector multiply(const Matrix& m, const Vector& v) {
Vector r;
r[0] = v[0] * m[0] + v[1] * m[2];
r[1] = v[0] * m[1] + v[1] * m[3];
return r;
}
} // namespace detail
__attribute__((target("default")))
Vector multiply(const Matrix& m, const Vector& v) {
return detail::multiply(m, v);
}
__attribute__((target("avx")))
Vector multiply(const Matrix& m, const Vector& v) {
return detail::multiply(m, v);
}
Assume that you compile it with
g++ -O3 -march=x86-64 main.cpp -o main
For AVX it creates perfectly optimized AVX SIMD code
multiply(Matrix const&, Vector const&) [clone .avx]: # #multiply(Matrix const&, Vector const&) [clone .avx]
vmovsd (%rdi), %xmm0 # xmm0 = mem[0],zero
vmovsd 8(%rdi), %xmm1 # xmm1 = mem[0],zero
vbroadcastss 4(%rsi), %xmm2
vmulps %xmm1, %xmm2, %xmm1
vbroadcastss (%rsi), %xmm2
vmulps %xmm0, %xmm2, %xmm0
vaddps %xmm1, %xmm0, %xmm0
retq
While the default implementation uses SSE instructions only:
multiply(Matrix const&, Vector const&): # #multiply(Matrix const&, Vector const&)
movsd (%rdi), %xmm1 # xmm1 = mem[0],zero
movsd 8(%rdi), %xmm2 # xmm2 = mem[0],zero
movss (%rsi), %xmm0 # xmm0 = mem[0],zero,zero,zero
movss 4(%rsi), %xmm3 # xmm3 = mem[0],zero,zero,zero
shufps $0, %xmm3, %xmm3 # xmm3 = xmm3[0,0,0,0]
mulps %xmm2, %xmm3
shufps $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]
mulps %xmm1, %xmm0
addps %xmm3, %xmm0
retq
You might want to check out this post
Godbolt link: https://godbolt.org/z/fcKvchvcb

It's much better to let compilers vectorize over the whole arrays of LLLL and RRRR samples, not for one left, right sample pair at once.
With the same mixing matrix for a whole array of audio samples, you get nice asm with no shuffles. Borrowing code from Nole's answer just to illustrate the auto-vectorization (you might want to simplify the
struct Vector {
std::array<float,2> coef;
};
struct Matrix {
std::array<float,4> coef;
};
static Vector multiply( const Matrix& m, const Vector& v ) {
Vector r;
r.coef[0] = v.coef[0]*m.coef[0] + v.coef[1]*m.coef[2];
r.coef[1] = v.coef[0]*m.coef[1] + v.coef[1]*m.coef[3];
return r;
}
// The per-element functions need to inline into the loop,
// so target attributes need to match or be a superset.
// Or better, just don't use target options on the per-sample function
__attribute__ ((target ("avx,fma")))
void intermix(float *__restrict left, float *__restrict right, const Matrix &m)
{
for (int i=0 ; i<10240 ; i++){
Vector v = {left[i], right[i]};
v = multiply(m, v);
left[i] = v.coef[0];
right[i] = v.coef[1];
}
}
GCC -O3 (without any target options) compiles this to nice AVX1 + FMA code, as per the __attribute__((target("avx,fma"))) (Similar to -march=x86-64-v3). (Godbolt)
# GCC (trunk) -O3
intermix(float*, float*, Matrix const&):
vbroadcastss (%rdx), %ymm5
vbroadcastss 8(%rdx), %ymm4
xorl %eax, %eax
vbroadcastss 4(%rdx), %ymm3
vbroadcastss 12(%rdx), %ymm2 # broadcast each matrix element separately
.L2:
vmulps (%rsi,%rax), %ymm4, %ymm1 # a whole vector of 8 R*B
vmulps (%rsi,%rax), %ymm2, %ymm0
vfmadd231ps (%rdi,%rax), %ymm5, %ymm1
vfmadd231ps (%rdi,%rax), %ymm3, %ymm0
vmovups %ymm1, (%rdi,%rax)
vmovups %ymm0, (%rsi,%rax)
addq $32, %rax
cmpq $40960, %rax
jne .L2
vzeroupper
ret
main:
movl $13, %eax
ret
Note how there are zero shuffle instructions because the matrix coefficients each get broadcast to a separate vector, so one vmulps can do R*B for 8 R samples in parallel, and so on.
Unfortunately GCC and clang both use indexed addressing modes, so the memory source vmulps and vfma instructions un-laminate into 2 uops for the back-end on Intel CPUs. And the stores can't use the port 7 AGU on HSW/SKL. -march=skylake or any other specific Intel SnB-family uarch doesn't fix that for either of them. Clang unrolls by default, so the extra pointer increments to avoid indexed addressing modes would be amortized. (It would actually just be 1 extra add instruction, since we're modifying L and R in-place. You could of course change the function to copy-and-mix.)
If the data is hot in L1d cache, it'll bottleneck on the front-end rather than load+FP throughput, but it still comes relatively close to 2 loads and 2 FMAs per clock.
Hmm, GCC is saving instructions but costing extra loads by loading the same L and R data twice, as memory-source operands for vmulps and vfmadd...ps. With -march=skylake it doesn't do that, instead using a separate vmovups (which has no problem with an indexed addressing mode, but the later store still does.)
I haven't looked at tuning choices from other GCC versions.
# GCC (trunk) -O3 -march=skylake (which implies -mtune=skylake)
.L2:
vmovups (%rsi,%rax), %ymm1
vmovups (%rdi,%rax), %ymm0 # separate loads
vmulps %ymm1, %ymm5, %ymm2 # FP instructions using only registers
vmulps %ymm1, %ymm3, %ymm1
vfmadd231ps %ymm0, %ymm6, %ymm2
vfmadd132ps %ymm4, %ymm1, %ymm0
vmovups %ymm2, (%rdi,%rax)
vmovups %ymm0, (%rsi,%rax)
addq $32, %rax
cmpq $40960, %rax
jne .L2
This is 10 uops, so can issue 2 cycles per iteration on Ice Lake, 2.5c on Skylake. On Ice Lake, it will sustain 2x 256-bit mul/FMA per clock cycle.
On Skylake, it doesn't bottleneck on AGU throughput since it's 4 uops for ports 2,3 every 2.5 cycles. So that's fine. No need for indexed addressing modes.

Why does sorting make this branchless code faster?

-Edit2- Seems like occasionally my CPU runs unsorted as fast as sorted. On other machines they're consistently the same speed. I guess I'm not benchmarking correctly or there's subtle things going on behind the scene
-Edit- ASM code below. The generated code is the same in the main loop. Can someone confirm if it's faster or the same? I'm using clang 10 and I'm on ryzen 3600. According to compiler explorer there is no branches, the adding uses a SIMD instruction that adds either A or B based on the mask. The mask is from the >= 128 and B is a vector of 0's. So no there is no hidden branch from what I can see.
I thought this very popular question was silly and tried it in clang Why is processing a sorted array faster than processing an unsorted array?
With g++ -O2 it takes 2seconds with clang it took 0.32. I tried making it branchless like the below and found that even though it's branchless sorting still makes it faster. Whats going on?! (also it seems like O2 vectorize the code so I didn't try to do more)
#include <algorithm>
#include <ctime>
#include <stdio.h>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
bool v = data[c] >= 128;
sum += data[c] * v;
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("%f\nsum = %lld\n", elapsedTime, sum);
}
.LBB0_4: # Parent Loop BB0_3 Depth=1
# => This Inner Loop Header: Depth=2
vmovdqa (%rsp,%rcx,4), %xmm5
vmovdqa 16(%rsp,%rcx,4), %xmm6
vmovdqa 32(%rsp,%rcx,4), %xmm7
vmovdqa 48(%rsp,%rcx,4), %xmm0
addq $16, %rcx
vpcmpgtd %xmm8, %xmm5, %xmm9
vpcmpgtd %xmm8, %xmm6, %xmm10
vpcmpgtd %xmm8, %xmm7, %xmm11
vpcmpgtd %xmm8, %xmm0, %xmm12
vpand %xmm5, %xmm9, %xmm5
vpand %xmm0, %xmm12, %xmm0
vpand %xmm6, %xmm10, %xmm6
vpand %xmm7, %xmm11, %xmm7
vpmovsxdq %xmm6, %ymm9
vpmovsxdq %xmm5, %ymm5
vpmovsxdq %xmm7, %ymm6
vpmovsxdq %xmm0, %ymm0
vpaddq %ymm5, %ymm1, %ymm1
vpaddq %ymm2, %ymm9, %ymm2
vpaddq %ymm6, %ymm3, %ymm3
vpaddq %ymm0, %ymm4, %ymm4
cmpq $32768, %rcx # imm = 0x8000
jne .LBB0_4

It's almost certainly branch prediction. The hardware will try to figure out how your conditions will evaluate and get that branch of code ready, and one method it leverages is previous iterations through loops. Since your sorted array means it's going to have a bunch of false paths, then a bunch of true paths, this means the only time it's going to miss is when the loop starts and when data[c] first becomes >= 128.
In addition, it's possible the compiler is smart enough to optimize on a sorted array. I haven't tested, but it could save the magic c value, do a lookup on that once, and then have fewer iterations of your inner for. This would be an optimization you could write yourself as well.

Normalize lower triangular matrix more quickly

The code below seems not the bottleneck.
I am just curious to know if there is a faster way to get this done on a cpu with SSE4.2.
The code works on the lower triangular entries of a matrix stored as a 1d array in the following form in ar_tri:
[ (1,0),
(2,0),(2,1),
(3,0),(3,1),(3,2),
...,
(n,0)...(n,n-1) ]
where (x,y) is the entries of the matrix at the xth row and yth column.
And also the reciprocal square root (rsqrt) of the diagonal of the matrix of the following form in ar_rdia:
[ rsqrt(0,0), rsqrt(1,1), ... ,rsqrt(n,n) ]
gcc6.1 -O3 on the Godbolt compiler explorer auto-vectorizes both versions using SIMD instructions (mulps). The triangular version has cleanup code at the end of each row, so there are some scalar instructions, too.
Would using rectangular matrix stored as a 1d array in contiguous memory improve the performance?
// Triangular version
#include <iostream>
#include <stdlib.h>
#include <stdint.h>
using namespace std;
int main(void){
size_t n = 10000;
size_t n_tri = n*(n-1)/2;
size_t repeat = 10000;
// test 10000 cycles of the code
float* ar_rdia = (float*)aligned_alloc(16, n*sizeof(float));
//reciprocal square root of diagonal
float* ar_triangular = (float*)aligned_alloc(16, n_tri*sizeof(float));
//lower triangular matrix
size_t i,j,k;
float a,b;
k = 0;
for(i = 0; i < n; ++i){
for(j = 0; j < i; ++j){
ar_triangular[k] *= ar_rdia[i]*ar_rdia[j];
++k;
}
}
cout << k;
free((void*)ar_rdia);
free((void*)ar_triangular);
}
// Square version
#include <iostream>
#include <stdlib.h>
#include <stdint.h>
using namespace std;
int main(void){
size_t n = 10000;
size_t n_sq = n*n;
size_t repeat = 10000;
// test 10000 cycles of the code
float* ar_rdia = (float*)aligned_alloc(16, n*sizeof(float));
//reciprocal square root of diagonal
float* ar_square = (float*)aligned_alloc(16, n_sq*sizeof(float));
//lower triangular matrix
size_t i,j,k;
float a,b;
k = 0;
for(i = 0; i < n; ++i){
for(j = 0; j < n; ++j){
ar_square[k] *= ar_rdia[i]*ar_rdia[j];
++k;
}
}
cout << k;
free((void*)ar_rdia);
free((void*)ar_square);
}
assembly output:
## Triangular version
main:
...
call aligned_alloc
movl $1, %edi
movq %rax, %rbp
xorl %esi, %esi
xorl %eax, %eax
.L2:
testq %rax, %rax
je .L3
leaq -4(%rax), %rcx
leaq -1(%rax), %r8
movss (%rbx,%rax,4), %xmm0
shrq $2, %rcx
addq $1, %rcx
cmpq $2, %r8
leaq 0(,%rcx,4), %rdx
jbe .L9
movaps %xmm0, %xmm2
leaq 0(%rbp,%rsi,4), %r10
xorl %r8d, %r8d
xorl %r9d, %r9d
shufps $0, %xmm2, %xmm2 # broadcast ar_rdia[i]
.L6: # vectorized loop
movaps (%rbx,%r8), %xmm1
addq $1, %r9
mulps %xmm2, %xmm1
movups (%r10,%r8), %xmm3
mulps %xmm3, %xmm1
movups %xmm1, (%r10,%r8)
addq $16, %r8
cmpq %rcx, %r9
jb .L6
cmpq %rax, %rdx
leaq (%rsi,%rdx), %rcx
je .L7
.L4: # scalar cleanup
movss (%rbx,%rdx,4), %xmm1
leaq 0(%rbp,%rcx,4), %r8
leaq 1(%rdx), %r9
mulss %xmm0, %xmm1
cmpq %rax, %r9
mulss (%r8), %xmm1
movss %xmm1, (%r8)
leaq 1(%rcx), %r8
jnb .L7
movss (%rbx,%r9,4), %xmm1
leaq 0(%rbp,%r8,4), %r8
mulss %xmm0, %xmm1
addq $2, %rdx
addq $2, %rcx
cmpq %rax, %rdx
mulss (%r8), %xmm1
movss %xmm1, (%r8)
jnb .L7
mulss (%rbx,%rdx,4), %xmm0
leaq 0(%rbp,%rcx,4), %rcx
mulss (%rcx), %xmm0
movss %xmm0, (%rcx)
.L7:
addq %rax, %rsi
cmpq $10000, %rdi
je .L16
.L3:
addq $1, %rax
addq $1, %rdi
jmp .L2
.L9:
movq %rsi, %rcx
xorl %edx, %edx
jmp .L4
.L16:
... print and free
ret
The interesting part of the assembly for the square case:
main:
... allocate both arrays
call aligned_alloc
leaq 40000(%rbx), %rsi
movq %rax, %rbp
movq %rbx, %rcx
movq %rax, %rdx
.L3: # loop over i
movss (%rcx), %xmm2
xorl %eax, %eax
shufps $0, %xmm2, %xmm2 # broadcast ar_rdia[i]
.L2: # vectorized loop over j
movaps (%rbx,%rax), %xmm0
mulps %xmm2, %xmm0
movups (%rdx,%rax), %xmm1
mulps %xmm1, %xmm0
movups %xmm0, (%rdx,%rax)
addq $16, %rax
cmpq $40000, %rax
jne .L2
addq $4, %rcx # no scalar cleanup: gcc noticed that the row length is a multiple of 4 elements
addq $40000, %rdx
cmpq %rsi, %rcx
jne .L3
... print and free
ret

The loop that stores to the triangular array should vectorize ok, with inefficiencies at the end of each row. gcc actually did auto-vectorize both, according to the asm you posted. I wish I'd looked at that first instead of taking your word for it that it needed to be manually vectorized. :(
.L6: # from the first asm dump.
movaps (%rbx,%r8), %xmm1
addq $1, %r9
mulps %xmm2, %xmm1
movups (%r10,%r8), %xmm3
mulps %xmm3, %xmm1
movups %xmm1, (%r10,%r8)
addq $16, %r8
cmpq %rcx, %r9
jb .L6
This looks exactly like the inner loop that my manual vectorized version would compile to. The .L4 is fully-unrolled scalar cleanup for the last up-to-3 elements of a row. (So it's probably not quite as good as my code). Still, it's quite decent, and auto-vectorization will let you take advantage of AVX and AVX512 with no source changes.
I edited your question to include a link to the code on godbolt, with both versions as separate functions. I didn't take the time to convert them to taking the arrays as function args, because then I'd have to take time to get all the __restrict__ keywords right, and to tell gcc that the arrays are aligned on a 4B * 16 = 64 byte boundary, so it can use aligned loads if it wants to.
Within a row, you're using the same ar_rdia[i] every time, so you broadcast that into a vector once at the start of the row. Then you just do vertical operations between the source ar_rdia[j + 0..3] and destination ar_triangular[k + 0..3].
To handle the last few elements at the end of a row that aren't a multiple of the vector size, we have two options:
scalar (or narrower vector) fallback / cleanup after the vectorized loop, handling the last up-to-3 elements of each row.
unroll the loop over i by 4, and use an optimal sequence for handling the odd 0, 1, 2, and 3 elements left at the end of a row. So the loop over j will be repeated 4 times, with fixed cleanup after each one. This is probably the most optimal approach.
have the final vector iteration overshoot the end of a row, instead of stopping after the last full vector. So we overlap the start of the next row. Since your operation is not idempotent, this option doesn't work well. Also, making sure k is updated correctly for the start of the next row takes a bit of extra code.
Still, this would be possible by having the final vector of a row blend the multiplier so elements beyond the end of the current row get multiplied by 1.0 (the multiplicative identity). This should be doable with a blendvpswith a vector of 1.0 to replace some elements of ar_rdia[i] * ar_rdia[j + 0..3]. We'd also have to create a selector mask (maybe by indexing into an array of int32_t row_overshoot_blend_window {0, 0, 0, 0, -1, -1, -1} using j-i as the index, to take a window of 4 elements). Another option is branching to select either no blend or one of three immediate blends (blendps is faster, and doesn't require a vector control mask, and the branches will have an easily predictable pattern).
This causes a store-forwarding failure at the start of 3 of every 4 rows, when the load from ar_triangular overlaps with the store from the end of the last row. IDK which will perform best.
Another maybe even better option would be to do loads that overshoot the end of the row, and do the math with packed SIMD, but then conditionally store 1 to 4 elements.
Not reading outside the memory you allocate can require leaving padding at the end of your buffer, e.g. if the last row wasn't a multiple of 4 elements.
/****** Normalize a triangular matrix using SIMD multiplies,
handling the ends of rows with narrower cleanup code *******/
// size_t i,j,k; // don't do this in C++ or C99. Put declarations in the narrowest scope possible. For types without constructors/destructors, it's still a style / human-readability issue
size_t k = 0;
for(size_t i = 0; i < n; ++i){
// maybe put this inside the for() loop and let the compiler hoist it out, to avoid doing it for small rows where the vector loop doesn't even run once.
__m128 vrdia_i = _mm_set1_ps(ar_rdia[i]); // broadcast-load: very efficient with AVX, load+shuffle without. Only done once per row anyway.
size_t j = 0;
for(j = 0; j < (i-3); j+=4){ // vectorize over this loop
__m128 vrdia_j = _mm_loadu_ps(ar_rdia + j);
__m128 scalefac = _mm_mul_ps(vrdia_j, v_rdia_i);
__m128 vtri = _mm_loadu_ps(ar_triangular + k);
__m128 normalized = _mm_mul_ps(scalefac , vtri);
_mm_storeu_ps(ar_triangular + k, normalized);
k += 4;
}
// scalar fallback / cleanup for the ends of rows. Alternative: blend scalefac with 1.0 so it's ok to overlap into the next row.
/* Fine in theory, but gcc likes to make super-bloated code by auto-vectorizing cleanup loops. Besides, we can do better than scalar
for ( ; j < i; ++j ){
ar_triangular[k] *= ar_rdia[i]*ar_rdia[j]; ++k; }
*/
if ((i-j) >= 2) { // load 2 floats (using movsd to zero the upper 64 bits, so mulps doesn't slow down or raise exceptions on denormals or NaNs
__m128 vrdia_j = _mm_castpd_ps( _mm_load_sd(static_cast<const double*>(ar_rdia+j)) );
__m128 scalefac = _mm_mul_ps(vrdia_j, v_rdia_i);
__m128 vtri = _mm_castpd_ps( _mm_load_sd(static_cast<const double*>(ar_triangular + k) ));
__m128 normalized = _mm_mul_ps(scalefac , vtri);
_mm_storel_pi(static_cast<__m64*>(ar_triangular + k), normalized); // movlps. Agner Fog's table indicates that Nehalem decodes this to 2 uops, instead of 1 for movsd. Bizarre!
j+=2;
k+=2;
}
if (j<i) { // last single element
ar_triangular[k] *= ar_rdia[i]*ar_rdia[j];
++k;
//++j; // end of the row anyway. A smart compiler would still optimize it away...
}
// another possibility: load 4 elements and do the math, then movss, movsd, movsd + extractps (_mm_extractmem_ps), or movups to store the last 1, 2, 3, or 4 elements of the row.
// don't use maskmovdqu; it bypasses cache
}
movsd and movlps are equivalent as stores, but not as loads. See this comment thread for discussion of why it makes some sense that the store forms have separate opcodes. Update: Agner Fog's insn tables indicate that Nehalem decodes MOVH/LPS/D to 2 fused-domain uops. They also say that SnB decodes it to 1, but IvB decodes it to 2 uops. That's got to be wrong. For Haswell, his table splits things to separate entries for movlps/d (1 micro-fused uop) and movhps/d (also 1 micro-fused uop). It makes no sense for the store form of movlps to be 2 uops and need the shuffle port on anything; it does exactly the same thing as a movsd store.
If your matrices are really big, don't worry too much about the end-of-row handling. If they're small, more of the total time is going to be spent on the ends of rows, so it's worth trying multiple ways, and having a careful look at the asm.
You could easily compute rsqrt on the fly here if the source data is contiguous. Otherwise yeah, copy just the diagonal into an array (and compute rsqrt while doing that copy, rather than with another pass over that array like your previous question. Either with scalar rsqrtss and no NR step while copying from the diagonal of a matrix into an array, or manually gather elements into a SIMD vector (with _mm_set_ps(a[i][i], a[i+1][i+1], a[i+2][i+2], a[i+3][i+3]) to let the compiler pick the shuffles) and do rsqrtps + a NR step, then store the vector of 4 results to the array.
Small problem sizes: avoiding waste from not doing full vectors at the ends of rows
The very start of the matrix is a special case, because three "ends" are contiguous in the first 6 elements. (The 4th row has 4 elements). It might be worth special-casing this and doing the first 3 rows with two SSE vectors. Or maybe just the first two rows together, and then the third row as a separate group of 3. Actually, a group of 4 and a group of 2 is much more optimal, because SSE can do those 8B and 16B loads/stores, but not 12B.
The first 6 scale factors are products of the first three elements of ar_rdia, so we can do a single vector load and shuffle it a couple ways.
ar_rdia[0]*ar_rdia[0]
ar_rdia[1]*ar_rdia[0], ar_rdia[1]*ar_rdia[1],
ar_rdia[2]*ar_rdia[0], ar_rdia[2]*ar_rdia[1], ar_rdia[2]*ar_rdia[2]
^
end of first vector of 4 elems, start of 2nd.
It turns out compilers aren't great at spotting and taking advantage of the patterns here, so to get optimal code for the first 10 elements here, we need to peel those iterations and optimize the shuffles and multiplies manually. I decided to do the first 4 rows, because the 4th row still reuses that SIMD vector of ar_rdia[0..3]. That vector even still gets used by the first vector-width of row 4 (the fifth row).
Also worth considering: doing 2, 4, 4 instead of this 4, 2, 4.
void triangular_first_4_rows_manual_shuffle(float *tri, const float *ar_rdia)
{
__m128 vr0 = _mm_load_ps(ar_rdia); // we know ar_rdia is aligned
// elements 0-3 // row 0, row 1, and the first element of row 2
__m128 vi0 = _mm_shuffle_ps(vr0, vr0, _MM_SHUFFLE(2, 1, 1, 0));
__m128 vj0 = _mm_shuffle_ps(vr0, vr0, _MM_SHUFFLE(0, 1, 0, 0));
__m128 sf0 = vi0 * vj0; // equivalent to _mm_mul_ps(vi0, vj0); // gcc defines __m128 in terms of GNU C vector extensions
__m128 vtri = _mm_load_ps(tri);
vtri *= sf0;
_mm_store_ps(tri, vtri);
tri += 4;
// elements 4 and 5, last two of third row
__m128 vi4 = _mm_shuffle_ps(vr0, vr0, _MM_SHUFFLE(3, 3, 2, 2)); // can compile into unpckhps, saving a byte. Well spotted by clang
__m128 vj4 = _mm_movehl_ps(vi0, vi0); // save a mov by reusing a previous shuffle output, instead of a fresh _mm_shuffle_ps(vr0, vr0, _MM_SHUFFLE(2, 1, 2, 1)); // also saves a code byte (no immediate)
// actually, a movsd from ar_ria+1 would get these two elements with no shuffle. We aren't bottlenecked on load-port uops, so that would be good.
__m128 sf4 = vi4 * vj4;
//sf4 = _mm_movehl_ps(sf4, sf4); // doesn't save anything compared to shuffling before multiplying
// could use movhps to load and store *tri to/from the high half of an xmm reg, but each of those takes a shuffle uop
// so we shuffle the scale-factor down to the low half of a vector instead.
__m128 vtri4 = _mm_castpd_ps(_mm_load_sd((const double*)tri)); // elements 4 and 5
vtri4 *= sf4;
_mm_storel_pi((__m64*)tri, vtri4); // 64bit store. Possibly slower than movsd if Agner's tables are right about movlps, but I doubt it
tri += 2;
// elements 6-9 = row 4, still only needing elements 0-3 of ar_rdia
__m128 vi6 = _mm_shuffle_ps(vr0, vr0, _MM_SHUFFLE(3, 3, 3, 3)); // broadcast. clang puts this ahead of earlier shuffles. Maybe we should put this whole block early and load/store this part of tri, too.
//__m128 vi6 = _mm_movehl_ps(vi4, vi4);
__m128 vj6 = vr0; // 3, 2, 1, 0 already in the order we want
__m128 vtri6 = _mm_loadu_ps(tri+6);
vtri6 *= vi6 * vj6;
_mm_storeu_ps(tri+6, vtri6);
tri += 4;
// ... first 4 rows done
}
gcc and clang compile this very similarly with -O3 -march=nehalem (to enable SSE4.2 but not AVX). See the code on Godbolt, with some other versions that don't compile as nicely:
# gcc 5.3
movaps xmm0, XMMWORD PTR [rsi] # D.26921, MEM[(__v4sf *)ar_rdia_2(D)]
movaps xmm1, xmm0 # tmp108, D.26921
movaps xmm2, xmm0 # tmp111, D.26921
shufps xmm1, xmm0, 148 # tmp108, D.26921,
shufps xmm2, xmm0, 16 # tmp111, D.26921,
mulps xmm2, xmm1 # sf0, tmp108
movhlps xmm1, xmm1 # tmp119, tmp108
mulps xmm2, XMMWORD PTR [rdi] # vtri, MEM[(__v4sf *)tri_5(D)]
movaps XMMWORD PTR [rdi], xmm2 # MEM[(__v4sf *)tri_5(D)], vtri
movaps xmm2, xmm0 # tmp116, D.26921
shufps xmm2, xmm0, 250 # tmp116, D.26921,
mulps xmm1, xmm2 # sf4, tmp116
movsd xmm2, QWORD PTR [rdi+16] # D.26922, MEM[(const double *)tri_5(D) + 16B]
mulps xmm1, xmm2 # vtri4, D.26922
movaps xmm2, xmm0 # tmp126, D.26921
shufps xmm2, xmm0, 255 # tmp126, D.26921,
mulps xmm0, xmm2 # D.26925, tmp126
movlps QWORD PTR [rdi+16], xmm1 #, vtri4
movups xmm1, XMMWORD PTR [rdi+48] # tmp129,
mulps xmm0, xmm1 # vtri6, tmp129
movups XMMWORD PTR [rdi+48], xmm0 #, vtri6
ret
Only 22 total instructions for the first 4 rows, and 4 of them are movaps reg-reg moves. (clang manages with only 3, with a total of 21 instructions). We'd probably save one by getting [ x x 2 1 ] into a vector with a movsd from ar_rdia+1, instead of yet another movaps + shuffle. And reduce pressure on the shuffle port (and ALU uops in general).
With AVX, clang uses vpermilps for most shuffles, but that just wastes a byte of code-size. Unless it saves power (because it only has 1 input), there's no reason to prefer its immediate form over shufps, unless you can fold a load into it.
I considered using palignr to always go 4-at-a-time through the triangular matrix, but that's almost certainly worse. You'd need those palignrs all the time, not just at the ends.
I think extra complexity / narrower loads/stores at the ends of rows is just going to give out-of-order execution something to do. For large problem sizes, you'll spend most of the time doing 16B at a time in the inner loop. This will probably bottleneck on memory, so less memory-intensive work at the ends of rows is basically free as long as out-of-order execution keeps pulling cache-lines from memory as fast as possible.
So triangular matrices are still good for this use case; keeping your working set dense and in contiguous memory seems good. Depending on what you're going to do next, this might or might not be ideal overall.

simd vectorlength and unroll factor for fortran loop

I want to vectorize the fortran below with SIMD directives
!DIR$ SIMD
DO IELEM = 1 , NELEM
X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
ENDDO
And I used the instruction avx2. The program is compiled by
ifort main_vec.f -simd -g -pg -O2 -vec-report6 -o vec.out -xcore-avx2 -align array32byte
Then I'd like to add VECTORLENGTH(n) clause after SIMD.
If there's no such a clause or n = 2, 4, the information doesn't give information about the unroll factor
if n = 8, 16, vectorization support: unroll factor set to 2.
I've read Intel's article about vectorization support: unroll factor set to xxxx So I guess the loop is unrolled to something like:
DO IELEM = 1 , NELEM, 2
X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
X(IKLE(IELEM+1)) = X(IKLE(IELEM+1)) + W(IELEM+1)
ENDDO
Then 2 X go into a vector register, 2 W go to another, do the addition.
But how does the value of VECTORLENGTH work? Or maybe I don't really understand what does the vector length mean.
And since I use the avx2 instruction, for the DOUBLE PRECISION type X, what's the maximum length could be reach?
Here's part of the assembly of the loop with SSE2, VL=8 and the compiler told me that unroll factor is 2. However it used 4 registers instead of 2.
.loc 1 114 is_stmt 1
movslq main_vec_$IKLE.0.1(,%rdx,4), %rsi #114.9
..LN202:
movslq 4+main_vec_$IKLE.0.1(,%rdx,4), %rdi #114.9
..LN203:
movslq 8+main_vec_$IKLE.0.1(,%rdx,4), %r8 #114.9
..LN204:
movslq 12+main_vec_$IKLE.0.1(,%rdx,4), %r9 #114.9
..LN205:
movsd -8+main_vec_$X.0.1(,%rsi,8), %xmm0 #114.26
..LN206:
movslq 16+main_vec_$IKLE.0.1(,%rdx,4), %r10 #114.9
..LN207:
movhpd -8+main_vec_$X.0.1(,%rdi,8), %xmm0 #114.26
..LN208:
movslq 20+main_vec_$IKLE.0.1(,%rdx,4), %r11 #114.9
..LN209:
movsd -8+main_vec_$X.0.1(,%r8,8), %xmm1 #114.26
..LN210:
movslq 24+main_vec_$IKLE.0.1(,%rdx,4), %r14 #114.9
..LN211:
addpd main_vec_$W.0.1(,%rdx,8), %xmm0 #114.9
..LN212:
movhpd -8+main_vec_$X.0.1(,%r9,8), %xmm1 #114.26
..LN213:
..LN214:
movslq 28+main_vec_$IKLE.0.1(,%rdx,4), %r15 #114.9
..LN215:
movsd -8+main_vec_$X.0.1(,%r10,8), %xmm2 #114.26
..LN216:
addpd 16+main_vec_$W.0.1(,%rdx,8), %xmm1 #114.9
..LN217:
movhpd -8+main_vec_$X.0.1(,%r11,8), %xmm2 #114.26
..LN218:
..LN219:
movsd -8+main_vec_$X.0.1(,%r14,8), %xmm3 #114.26
..LN220:
addpd 32+main_vec_$W.0.1(,%rdx,8), %xmm2 #114.9
..LN221:
movhpd -8+main_vec_$X.0.1(,%r15,8), %xmm3 #114.26
..LN222:
..LN223:
addpd 48+main_vec_$W.0.1(,%rdx,8), %xmm3 #114.9
..LN224:
movsd %xmm0, -8+main_vec_$X.0.1(,%rsi,8) #114.9
..LN225:
.loc 1 113 is_stmt 1
addq $8, %rdx #113.7
..LN226:
.loc 1 114 is_stmt 1
psrldq $8, %xmm0 #114.9
..LN227:
.loc 1 113 is_stmt 1
cmpq $26000, %rdx #113.7
..LN228:
.loc 1 114 is_stmt 1
movsd %xmm0, -8+main_vec_$X.0.1(,%rdi,8) #114.9
..LN229:
movsd %xmm1, -8+main_vec_$X.0.1(,%r8,8) #114.9
..LN230:
psrldq $8, %xmm1 #114.9
..LN231:
movsd %xmm1, -8+main_vec_$X.0.1(,%r9,8) #114.9
..LN232:
movsd %xmm2, -8+main_vec_$X.0.1(,%r10,8) #114.9
..LN233:
psrldq $8, %xmm2 #114.9
..LN234:
movsd %xmm2, -8+main_vec_$X.0.1(,%r11,8) #114.9
..LN235:
movsd %xmm3, -8+main_vec_$X.0.1(,%r14,8) #114.9
..LN236:
psrldq $8, %xmm3 #114.9
..LN237:
movsd %xmm3, -8+main_vec_$X.0.1(,%r15,8) #114.9
..LN238:

1) Vector Length N is a number of elements/iterations you can execute in parallel after "vectorizing" your loop (normally by putting N elements of array X into single vector register and processing them altogether by vector instruction). For simplification, think of Vector Length as value given by this formula:
Vector Length (abbreviated VL) = Vector Register Width / Sizeof (data type)
For AVX2 , Vector Register Width = 256 bit. Sizeof (double precision) = 8 bytes = 64 bits. Thus:
Vector Length (double FP, avx2) = 256 / 64 = 4
$DIR SIMD VECTORLENGTH (N) basically enforces compiler to use specified vector length (and to put N elements of array X into single vector register). That's it.
2) Unrolling and Vectorization relationship. For simplification, think of unrolling and vectorization as normally unrelated (somewhat "orthogonal") optimization techniques.
If your loop is unrolled by factor of M (M could be 2, 4,..), then it doesn't neccesarily mean that vector registers were used at all and it does not mean that your loop was parallelized in any sense. What it means instead is that M instances of original loop iterations have been grouped together into single iteration; and within given new "unwinded"/"unrolled" iteration old ex-iterations are executed sequentially, one by one (so your guessing example is absolutely correct).
The purpose of unrolling is normally making loop more "micro-architecture/memory-friendly". In more details: by making loop iterations more "fat" you normally improve the balance between pressure to your CPU resources vs. pressure to your Memory/Cache resources, especially since after unrolling you can normally reuse some data in registers more effectively.
3) Unrolling + Vectorization. It's not uncommon that Compilers simulteneously vectorize (with VL=N) and unroll (by M) certain loops. As a result, number of iterations in optimized loop is smaller than number of iterations in original loop by approximately factor of NxM, however number of elements processed in parallel (simulteneously in given moment in time) will only be N.
Thus, in your example, if loop is vectorized with VL=4, and unrolled by 2, then the pseudo-code for it might look like:
DO IELEM = 1 , NELEM, 8
[X(IKLE(IELEM)),X(IKLE(IELEM+2)), X(IKLE(IELEM+4)), X(IKLE(IELEM+6))] = ...
[X(IKLE(IELEM+1)),X(IKLE(IELEM+3)), X(IKLE(IELEM+5)), X(IKLE(IELEM+7))] = ...
ENDDO
,where square brackets "correspond" to vector register content.
4) Vectorization against Unrolling :
for loops with relatively small number of iterations (especially in C++) - it may happen that unrolling is not desirable since it partially blocks efficient vectorization (not enough iterations to execute in parallel) and (as you see from my artifical example) may somehow impact the way the data has to be loaded from memory. Different compilers have different heuristics wrt balancing Trip Counts, VL and Unrolling between each other; that's probably why unroll was disabled in your case when VL was smaller than 8.
runtime and compile-time trade-offs between trip counts, unrolling and vector length, as well as appropiate automatic suggestions (especially in case of using fresh Intel C++ or Fortran Compiler) could be explored using "Intel (Vectorization) Advisor":
5) P.S. There is a third dimension (I don't really like to talk about it).
When vectorlength requested by user is bigger than possible Vector Length on given hardware (let's say specifying vectorlength(16) for avx2 platform for double FP) or when you mix different types, then compiler can (or can not) start using a notion of "virtual vector register" and start doing double-/quad-pumping. M-pumping is kind of unrolling, but only for single instruction (i.e. pumping leads to repeating the single instruction, while unrolling leads to repeating the whole loop body). You may try to read about m-pumping in recent OpenMP books like given one. So in some cases you may end-up with superposition of a) vectorization, b) unrolling and c) double-pumping, but it's not common case and I'd avoid enforcing vectorlength > 2*ISA_VectorLength.

For loop performance difference, and compiler optimization

I chose David's answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what happens when setting the optimization flags on.
Jerry Coffin's answer explained what happens when setting the optimization flags for this example. What remains unanswered is why superCalculationA runs slower than superCalculationB, when B performs one extra memory reference and one addition for each iteration. Nemo's post shows the assembler output. I confirmed this compiling with the -S flag on my PC, 2.9GHz Sandy Bridge (i5-2310), running Ubuntu 12.04 64-bit, as suggested by Matteo Italia.
I was experimenting with for-loops performance when I stumbled upon the following case.
I have the following code that does the same computation in two different ways.
#include <cstdint>
#include <chrono>
#include <cstdio>
using std::uint64_t;
uint64_t superCalculationA(int init, int end)
{
uint64_t total = 0;
for (int i = init; i < end; i++)
total += i;
return total;
}
uint64_t superCalculationB(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < init + todo; i++)
total += i;
return total;
}
int main()
{
const uint64_t answer = 500000110500000000;
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
double elapsed;
std::printf("=====================================================\n");
start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationA(111, 1000000111);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationB(111, 1000000000);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
if (ret1 == answer)
{
std::printf("The first method, i.e. superCalculationA, succeeded.\n");
}
if (ret2 == answer)
{
std::printf("The second method, i.e. superCalculationB, succeeded.\n");
}
std::printf("=====================================================\n");
return 0;
}
Compiling this code with
g++ main.cpp -o output --std=c++11
leads to the following result:
=====================================================
Elapsed time: 2.859 s | 2859.441 ms | 2859440.968 us
Elapsed time: 2.204 s | 2204.059 ms | 2204059.262 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
My first question is: why is the second loop running 23% faster than the first?
On the other hand, if I compile the code with
g++ main.cpp -o output --std=c++11 -O1
The results improve a lot,
=====================================================
Elapsed time: 0.318 s | 317.773 ms | 317773.142 us
Elapsed time: 0.314 s | 314.429 ms | 314429.393 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
and the difference in time almost disappears.
But I could not believe my eyes when I set the -O2 flag,
g++ main.cpp -o output --std=c++11 -O2
and got this:
=====================================================
Elapsed time: 0.000 s | 0.000 ms | 0.328 us
Elapsed time: 0.000 s | 0.000 ms | 0.208 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
So, my second question is: What is the compiler doing when I set -O1 and -O2 flags that leads to this gigantic performance improvement?
I checked Optimized Option - Using the GNU Compiler Collection (GCC), but that did not clarify things.
By the way, I am compiling this code with g++ (GCC) 4.9.1.
EDIT to confirm Basile Starynkevitch's assumption
I edited the code, now main looks like this:
int main(int argc, char **argv)
{
int start = atoi(argv[1]);
int end = atoi(argv[2]);
int delta = end - start + 1;
std::chrono::time_point<std::chrono::high_resolution_clock> t_start, t_end;
double elapsed;
std::printf("=====================================================\n");
t_start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationB(start, delta);
t_end = std::chrono::high_resolution_clock::now();
elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
t_start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationA(start, end);
t_end = std::chrono::high_resolution_clock::now();
elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
std::printf("Results were %s\n", (ret1 == ret2) ? "the same!" : "different!");
std::printf("=====================================================\n");
return 0;
}
These modifications really increased computation time, both for -O1 and -O2. Both are giving me around 620 ms now. Which proves that -O2 was really doing some computation at compile time.
I still do not understand what these flags are doing to improve performance, and -Ofast does even better, at about 320ms.
Also notice that I have changed the order in which functions A and B are called to test Jerry Coffin's assumption. Compiling this code with no optimizer flags still gives me around 2.2 secs in B and 2.8 secs in A. So I figure that it is not a cache thing. Just reinforcing that I am not talking about optimization in the first case (the one with no flags), I just want to know what makes the seconds loop run faster than the first.

My immediate guess would be that the second is faster, not because of the changes you made to the loop, but because it's second, so the cache is already primed when it runs.
To test the theory, I re-arranged your code to reverse the order in which the two calculations were called:
#include <cstdint>
#include <chrono>
#include <cstdio>
using std::uint64_t;
uint64_t superCalculationA(int init, int end)
{
uint64_t total = 0;
for (int i = init; i < end; i++)
total += i;
return total;
}
uint64_t superCalculationB(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < init + todo; i++)
total += i;
return total;
}
int main()
{
const uint64_t answer = 500000110500000000;
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
double elapsed;
std::printf("=====================================================\n");
start = std::chrono::high_resolution_clock::now();
uint64_t ret2 = superCalculationB(111, 1000000000);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
start = std::chrono::high_resolution_clock::now();
uint64_t ret1 = superCalculationA(111, 1000000111);
end = std::chrono::high_resolution_clock::now();
elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
if (ret1 == answer)
{
std::printf("The first method, i.e. superCalculationA, succeeded.\n");
}
if (ret2 == answer)
{
std::printf("The second method, i.e. superCalculationB, succeeded.\n");
}
std::printf("=====================================================\n");
return 0;
}
The result I got was:
=====================================================
Elapsed time: 0.286 s | 286.000 ms | 286000.000 us
Elapsed time: 0.271 s | 271.000 ms | 271000.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
So, when version A runs first, it's slower. When version B run's first, it's slower.
To confirm, I added an extra call to superCalculationB before doing the timing on either version A or B. After that, I tried running the program three times. For those three runs, I'd judge the results a tie (version A was faster once and version B was faster twice, but neither won dependably nor by a wide enough margin to be meaningful).
That doesn't prove that it's actually a cache situation as such, but does give a pretty strong indication that it's a matter of the order in which the functions are called, not the difference in the code itself.
As far as what the compiler does to make the code faster: the main thing it does is unroll a few iterations of the loop. We can get pretty much the same effect if we unroll a few iterations by hand:
uint64_t superCalculationC(int init, int end)
{
int f_end = end - ((end - init) & 7);
int i;
uint64_t total = 0;
for (i = init; i < f_end; i += 8) {
total += i;
total += i + 1;
total += i + 2;
total += i + 3;
total += i + 4;
total += i + 5;
total += i + 6;
total += i + 7;
}
for (; i < end; i++)
total += i;
return total;
}
This has a property that some might find rather odd: it's actually faster when compiled with -O2 than with -O3. When compiled with -O2, it's also about five times faster than either of the other two are when compiled with -O3.
The primary reason for the ~5x speed gain compared to the compiler's loop unrolling is that we've unrolled the loop somewhat differently (and more intelligently, IMO) than the compiler does. We compute f_end to tell us how many times the unrolled loop should execute. We execute those iterations, then we execute a separate loop to "clean up" any odd iterations at the end.
The compiler instead generates code that's roughly equivalent to something like this:
for (i = init; i < end; i += 8) {
total += i;
if (i + 1 >= end) break;
total += i + 1;
if (i + 2 >= end) break;
total += i + 2;
// ...
}
Although this is quite a bit faster than when the loop hasn't been unrolled at all, it's quite a bit faster still to eliminate those extra checks from the main loop, and execute a separate loop for any odd iterations.
Given such a trivial loop body being executed such a large number of times, you can also improve speed (when compiled with -O2) still further by unrolling more iterations of the loop. With 16 iterations unrolled, it was about twice as fast as the code above with 8 iterations unrolled:
uint64_t superCalculationC(int init, int end)
{
int first_end = end - ((end - init) & 0xf);
int i;
uint64_t total = 0;
for (i = init; i < first_end; i += 16) {
total += i + 0;
total += i + 1;
total += i + 2;
// code for `i+3` through `i+13` goes here
total += i + 14;
total += i + 15;
}
for (; i < end; i++)
total += i;
return total;
}
I haven't tried to explore the limit of gains from unrolling this particular loop, but unrolling 32 iterations nearly doubles the speed again. Depending on the processor you're using, you might get some small gains by unrolling 64 iterations, but I'd guess we're starting to approach the limits--at some point, performance gains will probably level off, then (if you unroll still more iterations) probably drop off, quite possibly dramatically.
Summary: with -O3 the compiler unrolls a number of iterations of the loop. This is extremely effective in this case, primarily because we have many executions of nearly the most trivial possible loop body. Unrolling the loop by hand is even more effective than letting the compiler do it--we can unroll more intelligently, and we can simply unroll more iterations than the compiler does. The extra intelligence can give us an improvement of around 5:1, and the extra iterations another 4:1 or so1 (at the expense of somewhat longer, slightly less readable code).
Final caveat: as always with optimization, your mileage may vary. Differences in compilers and/or processors mean you're likely to get at least somewhat different results than I did. I'd expect my hand-unrolled loop to be substantially faster than the other two in most cases, but exactly how much faster is likely to vary.
1. But note that this is comparing the hand-unrolled loop with -O2 to the original loop with -O3. When compiled with -O3, the hand-unrolled loop runs much more slowly.

Checking the assembly output is really the only way to illuminate such things.
Compiler optimisations will do a great deal of things, including things that are not strictly "standard compliant" (although, that is not the case with -O1 and -O2, to my knowledge) - for instance check, -Ofast switch.
I have found this helpful: http://gcc.godbolt.org/, and with your demo code here

-O2
Explaining the -O2 result is easy, looking at the code from godbolt change to -O2
main:
pushq %rbx
movl $.LC2, %edi
call puts
call std::chrono::_V2::system_clock::now()
movq %rax, %rbx
call std::chrono::_V2::system_clock::now()
pxor %xmm0, %xmm0
subq %rbx, %rax
movsd .LC4(%rip), %xmm2
movl $.LC6, %edi
movsd .LC5(%rip), %xmm1
cvtsi2sdq %rax, %xmm0
movl $3, %eax
mulsd .LC3(%rip), %xmm0
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm1
call printf
call std::chrono::_V2::system_clock::now()
movq %rax, %rbx
call std::chrono::_V2::system_clock::now()
pxor %xmm0, %xmm0
subq %rbx, %rax
movsd .LC4(%rip), %xmm2
movl $.LC6, %edi
movsd .LC5(%rip), %xmm1
cvtsi2sdq %rax, %xmm0
movl $3, %eax
mulsd .LC3(%rip), %xmm0
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm1
call printf
movl $.LC7, %edi
call puts
movl $.LC8, %edi
call puts
movl $.LC2, %edi
call puts
xorl %eax, %eax
popq %rbx
ret
There is no call to the 2 functions, further there is no compare of the results.
Now why can that be? its of course the power of optimization, the program is too simple ...
First the power of inlining is applied, after which the compiler can see that all the parameters are in fact literal values (111, 1000000111, 1000000000, 500000110500000000) and therefore constants.
It finds out that init + todo is a loop invariant and replace them with end, defining end before the loop from B as end = init + todo = 111 + 1000000000 = 1000000111
Both loops are now known to be containing only compile time values. They are further completely the same:
uint64_t total = 0;
for (int i = 111; i < 1000000111; i++)
total += i;
return total;
The compiler sees it is a summation, total is the accumulator, it is an equal stride 1 sum so the compiler makes the ultimate loop unrolling, namely all, but it knows that this form has the sum of
Rewriting Gauss's formel s=n*(n+1)
111+1000000110
110+1000000109
...
1000000109+110
1000000110+111=1000000221
loops = 1000000111-111 = 1E9
half it as we got the double of the looked for
1000000221 * 1E9 / 2 = 500000110500000000
which is the result looked for 500000110500000000
Now that is has the result which is a compile time constant it can compare it with the wanted result and note it is always true so it can remove it.
The time noted is the minimum time for system_clock on your PC.
-O0
The timing of the -O0 is more difficult and most likely is an artifact of the missing align for functions and jumps, both µops cache and loopbuffer likes alignment of 32 bytes. You can test that if you add some
asm("nop");
in front of A's loop, 2-3 might do the trick.
Storeforwards also like that their values are naturally aligned.

EDIT: After learning more about dependencies in processor pipelining, I revised my answer, removing some unnecessary details and offering a more concrete explanation of the slowdown.
It appears that the performance difference in the -O0 case is due to processor pipelining.
First, the assembly (for the -O0 build), copied from Nemo's answer, with some of my own comments inline:
superCalculationA(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # init
movl %esi, -24(%rbp) # end
movq $0, -8(%rbp) # total = 0
movl -20(%rbp), %eax # copy init to register rax
movl %eax, -12(%rbp) # i = [rax]
jmp .L7
.L8:
movl -12(%rbp), %eax # copy i to register rax
cltq
addq %rax, -8(%rbp) # total += [rax]
addl $1, -12(%rbp) # i++
.L7:
movl -12(%rbp), %eax # copy i to register rax
cmpl -24(%rbp), %eax # [rax] < end
jl .L8
movq -8(%rbp), %rax
popq %rbp
ret
superCalculationB(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # init
movl %esi, -24(%rbp) # todo
movq $0, -8(%rbp) # total = 0
movl -20(%rbp), %eax # copy init to register rax
movl %eax, -12(%rbp) # i = [rax]
jmp .L11
.L12:
movl -12(%rbp), %eax # copy i to register rax
cltq
addq %rax, -8(%rbp) # total += [rax]
addl $1, -12(%rbp) # i++
.L11:
movl -20(%rbp), %edx # copy init to register rdx
movl -24(%rbp), %eax # copy todo to register rax
addl %edx, %eax # [rax] += [rdx] (so [rax] = init+todo)
cmpl -12(%rbp), %eax # i < [rax]
jg .L12
movq -8(%rbp), %rax
popq %rbp
ret
In both functions, the stack layout looks like this:
Addr Content
24 end/todo
20 init
16 <empty>
12 i
08 total
04
00 <base pointer>
(Note that total is a 64-bit int and so occupies two 4-byte slots.)
These are the key lines of superCalculationA():
addl $1, -12(%rbp) # i++
.L7:
movl -12(%rbp), %eax # copy i to register rax
cmpl -24(%rbp), %eax # [rax] < end
The stack address -12(%rbp) (which holds the value of i) is written to in the addl instruction, and then it is immediately read in the very next instruction. The read instruction cannot begin until the write has completed. This represents a block in the pipeline, causing superCalculationA() to be slower than superCalculationB().
You might be curious why superCalculationB() doesn't have this same pipeline block. It's really just an artifact of how gcc compiles the code in -O0 and doesn't represent anything fundamentally interesting. Basically, in superCalculationA(), the comparison i<end is performed by reading i from a register, while in superCalculationB(), the comparison i<init+todo is performed by reading i from the stack.
To demonstrate that this is just an artifact, let's replace
for (int i = init; i < end; i++)
with
for (int i = init; end > i; i++)
in superCalculateA(). The generated assembly then looks the same, with just the following change to the key lines:
addl $1, -12(%rbp) # i++
.L7:
movl -24(%rbp), %eax # copy end to register rax
cmpl -12(%rbp), %eax # i < [rax]
Now i is read from the stack, and the pipeline block is gone. Here are the performance numbers after making this change:
=====================================================
Elapsed time: 2.296 s | 2295.812 ms | 2295812.000 us
Elapsed time: 2.368 s | 2367.634 ms | 2367634.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
It should be noted that this is really a toy example, since we are compiling with -O0. In the real world, we compile with -O2 or -O3. In that case, the compiler orders the instructions in such a way so as to minimize pipeline blocks, and we don't need to worry about whether to write i<end or end>i.

(This is not exactly an answer, but it does include more data, including some that conflicts with Jerry Coffin's.)
The interesting question is why the unoptimized routines perform so differently and counter-intuitively. The -O2 and -O3 cases are relatively simple to explain, and others have done so.
For completeness, here is the assembly (thanks #Rutan Kax) for superCalculationA and superCalculationB produced by GCC 4.9.1:
superCalculationA(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movl %esi, -24(%rbp)
movq $0, -8(%rbp)
movl -20(%rbp), %eax
movl %eax, -12(%rbp)
jmp .L7
.L8:
movl -12(%rbp), %eax
cltq
addq %rax, -8(%rbp)
addl $1, -12(%rbp)
.L7:
movl -12(%rbp), %eax
cmpl -24(%rbp), %eax
jl .L8
movq -8(%rbp), %rax
popq %rbp
ret
superCalculationB(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movl %esi, -24(%rbp)
movq $0, -8(%rbp)
movl -20(%rbp), %eax
movl %eax, -12(%rbp)
jmp .L11
.L12:
movl -12(%rbp), %eax
cltq
addq %rax, -8(%rbp)
addl $1, -12(%rbp)
.L11:
movl -20(%rbp), %edx
movl -24(%rbp), %eax
addl %edx, %eax
cmpl -12(%rbp), %eax
jg .L12
movq -8(%rbp), %rax
popq %rbp
ret
It sure looks to me like B is doing more work.
My test platform is a 2.9GHz Sandy Bridge EP processor (E5-2690) running Red Hat Enterprise 6 Update 3. My compiler is GCC 4.9.1 and produces the assembly above.
To make sure Turbo Boost and related CPU-frequency-diddling technologies are not interfering with the measurement, I ran:
pkill cpuspeed # if you have it running
grep MHz /proc/cpuinfo # to see where you start
modprobe acpi_cpufreq # if you do not have it loaded
cd /sys/devices/system/cpu
for cpuN in cpu[0-9]* ; do
echo userspace > $cpuN/cpufreq/scaling_governor
echo 2000000 > $cpuN/cpufreq/scaling_setspeed
done
grep MHz /proc/cpuinfo # to see if it worked
This pins the CPU frequency to 2.0 GHz and disables Turbo Boost.
Jerry observed these two routines running faster or slower depending on the order in which he executed them. I could not reproduce that result. For me, superCalculationB consistently runs 25-30% faster than superCalculationA, regardless of the Turbo Boost or clock speed settings. That includes running them multiple times in arbitrary order. For example, at 2.0GHz superCalculationA consistently takes a little over 4500ms and superCalculationB consistently takes at little under 3600ms.
I have yet to see any theory that even begins to explain this.

Processors are complicated. Execution time depends on many things, many of which are outside your control. Just a few possibilities:
a. Your computer probably doesn't have a constant clock speed. It could be that the clock speed is usually set rather low to avoid wasting energy / battery life / producing excessive heat. When your program starts running, the OS figures out that power is needed and increases the clock speed. To verify, change the order of the calls - if the second loop executed is always faster than the first one, that may be the reason.
b. The exact execution speed, especially for a tight loop like yours, depends on how instructions are aligned in memory. Some processors may run a loop faster if it is completely contained within one cache line instead of two, or in two cache lines instead of three. Some compilers will add nop instructions to align loops on cache lines to optimise for this, most don't. Quite possible that one of the loops was aligned better by pure luck and therefore runs faster.
c. The exact execution speed may depend on the exact order in which instructions are dispatched. Slightly different code may run at different speeds due to subtle differences in the code which may be processor dependent, and anyway may be hard for the compiler to consider.
d. There is some evidence that Intel processors may have problems with artificially short loops which may happen only with artificial benchmarks. Your code is quite close to "artificial". There have been cases discussed in other threads where very short loops ran unexpectedly slow, and adding instructions made them run faster.

Answer of first question:
1- It makes faster after doing it once for for loops but i am not sure just commenting according to my experiment results.(experiment 1 change their names(B->A,A->B) experiment 2 run one function has for loop before time checks,experiment 3 start one for loop before time checks)
2- First programs should work faster the reason is second function is does 2 operation when first function does 1 operation.
I leave here updated code which explain my answer.
Answer of second question:
I am not sure but there can be two ways coming my mind,
It can be formalize your function in some way and get rid of loops because the difference
can be destroyed by that way(like "return end-init" or "return todo" i dunno, i'm not sure)
It has -fauto_inc_dec and it can make that difference because these functions all about increaments and decreaments.
I hope it can help.
#include <cstdint>
#include <ctime>
#include <cstdio>
using std::uint64_t;
uint64_t superCalculationA(int init, int end)
{
uint64_t total = 0;
for (int i = init; i < end; i++)
total += i;
return total;
}
uint64_t superCalculationB(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < init+todo; i++)
total += i;
return total;
}
int add(int a1,int a2){printf("multiple times added\n");return a1+a2;}
uint64_t superCalculationC(int init, int todo)
{
uint64_t total = 0;
for (int i = init; i < add(init , todo); i++)
total += i;
return total;
}
int main()
{
const uint64_t answer = 500000110500000000;
std::clock_t start=clock();
double elapsed;
std::printf("=====================================================\n");
superCalculationA(111, 1000000111);
start = clock();
uint64_t ret1 = superCalculationA(111, 1000000111);
elapsed = ((std::clock()-start)*1.0/CLOCKS_PER_SEC);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
start = clock();
uint64_t ret2 = superCalculationB(111, 1000000000);
elapsed = ((std::clock()-start)*1.0/CLOCKS_PER_SEC);
std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);
if (ret1 == answer)
{
std::printf("The first method, i.e. superCalculationA, succeeded.\n");
}
if (ret2 == answer)
{
std::printf("The second method, i.e. superCalculationB, succeeded.\n");
}
std::printf("=====================================================\n");
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js