Why is this for loop not faster using OpenMP? - c++

I have extracted this simple member function from a larger 2D program, all it does is a for loop accessing from three different arrays and doing a math operation (1D convolution). I have been testing with using OpenMP to make this particular function faster:
void Image::convolve_lines()
{
const int *ptr0 = tmp_bufs[0];
const int *ptr1 = tmp_bufs[1];
const int *ptr2 = tmp_bufs[2];
const int width = Width;
#pragma omp parallel for
for ( int x = 0; x < width; ++x )
{
const int sum = 0
+ 1 * ptr0[x]
+ 2 * ptr1[x]
+ 1 * ptr2[x];
output[x] = sum;
}
}
If I use gcc 4.7 on debian/wheezy amd64 the overall programm performs a lot slower on an 8 CPUs machine. If I use gcc 4.9 on a debian/jessie amd64 (only 4 CPUs on this machine) the overall program perform with very little difference.
Using time to compare:
single core run:
$ ./test black.pgm out.pgm 94.28s user 6.20s system 84% cpu 1:58.56 total
multi core run:
$ ./test black.pgm out.pgm 400.49s user 6.73s system 344% cpu 1:58.31 total
Where:
$ head -3 black.pgm
P5
65536 65536
255
So Width is set to 65536 during execution.
If that matter, I am using cmake for compilation:
add_executable(test test.cxx)
set_target_properties(test PROPERTIES COMPILE_FLAGS "-fopenmp" LINK_FLAGS "-fopenmp")
And CMAKE_BUILD_TYPE is set to:
CMAKE_BUILD_TYPE:STRING=Release
which implies -O3 -DNDEBUG
My question, why is this for loop not faster using multi-core ? There is no overlap on the array, openmp should split the memory equally. I do not see where bottleneck is coming from ?
EDIT: as it was commented, I changed my input file into:
$ head -3 black2.pgm
P5
33554432 128
255
So Width is now set to 33554432 during execution (should be considered by enough). Now the timing reveals:
single core run:
$ ./test ./black2.pgm out.pgm 100.55s user 5.77s system 83% cpu 2:06.86 total
multi core run (for some reason cpu% was always below 100%, which would indicate no threads at all):
$ ./test ./black2.pgm out.pgm 117.94s user 7.94s system 98% cpu 2:07.63 total

I have some general comments:
1. Before optimizing your code, make sure the data is 16 byte aligned. This is extremely important for whatever optimization one wants to apply. And if the data is separated into 3 pieces, it is better to add some dummy elements to make the starting addresses of the 3 pieces are all 16-byte aligned. By doing so, the CPU can load your data into cache lines easily.
2. Make sure the simple function is vectorized before implementing openMP. Most of cases, using AVX/SSE instruction sets should give you a decent 2 to 8X single thread improvement. And it is very simple for your case: create a constant mm256 register and set it with value 2, and load 8 integers to three mm256 registers. With Haswell processor, one addition and one multiplication can be done together. So theoretically, the loop should speed up by a factor 12 if AVX pipeline can be filled!
3. Sometimes parallelization can degrade performance: Modern CPU needs several hundreds to thousands clock cycles to warm up, entering high performance states and scaling up frequency. If the task is not large enough, it is very likely that the task is done before the CPU warms up and one cannot gain speed boost by going parallel. And don't forget that openMP has overhead as well: thread creating, synchronization and deletion. Another case is poor memory management. Data accesses are so scattered, all CPU cores are idle and waiting for data from RAM.
My Suggestion:
You might want to try intel MKL, don't reinvent the wheel. The library is optimized to extreme and there is no clock cycle wasted. One can link with the serial library or the parallel version, a speed boost is guaranteed if it is possible by going parallel.

Related

Execution time inconsistency in a program with high priority in the scheduler using RT Kernel

Problem
We are trying to implement a program that sends commands to a robot in a given cycle time. Thus this program should be a real-time application. We set up a pc with a preempted RT Linux kernel and are launching our programs with chrt -f 98 or chrt -rr 99 to define the scheduling policy and priority. Loading of the kernel and launching of the program seems to be fine and work (see details below).
Now we were measuring the time (CPU ticks) it takes our program to be computed. We expected this time to be constant with very little variation. What we measured though, were quite significant differences in computation time. Of course, we thought this could be undefined behavior in our rather complex program, so we created a very basic program and measured the time as well. The behavior was similarly bad.
Question
Why are we not measuring a (close to) constant computation time even for our basic program?
How can we solve this problem?
Environment Description
First of all, we installed an RT Linux Kernel on the PC using this tutorial. The main characteristics of the PC are:
PC Characteristics
Details
CPU
Intel(R) Atom(TM) Processor E3950 # 1.60GHz with 4 cores
Memory RAM
8 GB
Operating System
Ubunut 20.04.1 LTS
Kernel
Linux 5.9.1-rt20 SMP PREEMPT_RT
Architecture
x86-64
Tests
The first time we detected this problem was when we were measuring the time it takes to execute this "complex" program with a single thread. We did a few tests with this program but also with a simpler one:
The CPU execution times
The wall time (the world real-time)
The difference (Wall time - CPU time) between them and the ratio (CPU time / Wall time).
We also did a latency test on the PC.
Latency Test
For this one, we followed this tutorial, and these are the results:
Latency Test Generic Kernel
Latency Test RT Kernel
The processes are shown in htop with a priority of RT
Test Program - Complex
We called the function multiple times in the program and measured the time each takes. The results of the 2 tests are:
From this we observed that:
The first execution (around 0.28 ms) always takes longer than the second one (around 0.18 ms), but most of the time it is not the longest iteration.
The mode is around 0.17 ms.
For those that take 17 ms the difference is usually 0 and the ratio 1. Although this is not exclusive to this time. For these, it seems like only 1 CPU is being used and it is saturated (there is no waiting time).
When the difference is not 0, it is usually negative. This, from what we have read here and here, is because more than 1 CPU is being used.
Test Program - Simple
We did the same test but this time with a simpler program:
#include <vector>
#include <iostream>
#include <time.h>
int main(int argc, char** argv) {
int iterations = 5000;
double a = 5.5;
double b = 5.5;
double c = 4.5;
std::vector<double> wallTime(iterations, 0);
std::vector<double> cpuTime(iterations, 0);
struct timespec beginWallTime, endWallTime, beginCPUTime, endCPUTime;
std::cout << "Iteration | WallTime | cpuTime" << std::endl;
for (unsigned int i = 0; i < iterations; i++) {
// Start measuring time
clock_gettime(CLOCK_REALTIME, &beginWallTime);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &beginCPUTime);
// Function
a = b + c + i;
// Stop measuring time and calculate the elapsed time
clock_gettime(CLOCK_REALTIME, &endWallTime);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endCPUTime);
wallTime[i] = (endWallTime.tv_sec - beginWallTime.tv_sec) + (endWallTime.tv_nsec - beginWallTime.tv_nsec)*1e-9;
cpuTime[i] = (endCPUTime.tv_sec - beginCPUTime.tv_sec) + (endCPUTime.tv_nsec - beginCPUTime.tv_nsec)*1e-9;
std::cout << i << " | " << wallTime[i] << " | " << cpuTime[i] << std::endl;
}
return 0;
}
Final Thoughts
We understand that:
If the ratio == number of CPUs used, they are saturated and there is no waiting time.
If the ratio < number of CPUs used, it means that there is some waiting time (theoretically we should only be using 1 CPU, although in practice we use more).
Of course, we can give more details.
Thanks a lot for your help!
Your function will near certainly be optimized away so you are just measuring how long it takes to read the clocks. And as you can see that doesn't take very long with some exceptions:
The very first time you run the code (unless you just compiled it) the pages need to be loaded from disk. If you are unlucky the code spans pages and you include the loading of the next page in the measured time. Quite unlikely given the code size.
The first loop the code and any data needs to be loaded into cache. So that takes longer to execute. The branch predictor might also need a few loops to predict the loop right so the second, third loop might be slightly longer too.
For everything else I think you can blame scheduling:
an IRQ happens but nothing gets rescheduled
the process gets paused while another process runs
the process gets moved to another CPU thread leaving the caches hot
the process gets moved to another CPU core making L1 cache cold but leaving L2/L3 caches hot (if your L2 is shared)
the process gets moved to a CPU on another socket making L1/L2 caches cold but L3 cache hot (if L3 is shared)
You can do little about IRQs. Some you can fix to specific cores but others are just essential (like the timer interrupt for the scheduler itself). You kind of just have to live with that.
But you can fix your program to a specific CPU and you can fix everything else to all the other cores. Basically reserving the core for the real-time code. I guess you would have to use cgroups for this, to keep everything else off the chosen core. And you might still get some kernel threads run on the reserved core. Nothing you can do about that. But that should eliminate most of the large execution times.

Understanding Linux perf FP counters and computation of FLOPS in a C++ program

I am trying to measure the # of computations performed in a C++ program (FLOPS). I am using a Broadwell-based CPU and not using GPU. I have tried the following command, which I included all the FP-related events I found.
perf stat -e fp_arith_inst_retired.128b_packed_double,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.256b_packed_double,fp_arith_inst_retired.256b_packed_single,fp_arith_inst_retired.double,fp_arith_inst_retired.packed,fp_arith_inst_retired.scalar,fp_arith_inst_retired.scalar_double,fp_arith_inst_retired.scalar_single,fp_arith_inst_retired.single,inst_retired.x87 ./test_exe
I got something as follows:
Performance counter stats for './test_exe':
0 fp_arith_inst_retired.128b_packed_double (36.36%)
0 fp_arith_inst_retired.128b_packed_single (36.36%)
0 fp_arith_inst_retired.256b_packed_double (36.37%)
0 fp_arith_inst_retired.256b_packed_single (36.37%)
4,520,439,602 fp_arith_inst_retired.double (36.37%)
0 fp_arith_inst_retired.packed (36.36%)
4,501,385,966 fp_arith_inst_retired.scalar (36.36%)
4,493,140,957 fp_arith_inst_retired.scalar_double (36.37%)
0 fp_arith_inst_retired.scalar_single (36.36%)
0 fp_arith_inst_retired.single (36.36%)
82,309,806 inst_retired.x87 (36.36%)
65.861043789 seconds time elapsed
65.692904000 seconds user
0.164997000 seconds sys
Questions:
Although the C++ program is a large project, I did not use any SSE/AVX instructions. I am not familiar with SSE/AVX instruction set. The project is just written by the "ordinary" C++. Why does it contain many fp_arith_inst_retired.double, fp_arith_inst_retired.scalar and fp_arith_inst_retired.scalar_double? These counters are related to SSE/AVX computations, right?
What do the percentages in brackets mean? such as (36.37%)
How can I compute the FLOPS in my C++ program based on the perf results?
Thanks.
The normal way for C++ compilers to do FP math on x86-64 is with scalar versions of SSE instructions, e.g. addsd xmm0, [rdi] (https://www.felixcloutier.com/x86/addsd). Only legacy 32-bit builds default to using the x87 FPU for scalar math.
If your compiler didn't manage to auto-vectorize anything (e.g. you didn't use g++ -O3 -march=native), and the only math you do is with double not float, then all the math operations will be done with scalar-double instructions.
Each such instruction will be counted by the fp_arith_inst_retired.double, .scalar, and .scalar-double events. They overlap, basically sub-filters of the same event. (FMA operations count as two, even though they're still only one instruction, so these are FLOP counts not uops or instructions).
So you have 4,493,140,957 FLOPs over 65.86 seconds.
4493140957 / 65.86 / 1e9 ~= 0.0682 GFLOP/s, i.e. very low.
If you had had any counts for 128b_packed_double, you'd multiply those by 2. As noted in the perf list description: "each count represents 2 computation operations, one for each element" because a 128-bit vector holds two 64-bit double elements. So each count for this even is 2 FLOPs. Similarly for others, follow the scale factors described in the perf list output, e.g. times 8 for 256b_packed_single.
So you do need to separate the SIMD events by type and width, but you could just look at .scalar without separating single and double.
See also FLOP measurement, one of the duplicates of FLOPS in Python using a Haswell CPU (Intel Core Processor (Haswell, no TSX)) which was linked on your previous question
(36.37%) is how much of the total time that even was programmed on a HW counter. You used more events than there are counters, so perf multiplexed them for you, swapping every so often and extrapolating based on that statistical sampling to estimate the total over the run-time. See Perf tool stat output: multiplex and scaling of "cycles".
You could get exact counts for the non-zero non-redundant events by leaving out the ones that are zero for a given build.

Low performance with CLWB (cacheline write-backs) to same location vs. cycling through a few lines

Why does the running time of the code below decrease when I increase kNumCacheLines?
In every iteration, the code modifies one of kNumCacheLines cachelines, writes the line to the DIMM with the clwb instruction, and blocks until the store hits the memory controller with sfence. This example requires Intel Skylake-server or newer Xeon, or IceLake client processors.
#include <stdlib.h>
#include <stdint.h>
#define clwb(addr) \
asm volatile(".byte 0x66; xsaveopt %0" : "+m"(*(volatile char *)(addr)));
static constexpr size_t kNumCacheLines = 1;
int main() {
uint8_t *buf = new uint8_t[kNumCacheLines * 64];
size_t data = 0;
for (size_t i = 0; i < 10000000; i++) {
size_t buf_offset = (i % kNumCacheLines) * 64;
buf[buf_offset] = data++;
clwb(&buf[buf_offset]);
asm volatile("sfence" ::: "memory");
}
delete [] buf;
}
(editor's note: _mm_sfence() and _mm_clwb(void*) would avoid needing inline asm, but this inline asm looks correct, including the "memory" clobber).
Here are some performance numbers on my Skylake Xeon machine, reported by running time ./bench with different values of kNumCacheLines:
kNumCacheLines Time (seconds)
1 2.00
2 2.14
3 1.74
4 1.82
5 1.00
6 1.17
7 1.04
8 1.06
Intuitively, I would expect kNumCacheLines = 1 to give the best performance because of hits in the memory controller's write pending queue. But, it is one of the slowest.
As an explanation for the unintuitive slowdown, it is possible that while the memory controller is completing a write to a cache line, it blocks other writes to the same cache line. I suspect that increasing kNumCacheLines increases performance because of higher parallelism available to the memory controller. The running time jumps from 1.82 seconds to 1.00 seconds when kNumCacheLines goes from four to five. This seems to correlate with the fact that the memory controller's write pending queue has space for 256 bytes from a thread [https://arxiv.org/pdf/1908.03583.pdf, Section 5.3].
Note that because buf is smaller than 4 KB, all accesses use the same DIMM. (Assuming it's aligned so it doesn't cross a page boundary)
This is probably fully explained by Intel's CLWB instruction invalidating cache lines - turns out SKX runs clwb the same as clflushopt, i.e. it's a stub implementation for forward compatibility so persistent-memory software can start using it without checking CPU feature levels.
More cache lines means more memory-level parallelism in reloading the invalidated lines for the next store. Or that the flush part is finished before we try to reload. One or the other; there are a lot of details I don't have a specific explanation for.
In each iteration, you store a counter value into a cache line and clwb it. (and sfence). The previous activity on that cache line was kNumCacheLines iterations ago.
We were expecting that these stores could just commit into lines that were already in Exclusive state, but in fact they're going to be Invalid with eviction probably still in flight down the cache hierarchy, depending on exactly when sfence stalls, and for how long.
So each store needs to wait for an RFO (Read For Ownership) to get the line back into cache in Exclusive state before it can commit from the store buffer to L1d.
It seems that you're only getting a factor of 2 speedup from using more cache lines, even though Skylake(-X) has 12 LFBs (i.e. can track 12 in-flight cache lines incoming or outgoing). Perhaps sfence has something to do with that.
The big jump from 4 to 5 is surprising. (Basically two levels of performance, not a continuous transition). That lends some weight to the hypothesis that it's something to do with the store having made it all the way to DRAM before we try to reload, rather than having multiple RFOs in flight. Or at least casts doubt on the idea that it's just MLP for RFOs. CLWB forcing eviction is pretty clearly key, but the specific details of exactly what happens and why there's any speedup is just pure guesswork on my part.
A more detailed analysis might tell us something about microarchitectural details if anyone wants to do one. This hopefully isn't a very normal access pattern so probably we can just avoid doing stuff like this most of the time!
(Possibly related: apparently repeated writes to the same line of Optane DC PM memory are slower than sequential writes, so you don't want write-through caching or an access pattern like this on that kind of non-volatile memory either.)

Cost of OpenMPI in C++

I have the following program C++ program which uses no communication, and the same identical work is done on all cores, I know that this doesn't use parallel processing at all:
unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
for (unsigned i = 0; i < n; i++)
{
// Do something so it's not a trivial loop
vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;
I'm running this program in a single node with two
Intel® Xeon® Processor E5-2690 v3, so I have 24 cores all together. This is a dedicated node, no one else is using it.
Since there is no communication, and each processor is doing the same amount of (identical) work, running it on multiple processors should give the same time. However, I get the following times (averaged time over all cores):
1 core: 0.237
2 cores: 0.240
4 cores: 0.241
8 cores: 0.261
16 cores: 0.454
What could cause the increase in time? Particularly for 16 cores.
I have ran callgrind and I get the roughly same amount of data/instruction misses on all cores (the percentage of misses are the same).
I have repeated the same test on a node with two Intel® Xeon® Processor E5-2628L v2, (16 cores all together), I observe the same increase in execution times. Is this something to do with the MPI implementation?
Considering you are using ~2 GiB of memory per rank, your code is memory-bound. Except for prefetchers you are not operating within the cache but in main memory. You are simply hitting the memory bandwidth at a certain number of active cores.
Another aspect can be turbo mode, if enabled. Turbo mode can increase the core frequency to higher levels if less cores are utilized. As long as the memory bandwidth is not saturated, the higher frequency from turbo core will increase the bandwidth each core gets. This paper discusses the available aggregate memory bandwidth on Haswell processors depending on number of active cores and frequency (Fig 7./8.)
Please note that this has nothing to do with MPI / OpenMPI. You might as well launch the same program X times via any other mean.
I suspect that there are common resources that should be used by your program, so when the number of them increases, there are delays, so that a resource is free'ed so that it can be used by the other process.
You see, you may have 24 cores, but that doesn't mean that all your system allows every core to do everything concurrent. As mentioned in the comments, the memory access is one thing that might cause delays (due to traffic), same thing for disk.
Also consider the interconnection network, which can also suffer from many accesses. In conclusion, notice that these hardware delays are enough to overwhelm the processing time.
General note: Remember how Efficiency of a program is defined:
E = S/p, where S is the speedup and p the number of nodes/processes/threads
Now take Scalability into account. Usually programs are weakly scalable, i.e. that you have to increase with the same rate the size of the problem and p. By increasing only the number of p, while keeping the size of your problem (n in your case) constant, while keeping Efficiency constant, yields a strongly Scalable program.
Your program is not using parallel processing at all. Just because you have compiled it with OpenMP does not make it parallel.
To parallelize the for loop, for example, you need to use the different #pragma's OpenMP offer.
unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
#pragma omp parallel for
for (unsigned i = 0; i < n; i++)
{
// Do something so it's not a trivial loop
vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;
However, take into account that for large values of n, the impact of cache misses may hide the perfomance gained with multiple cores.

How to reach AVX computation throughput for a simple loop

Recently I am working on a numerical solver on computational Electrodynamics by Finite difference method.
The solver was very simple to implement, but it is very difficult to reach the theoretical throughput of modern processors, because there is only 1 math operation on the loaded data, for example:
#pragma ivdep
for(int ii=0;ii<Large_Number;ii++)
{ Z[ii] = C1*Z[ii] + C2*D[ii];}
Large_Number is about 1,000,000, but not bigger than 10,000,000
I have tried to manually unroll the loop and write AVX code but failed to make it faster:
int Vec_Size = 8;
int Unroll_Num = 6;
int remainder = Large_Number%(Vec_Size*Unroll_Num);
int iter = Large_Number/(Vec_Size*Unroll_Num);
int addr_incr = Vec_Size*Unroll_Num;
__m256 AVX_Div1, AVX_Div2, AVX_Div3, AVX_Div4, AVX_Div5, AVX_Div6;
__m256 AVX_Z1, AVX_Z2, AVX_Z3, AVX_Z4, AVX_Z5, AVX_Z6;
__m256 AVX_Zb = _mm256_set1_ps(Zb);
__m256 AVX_Za = _mm256_set1_ps(Za);
for(int it=0;it<iter;it++)
{
int addr = addr + addr_incr;
AVX_Div1 = _mm256_loadu_ps(&Div1[addr]);
AVX_Z1 = _mm256_loadu_ps(&Z[addr]);
AVX_Z1 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div1),_mm256_mul_ps(AVX_Za,AVX_Z1));
_mm256_storeu_ps(&Z[addr],AVX_Z1);
AVX_Div2 = _mm256_loadu_ps(&Div1[addr+8]);
AVX_Z2 = _mm256_loadu_ps(&Z[addr+8]);
AVX_Z2 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div2),_mm256_mul_ps(AVX_Za,AVX_Z2));
_mm256_storeu_ps(&Z[addr+8],AVX_Z2);
AVX_Div3 = _mm256_loadu_ps(&Div1[addr+16]);
AVX_Z3 = _mm256_loadu_ps(&Z[addr+16]);
AVX_Z3 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div3),_mm256_mul_ps(AVX_Za,AVX_Z3));
_mm256_storeu_ps(&Z[addr+16],AVX_Z3);
AVX_Div4 = _mm256_loadu_ps(&Div1[addr+24]);
AVX_Z4 = _mm256_loadu_ps(&Z[addr+24]);
AVX_Z4 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div4),_mm256_mul_ps(AVX_Za,AVX_Z4));
_mm256_storeu_ps(&Z[addr+24],AVX_Z4);
AVX_Div5 = _mm256_loadu_ps(&Div1[addr+32]);
AVX_Z5 = _mm256_loadu_ps(&Z[addr+32]);
AVX_Z5 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div5),_mm256_mul_ps(AVX_Za,AVX_Z5));
_mm256_storeu_ps(&Z[addr+32],AVX_Z5);
AVX_Div6 = _mm256_loadu_ps(&Div1[addr+40]);
AVX_Z6 = _mm256_loadu_ps(&Z[addr+40]);
AVX_Z6 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div6),_mm256_mul_ps(AVX_Za,AVX_Z6));
_mm256_storeu_ps(&Z[addr+40],AVX_Z6);
}
The above AVX loop is actually a bit slower than the Inter compiler generated code.
The compiler generated code can reach about 8G flops/s, about 25% of the single thread theoretical throughput of a 3GHz Ivybridge processor. I wonder if it is even possible to reach the throughput for the simple loop like this.
Thank you!
Improving performance for the codes like yours is "well explored" and still popular area. Take a look at dot-product (perfect link provided by Z Boson already) or at some (D)AXPY optimization discussions (https://scicomp.stackexchange.com/questions/1932/are-daxpy-dcopy-dscal-overkills)
In general , key topics to explore and consider applying are:
AVX2 advantage over AVX due to FMA and better load/store ports u-architecture on Haswell
Pre-Fetching. "Streaming stores", "non-temporal stores" for some platforms.
Threading parallelism to reach max sustained bandwidth
Unrolling (already done by you; Intel Compiler is also capable to do that with #pragma unroll (X) ). Not a big difference for "streaming" codes.
Finally deciding what is a set of hardware platforms you want to optimize your code for
Last bullet is particularly important, because for "streaming" and overall memory-bound codes - it's important to know more about target memory-sybsystems; for example, with existing and especially future high-end HPC servers (2nd gen Xeon Phi codenamed Knights Landing as an example) you may have very different "roofline model" balance between bandwidth and compute, and even different techniques than in case of optimizing for average desktop machine.
Are you sure that 8 GFLOPS/s is about 25% of the maximum throughput of a 3 GHz Ivybridge processor? Let's do the calculations.
Every 8 elements require two single-precision AVX multiplications and one AVX addition. An Ivybridge processor can only execute one 8-wide AVX addition and one 8-wide AVX multiplication per cycle. Also since the addition is dependent on the two multiplications, then 3 cycles are required to process 8 elements. Since the addition can be overlapped with the next multiplication, we can reduce this to 2 cycles per 8 elements. For one billion elements, 2*10^9/8 = 10^9/4 cycles are required. Considering 3 GHz clock, we get 10^9/4 * 10^-9/3 = 1/12 = 0.08 seconds. So the maximum theoretical throughput is 12 GLOPS/s and the compiler-generated code is reaching 66%, which is fine.
One more thing, by unrolling the loop 8 times, it can be vectorized efficiently. I doubt that you'll gain any significant speed up if you unroll this particular loop more than that, especially more than 16 times.
I think the real bottleneck is that there are 2 load and 1 store instructions for every 2 multiplication and 1 addition. Maybe the calculation is memory bandwidth limited. Every element requires transfer 12 bytes of data, and if 2G elements are processed every second (which is 6G flops) that is 24GB/s data transfer, reaching the theoretical bandwidth of ivy bridge. I wonder if this argument holds and there is indeed no solution to this problem.
The reason why I am answering to my own question is to hope someone can correct me before I easily give up the optimization. This simple loop is EXTREMELY important for many scientific solvers, it is the backbone of finite element and finite difference method. If one cannot even feed one processor because the computation is memory bandwith limited, why bother with multicore? A high bandwith GPU or Xeon Phi should be better solutions.