Where is the loop-carried dependency here?

Where is the loop-carried dependency here? - c++

Does anyone see anything obvious about the loop code below that I'm not seeing as to why this cannot be auto-vectorized by VS2012's C++ compiler?
All the compiler gives me is info C5002: loop not vectorized due to reason '1200' when I use the /Qvec-report:2 command-line switch.
Reason 1200 is documented in MSDN as:
Loop contains loop-carried data dependences that prevent
vectorization. Different iterations of the loop interfere with each
other such that vectorizing the loop would produce wrong answers, and
the auto-vectorizer cannot prove to itself that there are no such data
dependences.
I know (or I'm pretty sure that) there aren't any loop-carried data dependencies but I'm not sure what's preventing the compiler from realizing this.
These source and dest pointers do not ever overlap nor alias the same memory and I'm trying to provide the compiler with that hint via __restrict.
pitch is always a positive integer value, something like 4096, depending on the screen resolution, since this is a 8bpp->32bpp rendering/conversion function, operating column-by-column.
byte * __restrict source;
DWORD * __restrict dest;
int pitch;
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
The parens around each source[] are remnants of a function call which I have elided here because the loop still won't be auto-vectorized without the function call, in its most simplest form.
EDIT:
I've simplified the loop to its most trivial form that I can:
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
This still produces the same 1200 reason code.
EDIT (2):
This minimal test case with local allocations and identical pointer types still fails to auto-vectorize. I'm simply baffled at this point.
const byte * __restrict source;
byte * __restrict dest;
source = (const byte * __restrict ) new byte[1600];
dest = (byte * __restrict ) new byte[1600];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}

Let's just say there's more than just a couple of things preventing this loop from vectorizing...
Consider this:
int main(){
byte *source = new byte[1000];
DWORD *dest = new DWORD[1000];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
for (int i = 0; i < 200; ++i) {
dest[i*2*4096] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i*8192] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i] = source[i];
}
}
Compiler Output:
main.cpp(10) : info C5002: loop not vectorized due to reason '1200'
main.cpp(13) : info C5002: loop not vectorized due to reason '1200'
main.cpp(16) : info C5002: loop not vectorized due to reason '1203'
main.cpp(19) : info C5002: loop not vectorized due to reason '1101'
Let's break this down:
The first two loops are the same. So they give the original reason 1200 which is the loop-carried dependency.
The 3rd loop is the same as the 2nd loop. Yet the compiler gives a different reason 1203:
Loop body includes non-contiguous accesses into an array
Okay... Why a different reason? I dunno. But this time, the reason is correct.
The 4th loop gives 1101:
Loop contains a non-vectorizable conversion operation (may be implicit)
So VC++ doesn't isn't smart enough to issue the SSE4.1 pmovzxbd instruction.
That's a pretty niche case, I wouldn't have expected any modern compiler to be able to do this. And if it could, you'd need to specify SSE4.1.
So the only thing that's out-of-the ordinary is why the initial loop reports a loop-carried dependency.Well, that's a tough call... I would go so far to say that the compiler just isn't issuing the correct reason. (When it really should be non-contiguous access.)
Getting back to the point, I wouldn't expect MSVC or any compiler to be able to vectorize your original loop. Your original loop has accesses grouped in chunks of 4 - which makes it contiguous enough to vectorize. But it's a long-shot to expect the compiler to be able to recognize that.
So if it matters, I suggest manually vectorizing this loop. The intrinsic that you will need is _mm_cvtepu8_epi32().
Your original loop:
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
vectorizes as follows:
for (int i = 0; i < count; ++i) {
__m128i s0 = _mm_loadl_epi64((__m128i*)(source + i*8));
__m128i s1 = _mm_unpackhi_epi64(s0,s0);
*(__m128i*)(dest + (i*2 + 0)*pitch) = _mm_cvtepu8_epi32(s0);
*(__m128i*)(dest + (i*2 + 1)*pitch) = _mm_cvtepu8_epi32(s1);
}
Disclaimer: This is untested and ignores alignment.

From the MSDN documentation, a case in which an error 1203 would be reported
void code_1203(int *A)
{
// Code 1203 is emitted when non-vectorizable memory references
// are present in the loop body. Vectorization of some non-contiguous
// memory access is supported - for example, the gather/scatter pattern.
for (int i=0; i<1000; ++i)
{
A[i] += A[0] + 1; // constant memory access not vectorized
A[i] += A[i*2+2] + 2; // non-contiguous memory access not vectorized
}
}
It really could be the computations at the indexes that mess with the auto-vectorizer. Funny the error code shown is not 1203, though.
MSDN Parallelizer and Vectorizer Messages

Related

How to test the performance of a getter function in a loop without the compiler optimizing the loop?

I need to test the performance of a getter function (returns a double) in our codebase which has optimizations turned on (the compilation system is a bit complicated and I don't want to touch it unless I really have to).
I want to test it using a loop such as
for (int i = 0; i < 200000; ++i) {
auto scale = input.get_double();
}
but I think this loop will get optimized away since it's doing the same thing every iteration. Is there a trick to make sure the loop doesn't get optimized away? I was considering doing
for (int i = 0; i < 200000; ++i) {
auto scale = input.get_double() + i;
}
but I don't want the addition to be included in the runtime profiling.

How can a C++ program print either 0 or 1 depending on compilation options?

I have the following C++ code. It should print 1. There is only one illegal memory access and it is a reading one, i.e., for i=5 in the for loop, we obtain c[5]=cdata[5] and cdata[5] does not exist. Still, all writing calls are legal, i.e., the program never tries to write to some non-allocated array cell. If I compile it with g++ -O3, it prints 0. If I compile without -O3 it prints 1. It's the strangest error I've ever seen with g++.
Later edit to address some answers. The undefined behavior should be limited to reading cdata[5] and writing it to c[5]. The assignment c[5]=cdata[5] may read some unknown garbage or something undefined from cdata[5]. Ok, but that should only copy the garbage to c[5] without writing anything whatsoever somewhere else!
#include <iostream>
#include <cstdlib>
using namespace std;
int n,d;
int *f, *c;
void loadData(){
int fdata[] = {7, 2, 2, 7, 7, 1};
int cdata[] = {66, 5, 4, 3, 2};
n = 6;
d = 3;
f = new int[n+1];
c = new int[n];
f[0] = fdata[0];
c[0] = cdata[0];
for (int i = 1;i<n;i++){
f[i] = fdata[i];
c[i] = cdata[i];
}
cout<<f[5]<<endl;
}
int main(){
loadData();
}

It will be very hard to find out exactly what is happening to your code before and after the optimisation. As you already knew and pointed out yourself, you are trying to go out of bound of an array, which leads to undefined behaviour.
However, you are curious on why it is (apparently) the -O3 flag which "causes" the issue!
Let's start by saying that it is actually the flag -O -fpeel-loops which is causing your code to re-organise in a way that your error becomes apparent. The -O3 will enable several optimisation flags, among which -O -fpeel-loops.
You can read here about what the compiler flags are at which stage of optimisation.
In a nutshell, -fpeel-loops wheel re-organise the loop, so that the first and last couple of iterations are actually taken out of the loop itself, and some variables are actually cleared of memory. Small loops may even be taken apart completely!
With this said and considered, try running this piece of code, with -O -fpeel-loops or even with -O3:
#include <iostream>
#include <cstdlib>
using namespace std;
int n,d;
int *f, *c;
void loadData(){
int fdata[] = {7, 2, 2, 7, 7, 1};
int cdata[] = {66, 5, 4, 3, 2};
n = 6;
d = 3;
f = new int[n+1];
c = new int[n];
f[0] = fdata[0];
c[0] = cdata[0];
for (int i = 1;i<n;i++){
cout << f[i];
f[i] = fdata[i];
c[i] = cdata[i];
}
cout << "\nFINAL F[5]:" << f[5]<<endl;
}
int main(){
loadData();
}
You will see that it will print 1 regardless of your optimisation flags.
That is because of the statement: cout << f[i], which will change the way that fpeel-loops is operating.
Also, experiment with this block of code:
f[0] = fdata[0];
c[0] = cdata[0];
c[1] = cdata[1];
c[2] = cdata[2];
c[3] = cdata[3];
c[4] = cdata[4];
for (int i = 1; i<n; i++) {
f[i] = fdata[i];
c[5] = cdata[5];
}
cout << "\nFINAL F[5]:" << f[5] <<endl;
You will see that even in this case, with all your optimisation flags, the output is 1 and not 0. Even in this case:
for (int i = 1; i<n; i++) {
c[i] = cdata[i];
}
for (int i = 1; i<n; i++) {
f[i] = fdata[i];
}
The produced output is actually 1. This is, again, because we have changed the structure of the loop, and fpeel-loops is not able to reorganise it as before, in the way that the error was produced. It's also the case of wrapping it into a while loop:
int i = 1;
while (i < 6) {
f[i] = fdata[i];
c[i] = cdata[i];
i++;
}
Although on a while loop -O3 will prevent compilation here because of -Waggressive-loop-optimizations, so you should test it with -O -fpeel-loops
So, we can't really know for sure how your compiler is reorganising that loop for you, however it is using the so-called as-if rule to do so, and you can read more about it here.
Of course, your compiler takes the freedom o refactoring that code for you basing on the fact that you abide to set rules. The as-if rule for the compiler will always produce the desired output, providing that the programmer has not caused undefined behaviour. In our case, we do indeed have broken the rules by going out of bounds of that array, hence the strategy with which that loop was refactored, was built upon the idea that this could not happen, but we made it happen.
So yes, it is not everything as simple and straightforward as saying that all your problems are created by reading at an unallocated memory address. Of course, it is ultimately why the problem has verfied itself, but you were reading cdata out of bounds, which is a different array! So it is not as simple and easy as saying that the mere reading out of bounds of your array is causing the issue, as if you do it like that:
int i = 0;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
f[i] = fdata[i];
c[i] = cdata[i];
i++;
It will work and print 1! Even if we are definitely going out of bounds with that cdata array! So it is not the mere fact of reading at an unallocated memory address that is causing the issue.
Once again, it is actually the loop-refactoring strategy of fpeel-loops, which believed that you would not go out-of-bounds of that array and changed your loop accordingly.
Lastly, the fact that optimisations flags will lead you to have an output of 0 is strictly related to your machine. That 0 has no meaning, is is not the product of an actual logical operation, because it is actually a junk value, found at a non allocated memory address, or the product of a logical operation performed on a junk value, resulting in NaN.

If you want to discover why this happens, you can examine the generated assembly code. Here it is in Compiler Explorer: https://godbolt.org/z/bs7ej7
Note that with -O3, the generated assembly code is not exactly easy to read.

As you mentioned, you have memory problem in your code.
"cdata" has 5 member and maximum index for it is 4.
But in for loop when "i = 5", invalid data is read and it could cause a undefined behavior.
c[i] = cdata[i];
Please modify your code like this and try again.
for (int i = 1; i < n; i++)
f[i] = fdata[i];
for (int i = 1; i < 5; i++)
c[i] = cdata[i];

The answer is simple - you are accessing memory out of bounds of the array. Who knows what is there? It could be some other variable in your program, it could be random garbage, it is completely undefined.
So, when you compile sometimes you just so happen to get data there that is valid and other times when you compile the data is not what you expect.
It is purely coincidental that the data you expect is there or not, and depending on what the compiler does will determine what data is there i.e. it is completely unpredictable.
As for memory access, reading invalid memory is still a big no-no. So is reading potentially uninitialized values. You are doing both. They may not cause a segmentation fault, but they are still big no-no's.
My recommendation is use valgrind or a similar memory checking tool. If you run this code through valgrind, it will yell at you. With this, you can catch memory errors in your code that the compiler might not catch. This memory error is obvious, but some are hard to track down and almost all lead to undefined behavior, which makes you want to pull your teeth out with a rusty pair of pliers.
In short, don't access out of bounds elements and pray that the answer you are looking for just so happens to exist at that memory address.

Why in C++ overwritingis is slower than writing?

USELESS QUESTION - ASKED TO BE DELETED
I have to run a piece of code that manages a video stream from camera.
I am trying to boost it, and I realized a weird C++ behaviour. (I have to admit I am realizing I do not know C++)
The first piece of code run faster than the seconds, why? It might be possible that the stack is almost full?
Faster version
double* temp = new double[N];
for(int i = 0; i < N; i++){
temp[i] = operation(x[i],y[i]);
res = res + (temp[i]*temp[i])*coeff[i];
}
Slower version1
double temp;
for(int i = 0; i < N; i++){
temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
Slower version2
for(int i = 0; i < N; i++){
double temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
EDIT
I realized the compiler was optimizing the product between elemnts of coeff and temp. I beg your pardon for the unuseful question. I will delete this post.

This has obviously nothing to do with "writing vs overwriting".
Assuming your results are indeed correct, I can guess that your "faster" version can be vectorized (i.e. pipelined) by the compiler more efficiently.
The difference in that in this version you allocate a storage space for temp, whereas each iteration uses its own member of the array, hence all the iterations can be executed independently.
Your "slow 1" version creates a (sort of) false dependence on a single temp variable. A primitive compiler might "buy" it, producing a non-pipelined code.
Your "slow 2" version seems to be ok actually, loop iterations are independent.
Why is this still slower?
I can guess that this is due to the use of the same CPU registers. That is, arithmetic on double is usually implemented via FPU stack registers, this is the interference between loop iterations.

for loop optimization c ++

this is my first time posting in this site and I hope I get some help/hint. I have an assignment where I need to optimize the performance to the inner for loop but I have no idea how to do that. the code was given in the assignment. I need to count the time(which I was able to do) and improve the performance.
Here is the code:
//header files
#define N_TIMES 200 //This is originally 200000 but changed it to test the program faster
#define ARRAY_SIZE 9973
int main (void) {
int *array = (int*)calloc(ARRAY_SIZE, sizeof(int));
int sum = 0;
int checksum = 0;
int i;
int j;
int x;
// Initialize the array with random values 0 to 13.
srand(time(NULL));
for (j=0; j < ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
//printf("Checksum is %d.\n",checksum);
for (i = 0; i < N_TIMES; i++) {
// Do not alter anything above this line.
// Need to optimize this for loop----------------------------------------
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
printf("Sum is now: %d\n",sum);
}
// Do not alter anything below this line.
// ---------------------------------------------------------------
// Check each iteration.
//
if (sum != checksum) {
printf("Checksum error!\n");
}
sum = 0;
}
return 0;
}
The code takes about 695 seconds to run. Any help on how to optimize it please?
thanks a lot.

The bottleneck in that loop is obviously the IO done by printf; since you are probably writing the output on a console, the output is line buffered, which means that the stdio buffer is flushed at each iteration, which slows down things a lot.
If you have to do all that prints, you can greatly enhance the performance by forcing the stream to do block buffering: before the for add a
setvbuf(stdout, NULL, _IOFBF, 0);
In alternative, if this approach is not considered valid, you can do your own buffering by allocating a big buffer on your own and do your own buffering: write in your buffer using sprintf, periodically emptying it in the output stream with a fwrite.
Also, you can use the poor man's approach to buffering - just use a buffer big enough to write all that stuff (you can calculate how big it must be quite easily) and write in it without worrying about when it's full, when to empty it, ... - just empty it at the end of the loop. edit: see #paxdiablo's answer for an example of this
Applying just the first optimization, what I get with time is
real 0m6.580s
user 0m0.236s
sys 0m2.400s
vs the original
real 0m8.451s
user 0m0.700s
sys 0m3.156s
So, we got down of ~3 seconds in real time, half a second in user time and ~0.7 seconds in system time. But what we can see here is the huge difference between user+sys and real, which means that the time is not spent in doing something inside the process, but waiting.
Thus, the real bottleneck here is not in our process, but in the process of the virtual terminal emulator: sending huge quantities of text to the console is going to be slow no matter what optimizations we do in our program; in other words, your task is not CPU-bound, but IO-bound, so CPU-targeted optimizations won't be of much benefit, since at the end you have to wait anyway for your IO device to do his slow stuff.
The real way to speed up such a program would be much simpler: avoid the slow IO device (the console) and just write the data to file (which, by the way, is block-buffered by default).
matteo#teokubuntu:~/cpp/test$ time ./a.out > test
real 0m0.369s
user 0m0.240s
sys 0m0.068s

Since there's absolutely no variation in that loop based on i (the outer loop), you don't need to calculate it each time.
In addition, the printing of the data should be outside the inner loop so as not to impose I/O costs on the calculation.
With those two things in mind, one possibility is:
static int sumCalculated = 0;
if (!sumCalculated) {
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sumCalculated = 1;
}
printf("Sum is now: %d\n",sum);
although that has different output to the original which may be an issue (one line at the end rather than one line per addition).
If you do need to print the accumulating sum within the loop, I'd simply buffer that as well (since it doesn't vary each time through the i loop.
The string Sum is now: 999999999999\n (12 digits, it may vary depending on your int size) takes up 25 bytes (excluding terminating NUL). Multiply that by 9973 and you need a buffer of about 250K (including a terminating NUL). So something like this:
static char buff[250000];
static int sumCalculated = 0;
if (!sumCalculated) {
int offset = 0;
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
offset += sprintf (buff[offset], "Sum is now: %d\n",sum);
}
sumCalculated = 1;
}
printf ("%s", buff);
Now that sort of defeats the whole intent of the outer loop as a benchmark tool but loop-invariant removal is a valid approach to optimisation.

Move the printf outside the for loop.
// Do not alter anything above this line.
//Need to optimize this for loop----------------------------------------
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
printf("Sum is now: %d\n",sum);
// Do not alter anything below this line.
// ---------------------------------------------------------------

Getting the I/O out of the loop is a big help.
Depending on the compiler and machine, you might get a tiny increase in speed by using pointers rather than indexing (though on modern hardware, it generally doesn't make a difference).
Loop unrolling might help to increase the ratio of useful work to loop overhead.
You could use vector instructions (e.g., SIMD) to do a bunch of calculation in parallel.
Are you allowed to pack the array? Can you use an array of a smaller type than int (given that all the values are very small)? Making the array physically shorter improves locality.
Loop unrolling might look something like this:
for (int j = 0; j < ARRAY_SIZE; j += 2) {
sum += array[j] + array[j+1];
}
You'd have to figure out what to do if the array isn't an exact multiple of the unrolling size (which is probably why the assignment uses a prime number).
You would have to experiment to see how much unrolling would be the right amount.

Optimize indexed array summation

I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.

This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.

Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.

Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.

Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.

Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.

This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js