for loop optimization c ++

for loop optimization c ++ - c++

this is my first time posting in this site and I hope I get some help/hint. I have an assignment where I need to optimize the performance to the inner for loop but I have no idea how to do that. the code was given in the assignment. I need to count the time(which I was able to do) and improve the performance.
Here is the code:
//header files
#define N_TIMES 200 //This is originally 200000 but changed it to test the program faster
#define ARRAY_SIZE 9973
int main (void) {
int *array = (int*)calloc(ARRAY_SIZE, sizeof(int));
int sum = 0;
int checksum = 0;
int i;
int j;
int x;
// Initialize the array with random values 0 to 13.
srand(time(NULL));
for (j=0; j < ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
//printf("Checksum is %d.\n",checksum);
for (i = 0; i < N_TIMES; i++) {
// Do not alter anything above this line.
// Need to optimize this for loop----------------------------------------
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
printf("Sum is now: %d\n",sum);
}
// Do not alter anything below this line.
// ---------------------------------------------------------------
// Check each iteration.
//
if (sum != checksum) {
printf("Checksum error!\n");
}
sum = 0;
}
return 0;
}
The code takes about 695 seconds to run. Any help on how to optimize it please?
thanks a lot.

The bottleneck in that loop is obviously the IO done by printf; since you are probably writing the output on a console, the output is line buffered, which means that the stdio buffer is flushed at each iteration, which slows down things a lot.
If you have to do all that prints, you can greatly enhance the performance by forcing the stream to do block buffering: before the for add a
setvbuf(stdout, NULL, _IOFBF, 0);
In alternative, if this approach is not considered valid, you can do your own buffering by allocating a big buffer on your own and do your own buffering: write in your buffer using sprintf, periodically emptying it in the output stream with a fwrite.
Also, you can use the poor man's approach to buffering - just use a buffer big enough to write all that stuff (you can calculate how big it must be quite easily) and write in it without worrying about when it's full, when to empty it, ... - just empty it at the end of the loop. edit: see #paxdiablo's answer for an example of this
Applying just the first optimization, what I get with time is
real 0m6.580s
user 0m0.236s
sys 0m2.400s
vs the original
real 0m8.451s
user 0m0.700s
sys 0m3.156s
So, we got down of ~3 seconds in real time, half a second in user time and ~0.7 seconds in system time. But what we can see here is the huge difference between user+sys and real, which means that the time is not spent in doing something inside the process, but waiting.
Thus, the real bottleneck here is not in our process, but in the process of the virtual terminal emulator: sending huge quantities of text to the console is going to be slow no matter what optimizations we do in our program; in other words, your task is not CPU-bound, but IO-bound, so CPU-targeted optimizations won't be of much benefit, since at the end you have to wait anyway for your IO device to do his slow stuff.
The real way to speed up such a program would be much simpler: avoid the slow IO device (the console) and just write the data to file (which, by the way, is block-buffered by default).
matteo#teokubuntu:~/cpp/test$ time ./a.out > test
real 0m0.369s
user 0m0.240s
sys 0m0.068s

Since there's absolutely no variation in that loop based on i (the outer loop), you don't need to calculate it each time.
In addition, the printing of the data should be outside the inner loop so as not to impose I/O costs on the calculation.
With those two things in mind, one possibility is:
static int sumCalculated = 0;
if (!sumCalculated) {
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sumCalculated = 1;
}
printf("Sum is now: %d\n",sum);
although that has different output to the original which may be an issue (one line at the end rather than one line per addition).
If you do need to print the accumulating sum within the loop, I'd simply buffer that as well (since it doesn't vary each time through the i loop.
The string Sum is now: 999999999999\n (12 digits, it may vary depending on your int size) takes up 25 bytes (excluding terminating NUL). Multiply that by 9973 and you need a buffer of about 250K (including a terminating NUL). So something like this:
static char buff[250000];
static int sumCalculated = 0;
if (!sumCalculated) {
int offset = 0;
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
offset += sprintf (buff[offset], "Sum is now: %d\n",sum);
}
sumCalculated = 1;
}
printf ("%s", buff);
Now that sort of defeats the whole intent of the outer loop as a benchmark tool but loop-invariant removal is a valid approach to optimisation.

Move the printf outside the for loop.
// Do not alter anything above this line.
//Need to optimize this for loop----------------------------------------
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
printf("Sum is now: %d\n",sum);
// Do not alter anything below this line.
// ---------------------------------------------------------------

Getting the I/O out of the loop is a big help.
Depending on the compiler and machine, you might get a tiny increase in speed by using pointers rather than indexing (though on modern hardware, it generally doesn't make a difference).
Loop unrolling might help to increase the ratio of useful work to loop overhead.
You could use vector instructions (e.g., SIMD) to do a bunch of calculation in parallel.
Are you allowed to pack the array? Can you use an array of a smaller type than int (given that all the values are very small)? Making the array physically shorter improves locality.
Loop unrolling might look something like this:
for (int j = 0; j < ARRAY_SIZE; j += 2) {
sum += array[j] + array[j+1];
}
You'd have to figure out what to do if the array isn't an exact multiple of the unrolling size (which is probably why the assignment uses a prime number).
You would have to experiment to see how much unrolling would be the right amount.

Related

Fastest way to create a vector of indices from distance matrix in C++

I have a distance matrix D of size n by n and a constant L as input. I need to create a vector v contains all entries in D such that its value is at most L. Here v must be in a specific order v = [v1 v2 .. vn] where vi contains entries in ith row of D with value at most L. The order of entries in each vi is not important.
I wonder there is a fast way to create v using vector, array or any data structure + parallization. What I did is to use for loops and it is very slow for large n.
vector<int> v;
for (int i=0; i < n; ++i){
for (int j=0; j < n; ++j){
if (D(i,j) <= L) v.push_back(j);
}
}

The best way is mostly depending on the context. If you are seeking for GPU parallization you should take a look at OpenCL.
For CPU based parallization the C++ standard #include <thread> library is probably your best bet, but you need to be careful:
Threads take time to create so if n is relatively small (<1000 or so) it will slow you down
D(i,j) has to be readably by multiple threads at the same time
v has to be writable by multiple threads, a standard vector wont cut it
v may be a 2d vector with vi as its subvectors, but these have to be initialized before the parallization:
std::vector<std::vector<int>> v;
v.reserve(n);
for(size_t i = 0; i < n; i++)
{
v.push_back(std::vector<int>());
}
You need to decide how many threads you want to use. If this is for one machine only, hardcoding is a valid option. There is a function in the thread library that gets the amount of supported threads, but it is more of a hint than trustworthy.
size_t threadAmount = std::thread::hardware_concurrency(); //How many threads should run hardware_concurrency() gives you a hint, but its not optimal
std::vector<std::thread> t; //to store the threads in
t.reserve(threadAmount-1); //you need threadAmount-1 extra threads (we already have the main-thread)
To start a thread you need a function it can execute. In this case this is to read through part of your matrix.
void CheckPart(size_t start, size_t amount, int L, std::vector<std::vector<int>>& vec)
{
for(size_t i = start; i < amount+start; i++)
{
for(size_t j = 0; j < n; j++)
{
if(D(i,j) <= L)
{
vec[i].push_back(j);
}
}
}
}
Now you need to split your matrix in parts of about n/threadAmount rows and start the threads. The thread constructor needs a function and its parameter, but it will always try to copy the parameters, even if the function wants a reference. To prevent this, you need to force using a reference with std::ref()
int i = 0;
int rows;
for(size_t a = 0; a < threadAmount-1; a++)
{
rows = n/threadAmount + ((n%threadAmount>a)?1:0);
t.push_back(std::thread(CheckPart, i, rows, L, std::ref(v)));
i += rows;
}
The threads are now running and all there is to do is run the last block on the main function:
SortPart(i, n/threadAmount, L, v);
After that you need to wait for the threads finishing and clean them up:
for(unsigned int a = 0; a < threadAmount-1; a++)
{
if(t[a].joinable())
{
t[a].join();
}
}
Please note that this is just a quick and dirty example. Different problems might need different implementation, and since I can't guess the context the help I can give is rather limited.

In consideration of the comments, I made the appropriate corrections (in emphasis).
Have you searched tips for writing performance code, threading, asm instructions (if your assembly is not exactly what you want) and OpenCL for parallel-processing? If not, I strongly recommend!
In some cases, declaring all for loop variables out of the for loop (to avoid declaring they a lot of times) will make it faster, but not in this case (comment from our friend Paddy).
Also, using new insted of vector can be faster, as we see here: Using arrays or std::vectors in C++, what's the performance gap? - and I tested, and with vector it's 6 seconds slower than with new,which only takes 1 second. I guess that the safety and ease of management guarantees that come with std::vector is not desired when someone is searching for performance, even because using new is not so difficult, just avoid heap overflow with calculations and remember using delete[]
user4581301 is correct here, and the following statement is untrue: Finally, if you build D in a array instead of matrix (or maybe if you copy D into a constant array, maybe...), it will be much mor cache-friendly and will save one for loop statement.

My C++ program gets slower as computation proceeds

I wrote a neural network program in C++ to test something, and I found that my program gets slower as computation proceeds. Since this kind of phenomenon is somewhat I've never seen before, I checked possible causes. Memory used by program did not change when it got slower. RAM and CPU status were fine when I ran the program.
Fortunately, the previous version of the program did not have such problem. So I finally found that a single statement that makes the program slow. The program does not get slower when I use this statement:
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi;
However, the program gets slower and slower as soon as I replace above statement with:
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi - lambda*w[k][i][j];
To solve this problem, I did my best to find and remove the cause but I failed... The below is the simple code structure. For the case that this is not the problem that is related to local statement, I uploaded my code to google drive. The URL is located at the end of this question.
MLP.h
class MLP
{
private:
...
double lambda;
double ***w;
double ***dw;
neuron **hidden;
...
MLP.cpp
...
for(k = n_depth - 1; k > 0; k--)
{
if(k == n_depth - 1)
...
else
{
...
for(j = 1; n_neuron > j; j++)
{
for(i = 0; n_neuron > i; i++)
{
//dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi;
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi - lambda*w[k][i][j];
}
}
}
}
...
Full source code: https://drive.google.com/open?id=1A8Uw0hNDADp3-3VWAgO4eTtj4sVk_LZh

I'm not sure exactly why it gets slower and slower, but I do see where you can gain some performance.
Two and higher dimensional arrays are still stored in one dimensional
memory. This means (for C/C++ arrays) array[i][j] and array[i][j+1]
are adjacent to each other, whereas array[i][j] and array[i+1][j] may
be arbitrarily far apart.
Accessing data in a more-or-less sequential fashion, as stored in
physical memory, can dramatically speed up your code (sometimes by an
order of magnitude, or more)!
When modern CPUs load data from main memory into processor cache,
they fetch more than a single value. Instead they fetch a block of
memory containing the requested data and adjacent data (a cache line
). This means after array[i][j] is in the CPU cache, array[i][j+1] has
a good chance of already being in cache, whereas array[i+1][j] is
likely to still be in main memory.
Source: https://people.cs.clemson.edu/~dhouse/courses/405/papers/optimize.pdf
With your current code, w[k][i][j] will be read, and on the next iteration, w[k][i+1][j] will be read. You should invert i and j so that w is read in sequential order:
for(j = 1; n_neuron > j; ++j)
{
for(i = 0; n_neuron > i; ++i)
{
dw[k][j][i] = hidden[k-1][j].y * hidden[k][i].phi - lambda*w[k][j][i];
}
}
Also note that ++x should be slightly faster than x++, since x++ has to create a temporary containing the old value of x as the expression result. The compiler might optimize it when the value is unused though, but do not count on it.

Is using a for loop to iterate a fixed array slower than manually going through it?

Consider these two pieces of code:
float arr1[4], arr2[4];
//Do something here with arr1 and arr2
for (int i = 0; i < 4; i++)
arr1[i] += arr2[i];
-
float arr1[4], arr2[4];
//Do something here with arr1 and arr2
arr1[0] += arr2[0];
arr1[1] += arr2[1];
arr1[2] += arr2[2];
arr1[3] += arr2[3];
Assuming I'm working with larger arrays of a known fixed size, would the first have any performance impact over the second?

Assuming no compiler optimizations, then the for loop is unavoidably 'slower'. Although both approaches are O(n), the for loop has a larger constant because of the loop overhead.
Loop unrolling is a reasonable time-space trade-off for small arrays, and may actually be a space gain for really small arrays.
But doing it manually introduces many, roughly n, opportunities for human error both during the initial (inevitable) cut and paste of creating the many lines of code needed to do it manually, and then when changes need to be made later to the "loop body".
Generally, loops are preferable for reasons of maintenance and readability. They also more clearly convey the intent of the code.
Finally, for large arrays, small loop bodies, and particular target architectures, the processor's cache comes into play. In many cases the entire loop will fit in the cache, making it much faster than a long list of instructions.
Let the compiler worry about optimizing.

It depends how the loop is constructed. In case of short loops, that don't have much code inside, this can be done automatically by the compiler. It is known as loop unrolling.
Is it slower? Faster? There is no one, right answer - always profile your code. It may be faster to do it manually, because loops are implemented as conditional jumps. So it may be faster to manually go through it, because code will be executed "in order", instead of jumping to the same location multiple times.
Consider following code:
int main()
{
int sum = 0;
int values[4] = { 1, 2, 3, 4 };
for(int n = 0; n < 4; ++n)
sum += values[n];
return 0;
}
Following assembly will be generated for for loop:
Now, let's change it to manual approach:
int main()
{
int sum = 0;
int values[4] = { 1, 2, 3, 4 };
sum += values[0];
sum += values[1];
sum += values[2];
sum += values[3];
return 0;
}
Result:
Which one is better? Which one is faster? Hard to say. Code without jumps and conditions may be faster, but unrolling loops, that are to complex may result in code bloating.
So, what is my answer? Find out yourself.

Why is processing multiple streams of data slower than processing one?

I'm testing how reading multiple streams of data affects a CPUs caching performance. I'm using the following code to benchmark this. The benchmark reads integers stored sequentially in memory and writes partial sums back sequentially. The number of sequential blocks that are read from is varied. Integers from the blocks are read in a round-robin manner.
#include <iostream>
#include <vector>
#include <chrono>
using std::vector;
void test_with_split(int num_arrays) {
int num_values = 100000000;
// Fix up the number of values. The effect of this should be insignificant.
num_values -= (num_values % num_arrays);
int num_values_per_array = num_values / num_arrays;
// Initialize data to process
auto results = vector<int>(num_values);
auto arrays = vector<vector<int>>(num_arrays);
for (int i = 0; i < num_arrays; ++i) {
arrays.emplace_back(num_values_per_array);
}
for (int i = 0; i < num_values; ++i) {
arrays[i%num_arrays].emplace_back(i);
results.emplace_back(0);
}
// Try to clear the cache
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 100; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
free(c);
auto start = std::chrono::high_resolution_clock::now();
// Do the processing
int sum = 0;
for (int i = 0; i < num_values; ++i) {
sum += arrays[i%num_arrays][i/num_arrays];
results[i] = sum;
}
std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
int main() {
int num_arrays = 1;
while (true) {
test_with_split(num_arrays++);
}
}
Here are the timings for splitting 1-80 ways on an Intel Core 2 Quad CPU Q9550 # 2.83GHz:
The bump in the speed soon after 8 streams makes sense to me, as the processor has an 8-way associative L1 cache. The 24-way associative L2 cache in turn explains the bump at 24 streams. These especially hold if I'm getting the same effects as in Why is one loop so much slower than two loops?, where multiple big allocations always end up in the same associativity set. To compare I've included at the end timings when the allocation is done in one big block.
However, I don't fully understand the bump from one stream to two streams. My own guess would be that it has something to do with prefetching to L1 cache. Reading the Intel 64 and IA-32 Architectures Optimization Reference Manual it seems that the L2 streaming prefetcher supports tracking up to 32 streams of data, but no such information is given for the L1 streaming prefetcher. Is the L1 prefetcher unable to keep track of multiple streams, or is there something else at play here?
Background
I'm investigating this because I want to understand how organizing entities in a game engine as components in the structure-of-arrays style affects performance. For now it seems that the data required by a transformation being in two components vs. it being in 8-10 components won't matter much with modern CPUs. However, the testing above suggests that sometimes it might make sense to avoid splitting some data into multiple components if that would allow a "bottlenecking" transformation to only use one component, even if this means that some other transformation would have to read data it is not interested in.
Allocating in one block
Here are the timings if instead allocating multiple blocks of data only one is allocated and accessed in a strided manner. This does not change the bump from one stream to two, but I've included it for sake of completeness.
And here is the modified code for that:
void test_with_split(int num_arrays) {
int num_values = 100000000;
num_values -= (num_values % num_arrays);
int num_values_per_array = num_values / num_arrays;
// Initialize data to process
auto results = vector<int>(num_values);
auto array = vector<int>(num_values);
for (int i = 0; i < num_values; ++i) {
array.emplace_back(i);
results.emplace_back(0);
}
// Try to clear the cache
const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
char *c = (char *)malloc(size);
for (int i = 0; i < 100; i++)
for (int j = 0; j < size; j++)
c[j] = i*j;
free(c);
auto start = std::chrono::high_resolution_clock::now();
// Do the processing
int sum = 0;
for (int i = 0; i < num_values; ++i) {
sum += array[(i%num_arrays)*num_values_per_array+i/num_arrays];
results[i] = sum;
}
std::cout << "Time with " << num_arrays << " arrays: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count() << " ms\n";
}
Edit 1
I made sure that the 1 vs 2 splits difference was not due to the compiler unrolling the loop and optimizing the first iteration differently. Using the __attribute__ ((noinline)) I made sure the work function is not inlined into the main function. I verified that it did not happen by looking at the generated assembly. The timings after these changed were the same.

To answer the main part of your question: Is the L1 prefetcher able to keep track of multiple streams?
No. This is actually because the L1 cache doesn't have a prefetcher at all. The L1 cache isn't big enough to risk speculatively fetching data that might not be used. It would cause too many evictions and adversely impact any software that isn't reading data in specific patterns suited to that particular L1 cache prediction scheme. Instead L1 caches data that has been explicitly read or written. L1 caches are only helpful when you are writing data and re-reading data that has recently been accessed.
The L1 cache implementation is not the reason for your profile bump from 1X to 2X array depth. On streaming reads like what you've set up, the L1 cache plays little or no factor in performance. Most of your reads are coming directly from the L2 cache. In your first example using nested vectors, some number of reads are probably pulled from L1 (see below).
My guess is your bump from 1X to 2X has a lot to do with the algo and how the compiler is optimizing it. If the compiler knows num_arrays is a constant equal to 1, then it will automatically eliminate a lot of per-iteration overhead for you.
Now for the second part, as to why is the second version faster?:
The reason for the second version being faster is not so much in how the data is arranged in physical memory, but rather what under-the-hood logic change a nested std::vector<std::vector<int>> type implies.
In the nested (first) case, compiled code performs the following steps:
Accesses top-level std::vector class. This class contains a pointer to the start of the data array.
That pointer value must be loaded from memory.
Add current loop offset [i%num_arrays] to that pointer.
Access nested std::vector class data. (Likely L1 cache hit)
Load pointer to the vector's start of data array. (Likely L1 cache hit)
Add loop offset [i/num_arrays]
Read data. Finally!
(note the chances of getting L1 cache hits on steps #4 and #5 decrease drastically after 24 streams or so, due to likeliness of evictions before the next iteration trough the loop)
The second version, by comparison:
Accesses top-level std::vector class.
Load pointer to the vector's start of data array.
Add offset [(i%num_arrays)*num_values_per_array+i/num_arrays]
Read data!
An entire set of under-the-hood steps are removed. The calculation for offset is slightly longer since it needs an extra multiply by num_values_per_array. But the other steps more than make up for it.

Optimize indexed array summation

I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.

This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.

Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.

Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.

Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.

Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.

This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js