I am wondering if within a same wave / subgroup (warp?) we need to call memoryBarrierShared and barrier to synchronize shared variable? In NVIDIA I think it is not necessary, but I do not know for other IHVs.
EDIT : ballot
Since I am talking about wave / subgroup, I am talking about the ARB_shader_ballot extension.
Let's say we have such code (1) :
shared uint s_data[128];
uint tid = gl_GlobalInvocationID.x;
// initialization of some s_data
if(tid < gl_SubGroupSizeARB) {
for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1)
s_data[tid] += s_data[tid + i];
According to me, this code is not correct. The correct one, according to the spec, would be (2):
if(tid < gl_SubGroupSizeARB) {
for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {
s_data[tid] += s_data[tid + i];
However, since invocations run in parallel within a wave/subgroup, the barrier function seems to be useless : this one should be correct as well and faster than the second (3) :
if(tid < gl_SubGroupSizeARB) {
for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {
s_data[tid] += s_data[tid + i];
However, since we do not need barrier function, I wonder if (1) is correct, even if it is unlikely for me, and if not, if (3) is correct (that would means that my understanding is correct)
EDIT : int to uint, and change = to +=
The execution model shared by OpenGL and Vulkan with regard to compute shaders does not really recognize the concept of a "wave". It has the concept of a work group, but that is not the same thing. A work group can be much bigger than a GPU "wave", and for small work groups, multiple work groups could be executing on the same GPU "wave".
As such, these specifications make no statements about the behavior of any of its functions with regard to a "wave" (with the exception of shader ballot functions). So if you want synchronization that the standard says will work on all conforming implementations, you must call both functions as dictated by the standard.
Even with ARB_shader_ballot, its behavior does not modify the execution model of shaders. It only allows cross-communication between subgroups, and only via the explicit mechanisms that it provides.
The execution model and memory model of shader invocations is that they are unordered with respect to each other, unless you explicitly order them with barriers.
I'm attempting to create a std::vector<std::set<int>> with one set for each NUMA-node, containing the thread-ids obtained using omp_get_thread_num().
Create data which is larger than L3 cache,
set first touch using thread 0,
perform multiple experiments to determine the minimum access time of each thread,
extract the threads into nodes based on sorted access times and information about the topology.
Code: (Intel compiler, OpenMP)
// create data which will be shared by multiple threads
const auto part_size = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto size = 2 * part_size;
auto container = std::unique_ptr<double>(new double[size]);
// open a parallel section
auto thread_count = 0;
auto thread_id_min_duration = std::multimap<double, int>{};
#pragma omp parallel num_threads(std::thread::hardware_concurrency())
#pragma omp parallel
// perform first touch using thread 0
const auto thread_id = omp_get_thread_num();
if (thread_id == 0)
thread_count = omp_get_num_threads();
for (auto index = std::size_t{}; index < size; ++index)
container.get()[index] = static_cast<double>(std::rand() % 10 + 1);
#pragma omp barrier
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
// calculate the minimum access time of this thread
auto this_thread_min_duration = std::numeric_limits<double>::max();
for (auto experiment_counter = std::size_t{}; experiment_counter < 250; ++experiment_counter)
const auto* data = experiment_counter % 2 == 0 ? container.get() : container.get() + part_size;
const auto start_timestamp = omp_get_wtime();
for (auto index = std::size_t{}; index < part_size; ++index)
static volatile auto exceedingly_interesting_value_wink_wink = data[index];
const auto end_timestamp = omp_get_wtime();
const auto duration = end_timestamp - start_timestamp;
if (duration < this_thread_min_duration)
this_thread_min_duration = duration;
#pragma omp critical
thread_id_min_duration.insert(std::make_pair(this_thread_min_duration, thread_id));
} // #pragma omp parallel
Not shown here is code which outputs the minimum access times sorted into the multimap.
Env. and Output
I am attempting to not use SMT by using export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=24. However, I'm getting this output:
What's puzzling me is that I'm having the same access times on all threads. Since I'm trying to spread them across the 2 NUMA nodes, I expect to neatly see 12 threads with access time, say, x and another 12 with access time ~2x.
Why is the above happening?
Additional Information
Even more puzzling are the following environments and their outputs:
Any help in understanding this phenomenon would be much appreciated.
Put it shortly, the benchmark is flawed.
perform multiple experiments to determine the minimum access time of each thread
The term "minimum access time" is unclear here. I assume you mean "latency". The thing is your benchmark does not measure the latency. volatile tell to the compiler to read store data from the memory hierarchy. The processor is free to store the value in its cache and x86-64 processors actually do that (like almost all modern processors).
You can find the documentation of both here and there. Put it shortly, I strongly advise you to set OMP_PROC_BIND=TRUE and OMP_PLACES="{0},{1},{2},..." based on the values retrieved from hw-loc. More specifically, you can get this from hwloc-calc which is a really great tool (consider using --li --po, and PU, not CORE because this is what OpenMP runtimes expect). For example you can query the PU identifiers of a given NUMA node. Note that some machines have very weird non-linear OS PU numbering and OpenMP runtimes sometimes fail to map the threads correctly. IOMP (OpenMP runtime of ICC) should use hw-loc internally but I found some bugs in the past related to that. To check the mapping is correct, I advise you to use hwloc-ps. Note that OMP_PLACES=cores does not guarantee that threads are not migrating from one core to another (even one on a different NUMA node) except if OMP_PROC_BIND=TRUE is set (or a similar setting). Note that you can also use numactl so to control the NUMA policies of your process. For example, you can tell to the OS not to use a given NUMA node or to interleave the allocations. The first touch policy is not the only one and may not be the default one on all platforms (on some Linux platforms, the OS can move the pages between the NUMA nodes so to improve locality).
Why is the above happening?
The code takes 4.38 ms to read 50 MiB in memory in each threads. This means 1200 MiB read from the node 0 assuming the first touch policy is applied. Thus the throughout should be about 267 GiB/s. While this seems fine at first glance, this is a pretty big throughput for such a processor especially assuming only 1 NUMA node is used. This is certainly because part of the fetches are done from the L3 cache and not the RAM. Indeed, the cache can partially hold a part of the array and certainly does resulting in faster fetches thanks to the cache associativity and good cache policy. This is especially true as the cache lines are not invalidated since the array is only read. I advise you to use a significantly bigger array to prevent this complex effect happening.
You certainly expect one NUMA node to have a smaller throughput due to remote NUMA memory access. This is not always true in practice. In fact, this is often wrong on modern 2-socket systems since the socket interconnect is often not a limiting factor (this is the main source of throughput slowdown on NUMA systems).
NUMA effect arise on modern platform because of unbalanced NUMA memory node saturation and non-uniform latency. The former is not a problem in your application since all the PUs use the same NUMA memory node. The later is not a problem either because of the linear memory access pattern, CPU caches and hardware prefetchers : the latency should be completely hidden.
Even more puzzling are the following environments and their outputs
Using 26 threads on a 24 core machine means that 4 threads have to be executed on two cores. The thing is hyper-threading should not help much in such a case. As a result, multiple threads sharing the same core will be slowed down. Because IOMP certainly pin thread to cores and the unbalanced workload, 4 threads will be about twice slower.
Having 48 threads cause all the threads to be slower because of a twice bigger workload.
Let me address your first sentence. A C++ std::vector is different from a C malloc. Malloc'ed space is not "instantiated": only when you touch the memory does the physical-to-logical address mapping get established. This is known as "first touch". And that is why in C-OpenMP you initialize an array in parallel, so that the socket touching the part of the array gets the pages of that part. In C++, the "array" in a vector is created by a single thread, so the pages wind up on the socket of that thread.
Here's a solution:
template<typename T>
struct uninitialized {
uninitialized() {};
T val;
constexpr operator T() const {return val;};
double operator=( const T&& v ) { val = v; return val; };
Now you can create a vector<uninitialized<double>> and the array memory is not touched until you explicitly initialize it:
vector<uninitialized<double>> x(N),y(N);
#pragma omp parallel for
for (int i=0; i<N; i++)
y[i] = x[i] = 0.;
x[0] = 0; x[N-1] = 1.;
Now, I'm not sure how this goes if you have a vector of sets. Just thought I'd point out the issue.
After more investigation, I note the following:
work-load managers on clusters can and will disregard/reset OMP_PLACES/OMP_PROC_BIND,
memory page migration is a thing on modern NUMA systems.
Following this, I started using the work-load manager's own thread binding/pinning system, and adapted my benchmark to lock the memory page(s) on which my data lay. Furthermore, giving in to my programmer's paranoia, I ditched the std::unique_ptr for fear that it may lay its own first touch after allocating the memory.
// create data which will be shared by multiple threads
const auto size_per_thread = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto total_size = thread_count * size_per_thread;
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), total_size * sizeof(double));
if (data == nullptr)
throw std::runtime_error("could_not_allocate_memory_error");
// perform first touch using thread 0
#pragma omp parallel num_threads(thread_count)
if (omp_get_thread_num() == 0)
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < total_size; ++d_index)
data[d_index] = -1.0;
} // #pragma omp parallel
mlock(data, total_size); // page migration is a real thing...
// open a parallel section
auto thread_id_avg_latency = std::multimap<double, int>{};
auto generator = std::mt19937(); // heavy object can be created outside parallel
#pragma omp parallel num_threads(thread_count) private(generator)
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
// seed each thread's generator
generator.seed(thread_counter + 1);
// calculate the minimum access latency of this thread
auto this_thread_avg_latency = 0.0;
const auto experiment_count = 250;
for (auto experiment_counter = std::size_t{}; experiment_counter < experiment_count; ++experiment_counter)
const auto start_timestamp = omp_get_wtime() * 1E+6;
for (auto counter = std::size_t{}; counter < size_per_thread / 100; ++counter)
const auto index = std::uniform_int_distribution<std::size_t>(0, size_per_thread-1)(generator);
auto& datapoint = data[thread_counter * size_per_thread + index];
datapoint += index;
const auto end_timestamp = omp_get_wtime() * 1E+6;
this_thread_avg_latency += end_timestamp - start_timestamp;
this_thread_avg_latency /= experiment_count;
#pragma omp critical
thread_id_avg_latency.insert(std::make_pair(this_thread_avg_latency, omp_get_thread_num()));
} // #pragma omp parallel
With these changes, I am noticing the difference I expected.
Further notes:
this experiment shows that the latency of non-local access is 1.09 - 1.15 times that of local access on the cluster that I'm using,
there is no reliable cross-platform way of doing this (requires kernel-APIs),
OpenMP seems to number the threads exactly as hwloc/lstopo, numactl and lscpu seems to number them (logical ID?)
The most astonishing things are that the difference in latencies is very low, and that memory page migration may happen, which begs the question, why should we care about first-touch and all the rest of the NUMA concerns at all?
I have a C++ code that performs a time evolution of four variables that live on a 2D spatial grid. To save some time, I tried to parallelise my code with OpenMP but I just cannot get it to work: No matter how many cores I use, the runtime stays basically the same or increases. (My code does use 24 cores or however many I specify, so the compilation is not a problem.)
I have the feeling that the runtime for one individual time-step is too short and the overhead of producing threads kills the potential speed-up.
The layout of my code is:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
// perform time step
// (1) calculate righthand-side of ODE:
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
rhs[0][i][j] = A0[i][j] + B0[i][j] + ...;
rhs[1][i][j] = A1[i][j] + B1[i][j] + ...;
rhs[2][i][j] = A2[i][j] + B2[i][j] + ...;
rhs[3][i][j] = A3[i][j] + B3[i][j] + ...;
// (2) perform Euler step (or Runge-Kutta, ...)
for (int d = 0; d < 4; d++) {
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
next[d][i][j] = current[d][i][j] + time_step * rhs[d][i][j];
I thought this code should be fairly easy to parallelise... I put "#pragma omp parellel for" in front of the (1) and (2) loops, and I also specified the number of cores (e.g. 4 cores for loop (2) since there are four variables) but there is simply no speed-up whatsoever.
I have found that OpenMP is fairly smart about when to create/destroy the threads. I.e. it realises that threads are required soon again and then they're only put asleep to save overhead time.
I think one "problem" is that my time step is coded in a subroutine (I'm using RK4 instead of Euler) and the computation of the righthand-side is again in another subroutine that is called by the time_step() function. So, I believe that due to this, OpenMP cannot see that the threads should be kept open for longer and hence the threads are created and destroyed at every time step.
Would it be helpful to put a "#pragma omp parallel" in front of the time-loop so that the threads are created at the very beginning? And then do the actual parallelisation for the righthand-side (1) and the Euler step (2)? But how do I do that?
I have found numerous examples for how to parallelise nested for loops, but none of them were concerned with the setup where the inner loops have been sourced out to separate modules. Would this an obstacle for parallelising?
I have now removed the d loops (by making the indices explicit) and collapsed the i and j loops (by running over the entire 2D array with one variable only).
The code looks like:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
// perform time step
// (1) calculate righthand-side of ODE:
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
rhs[0][0][i] = A0[0][i] + B0[0][i] + ...;
rhs[1][0][i] = A1[0][i] + B1[0][i] + ...;
rhs[2][0][i] = A2[0][i] + B2[0][i] + ...;
rhs[3][0][i] = A3[0][i] + B3[0][i] + ...;
// (2) perform Euler step (or Runge-Kutta, ...)
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
next[0][0][i] = current[0][0][i] + time_step * rhs[0][0][i];
next[1][0][i] = current[1][0][i] + time_step * rhs[1][0][i];
next[2][0][i] = current[2][0][i] + time_step * rhs[2][0][i];
next[3][0][i] = current[3][0][i] + time_step * rhs[3][0][i];
The size of nr*ntheta is 400*40=1600 and I a make max_time_steps=1000 time steps. Still, the parallelisation does not result in a speed-up:
Runtime without OpenMP (result of time on the command line):
real 0m23.597s
user 0m23.496s
sys 0m0.076s
Runtime with OpenMP (24 cores)
real 0m23.162s
user 7m47.026s
sys 0m0.905s
I do not understand what's happening here.
One peculiarity that I don't show in my code snippet above is that my variables are not actually doubles but a self-defined struct of two doubles which resemble real and imaginary part. But I think this should not make a difference.
Just wanted to report some success after I left the parallelisation alone for a while. The code evolved for a year and now I went back to parallelisation. This time, I can say that OpenMP does it's job and reduces the required walltime.
While the code evolved overall, this particular loop that I've shown above did not really change; merely two things: a) The resolution is higher so that it covers about 10 times as many points and b) the number of calculations per loop also is about 10-fold (maybe even more).
My only explanation why it works now and didn't work a little over a year ago, is that, when I tried to parallelise the code last time, it wasn't computationally expensive enough and the speed-up was killed by the OpenMP overhead. One single loop now requires about 200-300ms whereas that time required must have been in the single digit ms last time.
I can see such effect when comparing gcc and the Intel compiler (which are doing a very different job when vectorizing):
a) Using gcc, one loop needs about 300ms without OpenMP, and on two cores only 52% of the time is required --> near perfect optimization.
b) Using icpc, one loop needs about 160ms without OpenMP, and on two cores it needs 60% of the time --> good optimization but about 20% less effective.
When going for more than two cores, the speed-up is not large enough to make it worthwhile.
I am wondering about the execution order of threads in OpenGL.
Say I have a mobile GPU that often have n_cores between 8 ... 32 (e.g. ARM Mali). That means they are different from Nvidia (AMD) warps (wavefronts).
The reason I am asking is because of following toy example
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
shared float a[16];
void main() {
uint tid = gl_GlobalInvocationID.x; // <-- thread id
// set all a to 0
if (tid < 16) {
a[tid] = 0;
a[tid % 16] += 1;
float b = 0;
b = REDUCE(a); // <-- reduction of a array a
It happens that b is different from execution to execution (glDispatchCompute(1, 100, 1)) as if there is some race condition.
I am not sure wether threads within a work group are really concurrent (like warps in a streaming multiprocessor).
Also how many cores are mapped to work groups/shaders?
What are your thoughts about that? Thanks
It happens that b is different from execution to execution (glDispatchCompute(1, 100, 1)) as if there is some race condition.
That's because there is one:
a[tid % 16] += 1;
For a workgroup with a local size of 256, there will be at least two invocations in that workgroup that have the same value of tid % 16. Therefore, those invocations will attempt to manipulate the same index of a.
Since there are no barriers or any other mechanism to prevent this, then this is a race-condition on the elements of a. And therefore, you get undefined behavior.
Now, you could manipulate a through atomic operations:
atomicAdd(a[tid % 16], 1);
That is well-defined behavior.
I am not sure wether threads within a work group are really concurrent (like warps in a streaming multiprocessor).
This is irrelevant. You must treat them as if they are executed concurrently.
Also how many cores are mapped to work groups/shaders?
Again, essentially irrelevant. This matters in terms of performance, but that's mainly about how big to make your local group size. But in terms of whether your code works or not, it doesn't matter.
I am trying to multithread a piece of code using the boost library. The problem is that each thread has to access and modify a couple of global variables. I am using mutex to lock the shared resources, but the program ends up taking more time then when it was not multithreaded. Any advice on how to optimize the shared access?
Thanks a lot!
In the example below, the *choose_ecount* variable has to be locked, and I cannot take it out of the loop and lock it for only an update at the end of the loop because it is needed with the newest values by the inside function.
for(int sidx = startStep; sidx <= endStep && sidx < d.sents[lang].size(); sidx ++){
sentence s = d.sents[lang][sidx];
int senlen = s.words.size();
int end_symb = s.words[senlen-1].pos;
inside(s, lbeta);
outside(s,lbeta, lalpha);
long double sen_prob = lbeta[senlen-1][F][NO][0][senlen-1];
if (lambda[0] == 0){
d.sents[lang][sidx].prob = sen_prob;
for(int size = 1; size <= senlen; size++)
for(int i = 0; i <= senlen - size ; i++)
int j = i + size - 1;
for(int k = i; k < j; k++)
int hidx = i; int head = s.words[hidx].pos;
for(int r = k+1; r <=j; r++)
int aidx = r; int arg = s.words[aidx].pos;
for(int kids = ONE; kids <= MAX; kids++)
long double num = lalpha[hidx][R][kids][i][j] * get_choose_prob(s, hidx, aidx) *
lbeta[hidx][R][kids - 1][i][k] * lbeta[aidx][F][NO][k+1][j];
long double gen_right_prob = (num / sen_prob);
choose_ecount[lang][head][arg] += gen_right_prob; //LOCK
order_ecount[lang][head][arg][RIGHT] += gen_right_prob; //LOCK
From the code you have posted I can see only writes to choose_ecount and order_ecount. So why not use local per thread buffers to compute the sum and then add them up after the outermost loop and only sync this operation?
If you need to access the intermediate values of choose_ecount how do you assure the correct intermediate value is present? One thread might have finished 2 iterations of its loop in the meantime producing different results in another thread.
It kind of sounds like you need to use a barrier for your computation instead.
It's unlikely you're going to get acceptable performance using a mutex in an inner loop. Concurrent programming is difficult, not just for the programmer but also for the computer. A large portion of the performance of modern CPUs comes from being able to treat blocks of code as sequences independent of external data. Algorithms that are efficient for single-threaded execution are often unsuitable for multi-threaded execution.
You might want to have a look at boost::atomic, which can provide lock-free synchronization, but the memory barriers required for atomic operations are still not free, so you may still run into problems, and you will probably have to re-think your algorithm.
I guess that you divide your complete problem into chunks ranging from startStep to endStep to get processed by each thread.
Since you have that locked mutex there, you're effectively serializing all threads:
You divide your problem into some chunks which are processed in serial, yet unspecified order.
That is the only thing you get is the overhead for doing multithreading.
Since you're operating on doubles, using atomic operations is not a choice for you: they're typically implemented for integral types only.
The only possible solution is to follow Kratz' suggestion to have a copy of choose_ecount and order_ecount for each thread and reduce them to a single one after your threads have finished.
I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
sum[id[i]] += value[i];
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.
This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
// if N was odd, process last element
if (N & 1) {
sum[id[N]] += value[N];
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
Whether this helps or not depends on the target CPU though.
Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
index = id[i];
sum[index] += value[i];
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.
Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
index = 2 * i;
sum[id[index]] += value[index];
sum[id[index]] += value[index];
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.
Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.
Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.
This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];