Bad optimization of std::fabs()? - c++

Recently i was working with an application that had code similar to:
for (auto x = 0; x < width - 1 - left; ++x)
{
// store / reset points
temp = hPoint = 0;
for(int channel = 0; channel < audioData.size(); channel++)
{
if (peakmode) /* fir rms of window size */
{
for (int z = 0; z < sizeFactor; z++)
{
temp += audioData[channel][x * sizeFactor + z + offset];
}
hPoint += temp / sizeFactor;
}
else /* highest sample in window */
{
for (int z = 0; z < sizeFactor; z++)
{
temp = audioData[channel][x * sizeFactor + z + offset];
if (std::fabs(temp) > std::fabs(hPoint))
hPoint = temp;
}
}
.. some other code
}
... some more code
}
This is inside a graphical render loop, called some 50-100 times / sec with buffers up to 192kHz in multiple channels. So it's a lot of data running through the innermost loops, and profiling showed this was a hotspot.
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries. It looked something like this:
if ((const float &&)(*((int *)&temp) & ~0x80000000) > (const float &&)(*((int *)&hPoint) & ~0x80000000))
hPoint = temp;
This gave a 12x reduction in render time, while still producing the same, valid output. Note that everything in the audiodata is sanitized beforehand to not include nans/infs/denormals, and only have a range of [-1, 1].
Are there any corner cases where this optimization will give wrong results - or, why is the standard library function not implemented like this? I presume it has to do with handling of non-normal values?
e: the layout of the floating point model is conforming to ieee, and sizeof(float) == sizeof(int) == 4

Well, you set the floating-point mode to IEEE conforming. Typically, with switches like --fast-math the compiler can ignore IEEE corner cases like NaN, INF and denormals. If the compiler also uses intrinsics, it can probably emit the same code.
BTW, if you're going to assume IEEE format, there's no need for the cast back to float prior to the comparison. The IEEE format is nifty: for all positive finite values, a<b if and only if reinterpret_cast<int_type>(a) < reinterpret_cast<int_type>(b)

It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries.
No, you can't, because this violates the strict aliasing rule.
Are there any corner cases where this optimization will give wrong results
Technically, this code results in undefined behavior, so it always gives wrong "results". Not in the sense that the result of the absolute value will always be unexpected or incorrect, but in the sense that you can't possibly reason about what a program does if it has undefined behavior.
or, why is the standard library function not implemented like this?
Your suspicion is justified, handling denormals and other exceptional values is tricky, the stdlib function also needs to take those into account, and the other reason is still the undefined behavior.
One (non-)solution if you care about performance:
Instead of casting and pointers, you can use a union. Unfortunately, that only works in C, not in C++, though. That won't result in UB, but it's still not portable (although it will likely work with most, if not all, platforms with IEEE-754).
union {
float f;
unsigned u;
} pun = { .f = -3.14 };
pun.u &= ~0x80000000;
printf("abs(-pi) = %f\n", pun.f);
But, granted, this may or may not be faster than calling fabs(). Only one thing is sure: it won't be always correct.

You would expect fabs() to be implemented in hardware. There was an 8087 instruction for it in 1980 after all. You're not going to beat the hardware.

How the standard library function implements it is .... implementation dependent. So you may find different implementation of the standard library with different performance.
I imagine that you could have problems in platforms where int is not 32 bits. You 'd better use int32_t (cstdint>)
For my knowledge, was std::abs previously inlined ? Or the optimisation you observed is mainly due to suppression of the function call ?

Some observations on how refactoring may improve performance:
as mentioned, x * sizeFactor + offset can be factored out of the inner loops
peakmode is actually a switch changing the function's behaviour - make two functions rather than test the switch mid-loop. This has 2 benefits:
easier to maintain
fewer local variables and code paths to get in the way of optimisation.
The division of temp by sizeFactor can be deferred until outside the channel loop in the peakmode version.
abs(hPoint) can be pre-computed whenever hPoint is updated
if audioData is a vector of vectors you may get some performance benefit by taking a reference to audioData[channel] at the start of the body of the channel loop, reducing the array indexing within the z loop down to one dimension.
finally, apply whatever specific optimisations for the calculation of fabs you deem fit. Anything you do here will hurt portability so it's a last resort.

In VS2008, using the following to track the absolute value of hpoint and hIsNeg to remember whether it is positive or negative is about twice as fast as using fabs():
int hIsNeg=0 ;
...
//Inside loop, replacing
// if (std::fabs(temp) > std::fabs(hPoint))
// hPoint = temp;
if( temp < 0 )
{
if( -temp > hpoint )
{
hpoint = -temp ;
hIsNeg = 1 ;
}
}
else
{
if( temp > hpoint )
{
hpoint = temp ;
hIsNeg = 0 ;
}
}
...
//After loop
if( hIsNeg )
hpoint = -hpoint ;

Related

Looking for a way to speed up a function

I'm trying to speed up a big block of code across many files and found out that one function uses about 70% of the total time. This is because this function is called 477+ million times.
The pointer array par can only be one of two presets, either
par[0] = 0.057;
par[1] = 2.87;
par[2] = -3.;
par[3] = -0.03;
par[4] = -3.05;
par[5] = -3.5;
OR
par[0] = 0.043;
par[1] = 2.92;
par[2] = -3.21;
par[3]= -0.065;
par[4] = -3.00;
par[5] = -2.65;
So I've tried plugging in numbers depending on which preset it is but have failed to find any significant time saves.
The pow and exp functions seem to be called about every time and they take up about 40 and 20 percent of the total time respectively, so only 10% of the total time is used by the parts of this function that aren't pow or exp. Finding ways to speed those up would probably be the best but none of the exponents used in pow are integers except -4 and I don't know if 1/(x*x*x*x) is faster than pow(x, -4).
double Signal::Param_RE_Tterm_approx(double Tterm, double *par) {
double value = 0.;
// time after Che angle peak
if (Tterm > 0.) {
if ( fabs(Tterm/ *par) >= 1.e-2) {
value += -1./(*par)*exp(-1.*Tterm/(*par));
}
else {
value += -1./par[0]*(1. - Tterm/par[0] + Tterm*Tterm/(par[0]*par[0]*2.) - Tterm*Tterm*Tterm/(par[0]*par[0]*par[0]*6.) );
}
if ( fabs(Tterm* *(par+1)) >= 1.e-2) {
value += *(par+2)* *(par+1)*pow( 1.+*(par+1)*Tterm, *(par+2)-1. );
}
else {
value += par[2]*par[1]*( 1.+(par[2]-1.)*par[1]*Tterm + (par[2]-1.)*(par[2]-1.-1.)/2.*par[1]*par[1]*Tterm*Tterm + (par[2]-1.)*(par[2]-1.-1.)*(par[2]-1.-2.)/6.*par[1]*par[1]*par[1]*Tterm*Tterm*Tterm );
}
}
// time before Che angle peak
else {
if ( fabs(Tterm/ *(par+3)) >= 1.e-2 ) {
value += -1./ *(par+3) *exp(-1.*Tterm/ *(par+3));
}
else {
value += -1./par[3]*(1. - Tterm/par[3] + Tterm*Tterm/(par[3]*par[3]*2.) - Tterm*Tterm*Tterm/(par[3]*par[3]*par[3]*6.) );
}
if ( fabs(Tterm* *(par+4) >= 1.e-2 ) {
value += *(par+5)* *(par+4) *pow( 1.+ *(par+4)*Tterm, *(par+5)-1. );
}
else {
value += par[5]*par[4]*( 1.+(par[5]-1.)*par[4]*Tterm + (par[5]-1.)*(par[5]-1.-1.)/2.*par[4]*par[4]*Tterm*Tterm + (par[5]-1.)*(par[5]-1.-1.)*(par[5]-1.-2.)/6.*par[4]*par[4]*par[4]*Tterm*Tterm*Tterm );
}
}
return value * 1.e9;
}
I first rewrote it to be a bit easier to follow:
#include <math.h>
double Param_RE_Tterm_approx(double Tterm, double const* par) {
double value = 0.;
if (Tterm > 0.) {
// time after Che angle peak
if ( fabs(Tterm/ par[0]) >= 1.e-2) {
value += -1./(par[0])*exp(-1.*Tterm/(par[0]));
} else {
value += -1./par[0]*(1. - Tterm/par[0] + Tterm*Tterm/(par[0]*par[0]*2.) - Tterm*Tterm*Tterm/(par[0]*par[0]*par[0]*6.) );
}
if ( fabs(Tterm* par[1]) >= 1.e-2) {
value += par[2]* par[1]*pow( 1.+par[1]*Tterm, par[2]-1. );
} else {
value += par[2]*par[1]*( 1.+(par[2]-1.)*par[1]*Tterm + (par[2]-1.)*(par[2]-1.-1.)/2.*par[1]*par[1]*Tterm*Tterm + (par[2]-1.)*(par[2]-1.-1.)*(par[2]-1.-2.)/6.*par[1]*par[1]*par[1]*Tterm*Tterm*Tterm );
}
} else {
// time before Che angle peak
if ( fabs(Tterm/ par[3]) >= 1.e-2 ) {
value += -1./ par[3] *exp(-1.*Tterm/ par[3]);
} else {
value += -1./par[3]*(1. - Tterm/par[3] + Tterm*Tterm/(par[3]*par[3]*2.) - Tterm*Tterm*Tterm/(par[3]*par[3]*par[3]*6.) );
}
if ( fabs(Tterm* par[4]) >= 1.e-2 ) {
value += par[5]* par[4] *pow( 1.+ par[4]*Tterm, par[5]-1. );
} else {
value += par[5]*par[4]*( 1.+(par[5]-1.)*par[4]*Tterm + (par[5]-1.)*(par[5]-1.-1.)/2.*par[4]*par[4]*Tterm*Tterm + (par[5]-1.)*(par[5]-1.-1.)*(par[5]-1.-2.)/6.*par[4]*par[4]*par[4]*Tterm*Tterm*Tterm );
}
}
return value * 1.e9;
}
We can then look at its structure.
There are two main branches -- Tterm negative (before) and positive (after). These correspond to using 0,1,2 or 3,4,5 in the par array.
Then in each case we do two things to add to value. In both cases, for small cases we use a polynomial, and for big cases we use an exponential/power equation.
As a guess, this is because the polynomial is a decent approximation for the exponential for small values -- the error is acceptable. What you should do is confirm that guess -- take a look at the Taylor series expansion of the "big" power/exponent based equation, and see if it agrees with the polynomials somehow. Or check numerically.
If it is the case, this means that this equation has a known amount of error that is acceptable. Quite often there are faster versions of exp or pow that have a known amount of max error; consider using those.
If this isn't the case, there still could be an acceptable amount of error, but the Taylor series approximation can give you "in code" information about what is an acceptable amount of error.
A next step I'd take is to tear the 8 pieces of this equation apart. There is positive/negative, the first and second value+= in each branch, and then the polynomial/exponential case.
I'm guesing the fact that exp is taking ~1/3 the time of pow is because you have 3 calls to pow to 1 call to exp in your function, but you might find out something interesting like "all of our time is actually in the Tterm > 0. case" or what have you.
Now examine call sites. Is there a pattern in the Tterm you are passing this function? Ie, do you tend to pass Tterms in roughly sorted order? If so, you can do the test for which function to call outside of calling this function, and do it in batches.
Simply doing it in batches and compiling with optimization and inlining the bodies of the functions might make a surprising amount of difference; compilers are getting better at vectorizing work.
If that doesn't work, you can start threading things off. On a modern computer you can have 4-60 threads solving this problem independently, and this problem looks like you'd get nearly linear speedup. A basic threading library, like TBB, would be good for this kind of task.
For the next step up, if you are getting large batches of data and you need to do a lot of processing, you can stuff it onto a GPU and solve it there. Sadly, GPU<->RAM communication is small, so simply doing the math in this function on the GPU and reading/writing back and forth with RAM won't give you much if any performance. But if more work than just this can go on the GPU, it might be worth it.
The only 10% of the total time is used by the parts of this function that aren't pow or exp.
If your function performance bottleneck is exp(), pow() execution, consider using vector instructions in your calculations. All modern processors support at least SSE2 instruction set, so this approach will definitely give at least ~2x speed up, because your calculation could be easily vectorized.
I recommend you to use this c++ vectorization library, which contains all standard mathematical functions (such as exp and pow) and allows to write code in OOP-style without using assembly language . I used it several times and it must work perfectly in your problem.
If you have GPU, you should also consider trying cuda framework, because, again, your problem could be perfectly vectorized. Moreover, If this function is called 477+ million times, GPU will literally eliminate your problem...
(Partial optimization:)
The longest expression has
Common subexpressions
Polynomial evaluated the costly way.
Pre-define these (perhaps add them to par[]):
a = par[5]*par[4];
b = (par[5]-1.);
c = b*(par[5]-2.)/2.;
d = c*(par[5]-3.)/3.;
Then, for example, the longest expression becomes:
e = par[4]*Tterm;
value += a*(((d*e + c)*e + b)*e + 1.);
And simplify the rest.
If the expressions are curve-fitting approximations, why not do also with
value += -1./(*par)*exp(-1.*Tterm/(*par));
You should also ask whether all 477M iterations are needed.
If you want to explore batching / more optimization opportunities for fusing in computations that depend on these values, try using Halide
I've rewritten your program in Halide here:
#include <Halide.h>
using namespace Halide;
class ParamReTtermApproxOpt : public Generator<ParamReTtermApproxOpt>
{
public:
Input<Buffer<float>> tterm{"tterm", 1};
Input<Buffer<float>> par{"par", 1};
Input<int> ncpu{"ncpu"};
Output<Buffer<float>> output{"output", 1};
Var x;
Func par_inv;
void generate() {
// precompute 1 / par[x]
par_inv(x) = fast_inverse(par(x));
// after che peak
Expr after_che_peak = tterm(x) > 0;
Expr first_term = -par_inv(0) * fast_exp(-tterm(x) * par_inv(0));
Expr second_term = par(2) * par(1) * fast_pow(1 + par(1) * tterm(x), par(2) - 1);
// before che peak
Expr third_term = -par_inv(3) * fast_exp(-tterm(x) * par_inv(3));
Expr fourth_term = par(5) * par(4) * fast_pow(1 + par(4) * tterm(x), par(5) - 1);
// final value
output(x) = 1.e9f * select(after_che_peak, first_term + second_term,
third_term + fourth_term);
}
void schedule() {
par_inv.bound(x, 0, 6);
par_inv.compute_root();
Var xo, xi;
// break x into two loops, one for ncpu tasks
output.split(x, xo, xi, output.extent() / ncpu)
// mark the task loop parallel
.parallel(xo)
// vectorize each thread's computation for 8-wide vector lanes
.vectorize(xi, 8);
output.print_loop_nest();
}
};
HALIDE_REGISTER_GENERATOR(ParamReTtermApproxOpt, param_re_tterm_approx_opt)
I can run 477,000,000 iterations in slightly over one second on my Surface Book (with ncpu=4). Batching is hugely important here since it enables vectorization.
Note that the equivalent program written using double arithmetic is much slower (20x) than float arithmetic. Though Halide doesn't supply fast_ versions for doubles, so this might not be quite apples-to-apples. Regardless, I would check whether you need the extra precision.

Why is my C++ code three times slower than the C equivalent on LeetCode? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I've been doing some of the LeetCode problems, and I notice that the C solutions are a couple of times faster than the exact same thing in C++. For example:
Updated with a couple of simpler examples:
Given a sorted array and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order. You may assume no duplicates in the array. (Link to question on LeetCode)
My solution in C, runs in 3 ms:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
My other C++ solution, exactly the same but as a member function of the Solution class runs in 13 ms:
class Solution {
public:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
};
Even simpler example:
Reverse the digits of an integer. Return 0 if the result will overflow. (Link to question on LeetCode)
The C version runs in 6 ms:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
And the C++ version is exactly the same but as a member function of the Solution class, and runs for 19 ms:
class Solution {
public:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
};
I see how there would be considerable overhead from using vector of vector as a 2D array in the original example if the LeetCode testing system doesn't compile the code with optimisation enabled. But the simpler examples above shouldn't suffer that issue because the data structures are pretty raw, especially in the second case where all you have is long or integer arithmetics. That's still slower by a factor of three.
I'm starting to think that there might be something odd happening with the way LeetCode do the benchmarking in general because even in the C version of the integer reversing problem you get a huge bump in running time from just replacing the line
if (rev>(-1U >> 1) || rev < (1 << 31)) {
with
if (rev>INT_MAX || rev < INT_MIN) {
Now, I suppose having to #include<limits.h> might have something to do with that but it seems a bit extreme that this simple change bumps the execution time from just 6 ms to 19 ms.
Lately I've been seeing the vector<vector<int>> suggestion a lot for doing 2d arrays in C++, and I've been pointing out to people why this really isn't a good idea. It's a handy trick to know when slapping together temporary code, but there's (almost) never any reason to ever use it for real code. The right thing to do is to use a class that wraps a contiguous block of memory.
So my first reaction might be to point to this as a possible source for the disparity. However you're also using int** in the C version, which is generally a sign of the exact same problem as vector<vector<int>>.
So instead I decided to just compare the two solutions.
http://coliru.stacked-crooked.com/a/fa8441cc5baa0391
6468424
6588511
That's the time taken by the 'C version' vs the 'C++ version' in nanoseconds.
My results don't show anything like the disparity you describe. Then it occurred to me to check a common mistake people make when benchmarking
http://coliru.stacked-crooked.com/a/e57d791876b9252b
18386695
42400612
Notice that the -O3 flag from the first example has become -O0, which disables optimization.
Conclusion: you're probably comparing unoptimized executables.
C++ supports building rich abstractions that don't require overhead, but eliminating the the overhead does require certain code transformations that play havoc with the 'debuggability' of code.
That means debug builds avoid those transformations and therefore C++ debug builds are often slower than debug builds of C style code because C style code just doesn't use much abstraction. Seeing a 130% slowdown such as the above is not at all surprising when timing, for example, machine code that uses function calls in place of simple store instructions.
Some code really needs optimizations in order to have reasonable performance even for debugging, so compilers often offer a mode that applies some optimizations which don't cause too much trouble for debuggers. Clang and gcc use -O1 for this, and you can see that even this level of optimization essentially eliminates the gap in this program between C style code and the more C++ style code:
http://coliru.stacked-crooked.com/a/13967ebcfcfa4073
8389992
8196935
Update:
In those later examples optimization shouldn't make a difference, since the C++ is not using any abstraction beyond what the C version is doing. I'm guessing that the explanation for this is that the examples are being compiled with different compilers or with some other different compiler options. Without knowing how the compilation is done I would say it makes no sense to compare these runtime numbers; LeetCode is clearly not producing an apples to apples comparison.
You are using vector of vector in your C++ code snippet. Vectors are sequence containers in C++ that are like arrays that can change in size. Instead of vector<vector<int>> if you use statically allocated arrays, that would be better. You may use your own Array class as well with operator [] overloaded, but vector has more overhead as it dynamically resizes when you add more elements than its original size. In C++, you use call by reference to further reduce your time if you compare that with C. C++ should run even faster if written well.

Is continue instant?

In the follow two code snippets, is there actually any different according to the speed of compiling or running?
for (int i = 0; i < 50; i++)
{
if (i % 3 == 0)
continue;
printf("Yay");
}
and
for (int i = 0; i < 50; i++)
{
if (i % 3 != 0)
printf("Yay");
}
Personally, in the situations where there is a lot more than a print statement, I've been using the first method as to reduce the amount of indentation for the containing code. Been wondering for a while so found it about time I ask whether it's actually having an effect other than visually.
Reply to Alf (i couldn't get code working in comments...)
More accurate to my usage is something along the lines of a "handleObjectMovement" function which would include
for each object
if object position is static
continue
deal with velocity and jazz
compared with
for each object
if object position is not static
deal with velocity and jazz
Hence me not using return. Essentially "if it's not relevant to this iteration, move on"
The behaviour is the same, so the runtime speed should be the same unless the compiler does something stupid (or unless you disable optimisation).
It's impossible to say whether there's a difference in compilation speed, since it depends on the details of how the compiler parses, analyses and translates the two variations.
If speed is important, measure it.
If you know which branch of the condition has higher probability you may use GCC likely/unlikely macro
How about getting rid of the check altogether?
for (int t = 0; t < 33; t++)
{
int i = t + (t >> 1) + 1;
printf("%d\n", i);
}

Are GLSL compilers well optimized

Are recent GLSL compilers smart/well optimized?
In other words, if I go brainless and write stuff like the following, would recent compilers save my day and optimize away the unnecessary code, or should I be always careful with what I write?
// All of the values are constants
if (3.7 == 3.7) // Will the condition be executed or removed at build time?
x++;
// Will this whole block be entirely removed? (or should I use macros)
if (1 == 2)
x++;
for (i = 0; i < 0; ++i) // Remove this
x++;
for (i = 0; i < varA * varB; ++i) // Compute varA * varB once, outside the loop
x++;
vec3 v = vec3(0);
if (length(v) > 0) // Remove
x++;
float p = mix(varA, varB, 1); // p = varB
float p = 5;
p *= uniform * 0; // Just set a = 0 from the start
float p = 5;
p *= 1; // Remove that
There is a lot more things that I can't get out now, but you should have got the point.
Also, can recent compilers automatically detect less obvious optimizations, like the ones described there:
http://www.opengl.org/wiki/GLSL_Optimizations
Is there a known trade off between "compilation time" versus "time spent optimizing", set up by the implementors or the specification?
GLSL compilers are very conservative on floating point optimizations. You cannot be sure for any particular optimization. The rule of thumb is: optimize whatever you can and don't hope for any help from a GLSL compiler.
Read Low-Level Thinking in High-Level Shading Languages by Emil Persson for interesting details and case-studies.
P.S: This answer may sound pessimistic. However, GLSL compilers still do a lot of great job optimizing your code. Just don't count on it and do your best.

Optimize indexed array summation

I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.
This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.
Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.
Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.
Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.
Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.
This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];