C++ - how to improve this methods execution speed

C++ - how to improve this methods execution speed - c++

I am trying to improve my programming skills for an assignment that will be released soon, it involves solving a problem while making it run as efficient and as fast as possible. I know this is a fairly restrained/small piece of code but how if anything would make it run faster.
the method takes an array with holds details of transactions, there are 100 as the number of transactions used to maintain the loop. so am getting the average num of shares and then returning it. not fluent english so hopefully it makes sense, thanks
double Analyser::averageVolume()
{
// Your code
double averageNumShares = 0;
for(int i = 0; i < nTransactions; i++)
{
averageNumShares += tArray[i].numShares;
}
averageNumShares = averageNumShares / nTransactions;
return averageNumShares;
//return 0
}

If you need to compute the average of n numbers I'm afraid that you can't speed it up much past the linear-time approch in your sample code..
Unless this is used as part of another more complex algorithm where you might be able to get away with not having to compute the average or something along these lines, taking an average is going to be an O(n) operation which basically involves summing all elements of the array and one division by the number of elements. Which is exactly what you have.

Why not have two other values for the object - A running total and the number of items?
Then computing the average can make use of those numbers. Quickly and simply (could be an inline function!).

Here is one additional approach, similar to that suggested by Ed Heal that should be less sensitive to roundoff errors. The roundoff error of the average grows with the size of the accumulated sum. This may or may not be an issue for you, but it is something to be aware of.
Here is an iterative algorithm that minimizes roundoff error in the average, which I first came across in an old edition (circa 1998) of Ross:
double Analyser::averageVolume()
{
double averageNumShares = 0.0;
for (int i = 0; i < nTransactions; i++)
{
double delta = (tArray[i].numShares - averageNumShares) / (i+1);
averageNumShares += delta;
}
return averageNumShares;
}
This works by deriving a recursive definition of the average. That is, given samples x[1], ..., x[j], ..., x[N], you can calculate the average of the first M+1 samples from sample x[M+1] and the average of the first M samples:
sum(M) = x[1] + x[2] + ... + x[M]
thus avg(M+1) = sum(M+1)/(M+1) and avg(M) = sum(M)/M
avg(M+1) - avg(M) = sum(M+1)/(M+1) - sum(M)/M
= [ M*sum(M+1) - (M+1)*sum(M) ]/[ M * (M+1) ]
= [ M*(x[M+1] + sum(M)) - M*sum(M) - sum(M) ] / [ M*(M+1) ]
= [ M*x[M+1] - sum(M) ] / [ M*(M+1) ]
= [ x[M+1] - avg(M) ] / (M+1)
thus: avg(M+1) = avg(M) + [ x[M+1] - avg(M) ]/(M+1)
To get a sense of the roundoff error for the two approaches, try computing the average of 10^7 samples, each sample equal to 1035.41. Your original approach returns (on my hardware), an average of 1035.40999988683. The iterative approach above returns the exact average of 1035.41.
Both, unfortunately, are O(N) at some point. Your original scheme has N additions and one division. The iterative scheme has N additions, subtractions, and divisions, so you pay a bit more for the accuracy.

If you use gcc change level of optimization.

Short answer:
This code is as good as it gets with respect to speed. What you can tweak is how you compile it. Or obviously, rewrite it in assembly if that is an option.
"Stretched" answer:
Now... if you really want to try getting better performance, have already tried using all compiler optimization flags and optimizations available, and you are ready to compromise code readability for possibly more speed, you could consider rewriting:
for(int i = 0; i < nTransactions; i++)
{
averageNumShares += tArray[i].numShares;
}
as
pointerValue = &(tArray[0].numShares);
pointerIncrement = sizeof(tArray[0]);
for(int i = 0; i < nTransactions; i++)
{
averageNumShares += *(pointerValue++pointerIncrement);
}
It could be that you get better performance by showing the compiler all you do is jump over a fixed offset at each loop iteration. A good compiler should be able to see that with your initial code. And the new code could get your worse performance. It really depends on the specifics of your compiler, and again, I do not recommend that approach unless you are desperate for better performance than what the compiler can offer, and don't want to jump to using in-line assembly or intrinsics (if any available).

int i= nTransactions;
while(i--){// test for 0 is faster
averageNumShares += (q++)->numShares;// increment pointer is faster than offset
}

Related

Strange behavior in matrix formation (C++, Armadillo)

I have a while loop that continues as long as energy variable (type double) has not converged to below a certain threshold. One of the variables needed to calculate this energy is an Armadillo matrix of doubles, named f_mo. In the while loop, this f_mo updates iteratively, so I calculate f_mo at the beginning of each loop as:
arma::mat f_mo = h_core_mo;//h_core_mo is an Armadillo matrix of doubles
for(size_t p = 0; p < n_mo; p++) {//n_mo is of type size_t
for(size_t q = 0; q < n_mo; q++) {
double sum = 0.0;
for(size_t i = 0; i < n_occ; i++) {//n_occ is of type size_t
//f_mo(p,q) += 2.0*g_mat_full_qp1_qp1_mo(p*n_mo + q, i*n_mo + i)-g_mat_full_qp1_qp1_mo(p*n_mo+i,i*n_mo+q); //all g_mat_ are Armadillo matrices of doubles
sum += 2.0*g_mat_full_qp1_qp1_mo(p*n_mo + q, i*n_mo + i)-g_mat_full_qp1_qp1_mo(p*n_mo+i,i*n_mo+q);
}
for(size_t i2 = 0; i2 < n_occ2; i2++) {//n_occ2 is of type size_t
//f_mo(p,q) -= 1.0*g_mat_full_qp1_qp2_mo(p*n_mo + q, i2*n_mo2 + i2);
sum -= 1.0*g_mat_full_qp1_qp2_mo(p*n_mo + q, i2*n_mo2 + i2);
}
f_mo(p,q) +=sum;
}}
But say I replace the sum (which I add at the end to f_mo(p,q)) with addition to f_mo(p,q) directly (the commented out code). The output f_mo matrices are identical to machine precision. Nothing about the code should change. The only variables affected in the loop are sum and f_mo. And YET, the code converges to a different energy and in vastly different number of while loop iterations. I am at a loss as to the cause of the difference. When I run the same code 2,3,4,5 times, I get the same result every time. When I recompile with no optimization, I get the same issue. When I run on a different computer (controlling for environment), I yet again get a discrepancy in # of while loops despite identical f_mo, but the total number of iterations for each method (sum += and f_mo(p,q) += ) differ.
It is worth noting that the point at which the code outputs differ is always g_mat_full_qp1_qp2_mo, which is recalculated later in the while loop. HOWEVER, every variable going into the calculation of g_mat_full_qp1_qp2_mo is identical between the two codes. This leads me to think there is something more profound about C++ that I do not understand. I welcome any ideas as to how you would proceed in debugging this behavior (I am all but certain it is not a typical bug, and I've controlled for environment and optimization)

I'm going to assume this a Hartree-Fock, or some other kind of electronic structure calculation where you adding the two-electron integrals to the core Hamiltonian, and apply some domain knowledge.
Part of that assume is the individual elements of the two-electron integrals are very small, in particular compared to the core Hamiltonian. Hence as 1201ProgramAlarm mentions in their comment, the order of addition will matter. You will get a more accurate result if you add smaller numbers together first to avoid loosing precision when adding two numbers many orders of magintude apart.. Because you iterate this processes until the Fock matrix f_mo has tightly converged you eventually converge to the same value.
In order to add up the numbers in a more accurate order, and hopefully converge faster, most electronic structure programs have a seperate routine to calculate the two-electron integrals, and then add them to the core Hamiltonian, which is what you are doing, element by element, in your example code.
Presentation on numerical computing.

Evaluating multiplication with exponential function

I'm trying to come up with a good way to evaluate the following function
double foo(std::vector<double> const& x, double c = 0.95)
{
auto N = x.size(); // Small power of 2 such as 512 or 1024
double sum = 0;
for (auto i = 0; i != N; ++i) {
sum += (x[i] * pow(c, double(i)/N));
}
return sum;
}
My two main concerns with this naive implementation are performance and accuracy. So I suspect that the most trivial improvement would be to reverse the loop order: for (auto i = N-1; i != -1; --i) (The -1 wraps around, this is OK). This improves accuracy by adding smaller terms first.
While this is good for accuracy, it keeps the performance problem of pow. Numerically, pow(c, double(i)/N) is pow(c, (i-1)/N) * pow(c, 1/N). And the latter is a constant. So in theory we can replace pow with repeated multiplication. While good for performance, this hurts accuracy - errors will accumulate.
I suspect that there's a significantly better algorithm hiding in here. For instance, the fact that N is a power of two means that there is a middle term x[N/2] that's multiplied with sqrt(c). That hints at a recursive solution.
On a somewhat related numerical observation, this looks like a signal multiplication with an exponential, so I naturally think : "FFT, trivial convolution=shift, IFFT", but that seems to offer no real benefit in terms of accuracy or performance.
So, is this a well-known problem with known solutions?

The task is a polynomial evaluation. The method for a single evaluation with the least operation count is the Horner scheme. In general a low operation count will reduce the accumulation of floating point noise.
As the example value c=0.95 is close to 1, any root will be still closer to 1 and thus lose accuracy. Avoid that by computing the difference to 1 directly, z=1-c^(1/n), via
z = -expm1(log(c)/N).
Now you have to evaluate the polynomial
sum of x[i] * (1-z)^i
which can be done by careful modification of the Horner scheme. Instead of
for(i=N; i-->0; ) {
res = res*(1-z)+x[i]
}
use
for(i=N; i-->0; ) {
res = (res+x[i])-res*z
}
which is mathematically equivalent but has the loss of digits in 1-z happening as late as possible without using more involved method like doubly accurate addition.
In tests those two methods contrary to the intent gave almost the same results, a substantial improvement could be observed by separating the result into its value at c=1, z=0 and a multiple of z as in
double res0 = 0, resz=0;
int i;
for(i=N; i-->0; ) {
/* res0+z*resz = (res0+z*resz)*(1-z)+x[i]; */
resz = resz - res0 -z*resz;
res0 = res0 + x[i];
}
The test case that showed this improvement was for the coefficient sequence of
f(u) = (1-u/N)^(N-2)*(1-u)
where for N=1000 the evaluations result in
c z=1-c^(1/N) f(1-z) diff for 1st proc diff for 3rd proc
0.950000 0.000051291978909 0.000018898570629 1.33289104579937e-17 4.43845264361253e-19
0.951000 0.000050239954368 0.000018510931892 1.23765066121009e-16 -9.24959978401696e-19
0.952000 0.000049189034371 0.000018123700958 1.67678642238461e-17 -5.38712954453735e-19
0.953000 0.000048139216599 0.000017736876972 -2.86635949350855e-17 -2.37169225231204e-19
...
0.994000 0.000006018054217 0.000002217256601 1.31645860662263e-17 1.15619997300212e-19
0.995000 0.000005012529261 0.000001846785028 -4.15668713370839e-17 -3.5363625547867e-20
0.996000 0.000004008013365 0.000001476685973 8.48811716443534e-17 8.470329472543e-22
0.997000 0.000003004504507 0.000001106958687 1.44711343873661e-17 -2.92226366802734e-20
0.998000 0.000002002000667 0.000000737602425 5.6734266807093e-18 -6.56450534122083e-21
0.999000 0.000001000499833 0.000000368616443 -3.72557383333555e-17 1.47701370177469e-20

Yves' answer inspired me.
It seems that the best approach is to not calculate pow(c, 1.0/N) directly, but indirectly:
cc[0]=c; cc[1]=sqrt(cc[0]), cc[2]=sqrt(cc[1]),... cc[logN] = sqrt(cc[logN-1])
Or in binary,
cc[0]=c, cc[1]=c^0.1, cc[2]=c^0.01, cc[3]=c^0.001, ....
Now if we need x[0b100100] * c^0.100100, we can calculate that as x[0b100100]* c^0.1 * c^0.0001. I don't need to precalculate a table of size N, as geza suggested. A table of size log(N) is probably sufficient, and it can be created by repeatedly taking square roots.
[edit]
As pointed out in a comment thread on another answer, pairwise summation is very effective in keeping errors under control. And it happens to combine extremely nicely with this answer.
We start by observing that we sum
x[0] * c^0.0000000
x[1] * c^0.0000001
x[2] * c^0.0000010
x[3] * c^0.0000011
...
So, we run log(N) iterations. In iteration 1, we add the N/2 pairs x[i]+x[i+1]*c^0.000001 and store the result in x[i/2]. In iteration 2, we add the pairs x[i]+x[i+1]*c^0.000010, etcetera. The chief difference with normal pairwise summation is that this is a multiply-and-add in each step.
We see now that in each iteration, we're using the same multiplier pow(c, 2^i/N), which means we only need to calculate log(N) multipliers. It's also quite cache-efficient, as we're doing only contiguous memory access. It also allows for easy SIMD parallelization, especially when you have FMA instructions.

If N is a power of 2, you can replace the evaluations of the powers by geometric means, using
a^(i+j)/2 = √(a^i.a^j)
and recursively subdivide from c^N/N.c^0/N. With preorder recursion, you can make sure to accumulate by increasing weights.
Anyway, the speedup of sqrt vs. pow might be marginal.
You can also stop recursion at a certain level and continue linearly, with mere products.

You could mix repeated multiplication by pow(c, 1./N) with some explicit pow calls. I.e. every 16th iteration or so do a real pow and otherwise move forward with the multiply. This should yield large performance benefits at negligible accuracy cost.
Depending on how much c varies, you might even be able to precompute and replace all pow calls with a lookup, or just the ones needed in the above method (= smaller lookup table = better caching).

Big O calculation

int maxValue = m[0][0];
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
if ( m[i][j] >maxValue )
{
maxValue = m[i][j];
}
}
}
cout<<maxValue<<endl;
int sum = 0;
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
sum = sum + m[i][j];
}
}
cout<< sum <<endl;
For the above mentioned code I got O(n2) as the execution time growth
They way I got it was by:
MAX [O(1) , O(n2), O(1) , O(1) , O(n2), O(1)]
both O(n2) is for for loops. Is this calculation correct?
If I change this code as:
int maxValue = m[0][0];
int sum = 0;
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
if ( m[i][j] > maxValue )
{
maxValue = m[i][j];
}
sum += m[i][j];
}
}
cout<<maxValue<<endl;
cout<< sum <<endl;
Still Big O would be O(n2) right?
So does that mean Big O just an indication on how time will grow according to the input data size? and not how algorithm written?

This feels a bit like a homework question to me, but...
Big-Oh is about the algorithm, and specifically how the number of steps performed (or the amount of memory used) by the algorithm grows as the size of the input data grows.
In your case, you are taking N to be the size of the input, and it's confusing because you have a two-dimensional array, NxN. So really, since your algorithm only makes one or two passes over this data, you could call it O(n), where in this case n is the size of your two-dimensional input.
But to answer the heart of your question, your first code makes two passes over the data, and your second code does the same work in a single pass. However, the idea of Big-Oh is that it should give you the order of growth, which means independent of exactly how fast a particular computer runs. So, it might be that my computer is twice as fast as yours, so I can run your first code in about the same time as you run the second code. So we want to ignore those kinds of differences and say that both algorithms make a fixed number of passes over the data, so for the purposes of "order of growth", one pass, two passes, three passes, it doesn't matter. It's all about the same as one pass.
It's probably easier to think about this without thinking about the NxN input. Just think about a single list of N numbers, and say you want to do something to it, like find the max value, or sort the list. If you have 100 items in your list, you can find the max in 100 steps, and if you have 1000 items, you can do it in 1000 steps. So the order of growth is linear with the size of the input: O(n). On the other hand, if you want to sort it, you might write an algorithm that makes roughly a full pass over the data each time it finds the next item to be inserted, and it has to do that roughly once for each element in the list, so that's making n passes over your list of length n, so that's O(n^2). If you have 100 items in your list, that's roughly 10^4 steps, and if you have 1000 items in your list that's roughly 10^6 steps. So the idea is that those numbers grow really fast in comparison to the size of your input, so even if I have a much faster computer (e.g., a model 10 years better than yours), I might be able to to beat you in the max problem even with a list 2 or 10 or even 100 or 1000 times as long. But for the sorting problem with a O(n^2) algorithm, I won't be able to beat you when I try to take on a list that's 100 or 1000 times as long, even with a computer 10 or 20 years better than yours. That's the idea of Big-Oh, to factor out those "relatively unimportant" speed differences and be able to see what amount of work, in a more general/theoretical sense, a given algorithm does on a given input size.
Of course, in real life, it may make a huge difference to you that one computer is 100 times faster than another. If you are trying to solve a particular problem with a fixed maximum input size, and your code is running at 1/10 the speed that your boss is demanding, and you get a new computer that runs 10 times faster, your problem is solved without needing to write a better algorithm. But the point is that if you ever wanted to handle larger (much larger) data sets, you couldn't just wait for a faster computer.

The big O notation is an upper bound to the maximum amount of time taken to execute the algorithm based on the input size. So basically two algorithms can have slightly varying maximum running time but same big O notation.
what you need to understand is that for a running time function that is linear based on input size will have big o notation as o(n) and a quadratic function will always have big o notation as o(n^2).
so if your running time is just n, that is one linear pass, big o notation stays o(n) and if your running time is 6n+c that is 6 linear passes and a constant time c it still is o(n).
Now in the above case the second code is more optimized as the number of times you need to make the skip to memory locations for the loop is less. and hence this will give a better execution. but both the code would still have the asymptotic running time as o(n^2).

Yes, it's O(N^2) in both cases. Of course O() time complexity depends on how you have written your algorithm, but both the versions above are O(N^2). However, note that actually N^2 is the size of your input data (it's an N x N matrix), so this would be better characterized as a linear time algorithm O(n) where n is the size of the input, i.e. n = N x N.

What's the numerically best way to calculate the average

what's the best way to calculate the average? With this question I want to know which algorithm for calculating the average is the best in a numerical sense. It should have the least rounding errors, should not be sensitive to over- or underflows and so on.
Thank you.
Additional information: incremental approaches preferred since the number of values may not fit into RAM (several parallel calculations on files larger than 4 GB).

If you want an O(N) algorithm, look at Kahan summation.

You can have a look at http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.3535 (Nick Higham, "The accuracy of floating point summation", SIAM Journal of Scientific Computation, 1993).
If I remember it correctly, compensated summation (Kahan summation) is good if all numbers are positive, as least as good as sorting them and adding them in ascending order (unless there are very very many numbers). The story is much more complicated if some numbers are positive and some are negative, so that you get cancellation. In that case, there is an argument for adding them in descending order.

Sort the numbers in ascending order of magnitude. Sum them, low magnitude first. Divide by the count.

I always use the following pseudocode:
float mean=0.0; // could use doulbe
int n=0; // could use long
for each x in data:
++n;
mean+=(x-mean)/n;
I don't have formal proofs of its stability but you can see that we won't have problems with numerical overflow, assuming that the data values are well behaved. It's referred to in Knuth's The Art of Computer Programming

Just to add one possible answer for further discussion:
Incrementally calculate the average for each step:
AVG_n = AVG_(n-1) * (n-1)/n + VALUE_n / n
or pairwise combination
AVG_(n_a + n_b) = (n_a * AVG_a + n_b * AVG_b) / (n_a + n_b)
(I hope the formulas are clear enough)

A very late post, but since I don't have enough reputation to comment, #Dave's method is the one used (as at December 2020) by the Gnu Scientific Library.
Here is the code, extracted from mean_source.c:
double FUNCTION (gsl_stats, mean) (const BASE data[], const size_t stride, const size_t size)
{
/* Compute the arithmetic mean of a dataset using the recurrence relation mean_(n) = mean(n-1) + (data[n] - mean(n-1))/(n+1) */
long double mean = 0;
size_t i;
for (i = 0; i < size; i++)
{
mean += (data[i * stride] - mean) / (i + 1);
}
return mean;
}
GSL uses the same algorithm to calculate the variance, which is, after all, just a mean of squared differences from a given number.

C++ - What would be faster: multiplying or adding?

I have some code that is going to be run thousands of times, and was wondering what was faster.
array is a 30 value short array which always holds 0, 1 or 2.
result = (array[29] * 68630377364883.0)
+ (array[28] * 22876792454961.0)
+ (array[27] * 7625597484987.0)
+ (array[26] * 2541865828329.0)
+ (array[25] * 847288609443.0)
+ (array[24] * 282429536481.0)
+ (array[23] * 94143178827.0)
+ (array[22] * 31381059609.0)
+ (array[21] * 10460353203.0)
+ (array[20] * 3486784401.0)
+ (array[19] * 1162261467)
+ (array[18] * 387420489)
+ (array[17] * 129140163)
+ (array[16] * 43046721)
+ (array[15] * 14348907)
+ (array[14] * 4782969)
+ (array[13] * 1594323)
+ (array[12] * 531441)
+ (array[11] * 177147)
+ (array[10] * 59049)
+ (array[9] * 19683)
+ (array[8] * 6561)
+ (array[7] * 2187)
+ (array[6] * 729)
+ (array[5] * 243)
+ (array[4] * 81)
+ (array[3] * 27)
+ (array[2] * 9)
+ (array[1] * 3)
+ (b[0]);
Would it be faster if I use something like:
if(array[29] != 0)
{
if(array[29] == 1)
{
result += 68630377364883.0;
}
else
{
result += (whatever 68630377364883.0 * 2 is);
}
}
for each of them. Would this be faster/slower? If so, by how much?

That is a ridiculously premature "optimization". Chances are you'll be hurting performance because you are adding branches to the code. Mispredicted branches are very costly. And it also renders the code harder to read.
Multiplication in modern processors is a lot faster than it used to be, it can be done a few clock cycles now.
Here's a suggestion to improve readability:
for (i=1; i<30; i++) {
result += array[i] * pow(3, i);
}
result += b[0];
You can pre-compute an array with the values of pow(3, i) if you are really that worried about performance.

First, on most architectures, mis-branching is very costly (depending on the execution pipeline depth), so I bet the non-branching version is better.
A variation on the code may be:
result = array[29];
for (i=28; i>=0; i--)
result = result * 3 + array[i];
Just make sure there are no overflows, so result must be in a type larger than 32-bit integer.

Even if addition is faster than multiplication, I think that you will lose more because of the branching. In any case, if addition is faster than multiplication, a better solution might be to use a table and index by it.
const double table[3] = {0.0, 68630377364883.0, 68630377364883.0 * 2.0};
result += table[array[29]];

My first attempt at optimisation would be to remove the floating-point ops in favour of integer arithmetic:
uint64_t total = b[0];
uint64_t x = 3;
for (int i = 1; i < 30; ++i, x *= 3) {
total += array[i] * x;
}
uint64_t is not standard C++, but is very widely available. You just need a version of C99's stdint for your platform.
There's also optimising for comprehensibility and maintainability - was this code a loop at one point, and did you measure the performance difference when you replaced the loop? Fully unrolling like this might even make the program slower (as well as less readable), since the code is larger and hence occupies more of the instruction cache, and hence results in cache misses elsewhere. You just don't know.
This assuming of course that your constants actually are the powers of 3 - I haven't bothered checking, which is precisely what I consider to be the readability issue with your code...

This is basically doing what strtoull does. If you don't have the digits handy as an ASCII string to feed to strtoull then I guess you have to write your own implementation. As people point out, branching is what causes a performance hit, so your function is probably best written this way:
#include <tr1/cstdint>
uint64_t base3_digits_to_num(uint8_t digits[30])
{
uint64_t running_sum = 0;
uint64_t pow3 = 1;
for (int i = 0; i < 30; ++i) {
running_sum += digits[i] * pow3;
pow3 *= 3;
}
return running_sum;
}
It's not clear to me that precomputing your powers of 3 is going to result in a significant speed advantage. You might try it and test yourself. The one advantage a lookup table might give you is that a smart compiler could possibly unroll the loop into a SIMD instruction. But a really smart compiler should then be able to do that anyway and generate the lookup table for you.
Avoiding floating point is also not necessarily a speed win. Floating point and integer operations are about the same on most processors produced in the last 5 years.
Checking to see if digits[i] is 0, 1 or 2 and executing different code for each of these cases is definitely a speed lose on any processor produced in the last 10 years. The Pentium3/Pentium4/Athlon Thunderbird days are when branches started to really become a huge hit, and the Pentium3 is at least 10 years old now.
Lastly, you might think this will be the bottleneck in your code. You're probably wrong. The right implementation is the one that is the simplest and most clear to anybody coming along reading your code. Then, if you want the best performance, run your code through a profiler and find out where to concentrate your optimization efforts. Agonizing this much over a little function when you don't even know that it's a bottleneck is silly.
And almost nobody here recognized that you were basically doing a base 3 conversion. So even your current primitive hand loop unrolling obscured your code enough that most people didn't understand it.
Edit: In fact, I looked at the assembly output. On an x86_64 platform the lookup table buys you nothing and may in fact be counter-productive because of its affect on the cache. The compiler generates leaq (%rdx,%rdx,2), %rdx in order to multiply by 3. Fetching from a table would be something like moveq (%rdx,%rcx,8), %eax, which is basically the same speed aside from requiring a fetch from memory (which might be very expensive). So it's almost certain that my code with the gcc option -funroll-loops is significantly faster than your attempt to optimize by hand.
The lesson here is that the compiler does a much, much better job of optimization than you can. Just make your code as clear and readable to others as possible and let the compiler do the work. And making it clear to others has the additional advantage of making it easier for the compiler to do its job.

If you're not sure - why don't you just measure it yourself?
Second example will be most likely much slower, but not because of the addition - mispredicted conditional jumps cost a lot of time.
If you have only 3 values, the cheapest way might be to have a static 2D array of values int **vals = {{0, 1*3, 2*3}, {0, 1*9, 2*9}, ...} and just sum vals[0][array[1]] + vals[1][array[2]] + ...
Some SIMD instructions might be faster than anything you can write on your own - look at those. Then again - if you're doing this a lot, handing it off to GPU might be even faster - depending on your other calculations.

Multiply, because branching is awefully slow

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js