I have some code that is going to be run thousands of times, and was wondering what was faster.
array is a 30 value short array which always holds 0, 1 or 2.
result = (array[29] * 68630377364883.0)
+ (array[28] * 22876792454961.0)
+ (array[27] * 7625597484987.0)
+ (array[26] * 2541865828329.0)
+ (array[25] * 847288609443.0)
+ (array[24] * 282429536481.0)
+ (array[23] * 94143178827.0)
+ (array[22] * 31381059609.0)
+ (array[21] * 10460353203.0)
+ (array[20] * 3486784401.0)
+ (array[19] * 1162261467)
+ (array[18] * 387420489)
+ (array[17] * 129140163)
+ (array[16] * 43046721)
+ (array[15] * 14348907)
+ (array[14] * 4782969)
+ (array[13] * 1594323)
+ (array[12] * 531441)
+ (array[11] * 177147)
+ (array[10] * 59049)
+ (array[9] * 19683)
+ (array[8] * 6561)
+ (array[7] * 2187)
+ (array[6] * 729)
+ (array[5] * 243)
+ (array[4] * 81)
+ (array[3] * 27)
+ (array[2] * 9)
+ (array[1] * 3)
+ (b[0]);
Would it be faster if I use something like:
if(array[29] != 0)
{
if(array[29] == 1)
{
result += 68630377364883.0;
}
else
{
result += (whatever 68630377364883.0 * 2 is);
}
}
for each of them. Would this be faster/slower? If so, by how much?
That is a ridiculously premature "optimization". Chances are you'll be hurting performance because you are adding branches to the code. Mispredicted branches are very costly. And it also renders the code harder to read.
Multiplication in modern processors is a lot faster than it used to be, it can be done a few clock cycles now.
Here's a suggestion to improve readability:
for (i=1; i<30; i++) {
result += array[i] * pow(3, i);
}
result += b[0];
You can pre-compute an array with the values of pow(3, i) if you are really that worried about performance.
First, on most architectures, mis-branching is very costly (depending on the execution pipeline depth), so I bet the non-branching version is better.
A variation on the code may be:
result = array[29];
for (i=28; i>=0; i--)
result = result * 3 + array[i];
Just make sure there are no overflows, so result must be in a type larger than 32-bit integer.
Even if addition is faster than multiplication, I think that you will lose more because of the branching. In any case, if addition is faster than multiplication, a better solution might be to use a table and index by it.
const double table[3] = {0.0, 68630377364883.0, 68630377364883.0 * 2.0};
result += table[array[29]];
My first attempt at optimisation would be to remove the floating-point ops in favour of integer arithmetic:
uint64_t total = b[0];
uint64_t x = 3;
for (int i = 1; i < 30; ++i, x *= 3) {
total += array[i] * x;
}
uint64_t is not standard C++, but is very widely available. You just need a version of C99's stdint for your platform.
There's also optimising for comprehensibility and maintainability - was this code a loop at one point, and did you measure the performance difference when you replaced the loop? Fully unrolling like this might even make the program slower (as well as less readable), since the code is larger and hence occupies more of the instruction cache, and hence results in cache misses elsewhere. You just don't know.
This assuming of course that your constants actually are the powers of 3 - I haven't bothered checking, which is precisely what I consider to be the readability issue with your code...
This is basically doing what strtoull does. If you don't have the digits handy as an ASCII string to feed to strtoull then I guess you have to write your own implementation. As people point out, branching is what causes a performance hit, so your function is probably best written this way:
#include <tr1/cstdint>
uint64_t base3_digits_to_num(uint8_t digits[30])
{
uint64_t running_sum = 0;
uint64_t pow3 = 1;
for (int i = 0; i < 30; ++i) {
running_sum += digits[i] * pow3;
pow3 *= 3;
}
return running_sum;
}
It's not clear to me that precomputing your powers of 3 is going to result in a significant speed advantage. You might try it and test yourself. The one advantage a lookup table might give you is that a smart compiler could possibly unroll the loop into a SIMD instruction. But a really smart compiler should then be able to do that anyway and generate the lookup table for you.
Avoiding floating point is also not necessarily a speed win. Floating point and integer operations are about the same on most processors produced in the last 5 years.
Checking to see if digits[i] is 0, 1 or 2 and executing different code for each of these cases is definitely a speed lose on any processor produced in the last 10 years. The Pentium3/Pentium4/Athlon Thunderbird days are when branches started to really become a huge hit, and the Pentium3 is at least 10 years old now.
Lastly, you might think this will be the bottleneck in your code. You're probably wrong. The right implementation is the one that is the simplest and most clear to anybody coming along reading your code. Then, if you want the best performance, run your code through a profiler and find out where to concentrate your optimization efforts. Agonizing this much over a little function when you don't even know that it's a bottleneck is silly.
And almost nobody here recognized that you were basically doing a base 3 conversion. So even your current primitive hand loop unrolling obscured your code enough that most people didn't understand it.
Edit: In fact, I looked at the assembly output. On an x86_64 platform the lookup table buys you nothing and may in fact be counter-productive because of its affect on the cache. The compiler generates leaq (%rdx,%rdx,2), %rdx in order to multiply by 3. Fetching from a table would be something like moveq (%rdx,%rcx,8), %eax, which is basically the same speed aside from requiring a fetch from memory (which might be very expensive). So it's almost certain that my code with the gcc option -funroll-loops is significantly faster than your attempt to optimize by hand.
The lesson here is that the compiler does a much, much better job of optimization than you can. Just make your code as clear and readable to others as possible and let the compiler do the work. And making it clear to others has the additional advantage of making it easier for the compiler to do its job.
If you're not sure - why don't you just measure it yourself?
Second example will be most likely much slower, but not because of the addition - mispredicted conditional jumps cost a lot of time.
If you have only 3 values, the cheapest way might be to have a static 2D array of values int **vals = {{0, 1*3, 2*3}, {0, 1*9, 2*9}, ...} and just sum vals[0][array[1]] + vals[1][array[2]] + ...
Some SIMD instructions might be faster than anything you can write on your own - look at those. Then again - if you're doing this a lot, handing it off to GPU might be even faster - depending on your other calculations.
Multiply, because branching is awefully slow
Related
I have a sparse matrix with only zeros and ones as entries (and, for example, with shape 32k x 64k and 0.01% non-zero entries and no patterns to exploit in terms of where the non-zero entries are). The matrix is known at compile time. I want to perform matrix-vector multiplication (modulo 2) with non-sparse vectors (not known at compile time) containing 50% ones and zeros. I want this to be efficient, in particular, I'm trying to make use of the fact that the matrix is known at compile time.
Storing the matrix in an efficient format (saving only the indices of the "ones") will always take a few Mbytes of memory and directly embedding the matrix into the executable seems like a good idea to me. My first idea was to just automatically generate the C++ code that just assigns all the result vector entries to the sum of the correct input entries. This looks like this:
constexpr std::size_t N = 64'000;
constexpr std::size_t M = 32'000;
template<typename Bit>
void multiply(const std::array<Bit, N> &in, std::array<Bit, M> &out) {
out[0] = (in[11200] + in[21960] + in[29430] + in[36850] + in[44352] + in[49019] + in[52014] + in[54585] + in[57077] + in[59238] + in[60360] + in[61120] + in[61867] + in[62608] + in[63352] ) % 2;
out[1] = (in[1] + in[11201] + in[21961] + in[29431] + in[36851] + in[44353] + in[49020] + in[52015] + in[54586] + in[57078] + in[59239] + in[60361] + in[61121] + in[61868] + in[62609] + in[63353] ) % 2;
out[2] = (in[11202] + in[21962] + in[29432] + in[36852] + in[44354] + in[49021] + in[52016] + in[54587] + in[57079] + in[59240] + in[60362] + in[61122] + in[61869] + in[62610] + in[63354] ) % 2;
out[3] = (in[56836] + in[11203] + in[21963] + in[29433] + in[36853] + in[44355] + in[49022] + in[52017] + in[54588] + in[57080] + in[59241] + in[60110] + in[61123] + in[61870] + in[62588] + in[63355] ) % 2;
// LOTS more of this...
out[31999] = (in[10208] + in[21245] + in[29208] + in[36797] + in[40359] + in[48193] + in[52009] + in[54545] + in[56941] + in[59093] + in[60255] + in[61025] + in[61779] + in[62309] + in[62616] + in[63858] ) % 2;
}
This does in fact work (takes ages to compile). However, it actually seems to be very slow (more than 10x slower than the same Sparse vector-matrix multiplication in Julia) and also to blow up the executable size significantly more than I would have thought necessary. I tried this with both std::array and std::vector, and with the individual entries (represented as Bit) being bool, std::uint8_t and int, to no progress worth mentioning. I also tried replacing the modulo and addition by XOR. In conclusion, this is a terrible idea. I'm not sure why though - is the sheer codesize slowing it down that much? Does this kind of code rule out compiler optimization?
I haven't tried any alternatives yet. The next idea I have is storing the indices as compile-time constant arrays (still giving me huge .cpp files) and looping over them. Initially, I expected doing this would lead the compiler optimization to generate the same binary as from my automatically generated C++ code. Do you think this is worth trying (I guess I will try anyway on monday)?
Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that. I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.
Do you have any other ideas on how this might be done?
I'm not sure why though - is the sheer codesize slowing it down that much?
The problem is that the executable is big, the the OS will fetch a lot of pages from your storage device. This process is very slow. The processor will often stall waiting for data to be loaded. And even the code would be already loaded in the RAM (OS caching), it would be inefficient because the speed of the RAM (latency + throughput) is quite bad. The main issue here is that all the instructions are executed only once. If you reuse the function, then the code need to be reloaded from the cache and if it is to big to fit in the cache, it will be loaded from the slow RAM. Thus, the overhead of loading the code is very high compared to its actual execution. To overcome this problem, you need to use a quite small code with loops iterating on a fairly small amount of data.
Does this kind of code rule out compiler optimization?
This is dependent of the compiler, but most mainstream compilers (eg. GCC or Clang) will optimize the code the same way (hence the slow compilation time).
Do you think this is worth trying (I guess I will try anyway on monday)?
Yes, this solution is clearly better, especially if the indices are stored in a compact way. In your case, you can store them using an uint16_t type. All the indices can be put in a big buffer. The starting/ending position of the indices for each line can be specified in another buffer referencing the first one (or using pointers). This buffer can be loaded once in memory in the beginning of your application from a dedicated file to reduce the size of the resulting program (and avoid fetches from the storage device in a critical loop). With a probability of 0.01% of having non-zero values, the resulting data structure will take less than 500 KiB of RAM. On an average mainstream desktop processor, it can fit in the L3 cache (that is rather quite fast) and I think that your computation should not take more than 1ms assuming the code of multiply is carefully optimized.
Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that.
Bit-packing is good only if your matrix is not too sparse. With a matrix filled with 50% of non-zero values, the bit-packing method is great. With 0.01% of non-zero values, the bit-packing method is clearly bad as it takes too much space.
I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.
As previously said, loading data from the storage device or the RAM is very slow. Doing some bit-shifts is very fast on any modern mainstream processor (and much much faster than loading data).
Here is the approximate timings for various operations that a computer can do:
I implemented the second method (constexpr arrays storing the matrix in compressed column storage format) and it is a lot better. It takes (for a 64'000 x 22'000 binary matrix containing 35'000 ones) <1min to compile with -O3 and performs one multiplication in <300 microseconds on my laptop (Julia takes around 350 microseconds for the same calculation). The total executable size is ~1 Mbyte.
Probably one can still do a lot better. If anyone has an idea, let me know!
Below is a code example (showing a 5x10 matrix) illustrating what I did.
#include <iostream>
#include <array>
// Compressed sparse column storage for binary matrix
constexpr std::size_t M = 5;
constexpr std::size_t N = 10;
constexpr std::size_t num_nz = 5;
constexpr std::array<std::uint16_t, N + 1> colptr = {
0x0,0x1,0x2,0x3,0x4,0x5,0x5,0x5,0x5,0x5,0x5
};
constexpr std::array<std::uint16_t, num_nz> row_idx = {
0x0,0x1,0x2,0x3,0x4
};
template<typename Bit>
constexpr void encode(const std::array<Bit, N>& in, std::array<Bit, M>& out) {
for (std::size_t col = 0; col < N; col++) {
for (std::size_t j = colptr[col]; j < colptr[col + 1]; j++) {
out[row_idx[j]] = (static_cast<bool>(out[row_idx[j]]) != static_cast<bool>(in[col]));
}
}
}
int main() {
using Bit = bool;
std::array<Bit, N> input{1, 0, 1, 0, 1, 1, 0, 1, 0, 1};
std::array<Bit, M> output{};
for (auto i : input) std::cout << i;
std::cout << std::endl;
encode(input, output);
for (auto i : output) std::cout << i;
}
I am writing a C++ program to generate Mandelbrot set zooms. All of my complex numbers were originally two doubles (one for the real part, one for the complex part). This was working pretty fast; 15 seconds per frame for the type of image I was generating.
Because of the zooming effect, I wanted to increase precision for the more zoomed-in frames, since these frames have such a small difference between the min_x and max_x. I looked to GMP to help me out with this.
Now, it is much much slower; 15:38 minutes per frame. The settings for the image are the same as before and the algorithm is the same. The only thing that has changed is that I am using mpf_class for the decimals which need to be precise (i.e. just the complex numbers). To compare performance, I used the same precision as double: mpf_set_default_prec(64);
Does GMP change the precision of mpf_class to meet the needs of the expression? In other words, if I have two 64 bit mpf_class objects and I do a calculation with them and store the result in another mpf_class, is the precision potentially increased? This would ruin performance over time I would think, but I am not sure that this is what is causing my issue.
My Quesions: Is this performance drop just the nature of GMP and other arbitrary precision libraries? What advice would you give?
Edit 1
I am (i.e. have always been) using -O3 flag for optimizing. I have also ran a test to verify that GMP is not automatically increasing the precision of the mpf_class objects. So the question still remains as to the reason for the drastic performance decrease.
Edit 2
As a demonstrative example, I compiled the following code as g++ main.cpp -lgmp -lgmpxx once as shown below, and once with every double replaced with mpf_class. With double it ran in 12.75 seconds and with mpf_class it ran in 24:54 minutes. Why is this the case when they are the same precision?
#include <gmpxx.h>
double linear_map(double d, double a1, double b1, double a2, double b2) {
double a = (d-a1)/(b1-a1);
return (a*(b2-a2)) + (a2);
}
int iterate(double x0, double y0) {
double x, y;
x = 0;
y = 0;
int i;
for (i = 0; i < 1000 && x*x + y*y <= 65536; i++) {
double xtemp = x*x - y*y + x0;
y = 2*x*y + y0;
x = xtemp;
}
return i;
}
int main() {
mpf_set_default_prec(64);
for (int j = 0; j < 3200; j++) {
for (int i = 0; i < 3200; i++) {
double x = linear_map(i, 0, 3200, -2, 1);
double y = linear_map(j, 0, 3200, -1.5, 1.5);
iterate(x, y);
}
}
return 0;
}
As explained in the comments, this kind of slowdown is entirely expected from a library such as GMP.
Builtin double multiplications are one of the areas where current-day CPUs and compilers are most optimized; CPUs have multiple execution units that manage to execute in parallel multiple floating point operations, often aided by the compilers, which try to auto-vectorize loops (although this isn't particularly applicable to your case, as your innermost loop has a strong dependency from the previous iteration).
On the other hand, in multiple, dynamic precision libraries such as GMP each operation amounts to lots of work - there are multiple branches to examine even just to check if both operands have the same/right amount of precision, and calculation algorithms implemented are generic and tailored towards the "higher precision" end, which means that they aren't particularly optimized for your current use case (using them with the same precision as double); also, GMP values can and do allocate memory when they are created, another costly operation.
I took your program and modified it slightly to make it parametric over the type to use (with a #define), reducing the side of the sampled square (from 3200 to 800, to make tests faster) and adding an accumulator of the return value of iterate to print it at the end, both to check if everything is working the same way between the various versions, and to make sure that the optimizer doesn't drop the loop completely.
The double version on my machine takes roughly 0.16 seconds, and, ran into a profiler, exhibits a completely flat profile in the flamegraph; everything happens in iterate.
The GMP version instead, as expected, takes 45 seconds (300x; you talked about a 60x slowdown, but you were comparing with an unoptimized base case) and is way more varied:
As before, iterate is taking pretty much the whole time (so we can ignore completely linear_map as far as optimization is concerned). All those "towers" are calls into GMP code; the __gmp_expr<...> stuff isn't particularly relevant - they are just template boilerplate to make complex expressions evaluate without too many temporaries, and gets completely inlined. The bulk of time is spent on the top of those towers, where the actual calculations are performed.
Indeed, ultimately most of the time is spent in GMP primitives and in memory allocation:
Given that we cannot touch the GMP internals, the only thing we can do is to be more careful using it, as each GMP operation is indeed costly.
Indeed it's important to keep in mind that, while the compiler can avoid calculating multiple times the same expressions for double values, it cannot do the same for GMP values, as they both have side effects (memory allocation, external function calls) and are too complicated to be examined by it anyway. In your inner loop we have:
double x, y;
x = 0;
y = 0;
int i;
for (i = 0; i < 1000 && x*x + y*y <= 65536; i++) {
T xtemp = x*x - y*y + x0;
(T is the generic type I'm using, defined to double or mpf_class)
Here you are calculating x*x and y*y twice at each iteration. We can optimize it as:
T x = 0, y = 0, xsq, ysq;
for(i = 0; i < 1000; i++) {
xsq = x*x;
ysq = y*y;
if(xsq+ysq > 65536) break;
T xtemp = xsq - ysq + y0;
Re-running the GMP version with this modification we get down to 38 seconds, which is a 18% improvement.
Notice that we kept xsq and ysq out of the loop to avoid re-creating them at each iteration; this is because, unlike double values (which ultimately are just register space or, at worse, stack space, both of which are free and handled statically by the compiler), mpt_class objects aren't free to re-create every time, as was hinted by the prominence of memory allocation functions in the profiler trace above; I'm not entirely aware of the inner workings of the GMP C++ wrapper, but I suspect that it enjoys optimizations similar to std::vector - on assignment an already allocated value will be able to recycle its space on allocation instead of allocating again.
Hence, we can hoist even the xtemp definition out of the loop
int iterate(T x0, T y0) {
T x = 0, y = 0, xsq , ysq, xtemp;
int i;
for (i = 0; i < 1000; i++) {
xsq = x*x;
ysq = y*y;
if(xsq+ysq > 65536) break;
xtemp = xsq - ysq + y0;
y = 2*x*y + y0;
x = xtemp;
}
return i;
}
which brings the runtime down to 33 seconds, which is 27% less than the original time.
The flamegraph is similar to the one before, but it seems more compact - we have shaved off some of the "interstitial" wastes of times, leaving just the core of the calculations. Most importantly, looking at the top hotspots, we can indeed see that multiplication switched place with subtraction, and malloc/free lost several positions.
I don't think that this can be optimized much more from a purely black-box perspective. If those are the calculations to do, I fear there's no easy way to perform them faster using GMP's mpf_class. At this point you should either:
drop GMP and use some other library, with better performance for the fixed-size case; I suspect there is something to gain here - even just avoiding completely allocations and inlining the calculation is a big win;
start applying some algorithmic optimizations; these will provide significantly shorter run time whatever data type you'll ultimately decide to use.
Notes
the full code (in its various iterations) can be found at https://bitbucket.org/mitalia/mandelbrot_gmp_test/
all tests done with g++ 7.3 with optimization level -O3 on 64 bit Linux running on my i7-6700
profiling performed using perf record from linux-tools 4.15 with call stack capture; graphs & tables generated by KDAB Hotspot 1.1.0
I'm trying to come up with a good way to evaluate the following function
double foo(std::vector<double> const& x, double c = 0.95)
{
auto N = x.size(); // Small power of 2 such as 512 or 1024
double sum = 0;
for (auto i = 0; i != N; ++i) {
sum += (x[i] * pow(c, double(i)/N));
}
return sum;
}
My two main concerns with this naive implementation are performance and accuracy. So I suspect that the most trivial improvement would be to reverse the loop order: for (auto i = N-1; i != -1; --i) (The -1 wraps around, this is OK). This improves accuracy by adding smaller terms first.
While this is good for accuracy, it keeps the performance problem of pow. Numerically, pow(c, double(i)/N) is pow(c, (i-1)/N) * pow(c, 1/N). And the latter is a constant. So in theory we can replace pow with repeated multiplication. While good for performance, this hurts accuracy - errors will accumulate.
I suspect that there's a significantly better algorithm hiding in here. For instance, the fact that N is a power of two means that there is a middle term x[N/2] that's multiplied with sqrt(c). That hints at a recursive solution.
On a somewhat related numerical observation, this looks like a signal multiplication with an exponential, so I naturally think : "FFT, trivial convolution=shift, IFFT", but that seems to offer no real benefit in terms of accuracy or performance.
So, is this a well-known problem with known solutions?
The task is a polynomial evaluation. The method for a single evaluation with the least operation count is the Horner scheme. In general a low operation count will reduce the accumulation of floating point noise.
As the example value c=0.95 is close to 1, any root will be still closer to 1 and thus lose accuracy. Avoid that by computing the difference to 1 directly, z=1-c^(1/n), via
z = -expm1(log(c)/N).
Now you have to evaluate the polynomial
sum of x[i] * (1-z)^i
which can be done by careful modification of the Horner scheme. Instead of
for(i=N; i-->0; ) {
res = res*(1-z)+x[i]
}
use
for(i=N; i-->0; ) {
res = (res+x[i])-res*z
}
which is mathematically equivalent but has the loss of digits in 1-z happening as late as possible without using more involved method like doubly accurate addition.
In tests those two methods contrary to the intent gave almost the same results, a substantial improvement could be observed by separating the result into its value at c=1, z=0 and a multiple of z as in
double res0 = 0, resz=0;
int i;
for(i=N; i-->0; ) {
/* res0+z*resz = (res0+z*resz)*(1-z)+x[i]; */
resz = resz - res0 -z*resz;
res0 = res0 + x[i];
}
The test case that showed this improvement was for the coefficient sequence of
f(u) = (1-u/N)^(N-2)*(1-u)
where for N=1000 the evaluations result in
c z=1-c^(1/N) f(1-z) diff for 1st proc diff for 3rd proc
0.950000 0.000051291978909 0.000018898570629 1.33289104579937e-17 4.43845264361253e-19
0.951000 0.000050239954368 0.000018510931892 1.23765066121009e-16 -9.24959978401696e-19
0.952000 0.000049189034371 0.000018123700958 1.67678642238461e-17 -5.38712954453735e-19
0.953000 0.000048139216599 0.000017736876972 -2.86635949350855e-17 -2.37169225231204e-19
...
0.994000 0.000006018054217 0.000002217256601 1.31645860662263e-17 1.15619997300212e-19
0.995000 0.000005012529261 0.000001846785028 -4.15668713370839e-17 -3.5363625547867e-20
0.996000 0.000004008013365 0.000001476685973 8.48811716443534e-17 8.470329472543e-22
0.997000 0.000003004504507 0.000001106958687 1.44711343873661e-17 -2.92226366802734e-20
0.998000 0.000002002000667 0.000000737602425 5.6734266807093e-18 -6.56450534122083e-21
0.999000 0.000001000499833 0.000000368616443 -3.72557383333555e-17 1.47701370177469e-20
Yves' answer inspired me.
It seems that the best approach is to not calculate pow(c, 1.0/N) directly, but indirectly:
cc[0]=c; cc[1]=sqrt(cc[0]), cc[2]=sqrt(cc[1]),... cc[logN] = sqrt(cc[logN-1])
Or in binary,
cc[0]=c, cc[1]=c^0.1, cc[2]=c^0.01, cc[3]=c^0.001, ....
Now if we need x[0b100100] * c^0.100100, we can calculate that as x[0b100100]* c^0.1 * c^0.0001. I don't need to precalculate a table of size N, as geza suggested. A table of size log(N) is probably sufficient, and it can be created by repeatedly taking square roots.
[edit]
As pointed out in a comment thread on another answer, pairwise summation is very effective in keeping errors under control. And it happens to combine extremely nicely with this answer.
We start by observing that we sum
x[0] * c^0.0000000
x[1] * c^0.0000001
x[2] * c^0.0000010
x[3] * c^0.0000011
...
So, we run log(N) iterations. In iteration 1, we add the N/2 pairs x[i]+x[i+1]*c^0.000001 and store the result in x[i/2]. In iteration 2, we add the pairs x[i]+x[i+1]*c^0.000010, etcetera. The chief difference with normal pairwise summation is that this is a multiply-and-add in each step.
We see now that in each iteration, we're using the same multiplier pow(c, 2^i/N), which means we only need to calculate log(N) multipliers. It's also quite cache-efficient, as we're doing only contiguous memory access. It also allows for easy SIMD parallelization, especially when you have FMA instructions.
If N is a power of 2, you can replace the evaluations of the powers by geometric means, using
a^(i+j)/2 = √(a^i.a^j)
and recursively subdivide from c^N/N.c^0/N. With preorder recursion, you can make sure to accumulate by increasing weights.
Anyway, the speedup of sqrt vs. pow might be marginal.
You can also stop recursion at a certain level and continue linearly, with mere products.
You could mix repeated multiplication by pow(c, 1./N) with some explicit pow calls. I.e. every 16th iteration or so do a real pow and otherwise move forward with the multiply. This should yield large performance benefits at negligible accuracy cost.
Depending on how much c varies, you might even be able to precompute and replace all pow calls with a lookup, or just the ones needed in the above method (= smaller lookup table = better caching).
I am trying to construct a summed area table for later use in an adaptive thresholding routine. Since this code is going to be used in time critical software, I am trying to squeeze as many cycles as possible out of it.
For performance, the table is unsigned integers for every pixel.
When I attach my profiler, I am showing that my largest performance bottleneck occurs when performing the x-pass.
The simple math expression for the computation is:
sat_[y * width + x] = sat_[y * width + x - 1] + buff_[y * width + x]
where the running sum resets at every new y position.
In this case, sat_ is a 1-D pointer of unsigned integers representing the SAT, and buff_ is an 8-bit unsigned monochrome buffer.
My implementation looks like the following:
uint *pSat = sat_;
char *pBuff = buff_;
for (size_t y = 0; y < height; ++y, pSat += width, pBuff += width)
{
uint curr = 0;
for (uint x = 0; x < width; x += 4)
{
pSat[x + 0] = curr += pBuff[x + 0];
pSat[x + 1] = curr += pBuff[x + 1];
pSat[x + 2] = curr += pBuff[x + 2];
pSat[x + 3] = curr += pBuff[x + 3];
}
}
The loop is unrolled manually because my compiler (VC11) didn't do it for me. The problem I have is that the entire segmentation routine is spending an extraordinary amount of time just running through that loop, and I am wondering if anyone has any thoughts on what might speed it up. I have access to all of the SSE's sets, and AVX for any machine this routine will run on, so if there is something there, that would be extremely useful.
Also, once I squeeze out the last cycles, I then plan on extending this to multi-core, but I want to get the single thread computation as tight as possible before I make the model more complex.
You have a dependency chain running along each row; each result depends on the previous one. So you cannot vectorise/parallelise in that direction.
But, it sounds like each row is independent of all the others, so you can vectorise/paralellise by computing multiple rows simultaneously. You'd need to transpose your arrays, in order to allow the vector instructions to access neighbouring elements in memory.*
However, that creates a problem. Walking along rows would now be absolutely terrible from a cache point of view (every iteration would be a cache miss). The way to solve this is to interchange the loop order.
Note, though, that each element is read precisely once. And you're doing very little computation per element. So you'll basically be limited by main-memory bandwidth well before you hit 100% CPU usage.
* This restriction may be lifted in AVX2, I'm not sure...
Algorithmically, I don't think there is anything you can do to optimize this further. Even though you didn't use the term OLAP cube in your description, you are basically just building an OLAP cube. The code you have is the standard approach to building an OLAP cube.
If you give details about the hardware you're working with, there might be some optimizations available. For example, there is a GPU programming approach that may or may not be faster. Note: Another post on this thread mentioned that parallelization is not possible. This isn't necessarily true... Your algorithm can't be implemented in parallel, but there are algorithms that maintain data-level parallelism, which could be exploited with a GPU approach.
I am trying to improve my programming skills for an assignment that will be released soon, it involves solving a problem while making it run as efficient and as fast as possible. I know this is a fairly restrained/small piece of code but how if anything would make it run faster.
the method takes an array with holds details of transactions, there are 100 as the number of transactions used to maintain the loop. so am getting the average num of shares and then returning it. not fluent english so hopefully it makes sense, thanks
double Analyser::averageVolume()
{
// Your code
double averageNumShares = 0;
for(int i = 0; i < nTransactions; i++)
{
averageNumShares += tArray[i].numShares;
}
averageNumShares = averageNumShares / nTransactions;
return averageNumShares;
//return 0
}
If you need to compute the average of n numbers I'm afraid that you can't speed it up much past the linear-time approch in your sample code..
Unless this is used as part of another more complex algorithm where you might be able to get away with not having to compute the average or something along these lines, taking an average is going to be an O(n) operation which basically involves summing all elements of the array and one division by the number of elements. Which is exactly what you have.
Why not have two other values for the object - A running total and the number of items?
Then computing the average can make use of those numbers. Quickly and simply (could be an inline function!).
Here is one additional approach, similar to that suggested by Ed Heal that should be less sensitive to roundoff errors. The roundoff error of the average grows with the size of the accumulated sum. This may or may not be an issue for you, but it is something to be aware of.
Here is an iterative algorithm that minimizes roundoff error in the average, which I first came across in an old edition (circa 1998) of Ross:
double Analyser::averageVolume()
{
double averageNumShares = 0.0;
for (int i = 0; i < nTransactions; i++)
{
double delta = (tArray[i].numShares - averageNumShares) / (i+1);
averageNumShares += delta;
}
return averageNumShares;
}
This works by deriving a recursive definition of the average. That is, given samples x[1], ..., x[j], ..., x[N], you can calculate the average of the first M+1 samples from sample x[M+1] and the average of the first M samples:
sum(M) = x[1] + x[2] + ... + x[M]
thus avg(M+1) = sum(M+1)/(M+1) and avg(M) = sum(M)/M
avg(M+1) - avg(M) = sum(M+1)/(M+1) - sum(M)/M
= [ M*sum(M+1) - (M+1)*sum(M) ]/[ M * (M+1) ]
= [ M*(x[M+1] + sum(M)) - M*sum(M) - sum(M) ] / [ M*(M+1) ]
= [ M*x[M+1] - sum(M) ] / [ M*(M+1) ]
= [ x[M+1] - avg(M) ] / (M+1)
thus: avg(M+1) = avg(M) + [ x[M+1] - avg(M) ]/(M+1)
To get a sense of the roundoff error for the two approaches, try computing the average of 10^7 samples, each sample equal to 1035.41. Your original approach returns (on my hardware), an average of 1035.40999988683. The iterative approach above returns the exact average of 1035.41.
Both, unfortunately, are O(N) at some point. Your original scheme has N additions and one division. The iterative scheme has N additions, subtractions, and divisions, so you pay a bit more for the accuracy.
If you use gcc change level of optimization.
Short answer:
This code is as good as it gets with respect to speed. What you can tweak is how you compile it. Or obviously, rewrite it in assembly if that is an option.
"Stretched" answer:
Now... if you really want to try getting better performance, have already tried using all compiler optimization flags and optimizations available, and you are ready to compromise code readability for possibly more speed, you could consider rewriting:
for(int i = 0; i < nTransactions; i++)
{
averageNumShares += tArray[i].numShares;
}
as
pointerValue = &(tArray[0].numShares);
pointerIncrement = sizeof(tArray[0]);
for(int i = 0; i < nTransactions; i++)
{
averageNumShares += *(pointerValue++pointerIncrement);
}
It could be that you get better performance by showing the compiler all you do is jump over a fixed offset at each loop iteration. A good compiler should be able to see that with your initial code. And the new code could get your worse performance. It really depends on the specifics of your compiler, and again, I do not recommend that approach unless you are desperate for better performance than what the compiler can offer, and don't want to jump to using in-line assembly or intrinsics (if any available).
int i= nTransactions;
while(i--){// test for 0 is faster
averageNumShares += (q++)->numShares;// increment pointer is faster than offset
}