Sparse matrix-dense vector multiplication with matrix known at compile time - c++

I have a sparse matrix with only zeros and ones as entries (and, for example, with shape 32k x 64k and 0.01% non-zero entries and no patterns to exploit in terms of where the non-zero entries are). The matrix is known at compile time. I want to perform matrix-vector multiplication (modulo 2) with non-sparse vectors (not known at compile time) containing 50% ones and zeros. I want this to be efficient, in particular, I'm trying to make use of the fact that the matrix is known at compile time.
Storing the matrix in an efficient format (saving only the indices of the "ones") will always take a few Mbytes of memory and directly embedding the matrix into the executable seems like a good idea to me. My first idea was to just automatically generate the C++ code that just assigns all the result vector entries to the sum of the correct input entries. This looks like this:
constexpr std::size_t N = 64'000;
constexpr std::size_t M = 32'000;
template<typename Bit>
void multiply(const std::array<Bit, N> &in, std::array<Bit, M> &out) {
out[0] = (in[11200] + in[21960] + in[29430] + in[36850] + in[44352] + in[49019] + in[52014] + in[54585] + in[57077] + in[59238] + in[60360] + in[61120] + in[61867] + in[62608] + in[63352] ) % 2;
out[1] = (in[1] + in[11201] + in[21961] + in[29431] + in[36851] + in[44353] + in[49020] + in[52015] + in[54586] + in[57078] + in[59239] + in[60361] + in[61121] + in[61868] + in[62609] + in[63353] ) % 2;
out[2] = (in[11202] + in[21962] + in[29432] + in[36852] + in[44354] + in[49021] + in[52016] + in[54587] + in[57079] + in[59240] + in[60362] + in[61122] + in[61869] + in[62610] + in[63354] ) % 2;
out[3] = (in[56836] + in[11203] + in[21963] + in[29433] + in[36853] + in[44355] + in[49022] + in[52017] + in[54588] + in[57080] + in[59241] + in[60110] + in[61123] + in[61870] + in[62588] + in[63355] ) % 2;
// LOTS more of this...
out[31999] = (in[10208] + in[21245] + in[29208] + in[36797] + in[40359] + in[48193] + in[52009] + in[54545] + in[56941] + in[59093] + in[60255] + in[61025] + in[61779] + in[62309] + in[62616] + in[63858] ) % 2;
}
This does in fact work (takes ages to compile). However, it actually seems to be very slow (more than 10x slower than the same Sparse vector-matrix multiplication in Julia) and also to blow up the executable size significantly more than I would have thought necessary. I tried this with both std::array and std::vector, and with the individual entries (represented as Bit) being bool, std::uint8_t and int, to no progress worth mentioning. I also tried replacing the modulo and addition by XOR. In conclusion, this is a terrible idea. I'm not sure why though - is the sheer codesize slowing it down that much? Does this kind of code rule out compiler optimization?
I haven't tried any alternatives yet. The next idea I have is storing the indices as compile-time constant arrays (still giving me huge .cpp files) and looping over them. Initially, I expected doing this would lead the compiler optimization to generate the same binary as from my automatically generated C++ code. Do you think this is worth trying (I guess I will try anyway on monday)?
Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that. I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.
Do you have any other ideas on how this might be done?

I'm not sure why though - is the sheer codesize slowing it down that much?
The problem is that the executable is big, the the OS will fetch a lot of pages from your storage device. This process is very slow. The processor will often stall waiting for data to be loaded. And even the code would be already loaded in the RAM (OS caching), it would be inefficient because the speed of the RAM (latency + throughput) is quite bad. The main issue here is that all the instructions are executed only once. If you reuse the function, then the code need to be reloaded from the cache and if it is to big to fit in the cache, it will be loaded from the slow RAM. Thus, the overhead of loading the code is very high compared to its actual execution. To overcome this problem, you need to use a quite small code with loops iterating on a fairly small amount of data.
Does this kind of code rule out compiler optimization?
This is dependent of the compiler, but most mainstream compilers (eg. GCC or Clang) will optimize the code the same way (hence the slow compilation time).
Do you think this is worth trying (I guess I will try anyway on monday)?
Yes, this solution is clearly better, especially if the indices are stored in a compact way. In your case, you can store them using an uint16_t type. All the indices can be put in a big buffer. The starting/ending position of the indices for each line can be specified in another buffer referencing the first one (or using pointers). This buffer can be loaded once in memory in the beginning of your application from a dedicated file to reduce the size of the resulting program (and avoid fetches from the storage device in a critical loop). With a probability of 0.01% of having non-zero values, the resulting data structure will take less than 500 KiB of RAM. On an average mainstream desktop processor, it can fit in the L3 cache (that is rather quite fast) and I think that your computation should not take more than 1ms assuming the code of multiply is carefully optimized.
Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that.
Bit-packing is good only if your matrix is not too sparse. With a matrix filled with 50% of non-zero values, the bit-packing method is great. With 0.01% of non-zero values, the bit-packing method is clearly bad as it takes too much space.
I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.
As previously said, loading data from the storage device or the RAM is very slow. Doing some bit-shifts is very fast on any modern mainstream processor (and much much faster than loading data).
Here is the approximate timings for various operations that a computer can do:

I implemented the second method (constexpr arrays storing the matrix in compressed column storage format) and it is a lot better. It takes (for a 64'000 x 22'000 binary matrix containing 35'000 ones) <1min to compile with -O3 and performs one multiplication in <300 microseconds on my laptop (Julia takes around 350 microseconds for the same calculation). The total executable size is ~1 Mbyte.
Probably one can still do a lot better. If anyone has an idea, let me know!
Below is a code example (showing a 5x10 matrix) illustrating what I did.
#include <iostream>
#include <array>
// Compressed sparse column storage for binary matrix
constexpr std::size_t M = 5;
constexpr std::size_t N = 10;
constexpr std::size_t num_nz = 5;
constexpr std::array<std::uint16_t, N + 1> colptr = {
0x0,0x1,0x2,0x3,0x4,0x5,0x5,0x5,0x5,0x5,0x5
};
constexpr std::array<std::uint16_t, num_nz> row_idx = {
0x0,0x1,0x2,0x3,0x4
};
template<typename Bit>
constexpr void encode(const std::array<Bit, N>& in, std::array<Bit, M>& out) {
for (std::size_t col = 0; col < N; col++) {
for (std::size_t j = colptr[col]; j < colptr[col + 1]; j++) {
out[row_idx[j]] = (static_cast<bool>(out[row_idx[j]]) != static_cast<bool>(in[col]));
}
}
}
int main() {
using Bit = bool;
std::array<Bit, N> input{1, 0, 1, 0, 1, 1, 0, 1, 0, 1};
std::array<Bit, M> output{};
for (auto i : input) std::cout << i;
std::cout << std::endl;
encode(input, output);
for (auto i : output) std::cout << i;
}

Related

How to convert 3 addition and 1 multiply into vectorized SIMD using intrinsic functions C++

I'm working with a problem using 2D prefix sum, also called Summed-Area Table S. For an 2D array I (grayscale image/matrix/etc), its definition is:
S[x][y] = S[x-1][y] + S[x][y-1] - S[x-1][y-1] + I[x][y]
Sqr[x][y] = Sqr[x-1][y] + Sqr[x][y-1] - Sqr[x-1][y-1] + I[x][y]^2
Calculating the sum of a sub-matrix with two corners (top,left) and (bot,right) can be done in O(1):
sum = S[bot][right] - S[bot][left-1] - S[top-1][right] + S[top-1][left-1]
One of my problem is to calculate all possible sub-matrix sum with a constant size (bot-top == right-left == R), which are then used to calculate their mean/variance. And I've vectorized it to the form below.
lineSize is the number of elements to be processed at once. I choose lineSize = 16 because Intel CPU AVX instructions can work on 8 doubles at the same time. It can be 8/16/32/...
#define cell(i, j, w) ((i)*(w) + (j))
const int lineSize = 16;
const int R = 3; // any integer
const int submatArea = (R+1)*(R+1);
const double submatAreaInv = double(1) / submatArea;
void subMatrixVarMulti(int64* S, int64* Sqr, int top, int left, int bot, int right, int w, int h, int diff, double submatAreaInv, double mean[lineSize], double var[lineSize])
{
const int indexCache = cell(top, left, w),
indexTopLeft = cell(top - 1, left - 1, w),
indexTopRight = cell(top - 1, right, w),
indexBotLeft = cell(bot, left - 1, w),
indexBotRight = cell(bot, right, w);
for (int i = 0; i < lineSize; i++) {
mean[i] = (S[indexBotRight+i] - S[indexBotLeft+i] - S[indexTopRight+i] + S[indexTopLeft+i]) * submatAreaInv;
var[i] = (Sqr[indexBotRight + i] - Sqr[indexBotLeft + i] - Sqr[indexTopRight + i] + Sqr[indexTopLeft + i]) * submatAreaInv
- mean[i] * mean[i];
}
How can I optimize the above loop to have the highest possible speed? Readability doesn't matter. I heard it can be done using AVX2 and intrinsic functions, but I don't know how.
Edit: the CPU is i7-7700HQ, kabylake = skylake family
Edit 2: forgot to mention that lineSize, R, ... are already const
Your compiler can generate AVX/AVX2/AVX-512 instructions for you, but you need to:
Select the latest available architecture when compiling. For example with GCC you might say -march=skylake if you know your code will run on Skylake and later, but does not need to support older CPUs. Without this, AVX instructions cannot be generated.
Add restrict or __restrict to your pointer inputs to tell the compiler they do not overlap. This applies to S and Sqr, as well as mean and var (both pairs have the same type, so the compiler assumes they might overlap, but you know they do not).
Make sure your data is "over-aligned." For example if you want the compiler to use 256-bit AVX2 instructions, you should align your arrays to 256 bits. There are a few ways to do this, such as making a typedef with the alignment, or using alignas() or std::assume_aligned() (available as a GCC attribute prior to C++20). The point is you need the compiler to know that S, Sqr, mean and var are aligned to the largest SIMD vector size available on your target architecture, so that it does not have to generate as much fixup code.
Use constexpr where possible, such as lineSize.
Most importantly, profile to compare performance as you make changes, and look at the generated code (e.g. g++ -S) to see if it looks the way you want it to.
I don't think you can perform efficiently this type of sum using SIMD due to the dependencies of the summation.
Instead you can do the computation differently which can be trivially optimized with SIMD:
Compute row-only partial summation. You parallelize it with SIMD by computing simultaneously for multiple rows.
Now with rows summed up, by computing cols-only partial summation to the output using the same SIMD optimization you obtain your desired Summed-Area Table.
You can do the same for both summation and summation of squares.
The only issue is you need extra memory and this type of computation requires more memory accesses. The extra memory is probably a minor thing but more memory access perhaps can be improved by storing the temporary data (the sums of rows) in a cache friendly manner. You'll probably need to experiment with this.

OpenCL crash on big 2d range

In my program, i need to run the kernel once on every item of the large 2d-array. The program works correctly for small ranges - up to around 50x50, sometimes up to 100x100.
For bigger datasets however, calling the kernel causes the video card driver to crash.
I have tested this program on two computers with different AMD cards, and they exhibit the exact same behaviour. Other, one-dimensional kernels work properly, even for huge datasets of ~10 000 x 10 000 items.
Also, removing the i variable from the matrix[i + (N + 1) * j] expression causes the kernel to work without errors.
Am i setting the range incorrectly, making a mistake in the kernel, or maybe the problem lies elsewhere?
enqueued range:
cl::EnqueueArgs args(queue,cl::NDRange(offset, offset+1),cl::NDRange(N+1, N),cl::NullRange);
kernel:
void kernel sub(global float* matrix, global const float* vec, int N, int offset) {
int i = get_global_id(0);
int j = get_global_id(1);
matrix[i + (N + 1) * j] -= matrix[i + (N + 1) * offset] * vec[j];
}
One of possible reasons - if your kernel is running for too long, driver may drop it. Dice up problem area into smaller blocks.
Consider this, for a 100x100 input array you will use N=100, hence the maximum value of i in your kernel will be 100 because of the N+1 used in the enqueue args, while the maximum for j will be 99. I have assumed that offset = 0. Therefore i + (N + 1) * j = 100 + 101*99 = 10099 which is outside of your 2D array.
When offset = 1, the minimums for i and j will be 1 and 2 respectively, while the maximums will be 101 and 100. Therefore i + (N + 1) * j = 101 + 101*100 = 10201.
In my experience, GPUs are not very good at catching segmentation faults when accessing global memory. Your attempt at purposefully creating one may work on some cards sometimes but no guarantees.
The problem could be caused by local-work-size and global-work-size. It is important while using two dimensional arrays to properly calculate them. It could be that for big values your global_id(0) is bigger than you specified in clEnqueueNDRangeKernel().

Efficiently Building Summed Area Table

I am trying to construct a summed area table for later use in an adaptive thresholding routine. Since this code is going to be used in time critical software, I am trying to squeeze as many cycles as possible out of it.
For performance, the table is unsigned integers for every pixel.
When I attach my profiler, I am showing that my largest performance bottleneck occurs when performing the x-pass.
The simple math expression for the computation is:
sat_[y * width + x] = sat_[y * width + x - 1] + buff_[y * width + x]
where the running sum resets at every new y position.
In this case, sat_ is a 1-D pointer of unsigned integers representing the SAT, and buff_ is an 8-bit unsigned monochrome buffer.
My implementation looks like the following:
uint *pSat = sat_;
char *pBuff = buff_;
for (size_t y = 0; y < height; ++y, pSat += width, pBuff += width)
{
uint curr = 0;
for (uint x = 0; x < width; x += 4)
{
pSat[x + 0] = curr += pBuff[x + 0];
pSat[x + 1] = curr += pBuff[x + 1];
pSat[x + 2] = curr += pBuff[x + 2];
pSat[x + 3] = curr += pBuff[x + 3];
}
}
The loop is unrolled manually because my compiler (VC11) didn't do it for me. The problem I have is that the entire segmentation routine is spending an extraordinary amount of time just running through that loop, and I am wondering if anyone has any thoughts on what might speed it up. I have access to all of the SSE's sets, and AVX for any machine this routine will run on, so if there is something there, that would be extremely useful.
Also, once I squeeze out the last cycles, I then plan on extending this to multi-core, but I want to get the single thread computation as tight as possible before I make the model more complex.
You have a dependency chain running along each row; each result depends on the previous one. So you cannot vectorise/parallelise in that direction.
But, it sounds like each row is independent of all the others, so you can vectorise/paralellise by computing multiple rows simultaneously. You'd need to transpose your arrays, in order to allow the vector instructions to access neighbouring elements in memory.*
However, that creates a problem. Walking along rows would now be absolutely terrible from a cache point of view (every iteration would be a cache miss). The way to solve this is to interchange the loop order.
Note, though, that each element is read precisely once. And you're doing very little computation per element. So you'll basically be limited by main-memory bandwidth well before you hit 100% CPU usage.
* This restriction may be lifted in AVX2, I'm not sure...
Algorithmically, I don't think there is anything you can do to optimize this further. Even though you didn't use the term OLAP cube in your description, you are basically just building an OLAP cube. The code you have is the standard approach to building an OLAP cube.
If you give details about the hardware you're working with, there might be some optimizations available. For example, there is a GPU programming approach that may or may not be faster. Note: Another post on this thread mentioned that parallelization is not possible. This isn't necessarily true... Your algorithm can't be implemented in parallel, but there are algorithms that maintain data-level parallelism, which could be exploited with a GPU approach.

Optimize this function (in C++)

I have a cpu-consuming code, where some function with a loop is executed many times. Every optimization in this loop brings noticeable performance gain. Question: How would you optimize this loop (there is not much more to optimize though...)?
void theloop(int64_t in[], int64_t out[], size_t N)
{
for(uint32_t i = 0; i < N; i++) {
int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;
}
}
I tried a few things, e.g. I replaced arrays with pointers that were incremented in every loop, but (surprisingly) i lost some performance instead of gaining...
Edit:
changed name of one variable (itsMaximums, error)
the function is an a method of a class
in and put are int64_t , so are negative and positive
`(v > max) can evaluate to true: consider the situation when actual max is negative
the code runs on 32-bit pc (development) and 64-bit (production)
N is unknown at compile time
I tried some SIMD, but I failed to increase performance... (the overhead of moving the variables to _m128i, executing and storing back was higher than than SSE speed gain. Yet I am not an expert on SSE, so maybe I had a poor code)
Results:
I added some loop unfolding, and a nice hack from Alex'es post. Below I paste some results:
original: 14.0s
unfolded loop (4 iterations): 10.44s
Alex'es trick: 10.89s
2) and 3) at once: 11.71s
strage, that 4) is not faster than 3) and 4). Below code for 4):
for(size_t i = 1; i < N; i+=CHUNK) {
int64_t t_in0 = in[i+0];
int64_t t_in1 = in[i+1];
int64_t t_in2 = in[i+2];
int64_t t_in3 = in[i+3];
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
max &= -max >> 63;
max += t_in1;
out[i+1] = max;
max &= -max >> 63;
max += t_in2;
out[i+2] = max;
max &= -max >> 63;
max += t_in3;
out[i+3] = max;
}
First, you need to look at the generated assembly. Otherwise you have no way of knowing what actually happens when this loop is executed.
Now: is this code running on a 64-bit machine? If not, those 64-bit additions might hurt a bit.
This loop seems an obvious candidate for using SIMD instructions. SSE2 supports a number of SIMD instructions for integer arithmetics, including some that work on two 64-bit values.
Other than that, see if the compiler properly unrolls the loop, and if not, do so yourself. Unroll a couple of iterations of the loop, and then reorder the hell out of it. Put all the memory loads at the top of the loop, so they can be started as early as possible.
For the if line, check that the compiler is generating a conditional move, rather than a branch.
Finally, see if your compiler supports something like the restrict/__restrict keyword. It's not standard in C++, but it is very useful for indicating to the compiler that in and out do not point to the same addresses.
Is the size (N) known at compile-time? If so, make it a template parameter (and then try passing in and out as references to properly-sized arrays, as this may also help the compiler with aliasing analysis)
Just some thoughts off the top of my head. But again, study the disassembly. You need to know what the compiler does for you, and especially, what it doesn't do for you.
Edit
with your edit:
max &= -max >> 63;
max += t_in0;
out[i+0] = max;
what strikes me is that you added a huge dependency chain.
Before the result can be computed, max must be negated, the result must be shifted, the result of that must be and'ed together with its original value, and the result of that must be added to another variable.
In other words, all these operations have to be serialized. You can't start one of them before the previous has finished. That's not necessarily a speedup. Modern pipelined out-of-order CPUs like to execute lots of things in parallel. Tying it up with a single long chain of dependant instructions is one of the most crippling things you can do. (Of course, it if can be interleaved with other iterations, it might work out better. But my gut feeling is that a simple conditional move instruction would be preferable)
> #**Announcement** see [chat](https://chat.stackoverflow.com/rooms/5056/discussion-between-sehe-and-jakub-m)
> > _Hi Jakub, what would you say if I have found a version that uses a heuristic optimization that, for random data distributed uniformly will result in ~3.2x speed increase for `int64_t` (10.56x effective using `float`s)?_
>
I have yet to find the time to update the post, but the explanation and code can be found through the chat.
> I used the same test-bed code (below) to verify that the results are correct and exactly match the original implementation from your OP
**Edit**: ironically... that testbed had a fatal flaw, which rendered the results invalid: the heuristic version was in fact skipping parts of the input, but because existing output wasn't being cleared, it appeared to have the correct output... (still editing...)
Ok, I have published a benchmark based on your code versions, and also my proposed use of partial_sum.
Find all the code here https://gist.github.com/1368992#file_test.cpp
Features
For a default config of
#define MAGNITUDE 20
#define ITERATIONS 1024
#define VERIFICATION 1
#define VERBOSE 0
#define LIMITED_RANGE 0 // hide difference in output due to absense of overflows
#define USE_FLOATS 0
It will (see output fragment here):
run 100 x 1024 iterations (i.e. 100 different random seeds)
for data length 1048576 (2^20).
The random input data is uniformly distributed over the full range of the element data type (int64_t)
Verify output by generating a hash digest of the output array and comparing it to the reference implementation from the OP.
Results
There are a number of (surprising or unsurprising) results:
there is no significant performance difference between any of the algorithms whatsoever (for integer data), provided you are compiling with optimizations enabled. (See Makefile; my arch is 64bit, Intel Core Q9550 with gcc-4.6.1)
The algorithms are not equivalent (you'll see hash sums differ): notably the bit fiddle proposed by Alex doesn't handle integer overflow in quite the same way (this can be hidden defining
#define LIMITED_RANGE 1
which limits the input data so overflows won't occur; Note that the partial_sum_incorrect version shows equivalent C++ non-bitwise _arithmetic operations that yield the same different results:
return max<0 ? v : max + v;
Perhaps, it is ok for your purpose?)
Surprisingly It is not more expensive to calculate both definitions of the max algorithm at once. You can see this being done inside partial_sum_correct: it calculates both 'formulations' of max in the same loop; This is really not more than a triva here, because none of the two methods is significantly faster...
Even more surprisingly a big performance boost can be had when you are able to use float instead of int64_t. A quick and dirty hack can be applied to the benchmark
#define USE_FLOATS 0
showing that the STL based algorithm (partial_sum_incorrect) runs aproximately 2.5x faster when using float instead of int64_t (!!!).Note:
that the naming of partial_sum_incorrect only relates to integer overflow, which doesn't apply to floats; this can be seen from the fact that the hashes match up, so in fact it is partial_sum_float_correct :)
that the current implementation of partial_sum_correct is doing double work that causes it to perform badly in floating point mode. See bullet 3.
(And there was that off-by-1 bug in the loop-unrolled version from the OP I mentioned before)
Partial sum
For your interest, the partial sum application looks like this in C++11:
std::partial_sum(data.begin(), data.end(), output.begin(),
[](int64_t max, int64_t v) -> int64_t
{
max += v;
if (v > max) max = v;
return max;
});
Sometimes, you need to step backward and look over it again. The first question is obviously, do you need this ? Could there be an alternative algorithm that would perform better ?
That being said, and supposing for the sake of this question that you already settled on this algorithm, we can try and reason about what we actually have.
Disclaimer: the method I am describing is inspired by the successful method Tim Peters used to improve the traditional introsort implementation, leading to TimSort. So please bear with me ;)
1. Extracting Properties
The main issue I can see is the dependency between iterations, which will prevent much of the possible optimizations and thwart many attempts at parallelizing.
int64_t v = in[i];
max += v;
if (v > max) max = v;
out[i] = max;
Let us rework this code in a functional fashion:
max = calc(in[i], max);
out[i] = max;
Where:
int64_t calc(int64_t const in, int64_t const max) {
int64_t const bumped = max + in;
return in > bumped ? in : bumped;
}
Or rather, a simplified version (baring overflow since it's undefined):
int64_t calc(int64_t const in, int64_t const max) {
return 0 > max ? in : max + in;
}
Do you notice the tip point ? The behavior changes depending on whether the ill-named(*) max is positive or negative.
This tipping point makes it interesting to watch the values in in more closely, especially according to the effect they might have on max:
max < 0 and in[i] < 0 then out[i] = in[i] < 0
max < 0 and in[i] > 0 then out[i] = in[i] > 0
max > 0 and in[i] < 0 then out[i] = (max + in[i]) ?? 0
max > 0 and in[i] > 0 then out[i] = (max + in[i]) > 0
(*) ill-named because it is also an accumulator, which the name hides. I have no better suggestion though.
2. Optimizing operations
This leads us to discover interesting cases:
if we have a slice [i, j) of the array containing only negative values (which we call negative slice), then we could do a std::copy(in + i, in + j, out + i) and max = out[j-1]
if we have a slice [i, j) of the array containing only positive values, then it's a pure accumulation code (which can easily be unrolled)
max gets positive as soon as in[i] is positive
Therefore, it could be interesting (but maybe not, I make no promise) to establish a profile of the input before actually working with it. Note that the profile could be made chunk by chunk for large inputs, for example tuning the chunk size based on the cache line size.
For references, the 3 routines:
void copy(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
std::copy(in + begin, in + end, out + begin);
} // copy
void accumulate(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
assert(begin != 0);
int64_t max = out[begin-1];
for (size_t i = begin; i != end; ++i) {
max += in[i];
out[i] = max;
}
} // accumulate
void regular(int64_t const in[], int64_t out[],
size_t const begin, size_t const end)
{
assert(begin != 0);
int64_t max = out[begin - 1];
for (size_t i = begin; i != end; ++i)
{
max = 0 > max ? in[i] : max + in[i];
out[i] = max;
}
}
Now, supposing that we can somehow characterize the input using a simple structure:
struct Slice {
enum class Type { Negative, Neutral, Positive };
Type type;
size_t begin;
size_t end;
};
typedef void (*Func)(int64_t const[], int64_t[], size_t, size_t);
Func select(Type t) {
switch(t) {
case Type::Negative: return ©
case Type::Neutral: return &regular;
case Type::Positive: return &accumulate;
}
}
void theLoop(std::vector<Slice> const& slices, int64_t const in[], int64_t out[]) {
for (Slice const& slice: slices) {
Func const f = select(slice.type);
(*f)(in, out, slice.begin, slice.end);
}
}
Now, unless introsort the work in the loop is minimal, so computing the characteristics might be too costly as is... however it leads itself well to parallelization.
3. Simple parallelization
Note that the characterization is a pure function of the input. Therefore, supposing that you work in a chunk per chunk fashion, it could be possible to have, in parallel:
Slice Producer: a characterizer thread, which computes the Slice::Type value
Slice Consumer: a worker thread, which actually executes the code
Even if the input is essentially random, providing the chunk is small enough (for example, a CPU L1 cache line) there might be chunks for which it does work. Synchronization between the two threads can be done with a simple thread-safe queue of Slice (producer/consumer) and adding a bool last attribute to stop consumption or by creating the Slice in a vector with a Unknown type, and having the consumer block until it's known (using atomics).
Note: because characterization is pure, it's embarrassingly parallel.
4. More Parallelization: Speculative work
Remember this innocent remark: max gets positive as soon as in[i] is positive.
Suppose that we can guess (reliably) that the Slice[j-1] will produce a max value that is negative, then the computation on Slice[j] are independent of what preceded them, and we can start the work right now!
Of course, it's a guess, so we might be wrong... but once we have fully characterized all the Slices, we have idle cores, so we might as well use them for speculative work! And if we're wrong ? Well, the Consumer thread will simply gently erase our mistake and replace it with the correct value.
The heuristic to speculatively compute a Slice should be simple, and it will have to be tuned. It may be adaptative as well... but that may be more difficult!
Conclusion
Analyze your dataset and try to find if it's possible to break dependencies. If it is you can probably take advantage of it, even without going multi-thread.
If values of max and in[] are far away from 64-bit min/max (say, they are always between -261 and +261), you may try a loop without the conditional branch, which may be causing some perf degradation:
for(uint32_t i = 1; i < N; i++) {
max &= -max >> 63; // assuming >> would do arithmetic shift with sign extension
max += in[i];
out[i] = max;
}
In theory the compiler may do a similar trick as well, but without seeing the disassembly, it's hard to tell if it does it.
The code appears already pretty fast. Depending on the nature of the in array, you could try special casing, for instance if you happen to know that in a particular invokation all the input numbers are positive, out[i] will be equal to the cumulative sum, with no need for an if branch.
ensuring the method isn't virtual, inline, _attribute_((always_inline)) and -funroll-loops seem like good options to explore.
Only by you benchmarking them can we determine if they were worthwhile optimizations in your bigger program.
The only thing that comes to mind that might help a small bit is to use pointers rather than array indices within your loop, something like
void theloop(int64_t in[], int64_t out[], size_t N)
{
int64_t max = in[0];
out[0] = max;
int64_t *ip = in + 1,*op = out+1;
for(uint32_t i = 1; i < N; i++) {
int64_t v = *ip;
ip++;
max += v;
if (v > max) max = v;
*op = max;
op++
}
}
The thinking here is that an index into an array is liable to compile as taking the base address of the array, multiplying the size of element by the index, and adding the result to get the element address. Keeping running pointers avoids this. I'm guessing a good optimizing compiler will do this already, so you'd need to study the current assembler output.
int64_t max = 0, i;
for(i=N-1; i > 0; --i) /* Comparing with 0 is faster */
{
max = in[i] > 0 ? max+in[i] : in[i];
out[i] = max;
--i; /* Will reduce checking of i>=0 by N/2 times */
max = in[i] > 0 ? max+in[i] : in[i]; /* Reduce operations v=in[i], max+=v by N times */
out[i] = max;
}
if(0 == i) /* When N is odd */
{
max = in[i] > 0 ? max+in[i] : in[i];
out[i] = max;
}

C++ - What would be faster: multiplying or adding?

I have some code that is going to be run thousands of times, and was wondering what was faster.
array is a 30 value short array which always holds 0, 1 or 2.
result = (array[29] * 68630377364883.0)
+ (array[28] * 22876792454961.0)
+ (array[27] * 7625597484987.0)
+ (array[26] * 2541865828329.0)
+ (array[25] * 847288609443.0)
+ (array[24] * 282429536481.0)
+ (array[23] * 94143178827.0)
+ (array[22] * 31381059609.0)
+ (array[21] * 10460353203.0)
+ (array[20] * 3486784401.0)
+ (array[19] * 1162261467)
+ (array[18] * 387420489)
+ (array[17] * 129140163)
+ (array[16] * 43046721)
+ (array[15] * 14348907)
+ (array[14] * 4782969)
+ (array[13] * 1594323)
+ (array[12] * 531441)
+ (array[11] * 177147)
+ (array[10] * 59049)
+ (array[9] * 19683)
+ (array[8] * 6561)
+ (array[7] * 2187)
+ (array[6] * 729)
+ (array[5] * 243)
+ (array[4] * 81)
+ (array[3] * 27)
+ (array[2] * 9)
+ (array[1] * 3)
+ (b[0]);
Would it be faster if I use something like:
if(array[29] != 0)
{
if(array[29] == 1)
{
result += 68630377364883.0;
}
else
{
result += (whatever 68630377364883.0 * 2 is);
}
}
for each of them. Would this be faster/slower? If so, by how much?
That is a ridiculously premature "optimization". Chances are you'll be hurting performance because you are adding branches to the code. Mispredicted branches are very costly. And it also renders the code harder to read.
Multiplication in modern processors is a lot faster than it used to be, it can be done a few clock cycles now.
Here's a suggestion to improve readability:
for (i=1; i<30; i++) {
result += array[i] * pow(3, i);
}
result += b[0];
You can pre-compute an array with the values of pow(3, i) if you are really that worried about performance.
First, on most architectures, mis-branching is very costly (depending on the execution pipeline depth), so I bet the non-branching version is better.
A variation on the code may be:
result = array[29];
for (i=28; i>=0; i--)
result = result * 3 + array[i];
Just make sure there are no overflows, so result must be in a type larger than 32-bit integer.
Even if addition is faster than multiplication, I think that you will lose more because of the branching. In any case, if addition is faster than multiplication, a better solution might be to use a table and index by it.
const double table[3] = {0.0, 68630377364883.0, 68630377364883.0 * 2.0};
result += table[array[29]];
My first attempt at optimisation would be to remove the floating-point ops in favour of integer arithmetic:
uint64_t total = b[0];
uint64_t x = 3;
for (int i = 1; i < 30; ++i, x *= 3) {
total += array[i] * x;
}
uint64_t is not standard C++, but is very widely available. You just need a version of C99's stdint for your platform.
There's also optimising for comprehensibility and maintainability - was this code a loop at one point, and did you measure the performance difference when you replaced the loop? Fully unrolling like this might even make the program slower (as well as less readable), since the code is larger and hence occupies more of the instruction cache, and hence results in cache misses elsewhere. You just don't know.
This assuming of course that your constants actually are the powers of 3 - I haven't bothered checking, which is precisely what I consider to be the readability issue with your code...
This is basically doing what strtoull does. If you don't have the digits handy as an ASCII string to feed to strtoull then I guess you have to write your own implementation. As people point out, branching is what causes a performance hit, so your function is probably best written this way:
#include <tr1/cstdint>
uint64_t base3_digits_to_num(uint8_t digits[30])
{
uint64_t running_sum = 0;
uint64_t pow3 = 1;
for (int i = 0; i < 30; ++i) {
running_sum += digits[i] * pow3;
pow3 *= 3;
}
return running_sum;
}
It's not clear to me that precomputing your powers of 3 is going to result in a significant speed advantage. You might try it and test yourself. The one advantage a lookup table might give you is that a smart compiler could possibly unroll the loop into a SIMD instruction. But a really smart compiler should then be able to do that anyway and generate the lookup table for you.
Avoiding floating point is also not necessarily a speed win. Floating point and integer operations are about the same on most processors produced in the last 5 years.
Checking to see if digits[i] is 0, 1 or 2 and executing different code for each of these cases is definitely a speed lose on any processor produced in the last 10 years. The Pentium3/Pentium4/Athlon Thunderbird days are when branches started to really become a huge hit, and the Pentium3 is at least 10 years old now.
Lastly, you might think this will be the bottleneck in your code. You're probably wrong. The right implementation is the one that is the simplest and most clear to anybody coming along reading your code. Then, if you want the best performance, run your code through a profiler and find out where to concentrate your optimization efforts. Agonizing this much over a little function when you don't even know that it's a bottleneck is silly.
And almost nobody here recognized that you were basically doing a base 3 conversion. So even your current primitive hand loop unrolling obscured your code enough that most people didn't understand it.
Edit: In fact, I looked at the assembly output. On an x86_64 platform the lookup table buys you nothing and may in fact be counter-productive because of its affect on the cache. The compiler generates leaq (%rdx,%rdx,2), %rdx in order to multiply by 3. Fetching from a table would be something like moveq (%rdx,%rcx,8), %eax, which is basically the same speed aside from requiring a fetch from memory (which might be very expensive). So it's almost certain that my code with the gcc option -funroll-loops is significantly faster than your attempt to optimize by hand.
The lesson here is that the compiler does a much, much better job of optimization than you can. Just make your code as clear and readable to others as possible and let the compiler do the work. And making it clear to others has the additional advantage of making it easier for the compiler to do its job.
If you're not sure - why don't you just measure it yourself?
Second example will be most likely much slower, but not because of the addition - mispredicted conditional jumps cost a lot of time.
If you have only 3 values, the cheapest way might be to have a static 2D array of values int **vals = {{0, 1*3, 2*3}, {0, 1*9, 2*9}, ...} and just sum vals[0][array[1]] + vals[1][array[2]] + ...
Some SIMD instructions might be faster than anything you can write on your own - look at those. Then again - if you're doing this a lot, handing it off to GPU might be even faster - depending on your other calculations.
Multiply, because branching is awefully slow