Efficient stable sum of a sorted array in AVX2 - c++

Consider a sorted (ascending) array of double numbers. For numerical stability the array should be summed up as if iterating it from the beginning till the end, accumulating the sum in some variable.
How to vectorize this efficiently with AVX2?
I've looked into this method Fastest way to do horizontal vector sum with AVX instructions , but it seems quite tricky to scale it to an array (some divide&conquer approach may be needed), while keeping the floating-point precision by ensuring that small numbers are summed up before adding them to a larger number.
Clarification 1: I think it should be ok to e.g. sum the first 4 items, then add them to the sum of the next 4 items, etc. I'm willing to trade some stability for performance. But I would prefer a method that doesn't ruin the stability completely.
Clarification 2: memory shouldn't be a bottleneck because the array is in L3 cache (but not in L1/L2 cache, because pieces of the array were populated from different threads). I wouldn't like to resort to Kahan summation because I think it's really the number of operations that matters, and Kahan summation would increase it about 4 times.

If you need precision and parallelism, use Kahan summation or another error-compensation technique to let you reorder your sum (into SIMD vector element strides with multiple accumulators).
As Twofold fast summation - Evgeny Latkin points out, if you bottleneck on memory bandwidth, an error-compensated sum isn't much slower than a max-performance sum, since the CPU has lots of computation throughput that goes unused in a simply-parallelized sum that bottlenecks on memory bandwidth
See also (google results for kahan summation avx)
https://github.com/rreusser/summation-algorithms
https://scicomp.stackexchange.com/questions/10869/which-algorithm-is-more-accurate-for-computing-the-sum-of-a-sorted-array-of-numb
Is this way of processing the tail of an array with SSE overkill? has a sample implementation of SSE Kahan, not unrolled, and a comparison of actual error with it (no error) vs. sequential sum (bad) vs. simple SIMD sum (much less total error), showing that just vectorizing (and/or unrolling) with multiple accumulators tends to help accuracy.
Re: your idea: Summing groups of 4 numbers in-order will let you hide the FP-add latency, and bottleneck on scalar add throughput.
Doing horizontal sums within vectors takes a lot of shuffling, so it's a potential bottleneck. You might consider loading a0 a1 a2 a3, then shuffling to get a0+a1 x a2+a3 x, then (a0+a1) + (a2+a3). You have a Ryzen, right? The last step is just a vextractf128 down to 128b. That's still 3 total ADD uops, and 3 shuffle uops, but with fewer instructions than scalar or 128b vectors.
Your idea is very similar to Pairwise Summation
You're always going to get some rounding error, but adding numbers of similar magnitude minimizes it.
See also Simd matmul program gives different numerical results for some comments about Pairwise Summation and simple efficient SIMD.
The difference between adding 4 contiguous numbers vs. vertically adding 4 SIMD vectors is probably negligible. SIMD vectors give you small strides (of SIMD vector width) in the array. Unless the array grows extremely quickly, they're still going to have basically similar magnitudes.
You don't need to horizontal sum until the very end to still get most of the benefit. You can maintain 1 or 2 SIMD vector accumulators while you use more SIMD registers to sum short runs (of maybe 4 or 8 SIMD vectors) before adding into the main accumulators.
In fact having your total split more ways (across the SIMD vector elements) means it doesn't grow as large. So it helps with exactly the problem you're trying to avoid, and horizontal summing down to a single scalar accumulator actually makes things worse, especially for a strictly sorted array.
With out-of-order execution, you don't need very many tmp accumulators to make this work and hide the FP-add latency of accumulating into the main accumulators. You can do a couple groups of accumulating into a fresh tmp = _mm_load_ps() vector and adding that to the total, and OoO exec will overlap those executions. So you don't need a huge unroll factor for your main loop.
But it shouldn't be too small, you don't want to bottleneck on the add latency into the main accumulator, waiting for the previous add to produce a result before the next one can start. You want to bottleneck on FP-add throughput. (Or if you care about Broadwell/Haswell and you don't totally bottleneck on memory bandwidth, mix in some FMA with a 1.0 multiplier to take advantage of that throughput.)
e.g. Skylake SIMD FP add has 4 cycle latency, 0.5 cycle throughput, so you need to be doing at least 7 adds that are part of a short dep chain for every add into a single accumulator. Preferably more, and/or preferably with 2 long-term accumulators to better absorb bubbles in scheduling from resource conflicts.
See _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps? for more about multiple accumulators.

Here's my solution so far:
double SumVects(const __m256d* pv, size_t n) {
if(n == 0) return 0.0;
__m256d sum = pv[0];
if(n == 1) {
sum = _mm256_permute4x64_pd(sum, _MM_SHUFFLE(3, 1, 2, 0));
} else {
for(size_t i=1; i+1 < n; i++) {
sum = _mm256_hadd_pd(sum, pv[i]);
sum = _mm256_permute4x64_pd(sum, _MM_SHUFFLE(3, 1, 2, 0));
}
sum = _mm256_hadd_pd(sum, pv[n-1]);
}
const __m128d laneSums = _mm_hadd_pd(_mm256_extractf128_pd(sum, 1),
_mm256_castpd256_pd128(sum));
return laneSums.m128d_f64[0] + laneSums.m128d_f64[1];
}
Some explanation: it adds adjacent double array items first, such as a[0]+a[1], a[2]+a[3], etc. Then it adds the sums of adjacent items.

The games you want to play are likely counterproductive. Try experimenting by generating a bunch of iid samples from your favourite distribution, sorting them, and comparing "sum in increasing order" with "sum each lane in increasing order, then sum the lane sums."
For 4 lanes and 16 data, summing lanewise gives me smaller error about 28% of the time while summing in increasing order gives me smaller error about 17% of the time; for 4 lanes and 256 data, summing lanewise gives me smaller error about 68% of the time while summing in increasing order gives me smaller error about 12% of the time. Summing lanewise also beats the algorithm you gave in your self-answer. I used a uniform distribution on [0,1] for this.

Related

Speeding up gather

I have a computation that produces a coefficient vector and returns the dot product of this vector with a data vector taken from a large array. To speed things up, I do this for eight vectors at a time using AVX2 SIMD intrinsics. The problem is that the bulk of the time ends up being consumed by the gather operation getting the data for the dot product.
I tried different ways of implementing the gather, and the intrinsic seems to work best. I would greatly appreciate some advice on speeding this up.
Here is a sketch:
__m256 Compute(__m256 input)
{
__m256 coefficients[56] = ComputeCoefficients(input);
__m256i indices = ComputeIndices(input);
__m256 sum = _mm256_setzero_ps();
for (size_t i = 0; i != 56; ++i)
{
__m256 data = _mm256_i32gather_ps(bigArray + i, indices, sizeof(float)); // 😴
sum = _m256_fmadd_ps(coefficients[i], data, sum);
}
return sum;
}
I would first make sure that you are using the most recent Intel processor possible. Intel has invested a lot of engineering in improving the gather instruction.
This being said, it is not magical. If there are cache misses, you will pay a price for them.
I would try to write the same code without SIMD instructions. Is it about the same speed? If it is, then chances are good that your are limited by memory access. Vectorization is good to solve computational limitations, and to write and store data in vector-size units, but even in principle, it cannot be expected to help much with problems bound by random access and cache issues.
Your code repeatedly calls VPGATHERDPS. According to Agner Fog, this instruction has a latency of 12 cycles and a throughput of one instruction every 4 cycles. The latency is, of course, a best-case scenario, cache misses will increase the latency.
You should benchmark your code and ensure that you are close to 4 cycles per loop iteration. The main loop should complete in about 300 cycles, and that's quite fast all things said.
You do not tell us a lot about your problem but we can guess that it is much slower than 300 cycles. If so, then you are probably having cache issues. If your table is large and you are accessing it randomly, then it is a hard problem. If you need better performance, you may need to reengineer the problem.

CUDA computing a histogram with shared memory

I'm following a udacity problem set lesson to compute a histogram of numBins element out of a long series of numElems values. In this simple case each element's value is also his own bin in the histogram, so generating with CPU code the histogram is as simple as
for (i = 0; i < numElems; ++i)
histo[val[i]]++;
I don't get the video explanation for a "fast histogram computation" according to which I should sort the values by a 'coarse bin id' and then compute the final histogram.
The question is:
why should I sort the values by 'coarse bin indices'?
why should I sort the values by 'coarse bin indices'?
This is an attempt to break down the work into pieces that can be handled by a single threadblock. There are several considerations here:
On a GPU, it's desirable to have multiple threadblocks so that all SMs can be engaged in solving the problem.
A given threadblock lives and operates on a single SM, so it is confined to the resources available on that SM, the primary limits being the number of threads and the size of available shared memory.
Since shared memory especially is limited, the division of work creates a smaller-sized histogram operation for each threadblock, which may fit in the SM shared memory whereas the overall histogram range may not. For example if I am histogramming over a range of 4 decimal digits, that would be 10,000 bins total. Each bin would probably need an int value, so that is 40Kbytes, which would just barely fit into shared memory (and might have negative performance implications as an occupancy limiter). A histogram over 5 decimal digits probably would not fit. On the other hand, with a "coarse bin sort" of a single decimal digit, I could reduce the per-block shared memory requirement from 40Kbytes to 4Kbytes (approximately).
Shared memory atomics are often considerably faster than global memory atomics, so breaking down the work this way allows for efficient use of shared memory atomics, which may be a useful optimization.
so I will have to sort all the values first? Isn't that more expensive than reading and doing an atomicAdd into the right bin?
Maybe. But the idea of a coarse bin sort is that it may be computationally much less expensive than a full sort. A radix sort is a commonly used, relatively fast sorting operation that can be done in parallel on a GPU. Radix sort has the characteristic that the sorting operation begins with the most significant "digit" and proceeds iteratively to the least significant digit. However a coarse bin sort implies that only some subset of the most significant digits need actually be "sorted". Therefore, a "coarse bin sort" using a radix sort technique could be computationally substantially less expensive than a full sort. If you sort only on the most significant digit out of 3 digits as indicated in the udacity example, that means your sort is only approximately 1/3 as expensive as a full sort.
I'm not suggesting that this is a guaranteed recipe for faster performance in every case. The specifics matter (e.g. size of histogram, range, final number of bins, etc.) The specific GPU you use may impact the tradeoff also. For example, Kepler and newer devices will have substantially improved global memory atomics, so the comparison will be substantially impacted by that. (OTOH, Pascal has substantially improved shared memory atomics, which will once again affect the comparison in the other direction.)

iteration direction on an array

Say we have two arrays a and b of a fundamental type (say, a float) and we need to calculate a[i] + b[i] for every valid index i, as well as store the result. What is the best way to iterate over the arrays to maximize cache hits? Is it front-to-back, back-to-front or something else?
For this kind of operation you should use the auto-vectorization of your compiler. Iterate small i to large i. Also, the answer depends on what you mean by "store the result" and the number n of items items you are going to iterate over.
If you mean c[i] = a[i] + b[i] and n is not too small then your compiler's auto-vectorizer will optimize this best without any more changes. Even MSVC will get that one correct (at least for SSE). Your compiler will have to do some adjustments for n not a multiple of 4 (or 8 for AVX) and alignment but this cost will be amortized across n and this overhead will have a negligible effect except for small n. If n is small then you might want to consider alignment. How small is small has to be determined but I would guess it's much less than 100.
If you mean sum + = a[i] + b[i], a reduction, then you do need to think about this. This has a dependency chain so you need to unroll your loop 3-10 times. Additionally, you need to use a relaxed floating point model since floating point arithmetic is not associative and the auto-vectorization won't kick in without it so add -ffast-math to GCC (/fp:fast to MSVC). If you unroll the loop and use a a relaxed floating point model then GCC, ICC, Clang, and MSVC should auto-vectorize your reduction efficiently.
In order to utilize the cache pre-fetch capability you need to read the arrays from front to back sequentially.
Furthermore, the arrays should be SSE aligned (16 byte). Even more important is that the items (e.g. floats) will be aligned on their size (4 bytes for floats). This is important so data will not cross cache lines (slower read).
After the arrays are aligned, you can use SSE/AVX to read, add and store the results doing 4 or 8 operations in a single instruction.
Edit:
You can read more on cache prefetching here and in depth description in the Intel SW Developer Manual.

Amortized analysis of std::vector insertion

How do we do the analysis of insertion at the back (push_back) in a std::vector? It's amortized time is O(1) per insertion. In particular in a video in channel9 by Stephan T Lavavej and in this ( 17:42 onwards ) he says that for optimal performance Microsoft's implementation of this method increases capacity of the vector by around 1.5.
How is this constant determined?
Assuming you mean push_back and not insertion, I believe that the important part is the multiply by some constant (as opposed to grabbing N more elements each time) and as long as you do this you'll get amortized constant time. Changing the factor changes the average case and worst case performance.
Concretely:
If your constant factor is too large, you'll have good average case performance, but bad worst case performance especially as the arrays get big. For instance, imagine doubling (2x) a 10000 size vector just because you have the 10001th element pushed. EDIT: As Michael Burr indirectly pointed out, the real cost here is probably that you'll grow your memory much larger than you need it to be. I would add to this that there are cache issues that affect speed if your factor is too large. Suffice it to say that there are real costs (memory and computation) if you grow much larger than you need.
However if your constant factor is too small, say (1.1x) then you're going to have good worst case performance, but bad average performance, because you're going to have to incur the cost of reallocating too many times.
Also, see Jon Skeet's answer to a similar question previously. (Thanks #Bo Persson)
A little more about the analysis: Say you have n items you are pushing back and a multiplication factor of M. Then the number of reallocations will be roughly log base M of n (log_M(n)). And the ith reallocation will cost proportional to M^i (M to the ith power). Then the total time of all the pushbacks will be M^1 + M^2 + ... M^(log_M(n)). The number of pushbacks is n, and thus you get this series (which is a geometric series, and reduces to roughly (nM)/(M-1) in the limit) divided by n. This is roughly a constant, M/(M-1).
For large values of M you will overshoot a lot and allocate much more than you need reasonably often (which I mentioned above). For small values of M (close to 1) this constant M/(M-1) becomes large. This factor directly affects the average time.
You can do the math to try to figure how this kind of thing works.
A popular method to work with asymptotic analysis is the Bankers method. What you do is markup all your operations with an extra cost, "saving" it for later to pay for an expensive operation latter on.
Let's make some dump assumptions to simplify the math:
Writing into an array costs 1. (Same for inserting and moving between arrays)
Allocating a larger array is free.
And our algorithm looks like:
function insert(x){
if n_elements >= maximum array size:
move all elements to a new array that
is K times larger than the current size
add x to array
n_elements += 1
Obviously, the "worst case" happens when we have to move the elements to the new array. Let's try to amortize this by adding a constant markup of d to the insertion cost, bringing it to a total of (1 + d) per operation.
Just after an array has been resized, we have (1/K) of it filled up and no money saved.
By the time we fill the array up, we can be sure to have at least d * (1 - 1/K) * N saved up. Since this money must be able to pay for all N elements being moved, we can figure out a relation between K and d:
d*(1 - 1/K)*N = N
d*(K-1)/K = 1
d = K/(K-1)
A helpful table:
k d 1+d(total insertion cost)
1.0 inf inf
1.1 11.0 12.0
1.5 3.0 4.0
2.0 2.0 3.0
3.0 1.5 2.5
4.0 1.3 2.3
inf 1.0 2.0
So from this you can get a rough mathematician's idea of how the time/memory tradeoff works for this problem. There are some caveats, of course: I didn't go over shrinking the array when it gets less elements, this only covers the worst case where no elements are ever removed and the time costs of allocating extra memory weren't accounted for.
They most likely ran a bunch of experimental tests to figure this out in the end making most of what I wrote irrelevant though.
Uhm, the analysis is really simple when you're familiar with number systems, such as our usual decimal one.
For simplicity, then, assume that each time the current capacity is reached, a new 10x as large buffer is allocated.
If the original buffer has size 1, then the first reallocation copies 1 element, the second (where now the buffer has size 10) copies 10 elements, and so on. So with five reallocations, say, you have 1+10+100+1000+10000 = 11111 element copies performed. Multiply that by 9, and you get 99999; now add 1 and you have 100000 = 10^5. Or in other words, doing that backwards, the number of element copies performed to support those 5 reallocations has been (10^5-1)/9.
And the buffer size after 5 reallocations, 5 multiplications by 10, is 10^5. Which is roughly a factor of 9 larger than the number of element copy operations. Which means that the time spent on copying is roughly linear in the resulting buffer size.
With base 2 instead of 10 you get (2^5-1)/1 = 2^5-1.
And so on for other bases (or factors to increase buffer size by).
Cheers & hth.

Speed up float 5x5 matrix * vector multiplication with SSE

I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions.
I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement?
Do I need the Intel compiler to do it?
Can you point out some references?
The Eigen C++ template library for vectors, matrices, ... has both
optimised code for small fixed size matrices (as well as dynamically sized ones)
optimised code that uses SSE optimisations
so you should give it a try.
In principle the speedup could be 4 times with SSE (8 times with AVX). Let me explain.
Let's call your fixed 5x5 matrix M. Defining the components of a 5D vector as (x,y,z,w,t). Now form a 5x4 matrix U from the first four vectors.
U =
xxxx
yyyy
zzzz
wwww
tttt
Next, do the matrix product MU = V. The matrix V contains the product of M and the first four vectors. The only problem is that for SSE we need read in the rows of U but in memory U is stored as xyzwtxyzwtxyzwtxyzwt so we have to transpose it to xxxxyyyyzzzzwwwwtttt. This can be done with shuffles/blends in SSE. Once we have this format the matrix product is very efficient.
Instead of taking O(5x5x4) operations with scalar code it only takes O(5x5) operations i.e. a 4x speedup. With AVX the matrix U will be 5x8 so instead of taking O(5x5x8) operations it only taxes O(5x5), i.e. a 8x speedup.
The matrix V, however, will be in xxxxyyyyzzzzwwwwtttt format so depending on the application it might have to be transposed to xyzwtxyzwtxyzwtxyzwt format.
Repeat this for the next four vectors (8 for AVX) and so forth until done.
If you have control over the vectors, for example if your application generates the vectors on the fly, then you can generate them in xxxxyyyyzzzzwwwwtttt format and avoid the transpose of the array. In that case you should get a 4x speed up with SSE and a 8x with AVX. If you combine this with threading, e.g. OpenMP, your speedup should be close to 16x (assuming four physical cores) with SSE. I think that's the best you can do with SSE.
Edit: Due to instruction level parallelism (ILP) you can get another factor of 2 in speedup so the speedup for SSE could 32x with four cores (64x AVX) and again another factor of 2 with Haswell due to FMA3.
I would suggest using Intel IPP and abstract yourself of dependency on techniques
If you're using GCC, note that the -O3 option will enable auto-vectorization, which will automatically generate SSE or AVX instructions in many cases. In general, if you just write it as a simple for-loop, GCC will vectorize it. See http://gcc.gnu.org/projects/tree-ssa/vectorization.html for more information.
This should be easy, especially when you're on Core 2 or later: You neeed 5* _mm_dp_ps , one _mm_mul_ps, two _mm_add_ps, one ordinary multiplication, plus some shuffles, loads and stores (and if the matrix is fixed, You can keep most of it in SSE registers, if you don't need them for anything else).
As for memory bandwidth: we're talking about 2,4 megabytes of vectors, when memory bandwidths are in single-digit gigabytes per second.
What is known about the vector? Since the matrix is fixed, AND if there is a limited amount of values that the vector can take, then I'd suggest that you pre-compute the calculations and access them using a table look-up.
The classic optimization technique to trade memory for cycles...
I would recommend having a look at an optimised BLAS library, such as the Intel MKL or the AMD ACML. Based on your description I would assume that you'd be after the SGEMV level 2 matrix-vector routine, to do y = A*x style operations.
If you really want to implement something yourself, using the (available) SSE..SSE4 and AVX instruction sets can offer significant performance improvements in some cases, although this is exactly what a good BLAS library will be doing. You also need to think alot about cache friendly data access patterns.
I don't know if this is applicable in your case, but can you operate on "chunks" of vectors at a time?? So rather than repeatedly doing an y = A*x style operation can you operate on blocks of [y1 y2 ... yn] = A * [x1 x2 ... xn]. If so, this means that you could use an optimised matrix-matrix routine, such as SGEMM. Due to the data access patterns this may be significantly more efficient than repeated calls to SGEMV. If it were me, I would try to go down this path...
Hope this helps.
If you know the vectors in advance (e.g., doing all 240k at once), you'd get a better speedup by parallelising the loop than by going to SSE. If you've already taken that step, or you don't know them all at once, SSE could be a big benefit.
If the memory is contiguous, then don't worry too much about the memory operations. If you've got a linked list or something then you're in trouble, but it should be able to keep up without too much problem.
5x5 is a funny size, but you could do at least 4 flops in one SSE instruction and try to cut your arithmetic overheads. You don't need the Intel compiler, but it might be better, I've heard legends about how it's much better with arithmetic code. Visual Studio has intrinsics for dealing with SSE2, and I think up to SSE4 depending on what you need. Of course, you'd have to roll it yourself. Grabbing a library might be the smart move here.