Why is this SSE2 code performing inconsistently? - c++

As a learning exercise I'm trying my hand at speeding up matrix multiplication code using SIMD on various architectures. I'm having a weird issue with my 3D matrix multiplication code for SSE2 where its performance jumps between two extremes, either ~5ms (expected) or ~100ms for 1 million operations.
The only thing "bad" that this code is doing is the unaligned stores/loads and the hack at the end to store a vector into memory without the 4th element trampling memory. This would explain some performance variance, but the fact that the performance difference is so large makes me suspect I'm missing something important.
I've tried a couple of things but I'll have another crack at it after some sleep.
See code below. The m_matrix variable is aligned on the 16 byte boundary.
void Matrix3x3::MultiplySSE2(Matrix3x3 &other, Matrix3x3 &output)
{
__m128 a_row, r_row;
__m128 a1_row, r1_row;
__m128 a2_row, r2_row;
const __m128 b_row0 = _mm_load_ps(&other.m_matrix[0]);
const __m128 b_row1 = _mm_loadu_ps(&other.m_matrix[3]);
const __m128 b_row2 = _mm_loadu_ps(&other.m_matrix[6]);
// Perform dot products with first row
a_row = _mm_set1_ps(m_matrix[0]);
r_row = _mm_mul_ps(a_row, b_row0);
a_row = _mm_set1_ps(m_matrix[1]);
r_row = _mm_add_ps(_mm_mul_ps(a_row, b_row1), r_row);
a_row = _mm_set1_ps(m_matrix[2]);
r_row = _mm_add_ps(_mm_mul_ps(a_row, b_row2), r_row);
_mm_store_ps(&output.m_matrix[0], r_row);
// Perform dot products with second row
a1_row = _mm_set1_ps(m_matrix[3]);
r1_row = _mm_mul_ps(a1_row, b_row0);
a1_row = _mm_set1_ps(m_matrix[4]);
r1_row = _mm_add_ps(_mm_mul_ps(a1_row, b_row1), r1_row);
a1_row = _mm_set1_ps(m_matrix[5]);
r1_row = _mm_add_ps(_mm_mul_ps(a1_row, b_row2), r1_row);
_mm_storeu_ps(&output.m_matrix[3], r1_row);
// Perform dot products with third row
a2_row = _mm_set1_ps(m_matrix[6]);
r2_row = _mm_mul_ps(a2_row, b_row0);
a2_row = _mm_set1_ps(m_matrix[7]);
r2_row = _mm_add_ps(_mm_mul_ps(a2_row, b_row1), r2_row);
a2_row = _mm_set1_ps(m_matrix[8]);
r2_row = _mm_add_ps(_mm_mul_ps(a2_row, b_row2), r2_row);
// Store only the first 3 elements in a vector so we dont trample memory
_mm_store_ss(&output.m_matrix[6], _mm_shuffle_ps(r2_row, r2_row, _MM_SHUFFLE(0, 0, 0, 0)));
_mm_store_ss(&output.m_matrix[7], _mm_shuffle_ps(r2_row, r2_row, _MM_SHUFFLE(1, 1, 1, 1)));
_mm_store_ss(&output.m_matrix[8], _mm_shuffle_ps(r2_row, r2_row, _MM_SHUFFLE(2, 2, 2, 2)));
}

A performance hit like that sounds like your data is maybe crossing a page line sometimes, not just a cache-line. If you're testing on a buffer of many different matrices, rather than the same small matrix repeatedly, maybe something else running on another CPU core is pushing your buffer out of L3?
performance issues in your code (which don't explain the factor-of-20 variance. These should always be slow):
_mm_set1_ps(m_matrix[3]) and so on is going to be a problem. It takes a pshufd or movaps + shufps to broadcast an element. I think this is unavoidable for matmuls, though.
Storing the final 3 elements without writing past the end: Try PALIGNR to get the last element of the previous row into a reg with the last row. Then you can do a single unaligned store, which overlaps with the preceding store. This is a lot fewer shuffles, and is probably faster than movss / extractps / extractps.
If you want to try something with fewer unaligned 16B stores, try movss, shuffle or right-shift by 4 bytes (psrldq aka _mm_bsrli_si128), then movq or movsd to store the last 8 bytes in one go. (byte-wise shift is on the same execution port as shuffles, unlike the per element bit-shifts)
Why did do you do three _mm_shuffle_ps (shufps)? The low element is already the one you want, for the first column of the last row. Anyway, I think extractps is faster than shuffle + store, on non-AVX where preserving the source from being clobbered by shufps takes a move. pshufd would work.)

Related

Efficiently compute max of an array of 8 elements in arm neon

How do I find max element in array of 8 bytes, 8 shorts or 8 ints?
I may need just the position of the max element, value of the max element, or both of them.
For example:
unsigned FindMax8(const uint32_t src[8]) // returns position of max element
{
unsigned ret = 0;
for (unsigned i=0; i<8; ++i)
{
if (src[i] > src[ret])
ret = i;
}
return ret;
}
At -O2 clang unrolls the loop but it does not use neon, which should give decent perf boost (because it eliminates many data dependent branches?)
For 8 bytes and 8 shorts approach should be simpler as entire array can be loaded into a single q-register. For arm64 this should be much simpler with vmaxv_u16, but how do I make it efficient in 32-bit neon?
As noted by Marc in comments, when function is changed to return max value GCC auto vectorizer generates the following for neon64:
ldr q0, [x0, 16]
ld1r {v2.4s}, [x0]
ldr q1, [x0]
umax v0.4s, v0.4s, v2.4s
umax v0.4s, v0.4s, v1.4s
umaxv s0, v0.4s
umov w0, v0.s[0]
I have one function that does quite complex math and at the end of computation I end up with uint32x4_t res result and all I need is to get index of the max element. This single piece is the slowest part of the code, by far slower than the rest of the rest of this math-heavy function.
I tried three different approaches (from slowest to fastest according to profiler):
full computation using neon with final single 32-bit result transfer from neon to arm.
vst1q_u32(src, res) and then using regular C code to find index of the max element.
vmov to four 32-bit arm registers using vget_lane_u64 two times and then using some bit-shifts to figure out index of the max element.
Here's fastest version that I was able to get:
unsigned compute(unsigned short *input)
{
uint32x4_t result = vld1q_u32((uint32_t*)(input));
// some computations...
// ... and at the end I end up with res01 and res23
// and I need to get index of max element from them:
uint32x2_t res01 = vget_low_u32(result);
uint32x2_t res23 = vget_high_u32(result);
// real code below:
uint64_t xres01 = vget_lane_u64(vreinterpret_u64_u32(res01), 0);
uint64_t xres23 = vget_lane_u64(vreinterpret_u64_u32(res23), 0);
unsigned ret = 0;
uint32_t xmax0 = (uint32_t)(xres01 & 0xffffffff);
uint32_t xmax1 = (uint32_t)(xres01 >> 32);
uint32_t xmax2 = (uint32_t)(xres23 & 0xffffffff);
uint32_t xmax3 = (uint32_t)(xres23 >> 32);
if (xmax1 > xmax0)
{
xmax0 = xmax1;
ret = 1;
}
if (xmax2 > xmax0)
{
xmax0 = xmax2;
ret = 2;
}
if (xmax3 > xmax0)
ret = 3;
return ret;
}
Version using full neon computation does this:
using vmax/vpmax find max element
set u32x4_t to the max element
using vceq set max elements to 0xffffffff
load u32x4_t mask with with {1u<<31, 1u<<30, 1u<<29, 1u<<28 }
do vand with the mask
pairwise add or vorr to collapse all 4 values to a single one.
using vclz set all to index of the max element
Maybe issue somewhere else, see actual code that I'm trying to optimize. My optimized version where only the last piece needs to be improved. Somehow profiler shows that 80% time is spent in the last lines where I compute max index. Any ideas? Changing that simple c-loop to pairs of regs improves entire function by 20-30%. Note, according to profiler the two vst1_u32 are the ones where function spents most of the time.
What other approach could I try?
Update:
It seems that slow down at the end of the function isn't related to the code. I'm not sure why, but when I tried to run different versions of the function depending on the order in which I called them I got timings change 3-4x times. Also, with different testing it seem that full neon version is fastest if there is no stall at the end of the function and I'm not sure why that stall happen. For that reason I created a new question to figure out why.

Efficiently count number of distinct values in 16-byte buffer in arm neon

Here's the basic algorithm to count number of distinct values in a buffer:
unsigned getCount(const uint8_t data[16])
{
uint8_t pop[256] = { 0 };
unsigned count = 0;
for (int i = 0; i < 16; ++i)
{
uint8_t b = data[i];
if (0 == pop[b])
count++;
pop[b]++;
}
return count;
}
Can this be done somehow in neon efficiently by loading into a q-reg and doing some bit magic? Alternatively, can I efficiently say that data has all elements identical, or contains only two distinct values or more than two?
For example, using vminv_u8 and vmaxv_u8 I can find min and max elements and if they are equal I know that data has identical elements. If not, then I can vceq_u8 with min value and vceq_u8 with max value and then vorr_u8 these results and compare that I have all 1-s in the result. Basically, in neon it can be done this way. Any ideas how to make it better?
unsigned getCountNeon(const uint8_t data[16])
{
uint8x16_t s = vld1q_u8(data);
uint8x16_t smin = vdupq_n_u8(vminvq_u8(s));
uint8x16_t smax = vdupq_n_u8(vmaxvq_u8(s));
uint8x16_t res = vdupq_n_u8(1);
uint8x16_t one = vdupq_n_u8(1);
for (int i = 0; i < 14; ++i) // this obviously needs to be unrolled
{
s = vbslq_u8(vceqq_u8(s, smax), smin, s); // replace max with min
uint8x16_t smax1 = vdupq_n_u8(vmaxvq_u8(s));
res = vaddq_u8(res, vaddq_u8(vceqq_u8(smax1, smax), one));
smax = smax1;
}
res = vaddq_u8(res, vaddq_u8(vceqq_u8(smax, smin), one));
return vgetq_lane_u8(res, 0);
}
With some optimizations and improvements perhaps a 16-byte block can be processed in 32-48 neon instructions. Can this be done better in arm? Unlikely
Some background why I ask this question. As I'm working on an algorithm I'm trying different approaches at processing data and I'm not sure yet what exactly I'll use at the end. Information that might be of use:
count of distinct elements per 16-byte block
value that repeats most per 16-byte block
average per block
median per block
speed of light?.. that's a joke, it cannot be computed in neon from 16-byte block :)
so, I'm trying stuff, and before I use any approach I want to see if that approach can be well optimized. For example, average per block will be memcpy speed on arm64 basically.
If you're expecting a lot of duplicates, and can efficiently get a horizontal min with vminv_u8, this might be better than scalar. Or not, maybe NEON->ARM stalls for the loop condition kill it. >.< But it should be possible to mitigate that with unrolling (and saving some info in registers to figure out how far you overshot).
// pseudo-code because I'm too lazy to look up ARM SIMD intrinsics, edit welcome
// But I *think* ARM can do these things efficiently,
// except perhaps the loop condition. High latency could be ok, but stalling isn't
int count_dups(uint8x16_t v)
{
int dups = (0xFF == vmax_u8(v)); // count=1 if any elements are 0xFF to start
auto hmin = vmin_u8(v);
while (hmin != 0xff) {
auto min_bcast = vdup(hmin); // broadcast the minimum
auto matches = cmpeq(v, min_bcast);
v |= matches; // min and its dups become 0xFF
hmin = vmin_u8(v);
dups++;
}
return dups;
}
This turns unique values into 0xFF, one set of duplicates at a time.
The loop-carried dep chain through v / hmin stays in vector registers; it's only the loop branch that needs NEON->integer.
Minimizing / hiding NEON->integer/ARM penalties
Unroll by 8 with no branches on hmin, leaving results in 8 NEON registers. Then transfer those 8 values; back-to-back transfers of multiple NEON registers to ARM only incurs one total stall (of 14 cycles on whatever Jake tested on.) Out-of-order execution could also hide some of the penalty for this stall. Then check those 8 integer registers with a fully-unrolled integer loop.
Tune the unroll factor to be large enough that you usually don't need another round of SIMD operations for most input vectors. If almost all of your vectors have at most 5 unique values, then unroll by 5 instead of 8.
Instead of transferring multiple hmin results to integer, count them in NEON. If you can use ARM32 NEON partial-register tricks to put multiple hmin values in the same vector for free, it's only a bit more work to shuffle 8 of them into one vector and compare for not-equal to 0xFF. Then horizontally add that compare result to get a -count.
Or if you have values from different input vectors in different elements of a single vector, you can use vertical operations to add results for multiple input vectors at once without needing horizontal ops.
There's almost certainly room to optimize this, but I don't know ARM that well, or ARM performance details. NEON's hard to use for anything conditional because of the big performance penalty for NEON->integer, totally unlike x86. Glibc has a NEON memchr with NEON->integer in the loop, but I don't know if it uses it or if it's faster than scalar.
Speeding up repeated calls to the scalar ARM version:
Zeroing the 256-byte buffer every time would be expensive, but we don't need to do that. Use a sequence number to avoid needing to reset:
Before every new set of elements: ++seq;
For each element in the set:
sum += (histogram[i] == seq);
histogram[i] = seq; // no data dependency on the load result, unlike ++
You might make the histogram an array of uint16_t or uint32_t to avoid needing to re-zero if a uint8_t seq wraps. But then it takes more cache footprint, so maybe just re-zeroing every 254 sequence numbers makes the most sense.

How to speed up this histogram of LUT lookups?

First, I have an array int a[1000][1000]. All these integers are between 0 and 32767 ,and they are known constants: they never change during a run of the program.
Second, I have an array b[32768], which contains integers between 0 and 32. I use this array to map all arrays in a to 32 bins:
int bins[32]{};
for (auto e : a[i])//mapping a[i] to 32 bins.
bins[b[e]]++;
Each time, array b will be initialized with a new array, and I need to hash all those 1000 arrays in array a (each contains 1000 ints) to 1000 arrays each contains 32 ints represents for how many ints fall into its each bin .
int new_array[32768] = {some new mapping};
copy(begin(new_array), end(new_array), begin(b));//reload array b;
int bins[1000][32]{};//output array to store results .
for (int i = 0; i < 1000;i++)
for (auto e : a[i])//hashing a[i] to 32 bins.
bins[i][b[e]]++;
I can map 1000*1000 values in 0.00237 seconds. Is there any other way that I can speed up my code? (Like SIMD?) This piece of code is the bottleneck of my program.
This is essentially a histogram problem. You're mapping values 16-bit values to 5-bit values with a 32k-entry lookup table, but after that it's just histogramming the LUT results. Like ++counts[ b[a[j]] ];, where counts is bins[i]. See below for more about histograms.
First of all, you can use the smallest possible data-types to increase the density of your LUT (and of the original data). On x86, a zero or sign-extending load of 8-bit or 16-bit data into a register is almost exactly the same cost as a regular 32-bit int load (assuming both hit in cache), and an 8-bit or 16-bit store is also just as cheap as a 32-bit store.
Since your data size exceeds L1 d-cache size (32kiB for all recent Intel designs), and you access it in a scattered pattern, you have a lot to gain from shrinking your cache footprint. (For more x86 perf info, see the x86 tag wiki, especially Agner Fog's stuff).
Since a has less than 65536 entries in each plane, your bin counts will never overflow a 16-bit counter, so bins can be uint16_t as well.
Your copy() makes no sense. Why are you copying into b[32768] instead of having your inner loop use a pointer to the current LUT? You use it read-only. The only reason you'd still want to copy is to copy from int to uin8_t if you can't change the code that produces different LUTs to produce int8_t or uint8_t in the first place.
This version takes advantage of those ideas and a few histogram tricks, and compiles to asm that looks good (Godbolt compiler explorer: gcc6.2 -O3 -march=haswell (AVX2)):
// untested
//#include <algorithm>
#include <stdint.h>
const int PLANES = 1000;
void use_bins(uint16_t bins[PLANES][32]); // pass the result to an extern function so it doesn't optimize away
// 65536 or higher triggers the static_assert
alignas(64) static uint16_t a[PLANES][1000]; // static/global, I guess?
void lut_and_histogram(uint8_t __restrict__ lut[32768])
{
alignas(16) uint16_t bins[PLANES][32]; // don't zero the whole thing up front: that would evict more data from cache than necessary
// Better would be zeroing the relevant plane of each bin right before using.
// you pay the rep stosq startup overhead more times, though.
for (int i = 0; i < PLANES;i++) {
alignas(16) uint16_t tmpbins[4][32] = {0};
constexpr int a_elems = sizeof(a[0])/sizeof(uint16_t);
static_assert(a_elems > 1, "someone changed a[] into a* and forgot to update this code");
static_assert(a_elems <= UINT16_MAX, "bins could overflow");
const uint16_t *ai = a[i];
for (int j = 0 ; j<a_elems ; j+=4) { //hashing a[i] to 32 bins.
// Unrolling to separate bin arrays reduces serial dependencies
// to avoid bottlenecks when the same bin is used repeatedly.
// This has to be balanced against using too much L1 cache for the bins.
// TODO: load a vector of data from ai[j] and unpack it with pextrw.
// even just loading a uint64_t and unpacking it to 4 uint16_t would help.
tmpbins[0][ lut[ai[j+0]] ]++;
tmpbins[1][ lut[ai[j+1]] ]++;
tmpbins[2][ lut[ai[j+2]] ]++;
tmpbins[3][ lut[ai[j+3]] ]++;
static_assert(a_elems % 4 == 0, "unroll factor doesn't divide a element count");
}
// TODO: do multiple a[i] in parallel instead of slicing up a single run.
for (int k = 0 ; k<32 ; k++) {
// gcc does auto-vectorize this with a short fully-unrolled VMOVDQA / VPADDW x3
bins[i][k] = tmpbins[0][k] + tmpbins[1][k] +
tmpbins[2][k] + tmpbins[3][k];
}
}
// do something with bins. An extern function stops it from optimizing away.
use_bins(bins);
}
The inner-loop asm looks like this:
.L2:
movzx ecx, WORD PTR [rdx]
add rdx, 8 # pointer increment over ai[]
movzx ecx, BYTE PTR [rsi+rcx]
add WORD PTR [rbp-64272+rcx*2], 1 # memory-destination increment of a histogram element
movzx ecx, WORD PTR [rdx-6]
movzx ecx, BYTE PTR [rsi+rcx]
add WORD PTR [rbp-64208+rcx*2], 1
... repeated twice more
With those 32-bit offsets from rbp (instead of 8-bit offsets from rsp, or using another register :/) the code density isn't wonderful. Still, the average instruction length isn't so long that it's likely to bottleneck on instruction decode on any modern CPU.
A variation on multiple bins:
Since you need to do multiple histograms anyway, just do 4 to 8 of them in parallel instead of slicing the bins for a single histogram. The unroll factor doesn't even have to be a power of 2.
That eliminates the need for the bins[i][k] = sum(tmpbins[0..3][k]) loop over k at the end.
Zero bins[i..i+unroll_factor][0..31] right before use, instead of zeroing the whole thing outside the loop. That way all the bins will be hot in L1 cache when you start, and this work can overlap with the more load-heavy work of the inner loop.
Hardware prefetchers can keep track of multiple sequential streams, so don't worry about having a lot more cache misses in loading from a. (Also use vector loads for this, and slice them up after loading).
Other questions with useful answers about histograms:
Methods to vectorise histogram in SIMD? suggests the multiple-bin-arrays and sum at the end trick.
Optimizing SIMD histogram calculation x86 asm loading a vector of a values and extracting to integer registers with pextrb. (In your code, you'd use pextrw / _mm_extract_epi16). With all the load/store uops happening, doing a vector load and using ALU ops to unpack makes sense. With good L1 hit rates, memory uop throughput may be the bottleneck, not memory / cache latency.
How to optimize histogram statistics with neon intrinsics? some of the same ideas: multiple copies of the bins array. It also has an ARM-specific suggestion for doing address calculations in a SIMD vector (ARM can get two scalars from a vector in a single instruction), and laying out the multiple-bins array the opposite way.
AVX2 Gather instructions for the LUT
If you're going to run this on Intel Skylake, you could maybe even do the LUT lookups with AVX2 gather instructions. (On Broadwell, it's probably a break-even, and on Haswell it would lose; they support vpgatherdd (_mm_i32gather_epi32), but don't have as efficient an implementation. Hopefully Skylake avoids hitting the same cache line multiple times when there is overlap between elements).
And yes, you can still gather from an array of uint16_t (with scale factor = 2), even though the smallest gather granularity is 32-bit elements. It means you get garbage in the high half of each 32-bit vector element instead of 0, but that shouldn't matter. Cache-line splits aren't ideal, since we're probably bottlenecked on cache throughput.
Garbage in the high half of gathered elements doesn't matter because you're extracting only the useful 16 bits anyway with pextrw. (And doing the histogram part of the process with scalar code).
You could potentially use another gather to load from the histogram bins, as long as each element comes from a separate slice/plane of histogram bins. Otherwise, if two elements come from the same bin, it would only be incremented by one when you manually scatter the incremented vector back into the histogram (with scalar stores). This kind of conflict detection for scatter stores is why AVX512CD exists. AVX512 does have scatter instructions, as well as gather (already added in AVX2).
AVX512
See page 50 of Kirill Yukhin's slides from 2014 for an example loop that retries until there are no conflicts; but it doesn't show how get_conflict_free_subset() is implemented in terms of __m512i _mm512_conflict_epi32 (__m512i a) (vpconflictd) (which returns a bitmap in each element of all the preceding elements it conflicts with). As #Mysticial points out, a simple implementation is less simple than it would be if the conflict-detect instruction simply produced a mask-register result, instead of another vector.
I searched for but didn't find an Intel-published tutorial/guide on using AVX512CD, but presumably they think using _mm512_lzcnt_epi32 (vplzcntd) on the result of vpconflictd is useful for some cases, because it's also part of AVX512CD.
Maybe you're "supposed" to do something more clever than just skipping all elements that have any conflicts? Maybe to detect a case where a scalar fallback would be better, e.g. all 16 dword elements have the same index? vpbroadcastmw2d broadcasts a mask register to all 32-bit elements of the result, so that lets you line up a mask-register value with the bitmaps in each element from vpconflictd. (And there are already compare, bitwise, and other operations between elements from AVX512F).
Kirill's slides list VPTESTNM{D,Q} (from AVX512F) along with the conflict-detection instructions. It generates a mask from DEST[j] = (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] == 0)? 1 : 0. i.e. AND elements together, and set the mask result for that element to 1 if they don't intersect.
Possibly also relevant: http://colfaxresearch.com/knl-avx512/ says "For a practical illustration, we construct and optimize a micro-kernel for particle binning particles", with some code for AVX2 (I think). But it's behind a free registration which I haven't done. Based on the diagram, I think they're doing the actual scatter part as scalar, after some vectorized stuff to produce data they want to scatter. The first link says the 2nd link is "for previous instruction sets".
Avoid gather/scatter conflict detection by replicating the count array
When the number of buckets is small compared to the size of the array, it becomes viable to replicate the count arrays and unroll to minimize store-forwarding latency bottlenecks with repeated elements. But for a gather/scatter strategy, it also avoids the possibility of conflicts, solving the correctness problem, if we use a different array for each vector element.
How can we do that when a gather / scatter instruction only takes one array base? Make all the count arrays contiguous, and offset each index vector with one extra SIMD add instruction, fully replacing conflict detection and branching.
If the number of buckets isn't a multiple of 16, you might want to round up the array geometry so each subset of counts starts at an aligned offset. Or not, if cache locality is more important than avoiding unaligned loads in the reduction at the end.
const size_t nb = 32; // number of buckets
const int VEC_WIDTH = 16; // sizeof(__m512i) / sizeof(uint32_t)
alignas(__m512i) uint32_t counts[nb * VEC_WIDTH] = {0};
// then in your histo loop
__m512i idx = ...; // in this case from LUT lookups
idx = _mm512_add_epi32(idx, _mm512_setr_epi32(
0*nb, 1*nb, 2*nb, 3*nb, 4*nb, 5*nb, 6*nb, 7*nb,
8*nb, 9*nb, 10*nb, 11*nb, 12*nb, 13*nb, 14*nb, 15*nb));
// note these are C array indexes, not byte offsets
__m512i vc = _mm512_i32gather_epi32(idx, counts, sizeof(counts[0]));
vc = _mm512_add_epi32(vc, _mm512_set1_epi32(1));
_mm512_i32scatter_epi32(counts, idx, vc, sizeof(counts[0]));
https://godbolt.org/z/8Kesx7sEK shows that the above code actually compiles. (Inside a loop, the vector-constant setup could get hoisted, but not setting mask registers to all-one before each gather or scatter, or preparing a zeroed merge destination.)
Then after the main histogram loop, reduce down to one count array:
// Optionally with size_t nb as an arg
// also optionally use restrict if you never reduce in-place, into the bottom of the input.
void reduce_counts(int *output, const int *counts)
{
for (int i = 0 ; i < nb - (VEC_WIDTH-1) ; i+=VEC_WIDTH) {
__m512i v = _mm512_load_si512(&counts[i]); // aligned load, full cache line
// optional: unroll this and accumulate two vectors in parallel for better spatial locality and more ILP
for (int offset = nb; offset < nb*VEC_WIDTH ; offset+=nb) {
__m512i tmp = _mm512_loadu_si512(&counts[i + offset]);
v = _mm512_add_epi32(v, tmp);
}
_mm512_storeu_si512(&output[i], v);
}
// if nb isn't a multiple of the vector width, do some cleanup here
// possibly using a masked store to write into a final odd-sized destination
}
Obviously this is bad with too many buckets; you end up having to zero way more memory, and loop over a lot of it at the end. Using 256-bit instead of 512-bit gathers helps, you only need half as many arrays, but efficiency of gather/scatter instructions improves with wider vectors. e.g. one vpgatherdd per 5 cycles for 256-bit on Cascade Lake, one per 9.25 for 512-bit. (And both are 4 uops for the front-end)
Or on Ice Lake, one vpscatterdd ymm per 7 cycles, one zmm per 11 cycles. (vs. 14 for 2x ymm). https://uops.info/
In your bins[1000][32] case, you could actually use the later elements of bins[i+0..15] as extra count arrays, if you zero first, at least for the first 1000-15 outer loop iterations. That would avoid touching extra memory: zeroing for the next outer loop would start at the previous counts[32], effectively.
(This would be playing a bit fast and loose with C 2D vs. 1D arrays, but all the actual accesses past the end of the [32] C array type would be via memset (i.e. unsigned char*) or via _mm* intrinsics which are also allowed to alias anything)
Related:
Tiny histograms (like 4 buckets) can use count[0] += (arr[i] == 0) and so on, which you can vectorize with SIMD packed compares - Micro Optimization of a 4-bucket histogram of a large array or list This is interesting when the number of buckets is less than or equal to the number of elements in a SIMD vector.

AVX2 Winner-Take-All Disparity Search

I am optimizing the "winner-take-all" portion of a disparity estimation algorithm using AVX2. My scalar routine is accurate, but at QVGA resolution and 48 disparities the runtime is disappointingly slow at ~14 ms on my laptop. I create both LR and RL disparity images, but for simplicity here I will only include code for the RL search.
My scalar routine:
int MAXCOST = 32000;
for (int i = maskRadius; i < rstep-maskRadius; i++) {
// WTA "RL" Search:
for (int j = maskRadius; j+maskRadius < cstep; j++) {
int minCost = MAXCOST;
int minDisp = 0;
for (int d = 0; d < numDisp && j+d < cstep; d++) {
if (asPtr[(i*numDisp*cstep)+(d*cstep)+j] < minCost) {
minCost = asPtr[(i*numDisp*cstep)+(d*cstep)+j];
minDisp = d;
}
}
dRPtr[(i*cstep)+j] = minDisp;
}
}
My attempt at using AVX2:
int MAXCOST = 32000;
int* dispVals = (int*) _mm_malloc( sizeof(int32_t)*16, 32 );
for (int i = maskRadius; i < rstep-maskRadius; i++) {
// WTA "RL" Search AVX2:
for( int j = 0; j < cstep-16; j+=16) {
__m256i minCosts = _mm256_set1_epi16( MAXCOST );
__m128i loMask = _mm_setzero_si128();
__m128i hiMask = _mm_setzero_si128();
for (int d = 0; d < numDisp && j+d < cstep; d++) {
// Grab 16 costs to compare
__m256i costs = _mm256_loadu_si256((__m256i*) (asPtr[(i*numDisp*cstep)+(d*cstep)+j]));
// Get the new minimums
__m256i newMinCosts = _mm256_min_epu16( minCosts, costs );
// Compare new mins to old to build mask to store minDisps
__m256i mask = _mm256_cmpgt_epi16( minCosts, newMinCosts );
__m128i loMask = _mm256_extracti128_si256( mask, 0 );
__m128i hiMask = _mm256_extracti128_si256( mask, 1 );
// Sign extend to 32bits
__m256i loMask32 = _mm256_cvtepi16_epi32( loMask );
__m256i hiMask32 = _mm256_cvtepi16_epi32( hiMask );
__m256i currentDisp = _mm256_set1_epi32( d );
// store min disps with mask
_mm256_maskstore_epi32( dispVals, loMask32, currentDisp ); // RT error, why?
_mm256_maskstore_epi32( dispVals+8, hiMask32, currentDisp ); // RT error, why?
// Set minCosts to newMinCosts
minCosts = newMinCosts;
}
// Write the WTA minimums one-by-one to the RL disparity image
int index = (i*cstep)+j;
for( int k = 0; k < 16; k++ ) {
dRPtr[index+k] = dispVals[k];
}
}
}
_mm_free( dispVals );
The Disparity Space Image (DSI) is of size HxWxD (320x240x48), which I lay out horizontally for better memory accesses, such that each row is of size WxD.
The Disparity Space Image has per-pixel matching costs. This aggregated
with a simple box filter to make another image of the exact same size,
but with costs summed over, say, a 3x3 or 5x5 window. This smoothing makes
the result more 'robust'. When I am accessing with asPtr, I am indexing
into this aggregated costs image.
Also, in an effort to save on unnecessary computation, I have been starting
and ending on rows offset by a mask radius. This mask radius is the radius
of my census mask. I could be doing some fancy border reflection, but it is
simpler and faster just to not bother with the disparity for this border.
This of course applies to the beginning and ending cols too, but messing with
indexing here is not good when I am forcing my entire algorithm to run only
on images whose columns are a multiple of 16 (ex. QVGA: 320x240) so that I
can index simply and hit everything with SIMD (no residual scalar processing).
Also, if you think my code is a mess, I encourage you to check out the
the highly optimized OpenCV stereo algorithms. I find them impossible and have been able to make little to no use of them.
My code compiles but fails at runtime. I am using VS 2012 Express Update 4. When I run with the debugger I am unable to gain any insights. I am relatively new to using intrinsics and so I am not sure what information I should expect to see when debugging, number of registers, whether __m256i variables should be visible, etc.
Heeding comment advice below, I improved the scalar time from ~14 to ~8 by using smarter indexing. My CPU is an i7-4980HQ and I successfully use AVX2 intrinsics elsewhere in the same file.
I still haven't found the problem, but I did see some things you might want to change. You're not checking the return value of _mm_malloc, though. If it's failing, that would explain it. (Maybe it doesn't like allocating 32-byte aligned memory?)
If you're running your code under a memory checker or something, then maybe it doesn't like reading from uninitialized memory for dispVals. (_mm256_maskstore_epi32 may count as a read-modify-write even if the mask is all-ones.)
Run your code under a debugger and find out what's going wrong. "runtime error" is not very meaningful.
_mm_set1* functions are slow-ish. VPBROADCASTD needs its source in memory or a vector reg, not a GP reg, so the compiler can either movd from a GP reg to a vector reg and then broadcast, or store to memory and then broadcast. Anyway, it would be faster to do
const __m256i add1 = _mm256_set1_epi32( 1 );
__m256i dvec = _mm256_setzero_si256();
for (d;d...;d++) {
dvec = _mm256_add_epi32(dvec, add1);
}
Other stuff:
This will probably run faster if you aren't storing to memory every iteration of the inner loop. Use a blend instruction (_mm256_blendv_epi8), or something like that, to update the vector(s) of displacements that go with the min costs. Blend = masked move with a register destination.
Also, your displacement values should fit in 16b integers, so don't sign-extend them to 32b until AFTER you're done finding them. Intel CPUs can sign-extend a 16b memory location into gp register on the fly with no speed penalty (movsz is as fast as mov), so prob. just declare your dRPtr array as uint16_t. Then you don't need the sign-extending stuff in your vector code at all (let alone in your inner loop!). Hopefully _mm256_extracti128_si256( mask, 0 ) compiles to nothing, since the 128 you want is already the low128, so just use the reg as the src for vmovsx, but still.
You can also save an instruction (and a fused-domain uop) by not loading first. (unless the compiler is smart enough not to elide the vmovdqu and use vpminuw with a memory operand, even though you used the load intrinsic).
So I'm thinking something like this:
// totally untested, didn't even check that this compiles.
for(i) { for(j) {
// inner loop, compiler can hoist these constants.
const __m256i add1 = _mm256_set1_epi16( 1 );
__m256i dvec = _mm256_setzero_si256();
__m256i minCosts = _mm256_set1_epi16( MAXCOST );
__m256i minDisps = _mm256_setzero_si256();
for (int d=0 ; d < numDisp && j+d < cstep ;
d++, dvec = _mm256_add_epi16(dvec, add1))
{
__m256i newMinCosts = _mm256_min_epu16( minCosts, asPtr[(i*numDisp*cstep)+(d*cstep)+j]) );
__m256i mask = _mm256_cmpgt_epi16( minCosts, newMinCosts );
minDisps = _mm256_blendv_epi8(minDisps, dvec, mask); // 2 uops, latency=2
minCosts = newMinCosts;
}
// put sign extension here if making dRPtr uint16_t isn't an option.
int index = (i*cstep)+j;
_mm256_storeu_si256 (dRPtr + index, __m256i minDisps);
}}
You might get better performance having two parallel dependency chains: minCosts0 / minDisps0, and minCosts1 / minDisps1, and then combining them at the end. minDisps is a loop-carried dependency, but the loop only has 5 instructions (including the vpadd, which looks like loop overhead but can't be reduced by unrolling). They decode to 6 uops (blendv is 2), plus loop overhead. It should run in 1.5cycles / iteration (not counting loop overhead) on haswell, but the dep chain would limit it to one iteration per 2 cycles. (Assuming unrolling to get rid of loop overhead). Doing two dep chains in parallel fixes this, and has the same effect as unrolling the loop: less loop overhead.
Hmm, actually on Haswell,
pminuw can run on p1/p5. (and the load part on p2/p3)
pcmpgtw can run on p1/p5
vpblendvb is 2 uops for p5.
padduw can run on p1/p5
movdqa reg,reg can run on p0/p1/p5 (and may not need an execution unit at all). Unrolling should get rid of any overhead for minCosts = newMinCosts, since the compiler can just end up with newMinCosts from the last unrolled loop body in the right register for the first loop body of the next iteration.
fused sub / jge (loop counter) can run on p6. (using PTEST + jcc on dvec would be slower). add/sub can run on p0/p1/p5/p6 when not fused with a jcc.
Ok, so actually the loop will take 2.5 cycles per iteration, limited by instructions that can only run on p1/p5. Unrolling by 2 or 4 will reduce the loop / movdqa overhead. Since Haswell can issue 4 uops per clock, it can then more efficiently queue up uops for out-of-order execution, since the loop won't have a super-high number of iterations. (48 was your example.) Having lots of uops queued up will give the CPU something to do after leaving the loop, and hide any latencies from cache misses, etc.
_mm256_min_epu16 (PMINUW) is another loop-carried dependency chain. Using it with a memory operand makes it a 3 or 4-cycle latency. However, the load part of the instruction can start as soon as the address is known, so folding a load into a modify op to take advantage of micro-fusion doesn't make dep chains any longer or shorter than using a separate load.
Sometimes you need to use a separate load, for unaligned data (AVX removed the alignment requirement for memory operands). We're limited more by execution units than the 4 uop / clock issue limit, so it's probably fine to use a dedicated load instruction.
source for insn ports / latencies.
Before you go and do platform specific optimizations, there are plenty of portable optimizations that could be performed. Extract loop invariants, convert index multiplies to increment additions, etc...
This may not be exact, but gets the general idea across:
int MAXCOST = 32000, numDispXcstep = numDisp*cstep;
for (int i = maskRadius; i < rstep - maskRadius; i+=numDispXcstep) {
for (int j = maskRadius; j < cstep - maskRadius; j++) {
int minCost = MAXCOST, minDisp = 0;
for (int d = 0; d < numDispXcstep - j; d+=cstep) {
if (asPtr[i+j+d] < minCost) {
minCost = asPtr[i+j+d];
minDisp = d;
}
}
dRPtr[i/numDisp+j] = minDisp;
}
}
Once you have done this it becomes apparent what is actually occurring. It looks like "i" is the largest step, followed by "d" with "j" actually being the variable that operates on sequential data. ... the next step would be to reorder the loops accordingly and if you still need further optimizations, apply platform specific intrinsics.

SSE instruction within nested for loops

i have several nested for loops in my code and i try to use intel SSE instructions on an intel i7 core to speed up the application.
The code structure is as follows (val is set in a higher for loop):
_m128 in1,in2,tmp1,tmp2,out;
float arr[4] __attribute__ ((aligned(16)));
val = ...;
... several higher for loops ...
for(f=0; f<=fend; f=f+4){
index2 = ...;
for(i=0; i<iend; i++){
for(j=0; j<jend; j++){
inputval = ...;
index = ...;
if(f<fend-4){
arr[0] = array[index];
arr[1] = array[index+val];
arr[2] = array[index+2*val];
arr[3] = array[index+3*val];
in1 = _mm_load_ps(arr);
in2 = _mm_set_ps1(inputval);
tmp1 = _mm_mul_ps(in1, in2);
tmp2 = _mm_loadu_ps(&array2[index2]);
out = _mm_add_ps(tmp1,tmp2);
_mm_storeu_ps(&array2[index2], out);
} else {
//if no 4 values available for SSE instruction execution execute serial code
for(int u = 0; u < fend-f; u++ ) array2[index2+u] += array[index+u*val] * inputval;
}
}
}
}
I think there are two main problems: the buffer used for aligning the values from 'array', and the fact that when no 4 values are left (e.g. when fend = 6, two values are left over which should be executed with the sequential code). Is there any other way of loading the values from in1 and/or executing SSE intructions with 3 or 2 values?
Thanks for the answers so far. The loading is as good as it gets i think, but is there any workaround for the 'leftover' part within the else statement that could be solved using SSE instructions?
I think the bigger problem is that there is so little computation for such a massive amount of data movement:
arr[0] = array[index]; // Data Movement
arr[1] = array[index+val]; // Data Movement
arr[2] = array[index+2*val]; // Data Movement
arr[3] = array[index+3*val]; // Data Movement
in1 = _mm_load_ps(arr); // Data Movement
in2 = _mm_set_ps1(inputval); // Data Movement
tmp1 = _mm_mul_ps(in1, in2); // Computation
tmp2 = _mm_loadu_ps(&array2[index2]); // Data Movement
out = _mm_add_ps(tmp1,tmp2); // Computation
_mm_storeu_ps(&array2[index2], out); // Data Movement
While it "might" be possible to simplify this. I'm not at all convinced that vectorization is going to be beneficial at all in this situation.
You'll have to change your data layout to make avoid the strided access index + n*val.
Or you can wait until AVX2 gather/scatter instructions become available in 2013?
You can express this:
arr[0] = array[index];
arr[1] = array[index+val];
arr[2] = array[index+2*val];
arr[3] = array[index+3*val];
in1 = _mm_load_ps(arr);
more succinctly as:
in1 = _mm_set_ps(array[index+3*val], array[index+2*val], array[index+val], array[index]);
and get rid of arr, which might give the compiler some opportunity to optimise away some redundant loads/stores.
However your data organisation is the main problem, compounded by the fact that you are doing almost no computation relative to the number of loads and stores, two of which are unaligned. If possible you need to re-organise your data structures so that you can load and store 4 elements at a time form aligned contiguous memory in all cases, otherwise any computational benefits will tend to be outweighed by inefficient memory access patterns.
if you want full benefit form SSE (factor 4 or more faster than best optimised code without explicit usage of SSE), you must ensure that your data layout such that you only ever need aligned loads and stores. Though using _mm_set_ps(w,z,y,x) in your code snippet may help, you should avoid the need for this, i.e. avoid strided accesses (they are less efficient than a single _mm_load_ps).
As for the problem of the last few<4 elements, I usually ensure that all my data are not only 16-byte aligned, but also array sizes are multiples of 16 bytes, such that I never have such spare remaining elements. Of course, the real problem may have spare elements, but that data can usually be set such that they don't cause a problem (set to the neutral elements, i.e. zero for additive operations). In rare cases, you only want to work on a subset of the array which starts and/or ends at an unaligned position. In this case one may use bitwise operations (_mm_and_ps, _mm_or_ps) to suppress operations on the unwanted elements.