Wrong detection of CPU instruction during DPDK build? - dpdk

I am trying to build DPDK using the following command:
I have to use the Dcpu_instruction_set due to different server models.
meson -Dcpu_instruction_set=cascadelake -Dprefix=/ build
However, in some of the VMs, we are seeing AVX512 is selected even though the VM does not have them.
If I do not use the parameter Dcpu_instruction_set, it's working as expected.
Currently, my application crashes if I build with this parameter.
Any ideas why dpdk build is detecting AVX512 flags?
DPDK ver : 21.11
Checking if "AVX512 checking" compiles: YES
Fetching value of define "__SSE4_2__" : 1
Fetching value of define "__AES__" : 1
Fetching value of define "__AVX__" : 1
Fetching value of define "__AVX2__" : 1
Fetching value of define "__AVX512BW__" : 1
Fetching value of define "__AVX512CD__" : 1
Fetching value of define "__AVX512DQ__" : 1
Fetching value of define "__AVX512F__" : 1
Fetching value of define "__AVX512VL__" : 1
Fetching value of define "__PCLMUL__" : 1
Fetching value of define "__RDRND__" : 1
Fetching value of define "__RDSEED__" : 1
Fetching value of define "__VPCLMULQDQ__" :
cpuid | grep AVX512
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
AVX512F: AVX-512 foundation instructions = false
AVX512DQ: double & quadword instructions = false
AVX512IFMA: fused multiply add = false
AVX512PF: prefetch instructions = false
AVX512ER: exponent & reciprocal instrs = false
AVX512CD: conflict detection instrs = false
AVX512BW: byte & word instructions = false
AVX512VL: vector length = false
AVX512VBMI: vector byte manipulation = false
AVX512_VBMI2: byte VPCOMPRESS, VPEXPAND = false
AVX512_VNNI: neural network instructions = false
AVX512_BITALG: bit count/shiffle = false
AVX512: VPOPCNTDQ instruction = false
AVX512_4VNNIW: neural network instrs = false
AVX512_4FMAPS: multiply acc single prec = false
AVX512_VP2INTERSECT: intersect mask regs = false
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 6
NUMA node(s): 6
Vendor ID: GenuineIntel
BIOS Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) Platinum 8253 CPU # 2.20GHz
BIOS Model name: Intel(R) Xeon(R) Platinum 8253 CPU # 2.20GHz
Stepping: 2
CPU MHz: 2194.843
BogoMIPS: 4389.68
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 22528K
NUMA node0 CPU(s): 0,1
NUMA node1 CPU(s): 2,3
NUMA node2 CPU(s): 4,5
NUMA node3 CPU(s): 6,7
NUMA node4 CPU(s): 8,9
NUMA node5 CPU(s): 10,11
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt aes xsave avx hypervisor lahf_lm pti arat

Question: Any idea why the dpdk build is detecting AVX512 flags?
Answer: here is the fundamental problem, you are forcing the meson to build the binary for cpu_instruction_set=cascadelake. Cascade lake has AVX512 SIMD Execution engines, hence the compiler (GCC) builds and optimizes with AVX512 code path.
Confusing Information: in some of the VMs, we see AVX512 is selected even though the VM does not have them.
Answer: VM generated with virtual box, Xen, or QEMu inherits the host ISA (instruction set) by default. Unless one precisely tunes or selects the processor type. Hence any Binary or application which is built with the most common ISA (like SSE or AVX2) generic compliance to all platforms can not be expected.
Solution:
do not compile for the specific platform, if the VM that gets deployed is not sure whether to use the ISA.
or update the application to use rte_cpu_is_supported function which will determine compile time flag (avx512f via cascade-lake instruction) is available in VM
Note: Please check the DPDK Link for -Dplatform=native

Related

can i speed up more than _mm256_i32gather_epi32

I made a gamma conversion code for 4k video
/** gamma0
input range : 0 ~ 1,023
output range : 0 ~ ?
*/
v00 = _mm256_unpacklo_epi16(v0, _mm256_setzero_si256());
v01 = _mm256_unpackhi_epi16(v0, _mm256_setzero_si256());
v10 = _mm256_unpacklo_epi16(v1, _mm256_setzero_si256());
v11 = _mm256_unpackhi_epi16(v1, _mm256_setzero_si256());
v20 = _mm256_unpacklo_epi16(v2, _mm256_setzero_si256());
v21 = _mm256_unpackhi_epi16(v2, _mm256_setzero_si256());
v00 = _mm256_i32gather_epi32(csv->gamma0LUT, v00, 4);
v01 = _mm256_i32gather_epi32(csv->gamma0LUT, v01, 4);
v10 = _mm256_i32gather_epi32(csv->gamma0LUTc, v10, 4);
v11 = _mm256_i32gather_epi32(csv->gamma0LUTc, v11, 4);
v20 = _mm256_i32gather_epi32(csv->gamma0LUTc, v20, 4);
v21 = _mm256_i32gather_epi32(csv->gamma0LUTc, v21, 4);
I want to implement a "10-bit input to 10~13bit output" LUT(look-up table), but only 32-bit commands are supported by AVX2.
So, it was unavoidably extended to 32bit and implemented using the _mm256_i32gather_epi32 command.
The performance bottleneck in this area is the most severe, is there any way to improve this?
Since the context of your question is still a bit vague for me, just some general ideas you could try (some may be just slightly better or even worse compared to what you have at the moment, all code below is untested):
LUT with 16 bit values using _mm256_i32gather_epi32
Even though it loads 32bit values, you can still use a multiplier of 2 as last argument of _mm256_i32gather_epi32. You should make sure that 2 bytes before and after your LUT are readable.
static const int16_t LUT[1024+2] = { 0, val0, val1, ..., val1022, val1023, 0};
__m256i high_idx = _mm256_srli_epi32(v, 16);
__m256i low_idx = _mm256_blend_epi16(v, _mm256_setzero_si256(), 0xAA);
__m256i high_val = _mm256_i32gather_epi32((int const*)(LUT+0), high_idx, 2);
__m256i low_val = _mm256_i32gather_epi32((int const*)(LUT+1), low_idx, 2);
__m256i values = _mm256_blend_epi16(low_val, high_val, 0xAA);
Join two values into one LUT-entry
For small-ish LUTs, you could calculate an index from two neighboring indexes as (idx_hi << 10) + idx_low and look up the corresponding tuple directly. However, instead of 2KiB you would have a 4 MiB LUT in your case, which likely hurts caching -- but you only have half the number of gather instructions.
Polynomial approximation
Mathematically, all continuous functions on a finite interval can be approximated by a polynomial. You could either convert your values to float evaluate the polynomial and convert it back, or do it directly with fixed-point multiplications (note that _mm256_mulhi_epi16/_mm256_mulhi_epu16 compute (a * b) >> 16, which is convenient if one factor is actually in [0, 1).
8 bit, 16 entry LUT with linear interpolation
SSE/AVX2 provides a pshufb instruction which can be used as a 8bit LUT with 16 entries (and an implicit 0 entry).
Proof-of-concept implementation:
__m256i idx = _mm256_srli_epi16(v, 6); // shift highest 4 bits to the right
idx = _mm256_mullo_epi16(idx, _mm256_set1_epi16(0x0101)); // duplicate idx, maybe _mm256_shuffle_epi8 is better?
idx = _mm256_sub_epi8(idx, _mm256_set1_epi16(0x0001)); // subtract 1 from lower idx, 0 is mapped to 0xff
__m256i lut_vals = _mm256_shuffle_epi8(LUT, idx); // implicitly: LUT[-1] = 0
// get fractional part of input value:
__m256i dv = _mm256_and_si256(v, _mm256_set1_epi8(0x3f)); // lowest 6 bits
dv = _mm256_mullo_epi16(dv, _mm256_set1_epi16(0xff01)); // dv = [-dv, dv]
dv = _mm256_add_epi8(dv, _mm256_set1_epi16(0x4000)); // dv = [0x40-(v&0x3f), (v&0x3f)];
__m256i res = _mm256_maddubs_epi16(lut_vals, dv); // switch order depending on whether LUT values are (un)signed.
// probably shift res to the right, depending on the scale of your LUT values
You could also combine this with first doing a linear or quadratic approximation and just calculating the difference to your target function.

How to convert scalar code of the double version of VDT's Pade Exp fast_ex() approx into SSE2?

Here's the code I'm trying to convert: the double version of VDT's Pade Exp fast_ex() approx (here's the old repo resource):
inline double fast_exp(double initial_x){
double x = initial_x;
double px=details::fpfloor(details::LOG2E * x +0.5);
const int32_t n = int32_t(px);
x -= px * 6.93145751953125E-1;
x -= px * 1.42860682030941723212E-6;
const double xx = x * x;
// px = x * P(x**2).
px = details::PX1exp;
px *= xx;
px += details::PX2exp;
px *= xx;
px += details::PX3exp;
px *= x;
// Evaluate Q(x**2).
double qx = details::QX1exp;
qx *= xx;
qx += details::QX2exp;
qx *= xx;
qx += details::QX3exp;
qx *= xx;
qx += details::QX4exp;
// e**x = 1 + 2x P(x**2)/( Q(x**2) - P(x**2) )
x = px / (qx - px);
x = 1.0 + 2.0 * x;
// Build 2^n in double.
x *= details::uint642dp(( ((uint64_t)n) +1023)<<52);
if (initial_x > details::EXP_LIMIT)
x = std::numeric_limits<double>::infinity();
if (initial_x < -details::EXP_LIMIT)
x = 0.;
return x;
}
I got this:
__m128d PExpSSE_dbl(__m128d x) {
__m128d initial_x = x;
__m128d half = _mm_set1_pd(0.5);
__m128d one = _mm_set1_pd(1.0);
__m128d log2e = _mm_set1_pd(1.4426950408889634073599);
__m128d p1 = _mm_set1_pd(1.26177193074810590878E-4);
__m128d p2 = _mm_set1_pd(3.02994407707441961300E-2);
__m128d p3 = _mm_set1_pd(9.99999999999999999910E-1);
__m128d q1 = _mm_set1_pd(3.00198505138664455042E-6);
__m128d q2 = _mm_set1_pd(2.52448340349684104192E-3);
__m128d q3 = _mm_set1_pd(2.27265548208155028766E-1);
__m128d q4 = _mm_set1_pd(2.00000000000000000009E0);
__m128d px = _mm_add_pd(_mm_mul_pd(log2e, x), half);
__m128d t = _mm_cvtepi64_pd(_mm_cvttpd_epi64(px));
px = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one));
__m128i n = _mm_cvtpd_epi64(px);
x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(6.93145751953125E-1)));
x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(1.42860682030941723212E-6)));
__m128d xx = _mm_mul_pd(x, x);
px = _mm_mul_pd(xx, p1);
px = _mm_add_pd(px, p2);
px = _mm_mul_pd(px, xx);
px = _mm_add_pd(px, p3);
px = _mm_mul_pd(px, x);
__m128d qx = _mm_mul_pd(xx, q1);
qx = _mm_add_pd(qx, q2);
qx = _mm_mul_pd(xx, qx);
qx = _mm_add_pd(qx, q3);
qx = _mm_mul_pd(xx, qx);
qx = _mm_add_pd(qx, q4);
x = _mm_div_pd(px, _mm_sub_pd(qx, px));
x = _mm_add_pd(one, _mm_mul_pd(_mm_set1_pd(2.0), x));
n = _mm_add_epi64(n, _mm_set1_epi64x(1023));
n = _mm_slli_epi64(n, 52);
// return?
}
But I'm not able to finish the last lines - i.e. this code:
if (initial_x > details::EXP_LIMIT)
x = std::numeric_limits<double>::infinity();
if (initial_x < -details::EXP_LIMIT)
x = 0.;
return x;
How would you convert in SSE2?
Than of course I need to check the whole, since I'm not quite sure I've converted it correctly.
EDIT: I found the SSE conversion of float exp - i.e. from this:
/* multiply by power of 2 */
z *= details::uint322sp((n + 0x7f) << 23);
if (initial_x > details::MAXLOGF) z = std::numeric_limits<float>::infinity();
if (initial_x < details::MINLOGF) z = 0.f;
return z;
to this:
n = _mm_add_epi32(n, _mm_set1_epi32(0x7f));
n = _mm_slli_epi32(n, 23);
return _mm_mul_ps(z, _mm_castsi128_ps(n));
Yup, dividing two polynomials can often give you a better tradeoff between speed and precision than one huge polynomial. As long as there's enough work to hide the divpd throughput. (The latest x86 CPUs have pretty decent FP divide throughput. Still bad vs. multiply, but it's only 1 uop so it doesn't stall the pipeline if you use it rarely enough, i.e. mixed with lots of multiplies. Including in the surrounding code that uses exp)
However, _mm_cvtepi64_pd(_mm_cvttpd_epi64(px)); won't work with SSE2. Packed-conversion intrinsics to/from 64-bit integers requires AVX512DQ.
To do packed rounding to the nearest integer, ideally you'd use SSE4.1 _mm_round_pd(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC), (or truncation towards zero, or floor or ceil towards -+Inf).
But we don't actually need that.
The scalar code ends up with int n and double px both representing the same numeric value. It uses the bad/buggy floor(val+0.5) idiom instead of rint(val) or nearbyint(val) to round to nearest, and then converts that already-integer double to an int (with C++'s truncation semantics, but that doesn't matter because the double value's already an exact integer.)
With SIMD intrinsics, it appears to be easiest to just convert to 32-bit integer and back.
__m128i n = _mm_cvtpd_epi32( _mm_mul_pd(log2e, x) ); // round to nearest
__m128d px = _mm_cvtepi32_pd( n );
Rounding to int with the desired mode, then converting back to double, is equivalent to double->double rounding and then grabbing an int version of that like the scalar version does. (Because you don't care what happens for doubles too large to fit in an int.)
cvtsd2si and si2sd instructions are 2 uops each, and shuffle the 32-bit integers to packed in the low 64 bits of a vector. So to set up for 64-bit integer shifts to stuff the bits into a double again, you'll need to shuffle. The top 64 bits of n will be zeros, so we can use that to create 64-bit integer n lined up with the doubles:
n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3,1,2,0)); // 64-bit integers
But with just SSE2, there are workarounds. Converting to 32-bit integer and back is one option: you don't care about inputs too small or too large. But packed-conversion between double and int costs at least 2 uops on Intel CPUs each way, so a total of 4. But only 2 of those uops need the FMA units, and your code probably doesn't bottleneck on port 5 with all those multiplies and adds.
Or add a very large number and subtract it again: large enough that each double is 1 integer apart, so normal FP rounding does what you want. (This works for inputs that won't fit in 32 bits, but not double > 2^52. So either way that would work.) Also see How to efficiently perform double/int64 conversions with SSE/AVX? which uses that trick. I couldn't find an example on SO, though.
Related:
Fastest Implementation of Exponential Function Using AVX and Fastest Implementation of Exponential Function Using SSE have versions with other speed / precision tradeoffs, for _ps (packed single-precision float).
Fast SSE low precision exponential using double precision operations is at the other end of the spectrum, but still for double.
How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU? discusses some existing libraries like SVML, and Agner Fog's VCL (GPL licensed). And glibc's libmvec.
Then of course I need to check the whole, since I'm not quite sure I've converted it correctly.
iterating over all 2^64 double bit-patterns is impractical, unlike for float where there are only 4 billion, but maybe iterating over all doubles that have the low 32 bits of their mantissa all zero would be a good start. i.e. check in a loop with
bitpatterns = _mm_add_epi64(bitpatterns, _mm_set1_epi64x( 1ULL << 32 ));
doubles = _mm_castsi128_pd(bitpatterns);
https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/
For those last few lines, correcting the input for out-of-range inputs:
The float version you quote just leaves out the range-check entirely. This is obviously the fastest way, if your inputs will always be in range or if you don't care about what happens for out-of-range inputs.
Alternate cheaper range-checking (maybe only for debugging) would be to turn out-of-range values into NaN by ORing the packed-compare result into the result. (An all-ones bit-pattern represents a NaN.)
__m128d out_of_bounds = _mm_cmplt_pd( limit, abs(initial_x) ); // abs = mask off the sign bit
result = _mm_or_pd(result, out_of_bounds);
In general, you can vectorize simple condition setting of a value using branchless compare + blend. Instead of if(x) y=0;, you have the SIMD equivalent of y = (condition) ? 0 : y;, on a per-element basis. SIMD compares produce a mask of all-zero / all-one elements so you can use it to blend.
e.g. in this case cmppd the input and blendvpd the output if you have SSE4.1. Or with just SSE2, and/andnot/or to blend. See SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation for a _ps version of both, _pd is identical.
In asm it will look like this:
; result in xmm0 (in need of fixups for out of range inputs)
; initial_x in xmm2
; constants:
; xmm5 = limit
; xmm6 = +Inf
cmpltpd xmm2, xmm5 ; xmm2 = input_x < limit ? 0xffff... : 0
andpd xmm0, xmm2 ; result = result or 0
andnpd xmm2, xmm6 ; xmm2 = 0 or +Inf (In that order because we used ANDN)
orpd xmm0, xmm2 ; result |= 0 or +Inf
; xmm0 = (input < limit) ? result : +Inf
(In an earlier version of the answer, I thought I was maybe saving a movaps to copy a register, but this is just a bog-standard blend. It destroys initial_x, so the compiler needs to copy that register at some point while calculating result, though.)
Optimizations for this special condition
Or in this case, 0.0 is represented by an all-zero bit-pattern, so do a compare that will produce true if in-range, and AND the output with that. (To leave it unchanged or force it to +0.0). This is better than _mm_blendv_pd, which costs 2 uops on most Intel CPUs (and the AVX 128-bit version always costs 2 uops on Intel). And it's not worse on AMD or Skylake.
+-Inf is represented by a bit-pattern of significand=0, exponent=all-ones. (Any other value in the significand represents +-NaN.) Since too-large inputs will presumably still leave non-zero significands, we can't just AND the compare result and OR that into the final result. I think we need to do a regular blend, or something as expensive (3 uops and a vector constant).
It adds 2 cycles of latency to the final result; both the ANDNPD and ORPD are on the critical path. The CMPPD and ANDPD aren't; they can run in parallel with whatever you do to compute the result.
Hopefully your compiler will actually use ANDPS and so on, not PD, for everything except the CMP, because it's 1 byte shorter but identical because they're both just bitwise ops. I wrote ANDPD just so I didn't have to explain this in comments.
You might be able to shorten the critical path latency by combining both fixups before applying to the result, so you only have one blend. But then I think you also need to combine the compare results.
Or since your upper and lower bounds are the same magnitude, maybe you can compare the absolute value? (mask off the sign bit of initial_x and do _mm_cmplt_pd(abs_initial_x, _mm_set1_pd(details::EXP_LIMIT))). But then you have to sort out whether to zero or set to +Inf.
If you had SSE4.1 for _mm_blendv_pd, you could use initial_x itself as the blend control for the fixup that might need applying, because blendv only cares about the sign bit of the blend control (unlike with the AND/ANDN/OR version where all bits need to match.)
__m128d fixup = _mm_blendv_pd( _mm_setzero_pd(), _mm_set1_pd(INFINITY), initial_x ); // fixup = (initial_x signbit) ? 0 : +Inf
// see below for generating fixup with an SSE2 integer arithmetic-shift
const signbit_mask = _mm_castsi128_pd(_mm_set1_epi64x(0x7fffffffffffffff)); // ~ set1(-0.0)
__m128d abs_init_x = _mm_and_pd( initial_x, signbit_mask );
__m128d out_of_range = _mm_cmpgt_pd(abs_init_x, details::EXP_LIMIT);
// Conditionally apply the fixup to result
result = _mm_blendv_pd(result, fixup, out_of_range);
Possibly use cmplt instead of cmpgt and rearrange if you care what happens for initial_x being a NaN. Choosing the compare so false applies the fixup instead of true will mean that an unordered comparison results in either 0 or +Inf for an input of -NaN or +NaN. This still doesn't do NaN propagation. You could _mm_cmpunord_pd(initial_x, initial_x) and OR that into fixup, if you want to make that happen.
Especially on Skylake and AMD Bulldozer/Ryzen where SSE2 blendvpd is only 1 uop, this should be pretty nice. (The VEX encoding, vblendvpd is 2 uops, having 3 inputs and a separate output.)
You might still be able to use some of this idea with only SSE2, maybe creating fixup by doing a compare against zero and then _mm_and_pd or _mm_andnot_pd with the compare result and +Infinity.
Using an integer arithmetic shift to broadcast the sign bit to every position in the double isn't efficient: psraq doesn't exist, only psraw/d. Only logical shifts come in 64-bit element size.
But you could create fixup with just one integer shift and mask, and a bitwise invert
__m128i ix = _mm_castsi128_pd(initial_x);
__m128i ifixup = _mm_srai_epi32(ix, 11); // all 11 bits of exponent field = sign bit
ifixup = _mm_and_si128(ifixup, _mm_set1_epi64x(0x7FF0000000000000ULL) ); // clear other bits
// ix = the bit pattern for 0 (non-negative x) or +Inf (negative x)
__m128d fixup = _mm_xor_si128(ifixup, _mm_set1_epi32(-1)); // bitwise invert
Then blend fixup into result for out-of-range inputs as normal.
Cheaply checking abs(initial_x) > details::EXP_LIMIT
If the exp algorithm was already squaring initial_x, you could compare against EXP_LIMIT squared. But it's not, xx = x*x only happens after some calculation to create x.
If you have AVX512F/VL, VFIXUPIMMPD might be handy here. It's designed for functions where the special case outputs are from "special" inputs like NaN and +-Inf, negative, positive, or zero, saving a compare for those cases. (e.g. for after a Newton-Raphson reciprocal(x) for x=0.)
But both of your special cases need compares. Or do they?
If you square your input and subtract, it only costs one FMA to do initial_x * initial_x - details::EXP_LIMIT * details::EXP_LIMIT to create a result that's negative for abs(initial_x) < details::EXP_LIMIT, and non-negative otherwise.
Agner Fog reports that vfixupimmpd is only 1 uop on Skylake-X.

Fastest downscaling of 8bit gray image with SSE

I have a function which downscales an 8-Bit image by a factor of two. I have previously optimised the rgb32 case with SSE. Now I would like to do the same for the gray8 case.
At the core, there is a function taking two lines of pixel data, which works like this:
/**
* Calculates the average of two rows of gray8 pixels by averaging four pixels.
*/
void average2Rows(const uint8_t* row1, const uint8_t* row2, uint8_t* dst, int size)
{
for (int i = 0; i < size - 1; i += 2)
*(dst++) = ((row1[i]+row1[i+1]+row2[i]+row2[i+1])/4)&0xFF;
}
Now, I have come up with an SSE variant which is about three times faster, but it does involve a lot of shuffling and I think one might do better. Does anybody see what can be optimised here?
/* row1: 16 8-bit values A-P
* row2: 16 8-bit values a-p
* returns 16 8-bit values (A+B+a+b)/4, (C+D+c+d)/4, ..., (O+P+o+p)/4
*/
__m128i avg16Bytes(const __m128i& row1, const __m128i& row2)
{
static const __m128i zero = _mm_setzero_si128();
__m128i ABCDEFGHIJKLMNOP = _mm_avg_epu8(row1_u8, row2);
__m128i ABCDEFGH = _mm_unpacklo_epi8(ABCDEFGHIJKLMNOP, zero);
__m128i IJKLMNOP = _mm_unpackhi_epi8(ABCDEFGHIJKLMNOP, zero);
__m128i AIBJCKDL = _mm_unpacklo_epi16( ABCDEFGH, IJKLMNOP );
__m128i EMFNGOHP = _mm_unpackhi_epi16( ABCDEFGH, IJKLMNOP );
__m128i AEIMBFJN = _mm_unpacklo_epi16( AIBJCKDL, EMFNGOHP );
__m128i CGKODHLP = _mm_unpackhi_epi16( AIBJCKDL, EMFNGOHP );
__m128i ACEGIKMO = _mm_unpacklo_epi16( AEIMBFJN, CGKODHLP );
__m128i BDFHJLNP = _mm_unpackhi_epi16( AEIMBFJN, CGKODHLP );
return _mm_avg_epu8(ACEGIKMO, BDFHJLNP);
}
/*
* Calculates the average of two rows of gray8 pixels by averaging four pixels.
*/
void average2Rows(const uint8_t* src1, const uint8_t* src2, uint8_t* dst, int size)
{
for(int i = 0;i<size-31; i+=32)
{
__m128i tl = _mm_loadu_si128((__m128i const*)(src1+i));
__m128i tr = _mm_loadu_si128((__m128i const*)(src1+i+16));
__m128i bl = _mm_loadu_si128((__m128i const*)(src2+i));
__m128i br = _mm_loadu_si128((__m128i const*)(src2+i+16)))
__m128i l_avg = avg16Bytes(tl, bl);
__m128i r_avg = avg16Bytes(tr, br);
_mm_storeu_si128((__m128i *)(dst+(i/2)), _mm_packus_epi16(l_avg, r_avg));
}
}
Notes:
I realise my function has slight (off by one) rounding errors, but I am willing to accept this.
For clarity I have assumed size is a multiple of 32.
EDIT: There is now a github repository implementing the answers to this question. The fastest solution was provided by user Peter Cordes. See his essay below for details:
__m128i avg16Bytes(const __m128i& row1, const __m128i& row2)
{
// Average the first 16 values of src1 and src2:
__m128i avg = _mm_avg_epu8(row1, row2);
// Unpack and horizontal add:
avg = _mm_maddubs_epi16(avg, _mm_set1_epi8(1));
// Divide by 2:
return _mm_srli_epi16(avg, 1);
}
It works as my original implementation by calculating (a+b)/2 + (c+d)/2 as opposed to (a+b+c+d)/4, so it has the same off-by-one rounding error.
Cudos to user Paul R for implementing a solution which is twice as fast as mine, but exact:
__m128i avg16Bytes(const __m128i& row1, const __m128i& row2)
{
// Unpack and horizontal add:
__m128i row1 = _mm_maddubs_epi16(row1_u8, _mm_set1_epi8(1));
__m128i row2 = _mm_maddubs_epi16(row2_u8, _mm_set1_epi8(1));
// vertical add:
__m128i avg = _mm_add_epi16(row1_avg, row2_avg);
// divide by 4:
return _mm_srli_epi16(avg, 2);
}
If you're willing to accept double-rounding from using pavgb twice, you can go faster than Paul R's answer by doing the vertical averaging first with pavgb, cutting in half the amount of data that needs to be unpacked to 16-bit elements. (And allowing half the loads to fold into memory operands for pavgb, reducing front-end bottlenecks on some CPUs.)
For horizontal averaging, your best bet is probably still pmaddubsw with set1(1) and shift by 1, then pack.
// SSSE3 version
// I used `__restrict__` to give the compiler more flexibility in unrolling
void average2Rows_doubleround(const uint8_t* __restrict__ src1, const uint8_t*__restrict__ src2,
uint8_t*__restrict__ dst, size_t size)
{
const __m128i vk1 = _mm_set1_epi8(1);
size_t dstsize = size/2;
for (size_t i = 0; i < dstsize - 15; i += 16)
{
__m128i v0 = _mm_load_si128((const __m128i *)&src1[i*2]);
__m128i v1 = _mm_load_si128((const __m128i *)&src1[i*2 + 16]);
__m128i v2 = _mm_load_si128((const __m128i *)&src2[i*2]);
__m128i v3 = _mm_load_si128((const __m128i *)&src2[i*2 + 16]);
__m128i left = _mm_avg_epu8(v0, v2);
__m128i right = _mm_avg_epu8(v1, v3);
__m128i w0 = _mm_maddubs_epi16(left, vk1); // unpack and horizontal add
__m128i w1 = _mm_maddubs_epi16(right, vk1);
w0 = _mm_srli_epi16(w0, 1); // divide by 2
w1 = _mm_srli_epi16(w1, 1);
w0 = _mm_packus_epi16(w0, w1); // pack
_mm_storeu_si128((__m128i *)&dst[i], w0);
}
}
The other option is _mm_srli_epi16(v, 8) to line up the odd elements with the even elements of every horizontal pair. But since there is no horizontal pack with truncation, you have to _mm_and_si128(v, _mm_set1_epi16(0x00FF)) both halves before you pack. It turns out to be slower than using SSSE3 pmaddubsw, especially without AVX where it takes extra MOVDQA instructions to copy registers.
void average2Rows_doubleround_SSE2(const uint8_t* __restrict__ src1, const uint8_t* __restrict__ src2, uint8_t* __restrict__ dst, size_t size)
{
size /= 2;
for (size_t i = 0; i < size - 15; i += 16)
{
__m128i v0 = _mm_load_si128((__m128i *)&src1[i*2]);
__m128i v1 = _mm_load_si128((__m128i *)&src1[i*2 + 16]);
__m128i v2 = _mm_load_si128((__m128i *)&src2[i*2]);
__m128i v3 = _mm_load_si128((__m128i *)&src2[i*2 + 16]);
__m128i left = _mm_avg_epu8(v0, v2);
__m128i right = _mm_avg_epu8(v1, v3);
__m128i l_odd = _mm_srli_epi16(left, 8); // line up horizontal pairs
__m128i r_odd = _mm_srli_epi16(right, 8);
__m128i l_avg = _mm_avg_epu8(left, l_odd); // leaves garbage in the high halves
__m128i r_avg = _mm_avg_epu8(right, r_odd);
l_avg = _mm_and_si128(l_avg, _mm_set1_epi16(0x00FF));
r_avg = _mm_and_si128(r_avg, _mm_set1_epi16(0x00FF));
__m128i avg = _mm_packus_epi16(l_avg, r_avg); // pack
_mm_storeu_si128((__m128i *)&dst[i], avg);
}
}
With AVX512BW, there's _mm_cvtepi16_epi8, but IACA says it's 2 uops on Skylake-AVX512, and it only takes 1 input and produces a half-width output. According to IACA, the memory-destination form is 4 total unfused-domain uops (same as reg,reg + separate store). I had to use _mm_mask_cvtepi16_storeu_epi8(&dst\[i+0\], -1, l_avg); to get it, because gcc and clang fail to fold a separate _mm_store into a memory destination for vpmovwb. (There is no non-masked store intrinsic, because compilers are supposed to do that for you like they do with folding _mm_load into memory operands for typical ALU instructions).
It's probably only useful when narrowing to 1/4 or 1/8th (cvtepi64_epi8), not just in half. Or maybe useful to avoid needing a second shuffle to deal with the in-lane behaviour of _mm512_packus_epi16. With AVX2, after a _mm256_packus_epi16 on [D C] [B A], you have [D B | C A], which you can fix with an AVX2 _mm256_permute4x64_epi64 (__m256i a, const int imm8) to shuffle in 64-bit chunks. But with AVX512, you'd need a vector shuffle-control for the vpermq. packus + a fixup shuffle is probably still a better option, though.
Once you do this, there aren't many vector instructions left in the loop, and there's a lot to gain from letting the compiler make tighter asm. Your loop is unfortunately difficult for compilers to do a good job with.
(This also helps Paul R's solution, since he copied the compiler-unfriendly loop structure from the question.)
Use the loop-counter in a way that gcc/clang can optimize better, and use types that avoid re-doing sign extension every time through the loop.
With your current loop, gcc/clang actually do an arithmetic right-shift for i/2, instead of incrementing by 16 (instead of 32) and using scaled-index addressing modes for the loads. It seems they don't realize that i is always even.
(full code + asm on Matt Godbolt's compiler explorer):
.LBB1_2: ## clang's inner loop for int i, dst[i/2] version
movdqu xmm1, xmmword ptr [rdi + rcx]
movdqu xmm2, xmmword ptr [rdi + rcx + 16]
movdqu xmm3, xmmword ptr [rsi + rcx]
movdqu xmm4, xmmword ptr [rsi + rcx + 16]
pavgb xmm3, xmm1
pavgb xmm4, xmm2
pmaddubsw xmm3, xmm0
pmaddubsw xmm4, xmm0
psrlw xmm3, 1
psrlw xmm4, 1
packuswb xmm3, xmm4
mov eax, ecx # This whole block is wasted instructions!!!
shr eax, 31
add eax, ecx
sar eax # eax = ecx/2, with correct rounding even for negative `i`
cdqe # sign-extend EAX into RAX
movdqu xmmword ptr [rdx + rax], xmm3
add rcx, 32 # i += 32
cmp rcx, r8
jl .LBB1_2 # }while(i < size-31)
gcc7.1 isn't quite so bad, (just mov/sar/movsx), but gcc5.x and 6.x do separate pointer-increments for src1 and src2, and also for a counter/index for the stores. (Totally braindead behaviour, especially since they still do it with -march=sandybridge. Indexed movdqu stores and non-indexed movdqu loads gives you the maximum loop overhead.)
Anyway, using dstsize and multiplying i inside the loop instead of dividing it gives much better results. Different versions of gcc and clang reliably compile it into a single loop-counter that they use with a scaled-index addressing mode for the loads. You get code like:
movdqa xmm1, xmmword ptr [rdi + 2*rax]
movdqa xmm2, xmmword ptr [rdi + 2*rax + 16]
pavgb xmm1, xmmword ptr [rsi + 2*rax]
pavgb xmm2, xmmword ptr [rsi + 2*rax + 16] # saving instructions with aligned loads, see below
...
movdqu xmmword ptr [rdx + rax], xmm1
add rax, 16
cmp rax, rcx
jb .LBB0_2
I used size_t i to match size_t size, to make sure gcc didn't waste any instructions sign-extending or zero-extending it to the width of a pointer. (zero-extension usually happens for free, though, so unsigned size and unsigned i might have been ok, and saved a couple REX prefixes.)
You could still get rid of the cmp but counting an index up towards 0, which would speed the loop up a little bit more than what I've done. I'm not sure how easy it would be to get compilers to not be stupid and omit the cmp instruction if you do count up towards zero. Indexing from the end of an object is no problem, though. src1+=size;. It does complicate things if you want to use an unaligned-cleanup loop, though.
On my Skylake i7-6700k (max turbo 4.4GHz, but look at the clock-cycle counts rather than times). With g++7.1, this makes a difference of ~2.7 seconds for 100M reps of 1024 bytes vs. ~3.3 seconds.
Performance counter stats for './grayscale-dowscale-by-2.inline.gcc-skylake-noavx' (2 runs):
2731.607950 task-clock (msec) # 1.000 CPUs utilized ( +- 0.40% )
2 context-switches # 0.001 K/sec ( +- 20.00% )
0 cpu-migrations # 0.000 K/sec
88 page-faults:u # 0.032 K/sec ( +- 0.57% )
11,917,723,707 cycles # 4.363 GHz ( +- 0.07% )
42,006,654,015 instructions # 3.52 insn per cycle ( +- 0.00% )
41,908,837,143 uops_issued_any # 15342.186 M/sec ( +- 0.00% )
49,409,631,052 uops_executed_thread # 18088.112 M/sec ( +- 0.00% )
3,301,193,901 branches # 1208.517 M/sec ( +- 0.00% )
100,013,629 branch-misses # 3.03% of all branches ( +- 0.01% )
2.731715466 seconds time elapsed ( +- 0.40% )
vs. Same vectorization, but with int i and dst[i/2] creating higher loop overhead (more scalar instructions):
Performance counter stats for './grayscale-dowscale-by-2.loopoverhead-aligned-inline.gcc-skylake-noavx' (2 runs):
3314.335833 task-clock (msec) # 1.000 CPUs utilized ( +- 0.02% )
4 context-switches # 0.001 K/sec ( +- 14.29% )
0 cpu-migrations # 0.000 K/sec
88 page-faults:u # 0.026 K/sec ( +- 0.57% )
14,531,925,552 cycles # 4.385 GHz ( +- 0.06% )
51,607,478,414 instructions # 3.55 insn per cycle ( +- 0.00% )
51,109,303,460 uops_issued_any # 15420.677 M/sec ( +- 0.00% )
55,810,234,508 uops_executed_thread # 16839.040 M/sec ( +- 0.00% )
3,301,344,602 branches # 996.080 M/sec ( +- 0.00% )
100,025,451 branch-misses # 3.03% of all branches ( +- 0.00% )
3.314418952 seconds time elapsed ( +- 0.02% )
vs. Paul R's version (optimized for lower loop overhead): exact but slower
Performance counter stats for './grayscale-dowscale-by-2.paulr-inline.gcc-skylake-noavx' (2 runs):
3751.990587 task-clock (msec) # 1.000 CPUs utilized ( +- 0.03% )
3 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
88 page-faults:u # 0.024 K/sec ( +- 0.56% )
16,323,525,446 cycles # 4.351 GHz ( +- 0.04% )
58,008,101,634 instructions # 3.55 insn per cycle ( +- 0.00% )
57,610,721,806 uops_issued_any # 15354.709 M/sec ( +- 0.00% )
55,505,321,456 uops_executed_thread # 14793.566 M/sec ( +- 0.00% )
3,301,456,435 branches # 879.921 M/sec ( +- 0.00% )
100,001,954 branch-misses # 3.03% of all branches ( +- 0.02% )
3.752086635 seconds time elapsed ( +- 0.03% )
vs. Paul R's original version with extra loop overhead:
Performance counter stats for './grayscale-dowscale-by-2.loopoverhead-paulr-inline.gcc-skylake-noavx' (2 runs):
4154.300887 task-clock (msec) # 1.000 CPUs utilized ( +- 0.01% )
3 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
90 page-faults:u # 0.022 K/sec ( +- 1.68% )
18,174,791,383 cycles # 4.375 GHz ( +- 0.03% )
67,608,724,157 instructions # 3.72 insn per cycle ( +- 0.00% )
66,937,292,129 uops_issued_any # 16112.769 M/sec ( +- 0.00% )
61,875,610,759 uops_executed_thread # 14894.350 M/sec ( +- 0.00% )
3,301,571,922 branches # 794.736 M/sec ( +- 0.00% )
100,029,270 branch-misses # 3.03% of all branches ( +- 0.00% )
4.154441330 seconds time elapsed ( +- 0.01% )
Note that branch-misses is about the same as the repeat count: the inner loop mispredicts at the end every time. Unrolling to keep the loop iteration count under about 22 would make the pattern short enough for Skylake's branch predictors to predict the not-taken condition correctly most of the time. Branch mispredicts are the only reason we're not getting ~4.0 uops per cycle through the pipeline, so avoiding branch misses would raise the IPC from 3.5 to over 4.0 (cmp/jcc macro-fusion puts 2 instructions in one uop).
These branch-misses probably hurt even if you're bottlenecked on L2 cache bandwidth (instead of the front-end). I didn't test that, though: my testing just wraps a for() loop around the function call from Paul R's test harness, so everything's hot in L1D cache. 32 iterations of the inner loop is close to the worst-case here: low enough for frequent mispredicts, but not so low that branch-prediction can pick up the pattern and avoid them.
My version should run in 3 cycles per iteration, bottlenecked only on the frontend, on Intel Sandybridge and later. (Nehalem will bottleneck on one load per clock.)
See http://agner.org/optimize/, and also Can x86's MOV really be "free"? Why can't I reproduce this at all? for more about fused-domain vs. unfused-domain uops and perf counters.
update: clang unrolls it for you, at least when the size is a compile-time constant... Strangely, it unrolls even the non-inline version of the dst[i/2] function (with unknown size), but not the lower-loop-overhead version.
With clang++-4.0 -O3 -march=skylake -mno-avx, my version (unrolled by 2 by the compiler) runs in: 9.61G cycles for 100M iters (2.2s). (35.6G uops issued (fused domain), 45.0G uops executed (unfused domain), near-zero branch-misses.) Probably not bottlenecked on the front-end anymore, but AVX would still hurt.
Paul R's (also unrolled by 2) runs in 12.29G cycles for 100M iters (2.8s). 48.4G uops issued (fused domain), 51.4G uops executed (unfused-domain). 50.1G instructions, for 4.08 IPC, probably still bottlenecked on the front-end (because it needs a couple movdqa instructions to copy a register before destroying it). AVX would help for non-destructive vector instructions, even without AVX2 for wider integer vectors.
With careful coding, you should be able to do about this well for runtime-variable sizes.
Use aligned pointers and aligned loads, so the compiler can use pavgb with a memory operand instead of using a separate unaligned-load instruction. This means fewer instructions and fewer uops for the front-end, which is a bottleneck for this loop.
This doesn't help Paul's version, because only the second operand for pmaddubsw can come from memory, and that's the one treated as signed bytes. If we used _mm_maddubs_epi16(_mm_set1_epi8(1), v0);, the 16-bit multiply result would be sign-extended instead of zero-extended. So 1+255 would come out to 0 instead of 256.
Folding a load requires alignment with SSE, but not with AVX. However, on Intel Haswell/Skylake, indexed addressing modes can only stay micro-fused with instructions which read-modify-write their destination register. vpavgb xmm0, xmm0, [rsi+rax*2] is un-laminated to 2 uops on Haswell/Skylake before it issues into the out-of-order part of the core, but pavgb xmm1, [rsi+rax*2] can stay micro-fused all the way through, so it issues as a single uop. The front-end issue bottleneck is 4 fused-domain uops per clock on mainstream x86 CPUs except Ryzen (i.e. not Atom/Silvermont). Folding half the loads into memory operands helps with that on all Intel CPUs except Sandybridge/Ivybridge, and all AMD CPUs.
gcc and clang will fold the loads when inlining into a test function that uses alignas(32), even if you use _mm_loadu intrinsics. They know the data is aligned, and take advantage.
Weird fact: compiling the 128b-vectorized code with AVX code-gen enabled (-march=native) actually slows it down on Haswell/Skylake, because it would make all 4 loads issue as separate uops even when they're memory operands for vpavgb, and there aren't any movdqa register-copying instructions that AVX would avoid. (Usually AVX comes out ahead anyway even for manually-vectorized code that still only uses 128b vectors, because of the benefit of 3-operand instructions not destroying one of their inputs.) In this case, 13,53G cycles ( +- 0.05% ) or 3094.195773 ms ( +- 0.20% ), up from 11.92G cycles in ~2.7 seconds. uops_issued = 48.508G, up from 41,908. Instruction count and uops_executed counts are essentially the same.
OTOH, an actual 256b AVX2 version would run slightly a bit less than twice as fast. Some unrolling to reduce the front-end bottleneck will definitely help. An AVX512 version might run close to 4x as fast on Skylake-AVX512 Xeons, but might bottleneck on ALU throughput since SKX shuts down execution port1 when there are any 512b uops in the RS waiting to execute, according to #Mysticial's testing. (That explains why pavgb zmm has 1 per clock throughput while pavgb ymm is 2 per clock..)
To have both input rows aligned, store your image data in a format with a row stride that's a multiple of 16, even if the actual image dimensions are odd. Your storage stride doesn't have to match your actual image dimensions.
If you can only align either the source or dest (e.g. because you're downscaling a region that starts at an odd column in the source image), you should probably still align your source pointers.
Intel's optimization manual recommends aligning the destination instead of the source, if you can't align both, but doing 4x as many loads as stores probably changes the balance.
To handle unaligned at the start/end, do a potentially-overlapping unaligned vector of pixels from the start and end. It's fine for stores to overlap other stores, and since dst is separate from src, you can redo a partially-overlapping vector.
In Paul's test main(), I just added alignas(32) in front of every array.
AVX2:
Since you compile one version with -march=native, you can easily detect AVX2 at compile time with #ifdef __AVX2__. There's no simple way to use exactly the same code for 128b and 256b manual vectorization. All the intrinsics have different names, so you typically need to copy everything even if there are no other differences.
(There are some C++ wrapper libraries for the intrinsics that use operator-overloading and function overloading to let you write a templated version that uses the same logic on different widths of vector. e.g. Agner Fog's VCL is good, but unless your software is open-source, you can't use it because it's GPL licensed and you want to distribute a binary.)
To take advantage of AVX2 in your binary-distribution version, you'd have to do runtime detection/dispatching. In that case, you'd want to dispatch to versions of a function that loops over rows, so you don't have dispatch overhead inside your loop over rows. Or just let that version use SSSE3.
Here is an implementation which uses fewer instructions. I haven't benchmarked it against your code though, so it may not be significantly faster:
void average2Rows(const uint8_t* src1, const uint8_t* src2, uint8_t* dst, int size)
{
const __m128i vk1 = _mm_set1_epi8(1);
for (int i = 0; i < size - 31; i += 32)
{
__m128i v0 = _mm_loadu_si128((__m128i *)&src1[i]);
__m128i v1 = _mm_loadu_si128((__m128i *)&src1[i + 16]);
__m128i v2 = _mm_loadu_si128((__m128i *)&src2[i]);
__m128i v3 = _mm_loadu_si128((__m128i *)&src2[i + 16]);
__m128i w0 = _mm_maddubs_epi16(v0, vk1); // unpack and horizontal add
__m128i w1 = _mm_maddubs_epi16(v1, vk1);
__m128i w2 = _mm_maddubs_epi16(v2, vk1);
__m128i w3 = _mm_maddubs_epi16(v3, vk1);
w0 = _mm_add_epi16(w0, w2); // vertical add
w1 = _mm_add_epi16(w1, w3);
w0 = _mm_srli_epi16(w0, 2); // divide by 4
w1 = _mm_srli_epi16(w1, 2);
w0 = _mm_packus_epi16(w0, w1); // pack
_mm_storeu_si128((__m128i *)&dst[i / 2], w0);
}
}
Test harness:
#include <stdio.h>
#include <stdlib.h>
#include <tmmintrin.h>
void average2Rows_ref(const uint8_t* row1, const uint8_t* row2, uint8_t* dst, int size)
{
for (int i = 0; i < size - 1; i += 2)
{
dst[i / 2] = (row1[i] + row1[i + 1] + row2[i] + row2[i + 1]) / 4;
}
}
void average2Rows(const uint8_t* src1, const uint8_t* src2, uint8_t* dst, int size)
{
const __m128i vk1 = _mm_set1_epi8(1);
for (int i = 0; i < size - 31; i += 32)
{
__m128i v0 = _mm_loadu_si128((__m128i *)&src1[i]);
__m128i v1 = _mm_loadu_si128((__m128i *)&src1[i + 16]);
__m128i v2 = _mm_loadu_si128((__m128i *)&src2[i]);
__m128i v3 = _mm_loadu_si128((__m128i *)&src2[i + 16]);
__m128i w0 = _mm_maddubs_epi16(v0, vk1); // unpack and horizontal add
__m128i w1 = _mm_maddubs_epi16(v1, vk1);
__m128i w2 = _mm_maddubs_epi16(v2, vk1);
__m128i w3 = _mm_maddubs_epi16(v3, vk1);
w0 = _mm_add_epi16(w0, w2); // vertical add
w1 = _mm_add_epi16(w1, w3);
w0 = _mm_srli_epi16(w0, 2); // divide by 4
w1 = _mm_srli_epi16(w1, 2);
w0 = _mm_packus_epi16(w0, w1); // pack
_mm_storeu_si128((__m128i *)&dst[i / 2], w0);
}
}
int main()
{
const int n = 1024;
uint8_t src1[n];
uint8_t src2[n];
uint8_t dest_ref[n / 2];
uint8_t dest_test[n / 2];
for (int i = 0; i < n; ++i)
{
src1[i] = rand();
src2[i] = rand();
}
for (int i = 0; i < n / 2; ++i)
{
dest_ref[i] = 0xaa;
dest_test[i] = 0x55;
}
average2Rows_ref(src1, src2, dest_ref, n);
average2Rows(src1, src2, dest_test, n);
for (int i = 0; i < n / 2; ++i)
{
if (dest_test[i] != dest_ref[i])
{
printf("%u %u %u %u: ref = %u, test = %u\n", src1[2 * i], src1[2 * i + 1], src2[2 * i], src2[2 * i + 1], dest_ref[i], dest_test[i]);
}
}
return 0;
}
Note that the output of the SIMD version exactly matches the output of the scalar reference code (no "off by one" rounding errors).

How efficient/intelligent is Theano in computing gradients?

Suppose I have an artificial neural networks with 5 hidden layers. For the moment, forget about the details of the neural network model such as biases, the activation functions used, type of data and so on ... . Of course, the activation functions are differentiable.
With symbolic differentiation, the following computes the gradients of the objective function with respect to the layers' weights:
w1_grad = T.grad(lost, [w1])
w2_grad = T.grad(lost, [w2])
w3_grad = T.grad(lost, [w3])
w4_grad = T.grad(lost, [w4])
w5_grad = T.grad(lost, [w5])
w_output_grad = T.grad(lost, [w_output])
This way, to compute the gradients w.r.t w1 the gradients w.r.t w2, w3, w4 and w5 must first be computed. Similarly to compute the gradients w.r.t w2 the gradients w.r.t w3, w4 and w5 must be computed first.
However, I could the following code also computes the gradients w.r.t to each weight matrix:
w1_grad, w2_grad, w3_grad, w4_grad, w5_grad, w_output_grad = T.grad(lost, [w1, w2, w3, w4, w5, w_output])
I was wondering, is there any difference between these two methods in terms of performance? Is Theano intelligent enough to avoid re-computing the gradients using the second method? By intelligent I mean to compute w3_grad, Theano should [preferably] use the pre-computed gradients of w_output_grad, w5_grad and w4_grad instead of computing them again.
Well it turns out Theano does not take the previously-computed gradients to compute the gradients in lower layers of a computational graph. Here's a dummy example of a neural network with 3 hidden layers and an output layer. However, it's not going to be a big deal at all since computing the gradients is a once-in-a-life-time operation unless you have to compute the gradient on each iteration. Theano returns a symbolic expression for the derivatives as a computational graph and you can simply use it as a function from that point on. From that point on we simply use the function derived by Theano to compute numerical values and update the weights using those.
import theano.tensor as T
import time
import numpy as np
class neuralNet(object):
def __init__(self, examples, num_features, num_classes):
self.w = shared(np.random.random((16384, 5000)).astype(T.config.floatX), borrow = True, name = 'w')
self.w2 = shared(np.random.random((5000, 3000)).astype(T.config.floatX), borrow = True, name = 'w2')
self.w3 = shared(np.random.random((3000, 512)).astype(T.config.floatX), borrow = True, name = 'w3')
self.w4 = shared(np.random.random((512, 40)).astype(T.config.floatX), borrow = True, name = 'w4')
self.b = shared(np.ones(5000, dtype=T.config.floatX), borrow = True, name = 'b')
self.b2 = shared(np.ones(3000, dtype=T.config.floatX), borrow = True, name = 'b2')
self.b3 = shared(np.ones(512, dtype=T.config.floatX), borrow = True, name = 'b3')
self.b4 = shared(np.ones(40, dtype=T.config.floatX), borrow = True, name = 'b4')
self.x = examples
L1 = T.nnet.sigmoid(T.dot(self.x, self.w) + self.b)
L2 = T.nnet.sigmoid(T.dot(L1, self.w2) + self.b2)
L3 = T.nnet.sigmoid(T.dot(L2, self.w3) + self.b3)
L4 = T.dot(L3, self.w4) + self.b4
self.forwardProp = T.nnet.softmax(L4)
self.predict = T.argmax(self.forwardProp, axis = 1)
def loss(self, y):
return -T.mean(T.log(self.forwardProp)[T.arange(y.shape[0]), y])
x = T.matrix('x')
y = T.ivector('y')
nnet = neuralNet(x)
loss = nnet.loss(y)
diffrentiationTime = []
for i in range(100):
t1 = time.time()
gw, gw2, gw3, gw4, gb, gb2, gb3, gb4 = T.grad(loss, [nnet.w, nnet.w2, logReg.w3, nnet.w4, nnet.b, nnet.b2, nnet.b3, nnet.b4])
diffrentiationTime.append(time.time() - t1)
print 'Efficient Method: Took %f seconds with std %f' % (np.mean(diffrentiationTime), np.std(diffrentiationTime))
diffrentiationTime = []
for i in range(100):
t1 = time.time()
gw = T.grad(loss, [nnet.w])
gw2 = T.grad(loss, [nnet.w2])
gw3 = T.grad(loss, [nnet.w3])
gw4 = T.grad(loss, [nnet.w4])
gb = T.grad(loss, [nnet.b])
gb2 = T.grad(loss, [nnet.b2])
gb3 = T.grad(loss, [nnet.b3])
gb4 = T.grad(loss, [nnet.b4])
diffrentiationTime.append(time.time() - t1)
print 'Inefficient Method: Took %f seconds with std %f' % (np.mean(diffrentiationTime), np.std(diffrentiationTime))
This will print out the followings:
Efficient Method: Took 0.061056 seconds with std 0.013217
Inefficient Method: Took 0.305081 seconds with std 0.026024
This shows that Theano uses a dynamic-programming approach to compute gradients for the efficient method.

Compare the sign bit in SSE Intrinsics

How would one create a mask using SSE intrinsics which indicates whether the signs of two packed floats (__m128's) are the same for example if comparing a and b where a is [1.0 -1.0 0.0 2.0] and b is [1.0 1.0 1.0 1.0] the desired mask we would get is [true false true true].
Here's one solution:
const __m128i MASK = _mm_set1_epi32(0xffffffff);
__m128 a = _mm_setr_ps(1,-1,0,2);
__m128 b = _mm_setr_ps(1,1,1,1);
__m128 f = _mm_xor_ps(a,b);
__m128i i = _mm_castps_si128(f);
i = _mm_srai_epi32(i,31);
i = _mm_xor_si128(i,MASK);
f = _mm_castsi128_ps(i);
// i = (0xffffffff, 0, 0xffffffff, 0xffffffff)
// f = (0xffffffff, 0, 0xffffffff, 0xffffffff)
In this snippet, both i and f will have the same bitmask. I assume you want it in the __m128 type so I added the f = _mm_castsi128_ps(i); to convert it back from an __m128i.
Note that this code is sensitive to the sign of the zero. So 0.0 and -0.0 will affect the results.
Explanations:
The way the code works is as follows:
f = _mm_xor_ps(a,b); // xor the sign bits (well all the bits actually)
i = _mm_castps_si128(f); // Convert it to an integer. There's no instruction here.
i = _mm_srai_epi32(i,31); // Arithmetic shift that sign bit into all the bits.
i = _mm_xor_si128(i,MASK); // Invert all the bits
f = _mm_castsi128_ps(i); // Convert back. Again, there's no instruction here.
Have a look at the _mm_movemask_ps instruction, which extracts the most significant bit (i.e. sign bit) from 4 floats. See http://msdn.microsoft.com/en-us/library/4490ys29.aspx
For example, if you have [1.0 -1.0 0.0 2.0], then movemask_ps will return 4, or 0100 in binary. So then if you get movemask_ps for each vector and compare the results (perhaps bitwise NOT XOR), then that will indicate whether all the signs are the same.
a = [1.0 -1.0 0.0 2.0]
b = [1.0 1.0 1.0 1.0]
movemask_ps a = 4
movemask_ps b = 0
NOT (a XOR b) = 0xB, or binary 1011
Hence signs are the same except in the second vector element.