I would like to make some vector computation faster, and I believe that SIMD instructions for float comparison and manipulation could help, here is the operation:
void func(const double* left, const double* right, double* res, const size_t size, const double th, const double drop) {
for (size_t i = 0; i < size; ++i) {
res[i] = right[i] >= th ? left[i] : (left[i] - drop) ;
}
}
Mainly, it drops the left value by drop in case right value is higher than threshold.
The size is around 128-256 (not that big), but computation is called heavily.
I tried to start with loop unrolling, but did not win a lot of performance, but may be some compile instructions are needed.
Could you please suggest some improvement into the code for faster computation?
Clang already auto-vectorizes this pretty much the way Soonts suggested doing manually. Use __restrict on your pointers so it doesn't need a fallback version that works for overlap between some of the arrays. It still auto-vectorizes, but it bloats the function.
Unfortunately gcc only auto-vectorizes with -ffast-math. It turns out only -fno-trapping-math is required: that's generally safe especially if you aren't using fenv access to unmask any FP exceptions (feenableexcept) or looking at MXCSR sticky FP exception flags (fetestexcept).
With that option, then GCC too will use (v)pblendvpd with -march=nehalem or -march=znver1. See it on Godbolt
Also, your C function is broken. th and drop are scalar double, but you declare them as const double *
AVX512F would let you do a !(right[i] >= thresh) compare and use the resulting mask for a merge-masked subtract.
Elements where the predicate was true will get left[i] - drop, other elements will keep their left[i] value, because you merge info a vector of left values.
Unfortunately GCC with -march=skylake-avx512 uses a normal vsubpd and then a separate vmovapd zmm2{k1}, zmm5 to blend, which is obviously a missed optimization. The blend destination is already one of the inputs to the SUB.
Using AVX512VL for 256-bit vectors (in case the rest of your program can't efficiently use 512-bit, so you don't suffer reduced turbo clock speeds):
__m256d left = ...;
__m256d right = ...;
__mmask8 cmp = _mm256_cmp_pd_mask(right, set1(th), _CMP_NGE_UQ);
__m256d res = _mm256_mask_sub_pd (left, cmp, left, set1(drop));
So (besides the loads and store) it's 2 instructions with AVX512F / VL.
If you don't need the specific NaN behaviour of your version, GCC can auto-vectorize too
And it's more efficient with all compilers because you just need an AND, not a variable-blend. So it's significantly better with just SSE2, and also better on most CPUs even when they do support SSE4.1 blendvpd, because that instruction isn't as efficient.
You can subtract 0.0 or drop from left[i] based on the compare result.
Producing 0.0 or a constant based on a compare result is extremely efficient: just an andps instruction. (The bit-pattern for 0.0 is all-zeros, and SIMD compares produce vectors of all-1 or all-0 bits. So AND keeps the old value or zeros it.)
We can also add -drop instead of subtracting drop. This costs an extra negation on input, but with AVX allows a memory-source operand for vaddpd. GCC chooses to use an indexed addressing mode so that doesn't actually help reduce the front-end uop count on Intel CPUs, though; it will "unlaminate". But even with -ffast-math, gcc doesn't do this optimization on its own to allow folding a load. (It wouldn't be worth doing separate pointer increments unless we unroll the loop, though.)
void func3(const double *__restrict left, const double *__restrict right, double *__restrict res,
const size_t size, const double th, const double drop)
{
for (size_t i = 0; i < size; ++i) {
double add = right[i] >= th ? 0.0 : -drop;
res[i] = left[i] + add;
}
}
GCC 9.1's inner loop (without any -march options and without -ffast-math) from the Godbolt link above:
# func3 main loop
# gcc -O3 -march=skylake (without fast-math)
.L33:
vcmplepd ymm2, ymm4, YMMWORD PTR [rsi+rax]
vandnpd ymm2, ymm2, ymm3
vaddpd ymm2, ymm2, YMMWORD PTR [rdi+rax]
vmovupd YMMWORD PTR [rdx+rax], ymm2
add rax, 32
cmp r8, rax
jne .L33
Or the plain SSE2 version has an inner loop that's basically the same as with left - zero_or_drop instead of left + zero_or_minus_drop, so unless you can promise the compiler 16-byte alignment or you're making an AVX version, negating drop is just extra overhead.
Negating drop takes a constant from memory (to XOR the sign bit), and that's the only static constant this function needs, so that tradeoff is worth considering for your case where the loop doesn't run a huge number of times. (Unless th or drop are also compile-time constants after inlining, and are getting loaded anyway. Or especially if -drop can be computed at compile time. Or if you can get your program to work with a negative drop.)
Another difference between adding and subtracting is that subtracting doesn't destroy the sign of zero. -0.0 - 0.0 = -0.0, +0.0 - 0.0 = +0.0. In case that matters.
# gcc9.1 -O3
.L26:
movupd xmm5, XMMWORD PTR [rsi+rax]
movapd xmm2, xmm4 # duplicate th
movupd xmm6, XMMWORD PTR [rdi+rax]
cmplepd xmm2, xmm5 # destroy the copy of th
andnpd xmm2, xmm3 # _mm_andnot_pd
addpd xmm2, xmm6 # _mm_add_pd
movups XMMWORD PTR [rdx+rax], xmm2
add rax, 16
cmp r8, rax
jne .L26
GCC uses unaligned loads so (without AVX) it can't fold a memory source operand into cmppd or subpd
Here you go (untested), I’ve tried to explain in the comments what they do.
void func_sse41( const double* left, const double* right, double* res,
const size_t size, double th, double drop )
{
// Verify the size is even.
// If it's not, you'll need extra code at the end to process last value the old way.
assert( 0 == ( size % 2 ) );
// Load scalar values into 2 registers.
const __m128d threshold = _mm_set1_pd( th );
const __m128d dropVec = _mm_set1_pd( drop );
for( size_t i = 0; i < size; i += 2 )
{
// Load 4 double values into registers, 2 from right, 2 from left
const __m128d r = _mm_loadu_pd( right + i );
const __m128d l = _mm_loadu_pd( left + i );
// Compare ( r >= threshold ) for 2 values at once
const __m128d comp = _mm_cmpge_pd( r, threshold );
// Compute ( left[ i ] - drop ), for 2 values at once
const __m128d dropped = _mm_sub_pd( l, dropVec );
// Select either left or ( left - drop ) based on the comparison.
// This is the only instruction here that requires SSE 4.1.
const __m128d result = _mm_blendv_pd( l, dropped, comp );
// Store the 2 result values
_mm_storeu_pd( res, result );
}
}
The code will crash with “invalid instruction” runtime error if the CPU doesn’t have SSE 4.1. For best result, detect with CPU ID to fail gracefully. I think now in 2019 it’s quite reasonable to assume it’s supported, Intel did in 2008, AMD in 2011, steam survey says “96.3%”. If you want to support older CPUs, possible to emulate _mm_blendv_pd with 3 other instructions, _mm_and_pd, _mm_andnot_pd, _mm_or_pd.
If you can guarantee the data is aligned, replacing loads with _mm_load_pd will be slightly faster, _mm_cmpge_pd compiles into CMPPD https://www.felixcloutier.com/x86/cmppd which can take one of the arguments directly from RAM.
Potentially, you can get further 2x improvement by writing AVX version. But I hope even SSE version is faster than your code, it handles 2 values per iteration, and doesn’t have conditions inside the loop. If you’re unlucky, AVX will be slower, many CPUs need some time to power on their AVX units, takes many thousands of cycles. Until powered, AVX code runs very slowly.
You can use GCC's and Clang's vector extensions to implement a ternary select function (see https://stackoverflow.com/a/48538557/2542702).
#include <stddef.h>
#include <inttypes.h>
#if defined(__clang__)
typedef double double4 __attribute__ ((ext_vector_type(4)));
typedef int64_t long4 __attribute__ ((ext_vector_type(4)));
#else
typedef double double4 __attribute__ ((vector_size (sizeof(double)*4)));
typedef int64_t long4 __attribute__ ((vector_size (sizeof(int64_t)*4)));
#endif
double4 select(long4 s, double4 a, double4 b) {
double4 c;
#if defined(__GNUC__) && !defined(__INTEL_COMPILER) && !defined(__clang__)
c = s ? a : b;
#else
for(int i=0; i<4; i++) c[i] = s[i] ? a[i] : b[i];
#endif
return c;
}
void func(double* left, double* right, double* res, size_t size, double th, double drop) {
size_t i;
for (i = 0; i<(size&-4); i+=4) {
double4 leftv = *(double4*)&left[i];
double4 rightv = *(double4*)&right[i];
*(double4*)&res[i] = select(rightv >= th, leftv, leftv - drop);
}
for(;i<size; i++) res[i] = right[i] >= th ? left[i] : (left[i] - drop);
}
https://godbolt.org/z/h4OKMl
Related
I have a strange problem with some AVX / AVX2 codes that I am working on. I have set
up a test application console developed in cpp (Visual Studio 2017 on Windows 7) with the aim of comparing the routines written in Cpp with the equivalent routine written with the set-instruction AVX / AVX2; each routine is timed.
A first problem: the timed time of the single routine changes according to the position of the call of the same;
void TraditionalAVG_UncharToDouble(const unsigned char *vec1, const unsigned char *vec2, double* doubleArray, const unsigned int length) {
int sumTot = 0;
double* ptrDouble = doubleArray;
for (unsigned int packIdx = 0; packIdx < length; ++packIdx) {
*ptrDouble = ((double)(*(vec1 + packIdx) + *(vec2 + packIdx)))/ ((double)2);
ptrDouble++;
}
}
void AVG_uncharToDoubleArray(const unsigned char *vec1, const unsigned char *vec2, double* doubleArray, const unsigned int length) {
//constexpr unsigned int memoryAlignmentBytes = 32;
constexpr unsigned int bytesPerPack = 256 / 16;
unsigned int packCount = length / bytesPerPack;
double* ptrDouble = doubleArray;
__m128d divider=_mm_set1_pd(2);
for (unsigned int packIdx = 0; packIdx < packCount; ++packIdx)
{
auto x1 = _mm_loadu_si128((const __m128i*)vec1);
auto x2 = _mm_loadu_si128((const __m128i*)vec2);
unsigned char index = 0;
while(index < 8) {
index++;
auto x1lo = _mm_cvtepu8_epi64(x1);
auto x2lo = _mm_cvtepu8_epi64(x2);
__m128d x1_pd = int64_to_double_full(x1lo);
__m128d x2_pd = int64_to_double_full(x2lo);
_mm_store_pd(ptrDouble, _mm_div_pd(_mm_add_pd(x1_pd, x2_pd), divider));
ptrDouble = ptrDouble + 2;
x1 = _mm_srli_si128(x1, 2);
x2 = _mm_srli_si128(x2, 2);
}
vec1 += bytesPerPack;
vec2 += bytesPerPack;
}
for (unsigned int ii = 0 ; ii < length % packCount; ++ii)
{
*(ptrDouble + ii) = (double)(*(vec1 + ii) + *(vec2 + ii))/ (double)2;
}
}
... on main ...
timeAvg02 = 0;
Start_TimerMS();
AVG_uncharToDoubleArray(unCharArray, unCharArrayBis, doubleArray, N);
End_TimerMS(&timeAvg02);
std::cout << "AVX2_AVG UncharTodoubleArray:: " << timeAvg02 << " ms" << std::endl;
//printerDouble("AvxDouble", doubleArray, N);
std::cout << std::endl;
timeAvg01 = 0;
Start_TimerMS3();
TraditionalAVG_UncharToDouble(unCharArray, unCharArrayBis, doubleArray, N);
End_TimerMS3(&timeAvg01);
std::cout << "Traditional_AVG UncharTodoubleArray: " << timeAvg01 << " ms" << std::endl;
//printerDouble("TraditionalAvgDouble", doubleArray, N);
std::cout << std::endl;
the second problem is that routines written in AVX2 are slower than routines written in cpp. The images represent the results of the two tests
How can I overcome this strange behavior? What is the reason behind it?
MSVC doesn't optimize intrinsics (much), so you get an actual vdivpd by 2.0, not a multiply by 0.5. That's a worse bottleneck than scalar, less than one element per clock cycle on most CPUs. (e.g. Skylake / Ice Lake / Alder Lake-P: 4 cycle throughput for vdivpd xmm, or 8 cycles for vdivpd ymm, either way 2 cycles per element. https://uops.info)
From Godbolt, with MSVC 19.33 -O2 -arch:AVX2, with a version that compiles (replacing your undefined int64_to_double_full with efficient 32-bit conversion). Your version is probably even worse.
$LL5#AVG_unchar:
vpmovzxbd xmm0, xmm5
vpmovzxbd xmm1, xmm4
vcvtdq2pd xmm3, xmm0
vcvtdq2pd xmm2, xmm1
vaddpd xmm0, xmm3, xmm2
vdivpd xmm3, xmm0, xmm6 ;; performance disaster
vmovupd XMMWORD PTR [r8], xmm3
add r8, 16
vpsrldq xmm4, xmm4, 2
vpsrldq xmm5, xmm5, 2
sub rax, 1
jne SHORT $LL5#AVG_unchar
Also, AVX2 implies support for 256-bit integer as well as FP vectors, so you can use __m256i. Although with this shift strategy for using the chars of a vector, you wouldn't want to. You'd just want to use __m256d.
Look at how clang vectorizes the scalar C++: https://godbolt.org/z/Yzze98qnY 2x vpmovzxbd-load of __m128i / vpaddd __m128i / vcvtdq2pd to __m256d / vmulpd __m256d (by 0.5) / vmovupd. (Narrow loads as a memory source for vpmovzxbd are good, especially with an XMM destination so they can micro-fuse on Intel CPUs. Writing this with intrinsics relies on compilers optimizing _mm_loadu_si32 into a memory source for _mm_cvtepu8_epi32. Looping to use all bytes of a wider load isn't crazy, but costs more shuffles. clang unrolls that loop, replacing later vpsrldq / vpmovzxbd with vpshufb shuffles to move bytes directly to where they're needed, at the cost of needing more constants.)
IDK what wrong with MSVC, why it failed to auto-vectorize with -O2 -arch:AVX2, but at least it optimized /2.0 to *0.5. When the reciprocal is exactly representable as a double, that's a well-known safe and valuable optimization.
With a good compiler, there'd be no need for intrinsics. But "good" seems to only include clang; GCC makes a bit of a mess with converting vector widths.
Your scalar C is strangely obfuscated as *ptrDouble = ((double)(*(vec1 + packIdx) + *(vec2 + packIdx)))/ ((double)2); instead of
(vec1[packIdx] + vec2[packIdx]) / 2.0.
Doing integer addition like this scalar code before conversion to FP is a good idea, especially for a vectorized version, so there's only one conversion. Each input already needs to get widened separately to 32-bit elements.
IDK what int64_to_double_full is, but if it's manual emulation of AVX-512 vcvtqq2pd, it makes no sense to use use it on values zero-extended from char. That value-range fits comfortably in int32_t, so you can widen only to 32-bit elements, and let hardware int->FP packed conversion with _mm256_cvtepi32_pd (vcvtdq2pd) widen the elements.
I have large in-memory array as some pointer uint64_t * arr (plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift these bits to the right by some amount from 0 to 63.
By shifting whole array I mean not to shift each element (like a[i] <<= Shift), but to shift it as a single large bit vector. In other words for each intermediate position i (except for first and last element) I can do following in a loop:
dst[i] = w | (src[i] << Shift);
w = src[i] >> (64 - Shift);
where w is some temporary variable, holding right-shifted value of previous array element.
This solution above is simple and obvious. But I need something more efficient as I have giga-bytes of data.
Ideally would be to use some SIMD instructions for that, so I'm looking for SIMD suggestions from experts. I need to implement shifting code for all four types of popular instruction sets - SSE-SSE4.2 / AVX / AVX-2 / AVX-512.
But as far as I know for example for SSE2 there exists only _mm_slli_si128() intrinsic/instruction, which shifts only by amount multiple of 8 (in other words byte-shifting). And I need shifting by arbitrary bit-size, not only byte-shift.
Without SIMD I can shift also by 128 bits at once through using shld reg, reg, reg instruction, which allows to do 128-bit shifting. It is implemented as intrinsic __shiftleft128() in MSVC, and produces assembler code that can be seen here.
BTW, I need solutions for all of MSVC/GCC/CLang.
Also inside single loop iteration I can shift 4 or 8 words in sequential operations, this will use CPU pipelining to speedup parallel out-of-order execution of several instructions.
If needed my bit vector can be aligned to any amount of bytes in memory, if this will help for example to improve SIMD speed by doing aligned reads/writes. Also source and destination bit vector memory are different (non-overlapping).
In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.
Note, to clarify, I actually have to do several shift-ors, not just single shift. I have large bit vector X, and several hundreds of shift sizes s0, s1, ..., sN, where each shift size is different and can be also large (for example shift by 100K bits), then I want to compute resulting large bit vector Y = (X << s0) | (X << s1) | ... | (X << sN). I just simplified my question for StackOverflow to shifting single vector. But probably this detail about original task is very important.
As requested by #Jake'Alquimista'LEE, I decided to implement a ready-made toy minimal reproducible example of what I want to do, computing shift-ors of input bit vector src to produced or-ed final dst bit vector. This example is not optimized at all, just a straightforward simple variant of how my task can be solved. For simplicity this example has small size of input vector, not giga-bytes as in my case. It is a toy example, I didn't check if it solves task correctly, it may contain minor bugs:
Try it online!
#include <cstdint>
#include <vector>
#include <random>
#define bit_sizeof(x) (sizeof(x) * 8)
using u64 = uint64_t;
using T = u64;
int main() {
std::mt19937_64 rng{123};
// Random generate source bit vector
std::vector<T> src(100'000);
for (size_t i = 0; i < src.size(); ++i)
src[i] = rng();
size_t const src_bitsize = src.size() * bit_sizeof(T);
// Destination bit vector, for example twice bigger in size
std::vector<T> dst(src.size() * 2);
// Random generate shifts
std::vector<u64> shifts(200);
for (size_t i = 0; i < shifts.size(); ++i)
shifts[i] = rng() % src_bitsize;
// Right-shift that handles overflow
auto Shr = [](auto x, size_t s) {
return s >= bit_sizeof(x) ? 0 : (x >> s);
};
// Do actual Shift-Ors
for (auto orig_shift: shifts) {
size_t const
word_off = orig_shift / bit_sizeof(T),
bit_off = orig_shift % bit_sizeof(T);
if (word_off >= dst.size())
continue;
size_t const
lim = std::min(src.size(), dst.size() - word_off);
T w = 0;
for (size_t i = 0; i < lim; ++i) {
dst[word_off + i] |= w | (src[i] << bit_off);
w = Shr(src[i], bit_sizeof(T) - bit_off);
}
// Special case of handling for last word
if (word_off + lim < dst.size())
dst[word_off + lim] |= w;
}
}
My real project's current code is different from toy example above. This project already solves correctly a real-world task. I just need to do extra optimizations. Some optimizations I already did, like using OpenMP to parallelize shift-or operations on all cores. Also as said in comments, I created specialized templated functions for each shift size, 64 functions in total, and choosing one of 64 functions to do actual shift-or. Each C++ function has compile time value of shift size, hence compiler does extra optimizations taking into account compile time values.
You can, and possibly you don't even need to use SIMD instructions explicitly.
The target compilers GCC, CLANG and MSVC and other compilers like ICC all support auto-vectorization.
While hand-optimized assembly can outperform compiler generated vectorized instructions, it's generally harder to achieve and you may need several versions for different architectures.
Generic code that leads to efficient auto-vectorized instructions is a solution that may be portable across many platforms.
For instance a simple shiftvec version
void shiftvec(uint64_t* dst, uint64_t* src, int size, int shift)
{
for (int i = 0; i < size; ++i,++src,++dst)
{
*dst = ((*src)<<shift) | (*(src+1)>>(64-shift));
}
}
compiled with a recent GCC (or CLANG works as well) and -O3 -std=c++11 -mavx2 leads to SIMD instructions in the core loop of the assembly
.L5:
vmovdqu ymm4, YMMWORD PTR [rsi+rax]
vmovdqu ymm5, YMMWORD PTR [rsi+8+rax]
vpsllq ymm0, ymm4, xmm2
vpsrlq ymm1, ymm5, xmm3
vpor ymm0, ymm0, ymm1
vmovdqu YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L5
See on godbolt.org: https://godbolt.org/z/5TxhqMhnK
This also generalizes if you want to do combine multiple shifts in dst:
void shiftvec2(uint64_t* dst, uint64_t* src1, uint64_t* src2, int size1, int size2, int shift1, int shift2)
{
int size = size1<size2 ? size1 : size2;
for (int i = 0; i < size; ++i,++src1,++src2,++dst)
{
*dst = ((*src1)<<shift1) | (*(src1+1)>>(64-shift1));
*dst |= ((*src2)<<shift2) | (*(src2+1)>>(64-shift2));
}
for (int i = size; i < size1; ++i,++src1,++dst)
{
*dst = ((*src1)<<shift1) | (*(src1+1)>>(64-shift1));
}
for (int i = size; i < size2; ++i,++src2,++dst)
{
*dst = ((*src2)<<shift2) | (*(src2+1)>>(64-shift2));
}
}
compiles to a core-loop:
.L38:
vmovdqu ymm7, YMMWORD PTR [rsi+rcx]
vpsllq ymm1, ymm7, xmm4
vmovdqu ymm7, YMMWORD PTR [rsi+8+rcx]
vpsrlq ymm0, ymm7, xmm6
vpor ymm1, ymm1, ymm0
vmovdqu YMMWORD PTR [rax+rcx], ymm1
vmovdqu ymm7, YMMWORD PTR [rdx+rcx]
vpsllq ymm0, ymm7, xmm3
vmovdqu ymm7, YMMWORD PTR [rdx+8+rcx]
vpsrlq ymm2, ymm7, xmm5
vpor ymm0, ymm0, ymm2
vpor ymm0, ymm0, ymm1
vmovdqu YMMWORD PTR [rax+rcx], ymm0
add rcx, 32
cmp r10, rcx
jne .L38
Combining multiple sources in one loop will reduce the total amount of memory bandwidth spent on loading/writing the destination. The limit in how many you can combine is of course limited by available registers. Note that xmm2 and xmm3 for shiftvec contain the shift values, so having different versions for compile-time known shift values may free those registers.
Additionally using __restrict (supported by GCC,CLANG,MSVC) for each of the pointers will tell the compiler that the ranges are not overlapping.
I initially had problems with MSVC giving proper auto vectorized code, but it seems adding more SIMD-like structure will make it work for all three desired compilers GCC, CLANG and MSVC:
void shiftvec(uint64_t* __restrict dst, const uint64_t* __restrict src, int size, int shift)
{
int i = 0;
// MSVC: use steps of 2 for SSE, 4 for AVX2, 8 for AVX512
for (; i+4 < size; i+=4,dst+=4,src+=4)
{
for (int j = 0; j < 4; ++j)
*(dst+j) = (*(src+j))<<shift;
for (int j = 0; j < 4; ++j)
*(dst+j) |= (*(src+1)>>(64-shift));
}
for (; i < size; ++i,++src,++dst)
{
*dst = ((*src)<<shift) | (*(src+1)>>(64-shift));
}
}
I would attempt to rely on x64 ability to read from unaligned addresses, and to do that with almost no visible penalty when stars are properly (un)aligned. One would only need to handle a few cases of (shift % 8) or (shift % 16) -- all doable with SSE2 instruction set, fixing the remainder with zeros and having an unaligned offset to the data vector and addressing the UB by memcpy.
That said, the inner loop would look like:
uint16_t const *ptr;
auto a = _mm_loadu_si128((__m128i*)ptr);
auto b = _mm_loadu_si128((__m128i*)(ptr - 1);
a = _mm_srl_epi16(a, c);
b = _mm_sll_epi16(b, 16 - c);
_mm_storeu_si128((__m128i*)ptr, mm_or_si128(a,b));
ptr += 8;
Unrolling this loop a few times, one might be able to use _mm_alignr_epi8 on SSE3+ to relax memory bandwidth (and those pipeline stages that need to combine results from unaligned memory accesses):
auto a0 = w;
auto a1 = _mm_load_si128(m128ptr + 1);
auto a2 = _mm_load_si128(m128ptr + 2);
auto a3 = _mm_load_si128(m128ptr + 3);
auto a4 = _mm_load_si128(m128ptr + 4);
auto b0 = _mm_alignr_epi8(a1, a0, 2);
auto b1 = _mm_alignr_epi8(a2, a1, 2);
auto b2 = _mm_alignr_epi8(a3, a2, 2);
auto b3 = _mm_alignr_epi8(a4, a3, 2);
// ... do the computation as above ...
w = a4; // rotate the context
In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.
The key to efficiency is to be lazy. The key to being lazy is to lie - pretend you shifted without actually doing any shifting.
For an initial example (to illustrate the concept only), consider:
struct Thingy {
int ignored_bits;
uint64_t data[];
}
void shift_right(struct Thingy * thing, int count) {
thing->ignored_bits += count;
}
void shift_left(struct Thingy * thing, int count) {
thing->ignored_bits -= count;
}
int get_bit(struct Thingy * thing, int bit_number) {
bit_number += thing->ignored_bits;
return !!(thing->data[bit_number / 64] & (1 << bit_number % 64));
}
For practical code you'll need to care about various details - you'll probably want to start with spare bits at the start of the array (and non-zero ignored_bits) so that you can pretend to shift right; for each small shift you'll probably want to clear "shifted in" bits (otherwise it'll behave like floating point - e.g. (5.0 << 8) >> 8) == 5.0); if/when ignored_bits goes outside a certain range you'll probably want a large memcpy(); etc.
For more fun; abuse low level memory management - use VirtualAlloc() (Windows) or mmap() (Linux) to reserve a huge space, then put your array in the middle of the space, then allocate/free pages at the start/end of array as needed; so that you only need to memcpy() after the original bits have been "shifted" many billions of bits to the left/right.
Of course the consequence is that it's going to complicate other parts of your code - e.g. to OR 2 bitfields together you'll have to do a tricky "fetch A; shift A to match B; result = A OR B" adjustment. This isn't a deal breaker for performance.
#include <cstdint>
#include <immintrin.h>
template<unsigned Shift>
void foo(uint64_t* __restrict pDst, const uint64_t* __restrict pSrc, intptr_t size)
{
uint64_t* pSrc0, * pSrc1, * pSrc2, * pSrc3, * pDst0, * pDst1, * pDst2, * pDst3;
__m256i prev, current;
intptr_t i, stride;
stride = size >> 2;
i = stride;
pSrc0 = pSrc;
pSrc1 = pSrc + stride;
pSrc2 = pSrc + 2 * stride;
pSrc2 = pSrc + 3 * stride;
pDst0 = pDst;
pDst1 = pDst + stride;
pDst2 = pDst + 2 * stride;
pDst3 = pDst + 3 * stride;
prev = _mm256_set_epi64x(0, pSrc1[-1], pSrc2[-1], pSrc3[-1]);
while (i--)
{
current = _mm256_set_epi64x(*pSrc0++, *pSrc1++, *pSrc2++, *pSrc3++);
prev = _mm256_srli_epi64(prev, 64 - Shift);
prev = _mm256_or_si256(prev, _mm256_slli_epi64(current, Shift));
*pDst0++ = _mm256_extract_epi64(prev, 3);
*pDst1++ = _mm256_extract_epi64(prev, 2);
*pDst2++ = _mm256_extract_epi64(prev, 1);
*pDst3++ = _mm256_extract_epi64(prev, 0);
prev = current;
}
}
You can do the operation on up to four 64bit elements at once on AVX2 (up to eight on AVX512)
If size isn't a multiple of four, there will be up to 3 remaining ones to deal with.
PS: Auto vectorization is never a proper solution.
No, you can't
Both NEON and AVX(512) support barrel shift operations up to 64bit elements.
You can however "shift" the whole 128bit vector by n-bytes (8bits) with the instruction ext on NEON and alignr on AVX.
And you should avoid using the vector class for performance since it's nothing else than linked list which is bad for the performance.
Is there any way to convert the following code:
int mask16 = 0b1010101010101010; // int or short, signed or unsigned, it does not matter
to
__uint128_t mask128 = ((__uint128_t)0x0100010001000100 << 64) | 0x0100010001000100;
So to be extra clear something like:
int mask16 = 0b1010101010101010;
__uint128_t mask128 = intrinsic_bits_to_bytes(mask16);
or by applying directly the mask:
int mask16 = 0b1010101010101010;
__uint128_t v = ((__uint128_t)0x2828282828282828 << 64) | 0x2828282828282828;
__uint128_t w = intrinsic_bits_to_bytes_mask(v, mask16); // w = ((__uint128_t)0x2928292829282928 << 64) | 0x2928292829282928;
Bit/byte order: Unless noted, these follow the question, putting the LSB of the uint16_t in the least significant byte of the __uint128_t (lowest memory address on little-endian x86). This is what you want for an ASCII dump of a bitmap for example, but it's opposite of place-value printing order for the base-2 representation of a single 16-bit number.
The discussion of efficiently getting values (back) into RDX:RAX integer registers has no relevance for most normal use-cases since you'd just store to memory from vector registers, whether that's 0/1 byte integers or ASCII '0'/'1' digits (which you can get most efficiently without ever having 0/1 integers in a __m128i, let alone in an unsigned __int128).
Table of contents:
SSE2 / SSSE3 version: good if you want the result in a vector, e.g. for storing a char array.
(SSE2 NASM version, shuffling into MSB-first printing order and converting to ASCII.)
BMI2 pdep: good for scalar unsigned __int128 on Intel CPUs with BMI2, if you're going to make use of the result in scalar registers. Slow on AMD.
Pure C++ with a multiply bithack: pretty reasonable for scalar
AVX-512: AVX-512 has masking as a first-class operation using scalar bitmaps. Possibly not as good as BMI2 pdep if you're using the result as scalar halves, otherwise even better than SSSE3.
AVX2 printing order (MSB at lowest address) dump of a 32-bit integer.
See also is there an inverse instruction to the movemask instruction in intel avx2? for other variations on element size and mask width. (SSE2 and multiply bithack were adapted from answers linked from that collection.)
With SSE2 (preferably SSSE3)
See #aqrit's How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD answer
Adapting that to work with 16 bits -> 16 bytes, we need a shuffle that replicates the first byte of the mask to the first 8 bytes of the vector, and the 2nd mask byte to the high 8 vector bytes. That's doable with one SSSE3 pshufb, or with punpcklbw same,same + punpcklwd same,same + punpckldq same,same to finally duplicate things up to two 64-bit qwords.
typedef unsigned __int128 u128;
u128 mask_to_u128_SSSE3(unsigned bitmap)
{
const __m128i shuffle = _mm_setr_epi32(0,0, 0x01010101, 0x01010101);
__m128i v = _mm_shuffle_epi8(_mm_cvtsi32_si128(bitmap), shuffle); // SSSE3 pshufb
const __m128i bitselect = _mm_setr_epi8(
1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7,
1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7 );
v = _mm_and_si128(v, bitselect);
v = _mm_min_epu8(v, _mm_set1_epi8(1)); // non-zero -> 1 : 0 -> 0
// return v; // if you want a SIMD vector result
alignas(16) u128 tmp;
_mm_store_si128((__m128i*)&tmp, v);
return tmp; // optimizes to movq / pextrq (with SSE4)
}
(To get 0 / 0xFF instead of 0 / 1, replace _mm_min_epu8 with v= _mm_cmpeq_epi8(v, bitselect). If you want a string of ASCII '0' / '1' characters, do cmpeq and _mm_sub_epi8(_mm_set1_epi8('0'), v). That avoids the set1(1) vector constant.)
Godbolt including test-cases. (For this and other non-AVX-512 versions.)
# clang -O3 for Skylake
mask_to_u128_SSSE3(unsigned int):
vmovd xmm0, edi # _mm_cvtsi32_si128
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI2_0] # xmm0 = xmm0[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI2_1] # 1<<0, 1<<1, etc.
vpminub xmm0, xmm0, xmmword ptr [rip + .LCPI2_2] # set1_epi8(1)
# done here if you return __m128i v or store the u128 to memory
vmovq rax, xmm0
vpextrq rdx, xmm0, 1
ret
BMI2 pdep: good on Intel, bad on AMD
BMI2 pdep is fast on Intel CPUs that have it (since Haswell), but very slow on AMD (over a dozen uops, high latency.)
typedef unsigned __int128 u128;
inline u128 assemble_halves(uint64_t lo, uint64_t hi) {
return ((u128)hi << 64) | lo; }
// could replace this with __m128i using _mm_set_epi64x(hi, lo) to see how that compiles
#ifdef __BMI2__
#include <immintrin.h>
auto mask_to_u128_bmi2(unsigned bitmap) {
// fast on Intel, slow on AMD
uint64_t tobytes = 0x0101010101010101ULL;
uint64_t lo = _pdep_u64(bitmap, tobytes);
uint64_t hi = _pdep_u64(bitmap>>8, tobytes);
return assemble_halves(lo, hi);
}
Good if you want the result in scalar registers (not one vector) otherwise probably prefer the SSSE3 way.
# clang -O3
mask_to_u128_bmi2(unsigned int):
movabs rcx, 72340172838076673 # 0x0101010101010101
pdep rax, rdi, rcx
shr edi, 8
pdep rdx, rdi, rcx
ret
# returns in RDX:RAX
Portable C++ with a magic multiply bithack
Not bad on x86-64; AMD since Zen has fast 64-bit multiply, and Intel's had that since Nehalem. Some low-power CPUs still have slowish imul r64, r64
This version may be optimal for __uint128_t results, at least for latency on Intel without BMI2, and on AMD, since it avoids a round-trip to XMM registers. But for throughput it's quite a few instructions
See #phuclv's answer on How to create a byte out of 8 bool values (and vice versa)? for an explanation of the multiply, and for the reverse direction. Use the algorithm from unpack8bools once for each 8-bit half of your mask.
//#include <endian.h> // glibc / BSD
auto mask_to_u128_magic_mul(uint32_t bitmap) {
//uint64_t MAGIC = htobe64(0x0102040810204080ULL); // For MSB-first printing order in a char array after memcpy. 0x8040201008040201ULL on little-endian.
uint64_t MAGIC = 0x0102040810204080ULL; // LSB -> LSB of the u128, regardless of memory order
uint64_t MASK = 0x0101010101010101ULL;
uint64_t lo = ((MAGIC*(uint8_t)bitmap) ) >> 7;
uint64_t hi = ((MAGIC*(bitmap>>8)) ) >> 7;
return assemble_halves(lo & MASK, hi & MASK);
}
If you're going to store the __uint128_t to memory with memcpy, you might want to control for host endianness by using htole64(0x0102040810204080ULL); (from GNU / BSD <endian.h>) or equivalent to always map the low bit of input to the lowest byte of output, i.e. to the first element of a char or bool array. Or htobe64 for the other order, e.g. for printing. Using that function on a constant instead of the variable data allows constant-propagation at compile time.
Otherwise, if you truly want a 128-bit integer whose low bit matches the low bit of the u16 input, the multiplier constant is independent of host endianness; there's no byte access to wider types.
clang 12.0 -O3 for x86-64:
mask_to_u128_magic_mul(unsigned int):
movzx eax, dil
movabs rdx, 72624976668147840 # 0x0102040810204080
imul rax, rdx
shr rax, 7
shr edi, 8
imul rdx, rdi
shr rdx, 7
movabs rcx, 72340172838076673 # 0x0101010101010101
and rax, rcx
and rdx, rcx
ret
AVX-512
This is easy with AVX-512BW; you can use the mask for a zero-masked load from a repeated 0x01 constant.
__m128i bits_to_bytes_avx512bw(unsigned mask16) {
return _mm_maskz_mov_epi8(mask16, _mm_set1_epi8(1));
// alignas(16) unsigned __int128 tmp;
// _mm_store_si128((__m128i*)&u128, v); // should optimize into vmovq / vpextrq
// return tmp;
}
Or avoid a memory constant (because compilers can do set1(-1) with just a vpcmpeqd xmm0,xmm0): Do a zero-masked absolute-value of -1. The constant setup can be hoisted, same as with set1(1).
__m128i bits_to_bytes_avx512bw_noconst(unsigned mask16) {
__m128i ones = _mm_set1_epi8(-1); // extra instruction *off* the critical path
return _mm_maskz_abs_epi8(mask16, ones);
}
But note that if doing further vector stuff, the result of maskz_mov might be able to optimize into other operations. For example vec += maskz_mov could optimize into a merge-masked add. But if not, vmovdqu8 xmm{k}{z}, xmm needs an ALU port like vpabsb xmm{k}{z}, xmm, but vpabsb can't run on port 5 on Skylake/Ice Lake. (A zero-masked vpsubb from a zeroed register would avoid that possible throughput problem, but then you'd be setting up 2 registers just to avoid loading a constant. In hand-written asm, you'd just materialize set1(1) using vpcmpeqd / vpabsb yourself if you wanted to avoid a 4-byte broadcast-load of a constant.)
(Godbolt compiler explorer with gcc and clang -O3 -march=skylake-avx512. Clang sees through the masked vpabsb and compiles it the same as the first version, with a memory constant.)
Even better if you can use a vector 0 / -1 instead of 0 / 1: use return _mm_movm_epi8(mask16). Compiles to just kmovd k0, edi / vpmovm2b xmm0, k0
If you want a vector of ASCII characters like '0' or '1', you could use _mm_mask_blend_epi8(mask, ones, zeroes). (That should be more efficient than a merge-masked add into a vector of set1(1) which would require an extra register copy, and also better than sub between set1('0') and _mm_movm_epi8(mask16) which would require 2 instructions: one to turn the mask into a vector, and a separate vpsubb.)
AVX2 with bits in printing order (MSB at lowest address), bytes in mem order, as ASCII '0' / '1'
With [] delimiters and \t tabs like this output format, from this codereview Q&A:
[01000000] [01000010] [00001111] [00000000]
Obviously if you want all 16 or 32 ASCII digits contiguous, that's easier and doesn't require shuffling the output to store each 8-byte chunk separately. Mostly of the reason for posting here is that it has the shuffle and mask constants in the right order for printing, and to show a version optimized for ASCII output after it turned out that's what the question really wanted.
Using How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?, basically a 256-bit version the SSSE3 code.
#include <limits.h>
#include <stdint.h>
#include <stdio.h>
#include <immintrin.h>
#include <string.h>
// https://stackoverflow.com/questions/21622212/how-to-perform-the-inverse-of-mm256-movemask-epi8-vpmovmskb
void binary_dump_4B_avx2(const void *input)
{
char buf[CHAR_BIT*4 + 2*4 + 3 + 1 + 1]; // bits, 4x [], 3x \t, \n, 0
buf[0] = '[';
for (int i=9 ; i<sizeof(buf) - 8; i+=11){ // GCC strangely doesn't unroll this loop
memcpy(&buf[i], "]\t[", 4); // 4-byte store as a single; we overlap the 0 later
}
__m256i v = _mm256_castps_si256(_mm256_broadcast_ss(input)); // aliasing-safe load; use _mm256_set1_epi32 if you know you have an int
const __m256i shuffle = _mm256_setr_epi64x(0x0000000000000000, // low byte first, bytes in little-endian memory order
0x0101010101010101, 0x0202020202020202, 0x0303030303030303);
v = _mm256_shuffle_epi8(v, shuffle);
// __m256i bit_mask = _mm256_set1_epi64x(0x8040201008040201); // low bits to low bytes
__m256i bit_mask = _mm256_set1_epi64x(0x0102040810204080); // MSB to lowest byte; printing order
v = _mm256_and_si256(v, bit_mask); // x & mask == mask
// v = _mm256_cmpeq_epi8(v, _mm256_setzero_si256()); // -1 / 0 bytes
// v = _mm256_add_epi8(v, _mm256_set1_epi8('1')); // '0' / '1' bytes
v = _mm256_cmpeq_epi8(v, bit_mask); // 0 / -1 bytes
v = _mm256_sub_epi8(_mm256_set1_epi8('0'), v); // '0' / '1' bytes
__m128i lo = _mm256_castsi256_si128(v);
_mm_storeu_si64(buf+1, lo);
_mm_storeh_pi((__m64*)&buf[1+8+3], _mm_castsi128_ps(lo));
// TODO?: shuffle first and last bytes into the high lane initially to allow 16-byte vextracti128 stores, with later stores overlapping to replace garbage.
__m128i hi = _mm256_extracti128_si256(v, 1);
_mm_storeu_si64(buf+1+11*2, hi);
_mm_storeh_pi((__m64*)&buf[1+11*3], _mm_castsi128_ps(hi));
// buf[32 + 2*4 + 3] = '\n';
// buf[32 + 2*4 + 3 + 1] = '\0';
// fputs
memcpy(&buf[32 + 2*4 + 2], "]", 2); // including '\0'
puts(buf); // appends a newline
// appending our own newline and using fputs or fwrite is probably more efficient.
}
void binary_dump(const void *input, size_t bytecount) {
}
// not shown: portable version, see Godbolt, or my or #chux's answer on the codereview question
int main(void)
{
int t = 1000000;
binary_dump_4B_avx2(&t);
binary_dump(&t, sizeof(t));
t++;
binary_dump_4B_avx2(&t);
binary_dump(&t, sizeof(t));
}
Runnable Godbolt demo with gcc -O3 -march=haswell.
Note that GCC10.3 and earlier are dumb and duplicate the AND/CMPEQ vector constant, once as bytes and once as qwords. (In that case, comparing against zero would be better, or using OR with an inverted mask and comparing against all-ones). GCC11.1 fixes that with a .set .LC1,.LC2, but still loads it twice, as memory operands instead of loading once into a register. Clang doesn't have either of these problems.
Fun fact: clang -march=icelake-client manages to turn the 2nd part of this into an AVX-512 masked blend between '0' and '1' vectors, but instead of just kmov it uses a broadcast-load, vpermb byte shuffle, then test-into-mask with the bitmask.
For each bit in the mask, you want to move a bit at position n to the low-order bit of the byte at position n, i.e. bit position 8 * n. You can do this with a loop:
__uint128_t intrinsic_bits_to_bytes(uint16_t mask)
{
int i;
__uint128_t result = 0;
for (i=0; i<16; i++) {
result |= (__uint128_t )((mask >> i) & 1) << (8 * i);
}
return result;
}
If you can use AVX512, you can do it in one instruction, no loop:
#include <immintrin.h>
__m128i intrinsic_bits_to_bytes(uint16_t mask16) {
const __m128i zeroes = _mm_setzero_si128();
const __m128i ones = _mm_set1_epi8(1);;
return _mm_mask_blend_epi8(mask16, ones, zeroes);
}
For building with gcc, I use:
g++ -std=c++11 -march=native -O3 src.cpp -pthread
This will build OK, but if your processor doesn't support AVX512, it will throw an illegal instruction at run
time.
I have following code, after performing a sobel operation:
short* tempBufferVert = new short[width * height];
ippiFilterSobelVertBorder_8u16s_C1R(pImg, width, tempBufferVert, width * 2, dstSize, IppiMaskSize::ippMskSize3x3, IppiBorderType::ippBorderConst, 0, pBufferVert);
for (int i = 0; i < width * height; i++)
tempBufferVert[i] >>= 2;
The frustrating thing is, the bit shift is the longest taking operation of it all, the IPP sobel is so optimized it runs faster than my stupid bit shift. How can I optimize the bitshift, or are there IPP or other options (AVX?) to perform a bitshift on the whole memory (but pertain the sign of the short, which the >>= does on the Visual Studio implementation)
C++ optimisers perform a lot better with iterator-based loops than with indexing loops.
This is because the compiler can make assumptions about how address arithmetic works at the index overflow. For it to make the same assumptions when using an index into an array you must happen to pick the correct datatype for the index by luck.
The shift code can be expressed as:
void shift(short* first, short* last, int bits)
{
while (first != last) {
*first++ >>= bits;
}
}
int test(int width, int height)
{
short* tempBufferVert = new short[width * height];
shift(tempBufferVert, tempBufferVert + (width * height), 2);
}
Which will (with correct optimisations enabled) be vectorised: https://godbolt.org/g/oJ8Boj
note how the middle of the loop becomes:
.L76:
vmovdqa ymm0, YMMWORD PTR [r9+rdx]
add r8, 1
vpsraw ymm0, ymm0, 2
vmovdqa YMMWORD PTR [r9+rdx], ymm0
add rdx, 32
cmp rsi, r8
ja .L76
lea rax, [rax+rdi*2]
cmp rcx, rdi
je .L127
vzeroupper
Firstly make sure you are compiling with optimisation enabled (e.g. -O3), and then check whether your compiler is auto-vectorizing the right shift loop. If it's not then you can probably get a significant improvement with SSE:
#include <emmintrin.h> // SSE2
for (int i = 0; i < width * height; i += 8)
{
__m128i v = _mm_loadu_si128((__m128i *)&tempBufferVert[i]);
v = _mm_srai_epi16(v, 2); // v >>= 2
_mm_storeu_si128((__m128i *)&tempBufferVert[i], v);
}
(Note: assumes width*height is a multiple of 8.)
You can probably do even better with some loop unrolling and/or using AVX2, but this may be enough for your needs as it stands.
I have two functions of 2d arrays multiplication. One of them with SSE. Another function without any optimization. Both functions work well. But the results are slightly different. For example 20.333334 and 20.333332.
Can you explain why the results are different? And what can I do with functions to have the same result?
function with SSE
float** sse_multiplication(float** array1, float** array2, float** arraycheck)
{
int i, j, k;
float *ms1, *ms2, result;
float *end_loop;
for( i = 0; i < rows1; i++)
{
for( j = 0; j < columns2; j++)
{
result = 0;
ms1 = array1[i];
ms2 = array2[j];
end_loop = &array1[i][columns1];
__asm{
mov rax, ms1
mov rbx, ms2
mov rdx, end_loop
xorps xmm2, xmm2
loop:
movups xmm0, [rax]
movups xmm1, [rbx]
movups xmm3, [rax+16]
movups xmm4, [rbx+16]
mulps xmm0, xmm1
mulps xmm3, xmm4
addps xmm2, xmm0
add rax, 32
add rbx, 32
cmp rdx, rax
jne loop
haddps xmm2, xmm2
haddps xmm2, xmm2
movups result, xmm2
}
arraycheck[i][j] = result;
}
}
return arraycheck;
}
function without any optimization
float** multiplication(float** array1, float** array2, float** arraycheck)
{
for (int i = 0; i < rows1; i++)
for (int j = 0; j < columns2; j++)
for (int k = 0; k < rows1; k++)
arraycheck[i][j] += array1[i][k] * array2[k][j];
return arraycheck;
}
FP addition is not perfectly associative, so a different order of operations produces slightly different rounding errors.
Your C sums elements in order. (Unless you use -ffast-math to allow the compiler to make the same assumption you did, that FP operations are close enough to associative).
Your asm sums up every 4th element at 4 different offsets, then horizontally sums those. The sum in each vector element is rounded differently at each point.
Your vectorized version doesn't seem to match the C version. The indexing looks different. AFAICT, the only sane way to vectorize arraycheck[i][j] += array1[i][k] * array2[k][j]; is over j. Looping over k would require strided loads from array2, and looping over i would require strided loads from array1.
Am I missing something about your asm? It's loading contiguous values from both arrays. It's also throwing away the mulps result in xmm3 every iteration of loop, so I think it's just buggy.
Since looping over j in the inner vector loop doesn't change array1[i][k], just broadcast-load it once outside the loop (_mm256_set1_ps).
However, that means doing a read-modify-write of arraycheck[i][j] for every different j value. i.e. ac[i][j + 0..3] = fma(a1[i][k], a2[k][j + 0..3], ac[i][j + 0..3]). To avoid this, you'd have to transpose one of the arrays first. (But that's O(N^2) for an NxN matrix, which is still cheaper than multiply).
This way doesn't use a horizontal sums, but see that link if you want better code for that.
It does operations in the same order as the scalar C, so results should match exactly.
Also note that you need to use more than one accumulator to saturate the execution units of a CPU. I'd suggest 8, to saturate Skylake's 4c latency, one per 0.5c throughput addps. Haswell has 3c latency, one per 1c addps, but Skylake dropped the separate FP add unit and does it in the FMA unit. (See the x86 tag wiki, esp. Agner Fog's guides)
Actually, since my suggested change doesn't use a single accumulator at all, every loop iteration accesses independent memory. You'll need a bit of loop unrolling to saturate the FP execution units with two loads and store in the loop (even though you only need two pointers, since the store is back into the same location as one of the loads). But anyway, if your data fits in L1 cache, out-of-order execution should keep the execution units pretty well supplied with work from separate iterations.
If you really care about performance, you'll make an FMA version, and maybe an AVX-without-FMA version for Sandybridge. You could be doing two 256b FMAs per clock instead of one 128b add and mul per clock. (And of course you're not even getting that, because you bottleneck on latency unless the loop is short enough for the out-of-order window to see independent instructions from the next iteration).
You're going to need "loop tiling", aka "cache blocking" to make this not suck for big problem sizes. This is a matrix multiply, right? There are very good libraries for this, which are tuned for cache sizes, and will beat the pants off a simple attempt like this. e.g. ATLAS was good last time I checked, but that was several years ago.
Use intrinsics, unless you write the whole function in asm. Compilers "understand" what they do, so can make nice optimizations, like loop unrolling when appropriate.
According to IEEE standard Formats, 32-bit float can only guanartee 6-7 digits accuracy. Your error is so marginal that no plausible claim can be made on compiler's mechanism. If you want to achieve better precision, it would be wise to choose either 64-bit double(guarentees 15 digits accuracy) or implement your own BigDecimal class like java do.