Copy bits of uint64_t into two uint64_t at specific location - c++

I have an input uint64_t X and number of its N least significant bits that I want to write into the target Y, Z uint64_t values starting from bit index M in the Z. Unaffected parts of Y and Z should not be changed. How I can implement it efficiently in C++ for the latest intel CPUs?
It should be efficient for execution in loops. I guess that it requires to have no branching: the number of used instructions is expected to be constant and as small as possible.
M and N are not fixed at compile time. M can take any value from 0 to 63 (target offset in Z), N is in the range from 0 to 64 (number of bits to copy).
illustration:

There's at least a four instruction sequence available on reasonable modern IA processors.
X &= (1 << (N+1)) - 1; // mask off the upper bits
// bzhi rax, rdi, rdx
Z = X << M;
// shlx rax, rax, rsi
Y = X >> (64 - M);
// neg sil
// shrx rax, rax, rsi
The value M=0 causes a bit of pain, as Y would need to be zero in that case and also the expression N >> (64-M) would need sanitation.
One possibility to overcome this is
x = bzhi(x, n);
y = rol(x,m);
y = bzhi(y, m); // y &= ~(~0ull << m);
z = shlx(x, m); // z = x << m;
As OP actually wants to update the bits, one obvious solution would be to replicate the logic for masks:
xm = bzhi(~0ull, n);
ym = rol(xm, m);
ym = bzhi(ym, m);
zm = shlx(xm, m);
However, clang seems to produce something like 24 instructions total with the masks applied:
Y = (Y & ~xm) | y; // |,+,^ all possible
Z = (Z & ~zm) | z;
It is likely then better to change the approach:
x2 = x << (64-N); // align xm to left
y2 = y >> y_shift; // align y to right
y = shld(y2,x2, y_shift); // y fixed
Here y_shift = max(0, M+N-64)
Fixing Z is slightly more involved, as Z can be combined of three parts:
zzzzz.....zzzzXXXXXXXzzzzzz, where m=6, n=7
That should be doable with two double shifts as above.

Related

How to convert scalar code of the double version of VDT's Pade Exp fast_ex() approx into SSE2?

Here's the code I'm trying to convert: the double version of VDT's Pade Exp fast_ex() approx (here's the old repo resource):
inline double fast_exp(double initial_x){
double x = initial_x;
double px=details::fpfloor(details::LOG2E * x +0.5);
const int32_t n = int32_t(px);
x -= px * 6.93145751953125E-1;
x -= px * 1.42860682030941723212E-6;
const double xx = x * x;
// px = x * P(x**2).
px = details::PX1exp;
px *= xx;
px += details::PX2exp;
px *= xx;
px += details::PX3exp;
px *= x;
// Evaluate Q(x**2).
double qx = details::QX1exp;
qx *= xx;
qx += details::QX2exp;
qx *= xx;
qx += details::QX3exp;
qx *= xx;
qx += details::QX4exp;
// e**x = 1 + 2x P(x**2)/( Q(x**2) - P(x**2) )
x = px / (qx - px);
x = 1.0 + 2.0 * x;
// Build 2^n in double.
x *= details::uint642dp(( ((uint64_t)n) +1023)<<52);
if (initial_x > details::EXP_LIMIT)
x = std::numeric_limits<double>::infinity();
if (initial_x < -details::EXP_LIMIT)
x = 0.;
return x;
}
I got this:
__m128d PExpSSE_dbl(__m128d x) {
__m128d initial_x = x;
__m128d half = _mm_set1_pd(0.5);
__m128d one = _mm_set1_pd(1.0);
__m128d log2e = _mm_set1_pd(1.4426950408889634073599);
__m128d p1 = _mm_set1_pd(1.26177193074810590878E-4);
__m128d p2 = _mm_set1_pd(3.02994407707441961300E-2);
__m128d p3 = _mm_set1_pd(9.99999999999999999910E-1);
__m128d q1 = _mm_set1_pd(3.00198505138664455042E-6);
__m128d q2 = _mm_set1_pd(2.52448340349684104192E-3);
__m128d q3 = _mm_set1_pd(2.27265548208155028766E-1);
__m128d q4 = _mm_set1_pd(2.00000000000000000009E0);
__m128d px = _mm_add_pd(_mm_mul_pd(log2e, x), half);
__m128d t = _mm_cvtepi64_pd(_mm_cvttpd_epi64(px));
px = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one));
__m128i n = _mm_cvtpd_epi64(px);
x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(6.93145751953125E-1)));
x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(1.42860682030941723212E-6)));
__m128d xx = _mm_mul_pd(x, x);
px = _mm_mul_pd(xx, p1);
px = _mm_add_pd(px, p2);
px = _mm_mul_pd(px, xx);
px = _mm_add_pd(px, p3);
px = _mm_mul_pd(px, x);
__m128d qx = _mm_mul_pd(xx, q1);
qx = _mm_add_pd(qx, q2);
qx = _mm_mul_pd(xx, qx);
qx = _mm_add_pd(qx, q3);
qx = _mm_mul_pd(xx, qx);
qx = _mm_add_pd(qx, q4);
x = _mm_div_pd(px, _mm_sub_pd(qx, px));
x = _mm_add_pd(one, _mm_mul_pd(_mm_set1_pd(2.0), x));
n = _mm_add_epi64(n, _mm_set1_epi64x(1023));
n = _mm_slli_epi64(n, 52);
// return?
}
But I'm not able to finish the last lines - i.e. this code:
if (initial_x > details::EXP_LIMIT)
x = std::numeric_limits<double>::infinity();
if (initial_x < -details::EXP_LIMIT)
x = 0.;
return x;
How would you convert in SSE2?
Than of course I need to check the whole, since I'm not quite sure I've converted it correctly.
EDIT: I found the SSE conversion of float exp - i.e. from this:
/* multiply by power of 2 */
z *= details::uint322sp((n + 0x7f) << 23);
if (initial_x > details::MAXLOGF) z = std::numeric_limits<float>::infinity();
if (initial_x < details::MINLOGF) z = 0.f;
return z;
to this:
n = _mm_add_epi32(n, _mm_set1_epi32(0x7f));
n = _mm_slli_epi32(n, 23);
return _mm_mul_ps(z, _mm_castsi128_ps(n));
Yup, dividing two polynomials can often give you a better tradeoff between speed and precision than one huge polynomial. As long as there's enough work to hide the divpd throughput. (The latest x86 CPUs have pretty decent FP divide throughput. Still bad vs. multiply, but it's only 1 uop so it doesn't stall the pipeline if you use it rarely enough, i.e. mixed with lots of multiplies. Including in the surrounding code that uses exp)
However, _mm_cvtepi64_pd(_mm_cvttpd_epi64(px)); won't work with SSE2. Packed-conversion intrinsics to/from 64-bit integers requires AVX512DQ.
To do packed rounding to the nearest integer, ideally you'd use SSE4.1 _mm_round_pd(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC), (or truncation towards zero, or floor or ceil towards -+Inf).
But we don't actually need that.
The scalar code ends up with int n and double px both representing the same numeric value. It uses the bad/buggy floor(val+0.5) idiom instead of rint(val) or nearbyint(val) to round to nearest, and then converts that already-integer double to an int (with C++'s truncation semantics, but that doesn't matter because the double value's already an exact integer.)
With SIMD intrinsics, it appears to be easiest to just convert to 32-bit integer and back.
__m128i n = _mm_cvtpd_epi32( _mm_mul_pd(log2e, x) ); // round to nearest
__m128d px = _mm_cvtepi32_pd( n );
Rounding to int with the desired mode, then converting back to double, is equivalent to double->double rounding and then grabbing an int version of that like the scalar version does. (Because you don't care what happens for doubles too large to fit in an int.)
cvtsd2si and si2sd instructions are 2 uops each, and shuffle the 32-bit integers to packed in the low 64 bits of a vector. So to set up for 64-bit integer shifts to stuff the bits into a double again, you'll need to shuffle. The top 64 bits of n will be zeros, so we can use that to create 64-bit integer n lined up with the doubles:
n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3,1,2,0)); // 64-bit integers
But with just SSE2, there are workarounds. Converting to 32-bit integer and back is one option: you don't care about inputs too small or too large. But packed-conversion between double and int costs at least 2 uops on Intel CPUs each way, so a total of 4. But only 2 of those uops need the FMA units, and your code probably doesn't bottleneck on port 5 with all those multiplies and adds.
Or add a very large number and subtract it again: large enough that each double is 1 integer apart, so normal FP rounding does what you want. (This works for inputs that won't fit in 32 bits, but not double > 2^52. So either way that would work.) Also see How to efficiently perform double/int64 conversions with SSE/AVX? which uses that trick. I couldn't find an example on SO, though.
Related:
Fastest Implementation of Exponential Function Using AVX and Fastest Implementation of Exponential Function Using SSE have versions with other speed / precision tradeoffs, for _ps (packed single-precision float).
Fast SSE low precision exponential using double precision operations is at the other end of the spectrum, but still for double.
How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU? discusses some existing libraries like SVML, and Agner Fog's VCL (GPL licensed). And glibc's libmvec.
Then of course I need to check the whole, since I'm not quite sure I've converted it correctly.
iterating over all 2^64 double bit-patterns is impractical, unlike for float where there are only 4 billion, but maybe iterating over all doubles that have the low 32 bits of their mantissa all zero would be a good start. i.e. check in a loop with
bitpatterns = _mm_add_epi64(bitpatterns, _mm_set1_epi64x( 1ULL << 32 ));
doubles = _mm_castsi128_pd(bitpatterns);
https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/
For those last few lines, correcting the input for out-of-range inputs:
The float version you quote just leaves out the range-check entirely. This is obviously the fastest way, if your inputs will always be in range or if you don't care about what happens for out-of-range inputs.
Alternate cheaper range-checking (maybe only for debugging) would be to turn out-of-range values into NaN by ORing the packed-compare result into the result. (An all-ones bit-pattern represents a NaN.)
__m128d out_of_bounds = _mm_cmplt_pd( limit, abs(initial_x) ); // abs = mask off the sign bit
result = _mm_or_pd(result, out_of_bounds);
In general, you can vectorize simple condition setting of a value using branchless compare + blend. Instead of if(x) y=0;, you have the SIMD equivalent of y = (condition) ? 0 : y;, on a per-element basis. SIMD compares produce a mask of all-zero / all-one elements so you can use it to blend.
e.g. in this case cmppd the input and blendvpd the output if you have SSE4.1. Or with just SSE2, and/andnot/or to blend. See SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation for a _ps version of both, _pd is identical.
In asm it will look like this:
; result in xmm0 (in need of fixups for out of range inputs)
; initial_x in xmm2
; constants:
; xmm5 = limit
; xmm6 = +Inf
cmpltpd xmm2, xmm5 ; xmm2 = input_x < limit ? 0xffff... : 0
andpd xmm0, xmm2 ; result = result or 0
andnpd xmm2, xmm6 ; xmm2 = 0 or +Inf (In that order because we used ANDN)
orpd xmm0, xmm2 ; result |= 0 or +Inf
; xmm0 = (input < limit) ? result : +Inf
(In an earlier version of the answer, I thought I was maybe saving a movaps to copy a register, but this is just a bog-standard blend. It destroys initial_x, so the compiler needs to copy that register at some point while calculating result, though.)
Optimizations for this special condition
Or in this case, 0.0 is represented by an all-zero bit-pattern, so do a compare that will produce true if in-range, and AND the output with that. (To leave it unchanged or force it to +0.0). This is better than _mm_blendv_pd, which costs 2 uops on most Intel CPUs (and the AVX 128-bit version always costs 2 uops on Intel). And it's not worse on AMD or Skylake.
+-Inf is represented by a bit-pattern of significand=0, exponent=all-ones. (Any other value in the significand represents +-NaN.) Since too-large inputs will presumably still leave non-zero significands, we can't just AND the compare result and OR that into the final result. I think we need to do a regular blend, or something as expensive (3 uops and a vector constant).
It adds 2 cycles of latency to the final result; both the ANDNPD and ORPD are on the critical path. The CMPPD and ANDPD aren't; they can run in parallel with whatever you do to compute the result.
Hopefully your compiler will actually use ANDPS and so on, not PD, for everything except the CMP, because it's 1 byte shorter but identical because they're both just bitwise ops. I wrote ANDPD just so I didn't have to explain this in comments.
You might be able to shorten the critical path latency by combining both fixups before applying to the result, so you only have one blend. But then I think you also need to combine the compare results.
Or since your upper and lower bounds are the same magnitude, maybe you can compare the absolute value? (mask off the sign bit of initial_x and do _mm_cmplt_pd(abs_initial_x, _mm_set1_pd(details::EXP_LIMIT))). But then you have to sort out whether to zero or set to +Inf.
If you had SSE4.1 for _mm_blendv_pd, you could use initial_x itself as the blend control for the fixup that might need applying, because blendv only cares about the sign bit of the blend control (unlike with the AND/ANDN/OR version where all bits need to match.)
__m128d fixup = _mm_blendv_pd( _mm_setzero_pd(), _mm_set1_pd(INFINITY), initial_x ); // fixup = (initial_x signbit) ? 0 : +Inf
// see below for generating fixup with an SSE2 integer arithmetic-shift
const signbit_mask = _mm_castsi128_pd(_mm_set1_epi64x(0x7fffffffffffffff)); // ~ set1(-0.0)
__m128d abs_init_x = _mm_and_pd( initial_x, signbit_mask );
__m128d out_of_range = _mm_cmpgt_pd(abs_init_x, details::EXP_LIMIT);
// Conditionally apply the fixup to result
result = _mm_blendv_pd(result, fixup, out_of_range);
Possibly use cmplt instead of cmpgt and rearrange if you care what happens for initial_x being a NaN. Choosing the compare so false applies the fixup instead of true will mean that an unordered comparison results in either 0 or +Inf for an input of -NaN or +NaN. This still doesn't do NaN propagation. You could _mm_cmpunord_pd(initial_x, initial_x) and OR that into fixup, if you want to make that happen.
Especially on Skylake and AMD Bulldozer/Ryzen where SSE2 blendvpd is only 1 uop, this should be pretty nice. (The VEX encoding, vblendvpd is 2 uops, having 3 inputs and a separate output.)
You might still be able to use some of this idea with only SSE2, maybe creating fixup by doing a compare against zero and then _mm_and_pd or _mm_andnot_pd with the compare result and +Infinity.
Using an integer arithmetic shift to broadcast the sign bit to every position in the double isn't efficient: psraq doesn't exist, only psraw/d. Only logical shifts come in 64-bit element size.
But you could create fixup with just one integer shift and mask, and a bitwise invert
__m128i ix = _mm_castsi128_pd(initial_x);
__m128i ifixup = _mm_srai_epi32(ix, 11); // all 11 bits of exponent field = sign bit
ifixup = _mm_and_si128(ifixup, _mm_set1_epi64x(0x7FF0000000000000ULL) ); // clear other bits
// ix = the bit pattern for 0 (non-negative x) or +Inf (negative x)
__m128d fixup = _mm_xor_si128(ifixup, _mm_set1_epi32(-1)); // bitwise invert
Then blend fixup into result for out-of-range inputs as normal.
Cheaply checking abs(initial_x) > details::EXP_LIMIT
If the exp algorithm was already squaring initial_x, you could compare against EXP_LIMIT squared. But it's not, xx = x*x only happens after some calculation to create x.
If you have AVX512F/VL, VFIXUPIMMPD might be handy here. It's designed for functions where the special case outputs are from "special" inputs like NaN and +-Inf, negative, positive, or zero, saving a compare for those cases. (e.g. for after a Newton-Raphson reciprocal(x) for x=0.)
But both of your special cases need compares. Or do they?
If you square your input and subtract, it only costs one FMA to do initial_x * initial_x - details::EXP_LIMIT * details::EXP_LIMIT to create a result that's negative for abs(initial_x) < details::EXP_LIMIT, and non-negative otherwise.
Agner Fog reports that vfixupimmpd is only 1 uop on Skylake-X.

How can you calculate a factor if you have the other factor and the product with overflows?

a * x = b
I have a seemingly rather complicated multiplication / imul problem: if I have a and I have b, how can I calculate x if they're all 32-bit dwords (e.g. 0-1 = FFFFFFFF, FFFFFFFF+1 = 0)?
For example:
0xcb9102df * x = 0x4d243a5d
In that case, x is 0x1908c643. I found a similar question but the premises were different and I'm hoping there's a simpler solution than those given.
Numbers have a modular multiplicative inverse modulo a power of two precisely iff they are odd. Everything else is a bit-shifted odd number (even zero, which might be anything, with all bits shifted out). So there are a couple of cases:
Given a * x = b
tzcnt(a) > tzcnt(b) no solution
tzcnt(a) <= tzcnt(b) solvable, with 2tzcnt(a) solutions
The second case has a special case with 1 solution, for odd a, namely x = inverse(a) * b
More generally, x = inverse(a >> tzcnt(a)) * (b >> tzcnt(a)) is a solution, because you write a as (a >> tzcnt(a)) * (1 << tzcnt(a)), so we cancel the left factor with its inverse, we leave the right factor as part of the result (cannot be cancelled anyway) and then multiply by what remains to get it up to b. Still only works in the second case, obviously. If you wanted, you could enumerate all solutions by filling in all possibilities for the top tzcnt(a) bits.
The only thing that remains is getting the inverse, you've probably seen it in the other answer, whatever it was, but for completeness you can compute it as follows: (not tested)
; input x
dword y = (x * x) + x - 1;
dword t = y * x;
y *= 2 - t;
t = y * x;
y *= 2 - t;
t = y * x;
y *= 2 - t;
; result y

How can I add together two SSE registers

I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist), which would use register as one big word.
Is there any way to perform this operation, even if multiple instructions are needed?
I was thinking about using _mm_add_epi64, detecting overflow in the right word and then adding 1 to the left word in register if needed, but I would also like this approach to work for 256bit registers (AVX2), and this approach seems too complicated for that.
To add two 128-bit numbers x and y to give z with SSE you can do it like this
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c.
The function unsigned_lessthan is defined below. It's complicated without AMD XOP (actually a found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people here can suggest a better method. Here is some code showing this works.
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
inline __m128i unsigned_lessthan(__m128i a, __m128i b) {
#ifdef __XOP__ // AMD XOP instruction set
return _mm_comgt_epu64(b,a));
#else // SSE2 instruction set
__m128i sign32 = _mm_set1_epi32(0x80000000); // sign bit of each dword
__m128i aflip = _mm_xor_si128(b,sign32); // a with sign bits flipped
__m128i bflip = _mm_xor_si128(a,sign32); // b with sign bits flipped
__m128i equal = _mm_cmpeq_epi32(b,a); // a == b, dwords
__m128i bigger = _mm_cmpgt_epi32(aflip,bflip); // a > b, dwords
__m128i biggerl = _mm_shuffle_epi32(bigger,0xA0); // a > b, low dwords copied to high dwords
__m128i eqbig = _mm_and_si128(equal,biggerl); // high part equal and low part bigger
__m128i hibig = _mm_or_si128(bigger,eqbig); // high part bigger or high part equal and low part
__m128i big = _mm_shuffle_epi32(hibig,0xF5); // result copied to low part
return big;
#endif
}
int main() {
__m128i x,y,z,c;
x = _mm_set_epi64x(3,0xffffffffffffffffll);
y = _mm_set_epi64x(1,0x2ll);
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
int out[4];
//int64_t out[2];
_mm_storeu_si128((__m128i*)out, z);
printf("%d %d\n", out[2], out[0]);
}
Edit:
The only potentially efficient way to add 128-bit or 256-bit numbers with SSE is with XOP. The only option with AVX would be XOP2 which does not exist yet. And even if you have XOP it may only be efficient to add two 128-bit or 256-numbers in parallel (you could do four with AVX if XOP2 existed) to avoid the horizontal instructions such as mm_unpacklo_epi64.
The best solution in general is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4 you can add them like this:
__m256i x4, y4, z4;
uint64_t x[4], uint64_t y[4], uint64_t z[4]
_mm256_storeu_si256((__m256i*)x, x4);
_mm256_storeu_si256((__m256i*)y, y4);
add_u256(x,y,z);
z4 = _mm256_loadu_si256((__m256i*)z);
void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) {
uint64_t c1 = 0, c2 = 0, tmp;
//add low 128-bits
z[0] = x[0] + y[0];
z[1] = x[1] + y[1];
c1 += z[1]<x[1];
tmp = z[1];
z[1] += z[0]<x[0];
c1 += z[1]<tmp;
//add high 128-bits + carry from low 128-bits
z[2] = x[2] + y[2];
c2 += z[2]<x[2];
tmp = z[2];
z[2] += c1;
c2 += z[2]<tmp;
z[3] = x[3] + y[3] + c2;
}
int main() {
uint64_t x[4], y[4], z[4];
x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
y[0] = 1; y[1] = 1; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,1,0)
//x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
//y[0] = 1; y[1] = 0; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,0,0)
add_u256(x,y,z);
for(int i=3; i>=0; i--) printf("%u ", z[i]); printf("\n");
}
Edit: based on a comment by Stephen Canon at saturated-substraction-avx-or-sse4-2 I discovered there is a more efficient way to compare unsigned 64-bit numbers with SSE4.2 if XOP is not available.
__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);

C++ convert to assembly language

this is my assignment.
I've done my code for this assembly, but is there any way to make the convert speed more fast?
thank in advance for any helps ;D
//Convert this nested for loop to assembly instructions
for (a = 0; a < y; a++)
for (b = 0; b < y; b++)
for (c = 0; c < y; c++)
if ((a + 2 * b - 8 * c) == y)
count++;
convert
_asm {
mov ecx,0
mov ax, 0
mov bx, 0
mov cx, 0
Back:
push cx
push bx
push ax
add bx, bx
mov dx, 8
mul dx
add cx, bx
sub cx, ax
pop ax
pop bx
cmp cx, y
jne increase
inc count
increase : pop cx
inc ax
cmp ax, y
jl Back
inc bx
mov ax, 0
cmp bx, y
jl Back
inc cx
mov ax, 0
mov bx, 0
cmp cx, y
jl Back
}
Some generic tricks:
Make your loop counters count down instead of up. You eliminate a compare that way.
Learn the magic of LEA to compute expressions that include addition and scaling by certain powers of 2. You won't need a MUL in there anywhere.
Hoist loop-invariant work outside the inner loop. a + 2*b is constant for every iteration of the c loop.
Use SI, DI to hold values. That should help you avoid all those push and pop instructions.
If your values fit in 8 bits, use AH, AL, etc. to make more effective use of your registers.
Oh, and you don't need that mov ax, 0 after inc cx, because AX is already 0 there.
Specific to this algorithm: If y is odd, skip iterations where a is even, and vice versa. Nearly 2x speedup awaits... (Work out with pencil and paper if you wonder why.) Hint: You don't need to test every iteration, either. You can simply step by 2s, if you're clever enough.
Or better still, work out a closed form that allows you to calculate the answer directly. ;-)
When you are optimizing, always start high and go low, i.e. start at the algorithm level, and when everything is exhausted, go to the assembly conversion.
First, observe that:
8 * c = (a + 2 * b - y)
Has a unique c solution for each triplet (a,b,y).
What does this mean? Your 3 loops can be collapsed into 2. This is a huge reduction from a runtime with theta y^3 to theta y^2.
Rewrite the code:
for (a = 0; a < y; a++)
for (b = 0; b < y; b++) {
c = (a+2*b-y);
if (((c%8)==0) && (c >= 0)) count++;
}
Next observe that c>=0 means:
a+2*b-y >= 0
a+2*b >= y
a >= y-2b
Note that the two loops can be interchanged, which gives:
for (b = 0; b < y; b++) {
for (a = max(y-2*b,0); a < y; a++) {
if (((a+2*b-y)%8)==0) count++;
} }
Which we can split into two:
for (b = 0; b < y/2; b++) {
for (a = y-2*b; a < y; a++) {
if (((a+2*b-y)%8)==0) count++;
} }
for (b = y/2; b < y; b++) {
for (a = 0; a < y; a++) {
if (((a+2*b-y)%8)==0) count++;
} }
Now we have entirely eliminated c. We can't eliminate a or b altogether without coming up with a closed form formula (or at least partial closed form formula), why?
So here are several exercises that will get you "there".
how can we get rid of %8? can we eliminate a or b now?
observe that for each y, there is approximately theta y^2 counts. why is it that there is no single closed form quadratic (i.e. a*y^2+b*y+c) that give us the correct count?
given 2, how would one go about coming up with a closed form formula?
And now conversion to assembly language will give you a small improvement in the grand scheme of things :p
(I hope all the details are right. Please correct if you see a mistake)
In Assembly Language Step-by-Step Jeff writes on page 230,
Now, speed optimization is a very slippery business in the x86 world, Having instructions in the CPU cache versus having to pull them from memory is a speed difference that swamps most speed differences among the instructions themselves. Other factors come into play in the most recent Pentium-class CPUs that make generalizations about instruction speed almost impossible, and certainly impossible to state with any precision.
Assuming you're on an x86 machine, my advice would be soak up all that Math in the other answers the best you can for optimizations.

Bilinear Interpolation from C to Neon

I'm trying to downsample an Image using Neon. So I tried to exercise neon by writing a function that subtracts two images using neon and I have succeeded.
Now I came back to write the bilinear interpolation using neon intrinsics.
Right now I have two problems, getting 4 pixels from one row and one column and also compute the interpolated value (gray) from 4 pixels or if it is possible from 8 pixels from one row and one column. I tried to think about it, but I think the algorithm should be rewritten at all ?
void resizeBilinearNeon( uint8_t *src, uint8_t *dest, float srcWidth, float srcHeight, float destWidth, float destHeight)
{
int A, B, C, D, x, y, index;
float x_ratio = ((float)(srcWidth-1))/destWidth ;
float y_ratio = ((float)(srcHeight-1))/destHeight ;
float x_diff, y_diff;
for (int i=0;i<destHeight;i++) {
for (int j=0;j<destWidth;j++) {
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = y*srcWidth+x ;
uint8x8_t pixels_r = vld1_u8 (src[index]);
uint8x8_t pixels_c = vld1_u8 (src[index+srcWidth]);
// Y = A(1-w)(1-h) + B(w)(1-h) + C(h)(1-w) + Dwh
gray = (int)(
pixels_r[0]*(1-x_diff)*(1-y_diff) + pixels_r[1]*(x_diff)*(1-y_diff) +
pixels_c[0]*(y_diff)*(1-x_diff) + pixels_c[1]*(x_diff*y_diff)
) ;
dest[i*w2 + j] = gray ;
}
}
Neon will definitely help with downsampling in an arbitrary ratio using bilinear filtering. The key being clever use of vtbl.8 instruction, that is able to perform a parallel look-up-table for 8 consecutive destination pixels from pre-loaded array:
d0 = a [b] c [d] e [f] g h, d1 = i j k l m n o p
d2 = q r s t u v [w] x, d3 = [y] z [A] B [C][D] E F ...
d4 = G H I J K L M N, d5 = O P Q R S T U V ...
One can easily calculate the fractional positions for the pixels in brackets:
[b] [d] [f] [w] [y] [A] [C] [D], accessed with vtbl.8 d6, {d0,d1,d2,d3}
The row below would be accessed with vtbl.8 d7, {d2,d3,d4,d5}
Incrementing vadd.8 d6, d30 ; with d30 = [1 1 1 1 1 ... 1] gives lookup indices for the pixels right of the origin etc.
There's no reason for getting the pixels from two rows other than illustrating it's possible and that the method can be used to implement also slight distortions if needed.
In real time applications using e.g. of lanzcos can be a bit overkill, but still feasible using NEON. Downsampling of larger factors require of course (heavy) filtering, but can be easily achieved with iteratively averaging and decimating by 2:1 and only at the end using fractional sampling.
For any 8 consecutive pixels to write, one can calculate the vector
x_positions = (X + [0 1 2 3 4 5 6 7]) * source_width / target_width;
y_positions = (Y + [0 0 0 0 0 0 0 0]) * source_height / target_height;
ptr = to_int(x_positions) + y_positions * stride;
x_position += (ptr & 7); // this pointer arithmetic goes only for 8-bit planar
ptr &= ~7; // this is to adjust read pointer to qword alignment
vld1.8 {d0,d1}, [r0]
vld1.8 {d2,d3], [r0], r2 // wasn't this possible? (use r2==stride)
d4 = int_part_of (x_positions);
d5 = d4 + 1;
d6 = fract_part_of (x_positions);
d7 = fract_part_of (y_positions);
vtbl.8 d8,d4,{d0,d1} // read top row
vtbl.8 d9,d5,{d0,d1} // read top row +1
MIX(d8,d9,d6) // horizontal mix of ptr[] & ptr[1]
vtbl.8 d10,d4,{d2,d3} // read bottom row
vtbl.8 d11,d5,{d2,d3} // read bottom row
MIX(d10,d11,d6) // horizontal mix of ptr[1024] & ptr[1025]
MIX(d8,d10,d7)
// MIX (dst, src, fract) is a macro that somehow does linear blending
// should be doable with ~3-4 instructions
To calculate the integer parts, it's enough to use 8.8 bit resolution (one really doesn't have to calculate 666+[0 1 2 3 .. 7]) and keep all intermediate results in simd register.
Disclaimer -- this is conceptual pseudo c / vector code. In SIMD there are two parallel tasks to be optimized: what's the minimum amount of arithmetic operations needed and how to minimize unnecessary shuffling / copying of data. In this respect too NEON with three register approach is much better suited to serious DSP than SSE. The second respect is the amount of multiplication instruction and the third advantage the interleaving instructions.
#MarkRansom is not correct about nearest neighbor versus 2x2 bilinear interpolation; bilinear using 4 pixels will produce better output than nearest neighbor. He is correct that averaging the appropriate number of pixels (more than 4 if scaling by > 2:1) will produce better output still. However, NEON will not help with image downsampling unless the scaling is done by an integer ratio.
The maximum benefit of NEON and other SIMD instruction sets is to be able to process 8 or 16 pixels at once using the same operations. By accessing individual elements the way you are, you lose all the SIMD benefit. Another problem is that moving data from NEON to ARM registers is a slow operation. Downsampling images is best done by a GPU or optimized ARM instructions.