Related
I have an input uint64_t X and number of its N least significant bits that I want to write into the target Y, Z uint64_t values starting from bit index M in the Z. Unaffected parts of Y and Z should not be changed. How I can implement it efficiently in C++ for the latest intel CPUs?
It should be efficient for execution in loops. I guess that it requires to have no branching: the number of used instructions is expected to be constant and as small as possible.
M and N are not fixed at compile time. M can take any value from 0 to 63 (target offset in Z), N is in the range from 0 to 64 (number of bits to copy).
illustration:
There's at least a four instruction sequence available on reasonable modern IA processors.
X &= (1 << (N+1)) - 1; // mask off the upper bits
// bzhi rax, rdi, rdx
Z = X << M;
// shlx rax, rax, rsi
Y = X >> (64 - M);
// neg sil
// shrx rax, rax, rsi
The value M=0 causes a bit of pain, as Y would need to be zero in that case and also the expression N >> (64-M) would need sanitation.
One possibility to overcome this is
x = bzhi(x, n);
y = rol(x,m);
y = bzhi(y, m); // y &= ~(~0ull << m);
z = shlx(x, m); // z = x << m;
As OP actually wants to update the bits, one obvious solution would be to replicate the logic for masks:
xm = bzhi(~0ull, n);
ym = rol(xm, m);
ym = bzhi(ym, m);
zm = shlx(xm, m);
However, clang seems to produce something like 24 instructions total with the masks applied:
Y = (Y & ~xm) | y; // |,+,^ all possible
Z = (Z & ~zm) | z;
It is likely then better to change the approach:
x2 = x << (64-N); // align xm to left
y2 = y >> y_shift; // align y to right
y = shld(y2,x2, y_shift); // y fixed
Here y_shift = max(0, M+N-64)
Fixing Z is slightly more involved, as Z can be combined of three parts:
zzzzz.....zzzzXXXXXXXzzzzzz, where m=6, n=7
That should be doable with two double shifts as above.
Consider the following code:
Matrix4x4 perspective(const ViewFrustum &frustum) {
float l = frustum.l;
float r = frustum.r;
float b = frustum.b;
float t = frustum.t;
float n = frustum.n;
float f = frustum.f;
return {
{ 2 * n / (r - l), 0, (r + l) / (r - l), 0 },
{ 0, 2 * n / (t - b), (t + b) / (t - b), 0 },
{ 0, 0, -((f + n) / (f - n)), -(2 * n * f / (f - n)) },
{ 0, 0, -1, 0 }
};
}
In order to improve readability of constructing the matrix, I have to either make a copy of values from the frustum struct, or references to them. However, neither do I actually need copies or indirection.
Is it possible to have some kind of a "reference" that would be resolved at compile time, kind of like a symbolic link. It would have the same effect as:
Matrix4x4 perspective(const ViewFrustum &frustum) {
#define l frustum.l;
#define r frustum.r;
#define b frustum.b;
#define t frustum.t;
#define n frustum.n;
#define f frustum.f;
return {
{ 2 * n / (r - l), 0, (r + l) / (r - l), 0 },
{ 0, 2 * n / (t - b), (t + b) / (t - b), 0 },
{ 0, 0, -((f + n) / (f - n)), -(2 * n * f / (f - n)) },
{ 0, 0, -1, 0 }
};
#undef l
#undef r
#undef b
#undef t
#undef n
#undef f
}
Without the preprocessor (or is it acceptable?). I suppose it isn't really needed, or could be avoided in this particular case by making those 6 values arguments to a function directly (though it would be a bit irritating having to call the function like that - but even then, I could make an inline proxy function).
But I was just wondering if this is somehow possible in general? I could not find anything like it. I think it would come in handy for locally shortening descriptive names that are going to be used a lot, without actually having to lose the original names.
Well, that's what C++ references are for:
const float &l = frustum.l;
const float &r = frustum.r;
const float &b = frustum.b;
const float &t = frustum.t;
const float &n = frustum.n;
const float &f = frustum.f;
Most modern compilers will optimize out the references, and use the values from the frustum object verbatim, in the following expression, by resolving the references at compile-time.
Obligatory disclaimer: do not prematurely optimize.
Let me compare your naive perspective function, containing
float l = frustum.l;
float r = frustum.r;
float b = frustum.b;
float t = frustum.t;
float n = frustum.n;
float f = frustum.f;
With define's and #Sam Varshavchik solution with references.
We assume that our compiler is optimizing, and optimizing at least decent.
Assembly output for all three versions: https://godbolt.org/g/G06Bx8.
You can notice that reference and define versions are exactly the same - as expected. But naive differs a lot. It first loads all the values from memory:
movss (%rdi), %xmm2 # xmm2 = mem[0],zero,zero,zero
movss 4(%rdi), %xmm1 # xmm1 = mem[0],zero,zero,zero
movss 8(%rdi), %xmm0 # xmm0 = mem[0],zero,zero,zero
movss %xmm0, 12(%rsp) # 4-byte Spill
movss 12(%rdi), %xmm0 # xmm0 = mem[0],zero,zero,zero
movss %xmm0, 8(%rsp) # 4-byte Spill
movss 16(%rdi), %xmm3 # xmm3 = mem[0],zero,zero,zero
movaps %xmm3, 16(%rsp) # 16-byte Spill
movss 20(%rdi), %xmm0
And then never again references the %rdi (frustrum) memory. Reference and define versions, on the other hand, load values as they are needed.
This happens because the implementation of Vector4 constructor is hidden from the optimizer and it can't assume that constructor doesn't modify frustrum, so it must insert loads, even when such loads are redundant.
So, naive version can be even faster than "optimized" one, under certain circumstances.
In general, you can use plain references, as long as you are in the local scope. Modern compilers "see through them" and just treat them as aliases (notice that this actually applies even to pointers).
However, when dealing with stuff on the small side, copying to a local variable, if anything, is generally beneficial. frustnum.ris one layer of indirection away (frustnum is actually a pointer under the hood), so accessing it is costlier than it may seem, and if you have function calls in the middle of your function the compiler may not be able to prove that its value isn't changing, so the access may need to be repeated.
Local variables instead are normally directly on the stack (cheap) or straight in registers (cheapest), and, most importantly, given that they usually have no interaction with "the outside", the compiler has an easier time reasoning about them, so it can be more aggressive with optimizations; also, when actually performing the computations those values are going to be copied in registers and on the stack anyway.
So go ahead and use copies, at worst the compiler will probably do the same, at best you may helped it optimizing stuff.
I would like to speed up a part of my code but I don't think there is a possible better way to do the following calculation:
float invSum = 1.0f / float(sum);
for (int i = 0; i < numBins; ++i)
{
histVec[i] *= invSum;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
fmean += f * midPoint;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
float diff = midPoint - fmean;
var += f * hwk::sqr(diff);
}
numBins in the for-loops is typically 10 but this bit of code is called very often (frequency of 80 frames per seconds, called at least 8 times per frame)
I tried to use some SSE methods but it is only slightly speeding up this code. I think I could avoid calculating twice midPoint but I am not sure how. Is there a better way to compute fmean and var?
Here is the SSE code:
// make hist contain a multiple of 4 valid values
for (int i = numBins; i < ((numBins + 3) & ~3); i++)
hist[i] = 0;
// find sum of bins in inHist
__m128i iSum4 = _mm_set1_epi32(0);
for (int i = 0; i < numBins; i += 4)
{
__m128i a = *((__m128i *) &inHist[i]);
iSum4 = _mm_add_epi32(iSum4, a);
}
int iSum = iSum4.m128i_i32[0] + iSum4.m128i_i32[1] + iSum4.m128i_i32[2] + iSum4.m128i_i32[3];
//float stdevB, meanB;
if (iSum == 0.0f)
{
stdev = 0.0;
mean = 0.0;
}
else
{
// Set histVec to normalised values in inHist
__m128 invSum = _mm_set1_ps(1.0f / float(iSum));
for (int i = 0; i < numBins; i += 4)
{
__m128i a = *((__m128i *) &inHist[i]);
__m128 b = _mm_cvtepi32_ps(a);
__m128 c = _mm_mul_ps(b, invSum);
_mm_store_ps(&histVec[i], c);
}
float binSize = 256.0f / (float)numBins;
float halfBinSize = binSize * 0.5f;
float binOffset = halfBinSize;
__m128 binSizeMask = _mm_set1_ps(binSize);
__m128 binOffsetMask = _mm_set1_ps(binOffset);
__m128 fmean4 = _mm_set1_ps(0.0f);
for (int i = 0; i < numBins; i += 4)
{
__m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i);
__m128 idx_m128 = _mm_cvtepi32_ps(idx4);
__m128 histVec4 = _mm_load_ps(&histVec[i]);
__m128 midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask);
fmean4 = _mm_add_ps(fmean4, _mm_mul_ps(histVec4, midPoint4));
}
fmean4 = _mm_hadd_ps(fmean4, fmean4); // 01 23 01 23
fmean4 = _mm_hadd_ps(fmean4, fmean4); // 0123 0123 0123 0123
float fmean = fmean4.m128_f32[0];
//fmean4 = _mm_set1_ps(fmean);
__m128 var4 = _mm_set1_ps(0.0f);
for (int i = 0; i < numBins; i+=4)
{
__m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i);
__m128 idx_m128 = _mm_cvtepi32_ps(idx4);
__m128 histVec4 = _mm_load_ps(&histVec[i]);
__m128 midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask);
__m128 diff4 = _mm_sub_ps(midPoint4, fmean4);
var4 = _mm_add_ps(var4, _mm_mul_ps(histVec4, _mm_mul_ps(diff4, diff4)));
}
var4 = _mm_hadd_ps(var4, var4); // 01 23 01 23
var4 = _mm_hadd_ps(var4, var4); // 0123 0123 0123 0123
float var = var4.m128_f32[0];
stdev = sqrt(var);
mean = fmean;
}
I might be doing something wrong since I dont have a lot of improvement as I was expecting.
Is there something in the SSE code that might possibly slow down the process?
(editor's note: the SSE part of this question was originally asked as https://stackoverflow.com/questions/31837817/foor-loop-optimisation-sse-comparison, which was closed as a duplicate.)
I only just realized that your data array starts out as an array of int, since you didn't have declarations in your code. I can see in the SSE version that you start with integers, and only store a float version of it later.
Keeping everything integer will let us do the loop-counter-vector with a simple ivec = _mm_add_epi32(ivec, _mm_set1_epi32(4)); Aki Suihkonen's answer has some transformations that should let it optimize a lot better. Especially, the auto-vectorizer should be able to do more even without -ffast-math. In fact, it does quite well. You could do better with intrinsics, esp. saving some vector 32bit multiplies and shortening the dependency chain.
My old answer, based on just trying to optimize your code as written, assuming FP input:
You may be able to combine all 3 loops into one, using the algorithm #Jason linked to. It might not be profitable, though, since it involves a division. For small numbers of bins, probably just loop multiple times.
Start by reading the guides at http://agner.org/optimize/. A couple of the techniques in his Optimising Assembly guide will speed up your SSE attempt (which I edited into this question for you).
combine your loops where possible, so you do more with the data for each time it's loaded / stored.
multiple accumulators to hide the latency of loop-carried dependency chains. (Even FP add takes 3 cycles on recent Intel CPUs.) This won't apply for really short arrays like your case.
instead of int->float conversion on every iteration, use a float loop counter as well as the int loop counter. (add a vector of _mm_set1_ps(4.0f) every iteration.) _mm_set... with variable args is something to avoid in loops, when possible. It takes several instructions (esp. when each arg to setr has to be calculated separately.)
gcc -O3 manages to auto-vectorize the first loop, but not the others. With -O3 -ffast-math, it auto-vectorizes more. -ffast-math allows it to do FP operations in a different order than the code specifies. e.g. adding up the array in 4 elements of a vector, and only combining the 4 accumulators at the end.
Telling gcc that the input pointer is aligned by 16 lets gcc auto-vectorize with a lot less overhead (no scalar loops for unaligned portions).
// return mean
float fpstats(float histVec[], float sum, float binSize, float binOffset, long numBins, float *variance_p)
{
numBins += 3;
numBins &= ~3; // round up to multiple of 4. This is just a quick hack to make the code fast and simple.
histVec = (float*)__builtin_assume_aligned(histVec, 16);
float invSum = 1.0f / float(sum);
float var = 0, fmean = 0;
for (int i = 0; i < numBins; ++i)
{
histVec[i] *= invSum;
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
fmean += f * midPoint;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
float diff = midPoint - fmean;
// var += f * hwk::sqr(diff);
var += f * (diff * diff);
}
*variance_p = var;
return fmean;
}
gcc generates some weird code for the 2nd loop.
# broadcasting fmean after the 1st loop
subss %xmm0, %xmm2 # fmean, D.2466
shufps $0, %xmm2, %xmm2 # vect_cst_.16
.L5: ## top of 2nd loop
movdqa %xmm3, %xmm5 # vect_vec_iv_.8, vect_vec_iv_.8
cvtdq2ps %xmm3, %xmm3 # vect_vec_iv_.8, vect__32.9
movq %rcx, %rsi # D.2465, D.2467
addq $1, %rcx #, D.2465
mulps %xmm1, %xmm3 # vect_cst_.11, vect__33.10
salq $4, %rsi #, D.2467
paddd %xmm7, %xmm5 # vect_cst_.7, vect_vec_iv_.8
addps %xmm2, %xmm3 # vect_cst_.16, vect_diff_39.15
mulps %xmm3, %xmm3 # vect_diff_39.15, vect_powmult_53.17
mulps (%rdi,%rsi), %xmm3 # MEM[base: histVec_10, index: _107, offset: 0B], vect__41.18
addps %xmm3, %xmm4 # vect__41.18, vect_var_42.19
cmpq %rcx, %rax # D.2465, bnd.26
ja .L8 #, ### <--- This is insane.
haddps %xmm4, %xmm4 # vect_var_42.19, tmp160
haddps %xmm4, %xmm4 # tmp160, vect_var_42.21
.L2:
movss %xmm4, (%rdx) # var, *variance_p_44(D)
ret
.p2align 4,,10
.p2align 3
.L8:
movdqa %xmm5, %xmm3 # vect_vec_iv_.8, vect_vec_iv_.8
jmp .L5 #
So instead of just jumping back to the top every iteration, gcc decides to jump ahead to copy a register, and then unconditionally jmp back to the top of the loop. The uop loop buffer may remove the front-end overhead of this sillyness, but gcc should have structured the loop so it didn't copy xmm5->xmm3 and then xmm3->xmm5 every iteration, because that's silly. It should have the conditional jump just go to the top of the loop.
Also note the technique gcc used to get a float version of the loop counter: start with an integer vector of 1 2 3 4, and add set1_epi32(4). Use that as an input for packed int->float cvtdq2ps. On Intel HW, that instruction runs on the FP-add port, and has 3 cycle latency, same as packed FP add. gcc prob. would have done better to just add a vector of set1_ps(4.0), even though this creates a 3-cycle loop-carried dependency chain, instead of 1 cycle vector int add, with a 3 cycle convert forking off on every iteration.
small iteration count
You say this will often be used on exactly 10 bins? A specialized version for just 10 bins could give a big speedup, by avoiding all the loop overhead and keeping everything in registers.
With that small a problem size, you can have the FP weights just sitting there in memory, instead of re-computing them with integer->float conversion every time.
Also, 10 bins is going to mean a lot of horizontal operations relative to the amount of vertical operations, since you only have 2 and a half vectors worth of data.
If exactly 10 is really common, specialize a version for that. If under-16 is common, specialize a version for that. (They can and should share the const float weights[] = { 0.0f, 1.0f, 2.0f, ...}; array.)
You probably will want to use intrinsics for the specialized small-problem versions, rather than auto-vectorization.
Having zero-padding after the end of the useful data in your array might still be a good idea in your specialized version(s). However, you can load the last 2 floats and clear the upper 64b of a vector register with a movq instruction. (__m128i _mm_cvtsi64_si128 (__int64 a)). Cast this to __m128 and you're good to go.
As peterchen mentioned, these operations are very trivial for current desktop processors. The function is linear, i.e. O(n). What's the typical size of numBins? If it's rather large (say, over 1000000), parallelization will help. This could be simple using a library like OpenMP. If numBins starts approaching MAXINT, you may consider GPGPU as an option (CUDA/OpenCL).
All that considered, you should try profiling your application. Chances are good that, if there is a performance constraint, it's not in this method. Michael Abrash's definition of "high-performance code" has helped me greatly in determining if/when to optimize:
Before we can create high-performance code, we must understand what high performance is. The objective (not always attained) in creating high-performance software is to make the software able to carry out its appointed tasks so rapidly that it responds instantaneously, as far as the user is concerned. In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless. Notice that the above definition most emphatically does not say anything about making the software as fast as possible.
Reference:
The Graphics Programming Black Book
The overall function to be calculated is
std = sqrt(SUM_i { hist[i]/sum * (midpoint_i - mean_midpoint)^2 })
Using the identity
Var (aX + b) = Var (X) * a^2
one can reduce the complexity of the overall operation considerably
1) midpoint of a bin doesn't need offset b
2) no need to prescale by bin array elements with bin width
and
3) no need to normalize histogram entries with reciprocal of sum
The optimized calculation goes as follows
float calcVariance(int histBin[], float binWidth)
{
int i;
int sum = 0;
int mid = 0;
int var = 0;
for (i = 0; i < 10; i++)
{
sum += histBin[i];
mid += i*histBin[i];
}
float inv_sum = 1.0f / (float)sum;
float mid_sum = mid * inv_sum;
for (i = 0; i < 10; i++)
{
int diff = i * sum - mid; // because mid is prescaled by sum
var += histBin[i] * diff * diff;
}
return sqrt(float(var) / (float)(sum * sum * sum)) * binWidth;
}
Minor changes are required if it's float histBin[];
Also I second padding histBin size to a multiple of 4 for better vectorization.
EDIT
Another way to calculate this with floats in the inner loop:
float inv_sum = 1.0f / (float)sum;
float mid_sum = mid * inv_sum;
float var = 0.0f;
for (i = 0; i < 10; i++)
{
float diff = (float)i - mid_sum;
var += (float)histBin[i] * diff * diff;
}
return sqrt(var * inv_sum) * binWidth;
Perform the scaling on the global results only and keep integers as long as possible.
Group all computation in a single loop, using Σ(X-m)²/N = ΣX²/N - m².
// Accumulate the histogram
int mean= 0, var= 0;
for (int i = 0; i < numBins; ++i)
{
mean+= i * histVec[i];
var+= i * i * histVec[i];
}
// Compute the reduced mean and variance
float fmean= (float(mean) / sum);
float fvar= float(var) / sum - fmean * fmean;
// Rescale
fmean= fmean * binSize + binOffset;
fvar= fvar * binSize * binSize;
The required integer type will depend on the maximum value in the bins. The SSE optimization of the loop can exploit the _mm_madd_epi16 instruction.
If the number of bins is a small as 10, consider fully unrolling the loop. Precompute the i and i² vectors in a table.
In the lucky case that the data fits in 16 bits and the sums in 32 bits, the accumulation is done with something like
static short I[16]= { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0 };
static short I2[16]= { 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 0, 0, 0, 0, 0, 0 };
// First group
__m128i i= _mm_load_si128((__m128i*)&I[0]);
__m128i i2= _mm_load_si128((__m128i*)&I2[0]);
__m128i h= _mm_load_si128((__m128i*)&inHist[0]);
__m128i mean= _mm_madd_epi16(i, h);
__m128i var= _mm_madd_epi16(i2, h);
// Second group
i= _mm_load_si128((__m128i*)&I[8]);
i2= _mm_load_si128((__m128i*)&I2[8]);
h= _mm_load_si128((__m128i*)&inHist[8]);
mean= _mm_add_epi32(mean, _mm_madd_epi16(i, h));
var= _mm_add_epi32(var, _mm_madd_epi16(i2, h));
CAUTION: unchecked
this is my assignment.
I've done my code for this assembly, but is there any way to make the convert speed more fast?
thank in advance for any helps ;D
//Convert this nested for loop to assembly instructions
for (a = 0; a < y; a++)
for (b = 0; b < y; b++)
for (c = 0; c < y; c++)
if ((a + 2 * b - 8 * c) == y)
count++;
convert
_asm {
mov ecx,0
mov ax, 0
mov bx, 0
mov cx, 0
Back:
push cx
push bx
push ax
add bx, bx
mov dx, 8
mul dx
add cx, bx
sub cx, ax
pop ax
pop bx
cmp cx, y
jne increase
inc count
increase : pop cx
inc ax
cmp ax, y
jl Back
inc bx
mov ax, 0
cmp bx, y
jl Back
inc cx
mov ax, 0
mov bx, 0
cmp cx, y
jl Back
}
Some generic tricks:
Make your loop counters count down instead of up. You eliminate a compare that way.
Learn the magic of LEA to compute expressions that include addition and scaling by certain powers of 2. You won't need a MUL in there anywhere.
Hoist loop-invariant work outside the inner loop. a + 2*b is constant for every iteration of the c loop.
Use SI, DI to hold values. That should help you avoid all those push and pop instructions.
If your values fit in 8 bits, use AH, AL, etc. to make more effective use of your registers.
Oh, and you don't need that mov ax, 0 after inc cx, because AX is already 0 there.
Specific to this algorithm: If y is odd, skip iterations where a is even, and vice versa. Nearly 2x speedup awaits... (Work out with pencil and paper if you wonder why.) Hint: You don't need to test every iteration, either. You can simply step by 2s, if you're clever enough.
Or better still, work out a closed form that allows you to calculate the answer directly. ;-)
When you are optimizing, always start high and go low, i.e. start at the algorithm level, and when everything is exhausted, go to the assembly conversion.
First, observe that:
8 * c = (a + 2 * b - y)
Has a unique c solution for each triplet (a,b,y).
What does this mean? Your 3 loops can be collapsed into 2. This is a huge reduction from a runtime with theta y^3 to theta y^2.
Rewrite the code:
for (a = 0; a < y; a++)
for (b = 0; b < y; b++) {
c = (a+2*b-y);
if (((c%8)==0) && (c >= 0)) count++;
}
Next observe that c>=0 means:
a+2*b-y >= 0
a+2*b >= y
a >= y-2b
Note that the two loops can be interchanged, which gives:
for (b = 0; b < y; b++) {
for (a = max(y-2*b,0); a < y; a++) {
if (((a+2*b-y)%8)==0) count++;
} }
Which we can split into two:
for (b = 0; b < y/2; b++) {
for (a = y-2*b; a < y; a++) {
if (((a+2*b-y)%8)==0) count++;
} }
for (b = y/2; b < y; b++) {
for (a = 0; a < y; a++) {
if (((a+2*b-y)%8)==0) count++;
} }
Now we have entirely eliminated c. We can't eliminate a or b altogether without coming up with a closed form formula (or at least partial closed form formula), why?
So here are several exercises that will get you "there".
how can we get rid of %8? can we eliminate a or b now?
observe that for each y, there is approximately theta y^2 counts. why is it that there is no single closed form quadratic (i.e. a*y^2+b*y+c) that give us the correct count?
given 2, how would one go about coming up with a closed form formula?
And now conversion to assembly language will give you a small improvement in the grand scheme of things :p
(I hope all the details are right. Please correct if you see a mistake)
In Assembly Language Step-by-Step Jeff writes on page 230,
Now, speed optimization is a very slippery business in the x86 world, Having instructions in the CPU cache versus having to pull them from memory is a speed difference that swamps most speed differences among the instructions themselves. Other factors come into play in the most recent Pentium-class CPUs that make generalizations about instruction speed almost impossible, and certainly impossible to state with any precision.
Assuming you're on an x86 machine, my advice would be soak up all that Math in the other answers the best you can for optimizations.
I need to know the sign of the value which has the max absolute value stored in an __m128. This is the solution I have now:
int getMaxSign(__m128 const& vec) {
static const __m128 SIGN_BIT_MASK =
_mm_castsi128_ps(_mm_set1_epi32(0x80000000));
// This creates an int, where sign(a) is 1 if a is negative, 0 o.w.:
// sign(a3)<<3 | sign(a2)<<2 | sign(a1)<<1 | sign(a0)
const int signMask = _mm_movemask_ps(vec);
// Get the absolute value of the vector;
__m128 absValsMMX = _mm_andnot_ps(SIGN_BIT_MASK, vec);
// Figure out the horizontal max
__declspec(align(16)) float absVals[4];
_mm_store_ps(absVals, absValsMMX);
const float maxVal = std::max(std::max(absVals[0], absVals[1]), absVals[2]);
return (maxVal == absVals[0] ? signMask & 0x1 :
(maxVal == absVals[1] ? signMask & 0x2 : signMask & 0x4));
}
In this case, sign will be 1 if the value with the maximum absolute value was negative, and 0 otherwise, but I don't actually care what the convention is. Another thing is that I am representing homogenous vectors using these __m128s, so I know that the last value will always be 0.
This seems like a lot of work to do for a relatively simple task. How can I do this faster?
Thanks!
Here is one possible implementation (in C):
int getMaxSign(const __m128 v)
{
__m128 v1, vmax, vmin, vsign;
float sign;
v1 = (__m128)_mm_alignr_epi8((__m128i)v, (__m128i)v, 4); // v1 = v rotated by 1 element
vmax = _mm_max_ps(v, v1); // generate horizontal max/min
vmin = _mm_min_ps(v, v1);
vmax = _mm_max_ps(vmax, (__m128)_mm_alignr_epi8((__m128i)vmax, (__m128i)vmax, 8));
vmin = _mm_min_ps(vmin, (__m128)_mm_alignr_epi8((__m128i)vmin, (__m128i)vmin, 8));
vsign = _mm_add_ps(vmax, vmin); // add max and min to get sign of abs max
sign = _mm_extract_ps(vsign, 0);
return (int)(sign < 0.0f); // return 1 for negative
}
Although this looks like a lot of code it's only about 9 SSE instructions and there are no memory accesses, no branches and very little scalar code.
Note that both SSSE3 and SSE4.1 instructions are used in the above.
Here is a second version which only requires SSSE3:
int getMaxSign(const __m128 v)
{
__m128 v1, vmax, vmin, vsign;
int mask;
v1 = (__m128)_mm_alignr_epi8((__m128i)v, (__m128i)v, 4); // v1 = v rotated by 1 element
vmax = _mm_max_ps(v, v1); // generate horizontal max/min
vmin = _mm_min_ps(v, v1);
vmax = _mm_max_ps(vmax, (__m128)_mm_alignr_epi8((__m128i)vmax, (__m128i)vmax, 8));
vmin = _mm_min_ps(vmin, (__m128)_mm_alignr_epi8((__m128i)vmin, (__m128i)vmin, 8));
vsign = _mm_add_ps(vmax, vmin); // add max and min to get sign of abs max
mask = _mm_movemask_epi8((__m128i)vsign);
return (mask & 8) != 0; // return 1 for negative
}
This generates 12 instructions:
pshufd $57, %xmm0, %xmm1
movdqa %xmm0, %xmm2
minps %xmm1, %xmm2
pshufd $78, %xmm2, %xmm3
minps %xmm3, %xmm2
maxps %xmm1, %xmm0
pshufd $78, %xmm0, %xmm1
maxps %xmm1, %xmm0
addps %xmm2, %xmm0
pmovmskb %xmm0, %eax
shrl $3, %eax
andl $1, %eax
Note how the compiler craftily changes palignr to pshufd and also implements the final scalar test using just a shrl and an andl.
Note for Visual Studio C/C++: to cast between __m128 and __m128i you'll need to use _mm_castps_si128 and _mm_castsi128_ps, e.g.
mask = _mm_movemask_epi8((__m128i)vsign);
would need to be changed to:
mask = _mm_movemask_epi8(_mm_castps_si128(vsign));
If your numbers are discrete, and properly spaced, and drawing from a limited subset, there are other possibilities.
If you're guaranteed that a, b, and c are integers for instance, then you can multiply the vector by itself to get an odd power and then dot it with <1, 1, 1>. If we multiply it by itself 4 times, for instance, it will give you < a^5, b^5, c^5 >. If |a| is the largest and |a|=2, then we know that b and c will be 1 or 0, so the value of a^3 will dominate and the dot product will have its sign. For instance, if X= < a=-2, b=1, c=0 > , then X^5 = <-32, 1, 0>. When you dot this with <1, 1, 1> you get -31, whose sign reflects that of the largest absolute value. As the absolute value of the largest number increases, the disparity between it and the other terms will tend to converge - for instance, if we have <-8, 7, 7>, then the algorithm above gives X^5=<-32768, 16807, 16807>, you dot that with <1, 1, 1> and get 846, so the algorithm fails with exponent 5. If we bump the exponent up to 7, we get <-2097152, 823543, 823543>, dotted with <1, 1, 1> gives us -450066, which is the correct answer. Eventually round-off errors will also break this method. But I'm hoping it might give some insights into other alternatives, if you know the limits on your dataset.
As a footnote, remember that X^5 = (X*X) * (X*X) * X, so you do one multiply to get X^2, multiply that by itself to get X^4, and then multiply by X - three multiplies total. You need an odd exponent to preserve sign.
m = min(a,b,c);
M = max(a,b,c);
// return abs(m)>abs(M) ? sign(m): sign(M); // was
return sign(m+M);
As correctly noticed by Paul_R, the sign comes simply from the sum of the min and max values. Which ever has larger (opposite signed) absolute value, wins.
But the idea can be exploited more: the sum of min/max is the same, as the sum of all the elements, minus the middle one, which can be found by max 3 comparisons.
return sign(a+b+c - middle(a,b,c)); // or
return sign(a*aw + b*bw + c*cw); // where aw,bw,cw = [0,1]
aw,bw,cw could be derived from the number of won comparisons (which I think have to planned carefully for the case, when there are 2 or 3 equal values.)
And further:
x = abs(b)>abs(a)?b:a;
return sign(x+c);
Possibly even further:
s = sign(a + b); // store the sign of larger of a or b
a = abs(a); b=abs(b);
a = max(a,b) | s; // somehow copy the sign.
return sign(a+c);