How to efficiently scan 2 bit masks alternating each iteration - c++

Given are 2 bitmasks, that should be accessed alternating (0,1,0,1...). I try to get a runtime efficient solution, but find no better way then following example.
uint32_t mask[2] { ... };
uint8_t mask_index = 0;
uint32_t f = _tzcnt_u32(mask[mask_index]);
while (f < 32) {
// element adding to result vector removed, since not relevant for question itself
mask[0] >>= f + 1;
mask[1] >>= f + 1;
mask_index ^= 1;
f = _tzcnt_u32(mask[mask_index]);
}
ASM output (MSVC, x64) seems blown up pretty much.
inc r9
add r9,rcx
mov eax,esi
mov qword ptr [rdi+rax*8],r9
inc esi
lea rax,[rcx+1]
shrx r11d,r11d,eax
mov dword ptr [rbp],r11d
shrx r8d,r8d,eax
mov dword ptr [rbp+4],r8d
xor r10b,1
movsx rax,r10b
tzcnt ecx,dword ptr [rbp+rax*4]
mov ecx,ecx
cmp rcx,20h
jb main+240h (07FF632862FD0h)
cmp r9,20h
jb main+230h (07FF632862FC0h)
Has someone an advice?
(This is is a followup to Solve loop data dependency with SIMD - finding transitions between -1 and +1 in an int8_t array of sgn values using SIMD to create the bitmasks)
Update
I wonder if a potential solution could make use of SIMD by loading chunks of both bit streams into a register (AVX2 in my case) like this:
|m0[0]|m1[0]|m0[1]|m1[1]|m0[2]|m1[2]|m0[n+1]|m1[n+1]|
or
1 register with chunks per stream
|m0[0]|m0[1]|m0[2]|m0[n+1]|
|m1[0]|m1[1]|m1[2]|m1[n+1]|
or split the stream in chunks of same size and deal with as many lanes fit into the register at once. Let's assume we have 256*10 elements which might end up in 10 iterations like this:
|m0[0]|m0[256]|m0[512]|...|
|m1[0]|m1[256]|m1[512]|...|
and deal with the join separately
Not sure if this might be a way to achieve more iterations per cycle and limit the need of horizontal bitscans, shift/clear op's and avoid branches.

This is quite hard to optimize this loop. The main issue is that each iteration of the loop is dependent of the previous one and even instructions in the loops are dependent. This creates a long nearly sequential chain of instruction to be executed. As a result the processor cannot execute this efficiently. In addition, some instructions in this chain have a quite high latency: tzcnt has a 3-cycle latency on Intel processors and L1 load/store have a 3 cycle latency.
One solution is work directly with registers instead of an array with indirect accesses so to reduce the length of the chain and especially instruction with the highest latency. This can be done by unrolling the loop twice and splitting the problem in two different ones:
uint32_t m0 = mask[0];
uint32_t m1 = mask[1];
uint8_t mask_index = 0;
if(mask_index == 0) {
uint32_t f = _tzcnt_u32(m0);
while (f < 32) {
m1 >>= f + 1;
m0 >>= f + 1;
f = _tzcnt_u32(m1);
if(f >= 32)
break;
m0 >>= f + 1;
m1 >>= f + 1;
f = _tzcnt_u32(m0);
}
}
else {
uint32_t f = _tzcnt_u32(m1);
while (f < 32) {
m0 >>= f + 1;
m1 >>= f + 1;
f = _tzcnt_u32(m1);
if(f >= 32)
break;
m0 >>= f + 1;
m1 >>= f + 1;
f = _tzcnt_u32(m0);
}
}
// If mask is needed, m0 and m1 need to be stored back in mask.
This should be a bit faster, especially because a smaller critical path but also because the two shifts can be executed in parallel. Here is the resulting assembly code:
$loop:
inc ecx
shr edx, cl
shr eax, cl
tzcnt ecx, edx
cmp ecx, 32
jae SHORT $end_loop
inc ecx
shr eax, cl
shr edx, cl
tzcnt ecx, eax
cmp ecx, 32
jb SHORT $loop
Note that modern x86 processors can fuse the instructions cmp+jae and cmp+jb and the branch prediction can assume the loop will continue so it just miss-predict the last conditional jump. On Intel processors, the critical path is composed of a 1-cycle latency inc, a 1-cycle latency shr, a 3-cycle latency tzcnt resulting in a 5-cycle per round (1 round = 1 iteration of the initial loop). On AMD Zen-like processors, it is 1+1+2=4 cycles which is very good. Optimizing this further appears to be very challenging.
One possible optimization could be to use a lookup table so to compute the lower bits of m0 and m1 in bigger steps. However, a lookup table fetch has a 3-cycle latency, may cause expensive cache misses in practice, takes more memory and make the code significantly more complex since the number of trailing 0 bits can be quite big (eg. 28 bits). Thus, I am not sure this is a good idea although it certainly worth trying.

Here’s another way, untested. People all over internets recommend against using goto, but sometimes, like for your use case, the feature does help.
// Grab 2 more of these masks, or if you don't have any, return false
bool loadMasks( uint32_t& mask1, uint32_t& mask2 );
// Consume the found value
void consumeIndex( size_t index );
void processMasks()
{
size_t sourceOffset = 0;
uint32_t mask0, mask1;
// Skip initial zeros
while( true )
{
if( !loadMasks( mask0, mask1 ) )
return;
if( 0 != ( mask0 | mask1 ) )
break;
sourceOffset += 32;
}
constexpr uint32_t minusOne = ~(uint32_t)0;
uint32_t idx;
// Figure out the initial state, and jump
if( _tzcnt_u32( mask0 ) > _tzcnt_u32( mask1 ) )
goto testMask1;
// Main loop below
testMask0:
idx = _tzcnt_u32( mask0 );
if( idx >= 32 )
{
sourceOffset += 32;
if( !loadMasks( mask0, mask1 ) )
return;
goto testMask0;
}
consumeIndex( sourceOffset + idx );
mask1 &= minusOne << ( idx + 1 );
testMask1:
idx = _tzcnt_u32( mask1 );
if( idx >= 32 )
{
sourceOffset += 32;
if( !loadMasks( mask0, mask1 ) )
return;
goto testMask1;
}
consumeIndex( sourceOffset + idx );
mask0 &= minusOne << ( idx + 1 );
goto testMask0;
}

Related

Efficiently shift-or large bit vector

I have large in-memory array as some pointer uint64_t * arr (plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift these bits to the right by some amount from 0 to 63.
By shifting whole array I mean not to shift each element (like a[i] <<= Shift), but to shift it as a single large bit vector. In other words for each intermediate position i (except for first and last element) I can do following in a loop:
dst[i] = w | (src[i] << Shift);
w = src[i] >> (64 - Shift);
where w is some temporary variable, holding right-shifted value of previous array element.
This solution above is simple and obvious. But I need something more efficient as I have giga-bytes of data.
Ideally would be to use some SIMD instructions for that, so I'm looking for SIMD suggestions from experts. I need to implement shifting code for all four types of popular instruction sets - SSE-SSE4.2 / AVX / AVX-2 / AVX-512.
But as far as I know for example for SSE2 there exists only _mm_slli_si128() intrinsic/instruction, which shifts only by amount multiple of 8 (in other words byte-shifting). And I need shifting by arbitrary bit-size, not only byte-shift.
Without SIMD I can shift also by 128 bits at once through using shld reg, reg, reg instruction, which allows to do 128-bit shifting. It is implemented as intrinsic __shiftleft128() in MSVC, and produces assembler code that can be seen here.
BTW, I need solutions for all of MSVC/GCC/CLang.
Also inside single loop iteration I can shift 4 or 8 words in sequential operations, this will use CPU pipelining to speedup parallel out-of-order execution of several instructions.
If needed my bit vector can be aligned to any amount of bytes in memory, if this will help for example to improve SIMD speed by doing aligned reads/writes. Also source and destination bit vector memory are different (non-overlapping).
In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.
Note, to clarify, I actually have to do several shift-ors, not just single shift. I have large bit vector X, and several hundreds of shift sizes s0, s1, ..., sN, where each shift size is different and can be also large (for example shift by 100K bits), then I want to compute resulting large bit vector Y = (X << s0) | (X << s1) | ... | (X << sN). I just simplified my question for StackOverflow to shifting single vector. But probably this detail about original task is very important.
As requested by #Jake'Alquimista'LEE, I decided to implement a ready-made toy minimal reproducible example of what I want to do, computing shift-ors of input bit vector src to produced or-ed final dst bit vector. This example is not optimized at all, just a straightforward simple variant of how my task can be solved. For simplicity this example has small size of input vector, not giga-bytes as in my case. It is a toy example, I didn't check if it solves task correctly, it may contain minor bugs:
Try it online!
#include <cstdint>
#include <vector>
#include <random>
#define bit_sizeof(x) (sizeof(x) * 8)
using u64 = uint64_t;
using T = u64;
int main() {
std::mt19937_64 rng{123};
// Random generate source bit vector
std::vector<T> src(100'000);
for (size_t i = 0; i < src.size(); ++i)
src[i] = rng();
size_t const src_bitsize = src.size() * bit_sizeof(T);
// Destination bit vector, for example twice bigger in size
std::vector<T> dst(src.size() * 2);
// Random generate shifts
std::vector<u64> shifts(200);
for (size_t i = 0; i < shifts.size(); ++i)
shifts[i] = rng() % src_bitsize;
// Right-shift that handles overflow
auto Shr = [](auto x, size_t s) {
return s >= bit_sizeof(x) ? 0 : (x >> s);
};
// Do actual Shift-Ors
for (auto orig_shift: shifts) {
size_t const
word_off = orig_shift / bit_sizeof(T),
bit_off = orig_shift % bit_sizeof(T);
if (word_off >= dst.size())
continue;
size_t const
lim = std::min(src.size(), dst.size() - word_off);
T w = 0;
for (size_t i = 0; i < lim; ++i) {
dst[word_off + i] |= w | (src[i] << bit_off);
w = Shr(src[i], bit_sizeof(T) - bit_off);
}
// Special case of handling for last word
if (word_off + lim < dst.size())
dst[word_off + lim] |= w;
}
}
My real project's current code is different from toy example above. This project already solves correctly a real-world task. I just need to do extra optimizations. Some optimizations I already did, like using OpenMP to parallelize shift-or operations on all cores. Also as said in comments, I created specialized templated functions for each shift size, 64 functions in total, and choosing one of 64 functions to do actual shift-or. Each C++ function has compile time value of shift size, hence compiler does extra optimizations taking into account compile time values.
You can, and possibly you don't even need to use SIMD instructions explicitly.
The target compilers GCC, CLANG and MSVC and other compilers like ICC all support auto-vectorization.
While hand-optimized assembly can outperform compiler generated vectorized instructions, it's generally harder to achieve and you may need several versions for different architectures.
Generic code that leads to efficient auto-vectorized instructions is a solution that may be portable across many platforms.
For instance a simple shiftvec version
void shiftvec(uint64_t* dst, uint64_t* src, int size, int shift)
{
for (int i = 0; i < size; ++i,++src,++dst)
{
*dst = ((*src)<<shift) | (*(src+1)>>(64-shift));
}
}
compiled with a recent GCC (or CLANG works as well) and -O3 -std=c++11 -mavx2 leads to SIMD instructions in the core loop of the assembly
.L5:
vmovdqu ymm4, YMMWORD PTR [rsi+rax]
vmovdqu ymm5, YMMWORD PTR [rsi+8+rax]
vpsllq ymm0, ymm4, xmm2
vpsrlq ymm1, ymm5, xmm3
vpor ymm0, ymm0, ymm1
vmovdqu YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L5
See on godbolt.org: https://godbolt.org/z/5TxhqMhnK
This also generalizes if you want to do combine multiple shifts in dst:
void shiftvec2(uint64_t* dst, uint64_t* src1, uint64_t* src2, int size1, int size2, int shift1, int shift2)
{
int size = size1<size2 ? size1 : size2;
for (int i = 0; i < size; ++i,++src1,++src2,++dst)
{
*dst = ((*src1)<<shift1) | (*(src1+1)>>(64-shift1));
*dst |= ((*src2)<<shift2) | (*(src2+1)>>(64-shift2));
}
for (int i = size; i < size1; ++i,++src1,++dst)
{
*dst = ((*src1)<<shift1) | (*(src1+1)>>(64-shift1));
}
for (int i = size; i < size2; ++i,++src2,++dst)
{
*dst = ((*src2)<<shift2) | (*(src2+1)>>(64-shift2));
}
}
compiles to a core-loop:
.L38:
vmovdqu ymm7, YMMWORD PTR [rsi+rcx]
vpsllq ymm1, ymm7, xmm4
vmovdqu ymm7, YMMWORD PTR [rsi+8+rcx]
vpsrlq ymm0, ymm7, xmm6
vpor ymm1, ymm1, ymm0
vmovdqu YMMWORD PTR [rax+rcx], ymm1
vmovdqu ymm7, YMMWORD PTR [rdx+rcx]
vpsllq ymm0, ymm7, xmm3
vmovdqu ymm7, YMMWORD PTR [rdx+8+rcx]
vpsrlq ymm2, ymm7, xmm5
vpor ymm0, ymm0, ymm2
vpor ymm0, ymm0, ymm1
vmovdqu YMMWORD PTR [rax+rcx], ymm0
add rcx, 32
cmp r10, rcx
jne .L38
Combining multiple sources in one loop will reduce the total amount of memory bandwidth spent on loading/writing the destination. The limit in how many you can combine is of course limited by available registers. Note that xmm2 and xmm3 for shiftvec contain the shift values, so having different versions for compile-time known shift values may free those registers.
Additionally using __restrict (supported by GCC,CLANG,MSVC) for each of the pointers will tell the compiler that the ranges are not overlapping.
I initially had problems with MSVC giving proper auto vectorized code, but it seems adding more SIMD-like structure will make it work for all three desired compilers GCC, CLANG and MSVC:
void shiftvec(uint64_t* __restrict dst, const uint64_t* __restrict src, int size, int shift)
{
int i = 0;
// MSVC: use steps of 2 for SSE, 4 for AVX2, 8 for AVX512
for (; i+4 < size; i+=4,dst+=4,src+=4)
{
for (int j = 0; j < 4; ++j)
*(dst+j) = (*(src+j))<<shift;
for (int j = 0; j < 4; ++j)
*(dst+j) |= (*(src+1)>>(64-shift));
}
for (; i < size; ++i,++src,++dst)
{
*dst = ((*src)<<shift) | (*(src+1)>>(64-shift));
}
}
I would attempt to rely on x64 ability to read from unaligned addresses, and to do that with almost no visible penalty when stars are properly (un)aligned. One would only need to handle a few cases of (shift % 8) or (shift % 16) -- all doable with SSE2 instruction set, fixing the remainder with zeros and having an unaligned offset to the data vector and addressing the UB by memcpy.
That said, the inner loop would look like:
uint16_t const *ptr;
auto a = _mm_loadu_si128((__m128i*)ptr);
auto b = _mm_loadu_si128((__m128i*)(ptr - 1);
a = _mm_srl_epi16(a, c);
b = _mm_sll_epi16(b, 16 - c);
_mm_storeu_si128((__m128i*)ptr, mm_or_si128(a,b));
ptr += 8;
Unrolling this loop a few times, one might be able to use _mm_alignr_epi8 on SSE3+ to relax memory bandwidth (and those pipeline stages that need to combine results from unaligned memory accesses):
auto a0 = w;
auto a1 = _mm_load_si128(m128ptr + 1);
auto a2 = _mm_load_si128(m128ptr + 2);
auto a3 = _mm_load_si128(m128ptr + 3);
auto a4 = _mm_load_si128(m128ptr + 4);
auto b0 = _mm_alignr_epi8(a1, a0, 2);
auto b1 = _mm_alignr_epi8(a2, a1, 2);
auto b2 = _mm_alignr_epi8(a3, a2, 2);
auto b3 = _mm_alignr_epi8(a4, a3, 2);
// ... do the computation as above ...
w = a4; // rotate the context
In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.
The key to efficiency is to be lazy. The key to being lazy is to lie - pretend you shifted without actually doing any shifting.
For an initial example (to illustrate the concept only), consider:
struct Thingy {
int ignored_bits;
uint64_t data[];
}
void shift_right(struct Thingy * thing, int count) {
thing->ignored_bits += count;
}
void shift_left(struct Thingy * thing, int count) {
thing->ignored_bits -= count;
}
int get_bit(struct Thingy * thing, int bit_number) {
bit_number += thing->ignored_bits;
return !!(thing->data[bit_number / 64] & (1 << bit_number % 64));
}
For practical code you'll need to care about various details - you'll probably want to start with spare bits at the start of the array (and non-zero ignored_bits) so that you can pretend to shift right; for each small shift you'll probably want to clear "shifted in" bits (otherwise it'll behave like floating point - e.g. (5.0 << 8) >> 8) == 5.0); if/when ignored_bits goes outside a certain range you'll probably want a large memcpy(); etc.
For more fun; abuse low level memory management - use VirtualAlloc() (Windows) or mmap() (Linux) to reserve a huge space, then put your array in the middle of the space, then allocate/free pages at the start/end of array as needed; so that you only need to memcpy() after the original bits have been "shifted" many billions of bits to the left/right.
Of course the consequence is that it's going to complicate other parts of your code - e.g. to OR 2 bitfields together you'll have to do a tricky "fetch A; shift A to match B; result = A OR B" adjustment. This isn't a deal breaker for performance.
#include <cstdint>
#include <immintrin.h>
template<unsigned Shift>
void foo(uint64_t* __restrict pDst, const uint64_t* __restrict pSrc, intptr_t size)
{
uint64_t* pSrc0, * pSrc1, * pSrc2, * pSrc3, * pDst0, * pDst1, * pDst2, * pDst3;
__m256i prev, current;
intptr_t i, stride;
stride = size >> 2;
i = stride;
pSrc0 = pSrc;
pSrc1 = pSrc + stride;
pSrc2 = pSrc + 2 * stride;
pSrc2 = pSrc + 3 * stride;
pDst0 = pDst;
pDst1 = pDst + stride;
pDst2 = pDst + 2 * stride;
pDst3 = pDst + 3 * stride;
prev = _mm256_set_epi64x(0, pSrc1[-1], pSrc2[-1], pSrc3[-1]);
while (i--)
{
current = _mm256_set_epi64x(*pSrc0++, *pSrc1++, *pSrc2++, *pSrc3++);
prev = _mm256_srli_epi64(prev, 64 - Shift);
prev = _mm256_or_si256(prev, _mm256_slli_epi64(current, Shift));
*pDst0++ = _mm256_extract_epi64(prev, 3);
*pDst1++ = _mm256_extract_epi64(prev, 2);
*pDst2++ = _mm256_extract_epi64(prev, 1);
*pDst3++ = _mm256_extract_epi64(prev, 0);
prev = current;
}
}
You can do the operation on up to four 64bit elements at once on AVX2 (up to eight on AVX512)
If size isn't a multiple of four, there will be up to 3 remaining ones to deal with.
PS: Auto vectorization is never a proper solution.
No, you can't
Both NEON and AVX(512) support barrel shift operations up to 64bit elements.
You can however "shift" the whole 128bit vector by n-bytes (8bits) with the instruction ext on NEON and alignr on AVX.
And you should avoid using the vector class for performance since it's nothing else than linked list which is bad for the performance.

Reducing an integer to 1 if it is not equal to 0

I'm trying to solve a timing leak by removing an if statement in my code but because of c++'s interpretation of integer inputs in if statements I am stuck.
Note that I assume the compiler does create a conditional branch, which results in timing information being leaked!
The original code is:
int s
if (s)
r = A
else
r = B
Now I'm trying to rewrite it as:
int s;
r = sA+(1-s)B
Because s is not bound to [0,1] I run into the problem that it multiplies by A and B incorrectly if s is out of [0,1]. What can I do, without using an if-statement on s to solve this?
Thanks in advance
What evidence do you have that the if statement is resulting in the timing leak?
If you use a modern compiler with optimizations turned on, that code should not produce a branch. You should check what your compiler is doing by looking at the assembly language output.
For instance, g++ 5.3.0 compiles this code:
int f(int s, int A, int B) {
int r;
if (s)
r = A;
else
r = B;
return r;
}
to this assembly:
movl %esi, %eax
testl %edi, %edi
cmove %edx, %eax
ret
Look, ma! No branches! ;)
If you know the number of bits in the integer, it's pretty easy, although there are a few complications making it standards-clean with the possibility of unusual integer representations.
Here's one simple solution for 32-bit integers:
uint32_t mask = s;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask &= 1;
r = b ^ (-mask & (a ^ b)):
The five shift-and-or statements propagate any set bit in mask so that in the end the low-order bit is 1 unless the mask was originally 0. Then we isolate the low-order bit, resulting in a 1 or 0. The last statement is a bit-hacking equivalent of your two multiplies and add.
Here is a faster one based on the observation that if you subtract one from a number and the sign bit changes from 0 to 1, then the number was 0:
uint32_t mask = ((uint32_t(s)-1U)&~uint32_t(s))>>31) - 1U;
That is essentially the same computation as subtracting 1 and then using the carry bit, but unfortunately the carry bit is not exposed to the C language (except possibly through compiler-specific intrinsics).
Other variations are possible.
The only way to do it without branches when the optimization is not available is to resort to inline assembly. Assuming 8086:
mov ax, s
neg ax ; CF = (ax != 0)
sbb ax, ax ; ax = (s != 0 ? -1 : 0)
neg ax ; ax = (s != 0 ? 1 : 0)
mov s, ax ; now use s at will, it will be: s = (s != 0 ? 1 : 0)

Accessing three static arrays is quicker than one static array containing 3x data?

I have 700 items and I loop through the 700 items for each I obtain the item' three attributes and perform some basic calculations. I have implemented this using two techniques:
1) Three 700-element arrays, one array for each of the three attributes. So:
item0.a = array1[0]
item0.b = array2[0]
item0.e = array3[0]
2) One 2100-element array containing data for the three attributes consecutively. So:
item0.a = array[(0*3)+0]
item0.b = array[(0*3)+1]
item0.e = array[(0*3)+2]
Now the three item attributes a, b and e are used together within the loop- therefore it would make sense that if you store them in one array the performance should be better than if you use the three-array technique (due to spatial locality). However:
Three 700-element arrays = 3300 CPU cycles on average for the whole loop
One 2100-element array = 3500 CPU cycles on average for the whole loop
Here is the code for the 2100-array technique:
unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array[2100];
//I have left out code for simplicity. You can assume by now the array is populated.
start = __rdtscp(&x);
for(int i=0; i < 700; i++){
unsigned short j = i * 3;
unsigned int a = array[j + 0];
unsigned int b = array[j + 1];
data_for_all_items = data_for_all_items & (a!= -1 & b != -1);
unsigned int e = array[j + 2];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
and here is the code for the three 700-element arrays technique:
unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array1[700];
unsigned int array2[700];
unsigned int array3[700];
//I have left out code for simplicity. You can assume by now the arrays are populated.
start = __rdtscp(&x);
for(int i=0; i < 700; i++){
unsigned int a= array1[i]; //Array 1
unsigned int b= array2[i]; //Array 2
data_for_all_items = data_for_all_items & (a!= -1 & b != -1);
unsigned int e = array3[i]; //Array 3
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
Why isn't the technique using one-2100 element array faster? It should be because the three attributes are used together, per each 700 item.
I used MSVC 2012, Win 7 64
Assembly for 3x 700-element array technique:
start = __rdtscp(&x);
rdtscp
shl rdx,20h
lea r8,[this]
or rax,rdx
mov dword ptr [r8],ecx
mov r8d,8ch
mov r9,rax
lea rdx,[rbx+0Ch]
for(int i=0; i < 700; i++){
sub rdi,rbx
unsigned int a = array1[i];
unsigned int b = array2[i];
data_for_all_items = data_for_all_items & (a != -1 & b != -1);
cmp dword ptr [rdi+rdx-0Ch],0FFFFFFFFh
lea rdx,[rdx+14h]
setne cl
cmp dword ptr [rdi+rdx-1Ch],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-18h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-10h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-14h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-20h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-1Ch],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-18h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-10h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-14h],0FFFFFFFFh
setne al
and cl,al
and r15b,cl
dec r8
jne 013F26DA53h
unsigned int e = array3[i];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r8,[y]
or rax,rdx
mov dword ptr [r8],ecx
Assembler for the 2100-element array technique:
start = __rdtscp(&x);
rdtscp
lea r8,[this]
shl rdx,20h
or rax,rdx
mov dword ptr [r8],ecx
for(int i=0; i < 700; i++){
xor r8d,r8d
mov r10,rax
unsigned short j = i*3;
movzx ecx,r8w
add cx,cx
lea edx,[rcx+r8]
unsigned int a = array[j + 0];
unsigned int b = array[j + 1];
data_for_all_items = data_for_all_items & (best_ask != -1 & best_bid != -1);
movzx ecx,dx
cmp dword ptr [r9+rcx*4+4],0FFFFFFFFh
setne dl
cmp dword ptr [r9+rcx*4],0FFFFFFFFh
setne al
inc r8d
and dl,al
and r14b,dl
cmp r8d,2BCh
jl 013F05DA10h
unsigned int e = array[pos + 2];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r8,[y]
or rax,rdx
mov dword ptr [r8],ecx
Edit: Given your assembly code, the second loop is five times unrolled. The unrolled version could run faster on an out-of-order execution CPU such as any modern x86/x86-64 CPU.
The second code is vectorisable - two elements of each array could be loaded at each iteration in one XMM register each. Since modern CPUs use SSE for both scalar and vector FP arithmetic, this cuts the number of cycles roughly in half. With an AVX-capable CPU four doubles could be loaded in an YMM register and therefore the number of cycles should be cut in four.
The first loop is not vectorisable along i since the value of a in iteration i+1 comes from a location 3 elements after the one where the value of a in iteration i comes from. In that case vectorisation requires gathered vector loads are those are only supported in the AVX2 instruction set.
Using proper data structures is crucial when programming CPUs with vector capabilities. Converting codes like your first loop into something like your second loop is 90% of the job that one has to do in order to get good performance on Intel Xeon Phi which has very wide vector registers but awfully slow in-order execution engine.
The simple answer is that version 1 is SIMD friendly and version 2 is not. However, it's possible to make version 2, the 2100 element array, SIMD friendly. You need to us a Hybrid Struct of Arrays, aka an Array of Struct of Arrays (AoSoA). You arrange the array like this: aaaa bbbb eeee aaaa bbbb eeee ....
Below is code using GCC's vector extensions to do this. Note that now the 2100 element array code looks almost the same as the 700 element array code but it uses one array instead of three. And instead of having 700 elements between a b and e there are only 12 elements between them.
I did not find an easy solution to convert uint4 to double4 with the GCC vector extensions and I don't want to spend the time to write intrinics to do this right now so I made c and v unsigned int but for performance I would not want to be converting uint4 to double 4 in a loop anyway.
typedef unsigned int uint4 __attribute__ ((vector_size (16)));
//typedef double double4 __attribute__ ((vector_size (32)));
uint4 zero = {};
unsigned int array[2100];
uint4 test = -1 + zero;
//double4 cv = {};
//double4 dv = {};
uint4 cv = {};
uint4 dv = {};
uint4* av = (uint4*)&array[0];
uint4* bv = (uint4*)&array[4];
uint4* ev = (uint4*)&array[8];
for(int i=0; i < 525; i+=3) { //525 = 2100/4 = 700/4*3
test = test & ((av[i]!= -1) & (bv[i] != -1));
cv += (av[i] * ev[i]);
dv += (bv[i] * ev[i]);
}
double c = cv[0] + cv[1] + cv[2] + cv[3];
double v = dv[0] + dv[1] + dv[2] + dv[3];
bool data_for_all_items = test[0] & test[1] & test[2] & test[3];
The concept of 'spatial locality' is throwing you off a little bit. Chances are that with both solutions, your processor is doing its best to cache the arrays.
Unfortunately, version of your code that uses one array also has some extra math which is being performed. This is probably where your extra cycles are being spent.
Spatial locality is indeed useful, but it's actually helping you on the second case (3 distinct arrays) much more.
The cache line size is 64 Bytes (note that it doesn't divide in 3), so a single access to a 4 or 8 byte value is effectively prefetching the next elements. In addition, keep in mind that the CPU HW prefetcher is likely to go on and prefetch ahead even further elements.
However, when a,b,e are packed together, you're "wasting" this valuable prefetching on elements of the same iteration. When you access a, There's no point in prefetching b and e - the next loads are already going there (and would likely just merge in the CPU with the first load or wait for it to retrieve the data). In fact, when the arrays are merged - you fetch a new memory line only once per 64/(3*4)=~5.3 iterations. The bad alignment even means that on some iterations you'll have a and maybe b long before you get e, this imbalance is usually bad news.
In reality, since the iterations are independent, your CPU would go ahead and start the second iteration relatively fast thanks to the combination of loop unrolling (in case it was done) and out-of-order execution (calculating the index for the next set of iterations is simple and has no dependencies on the loads sent by the last ones). However you would have to run ahead pretty far in order to issue the next load everytime, and eventually the finite size of CPU instruction queues will block you, maybe before reaching the full potential memory bandwidth (number of parallel outstanding loads).
The alternative option on the other hand, where you have 3 distinct arrays, uses the spatial locality / HW prefetching solely across iterations. On each iteration, you'll issue 3 loads, which would fetch a full line once every 64/4=16 iterations. The overall data fetched is the same (well, it's the same data), but the timeliness is much better because you fetch ahead for the next 16 iterations instead of the 5. The difference become even bigger when HW prefetching is involved because you have 3 streams instead of one, meaning you can issue more prefetches (and look even further ahead).

Optimizing Bitwise Logic

In my code the following lines are currently the hotspot:
int table1[256] = /*...*/;
int table2[512] = /*...*/;
int table3[512] = /*...*/;
int* result = /*...*/;
for(int r = 0; r < r_end; ++r)
{
std::uint64_t bits = bit_reader.value(); // 64 bits, no assumption regarding bits.
// The get_ functions are table lookups from the highest word of the bits variable.
struct entry
{
int sign_offset : 5;
int r_offset : 4;
int x : 7;
};
// NOTE: We are only interested in the highest word in the bits variable.
entry e;
if(is_in_table1(bits)) // branch prediction should work well here since table1 will be hit more often than 2 or 3, and 2 more often than 3.
e = reinterpret_cast<const entry&>(table1[get_table1_index(bits)]);
else if(is_in_table2(bits))
e = reinterpret_cast<const entry&>(table2[get_table2_index(bits)]);
else
e = reinterpret_cast<const entry&>(table3[get_table3_index(bits)]);
r += e.r_offset; // r is 18 bits, top 14 bits are always 0.
int x = e.x; // x is 14 bits, top 18 bits are always 0.
int sign_offset = e.sign_offset;
assert(sign_offset <= 16 && sign_offset > 0);
// The following is the hotspot.
int sign = 1 - (bits >> (63 - sign_offset) & 0x2);
(*result++) = ((x << 18) * sign) | r; // 32 bits
// End of hotspot
bit_reader.skip(sign_offset); // sign_offset is the last bit used.
}
Though I haven't figured out how to further optimize this, maybe something from intrinsics for Operations at Bit-Granularity, __shiftleft128 or _rot could be useful?
Note that I am also doing processing of the resulting data on the GPU, so the important thing is to get something into result which the GPU then can use to calculate the correct.
Suggestions?
EDIT:
Added table look-up.
EDIT:
int sign = 1 - (bits >> (63 - e.sign_offset) & 0x2);
000000013FD6B893 and ecx,1Fh
000000013FD6B896 mov eax,3Fh
000000013FD6B89B sub eax,ecx
000000013FD6B89D movzx ecx,al
000000013FD6B8A0 shr r8,cl
000000013FD6B8A3 and r8d,2
000000013FD6B8A7 mov r14d,1
000000013FD6B8AD sub r14d,r8d
I overlooked the fact that the sign is +/-1, so I'm correcting my answer.
Assuming that mask is an array with properly defined bitmasks for all possible values of sign_offset, this approach might be faster
bool sign = (bits & mask[sign_offset]) != 0;
__int64 result = r;
if (sign)
result |= -(x << 18);
else
result |= x << 18;
The code generated by VC2010 optimized build
OP code (11 instructions)
; 23 : __int64 sign = 1 - (bits >> (63 - sign_offset) & 0x2);
mov rax, QWORD PTR bits$[rsp]
mov ecx, 63 ; 0000003fH
sub cl, BYTE PTR sign_offset$[rsp]
mov edx, 1
sar rax, cl
; 24 : __int64 result = ((x << 18) * sign) | r; // 32 bits
; 25 : std::cout << result;
and eax, 2
sub rdx, rax
mov rax, QWORD PTR x$[rsp]
shl rax, 18
imul rdx, rax
or rdx, QWORD PTR r$[rsp]
My code (8 instructions)
; 34 : bool sign = (bits & mask[sign_offset]) != 0;
mov r11, QWORD PTR sign_offset$[rsp]
; 35 : __int64 result = r;
; 36 : if (sign)
; 37 : result |= -(x << 18);
mov rdx, QWORD PTR x$[rsp]
mov rax, QWORD PTR mask$[rsp+r11*8]
shl rdx, 18
test rax, QWORD PTR bits$[rsp]
je SHORT $LN2#Test1
neg rdx
$LN2#Test1:
; 38 : else
; 39 : result |= x << 18;
or rdx, QWORD PTR r$[rsp]
EDIT by Skizz
To get rid of branch:
shl rdx, 18
lea rbx,[rdx*2]
test rax, QWORD PTR bits$[rsp]
cmove rbx,0
sub rdx,rbx
or rdx, QWORD PTR r$[rsp]
Let's do some equivalent transformations:
int sign = 1 - (bits >> (63 - sign_offset) & 0x2);
int result = ((x << 18) * sign) | r; // 32 bits
Perhaps the processor will find shifting 32-bit values cheaper -- replace the definition of HIDWORD with whatever leads to direct access to the high-order DWORD without shifting. Also, for preparation of the next step, let's rearrange the shifting in the second assignment:
#define HIDWORD(q) ((uint32_t)((q) >> 32))
int sign = 1 - (HIDWORD(bits) >> (31 - sign_offset) & 0x2);
int result = ((x * sign) << 18) | r; // 32 bits
Observe that, in two-s complement, q * (-1) equals ~q + 1, or (q ^ -1) - (-1), while q * 1 equals (q ^ 0) - 0. This justifies the second transformation which gets rid of the nasty multiplication:
int mask = -(HIDWORD(bits) >> (32 - sign_offset) & 0x1);
int result = (((x ^ mask) - mask) << 18) | r; // 32 bits
Now let's rearrange shifting again:
int mask = (-(HIDWORD(bits) >> (32 - sign_offset) & 0x1)) << 18;
int result = (((x << 18) ^ mask) - mask) | r; // 32 bits
Recall the identity concerning - and ~:
int mask = (~(HIDWORD(bits) >> (32 - sign_offset) & 0x1) + 1) << 18;
Shift rearrangement again:
int mask = (~(HIDWORD(bits) >> (32 - sign_offset) & 0x1)) << 18 + (1 << 18);
Who can finally unfiddle this? (Are the transformations corect anyway?)
(Note that only profiling on a real CPU can
assess the performance. Measures like instruction count won't do. I am not even sure that the transformations helped at all.)
Memory access is usually the root of all optimisation problems on modern CPUs. You are being misled by the performance tools as to where the slow down is happening. The compiler is probably re-ordering the code to something like this:-
int sign = 1 - (bits >> (63 - get_sign_offset(bits)) & 0x2);
(*result++) = ((get_x(bits) << 18) * sign) | (r += get_r_offset(bits));
or even:-
(*result++) = ((get_x(bits) << 18) * (1 - (bits >> (63 - get_sign_offset(bits)) & 0x2))) | (r += get_r_offset(bits));
This would highlight the lines you identified as being the hotspot.
I would look at the way you organise your memory and the what the various get_ functions do. Can you post the get_ functions at all?
To calculate the sign, I would suggest this:
int sign = (int)(((int64_t)(bits << sign_offset)) >> 63);
Which is only 2 instructions (shl and sar).
If sign_offset is one bigger than I expected:
int sign = (int)(((int64_t)(bits << (sign_offset - 1))) >> 63);
Which is still not bad. Should be only 3 instructions.
That gives an answer as 0 or -1, with which you can do this:
(*result++) = (((x << 18) ^ sign) - sign) | r;
I think this is the fastest solution:
*result++ = (_rotl64(bits, sign_offset) << 31) | (x << 18) | (r << 0); // 32 bits
And then correct x depending on whether the sign bit is set or not on the GPU.

How can I optimize conversion from half-precision float16 to single-precision float32?

I'm trying improve performance for my function. Profiler points to the code at inner loop. Can I improve perfomance of that code, maybe using SSE intrinsics?
void ConvertImageFrom_R16_FLOAT_To_R32_FLOAT(char* buffer, void* convertedData, DWORD width, DWORD height, UINT rowPitch)
{
struct SINGLE_FLOAT
{
union {
struct {
unsigned __int32 R_m : 23;
unsigned __int32 R_e : 8;
unsigned __int32 R_s : 1;
};
struct {
float r;
};
};
};
C_ASSERT(sizeof(SINGLE_FLOAT) == 4); // 4 bytes
struct HALF_FLOAT
{
unsigned __int16 R_m : 10;
unsigned __int16 R_e : 5;
unsigned __int16 R_s : 1;
};
C_ASSERT(sizeof(HALF_FLOAT) == 2);
SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
for(DWORD j = 0; j< height; j++)
{
HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
for(DWORD i = 0; i< width; i++)
{
d->R_s = s->R_s;
d->R_e = s->R_e - 15 + 127;
d->R_m = s->R_m << (23-10);
d++;
s++;
}
}
}
Update:
Disassembly
; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01
TITLE Utils.cpp
.686P
.XMM
include listing.inc
.model flat
INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES
PUBLIC ?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT
; Function compile flags: /Ogtp
; COMDAT ?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z
_TEXT SEGMENT
_buffer$ = 8 ; size = 4
tv83 = 12 ; size = 4
_convertedData$ = 12 ; size = 4
_width$ = 16 ; size = 4
_height$ = 20 ; size = 4
_rowPitch$ = 24 ; size = 4
?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z PROC ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT, COMDAT
; 323 : {
push ebp
mov ebp, esp
; 343 : for(DWORD j = 0; j< height; j++)
mov eax, DWORD PTR _height$[ebp]
push esi
mov esi, DWORD PTR _convertedData$[ebp]
test eax, eax
je SHORT $LN4#ConvertIma
; 324 : union SINGLE_FLOAT {
; 325 : struct {
; 326 : unsigned __int32 R_m : 23;
; 327 : unsigned __int32 R_e : 8;
; 328 : unsigned __int32 R_s : 1;
; 329 : };
; 330 : struct {
; 331 : float r;
; 332 : };
; 333 : };
; 334 : C_ASSERT(sizeof(SINGLE_FLOAT) == 4);
; 335 : struct HALF_FLOAT
; 336 : {
; 337 : unsigned __int16 R_m : 10;
; 338 : unsigned __int16 R_e : 5;
; 339 : unsigned __int16 R_s : 1;
; 340 : };
; 341 : C_ASSERT(sizeof(HALF_FLOAT) == 2);
; 342 : SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
push ebx
mov ebx, DWORD PTR _buffer$[ebp]
push edi
mov DWORD PTR tv83[ebp], eax
$LL13#ConvertIma:
; 344 : {
; 345 : HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
; 346 : for(DWORD i = 0; i< width; i++)
mov edi, DWORD PTR _width$[ebp]
mov edx, ebx
test edi, edi
je SHORT $LN5#ConvertIma
npad 1
$LL3#ConvertIma:
; 347 : {
; 348 : d->R_s = s->R_s;
movzx ecx, WORD PTR [edx]
movzx eax, WORD PTR [edx]
shl ecx, 16 ; 00000010H
xor ecx, DWORD PTR [esi]
shl eax, 16 ; 00000010H
and ecx, 2147483647 ; 7fffffffH
xor ecx, eax
mov DWORD PTR [esi], ecx
; 349 : d->R_e = s->R_e - 15 + 127;
movzx eax, WORD PTR [edx]
shr eax, 10 ; 0000000aH
and eax, 31 ; 0000001fH
add eax, 112 ; 00000070H
shl eax, 23 ; 00000017H
xor eax, ecx
and eax, 2139095040 ; 7f800000H
xor eax, ecx
mov DWORD PTR [esi], eax
; 350 : d->R_m = s->R_m << (23-10);
movzx ecx, WORD PTR [edx]
and ecx, 1023 ; 000003ffH
shl ecx, 13 ; 0000000dH
and eax, -8388608 ; ff800000H
or ecx, eax
mov DWORD PTR [esi], ecx
; 351 : d++;
add esi, 4
; 352 : s++;
add edx, 2
dec edi
jne SHORT $LL3#ConvertIma
$LN5#ConvertIma:
; 343 : for(DWORD j = 0; j< height; j++)
add ebx, DWORD PTR _rowPitch$[ebp]
dec DWORD PTR tv83[ebp]
jne SHORT $LL13#ConvertIma
pop edi
pop ebx
$LN4#ConvertIma:
pop esi
; 353 : }
; 354 : }
; 355 : }
pop ebp
ret 0
?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z ENDP ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT
_TEXT ENDS
The x86 F16C instruction-set extension adds hardware support for converting single-precision float vectors to/from vectors of half-precision float.
The format is the same IEEE 754 half-precision binary16 that you describe. I didn't check that the endianness is the same as your struct, but that's easy to fix if needed (with a pshufb).
F16C is supported starting from Intel IvyBridge and AMD Piledriver. (And has its own CPUID feature bit, which your code should check for, otherwise fall back to SIMD integer shifts and shuffles).
The intrinsics for VCVTPS2PH are:
__m128i _mm_cvtps_ph ( __m128 m1, const int imm);
__m128i _mm256_cvtps_ph(__m256 m1, const int imm);
The immediate byte is a rounding control. The compiler can use it as a convert-and-store directly to memory (unlike most instructions that can optionally use a memory operand, where it's the source operand that can be memory instead of a register.)
VCVTPH2PS goes the other way, and is just like most other SSE instructions (can be used between registers or as a load).
__m128 _mm_cvtph_ps ( __m128i m1);
__m256 _mm256_cvtph_ps ( __m128i m1)
F16C is so efficient that you might want to consider leaving your image in half-precision format, and converting on the fly every time you need a vector of data from it. This is great for your cache footprint.
Accessing bitfields in memory can be really tricky, depending on the architecture, of course.
You might achieve better performance if you would make a union of a float and a 32 bit integer, and simply perform all decomposition and composition using a local variables. That way the generated code could perform the entire operation using only processor registers.
the loops are independent of each other, so you could easily parallelize this code, either by using SIMD or OpenMP, a simple version would be splitting the top half and the bottom half of the image into two threads, running concurrently.
You're processing the data as a two dimension array. If you consider how it's laid out in memory you may be able to process it as a single dimensional array and you can save a little overhead by having one loop instead of nested loops.
I'd also compile to assembly code and make sure the compiler optimization worked and it isn't recalculating (15 + 127) hundreds of times.
You should be able to reduce this to a single instruction on chips which use the upcoming CVT16 instruction set. According to that Wikipedia article:
The CVT16 instructions allow conversion of floating point vectors between single precision and half precision.
SSE Intrinsics seem to be an excellent idea. Before you go down that road, you should
look at the assembly code generated by the compiler, (is there potential for optimization?)
search your compiler documentation how to generate SSE code automatically,
search your software library's documentation (or wherever the 16bit float type originated) for a function to bulk convert this type. (a conversion to 64bit floating point could be helpful too.) You are very likely not the first person to encounter this problem!
If all that fails, go and try your luck with some SSE intrinsics. To get some idea, here is some SSE code to convert from 32 to 16 bit floating point. (you want the reverse)
Besides SSE you should also consider multi-threading and offloading the task to the GPU.
Here are some ideas:
Put the constants into const register variables.
Some processors don't like fetching constants from memory; it is awkward and may take many instruction cycles.
Loop Unrolling
Repeat the statements in the loop, and increase the increment.
Processors prefer continuous instructions; jumps and branches anger them.
Data Prefetching (or loading the cache)
Use more variables in the loop, and declare them as volatile so the compiler doesn't optimize them:
SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
SINGLE_FLOAT* d1 = d + 1;
SINGLE_FLOAT* d2 = d + 2;
SINGLE_FLOAT* d3 = d + 3;
for(DWORD j = 0; j< height; j++)
{
HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
HALF_FLOAT* s1 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 1));
HALF_FLOAT* s2 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 2));
HALF_FLOAT* s3 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 3));
for(DWORD i = 0; i< width; i += 4)
{
d->R_s = s->R_s;
d->R_e = s->R_e - 15 + 127;
d->R_m = s->R_m << (23-10);
d1->R_s = s1->R_s;
d1->R_e = s1->R_e - 15 + 127;
d1->R_m = s1->R_m << (23-10);
d2->R_s = s2->R_s;
d2->R_e = s2->R_e - 15 + 127;
d2->R_m = s2->R_m << (23-10);
d3->R_s = s3->R_s;
d3->R_e = s3->R_e - 15 + 127;
d3->R_m = s3->R_m << (23-10);
d += 4;
d1 += 4;
d2 += 4;
d3 += 4;
s += 4;
s1 += 4;
s2 += 4;
s3 += 4;
}
}
I don't know about SSE intrinsics but it would be interesting to see a disassembly of your inner loop. An old-school way (that may not help much but that would be easy to try out) would be to reduce the number of iterations by doing two inner loops: one that does N (say 32) repeats of the processing (loop count of width/N) and then one to finish the remainder (loop count of width%N)... with those divs and modulos calculated outside the first loop to avoid recalculating them. Apologies if that sounds obvious!
The function is only doing a few small things. It is going to be tough to shave much off the time by optimisation, but as somebody already said, parallelisation has promise.
Check how many cache misses you are getting. If the data is paging in and out, you might be able to speed it up by applying more intelligence into the ordering to minimise cache swaps.
Also consider macro-optimisations. Are there any redundancies in the data computation that might be avoided (e.g. caching old results instead of recomputing them when needed)? Do you really need to convert the whole data set or could you just convert the bits you need? I don't know your application so I'm just guessing wildly here, but there might be scope for that kind of optimisation.
My suspicion is that this operation will be already bottlenecked on memory access, and making it more efficient (e.g., using SSE) would not make it execute more quickly. However this is only a suspicion.
Other things to try, assuming x86/x64, might be:
Don't d++ and s++, but use d[i] and s[i] on each iteration. (Then of course bump d after each scanline.) Since the elements of d are 4 bytes and those of s 2, this operation can be folded into the address calculation. (Unfortunately I can't guarantee that this would necessarily make execution more efficient.)
Remove the bitfield operations and do the operations manually. (When extracting, shift first and mask second, to maximize the likelihood that the mask can fit into a small immediate value.)
Unroll the loop, though with a loop as easily-predicted as this one it might not make much difference.
Count along each line from width down to zero. This stops the compiler having to fetch width each time round. Probably more important for x86, because it has so few registers. (If the CPU likes my "d[i] and s[i]" suggestion, you could make width signed, count from width-1 instead, and walk backwards.)
These would all be quicker to try than converting to SSE and would hopefully make it memory-bound, if it isn't already, at which point you can give up.
Finally if the output is in write-combined memory (e.g., it's a texture or vertex buffer or something accessed over AGP, or PCI Express, or whatever it is PCs have these days) then this could well result in poor performance, depending on what code the compiler has generated for the inner loop. So if that is the case you may get better results converting each scanline into a local buffer then using memcpy to copy it to its final destination.