Optimizing Bitwise Logic - c++

In my code the following lines are currently the hotspot:
int table1[256] = /*...*/;
int table2[512] = /*...*/;
int table3[512] = /*...*/;
int* result = /*...*/;
for(int r = 0; r < r_end; ++r)
{
std::uint64_t bits = bit_reader.value(); // 64 bits, no assumption regarding bits.
// The get_ functions are table lookups from the highest word of the bits variable.
struct entry
{
int sign_offset : 5;
int r_offset : 4;
int x : 7;
};
// NOTE: We are only interested in the highest word in the bits variable.
entry e;
if(is_in_table1(bits)) // branch prediction should work well here since table1 will be hit more often than 2 or 3, and 2 more often than 3.
e = reinterpret_cast<const entry&>(table1[get_table1_index(bits)]);
else if(is_in_table2(bits))
e = reinterpret_cast<const entry&>(table2[get_table2_index(bits)]);
else
e = reinterpret_cast<const entry&>(table3[get_table3_index(bits)]);
r += e.r_offset; // r is 18 bits, top 14 bits are always 0.
int x = e.x; // x is 14 bits, top 18 bits are always 0.
int sign_offset = e.sign_offset;
assert(sign_offset <= 16 && sign_offset > 0);
// The following is the hotspot.
int sign = 1 - (bits >> (63 - sign_offset) & 0x2);
(*result++) = ((x << 18) * sign) | r; // 32 bits
// End of hotspot
bit_reader.skip(sign_offset); // sign_offset is the last bit used.
}
Though I haven't figured out how to further optimize this, maybe something from intrinsics for Operations at Bit-Granularity, __shiftleft128 or _rot could be useful?
Note that I am also doing processing of the resulting data on the GPU, so the important thing is to get something into result which the GPU then can use to calculate the correct.
Suggestions?
EDIT:
Added table look-up.
EDIT:
int sign = 1 - (bits >> (63 - e.sign_offset) & 0x2);
000000013FD6B893 and ecx,1Fh
000000013FD6B896 mov eax,3Fh
000000013FD6B89B sub eax,ecx
000000013FD6B89D movzx ecx,al
000000013FD6B8A0 shr r8,cl
000000013FD6B8A3 and r8d,2
000000013FD6B8A7 mov r14d,1
000000013FD6B8AD sub r14d,r8d

I overlooked the fact that the sign is +/-1, so I'm correcting my answer.
Assuming that mask is an array with properly defined bitmasks for all possible values of sign_offset, this approach might be faster
bool sign = (bits & mask[sign_offset]) != 0;
__int64 result = r;
if (sign)
result |= -(x << 18);
else
result |= x << 18;
The code generated by VC2010 optimized build
OP code (11 instructions)
; 23 : __int64 sign = 1 - (bits >> (63 - sign_offset) & 0x2);
mov rax, QWORD PTR bits$[rsp]
mov ecx, 63 ; 0000003fH
sub cl, BYTE PTR sign_offset$[rsp]
mov edx, 1
sar rax, cl
; 24 : __int64 result = ((x << 18) * sign) | r; // 32 bits
; 25 : std::cout << result;
and eax, 2
sub rdx, rax
mov rax, QWORD PTR x$[rsp]
shl rax, 18
imul rdx, rax
or rdx, QWORD PTR r$[rsp]
My code (8 instructions)
; 34 : bool sign = (bits & mask[sign_offset]) != 0;
mov r11, QWORD PTR sign_offset$[rsp]
; 35 : __int64 result = r;
; 36 : if (sign)
; 37 : result |= -(x << 18);
mov rdx, QWORD PTR x$[rsp]
mov rax, QWORD PTR mask$[rsp+r11*8]
shl rdx, 18
test rax, QWORD PTR bits$[rsp]
je SHORT $LN2#Test1
neg rdx
$LN2#Test1:
; 38 : else
; 39 : result |= x << 18;
or rdx, QWORD PTR r$[rsp]
EDIT by Skizz
To get rid of branch:
shl rdx, 18
lea rbx,[rdx*2]
test rax, QWORD PTR bits$[rsp]
cmove rbx,0
sub rdx,rbx
or rdx, QWORD PTR r$[rsp]

Let's do some equivalent transformations:
int sign = 1 - (bits >> (63 - sign_offset) & 0x2);
int result = ((x << 18) * sign) | r; // 32 bits
Perhaps the processor will find shifting 32-bit values cheaper -- replace the definition of HIDWORD with whatever leads to direct access to the high-order DWORD without shifting. Also, for preparation of the next step, let's rearrange the shifting in the second assignment:
#define HIDWORD(q) ((uint32_t)((q) >> 32))
int sign = 1 - (HIDWORD(bits) >> (31 - sign_offset) & 0x2);
int result = ((x * sign) << 18) | r; // 32 bits
Observe that, in two-s complement, q * (-1) equals ~q + 1, or (q ^ -1) - (-1), while q * 1 equals (q ^ 0) - 0. This justifies the second transformation which gets rid of the nasty multiplication:
int mask = -(HIDWORD(bits) >> (32 - sign_offset) & 0x1);
int result = (((x ^ mask) - mask) << 18) | r; // 32 bits
Now let's rearrange shifting again:
int mask = (-(HIDWORD(bits) >> (32 - sign_offset) & 0x1)) << 18;
int result = (((x << 18) ^ mask) - mask) | r; // 32 bits
Recall the identity concerning - and ~:
int mask = (~(HIDWORD(bits) >> (32 - sign_offset) & 0x1) + 1) << 18;
Shift rearrangement again:
int mask = (~(HIDWORD(bits) >> (32 - sign_offset) & 0x1)) << 18 + (1 << 18);
Who can finally unfiddle this? (Are the transformations corect anyway?)
(Note that only profiling on a real CPU can
assess the performance. Measures like instruction count won't do. I am not even sure that the transformations helped at all.)

Memory access is usually the root of all optimisation problems on modern CPUs. You are being misled by the performance tools as to where the slow down is happening. The compiler is probably re-ordering the code to something like this:-
int sign = 1 - (bits >> (63 - get_sign_offset(bits)) & 0x2);
(*result++) = ((get_x(bits) << 18) * sign) | (r += get_r_offset(bits));
or even:-
(*result++) = ((get_x(bits) << 18) * (1 - (bits >> (63 - get_sign_offset(bits)) & 0x2))) | (r += get_r_offset(bits));
This would highlight the lines you identified as being the hotspot.
I would look at the way you organise your memory and the what the various get_ functions do. Can you post the get_ functions at all?

To calculate the sign, I would suggest this:
int sign = (int)(((int64_t)(bits << sign_offset)) >> 63);
Which is only 2 instructions (shl and sar).
If sign_offset is one bigger than I expected:
int sign = (int)(((int64_t)(bits << (sign_offset - 1))) >> 63);
Which is still not bad. Should be only 3 instructions.
That gives an answer as 0 or -1, with which you can do this:
(*result++) = (((x << 18) ^ sign) - sign) | r;

I think this is the fastest solution:
*result++ = (_rotl64(bits, sign_offset) << 31) | (x << 18) | (r << 0); // 32 bits
And then correct x depending on whether the sign bit is set or not on the GPU.

Related

Translating RCL assembly into C/C++

I'm trying to manually translate some assembly into C/C++ in order to migrate a code-base to x64 bit as Visual Studio does not allow __asm code to compile in x64 bit.
I've translated some parts, however I am stuck on the following (it's an extract from the full function but should be self-contained):
void Foo()
{
char cTemp = ...;
int iVal = ...;
__asm
{
mov ebx, iVal
mov dl, cTemp
mov al, dl
MOV CL, 3
CLC
MOV AX, BX
ROR AH, CL
XOR DL, AH
SAR DL, 1
RCL BX, 1
RCL DL, 1
}
}
The parts I'm struggling with are:
RCL BX, 1
RCL DL, 1
From what I understand this is the equivalent of the following:
short v = ...;
v = (v << 1) + (CLEAR_FLAG ? 1 : 0)
Which from my understanding means if the value of (v << 1) overflows then add 1, otherwise add 0 (I may be mis-understanding though, so please correct me if so).
What I'm struggling to do is detect the overflow in C/C++ when carrying the shift operation. I've looked around and the only thing I can find is detecting addition/subtraction overflow before it happens, but nothing with regards to bit shifting.
Is it possible at all to translate such assembly to C/C++?
RCL is a "rotate left with carry" operation. So you need to take into account the previous instruction, SAR, that sets the carry flag (CF).
Note that SAR is a signed right shift, so will need a signed operand. It important to use proper data types that match the instruction precisely in bitness and signedness.
A 1-to-1 translation could look something like this
int8_t dl = /* ... */;
uint16_t bx = /* ... */;
// SAR DL,1
int8_t carry_1 = dl & 1;
dl >>= 1;
// RCL BX,1
uint16_t carry_2 = bx >> 15;
bx = (bx << 1) | carry_1;
// RCL DL,1
dl = (dl << 1) | carry_2;
There is probably a way to simplify these further. There are tools that can do that, they provide a somewhat readable C++ equivalent for a decompiled function.

How to efficiently scan 2 bit masks alternating each iteration

Given are 2 bitmasks, that should be accessed alternating (0,1,0,1...). I try to get a runtime efficient solution, but find no better way then following example.
uint32_t mask[2] { ... };
uint8_t mask_index = 0;
uint32_t f = _tzcnt_u32(mask[mask_index]);
while (f < 32) {
// element adding to result vector removed, since not relevant for question itself
mask[0] >>= f + 1;
mask[1] >>= f + 1;
mask_index ^= 1;
f = _tzcnt_u32(mask[mask_index]);
}
ASM output (MSVC, x64) seems blown up pretty much.
inc r9
add r9,rcx
mov eax,esi
mov qword ptr [rdi+rax*8],r9
inc esi
lea rax,[rcx+1]
shrx r11d,r11d,eax
mov dword ptr [rbp],r11d
shrx r8d,r8d,eax
mov dword ptr [rbp+4],r8d
xor r10b,1
movsx rax,r10b
tzcnt ecx,dword ptr [rbp+rax*4]
mov ecx,ecx
cmp rcx,20h
jb main+240h (07FF632862FD0h)
cmp r9,20h
jb main+230h (07FF632862FC0h)
Has someone an advice?
(This is is a followup to Solve loop data dependency with SIMD - finding transitions between -1 and +1 in an int8_t array of sgn values using SIMD to create the bitmasks)
Update
I wonder if a potential solution could make use of SIMD by loading chunks of both bit streams into a register (AVX2 in my case) like this:
|m0[0]|m1[0]|m0[1]|m1[1]|m0[2]|m1[2]|m0[n+1]|m1[n+1]|
or
1 register with chunks per stream
|m0[0]|m0[1]|m0[2]|m0[n+1]|
|m1[0]|m1[1]|m1[2]|m1[n+1]|
or split the stream in chunks of same size and deal with as many lanes fit into the register at once. Let's assume we have 256*10 elements which might end up in 10 iterations like this:
|m0[0]|m0[256]|m0[512]|...|
|m1[0]|m1[256]|m1[512]|...|
and deal with the join separately
Not sure if this might be a way to achieve more iterations per cycle and limit the need of horizontal bitscans, shift/clear op's and avoid branches.
This is quite hard to optimize this loop. The main issue is that each iteration of the loop is dependent of the previous one and even instructions in the loops are dependent. This creates a long nearly sequential chain of instruction to be executed. As a result the processor cannot execute this efficiently. In addition, some instructions in this chain have a quite high latency: tzcnt has a 3-cycle latency on Intel processors and L1 load/store have a 3 cycle latency.
One solution is work directly with registers instead of an array with indirect accesses so to reduce the length of the chain and especially instruction with the highest latency. This can be done by unrolling the loop twice and splitting the problem in two different ones:
uint32_t m0 = mask[0];
uint32_t m1 = mask[1];
uint8_t mask_index = 0;
if(mask_index == 0) {
uint32_t f = _tzcnt_u32(m0);
while (f < 32) {
m1 >>= f + 1;
m0 >>= f + 1;
f = _tzcnt_u32(m1);
if(f >= 32)
break;
m0 >>= f + 1;
m1 >>= f + 1;
f = _tzcnt_u32(m0);
}
}
else {
uint32_t f = _tzcnt_u32(m1);
while (f < 32) {
m0 >>= f + 1;
m1 >>= f + 1;
f = _tzcnt_u32(m1);
if(f >= 32)
break;
m0 >>= f + 1;
m1 >>= f + 1;
f = _tzcnt_u32(m0);
}
}
// If mask is needed, m0 and m1 need to be stored back in mask.
This should be a bit faster, especially because a smaller critical path but also because the two shifts can be executed in parallel. Here is the resulting assembly code:
$loop:
inc ecx
shr edx, cl
shr eax, cl
tzcnt ecx, edx
cmp ecx, 32
jae SHORT $end_loop
inc ecx
shr eax, cl
shr edx, cl
tzcnt ecx, eax
cmp ecx, 32
jb SHORT $loop
Note that modern x86 processors can fuse the instructions cmp+jae and cmp+jb and the branch prediction can assume the loop will continue so it just miss-predict the last conditional jump. On Intel processors, the critical path is composed of a 1-cycle latency inc, a 1-cycle latency shr, a 3-cycle latency tzcnt resulting in a 5-cycle per round (1 round = 1 iteration of the initial loop). On AMD Zen-like processors, it is 1+1+2=4 cycles which is very good. Optimizing this further appears to be very challenging.
One possible optimization could be to use a lookup table so to compute the lower bits of m0 and m1 in bigger steps. However, a lookup table fetch has a 3-cycle latency, may cause expensive cache misses in practice, takes more memory and make the code significantly more complex since the number of trailing 0 bits can be quite big (eg. 28 bits). Thus, I am not sure this is a good idea although it certainly worth trying.
Here’s another way, untested. People all over internets recommend against using goto, but sometimes, like for your use case, the feature does help.
// Grab 2 more of these masks, or if you don't have any, return false
bool loadMasks( uint32_t& mask1, uint32_t& mask2 );
// Consume the found value
void consumeIndex( size_t index );
void processMasks()
{
size_t sourceOffset = 0;
uint32_t mask0, mask1;
// Skip initial zeros
while( true )
{
if( !loadMasks( mask0, mask1 ) )
return;
if( 0 != ( mask0 | mask1 ) )
break;
sourceOffset += 32;
}
constexpr uint32_t minusOne = ~(uint32_t)0;
uint32_t idx;
// Figure out the initial state, and jump
if( _tzcnt_u32( mask0 ) > _tzcnt_u32( mask1 ) )
goto testMask1;
// Main loop below
testMask0:
idx = _tzcnt_u32( mask0 );
if( idx >= 32 )
{
sourceOffset += 32;
if( !loadMasks( mask0, mask1 ) )
return;
goto testMask0;
}
consumeIndex( sourceOffset + idx );
mask1 &= minusOne << ( idx + 1 );
testMask1:
idx = _tzcnt_u32( mask1 );
if( idx >= 32 )
{
sourceOffset += 32;
if( !loadMasks( mask0, mask1 ) )
return;
goto testMask1;
}
consumeIndex( sourceOffset + idx );
mask0 &= minusOne << ( idx + 1 );
goto testMask0;
}

C overflows inside an equation?

a + b overflows 255 back to 4 as I would expect, then c / 2 gives 2 as I expect. But then why does the last example not overflow when evaluating the same two steps?
I'm guessing the internal calculation values are stored with more bits, then only truncated down to 8 bit when doing the assignment. In that case where is the limit, it must overflow at some point?
uint8_t a = 250;
uint8_t b = 10;
uint8_t c = (a + b);
uint8_t d = c / 2;
uint8_t e = (a + b) / 2;
std::cout << unsigned(c) << ", " << unsigned(d) << ", " << unsigned(e) << "\n";
4, 2, 130
It's called integral promotion. The operations themselves are done in your CPUs native integer type, int, which can hold numbers greater than 255. In the a+b case the result must be stored in an uint8_t, and that's where the truncating is done. In the last case first there is a division which is done as an int, and the result can be perfectly stored in an uint8_t.
a+b gives value 260 which is not assigned to any uint8_t type so you are good in the last case. Only when you assing any value greater than 255 to uint8_t then there is a overflow.
In the following (a + b) does not overflow, the compiler recognizes a and b as Integer Types so addition results in an Integer Type, the result of this expression is not restricted by the size of the terms or factors in the expression.
Lets assume the type of a variable like a or b in this case limits the result to only that type. While possible it would be almost impossible to use a language like this. Imagine five variables that when no type consideration is made they sum to 500 ie this..
uint8_t a = 98;
uint8_t b = 99;
uint8_t c = 100;
uint8_t d = 101;
uint8_t e = 102;
The sum of the above variables == 500. Now... in the following the result of any expression cannot exceed the size of one of the terms...
int incorrect = (a + b + c + d + e);
in this case (a + b + c) == 41 then (41 + d + e) == 244. This is a nonsensical answer.. The alternative that most people recognize ie
(98 + 99 + 100 + 101 + 102) == 500;
This is one reason why type conversion exists.
Intermediate results in expressions should not be restricted by the terms or factors in the expression but by the resultant type ie the lvalue.
#atturri is correct. here is what happen to your variables in x86 machine language:
REP STOS DWORD PTR ES:[EDI]
MOV BYTE PTR SS:[a],0FA
MOV BYTE PTR SS:[b],0A
MOVZX EAX,BYTE PTR SS:[a] ; promotion to 32bit integer
MOVZX ECX,BYTE PTR SS:[b] ; promotion to 32bit integer
ADD EAX,ECX
MOV BYTE PTR SS:[c],AL ; ; demotion to 8bit integer
MOVZX EAX,BYTE PTR SS:[c]
CDQ
SUB EAX,EDX
SAR EAX,1
MOV BYTE PTR SS:[d],AL
MOVZX EAX,BYTE PTR SS:[a]
MOVZX ECX,BYTE PTR SS:[b]
ADD EAX,ECX
CDQ
SUB EAX,EDX
SAR EAX,1
MOV BYTE PTR SS:[e],AL

How can I optimize conversion from half-precision float16 to single-precision float32?

I'm trying improve performance for my function. Profiler points to the code at inner loop. Can I improve perfomance of that code, maybe using SSE intrinsics?
void ConvertImageFrom_R16_FLOAT_To_R32_FLOAT(char* buffer, void* convertedData, DWORD width, DWORD height, UINT rowPitch)
{
struct SINGLE_FLOAT
{
union {
struct {
unsigned __int32 R_m : 23;
unsigned __int32 R_e : 8;
unsigned __int32 R_s : 1;
};
struct {
float r;
};
};
};
C_ASSERT(sizeof(SINGLE_FLOAT) == 4); // 4 bytes
struct HALF_FLOAT
{
unsigned __int16 R_m : 10;
unsigned __int16 R_e : 5;
unsigned __int16 R_s : 1;
};
C_ASSERT(sizeof(HALF_FLOAT) == 2);
SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
for(DWORD j = 0; j< height; j++)
{
HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
for(DWORD i = 0; i< width; i++)
{
d->R_s = s->R_s;
d->R_e = s->R_e - 15 + 127;
d->R_m = s->R_m << (23-10);
d++;
s++;
}
}
}
Update:
Disassembly
; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01
TITLE Utils.cpp
.686P
.XMM
include listing.inc
.model flat
INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES
PUBLIC ?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT
; Function compile flags: /Ogtp
; COMDAT ?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z
_TEXT SEGMENT
_buffer$ = 8 ; size = 4
tv83 = 12 ; size = 4
_convertedData$ = 12 ; size = 4
_width$ = 16 ; size = 4
_height$ = 20 ; size = 4
_rowPitch$ = 24 ; size = 4
?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z PROC ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT, COMDAT
; 323 : {
push ebp
mov ebp, esp
; 343 : for(DWORD j = 0; j< height; j++)
mov eax, DWORD PTR _height$[ebp]
push esi
mov esi, DWORD PTR _convertedData$[ebp]
test eax, eax
je SHORT $LN4#ConvertIma
; 324 : union SINGLE_FLOAT {
; 325 : struct {
; 326 : unsigned __int32 R_m : 23;
; 327 : unsigned __int32 R_e : 8;
; 328 : unsigned __int32 R_s : 1;
; 329 : };
; 330 : struct {
; 331 : float r;
; 332 : };
; 333 : };
; 334 : C_ASSERT(sizeof(SINGLE_FLOAT) == 4);
; 335 : struct HALF_FLOAT
; 336 : {
; 337 : unsigned __int16 R_m : 10;
; 338 : unsigned __int16 R_e : 5;
; 339 : unsigned __int16 R_s : 1;
; 340 : };
; 341 : C_ASSERT(sizeof(HALF_FLOAT) == 2);
; 342 : SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
push ebx
mov ebx, DWORD PTR _buffer$[ebp]
push edi
mov DWORD PTR tv83[ebp], eax
$LL13#ConvertIma:
; 344 : {
; 345 : HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
; 346 : for(DWORD i = 0; i< width; i++)
mov edi, DWORD PTR _width$[ebp]
mov edx, ebx
test edi, edi
je SHORT $LN5#ConvertIma
npad 1
$LL3#ConvertIma:
; 347 : {
; 348 : d->R_s = s->R_s;
movzx ecx, WORD PTR [edx]
movzx eax, WORD PTR [edx]
shl ecx, 16 ; 00000010H
xor ecx, DWORD PTR [esi]
shl eax, 16 ; 00000010H
and ecx, 2147483647 ; 7fffffffH
xor ecx, eax
mov DWORD PTR [esi], ecx
; 349 : d->R_e = s->R_e - 15 + 127;
movzx eax, WORD PTR [edx]
shr eax, 10 ; 0000000aH
and eax, 31 ; 0000001fH
add eax, 112 ; 00000070H
shl eax, 23 ; 00000017H
xor eax, ecx
and eax, 2139095040 ; 7f800000H
xor eax, ecx
mov DWORD PTR [esi], eax
; 350 : d->R_m = s->R_m << (23-10);
movzx ecx, WORD PTR [edx]
and ecx, 1023 ; 000003ffH
shl ecx, 13 ; 0000000dH
and eax, -8388608 ; ff800000H
or ecx, eax
mov DWORD PTR [esi], ecx
; 351 : d++;
add esi, 4
; 352 : s++;
add edx, 2
dec edi
jne SHORT $LL3#ConvertIma
$LN5#ConvertIma:
; 343 : for(DWORD j = 0; j< height; j++)
add ebx, DWORD PTR _rowPitch$[ebp]
dec DWORD PTR tv83[ebp]
jne SHORT $LL13#ConvertIma
pop edi
pop ebx
$LN4#ConvertIma:
pop esi
; 353 : }
; 354 : }
; 355 : }
pop ebp
ret 0
?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z ENDP ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT
_TEXT ENDS
The x86 F16C instruction-set extension adds hardware support for converting single-precision float vectors to/from vectors of half-precision float.
The format is the same IEEE 754 half-precision binary16 that you describe. I didn't check that the endianness is the same as your struct, but that's easy to fix if needed (with a pshufb).
F16C is supported starting from Intel IvyBridge and AMD Piledriver. (And has its own CPUID feature bit, which your code should check for, otherwise fall back to SIMD integer shifts and shuffles).
The intrinsics for VCVTPS2PH are:
__m128i _mm_cvtps_ph ( __m128 m1, const int imm);
__m128i _mm256_cvtps_ph(__m256 m1, const int imm);
The immediate byte is a rounding control. The compiler can use it as a convert-and-store directly to memory (unlike most instructions that can optionally use a memory operand, where it's the source operand that can be memory instead of a register.)
VCVTPH2PS goes the other way, and is just like most other SSE instructions (can be used between registers or as a load).
__m128 _mm_cvtph_ps ( __m128i m1);
__m256 _mm256_cvtph_ps ( __m128i m1)
F16C is so efficient that you might want to consider leaving your image in half-precision format, and converting on the fly every time you need a vector of data from it. This is great for your cache footprint.
Accessing bitfields in memory can be really tricky, depending on the architecture, of course.
You might achieve better performance if you would make a union of a float and a 32 bit integer, and simply perform all decomposition and composition using a local variables. That way the generated code could perform the entire operation using only processor registers.
the loops are independent of each other, so you could easily parallelize this code, either by using SIMD or OpenMP, a simple version would be splitting the top half and the bottom half of the image into two threads, running concurrently.
You're processing the data as a two dimension array. If you consider how it's laid out in memory you may be able to process it as a single dimensional array and you can save a little overhead by having one loop instead of nested loops.
I'd also compile to assembly code and make sure the compiler optimization worked and it isn't recalculating (15 + 127) hundreds of times.
You should be able to reduce this to a single instruction on chips which use the upcoming CVT16 instruction set. According to that Wikipedia article:
The CVT16 instructions allow conversion of floating point vectors between single precision and half precision.
SSE Intrinsics seem to be an excellent idea. Before you go down that road, you should
look at the assembly code generated by the compiler, (is there potential for optimization?)
search your compiler documentation how to generate SSE code automatically,
search your software library's documentation (or wherever the 16bit float type originated) for a function to bulk convert this type. (a conversion to 64bit floating point could be helpful too.) You are very likely not the first person to encounter this problem!
If all that fails, go and try your luck with some SSE intrinsics. To get some idea, here is some SSE code to convert from 32 to 16 bit floating point. (you want the reverse)
Besides SSE you should also consider multi-threading and offloading the task to the GPU.
Here are some ideas:
Put the constants into const register variables.
Some processors don't like fetching constants from memory; it is awkward and may take many instruction cycles.
Loop Unrolling
Repeat the statements in the loop, and increase the increment.
Processors prefer continuous instructions; jumps and branches anger them.
Data Prefetching (or loading the cache)
Use more variables in the loop, and declare them as volatile so the compiler doesn't optimize them:
SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
SINGLE_FLOAT* d1 = d + 1;
SINGLE_FLOAT* d2 = d + 2;
SINGLE_FLOAT* d3 = d + 3;
for(DWORD j = 0; j< height; j++)
{
HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
HALF_FLOAT* s1 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 1));
HALF_FLOAT* s2 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 2));
HALF_FLOAT* s3 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 3));
for(DWORD i = 0; i< width; i += 4)
{
d->R_s = s->R_s;
d->R_e = s->R_e - 15 + 127;
d->R_m = s->R_m << (23-10);
d1->R_s = s1->R_s;
d1->R_e = s1->R_e - 15 + 127;
d1->R_m = s1->R_m << (23-10);
d2->R_s = s2->R_s;
d2->R_e = s2->R_e - 15 + 127;
d2->R_m = s2->R_m << (23-10);
d3->R_s = s3->R_s;
d3->R_e = s3->R_e - 15 + 127;
d3->R_m = s3->R_m << (23-10);
d += 4;
d1 += 4;
d2 += 4;
d3 += 4;
s += 4;
s1 += 4;
s2 += 4;
s3 += 4;
}
}
I don't know about SSE intrinsics but it would be interesting to see a disassembly of your inner loop. An old-school way (that may not help much but that would be easy to try out) would be to reduce the number of iterations by doing two inner loops: one that does N (say 32) repeats of the processing (loop count of width/N) and then one to finish the remainder (loop count of width%N)... with those divs and modulos calculated outside the first loop to avoid recalculating them. Apologies if that sounds obvious!
The function is only doing a few small things. It is going to be tough to shave much off the time by optimisation, but as somebody already said, parallelisation has promise.
Check how many cache misses you are getting. If the data is paging in and out, you might be able to speed it up by applying more intelligence into the ordering to minimise cache swaps.
Also consider macro-optimisations. Are there any redundancies in the data computation that might be avoided (e.g. caching old results instead of recomputing them when needed)? Do you really need to convert the whole data set or could you just convert the bits you need? I don't know your application so I'm just guessing wildly here, but there might be scope for that kind of optimisation.
My suspicion is that this operation will be already bottlenecked on memory access, and making it more efficient (e.g., using SSE) would not make it execute more quickly. However this is only a suspicion.
Other things to try, assuming x86/x64, might be:
Don't d++ and s++, but use d[i] and s[i] on each iteration. (Then of course bump d after each scanline.) Since the elements of d are 4 bytes and those of s 2, this operation can be folded into the address calculation. (Unfortunately I can't guarantee that this would necessarily make execution more efficient.)
Remove the bitfield operations and do the operations manually. (When extracting, shift first and mask second, to maximize the likelihood that the mask can fit into a small immediate value.)
Unroll the loop, though with a loop as easily-predicted as this one it might not make much difference.
Count along each line from width down to zero. This stops the compiler having to fetch width each time round. Probably more important for x86, because it has so few registers. (If the CPU likes my "d[i] and s[i]" suggestion, you could make width signed, count from width-1 instead, and walk backwards.)
These would all be quicker to try than converting to SSE and would hopefully make it memory-bound, if it isn't already, at which point you can give up.
Finally if the output is in write-combined memory (e.g., it's a texture or vertex buffer or something accessed over AGP, or PCI Express, or whatever it is PCs have these days) then this could well result in poor performance, depending on what code the compiler has generated for the inner loop. So if that is the case you may get better results converting each scanline into a local buffer then using memcpy to copy it to its final destination.

How to efficiently de-interleave bits (inverse Morton)

This question: How to de-interleave bits (UnMortonizing?) has a good answer for extracting one of the two halves of a Morton number (just the odd bits), but I need a solution which extracts both parts (the odd bits and the even bits) in as few operations as possible.
For my use I would need to take a 32 bit int and extract two 16 bit ints, where one is the even bits and the other is the odd bits shifted right by 1 bit, e.g.
input, z: 11101101 01010111 11011011 01101110
output, x: 11100001 10110111 // odd bits shifted right by 1
y: 10111111 11011010 // even bits
There seem to be plenty of solutions using shifts and masks with magic numbers for generating Morton numbers (i.e. interleaving bits), e.g. Interleave bits by Binary Magic Numbers, but I haven't yet found anything for doing the reverse (i.e. de-interleaving).
UPDATE
After re-reading the section from Hacker's Delight on perfect shuffles/unshuffles I found some useful examples which I adapted as follows:
// morton1 - extract even bits
uint32_t morton1(uint32_t x)
{
x = x & 0x55555555;
x = (x | (x >> 1)) & 0x33333333;
x = (x | (x >> 2)) & 0x0F0F0F0F;
x = (x | (x >> 4)) & 0x00FF00FF;
x = (x | (x >> 8)) & 0x0000FFFF;
return x;
}
// morton2 - extract odd and even bits
void morton2(uint32_t *x, uint32_t *y, uint32_t z)
{
*x = morton1(z);
*y = morton1(z >> 1);
}
I think this can still be improved on, both in its current scalar form and also by taking advantage of SIMD, so I'm still interested in better solutions (either scalar or SIMD).
If your processor handles 64 bit ints efficiently, you could combine the operations...
int64 w = (z &0xAAAAAAAA)<<31 | (z &0x55555555 )
w = (w | (w >> 1)) & 0x3333333333333333;
w = (w | (w >> 2)) & 0x0F0F0F0F0F0F0F0F;
...
Code for the Intel Haswell and later CPUs. You can use the BMI2 instruction set which contains the pext and pdep instructions. These can (among other great things) be used to build your functions.
#include <immintrin.h>
#include <stdint.h>
// on GCC, compile with option -mbmi2, requires Haswell or better.
uint64_t xy_to_morton (uint32_t x, uint32_t y)
{
return _pdep_u32(x, 0x55555555) | _pdep_u32(y,0xaaaaaaaa);
}
uint64_t morton_to_xy (uint64_t m, uint32_t *x, uint32_t *y)
{
*x = _pext_u64(m, 0x5555555555555555);
*y = _pext_u64(m, 0xaaaaaaaaaaaaaaaa);
}
In case someone is using morton codes in 3d, so he needs to read one bit every 3, and 64 bits here is the function I used:
uint64_t morton3(uint64_t x) {
x = x & 0x9249249249249249;
x = (x | (x >> 2)) & 0x30c30c30c30c30c3;
x = (x | (x >> 4)) & 0xf00f00f00f00f00f;
x = (x | (x >> 8)) & 0x00ff0000ff0000ff;
x = (x | (x >> 16)) & 0xffff00000000ffff;
x = (x | (x >> 32)) & 0x00000000ffffffff;
return x;
}
uint64_t bits;
uint64_t x = morton3(bits)
uint64_t y = morton3(bits>>1)
uint64_t z = morton3(bits>>2)
You can extract 8 interleaved bits by multiplying like so:
uint8_t deinterleave_even(uint16_t x) {
return ((x & 0x5555) * 0xC00030000C0003 & 0x0600180060008001) * 0x0101010101010101 >> 56;
}
uint8_t deinterleave_odd(uint16_t x) {
return ((x & 0xAAAA) * 0xC00030000C0003 & 0x03000C003000C000) * 0x0101010101010101 >> 56;
}
It should be trivial to combine them for 32 bits or larger.
If you need speed than you can use table-lookup for one byte conversion at once (two bytes table is faster but to big). Procedure is made under Delphi IDE but the assembler/algorithem is the same.
const
MortonTableLookup : array[byte] of byte = ($00, $01, $10, $11, $12, ... ;
procedure DeinterleaveBits(Input: cardinal);
//In: eax
//Out: dx = EvenBits; ax = OddBits;
asm
movzx ecx, al //Use 0th byte
mov dl, byte ptr[MortonTableLookup + ecx]
//
shr eax, 8
movzx ecx, ah //Use 2th byte
mov dh, byte ptr[MortonTableLookup + ecx]
//
shl edx, 16
movzx ecx, al //Use 1th byte
mov dl, byte ptr[MortonTableLookup + ecx]
//
shr eax, 8
movzx ecx, ah //Use 3th byte
mov dh, byte ptr[MortonTableLookup + ecx]
//
mov ecx, edx
and ecx, $F0F0F0F0
mov eax, ecx
rol eax, 12
or eax, ecx
rol edx, 4
and edx, $F0F0F0F0
mov ecx, edx
rol ecx, 12
or edx, ecx
end;
I didn't want to be limited to a fixed size integer and making lists of similar commands with hardcoded constants, so I developed a C++11 solution which makes use of template metaprogramming to generate the functions and the constants. The assembly code generated with -O3 seems as tight as it can get without using BMI:
andl $0x55555555, %eax
movl %eax, %ecx
shrl %ecx
orl %eax, %ecx
andl $0x33333333, %ecx
movl %ecx, %eax
shrl $2, %eax
orl %ecx, %eax
andl $0xF0F0F0F, %eax
movl %eax, %ecx
shrl $4, %ecx
orl %eax, %ecx
movzbl %cl, %esi
shrl $8, %ecx
andl $0xFF00, %ecx
orl %ecx, %esi
TL;DR source repo and live demo.
Implementation
Basically every step in the morton1 function works by shifting and adding to a sequence of constants which look like this:
0b0101010101010101 (alternate 1 and 0)
0b0011001100110011 (alternate 2x 1 and 0)
0b0000111100001111 (alternate 4x 1 and 0)
0b0000000011111111 (alternate 8x 1 and 0)
If we were to use D dimensions, we would have a pattern with D-1 zeros and 1 one. So to generate these it's enough to generate consecutive ones and apply some bitwise or:
/// #brief Generates 0b1...1 with #tparam n ones
template <class T, unsigned n>
using n_ones = std::integral_constant<T, (~static_cast<T>(0) >> (sizeof(T) * 8 - n))>;
/// #brief Performs `#tparam input | (#tparam input << #tparam width` #tparam repeat times.
template <class T, T input, unsigned width, unsigned repeat>
struct lshift_add :
public lshift_add<T, lshift_add<T, input, width, 1>::value, width, repeat - 1> {
};
/// #brief Specialization for 1 repetition, just does the shift-and-add operation.
template <class T, T input, unsigned width>
struct lshift_add<T, input, width, 1> : public std::integral_constant<T,
(input & n_ones<T, width>::value) | (input << (width < sizeof(T) * 8 ? width : 0))> {
};
Now that we can generate the constants at compile time for arbitrary dimensions with the following:
template <class T, unsigned step, unsigned dimensions = 2u>
using mask = lshift_add<T, n_ones<T, 1 << step>::value, dimensions * (1 << step), sizeof(T) * 8 / (2 << step)>;
With the same type of recursion, we can generate functions for each of the steps of the algorithm x = (x | (x >> K)) & M:
template <class T, unsigned step, unsigned dimensions>
struct deinterleave {
static T work(T input) {
input = deinterleave<T, step - 1, dimensions>::work(input);
return (input | (input >> ((dimensions - 1) * (1 << (step - 1))))) & mask<T, step, dimensions>::value;
}
};
// Omitted specialization for step 0, where there is just a bitwise and
It remains to answer the question "how many steps do we need?". This depends also on the number of dimensions. In general, k steps compute 2^k - 1 output bits; the maximum number of meaningful bits for each dimension is given by z = sizeof(T) * 8 / dimensions, therefore it is enough to take 1 + log_2 z steps. The problem is now that we need this as constexpr in order to use it as a template parameter. The best way I found to work around this is to define log2 via metaprogramming:
template <unsigned arg>
struct log2 : public std::integral_constant<unsigned, log2<(arg >> 1)>::value + 1> {};
template <>
struct log2<1u> : public std::integral_constant<unsigned, 0u> {};
/// #brief Helper constexpr which returns the number of steps needed to fully interleave a type #tparam T.
template <class T, unsigned dimensions>
using num_steps = std::integral_constant<unsigned, log2<sizeof(T) * 8 / dimensions>::value + 1>;
And finally, we can perform one single call:
/// #brief Helper function which combines #see deinterleave and #see num_steps into a single call.
template <class T, unsigned dimensions>
T deinterleave_first(T n) {
return deinterleave<T, num_steps<T, dimensions>::value - 1, dimensions>::work(n);
}