Translating C++ x86 Inline assembly code to C++ - c++

I've been struggling trying to convert this assembly code to C++ code.
It's a function from an old game that takes pixel data Stmp, and I believe it places it to destination void* dest
void Function(int x, int y, int yl, void* Stmp, void* dest)
{
unsigned long size = 1280 * 2;
unsigned long j = yl;
void* Dtmp = (void*)((char*)dest + y * size + (x * 2));
_asm
{
push es;
push ds;
pop es;
mov edx,Dtmp;
mov esi,Stmp;
mov ebx,j;
xor eax,eax;
xor ecx,ecx;
loop_1:
or bx,bx;
jz exit_1;
mov edi,edx;
loop_2:
cmp word ptr[esi],0xffff;
jz exit_2;
mov ax,[esi];
add edi,eax;
mov cx,[esi+2];
add esi,4;
shr ecx,2;
jnc Next2;
movsw;
Next2:
rep movsd;
jmp loop_2;
exit_2:
add esi,2;
add edx,size;
dec bx;
jmp loop_1;
exit_1:
pop es;
};
}
That's where I've gotten as far to: (Not sure if it's even correct)
while (j > 0)
{
if (*stmp != 0xffff)
{
}
++stmp;
dtmp += size;
--j;
}
Any help is greatly appreciated. Thank you.

It saves / restores ES around setting it equal to DS so rep movsd will use the same addresses for load and store. That instruction is basically memcpy(edi, esi, ecx) but incrementing the pointers in EDI and ESI (by 4 * ecx). https://www.felixcloutier.com/x86/movs:movsb:movsw:movsd:movsq
In a flat memory model, you can totally ignore that. This code looks like it might have been written to run in 16-bit unreal mode, or possibly even real mode, hence the use of 16-bit registers all over the place.
Look like it's loading some kind of records that tell it how many bytes to copy, and reading until the end of the record, at which point it looks for the next record there. There's an outer loop around that, looping through records.
The records look like this I think:
struct sprite_line {
uint16_t skip_dstbytes, src_bytes;
uint16_t src_data[]; // flexible array member, actual size unlimited but assumed to be a multiple of 2.
};
The inner loop is this:
;; char *dstp; // in EDI
;; struct spriteline *p // in ESI
loop_2:
cmp word ptr[esi],0xffff ; while( p->skip_dstbytes != (uint16_t)-1 ) {
jz exit_2;
mov ax,[esi]; ; EAX was xor-zeroed earlier; some old CPUs maybe had slow movzx loads
add edi,eax; ; dstp += p->skip_dstbytes;
mov cx,[esi+2]; ; bytelen = p->src_len;
add esi,4; ; p->data
shr ecx,2; ; length in dwords = bytelen >> 2
jnc Next2;
movsw; ; one 16-bit (word) copy if bytelen >> 1 is odd, i.e. if last bit shifted out was a 1.
; The first bit shifted out isn't checked, so size is assumed to be a multiple of 2.
Next2:
rep movsd; ; copy in 4-byte chunks
Old CPUs (before IvyBridge) had rep movsd faster than rep movsb, otherwise this code could just have done that.
or bx,bx;
jz exit_1;
That's an obsolete idiom that comes from 8080 for test bx,bx / jnz, i.e. jump if BX was zero. So it's a while( bx != 0 ) {} loop. With dec bx in it. It's an inefficient way to write a while (--bx) loop; a compiler would put a dec/jnz .top_of_loop at the bottom, with a test once outside the loop in case it needs to run zero times. Why are loops always compiled into "do...while" style (tail jump)?
Some people would say that's what a while loop looks like in asm, if they're picturing totally naive translation from C to asm.

Related

C++ HOW can this out-of-range access inside struct go wrong?

#include <iostream>
#include <random>
using namespace std;
struct TradeMsg {
int64_t timestamp; // 0->7
char exchange; // 8
char symbol[17]; // 9->25
char sale_condition[4]; // 26 -> 29
char source_of_trade; // 30
uint8_t trade_correction; // 31
int64_t trade_volume; // 32->39
int64_t trade_price; // 40->47
};
static_assert(sizeof(TradeMsg) == 48);
char buffer[1000000];
template<class T, size_t N=1>
int someFunc(char* buffer, T* output, int& cursor) {
// read + process data from buffer. Return data in output. Set cursor to the last byte read + 1.
return cursor + (rand() % 20) + 1; // dummy code
}
void parseData(TradeMsg* msg) {
int cursor = 0;
cursor = someFunc<int64_t>(buffer, &msg->timestamp, cursor);
cursor = someFunc<char>(buffer, &msg->exchange, cursor);
cursor++;
int i = 0;
// i is GUARANTEED to be <= 17 after this loop,
// edit: the input data in buffer[] guarantee that fact.
while (buffer[cursor + i] != ',') {
msg->symbol[i] = buffer[cursor + i];
i++;
}
msg->symbol[i] = '\n'; // might access symbol[17].
cursor = cursor + i + 1;
for (i=0; i<4; i++) msg->sale_condition[i] = buffer[cursor + i];
cursor += 5;
//cursor = someFunc...
}
int main()
{
TradeMsg a;
a.symbol[17] = '\0';
return 0;
}
I have this struct that is guaranteed to have predictable size. In the code, there is a case where the program tries to assign value to an array element past its size msg->symbol[17] = ... .
However, in that case, the assignment does not cause any harm as long as:
It is done before the next struct members (sale_condition) are assigned (no unexpected code reordering).
It does not modifies any previous members (timestamp, exchange).
It does not access any memory outside the struct.
I read that this is undefined behavior. But what kind of compiler optimization/code generation can make this go wrong? symbol[17] is pretty deep inside the middle of the struct, so I don't see how can the compiler generates an access outside it. Assume that platform is x86-64 only
Various folks have pointed out debug-mode checks that will fire on access outside the bounds of an array member of a struct, with options like gcc -fsanitize=undefined. Separate from that, it's also legal for a compiler to use the assumption of non-overlap between member accesses to reorder two assignments which actually do alias:
#Peter in comments points out that the compiler is allowed to assume that accesses to msg->symbol[i] don't affect other struct members, and potentially delay msg->symbol[i] = '\n'; until after the loop that writes msg->sale_condition[i]. (i.e. sink that store to the bottom of the function).
There isn't a good reason you'd expect a compiler to want to do that in this function alone, but perhaps after inlining into some caller that also stored something there, it could be relevant. Or just because it's a DeathStation 9000 that exists in this thought experiment to break your code.
You could write this safely, although GCC compiles that worse
Since char* is allowed to alias any other object, you could offset a char* relative to the start of the whole struct, rather than to the start of the member array. Use offsetof to find the right start point like this:
#include <cstddef>
...
((char*)msg + offsetof(TradeMsg, symbol))[i] = '\n'; // might access symbol[17].
That's exactly equivalent to *((char*)msg + offsetof(...) + i) = '\n'; by definition of C++'s [] operator, even though it lets you use [i] to index relative to the same position.
However, that does compile to less efficient asm with GCC11.2 -O2. (Godbolt), mostly because int i, cursor are narrower than pointer-width. The "safe" version that redoes indexing from the start of the struct does more indexing work in asm, not using the msg+offsetof(symbol) pointer that it was already using as the base register in the loop.
# original version, with UB if `i` goes past the buffer.
# gcc11.2 -O2 -march=haswell. -O3 fully unrolls into a chain of copy/branch
... partially peeled first iteration
.L3: # do{
mov BYTE PTR [rbx+8+rax], dl # store into msg->symbol[i]
movsx rdi, eax # not read inside the loop
lea ecx, [r8+rax]
inc rax
movzx edx, BYTE PTR buffer[rsi+1+rax] # load from buffer
cmp dl, 44
jne .L3 # }while(buffer[cursor+i] != ',')
## End of copy-and-search loop.
# Loops are identical up to this point except for MOVSX here vs. MOV in the no-UB version.
movsx rcx, ecx # just redo sign extension of this calculation that was done repeatedly inside the loop just for this, apparently.
.L2:
mov BYTE PTR [rbx+9+rdi], 10 # store a newline
mov eax, 1 # set up for next loop
# offsetof version, without UB
# same loop, but with RDI and RSI usage switched.
# And with mov esi, eax zero extension instead of movsx rdi, eax sign extension
cmp dl, 44
jne .L3 # }while(buffer[cursor+i] != ',')
add esi, 9 # offsetof(TradeMsg, symbol)
movsx rcx, ecx # more stuff getting sign extended.
movsx rsi, esi # including something used in the newline store
.L2:
mov BYTE PTR [rbx+rsi], 10
mov eax, 1 # set up for next loop
The RCX calculation seems to just be for use by the next loop, setting sale_conditions.
BTW, the copy-and-search loop is like strcpy but with a ',' terminator. Unfortunately gcc/clang don't know how to optimize that; they compile to a slow byte-at-a-time loop, not e.g. an AVX512BW masked store using mask-1 from a vec == set1_epi8(',') compare, to get a mask selecting the bytes-before-',' instead of the comma element. (Probably needs a bithack to isolate that lowest-set-bit as the only set bit, though, unless it's safe to always copy 16 or 17 bytes separate from finding the ',' position, which could be done efficiently without masked stores or branching.)
Another option might be a union between a char[21] and struct{ char sym[17], sale[4];}, if you use a C++ implementation that allows C99-style union type-punning. (It's a GNU extension, and also supported by MSVC, but not necessarily literally every x86 compiler.)
Also, style-wise, shadowing int i = 0; with for( int i=0 ; i<4 ; i++ ) is poor style. Pick a different var name for that loop, like j. (Or if there is anything meaningful, a better name for i which has to survive across multiple loops.)
In a few cases:
When variable guard is set up: https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
In a C++ interpreter (yes they exist): https://root.cern/cling/
Your symbol has a size of 17 Yet, you are trying to assign a value to the 18th index a.symbol[17] = '\0';
Remember your index value starts off at 0 not 1.
So you have two places that can go wrong. i can equal 17 which will cause an error and that last line I showed above will cause an error.

finding average, min and max in assembly

I'm having a problem in finding the average, min and max of an array in assembly language. i created a simple array with C++ and created a test.asm file to pass it through. i figured out the average, but now its the min and max i cant seem to figure out.
#include <iostream>
using namespace std;
extern "C"
int test(int*, int);
int main()
{
const int SIZE = 7;
int arr[SIZE] = { 1,2,3,4,5,6,7 };
int val = test(arr, SIZE);
cout << "The function test returned: " << val << endl;
return 0;
}
This is my test.asm that adds all the values and returns 4.
.686
.model flat
.code
_test PROC ;named _test because C automatically prepends an underscode, it is needed to interoperate
push ebp
mov ebp,esp ;stack pointer to ebp
mov ebx,[ebp+8] ; address of first array element
mov ecx,[ebp+12]
mov ebp,0
mov edx,0
mov eax,0
loopMe:
cmp ebp,ecx
je allDone
add eax,[ebx+edx]
add edx,4
add ebp,1
jmp loopMe
allDone:
mov edx,0
div ecx
pop ebp
ret
_test ENDP
END
I am still trying to figure out how to find the min since the max will be done in a similar way. I assume you use the cmp to compare values but everything i tried so far hasn't been successful. I'm fairly new to assembly language and its hard for me to grasp. Any help is appreciated.
Any help is appreciated
Ok, so I will show you refactored average function, even if you didn't ask for it directly. :)
Things you can learn from this:
simplified function prologue/epilogue, when ebp is not modified in code
the input array is of 32b int values, so to have correct average you should calculate 64b sum, and do the 64b sum signed division
subtle "tricks" how to get zero value (xor) or how inc is +1 to value (lowering code size)
handling zero sized array by returning fake average 0 (no crash)
addition of two 64b values composed from 32b registers/instructions
counting human "index" (+1 => direct cmp with size possible), yet addressing 32b values (usage of *4 in addressing)
renamed to getAverage
BTW, this is not optimized for performance, I tried to keep the source "simple", so it's easy to read and understand what is it doing.
_getAverage PROC
; avoiding `ebp` usage, so no need to save/set it
mov ebx,[esp+4] ; address of first array element
mov ecx,[esp+8] ; size of array
xor esi,esi ; array index 0
; 64b sum (edx:eax) = 0
xor eax,eax
cdq
; test for invalid input (zero sized array)
jecxz zeroSizeArray ; arguments validation, returns 0 for 0 size
; here "0 < size", so no "index < size" test needed for first element
; "do { ... } while(index < size);" loop variant
sumLoop:
; extend value from array[esi] to 64b (edi is upper 32b)
mov edi,[ebx+esi*4]
sar edi,31
; edx:eax += edi:array[esi] (64b array value added to 64b sum)
add eax,[ebx+esi*4]
adc edx,edi
; next index and loop while index < size
inc esi
cmp esi,ecx
jb sumLoop
; divide the 64b sum of integers by "size" to get average value
idiv ecx ; signed (!) division (input array is signed "int")
; can't overflow (Divide-error), as the sum value was accumulated
; from 32b values only, so EAX contains full correct result
zeroSizeArray:
ret
_getAverage ENDP

C++ external assembly: where is error in my code?

main.cpp
// Calls the external LongRandom function, written in
// assembly language, that returns an unsigned 32-bit
// random integer. Compile in the Large memory model.
// Procedure called LongRandomArray that fills an array with 32-bit unsigned
// random integers
#include <iostream.h>
#include <conio.h>
extern "C" {
unsigned long LongRandom();
void LongRandomArray(unsigned long * buffer, unsigned count);
}
const int ARRAY_SIZE = 20;
int main()
{
// Allocate array storage and fill with 32-bit
// unsigned random integers.
unsigned long * rArray = new unsigned long[ARRAY_SIZE];
LongRandomArray(rArray,ARRAY_SIZE);
for(unsigned i = 0; i < 20; i++)
{
cout << rArray[i] << ',';
}
cout << endl;
getch();
return 0;
}
LongRandom & LongRandomArray procedure module (longrand.asm)
.model large
.386
Public _LongRandom
Public _LongRandomArray
.data
seed dd 12345678h
; Return an unsigned pseudo-random 32-bit integer
; in DX:AX,in the range 0 - FFFFFFFFh.
.code
_LongRandom proc far, C
mov eax, 214013
mul seed
xor edx,edx
add eax, 2531011
mov seed, eax ; save the seed for the next call
shld edx,eax,16 ; copy upper 16 bits of EAX to DX
ret
_LongRandom endp
_LongRandomArray proc far, C
ARG bufferPtr:DWORD, count:WORD
; fill random array
mov edi,bufferPtr
mov cx, count
L1:
call _LongRandom
mov word ptr [edi],dx
add edi,2
mov word ptr [edi],ax
add edi,2
loop L1
ret
_LongRandomArray endp
end
This code is based on on an 16-bit example for MS-DOS from Kip Irvine's assembly book (6th ed.) and explicitely written for Borland C++ 5.01 and TASM 4.0 (see chapter 13.4 "Linking to C/C++ in Real-Address Mode").
Pointers in 16-bit-mode consist of a segment and an offset, usually written as segment:offset. This is not the real memory address which will calculated by the processor. You can not load segment:offset in a 32-bit-register (EDI) and store a value to the memory. So
...
mov edi,bufferPtr
...
mov word ptr [edi],dx
...
is wrong. You have to load the segment part of the pointer in a segment register e.g. ES, the offset part in a appropriate general 16-bit register eg. DI and to possibly use a segment override:
...
push es
les di,bufferPtr ; bufferPtr => ES:DI
...
mov word ptr es:[di],dx
...
pop es
...
The ARG replaces the name of the variable with the appropriate [bp+x] operand. Therefor you need a prologue (and an epilogue). TASM inserts the right instruction, if the PROC header is well written what is not the case here. Take a look at following working function:
_LongRandomArray PROC C FAR
ARG bufferPtr:DWORD, count:WORD
push es
les di,bufferPtr
mov cx, count
L1:
call _LongRandom
mov word ptr es:[di],dx
add di,2
mov word ptr es:[di],ax
add di,2
loop L1
pop es
ret
_LongRandomArray ENDP
Compile your code with BCC (not BCC32):
BCC -ml main.cpp longrand.asm

Accessing three static arrays is quicker than one static array containing 3x data?

I have 700 items and I loop through the 700 items for each I obtain the item' three attributes and perform some basic calculations. I have implemented this using two techniques:
1) Three 700-element arrays, one array for each of the three attributes. So:
item0.a = array1[0]
item0.b = array2[0]
item0.e = array3[0]
2) One 2100-element array containing data for the three attributes consecutively. So:
item0.a = array[(0*3)+0]
item0.b = array[(0*3)+1]
item0.e = array[(0*3)+2]
Now the three item attributes a, b and e are used together within the loop- therefore it would make sense that if you store them in one array the performance should be better than if you use the three-array technique (due to spatial locality). However:
Three 700-element arrays = 3300 CPU cycles on average for the whole loop
One 2100-element array = 3500 CPU cycles on average for the whole loop
Here is the code for the 2100-array technique:
unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array[2100];
//I have left out code for simplicity. You can assume by now the array is populated.
start = __rdtscp(&x);
for(int i=0; i < 700; i++){
unsigned short j = i * 3;
unsigned int a = array[j + 0];
unsigned int b = array[j + 1];
data_for_all_items = data_for_all_items & (a!= -1 & b != -1);
unsigned int e = array[j + 2];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
and here is the code for the three 700-element arrays technique:
unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array1[700];
unsigned int array2[700];
unsigned int array3[700];
//I have left out code for simplicity. You can assume by now the arrays are populated.
start = __rdtscp(&x);
for(int i=0; i < 700; i++){
unsigned int a= array1[i]; //Array 1
unsigned int b= array2[i]; //Array 2
data_for_all_items = data_for_all_items & (a!= -1 & b != -1);
unsigned int e = array3[i]; //Array 3
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
Why isn't the technique using one-2100 element array faster? It should be because the three attributes are used together, per each 700 item.
I used MSVC 2012, Win 7 64
Assembly for 3x 700-element array technique:
start = __rdtscp(&x);
rdtscp
shl rdx,20h
lea r8,[this]
or rax,rdx
mov dword ptr [r8],ecx
mov r8d,8ch
mov r9,rax
lea rdx,[rbx+0Ch]
for(int i=0; i < 700; i++){
sub rdi,rbx
unsigned int a = array1[i];
unsigned int b = array2[i];
data_for_all_items = data_for_all_items & (a != -1 & b != -1);
cmp dword ptr [rdi+rdx-0Ch],0FFFFFFFFh
lea rdx,[rdx+14h]
setne cl
cmp dword ptr [rdi+rdx-1Ch],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-18h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-10h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdi+rdx-14h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-20h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-1Ch],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-18h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-10h],0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [rdx-14h],0FFFFFFFFh
setne al
and cl,al
and r15b,cl
dec r8
jne 013F26DA53h
unsigned int e = array3[i];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r8,[y]
or rax,rdx
mov dword ptr [r8],ecx
Assembler for the 2100-element array technique:
start = __rdtscp(&x);
rdtscp
lea r8,[this]
shl rdx,20h
or rax,rdx
mov dword ptr [r8],ecx
for(int i=0; i < 700; i++){
xor r8d,r8d
mov r10,rax
unsigned short j = i*3;
movzx ecx,r8w
add cx,cx
lea edx,[rcx+r8]
unsigned int a = array[j + 0];
unsigned int b = array[j + 1];
data_for_all_items = data_for_all_items & (best_ask != -1 & best_bid != -1);
movzx ecx,dx
cmp dword ptr [r9+rcx*4+4],0FFFFFFFFh
setne dl
cmp dword ptr [r9+rcx*4],0FFFFFFFFh
setne al
inc r8d
and dl,al
and r14b,dl
cmp r8d,2BCh
jl 013F05DA10h
unsigned int e = array[pos + 2];
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r8,[y]
or rax,rdx
mov dword ptr [r8],ecx
Edit: Given your assembly code, the second loop is five times unrolled. The unrolled version could run faster on an out-of-order execution CPU such as any modern x86/x86-64 CPU.
The second code is vectorisable - two elements of each array could be loaded at each iteration in one XMM register each. Since modern CPUs use SSE for both scalar and vector FP arithmetic, this cuts the number of cycles roughly in half. With an AVX-capable CPU four doubles could be loaded in an YMM register and therefore the number of cycles should be cut in four.
The first loop is not vectorisable along i since the value of a in iteration i+1 comes from a location 3 elements after the one where the value of a in iteration i comes from. In that case vectorisation requires gathered vector loads are those are only supported in the AVX2 instruction set.
Using proper data structures is crucial when programming CPUs with vector capabilities. Converting codes like your first loop into something like your second loop is 90% of the job that one has to do in order to get good performance on Intel Xeon Phi which has very wide vector registers but awfully slow in-order execution engine.
The simple answer is that version 1 is SIMD friendly and version 2 is not. However, it's possible to make version 2, the 2100 element array, SIMD friendly. You need to us a Hybrid Struct of Arrays, aka an Array of Struct of Arrays (AoSoA). You arrange the array like this: aaaa bbbb eeee aaaa bbbb eeee ....
Below is code using GCC's vector extensions to do this. Note that now the 2100 element array code looks almost the same as the 700 element array code but it uses one array instead of three. And instead of having 700 elements between a b and e there are only 12 elements between them.
I did not find an easy solution to convert uint4 to double4 with the GCC vector extensions and I don't want to spend the time to write intrinics to do this right now so I made c and v unsigned int but for performance I would not want to be converting uint4 to double 4 in a loop anyway.
typedef unsigned int uint4 __attribute__ ((vector_size (16)));
//typedef double double4 __attribute__ ((vector_size (32)));
uint4 zero = {};
unsigned int array[2100];
uint4 test = -1 + zero;
//double4 cv = {};
//double4 dv = {};
uint4 cv = {};
uint4 dv = {};
uint4* av = (uint4*)&array[0];
uint4* bv = (uint4*)&array[4];
uint4* ev = (uint4*)&array[8];
for(int i=0; i < 525; i+=3) { //525 = 2100/4 = 700/4*3
test = test & ((av[i]!= -1) & (bv[i] != -1));
cv += (av[i] * ev[i]);
dv += (bv[i] * ev[i]);
}
double c = cv[0] + cv[1] + cv[2] + cv[3];
double v = dv[0] + dv[1] + dv[2] + dv[3];
bool data_for_all_items = test[0] & test[1] & test[2] & test[3];
The concept of 'spatial locality' is throwing you off a little bit. Chances are that with both solutions, your processor is doing its best to cache the arrays.
Unfortunately, version of your code that uses one array also has some extra math which is being performed. This is probably where your extra cycles are being spent.
Spatial locality is indeed useful, but it's actually helping you on the second case (3 distinct arrays) much more.
The cache line size is 64 Bytes (note that it doesn't divide in 3), so a single access to a 4 or 8 byte value is effectively prefetching the next elements. In addition, keep in mind that the CPU HW prefetcher is likely to go on and prefetch ahead even further elements.
However, when a,b,e are packed together, you're "wasting" this valuable prefetching on elements of the same iteration. When you access a, There's no point in prefetching b and e - the next loads are already going there (and would likely just merge in the CPU with the first load or wait for it to retrieve the data). In fact, when the arrays are merged - you fetch a new memory line only once per 64/(3*4)=~5.3 iterations. The bad alignment even means that on some iterations you'll have a and maybe b long before you get e, this imbalance is usually bad news.
In reality, since the iterations are independent, your CPU would go ahead and start the second iteration relatively fast thanks to the combination of loop unrolling (in case it was done) and out-of-order execution (calculating the index for the next set of iterations is simple and has no dependencies on the loads sent by the last ones). However you would have to run ahead pretty far in order to issue the next load everytime, and eventually the finite size of CPU instruction queues will block you, maybe before reaching the full potential memory bandwidth (number of parallel outstanding loads).
The alternative option on the other hand, where you have 3 distinct arrays, uses the spatial locality / HW prefetching solely across iterations. On each iteration, you'll issue 3 loads, which would fetch a full line once every 64/4=16 iterations. The overall data fetched is the same (well, it's the same data), but the timeliness is much better because you fetch ahead for the next 16 iterations instead of the 5. The difference become even bigger when HW prefetching is involved because you have 3 streams instead of one, meaning you can issue more prefetches (and look even further ahead).

How can I optimize conversion from half-precision float16 to single-precision float32?

I'm trying improve performance for my function. Profiler points to the code at inner loop. Can I improve perfomance of that code, maybe using SSE intrinsics?
void ConvertImageFrom_R16_FLOAT_To_R32_FLOAT(char* buffer, void* convertedData, DWORD width, DWORD height, UINT rowPitch)
{
struct SINGLE_FLOAT
{
union {
struct {
unsigned __int32 R_m : 23;
unsigned __int32 R_e : 8;
unsigned __int32 R_s : 1;
};
struct {
float r;
};
};
};
C_ASSERT(sizeof(SINGLE_FLOAT) == 4); // 4 bytes
struct HALF_FLOAT
{
unsigned __int16 R_m : 10;
unsigned __int16 R_e : 5;
unsigned __int16 R_s : 1;
};
C_ASSERT(sizeof(HALF_FLOAT) == 2);
SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
for(DWORD j = 0; j< height; j++)
{
HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
for(DWORD i = 0; i< width; i++)
{
d->R_s = s->R_s;
d->R_e = s->R_e - 15 + 127;
d->R_m = s->R_m << (23-10);
d++;
s++;
}
}
}
Update:
Disassembly
; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01
TITLE Utils.cpp
.686P
.XMM
include listing.inc
.model flat
INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES
PUBLIC ?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT
; Function compile flags: /Ogtp
; COMDAT ?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z
_TEXT SEGMENT
_buffer$ = 8 ; size = 4
tv83 = 12 ; size = 4
_convertedData$ = 12 ; size = 4
_width$ = 16 ; size = 4
_height$ = 20 ; size = 4
_rowPitch$ = 24 ; size = 4
?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z PROC ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT, COMDAT
; 323 : {
push ebp
mov ebp, esp
; 343 : for(DWORD j = 0; j< height; j++)
mov eax, DWORD PTR _height$[ebp]
push esi
mov esi, DWORD PTR _convertedData$[ebp]
test eax, eax
je SHORT $LN4#ConvertIma
; 324 : union SINGLE_FLOAT {
; 325 : struct {
; 326 : unsigned __int32 R_m : 23;
; 327 : unsigned __int32 R_e : 8;
; 328 : unsigned __int32 R_s : 1;
; 329 : };
; 330 : struct {
; 331 : float r;
; 332 : };
; 333 : };
; 334 : C_ASSERT(sizeof(SINGLE_FLOAT) == 4);
; 335 : struct HALF_FLOAT
; 336 : {
; 337 : unsigned __int16 R_m : 10;
; 338 : unsigned __int16 R_e : 5;
; 339 : unsigned __int16 R_s : 1;
; 340 : };
; 341 : C_ASSERT(sizeof(HALF_FLOAT) == 2);
; 342 : SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
push ebx
mov ebx, DWORD PTR _buffer$[ebp]
push edi
mov DWORD PTR tv83[ebp], eax
$LL13#ConvertIma:
; 344 : {
; 345 : HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
; 346 : for(DWORD i = 0; i< width; i++)
mov edi, DWORD PTR _width$[ebp]
mov edx, ebx
test edi, edi
je SHORT $LN5#ConvertIma
npad 1
$LL3#ConvertIma:
; 347 : {
; 348 : d->R_s = s->R_s;
movzx ecx, WORD PTR [edx]
movzx eax, WORD PTR [edx]
shl ecx, 16 ; 00000010H
xor ecx, DWORD PTR [esi]
shl eax, 16 ; 00000010H
and ecx, 2147483647 ; 7fffffffH
xor ecx, eax
mov DWORD PTR [esi], ecx
; 349 : d->R_e = s->R_e - 15 + 127;
movzx eax, WORD PTR [edx]
shr eax, 10 ; 0000000aH
and eax, 31 ; 0000001fH
add eax, 112 ; 00000070H
shl eax, 23 ; 00000017H
xor eax, ecx
and eax, 2139095040 ; 7f800000H
xor eax, ecx
mov DWORD PTR [esi], eax
; 350 : d->R_m = s->R_m << (23-10);
movzx ecx, WORD PTR [edx]
and ecx, 1023 ; 000003ffH
shl ecx, 13 ; 0000000dH
and eax, -8388608 ; ff800000H
or ecx, eax
mov DWORD PTR [esi], ecx
; 351 : d++;
add esi, 4
; 352 : s++;
add edx, 2
dec edi
jne SHORT $LL3#ConvertIma
$LN5#ConvertIma:
; 343 : for(DWORD j = 0; j< height; j++)
add ebx, DWORD PTR _rowPitch$[ebp]
dec DWORD PTR tv83[ebp]
jne SHORT $LL13#ConvertIma
pop edi
pop ebx
$LN4#ConvertIma:
pop esi
; 353 : }
; 354 : }
; 355 : }
pop ebp
ret 0
?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z ENDP ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT
_TEXT ENDS
The x86 F16C instruction-set extension adds hardware support for converting single-precision float vectors to/from vectors of half-precision float.
The format is the same IEEE 754 half-precision binary16 that you describe. I didn't check that the endianness is the same as your struct, but that's easy to fix if needed (with a pshufb).
F16C is supported starting from Intel IvyBridge and AMD Piledriver. (And has its own CPUID feature bit, which your code should check for, otherwise fall back to SIMD integer shifts and shuffles).
The intrinsics for VCVTPS2PH are:
__m128i _mm_cvtps_ph ( __m128 m1, const int imm);
__m128i _mm256_cvtps_ph(__m256 m1, const int imm);
The immediate byte is a rounding control. The compiler can use it as a convert-and-store directly to memory (unlike most instructions that can optionally use a memory operand, where it's the source operand that can be memory instead of a register.)
VCVTPH2PS goes the other way, and is just like most other SSE instructions (can be used between registers or as a load).
__m128 _mm_cvtph_ps ( __m128i m1);
__m256 _mm256_cvtph_ps ( __m128i m1)
F16C is so efficient that you might want to consider leaving your image in half-precision format, and converting on the fly every time you need a vector of data from it. This is great for your cache footprint.
Accessing bitfields in memory can be really tricky, depending on the architecture, of course.
You might achieve better performance if you would make a union of a float and a 32 bit integer, and simply perform all decomposition and composition using a local variables. That way the generated code could perform the entire operation using only processor registers.
the loops are independent of each other, so you could easily parallelize this code, either by using SIMD or OpenMP, a simple version would be splitting the top half and the bottom half of the image into two threads, running concurrently.
You're processing the data as a two dimension array. If you consider how it's laid out in memory you may be able to process it as a single dimensional array and you can save a little overhead by having one loop instead of nested loops.
I'd also compile to assembly code and make sure the compiler optimization worked and it isn't recalculating (15 + 127) hundreds of times.
You should be able to reduce this to a single instruction on chips which use the upcoming CVT16 instruction set. According to that Wikipedia article:
The CVT16 instructions allow conversion of floating point vectors between single precision and half precision.
SSE Intrinsics seem to be an excellent idea. Before you go down that road, you should
look at the assembly code generated by the compiler, (is there potential for optimization?)
search your compiler documentation how to generate SSE code automatically,
search your software library's documentation (or wherever the 16bit float type originated) for a function to bulk convert this type. (a conversion to 64bit floating point could be helpful too.) You are very likely not the first person to encounter this problem!
If all that fails, go and try your luck with some SSE intrinsics. To get some idea, here is some SSE code to convert from 32 to 16 bit floating point. (you want the reverse)
Besides SSE you should also consider multi-threading and offloading the task to the GPU.
Here are some ideas:
Put the constants into const register variables.
Some processors don't like fetching constants from memory; it is awkward and may take many instruction cycles.
Loop Unrolling
Repeat the statements in the loop, and increase the increment.
Processors prefer continuous instructions; jumps and branches anger them.
Data Prefetching (or loading the cache)
Use more variables in the loop, and declare them as volatile so the compiler doesn't optimize them:
SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
SINGLE_FLOAT* d1 = d + 1;
SINGLE_FLOAT* d2 = d + 2;
SINGLE_FLOAT* d3 = d + 3;
for(DWORD j = 0; j< height; j++)
{
HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
HALF_FLOAT* s1 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 1));
HALF_FLOAT* s2 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 2));
HALF_FLOAT* s3 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 3));
for(DWORD i = 0; i< width; i += 4)
{
d->R_s = s->R_s;
d->R_e = s->R_e - 15 + 127;
d->R_m = s->R_m << (23-10);
d1->R_s = s1->R_s;
d1->R_e = s1->R_e - 15 + 127;
d1->R_m = s1->R_m << (23-10);
d2->R_s = s2->R_s;
d2->R_e = s2->R_e - 15 + 127;
d2->R_m = s2->R_m << (23-10);
d3->R_s = s3->R_s;
d3->R_e = s3->R_e - 15 + 127;
d3->R_m = s3->R_m << (23-10);
d += 4;
d1 += 4;
d2 += 4;
d3 += 4;
s += 4;
s1 += 4;
s2 += 4;
s3 += 4;
}
}
I don't know about SSE intrinsics but it would be interesting to see a disassembly of your inner loop. An old-school way (that may not help much but that would be easy to try out) would be to reduce the number of iterations by doing two inner loops: one that does N (say 32) repeats of the processing (loop count of width/N) and then one to finish the remainder (loop count of width%N)... with those divs and modulos calculated outside the first loop to avoid recalculating them. Apologies if that sounds obvious!
The function is only doing a few small things. It is going to be tough to shave much off the time by optimisation, but as somebody already said, parallelisation has promise.
Check how many cache misses you are getting. If the data is paging in and out, you might be able to speed it up by applying more intelligence into the ordering to minimise cache swaps.
Also consider macro-optimisations. Are there any redundancies in the data computation that might be avoided (e.g. caching old results instead of recomputing them when needed)? Do you really need to convert the whole data set or could you just convert the bits you need? I don't know your application so I'm just guessing wildly here, but there might be scope for that kind of optimisation.
My suspicion is that this operation will be already bottlenecked on memory access, and making it more efficient (e.g., using SSE) would not make it execute more quickly. However this is only a suspicion.
Other things to try, assuming x86/x64, might be:
Don't d++ and s++, but use d[i] and s[i] on each iteration. (Then of course bump d after each scanline.) Since the elements of d are 4 bytes and those of s 2, this operation can be folded into the address calculation. (Unfortunately I can't guarantee that this would necessarily make execution more efficient.)
Remove the bitfield operations and do the operations manually. (When extracting, shift first and mask second, to maximize the likelihood that the mask can fit into a small immediate value.)
Unroll the loop, though with a loop as easily-predicted as this one it might not make much difference.
Count along each line from width down to zero. This stops the compiler having to fetch width each time round. Probably more important for x86, because it has so few registers. (If the CPU likes my "d[i] and s[i]" suggestion, you could make width signed, count from width-1 instead, and walk backwards.)
These would all be quicker to try than converting to SSE and would hopefully make it memory-bound, if it isn't already, at which point you can give up.
Finally if the output is in write-combined memory (e.g., it's a texture or vertex buffer or something accessed over AGP, or PCI Express, or whatever it is PCs have these days) then this could well result in poor performance, depending on what code the compiler has generated for the inner loop. So if that is the case you may get better results converting each scanline into a local buffer then using memcpy to copy it to its final destination.