I need to convert an array of bytes to array of floats. I get the bytes through a network connection, and then need to parse them into floats. the size of the array is not pre-definded.
this is the code I have so far, using unions. Do you have any suggestions on how to make it run faster?
int offset = DATA_OFFSET - 1;
UStuff bb;
//Convert every 4 bytes to float using a union
for (int i = 0; i < NUM_OF_POINTS;i++){
//Going backwards - due to endianness
for (int j = offset + BYTE_FLOAT*i + BYTE_FLOAT ; j > offset + BYTE_FLOAT*i; --j)
{
bb.c[(offset + BYTE_FLOAT*i + BYTE_FLOAT)- j] = sample[j];
}
res.append(bb.f);
}
return res;
This is the union I use
union UStuff
{
float f;
unsigned char c[4];
};
Technically you are not allowed to type-pun through a union in C++, although you are allowed to do it in C. The behaviour of your code is undefined.
Setting aside this important behaviour point: even then, you're assuming that a float is represented in the same way on all machines, which it isn't. You might think a float is a 32 bit IEEE754 little-endian block of data, but it doesn't have to be that.
Sadly the best solution will end up slower. Most serialisations of floating point data are performed by going into and back out from a string which, in your case pretty much solves the problem since you can represent them as an array of unsigned char data. So your only wrangle is down to figuring out the encoding of the data. Job done!
Short Answer
#include <cstdint>
#define NOT_STUPID 1
#define ENDIAN NOT_STUPID
namespace _ {
inline uint32_t UI4Set(char byte_0, char byte_1, char byte_2, char byte_3) {
#if ENDIAN == NOT_STUPID
return byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
#else
return byte_3 | ((uint32_t)byte_2) << 8 | ((uint32_t)byte_1) << 16 |
((uint32_t)byte_0) << 24;
#endif
}
inline float FLTSet(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t flt = UI4Set(byte_0, byte_1, byte_2, byte_3);
return *reinterpret_cast<float*>(&flt);
}
/* Use this function to write directly to RAM and avoid the xmm
registers. */
inline uint32_t FLTSet(char byte_0, char byte_1, char byte_2, char byte_3,
float* destination) {
uint32_t value = UI4Set (byte_0, byte_1, byte_2, byte_3);
*reinterpret_cast<uint32_t*>(destination) = value;
return value;
}
} //< namespace _
using namespace _; //< #see Kabuki Toolkit
static flt = FLTSet (0, 1, 2, 3);
int main () {
uint32_t flt_init = FLTSet (4, 5, 6, 7, &flt);
return 0;
}
//< This uses 4 extra bytes doesn't use the xmm register
Long Answer
It is GENERALLY NOT recommended to use a Union to convert to and from floating-point to integer because Unions to this day do not always generate optimal assembly code and the other techniques are more explicit and may use less typing; and rather then my opinion from other StackOverflow posts on Unions, we shall prove it will disassembly with a modern compiler: Visual-C++ 2018.
The first thing we must know about how to optimize floating-point algorithms is how the registers work. The core of the CPU is exclusively an integer-processing-unit with coprocessors (i.e. extensions) on them for processing floating-point numbers. These Load-Store Machines (LSM) can only work with integers and they must use a separate set of registers for interacting with floating-point coprocessors. On x86_64 these are the xmm registers, which are 128-bits wide and can process Single Instruction Multiple Data (SIMD). In C++ way to load-and-store a floating-point register is:
int Foo(double foo) { return foo + *reinterpret_cast<double*>(&foo); }
int main() {
double foo = 1.0;
uint64_t bar = *reinterpret_cast<uint64_t*>(&foo);
return Foo(bar);
}
Now let's inspect the disassembly with Visual-C++ O2 optimations on because without them you will get a bunch of debug stack frame variables. I had to add the function Foo into the example to avoid the code being optimized away.
double foo = 1.0;
uint64_t bar = *reinterpret_cast<uint64_t*>(&foo);
00007FF7482E16A0 mov rax,3FF0000000000000h
00007FF7482E16AA xorps xmm0,xmm0
00007FF7482E16AD cvtsi2sd xmm0,rax
return Foo(bar);
00007FF7482E16B2 addsd xmm0,xmm0
00007FF7482E16B6 cvttsd2si eax,xmm0
}
00007FF7482E16BA ret
And as described, we can see that the LSM first moves the double value into an integer register, then it zeros out the xmm0 register using a xor function because the register is 128-bits wide and we're loading a 64-bit integer, then loads the contents of the integer register to the floating-point register using the cvtsi2sd instruction, then finally followed by the cvttsd2si instruction that then loads the value back from the xmm0 register to the return register before finally returning.
So now let's address that concern about generating optimal assembly code using this test script and Visual-C++ 2018:
#include <stdafx.h>
#include <cstdint>
#include <cstdio>
static float foo = 0.0f;
void SetFooUnion(char byte_0, char byte_1, char byte_2, char byte_3) {
union {
float flt;
char bytes[4];
} u = {foo};
u.bytes[0] = byte_0;
u.bytes[1] = byte_1;
u.bytes[2] = byte_2;
u.bytes[3] = byte_3;
foo = u.flt;
}
void SetFooManually(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t faster_method = byte_0 | ((uint32_t)byte_1) << 8 |
((uint32_t)byte_2) << 16 | ((uint32_t)byte_3) << 24;
*reinterpret_cast<uint32_t*>(&foo) = faster_method;
}
namespace _ {
inline uint32_t UI4Set(char byte_0, char byte_1, char byte_2, char byte_3) {
return byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
}
inline float FLTSet(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t flt = UI4Set(byte_0, byte_1, byte_2, byte_3);
return *reinterpret_cast<float*>(&flt);
}
inline void FLTSet(char byte_0, char byte_1, char byte_2, char byte_3,
float* destination) {
uint32_t value = byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
*reinterpret_cast<uint32_t*>(destination) = value;
}
} // namespace _
int main() {
SetFooUnion(0, 1, 2, 3);
union {
float flt;
char bytes[4];
} u = {foo};
// Start union read tests
putchar(u.bytes[0]);
putchar(u.bytes[1]);
putchar(u.bytes[2]);
putchar(u.bytes[3]);
// Start union write tests
u.bytes[0] = 4;
u.bytes[2] = 5;
foo = u.flt;
// Start hand-coded tests
SetFooManually(6, 7, 8, 9);
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
putchar((char)(bar >> 8));
putchar((char)(bar >> 16));
putchar((char)(bar >> 24));
_::FLTSet (0, 1, 2, 3, &foo);
return 0;
}
Now after inspecting the O2-optimized disassembly, we have proven the compiler DOES NOT produce optimal code:
int main() {
00007FF6DB4A1000 sub rsp,28h
SetFooUnion(0, 1, 2, 3);
00007FF6DB4A1004 mov dword ptr [rsp+30h],3020100h
00007FF6DB4A100C movss xmm0,dword ptr [rsp+30h]
union {
float flt;
char bytes[4];
} u = {foo};
00007FF6DB4A1012 movss dword ptr [rsp+30h],xmm0
// Start union read tests
putchar(u.bytes[0]);
00007FF6DB4A1018 movsx ecx,byte ptr [u]
SetFooUnion(0, 1, 2, 3);
00007FF6DB4A101D movss dword ptr [foo (07FF6DB4A3628h)],xmm0
// Start union read tests
putchar(u.bytes[0]);
00007FF6DB4A1025 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[1]);
00007FF6DB4A102B movsx ecx,byte ptr [rsp+31h]
00007FF6DB4A1030 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[2]);
00007FF6DB4A1036 movsx ecx,byte ptr [rsp+32h]
00007FF6DB4A103B call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[3]);
00007FF6DB4A1041 movsx ecx,byte ptr [rsp+33h]
00007FF6DB4A1046 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
00007FF6DB4A104C mov ecx,6
// Start union write tests
u.bytes[0] = 4;
u.bytes[2] = 5;
foo = u.flt;
// Start hand-coded tests
SetFooManually(6, 7, 8, 9);
00007FF6DB4A1051 mov dword ptr [foo (07FF6DB4A3628h)],9080706h
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
00007FF6DB4A105B call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 8));
00007FF6DB4A1061 mov ecx,7
00007FF6DB4A1066 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 16));
00007FF6DB4A106C mov ecx,8
00007FF6DB4A1071 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 24));
00007FF6DB4A1077 mov ecx,9
00007FF6DB4A107C call qword ptr [__imp_putchar (07FF6DB4A2160h)]
return 0;
00007FF6DB4A1082 xor eax,eax
_::FLTSet(0, 1, 2, 3, &foo);
00007FF6DB4A1084 mov dword ptr [foo (07FF6DB4A3628h)],3020100h
}
00007FF6DB4A108E add rsp,28h
00007FF6DB4A1092 ret
Here is the raw main disassembly because the inlined functions are missing:
; Listing generated by Microsoft (R) Optimizing Compiler Version 19.12.25831.0
include listing.inc
INCLUDELIB OLDNAMES
EXTRN __imp_putchar:PROC
EXTRN __security_check_cookie:PROC
?foo##3MA DD 01H DUP (?) ; foo
_BSS ENDS
PUBLIC main
PUBLIC ?SetFooManually##YAXDDDD#Z ; SetFooManually
PUBLIC ?SetFooUnion##YAXDDDD#Z ; SetFooUnion
EXTRN _fltused:DWORD
; COMDAT pdata
pdata SEGMENT
$pdata$main DD imagerel $LN8
DD imagerel $LN8+137
DD imagerel $unwind$main
pdata ENDS
; COMDAT xdata
xdata SEGMENT
$unwind$main DD 010401H
DD 04204H
xdata ENDS
; Function compile flags: /Ogtpy
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; COMDAT ?SetFooManually##YAXDDDD#Z
_TEXT SEGMENT
byte_0$dead$ = 8
byte_1$dead$ = 16
byte_2$dead$ = 24
byte_3$dead$ = 32
?SetFooManually##YAXDDDD#Z PROC ; SetFooManually, COMDAT
00000 c7 05 00 00 00
00 06 07 08 09 mov DWORD PTR ?foo##3MA, 151521030 ; 09080706H
0000a c3 ret 0
?SetFooManually##YAXDDDD#Z ENDP ; SetFooManually
_TEXT ENDS
; Function compile flags: /Ogtpy
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; COMDAT main
_TEXT SEGMENT
u$1 = 48
u$ = 48
main PROC ; COMDAT
$LN8:
00000 48 83 ec 28 sub rsp, 40 ; 00000028H
00004 c7 44 24 30 00
01 02 03 mov DWORD PTR u$1[rsp], 50462976 ; 03020100H
0000c f3 0f 10 44 24
30 movss xmm0, DWORD PTR u$1[rsp]
00012 f3 0f 11 44 24
30 movss DWORD PTR u$[rsp], xmm0
00018 0f be 4c 24 30 movsx ecx, BYTE PTR u$[rsp]
0001d f3 0f 11 05 00
00 00 00 movss DWORD PTR ?foo##3MA, xmm0
00025 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0002b 0f be 4c 24 31 movsx ecx, BYTE PTR u$[rsp+1]
00030 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00036 0f be 4c 24 32 movsx ecx, BYTE PTR u$[rsp+2]
0003b ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00041 0f be 4c 24 33 movsx ecx, BYTE PTR u$[rsp+3]
00046 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0004c b9 06 00 00 00 mov ecx, 6
00051 c7 05 00 00 00
00 06 07 08 09 mov DWORD PTR ?foo##3MA, 151521030 ; 09080706H
0005b ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00061 b9 07 00 00 00 mov ecx, 7
00066 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0006c b9 08 00 00 00 mov ecx, 8
00071 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00077 b9 09 00 00 00 mov ecx, 9
0007c ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00082 33 c0 xor eax, eax
00084 48 83 c4 28 add rsp, 40 ; 00000028H
00088 c3 ret 0
main ENDP
_TEXT ENDS
END
So what is the difference?
?SetFooUnion##YAXDDDD#Z PROC ; SetFooUnion, COMDAT
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; Line 7
mov BYTE PTR [rsp+32], r9b
; Line 14
mov DWORD PTR u$[rsp], 50462976 ; 03020100H
; Line 18
movss xmm0, DWORD PTR u$[rsp]
movss DWORD PTR ?foo##3MA, xmm0
; Line 19
ret 0
?SetFooUnion##YAXDDDD#Z ENDP ; SetFooUnion
versus:
?SetFooManually##YAXDDDD#Z PROC ; SetFooManually, COMDAT
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; Line 34
mov DWORD PTR ?foo##3MA, 151521030 ; 09080706H
; Line 35
ret 0
?SetFooManually##YAXDDDD#Z ENDP ; SetFooManually
The first thing to notice is the effect of the Union on inline memory optimizations. Unions are designed specifically to multiplex RAM for different purposes for different time periods in order to reduce RAM usage, so this means the memory must remain coherent in RAM and thus is less inlinable. The Union code forced the compiler to write the Union out to RAM while the non-Union method just throws out your code and replaces with a single mov DWORD PTR ?foo##3MA, 151521030 instruction without the use of the xmm0 register!!! The O2 optimizations automatically inlined the SetFooUnion and SetFooManually functions, but the non-Union method inlined a lot more code used less RAM reads, evidence from the difference between the Union method's line of code:
movsx ecx,byte ptr [rsp+31h]
versus the non-Union method's version:
mov ecx,7
The Union is loading ecx from A POINTER TO RAM while the other is using the faster single-cycle mov instruction. THAT IS A HUGE PERFORMANCE INCREASE!!! However, this actually may be the desired behavior when working with real-time systems and multi-threaded applications because the compiler optimizations may be unwanted and may mess up our timing, or you may want to use a mix of the two methods.
Aside from the potentially suboptimal RAM usage, I tried for several hours to get the compiler to generate suboptimal assembly but I was not able to for most of my toy problems, so it does seem as though this is a rather nifty feature of Unions rather than a reason to shun them. My favorite C++ metaphor is that C++ is like a kitchen full of sharp knives, you need to pick the correct knife for the correct job, and just because there are a lot of sharp knives in a kitchen doesn't mean that you pull out all of the knives at once, or that you leave the knives out. As long as you keep the kitchen tidy then you're not going to cut yourself. A Union is a sharp knife that may help ensure greater RAM coherency but it requires more typing and slows the program down.
Related
I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:
//ncbSzBuffDataUsed of type INT32
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
pDst[i] = pSrc[i];
}
as such:
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2 movsxd r8,edx
00007FF664412521 4C 2B D1 sub r10,rcx
00007FF664412524 0F 1F 40 00 nop dword ptr [rax]
00007FF664412528 0F 1F 84 00 00 00 00 00 nop dword ptr [rax+rax]
00007FF664412530 41 0F B6 04 0A movzx eax,byte ptr [r10+rcx]
{
pDst[i] = pSrc[i];
00007FF664412535 88 01 mov byte ptr [rcx],al
00007FF664412537 48 8D 49 01 lea rcx,[rcx+1]
00007FF66441253B 49 83 E8 01 sub r8,1
00007FF66441253F 75 EF jne _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)
}
versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?
Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://learn.microsoft.com/en-us/cpp/intrinsics/movsb.
As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:
set up rsi (= source address)
set up rdi (= destination address)
set up rcx (= count)
issue the rep movsb
So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.
Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.
In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.
More on the x64 register passing convention here, and on the x64 ABI generally here.
Edit 2 (again inspired by Peter's comments):
The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.
This is not really an answer, and I can't jam it all into a comment. I just want to share my additional findings. (This is probably relevant to the Visual Studio compilers only.)
What also makes a difference is how you structure your loops. For instance:
Assuming the following struct definitions:
#define PCALLBACK ULONG64
#pragma pack(push)
#pragma pack(1)
typedef struct {
ULONG64 ui0;
USHORT w0;
USHORT w1;
//Followed by:
// PCALLBACK[] 'array' - variable size array
}DPE;
#pragma pack(pop)
(1) The regular way to structure a for loop. The following code chunk is called somewhere in the middle of a larger serialization function:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
{
pDstClbks[i] = info.callbackFuncs[i];
}
As was mentioned somewhere in the answer on this page, it is clear that the compiler was starved of registers to have produced the following monstrocity (see how it reused rax for the loop end limit, or movzx eax,word ptr [r13] instruction that could've been clearly left out of the loop.)
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF7029327CF 48 83 C1 30 add rcx,30h
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
00007FF7029327D3 66 41 3B 5D 00 cmp bx,word ptr [r13]
00007FF7029327D8 73 1F jae 07FF7029327F9h
00007FF7029327DA 4C 8B C1 mov r8,rcx
00007FF7029327DD 4C 2B F1 sub r14,rcx
{
pDstClbks[i] = info.callbackFuncs[i];
00007FF7029327E0 4B 8B 44 06 08 mov rax,qword ptr [r14+r8+8]
00007FF7029327E5 48 FF C3 inc rbx
00007FF7029327E8 49 89 00 mov qword ptr [r8],rax
00007FF7029327EB 4D 8D 40 08 lea r8,[r8+8]
00007FF7029327EF 41 0F B7 45 00 movzx eax,word ptr [r13]
00007FF7029327F4 48 3B D8 cmp rbx,rax
00007FF7029327F7 72 E7 jb 07FF7029327E0h
}
00007FF7029327F9 45 0F B7 C7 movzx r8d,r15w
(2) So if I re-write it into a less familiar C pattern:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
for(PCALLBACK* pScrClbks = info.callbackFuncs;
pDstClbks < pEndDstClbks;
pScrClbks++, pDstClbks++)
{
*pDstClbks = *pScrClbks;
}
this produces a more sensible machine code (on the same compiler, in the same function, in the same project):
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF71D7E27C2 48 83 C1 30 add rcx,30h
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
00007FF71D7E27C6 0F B7 86 88 00 00 00 movzx eax,word ptr [rsi+88h]
00007FF71D7E27CD 48 8D 14 C1 lea rdx,[rcx+rax*8]
for(PCALLBACK* pScrClbks = info.callbackFuncs; pDstClbks < pEndDstClbks; pScrClbks++, pDstClbks++)
00007FF71D7E27D1 48 3B CA cmp rcx,rdx
00007FF71D7E27D4 76 14 jbe 07FF71D7E27EAh
00007FF71D7E27D6 48 2B F1 sub rsi,rcx
{
*pDstClbks = *pScrClbks;
00007FF71D7E27D9 48 8B 44 0E 08 mov rax,qword ptr [rsi+rcx+8]
00007FF71D7E27DE 48 89 01 mov qword ptr [rcx],rax
00007FF71D7E27E1 48 83 C1 08 add rcx,8
00007FF71D7E27E5 48 3B CA cmp rcx,rdx
00007FF71D7E27E8 77 EF jb 07FF71D7E27D9h
}
00007FF71D7E27EA 45 0F B7 C6 movzx r8d,r14w
I'm learning C++ from basics (using Visual Studio Community 2015)
While working on the arrays I came across the following:
int main()
{
int i[10] = {};
}
The assembly code for this is :
18: int i[10] = {};
008519CE 33 C0 xor eax,eax
008519D0 89 45 D4 mov dword ptr [ebp-2Ch],eax
008519D3 89 45 D8 mov dword ptr [ebp-28h],eax
008519D6 89 45 DC mov dword ptr [ebp-24h],eax
008519D9 89 45 E0 mov dword ptr [ebp-20h],eax
008519DC 89 45 E4 mov dword ptr [ebp-1Ch],eax
008519DF 89 45 E8 mov dword ptr [ebp-18h],eax
008519E2 89 45 EC mov dword ptr [ebp-14h],eax
008519E5 89 45 F0 mov dword ptr [ebp-10h],eax
008519E8 89 45 F4 mov dword ptr [ebp-0Ch],eax
008519EB 89 45 F8 mov dword ptr [ebp-8],eax
Here, since initalisation is used every int here is initialised to 0 (xor eax,eax). This is clear.
From what I have learnt, any variable would be allocated memory only if it is used (atleast in modern compilers) and if any one element in an array is initialised the complete array would be allocated memory as follows:
int main()
{
int i[10];
i[0] = 20;
int j = 20;
}
Assembly generated:
18: int i[10];
19: i[0] = 20;
00A319CE B8 04 00 00 00 mov eax,4
00A319D3 6B C8 00 imul ecx,eax,0
00A319D6 C7 44 0D D4 14 00 00 00 mov dword ptr [ebp+ecx-2Ch],14h
20: int j = 20;
00A319DE C7 45 C8 14 00 00 00 mov dword ptr [ebp-38h],14h
Here, the compiler used 4 bytes (to copy the value 20 to i[0]) but from what I have learnt the memory for the entire array should be allocated at line 19. But the compiler haven't produced any relevant machine code for this. And where would it store info (stating that the remaining memory for the other nine elements[1-9] of array i's cannot be used by other variables)
Please help!!!
I decided to benchmark the realization of a swap function for simple types (like int, or struct, or a class that uses only simple types in its fields) with a static tmp variable in it to prevent memory allocation in each swap call. So I wrote this simple test program:
#include <iostream>
#include <chrono>
#include <utility>
#include <vector>
template<typename T>
void mySwap(T& a, T& b) //Like std::swap - just for tests
{
T tmp = std::move(a);
a = std::move(b);
b = std::move(tmp);
}
template<typename T>
void mySwapStatic(T& a, T& b) //Here with static tmp
{
static T tmp;
tmp = std::move(a);
a = std::move(b);
b = std::move(tmp);
}
class Test1 { //Simple class with some simple types
int foo;
float bar;
char bazz;
};
class Test2 { //Class with std::vector in it
int foo;
float bar;
char bazz;
std::vector<int> bizz;
public:
Test2()
{
bizz = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
}
};
#define Test Test1 //choosing class
const static unsigned int NUM_TESTS = 100000000;
static Test a, b; //making it static to prevent throwing out from code by compiler optimizations
template<typename T, typename F>
auto test(unsigned int numTests, T& a, T& b, const F swapFunction ) //test function
{
std::chrono::system_clock::time_point t1, t2;
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
swapFunction(a, b);
}
t2 = std::chrono::system_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
}
int main()
{
std::chrono::system_clock::time_point t1, t2;
std::cout << "Test 1. MySwap Result:\t\t" << test(NUM_TESTS, a, b, mySwap<Test>) << " nanoseconds\n"; //caling test function
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
mySwap<Test>(a, b);
}
t2 = std::chrono::system_clock::now();
std::cout << "Test 2. MySwap2 Result:\t\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //This result slightly better then 1. why?!
std::cout << "Test 3. MySwapStatic Result:\t" << test(NUM_TESTS, a, b, mySwapStatic<Test>) << " nanoseconds\n"; //test function with mySwapStatic
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
mySwapStatic<Test>(a, b);
}
t2 = std::chrono::system_clock::now();
std::cout << "Test 4. MySwapStatic2 Result:\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //And again - it's better then 3...
std::cout << "Test 5. std::swap Result:\t" << test(NUM_TESTS, a, b, std::swap<Test>) << " nanoseconds\n"; //calling test function with std::swap for comparsion. Mostly similar to 1...
return 0;
}
Some results with Test defined as Test1 (g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2 called as g++ main.cpp -O3 -std=c++11):
Test 1. MySwap Result: 625,105,480 nanoseconds
Test 2. MySwap2 Result: 528,701,547 nanoseconds
Test 3. MySwapStatic Result: 338,484,180 nanoseconds
Test 4. MySwapStatic2 Result: 228,228,156 nanoseconds
Test 5. std::swap Result: 564,863,184 nanoseconds
My main question: is it good to use this implementation for swapping of simple types? I know that if you use it for swapping types with vectors, for example, then std::swap is better, and you can see it just by changing the Test define to Test2.
Second question: why are the results in test 1, 2, 3, and 4 so different? What am I doing wrong with the test function implementation?
An answer to your second question first : in your test 2 and 4, the compiler is inlining the functions, thus it gives better performances (there is even more on test 4, but I will cover this later).
Overall, it is probably a bad idea to use a static temp variable.
Why ? First, it should be noted that in x86 assembly, there is no instruction to copy from memory to memory. This means that when you swap, there is, not one, but two temporary variables in CPU registers. And these temporary variables MUST be in CPU registers, you can't copy mem to mem, thus a static variable will add a third memory location to transfer to and from.
One problem with your static temp is that it will hinder inlining. Imagine if the variables you swap are already in CPU registers. In this case, the compiler can inline the swapping, and never copy anything to memory, which is much faster. Now, if you forced a static temp, either the compiler removes it (useless), or is forced to add a memory copy. That's what happen in test 4, in which GCC removed all the reads to the static variable. It just pointlessly write updated values to it because you told it do to so. The read removal explains the good performance gain, but it could be even faster.
Your test cases are flawed because they don't show this point.
Now you may ask : Then why my static functions perform better ? I have no idea. (Answer at the end)
I was curious, so I compiled your code with MSVC, and it turns out MSVC is doing it right, and GCC is doing it weird. At O2 optimization levels, MSVC detects that two swaps is a no-op and shortcut them, but even at O1, the non-inlined generated code is faster than all the test case with GCC at O3. (EDIT: Actually, MSVC is not doing it right either, see explanation at the end.)
The assembly generated by MSVC looks indeed better, but when comparing the static and non-static assembly generated by GCC, I don't know why the static performs better.
Anyway, I think even if GCC is generating weird code, the inlining problem should be worth using the std::swap, because with bigger types the additional memory copy could be costly, and smaller types give better inlining.
Here are the assembly produced by all the test cases, if someone have an idea of why the GCC static performs better than the non-static, despite being longer and using more memory moves. EDIT: Answer at the end
GCC non-static (perf 570ms):
00402F90 44 8B 01 mov r8d,dword ptr [rcx]
00402F93 F3 0F 10 41 04 movss xmm0,dword ptr [rcx+4]
00402F98 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F9C 4C 8B 0A mov r9,qword ptr [rdx]
00402F9F 4C 89 09 mov qword ptr [rcx],r9
00402FA2 44 0F B6 4A 08 movzx r9d,byte ptr [rdx+8]
00402FA7 44 88 49 08 mov byte ptr [rcx+8],r9b
00402FAB 44 89 02 mov dword ptr [rdx],r8d
00402FAE F3 0F 11 42 04 movss dword ptr [rdx+4],xmm0
00402FB3 88 42 08 mov byte ptr [rdx+8],al
GCC static and MSVC static (perf 275ms):
00402F10 48 8B 01 mov rax,qword ptr [rcx]
00402F13 48 89 05 66 11 00 00 mov qword ptr [404080h],rax
00402F1A 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F1E 88 05 64 11 00 00 mov byte ptr [404088h],al
00402F24 48 8B 02 mov rax,qword ptr [rdx]
00402F27 48 89 01 mov qword ptr [rcx],rax
00402F2A 0F B6 42 08 movzx eax,byte ptr [rdx+8]
00402F2E 88 41 08 mov byte ptr [rcx+8],al
00402F31 48 8B 05 48 11 00 00 mov rax,qword ptr [404080h]
00402F38 48 89 02 mov qword ptr [rdx],rax
00402F3B 0F B6 05 46 11 00 00 movzx eax,byte ptr [404088h]
00402F42 88 42 08 mov byte ptr [rdx+8],al
MSVC non-static (perf 215ms):
00000 f2 0f 10 02 movsdx xmm0, QWORD PTR [rdx]
00004 f2 0f 10 09 movsdx xmm1, QWORD PTR [rcx]
00008 44 8b 41 08 mov r8d, DWORD PTR [rcx+8]
0000c f2 0f 11 01 movsdx QWORD PTR [rcx], xmm0
00010 8b 42 08 mov eax, DWORD PTR [rdx+8]
00013 89 41 08 mov DWORD PTR [rcx+8], eax
00016 f2 0f 11 0a movsdx QWORD PTR [rdx], xmm1
0001a 44 89 42 08 mov DWORD PTR [rdx+8], r8d
std::swap versions are all identical to the non-static versions.
After having some fun investigating, I found the likely reason of the bad performance of the GCC non-static version. Modern processors have a feature called store-to-load forwarding. This feature kicks in when a memory load match a previous memory store, and shortcut the memory operation to use the value already known. In this case GCC somehow use an asymmetric load/store for parameter A and B. A is copied using 4+4+1 bytes, and B is copied using 8+1 bytes. This means the 8 first bytes of the class won't be matched by the store-to-load forwarding, losing a precious CPU optimization. To check this, I manually replaced the 8+1 copy by a 4+4+1 copy, and the performance went up as expected (code below). In the end, GCC is at fault for not considering this.
GCC patched code, longer but taking advantage of store forwarding (perf 220ms) :
00402F90 44 8B 01 mov r8d,dword ptr [rcx]
00402F93 F3 0F 10 41 04 movss xmm0,dword ptr [rcx+4]
00402F98 0F B6 41 08 movzx eax,byte ptr [rcx+8]
00402F9C 4C 8B 0A mov r9,qword ptr [rdx]
00402F9F 4C 89 09 mov qword ptr [rcx],r9
00402F9C 44 8B 0A mov r9d,dword ptr [rdx]
00402F9F 44 89 09 mov dword ptr [rcx],r9d
00402FA2 44 8B 4A 04 mov r9d,dword ptr [rdx+4]
00402FA6 44 89 49 04 mov dword ptr [rcx+4],r9d
00402FAA 44 0F B6 4A 08 movzx r9d,byte ptr [rdx+8]
00402FAF 44 88 49 08 mov byte ptr [rcx+8],r9b
00402FB3 44 89 02 mov dword ptr [rdx],r8d
00402FB6 F3 0F 11 42 04 movss dword ptr [rdx+4],xmm0
00402FBB 88 42 08 mov byte ptr [rdx+8],al
Actually, this copy instructions (symmetric 4+4+1) is the right way to do it. In these test we are only doing copies, in which case MSVC version is the best without doubt. The problem is that in a real case, the class member will be accessed individually, thus generating 4 bytes read/write. MSVC 8 bytes batch copy (also generated by GCC for one argument) will prevent future store forwarding for individual members. A new test I did with member operation beside copies show that the patched 4+4+1 versions indeed outperform all the others. And by a factor of nearly x2. Sadly, no modern compiler is generating this code.
I have a question regarding initialization of fairly large sets of static data.
See my three examples below of initializing sets of static data. I'd like to understand the program load time & memory footprint implications of the methods shown below. I really don't know how to evaluate that on my own at the moment. My build environment is still on the desktop using Visual Studio, however the embedded targets will be compiled for VxWorks using GCC.
Traditionally, I've used basic C-structs for this sort of thing, although there's good reason to move this data into C++ classes moving forward. Dynamic memory allocation is frowned upon in the embedded application and is avoided wherever possible.
As far as I know, the only way to initialize a C++ class is through its constructor, shown below in Method 2. I am wondering how that compares to Method 1. Is there any appreciable additional overhead in terms of ROM (Program footprint), RAM (Memory Footprint), or program load time? It seems to me that the compiler may be able to optimize away the rather trivial constructor, but I'm not sure if that's common behavior.
I listed Method 3, as I've considered it, although it just seems like a bad idea. Is there something else that I'm missing here? Anyone else out there initialize data in a similar manner?
/* C-Style Struct Storage */
typedef struct{
int a;
int b;
}DATA_C;
/* CPP Style Class Storage */
class DATA_CPP{
public:
int a;
int b;
DATA_CPP(int,int);
};
DATA_CPP::DATA_CPP(int aIn, int bIn){
a = aIn;
b = bIn;
}
/* METHOD 1: Direct C-Style Static Initialization */
DATA_C MyCData[5] = { {1,2},
{3,4},
{5,6},
{7,8},
{9,10}
};
/* METHOD 2: Direct CPP-Style Initialization */
DATA_CPP MyCppData[5] = { DATA_CPP(1,2),
DATA_CPP(3,4),
DATA_CPP(5,6),
DATA_CPP(7,8),
DATA_CPP(9,10),
};
/* METHOD 3: Cast C-Struct to CPP class */
DATA_CPP* pMyCppData2 = (DATA_CPP*) MyCData;
In C++11 , you can write this:
DATA_CPP obj = {1,2}; //Or simply : DATA_CPP obj {1,2}; i.e omit '='
instead of
DATA_CPP obj(1,2);
Extending this, you can write:
DATA_CPP MyCppData[5] = { {1,2},
{3,4},
{5,6},
{7,8},
{9,10},
};
instead of this:
DATA_CPP MyCppData[5] = { DATA_CPP(1,2),
DATA_CPP(3,4),
DATA_CPP(5,6),
DATA_CPP(7,8),
DATA_CPP(9,10),
};
Read about uniform initialization.
I've done a bit of research in this, thought I'd post the results. I used Visual Studio 2008 in all cases here.
Here's the disassembly view of the code from Visual Studio in Debug Mode:
/* METHOD 1: Direct C-Style Static Initialization */
DATA_C MyCData[5] = { {1,2},
{3,4},
{5,6},
{7,8},
{9,10},
};
/* METHOD 2: Direct CPP-Style Initialization */
DATA_CPP MyCppData[5] = { DATA_CPP(1,2),
010345C0 push ebp
010345C1 mov ebp,esp
010345C3 sub esp,0C0h
010345C9 push ebx
010345CA push esi
010345CB push edi
010345CC lea edi,[ebp-0C0h]
010345D2 mov ecx,30h
010345D7 mov eax,0CCCCCCCCh
010345DC rep stos dword ptr es:[edi]
010345DE push 2
010345E0 push 1
010345E2 mov ecx,offset MyCppData (1038184h)
010345E7 call DATA_CPP::DATA_CPP (103119Ah)
DATA_CPP(3,4),
010345EC push 4
010345EE push 3
010345F0 mov ecx,offset MyCppData+8 (103818Ch)
010345F5 call DATA_CPP::DATA_CPP (103119Ah)
DATA_CPP(5,6),
010345FA push 6
010345FC push 5
010345FE mov ecx,offset MyCppData+10h (1038194h)
01034603 call DATA_CPP::DATA_CPP (103119Ah)
DATA_CPP(7,8),
01034608 push 8
0103460A push 7
0103460C mov ecx,offset MyCppData+18h (103819Ch)
01034611 call DATA_CPP::DATA_CPP (103119Ah)
DATA_CPP(9,10),
01034616 push 0Ah
01034618 push 9
0103461A mov ecx,offset MyCppData+20h (10381A4h)
0103461F call DATA_CPP::DATA_CPP (103119Ah)
};
01034624 pop edi
01034625 pop esi
01034626 pop ebx
01034627 add esp,0C0h
0103462D cmp ebp,esp
0103462F call #ILT+325(__RTC_CheckEsp) (103114Ah)
01034634 mov esp,ebp
01034636 pop ebp
Interesting thing to note here is that there is definitely some overhead in program memory usage and load time, at least in non-optimized debug mode. Notice that Method 1 has zero assembly instructions, while method two has about 44 instructions.
I also ran compiled the program in Release mode with optimization enabled, here is the abridged assembly output:
?MyCData##3PAUDATA_C##A DD 01H ; MyCData
DD 02H
DD 03H
DD 04H
DD 05H
DD 06H
DD 07H
DD 08H
DD 09H
DD 0aH
?MyCppData##3PAVDATA_CPP##A DD 01H ; MyCppData
DD 02H
DD 03H
DD 04H
DD 05H
DD 06H
DD 07H
DD 08H
DD 09H
DD 0aH
END
Seems like the compiler indeed optimized away the calls to the C++ constructor. I could find no evidence of the constructor ever being called anywhere in the assembly code.
I thought I'd try something a bit more. I changed the constructor to:
DATA_CPP::DATA_CPP(int aIn, int bIn){
a = aIn + bIn;
b = bIn;
}
Again, the compiler optimized this away, resulting in a static dataset:
?MyCppData##3PAVDATA_CPP##A DD 03H ; MyCppData
DD 02H
DD 07H
DD 04H
DD 0bH
DD 06H
DD 0fH
DD 08H
DD 013H
DD 0aH
END
Interesting, the compiler was able to evaluate the constructor code on all the static data during compilation and create a static dataset, still not calling the constructor.
I thought I'd try something still a bit more, operate on a global variable in the constructor:
int globalvar;
DATA_CPP::DATA_CPP(int aIn, int bIn){
a = aIn + globalvar;
globalvar += a;
b = bIn;
}
And in this case, the compiler now generated assembly code to call the constructor during initialization:
??__EMyCppData##YAXXZ PROC ; `dynamic initializer for 'MyCppData'', COMDAT
; 35 : DATA_CPP MyCppData[5] = { DATA_CPP(1,2),
00000 a1 00 00 00 00 mov eax, DWORD PTR ?globalvar##3HA ; globalvar
00005 8d 48 01 lea ecx, DWORD PTR [eax+1]
00008 03 c1 add eax, ecx
0000a 89 0d 00 00 00
00 mov DWORD PTR ?MyCppData##3PAVDATA_CPP##A, ecx
; 36 : DATA_CPP(3,4),
00010 8d 48 03 lea ecx, DWORD PTR [eax+3]
00013 03 c1 add eax, ecx
00015 89 0d 08 00 00
00 mov DWORD PTR ?MyCppData##3PAVDATA_CPP##A+8, ecx
; 37 : DATA_CPP(5,6),
0001b 8d 48 05 lea ecx, DWORD PTR [eax+5]
0001e 03 c1 add eax, ecx
00020 89 0d 10 00 00
00 mov DWORD PTR ?MyCppData##3PAVDATA_CPP##A+16, ecx
; 38 : DATA_CPP(7,8),
00026 8d 48 07 lea ecx, DWORD PTR [eax+7]
00029 03 c1 add eax, ecx
0002b 89 0d 18 00 00
00 mov DWORD PTR ?MyCppData##3PAVDATA_CPP##A+24, ecx
; 39 : DATA_CPP(9,10),
00031 8d 48 09 lea ecx, DWORD PTR [eax+9]
00034 03 c1 add eax, ecx
00036 89 0d 20 00 00
00 mov DWORD PTR ?MyCppData##3PAVDATA_CPP##A+32, ecx
0003c a3 00 00 00 00 mov DWORD PTR ?globalvar##3HA, eax ; globalvar
; 40 : };
00041 c3 ret 0
??__EMyCppData##YAXXZ ENDP ; `dynamic initializer for 'MyCppData''
FYI, I found this page helpful in setting up visual studio to output assembly:
How do I get the assembler output from a C file in VS2005
Which of these would be more computationally efficient, and why?
A) Repeated array access:
for(i=0; i<numbers.length; i++) {
result[i] = numbers[i] * numbers[i] * numbers[i];
}
B) Setting a local variable:
for(i=0; i<numbers.length; i++) {
int n = numbers[i];
result[i] = n * n * n;
}
Would not the repeated array access version have to be calculated (using pointer arithmetic), making the first option slower because it is doing this?:
for(i=0; i<numbers.length; i++) {
result[i] = *(numbers + i) * *(numbers + i) * *(numbers + i);
}
Any sufficiently sophisticated compiler will generate the same code for all three solutions. I turned your three versions into a small C program (with a minor adjustement, I changed the access numbers.length to a macro invocation which gives the length of an array):
#include <stddef.h>
size_t i;
static const int numbers[] = { 0, 1, 2, 4, 5, 6, 7, 8, 9 };
#define ARRAYLEN(x) (sizeof((x)) / sizeof(*(x)))
static int result[ARRAYLEN(numbers)];
void versionA(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
result[i] = numbers[i] * numbers[i] * numbers[i];
}
}
void versionB(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
int n = numbers[i];
result[i] = n * n * n;
}
}
void versionC(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
result[i] = *(numbers + i) * *(numbers + i) * *(numbers + i);
}
}
I then compiled it using optimizations (and debug symbols, for prettier disassembly) with Visual Studio 2012:
C:\Temp>cl /Zi /O2 /Wall /c so19244189.c
Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50727.1 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
so19244189.c
Finally, here's the disassembly:
C:\Temp>dumpbin /disasm so19244189.obj
[..]
_versionA:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
_versionB:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
_versionC:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
Note how the assembly is exactly the same in all cases. So the correct answer to your question
Which of these would be more computationally efficient, and why?
for this compiler is: mu. Your question cannot be answered because it's based on incorrect assumptions. None of the answers is faster than any other.
The theoretical answer:
A reasonably good optimizing compiler should convert version A to version B, and perform only one load from memory. There should be no performance difference if optimization is enabled.
If optimization is disabled, version A will be slower, because the address must be computed 3 times and there are 3 memory loads (2 of them are cached and very fast, but it's still slower than reusing a register).
In practice, the answer will depend on your compiler, and you should check this by benchmarking.
It depends on compiler but all of them should be the same.
First lets look at case B smart compiler will generate code to load value into register only once so it doesn't matter if you use some additional variable or not, compiler generates opcode for mov instruction and has value into register. So B is the same as A.
Now lets compare A and C. We should look at opeators [] inline implementation. a[b] actually is *(a + b) so *(numbers + i) the same as numbers[i] that means cases A and C are the same.
So we have (A==B) && (A==C) all in all (A==B==C) If you know what I mean :).