Related
Today I have found sample code which slowed down by 50%, after adding some unrelated code. After debugging I have figured out the problem was in the loop alignment.
Depending of the loop code placement there is different execution time e.g.:
Address
Time[us]
00007FF780A01270
980us
00007FF7750B1280
1500us
00007FF7750B1290
986us
00007FF7750B12A0
1500us
I didn't expect previously that code alignment may have such a big impact. And I thought my compiler is smart enough to align the code correctly.
What exactly cause such a big difference in execution time ? (I suppose some processor architecture details).
The test program I have compiled in Release mode with Visual Studio 2019 and run it on Windows 10.
I have checked the program on 2 processors: i7-8700k (the results above), and on intel i5-3570k but the problem does not exist there and the execution time is always about 1250us.
I have also tried to compile the program with clang, but with clang the result is always ~1500us (on i7-8700k).
My test program:
#include <chrono>
#include <iostream>
#include <intrin.h>
using namespace std;
template<int N>
__forceinline void noops()
{
__nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop();
noops<N - 1>();
}
template<>
__forceinline void noops<0>(){}
template<int OFFSET>
__declspec(noinline) void SumHorizontalLine(const unsigned char* __restrict src, int width, int a, unsigned short* __restrict dst)
{
unsigned short sum = 0;
const unsigned char* srcP1 = src - a - 1;
const unsigned char* srcP2 = src + a;
//some dummy loop,just a few iterations
for (int i = 0; i < a; ++i)
dst[i] = src[i] / (double)dst[i];
noops<OFFSET>();
//the important loop
for (int x = a + 1; x < width - a; x++)
{
unsigned char v1 = srcP1[x];
unsigned char v2 = srcP2[x];
sum -= v1;
sum += v2;
dst[x] = sum;
}
}
template<int OFFSET>
void RunTest(unsigned char* __restrict src, int width, int a, unsigned short* __restrict dst)
{
double minTime = 99999999;
for(int i = 0; i < 20; ++i)
{
auto start = chrono::steady_clock::now();
for (int i = 0; i < 1024; ++i)
{
SumHorizontalLine<OFFSET>(src, width, a, dst);
}
auto end = chrono::steady_clock::now();
auto us = chrono::duration_cast<chrono::microseconds>(end - start).count();
if (us < minTime)
{
minTime = us;
}
}
cout << OFFSET << " : " << minTime << " us" << endl;
}
int main()
{
const int width = 2048;
const int x = 3;
unsigned char* src = new unsigned char[width * 5];
unsigned short* dst = new unsigned short[width];
memset(src, 0, sizeof(unsigned char) * width);
memset(dst, 0, sizeof(unsigned short) * width);
while(true)
RunTest<1>(src, width, x, dst);
}
To verify different alignment, just recompile the program and change RunTest<0> to RunTest<1> etc.
Compiler always align the code to 16bytes. In my test code I just insert additional nops to move the code a bit more.
Assembly code generated for the loop with OFFSET=1 (for other offset only the amount of npads is different):
0007c 90 npad 1
0007d 90 npad 1
0007e 49 83 c1 08 add r9, 8
00082 90 npad 1
00083 90 npad 1
00084 90 npad 1
00085 90 npad 1
00086 90 npad 1
00087 90 npad 1
00088 90 npad 1
00089 90 npad 1
0008a 90 npad 1
0008b 90 npad 1
0008c 90 npad 1
0008d 90 npad 1
0008e 90 npad 1
0008f 90 npad 1
$LL15#SumHorizon:
; 25 :
; 26 : noops<OFFSET>();
; 27 :
; 28 : for (int x = a + 1; x < width - a; x++)
; 29 : {
; 30 : unsigned char v1 = srcP1[x];
; 31 : unsigned char v2 = srcP2[x];
; 32 : sum -= v1;
00090 0f b6 42 f9 movzx eax, BYTE PTR [rdx-7]
00094 4d 8d 49 02 lea r9, QWORD PTR [r9+2]
; 33 : sum += v2;
00098 0f b6 0a movzx ecx, BYTE PTR [rdx]
0009b 48 8d 52 01 lea rdx, QWORD PTR [rdx+1]
0009f 66 2b c8 sub cx, ax
000a2 66 44 03 c1 add r8w, cx
; 34 : dst[x] = sum;
000a6 66 45 89 41 fe mov WORD PTR [r9-2], r8w
000ab 49 83 ea 01 sub r10, 1
000af 75 df jne SHORT $LL15#SumHorizon
; 35 : }
; 36 :
; 37 : }
000b1 c3 ret 0
??$SumHorizontalLine#$00##YAXPEIBEHHPEIAG#Z ENDP ; SumHorizont
In the slow cases (i.e., 00007FF7750B1280 and 00007FF7750B12A0), the jne instruction crosses a 32-byte boundary. The mitigations for the "Jump Conditional Code" (JCC) erratum (https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf) prevent such instructions from being cached in the DSB. The JCC erratum only applies to Skylake-based CPUs, which is why the effect does not occur on your i5-3570k CPU.
As Peter Cordes pointed out in a comment, recent compilers have options that try to mitigate this effect. Intel JCC Erratum - should JCC really be treated separately? mentions MSVC's /QIntel-jcc-erratum option; another related question is How can I mitigate the impact of the Intel jcc erratum on gcc?
I thought my compiler is smart enough to align the code correctly.
As you said, the compiler is always aligning things to a multiple of 16 bytes. This probably does account for the direct effects of alignment. But there are limits to the "smartness" of the compiler.
Besides alignment, code placement has indirect performance effects as well, because of cache associativity. If there is too much contention for the few cache lines that can map to this address, performance will suffer. Moving to an address with less contention makes the problem go away.
The compiler may be smart enough to handle cache contention effects as well, but only IF you turn on profile-guided optimization. The interactions are far too complex to predict in a reasonable amount of work; it is much easier to watch for cache conflicts by actually running the program and that's what PGO does.
I need to convert an array of bytes to array of floats. I get the bytes through a network connection, and then need to parse them into floats. the size of the array is not pre-definded.
this is the code I have so far, using unions. Do you have any suggestions on how to make it run faster?
int offset = DATA_OFFSET - 1;
UStuff bb;
//Convert every 4 bytes to float using a union
for (int i = 0; i < NUM_OF_POINTS;i++){
//Going backwards - due to endianness
for (int j = offset + BYTE_FLOAT*i + BYTE_FLOAT ; j > offset + BYTE_FLOAT*i; --j)
{
bb.c[(offset + BYTE_FLOAT*i + BYTE_FLOAT)- j] = sample[j];
}
res.append(bb.f);
}
return res;
This is the union I use
union UStuff
{
float f;
unsigned char c[4];
};
Technically you are not allowed to type-pun through a union in C++, although you are allowed to do it in C. The behaviour of your code is undefined.
Setting aside this important behaviour point: even then, you're assuming that a float is represented in the same way on all machines, which it isn't. You might think a float is a 32 bit IEEE754 little-endian block of data, but it doesn't have to be that.
Sadly the best solution will end up slower. Most serialisations of floating point data are performed by going into and back out from a string which, in your case pretty much solves the problem since you can represent them as an array of unsigned char data. So your only wrangle is down to figuring out the encoding of the data. Job done!
Short Answer
#include <cstdint>
#define NOT_STUPID 1
#define ENDIAN NOT_STUPID
namespace _ {
inline uint32_t UI4Set(char byte_0, char byte_1, char byte_2, char byte_3) {
#if ENDIAN == NOT_STUPID
return byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
#else
return byte_3 | ((uint32_t)byte_2) << 8 | ((uint32_t)byte_1) << 16 |
((uint32_t)byte_0) << 24;
#endif
}
inline float FLTSet(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t flt = UI4Set(byte_0, byte_1, byte_2, byte_3);
return *reinterpret_cast<float*>(&flt);
}
/* Use this function to write directly to RAM and avoid the xmm
registers. */
inline uint32_t FLTSet(char byte_0, char byte_1, char byte_2, char byte_3,
float* destination) {
uint32_t value = UI4Set (byte_0, byte_1, byte_2, byte_3);
*reinterpret_cast<uint32_t*>(destination) = value;
return value;
}
} //< namespace _
using namespace _; //< #see Kabuki Toolkit
static flt = FLTSet (0, 1, 2, 3);
int main () {
uint32_t flt_init = FLTSet (4, 5, 6, 7, &flt);
return 0;
}
//< This uses 4 extra bytes doesn't use the xmm register
Long Answer
It is GENERALLY NOT recommended to use a Union to convert to and from floating-point to integer because Unions to this day do not always generate optimal assembly code and the other techniques are more explicit and may use less typing; and rather then my opinion from other StackOverflow posts on Unions, we shall prove it will disassembly with a modern compiler: Visual-C++ 2018.
The first thing we must know about how to optimize floating-point algorithms is how the registers work. The core of the CPU is exclusively an integer-processing-unit with coprocessors (i.e. extensions) on them for processing floating-point numbers. These Load-Store Machines (LSM) can only work with integers and they must use a separate set of registers for interacting with floating-point coprocessors. On x86_64 these are the xmm registers, which are 128-bits wide and can process Single Instruction Multiple Data (SIMD). In C++ way to load-and-store a floating-point register is:
int Foo(double foo) { return foo + *reinterpret_cast<double*>(&foo); }
int main() {
double foo = 1.0;
uint64_t bar = *reinterpret_cast<uint64_t*>(&foo);
return Foo(bar);
}
Now let's inspect the disassembly with Visual-C++ O2 optimations on because without them you will get a bunch of debug stack frame variables. I had to add the function Foo into the example to avoid the code being optimized away.
double foo = 1.0;
uint64_t bar = *reinterpret_cast<uint64_t*>(&foo);
00007FF7482E16A0 mov rax,3FF0000000000000h
00007FF7482E16AA xorps xmm0,xmm0
00007FF7482E16AD cvtsi2sd xmm0,rax
return Foo(bar);
00007FF7482E16B2 addsd xmm0,xmm0
00007FF7482E16B6 cvttsd2si eax,xmm0
}
00007FF7482E16BA ret
And as described, we can see that the LSM first moves the double value into an integer register, then it zeros out the xmm0 register using a xor function because the register is 128-bits wide and we're loading a 64-bit integer, then loads the contents of the integer register to the floating-point register using the cvtsi2sd instruction, then finally followed by the cvttsd2si instruction that then loads the value back from the xmm0 register to the return register before finally returning.
So now let's address that concern about generating optimal assembly code using this test script and Visual-C++ 2018:
#include <stdafx.h>
#include <cstdint>
#include <cstdio>
static float foo = 0.0f;
void SetFooUnion(char byte_0, char byte_1, char byte_2, char byte_3) {
union {
float flt;
char bytes[4];
} u = {foo};
u.bytes[0] = byte_0;
u.bytes[1] = byte_1;
u.bytes[2] = byte_2;
u.bytes[3] = byte_3;
foo = u.flt;
}
void SetFooManually(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t faster_method = byte_0 | ((uint32_t)byte_1) << 8 |
((uint32_t)byte_2) << 16 | ((uint32_t)byte_3) << 24;
*reinterpret_cast<uint32_t*>(&foo) = faster_method;
}
namespace _ {
inline uint32_t UI4Set(char byte_0, char byte_1, char byte_2, char byte_3) {
return byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
}
inline float FLTSet(char byte_0, char byte_1, char byte_2, char byte_3) {
uint32_t flt = UI4Set(byte_0, byte_1, byte_2, byte_3);
return *reinterpret_cast<float*>(&flt);
}
inline void FLTSet(char byte_0, char byte_1, char byte_2, char byte_3,
float* destination) {
uint32_t value = byte_0 | ((uint32_t)byte_1) << 8 | ((uint32_t)byte_2) << 16 |
((uint32_t)byte_3) << 24;
*reinterpret_cast<uint32_t*>(destination) = value;
}
} // namespace _
int main() {
SetFooUnion(0, 1, 2, 3);
union {
float flt;
char bytes[4];
} u = {foo};
// Start union read tests
putchar(u.bytes[0]);
putchar(u.bytes[1]);
putchar(u.bytes[2]);
putchar(u.bytes[3]);
// Start union write tests
u.bytes[0] = 4;
u.bytes[2] = 5;
foo = u.flt;
// Start hand-coded tests
SetFooManually(6, 7, 8, 9);
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
putchar((char)(bar >> 8));
putchar((char)(bar >> 16));
putchar((char)(bar >> 24));
_::FLTSet (0, 1, 2, 3, &foo);
return 0;
}
Now after inspecting the O2-optimized disassembly, we have proven the compiler DOES NOT produce optimal code:
int main() {
00007FF6DB4A1000 sub rsp,28h
SetFooUnion(0, 1, 2, 3);
00007FF6DB4A1004 mov dword ptr [rsp+30h],3020100h
00007FF6DB4A100C movss xmm0,dword ptr [rsp+30h]
union {
float flt;
char bytes[4];
} u = {foo};
00007FF6DB4A1012 movss dword ptr [rsp+30h],xmm0
// Start union read tests
putchar(u.bytes[0]);
00007FF6DB4A1018 movsx ecx,byte ptr [u]
SetFooUnion(0, 1, 2, 3);
00007FF6DB4A101D movss dword ptr [foo (07FF6DB4A3628h)],xmm0
// Start union read tests
putchar(u.bytes[0]);
00007FF6DB4A1025 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[1]);
00007FF6DB4A102B movsx ecx,byte ptr [rsp+31h]
00007FF6DB4A1030 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[2]);
00007FF6DB4A1036 movsx ecx,byte ptr [rsp+32h]
00007FF6DB4A103B call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar(u.bytes[3]);
00007FF6DB4A1041 movsx ecx,byte ptr [rsp+33h]
00007FF6DB4A1046 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
00007FF6DB4A104C mov ecx,6
// Start union write tests
u.bytes[0] = 4;
u.bytes[2] = 5;
foo = u.flt;
// Start hand-coded tests
SetFooManually(6, 7, 8, 9);
00007FF6DB4A1051 mov dword ptr [foo (07FF6DB4A3628h)],9080706h
uint32_t bar = *reinterpret_cast<uint32_t*>(&foo);
putchar((char)(bar));
00007FF6DB4A105B call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 8));
00007FF6DB4A1061 mov ecx,7
00007FF6DB4A1066 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 16));
00007FF6DB4A106C mov ecx,8
00007FF6DB4A1071 call qword ptr [__imp_putchar (07FF6DB4A2160h)]
putchar((char)(bar >> 24));
00007FF6DB4A1077 mov ecx,9
00007FF6DB4A107C call qword ptr [__imp_putchar (07FF6DB4A2160h)]
return 0;
00007FF6DB4A1082 xor eax,eax
_::FLTSet(0, 1, 2, 3, &foo);
00007FF6DB4A1084 mov dword ptr [foo (07FF6DB4A3628h)],3020100h
}
00007FF6DB4A108E add rsp,28h
00007FF6DB4A1092 ret
Here is the raw main disassembly because the inlined functions are missing:
; Listing generated by Microsoft (R) Optimizing Compiler Version 19.12.25831.0
include listing.inc
INCLUDELIB OLDNAMES
EXTRN __imp_putchar:PROC
EXTRN __security_check_cookie:PROC
?foo##3MA DD 01H DUP (?) ; foo
_BSS ENDS
PUBLIC main
PUBLIC ?SetFooManually##YAXDDDD#Z ; SetFooManually
PUBLIC ?SetFooUnion##YAXDDDD#Z ; SetFooUnion
EXTRN _fltused:DWORD
; COMDAT pdata
pdata SEGMENT
$pdata$main DD imagerel $LN8
DD imagerel $LN8+137
DD imagerel $unwind$main
pdata ENDS
; COMDAT xdata
xdata SEGMENT
$unwind$main DD 010401H
DD 04204H
xdata ENDS
; Function compile flags: /Ogtpy
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; COMDAT ?SetFooManually##YAXDDDD#Z
_TEXT SEGMENT
byte_0$dead$ = 8
byte_1$dead$ = 16
byte_2$dead$ = 24
byte_3$dead$ = 32
?SetFooManually##YAXDDDD#Z PROC ; SetFooManually, COMDAT
00000 c7 05 00 00 00
00 06 07 08 09 mov DWORD PTR ?foo##3MA, 151521030 ; 09080706H
0000a c3 ret 0
?SetFooManually##YAXDDDD#Z ENDP ; SetFooManually
_TEXT ENDS
; Function compile flags: /Ogtpy
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; COMDAT main
_TEXT SEGMENT
u$1 = 48
u$ = 48
main PROC ; COMDAT
$LN8:
00000 48 83 ec 28 sub rsp, 40 ; 00000028H
00004 c7 44 24 30 00
01 02 03 mov DWORD PTR u$1[rsp], 50462976 ; 03020100H
0000c f3 0f 10 44 24
30 movss xmm0, DWORD PTR u$1[rsp]
00012 f3 0f 11 44 24
30 movss DWORD PTR u$[rsp], xmm0
00018 0f be 4c 24 30 movsx ecx, BYTE PTR u$[rsp]
0001d f3 0f 11 05 00
00 00 00 movss DWORD PTR ?foo##3MA, xmm0
00025 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0002b 0f be 4c 24 31 movsx ecx, BYTE PTR u$[rsp+1]
00030 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00036 0f be 4c 24 32 movsx ecx, BYTE PTR u$[rsp+2]
0003b ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00041 0f be 4c 24 33 movsx ecx, BYTE PTR u$[rsp+3]
00046 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0004c b9 06 00 00 00 mov ecx, 6
00051 c7 05 00 00 00
00 06 07 08 09 mov DWORD PTR ?foo##3MA, 151521030 ; 09080706H
0005b ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00061 b9 07 00 00 00 mov ecx, 7
00066 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
0006c b9 08 00 00 00 mov ecx, 8
00071 ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00077 b9 09 00 00 00 mov ecx, 9
0007c ff 15 00 00 00
00 call QWORD PTR __imp_putchar
00082 33 c0 xor eax, eax
00084 48 83 c4 28 add rsp, 40 ; 00000028H
00088 c3 ret 0
main ENDP
_TEXT ENDS
END
So what is the difference?
?SetFooUnion##YAXDDDD#Z PROC ; SetFooUnion, COMDAT
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; Line 7
mov BYTE PTR [rsp+32], r9b
; Line 14
mov DWORD PTR u$[rsp], 50462976 ; 03020100H
; Line 18
movss xmm0, DWORD PTR u$[rsp]
movss DWORD PTR ?foo##3MA, xmm0
; Line 19
ret 0
?SetFooUnion##YAXDDDD#Z ENDP ; SetFooUnion
versus:
?SetFooManually##YAXDDDD#Z PROC ; SetFooManually, COMDAT
; File c:\workspace\kabuki-toolkit\seams\0_0_experiments\main.cc
; Line 34
mov DWORD PTR ?foo##3MA, 151521030 ; 09080706H
; Line 35
ret 0
?SetFooManually##YAXDDDD#Z ENDP ; SetFooManually
The first thing to notice is the effect of the Union on inline memory optimizations. Unions are designed specifically to multiplex RAM for different purposes for different time periods in order to reduce RAM usage, so this means the memory must remain coherent in RAM and thus is less inlinable. The Union code forced the compiler to write the Union out to RAM while the non-Union method just throws out your code and replaces with a single mov DWORD PTR ?foo##3MA, 151521030 instruction without the use of the xmm0 register!!! The O2 optimizations automatically inlined the SetFooUnion and SetFooManually functions, but the non-Union method inlined a lot more code used less RAM reads, evidence from the difference between the Union method's line of code:
movsx ecx,byte ptr [rsp+31h]
versus the non-Union method's version:
mov ecx,7
The Union is loading ecx from A POINTER TO RAM while the other is using the faster single-cycle mov instruction. THAT IS A HUGE PERFORMANCE INCREASE!!! However, this actually may be the desired behavior when working with real-time systems and multi-threaded applications because the compiler optimizations may be unwanted and may mess up our timing, or you may want to use a mix of the two methods.
Aside from the potentially suboptimal RAM usage, I tried for several hours to get the compiler to generate suboptimal assembly but I was not able to for most of my toy problems, so it does seem as though this is a rather nifty feature of Unions rather than a reason to shun them. My favorite C++ metaphor is that C++ is like a kitchen full of sharp knives, you need to pick the correct knife for the correct job, and just because there are a lot of sharp knives in a kitchen doesn't mean that you pull out all of the knives at once, or that you leave the knives out. As long as you keep the kitchen tidy then you're not going to cut yourself. A Union is a sharp knife that may help ensure greater RAM coherency but it requires more typing and slows the program down.
In below program I expect test1 to run slower because of the dependent instructions. A test run with -O2 seemed to confirm this. But then I tried with -O3 and now the timings are more or less equal. How can this be?
#include <iostream>
#include <vector>
#include <cstring>
#include <chrono>
volatile int x = 0; // used for preventing certain optimizations
enum { size = 60 * 1000 * 1000 };
std::vector<unsigned> a(size + x); // `size + x` makes the vector size unknown by compiler
std::vector<unsigned> b(size + x);
void test1()
{
for (auto i = 1u; i != size; ++i)
{
a[i] = a[i] + a[i-1]; // data dependency hinders pipelining(?)
}
}
void test2()
{
for (auto i = 0u; i != size; ++i)
{
a[i] = a[i] + b[i]; // no data dependencies
}
}
template<typename F>
int64_t benchmark(F&& f)
{
auto start_time = std::chrono::high_resolution_clock::now();
f();
auto elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start_time);
return elapsed_ms.count();
}
int main(int argc, char**)
{
// make sure the optimizer cannot make any assumptions
// about the contents of the vectors:
for (auto& el : a) el = x;
for (auto& el : b) el = x;
test1(); // warmup
std::cout << "test1: " << benchmark(&test1) << '\n';
test2(); // warmup
std::cout << "\ntest2: " << benchmark(&test2) << '\n';
return a[x] * x; // prevent optimization and exit with code 0
}
I get these results:
g++-4.8 -std=c++11 -O2 main.cpp && ./a.out
test1: 115
test2: 48
g++-4.8 -std=c++11 -O3 main.cpp && ./a.out
test1: 29
test2: 38
Because in -O3 gcc effectively eliminates the data dependency, by storing the value of a[i] in a register and reusing it on the next iteration instead of loading a[i-1].
The result is more or less equivalent to:
void test1()
{
auto x = a[0];
auto end = a.begin() + size;
for (auto it = next(a.begin()); it != end; ++it)
{
auto y = *it; // Load
x = y + x;
*it = x; // Store
}
}
Which compiled in -O2 yield the exact same assembly as your code compiled in -O3.
The second loop in your question is unrolled in -O3, hence the speedup. The two optimizations applied seem to be unrelated to me, the first case is faster simply because gcc removed a load instruction, the second because it is unrolled.
In both cases I don't think that the optimizer did anything in particular to improve the cache behavior, both memory access patterns are easy predictable by the cpu.
Optimizers are very sophisticated pieces of software and not always predictable.
With g++ 5.2.0 and -O2 test1 and test2 gets compiled to similar machine code:
;;;; test1 inner loop
400c28: 8b 50 fc mov -0x4(%rax),%edx
400c2b: 01 10 add %edx,(%rax)
400c2d: 48 83 c0 04 add $0x4,%rax
400c31: 48 39 c1 cmp %rax,%rcx
400c34: 75 f2 jne 400c28 <_Z5test1v+0x18>
;;;; test2 inner loop
400c50: 8b 0c 06 mov (%rsi,%rax,1),%ecx
400c53: 01 0c 02 add %ecx,(%rdx,%rax,1)
400c56: 48 83 c0 04 add $0x4,%rax
400c5a: 48 3d 00 1c 4e 0e cmp $0xe4e1c00,%rax
400c60: 75 ee jne 400c50 <_Z5test2v+0x10>
however with -O3 test1 remains more or less similar
;;;; test1 inner loop
400d88: 03 10 add (%rax),%edx
400d8a: 48 83 c0 04 add $0x4,%rax
400d8e: 89 50 fc mov %edx,-0x4(%rax)
400d91: 48 39 c1 cmp %rax,%rcx
400d94: 75 f2 jne 400d88 <_Z5test1v+0x18>
while test2 explodes to what it seems an unrolled version using xmm registers and generates completely different machine code. The inner loop becomes
;;;; test2 inner loop (after a lot of preprocessing)
400e30: f3 41 0f 6f 04 00 movdqu (%r8,%rax,1),%xmm0
400e36: 83 c1 01 add $0x1,%ecx
400e39: 66 0f fe 04 07 paddd (%rdi,%rax,1),%xmm0
400e3e: 0f 29 04 07 movaps %xmm0,(%rdi,%rax,1)
400e42: 48 83 c0 10 add $0x10,%rax
400e46: 44 39 c9 cmp %r9d,%ecx
400e49: 72 e5 jb 400e30 <_Z5test2v+0x90>
and works doing multiple additions for each iteration.
If you want to test specific processor behavior probably writing directly in a assembler is a better idea as C++ compilers may do some heavy rewriting of what your original source code is doing.
What of this options have the best performance:
Use two shorts for two sensible informations, or use one int and use bit operations to retrive half of it for each sensible information?
This may vary depending on architecture and compiler, but generally one using int and bit operations on it will have slightly less performance. But the difference of performance will be so minimal that till now I haven't written code that will require that level of optimization. I depend on the compiler to these kind of optimizations for me.
Now let us check the below C++ code that simulates the behaviur:
int main()
{
int x = 100;
short a = 255;
short b = 127;
short p = x >> 16;
short q = x & 0xffff;
short y = a;
short z = b;
return 0;
}
The corresponding assembly code on x86_64 system (from gnu g++) will be as shown below:
00000000004004ed <main>:
int main()
{
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
int x = 100;
4004f1: c7 45 fc 64 00 00 00 movl $0x64,-0x4(%rbp)
short a = 255;
4004f8: 66 c7 45 f0 ff 00 movw $0xff,-0x10(%rbp)
short b = 127;
4004fe: 66 c7 45 f2 7f 00 movw $0x7f,-0xe(%rbp)
short p = x >> 16;
400504: 8b 45 fc mov -0x4(%rbp),%eax
400507: c1 f8 10 sar $0x10,%eax
40050a: 66 89 45 f4 mov %ax,-0xc(%rbp)
short q = x & 0xffff;
40050e: 8b 45 fc mov -0x4(%rbp),%eax
400511: 66 89 45 f6 mov %ax,-0xa(%rbp)
short y = a;
400515: 0f b7 45 f0 movzwl -0x10(%rbp),%eax
400519: 66 89 45 f8 mov %ax,-0x8(%rbp)
short z = b;
40051d: 0f b7 45 f2 movzwl -0xe(%rbp),%eax
400521: 66 89 45 fa mov %ax,-0x6(%rbp)
return 0;
400525: b8 00 00 00 00 mov $0x0,%eax
}
As we see, "short p = x >> 16" is the slowest as it uses the extra expensive right shift operation. While all other assignments are equal in terms of cost.
Which of these would be more computationally efficient, and why?
A) Repeated array access:
for(i=0; i<numbers.length; i++) {
result[i] = numbers[i] * numbers[i] * numbers[i];
}
B) Setting a local variable:
for(i=0; i<numbers.length; i++) {
int n = numbers[i];
result[i] = n * n * n;
}
Would not the repeated array access version have to be calculated (using pointer arithmetic), making the first option slower because it is doing this?:
for(i=0; i<numbers.length; i++) {
result[i] = *(numbers + i) * *(numbers + i) * *(numbers + i);
}
Any sufficiently sophisticated compiler will generate the same code for all three solutions. I turned your three versions into a small C program (with a minor adjustement, I changed the access numbers.length to a macro invocation which gives the length of an array):
#include <stddef.h>
size_t i;
static const int numbers[] = { 0, 1, 2, 4, 5, 6, 7, 8, 9 };
#define ARRAYLEN(x) (sizeof((x)) / sizeof(*(x)))
static int result[ARRAYLEN(numbers)];
void versionA(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
result[i] = numbers[i] * numbers[i] * numbers[i];
}
}
void versionB(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
int n = numbers[i];
result[i] = n * n * n;
}
}
void versionC(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
result[i] = *(numbers + i) * *(numbers + i) * *(numbers + i);
}
}
I then compiled it using optimizations (and debug symbols, for prettier disassembly) with Visual Studio 2012:
C:\Temp>cl /Zi /O2 /Wall /c so19244189.c
Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50727.1 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
so19244189.c
Finally, here's the disassembly:
C:\Temp>dumpbin /disasm so19244189.obj
[..]
_versionA:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
_versionB:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
_versionC:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
Note how the assembly is exactly the same in all cases. So the correct answer to your question
Which of these would be more computationally efficient, and why?
for this compiler is: mu. Your question cannot be answered because it's based on incorrect assumptions. None of the answers is faster than any other.
The theoretical answer:
A reasonably good optimizing compiler should convert version A to version B, and perform only one load from memory. There should be no performance difference if optimization is enabled.
If optimization is disabled, version A will be slower, because the address must be computed 3 times and there are 3 memory loads (2 of them are cached and very fast, but it's still slower than reusing a register).
In practice, the answer will depend on your compiler, and you should check this by benchmarking.
It depends on compiler but all of them should be the same.
First lets look at case B smart compiler will generate code to load value into register only once so it doesn't matter if you use some additional variable or not, compiler generates opcode for mov instruction and has value into register. So B is the same as A.
Now lets compare A and C. We should look at opeators [] inline implementation. a[b] actually is *(a + b) so *(numbers + i) the same as numbers[i] that means cases A and C are the same.
So we have (A==B) && (A==C) all in all (A==B==C) If you know what I mean :).