Using base pointer register in C++ inline asm - c++

I want to be able to use the base pointer register (%rbp) within inline asm. A toy example of this is like so:
void Foo(int &x)
{
asm volatile ("pushq %%rbp;" // 'prologue'
"movq %%rsp, %%rbp;" // 'prologue'
"subq $12, %%rsp;" // make room
"movl $5, -12(%%rbp);" // some asm instruction
"movq %%rbp, %%rsp;" // 'epilogue'
"popq %%rbp;" // 'epilogue'
: : : );
x = 5;
}
int main()
{
int x;
Foo(x);
return 0;
}
I hoped that, since I am using the usual prologue/epilogue function-calling method of pushing and popping the old %rbp, this would be ok. However, it seg faults when I try to access x after the inline asm.
The GCC-generated assembly code (slightly stripped-down) is:
_Foo:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
# INLINEASM
pushq %rbp; // prologue
movq %rsp, %rbp; // prologue
subq $12, %rsp; // make room
movl $5, -12(%rbp); // some asm instruction
movq %rbp, %rsp; // epilogue
popq %rbp; // epilogue
# /INLINEASM
movq -8(%rbp), %rax
movl $5, (%rax) // x=5;
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Foo
movl $0, %eax
leave
ret
Can anyone tell me why this seg faults? It seems that I somehow corrupt %rbp but I don't see how. Thanks in advance.
I'm running GCC 4.8.4 on 64-bit Ubuntu 14.04.

See the bottom of this answer for a collection of links to other inline-asm Q&As.
Your code is broken because you step on the red-zone below RSP (with push) where GCC was keeping a value.
What are you hoping to learn to accomplish with inline asm? If you want to learn inline asm, learn to use it to make efficient code, rather than horrible stuff like this. If you want to write function prologues and push/pop to save/restore registers, you should write whole functions in asm. (Then you can easily use nasm or yasm, rather than the less-preferred-by-most AT&T syntax with GNU assembler directives1.)
GNU inline asm is hard to use, but allows you to mix custom asm fragments into C and C++ while letting the compiler handle register allocation and any saving/restoring if necessary. Sometimes the compiler will be able to avoid the save and restore by giving you a register that's allowed to be clobbered. Without volatile, it can even hoist asm statements out of loops when the input would be the same. (i.e. unless you use volatile, the outputs are assumed to be a "pure" function of the inputs.)
If you're just trying to learn asm in the first place, GNU inline asm is a terrible choice. You have to fully understand almost everything that's going on with the asm, and understand what the compiler needs to know, to write correct input/output constraints and get everything right. Mistakes will lead to clobbering things and hard-to-debug breakage. The function-call ABI is a much simpler and easier to keep track of boundary between your code and the compiler's code.
Why this breaks
You compiled with -O0, so gcc's code spills the function parameter from %rdi to a location on the stack. (This could happen in a non-trivial function even with -O3).
Since the target ABI is the x86-64 SysV ABI, it uses the "Red Zone" (128 bytes below %rsp that even asynchronous signal handlers aren't allowed to clobber), instead of wasting an instruction decrementing the stack pointer to reserve space.
It stores the 8B pointer function arg at -8(rsp_at_function_entry). Then your inline asm pushes %rbp, which decrements %rsp by 8 and then writes there, clobbering the low 32b of &x (the pointer).
When your inline asm is done,
gcc reloads -8(%rbp) (which has been overwritten with %rbp) and uses it as the address for a 4B store.
Foo returns to main with %rbp = (upper32)|5 (orig value with the low 32 set to 5).
main runs leave: %rsp = (upper32)|5
main runs ret with %rsp = (upper32)|5, reading the return address from virtual address (void*)(upper32|5), which from your comment is 0x7fff0000000d.
I didn't check with a debugger; one of those steps might be slightly off, but the problem is definitely that you clobber the red zone, leading to gcc's code trashing the stack.
Even adding a "memory" clobber doesn't get gcc to avoid using the red zone, so it looks like allocating your own stack memory from inline asm is just a bad idea. (A memory clobber means you might have written some memory you're allowed to write to, e.g. a global variable or something pointed-to by a global, not that you might have overwritten something you're not supposed to.)
If you want to use scratch space from inline asm, you should probably declare an array as a local variable and use it as an output-only operand (which you never read from).
AFAIK, there's no syntax for declaring that you modify the red-zone, so your only options are:
use an "=m" output operand (possibly an array) for scratch space; the compiler will probably fill in that operand with an addressing mode relative to RBP or RSP. You can index into it with constants like 4 + %[tmp] or whatever. You might get an assembler warning from 4 + (%rsp) but not an error.
skip over the red-zone with add $-128, %rsp / sub $-128, %rsp around your code. (Necessary if you want to use an unknown amount of extra stack space, e.g. push in a loop, or making a function call. Yet another reason to deref a function pointer in pure C, not inline asm.)
compile with -mno-red-zone (I don't think you can enable that on a per-function basis, only per-file)
Don't use scratch space in the first place. Tell the compiler what registers you clobber and let it save them.
Here's what you should have done:
void Bar(int &x)
{
int tmp;
long tmplong;
asm ("lea -16 + %[mem1], %%rbp\n\t"
"imul $10, %%rbp, %q[reg1]\n\t" // q modifier: 64bit name.
"add %k[reg1], %k[reg1]\n\t" // k modifier: 32bit name
"movl $5, %[mem1]\n\t" // some asm instruction writing to mem
: [mem1] "=m" (tmp), [reg1] "=r" (tmplong) // tmp vars -> tmp regs / mem for use inside asm
:
: "%rbp" // tell compiler it needs to save/restore %rbp.
// gcc refuses to let you clobber %rbp with -fno-omit-frame-pointer (the default at -O0)
// clang lets you, but memory operands still use an offset from %rbp, which will crash!
// gcc memory operands still reference %rsp, so don't modify it. Declaring a clobber on %rsp does nothing
);
x = 5;
}
Note the push/pop of %rbp in the code outside the #APP / #NO_APP section, emitted by gcc. Also note that the scratch memory it gives you is in the red zone. If you compile with -O0, you'll see that it's at a different position from where it spills &x.
To get more scratch regs, it's better to just declare more output operands that are never used by the surrounding non-asm code. That leaves register allocation to the compiler, so it can be different when inlined into different places. Choosing ahead of time and declaring a clobber only makes sense if you need to use a specific register (e.g. shift count in %cl). Of course, an input constraint like "c" (count) gets gcc to put the count in rcx/ecx/cx/cl, so you don't emit a potentially redundant mov %[count], %%ecx.
If this looks too complicated, don't use inline asm. Either lead the compiler to the asm you want with C that's like the optimal asm, or write a whole function in asm.
When using inline asm, keep it as small as possible: ideally just the one or two instructions that gcc isn't emitting on its own, with input/output constraints to tell it how to get data into / out of the asm statement. This is what it's designed for.
Rule of thumb: if your GNU C inline asm start or ends with a mov, you're usually doing it wrong and should have used a constraint instead.
Footnotes:
You can use GAS's intel-syntax in inline-asm by building with -masm=intel (in which case your code will only work with that option), or using dialect alternatives so it works with the compiler in Intel or AT&T asm output syntax. But that doesn't change the directives, and GAS's Intel-syntax is not well documented. (It's like MASM, not NASM, though.) I don't really recommend it unless you really hate AT&T syntax.
Inline asm links:
x86 wiki. (The tag wiki also links to this question, for this collection of links)
The inline-assembly tag wiki
The manual. Read this. Note that inline asm was designed to wrap single instructions that the compiler doesn't normally emit. That's why it's worded to say things like "the instruction", not "the block of code".
A tutorial
Looping over arrays with inline assembly Using r constraints for pointers/indices and using your choice of addressing mode, vs. using m constraints to let gcc choose between incrementing pointers vs. indexing arrays.
How can I indicate that the memory *pointed* to by an inline ASM argument may be used? (pointer inputs in registers do not imply that the pointed-to memory is read and/or written, so it might not be in sync if you don't tell the compiler).
In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?. Using %q0 to get %rax vs. %w0 to get %ax. Using %g[scalar] to get %zmm0 instead of %xmm0.
Efficient 128-bit addition using carry flag Stephen Canon's answer explains a case where an early-clobber declaration is needed on a read+write operand. Also note that x86/x86-64 inline asm doesn't need to declare a "cc" clobber (the condition codes, aka flags); it's implicit. (gcc6 introduces syntax for using flag conditions as input/output operands. Before that you have to setcc a register that gcc will emit code to test, which is obviously worse.)
Questions about the performance of different implementations of strlen: my answer on a question with some badly-used inline asm, with an answer similar to this one.
llvm reports: unsupported inline asm: input with type 'void *' matching output with type 'int': Using offsetable memory operands (in x86, all effective addresses are offsettable: you can always add a displacement).
When not to use inline asm, with an example of 32b/32b => 32b division and remainder that the compiler can already do with a single div. (The code in the question is an example of how not to use inline asm: many instructions for setup and save/restore that should be left to the compiler by writing proper in/out constraints.)
MSVC inline asm vs. GNU C inline asm for wrapping a single instruction, with a correct example of inline asm for 64b/32b=>32bit division. MSVC's design and syntax require a round trip through memory for inputs and outputs, making it terrible for short functions. It's also "never very reliable" according to Ross Ridge's comment on that answer.
Using x87 floating point, and commutative operands. Not a great example, because I didn't find a way to get gcc to emit ideal code.
Some of those re-iterate some of the same stuff I explained here. I didn't re-read them to try to avoid redundancy, sorry.

In x86-64, the stack pointer needs to be aligned to 8 bytes.
This:
subq $12, %rsp; // make room
should be:
subq $16, %rsp; // make room

Related

Remove needless assembler statements from g++ output

I am investigating some problem with a local binary. I've noticed that g++ creates a lot of ASM output that seems unnecessary to me. Example with -O0:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp <--- just need 8 bytes for the movq to -8(%rbp), why -16?
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi <--- now we have moved rdi onto itself.
call Base::Base()
leaq 16+vtable for Derived(%rip), %rdx
movq -8(%rbp), %rax <--- effectively %edi, does not point into this area of the stack
movq %rdx, (%rax) <--- thus this wont change -8(%rbp)
movq -8(%rbp), %rax <--- so this statement is unnecessary
movl $4712, 12(%rax)
nop
leave
ret
option -O1 -fno-inline -fno-elide-constructors -fno-omit-frame-pointer:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $8, %rsp <--- reserve some stack space and never use it.
movq %rdi, %rbx
call Base::Base()
leaq 16+vtable for Derived(%rip), %rax
movq %rax, (%rbx)
movl $4712, 12(%rbx)
addq $8, %rsp <--- release unused stack space.
popq %rbx
popq %rbp
ret
This code is for the constructor of Derived that calls the Base base constructor and then overrides the vtable pointer at position 0 and sets a constant value to an int member it holds in addition to what Base contains.
Question:
Can I translate my program with as few optimizations as possible and get rid of such stuff? Which options would I have to set? Or is there a reason the compiler cannot detect these cases with -O0 or -O1 and there is no way around them?
Why is the subq $8, %rsp statement generated at all? You cannot optimize in or out a statement that makes no sense to begin with. Why does the compiler generate it then? The register allocation algorithm should never, even with O0, generate code for something that is not there. So why it is done?
is there a reason the compiler cannot detect these cases with -O0 or -O1
exactly because you're telling the compiler not to. These are optimisation levels that need to be turn off or down for proper debugging. You're also trading off compilation time for run-time.
You're looking through the telescope the wrong way, check out the awesome optimisations that you're compiler will do for you when you crank up optimisation.
I don't see any obvious missed optimizations in your -O1 output. Except of course setting up RBP as a frame pointer, but you used -fno-omit-frame-pointer so clearly you know why GCC didn't optimize that away.
The function has no local variables
Your function is a non-static class member function, so it has one implicit arg: this in rdi. Which g++ spills to the stack because of -O0. Function args count as local variables.
How does a cyclic move without an effect improve the debugging experience. Please elaborate.
To improve C/C++ debugging: debug-info formats can only describe a C variable's location relative to RSP or RBP, not which register it's currently in. Also, so you can modify any variable with a debugger and continue, getting the expected results as if you'd done that in the C++ abstract machine. Every statement is compiled to a separate block of asm with no values alive in registers (Fun fact: except register int foo: that keyword does affect debug-mode code gen).
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to G++ and other compilers as well.
Which options would I have to set?
If you're reading / debugging the asm, use at least -Og or higher to disable the debug-mode spill-everything-between-statements behaviour of -O0. Preferably -O2 or -O3 unless you like seeing even more missed optimizations than you'd get with full optimization. But -Og or -O1 will do register allocation and make sane loops (with the conditional branch at the bottom), and various simple optimizations. Although still not the standard peephole of xor-zeroing.
How to remove "noise" from GCC/clang assembly output? explains how to write functions that take args and return a value so you can write functions that don't optimize away.
Loading into RAX and then movq %rax, %rdi is just a side-effect of -O0. GCC spends so little time optimizing the GIMPLE and/or RTL internal representations of the program logic (before emitting x86 asm) that it doesn't even notice it could have loaded into RDI in the first place. Part of the point of -O0 is to compile quickly, as well as consistent debugging.
Why is the subq $8, %rsp statement generated at all?
Because the ABI requires 16-byte stack alignment before a call instruction, and this function did an even number of 8-byte pushes. (call itself pushes a return address). It will go away at -O1 without -fno-omit-frame-pointer because you aren't forcing g++ to push/pop RBP as well as the call-preserved register it actually needs.
Why does System V / AMD64 ABI mandate a 16 byte stack alignment?
Fun fact: clang will often use a dummy push %rcx/pop or something, depending on -mtune options, instead of an 8-byte sub.
If it were a leaf function, g++ would just use the red-zone below RSP for locals, even at -O0. Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?
In un-optimized code it's not rare for G++ to allocate an extra 16 bytes it doesn't ever use. Even sometimes with optimization enabled g++ rounds up its stack allocation size too far when aiming for a 16-byte boundary. This is a missed-optimization bug. e.g. Memory allocation and addressing in Assembly

Clobbered memory in two inline assembly calls vs in one inline assembly call?

This question follows this one, considering a GCC-compliant compiler and a x86-64 architecture.
I am wondering if there is any difference between option 1, option 2 and option 3 below. Would the result be the same in all contexts, or would it be different. And if so what would be the difference?
// Option 1
asm volatile(:::"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):);
and
// Option 2
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):);
asm volatile(:::"memory");
and
// Option 3
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
Options 1 & 2 would let the CPUID itself reorder with unrelated non-volatile loads/stores (in one direction or the other). This is very likely not what you want.
You could put a memory barrier on both sides of CPUID, but it's certainly better to just make CPUID a memory barrier itself.
As Jester points out, option 1 would force reload of level from memory, if it had ever had its address passed outside of the function, or if it already is a global or static.
(Or whatever the exact criterion is that decides whether a C variable could be modified read or written by asm that uses a "memory" clobber. I think it's essentially the same as what the optimizer uses to decide whether a variable can be kept in a register across a non-inline function call to an opaque function, so pure local variables that haven't had their address passed anywhere, and that aren't inputs to the asm statement, can still live in registers).
For example (Godbolt compiler explorer):
void foo(int level){
int eax, ebx, ecx, edx;
asm volatile("":::"memory");
asm volatile("CPUID"
: "=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx)
: "0"(level)
:
);
}
# x86-64 gcc7.3 -O3 -fverbose-asm
pushq %rbx # # rbx is call-preserved, but we clobber it.
movl %edi, %eax # level, eax
CPUID
popq %rbx #
ret
Notice the lack of a spill/reload of the function arg.
Normally I'd use Intel syntax, but with inline asm it's a good idea to always use AT&T unless you complete hate AT&T syntax or don't know it.
Even if it started in memory (i386 System V calling convention, with stack args), the compiler still decides that nothing else (including the asm statement with a memory clobber) could reference it. But how do we tell the difference between delaying the load? Modify the function arg before the barrier, then use it after:
void modify_level(int level){
level += 1; // modify level before the barrier
int eax, ebx, ecx, edx;
asm volatile("#mem barrier here":::"memory");
asm volatile("CPUID" // then read it after
: "=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx)
: "0"(level):);
}
The asm output from gcc -m32 -O3 -fverbose-asm is:
modify_level(int):
pushl %ebx #
#mem barrier here
movl 8(%esp), %eax # level, tmp97
addl $1, %eax #, level
CPUID
popl %ebx #
ret
Notice that the compiler let level++ reorder across the memory barrier, because it's a local variable.
Godbolt filters hand-written asm comments along with compiler-generated asm comment-only lines. I disabled the comment filter and found the mem barrier. You might want to remove -fverbose-asm to get less noise. Or use a non-comment string for the mem barrier: it doesn't have to assemble if you're just looking at the compiler's asm output. (Unless you're using clang, which has the assembler built-in).
BTW, the original version of your question didn't compile: you left out the empty string as asm template. asm(:::"memory"). The output, input, and clobber sections can be empty, but the asm instruction string is not optional.
Fun fact, you can put asm comments in the string:
asm volatile("# memory barrier here":::"memory");
gcc fills in any %whatever things in the string template as it writes asm output, so you can even do stuff like "CPUID # %%0 was in %0" and see what gcc chose for your "dummy" args that are otherwise unmentioned in the asm template. (This is more interesting for dummy memory input/output operands to tell the compiler which memory you read/write instead of using a "memory" clobber, when you give the asm statement a pointer.)

Visual C++ optimization options - how to improve the code output?

Are there any options (other than /O2) to improve the Visual C++ code output? The MSDN documentation is quite bad in this regard.
Note that I'm not asking about project-wide settings (link-time optimization, etc...). I'm only interested in this particular example.
The fairly simple C++11 code looks like this:
#include <vector>
int main() {
std::vector<int> v = {1, 2, 3, 4};
int sum = 0;
for(int i = 0; i < v.size(); i++) {
sum += v[i];
}
return sum;
}
Clang's output with libc++ is quite compact:
main: # #main
mov eax, 10
ret
Visual C++ output, on the other hand, is a multi-page mess.
Am I missing something here or is VS really this bad?
Compiler explorer link:
https://godbolt.org/g/GJYHjE
Unfortunately, it's difficult to greatly improve Visual C++ output in this case, even by using more aggressive optimization flags. There are several factors contributing to VS inefficiency, including lack of certain compiler optimizations, and the structure of Microsoft's implementation of <vector>.
Inspecting the generated assembly, Clang does an outstanding job optimizing this code. Specifically, when compared to VS, Clang is able to perform a very effective Constant propagation, Function Inlining (and consequently, Dead Code Elimination), and New/delete optimization.
Constant Propagation
In the example, the vector is statically initialized:
std::vector<int> v = {1, 2, 3, 4};
Normally, the compiler will store the constants 1, 2, 3, 4 in the data memory, and in the for loop, will load one value at one at a time, starting from the low address in which 1 is stored, and add each value to the sum.
Here's the abbreviated VS code for doing this:
movdqa xmm0, XMMWORD PTR __xmm#00000004000000030000000200000001
...
movdqu XMMWORD PTR $T1[rsp], xmm0 ; Store integers 1, 2, 3, 4 in memory
...
$LL4#main:
add ebx, DWORD PTR [rdx] ; loop and sum the values
lea rdx, QWORD PTR [rdx+4]
inc r8d
movsxd rax, r8d
cmp rax, r9
jb SHORT $LL4#main
Clang, however, is very clever to realize that the sum could be calculated in advance. My best guess is that it replaces the loading of the constants from memory to constant mov operations into registers (propagates the constants), and then combines them into the result of 10. This has the useful side effect of breaking dependencies, and since the addresses are no longer loaded from, the compiler is free to remove everything else as dead code.
Clang seems to be unique in doing this - neither VS or GCC were able to precalculate the vector accumulation result in advance.
New/Delete Optimization
Compilers conforming to C++14 are allowed to omit calls to new and delete on certain conditions, specifically when the number of allocation calls is not part of the observable behavior of the program (N3664 standard paper).
This has already generated much discussion on SO:
clang vs gcc - optimization including operator new
Is the compiler allowed to optimize out heap memory allocations?
Optimization of raw new[]/delete[] vs std::vector
Clang invoked with -std=c++14 -stdlib=libc++ indeed performs this optimization and eliminates the calls to new and delete, which do carry side effects, but supposedly do not affect the observable behaviour of the program. With -stdlib=libstdc++, Clang is stricter and keeps the calls to new and delete - although, by looking at the assembly, it's clear they are not really needed.
Now, when inspecting the main code generated by VS, we can find there two function calls (with the rest of vector construction and iteration code inlined into main):
call std::vector<int,std::allocator<int> >::_Range_construct_or_tidy<int const * __ptr64>
and
call void __cdecl operator delete(void * __ptr64)
The first is used for allocating the vector, and the second for deallocating it, and practically all other functions in the VS output are pulled in by these functions calls. This hints that Visual C++ will not optimize away calls to allocation functions (for C++14 conformance we should add the /std:c++14 flag, but the results are the same).
This blog post (May 10, 2017) from the Visual C++ team confirms that indeed, this optimization is not implemented. Searching the page for N3664 shows that "Avoiding/fusing allocations" is at status N/A, and linked comment says:
[E] Avoiding/fusing allocations is permitted but not required. For the time being, we’ve chosen not to implement this.
Combining new/delete optimization and constant propagation, it's easy to see the impact of these two optimizations in this Compiler Explorer 3-way comparison of Clang with -stdlib=libc++, Clang with -stdlib=libstdc++, and GCC.
STL Implementation
VS has its own STL implementation which is very differently structured than libc++ and stdlibc++, and that seems to have a large contribution to VS inferior code generation. While VS STL has some very useful features, such as checked iterators and iterator debugging hooks (_ITERATOR_DEBUG_LEVEL), it gives the general impression of being heavier and to perform less efficiently than stdlibc++.
For isolating the impact of the vector STL implementation, an interesting experiment is to use Clang for compilation, combined with the VS header files. Indeed, using Clang 5.0.0 with Visual Studio 2015 headers, results in the following code generation - clearly, the STL implementation has a huge impact!
main: # #main
.Lfunc_begin0:
.Lcfi0:
.seh_proc main
.seh_handler __CxxFrameHandler3, #unwind, #except
# BB#0: # %.lr.ph
pushq %rbp
.Lcfi1:
.seh_pushreg 5
pushq %rsi
.Lcfi2:
.seh_pushreg 6
pushq %rdi
.Lcfi3:
.seh_pushreg 7
pushq %rbx
.Lcfi4:
.seh_pushreg 3
subq $72, %rsp
.Lcfi5:
.seh_stackalloc 72
leaq 64(%rsp), %rbp
.Lcfi6:
.seh_setframe 5, 64
.Lcfi7:
.seh_endprologue
movq $-2, (%rbp)
movl $16, %ecx
callq "??2#YAPEAX_K#Z"
movq %rax, -24(%rbp)
leaq 16(%rax), %rcx
movq %rcx, -8(%rbp)
movups .L.ref.tmp(%rip), %xmm0
movups %xmm0, (%rax)
movq %rcx, -16(%rbp)
movl 4(%rax), %ebx
movl 8(%rax), %esi
movl 12(%rax), %edi
.Ltmp0:
leaq -24(%rbp), %rcx
callq "?_Tidy#?$vector#HV?$allocator#H#std###std##IEAAXXZ"
.Ltmp1:
# BB#1: # %"\01??1?$vector#HV?$allocator#H#std###std##QEAA#XZ.exit"
addl %ebx, %esi
leal 1(%rdi,%rsi), %eax
addq $72, %rsp
popq %rbx
popq %rdi
popq %rsi
popq %rbp
retq
.seh_handlerdata
.long ($cppxdata$main)#IMGREL
.text
Update - Visual Studio 2017
In Visual Studio 2017, <vector> has seen a major overhaul, as announced on this blog post from the Visual C++ team. Specifically, it mentions the following optimizations:
Eliminated unnecessary EH logic. For example, vector’s copy assignment operator had an unnecessary try-catch block. It just has to provide the basic guarantee, which we can achieve through proper action sequencing.
Improved performance by avoiding unnecessary rotate() calls. For example, emplace(where, val) was calling emplace_back() followed by rotate(). Now, vector calls rotate() in only one scenario (range insertion with input-only iterators, as previously described).
Improved performance with stateful allocators. For example, move construction with non-equal allocators now attempts to activate our memmove() optimization. (Previously, we used make_move_iterator(), which had the side effect of inhibiting the memmove() optimization.) Note that a further improvement is coming in VS 2017 Update 1, where move assignment will attempt to reuse the buffer in the non-POCMA non-equal case.
Curious, I went back to test this. When building the example in Visual Studio 2017, the result is still a multi page assembly listing, with many function calls, so even if code generation improved, it is difficult to notice.
However, when building with clang 5.0.0 and Visual Studio 2017 headers, we get the following assembly:
main: # #main
.Lcfi0:
.seh_proc main
# BB#0:
subq $40, %rsp
.Lcfi1:
.seh_stackalloc 40
.Lcfi2:
.seh_endprologue
movl $16, %ecx
callq "??2#YAPEAX_K#Z" ; void * __ptr64 __cdecl operator new(unsigned __int64)
movq %rax, %rcx
callq "??3#YAXPEAX#Z" ; void __cdecl operator delete(void * __ptr64)
movl $10, %eax
addq $40, %rsp
retq
.seh_handlerdata
.text
Note the movl $10, %eax instruction - that is, with VS 2017's <vector>, clang was able to collapse everything, precalculate the result of 10, and keep only the calls to new and delete.
I'd say that is pretty amazing!
Function Inlining
Function inlining is probably the single most vital optimization in this example. By collapsing the code of called functions into their call sites, the compiler is able to perform further optimizations on the merged code, plus, removing of function calls is beneficial in reducing call overhead and removing of optimization barriers.
When inspecting the generated assembly for VS, and comparing the code before and after inlining (Compiler Explorer), we can see that most vector functions were indeed inlined, except for the allocation and deallocation functions. In particular, there are calls to memmove, which are the result of inlining of some higher level functions, such as _Uninitialized_copy_al_unchecked.
memmove is a library function, and therefore cannot be inlined. However, clang has a clever way around this - it replaces the call to memmove with a call to __builtin_memmove. __builtin_memmove is a builtin/intrinsic function, which has the same functionality as memmove, but as opposed to the plain function call, the compiler generates code for it and embeds it into the calling function. Consequently, the code could be further optimized inside the calling function and eventually removed as dead code.
Summary
To conclude, Clang is clearly superior than VS in this example, both thanks to high quality optimizations, and more efficient vector STL implementation. When using the same header files for Visual C++ and clang (the Visual Studio 2017 headers), Clang beats Visual C++ hands down.
While writing this answer, I couldn't help not to think, what would we do without Compiler Explorer? Thanks Matt Godbolt for this amazing tool!

What exception is raised in C by GCC -fstack-check option

as per gcc documentation
-fstack-check
Generate code to verify that you do not go beyond the boundary of the stack. Note that this switch does not actually cause checking to be done; the operating system must do that. The switch causes generation of code to ensure that the operating system sees the stack being extended.
My assumption is that this extra code will generate exception to let OS know.
When using C language I need to know what exception is being generated by the extra code.
Google is also not helping much. Close I came to know is that it generates Storage_Error exception in case of Ada language (Reference).
I am working on sort of small OS/scheduler where I need to catch this exception. I am using C/C++.
My GCC version 3.4.4
It doesn't generate any exception directly. It generates code which, when the stack is enlarged by more than one page, generates a read-write access to each page in the newly allocated region. That's all it does. Example:
extern void bar(char *);
void foo(void)
{
char buf[4096 * 8];
bar(buf);
}
compiles (with gcc 4.9, on x86-64, at -O2) to:
foo:
pushq %rbp
movq $-32768, %r11
movq %rsp, %rbp
subq $4128, %rsp
addq %rsp, %r11
.LPSRL0:
cmpq %r11, %rsp
je .LPSRE0
subq $4096, %rsp
orq $0, (%rsp)
jmp .LPSRL0
.LPSRE0:
addq $4128, %rsp
leaq -32768(%rbp), %rdi
call bar
leave
ret
orq $0, (%rsp) has no effect on the contents of the memory at (%rsp), but the CPU treats it as a read-write access to that address anyway. (I don't know why GCC offsets %rsp by 4128 bytes during the loop, or why it thinks a frame pointer is necessary.)
The theory is that the OS can notice these accesses and do something appropriate if the stack has become too large. For a POSIX-compliant operating system, that would be delivery of a SIGSEGV signal.
You may be wondering how the OS can notice such a thing. The hardware allows the OS to designate pages of address space as completely inaccessible; any attempt to read or write memory in those pages triggers a hardware fault which the OS can process as it sees fit (again, for a POSIX-compliant OS, delivery of SIGSEGV). This can be used to place a "guard area" immediately past the end of the space reserved for the stack. That's why one access per page is sufficient.
What -fstack-check is meant to protect you from, to be clear, is the situation where the "guard area" is very small - perhaps just one page - so allocating a large buffer on the stack moves the stack pointer past that area and into another region of accessible RAM. If the program then happens never to touch memory within the guard area, you won't get a prompt crash, but you will scribble on whatever that other region is, causing a delayed-action malfunction.

what will be the addressing mode in assembly code generated by the compiler here?

Suppose we've got two integer and character variables:
int adad=12345;
char character;
Assuming we're discussing a platform in which, length of an integer variable is longer than or equal to three bytes, I want to access third byte of this integer and put it in the character variable, with that said I'd write it like this:
character=*((char *)(&adad)+2);
Considering that line of code and the fact that I'm not a compiler or assembly expert, I know a little about addressing modes in assembly and I'm wondering the address of the third byte (or I guess it's better to say offset of the third byte) here would be within the instructions generated by that line of code themselves, or it'd be in a separate variable whose address (or offset) is within those instructions ?
The best thing to do in situations like this is to try it. Here's an example program:
int main(int argc, char **argv)
{
int adad=12345;
volatile char character;
character=*((char *)(&adad)+2);
return 0;
}
I added the volatile to avoid the assignment line being completely optimized away. Now, here's what the compiler came up with (for -Oz on my Mac):
_main:
pushq %rbp
movq %rsp,%rbp
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
xorl %eax,%eax
leave
ret
The only three lines that we care about are:
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
The movl is the initialization of adad. Then, as you can see, it reads out the 3rd byte of adad, and stores it back into memory (the volatile is forcing that store back).
I guess a good question is why does it matter to you what assembly gets generated? For example, just by changing my optimization flag to -O0, the assembly output for the interesting part of the code is:
movl $0x00003039,0xf8(%rbp)
leaq 0xf8(%rbp),%rax
addq $0x02,%rax
movzbl (%rax),%eax
movb %al,0xff(%rbp)
Which is pretty straightforwardly seen as the exact logical operations of your code:
Initialize adad
Take the address of adad
Add 2 to that address
Load one byte by dereferencing the new address
Store one byte into character
Various optimizations will change the output... if you really need some specific behaviour/addressing mode for some reason, you might have to write the assembly yourself.
Without knowing anything about the compiler and the underlying CPU architecture no definitive answer can be given. For example, not all CPU architectures allow the addressing of every arbitrary byte in memory (though I believe all the currently popular ones do): on a CPU that's word-addressed, instead of byte-addressed, what the compiler will generate is inevitably going to be the loading into some register of the whole word adad (presumably by an offset from a base pointer register, if the variable in question is on stack [1]), followed by shifting and masking to isolate the byte of interest.
[1] note that, without knowing what CPU architecture we're talking about and how the compiler uses it, we can't even say whether "load a word at a fixed offset from a base register" is something that's done inline within the instruction (as one might hope, and many popular architectures definitely do support;-) or needs separate address arithmetic in an auxiliary register.
IOW, whether it's a good idea or not, it's definitely possible to define a CPU architecture which cannot load / store registers except from other registers or memory addresses defined by other registers or constant, and some such architectures exist (though they may not be all that popular at this time;-).