How to link malloc when compiling with GCC? [duplicate] - c++

I have a function foo written in assembly and compiled with yasm and GCC on Linux (Ubuntu) 64-bit. It simply prints a message to stdout using puts(), here is how it looks:
bits 64
extern puts
global foo
section .data
message:
db 'foo() called', 0
section .text
foo:
push rbp
mov rbp, rsp
lea rdi, [rel message]
call puts
pop rbp
ret
It is called by a C program compiled with GCC:
extern void foo();
int main() {
foo();
return 0;
}
Build commands:
yasm -f elf64 foo_64_unix.asm
gcc -c foo_main.c -o foo_main.o
gcc foo_64_unix.o foo_main.o -o foo
./foo
Here is the problem:
When running the program it prints an error message and immediately segfaults during the call to puts:
./foo: Symbol `puts' causes overflow in R_X86_64_PC32 relocation
Segmentation fault
After disassembling with objdump I see that the call is made with the wrong address:
0000000000000660 <foo>:
660: 90 nop
661: 55 push %rbp
662: 48 89 e5 mov %rsp,%rbp
665: 48 8d 3d a4 09 20 00 lea 0x2009a4(%rip),%rdi
66c: e8 00 00 00 00 callq 671 <foo+0x11> <-- here
671: 5d pop %rbp
672: c3 retq
(671 is the address of the next instruction, not address of puts)
However, if I rewrite the same code in C the call is done differently:
645: e8 c6 fe ff ff callq 510 <puts#plt>
i.e. it references puts from the PLT.
Is it possible to tell yasm to generate similar code?

TL:DR: 3 options:
Build a non-PIE executable (gcc -no-pie -fno-pie call-lib.c libcall.o) so the linker will generate a PLT entry for you transparently when you write call puts.
call puts wrt ..plt like gcc -fPIE would do.
call [rel puts wrt ..got] like gcc -fno-plt would do.
The latter two will work in PIE executables or shared libraries. The 3rd way, wrt ..got, is slightly more efficient.
Your gcc is building PIE executables by default (32-bit absolute addresses no longer allowed in x86-64 Linux?).
I'm not sure why, but when doing so the linker doesn't automatically resolve call puts to call puts#plt. There is still a puts PLT entry generated, but the call doesn't go there.
At runtime, the dynamic linker tries to resolve puts directly to the libc symbol of that name and fixup the call rel32. But the symbol is more than +-2^31 away, so we get a warning about overflow of the R_X86_64_PC32 relocation. The low 32 bits of the target address are correct, but the upper bits aren't. (Thus your call jumps to a bad address).
Your code works for me if I build with gcc -no-pie -fno-pie call-lib.c libcall.o. The -no-pie is the critical part: it's the linker option. Your YASM command doesn't have to change.
When making a traditional position-dependent executable, the linker turns the puts symbol for the call target into puts#plt for you, because we're linking a dynamic executable (instead of statically linking libc with gcc -static -fno-pie, in which case the call could go directly to the libc function.)
Anyway, this is why gcc emits call puts#plt (GAS syntax) when compiling with -fpie (the default on your desktop, but not the default on https://godbolt.org/), but just call puts when compiling with -fno-pie.
See What does #plt mean here? for more about the PLT, and also Sorry state of dynamic libraries on Linux from a few years ago. (The modern gcc -fno-plt is like one of the ideas in that blog post.)
BTW, a more accurate/specific prototype would let gcc avoid zeroing EAX before calling foo:
extern void foo(); in C means extern void foo(...);
You could declare it as extern void foo(void);, which is what () means in C++. C++ doesn't allow function declarations that leave the args unspecified.
asm improvements
You can also put message in section .rodata (read-only data, linked as part of the text segment).
You don't need a stack frame, just something to align the stack by 16 before a call. A dummy push rax will do it.
Or we can tail-call puts by jumping to it instead of calling it, with the same stack position as on entry to this function. This works with or without PIE. Just replace call with jmp, as long as RSP is pointing at your own return address.
If you want to make PIE executables (or shared libraries), you have two options
call puts wrt ..plt - explicitly call through the PLT.
call [rel puts wrt ..got] - explicitly do an indirect call through the GOT entry, like gcc's -fno-plt style of code-gen. (Using a RIP-relative addressing mode to reach the GOT, hence the rel keyword).
WRT = With Respect To. The NASM manual documents wrt ..plt, and see also section 7.9.3: special symbols and WRT.
Normally you would use default rel at the top of your file so you can actually use call [puts wrt ..got] and still get a RIP-relative addressing mode. You can't use a 32-bit absolute addressing mode in PIE or PIC code.
call [puts wrt ..got] assembles to a memory-indirect call using the function pointer that dynamic linking stored in the GOT. (Early-binding, not lazy dynamic linking.)
NASM documents ..got for getting the address of variables in section 9.2.3. Functions in (other) libraries are identical: you get a pointer from the GOT instead of calling directly, because the offset isn't a link-time constant and might not fit in 32-bits.
YASM also accepts call [puts wrt ..GOTPCREL], like AT&T syntax call *puts#GOTPCREL(%rip), but NASM does not.
; don't use BITS 64. You *want* an error if you try to assemble this into a 32-bit .o
default rel ; RIP-relative addressing instead of 32-bit absolute by default; makes the [rel ...] optional
section .rodata ; .rodata is best for constants, not .data
message:
db 'foo() called', 0
section .text
global foo
foo:
sub rsp, 8 ; align the stack by 16
; PIE with PLT
lea rdi, [rel message] ; needed for PIE
call puts WRT ..plt ; tailcall puts
;or
; PIE with -fno-plt style code, skips the PLT indirection
lea rdi, [rel message]
call [rel puts wrt ..got]
;or
; non-PIE
mov edi, message ; more efficient, but only works in non-PIE / non-PIC
call puts ; linker will rewrite it into call puts#plt
add rsp,8 ; restore the stack, undoing the add
ret
In a position-dependent Linux executable, you can use mov edi, message instead of a RIP-relative LEA. It's smaller code-size and can run on more execution ports on most CPUs. (Fun fact: MacOS always puts the "image base" outside the low 4GiB so this optimization isn't possible there.)
In a non-PIE executable, you also might as well use call puts or jmp puts and let the linker sort it out, unless you want more efficient no-plt style dynamic linking. But if you do choose to statically link libc, I think this is the only way you'll get a direct jmp to the libc function.
(I think the possibility of static linking for non-PIE is why ld is willing to generate PLT stubs automatically for non-PIE, but not for PIE or shared libraries. It requires you to say what you mean when linking ELF shared objects.)
If you did use call puts in a PIE (call rel32), it could only work if you statically linked a position-independent implementation of puts into your PIE, so the entire thing was one executable that would get loaded at a random address at runtime (by the usual dynamic-linker mechanism), but simply didn't have a dependency on libc.so.6
Linker "relaxing" calls when the target is present at static-link time
GAS call *bar#GOTPCREL(%rip) uses R_X86_64_GOTPCRELX (relaxable)
NASM call [rel bar wrt ..got] uses R_X86_64_GOTPCREL (not relaxable)
This is less of a problem with hand-written asm; you can just use call bar when you know the symbol will be present in another .o (rather than .so) that you're going to link. But C compilers don't know the difference between library functions and other user functions you declare with prototypes (unless you use stuff like gcc -fvisibility=hidden https://gcc.gnu.org/wiki/Visibility or attributes / pragmas).
Still, you might want to write asm source that the linker can optimize if you statically link a library, but AFAIK you can't do that with NASM. You can export a symbol as hidden (visible at static-link time, but not for dynamic linking in the final .so) with global bar:function hidden, but that's in the source file defining the function, not files accessing it.
global bar
bar:
mov eax,231
syscall
call bar wrt ..plt
call [rel bar wrt ..got]
extern bar
The 2nd file, after assembling with nasm -felf64 and disassembling with objdump -drwc -Mintel to see the relocations:
0000000000000000 <.text>:
0: e8 00 00 00 00 call 0x5 1: R_X86_64_PLT32 bar-0x4
5: ff 15 00 00 00 00 call QWORD PTR [rip+0x0] # 0xb 7: R_X86_64_GOTPCREL bar-0x4
After linking with ld (GNU Binutils) 2.35.1 - ld bar.o bar2.o -o bar
0000000000401000 <_start>:
401000: e8 0b 00 00 00 call 401010 <bar>
401005: ff 15 ed 1f 00 00 call QWORD PTR [rip+0x1fed] # 402ff8 <.got>
40100b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
0000000000401010 <bar>:
401010: b8 e7 00 00 00 mov eax,0xe7
401015: 0f 05 syscall
Note that the PLT form got relaxed to just a direct call bar, PLT eliminated. But the ff 15 call [rel mem] was not relaxed to an e8 rel32
With GAS:
_start:
call bar#plt
call *bar#GOTPCREL(%rip)
gcc -c foo.s && disas foo.o
0000000000000000 <_start>:
0: e8 00 00 00 00 call 5 <_start+0x5> 1: R_X86_64_PLT32 bar-0x4
5: ff 15 00 00 00 00 call QWORD PTR [rip+0x0] # b <_start+0xb> 7: R_X86_64_GOTPCRELX bar-0x4
Note the X at the end of R_X86_64_GOTPCRELX.
ld bar2.o foo.o -o bar && disas bar:
0000000000401000 <bar>:
401000: b8 e7 00 00 00 mov eax,0xe7
401005: 0f 05 syscall
0000000000401007 <_start>:
401007: e8 f4 ff ff ff call 401000 <bar>
40100c: 67 e8 ee ff ff ff addr32 call 401000 <bar>
Both calls got relaxed to a direct e8 call rel32 straight to the target address. The extra byte in indirect call is filled with a 67 address-size prefix (which has no effect on call rel32), padding the instruction to the same length. (Because it's too late to re-assemble and re-compute all relative branches within functions, and alignment and so on.)
That would happen for call *puts#GOTPCREL(%rip) if you statically linked libc, with gcc -static.

The 0xe8 opcode is followed by a signed offset to be applied to the PC (which has advanced to the next instruction by that time) to compute the branch target. Hence objdump is interpreting the branch target as 0x671.
YASM is rendering zeros because it has likely put a relocation on that offset, which is how it asks the loader to populate the correct offset for puts during loading. The loader is encountering an overflow when computing the relocation, which may indicate that puts is at a further offset from your call than can be represented in a 32-bit signed offset. Hence the loader fails to fix this instruction, and you get a crash.
66c: e8 00 00 00 00 shows the unpopulated address. If you look in your relocation table, you should see a relocation on 0x66d. It is not uncommon for the assembler to populate addresses/offsets with relocations as all zeros.
This page suggests that YASM has a WRT directive that can control use of .got, .plt, etc.
Per S9.2.5 on the NASM documentation, it looks like you can use CALL puts WRT ..plt (presuming YASM has the same syntax).

Related

Is it possible to write asm in C++ with opcode instead of shellcode

I'm curious if there's a way to use __asm in c++ then write that into memory instead of doing something like:
BYTE shell_code[] = { 0x48, 0x03 ,0x1c ,0x25, 0x0A, 0x00, 0x00, 0x00 };
write_to_memory(function, &shell_code, sizeof(shell_code));
So I would like to do:
asm_code = __asm("add rbx, &variable\n\t""jmp rbx") ;
write_to_memory(function, &asm_code , sizeof(asm_code ));
Worst case I can use GCC and objdump externally or something but hoping there's an internal way
You can put an asm(""); statement at global scope, with start/end labels inside it, and declare those labels as extern char start_code[], end_code[0]; so you can access them from C. C char arrays work most like asm labels, in terms of being able to use the C name and have it work as an address.
// compile with gcc -masm=intel
// AFAIK, no way to do that with clang
asm(
".pushsection .rodata \n" // we don't want to run this from here, it's just data
"start_code: \n"
" add rax, OFFSET variable \n" // *absolute* address as 32-bit sign-extended immediate
"end_code: \n"
".popsection"
);
__attribute__((used)) static int variable = 1;
extern char start_code[], end_code[0]; // C declarations for those asm labels
#include <string.h>
void copy_code(void *dst)
{
memcpy(dst, start_code, end_code - start_code);
}
It would be fine to have the payload code in the default .text section, but we can put it in .rodata since we don't want to run it.
Is that the kind of thing you're looking for? asm output on Godbolt (without assembling + disassembling:
start_code:
add rax, OFFSET variable
end_code:
copy_code(void*):
mov edx, OFFSET FLAT:end_code
mov esi, OFFSET FLAT:start_code
sub rdx, OFFSET FLAT:start_code
jmp [QWORD PTR memcpy#GOTPCREL[rip]]
To see if it actually assembles to what we want, I compiled with
gcc -O2 -fno-plt -masm=intel -fno-pie -no-pie -c foo.c to get a .o. objdump -drwC -Mintel shows:
0000000000000000 <copy_code>:
0: ba 00 00 00 00 mov edx,0x0 1: R_X86_64_32 .rodata+0x6
5: be 00 00 00 00 mov esi,0x0 6: R_X86_64_32 .rodata
a: 48 81 ea 00 00 00 00 sub rdx,0x0 d: R_X86_64_32S .rodata
11: ff 25 00 00 00 00 jmp QWORD PTR [rip+0x0] # 17 <end_code+0x11> 13: R_X86_64_GOTPCRELX memcpy-0x4
And with -D to see all sections, the actual payload is there in .rodata, still not linked yet:
Disassembly of section .rodata:
0000000000000000 <start_code>:
0: 48 05 00 00 00 00 add rax,0x0 2: R_X86_64_32S .data
-fno-pie -no-pie is only necessary for the 32-bit absolute address of variable to work. (Without it, we get two RIP-relative LEAs and a sub rdx, rsi. Unfortunately neither way of compiling gets GCC to subtract the symbols at build time with mov edx, OFFSET end_code - start_code, but that's just in the code doing the memcpy, not in the machine code being copied.)
In a linked executable
We can see how the linker filled in those relocations.
(I tested by using -nostartfiles instead of -c - I didn't want to run it, just look at the disassembly, so there was not point to actually writing a main.)
$ gcc -O2 -fno-plt -masm=intel -fno-pie -no-pie -nostartfiles foo.c
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
$ objdump -D -rwC -Mintel a.out
(manually edited to remove uninteresting sections)
Disassembly of section .text:
0000000000401000 <copy_code>:
401000: ba 06 20 40 00 mov edx,0x402006
401005: be 00 20 40 00 mov esi,0x402000
40100a: 48 81 ea 00 20 40 00 sub rdx,0x402000
401011: ff 25 e1 2f 00 00 jmp QWORD PTR [rip+0x2fe1] # 403ff8 <memcpy#GLIBC_2.14>
The linked payload:
0000000000402000 <start_code>:
402000: 48 05 18 40 40 00 add rax,0x404018 # from add rax, OFFSET variable
0000000000402006 <end_code>:
402006: 48 c7 c2 06 00 00 00 mov rdx,0x6
# this was from mov rdx, OFFSET end_code - start_code to see if that would assemble + link
Our non-zero-init dword variable that we're taking the address of:
Disassembly of section .data:
0000000000404018 <variable>:
404018: 01 00 add DWORD PTR [rax],eax
...
Your specific asm instruction is weird
&variable isn't valid asm syntax, but I'm guessing you wanted to add the address?
Since you're going to be copying the machine code somewhere, you must avoid RIP-relative addressing modes and any other relative references to things outside the block you're copying. Only mov can use 64-bit absolute addresses, like movabs rdi, OFFSET variable instead of the usual lea rdi, [rip + variable]. Also, you can even load / store into/from RAX/EAX/AX/AL with 64-bit absolute addresses movabs eax, [variable]. (mov-immediate can use any register, load/store are only the accumulator. https://www.felixcloutier.com/x86/mov)
(movabs is an AT&T mnemonic, but GAS allows it in .intel_syntax noprefix to force using 64-bit immediates, instead of the default 32-bit-sign-extended.)
This is kind of opposite of normal position-independent code, which works when the whole image is loaded at an arbitrary base. This will make code that works when the image is loaded to a fixed base (or even variable since runtime fixups should work for symbolic references), and then copied around relative to the rest of your code. So all your memory refs have to be absolute, except for within the asm.
So we couldn't have made PIE-compatible machine code by using lea rdx, [RIP+variable] / add rax, rdx - that would only get the right address for variable when run from the linked location in .rodata, not from any copy. (Unless you manually fixup the code when copying it, but it's still only a rel32 displacement.)
Terminology:
An opcode is part of a machine instruction, e.g. add ecx, 123 assembles to 3 bytes: 83 c1 7b. Those are the opcode, modrm, and imm8 respectively. (https://www.felixcloutier.com/x86/add).
"opcode" also gets misused (especially in shellcode usage) to describe the whole instruction represented as bytes.
Text names for instructions like add are mnemonics.
this is just a guess, i don't know if it will work. i'm sorry in advance for an ugly answer since i don't have much time due to work.
i think you can enclose your asm code inside labels.
get the address of that label and the size. treat it as a blob of data and you can write it anywhere.
void funcA(){
//some code here.
labelStart:
__asm("
;asm code here.
")
labelEnd:
//some code here.
//---make code as movable data.
char* pDynamicProgram = labelStart;
size_t sizeDP = labelEnd - labelStart;
//---writing to some memory.
char* someBuffer = malloc(sizeDP);
memcpy(someBuffer, pDynamicProgram, sizeDP);
//---execute: cast as a function pointer then execute call.
((func*)someBuffer)(/* parameters if any*/);
}
the sample code above of course is not compilable. but the logic is kind of like that. i see viruses do it that way though i haven't saw the actual c++ code. but we saw it from disassemblers. for the "return" logic after the call, there are many adhoc ways to do that. just be creative.
also, i think you have to enable first some settings for your program to write to some forbidden memory in case you want to override an existing function.

Satisfying -Wreturn-type in a function that can't return [duplicate]

It's common for compilers to provide a switch to warn when code is unreachable. I've also seen macros for some libraries, that provide assertions for unreachable code.
Is there a hint, such as through a pragma, or builtin that I can pass to GCC (or any other compilers for that matter), that will warn or error during compilation if it's determined that a line expected to be unreachable can actually be reached?
Here's an example:
if (!conf->devpath) {
conf->devpath = arg;
return 0;
} // pass other opts into fuse
else {
return 1;
}
UNREACHABLE_LINE();
The value of this is in detecting, after changes in conditions above the expected unreachable line, that the line is in fact reachable.
gcc 4.5 supports the __builtin_unreachable() compiler inline, combining this with -Wunreachable-code might do what you want, but will probably cause spurious warnings
__builtin_unreachable() does not generate any compile time warnings as far as I can see on GCC 7.3.0
Neither can I find anything in the docs that suggest that it would.
For example, the following example compiles without any warning:
#include <stdio.h>
int main(void) {
__builtin_unreachable();
puts("hello")
return 0;
}
with:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -Wunreachable-code main.c
The only thing I think it does do, is to allow the compiler to do certain optimizations based on the fact that a certain line of code is never reached, and give undefined behaviour if you make a programming error and it ever does.
For example, executing the above example appears to exit normally, but does not print hello as expected. Our assembly analysis then shows that the normal looking exit was just an UB coincidence.
The -fsanitize=unreachable flag to GCC converts the __builtin_unreachable(); to an assertion which fails at runtime with:
<stdin>:1:17: runtime error: execution reached a __builtin_unreachable() call
That flag is broken in Ubuntu 16.04 though: ld: unrecognized option '--push-state--no-as-needed'
What does __builtin_unreachable() do to the executable?
If we disassemble both the code with and without __builtin_unreachable with:
objdump -S a.out
we see that the one without it calls puts:
000000000000063a <main>:
#include <stdio.h>
int main(void) {
63a: 55 push %rbp
63b: 48 89 e5 mov %rsp,%rbp
puts("hello");
63e: 48 8d 3d 9f 00 00 00 lea 0x9f(%rip),%rdi # 6e4 <_IO_stdin_used+0x4>
645: e8 c6 fe ff ff callq 510 <puts#plt>
return 0;
64a: b8 00 00 00 00 mov $0x0,%eax
}
64f: 5d pop %rbp
650: c3 retq
651: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
658: 00 00 00
65b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
while the one without does only:
int main(void) {
5fa: 55 push %rbp
5fb: 48 89 e5 mov %rsp,%rbp
5fe: 66 90 xchg %ax,%ax
and does not even return, so I think it is just an undefined behaviour coincidence that it did not just blow up.
Why isn't GCC able to determine if some code is unreachable?
I gather the following answers:
determining unreachable code automatically is too hard for GCC for some reason, which is why for years now -Wunreachable-code does nothing: gcc does not warn for unreachable code
users may use inline assembly that implies unreachability, but GCC cannot determine that. This is mentioned on the GCC manual:
One such case is immediately following an asm statement that either never terminates, or one that transfers control elsewhere and never returns. In this example, without the __builtin_unreachable, GCC issues a warning that control reaches the end of a non-void function. It also generates code to return after the asm.
int f (int c, int v)
{
if (c)
{
return v;
}
else
{
asm("jmp error_handler");
__builtin_unreachable ();
}
}
Tested on GCC 7.3.0, Ubuntu 18.04.
With gcc 4.4.0 Windows cross compiler to PowerPC compiling with -O2 or -O3 the following works for me:
#define unreachable asm("unreachable\n")
The assembler fails with unknown operation if the compiler doesn't optimise it away because it has concluded that it is unreachable.
Yes, it is quite probably `highly unpredictable under different optimization options', and likely to break when I finally update the compiler, but for the moment it's better then nothing.
If your compiler does not have the warning that you need, it can be complemented with a static analyzer. The kind of analyzer I am talking about would have its own annotation language and/or recognize C assert, and use these for hints of properties that should be true at specific points of the execution. If there isn't a specific annotation for unreachable statements, you could probably use assert (false);.
I am not personally familiar with them but Klokwork and CodeSonar are two famous analyzers. Goanna is a third one.

Why is there a locked xadd instruction in this disassambled std::string dtor?

I have a very simple code:
#include <string>
#include <iostream>
int main() {
std::string s("abc");
std::cout << s;
}
Then, I compiled it:
g++ -Wall test_string.cpp -o test_string -std=c++17 -O3 -g3 -ggdb3
And then decompiled it, and the most interesting piece is:
00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:
4009a0: 48 81 ff a0 11 60 00 cmp rdi,0x6011a0
4009a7: 75 01 jne 4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
4009a9: c3 ret
4009aa: b8 00 00 00 00 mov eax,0x0
4009af: 48 85 c0 test rax,rax
4009b2: 74 11 je 4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>
4009b4: 83 c8 ff or eax,0xffffffff
4009b7: f0 0f c1 47 10 lock xadd DWORD PTR [rdi+0x10],eax
4009bc: 85 c0 test eax,eax
4009be: 7f e9 jg 4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
4009c0: e9 cb fd ff ff jmp 400790 <_ZdlPv#plt>
4009c5: 8b 47 10 mov eax,DWORD PTR [rdi+0x10]
4009c8: 8d 50 ff lea edx,[rax-0x1]
4009cb: 89 57 10 mov DWORD PTR [rdi+0x10],edx
4009ce: eb ec jmp 4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>
Why _ZNSs4_Rep10_M_disposeERKSaIcE.isra.10 (which is std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_dispose(std::allocator<char> const&) [clone .isra.10]) is a lock prefixed xadd?
A follow-up question is how I can avoid it?
It looks like code associated with copy on write strings. The locked instruction is decrementing a reference count and then calling operator delete only if the reference count for the possibly shared buffer containing the actual string data is zero (i.e., it is not shared: no other string object refers to it).
Since libstdc++ is open source, we can confirm this by looking at the source!
The function you've disassembled, _ZNSs4_Rep10_M_disposeERKSaIcE de-mangles1 to std::basic_string<char>::_Rep::_M_dispose(std::allocator<char> const&). Here's the corresponding source for libstdc++ in the gcc-4.x era2:
void
_M_dispose(const _Alloc& __a)
{
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
{
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&this->_M_refcount);
if (__gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount,
-1) <= 0)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&this->_M_refcount);
_M_destroy(__a);
}
}
} // XXX MT
Given that, we can annotate the assembly you provided, mapping each instruction back to the C++ source:
00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:
# the next two lines implement the check:
# if (__builtin_expect(this != &_S_empty_rep(), false))
# which is an empty string optimization. The S_empty_rep singleton
# is at address 0x6011a0 and if the current buffer points to that
# we are done (execute the ret)
4009a0: cmp rdi,0x6011a0
4009a7: jne 4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
4009a9: ret
# now we are in the implementation of
# __gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount, -1)
# which dispatches either to an atomic version of the add function
# or the non-atomic version, depending on the value of `eax` which
# is always directly set to zero, so the non-atomic version is
# *always called* (see details below)
4009aa: mov eax,0x0
4009af: test rax,rax
4009b2: je 4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>
# this is the atomic version of the decrement you were concerned about
# but we never execute this code because the test above always jumps
# to 4009c5 (the non-atomic version)
4009b4: or eax,0xffffffff
4009b7: lock xadd DWORD PTR [rdi+0x10],eax
4009bc: test eax,eax
# check if the result of the xadd was zero, if not skip the delete
4009be: jg 4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
# the delete call
4009c0: jmp 400790 <_ZdlPv#plt> # tailcall
# the non-atomic version starts here, this is the code that is
# always executed
4009c5: mov eax,DWORD PTR [rdi+0x10]
4009c8: lea edx,[rax-0x1]
4009cb: mov DWORD PTR [rdi+0x10],edx
# this jumps up to the test eax,eax check which calls operator delete
# if the refcount was zero
4009ce: jmp 4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>
A key note is that the lock xadd code you were concerned about is never executed. There is a mov eax, 0 followed by a test rax, rax; je - this test always succeeds and the jump always occurs because rax is always zero.
What's happening here is that __gnu_cxx::__atomic_add_dispatch is implemented in a way that it checks whether the process is definitely single threaded. If it is definitely single threaded, it doesn't bother to use expensive atomic instructions for things like __atomic_add_dispatch - it simply uses a regular non-atomic addition. It does this by checking the address of a pthreads function, __pthread_key_create - if this is zero, the pthread library hasn't been linked in, and hence the process is definitely single threaded. In your case, the address of this pthread function gets resolved at link time to 0 (you didn't have -lpthread on your compile command line), which is where the mov eax, 0x0 comes from. At link time, it's too late to optimize on this knowledge, so the vestigial atomic increment code remains but never executes. This mechanism is described in more detail in this answer.
The code that does execute is the last part of the function, starting at 4009c5. This code also decrements the reference count, but in a non-atomic way. The check which decides between these two options is probably based on whether the process is multithreaded or not, e.g., whether -lpthread has been linked. For whatever reason this check, inside __exchange_and_add_dispatch, is implemented in a way that prevents the compiler from actually removing the atomic half of the branch, even though the fact that it will never be taken is known at some point during the build process (after all, the hard-coded mov eax, 0 got there somehow).
A follow-up question is how I can avoid it?
Well you've already avoided the lock add part, so if that's what you care about, your good to go. However, you still have a cause for concern:
Copy on write std::string implementations are not standards compliant due to changes made in C++11, so the question remains why exactly you are getting this COW string behavior even when specifying -std=c++17.
The problem is most likely distribution related: CentOS 7 by default uses an ancient gcc version < 5 which still uses the non-compliant COW strings. However, you mention that you are using gcc 8.2.1, which by default in a normal install which uses non-COW strings. It seems like if you installed 8.2.1 use the RHEL "devtools" method, you'll get a new gcc which still uses the old ABI and links against the old system libstdc++.
To confirm this, you might want to check the value of _GLIBCXX_USE_CXX11_ABI macro in your test program, and also your libstdc++ version (the version information here might prove useful).
You can avoid by using an OS other than CentOS that doesn't use ancient gcc and glibc version. If you need to stick with CentOS for some reason you'll have to look into if there is a supported way to use newer libstdc++ version on that distribution. You could also consider using a containerization technology to build an executable independent of the library versions of your local host.
1 You can demangle it like so: echo '_ZNSs4_Rep10_M_disposeERKSaIcE' | c++filt.
2 I'm using gcc-4 era source since I'm guessing that's what you end up using in CentOS 7.

GDB: examine as instruction with opcodes

Is it possible to examine memory as instruction (x/i) the way I can see both asm and raw instructions in hex (like with disassemble /r)?
Sometimes I want to disassemble some part of memory which GDB refuses to disassemble saying: "No function contains specified address".
The only option is then x/i, but I would like to see exactly what hex values are translated to what instructions.
I want to disassemble some part of memory which GDB refuses to disassemble saying: "No function contains specified address".
The disas/r 0x1234,0x1235 will work even when GDB can not determine function boundaries. Example:
(gdb) disas/r 0x0000000000400803
No function contains specified address.
(gdb) disas/r 0x0000000000400803,0x000000000040080f
Dump of assembler code from 0x400803 to 0x40080f:
0x0000000000400803: e8 b8 fd ff ff callq 0x4005c0 <system#plt>
0x0000000000400808: 48 81 45 f0 00 10 00 00 addq $0x1000,-0x10(%rbp)
End of assembler dump.

GCC function padding value

Whenever I compile C or C++ code with optimizations enable,d GCC aligns functions to a 16-byte boundary (on IA-32). If the function is shorter than 16 bytes, GCC pads it with some bytes, which don't seem to be random at all:
19: c3 ret
1a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
It always seems to be either 8d b6 00 00 00 00 ... or 8d 74 26 00.
Do function padding bytes have any significance?
The padding is created by the assembler, not by gcc. It merely sees a .align directive (or equivalent) and doesn't know whether the space to be padded is inside a function (e.g. loop alignment) or between functions, so it must insert NOPs of some sort. Modern x86 assemblers use the largest possible NOP opcodes with the intention of spending as few cycles as possible if the padding is for loop alignment.
Personally, I'm extremely skeptical of alignment as an optimization technique. I've never seen it help much, and it can definitely hurt by increasing the total code size (and cache utilization) tremendously. If you use the -Os optimization level, it's off by default, so there's nothing to worry about. Otherwise you can disable all the alignments with the proper -f options.
The assembler first sees an .align directive. Since it doesn't know if this address is within a function body or not, it cannot output NULL 0x00 bytes, and must generate NOPs (0x90).
However:
lea esi,[esi+0x0] ; does nothing, psuedocode: ESI = ESI + 0
executes in fewer clock cycles than
nop
nop
nop
nop
nop
nop
If this code happened to fall within a function body (for instance, loop alignment), the lea version would be much faster, while still "doing nothing."
The instruction lea 0x0(%esi),%esi just loads the value in %esi into %esi - it's no-operation (or NOP), which means that if it's executed it will have no effect.
This just happens to be a single instruction, 6-byte NOP. 8d 74 26 00 is just a 4-byte encoding of the same instruction.