GCC function padding value - c++

Whenever I compile C or C++ code with optimizations enable,d GCC aligns functions to a 16-byte boundary (on IA-32). If the function is shorter than 16 bytes, GCC pads it with some bytes, which don't seem to be random at all:
19: c3 ret
1a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
It always seems to be either 8d b6 00 00 00 00 ... or 8d 74 26 00.
Do function padding bytes have any significance?

The padding is created by the assembler, not by gcc. It merely sees a .align directive (or equivalent) and doesn't know whether the space to be padded is inside a function (e.g. loop alignment) or between functions, so it must insert NOPs of some sort. Modern x86 assemblers use the largest possible NOP opcodes with the intention of spending as few cycles as possible if the padding is for loop alignment.
Personally, I'm extremely skeptical of alignment as an optimization technique. I've never seen it help much, and it can definitely hurt by increasing the total code size (and cache utilization) tremendously. If you use the -Os optimization level, it's off by default, so there's nothing to worry about. Otherwise you can disable all the alignments with the proper -f options.

The assembler first sees an .align directive. Since it doesn't know if this address is within a function body or not, it cannot output NULL 0x00 bytes, and must generate NOPs (0x90).
However:
lea esi,[esi+0x0] ; does nothing, psuedocode: ESI = ESI + 0
executes in fewer clock cycles than
nop
nop
nop
nop
nop
nop
If this code happened to fall within a function body (for instance, loop alignment), the lea version would be much faster, while still "doing nothing."

The instruction lea 0x0(%esi),%esi just loads the value in %esi into %esi - it's no-operation (or NOP), which means that if it's executed it will have no effect.
This just happens to be a single instruction, 6-byte NOP. 8d 74 26 00 is just a 4-byte encoding of the same instruction.

Related

Is it possible to write asm in C++ with opcode instead of shellcode

I'm curious if there's a way to use __asm in c++ then write that into memory instead of doing something like:
BYTE shell_code[] = { 0x48, 0x03 ,0x1c ,0x25, 0x0A, 0x00, 0x00, 0x00 };
write_to_memory(function, &shell_code, sizeof(shell_code));
So I would like to do:
asm_code = __asm("add rbx, &variable\n\t""jmp rbx") ;
write_to_memory(function, &asm_code , sizeof(asm_code ));
Worst case I can use GCC and objdump externally or something but hoping there's an internal way
You can put an asm(""); statement at global scope, with start/end labels inside it, and declare those labels as extern char start_code[], end_code[0]; so you can access them from C. C char arrays work most like asm labels, in terms of being able to use the C name and have it work as an address.
// compile with gcc -masm=intel
// AFAIK, no way to do that with clang
asm(
".pushsection .rodata \n" // we don't want to run this from here, it's just data
"start_code: \n"
" add rax, OFFSET variable \n" // *absolute* address as 32-bit sign-extended immediate
"end_code: \n"
".popsection"
);
__attribute__((used)) static int variable = 1;
extern char start_code[], end_code[0]; // C declarations for those asm labels
#include <string.h>
void copy_code(void *dst)
{
memcpy(dst, start_code, end_code - start_code);
}
It would be fine to have the payload code in the default .text section, but we can put it in .rodata since we don't want to run it.
Is that the kind of thing you're looking for? asm output on Godbolt (without assembling + disassembling:
start_code:
add rax, OFFSET variable
end_code:
copy_code(void*):
mov edx, OFFSET FLAT:end_code
mov esi, OFFSET FLAT:start_code
sub rdx, OFFSET FLAT:start_code
jmp [QWORD PTR memcpy#GOTPCREL[rip]]
To see if it actually assembles to what we want, I compiled with
gcc -O2 -fno-plt -masm=intel -fno-pie -no-pie -c foo.c to get a .o. objdump -drwC -Mintel shows:
0000000000000000 <copy_code>:
0: ba 00 00 00 00 mov edx,0x0 1: R_X86_64_32 .rodata+0x6
5: be 00 00 00 00 mov esi,0x0 6: R_X86_64_32 .rodata
a: 48 81 ea 00 00 00 00 sub rdx,0x0 d: R_X86_64_32S .rodata
11: ff 25 00 00 00 00 jmp QWORD PTR [rip+0x0] # 17 <end_code+0x11> 13: R_X86_64_GOTPCRELX memcpy-0x4
And with -D to see all sections, the actual payload is there in .rodata, still not linked yet:
Disassembly of section .rodata:
0000000000000000 <start_code>:
0: 48 05 00 00 00 00 add rax,0x0 2: R_X86_64_32S .data
-fno-pie -no-pie is only necessary for the 32-bit absolute address of variable to work. (Without it, we get two RIP-relative LEAs and a sub rdx, rsi. Unfortunately neither way of compiling gets GCC to subtract the symbols at build time with mov edx, OFFSET end_code - start_code, but that's just in the code doing the memcpy, not in the machine code being copied.)
In a linked executable
We can see how the linker filled in those relocations.
(I tested by using -nostartfiles instead of -c - I didn't want to run it, just look at the disassembly, so there was not point to actually writing a main.)
$ gcc -O2 -fno-plt -masm=intel -fno-pie -no-pie -nostartfiles foo.c
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
$ objdump -D -rwC -Mintel a.out
(manually edited to remove uninteresting sections)
Disassembly of section .text:
0000000000401000 <copy_code>:
401000: ba 06 20 40 00 mov edx,0x402006
401005: be 00 20 40 00 mov esi,0x402000
40100a: 48 81 ea 00 20 40 00 sub rdx,0x402000
401011: ff 25 e1 2f 00 00 jmp QWORD PTR [rip+0x2fe1] # 403ff8 <memcpy#GLIBC_2.14>
The linked payload:
0000000000402000 <start_code>:
402000: 48 05 18 40 40 00 add rax,0x404018 # from add rax, OFFSET variable
0000000000402006 <end_code>:
402006: 48 c7 c2 06 00 00 00 mov rdx,0x6
# this was from mov rdx, OFFSET end_code - start_code to see if that would assemble + link
Our non-zero-init dword variable that we're taking the address of:
Disassembly of section .data:
0000000000404018 <variable>:
404018: 01 00 add DWORD PTR [rax],eax
...
Your specific asm instruction is weird
&variable isn't valid asm syntax, but I'm guessing you wanted to add the address?
Since you're going to be copying the machine code somewhere, you must avoid RIP-relative addressing modes and any other relative references to things outside the block you're copying. Only mov can use 64-bit absolute addresses, like movabs rdi, OFFSET variable instead of the usual lea rdi, [rip + variable]. Also, you can even load / store into/from RAX/EAX/AX/AL with 64-bit absolute addresses movabs eax, [variable]. (mov-immediate can use any register, load/store are only the accumulator. https://www.felixcloutier.com/x86/mov)
(movabs is an AT&T mnemonic, but GAS allows it in .intel_syntax noprefix to force using 64-bit immediates, instead of the default 32-bit-sign-extended.)
This is kind of opposite of normal position-independent code, which works when the whole image is loaded at an arbitrary base. This will make code that works when the image is loaded to a fixed base (or even variable since runtime fixups should work for symbolic references), and then copied around relative to the rest of your code. So all your memory refs have to be absolute, except for within the asm.
So we couldn't have made PIE-compatible machine code by using lea rdx, [RIP+variable] / add rax, rdx - that would only get the right address for variable when run from the linked location in .rodata, not from any copy. (Unless you manually fixup the code when copying it, but it's still only a rel32 displacement.)
Terminology:
An opcode is part of a machine instruction, e.g. add ecx, 123 assembles to 3 bytes: 83 c1 7b. Those are the opcode, modrm, and imm8 respectively. (https://www.felixcloutier.com/x86/add).
"opcode" also gets misused (especially in shellcode usage) to describe the whole instruction represented as bytes.
Text names for instructions like add are mnemonics.
this is just a guess, i don't know if it will work. i'm sorry in advance for an ugly answer since i don't have much time due to work.
i think you can enclose your asm code inside labels.
get the address of that label and the size. treat it as a blob of data and you can write it anywhere.
void funcA(){
//some code here.
labelStart:
__asm("
;asm code here.
")
labelEnd:
//some code here.
//---make code as movable data.
char* pDynamicProgram = labelStart;
size_t sizeDP = labelEnd - labelStart;
//---writing to some memory.
char* someBuffer = malloc(sizeDP);
memcpy(someBuffer, pDynamicProgram, sizeDP);
//---execute: cast as a function pointer then execute call.
((func*)someBuffer)(/* parameters if any*/);
}
the sample code above of course is not compilable. but the logic is kind of like that. i see viruses do it that way though i haven't saw the actual c++ code. but we saw it from disassemblers. for the "return" logic after the call, there are many adhoc ways to do that. just be creative.
also, i think you have to enable first some settings for your program to write to some forbidden memory in case you want to override an existing function.

Relative vs Physical addresses in C++

I recently started learning about memory management and I read about relative addresses and physical addresses, and a question appeared in my mind:
When I print a variable's address, is it showing the relative (virtual) address or the physical address in where the variable located in the memory?
And another question regarding memory management:
Why does this code produce the same stack pointer value for each run (from Shellcoder's Handbook, page 28)?
Does any program that I run produce this address?
// find_start.c
unsigned long find_start(void)
{
__asm__("movl %esp, %eax");
}
int main()
{
printf("0x%x\n",find_start());
}
If we compile this and run this a few times, we get:
shellcoders#debian:~/chapter_2$ ./find_start
0xbffffad8
shellcoders#debian:~/chapter_2$ ./find_start
0xbffffad8
shellcoders#debian:~/chapter_2$ ./find_start
0xbffffad8
shellcoders#debian:~/chapter_2$ ./find_start
0xbffffad8
I would appreciate if someone could clarify this topic to me.
When I print a variable's address, is it showing the relative ( virtual ) address or the physical address in where the variable located in the memory ?
The counterpart to a relative address is an absolute address. That has nothing to do with the distinction between virtual and physical addresses.
On most common modern operating systems, such as Windows, Linux and MacOS, unless you are writing a driver, you will never encounter physical addresses. These are handled internally by the operating system. You will only be working with virtual addresses.
Why does this code produces the same stack pointer value for each run ( from shellcoder's handbook , page 28) ?
On most modern operating systems, every process has its own virtual memory address space. The executable is loaded to its preferred base address in that virtual address space, if possible, otherwise it is loaded at another address (relocated). The preferred base address of an executable file is normally stored in its header. Depending on the operating system and CPU, the heap is probably created at a higher address, since the heap normally grows upward (towards higher addresses). Because the stack normally grows downward (towards lower addresses), it will likely be created below the load address of the executable and grow towards the address 0.
Since the preferred load address is the same every time you run the executable, it is likely that the virtual memory addresses are the same. However, this may change if address layout space randomization is used. Also, just because the virtual memory addresses are the same does not mean that the physical memory address are the same, too.
Does any program that I will run produce this address ?
Depending on your operating system, you can set the preferred base address in which your program is loaded into virtual memory in the linker settings. Many programs may still have the same base address as your program, probably because both programs were built using the same linker with default settings.
The virtual addresses are only per program? Let's say I have 2 programs: program1 and program2. Can program2 access program1's memory?
It is not possible for program2 to access program1's memory directly, because they have separate virtual memory address spaces. However, it is possible for one program to ask the operating system for permission to access another process's address space. The operating system will normally grant this permission, provided that the program has sufficient priviledges. On Windows, this is can be accomplished for example with the function WriteProcessMemory. Linux offers similar functionality by using ptrace and writing to /proc/[pid]/mem. See this link for further information.
You get virtual addresses. Your program never gets to see physical addresses. Ever.
can program2 access program1's memory ?
No, because you can don't have addresses that point to program1's memory. If you have virtual address 0xabcd1234 in the program1 process, and you try to read it from the program2 process, you get program2's 0xabcd1234 (or a crash if there is no such address in program2). It's not a permission check - it's not like the CPU goes to the memory and sees "oh, this is program1's memory, I shouldn't access it". It's program2's own memory space.
But yes, if you use "shared memory" to ask the OS to put the same physical memory in both processes.
And yes, if you use ptrace or /proc/<pid>/mem to ask the OS nicely to read from the other process's memory, and you have permission to do that, then it will do that.
why does this code produces the same stack pointer value for each run ( from shellcoder's handbook , page 28) ? does any program that I will run will produce this address ?
Apparently, that program always has that stack pointer value. Different programs might have different stack pointers. And if you put more local variables in main, or call find_start from a different function, you will get a different stack pointer value because there will be more data pushed on the stack.
Note: even if you run the program twice at the same time, the address will be the same, because they are virtual addresses, and every process has its own virtual address space. They will be different physical addresses but you don't see the physical addresses.
In stack overflow example in the book I mentioned, they overwrite the return address in the stack to an address of an exploit in the enviroment variables. how does it work ?
It all works within one process.
Only focusing on a small part of your question.
#include <stdio.h>
// find_start.c
unsigned long find_start(void)
{
__asm__("movl %esp, %eax");
}
unsigned long nest ( void )
{
return(find_start());
}
int main()
{
printf("0x%lx\n",find_start());
printf("0x%lx\n",nest());
}
gcc so.c -o so
./so
0x50e381a0
0x50e38190
There is no magic here. The virtual space allows for programs to be built the same. I don't need to know where my program is going to live, each program can be compiled for the same address space, when loaded and run the can see the same virtual address space because they are all mapped to separate/different physical address spaces.
readelf -a so
(don't but I prefer objdump)
objdump -D so
Disassembly of section .text:
0000000000000540 <_start>:
540: 31 ed xor %ebp,%ebp
542: 49 89 d1 mov %rdx,%r9
545: 5e pop %rsi
....
000000000000064a <find_start>:
64a: 55 push %rbp
64b: 48 89 e5 mov %rsp,%rbp
64e: 89 e0 mov %esp,%eax
650: 90 nop
651: 5d pop %rbp
652: c3 retq
0000000000000653 <nest>:
653: 55 push %rbp
654: 48 89 e5 mov %rsp,%rbp
657: e8 ee ff ff ff callq 64a <find_start>
65c: 5d pop %rbp
65d: c3 retq
000000000000065e <main>:
65e: 55 push %rbp
65f: 48 89 e5 mov %rsp,%rbp
662: e8 e3 ff ff ff callq 64a <find_start>
667: 48 89 c6 mov %rax,%rsi
66a: 48 8d 3d b3 00 00 00 lea 0xb3(%rip),%rdi # 724 <_IO_stdin_used+0x4>
671: b8 00 00 00 00 mov $0x0,%eax
676: e8 a5 fe ff ff callq 520 <printf#plt>
67b: e8 d3 ff ff ff callq 653 <nest>
So, two things or maybe more than two things. Our entry point _start is in ram at a low address. low virtual address. On this system with this compiler I would expect all/most programs to start in a similar place or the same or in some cases it may depend on what is in my program, but it should be somewhere low.
The stack pointer though, if you check above and now as I type stuff:
0x355d38d0
0x355d38c0
it has changed.
0x4ebf1760
0x4ebf1750
0x31423240
0x31423230
0xa63188d0
0xa63188c0
a few times within a few seconds. The stack is a relative thing not absolute so there is no real need to create a fixed address that is the same every time. Needs to be in a space that is related to this user/thread and virtual since it is going through the mmu for protection reasons. There is no reason for a virtual address to not equal the physical address. The kernel code/driver that manages the mmu for a platform is programmed to do it a certain way. You can have the address space for code start at 0x0000 for every program, and you might wish the address space for data to be the same, zero based. but for stack it doesn't matter. And on my machine, my os, this particular version this particular day it isn't consistent.
I originally thought your question was different depending on factors that are specific to your build, and settings. For a specific build a single call to find_start is going to be at a fixed relative address for the stack pointer each function that uses the stack will put it back the way it was found, assuming you can't change the compilation of the program while running the stack pointer for a single instance of the call the nesting will be the same the stack consumption by each function along the way will be the same.
I added another layer and by looking at the disassembly, main, nest and find_start all mess with the stack pointer (unoptimized) so that is why for these runs they are 0x10 apart. if I added/removed more code per function to change the stack usage in one or more of the functions then that delta could change.
But
gcc -O2 so.c -o so
objdump -D so > so.txt
./so
0x0
0x0
Disassembly of section .text:
0000000000000560 <main>:
560: 48 83 ec 08 sub $0x8,%rsp
564: 89 e0 mov %esp,%eax
566: 48 8d 35 e7 01 00 00 lea 0x1e7(%rip),%rsi # 754 <_IO_stdin_used+0x4>
56d: 31 d2 xor %edx,%edx
56f: bf 01 00 00 00 mov $0x1,%edi
574: 31 c0 xor %eax,%eax
576: e8 c5 ff ff ff callq 540 <__printf_chk#plt>
57b: 89 e0 mov %esp,%eax
57d: 48 8d 35 d0 01 00 00 lea 0x1d0(%rip),%rsi # 754 <_IO_stdin_used+0x4>
584: 31 d2 xor %edx,%edx
586: bf 01 00 00 00 mov $0x1,%edi
58b: 31 c0 xor %eax,%eax
58d: e8 ae ff ff ff callq 540 <__printf_chk#plt>
592: 31 c0 xor %eax,%eax
594: 48 83 c4 08 add $0x8,%rsp
598: c3 retq
The optimizer didn't recognize the return value for some reason.
unsigned long fun ( void )
{
return(0x12345678);
}
00000000000006b0 <fun>:
6b0: b8 78 56 34 12 mov $0x12345678,%eax
6b5: c3 retq
calling convention looks fine.
Put find_start in a separate file so the optimizer can't remove it
gcc -O2 so.c sp.c -o so
./so
0xb1192fc8
0xb1192fc8
./so
0x7aa979d8
0x7aa979d8
./so
0x485134c8
0x485134c8
./so
0xa8317c98
0xa8317c98
./so
0x2ba70b8
0x2ba70b8
Disassembly of section .text:
0000000000000560 <main>:
560: 48 83 ec 08 sub $0x8,%rsp
564: e8 67 01 00 00 callq 6d0 <find_start>
569: 48 8d 35 f4 01 00 00 lea 0x1f4(%rip),%rsi # 764 <_IO_stdin_used+0x4>
570: 48 89 c2 mov %rax,%rdx
573: bf 01 00 00 00 mov $0x1,%edi
578: 31 c0 xor %eax,%eax
57a: e8 c1 ff ff ff callq 540 <__printf_chk#plt>
57f: e8 4c 01 00 00 callq 6d0 <find_start>
584: 48 8d 35 d9 01 00 00 lea 0x1d9(%rip),%rsi # 764 <_IO_stdin_used+0x4>
58b: 48 89 c2 mov %rax,%rdx
58e: bf 01 00 00 00 mov $0x1,%edi
593: 31 c0 xor %eax,%eax
595: e8 a6 ff ff ff callq 540 <__printf_chk#plt>
I didn't let it inline those functions it can see nest so it inlined it removing the stack change that came with it. So now the value nested or not is the same.

macOS - Reading part of other app library code and disassembling it to get offset

My applications read other application memory in order to get pointer. I need firstly to read offset from static library to start working with application itself.
Some function in dylib contains offset to pointer "0x41b1110" - i know that this offset works when used manually, but i need to to read that with my application automatically without checking value manually, if i do simple read from memory from address SomeAddressX as uint64_t it get's ridiculous address which is not equal 0x41b1110. im pretty sure what i got is simply this instruction. Then i have tried read this as byte array, and this byte array was equal to byte array from plain binary at this address. Im wondering how to read simply "0x41b1110" not entire instruction? Do i need to disassembly byte code to x64 instruction and then parse it to get address, or is there smarter way ? Im not much experienced with asm.
SomeAddressX - rax, qword [ds:0x41b1110]
Adding Example byte code and instruction
lea rax, qword [ds:0x1043740]
which gives
48 8D 05 6F D9 99 00
first three 48 8D 05 appears to be lea rax, qword but the other part 6F D9 99 00 is not looking like 01 04 37 40 (0x1043740) ?
It's x64 and enforced PIC (position-independent code) code on OSX (doesn't allow non-PIC executables, as it is using ASLR).
So that disassembly is hiding an important bit of information from you. The true nature of that instruction is revealed here (ba dum ts):
lea rax,[rip+0x99d96f]
It's using current instruction pointer rip to relatively address it's data.
The 0x1043740 is result of addressOfInstruction + 7 + 0x99d96f.
The 0x99d96f part is clearly visible in the bytecode itself.
The +7 is instruction opcode size. Now I'm not 100% sure it's added too at that stage, so do your own math, as you know "addressOfInstruction".
And check out your debugger options, to see if you can switch between the friendly absolute memory display vs. true rip+offset disassembly.

Do macros in C++ improve performance?

I'm a beginner in C++ and I've just read that macros work by replacing text whenever needed. In this case, does this mean that it makes the .exe run faster? And how is this different than an inline function?
For example, if I have the following macro :
#define SQUARE(x) ((x) * (x))
and normal function :
int Square(const int& x)
{
return x*x;
}
and inline function :
inline int Square(const int& x)
{
return x*x;
}
What are the main differences between these three and especially between the inline function and the macro? Thank you.
You should avoid using macros if possible. Inline functions are always the better choice, as they are type safe. An inline function should be as fast as a macro (if it is indeed inlined by the compiler; note that the inline keyword is not binding but just a hint to the compiler, which may ignore it if inlining is not possible).
PS: as a matter of style, avoid using const Type& for parameter types that are fundamental, like int or double. Simply use the type itself, in other words, use
int Square(int x)
since a copy won't affect (or even make it worse) performance, see e.g. this question for more details.
Macros translate to: stupid replacing of pattern A with pattern B. This means: everything happens before the compiler kicks in. Sometimes they come in handy; but in general, they should be avoided. Because you can do a lot of things, and later on, in the debugger, you have no idea what is going on.
Besides: your approach to performance is well, naive, to say it friendly. First you learn the language (which is hard for modern C++, because there are a ton of important concepts and things one absolutely need to know and understand). Then you practice, practice, practice. And then, when you really come to a point where your existing application has performance problems; then do profiling to understand the real issue.
In other words: if you are interested in performance, you are asking the wrong question. You should worry much more about architecture (like: potential bottlenecks), configuration (in the sense of latency between different nodes in your system), and so on. Of course, you should apply common sense; and not write code that is obviously wasting memory or CPU cycles. But sometimes a piece of code that runs 50% slower ... might be 500% easier to read and maintain. And if execution time is then 500ms, and not 250ms; that might be totally OK (unless that specific part is called a thousand times per minute).
The difference between a macro and an inlined function is that a macro is dealt with before the compiler sees it.
On my compiler (clang++) without optimisation flags the square function won't be inlined. The code it generates looks like this
4009f0: 55 push %rbp
4009f1: 48 89 e5 mov %rsp,%rbp
4009f4: 89 7d fc mov %edi,-0x4(%rbp)
4009f7: 8b 7d fc mov -0x4(%rbp),%edi
4009fa: 0f af 7d fc imul -0x4(%rbp),%edi
4009fe: 89 f8 mov %edi,%eax
400a00: 5d pop %rbp
400a01: c3 retq
the imul is the assembly instruction doing the work, the rest is moving data around.
code that calls it looks like
400969: e8 82 00 00 00 callq 4009f0 <_Z6squarei>
iI add the -O3 flag to Inline it and that imul shows up in the main function where the function is called from in C++ code
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
It's a reasonable thing to do to get a basic handle on assembly language for your machine and use gcc -S on your source, or objdump -D on your binary (as I did here) to see exactly what is going on.
Using the macro instead of the inlined function gets something very similar
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
Note one of the many dangers here with macros: what does this do ?
x = 5; std::cout << SQUARE(++x) << std::endl;
36? nope, 42. It becomes
std::cout << ++x * ++x << std::endl;
which becomes 6 * 7
Don't be put off by people telling you not to care about optimisation. Using C or C++ as your language is an optimisation in itself. Just try to work out if you're wasting time with it and be sensible.
Macros just perform text substitution to modify source code.
As such, macros don't inherently affect performance of code. The techniques you use to design and code obviously affect performance. So the only implication of macros on performance is based on what the macro does (i.e. what code you write the macro to emit).
The big danger of macros is that they do not respect scope. The changes they make are unconditional, cross function boundaries, and things like that. There are a lot of subtleties in writing macros to make them behave as intended (avoid unintended side effects in code, avoid undefined behaviour, etc). This means code which uses macros is harder to understand, and harder to get right.
At best, with modern compilers, the performance gain you can get using macros, is the same as can be achieved with inline functions - at the expense of increasing chances of the code behaving incorrectly. You are therefore better off using inline functions - unlike macros they are typesafe and work consistently with other code.
Modern compilers might choose to not inline a function, even if you have specified it as inline. If that happens, you generally don't need to worry - modern compilers are able to do a better job than most modern programmers in deciding whether a function should be inlined.
Using such a macro only make sense if its argument is itself a #define'd constant, as the computation will then be performed by the preprocessor. Even then, double-check that the result is the expected one.
When working on classic variables, the (inlined) function form should be preferred as:
It is type-safe;
It will handle expressions used as an argument in a consistent way. This not only includes the case of per/post increments as quoted by Peter, but when the argument it itself some computation-intensive expression, using the macro form forces the evaluation of that argument twice (which may not necessarely evaluate to the same value btw) vs. only once for the function.
I have to admit that I used to code such macros for quick prototyping of apparently simple functions, but the time those make me lose over the years finalyl changed my mind !

In C++, which is faster? (2 * i + 1) or (i << 1 | 1)?

I realize that the answer is probably hardware specific, but I'm curious if there was a more general intuition that I'm missing?
I asked this question & given the answer, now I'm wondering if I should alter my approach in general to use "(i << 1|1)" instead of "(2*i + 1)"??
Since the ISO standard doesn't actually mandate performance requirements, this will depend on the implementation, the compiler flags chosen, the target CPU and quite possibly the phase of the moon.
These sort of optimisations (saving a couple of cycles) almost always pale into insignificance in terms of return on investment, against macro-level optimisations like algorithm selection.
Aim for readability of code first and foremost. If your intent is to shift bits and OR, use the bit-shift version. If your intent is to multiply, use the * version. Only worry about performance once you've established there's an issue.
Any decent compiler will optimise it far better than you can anyway :-)
Just an experiment regarding answers given about "... it'll use LEA":
The following code:
int main(int argc, char **argv)
{
#ifdef USE_SHIFTOR
return (argc << 1 | 1);
#else
return (2 * argc + 1);
#endif
}
will, with gcc -fomit-frame-pointer -O8 -m{32|64} (for 32bit or 64bit) compile into the following assembly code:
x86, 32bit:080483a0 <main>:
80483a0: 8b 44 24 04 mov 0x4(%esp),%eax
80483a4: 8d 44 00 01 lea 0x1(%eax,%eax,1),%eax
80483a8: c3 ret
x86, 64bit:00000000004004c0 <main>:
4004c0: 8d 44 3f 01 lea 0x1(%rdi,%rdi,1),%eax
4004c4: c3 retq
x86, 64bit, -DUSE_SHIFTOR:080483a0 <main>:
80483a0: 8b 44 24 04 mov 0x4(%esp),%eax
80483a4: 01 c0 add %eax,%eax
80483a6: 83 c8 01 or $0x1,%eax
80483a9: c3 ret
x86, 32bit, -DUSE_SHIFTOR:00000000004004c0 <main>:
4004c0: 8d 04 3f lea (%rdi,%rdi,1),%eax
4004c3: 83 c8 01 or $0x1,%eax
4004c6: c3 retq
In fact, it's true that most cases will use LEA. Yet the code is not the same for the two cases. There are two reasons for that:
addition can overflow and wrap around, while bit operations like << or | cannot
(x + 1) == (x | 1) only is true if !(x & 1) else the addition carries over to the next bit. In general, adding one only results in having the lowest bit set in half of the cases.
While we (and the compiler, probably) know that the second is necessarily applicable, the first is still a possibility. The compiler therefore creates different code, since the "or-version" requires forcing bit zero to 1.
Any but the most brain-dead compiler will see those expressions as equivalent and compile them to the same executable code.
Typically it's not really worth worrying too much about optimizing simple arithmetic expressions like these, since it's the sort of thing compilers are best at optimizing. (Unlike many other cases in which a "smart compiler" could do the the right thing, but an actual compiler falls flat.)
This will work out to the same pair of instructions on PPC, Sparc, and MIPS, by the way: a shift followed by an add. On the ARM it'll cook down to a single fused shift-add instruction, and on x86 it'll probably be a single LEA op.
Output of gcc with the -S option (no compiler flags given):
.LCFI3:
movl 8(%ebp), %eax
addl %eax, %eax
orl $1, %eax
popl %ebp
ret
.LCFI1:
movl 8(%ebp), %eax
addl %eax, %eax
addl $1, %eax
popl %ebp
ret
I'm not sure which one is which, but I don't believe it matters.
If the compiler does no optimizations at all, then the second would probably translate to faster assembly instructions. How long each instruction takes is completely architecture-dependent. Most compilers will optimize them to be the same assembly-level instructions.
I just tested this with gcc-4.7.1 using the source of FrankH, the code generated is
lea 0x1(%rdi,%rdi,1),%eax
retq
no matter if the shift or the multiplication version is used.
Nobody cares. Nor should they.
Quit worrying about that and get your code correct, simple, and done.
i + i + 1 may be faster than other two,
because addition is faster than multiplication and can be faster than shift.
The faster is the first form (the one with the shift right), in fact the shr instruction takes 4 clock cycles to complete in the worst case, while the mul 10 in the best case. However, the best form should be decided by the compiler because it has a complete view of the others (assembly) instructions.