Does one still need to use -fPIC when compiling with GCC?

Does one still need to use -fPIC when compiling with GCC? - c++

On gcc target machines, when one wanted to compile a shared library, one would need to specify -fpic or -fPIC to get things to work correcly. This is because by default absolute addressing was used, which is suitable for executable that have full control of their own address space, but not shared libraries, which could be loaded anywhere in an executable's address space.
However modern kernels are now implementing address space randomization and many modern architectures support PC relative addressing. This all seems to make the absolute addressing either unusable (address space randomization) or unneeded (PC relative addressing).
I have also noticed that clang does not have an -fPIC option which makes me think that it is no longer necessary.
So is -fPIC now redundant or does one need to generate separate .o files, one for static library use, and one for shared library use?

You still need to compile with -fPIC. The problem isn't solvable with pc-relative addressing. The problem is how you resolve external symbols. In a dynamically linked program the resolution follows different rules and especially with adress space randomization it can't be resolved during link time.
And clang does have the -fPIC flag just like gcc.
$ cat > foo.c
void foo(void);
void bar(void) { foo(); }
$ gcc -S foo.c && grep call.*foo foo.s
call foo
$ gcc -fPIC -S foo.c && grep call.*foo foo.s
call foo#PLT
$ clang -S foo.c && grep call.*foo foo.s
callq foo
$ clang -fPIC -S foo.c && grep call.*foo foo.s
callq foo#PLT
$

It depends on the target. Some targets (like x86_64) are position independent by default, sp -fpic is a noop and has no effect on the generated code. So in those cases you can omit it and nothing changes. Other targets (like x86 32-bit) are not position independent by default, so on those machines, if you omit -fpic for the executable, it will disable ASLR for that image file (but not for shared libraries it uses).

I agree with you: in many cases the -fpic/-fPIC options are almost redundant, I do however use them to ensure:
portability (never sure what particular OS/kernel will be available)
backwards compatibility: with those options it ensures the behaviour you want on older kernels
habit - Hard things to break :)
compliance with older codebases that may require it

You never needed to generate separate .o files. Always specify the compiler options to generate portable code (typically -fPIC).
On some systems, the compiler may be configured to force this option on, or set it by default. But it doesn't hurt to specify it anyway.
Note: One hopes that where PC-relative addressing is supported and performs well, that -fPIC uses that mode rather than dedicating an extra register.

gcc targets a lot of platforms and architectures, and not all of them supports natively PIC like the x86 architecture does. In some cases, creating PIC means additional overhead, which may be undesired, and wether you want or need this is depending on your project and the platform you are targeting,.

Related

What does DT_TEXTREL mean and how to solve? [duplicate]

64 bit Linux uses the small memory model by default, which puts all code and static data below the 2GB address limit. This makes sure that you can use 32-bit absolute addresses. Older versions of gcc use 32-bit absolute addresses for static arrays in order to save an extra instruction for relative address calculation. However, this no longer works. If I try to make a 32-bit absolute address in assembly, I get the linker error:
"relocation R_X86_64_32S against `.data' can not be used when making a shared object; recompile with -fPIC".
This error message is misleading, of course, because I am not making a shared object and -fPIC doesn't help.
What I have found out so far is this: gcc version 4.8.5 uses 32-bit absolute addresses for static arrays, gcc version 6.3.0 doesn't. version 5 probably doesn't either. The linker in binutils 2.24 allows 32-bit absolute addresses, verson 2.28 does not.
The consequence of this change is that old libraries have to be recompiled and legacy assembly code is broken.
Now I want to ask: When was this change made? Is it documented somewhere? And is there a linker option that makes it accept 32-bit absolute addresses?

Your distro configured gcc with --enable-default-pie, so it's making position-independent executables by default, (allowing for ASLR of the executable as well as libraries). Most distros are doing that, these days.
You actually are making a shared object: PIE executables are sort of a hack using a shared object with an entry-point. The dynamic linker already supported this, and ASLR is nice for security, so this was the easiest way to implement ASLR for executables.
32-bit absolute relocation aren't allowed in an ELF shared object; that would stop them from being loaded outside the low 2GiB (for sign-extended 32-bit addresses). 64-bit absolute addresses are allowed, but generally you only want that for jump tables or other static data, not as part of instructions.1
The recompile with -fPIC part of the error message is bogus for hand-written asm; it's written for the case of people compiling with gcc -c and then trying to link with gcc -shared -o foo.so *.o, with a gcc where -fPIE is not the default. The error message should probably change because many people are running into this error when linking hand-written asm.
How to use RIP-relative addressing: basics
Always use RIP-relative addressing for simple cases where there's no downside. See also footnote 1 below and this answer for syntax. Only consider using absolute addressing when it's actually helpful for code-size instead of harmful. e.g. NASM default rel at the top of your file.
AT&T foo(%rip) or in GAS .intel_syntax noprefix use [rip + foo].
Disable PIE mode to make 32-bit absolute addressing work
Use gcc -fno-pie -no-pie to override this back to the old behaviour. -no-pie is the linker option, -fno-pie is the code-gen option. With only -fno-pie, gcc will make code like mov eax, offset .LC0 that doesn't link with the still-enabled -pie.
(clang can have PIE enabled by default, too: use clang -fno-pie -nopie. A July 2017 patch made -no-pie an alias for -nopie, for compat with gcc, but clang4.0.1 doesn't have it.)
Performance cost of PIE for 64-bit (minor) or 32-bit code (major)
With only -no-pie, (but still -fpie) compiler-generated code (from C or C++ sources) will be slightly slower and larger than necessary, but will still be linked into a position-dependent executable which won't benefit from ASLR. "Too much PIE is bad for performance" reports an average slowdown of 3% for x86-64 on SPEC CPU2006 (I don't have a copy of the paper so IDK what hardware that was on :/). But in 32-bit code, the average slowdown is 10%, worst-case 25% (on SPEC CPU2006).
The penalty for PIE executables is mostly for stuff like indexing static arrays, as Agner describes in the question, where using a static address as a 32-bit immediate or as part of a [disp32 + index*4] addressing mode saves instructions and registers vs. a RIP-relative LEA to get an address into a register. Also 5-byte mov r32, imm32 instead of 7-byte lea r64, [rel symbol] for getting a static address into a register is nice for passing the address of a string literal or other static data to a function.
-fPIE still assumes no symbol-interposition for global variables / functions, unlike -fPIC for shared libraries which have to go through the GOT to access globals (which is yet another reason to use static for any variables that can be limited to file scope instead of global). See The sorry state of dynamic libraries on Linux.
Thus -fPIE is much less bad than -fPIC for 64-bit code, but still bad for 32-bit because RIP-relative addressing isn't available. See some examples on the Godbolt compiler explorer. On average, -fPIE has a very small performance / code-size downside in 64-bit code. The worst case for a specific loop might only be a few %. But 32-bit PIE can be much worse.
None of these -f code-gen options make any difference when just linking,
or when assembling .S hand-written asm. gcc -fno-pie -no-pie -O3 main.c nasm_output.o is a case where you want both options.
Checking your GCC config
If your GCC was configured this way, gcc -v |& grep -o -e '[^ ]*pie' prints --enable-default-pie. Support for this config option was added to gcc in early 2015. Ubuntu enabled it in 16.10, and Debian around the same time in gcc 6.2.0-7 (leading to kernel build errors: https://lkml.org/lkml/2016/10/21/904).
Related: Build compressed x86 kernels as PIE was also affected by the changed default.
Why doesn't Linux randomize the address of the executable code segment? is an older question about why it wasn't the default earlier, or was only enabled for a few packages on older Ubuntu before it was enabled across the board.
Note that ld itself didn't change its default. It still works normally (at least on Arch Linux with binutils 2.28). The change is that gcc defaults to passing -pie as a linker option, unless you explicitly use -static or -no-pie.
In a NASM source file, I used a32 mov eax, [abs buf] to get an absolute address. (I was testing if the 6-byte way to encode small absolute addresses (address-size + mov eax,moffs: 67 a1 40 f1 60 00) has an LCP stall on Intel CPUs. It does.)
nasm -felf64 -Worphan-labels -g -Fdwarf testloop.asm &&
ld -o testloop testloop.o # works: static executable
gcc -v -nostdlib testloop.o # doesn't work
...
..../collect2 ... -pie ...
/usr/bin/ld: testloop.o: relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
gcc -v -no-pie -nostdlib testloop.o # works
gcc -v -static -nostdlib testloop.o # also works: -static implies -no-pie
GCC can also make a "static PIE" with -static-pie; ASLRed by no dynamic libraries or ELF interpreter. Not the same thing as -static -pie - those conflict with each other (you get a static non-PIE) although it might possibly get changed.
related: building static / dynamic executables with/without libc, defining _start or main.
Checking if an existing executable is PIE or not
This has also been asked at: How to test whether a Linux binary was compiled as position independent code?
file and readelf say that PIEs are "shared objects", not ELF executables. ELF-type EXEC can't be PIE.
$ gcc -fno-pie -no-pie -O3 hello.c
$ file a.out
a.out: ELF 64-bit LSB executable, ...
$ gcc -O3 hello.c
$ file a.out
a.out: ELF 64-bit LSB shared object, ...
## Or with a more recent version of file:
a.out: ELF 64-bit LSB pie executable, ...
gcc -static-pie is a special thing that GCC doesn't do by default, even with -nostdlib. It shows up as LSB pie executable, dynamically linked with current versions of file. (See What's the difference between "statically linked" and "not a dynamic executable" from Linux ldd?). It has ELF-type DYN, but readelf shows no .interp, and ldd will tell you it's statically linked. GDB starti and /proc/maps confirms that execution starts at the top of its _start, not in an ELF interpreter.
Semi-related (but not really): another recent gcc feature is gcc -fno-plt. Finally calls into shared libraries can be just call [rip + symbol#GOTPCREL] (AT&T call *puts#GOTPCREL(%rip)), with no PLT trampoline.
The NASM version of this is call [rel puts wrt ..got]
as an alternative to call puts wrt ..plt. See Can't call C standard library function on 64-bit Linux from assembly (yasm) code. This works in a PIE or non-PIE, and avoids having the linker build a PLT stub for you.
Some distros have started enabling it. It also avoids needing writeable + executable memory pages so it's good for security against code-injection. (I think modern PLT implementation's don't need that either, just updating a GOT pointer not rewriting a jmp rel32 instruction, so there might not be a security difference.)
It's a significant speedup for programs that make a lot of shared-library calls, e.g. x86-64 clang -O2 -g compiling tramp3d goes from 41.6s to 36.8s on whatever hardware the patch author tested on. (clang is maybe a worst-case scenario for shared library calls, making lots of calls to small LLVM library functions.)
It does require early binding instead of lazy dynamic linking, so it's slower for big programs that exit right away. (e.g. clang --version or compiling hello.c). This slowdown could be reduced with prelink, apparently.
This doesn't remove the GOT overhead for external variables in shared library PIC code, though. (See the godbolt link above).
Footnotes 1
64-bit absolute addresses actually are allowed in Linux ELF shared objects, with text relocations to allow loading at different addresses (ASLR and shared libraries). This allows you to have jump tables in section .rodata, or static const int *foo = &bar; without a runtime initializer.
So mov rdi, qword msg works (NASM/YASM syntax for 10-byte mov r64, imm64, aka AT&T syntax movabs, the only instruction which can use a 64-bit immediate). But that's larger and usually slower than lea rdi, [rel msg], which is what you should use if you decide not to disable -pie. A 64-bit immediate is slower to fetch from the uop cache on Sandybridge-family CPUs, according to Agner Fog's microarch pdf. (Yes, the same person who asked this question. :)
You can use NASM's default rel instead of specifying it in every [rel symbol] addressing mode. See also Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array for some more description of avoiding 32-bit absolute addressing. OS X can't use 32-bit addresses at all, so RIP-relative addressing is the best way there, too.
In position-dependent code (-no-pie), you should use mov edi, msg when you want an address in a register; 5-byte mov r32, imm32 is even smaller than RIP-relative LEA, and more execution ports can run it.

-fPIC ignored for target (all code is position independent), useless warning

When I compile my library I have switched on -fPIC because I want to be able to compile it as a shared and as a static library.
Using gcc 3.4.4 on cygwin I get this warning on all source files:
-fPIC ignored for target (all code is position independent)
And I really wonder what's the point of it. It tells me that I use a switch which has no effect because what the switch should achieve is already accomplished. Well, it means it's redundant, fine. But what's the point of it and how can I suppress it?
I'm not talking about why using PIC or not, just why it generates that IMO useless warning.

and how can I suppress it?
Not only a useless warning, but a distraction that makes it difficult to follow other warnings and errors.
Given my make output consistently showed 3 related lines, I opted to filter out the 3 "useless" lines using the following:
make 2>&1 | sed '/PIC ignored/{N;N;d;}'
I realize this is not the ideal way to suppress the noise, but maybe it will help to some degree. Be advised that I'm cutting 3 lines where other situations may only require removal of just one line. Note that I'm also routing stderr to stdout.
Here's a snipit of make output without the sed filter:
libavcodec/x86/mlpdsp.c:51:36: warning: taking address of expression of type 'void'
&ff_mlp_iirorder_4 };
^
CC libavcodec/x86/motion_est_mmx.o
libavcodec/x86/motion_est_mmx.c:1:0: warning: -fPIC ignored for target (all code is position independent)
/* */
^
CC libavcodec/x86/mpegaudiodec_mmx.o
libavcodec/x86/mpegaudiodec_mmx.c:1:0: warning: -fPIC ignored for target (all code is position independent)
/* */
^
CC libavcodec/x86/mpegvideo_mmx.o
libavcodec/x86/mpegvideo_mmx.c:1:0: warning: -fPIC ignored for target (all code is position independent)
/* */
^
And the same with the sed filter:
^
libavcodec/x86/mlpdsp.c:51:36: warning: taking address of expression of type 'void'
&ff_mlp_iirorder_4 };
^
CC libavcodec/x86/motion_est_mmx.o
CC libavcodec/x86/mpegaudiodec_mmx.o
CC libavcodec/x86/mpegvideo_mmx.o
CC libavcodec/x86/proresdsp-init.o

Personally, I'd just add os detection to the makefile. Something along the lines of
TARGET_TRIPLE := $(subst -, ,$(shell $(CC) -dumpmachine))
TARGET_ARCH := $(word 1,$(TARGET_TRIPLE))
TARGET_OS := $(word 3,$(TARGET_TRIPLE))
ifeq ($(TARGET_OS),mingw32)
else ifeq ($(TARGET_OS),cygwin)
else
CFLAGS += -fPIC
endif

And I really wonder what's the point of it... I'm not talking about why using PIC or not, just why it generates that IMO useless warning.
That's a good question, and I have not seen a definitive answer. At least one of the GCC devs considers it a pointless warning. Paolo Bonzini called it that in his recent patch Remove pointless -fPIC warning on Windows platforms.
According to Jonathan Wakely on the GCC mailing list at How to suppress "warning: -fPIC ignored for target..." under Cygwin (August 2015):
That warning has been there since well before 2003 (I couldn't be
bothered tracing the history back past a file rename in 2003).
And from Alexander Monakov on the same thread (referencing Bonzini's patch):
A patch was proposed just recently to just remove the warning:
https://gcc.gnu.org/ml/gcc-patches/2015-08/msg00836.html
Related, Windows has /ASLR, which is address space layout randomization. Its optional, but often required as a security gate, meaning all program code on must be compiled with it. If you have an SDLC, then you are probably using /ASLR because Microsoft calls it out as a best practice in Writing Secure Code.
The Linux/Unix equivalent of /ASLR is -fPIE for executables.
Under Windows, all DLL code is relocatable. Under Linux/Unix, shared object code can be made relocatable with -fPIC.
-fPIC is a "superset" of -fPIE (some hand waiving). That means -fPIC can be used anywhere you would use -fPIE (but not vice-versa).

the switch has some effect on linux (on windows/cygwin it would do nothing, maybe the compiler did not add platform specific check heregg) code generated with -fPIC is position independent, that means all instructions that refer to a specific address must be replaced by redirection to memory location; memory location is then set by dynamic loader; the result is is slightly slower - and takes more time to load; You don't need that for a static library, where all addresses are set by the linker, when the executable is created/linked.
The warning probably means that code of the static library is not quite as fast as you might expect it to be.You can create two object files of the same name in different directories, one with -fPIC for the shared library and the other for the static library.

g++ Optimization Flags: -fuse-linker-plugin vs -fwhole-program

I am reading:
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
It first suggests:
In combination with -flto using this option (-fwhole-program) should not be used. Instead relying on a linker plugin should provide safer and more precise information.
And then, it suggests:
If the program does not require any symbols to be exported, it is possible to combine -flto and -fwhole-program to allow the interprocedural optimizers to use more aggressive assumptions which may lead to improved optimization opportunities. Use of -fwhole-program is not needed when linker plugin is active (see -fuse-linker-plugin).
Does it mean that in theory, using -fuse-linker-plugin with -flto always gets a better optimized executable than using -fwhole-program with -flto?
I tried to use ld to link with -fuse-linker-plugin and -fwhole-program separately, and the executables' sizes at least are different.
P.S. I am using gcc 4.6.2, and ld 2.21.53.0.1 on CentOS 6.

UPDATE: See #PeterCordes comment below. Essentially, -fuse-linker-plugin is no longer necessary.
These differences are subtle. First, understand what -flto actually does. It essentially creates an output that can be optimized later (at "link-time").
What -fwhole-program does is assumes "that the current compilation unit represents the whole program being compiled" whether or not that is actually the case. Therefore, GCC will assume that it knows all of the places that call a particular function. As it says, it might use more aggressive inter-procedural optimizers. I'll explain that in a bit.
Lastly, what -fuse-linker-plugin does is actually perform the optimizations at link time that would normally be done as each compilation unit is performed. So, this one is designed to pair with -flto because -flto means save enough information to do optimizations later and -fuse-linker-plugin means actually do those optimizations.
So, where do they differ? Well, as GCC doc suggests, there is no advantage in principle of using -fwhole-program because that option assumes something that you then have to ensure is true. To break it, simply define a function in one .cpp file and use it in another. You will get a linker error.
Is there any advantage to -fwhole-program? Well, if you only have one compilation unit then you can use it, but honestly, it won't be any better. I was able to get different sized executables by using equivalent programs, but when checking the actual generated machine code, they were identical. In fact, the only differences that I saw were that line numbers with debugging information were different.

How to remove unused C/C++ symbols with GCC and ld?

I need to optimize the size of my executable severely (ARM development) and
I noticed that in my current build scheme (gcc + ld) unused symbols are not getting stripped.
The usage of the arm-strip --strip-unneeded for the resulting executables / libraries doesn't change the output size of the executable (I have no idea why, maybe it simply can't).
What would be the way (if it exists) to modify my building pipeline, so that the unused symbols are stripped from the resulting file?
I wouldn't even think of this, but my current embedded environment isn't very "powerful" and
saving even 500K out of 2M results in a very nice loading performance boost.
Update:
Unfortunately the current gcc version I use doesn't have the -dead-strip option and the -ffunction-sections... + --gc-sections for ld doesn't give any significant difference for the resulting output.
I'm shocked that this even became a problem, because I was sure that gcc + ld should automatically strip unused symbols (why do they even have to keep them?).

For GCC, this is accomplished in two stages:
First compile the data but tell the compiler to separate the code into separate sections within the translation unit. This will be done for functions, classes, and external variables by using the following two compiler flags:
-fdata-sections -ffunction-sections
Link the translation units together using the linker optimization flag (this causes the linker to discard unreferenced sections):
-Wl,--gc-sections
So if you had one file called test.cpp that had two functions declared in it, but one of them was unused, you could omit the unused one with the following command to gcc(g++):
gcc -Os -fdata-sections -ffunction-sections test.cpp -o test -Wl,--gc-sections
(Note that -Os is an additional compiler flag that tells GCC to optimize for size)

If this thread is to be believed, you need to supply the -ffunction-sections and -fdata-sections to gcc, which will put each function and data object in its own section. Then you give and --gc-sections to GNU ld to remove the unused sections.

You'll want to check your docs for your version of gcc & ld:
However for me (OS X gcc 4.0.1) I find these for ld
-dead_strip
Remove functions and data that are unreachable by the entry point or exported symbols.
-dead_strip_dylibs
Remove dylibs that are unreachable by the entry point or exported symbols. That is, suppresses the generation of load command commands for dylibs which supplied no symbols during the link. This option should not be used when linking against a dylib which is required at runtime for some indirect reason such as the dylib has an important initializer.
And this helpful option
-why_live symbol_name
Logs a chain of references to symbol_name. Only applicable with -dead_strip. It can help debug why something that you think should be dead strip removed is not removed.
There's also a note in the gcc/g++ man that certain kinds of dead code elimination are only performed if optimization is enabled when compiling.
While these options/conditions may not hold for your compiler, I suggest you look for something similar in your docs.

Programming habits could help too; e.g. add static to functions that are not accessed outside a specific file; use shorter names for symbols (can help a bit, likely not too much); use const char x[] where possible; ... this paper, though it talks about dynamic shared objects, can contain suggestions that, if followed, can help to make your final binary output size smaller (if your target is ELF).

The answer is -flto. You have to pass it to both your compilation and link steps, otherwise it doesn't do anything.
It actually works very well - reduced the size of a microcontroller program I wrote to less than 50% of its previous size!
Unfortunately it did seem a bit buggy - I had instances of things not being built correctly. It may have been due to the build system I'm using (QBS; it's very new), but in any case I'd recommend you only enable it for your final build if possible, and test that build thoroughly.

While not strictly about symbols, if going for size - always compile with -Os and -s flags. -Os optimizes the resulting code for minimum executable size and -s removes the symbol table and relocation information from the executable.
Sometimes - if small size is desired - playing around with different optimization flags may - or may not - have significance. For example toggling -ffast-math and/or -fomit-frame-pointer may at times save you even dozens of bytes.

It seems to me that the answer provided by Nemo is the correct one. If those instructions do not work, the issue may be related to the version of gcc/ld you're using, as an exercise I compiled an example program using instructions detailed here
#include <stdio.h>
void deadcode() { printf("This is d dead codez\n"); }
int main(void) { printf("This is main\n"); return 0 ; }
Then I compiled the code using progressively more aggressive dead-code removal switches:
gcc -Os test.c -o test.elf
gcc -Os -fdata-sections -ffunction-sections test.c -o test.elf -Wl,--gc-sections
gcc -Os -fdata-sections -ffunction-sections test.c -o test.elf -Wl,--gc-sections -Wl,--strip-all
These compilation and linking parameters produced executables of size 8457, 8164 and 6160 bytes, respectively, the most substantial contribution coming from the 'strip-all' declaration. If you cannot produce similar reductions on your platform,then maybe your version of gcc does not support this functionality. I'm using gcc(4.5.2-8ubuntu4), ld(2.21.0.20110327) on Linux Mint 2.6.38-8-generic x86_64

strip --strip-unneeded only operates on the symbol table of your executable. It doesn't actually remove any executable code.
The standard libraries achieve the result you're after by splitting all of their functions into seperate object files, which are combined using ar. If you then link the resultant archive as a library (ie. give the option -l your_library to ld) then ld will only include the object files, and therefore the symbols, that are actually used.
You may also find some of the responses to this similar question of use.

I don't know if this will help with your current predicament as this is a recent feature, but you can specify the visibility of symbols in a global manner. Passing -fvisibility=hidden -fvisibility-inlines-hidden at compilation can help the linker to later get rid of unneeded symbols. If you're producing an executable (as opposed to a shared library) there's nothing more to do.
More information (and a fine-grained approach for e.g. libraries) is available on the GCC wiki.

From the GCC 4.2.1 manual, section -fwhole-program:
Assume that the current compilation unit represents whole program being compiled. All public functions and variables with the exception of main and those merged by attribute externally_visible become static functions and in a affect gets more aggressively optimized by interprocedural optimizers. While this option is equivalent to proper use of static keyword for programs consisting of single file, in combination with option --combine this flag can be used to compile most of smaller scale C programs since the functions and variables become local for the whole combined compilation unit, not for the single source file itself.

You can use strip binary on object file(eg. executable) to strip all symbols from it.
Note: it changes file itself and don't create copy.

How to build in release mode with optimizations in GCC?

What are the specific options I would need to build in "release mode" with full optimizations in GCC? If there are more than one option, please list them all. Thanks.

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
There is no 'one size fits all' - you need to understand your application, your requirements and the optimisation flags to determine the correct subset for your binary.
Or the answer you want:
-O3

Here is a part from a Makefile that I use regularly (in this example, it's trying to build a program named foo).
If you run it like $ make BUILD=debug or $ make debug
then the Debug CFLAGS will be used. These turn off optimization (-O0) and includes debugging symbols (-g).
If you omit these flags (by running $ make without any additional parameters), you'll build the Release CFLAGS version where optimization is turned on (-O2), debugging symbols stripped (-s) and assertions disabled (-DNDEBUG).
As others have suggested, you can experiment with different -O* settings dependig on your specific needs.
ifeq ($(BUILD),debug)
# "Debug" build - no optimization, and debugging symbols
CFLAGS += -O0 -g
else
# "Release" build - optimization, and no debug symbols
CFLAGS += -O2 -s -DNDEBUG
endif
all: foo
debug:
make "BUILD=debug"
foo: foo.o
# The rest of the makefile comes here...

Note that gcc doesn't have a "release mode" and a "debug mode" like MSVC does. All code is just code. The presence of the various optimization options (-O2 and -Os are the only ones you generally need to care about unless you're doing very fine tuning) modifies the generated code, but not in a way to prevent interoperability with other ABI-compliant code. Generally you want optimization on stuff you want to release.
The presence of the "-g" option will cause extended symbol and source code information to be placed in the generated files, which is useful for debugging but increases the size of the file (and reveals your source code), which is something you often don't want in "released" binaries.
But they're not exclusive. You can have a binary compiled with optimization and debug info, or one with neither.

-O2 will turn on all optimizations that don't require a space\speed trade off and tends to be the one I see used most often. -O3 does some space for speed trade offs(like function inline.) -Os does O2 plus does other things to reduce code size. This can make things faster than O3 by improving cache use. (test to find out if it works for you.) Note there are a large number of options that none of the O switches touch. The reason they are left out is because it often depends on what kind of code you are writing or are very architecture dependent.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js