LLVM bytecode linked with with ld.lld gives segmentation fault - llvm

I wrote a simple C program:
example.c:
int main() {
return 0;
}
Then converted it to .ll by using
clang -S -emit-llvm example.c
Which generated a example.ll file which looks like this:
; ModuleID = 'example.c'
source_filename = "example.c"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"
; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 #main() #0 {
%1 = alloca i32, align 4
store i32 0, i32* %1, align 4
ret i32 0
}
attributes #0 = { noinline nounwind optnone uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }
!llvm.module.flags = !{!0}
!llvm.ident = !{!1}
!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"clang version 8.0.0-3 (tags/RELEASE_800/final)"}
Then I converted .ll file to .o by using:
llc -filetype=obj example.ll
And then I tried to link that file to make it executable by using:
ld.lld example.o -o example -e main
Which created an executable ./example.
Running example yields a segmentation fault
29185 segmentation fault (core dumped) ./example
objdump of example.o looks like this:
example.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
b: 31 c0 xor %eax,%eax
d: 5d pop %rbp
e: c3 retq
And the executable looks like this:
example: file format elf64-x86-64
Disassembly of section .text:
0000000000201000 <main>:
201000: 55 push %rbp
201001: 48 89 e5 mov %rsp,%rbp
201004: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
20100b: 31 c0 xor %eax,%eax
20100d: 5d pop %rbp
20100e: c3 retq
I also tried linking the object file with ld but that also didn't work. Am I missing something. How can I make a llvm object file executable? Please note that none of the commands yielded any errors or warnings.

Well, this is not how you supposed to link the executable. E.g. entry point is supposed to be named "_start" and so on, so you're missing bunch of runtime initialization objects / libraries here.
Either link with clang (so, clang example.ll or clang example.o) or pass -v to clang invocation to obtain the proper linker cmdline.

Related

Clobber X86 register by modifying LLVM Backend

I am trying to alter a little bit the LLVM Backend for X86 target, to produce some desired behaviour.
More specifically, I would like to emulate a flag like gcc's fcall-used-reg option, which instructs the compiler to convert a callee-saved register into a clobbered register (meaning that it may be altered during a function call).
Let's focus on r14. I manually clobber the register, like in this answer:
#include <inttypes.h>
uint64_t inc(uint64_t i) {
__asm__ __volatile__(
""
: "+m" (i)
:
: "r14"
);
return i + 1;
}
int main(int argc, char **argv) {
(void)argv;
return inc(argc);
}
Compile and disassemble:
gcc -std=gnu99 -O3 -ggdb3 -Wall -Wextra -pedantic -o main.out main.c
objdump -d main.out
Disassembly contains:
0000000000001150 <inc>:
1150: 41 56 push %r14
1152: 48 89 7c 24 f8 mov %rdi,-0x8(%rsp)
1157: 48 8b 44 24 f8 mov -0x8(%rsp),%rax
115c: 41 5e pop %r14
115e: 48 83 c0 01 add $0x1,%rax
1162: c3 retq
1163: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
116a: 00 00 00
116d: 0f 1f 00 nopl (%rax)
where we can see that r14, because it is tampered with, is pushed to the stack, and then popped to regain its original value.
Now, repeat with the -fcall-used-r14 flag:
gcc -std=gnu99 -O3 -ggdb3 -fcall-used-r14 -Wall -Wextra -pedantic -o main.out main.c
objdump -d main.out
Disassembly contains:
0000000000001150 <inc>:
1150: 48 89 7c 24 f8 mov %rdi,-0x8(%rsp)
1155: 48 8b 44 24 f8 mov -0x8(%rsp),%rax
115a: 48 83 c0 01 add $0x1,%rax
115e: c3 retq
115f: 90 nop
where no push/pop happens.
Now, I have modified some LLVM Target files, compiled the source, and added(?) this functionality to the llc tool:
clang-11 -emit-llvm -S -c main.c -o main.ll
llc-11 main.ll -o main.s
Now, main.s contains:
# %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
pushq %r14
.cfi_offset %r14, -24
movq %rdi, -16(%rbp)
#APP
#NO_APP
movq -16(%rbp), %rax
addq $1, %rax
popq %r14
popq %rbp
.cfi_def_cfa %rsp, 8
retq
Apparently, r14 is still callee-saved.
Inside llvm/lib/Target/X86/X86CallingConv.td I have modified the following lines (removing R14), because they seemed the only relevant to the System V ABI for Linux and C calling conventions that I was interested in:
def CSR_64 : CalleeSavedRegs<(add R12, R13, R15, RBP)>;
...
def CSR_64_MostRegs : CalleeSavedRegs<(add RBX, RCX, RDX, RSI, RDI, R8, R9, R10,
R11, R12, R13, R15, RBP,
...
def CSR_64_AllRegs_NoSSE : CalleeSavedRegs<(add RAX, RBX, RCX, RDX, RSI, RDI, R8, R9,
R10, R11, R12, R13, R15, RBP)>;
My questions are:
Is X86CallingConv.td the only file I should modify? I think yes, but maybe I'm wrong.
Am I focusing on the correct lines? Maybe this is more difficult to answer, but at least a direction could be helpful.
I am running LLVM 11 inside Debian 10.5.
EDIT:
Changing the line, removing R14 from "hidden" definition:
def CSR_SysV64_RegCall_NoSSE : CalleeSavedRegs<(add RBX, RBP, RSP,
(sequence "R%u", 12, 13), R15)>;
as Margaret correctly pointed out did not help either.
Turns out, the minimum modification was the line:
def CSR_64 : CalleeSavedRegs<(add RBX, R12, R13, R15, RBP)>;
The problem was with how I built the source.
By running cmake --build . again after the original installation, the llc tool was not modified globally (I thought it would have because I was building the default architecture - X86 - but that was irrelevant). So, I was calling an unmodified llc-11 tool. Thus, when I ran:
/path/to/llvm-project/build/bin/lcc main.ll -o main.s
main.s contained:
# %bb.0:
movq %rdi, -8(%rsp)
#APP
#NO_APP
movq -8(%rsp), %rax
addq $1, %rax
retq
which is what I wanted in the first place.

LLVM's llc generates seemly incorrect code

I'm writing a function pass in LLVM, which generates IR file. The problem is that the assembled code does not seem to behave as I expect. Since I'm pretty new to LLVM, I'd like to know if I misunderstood the LLVM IR semantics or this is an incorrect behavior of llc.
The LLVM IR is:
define void #fff(i32*) #0 {
%2 = alloca i32*, align 8
%3 = alloca i32, align 4
%4 = load i8*, i8** #dirty
br label %5
; <label>:5: ; preds = %1
store i32* %0, i32** %2, align 8
%6 = load i32*, i32** %2, align 8
%7 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([11 x i8], [11 x i8]* #.str.4, i32 0, i32 0), i32* %6)
%8 = load i32*, i32** %2, align 8
%9 = load i32, i32* %8, align 4
%readDirty = load atomic i8, i8* %4 acquire, align 8
%10 = icmp eq i8 %readDirty, 1
br i1 %10, label %Restart, label %11, !prof !3
; <label>:11: ; preds = %5
store i32 %9, i32* %3, align 4
ret void
Restart: ; preds = %5
;EDIT: bug was here. Must include label %5 as a possible destination block
indirectbr i8* blockaddress(#fff, %5), []
}
This correspond (roughly) to the following C code:
char *dirty=1;
void fff(int *head) ATTR{
restart:
printf("head = %p\n", head);
int r = *head;
if(*dirty)
goto restart; //But using indirect branch
}
Next I assemble, link and run using:
llc -filetype=obj simpleOut.ll -o out.o
gcc -o exe out.o
./exe
If I call the function with address 0x7ffeea51d7a8, it prints:
head = 0x7ffeea51d7a8
head = 0x2e889e825bf4005c
Segmentation fault: 11
The x86_64 assembly code is:
;head reside in rcx
100000d60: 55 pushq %rbp
100000d61: 48 89 e5 movq %rsp, %rbp
100000d64: 53 pushq %rbx
100000d65: 48 83 ec 18 subq $24, %rsp
100000d69: 48 89 f9 movq %rdi, %rcx
100000d6c: 48 8d 3d dd 02 00 00 leaq 733(%rip), %rdi
100000d73: ff 17 callq *(%rdi)
100000d75: 48 8b 18 movq (%rax), %rbx
100000d78: 48 8d 3d c0 01 00 00 leaq 448(%rip), %rdi
100000d7f: 48 89 4d f0 movq %rcx, -16(%rbp)
100000d83: 48 8b 75 f0 movq -16(%rbp), %rsi
100000d87: b0 00 movb $0, %al
100000d89: e8 62 01 00 00 callq 354 ;call to printf, corrupt rcx
100000d8e: 48 8b 45 f0 movq -16(%rbp), %rax
100000d92: 8b 00 movl (%rax), %eax
100000d94: 80 3b 01 cmpb $1, (%rbx)
100000d97: 74 0a je 10 <_fff+0x43>
100000d99: 89 45 ec movl %eax, -20(%rbp)
100000d9c: 48 83 c4 18 addq $24, %rsp
100000da0: 5b popq %rbx
100000da1: 5d popq %rbp
100000da2: c3 retq
100000da3: 48 8d 05 ce ff ff ff leaq -50(%rip), %rax
100000daa: ff e0 jmpq *%rax ;jumps to 100000d78
100000dac: 0f 1f 40 00 nopl (%rax)
The problem seems to be that the LLVM statement store i32* %0, i32** %2, align 8 translates to movq %rcx, -16(%rbp) even after the restart, where the register rcx was already corrupted by printf function.
If this seems like a bug I'll file a bug report with LLVM. Just wanted to check that I don't misunderstand the LLVM IR.
llc version is 5.0.0, installed via homebrew. gcc (used for linking) is clang-900.0.39.2.
Thanks
According to the documentation, indirectbr instruction should be supplied with the list of all possible destination blocks. Omitting a BB that is being jumped to produces undefined behavior.

Understanding llvm-ir to assembly for x86_64-unknown-linux-gnu

I transform the following llvm-IR
; Function Attrs: noinline norecurse nounwind uwtable
define i32 #main() #0{
entry:
%sub = sub nsw i32 5, 3
%cmp = icmp slt i32 %sub, 3
br i1 %cmp, label %if.then, label %if.else
if.then: ; preds = %entry
%mul = mul nsw i32 %sub, 2
br label %if.end
if.else: ; preds = %entry
%sub1 = sub nsw i32 %sub, 3
br label %if.end
if.end: ; preds = %if.else,
%if.then
%y.0 = phi i32 [ %mul, %if.then ], [ %sub1, %if.else ]
%sub2 = sub nsw i32 %sub, %y.0
%add = add nsw i32 %sub, %y.0
ret i32 0
}
to assembly code for x86_64-unknown-linux-gnu
using llc sample.ll
generated assembly code:
.text
.file "phi.cpp"
.globl main # -- Begin function main
.p2align 4, 0x90
.type main,#function
main: # #main
.cfi_startproc
# BB#0: # %entry
pushq %rbp
.Lcfi0:
.cfi_def_cfa_offset 16
.Lcfi1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
.Lcfi2:
.cfi_def_cfa_register %rbp
xorl %eax, %eax
testb %al, %al
xorl %eax, %eax
popq %rbp
retq
.Lfunc_end0:
.size main, .Lfunc_end0-main
.cfi_endproc
# -- End function
The register in the above code: %rbp is the base pointer, which points to the base of the current stack frame, and %rsp is the stack pointer, which points to the top of the current stack frame and operand are store in %eax and %al for arithmatic operation but in can't find the instruction where the value is load in %eax and %al register
I also want to know
How llc is handling phi node on assembly level
lli defaults to -O2 and your code start with a constant expression sub nsw i32 5, 3. Thus, your function does, basically, nothing, and the only thing LLVM should keep is to nullify EAX.
If you run lli -O0 your.ll, you'll get much verbose code, that perform spills on stack and register loads.
BTW, there are a pair of passes called mem2reg and reg2mem that convert code back and forth code from SSA form. Specifically, these passes would convert phi nodes to branches and introduce explicit stores and loads in IR.

Can Using goto Create Optimizations that the Compiler Can't Generate in C++?

Can using gotos instead of oops result in a series of jump instructions more efficient than what the compiler would have generated if loops had been used instead?
For example: If I had a while loop nested inside a switch statement, which would be nested in another loop which would be nested inside of another switch case, could using goto actually outsmart the jump instructions the compiler generates when using just loops and no gotos?
It may be possible to gain a small speed advantage by using goto. However, the opposite may also be true. The compilers have become very good in detecting and unrolling loops or optimizing loops by using SIMD instructions. You will most likely kill all those optimization options for the compiler, since they are not build to optimize goto statements.
You can also write functions to prevent gotos. This way you enable the compiler to inline the function and get rid of the jump.
If you consider using goto for optimization purposes, i would say, that it is a very bad idea. Express your algorithm in clean code and optimize later. And if you need more performance, think about you data and the access to it. That is the point, where you can gain or loose performance.
Since you wanted code to proove the point, I constructed the following example. I used gcc 6.3.1 with -O3 on an Intel i7-3537U. If you try to reproduce the example, your results may differ depending on your compiler or hardware.
#include <iostream>
#include <array>
#include "ToolBox/Instrumentation/Profiler.hpp"
constexpr size_t kilo = 1024;
constexpr size_t mega = kilo * 1024;
constexpr size_t size = 512*mega;
using container = std::array<char, size>;
enum class Measurements {
jump,
loop
};
// Simple vector addition using for loop
void sum(container& result, const container& data1, const container& data2) {
profile(Measurements::loop);
for(unsigned int i = 0; i < size; ++i) {
result[i] = data1[i] + data2[i];
}
}
// Simple vector addition using jumps
void sum_jump(container& result, const container& data1, const container& data2) {
profile(Measurements::jump);
unsigned int i = 0;
label:
result[i] = data1[i] + data2[i];
i++;
if(i == size) goto label;
}
int main() {
// This segment is just for benchmarking purposes
// Just ignore this
ToolBox::Instrumentation::Profiler<Measurements, std::chrono::nanoseconds, 2> profiler(
std::cout,
{
{Measurements::jump, "jump"},
{Measurements::loop, "loop"}
}
);
// allocate memory to execute our sum functions on
container data1, data2, result;
// run the benchmark 100 times to account for caching of the data
for(unsigned i = 0; i < 100; i++) {
sum_jump(result, data1, data2);
sum(result, data1, data2);
}
}
The output of the programm is the the following:
Runtimes for 12Measurements
jump: 100x 2972 nanoseconds 29 nanoseconds/execution
loop: 100x 2820 nanoseconds 28 nanoseconds/execution
Ok, we see that there is no time difference in the runtimes, because we are limited in bandwidth of the memory and not in cpu instructions. But lets look at the assembler instructions, that are generated:
Dump of assembler code for function sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&):
0x00000000004025c0 <+0>: push %r15
0x00000000004025c2 <+2>: push %r14
0x00000000004025c4 <+4>: push %r12
0x00000000004025c6 <+6>: push %rbx
0x00000000004025c7 <+7>: push %rax
0x00000000004025c8 <+8>: mov %rdx,%r15
0x00000000004025cb <+11>: mov %rsi,%r12
0x00000000004025ce <+14>: mov %rdi,%rbx
0x00000000004025d1 <+17>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv#plt>
0x00000000004025d6 <+22>: mov %rax,%r14
0x00000000004025d9 <+25>: lea 0x20000000(%rbx),%rcx
0x00000000004025e0 <+32>: lea 0x20000000(%r12),%rax
0x00000000004025e8 <+40>: lea 0x20000000(%r15),%rsi
0x00000000004025ef <+47>: cmp %rax,%rbx
0x00000000004025f2 <+50>: sbb %al,%al
0x00000000004025f4 <+52>: cmp %rcx,%r12
0x00000000004025f7 <+55>: sbb %dl,%dl
0x00000000004025f9 <+57>: and %al,%dl
0x00000000004025fb <+59>: cmp %rsi,%rbx
0x00000000004025fe <+62>: sbb %al,%al
0x0000000000402600 <+64>: cmp %rcx,%r15
0x0000000000402603 <+67>: sbb %cl,%cl
0x0000000000402605 <+69>: test $0x1,%dl
0x0000000000402608 <+72>: jne 0x40268b <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+203>
0x000000000040260e <+78>: and %cl,%al
0x0000000000402610 <+80>: and $0x1,%al
0x0000000000402612 <+82>: jne 0x40268b <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+203>
0x0000000000402614 <+84>: xor %eax,%eax
0x0000000000402616 <+86>: nopw %cs:0x0(%rax,%rax,1)
0x0000000000402620 <+96>: movdqu (%r12,%rax,1),%xmm0
0x0000000000402626 <+102>: movdqu 0x10(%r12,%rax,1),%xmm1
0x000000000040262d <+109>: movdqu (%r15,%rax,1),%xmm2
0x0000000000402633 <+115>: movdqu 0x10(%r15,%rax,1),%xmm3
0x000000000040263a <+122>: paddb %xmm0,%xmm2
0x000000000040263e <+126>: paddb %xmm1,%xmm3
0x0000000000402642 <+130>: movdqu %xmm2,(%rbx,%rax,1)
0x0000000000402647 <+135>: movdqu %xmm3,0x10(%rbx,%rax,1)
0x000000000040264d <+141>: movdqu 0x20(%r12,%rax,1),%xmm0
0x0000000000402654 <+148>: movdqu 0x30(%r12,%rax,1),%xmm1
0x000000000040265b <+155>: movdqu 0x20(%r15,%rax,1),%xmm2
0x0000000000402662 <+162>: movdqu 0x30(%r15,%rax,1),%xmm3
0x0000000000402669 <+169>: paddb %xmm0,%xmm2
0x000000000040266d <+173>: paddb %xmm1,%xmm3
0x0000000000402671 <+177>: movdqu %xmm2,0x20(%rbx,%rax,1)
0x0000000000402677 <+183>: movdqu %xmm3,0x30(%rbx,%rax,1)
0x000000000040267d <+189>: add $0x40,%rax
0x0000000000402681 <+193>: cmp $0x20000000,%rax
0x0000000000402687 <+199>: jne 0x402620 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+96>
0x0000000000402689 <+201>: jmp 0x4026d5 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+277>
0x000000000040268b <+203>: xor %eax,%eax
0x000000000040268d <+205>: nopl (%rax)
0x0000000000402690 <+208>: movzbl (%r15,%rax,1),%ecx
0x0000000000402695 <+213>: add (%r12,%rax,1),%cl
0x0000000000402699 <+217>: mov %cl,(%rbx,%rax,1)
0x000000000040269c <+220>: movzbl 0x1(%r15,%rax,1),%ecx
0x00000000004026a2 <+226>: add 0x1(%r12,%rax,1),%cl
0x00000000004026a7 <+231>: mov %cl,0x1(%rbx,%rax,1)
0x00000000004026ab <+235>: movzbl 0x2(%r15,%rax,1),%ecx
0x00000000004026b1 <+241>: add 0x2(%r12,%rax,1),%cl
0x00000000004026b6 <+246>: mov %cl,0x2(%rbx,%rax,1)
0x00000000004026ba <+250>: movzbl 0x3(%r15,%rax,1),%ecx
0x00000000004026c0 <+256>: add 0x3(%r12,%rax,1),%cl
0x00000000004026c5 <+261>: mov %cl,0x3(%rbx,%rax,1)
0x00000000004026c9 <+265>: add $0x4,%rax
0x00000000004026cd <+269>: cmp $0x20000000,%rax
0x00000000004026d3 <+275>: jne 0x402690 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+208>
0x00000000004026d5 <+277>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv#plt>
0x00000000004026da <+282>: sub %r14,%rax
0x00000000004026dd <+285>: add %rax,0x202b74(%rip) # 0x605258 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_1EE14totalTimeSpentE>
0x00000000004026e4 <+292>: incl 0x202b76(%rip) # 0x605260 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_1EE10executionsE>
0x00000000004026ea <+298>: add $0x8,%rsp
0x00000000004026ee <+302>: pop %rbx
0x00000000004026ef <+303>: pop %r12
0x00000000004026f1 <+305>: pop %r14
0x00000000004026f3 <+307>: pop %r15
0x00000000004026f5 <+309>: retq
End of assembler dump.
As we can see, the compiler has vectorized the loop and uses simd instructions (e.g. paddb).
Now the version with the jumps:
Dump of assembler code for function sum_jump(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&):
0x0000000000402700 <+0>: push %r15
0x0000000000402702 <+2>: push %r14
0x0000000000402704 <+4>: push %r12
0x0000000000402706 <+6>: push %rbx
0x0000000000402707 <+7>: push %rax
0x0000000000402708 <+8>: mov %rdx,%rbx
0x000000000040270b <+11>: mov %rsi,%r14
0x000000000040270e <+14>: mov %rdi,%r15
0x0000000000402711 <+17>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv#plt>
0x0000000000402716 <+22>: mov %rax,%r12
0x0000000000402719 <+25>: mov (%rbx),%al
0x000000000040271b <+27>: add (%r14),%al
0x000000000040271e <+30>: mov %al,(%r15)
0x0000000000402721 <+33>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv#plt>
0x0000000000402726 <+38>: sub %r12,%rax
0x0000000000402729 <+41>: add %rax,0x202b38(%rip) # 0x605268 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_0EE14totalTimeSpentE>
0x0000000000402730 <+48>: incl 0x202b3a(%rip) # 0x605270 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_0EE10executionsE>
0x0000000000402736 <+54>: add $0x8,%rsp
0x000000000040273a <+58>: pop %rbx
0x000000000040273b <+59>: pop %r12
0x000000000040273d <+61>: pop %r14
0x000000000040273f <+63>: pop %r15
0x0000000000402741 <+65>: retq
End of assembler dump.
And here we did not trigger the optimization.
You can compile the programm and check the asm code yourself with:
gdb -batch -ex 'file a.out' -ex 'disassemble sum'
I also tried this approch for a matrix multiplication, but gcc was smart enough to detect the matrix multiplication with goto/label syntax too.
Conclusion:
Even if we did not see a speed loss, we saw, that gcc could not use an optimization, that could speed up the computation. In more cpu challenging tasks, that may cause a drop in performance.
Could it? Sure, theoretically, if you were using an especially dumb compiler or were compiling with optimizations disabled.
In practice, absolutely not. An optimizing compiler has very little difficulty optimizing loops and switch statements. You are far more likely to confuse the optimizer with unconditional jumps than if you play by the normal rules and use looping constructs that it is familiar with and programmed to optimize accordingly.
This just goes back to a general rule that optimizers do their best work with standard, idiomatic code because that's what they have been trained to recognize. Given their extensive use of pattern matching, if you deviate from normal patterns, you are less likely to get optimal code output.
For example: If I had a while loop nested inside a switch statement, which would be nested in another loop which would be nested inside of another switch case
Yeah, I'm confused by reading your description of the code, but a compiler would not be. Unlike humans, compilers have no trouble with nested loops. You aren't going to confuse it by writing valid C++ code, unless the compiler has a bug, in which case all bets are off. No matter how much nesting there is, if the logical result is a jump outside of the entire block, then a compiler will emit code that does that, just as if you had written a goto of your own. If you had tried to write a goto of your own, you would be more likely to get sub-optimal code because of scoping issues, which would require the compiler to emit code that saved local variables, adjusted the stack frame, etc. There are many standard optimizations that are normally performed on loops, yet become either impossible or simply not applied if you throw a goto inside of that loop.
Furthermore, it is not the case that jumps always result in faster codeā€”even when jumping over other instructions. In many cases on modern, heavily-pipelined, out-of-order processors, mispredicted branches amount to a significant penalty, far more than if the intervening instructions had simply been executed and their results thrown away (ignored). Optimizing compilers that are targeting a particular platform know about this, and may decide to emit branchless code for performance reasons. If you throw in a jump, you force an unconditional branch and eliminate their ability to make these decisions strategically.
You claim that you want "an actual example instead of theory", but this is setting up a logical fallacy. It makes the assumption that you are correct and that the use of goto can indeed lead to better optimized code than a compiler could generate with looping constructs, but if that is not the case, then there is no code that can prove it. I could show you hundreds of examples of code where a goto resulted in sub-optimal object code being generated, but you could just claim that there still existed some snippet of code where the reverse is true, and there is no way that I (or anyone else) could exhaustively demonstrate that you were wrong.
On the contrary, theoretical arguments are the only way to answer this question (at least if answering in the negative). Furthermore, I would argue that a theoretical understanding is all that is required to inform one's writing of code. You should write code assuming that your optimizer is not broken, and only after determining that it actually is broken should you go back and try to figure out how to revise the code to get it to generate the output you expected. If, in some bizarre circumstance, you find that that involves a goto, then you can use it. Until that point, assume that the compiler knows how to optimize loops and write the code in the normal, readable way, assured that in the vast majority of circumstances (if not actually 100%) that the output will be better than what you could have gotten by trying to outsmart the compiler from the outset.
Goto/while/do are just high-level compiler constructions. Once they get resolved to Intermediate Representation (IR or AST) they disappear and everything is implemented in terms of branches.
Take for example this code
double calcsum( double* val, unsigned int count )
{
double sum = 0;
for ( unsigned int j=0; j<count; ++count ) {
sum += val[j];
}
return sum;
}
Compile it such that it generates IR:
clang++ -S -emit-llvm -O3 test.cpp # will create test.ll
Look at the generated IR language
$ cat test.ll
define double #_Z7calcsumPdj(double* nocapture readonly, i32)
local_unnamed_addr #0 {
%3 = icmp eq i32 %1, 0
br i1 %3, label %26, label %4
; <label>:4: ; preds = %2
%5 = load double, double* %0, align 8, !tbaa !1
%6 = sub i32 0, %1
%7 = and i32 %6, 7
%8 = icmp ugt i32 %1, -8
br i1 %8, label %12, label %9
; <label>:9: ; preds = %4
%10 = sub i32 %6, %7
br label %28
; <label>:11: ; preds = %28
br label %12
; <label>:12: ; preds = %11, %4
%13 = phi double [ undef, %4 ], [ %38, %11 ]
%14 = phi double [ 0.000000e+00, %4 ], [ %38, %11 ]
%15 = icmp eq i32 %7, 0
br i1 %15, label %24, label %16
; <label>:16: ; preds = %12
br label %17
; <label>:17: ; preds = %17, %16
%18 = phi double [ %14, %16 ], [ %20, %17 ]
%19 = phi i32 [ %7, %16 ], [ %21, %17 ]
%20 = fadd double %18, %5
%21 = add i32 %19, -1
%22 = icmp eq i32 %21, 0
br i1 %22, label %23, label %17, !llvm.loop !5
; <label>:23: ; preds = %17
br label %24
; <label>:24: ; preds = %12, %23
%25 = phi double [ %13, %12 ], [ %20, %23 ]
br label %26
; <label>:26: ; preds = %24, %2
%27 = phi double [ 0.000000e+00, %2 ], [ %25, %24 ]
ret double %27
; <label>:28: ; preds = %28, %9
%29 = phi double [ 0.000000e+00, %9 ], [ %38, %28 ]
%30 = phi i32 [ %10, %9 ], [ %39, %28 ]
%31 = fadd double %29, %5
%32 = fadd double %31, %5
%33 = fadd double %32, %5
%34 = fadd double %33, %5
%35 = fadd double %34, %5
%36 = fadd double %35, %5
%37 = fadd double %36, %5
%38 = fadd double %37, %5
%39 = add i32 %30, -8
%40 = icmp eq i32 %39, 0
br i1 %40, label %11, label %28
}
You can see that there are not while/do constructs anymore. Everything is branch/goto. Obviously this still gets further compiled into assembly language .
But besides this, "fast" today depends on if the compiler can match a good optimization pass to your code. By using goto you would be potentially forfeiting some loop-optimization passes as lcssa, licm, loop deletion, loop reduce, simplify, unroll, unswitch:
http://llvm.org/docs/Passes.html#lcssa-loop-closed-ssa-form-pass
I think if you are trying to optimize your code using this you are doing something wrong.
Compiler optimization can be made per platform - you can't do this
There are a lot of compilers how can you be sure your optimization is will be not degradation on different platform
There is so match OS, processors and other things that prevent such optimization to live
Programming language is made for Human not machines - why didn't you use machine instructions - you will have better control and can optimize things (take into account point 3)
.......
Anyway if you Are looking for micro optimizations on current platform better to get full control and use asm.

selecting address to change value in memory

This question/answer on SO shows how to use GDB to change a value in memory, but in the example given, it chooses an address to set the value that wasn't previously being used
For example, to change the return value to 22, the author does
set {unsigned char}0x00000000004004b9 = 22
However, why would this address 0x00000000004004b9 be the address to change? If you look at the output of disas/r the address 0x00000000004004b9 isn't being used, so why use this one to set to 22? I'm trying to understand how to know which address needs to be changed to (in this example) change the return value, if the output of disas/r doesn't show it.
code
$ cat t.c
int main()
{
return 42;
}
$ gcc t.c && ./a.out; echo $?
42
$ gdb --write -q ./a.out
(gdb) disas/r main
Dump of assembler code for function main:
0x00000000004004b4 <+0>: 55 push %rbp
0x00000000004004b5 <+1>: 48 89 e5 mov %rsp,%rbp
0x00000000004004b8 <+4>: b8 2a 00 00 00 mov $0x2a,%eax
0x00000000004004bd <+9>: 5d pop %rbp
0x00000000004004be <+10>: c3 retq
End of assembler dump.
(gdb) set {unsigned char}0x00000000004004b9 = 22
(gdb) disas/r main
Dump of assembler code for function main:
0x00000000004004b4 <+0>: 55 push %rbp
0x00000000004004b5 <+1>: 48 89 e5 mov %rsp,%rbp
0x00000000004004b8 <+4>: b8 16 00 00 00 mov $0x16,%eax <<< ---changed
0x00000000004004bd <+9>: 5d pop %rbp
0x00000000004004be <+10>: c3 retq
End of assembler dump.
(gdb) q
$ ./a.out; echo $?
22 <<<--- Just as desired
I'm trying to understand how to know which address needs to be changed to (in this example) change the return value, if the output of disas/r doesn't show it.
To understand this, you need to understand instruction encoding. The instruction here is "move immediate 32-bit constant to register". The constant is part of the instruction (that's what "immediate" means). It may be helpful to compile this instead:
int foo() { return 0x41424344; }
int bar() { return 0x45464748; }
int main() { return foo() + bar(); }
When you do compile it, you should see something similar to:
(gdb) disas/r foo
Dump of assembler code for function foo:
0x00000000004004ed <+0>: 55 push %rbp
0x00000000004004ee <+1>: 48 89 e5 mov %rsp,%rbp
0x00000000004004f1 <+4>: b8 44 43 42 41 mov $0x41424344,%eax
0x00000000004004f6 <+9>: 5d pop %rbp
0x00000000004004f7 <+10>: c3 retq
End of assembler dump.
(gdb) disas/r bar
Dump of assembler code for function bar:
0x00000000004004f8 <+0>: 55 push %rbp
0x00000000004004f9 <+1>: 48 89 e5 mov %rsp,%rbp
0x00000000004004fc <+4>: b8 48 47 46 45 mov $0x45464748,%eax
0x0000000000400501 <+9>: 5d pop %rbp
0x0000000000400502 <+10>: c3 retq
End of assembler dump.
Now you can clearly see where in the instruction stream each byte of the immediate constant resides (and also that x86 uses little-endian encoding for them).
The standard reference on instruction encoding for x86 is Intel instruction set reference. You can find 0xB8 instruction on page 3-528.