【SOLVED】LLVM DILocation: extract information from metadata - c++

I wanna get value and 861 from a return instruction, for example ret i32 %3, !dbg !861 and it's metadata !861 = !DILocation(line: 8, column: 5, scope: !857). But it didn't work.
version of clang and llvm is 13.0.0
for (auto &B : F) {
for (auto &I : B) {
// get metadata
if (auto *inst = dyn_cast<ReturnInst>(&I)) {
// ret i32 %3, !dbg !861
// !861 = !DILocation(line: 8, column: 5, scope: !857)
errs() << "!!!return inst: " << *inst << "\n";
DILocation *DILoc = inst->getDebugLoc().get();
errs() << " " << DILoc << "."<< "\n";
Type *instTy = inst->getType();
errs() << " " << *instTy << "."<< "\n";
Value* val = dyn_cast<Value>(inst);
errs() << " val name: " << val->getName().str() << ".\n";
if (auto constant_int = dyn_cast<ConstantInt>(val)) {
int number = constant_int->getSExtValue();
errs() << " val number: " << number << ".\n";
}
}
}
}
and the result:
!!!return inst: ret i32 %3
0x0.
void.
val name: .
I nearly got nothing! Problems:
1. DILocation return 0x0, why? I wanna get information of !861 = !DILocation(line: 8, column: 5, scope: !857)
Actually, now I find my true problem.
I used clang++ -O0 -g -S -emit-llvm test1.cpp -o test.ll to get .ll file. So it generate the metadata.
When I used clang++, I didn’t use -O0 -g. So it didn’t generate the metadata.
So the function LLVM: llvm::DebugLoc didn’t work.
And now, after I added the two arguments, the code I wrote works!
2. return type is void, why? I thought it should be ret.
Nick Lewycky said: The return instruction, locally to the current function, is void, it does not produce a value that subsequent instructions in the same function can consume. > %a = add i32 %b, %c makes sense, but %a = ret i32 %b does not.
A ret instruction itself always has void type, no matter what type the function is returning.
If you want the type of the returned value you could ask inst->getReturnValue()->getType().
Thanks Nick Lewycky!

Related

When passing by reference, is it possible to pass an address of a variable in the stack? Can it be accessed out of its scope? [duplicate]

I'm just wondering how references are actually implemented across different compilers and debug/release configurations. Does the standard provide recommendations on their implementation? Do implementations differ?
I tried to run a simple program where I return non-const references and pointers to local variables from functions, but they worked out the same way. Does this mean that references are internally just a pointer?
Just to repeat some of the stuff everyone's been saying, lets look at some compiler output:
#include <stdio.h>
#include <stdlib.h>
int byref(int & foo)
{
printf("%d\n", foo);
}
int byptr(int * foo)
{
printf("%d\n", *foo);
}
int main(int argc, char **argv) {
int aFoo = 5;
byref(aFoo);
byptr(&aFoo);
}
We can compile this with LLVM (with optimizations turned off) and we get the following:
define i32 #_Z5byrefRi(i32* %foo) {
entry:
%foo_addr = alloca i32* ; <i32**> [#uses=2]
%retval = alloca i32 ; <i32*> [#uses=1]
%"alloca point" = bitcast i32 0 to i32 ; <i32> [#uses=0]
store i32* %foo, i32** %foo_addr
%0 = load i32** %foo_addr, align 8 ; <i32*> [#uses=1]
%1 = load i32* %0, align 4 ; <i32> [#uses=1]
%2 = call i32 (i8*, ...)* #printf(i8* noalias getelementptr inbounds ([4 x i8]* #.str, i64 0, i64 0), i32 %1) ; <i32> [#uses=0]
br label %return
return: ; preds = %entry
%retval1 = load i32* %retval ; <i32> [#uses=1]
ret i32 %retval1
}
define i32 #_Z5byptrPi(i32* %foo) {
entry:
%foo_addr = alloca i32* ; <i32**> [#uses=2]
%retval = alloca i32 ; <i32*> [#uses=1]
%"alloca point" = bitcast i32 0 to i32 ; <i32> [#uses=0]
store i32* %foo, i32** %foo_addr
%0 = load i32** %foo_addr, align 8 ; <i32*> [#uses=1]
%1 = load i32* %0, align 4 ; <i32> [#uses=1]
%2 = call i32 (i8*, ...)* #printf(i8* noalias getelementptr inbounds ([4 x i8]* #.str, i64 0, i64 0), i32 %1) ; <i32> [#uses=0]
br label %return
return: ; preds = %entry
%retval1 = load i32* %retval ; <i32> [#uses=1]
ret i32 %retval1
}
The bodies of both functions are identical
Sorry for using assembly to explain this but I think this is the best way to understand how references are implemented by compilers.
#include <iostream>
using namespace std;
int main()
{
int i = 10;
int *ptrToI = &i;
int &refToI = i;
cout << "i = " << i << "\n";
cout << "&i = " << &i << "\n";
cout << "ptrToI = " << ptrToI << "\n";
cout << "*ptrToI = " << *ptrToI << "\n";
cout << "&ptrToI = " << &ptrToI << "\n";
cout << "refToNum = " << refToI << "\n";
//cout << "*refToNum = " << *refToI << "\n";
cout << "&refToNum = " << &refToI << "\n";
return 0;
}
Output of this code is like this
i = 10
&i = 0xbf9e52f8
ptrToI = 0xbf9e52f8
*ptrToI = 10
&ptrToI = 0xbf9e52f4
refToNum = 10
&refToNum = 0xbf9e52f8
Lets look at the disassembly(I used GDB for this. 8,9 and 10 here are line numbers of code)
8 int i = 10;
0x08048698 <main()+18>: movl $0xa,-0x10(%ebp)
Here $0xa is the 10(decimal) that we are assigning to i. -0x10(%ebp) here means content of ebp register –16(decimal).
-0x10(%ebp) points to the address of i on stack.
9 int *ptrToI = &i;
0x0804869f <main()+25>: lea -0x10(%ebp),%eax
0x080486a2 <main()+28>: mov %eax,-0x14(%ebp)
Assign address of i to ptrToI. ptrToI is again on stack located at address -0x14(%ebp), that is ebp – 20(decimal).
10 int &refToI = i;
0x080486a5 <main()+31>: lea -0x10(%ebp),%eax
0x080486a8 <main()+34>: mov %eax,-0xc(%ebp)
Now here is the catch! Compare disassembly of line 9 and 10 and you will observer that ,-0x14(%ebp) is replaced by -0xc(%ebp) in line number 10. -0xc(%ebp) is the address of refToNum. It is allocated on stack. But you will never be able to get this address from you code because you are not required to know the address.
So; a reference does occupy memory. In this case it is the stack memory since we have allocated it as a local variable.
How much memory does it occupy?
As much a pointer occupies.
Now lets see how we access the reference and pointers. For simplicity I have shown only part of the assembly snippet
16 cout << "*ptrToI = " << *ptrToI << "\n";
0x08048746 <main()+192>: mov -0x14(%ebp),%eax
0x08048749 <main()+195>: mov (%eax),%ebx
19 cout << "refToNum = " << refToI << "\n";
0x080487b0 <main()+298>: mov -0xc(%ebp),%eax
0x080487b3 <main()+301>: mov (%eax),%ebx
Now compare the above two lines, you will see striking similarity. -0xc(%ebp) is the actual address of refToI which is never accessible to you.
In simple terms, if you think of reference as a normal pointer, then accessing a reference is like fetching the value at address pointed to by the reference. Which means the below two lines of code will give you the same result
cout << "Value if i = " << *ptrToI << "\n";
cout << " Value if i = " << refToI << "\n";
Now compare this
15 cout << "ptrToI = " << ptrToI << "\n";
0x08048713 <main()+141>: mov -0x14(%ebp),%ebx
21 cout << "&refToNum = " << &refToI << "\n";
0x080487fb <main()+373>: mov -0xc(%ebp),%eax
I guess you are able to spot what is happening here.
If you ask for &refToI, the contents of -0xc(%ebp) address location are returned and -0xc(%ebp) is where refToi resides and its contents are nothing but address of i.
One last thing, Why is this line commented?
//cout << "*refToNum = " << *refToI << "\n";
Because *refToI is not permitted and it will give you a compile time error.
The natural implementation of a reference is indeed a pointer. However, do not depend on this in your code.
In Bjarne's words:
Like a pointer, a reference is an alias for an object, is usually implemented to hold a machine address of an object, and does not impose performance overhead compared to pointers, but it differs from a pointer in that:
• You access a reference with exactly the same syntax as the name of an object.
• A reference always refers to the object to which it was initialized.
• There is no ‘‘null reference,’’ and we may assume that a reference refers to an object
Though a reference is in reality a pointer, but it shouldn't be used like a pointer but as an alias.
There is no need a reference to be a pointer.
In many cases, it is, but in other cases it is just an alias and there is no need of separate memory allocation for a pointer.
assembly samples are not always correct, because they depend heavily on optimizations and how "smart" is the compiler.
for example:
int i;
int& j = i;
does not need to generate any additional code or allocate any additional memory.
I can't say this is right for sure, but I did some Googling and found this statement:
The language standard does not require
any particular mechanism. Each
implementation is free to do it in any
way, as long as the behavior is
compliant.
Source: Bytes.com
Reference is not pointer. This is fact. Pointer can bind to another object, has its own operations like dereferencing and incrementing / decrementing.
Although internally, reference may be implemented as a pointer. But this is an implementation detail which does not change the fact that references cannot be interchanged with pointers. And one cannot write code assuming references are implemented as pointers.

A painful bug - LoadInst error in codegen.cpp

Recently, I've been trying to create a toy language based on an old tutorial. I had this idea almost half a year ago, but I don't have the time to do it until now. Anyway, when I'm following the tutorial, I modified the source code to get rid of a handful of compile errors (I believe that most of the errors are related to backward compatibility), but I'm stuck with feeding the "custom-defined code" to the parser.
The error:
eric#pop-os:~/Desktop/my_toy_compiler-master$ echo 'int do_math(int a){ int x = a * 5 + 3 } do_math(10)' | ./parser
0x55a3e10a7580
Generating code...
Generating code for 20NFunctionDeclaration
Creating variable declaration int a
Generating code for 20NVariableDeclaration
Creating variable declaration int x
Creating assignment for x
Creating binary operation 274
Creating integer: 3
Creating binary operation 276
Creating integer: 5
Creating identifier reference: a
parser: /home/eric/llvm-project/llvm/lib/IR/DataLayout.cpp:740: llvm::Align llvm::DataLayout::getAlignment(llvm::Type*, bool) const: Assertion `Ty->isSized() && "Cannot getTypeInfo() on a type that is unsized!"' failed.
Aborted (core dumped)
The problem starts after 'Creating identifier reference: a'. Therefore, it is intuitive to take a look at the relevant code. There are two functions in the codegen.cpp that are considered significant to this bug.
excerpt of codegen.cpp:
...
static Type *typeOf(const NIdentifier& type)
{
if (type.name.compare("int") == 0) {
return Type::getInt64Ty(MyContext);
}
else if (type.name.compare("double") == 0) {
return Type::getDoubleTy(MyContext);
}
return Type::getVoidTy(MyContext);
}
...
Value* NIdentifier::codeGen(CodeGenContext& context)
{
std::cout << "Creating identifier reference: " << name << endl;
if (context.locals().find(name) == context.locals().end()) {
std::cerr << "undeclared variable " << name << endl;
return NULL;
}
return new LoadInst(typeOf(name), context.locals()[name], "", false, context.currentBlock());
}
...
The full version of codegen.cpp:
#include "node.h"
#include "codegen.h"
#include "parser.hpp"
using namespace std;
/* Compile the AST into a module */
void CodeGenContext::generateCode(NBlock& root)
{
std::cout << "Generating code...\n";
/* Create the top level interpreter function to call as entry */
vector<Type*> argTypes;
FunctionType *ftype = FunctionType::get(Type::getVoidTy(MyContext), makeArrayRef(argTypes), false);
mainFunction = Function::Create(ftype, GlobalValue::InternalLinkage, "main", module);
BasicBlock *bblock = BasicBlock::Create(MyContext, "entry", mainFunction, 0);
/* Push a new variable/block context */
pushBlock(bblock);
root.codeGen(*this); /* emit bytecode for the toplevel block */
ReturnInst::Create(MyContext, bblock);
popBlock();
/* Print the bytecode in a human-readable format
to see if our program compiled properly
*/
std::cout << "Code is generated.\n";
// module->dump();
legacy::PassManager pm;
pm.add(createPrintModulePass(outs()));
pm.run(*module);
}
/* Executes the AST by running the main function */
GenericValue CodeGenContext::runCode() {
std::cout << "Running code...\n";
ExecutionEngine *ee = EngineBuilder( unique_ptr<Module>(module) ).create();
ee->finalizeObject();
vector<GenericValue> noargs;
GenericValue v = ee->runFunction(mainFunction, noargs);
std::cout << "Code was run.\n";
return v;
}
/* Returns an LLVM type based on the identifier */
static Type *typeOf(const NIdentifier& type)
{
if (type.name.compare("int") == 0) {
return Type::getInt64Ty(MyContext);
}
else if (type.name.compare("double") == 0) {
return Type::getDoubleTy(MyContext);
}
return Type::getVoidTy(MyContext);
}
/* -- Code Generation -- */
Value* NInteger::codeGen(CodeGenContext& context)
{
std::cout << "Creating integer: " << value << endl;
return ConstantInt::get(Type::getInt64Ty(MyContext), value, true);
}
Value* NDouble::codeGen(CodeGenContext& context)
{
std::cout << "Creating double: " << value << endl;
return ConstantFP::get(Type::getDoubleTy(MyContext), value);
}
Value* NIdentifier::codeGen(CodeGenContext& context)
{
std::cout << "Creating identifier reference: " << name << endl;
if (context.locals().find(name) == context.locals().end()) {
std::cerr << "undeclared variable " << name << endl;
return NULL;
}
return new LoadInst(Type::getInt64Ty(MyContext), context.locals()[name], "", false, context.currentBlock());
}
Value* NMethodCall::codeGen(CodeGenContext& context)
{
Function *function = context.module->getFunction(id.name.c_str());
if (function == NULL) {
std::cerr << "no such function " << id.name << endl;
}
std::vector<Value*> args;
ExpressionList::const_iterator it;
for (it = arguments.begin(); it != arguments.end(); it++) {
args.push_back((**it).codeGen(context));
}
CallInst *call = CallInst::Create(function, makeArrayRef(args), "", context.currentBlock());
std::cout << "Creating method call: " << id.name << endl;
return call;
}
Value* NBinaryOperator::codeGen(CodeGenContext& context)
{
std::cout << "Creating binary operation " << op << endl;
Instruction::BinaryOps instr;
switch (op) {
case TPLUS: instr = Instruction::Add; goto math;
case TMINUS: instr = Instruction::Sub; goto math;
case TMUL: instr = Instruction::Mul; goto math;
case TDIV: instr = Instruction::SDiv; goto math;
/* TODO comparison */
}
return NULL;
math:
return BinaryOperator::Create(instr, lhs.codeGen(context),
rhs.codeGen(context), "", context.currentBlock());
}
Value* NAssignment::codeGen(CodeGenContext& context)
{
std::cout << "Creating assignment for " << lhs.name << endl;
if (context.locals().find(lhs.name) == context.locals().end()) {
std::cerr << "undeclared variable " << lhs.name << endl;
return NULL;
}
return new StoreInst(rhs.codeGen(context), context.locals()[lhs.name], false, context.currentBlock());
}
Value* NBlock::codeGen(CodeGenContext& context)
{
StatementList::const_iterator it;
Value *last = NULL;
for (it = statements.begin(); it != statements.end(); it++) {
std::cout << "Generating code for " << typeid(**it).name() << endl;
last = (**it).codeGen(context);
}
std::cout << "Creating block" << endl;
return last;
}
Value* NExpressionStatement::codeGen(CodeGenContext& context)
{
std::cout << "Generating code for " << typeid(expression).name() << endl;
return expression.codeGen(context);
}
Value* NReturnStatement::codeGen(CodeGenContext& context)
{
std::cout << "Generating return code for " << typeid(expression).name() << endl;
Value *returnValue = expression.codeGen(context);
context.setCurrentReturnValue(returnValue);
return returnValue;
}
Value* NVariableDeclaration::codeGen(CodeGenContext& context)
{
std::cout << "Creating variable declaration " << type.name << " " << id.name << endl;
AllocaInst *alloc = new AllocaInst(typeOf(type), NULL, id.name.c_str(), context.currentBlock());
context.locals()[id.name] = alloc;
if (assignmentExpr != NULL) {
NAssignment assn(id, *assignmentExpr);
assn.codeGen(context);
}
return alloc;
}
Value* NExternDeclaration::codeGen(CodeGenContext& context)
{
vector<Type*> argTypes;
VariableList::const_iterator it;
for (it = arguments.begin(); it != arguments.end(); it++) {
argTypes.push_back(typeOf((**it).type));
}
FunctionType *ftype = FunctionType::get(typeOf(type), makeArrayRef(argTypes), false);
Function *function = Function::Create(ftype, GlobalValue::ExternalLinkage, id.name.c_str(), context.module);
return function;
}
Value* NFunctionDeclaration::codeGen(CodeGenContext& context)
{
vector<Type*> argTypes;
VariableList::const_iterator it;
for (it = arguments.begin(); it != arguments.end(); it++) {
argTypes.push_back(typeOf((**it).type));
}
FunctionType *ftype = FunctionType::get(typeOf(type), makeArrayRef(argTypes), false);
Function *function = Function::Create(ftype, GlobalValue::InternalLinkage, id.name.c_str(), context.module);
BasicBlock *bblock = BasicBlock::Create(MyContext, "entry", function, 0);
context.pushBlock(bblock);
Function::arg_iterator argsValues = function->arg_begin();
Value* argumentValue;
for (it = arguments.begin(); it != arguments.end(); it++) {
(**it).codeGen(context);
argumentValue = &*argsValues++;
argumentValue->setName((*it)->id.name.c_str());
StoreInst *inst = new StoreInst(argumentValue, context.locals()[(*it)->id.name], false, bblock);
}
block.codeGen(context);
ReturnInst::Create(MyContext, context.getCurrentReturnValue(), bblock);
context.popBlock();
std::cout << "Creating function: " << id.name << endl;
return function;
}
Note that there are originally only 3 parameters for the LoadInst function. I checked the llvm::LoadInst Class Reference only to see that the LoadInst function now requires at least 4 parameters. I figured out that I (and the author) missed the Type *Ty parameter. Obviously, typeOf(name) in return new LoadInst(typeOf(name), context.locals()[name], "", false, context.currentBlock()); is not a solution since name, which is 'a' according to the error, will always make typeOf(name) void. I suspect that this causes Cannot getTypeInfo() on a type that is unsized!, as stated by the error.
To be short, I believe that I should look for something like this:
Value* NIdentifier::codeGen(CodeGenContext& context)
{
std::cout << "Creating identifier reference: " << name << endl;
if (context.locals().find(name) == context.locals().end()) {
std::cerr << "undeclared variable " << name << endl;
return NULL;
}
return new LoadInst(*some magic that return the llvm::type of name identifier*, context.locals()[name], "", false, context.currentBlock());
}
I'm still a noob in llvm, so excuse me if my guess isn't correct. Big thanks for any tips or ideas.
P.S. I tried return new LoadInst(Type::getInt64Ty(MyContext), context.locals()[name], "", false, context.currentBlock());. The terminal broke my heart again by saying the follow:
eric#pop-os:~/Desktop/my_toy_compiler-master$ echo 'int do_math(int a){ int x = a * 5 + 3 } do_math(10)' | ./parser
0x562120785580
Generating code...
Generating code for 20NFunctionDeclaration
Creating variable declaration int a
Generating code for 20NVariableDeclaration
Creating variable declaration int x
Creating assignment for x
Creating binary operation 274
Creating integer: 3
Creating binary operation 276
Creating integer: 5
Creating identifier reference: a
Creating block
Creating function: do_math
Generating code for 20NExpressionStatement
Generating code for 11NMethodCall
Creating integer: 10
Creating method call: do_math
Creating block
Code is generated.
; ModuleID = 'main'
source_filename = "main"
#.str = private constant [4 x i8] c"%d\0A\00"
declare i32 #printf(i8*, ...)
define internal void #echo(i64 %toPrint) {
entry:
%0 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* #.str, i32 0, i32 0), i64 %toPrint)
ret void
}
define internal void #main() {
entry:
%0 = call i64 #do_math(i64 10)
ret void
}
define internal i64 #do_math(i64 %a1) {
entry:
%a = alloca i64, align 8
store i64 %a1, i64* %a, align 4
%x = alloca i64, align 8
%0 = load i64, i64* %a, align 4
%1 = mul i64 %0, 5
%2 = add i64 %1, 3
store i64 %2, i64* %x, align 4
ret void
}
Running code...
Function context does not match Module context!
void (i64)* #echo
in function echo
LLVM ERROR: Broken function found, compilation aborted!
Aborted (core dumped)
It's sad that my core is dumped anyway.

In LLVM, what "type" is a bitcast of a function inside a call? How to access this function?

In my llvm IR code, I have the following line:
%tmp = call i32 #decf1(void (i8*)* bitcast (void (%a_type*)* #decf2 to void (i8*)*), i8 %x3, i8* #external_type)
I am trying to extract a_type and decf2 programmatically, but I seem not to get access to them.
bool runOnFunction(Function &F) override {
errs() << "Initializing Test pass\n";
for (BasicBlock &BB : F) {
for (Instruction &I : BB) {
// New Instruction
errs() << "\n\n"
<< "=====================\n"
<< "- - - - - - - - - - -\n"
<< "NewInstruction:\n";
I.dump();
errs() << "\n";
// New Operands
errs() << "- - - - - - - - - - -\n"
<< "Operands:\n";
for (Use &U : I.operands()) {
errs() << "Type: ";
U->getType()->print(errs());
errs() << "\n";
errs() << "Name: " << U->getName() << "\n";
}
errs() << "\n";
}
This pass produces me the following output for the instruction containing the cast.
=====================
- - - - - - - - - - -
NewInstruction:
%tmp = call i32 #decf1(void (i8*)* bitcast (void (%a_type*)* #decf2 to void (i8*)*), i8 %x3, i8* #external_type)
- - - - - - - - - - -
Operands:
Type: void (i8*)*
Name:
Is Instruction: No
Is Function: No
Type: i8
Name: x3
Is Instruction: Yes
%x3 = mul i8 %x2, %x2
Is Function: No
Type: i8*
Name: external_type
Is Instruction: No
Is Function: No
Type: i32 (void (i8*)*, i8, i8*)*
Name: decf1
Is Instruction: No
Is Function: Yes
Is Declaration: Yes
It seems that the first printed operand has to do with the bitcast. How can I get the bitcast and the operands/type/function it is casting?
It seems that Value::stripPointerCasts() is a way to get the the cast decf2 function as a Function *.
Still need to elaborate on how to get the a_type from there.

How references are really implemented in C++? [duplicate]

I'm just wondering how references are actually implemented across different compilers and debug/release configurations. Does the standard provide recommendations on their implementation? Do implementations differ?
I tried to run a simple program where I return non-const references and pointers to local variables from functions, but they worked out the same way. Does this mean that references are internally just a pointer?
Just to repeat some of the stuff everyone's been saying, lets look at some compiler output:
#include <stdio.h>
#include <stdlib.h>
int byref(int & foo)
{
printf("%d\n", foo);
}
int byptr(int * foo)
{
printf("%d\n", *foo);
}
int main(int argc, char **argv) {
int aFoo = 5;
byref(aFoo);
byptr(&aFoo);
}
We can compile this with LLVM (with optimizations turned off) and we get the following:
define i32 #_Z5byrefRi(i32* %foo) {
entry:
%foo_addr = alloca i32* ; <i32**> [#uses=2]
%retval = alloca i32 ; <i32*> [#uses=1]
%"alloca point" = bitcast i32 0 to i32 ; <i32> [#uses=0]
store i32* %foo, i32** %foo_addr
%0 = load i32** %foo_addr, align 8 ; <i32*> [#uses=1]
%1 = load i32* %0, align 4 ; <i32> [#uses=1]
%2 = call i32 (i8*, ...)* #printf(i8* noalias getelementptr inbounds ([4 x i8]* #.str, i64 0, i64 0), i32 %1) ; <i32> [#uses=0]
br label %return
return: ; preds = %entry
%retval1 = load i32* %retval ; <i32> [#uses=1]
ret i32 %retval1
}
define i32 #_Z5byptrPi(i32* %foo) {
entry:
%foo_addr = alloca i32* ; <i32**> [#uses=2]
%retval = alloca i32 ; <i32*> [#uses=1]
%"alloca point" = bitcast i32 0 to i32 ; <i32> [#uses=0]
store i32* %foo, i32** %foo_addr
%0 = load i32** %foo_addr, align 8 ; <i32*> [#uses=1]
%1 = load i32* %0, align 4 ; <i32> [#uses=1]
%2 = call i32 (i8*, ...)* #printf(i8* noalias getelementptr inbounds ([4 x i8]* #.str, i64 0, i64 0), i32 %1) ; <i32> [#uses=0]
br label %return
return: ; preds = %entry
%retval1 = load i32* %retval ; <i32> [#uses=1]
ret i32 %retval1
}
The bodies of both functions are identical
Sorry for using assembly to explain this but I think this is the best way to understand how references are implemented by compilers.
#include <iostream>
using namespace std;
int main()
{
int i = 10;
int *ptrToI = &i;
int &refToI = i;
cout << "i = " << i << "\n";
cout << "&i = " << &i << "\n";
cout << "ptrToI = " << ptrToI << "\n";
cout << "*ptrToI = " << *ptrToI << "\n";
cout << "&ptrToI = " << &ptrToI << "\n";
cout << "refToNum = " << refToI << "\n";
//cout << "*refToNum = " << *refToI << "\n";
cout << "&refToNum = " << &refToI << "\n";
return 0;
}
Output of this code is like this
i = 10
&i = 0xbf9e52f8
ptrToI = 0xbf9e52f8
*ptrToI = 10
&ptrToI = 0xbf9e52f4
refToNum = 10
&refToNum = 0xbf9e52f8
Lets look at the disassembly(I used GDB for this. 8,9 and 10 here are line numbers of code)
8 int i = 10;
0x08048698 <main()+18>: movl $0xa,-0x10(%ebp)
Here $0xa is the 10(decimal) that we are assigning to i. -0x10(%ebp) here means content of ebp register –16(decimal).
-0x10(%ebp) points to the address of i on stack.
9 int *ptrToI = &i;
0x0804869f <main()+25>: lea -0x10(%ebp),%eax
0x080486a2 <main()+28>: mov %eax,-0x14(%ebp)
Assign address of i to ptrToI. ptrToI is again on stack located at address -0x14(%ebp), that is ebp – 20(decimal).
10 int &refToI = i;
0x080486a5 <main()+31>: lea -0x10(%ebp),%eax
0x080486a8 <main()+34>: mov %eax,-0xc(%ebp)
Now here is the catch! Compare disassembly of line 9 and 10 and you will observer that ,-0x14(%ebp) is replaced by -0xc(%ebp) in line number 10. -0xc(%ebp) is the address of refToNum. It is allocated on stack. But you will never be able to get this address from you code because you are not required to know the address.
So; a reference does occupy memory. In this case it is the stack memory since we have allocated it as a local variable.
How much memory does it occupy?
As much a pointer occupies.
Now lets see how we access the reference and pointers. For simplicity I have shown only part of the assembly snippet
16 cout << "*ptrToI = " << *ptrToI << "\n";
0x08048746 <main()+192>: mov -0x14(%ebp),%eax
0x08048749 <main()+195>: mov (%eax),%ebx
19 cout << "refToNum = " << refToI << "\n";
0x080487b0 <main()+298>: mov -0xc(%ebp),%eax
0x080487b3 <main()+301>: mov (%eax),%ebx
Now compare the above two lines, you will see striking similarity. -0xc(%ebp) is the actual address of refToI which is never accessible to you.
In simple terms, if you think of reference as a normal pointer, then accessing a reference is like fetching the value at address pointed to by the reference. Which means the below two lines of code will give you the same result
cout << "Value if i = " << *ptrToI << "\n";
cout << " Value if i = " << refToI << "\n";
Now compare this
15 cout << "ptrToI = " << ptrToI << "\n";
0x08048713 <main()+141>: mov -0x14(%ebp),%ebx
21 cout << "&refToNum = " << &refToI << "\n";
0x080487fb <main()+373>: mov -0xc(%ebp),%eax
I guess you are able to spot what is happening here.
If you ask for &refToI, the contents of -0xc(%ebp) address location are returned and -0xc(%ebp) is where refToi resides and its contents are nothing but address of i.
One last thing, Why is this line commented?
//cout << "*refToNum = " << *refToI << "\n";
Because *refToI is not permitted and it will give you a compile time error.
The natural implementation of a reference is indeed a pointer. However, do not depend on this in your code.
In Bjarne's words:
Like a pointer, a reference is an alias for an object, is usually implemented to hold a machine address of an object, and does not impose performance overhead compared to pointers, but it differs from a pointer in that:
• You access a reference with exactly the same syntax as the name of an object.
• A reference always refers to the object to which it was initialized.
• There is no ‘‘null reference,’’ and we may assume that a reference refers to an object
Though a reference is in reality a pointer, but it shouldn't be used like a pointer but as an alias.
There is no need a reference to be a pointer.
In many cases, it is, but in other cases it is just an alias and there is no need of separate memory allocation for a pointer.
assembly samples are not always correct, because they depend heavily on optimizations and how "smart" is the compiler.
for example:
int i;
int& j = i;
does not need to generate any additional code or allocate any additional memory.
I can't say this is right for sure, but I did some Googling and found this statement:
The language standard does not require
any particular mechanism. Each
implementation is free to do it in any
way, as long as the behavior is
compliant.
Source: Bytes.com
Reference is not pointer. This is fact. Pointer can bind to another object, has its own operations like dereferencing and incrementing / decrementing.
Although internally, reference may be implemented as a pointer. But this is an implementation detail which does not change the fact that references cannot be interchanged with pointers. And one cannot write code assuming references are implemented as pointers.

How are references implemented internally?

I'm just wondering how references are actually implemented across different compilers and debug/release configurations. Does the standard provide recommendations on their implementation? Do implementations differ?
I tried to run a simple program where I return non-const references and pointers to local variables from functions, but they worked out the same way. Does this mean that references are internally just a pointer?
Just to repeat some of the stuff everyone's been saying, lets look at some compiler output:
#include <stdio.h>
#include <stdlib.h>
int byref(int & foo)
{
printf("%d\n", foo);
}
int byptr(int * foo)
{
printf("%d\n", *foo);
}
int main(int argc, char **argv) {
int aFoo = 5;
byref(aFoo);
byptr(&aFoo);
}
We can compile this with LLVM (with optimizations turned off) and we get the following:
define i32 #_Z5byrefRi(i32* %foo) {
entry:
%foo_addr = alloca i32* ; <i32**> [#uses=2]
%retval = alloca i32 ; <i32*> [#uses=1]
%"alloca point" = bitcast i32 0 to i32 ; <i32> [#uses=0]
store i32* %foo, i32** %foo_addr
%0 = load i32** %foo_addr, align 8 ; <i32*> [#uses=1]
%1 = load i32* %0, align 4 ; <i32> [#uses=1]
%2 = call i32 (i8*, ...)* #printf(i8* noalias getelementptr inbounds ([4 x i8]* #.str, i64 0, i64 0), i32 %1) ; <i32> [#uses=0]
br label %return
return: ; preds = %entry
%retval1 = load i32* %retval ; <i32> [#uses=1]
ret i32 %retval1
}
define i32 #_Z5byptrPi(i32* %foo) {
entry:
%foo_addr = alloca i32* ; <i32**> [#uses=2]
%retval = alloca i32 ; <i32*> [#uses=1]
%"alloca point" = bitcast i32 0 to i32 ; <i32> [#uses=0]
store i32* %foo, i32** %foo_addr
%0 = load i32** %foo_addr, align 8 ; <i32*> [#uses=1]
%1 = load i32* %0, align 4 ; <i32> [#uses=1]
%2 = call i32 (i8*, ...)* #printf(i8* noalias getelementptr inbounds ([4 x i8]* #.str, i64 0, i64 0), i32 %1) ; <i32> [#uses=0]
br label %return
return: ; preds = %entry
%retval1 = load i32* %retval ; <i32> [#uses=1]
ret i32 %retval1
}
The bodies of both functions are identical
Sorry for using assembly to explain this but I think this is the best way to understand how references are implemented by compilers.
#include <iostream>
using namespace std;
int main()
{
int i = 10;
int *ptrToI = &i;
int &refToI = i;
cout << "i = " << i << "\n";
cout << "&i = " << &i << "\n";
cout << "ptrToI = " << ptrToI << "\n";
cout << "*ptrToI = " << *ptrToI << "\n";
cout << "&ptrToI = " << &ptrToI << "\n";
cout << "refToNum = " << refToI << "\n";
//cout << "*refToNum = " << *refToI << "\n";
cout << "&refToNum = " << &refToI << "\n";
return 0;
}
Output of this code is like this
i = 10
&i = 0xbf9e52f8
ptrToI = 0xbf9e52f8
*ptrToI = 10
&ptrToI = 0xbf9e52f4
refToNum = 10
&refToNum = 0xbf9e52f8
Lets look at the disassembly(I used GDB for this. 8,9 and 10 here are line numbers of code)
8 int i = 10;
0x08048698 <main()+18>: movl $0xa,-0x10(%ebp)
Here $0xa is the 10(decimal) that we are assigning to i. -0x10(%ebp) here means content of ebp register –16(decimal).
-0x10(%ebp) points to the address of i on stack.
9 int *ptrToI = &i;
0x0804869f <main()+25>: lea -0x10(%ebp),%eax
0x080486a2 <main()+28>: mov %eax,-0x14(%ebp)
Assign address of i to ptrToI. ptrToI is again on stack located at address -0x14(%ebp), that is ebp – 20(decimal).
10 int &refToI = i;
0x080486a5 <main()+31>: lea -0x10(%ebp),%eax
0x080486a8 <main()+34>: mov %eax,-0xc(%ebp)
Now here is the catch! Compare disassembly of line 9 and 10 and you will observer that ,-0x14(%ebp) is replaced by -0xc(%ebp) in line number 10. -0xc(%ebp) is the address of refToNum. It is allocated on stack. But you will never be able to get this address from you code because you are not required to know the address.
So; a reference does occupy memory. In this case it is the stack memory since we have allocated it as a local variable.
How much memory does it occupy?
As much a pointer occupies.
Now lets see how we access the reference and pointers. For simplicity I have shown only part of the assembly snippet
16 cout << "*ptrToI = " << *ptrToI << "\n";
0x08048746 <main()+192>: mov -0x14(%ebp),%eax
0x08048749 <main()+195>: mov (%eax),%ebx
19 cout << "refToNum = " << refToI << "\n";
0x080487b0 <main()+298>: mov -0xc(%ebp),%eax
0x080487b3 <main()+301>: mov (%eax),%ebx
Now compare the above two lines, you will see striking similarity. -0xc(%ebp) is the actual address of refToI which is never accessible to you.
In simple terms, if you think of reference as a normal pointer, then accessing a reference is like fetching the value at address pointed to by the reference. Which means the below two lines of code will give you the same result
cout << "Value if i = " << *ptrToI << "\n";
cout << " Value if i = " << refToI << "\n";
Now compare this
15 cout << "ptrToI = " << ptrToI << "\n";
0x08048713 <main()+141>: mov -0x14(%ebp),%ebx
21 cout << "&refToNum = " << &refToI << "\n";
0x080487fb <main()+373>: mov -0xc(%ebp),%eax
I guess you are able to spot what is happening here.
If you ask for &refToI, the contents of -0xc(%ebp) address location are returned and -0xc(%ebp) is where refToi resides and its contents are nothing but address of i.
One last thing, Why is this line commented?
//cout << "*refToNum = " << *refToI << "\n";
Because *refToI is not permitted and it will give you a compile time error.
The natural implementation of a reference is indeed a pointer. However, do not depend on this in your code.
In Bjarne's words:
Like a pointer, a reference is an alias for an object, is usually implemented to hold a machine address of an object, and does not impose performance overhead compared to pointers, but it differs from a pointer in that:
• You access a reference with exactly the same syntax as the name of an object.
• A reference always refers to the object to which it was initialized.
• There is no ‘‘null reference,’’ and we may assume that a reference refers to an object
Though a reference is in reality a pointer, but it shouldn't be used like a pointer but as an alias.
There is no need a reference to be a pointer.
In many cases, it is, but in other cases it is just an alias and there is no need of separate memory allocation for a pointer.
assembly samples are not always correct, because they depend heavily on optimizations and how "smart" is the compiler.
for example:
int i;
int& j = i;
does not need to generate any additional code or allocate any additional memory.
I can't say this is right for sure, but I did some Googling and found this statement:
The language standard does not require
any particular mechanism. Each
implementation is free to do it in any
way, as long as the behavior is
compliant.
Source: Bytes.com
Reference is not pointer. This is fact. Pointer can bind to another object, has its own operations like dereferencing and incrementing / decrementing.
Although internally, reference may be implemented as a pointer. But this is an implementation detail which does not change the fact that references cannot be interchanged with pointers. And one cannot write code assuming references are implemented as pointers.