How to differentiate stack/heap addresses in llvm IR code?

How to differentiate stack/heap addresses in llvm IR code? - c++

I'd like to find a way to determine if a load/store operand in LLVM IR is a stack address or a heap address in an LLVM pass (the pass coded in C++), i.e.
if (inst is a store) {
if (inst->getOperand(1) is a heap address) {
// do something with the heap address
}
}
And operate similarly for loads. Looking in the IR code, they are referenced the same:
store i32 5, i32* %c, align 4 // storing value to a local variable
store i32 1, i32* %4, align 4 // storing value to something on the heap, do something with the heap address
Any ideas?

My frontend does this (well, something a little like it). You may not be able to do it well enough to reach your goals, but if you do, this is one approach:
Regard each return result of malloc() (or whatever your allocator is called) as a heap variable and each result of alloca() as a stack variable. For each of those, classify more values by looking at for(auto x : y->users()); a getelementptr or cast of a malloc() is also a heap variable.
However, this doesn't classify every value. Loading a pointer from a struct/array on the heap may return something on the stack and vice versa. Function arguments may be either. But perhaps you don't need to classify every value.

Related

Addressing stack variables

As far as I understand, stack variables are stored using an absolute offset to the stack frame pointer.
But how are those variables addressed later?
Consider the following code:
#include <iostream>
int main()
{
int a = 0;
int b = 1;
int c = 2;
std::cout << b << std::endl;
}
How does the compiler know where to find b? Does it store its offset to the stack frame pointer? And if so, where is this information stored? And does that mean that int needs more than 4 bytes to be stored?

The location (relative to the stack pointer) of stack variables is a compile-time constant.
The compiler always knows how many things it's pushed to the stack since the beginning of the function and therefore the relative position of any one of them within the stack frame. (Unless you use alloca or VLAs1.)
On x86 this is usually achieved by addressing relative to the ebp or esp registers, which are typically used to represent the "beginning" and "end" of the stack frame. The offsets themselves don't need to be stored anywhere as they are built into the instruction as part of the addressing scheme.
Note that local variables are not always stored on the stack.
The compiler is free to put them wherever it wants, so long as it behaves as if it were allocated on the stack.
In particular, small objects like integers may simply stay in a register for the full duration of their lifespans (or until the compiler is forced to spill them onto the stack), constants may be stored in read-only memory, or any other optimization that the compiler deems fit.
Footnote 1: In functions that use alloca or a VLA, the compiler will use a separate register (like RBP in x86-64) as a "frame pointer" even in an optimized build, and address locals relative to the frame pointer, not the stack pointer. The amount of named C variables is known at compile time, so they can go at the top of the stack frame where the offset from them to the frame pointer is constant. Multiple VLAs can just work as pointers to space allocated as if by alloca. (That's one typical implementation strategy).

From LLVM, how can I determine the architecture's maximum alignment?

I have cases where I need to alloca space for an object with size, layout, and alignment that is unknown at compile time. These values are accessible at runtime, but as far as I can tell, the align attribute on an alloca instruction must be compile-time constant, rather than an instruction argument.
How can I safely obtain an align value which will be strict enough to align to any primitive data type on the target platform? (The equivalent of this in C++ would be alignof(std::max_align_t)).

C++ how are variables accessed in memory?

When I create a new variable in a C++ program, eg a char:
char c = 'a';
how does C++ then have access to this variable in memory? I would imagine that it would need to store the memory location of the variable, but then that would require a pointer variable, and this pointer would again need to be accessed.

See the docs:
When a variable is declared, the memory needed to store its value is
assigned a specific location in memory (its memory address).
Generally, C++ programs do not actively decide the exact memory
addresses where its variables are stored. Fortunately, that task is
left to the environment where the program is run - generally, an
operating system that decides the particular memory locations on
runtime. However, it may be useful for a program to be able to obtain
the address of a variable during runtime in order to access data cells
that are at a certain position relative to it.
You can also refer this article on Variables and Memory
The Stack
The stack is where local variables and function parameters reside. It
is called a stack because it follows the last-in, first-out principle.
As data is added or pushed to the stack, it grows, and when data is
removed or popped it shrinks. In reality, memory addresses are not
physically moved around every time data is pushed or popped from the
stack, instead the stack pointer, which as the name implies points to
the memory address at the top of the stack, moves up and down.
Everything below this address is considered to be on the stack and
usable, whereas everything above it is off the stack, and invalid.
This is all accomplished automatically by the operating system, and as
a result it is sometimes also called automatic memory. On the
extremely rare occasions that one needs to be able to explicitly
invoke this type of memory, the C++ key word auto can be used.
Normally, one declares variables on the stack like this:
void func () {
int i; float x[100];
...
}
Variables that are declared on the stack are only valid within the
scope of their declaration. That means when the function func() listed
above returns, i and x will no longer be accessible or valid.
There is another limitation to variables that are placed on the stack:
the operating system only allocates a certain amount of space to the
stack. As each part of a program that is being executed comes into
scope, the operating system allocates the appropriate amount of memory
that is required to hold all the local variables on the stack. If this
is greater than the amount of memory that the OS has allowed for the
total size of the stack, then the program will crash. While the
maximum size of the stack can sometimes be changed by compile time
parameters, it is usually fairly small, and nowhere near the total
amount of RAM available on a machine.

Assuming this is a local variable, then this variable is allocated on the stack - i.e. in the RAM. The compiler keeps track of the variable offset on the stack. In the basic scenario, in case any computation is then performed with the variable, it is moved to one of the processor's registers and the CPU performs the computation. Afterwards the result is returned back to the RAM. Modern processors keep whole stack frames in the registers and have multiple levels of registers, so it can get quite complex.
Please note the "c" name is no more mentioned in the binary (unless you have debugging symbols). The binary only then works with the memory locations. E.g. it would look like this (simple addition):
a = b + c
take value of memory offset 1 and put it in the register 1
take value of memory offset 2 and put in in the register 2
sum registers 1 and 2 and store the result in register 3
copy the register 3 to memory location 3
The binary doesn't know "a", "b" or "c". The compiler just said "a is in memory 1, b is in memory 2, c is in memory 3". And the CPU just blindly executes the commands the compiler has generated.

C++ itself (or, the compiler) would have access to this variable in terms of the program structure, represented as a data structure. Perhaps you're asking how other parts in the program would have access to it at run time.
The answer is that it varies. It can be stored either in a register, on the stack, on the heap, or in the data/bss sections (global/static variables), depending on its context and the platform it was compiled for: If you needed to pass it around by reference (or pointer) to other functions, then it would likely be stored on the stack. If you only need it in the context of your function, it would probably be handled in a register. If it's a member variable of an object on the heap, then it's on the heap, and you reference it by an offset into the object. If it's a global/static variable, then its address is determined once the program is fully loaded into memory.
C++ eventually compiles down to machine language, and often runs within the context of an operating system, so you might want to brush up a bit on Assembly basics, or even some OS principles, to better understand what's going on under the hood.

Lets say our program starts with a stack address of 4000000
When, you call a function, depending how much stack you use, it will "allocate it" like this
Let's say we have 2 ints (8bytes)
int function()
{
int a = 0;
int b = 0;
}
then whats gonna happen in assembly is
MOV EBP,ESP //Here we store the original value of the stack address (4000000) in EBP, and we restore it at the end of the function back to 4000000
SUB ESP, 8 //here we "allocate" 8 bytes in the stack, which basically just decreases the ESP addr by 8
so our ESP address was changed from
4000000
to
3999992
that's how the program knows knows the stack addresss for the first int is "3999992" and the second int is from 3999996 to 4000000
Even tho this pretty much has nothing to do with the compiler, it's really important to know because when you know how stack is "allocated", you realize how cheap it is to do things like
char my_array[20000];
since all it's doing is just doing sub esp, 20000 which is a single assembly instruction
but if u actually use all those bytes like memset(my_array,20000) that's a different history.

how does C++ then have access to this variable in memory?
It doesn't!
Your computer does, and it is instructed on how to do that by loading the location of the variable in memory into a register. This is all handled by assembly language. I shan't go into the details here of how such languages work (you can look it up!) but this is rather the purpose of a C++ compiler: to turn an abstract, high-level set of "instructions" into actual technical instructions that a computer can understand and execute. You could sort of say that assembly programs contain a lot of pointers, though most of them are literals rather than "variables".

Confusion between GetElementPtr and C++ API

Looking at the documentation of GetElementPtr:
http://llvm.org/docs/GetElementPtr.html
The examples rely on multiple indexes: the 1st for the struct member and the 2nd for the element in the array. This supposedly returns an offset from the base pointer
I'm trying to figure out what's the correct way to create a given GetElementPtr instruction with the C++ API. Unfortunately, there are several varieties of the CreateXXXGEP instruction, with a parameter "val" that I presume is one of the indices. No version of it seems to use two indices as in the documentation: http://llvm.org/docs/doxygen/html/classllvm_1_1IRBuilder.html
Even the CreateStructGEP uses a single idx parameter!
I want to do a very simple thing; I want to take a char buffer:
Value* vB = llvm::ConstantDataArray::GetString(...)
and use the pointer to the array to pass it to another function that expects i8*

You're probably looking for the variants taking an array of Value *. Construct ConstantInts and put them in an std::vector and pass them in.

Is it possible for a pointer to point to an address of 0x000000

This pointless question is about pointers, can someone point me in the right direction
Can the address of a variable ever be legitimately assigned a value of 0x000000, if so, does that mean for example:
#define NULL 0x00000000
int example;
int * pointerToObject = &example;
if(pointerToObject != NULL)
will return false?

No, you cannot get a valid (=nonnull) pointer p for which p != NULL returns false.
Note, however, that this does not imply anything about "address 0" (whatever that means). C++ is intentionally very abstract when referring to addresses and pointers. There is a null pointer value, i.e. the value of a pointer which does not point anywhere. The way to create such a pointer is to assign a null pointer constant into a pointer variable. In C++03, a null pointer constant was the literal 0, optionally with a suffix (e.g. 0L). C++11 added another null pointer constant, the literal nullptr.
How the null pointer value is represented internally is beyond the scope of the C++ language. On some (most?) systems, the address value 0x00000000 is used for this, since nothing a user program can point to can legally reside at that address. However, there is nothing stopping a hardware platform from using the value 0xDEADBEEF to represent a null pointer. On such platform, this code:
int *p = 0;
would have to be compiled as assigning the value 0xDEADBEEF into the 4 bytes occupied by the variable p.
On such a platform, you could legally get a pointer to address 0x00000000, but it wouldn't compare equal to 0, NULL or nullptr.
A different way of looking at this is that the following declarations are not equivalent:
int *p = 0;
int *q = reinterpret_cast<int*>(0);
p is initialised with the null pointer value. q is initialised with the address 0 (depending on the mapping used in reinterpret_cast, of course).

Yes, sure, you can have null pointers. And that check in your question will return false for null pointers. For example, malloc() can return a null pointer on failure and you'd better check for that.
0x000000 is interpreted as zero literal by the compiler and so can yield a null pointer. Same as code from here:
int* ptr = int();

You first have to underestand how is allocated memory when you execute a given program.
There is three kinds of memory:
Fixed Memory
Stack memory
Heap memory
Fixed Memory
Executable code
Global variables
Constant structures that don’t fit inside a machine instruction.
(constant arrays, strings, floating points, long integers etc.)
Static variables.
Subroutine local variable in non-recursive languages (e.g. early
FORTRAN).
Stack memory
Local variables for functions, whose size can be determined at call
time.
Information saved at function call and restored at function return:
Values of callee arguments
Register values:
Return address (value of PC)
Frame pointer (value of FP)
Other registers
Static link (to be discussed)
Heap memory
Structures whose size varies dynamically (e.g. variable length arrays
or strings).
Structures that are allocated dynamically (e.g. records in a linked
list).
Structures created by a function call that must survive after the
call returns.
Issues:
Allocation and free space management
Deallocation / garbage collection
How is the program loaded into memory?
When a program is loaded into memory, it is organized into three areas of memory, called segments: the text segment, stack segment, and heap segment. The text segment (sometimes also called the code segment) is where the compiled code of the program itself resides. This is the machine language representation of the program steps to be carried out, including all functions making up the program, both user defined and system.
The remaining two areas of system memory is where storage may be allocated by the compiler for data storage.
How is memory organized
Of course the zero you see in that image is relative to the offset.
Conlcusion
Even if a program is allocated into memory starting at address 0x00000000 there is no way that storage mechanism asign that address to a constant, static, or dinamic data that program uses. Since al those elements are stored in the data, heap or the stack sections and at the same time those are allocated after the .text section.
Discussion
I think that NULL pointers are set to 0x000000 by the reason that there is no way a variable(dinamic, local, static or global) be allocated at that address.

It it not possible for C++ to access the memory with address 0. C++ reserves that value (but not necessarily that physical address) for a null pointer.
So your statement if (pointerToObject != NULL) will never be false.
Pre C++11 you'd refer to that pointer using NULL which, typically, was a macro defined to 0 or (void*)0. In C++11, the keyword nullptr is used for that specific pointer value.
One useful property is that delete p will be benign if p == nullptr. free in C has the similar property.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js