LLVM: Replacing all instances of an address with a constant - llvm

I'm trying to replace all instances of an address with a constant.
I'm getting & testing the address of store with the following (i is an instruction)
//already know it's a store instruction at this point
llvm::Value *addy = i->getOperand(0);
if(llvm::ConstantInt* c = dyn_cast<llvm:::ConstantInt>(addy)){
//replace all uses of the address with the constant
//operand(1) will be the address the const would be stored at
i->getOperand(1)->replaceAllUsesWith(c);
}
I'd think this would work, but I'm getting the error that
"Assertion: New->getType()== getType() && replaceAllUses of value with new value of different type!" failed
and I'm not sure why...my understanding of replaceAllUses is that it would replace usage of address (i->getOperand(1) with the constant?

The error message is pretty straightforward: the type of the new value is not identical to the type of the old value that you are replacing.
LLVM IR is strongly typed, and as you can see in the language reference, every instruction has a specific type it expects as each operand. For example, store requires that the address's type will always be a pointer to the type of the value being stored.
As a result, whenever you replace the usage of a value, you must ensure first that they both have the same type - replaceAllUsesWith actually has an assert to verify it, as you can see, and you failed it. It's also simple to see why: operand 1 of a store instruction is always of some pointer type, and a ConstantInt always represents something of some integer type, so surely they can never match.
What exactly are you trying to achieve? Perhaps you are thinking about replacing each load of that store's address with a usage of the constant? In that case, you'll have to find yourself all the loads that use that address, and for each of them (for each of the loads, I mean, not of the addresses) perform replaceAllUsesWith with the constant. There are standard LLVM passes that can do those things for you, by the way - check out the pass list. I'm guessing mem2reg followed by some constant propagation pass will take care of this.

Related

How to change returned value of function

There is a function in this program, that currently returns a 1. I would prefer for it to return a 0.
uregs[R_PC] is the program counter.
arg0 is the program counter offset from where we left the function (assembly, "ret").
From this I deduce: we can add the offset to the program counter, uregs[R_PC]+arg0, to find the address of the return value.
I have allocated a 32-bit "0", and I try to write 2 bytes of it into the address where the return value lives (our function expects to return a BOOL16, so we only need 2 bytes of 0):
sudo dtrace -p "$(getpid)" -w -n '
int *zero;
BEGIN { zero=alloca(4); *zero=0; }
pid$target::TextOutA:return {
copyout(zero, uregs[R_PC]+arg0, 2);
}'
Of course I get:
dtrace: error on enabled probe ID 2 (ID 320426: pid60498:gdi32.dll.so:TextOutA:return): invalid address (0x41f21c) in action #1 at DIF offset 60
uregs[R_PC] is presumably a userspace address. Probably copyout() wants a kernel address.
How do I translate the userspace address uregs[R_PC] to kernel-space? I know that with copyin() we can read data stored at user-space address, into kernel-space. But that doesn't give us the kernel address of that memory.
Alternatively: is there some other way to change the return value using DTrace?
DTrace is not the right tool for this. You should instead use a debugger like dbx, mdb or gdb.
In the meantime, I'll try to clarify some of the concepts that you've mentioned.
To begin, you may well see in the source code for a simple function that there is a single return. It is quite possible that the compiled result, i.e. the function's machine-specific implementation, also contains only a single point of exit. Typically, however, the implementation is likely to contain more than one exit point and it may be useful for a developer to know from which specific one a function returned. It is this information, described as an offset from the start of the function, that is given by a return probe's arg0. Your D script, then, is attempting to update part of the program or library itself; although the addition of arg0 makes the destination address somewhat random, the result is most likely still within the text section, which is read-only.
Secondly, in the common case, a function's implementation returns a value by storing it in a specific register; e.g. %rax on amd64. Thus overriding a return value would neccessitate overriding a register value. This is impossible because DTrace's access to the user-land registers is read-only.
It is possible that a function is implemented in such a way that, as it returns, it recovers the return value from a specific memory location before writing it into the appropriate register. If this were the case then one could, indeed, modify the value in memory (given its location) just before it is accessed. However, this is going to work for only a subset of cases: the return value might equally be contained in another register or else simply expressed as a constant in the program text itself. In any case, it would be far more trouble than it's worth given the existence of more appropriate debugging tools.

c++ strings and pointers confusion

string * sptemp = (string *) 0x000353E0;
What does this code exactly want to say ?
I know that in the left side we define a string pointer but I couldn't understand the right part.
It means take a numeric value, convert it to a pointer with that value as the address it points to, and then use that value to initialise the variable sptemp.
If the memory at that address contains a valid string object, then you can use the pointer to access it. If not, trying to do so will give undefined behaviour.
string * sptemp = (string *) 0x000353E0;
What does this code exactly want to say ?
It says, treat the data located at address 0x000353E0 as though it holds a string and assign the address to the variable sptemp. The data can be accessed through the pointer variable sptemp after that.
These comments are mostly right, but not completely. We don't actually know that string is std::string here. It could be that string is a bit of memory-mapped hardware whose address on the OP's embedded SBC is defined by the hardware 0x000353E0. In that case, this is completely sensible, and what people do all the time. The pointer "string *sptemp" is set to point to the hardware interface.
But it's probably nonsense.

Reading variable pointed by a pointer in llvm

Pointer type can be deduced through:
Value* v= i->getOperand(0);
.......
if(PointerType* pt=dyn_cast<PointerType>(v->getType())){
pt->getElementType()->getTypeID();
How can I read the value that this pointer points to?
I is a CallInst.
Given a CallInst, you can get an argument via getArgOperand() or iterate over all of them with arg_operands(). The arguments you get this way are just Values, and you can do anything you can do with other Values on them.
In particular, if those Values are constants, you can get the actual values used in the compiler - see this related stackoverflow question: LLVM get constant integer back from Value*

trace register value in llvm

In llvm, can one trace back to the instruction that defines the value for a particular register? For example, if I have an instruction as:
%add14 = add i32 %add7, %add5
Is here a way for me to trace back to the instruction where add5 is defined?
First of all, there are no registers in LLVM IR: all those things with % in their names are just names of values. You don't store information inside those things, they are not variables or memory locations, they are just names. I recommend reading about SSA form, which helps explains this further.
In any case, what you need to do is invoke the getOperand(n) method on the instruction to get its nth operand - for example, getOperand(0) in your example will return the value named %add7. You can then check whether that value is indeed an instruction (as opposed to, say, a function argument) by checking its type (isa<Instruction>).
To emphasize - calling the getOperand method will give you the actual place in which the operand is defined, nothing else is required.

What are the use pointer variables?

I've recently tried to really come to grips with references and pointers in C++, and I'm getting a little bit confused. I understand the * and & operators which can respectively get the value at an address and get the address of a value, however why can't these simply be used with basic types like ints?
I don't understand why you can't, for example, do something like the following and not use any weird pointer variable creation:
string x = "Hello";
int y = &x; //Set 'y' to the memory address of 'x'
cout << *y; //Output the value at the address 'y' (which is the memory address of 'x')
The code above should, theoretically in my mind, output the value of 'x'. 'y' contains the memory address of 'x', and hence '*y' should be 'x'. If this works (which incidentally on trying to compile it, it doesn't -- it tells me it can't convert from a string to an int, which doesn't make much sense since you'd think a memory address could be stored in an int fine).
Why do we need to use special pointer variable declarations (e.g. string *y = &x)?
And inside this, if we take the * operator in the pointer declaration literally in the example in the line above, we are setting the value of 'y' to the memory address of 'x', but then later when we want to access the value at the memory address ('&x') we can use the same '*y' which we previously set to the memory address.
C and C++ resolve type information at compile-time, not runtime. Even runtime polymorphism relies on the compiler constructing a table of function pointers with offsets fixed at compile time.
For that reason, the only way the program can know that cout << *y; is printing a string is because y is strongly typed as a pointer-to-string (std::string*). The program cannot, from the address alone, determine that the object stored at address y is a std::string. (Even C++ RTTI does not allow this, you need enough type information to identify a polymorphic base class.)
In short, C is a typed language. You cannot store arbitrary things in variables.
Check the type safety article at wikipedia. C/C++ prevents problematic operations and functional calls at compliation time by checking the type of the operands and function parameters (but note that with explicit casts you can change the type of an expression).
It doesn't make sense to store a string in an integer -> The same way it doesn't make sense to store a pointer in it.
Simply put, a memory address has a type, which is pointer. Pointers are not ints, so you can't store a pointer in an int variable. If you're curious why ints and pointers are not fungible, it's because the size of each is implementation defined (with certain restrictions) and there is no guarantee that they will be the same size.
For instance, as #Damien_The_Unbeliever pointed out pointers on a 64-bit system must be 64-bits long, but it is perfectly legal for an int to be 32-bits, as long as it is no longer than a long and nor shorter than a short.
As to why each data type has it's own pointer type, that's because each type (especially user-defined types) is structured differently in memory. If we were to dereference typeless (or void) pointers, there would be no information indicating how that data should be interpreted. If, on the other hand, you were to create a universal pointer and do away with the "inconvenience" of specifying types, each entity in memory would probably have to be stored along-side its type information. While this is doable, it's far from efficient, and efficiency is on of C++'s design goals.
Some very low-level languages... like machine language... operate exactly as you describe. A number is a number, and it's up to the programmer to hold it in their heads what it represents. Generally speaking, the hope of higher level languages is to keep you from the concerns and potential for error that comes from that style of development.
You can actually disregard C++'s type-safety, at your peril. For instance, the gcc on a 32-bit machine I have will print "Hello" when I run this:
string x = "Hello";
int y = reinterpret_cast<int>(&x);
cout << *reinterpret_cast<string*>(y) << endl;
But as pretty much every other answerer has pointed out, there's no guarantee it would work on another computer. If I try this on a 64-bit machine, I get:
error: cast from ‘std::string*’ to ‘int’ loses precision
Which I can work around by changing it to a long:
string x = "Hello";
long y = reinterpret_cast<long>(&x);
cout << *reinterpret_cast<string*>(y) << endl;
The C++ standard specifies minimums for these types, but not maximums, so you really don't know what you're going to be dealing with when you face a new compiler. See: What does the C++ standard state the size of int, long type to be?
So the potential for writing non-portable code is high once you start going this route and "casting away" the safeties in the language. reinterpret_cast is the most dangerous type of casting...
When should static_cast, dynamic_cast, const_cast and reinterpret_cast be used?
But that's just technically drilling down into the "why not int" part specifically, in case you were interested. Note that as #BenVoight points out in the comment below, there does exist an integer type as of C99 called intptr_t which is guaranteed to hold any poniter. So there are much larger problems when you throw away type information than losing precision...like accidentally casting back to a wrong type!
C++ is a strongly typed language, and pointers and integers are different types. By making those separate types the compiler is able to detect misuses and tell you that what you are doing is incorrect.
At the same time, the pointer type maintains information on the type of the pointed object, if you obtain the address of a double, you have to store that in a double*, and the compiler knows that dereferencing that pointer you will get to a double. In your example code, int y = &x; cout << *y; the compiler would loose the information of what y points to, the type of the expression *y would be unknown and it would not be able to determine which of the different overloads of operator<< to call. Compare that with std::string *y = &x; where the compiler sees y it knows it is a std::string* and knows that dereferencing it you get to a std::string (and not a double or any other type), enabling the compiler to statically check all expressions that contain y.
Finally, while you think that a pointer is just the address of the object and that should be representable by an integral type (which on 64bit architectures would have to be int64 rather than int) that is not always the case. There are different architectures on which pointers are not really representable by integral values. For example in architectures with segmented memory, the address of an object can contain both a segment (integral value) and an offset into the segment (another integral value). On other architectures the size of pointers was different than the size of any integral type.
The language is trying to protect you from conflating two different concepts - even though at the hardware level they are both just sets of bits;
Outside of needing to pass values manually between various parts of a debugger, you never need to know the numerical value.
Outside of archaic uses of arrays, it doesn't make sense to "add 10" to a pointer - so you shouldn't treat them as numeric values.
By the compiler retaining type information, it also prevents you from making mistakes - if all pointers were equal, then the compiler couldn't, helpfully, point out that what you're trying to dereference as an int is a pointer to a string.