How to execute separate compiled binary file from inside program on MCU? - c++

I have an MCU (say an STM32) running, and I would like to 'pass' it a separately compiled binary file over UART/USB and use it like calling a function, where I can pass it data and collect its output? After its complete, a second, different binary would be sent to be executed, and so on.
How can I do this? Does this require an OS be running? I'd like to avoid that overhead.
Thanks!

It is somewhat specific to the mcu what the exact call function is but you are just making a function call. You can try the function pointer thing but that has been known to fail with thumb (on gcc)(stm32 uses the thumb instruction set from arm).
First off you need to decide in your overall system design if you want to use a specific address for this code. for example 0x20001000. or do you want to have several of these resident at the same time and want to load them at any one of multiple possible addresses? This will determine how you link this code. Is this code standalone? with its own variables or does it want to know how to call functions in other code? All of this determines how you build this code. The easiest, at least to first try this out, is a fixed address. Build like you build your normal application but based in a ram address like 0x20001000. Then you load the program sent to you at that address.
In any case the normal way to "call" a function in thumb (say an stm32). Is the bl or blx instruction. But normally in this situation you would use bx but to make it a call need a return address. The way arm/thumb works is that for bx and other related instructions the lsbit determines the mode you switch/stay in when branching. Lsbit set is thumb lsbit clear is arm. This is all documented in the arm documentation which completely covers your question BTW, not sure why you are asking...
Gcc and I assume llvm struggles to get this right and then some users know enough to be dangerous and do the worst thing of ADDing one (rather than ORRing one) or even attempting to put the one there. Sometimes putting the one there helps the compiler (this is if you try to do the function pointer approach and hope the compiler does all the work for you *myfun = 0x10000 kind of thing). But it has been shown on this site that you can make subtle changes to the code or depending on the exact situation the compiler will get it right or wrong and without looking at the code you have to help with the orr one thing. As with most things when you need an exact instruction, just do this in asm (not inline please, use real) yourself, make your life 10000 times easier...and your code significantly more reliable.
So here is my trivial solution, extremely reliable, port the asm to your assembly language.
.thumb
.thumb_func
.globl HOP
HOP:
bx r0
I C it looks like this
void HOP ( unsigned int );
Now if you loaded to address 0x20001000 then after loading there
HOP(0x20001000|1);
Or you can
.thumb
.thumb_func
.globl HOP
HOP:
orr r0,#1
bx r0
Then
HOP(0x20001000);
The compiler generates a bl to hop which means the return path is covered.
If you want to send say a parameter...
.thumb
.thumb_func
.globl HOP
HOP:
orr r1,#1
bx r1
void HOP ( unsigned int, unsigned int );
HOP(myparameter,0x20001000);
Easy and extremely reliable, compiler cannot mess this up.
If you need to have functions and global variables between the main app and the downloaded app, then there are a few solutions and they involve resolving addresses, if the loaded app and the main app are not linked at the same time (doing a copy and jump and single link is generally painful and should be avoided, but...) then like any shared library you need to have a mechanism for resolving addresses. If this downloaded code has several functions and global variables and/or your main app has several functions and global variables that the downloaded library needs, then you have to solve this. Essentially one side has to have a table of addresses in a way that both sides agree on the format, could be as a simple array of addresses and both sides know which address is which simply from position. Or you create a list of addresses with labels and then you have to search through the list matching up names to addresses for all the things you need to resolve. You could for example use the above to have a setup function that you pass an array/structure to (structures across compile domains is of course a very bad thing). That function then sets up all the local function pointers and variable pointers to the main app so that subsequent functions in this downloaded library can call the functions in the main app. And/or vice versa this first function can pass back an array structure of all the things in the library.
Alternatively a known offset in the downloaded library there could be an array/structure for example the first words/bytes of that downloaded library. Providing one or the other or both, that the main app can find all the function addresses and variables and/or the caller can be given the main applications function addresses and variables so that when one calls the other it all works... This of course means function pointers and variable pointers in both directions for all of this to work. Think about how .so or .dlls work in linux or windows, you have to replicate that yourself.
Or you go the path of linking at the same time, then the downloaded code has to have been built along with the code being run, which is probably not desirable, but some folks do this, or they do this to load code from flash to ram for various reasons. but that is a way to resolve all the addresses at build time. then part of the binary in the build you extract separately from the final binary and then pass it around later.
If you do not want a fixed address, then you need to build the downloaded binary as position independent, and you should link that with .text and .bss and .data at the same address.
MEMORY
{
hello : ORIGIN = 0x20001000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > hello
.rodata : { *(.rodata*) } > hello
.bss : { *(.bss*) } > hello
.data : { *(.data*) } > hello
}
you should obviously do this anyway, but with position independent then you have it all packed in along with the GOT (might need a .got entry but I think it knows to use .data). Note, if you put .data after .bss with gnu at least and insure, even if it is a bogus variable you do not use, make sure you have one .data then .bss is zero padded and allocated for you, no need to set it up in a bootstrap.
If you build for position independence then you can load it almost anywhere, clearly on arm/thumb at least on a word boundary.
In general for other instruction sets the function pointer thing works just fine. In ALL cases you simply look at the documentation for the processor and see the instruction(s) used for calling and returning or branching and simply use that instruction, be it by having the compiler do it or forcing the right instruction so that you do not have it fail down the road in a re-compile (and have a very painful debug). arm and mips have 16 bit modes that require specific instructions or solutions for switching modes. x86 has different modes 32 bit and 64 bit and ways to switch modes, but normally you do not need to mess with this for something like this. msp430, pic, avr, these should be just a function pointer thing in C should work fine. In general do the function pointer thing then see what the compiler generates and compare that to the processor documentation. (compare it to a non-function pointer call).
If you do not know these basic C concepts of function pointer, linking a bare metal app on an mcu/processor, bootstrap, .text, .data, etc. You need to go learn all that.
The times you decide to switch to an operating system are....if you need a filesystem, networking, or a few things like this where you just do not want to do that yourself. Now sure there is lwip for networking and some embedded filesystem libraries. And multithreading then an os as well, but if all you want to do is generate a branch/jump/call instruction you do not need an operating system for that. Just generate the call/branch/whatever.

Loading and execution a fully linked binary and loading and calling a single function (and returning to the caller) are not really the same thing. The latter is somewhat complicated and involves "dynamic linking", where the code effectively and secures in the same execution environment as the caller.
Loading a complete stand-alone executable in the other hand is more straightforward and is the function of a bootloader. A bootloader loads and jumps to the loaded executable which then establishes it's own execution environment. Returning to the bootloader requires a processor reset.
In this case it would make sense to have the bootloader load and execute code in RAM if you are going to be frequently loading different code. However be aware that on Harvard Architecture devices like STM32, RAM execution may slow down execution because data and instruction fetch share the same bus.
The actual implementation of a bootloader will depend on the target architecture, but for Cortex-M devices is fairly straightforward and dealt with elsewhere.
STM32 actually includes an on-chip bootloader (you need to configure the boot source pins to invoke it), which I believe can load and execute code in RAM. It is normally used to load a secondary bootloader to load and program flash, but it can be used for loading any code.
You do need to build and link your code to run from RAM at the address tle loader locates it, or if supported build position-indeoendent code that can run from anywhere.

Related

Why this branch instruction of ARM doesn't work

Now I am writing a library to mock the trivial function for C/C++. It is used like this: MOCK(mocked, substitute)
If you call the mocked function, the substitute function will be called instead.
I modify the attribute of code page and inject the jump code into the function to implement it. I have implemented it for x86 CPU and I want to port it to ARM CPU. But I have a problem when I inject binary code.
For example, the address of substitute function is 0x91f1, and the address of function to mock is 0x91d1. So I want to inject the ARM branch code into 0x91d1 to jump to the substitute function.
According to the document online, the relative address is
(0x91f1 - (0x91d1 + 8)) / 4 = 6
so the binary instruction is:
0xea000006
Because my arm emulator(I use Android arm v7 emulator) is little endian, so the binary code to inject is:
0x060000ea
But when I executed the mocked function after injecting branch code, segment fault occurred. I don't know why the branch instruction is wrong. I have not learned ARM architecture so I don't know whether the branch instruction of ARM has some limits.
Addresses you are branching to is odd numbered, meaning they are in Thumb mode.
There is an obvious problem with your approach.
If target is in Thumb mode, you either need to be in Thumb mode at the point you are branching from or you need to use a bx (Branch and Exchange) instruction.
Your functions are in Thumb mode (+1 at the target) but you are using ARM mode branch coding (B A1 coding?), so obviously either you are not in Thumb mode or you are using ARM mode instruction in Thumb mode.
The ARM family allows loading of registers with values. One of those registers is the PC (Program Counter).
Some alternatives:
You could use a function to load the PC register with the
destination address (absolute).
Add the PC register with an offset.
Use a multiply-and-add instruction with the PC register.
Push the destination register onto the stack and pop into PC
register.
These choices plus modifying the destination of the branch instructions are all different options at are not "best". Pick one that suits you best and is easiest to maintain.

Design elements for inline asm in concurrent usage

I can't find a neat explanation about how I'm supposed to write a piece of inline asm, and what are the problem that can possibly arise from a concurrent use of a foo function that contains asm code in it.
The problem that I see is that in asm the registers are uniquely named, and so 1 name is strictly tied to a really precise portion of your cpu, and that's a big problem if you are writing 1 piece of code that is supposed to run concurrently because you can't simply extra registers with the same name.
The other problem is that asm doesn't really uses a calling convention, you simply call registers and/or values, and sometimes calling a register implies a silent action on another register that doesn't even shows up explicitly in your code; so I can't even expect that my C/C++ function foo will be packed and sealed inside its own stack if it contains asm code .
Now with what gcc calls extended asm I can basically declare where the input and the output goes, so each function can use its own parameters "as registers" , and the pattern is the following
asm ( assembler template
: output
: input
: registers
);
Assuming that my main target for now are mathematical operations, and my function is only supposed to give a certain functionality and perform some computation ( no internal lock ), is extended asm good for concurrency ? How I should design a piece of asm that is supposed to be used by a concurrent application ?
For now I'm using gcc, but I would like a generic answer about the general asm design that I'm supposed to give to this kind of code snippets.
You seem to be misunderstanding what threading actually is. Let's consider a single-processor system first. The threads don't actually run concurrently, since there is only one unit that can successfully decode and execute them. Your operating system is only creating the illusion of running multiple threads (and processes, too) by employing scheduling inside of it : every thread, or process, is allocated a certain amount of time it gets to execute on the processor.
This is why, when threads are executed, they don't overwrite each other's registers. When a currently executed thread or process is switched, the operating system asks the processor to perform something that's called a context switch. In a nutshell, the processor saves its state when it was executing the previous task/thread/process into some memory area, which is controlled by the OS. The new task/thread/process has its context restored from the previously stored state and continues its execution. When this task/thread/process' time slice on the CPU is up, the scheduler decides which task/thread/process to resume next. The time slice is usually very small, which is why you're given the illusion of multiple streams of code running at the same time. Keep in mind that this is a very, very simplified description : refer to CPU manuals or books on operating systems for more detail.
The situation is analogous on multi-processor systems : only with the exception that, then, there is more than one unit that can execute the instructions. This is also true for multi-core processors : every one of the cores has its own set of registers. The basic stuff stays the same - the scheduler in your OS decides whether the code being executed is actually executed at the same time by multiple cores in one processor.
Thus, your concerns in this case are not valid. However, they were raised for very valid reasons. Remember that the only things that threads share is the main memory : each thread has its own registers, and its own stack.
Let me come back to the actual question about gcc's extended inline assembly. The compiler itself cannot work out which registers are modified by the assembly you wrote. That's why you need to specify it. However, it is very rare that an instruction modifies a register without you being able to control it, and it happens only with a small number of instructions - assuming that we're talking about x86. Moreover, gcc can work out the destination/source operands by itself when you want to refer to a C/C++ variable from inside the assembly. In fact, this is the preferred method, since it leaves the compiler much more room for optimization.
Consider this piece of code :
unsigned int get_cr0(void)
{
unsigned int rc;
__asm__ (
"movl %%cr0, %0\n"
: "=r"(rc)
:
:
);
return rc;
}
This function's purpose is to return the contents of the control register cr0. This is a privileged instruction, so the program will not work when you run it in user mode, but this is not important right now. See how I put %0 in the instruction, and then specified "=r"(rc) in the output list. This means that %0 will be automagically aliased by the compiler to your rc variable. You can do this for every variable you specify on the input/output list. They are numbered starting from zero, as you can see.
I can't really remember the instructions which used registers that were not encoded as operands, so I can't give you an example right now. In this case, you would need to put them on the clobber list (the last one). I'm pretty sure you can refer to this for more information.
I also can't answer anything regarding "general asm design", since this is a non-standard extension and thus varies between compilers. The 64-bit Visual Studio compilers don't support it at all, for example.

gdb : findind every jumps to an address

I'm trying to understand a small binary using gdb but there is something I can't find a way to achieve : how can I find the list of jumps that point to a specified address?
I have a small set of instructions in the disassembled code and I want to know where it is called.
I first thought about searching the corresponding instruction in .text, but since there are many kind of jumps, and address can be relative, this can't work.
Is there a way to do that?
Alternatively, if I put a breakpoint on this address, is there a way to know the address of the previous instruction (in this case, the jump)?
If this is some subroutine being called from other places, then it must respect some ABI while it's called.
Depending on a CPU used, the return address (and therefore a place from where it was called) will be stored somewhere (on stack or in some registers). If you replace original code with the one that examines this, you can create a list of return addresses. Or simpler, as you suggested, if you use gdb and put a breakpoint at that routine, you can see from where it was called by using a bt command.
If it was actual jump (versus a "jump to subroutine") that led you there (which I doubt, if it's called from many places, unless it's a kind of longjmp/setjmp), then you will probably not be able to determine where this was called from, unless the CPU you are using allows you to trace the execution in some way.

Simple process loader memory mapping

I'm writing a very simple process loader for Linux. The executables I'm loading are already compiled, and I know where each one expects to be found in memory. The first approach I tried was using mmap() to manually place each code or data section at the correct location, like
mmap(addr, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0)
which segfaults unless I remove the MAP_FIXED flag because, it seems, the address of one block conflicts with something already in memory, possibly even the loader itself; the address 0x401000 seems to be the problematic one.
I'm not really even sure where to begin with this one. A friend suggested virtualizing memory access operations; I'm not sure what kind of performance hits I'd take for that, and I have no clue how it's done, but it might be an option. What I'd really love to do is create an "empty" process, which would have, as far as it was concerned, full run of the memory, so nothing would be loaded into the user space until I wanted it to be. The whole concept of an "empty" process might be meaningless, but it's the best way to describe what I want. I'm pretty desperate for some references or examples that might help me.
With your process running (maybe snoozing in "sleep(1000);"), look at its /proc/pid/maps. That will tell you what 0x401000 is used for.
~$ sleep 1h &
[3] 2033
~$ cat /proc/2033/maps
00110000-002af000 r-xp 00000000 08:01 1313056 /lib/i386-linux-gnu/libc-2.15.so
...
Here on my box, /bin/sleep doesn't use that block, and neither does my little one-liner program.
You're probably linking in some library which wants to land there?
So one way would be to allocate the block you need way early (long before main() runs -- look elsewhere for that info).
Another way is to link your code to some address you "know" isn't taken (presumably, you're generating the x86 opcodes yourself, or otherwise "linking", so that shouldn't be a stretch).
Another, better, option is to make your code relocatable. The fact that you don't want to replace the entire process's address space (precisely what exec does) more or less says that your code should be just that.
So find a usable address, load the bits there, and, as needed, perform the relocations (so your on-disk file format, if it's not ELF, will need to include reloc info). That's the high road, and the obvious thing you'll want next from your loader.
Of course, that pretty much means reimplementing dlopen() yourself. I assume you're just trying to learn how it works -- if not, man dlopen. Stephane's Rule Zero: it's already there ;-)
Don't forget to support linking other libraries from your code (without duplication), dlclose(), initializers, the various RTLD_* modes, honor MYCUSTOMLD_LIBRARY_PATH, GCC's __thread specifier, etc. ;-)

Getting ptr from memory address with c++

Im trying to get the engine version of a game from a global pointer, but I am fairly new to this. Here is a very small example I found...
http://ampaste.net/mb42243
And this is the disassembly for what I am trying to get, the pointer (gpszVersionString) is the highlighted line (line 5)
http://ampaste.net/m2a8f8887
So what I need to find out is basically using the example approach I found to get it, would I need to basically sig out the first part of the function and find the offset to that line?
Like...
Memory signature - /x56/x8B/x35/x74/xD5/x29/x10/x68/x00/xA8/x38/x10
Then an offset to reach that line? (not sure how to find the offset)
You can't directly do this. Process address space is completely unique to your process -- 0xDEADBEEF can point to "Dog" in one process, while 0xDEADBEEF can point to "Cat" in another. You would have to make operating system calls that allow you to access another process' address space, and even then you'd have to guess. Many times that location will be different each run of the application -- you can't generally predict what the runtime layout of a process will be in all cases.
Assuming you're on Windows you'll need to (EDIT: You don't need A and B in all cases but you usually need them) A. be an administrator, B. take the SeDebugPrivilege for your process, C, open a handle to the process, and then D. use ReadProcessMemory/WriteProcessMemory to do what you want.
Hope that helps :)
EDIT 2: It looks like you're looking at an address taken from a disassembler. If that's the case, then you can't use that value of the address -- the image can be re-based at runtime and the value there would be completely different. Particularly on recent versions of Windows which support Address Space Layout Randomization.