Is Hardware Breakpoint able to write memory? - c++

I have this ASM Code which i would like to do a Hardware breakpoint on it, however I'm wonder if I could use Hardware Breakpoint to write the memory. Can anyone advice?
[ASM]
41A8BA - 68 12345678 [PUSH 78563412]
Is there anyway I can write it on Hardware Breakpoint that makes it to "68 00000000" for example on C++?
[C++ Code]
LONG WINAPI ExceptionFilter(PEXCEPTION_POINTERS ExceptionInfo)
{
if(ExceptionInfo->ExceptionRecord->ExceptionCode == EXCEPTION_SINGLE_STEP)
{
if((DWORD)ExceptionInfo->ExceptionRecord->ExceptionAddress == 0x41A8BA)
{
//What do i write here?
return EXCEPTION_CONTINUE_EXECUTION;
}
}
return EXCEPTION_CONTINUE_SEARCH;
}

I'm very familiar with how x86's implementation of hardware breakpoints work (from "what it does to the processor" - not how it's actually designed internally), and have read the descriptions for several others.
Hardware breakpoints do not DO anything to the code in question - it is a set of special registers, that can be given a pattern ("Address X, trigger on write", "Address Y, trigger on execute"), which are checked against during the execution of the code, and if there is a match (e.g. "Address X is being written to" or "Address Y is being executed"), then the processor will stop executing and enter an exception handler - at which point the software in the exception handler takes over, typically by handing over to the debugger to say "Your code did a write to address X, here's where you are at" or "Your code executed address Y, here's where we stopped".
The hardware brekapoints can't directly be used to read, write or execute anything - it's just a "match + exception" mechanism. Technically, one could make the exception handler do something like writing to the address being executed, but that would not be "the hardware breakpoint", and it would still be treated just like any other code executing on the processor, meaning the memory has to be mapped in a way that it can be written (code, typically, isn't writeable in modern OS's such as Windows and Linux).
You can of course, in the exception handler for the debug break, map the memory as writeable (if needed), and write a different value to the part of code you care about (if it's in another process, you need to use OpenProcess and WriteProcessMemory) - again, this is nothing directly to do with hardware breakpoints, but about the code executed as a consequence, and will still follow the usual rules of the OS with regards to what memory you can read and write.

I'm wondering what this has to do with hardware breakpoints.
Is far as I understand you want to modify a Windows program when it is stopped?
To do this you should use the "WriteProcessMemory()" API function.

Related

Will reading out-of-bounds of a stack-allocated array cause any problems in real world?

Even though it is bad practice, is there any way the following code could cause trouble in real life? Note than I am only reading out of bounds, not writing:
#include <iostream>
int main() {
int arr[] = {1, 2, 3};
std::cout << arr[3] << '\n';
}
As mentioned, it is not "safe" to read beyond the end of the stack. But it sounds like you're really trying to ask what could go wrong? and, typically, the answer is "not much". Your program would ideally crash with a segfault, but it might just keep on happily running, unaware that it's entered undefined behavior. The results of such a program would be garbage, of course, but nothing's going to catch on fire (probably...).
People mistakenly write code with undefined behavior all the time, and a lot of effort has been spent trying to help them catch such issues and minimize their harm. Programs run in user space cannot affect other programs on the same machine thanks to isolated address spaces and other features, and software like sanitizers can help detect UB and other issues during development. Typically you can just fix the issue and move on to more important things.
That said, UB is, as the name suggests, undefined. Which means your computer is allowed to do whatever it wants once you ask it to execute UB. It could format your hard drive, fry your processor, or even "make demons fly out of your nose". A reasonable computer wouldn't do those things, but it could.
The most significant issue with a program that enters UB is simply that it's not going to do what you wanted it to do. If you are trying to delete /foo but you read off the end of the stack you might end up passing /bar to your delete function instead. And if you access memory that an attacker also has access to you could wind up executing code on their behalf. A large number of major security vulnerabilities boil down to some line of code that triggers UB in just the wrong way that a malicious user can take advantage of.
Depends on what you mean by stack. If it is the whole stack, then no, you can't do that, it will lead to a segmentation fault. Not because there is the memory of other processes there (that's not how it works), but rather because there is NOTHING there. You can heuristically see this by looking at the various addresses the program uses. The stack for example is at ~0x7f7d4af48040, which is beyond what any computer would have as memory. The memory your program sees is different from the physical memory.
If you mean read beyond the stack frame of the current method: yes, you can technically do that safely. Here is an example
void stacktrace(){
std::cerr << "Received SIGSEGV. Stack trace:\n";
void** bp;
asm(R"(
.intel_syntax noprefix
mov %[bp], rbp
.att_syntax
)"
: [bp] "=r" (bp));
size_t i = 0;
while(true){
std::cerr << "[" << i++ << "] " << bp[1] << '\n';
if(bp > *bp) break;
bp = (void**) *bp;
}
exit(1);
}
This is a very basic program I wrote to see, whether I could manually generate a stack trace. It might not be obvious if you are unfamiliar, but on x64 the address contained in rbp is the base of the current stack frame. In c++, the stack frame would look like:
return pointer
previous value of rsp [rsp = stack pointer] <- rbp points here
local variables (may be some other stuff like stack cookie)
...
local variables <- rsp points here
The address decreases the lower you go. In the example I gave above you can see that I get the value of rbp, which points outside the current stack frame, and move from there. So you can read from memory beyond the stack frame, but you generally shouldn't, and even so, why would you want to?
Note: Evg pointed this out. If you read some object, beyond the stack that might/will probably trigger a segfault, depending on object type, so this should only be done if you are very sure of what you're doing.
If you don't own the memory or you do own it but you haven't initialized it, you are not allowed to read it. This might seem like a pedantic and uselss rule. Afterall, the memory is there and I am not trying to overwrite anything, right? What is a byte among friends, let me read it.
The point is that C++ is a high level language. The compiler only tries to interpret what you have coded and translate it to assembly. If you type in nonsense, you will get out nonsense. It's a bit like forcing someone translate "askjds" from English to German.
But does this ever cause problems in real life? I roughly know what asm instructions are going to be generated. Why bother?
This video talks about a bug with Facebooks' string implementation where they read a byte of uninitialized memory which they did own, but it caused a very difficult to find bug nevertheless.
The point is that, silicon is not intuitive. Do not try to rely on your intuitions.

Design elements for inline asm in concurrent usage

I can't find a neat explanation about how I'm supposed to write a piece of inline asm, and what are the problem that can possibly arise from a concurrent use of a foo function that contains asm code in it.
The problem that I see is that in asm the registers are uniquely named, and so 1 name is strictly tied to a really precise portion of your cpu, and that's a big problem if you are writing 1 piece of code that is supposed to run concurrently because you can't simply extra registers with the same name.
The other problem is that asm doesn't really uses a calling convention, you simply call registers and/or values, and sometimes calling a register implies a silent action on another register that doesn't even shows up explicitly in your code; so I can't even expect that my C/C++ function foo will be packed and sealed inside its own stack if it contains asm code .
Now with what gcc calls extended asm I can basically declare where the input and the output goes, so each function can use its own parameters "as registers" , and the pattern is the following
asm ( assembler template
: output
: input
: registers
);
Assuming that my main target for now are mathematical operations, and my function is only supposed to give a certain functionality and perform some computation ( no internal lock ), is extended asm good for concurrency ? How I should design a piece of asm that is supposed to be used by a concurrent application ?
For now I'm using gcc, but I would like a generic answer about the general asm design that I'm supposed to give to this kind of code snippets.
You seem to be misunderstanding what threading actually is. Let's consider a single-processor system first. The threads don't actually run concurrently, since there is only one unit that can successfully decode and execute them. Your operating system is only creating the illusion of running multiple threads (and processes, too) by employing scheduling inside of it : every thread, or process, is allocated a certain amount of time it gets to execute on the processor.
This is why, when threads are executed, they don't overwrite each other's registers. When a currently executed thread or process is switched, the operating system asks the processor to perform something that's called a context switch. In a nutshell, the processor saves its state when it was executing the previous task/thread/process into some memory area, which is controlled by the OS. The new task/thread/process has its context restored from the previously stored state and continues its execution. When this task/thread/process' time slice on the CPU is up, the scheduler decides which task/thread/process to resume next. The time slice is usually very small, which is why you're given the illusion of multiple streams of code running at the same time. Keep in mind that this is a very, very simplified description : refer to CPU manuals or books on operating systems for more detail.
The situation is analogous on multi-processor systems : only with the exception that, then, there is more than one unit that can execute the instructions. This is also true for multi-core processors : every one of the cores has its own set of registers. The basic stuff stays the same - the scheduler in your OS decides whether the code being executed is actually executed at the same time by multiple cores in one processor.
Thus, your concerns in this case are not valid. However, they were raised for very valid reasons. Remember that the only things that threads share is the main memory : each thread has its own registers, and its own stack.
Let me come back to the actual question about gcc's extended inline assembly. The compiler itself cannot work out which registers are modified by the assembly you wrote. That's why you need to specify it. However, it is very rare that an instruction modifies a register without you being able to control it, and it happens only with a small number of instructions - assuming that we're talking about x86. Moreover, gcc can work out the destination/source operands by itself when you want to refer to a C/C++ variable from inside the assembly. In fact, this is the preferred method, since it leaves the compiler much more room for optimization.
Consider this piece of code :
unsigned int get_cr0(void)
{
unsigned int rc;
__asm__ (
"movl %%cr0, %0\n"
: "=r"(rc)
:
:
);
return rc;
}
This function's purpose is to return the contents of the control register cr0. This is a privileged instruction, so the program will not work when you run it in user mode, but this is not important right now. See how I put %0 in the instruction, and then specified "=r"(rc) in the output list. This means that %0 will be automagically aliased by the compiler to your rc variable. You can do this for every variable you specify on the input/output list. They are numbered starting from zero, as you can see.
I can't really remember the instructions which used registers that were not encoded as operands, so I can't give you an example right now. In this case, you would need to put them on the clobber list (the last one). I'm pretty sure you can refer to this for more information.
I also can't answer anything regarding "general asm design", since this is a non-standard extension and thus varies between compilers. The 64-bit Visual Studio compilers don't support it at all, for example.

How to handle this exception? (zero esp)

How to handle this exception?
__asm
{
mov esp, 0
mov eax, 0
div eax
}
This is not handled with try/except or SetUnhandledExceptionFilter().
Assuming this is running in an operating system, the operating system will catch the divide by zero, and then ATTEMPT to form an exception/signal stackframe for the application code. However, since the user-mode stack is "bad", it can't.
There is really no way for the operating system to deal with this, other than kill the application. [Theoretically, the could make up a new stack from some dynamically allocated memory, but it's pretty pointless, as there is no (always working) way for the application itself to recover to a sane state].
Don't set the stack pointer to something that isn't the stack - or if you do store "random" data in the stack pointer register, do not have exceptions. It's the same as "don't aim a gun at your foot and pull the trigger, unless you want to be without your foot".
Edit:
If the code is running in "kernel mode" rather than "usermode", it's even more "game over", since it will "double-fault" - the processor hits a divide by zero exception handler, which tries to write to the stack, and when it does so, it faults. This is now a "fault within a fault handler", aka a "double-fault". The typical setup of the double-fault handler is to have a separate stack, which then recovers the fault handler. But it's still game over - we don't know how to return to the original fault handler [or how to find out what the original fault handler was].
If there is no "new stack" with the double fault handler, it will triple fault a x86 processor - typically, a triple fault will make the processor restart [technically, it halts the processor with a special combination of bits signalled on the address bus to indicate that it's a "triple fault". The typical PC northbridge then resets the processor in recognition that the triple fault is an unrecoverable situation - this is why sometimes your PC simply reboots when you have poor quality drivers].
It's not a good idea to try to interact with a higher-level language's exception mechanism from embedded assembly. The compiler can do "magic" that you cannot match, and there's no (portable) way to tell the compiler that "this assembly code might throw an exception".

From where starts the process' memory space and where does it end?

On Windows platform, I'm trying to dump memory from my application where the variables lie. Here's the function:
void MyDump(const void *m, unsigned int n)
{
const unsigned char *p = reinterpret_cast<const unsigned char *>(m);
char buffer[16];
unsigned int mod = 0;
for (unsigned int i = 0; i < n; ++i, ++mod) {
if (mod % 16 == 0) {
mod = 0;
std::cout << " | ";
for (unsigned short j = 0; j < 16; ++j) {
switch (buffer[j]) {
case 0xa:
case 0xb:
case 0xd:
case 0xe:
case 0xf:
std::cout << " ";
break;
default: std::cout << buffer[j];
}
}
std::cout << "\n0x" << std::setfill('0') << std::setw(8) << std::hex << (long)i << " | ";
}
buffer[i % 16] = p[i];
std::cout << std::setw(2) << std::hex << static_cast<unsigned int>(p[i]) << " ";
if (i % 4 == 0 && i != 1)
std::cout << " ";
}
}
Now, how can I know from which address starts my process memory space, where all the variables are stored? And how do I now, how long the area is?
For instance:
MyDump(0x0000 /* <-- Starts from here? */, 0x1000 /* <-- This much? */);
Best regards,
nhaa123
The short answer to this question is you cannot approach this problem this way. The way processes are laid out in memory is very much compiler and operating system dependent, and there is no easy to to determine where all of the code and variables lie. To accurately and completely find all of the variables, you'd need to write large portions of a debugger yourself (or borrow them from a real debugger's code).
But, you could perhaps narrow the scope of your question a little bit. If what you really want is just a stack trace, those are not too hard to generate: How can one grab a stack trace in C?
Or if you want to examine the stack itself, it is easy to get a pointer to the current top of the stack (just declare a local variable and then take it's address). Tthe easiest way to get the bottom of the stack is to declare a variable in main, store it's address in a global variable, and use that address later as the "bottom" (this is easy but not really 'clean').
Getting a picture of the heap is a lot lot lot harder, because you need extensive knowledge of the internal workings of the heap to know which pieces of it are currently allocated. Since the heap is basically "unlimited" in size, that's quite alot of data to print if you just print all of it, even the unused parts. I don't know of a way to do this, and I would highly recommend you not waste time trying.
Getting a picture of static global variables is not as bad as the heap, but also difficult. These live in the data segments of the executable, and unless you want to get into some assembly and parsing of executable formats, just avoid doing this as well.
Overview
What you're trying to do is absolutely possible, and there are even tools to help, but you'll have to do more legwork than I think you're expecting.
In your case, you're particularly interested in "where the variables lie." The system heap API on Windows will be an incredible help to you. The reference is really quite good, and though it won't be a single contiguous region the API will tell you where your variables are.
In general, though, not knowing anything about where your memory is laid out, you're going to have to do a sweep of the entire address space of the process. If you want only data, you'll have to do some filtering of that, too, because code and stack nonsense are also there. Lastly, to avoid seg-faulting while you dump the address space, you may need to add a segfault signal handler that lets you skip unmapped memory while you're dumping.
Process Memory Layout
What you will have, in a running process, is multiple disjoint stretches of memory to print out. They will include:
Compiled code (read-only),
Stack data (local variables),
Static Globals (e.g. from shared libraries or in your program), and
Dynamic heap data (everything from malloc or new).
The key to a reasonable dump of memory is being able to tell which range of addresses belongs to which family. That's your main job, when you're dumping the program. Some of this, you can do by reading the addresses of functions (1) and variables (2, 3 and 4), but if you want to print more than a few things, you'll need some help.
For this, we have...
Useful Tools
Rather than just blindly searching the address space from 0 to 2^64 (which, we all know, is painfully huge), you will want to employ OS and compiler developer tools to narrow down your search. Someone out there needs these tools, maybe even more than you do; it's just a matter of finding them. Here are a few of which I'm aware.
Disclaimer: I don't know many of the Windows equivalents for many of these things, though I'm sure they exist somewhere.
I've already mentioned the Windows system heap API. This is a best-case scenario for you. The more things you can find in this vein, the more accurate and easy your dump will be. Really, the OS and the C runtime know quite a bit about your program. It's a matter of extracting the information.
On Linux, memory types 1 and 3 are accessible through utilities like /proc/pid/maps. In /proc/pid/maps you can see the ranges of the address space reserved for libraries and program code. You can also see the protection bits; read-only ranges, for instance, are probably code, not data.
For Windows tips, Mark Russinovich has written some articles on how to learn about a Windows process's address space and where different things are stored. I imagine he might have some good pointers in there.
Well, you can't, not really... at least not in a portable manner. For the stack, you could do something like:
void* ptr_to_start_of_stack = 0;
int main(int argc, char* argv[])
{
int item_at_approximately_start_of_stack;
ptr_to_start_of_stack = &item_at_approximately_start_of_stack;
// ...
// ... do lots of computation
// ... a function called here can do something similar, and
// ... attempt to print out from ptr_to_start_of_stack to its own
// ... approximate start of stack
// ...
return 0;
}
In terms of attempting to look at the range of the heap, on many systems, you could use the sbrk() function (specifically sbrk(0)) to get a pointer to the start of the heap (typically, it grows upward starting from the end of the address space, while the stack typically grows down from the start of the address space).
That said, this is a really bad idea. Not only is it platform dependent, but the information you can get from it is really not as useful as good logging. I suggest you familiarize yourself with Log4Cxx.
Good logging practice, in addition to the use of a debugger such as GDB, is really the best way to go. Trying to debug your program by looking at a full memory dump is like trying to find a needle in a haystack, and so it really is not as useful as you might think. Logging where the problem might logically be, is more helpful.
AFAIK, this depends on OS, you should look at e.g. memory segmentation.
Assuming you are running on a mainstream operating system. You'll need help from the operating system to find out which addresses in your virtual memory space have mapped pages. For example, on Windows you'd use VirtualQueryEx(). The memory dump you'll get can be as large as two gigabytes, it isn't that likely you discover anything recognizable quickly.
Your debugger already allows you to inspect memory at arbitrary addresses.
You can't, at least not portably. And you can't make many assumptions either.
Unless you're running this on CP/M or MS-DOS.
But with modern systems, the where and hows of where you data and code are located, in the generic case, aren't really up to you.
You can play linker games, and such to get better control of the memory map for you executable, but you won't have any control over, say, any shared libraries you may load, etc.
There's no guarantee that any of your code, for example, is even in a continuous space. The Virtual Memory and loader can place code pretty much where it wants. Nor is there any guarantee that your data is anywhere near your code. In fact, there's no guarantee that you can even READ the memory space where your code lives. (Execute, yes. Read, maybe not.)
At a high level, your program is split in to 3 sections: code, data, and stack. The OS places these where it sees fit, and the memory manager controls what and where you can see stuff.
There are all sorts of things that can muddy these waters.
However.
If you want.
You can try having "markers" in your code. For example, put a function at the start of your file called "startHere()" and then one at the end called "endHere()". If you're lucky, for a single file program, you'll have a continuous blob of code between the function pointers for "startHere" and "endHere".
Same thing with static data. You can try the same concept if you're interested in that at all.

Gentle introduction to JIT and dynamic compilation / code generation

The deceptively simple foundation of dynamic code generation within a C/C++ framework has already been covered in another question. Are there any gentle introductions into topic with code examples?
My eyes are starting to bleed staring at highly intricate open source JIT compilers when my needs are much more modest.
Are there good texts on the subject that don't assume a doctorate in computer science? I'm looking for well worn patterns, things to watch out for, performance considerations, etc. Electronic or tree-based resources can be equally valuable. You can assume a working knowledge of (not just x86) assembly language.
Well a pattern I've used in emulators goes something like this:
typedef void (*code_ptr)();
unsigned long instruction_pointer = entry_point;
std::map<unsigned long, code_ptr> code_map;
void execute_block() {
code_ptr f;
std::map<unsigned long, void *>::iterator it = code_map.find(instruction_pointer);
if(it != code_map.end()) {
f = it->second
} else {
f = generate_code_block();
code_map[instruction_pointer] = f;
}
f();
instruction_pointer = update_instruction_pointer();
}
void execute() {
while(true) {
execute_block();
}
}
This is a simplification, but the idea is there. Basically, every time the engine is asked to execute a "basic block" (usually a everything up to next flow control op or whole function in possible), it will look it up to see if it has already been created. If so, execute it, else create it, add it and then execute.
rinse repeat :)
As for the code generation, that gets a little complicated, but the idea is to emit a proper "function" which does the work of your basic block in the context of your VM.
EDIT: note that I haven't demonstrated any optimizations either, but you asked for a "gentle introduction"
EDIT 2: I forgot to mention one of the most immediately productive speed ups you can implement with this pattern. Basically, if you never remove a block from your tree (you can work around it if you do but it is way simpler if you never do), then you can "chain" blocks together to avoid lookups. Here's the concept. Whenever you return from f() and are about to do the "update_instruction_pointer", if the block you just executed ended in either a call, unconditional jump, or didn't end in flow control at all, then you can "fixup" its ret instruction with a direct jmp to the next block it'll execute (cause it'll always be the same one) if you have already emited it. This makes it so you are executing more and more often in the VM and less and less in the "execute_block" function.
I'm not aware of any sources specifically related to JITs, but I imagine that it's pretty much like a normal compiler, only simpler if you aren't worried about performance.
The easiest way is to start with a VM interpreter. Then, for each VM instruction, generate the assembly code that the interpreter would have executed.
To go beyond that, I imagine that you would parse the VM byte codes and convert them into some sort of suitable intermediate form (three address code? SSA?) and then optimize and generate code as in any other compiler.
For a stack based VM, it may help to to keep track of the "current" stack depth as you translate the byte codes into intermediate form, and treat each stack location as a variable. For example, if you think that the current stack depth is 4, and you see a "push" instruction, you might generate an assignment to "stack_variable_5" and increment a compile time stack counter, or something like that. An "add" when the stack depth is 5 might generate the code "stack_variable_4 = stack_variable_4+stack_variable_5" and decrement the compile time stack counter.
It is also possible to translate stack based code into syntax trees. Maintain a compile-time stack. Every "push" instruction causes a representation of the thing being pushed to be stored on the stack. Operators create syntax tree nodes that include their operands. For example, "X Y +" might cause the stack to contain "var(X)", then "var(X) var(Y)" and then the plus pops both var references off and pushes "plus(var(X), var(Y))".
Get yourself a copy of Joel Pobar's book on Rotor (when it's out), and delve through the source to the SSCLI. Beware, insanity lies within :)