How does OS know that EIP is no longer a valid/legal instruction and that application has crashed? How does it know when to generate the crash dump data?
On an x86-compatible processor, when EIP points to a page which does not have read permission, a page that is not mapped, an invalid instruction, or when a valid instruction tries to access a memory page without permission, or a page that is not mapped, or a divide instruction sees that the denominator is zero, or an INT instruction is executed, or a bunch of other things, it raises an exception. In the case of an exception occuring in protected mode when the current privilege level (CPL) is > 0, the following things occur:
Loads the values for SS and ESP from a memory section called the task state segment.
Pushes the values of SS, ESP, EFLAGS, CS and EIP onto the stack. The SS and ESP values are the previous ones, not the new ones from the TSS.
Some exceptions also push an error code onto the stack.
Gets the values for CS and EIP from the interrupt descriptor table and puts these values in CS and EIP.
Note that the kernel has set up these tables and segments in advance.
Then:
The kernel decides what to do with the exception. This depends on the specific kernel. Usually, it decides to kill your program. On Linux, you can override this default using signal handling and on Windows you can override it using Structured Exception Handling.
(This is not an exhaustive reference to x86 exception handling. This is a brief overview of the most common case.)
The detailed answer https://stackoverflow.com/a/59075911/15304 from #user253751 is there for you to know all that you may want to know.
A word of context might help though: processor usually proceeds to the next instruction after each instruction is over, but there are cases where it will suddenly start a completely unrelated instruction. This is called an interrupt, and is widely used to support device operations or get some code called at periodic intervals.
In an interrupt handler, we have to save the full processor state so that the interrupted code can be safely resumed after we're done with device-specific code.
The hardware exception mechanism used to know that a process is trying to do something that is impossible/invalid given the current configuration extensively borrows interrupts mechanisms, but it also has to take care of a context switch between (presumably) user-level code for the "faulty" process and kernel-level code that will handle the fault. That context switch is the reason why we see stack pointers re-loaded and task state segment involved in the description of hardware exceptions that have much simpler definitions (e.g. exectue instruction at address 0xfffff000) on other architectures.
Note that having a hardware exception doesn't necessarily means that the process crashed. The exception handler in the kernel will usually have to compare some information (what address we tried to access, what object is mapped at this address, etc.) and either does useful job (bring one more page of a mapped file into memory) and resume the process, or calls it an invalid access.
Related
Windows 10, x64 , x86
My current knowledge
Lets say it is quad core, there will be 4 individual program counters which will point to 4 different locations of code for parallel execution.
Each of this program counters indicates where a computer is in its program sequence.
The address it points to changes after a context switch where another threads program counter gets placed onto the program counter to execute.
What I want to do:
Im in Kernel Mode my thread is running on core 1 and I want to read the current instruction pointer of core 2.
Expected Results:
0x203123 is the address of the instruction pointer and this address belongs to this thread and this thread belongs to this process... etc.
Anyone knows how to do it or can give me good book references, links etc...
Although I don't believe it's officially documented, there is a ZwGetContextThread exported from ntdll.dll. Being undocumented, things can change (and I haven't tried it in quite a while) but at least when I last tried it, you called it with a thread handle and a pointer to a CONTEXT structure, and it would return that thread's context.
I'm not certain exactly how up-to-date that is though. It's never mattered to me, so I haven't checked, but my guess would be that the IP in the CONTEXT you get is whatever was saved the last time the thread was suspended. So, if you want something (reasonably) current, you'd use ZwSuspendThread, get the context, then ZwResumeThread to start it running again.
Here I suppose I'm probably supposed to give the standard lines about undocumented function being subject to change, using them being a bad idea, and that you should generally leave all of this alone. Ah well, I been disappointing teachers and other authority figures for years, and I guess I'm not changing right now.
On the other hand, there may be a practical problem here. If you really need data that's really current, this probably isn't going to work very well for you. What it gives you will be kind of current at best. On the other hand, really current is almost a meaningless concept with information that goes out of date every clock cycle.
Anyone knows how to do it or can give me good book references, links etc...
For 80x86 hardware (regardless of operating system); there are only 3 ways to do this (that I know of):
a) send an inter-processor interrupt to the other CPU, and have an interrupt handler that stores the "return EIP" (from its stack) at a known address in memory so that your CPU can read "value of EIP immediately before interrupt" (with synchronization so that your CPU doesn't read before the value is written, etc).
b) put the other CPU into some kind of "debug mode" (single-stepping, last branch recording, ...) so that (either code in a debug exception handler or the CPU's hardware itself) is constantly writing EIP values to memory that you can read.
Of course both of these options will ruin performance, and the value you get will probably be useless (because EIP would've changed after you obtain it but before you can use the obtained value). To ensure the value is still useful; you'd need the other CPU to wait until after you've consumed the obtained value (and are ready for the next value); and to do that you'd have to resort to single-step debugging facilities (with the waiting in the debug exception handler), where you'll be lucky if you can get performance better than a thousand times slower (and can probably improve performance by simply disabling other CPUs completely).
Also note that they still won't accurately tell you EIP in all cases (e.g. if the CPU is in SMM/System Management Mode and is beyond the control of the OS); and I doubt Windows kernel supports any of it (e.g. kernel should support single-stepping of user-space processes/threads to allow debuggers to work, but won't support single-stepping of kernel and will probably lock up the computer due to various "waiting for lock to be released for 6 days" problems).
The last of the 3 options is:
c) Run the OS inside an emulator/simulator instead of running it on real hardware. In that case you can probably modify the emulator/simulator's code to inject EIP values somewhere (maybe some kind of virtual "EIP reporting device"?). This will ruin performance of the emulator/simulator, but you may be able to hide that (e.g. "virtual time inside the emulator passes at a rate of one second per 1000 seconds of real time outside the emulator").
When every process has its own private memory space that no external process has access to, how does a debugger access a process' memory space?
For eg, I can attach gdb to a running process using gdb -p <pid>
The I can access all the memory of this process via gdb.
How is gdb able to do this?
I read the relevant questions in SO and no post seems to answer this point.
Since the question is tagged Linux and Unix, I'll expand a little on what David Scwartz says, which in short is "there is an API for that in the OS". The same basic principle applies in Windows as well, but the actual implementation is different, and although I suspect the implementation inside the OS does the same thing, there's no REAL way to know that, since we can't inspect the source code for Windows (one can, however, understanding how an OS and a processor works, sort of figure out what must be happening!)
Linux has a function called ptrace, that allows one process (following some checking of privileges) to inspect another process in various ways. It is one call, but the first parameter is a "what do you want to do". Here are some of the most basic examples - there are a couple of dozen others for less "common" operations:
PTRACE_ATTACH - connect to the process.
PTRACE_PEEKTEXT - look at the attached process' code memory (for example to disassemble the code)
PTRACE_PEEKDATA - look at the attached process' data memory (to display variables)
PTRACE_POKETEXT - write to process' code memory
PTRACE_POKEDATA - write to process' data memory.
PTRACE_GETREGS - copy the current register values.
PTRACE_SETREGS - change the current register values (e.g. a debug command of set variable x = 7, if x happens to be in a register)
In Linux, since memory is "all the same", PTRACE_PEEKTEXT and PTRACE_PEEKDATA are actually the same functionality, so you can give an address in code for PTRACE_PEEKDATA and an address, say, on the stack for PTRACE_PEEKTEXT and it will perfectly happily copy that back for you. The distinction is made for OS/processor combinations where memory is "split" between DATA memory and CODE memory. Most modern OS's and processors do not make that distinction. Same obviously applies to PTRACE_POKEDATA and PTRACE_POKETEXT.
So, say that the "debugger process" uses:
long data = ptrace(PTRACE_PEEKDATA, pid, 0x12340128, NULL);
When the OS is called with a PTRACE_PEEKDATA for address 0x12340128 it will "look" at the corresponding memory mapping for the memory at 0x12340128 (page-aligned that makes 0x12340000), if it exists, it will get mapped into the kernel, the data is then copied out from address 0x12340128 into the local memory, the memory unmapped, and the copied data passed back as the return value.
The manual states the initiating of the usage as:
The parent can initiate a trace by calling fork(2) and having the
resulting child do a PTRACE_TRACEME, followed (typically) by an exec(3).
Alternatively, the parent may commence trace of an existing process
using PTRACE_ATTACH.
For several pages more information do man ptrace.
When every process has its own private memory space that no external process has access to ...
That's false. External processes with the correct permissions and using the correct APIs can access other process' memory.
For linux debugging there is a system call ptrace which makes it possible to control another process on the system. Indeed, you need the rights to do that, which is typically given, if you are the owner of the process and you have not removed the permissions manually.
The os call ptrace itself enables access to memory, program counter, registers and nearly all other related things to read and write.
Please see man ptrace for details.
If you are interested how it works in a debugger, please have a look for the files in
gdb-x.x.x/gdb/linux-nat.c. There you can find the core stuff for accessing other processes to debug.
This question already has answers here:
entering ring 0 from user mode
(3 answers)
Closed 8 years ago.
Context:
according to this description user-space programms cannot perform all operations which are provided by the processors. The description in the link above says that there are different operation levels inside the cpu.
Question:
How is user-space code prevented from beeing executed in privileged levels by the cpu? Couldn't it be possible to switch into higher levels by using assembly language without using system-calls?
I am pretty sure it is not, but I do not understand why. Could anyone please point this out or point to some resources which deals with this topic?
When the cpu reaches an instruction which, due to the identity of the instruction to be executed, the memory address to be accessed, or some other condition, is not permitted at the current privilege level, a cpu exception is raised. This essentially saves the current cpu state (register contents, etc.) and transfers execution to a preset kernel address running at kernel privilege level, which can inspect the operation that was to be performed and decide how to proceed. In practice, it will generally end with the kernel killing the process if the operation to be performed is not permitted.
The cpu processes code stored in ram.
The memory keeps flags. The memory has a special layout. There are so called descriptor tables, which translate physical memory into virtual one. First there is a descriptortest or segment test where the gdt is read. The gdt contains a value called descriptor privilege level. It contains the value of the ringlevel, which the calling process must meet. If it does not, no access is granted.
Then comes the page directory test, which has a supervisor bit. This also must meet certain conditions. If it is zero only priviligeged prozesses may access this page table in the page directory.
If the value is one, all processes may acces the pages in the current checked page directory entry.
The last test is the page test. Its checks are like the previous checks.
If a process passed all checks succesfully, access to the memory page is granted. Cpu Register c3 should be of interest here.
How do breakpoints work in C++ code? Are they special instructions inserted in between some assembler instructions when the code is compiled? Or is there something else in place? Also, how are stepping-through-the-code implemented? The same way as breakpoints...?
This is heavly depend on the CPU and debugger.
For example, one of the possible solution on x86 CPU:
Insert one-byte INT3 instruction on the required place
Wait until breakpoint exception hits
Compare exception address to the list of breakpoint to determine which one
Do breakpoint actions
Replace INT3 with original byte and switch the debugged process into trace mode (step-by-step execution of CPU instructions)
Continue debugged process
Immediately you catch trace exception - the instruction was executed
Put INT3 back
Watchpoints can be implemented in the similar way, but instead of INT3 you put the memory page where watched variable is into read only, or into no access mode, and wait for segmentation exception.
Stepping through assembly can also be done by using trace mode. Stepping through source lines can also be done by placing breakpoints onto next instructions, based on debug data.
Also some CPU has hardware breakpoint support, when you just load address into some register.
According to this blog entry on technochakra.com you are correct:
Software breakpoints work by inserting a special instruction in the program being debugged. This special instruction on the Intel platform is “int 3″. When executed it calls the debugger’s exception handler.
I'm not sure how stepping into or over the next instruction is implemented though. However, the article goes on to add:
For practical reasons, it is unwise to ask for a recompilation whenever a breakpoint is added or deleted. Debuggers change the loaded image of the executable in memory and insert the “int 3″ instruction at runtime.
However, this would only be used for the "run to current line option".
Single stepping is implemented at (assembler) code level not at C++ level. The debugger knows how to map the C++ code lines to code addresses.
There are different implementations. There are CPUs that support debugging with breakpoint registers. When the execution reaches the address in the breakpoint register, the CPU executes a breakpoint exception.
A different approach is to patch the code for the time of execution with a special instruction, at best a one-byte instruction. At x86 systems that usually int 3.
The first approach allows breakpoints in ROM, the second allows more breakpoints at the same time.
AFAIK all debuggers (for whatever compiled language) that allow an unlimited number of breakpoints use a variant of replacing the instruction to be breakpointed with a special value (as described above) and keeping a list of places where these values have been placed.
When the processor tries to execute one of these special values, an exception is raised, the debugger catches it and checks if the address of the exception is on its list of breakpoints.
If it is, the debugger is invoked and the user is given an opportunity to interact.
If it is NOT, then the exception is due to something that was in the program from the outset and the debugger lets the exception 'pass' to whatever error handler might be there.
Note also, that debugging self-modifying code can fail precisely because the debugger momentarily modifies the code itself. (Of course, nobody would ever write self-modifying, now would they? >;-)
For these reasons, it is important that the debugger be given the opportunity to remove all the breakpoints it sets before terminating the debugging session.
I finished homework for a graduate course in operating systems. I got a great score and I only missed one tiny point of a question. It asked which were privileged instructions and which were not. I answered all correctly except one: Adding one register value to another
I answered it was privileged but apparently it's not! How can this be?
I figured the user interacts with registers/memory by using systems calls, which in a sense change from user mode system calls to kernel mode routines. Therefore the adding of one register value to another could be called by a non-privileged user, but in the end the kernel is doing the work and is in kernel, privileged mode. Therefore it's privileged? A user can't do it by themselves. Am I wrong? Why?!
Thanks!
I'm not sure why you would think that changing a register would require kernel intervention. Some special registers may be privileged (those controlling things like descriptor tables or protection levels, with which user-mode code could bypass system-mode protections) but general purpose registers can be changed freely without a kernel getting involved.
When your code is running, the vast majority of instructions would be things like:
inc %eax
movl $7,%ebx
addl %eax,%ebx
As an aside, I'm just imagining how slow my code would run if it required a system call to the kernel every time I incremented a counter or called a function :-)
The only thing I can think of would be if you thought your execution thread wasn't allowed to change registers arbitrarily since that may affect those registers for other threads. But the kernel would take care of that when switching threads - all your registers would be packed away somewhere for later and the ones for the next thread would be loaded in.
Based on your comments, you seem to think that the time of adding is when the CPU protection mechanism should step in. In fact, it can't at that point because it has no idea what you're going to use the register for. You may just be using it as a counter.
However, if you do use it as an address to access memory, and that memory is invalid somehow (outside of your address space, or swapped to disk), the kernel will step in at that point to rectify the situation (toss your application out on its ear, or bring in the swapped-out memory).
However, even that is not a privileged instruction, it's just the CPU handling page faults.
A privileged instruction is something that you're not allowed to do at all, like change the interrupt descriptor table location registers or deactivate interrupts.