Privileged instructions, adding register values?

Privileged instructions, adding register values? - privilege

I finished homework for a graduate course in operating systems. I got a great score and I only missed one tiny point of a question. It asked which were privileged instructions and which were not. I answered all correctly except one: Adding one register value to another
I answered it was privileged but apparently it's not! How can this be?
I figured the user interacts with registers/memory by using systems calls, which in a sense change from user mode system calls to kernel mode routines. Therefore the adding of one register value to another could be called by a non-privileged user, but in the end the kernel is doing the work and is in kernel, privileged mode. Therefore it's privileged? A user can't do it by themselves. Am I wrong? Why?!
Thanks!

I'm not sure why you would think that changing a register would require kernel intervention. Some special registers may be privileged (those controlling things like descriptor tables or protection levels, with which user-mode code could bypass system-mode protections) but general purpose registers can be changed freely without a kernel getting involved.
When your code is running, the vast majority of instructions would be things like:
inc %eax
movl $7,%ebx
addl %eax,%ebx
As an aside, I'm just imagining how slow my code would run if it required a system call to the kernel every time I incremented a counter or called a function :-)
The only thing I can think of would be if you thought your execution thread wasn't allowed to change registers arbitrarily since that may affect those registers for other threads. But the kernel would take care of that when switching threads - all your registers would be packed away somewhere for later and the ones for the next thread would be loaded in.
Based on your comments, you seem to think that the time of adding is when the CPU protection mechanism should step in. In fact, it can't at that point because it has no idea what you're going to use the register for. You may just be using it as a counter.
However, if you do use it as an address to access memory, and that memory is invalid somehow (outside of your address space, or swapped to disk), the kernel will step in at that point to rectify the situation (toss your application out on its ear, or bring in the swapped-out memory).
However, even that is not a privileged instruction, it's just the CPU handling page faults.
A privileged instruction is something that you're not allowed to do at all, like change the interrupt descriptor table location registers or deactivate interrupts.

Related

How to Read the program counter / Instruction pointer of a specific core in Kernel Mode?

Windows 10, x64 , x86
My current knowledge
Lets say it is quad core, there will be 4 individual program counters which will point to 4 different locations of code for parallel execution.
Each of this program counters indicates where a computer is in its program sequence.
The address it points to changes after a context switch where another threads program counter gets placed onto the program counter to execute.
What I want to do:
Im in Kernel Mode my thread is running on core 1 and I want to read the current instruction pointer of core 2.
Expected Results:
0x203123 is the address of the instruction pointer and this address belongs to this thread and this thread belongs to this process... etc.
Anyone knows how to do it or can give me good book references, links etc...

Although I don't believe it's officially documented, there is a ZwGetContextThread exported from ntdll.dll. Being undocumented, things can change (and I haven't tried it in quite a while) but at least when I last tried it, you called it with a thread handle and a pointer to a CONTEXT structure, and it would return that thread's context.
I'm not certain exactly how up-to-date that is though. It's never mattered to me, so I haven't checked, but my guess would be that the IP in the CONTEXT you get is whatever was saved the last time the thread was suspended. So, if you want something (reasonably) current, you'd use ZwSuspendThread, get the context, then ZwResumeThread to start it running again.
Here I suppose I'm probably supposed to give the standard lines about undocumented function being subject to change, using them being a bad idea, and that you should generally leave all of this alone. Ah well, I been disappointing teachers and other authority figures for years, and I guess I'm not changing right now.
On the other hand, there may be a practical problem here. If you really need data that's really current, this probably isn't going to work very well for you. What it gives you will be kind of current at best. On the other hand, really current is almost a meaningless concept with information that goes out of date every clock cycle.

Anyone knows how to do it or can give me good book references, links etc...
For 80x86 hardware (regardless of operating system); there are only 3 ways to do this (that I know of):
a) send an inter-processor interrupt to the other CPU, and have an interrupt handler that stores the "return EIP" (from its stack) at a known address in memory so that your CPU can read "value of EIP immediately before interrupt" (with synchronization so that your CPU doesn't read before the value is written, etc).
b) put the other CPU into some kind of "debug mode" (single-stepping, last branch recording, ...) so that (either code in a debug exception handler or the CPU's hardware itself) is constantly writing EIP values to memory that you can read.
Of course both of these options will ruin performance, and the value you get will probably be useless (because EIP would've changed after you obtain it but before you can use the obtained value). To ensure the value is still useful; you'd need the other CPU to wait until after you've consumed the obtained value (and are ready for the next value); and to do that you'd have to resort to single-step debugging facilities (with the waiting in the debug exception handler), where you'll be lucky if you can get performance better than a thousand times slower (and can probably improve performance by simply disabling other CPUs completely).
Also note that they still won't accurately tell you EIP in all cases (e.g. if the CPU is in SMM/System Management Mode and is beyond the control of the OS); and I doubt Windows kernel supports any of it (e.g. kernel should support single-stepping of user-space processes/threads to allow debuggers to work, but won't support single-stepping of kernel and will probably lock up the computer due to various "waiting for lock to be released for 6 days" problems).
The last of the 3 options is:
c) Run the OS inside an emulator/simulator instead of running it on real hardware. In that case you can probably modify the emulator/simulator's code to inject EIP values somewhere (maybe some kind of virtual "EIP reporting device"?). This will ruin performance of the emulator/simulator, but you may be able to hide that (e.g. "virtual time inside the emulator passes at a rate of one second per 1000 seconds of real time outside the emulator").

Can more than one Load/Store instructions can be executed at the same instance of time in Multiprocessor Environment

I believe in single processor systems, more than one Store will happen one after the other,
but what is the case for multi processsor systems?
Adding to the question, also if the machine is 32bit and when we try to write
a long int(64 bit) value to the memory, how will the Load/Store instructions behave?
The reason for the above two questions is, if someone tries to read the same memory (a memory
of size 32bit/64 bit, in 32 bit systems), in another
thread will this be safe, or do i need to consider using locks.?
Added:
I wanted to do with minimum locks possible since ours is time critical execution.
Hence I wanted to understand is there ever a possibility of executing two Store/Load instructions
at the same instant of time to the same memory location, if things gets executed in multi processor
environment.

You are wrong if you only look on load/store cpu instructions.
The compiler and your os and cpu can:
change execution order to optimize the code
can hold values in separate caches
can store data in cpu registers without accessing cache or other memory
can optimize access complete away
... a lot more I believe!
If you want to access the same variable from different threads you must use a synchronization mechanism which is provided from your language or a library which fits to your os. Nothing else will give you a guarantee to work.
The problem is not the real access to any kind of memory. You definitely must ensure that your code contains memory barriers as needed for the underlying libraries and OS support. If there are no barriers between multi thread access you will maybe not see any change from a write in one thread while read it from a second one.
This will also be a problem on a single core cpu because the compiler have no idea that you modify a variable from two threads if you don't use any kind of synchronization.
To your add on:
You simply have no control over any kind of memory access without writing your code in assembler. And if you write it in assembler, you! have to deal with registers L1/L2/Lx Caching, Memory Mapping, Inter-CPU-Communication and so on. Forget all about load/store instruction. This is only 1% of the job!
If you have time critical jobs:
try to fix the core were the thread runs on ( see detailed description for threading libraries like posix pthreads or whatever lib you are running on )
it can be much faster to run a single process with a single thread and program it in a cooperative fashion. No locks, no memory barriers, no ipc. But you have to deal with all the thread a like problems. But it is fast!
Often it is much faster to split you problem in some processes each only with one thread and make the ipc minimal. This needs a deep understanding how you can scale your algorithms.
Often a very! simple 8/16 bit cpu runs much faster in special environments in comparison to a fat 8 core cpu with fat os on it.
But you don't tell us the rest of your environment and requirements so the answer never can give a full answer to your real problem. But keep in mind: load/store was yesterday.

This can not be answered generically. You have to know which model of what design of processor it is. An AMD Opteron will be different from an Intel Pentium which is different from a Intel Core2, and all of those are different from an ARMv7 design. [They are probably fairly similar, but there's details that you may care about if you REALLY want to rely on these operations to be performed in a specific way]. And of course, if you share memory between, say, a GPU (graphics processing unit) and a CPU, you have even more possible scenarios of "different design".
There are single core "superscalar" (more than one execution unit) and "out of order execution" (processors that reorder instructions), so more than one execution unit (including more than one load/store unit), and thus more than one instruction (including load or store) can be performed at the same time.
Obviously, once the processor determines that the memory operation needs to go "outside" (that is, the value is not available in the cache), it has to be serialized, but there is no guarantee that a load or store as sequenced by you or the compiler won't be re-ordered between loads and stores. If the processor has instructions to support "data wider than the bus" (e.g. 32-bit processor loading 64-bit word), these are typically atomic to that processor. If the processor does not in itself support 64-bit words, then the load of a 64-bit value would encompass two 32-bit loads.
[When I write "load", the same applies for "store"]
In case of multiprocessor or multicore architectures, it becomes a system architecture question, which makes it even more complicated than "we can't answer this without understanding the processor design", since there are now more components involved: memory design (one lump of memory shared between processors, several lumps of memory that are not directly shared, etc).
In general, if you have multiple threads, you will need to use atomic operations - most processors have a way to say "I want this to happen without someone else interfering" - in the old days, it would be a "lock" pin on the processor(s) that would be wired to anything else that could access the memory bus, and when that pin was active, all other devices has to wait for it to become inactive before accessing the memory bus. These days, it's a fair bit more sophisticated, since there are caches involved. Most systems use an "exclusive cache content" method: The processor signals all it's peers that "I want this address to be exclusive in my cache", at which point all other processors will "flush and invalidate" that particular address in their caches. Then the atomic operation is performed in the cache, and the results available to be read by other processors only when the atomic operation is completed. This is a pretty simplified view of how it works - modern processors are very complex, and there is a lot of work involved with such seemingly simple things as "make sure this value gets updated in a way that doesn't get interrupted by some other processor writing to the same thing".
If there isn't support in the processor for "atomic" operations, then there has to be proper locks (and any processor designed for use in a multicore/multicpu environment will have operations to support locks in some way), where the lock is taken before updating something, and then released after the update. This is clearly more complex than having builtin atomic operations, but it makes the design of the processor simpler. Also, for more complex updates (where more than one 32- or 64-bit value needs updating) this sort of locking is still required - for example, if we have a "queue" where you have a "where we're writing" and "elements in queue" that both need to be updated on write, you can't do that in a single operation [without being VERY clever about it, at least].
In heterogeneous systems, such as GPU + CPU combinations, you can't do atomics between different devices, because the cache of one device doesn't "understand the language" of the other device. So when the CPU says "I want this as exclusive", the GPU sees "Hurdi gurdi meatballs" and thinks "I have no idea what that is about, I'll just ignore it" [or something like that]. In this case, there has to be some other way to access shared data, but it's typically not atomic ever, you have to send commands (via other means than the interprocessor signalling system) to the GPU to say "flush your cache, and tell me when you're done with that", and when the CPU has written something the GPU needs, the CPU will flush it's cache before telling the GPU that it can use the data. This can get pretty messy, and takes a fair amount of time.

I believe in single processor systems, more than one Store will happen one after the other,
False. Most machines are set up like that, but for performance reasons many CPUs can be configured to have a much more relaxed store ordering. This is almost never a problem for an application on a single CPU (because the CPU will make it look like you expect) but it's really critical to understand when talking to hardware.
Here's a wikipedia article: http://en.wikipedia.org/wiki/Memory_ordering
This gets doubly complicated on CPUs with non-coherent local caches. Because then you can have strong ordering as seen from this CPU while other CPUs will see totally different results depending on the cache flush order.
Adding to the question, also if the machine is 32bit and when we try to write a long int(64 bit) value to the memory, how will the Load/Store instructions behave?
Some 32 bit CPUs have instructions to do atomic 64 bit writes, others don't. Those that don't will do two separate writes where a partial result can be seen by other CPUs or threads (if you get unlucky with context switching) or signal handlers or interrupt handlers.
The reason for the above two questions is, if someone tries to read the same memory (a memory of size 32bit/64 bit, in 32 bit systems), in another thread will this be safe, or do i need to consider using locks.?
Yes, no, maybe. If it's just one value and it doesn't tell the other thread that some other memory might be in a certain state, then yes, it can be safe in certain circumstances. You're not guaranteed that the other thread will see the changed value in memory for a long time, but eventually it should see it.
Generally, you can't reason about the behavior of access to shared memory in a threaded environment without strictly following the documentation of the thread model you're using. And most of those say something like without locks the behavior is undefined, with locks everything that happened before the lock is guaranteed to happen before the lock and everything that happens after the lock is guaranteed to happen after the lock. This is not only because of differences between CPUs, but also because the operating system can do something funny and the locking code needs to be designed to convince the compiler to not do something funny either (which is surprisingly hard with modern compilers).

which registers is changed when we move from user mode to kernel mode ?! and what is the reason to move to kernel mode?

which registers is changed when we move from user mode to kernel mode ?! and what is the reason to move to kernel mode ?
why these reasons aren't cause moving to kernel mode :
make new admin by root ( super user or admin)
If i get TLB miss why we don't move to kernel mode
when we write to bit Page modified in the Page tables

From your questions i found that you are very poor in operating system concepts.
Ok let me explain,(I am assuming you are using linux not windows).
"which registers is changed when we move from user mode to kernel mode ?"
For knowing answer for this question you need to learn about process management.
But i can simply say, linux uses system call interface for changing from user space to kernel space. system call interface uses some registers (Based on your processor) to pass system call number and arguments for system call.

In general, the move to kernel mode happens when
you make an explicit request to the kernel (a system call)
you make an implicit request to the kernel (accessing memory that isn't mapped into your space, whether valid or not)
the kernel decides it needs to do something more important than executing your code (normally as the result of a hardware interrupt).
All registers will be preserved, as it would be rather difficult to write code if your registers could change at random, but how that happens is very CPU specific.

"which registers is changed when we move from user mode to kernel mode ?!"
In a typical x86 based architecture running linux kernel, this is what happens:
a software program shall trigger interrupt 0x80 by the instruction: int $0x80
The CPU will change the program counter register & the code selector to refer to
the place where linux system call handler exists in memory (linux applies virtual
memory concept).
Till now registers affected are: CS, EIP, and EFLAGS register. the CPU also changes
Stack Selector (SS) and Stack Pointer (ESP) to refer to the top of kernel stack.
Finally, the kernel changes the Data Selector and Extra Data Selector (DS & ES) to
select to a kernel mode data segment.
The kernel shall push the program context on kernel's stack and the general purpose
registers (like accumulators) will change due to the kernel code being executed.
So as you can see, it all depends on the operating system and the architecture.
"and what is the reason to move to kernel mode ?"
The CPU by default works in kernel mode, your question should be "what is the need of user mode?". The user mode is necessary because it doesn't provide all permissions to the running software. You can run your browser/file manager/shell in user mode without any worries. If full permissions are given to application software, they will access the kernel data and damage it, and they might also access the hardware and for example, destroy the data stored on your hard disk.
Kernel of course must work in kernel mode (at least the core of the kernel). Application software for example, might require to write data to a file on the disk. application software doesn't have access to the disk (because it is running in user mode). the only way to achieve this is to call the kernel (which is running in kernel mode) to do the job. that's why you need to move from user mode to kernel mode and vice versa.

Raise an exception when an address is executed without modifying it directly

I would like to raise an exception when code at a given address is executed, without making it visible in the code.
I know that using a hardware breakpoint is a possibility, but these would get removed if someone were to attach a debugger that uses them and I wouldn't have a way of detecting if they are missing and replacing them. What other options are there?
Speed is a concern, ie: I cannot do PAGE_GUARD single stepping; the user would lag to death.
I'm on Windows and using VC 2012 w/ C++.

If exception handling is too costly, the only other solution is to emulate the code as the CPU would do.
There are a few caveats, though:
There are a lot of instructions and decoding and emulating them correctly is a big undertaking. Switching between emulation and execution will cost you extra CPU cycles.
You won't be able to emulate everything and will have to execute a number of instructions (e.g. FPU/MMX/SSE instructions) in a "playground/sandbox" because of that.
To handle system calls properly, you'll actually have to prepare the CPU state and execute them and then go back into the emulator. You'll probably have to generate code on the fly here.
If the emulated code causes CPU exceptions and uses SEH to handle them (or throws and catches C++ exceptions as CPU exceptions, again via SEH), you are very likely to break the code as stack unwinding won't work on the foreign (emulator's) stack.
Things will get tricky with multi-threaded code, especially so on multi-processor systems. You'll have to catch thread creation/destroying and create/destroy individual instances of the emulator and deal with memory sharing between the threads and deal with atomicity of emulated/executed instructions.
Whatever I've forgotten to think of.
Things may still work too slowly or not work at all.
Another, perhaps more practical, option would be to patch the executable at that address of interest, divert execution to your code (with the jmp instruction), do whatever you need there and then go back. You'll have to take care of all context preservation/restoration and also emulate the instructions damaged by the jmp instruction written on top of them. There are caveats here as well. Those overwritten instructions may be jumped to from elsewhere in the code. You'll have to either choose the address in such a way that there're no jumps into the middle of your jmp or you'll have to deal with them somehow (not sure how yet).

Is memory reordering visible to other threads on a uniprocessor?

It's common that modern CPU architectures employ performance optimizations that can result in out-of-order execution. In single threaded applications memory reordering may also occur, but it's invisible to programmers as if memory was accessed in program order. And for SMP, memory barriers come to the rescue which are used to enforce some sort of memory ordering.
What I'm not sure, is about multi-threading in a uniprocessor. Consider the following example: When thread 1 runs, the store to f could take place before the store to x. Let's say context switch happens after f is written, and right before x is written. Now thread 2 starts to run, and it ends the loop and print 0, which is undesirable of course.
// Both x, f are initialized w/ 0.
// Thread 1
x = 42;
f = 1;
// Thread 2
while (f == 0)
;
print x;
Is the scenario described above possible? Or is there a guarantee that physical memory is committed during thread context switch?
According to this wiki,
When a program runs on a single-CPU machine, the hardware performs
the necessary bookkeeping to ensure that the program execute as if all
memory operations were performed in the order specified by the
programmer (program order), so memory barriers are not necessary.
Although it didn't explicitly mention uniprocessor multi-threaded applications, it includes this case.
I'm not sure it's correct/complete or not. Note that this may highly depend on the hardware(weak/strong memory model). So you may want to include the hardware you know in the answers. Thanks.
PS. device I/O, etc are not my concern here. And it's a single-core uniprocessor.
Edit: Thanks Nitsan for the reminder, we assume no compiler reordering here(just hardware reordering), and loop in thread 2 is not optimized away..Again, devil is in the details.

As a C++ question the answer must be that the program contains a data race, so the behavior is undefined. In reality that means that it could print something other than 42.
That is independent of the underlying hardware. As has been pointed out, the loop can be optimized away and the compiler can reorder the assignments in thread 1, so that result can occur even on uniprocessor machines.
[I'll assume that with "uniprocessor" machine, you mean processors with a single core and hardware thread.]
You now say, that you want to assume compiler reordering or loop elimination does not happen. With this, we have left the realm of C++ and are really asking about corresponding machine instructions. If you want to eliminate compiler reordering, we can probably also rule out any form of SIMD instructions and consider only instructions operating on a single memory location at a time.
So essentially thread1 has two store instructions in the order store-to-x store-to-f, while thread2 has test-f-and-loop-if-not-zero (this may be multiple instructions, but involves a load-from-f) and then a load-from-x.
On any hardware architecture I am aware of or can reasonably imagine, thread 2 will print 42.
One reason is that, if instructions processed by a single processors are not sequentially consistent among themselves, you could hardly assert anything about the effects of a program.
The only event that could interfere here, is an interrupt (as is used to trigger a preemptive context switch). A hypothetical machine that stores the entire state of its current execution pipeline state upon an interrupt and restores it upon return from the interrupt, could produce a different result, but such a machine is impractical and afaik does not exist. These operations would create quite a bit of additional complexity and/or require additional redundant buffers or registers, all for no good reason - except to break your program. Real processors either flush or roll back the current pipeline upon interrupt, which is enough to guarantee sequential consistency for all instructions on a single hardware thread.
And there is no memory model issue to worry about. The weaker memory models originate from the separate buffers and caches that separate the separate hardware processors from the main memory or nth level cache they actually share. A single processor has no similarly partitioned resources and no good reason to have them for multiple (purely software) threads. Again there is no reason to complicate the architecture and waste resources to make the processor and/or memory subsystem aware of something like separate thread contexts, if there aren't separate processing resources (processors/hardware threads) to keep these resources busy.

A strong memory ordering execute memory access instructions with the exact same order as defined in the program, it is often referred as "program ordering".
Weaker memory ordering may be employed to allow the processor reorder memory access for better performance, it is often referred as "processor ordering".
AFAIK, the scenario described above is NOT possible in the Intel ia32 architecture, whose processor ordering outlaws such cases. The relevant rules are (intel ia-32 software development manual Vol3A 8.2 Memory Ordering) :
writes are not reordered with other writes, with the exception of streaming stores, CLFLUSH and string operations.
To illustrate the rule, it gives an example similar to this:
memory location x, y, initialized to 0;
thread 1:
mov [x] 1
mov [y] 1
thread 2:
mov r1 [y]
mov r2 [x]
r1 == 1 and r2 == 0 is not allowed
In your example, thread 1 cannot store f before storing x.
#Eric in respond to your comments.
fast string store instruction "stosd", may store string out of order inside its operation. In a multiprocessor environment, when a processor store a string "str", another processor may observe str[1] being written before str[0], while the logic order presumed to be writing str[0] before str[1];
But these instructions are not reorder with any other stores. and must have precise exception handling. When exception occurs in the middle of stosd, the implementation may choose to delay it so that all out-of-order sub-stores (doesn't necessarily mean the whole stosd instruction) must commit before the context switch.
Edited to address the claims made on as if this is a C++ question:
Even this is considered in the context of C++, As I understand, a standard confirming compiler should NOT reorder the assignment of x and f in thread 1.
$1.9.14
Every value computation and side effect associated with a full-expression is sequenced before every value
computation and side effect associated with the next full-expression to be evaluated.

This isn't really a C or C++ question, since you've explicitly assumed no load/store re-ordering, which compilers for both languages are perfectly allowed to do.
Allowing that assumption for the sake of argument, note that loop may anyway never exit, unless you either:
give the compiler some reason to believe f may change (eg, by passing its address to some non-inlineable function which could modify it)
mark it volatile, or
make it an explicitly atomic type and request acquire semantics
On the hardware side, your worry about physical memory being "committed" during a context switch isn't an issue. Both software threads share the same memory hardware and cache, so there's no risk of inconsistency there whatever consistency/coherence protocol pertains between cores.
Say both stores were issued, and the memory hardware decides to re-order them. What does this really mean? Perhaps f's address is already in cache, so it can be written immediately, but x's store is deferred until that cache line is fetched. Well, a read from x is dependent on the same address, so either:
the load can't happen until the fetch happens, in which case a sane implementation must issue the queued store before the queued load
or the load can peek into the queue and fetch x's value without waiting for the write
Consider anyway that the kernel pre-emption required to switch threads will itself issue whatever load/store barriers are required for consistency of the kernel scheduler state, and it should be obvious that hardware re-ordering can't be a problem in this situation.
The real issue (which you're trying to avoid) is your assumption that there is no compiler re-ordering: this is simply wrong.

You would only need a compiler fence. From the Linux kernel docs on Memory Barriers (link):
SMP memory barriers are reduced to compiler barriers on uniprocessor
compiled systems because it is assumed that a CPU will appear to be
self-consistent, and will order overlapping accesses correctly with
respect to itself.
To expand on that, the reason why synchronization is not required at the hardware level is that:
All threads on a uniprocessor system share the same memory and thus there are no cache-coherency issues (such as propagation latency) that can occur on SMP systems, and
Any out-of-order load/store instructions in the CPU's execution pipeline would either be committed or rolled back in full if the pipeline is flushed due to a preemptive context switch.

This code may well never finish(in thread 2) as the compiler can decide to hoist the whole expression out of the loop(this is similar to using an isRunning flag which is not volatile).
That said you need to worry about 2 types of re-orderings here: compiler and CPU, both are free to move the stores around. See here: http://preshing.com/20120515/memory-reordering-caught-in-the-act for an example. At this point the code you describe above is at the mercy of compiler, compiler flags, and particular architecture. The wiki quoted is misleading as it may suggest internal re-ordering is not at the mercy of the cpu/compiler which is not the case.

As far as the x86 is concerned, the out-of-order-stores are made consistent from the viewpoint of the executing code with regards to program flow. In this case, "program flow" is just the flow of instructions that a processor executes, not something constrained to a "program running in a thread". All the instructions necessary for context switching, etc. are considered part of this flow so the consistency is maintained across threads.

A context switch has to store the complete machine state so that it can be restored before the suspended thread resumes execution. The machine states includes the processor registers but not the processor pipeline.
If you assume no compiler reordering, this means that all hardware instructions that are "on-the-fly" have to be completed before a context switch (i.e. an interrupt) takes place, otherwise they get lost and are not stored by the context switch mechanism. This is independend of hardware reordering.
In your example, even if the processor swaps the two hardware instructions "x=42" and "f=1", the instruction pointer is already after the second one, and therefore both instructions must be completed before the context switch begins. if it were not so, since the content of the pipeline and of the cache are not part of the "context", they would be lost.
In other words, if the interrupt that causes the ctx switch happens when the IP register points at the instruction following "f=1", then all instructions before that point must have completed all their effects.

From my point of view, processor fetch instructions one by one.
In your case, if "f = 1" was speculatively executed before "x = 42", that means both these two instructions are already in the processor's pipeline. The only possible way to schedule current thread out is interrupt. But the processor(at least on X86) will flush the pipeline's instructions before serving the interrupt.
So no need to worry about the reordering in a uniprocessor.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js