Is it necessary to save the FPU state here? - c++

I wrote a simple cooperative multi-threading library. Currently I always save and restore the fpu state with fxsave / fxrstor when switching to a new context. But is this necessary in the cdecl calling convention?
As a simple example:
float thread_using_fpu(float x)
{
float y = x / 2; // do some fpu operation
yield(); // context switch, possibly altering fpu state.
y = y / 2; // another fpu operation
return y;
}
May the compiler make any assumptions about the FPU state after the call to yield()?

As per the The SYSTEM V APPLICATION BINARY INTERFACE Intel386TM Architecture Processor Supplement, page 3-12:
%st(0): If the function does not return a floating-point value, then this
register must be empty. This register must be empty before
entry to a function.
%st(1) through %st(7):
Floating-point scratch registers have no specified role in the
standard calling sequence. These registers must be empty before entry
and upon exit from a function.
Thus, you do not need to context switch them.
Another, newer version says this:
The CPU shall be in x87 mode upon entry to a function. Therefore, every function that uses the MMX registers is required to issue an
emms or femms instruction after using MMX registers, before returning
or calling another function. [...]
The control bits of the MXCSR register are callee-saved (preserved across calls), while the status bits are caller-saved (not preserved).
The x87 status word register is caller-saved, whereas the x87 control
word is callee-saved.
[...] All x87 registers are caller-saved, so callees that make use of the MMX registers may use the faster femms instruction.
So, you may need to save the control word.

No. You don't have to do any saving of the state. If one thread is in the middle of a floating point calculation where there is, for example, a denormalized flag set, and that thread is interrupted, then when it resumes the O/S or kernel will set the flags, just like it will restore other registers. Likewise, you don't have to worry about it in a yield().
Edit: If you are doing your own context switching, it is possible you would need to save the precision and rounding control flags if you need to set them to non-default values. Otherwise, again you're fine.

Related

How can I utilize the 'red' and 'atom' PTX instructions in CUDA C++ code?

The CUDA PTX Guide describes the instructions 'atom' and 'red', which perform atomic and non-atomic reductions. This is news to me (at least with respect to non-atomic reductions)... I remember learning how to do reductions with SHFL a while back. Are these instructions reflected or wrapped somehow in CUDA runtime APIs? Or some other way accessible with C++ code without actually writing PTX code?
Are these instructions reflected or wrapped somehow in CUDA runtime APIs? Or some other way accessible with C++ code without actually writing PTX code?
Most of these instructions are reflected in atomic operations (built-in intrinsics) described in the programming guide. If you compile any of those atomic intrinsics, you will find atom or red instructions emitted by the compiler at the PTX or SASS level in your generated code.
The red instruction type will generally be used when you don't explicitly use the return value from from one of the atomic intrinsics. If you use the return value explicitly, then the compiler usually emits the atom instruction.
Thus, it should be clear that this instruction by itself does not perform a complete classical parallel reduction, but certainly could be used to implement one if you wanted to depend on atomic hardware (and associated limitations) for your reduction operations. This is generally not the fastest possible implementation for parallel reductions.
If you want direct access to these instructions, the usual advice would be to use inline PTX where desired.
As requested, to elaborate using atomicAdd() as an example:
If I perform the following:
atomicAdd(&x, data);
perhaps because I am using it for a typical atomic-based reduction into the device variable x, then the compiler would emit a red (PTX) or RED (SASS) instruction taking the necessary arguments (the pointer to x and the variable data, i.e. 2 logical registers).
If I perform the following:
int offset = atomicAdd(&buffer_ptr, buffer_size);
perhaps because I am using it not for a typical reduction but instead to reserve a space (buffer_size) in a buffer shared amongst various threads in the grid, which has an offset index (buffer_ptr) to the next available space in the shared buffer, then the compiler would emit a atom (PTX) or ATOM (SASS) instruction, including 3 arguments (offset, &buffer_ptr, and buffer_size, in registers).
The red form can be issued by the thread/warp which may then continue and not normally stall due to this instruction issue which will normally have no dependencies for subsequent instructions. The atom form OTOH will imply modification of one of its 3 arguments (one of 3 logical registers). Therefore subsequent use of the data in that register (i.e. the return value of the intrinsic, i.e. offset in this case) can result in a thread/warp stall, until the return value is actually returned by the atomic hardware.

6502 emulator in C/C++: how to separate addressing mode code from actual instruction code

In the spare time I'm starting writing a very simple C++ emulator for the 6502 CPU.
I used to write down a lot of assembly code for this CPU so all the opcodes, addressing modes and other stuff are not a big deal.
The 6502 has 56 different instructions plus 13 addressing modes giving a total of 151 different opcodes. To me speed is not an issue so instead of writing a huge switch-case statement and repeat the same code again and again (different opcodes can refer to the same instruction using a different addressing mode) I'd like to separate actual instruction code from the addressing mode code: I found this solution very neat as it would require to write only 13 addressing mode functions and 56 instruction functions without repeat myself.
here the addressing mode functions:
// Addressing modes
uint16_t Addr_ACC(); // ACCUMULATOR
uint16_t Addr_IMM(); // IMMEDIATE
uint16_t Addr_ABS(); // ABSOLUTE
uint16_t Addr_ZER(); // ZERO PAGE
uint16_t Addr_ZEX(); // INDEXED-X ZERO PAGE
uint16_t Addr_ZEY(); // INDEXED-Y ZERO PAGE
uint16_t Addr_ABX(); // INDEXED-X ABSOLUTE
uint16_t Addr_ABY(); // INDEXED-Y ABSOLUTE
uint16_t Addr_IMP(); // IMPLIED
uint16_t Addr_REL(); // RELATIVE
uint16_t Addr_INX(); // INDEXED-X INDIRECT
uint16_t Addr_INY(); // INDEXED-Y INDIRECT
uint16_t Addr_ABI(); // ABSOLUTE INDIRECT
they all returns the actual memory address (16 bit) used by the instruction to read/write the operand/result
the instruction function prototype is:
void Op_ADC(uint16_t addr);
void Op_AND(uint16_t addr);
void Op_ASL(uint16_t addr);
...
it takes the 16 bit address, perform its own operations, update the status flags and/or registers, and commit the results (if any) on the same memory address.
Given that code framework I found difficult to use the ACCUMULATOR addressing mode which is the only one to return the actual value of the A internal register instead of a memory address. I could return the value of A using the uin16_t return type and add a boolean flag for such addressing mode but I find it an extremely ugly solution.
The instruction functions should be completely addressing-mode agnostic.
In Sharp6502 (my 6502 emulation engine written in C#) I treat internal registers and external memory both as first-class objects - the MemoryManager class instantiates an object for external memory, and another for internal registers, mapped to a different numeric range. Consequently, memory access and register access are identical at the functional level, since they are both referenced through MemoryManager according to what is basically an index.
Address-mode differentiation is simply a matter of filtering the bit-pattern of the instruction under emulation and performing a very simple calculation to determine the index to be passed to the MemoryManager - this might be implied, or require one or two further bytes, but the underlying mechanism is identical for every instruction.

How to return a complex return value?

Currently I am writing some assembly language procedures. As some convention says, when I want to return some value to the caller, say an integer, I should return it in the EAX register. Now I am wondering what if I want to return a float, a double, an enum, or even a complex struct. How to return these type of values?
I can think of returning an address in the EAX which points to the real value in memory. But is it the standard way?
Many thanks~~~
It is all up to you, if the caller is your code. If the caller is not under your control, you have to either follow their existing convention or develop your own convention together.
For example, on x86 platform when floating-point arithmetic is processed by FPU instructions, the result of a function is returned as the top value on the FPU register stack. (If you know, x86 FPU registers are organized into a "circular stack" of sorts). At that moment it is neither float nor double, it is a value stored with internal FPU precision (which could be higher than float or double) and it is the caller's responsibility to retrieve that value from the top of FPU stack and convert it to whatever type it desires. In fact, that is how a typical FPU instruction works: it takes its arguments from the top of FPU stack and pushes the result back onto FPU stack. By implementing your function in the same way you essentially emulate a "complex" FPU instruction with your function - a rather natural way to do it.
When floating-point arithmetic is processed by SSE instructions, you can choose some SSE register for the same purpose (use xmm0 just like you use EAX for integers).
For complex structures (i.e. ones that are larger than a register or a pair of registers), the caller would normally pass a pointer to a reserved buffer to the function. And the function would put the result into the buffer. In other words, under the hood, functions never really "return" large objects, but rather construct them in a caller-provided memory buffer.
Of course, you can use this "memory buffer" method for returning values of any type, but with smaller values, i.e. values of scalar type, it is much more efficient to use registers than a memory location. This applies, BTW, to small structures as well.
Enums are usually just a conceptual wrapper over some integer type. So, there's no difference between returning a enum or an integer.
A double should be returned as the 1st item in the stack.
Here is a C++ code example (x86):
double sqrt(double n)
{
_asm fld n
_asm fsqrt
}
If you prefer to manage the stack manually (saving some CPU cycles):
double inline __declspec (naked) __fastcall sqrt(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
For complex types, you should pass a pointer, or return a pointer.
When you have questions about calling conventions or assembly language, write a simple function in high level language (in a separate file). Next, have your compiler generate an assembly language listing or have your debugger display "interleaved assembly".
Not only will the listing tell you how the compiler implements code, but also show you the calling conventions. A lot easier than posting to S.O. and usually faster. ;-)
C99 has a complex builtin data type (_Complex). So if you have a C99 compliant compiler, you could just compile some function that returns a complex and compile this to assembler (usually with a -S option). There you can see the convention that is taken.
It depends on the ABI. For example, Linux on x86 uses the Sys V ABI, specified in the Intel386 Architecture Processor Supplment, Fourth Edition.
The section Function Calling Sequence section has the information on how values are to be returned. Briefly, in this API:
Functions returning scalars or no value use %eax;
Functions returning floating point values use %st(0);
For functions returning struct or union types, the caller provides space for the return value and passes its address as a hidden, first argument. The callee returns this address in %eax.
Typically you would use the stack
If you're planning to interface with C or another higher-level language, typically you would accept the address of a memory buffer as an argument to your function and return your complex value by populating that buffer. If this is assembly-only, then you can define your own convention using any set of registers you want, although usually you'd only do so if you have a specific reason (e.g., performance).

Why and when should one call _fpreset( )?

The only documentation I can find (on MSDN or otherwise) is that a call to _fpreset() "resets the floating-point package." What is the "floating point package?" Does this also clear the FPU status word? I see documentation that says to call _fpreset() when recovering from a SIGFPE, but doesn't _clearfp() do this as well? Do I need to call both?
I am working on an application that unmasks some FP exceptions (using _controlfp()). When I want to reset the FPU to the default state (say, when calling to .NET code), should I just call _clearfp(), _fpreset(), or both. This is performance critical code, so I don't want to call both if I don't have to...
_fpreset() resets the state of the floating-point unit. It resets the FPU precision to its default and clears the FPU status word. The two occasitions I see to use it are when recovering from an FPE (as you said) and when getting control back from library code (e.g. a DLL that you have no control about) that has screwed the FPU in any way, like changing the precision.

Is Updating double operation atomic

In Java, updating double and long variable may not be atomic, as double/long are being treated as two separate 32 bits variables.
http://java.sun.com/docs/books/jls/second_edition/html/memory.doc.html#28733
In C++, if I am using 32 bit Intel Processor + Microsoft Visual C++ compiler, is updating double (8 byte) operation atomic?
I cannot find much specification mention on this behavior.
When I say "atomic variable", here is what I mean :
Thread A trying to write 1 to variable x.
Thread B trying to write 2 to variable x.
We shall get value 1 or 2 out from variable x, but not an undefined value.
This is hardware specific and depends an the architecture. For x86 and x86_64 8 byte writes or reads are guaranteed to be atomic, if they are aligned. Quoting from the Intel Architecture Memory Ordering White Paper:
Intel 64 memory ordering guarantees
that for each of the following
memory-access instructions, the
constituent memory operation appears
to execute as a single memory access
regardless of memory type:
Instructions that read or write a single byte.
Instructions that read or write a word (2 bytes) whose address is
aligned on a 2 byte boundary.
Instructions that read or write a doubleword (4 bytes) whose address is
aligned on a 4 byte boundary.
Instructions that read or write a quadword (8 bytes) whose address is
aligned on an 8 byte boundary.
All locked instructions (the implicitly
locked xchg instruction and other
read-modify-write instructions with a
lock prefix) are an indivisible and
uninterruptible sequence of load(s)
followed by store(s) regardless of
memory type and alignment.
It's safe to assume that updating a double is never atomic, even if it's size is the same as an int with atomic guarantee. The reason is that if has different processing path since it's a non-critical and expensive data type. For example even data barriers usually mention that they don't apply to floating point data/operations in general.
Visual C++ will allign primitive types (see article) and while that should guarantee that it's bits won't get garbled while writing to memory (8 byte allignment is always in one 64 or 128 bit cache line) the rest depends on how CPU handles non-atomic data in it's cache and whether reading/flushing a cache line is interruptable. So if you dig through Intel docs for the kind of core you are using and it gives you that guarantee then you are safe to go.
The reason why Java spec is so conservative is that it's supposed to run the same way on an old 386 and on Corei7. Which is of course delusional but a promise is a promise, therefore it promisses less :-)
The reason I'm saying that you have to look up CPU doc is that your CPU might be an old 386, or alike :-)) Don't forget that on a 32-bit CPU your 8-byte block takes 2 "rounds" to access so you are down to the mercy of the mechanics of the cache access.
Cache line flushing giving much higher data consistency guarantee applies only to a reasonably recent CPU with Intel-ian guarantee (automatic cache consistency).
I wouldn't think in any architecture, thread/context switching would interrupt updating a register halfway so that you are left with for example 18bits updated of the 32bits it was going to update. Same for updating a memory location ( provided that it's a basic access unit, 8,16,32,64 bits etc).
So has this question been answered? I ran a simple test program changing a double:
#include <stdio.h>
int main(int argc, char** argv)
{
double i = 3.14159265358979323;
i += 84626.433;
}
I compiled it without optimizations (gcc -O0), and all assignment operations are performed with single assembler instructions such as fldl .LC0 and faddp %st, %st(1). (i += 84626.433 is of course done two operations, faddp and fstpl).
Can a thread really get interrupted inside a single instruction such as faddp?
On a multicore, besides being atomic, you have to worry about cache coherence, so that the thread reading sees the new value in its cache when the writer has updated.