Nasm, C++, passing class object - c++

I have some class in my cpp file.
class F{
private:
int id;
float o;
float p;
float s;
static int next;
public:
F(double o, double s = 0.23, double p = 0.0):
id(next++), o(o),
p(p), s(s){}
};
int F::next = 0;
extern "C" float pod(F f);
int main(){
F bur(1000, 0.23, 100);
pod(bur);
return 0;
}
and I'm trying to pass class object burto function pod which is defined in my asm file. However I have big problem getting values from this class object.
In asm program I have 0.23 in XMM1, 100 in XMM2 but I can't find where 1000 is stored.

I don't know why you are seeing 100 in xmm2, I suspect that that is entirely conincidence. The easiest way to see how your struct is being passed is to compile the C++ code.
With cruft removed, my compiler does this:
main:
.LFB3:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl _ZN1F4nextE(%rip), %edi # load F::next into edi/rdi
movq .LC3(%rip), %xmm0 # load { 0.23, 100 } into xmm0
leal 1(%rdi), %eax # store rdi + 1 into eax
movl %eax, _ZN1F4nextE(%rip) # store eax back into F::next
movabsq $4934256341737799680, %rax # load { 1000.0, 0 } into rax
orq %rax, %rdi # or rax into pre-increment F::next in rdi
call pod
xorl %eax, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.LC3:
.quad 4497835022170456064
The constant 4497835022170456064 is 3E6B851F42C80000 in hex, and if you look at the most significant four bytes (3E6B851F), this is 0.23 when interpreted as a single precision float, and the least significant four bytes (42C80000) are 100.0.
Similarly the most significant four bytes of the constant 4934256341737799680 (hex 447A000000000000) are 1000.0.
So, bur.id and bur.o are passed in rdi and bur.p and bur.s are passed in xmm0.
The reason for this is documented in the x86-64 abi reference. In extreme summary, because the first fwo fields are small enough, of mixed type and one of them is integer, they are passed in a general purpose register (rdi being the first general purpose parameter register), because the next two fields are both float, they are passed in an SSE register.

You want to have a look at calling-convention compilation from Agner's here. Depending on compiler, operating system and whether you are in 32 o 64 bits, different things may happen. (see Table 5 chapter 7).
For 64bits linux for instance, since you object contains different values (see table 6), the R case seems to apply:
Entire object is transferred in integer registers and/or XMM registers if the size is no bigger than 128 bits, otherwise on the stack. Each 64-bit part of the object is transferred in an XMM register if it contains only float or double, or in an integer register if it contains integer types or mixed integer and float. Two consecutive floats can be packed into the lower half of one XMM register.
In your case, the class fits in 128 bits. The experiment from #CharlesBailey illustrates this behavior. According to the convention
... or in an integer register if it contains integer types or mixed integer and float. Two consecutive floats can be packed into the lower half of one XMM register. Examples: int and float: RDI.
first int register rdi should hold id and o where xmm0 should hold p and s.
Seeing 100 within xmm2 might be a side effect of initialization as this is passed as a double to the struct constructor.

Related

Why more space on the stack frame is reserved than is needed in x86

With reference to the following code
#include <iostream>
using namespace std;
void do_something(int* ptr) {
cout << "Got address " << reinterpret_cast<void*>(ptr) << endl;
}
void func() {
int a;
do_something(&a);
}
int main() {
func();
}
When I disassemble the func function the x86 (I am not sure whether it is x86 or x86_64) code is
-> 0x100001140 <+0>: pushq %rbp
0x100001141 <+1>: movq %rsp, %rbp
0x100001144 <+4>: subq $0x10, %rsp
0x100001148 <+8>: leaq -0x4(%rbp), %rdi
0x10000114c <+12>: callq 0x100000f90 ; do_something(int*)
0x100001151 <+17>: addq $0x10, %rsp
0x100001155 <+21>: popq %rbp
0x100001156 <+22>: retq
0x100001157 <+23>: nopw (%rax,%rax)
I understand that the first push statement is pushing the base pointer to the previous function call on the stack, and then the stack pointer value is copied over to the base pointer. But then why are 16 bytes reserved for the stack?
Does this have to do with alignment somehow? The variable a needs only 4 bytes..
Also what exactly is the lea instruction doing in this function call? is it just getting the address of the integer relative to the base pointer? Which in this case seems to be 4 bytes off from the base (assuming that the return address is 4 bytes long and is the first thing on the stack)
Other architectures seem to reserve more than 16 bytes and have other things stored on the base of the stack frame..
This is x64 code, note the usage of the rsp register. x86 code uses the esp register. Most important implementation detail of the x64 ABI is that the stack must always be aligned to 16. Not actually necessary to properly run 64-bit code, but the alignment guarantee ensures that the compiler can safely emit SSE instructions. Their operands require 16 byte alignment to be fast. None are actually used in this snippet but they might be in do_something.
Upon entry of your function, the caller's CALL instruction has pushed 8 bytes on the stack to store the return address. The first PUSH instruction aligns the stack to 16 again, no additional corrections required.
It then creates the stack frame to store the a variable. While only 4 bytes are required, adjusting rsp by only 4 isn't good enough to provide the necessary alignment. So it picks the next suitable value, 16. The extra 12 bytes are simply unused.
The LEA instruction is a very handy one that implements &a. LEA = Load Effective Address = "take the address of". Not a particularly involved calculation here, it gets more convoluted when you use something like &array[ix]. Something that still can be done by a single LEA if the array element size is 1, 2 or 4 bytes long, pretty common.
The -4 is the offset from the start of the stack frame for the a variable. 4 bytes are needed to store int, your compiler implements the LP64 data model. Keep in mind that the stack grows downwards so it isn't 0.
Then it is just making the function call, the rdi register is used to pass the 1st argument in the x64 ABI. Then it destroys the stack frame again by re-adjusting rsp and restores rbp.
Do keep in mind that you are looking at unoptimized code. Usually none of this is left after the optimizer is done with it, small functions like this almost always get inlined. So this doesn't teach you that much practical knowledge of the code that actually runs. Have a look-see at the -O2 code.
According to x86-64 ABI, the stack must be 16 byte aligned prior to a subroutine call.
leaq (mem), reg
is equivalent to the following
reg = &(*mem) = mem

sse C++ memory commands

SSE asm has SQRTPS command.
SQRTPS command have 2 versions:
SQRTPS xmm1, xmm2
SQRTPS xmm1, m128
gcc/clang/vs (all) compilers have helper function _mm_sqrt_ps.
But _mm_sqrt_ps can work only with preloaded xmm (with _mm_set_ps / _mm_load_ps).
From Visual Studio, for example:
http://msdn.microsoft.com/en-us/library/vstudio/8z67bwwk%28v=vs.100%29.aspx
What I expect:
__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
asm{
sqrtps xmm0, data // DIRECTLY FROM MEMORY
movaps result, xmm0
}
What I have (in C):
__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
auto xmm = _mm_load_ps(&data) // or _mm_set_ps
xmm = _mm_sqrt_ps(xmm);
_mm_store_ps(&result[0], xmm);
(in asm):
movaps xmm1, data
sqrtps xmm0, xmm1 // FROM REGISTER
movaps result, xmm0
In other words, I would like to see something like this:
__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
auto xmm = _mm_sqrt_ps(data); // DIRECTLY FROM MEMORY, no need to load (because there is such instruction)
_mm_store_ps(&result[0], xmm);
Quick research: I made the following file, called mysqrt.cpp:
#include <pmmintrin.h>
extern "C" __m128 MySqrt(__m128* a) {
return _mm_sqrt_ps(a[1]);
}
Trying gcc, namely g++4.8 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s:
_MySqrt:
LFB526:
sqrtps 16(%rdi), %xmm0
ret
Clang (clang++3.6 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s):
_MySqrt: ## #MySqrt
.cfi_startproc
## BB#0: ## %entry
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
sqrtps 16(%rdi), %xmm0
popq %rbp
retq
Don't know about VS, but at least both gcc and clang seem to produce memory version of sqrtps if needed.
UPDATE Example of function usage:
#include <iostream>
#include <pmmintrin.h>
extern "C" __m128 MySqrt(__m128* a);
int main() {
__m128 x[2];
x[1] = _mm_set_ps1(4);
__m128 y = MySqrt(x);
std::cout << y[0] << std::endl;
}
// output:
2
UPDATE 2: Regarding your code, you should just do:
auto xmm = _mm_sqrt_ps(*reinterpret_cast<__m128*>(data));
And of course it will be at your own risk, you should guarantee that data contains valid __m128 and is properly aligned.
I think you misunderstood the interface provided by the primitive _mm_sqrt_ps(__m128). The argument type here can be a variable hold in memory or in register. The extension type __m128 acts like any normal builtin type, e.g. double, and is not bound to an xmm register but can also be stored in memory.
EDIT Unless you use asm, the compiler determines if and when a variable is loaded into register or left in memory. So, in the following code snippet
__m128 foo(const __m128 x, const __m128*y, std::size_t n)
{
__m128 result = _mm_set_ps(1.0);
while(n--)
result = _mm_mul_ps(result,_mm_add_ps(x,_mm_sqrt_ps(*y++)));
return result;
}
it's up to the compiler which variables are stored in register. I would think that the compiler puts x and result into xmm registers, but gets *y directly from memory.
The answer to your question is that you can't control this , at least for aligned loads, with intrinsics. It's up to the compiler to decide if it uses SQRTPS xmm1, xmm2 or SQRTPS xmm1, m128. If you want to be 100% certain then you have to write it in assembly. This is one of the deficiencies of intrinsics (at least as they are currently implemented) in my opinion.
Some code can help explain this.
We can get GCC (64-bit with -O3) to generate both version using aligned and unaligned loads
float x[4], y[4]
__m128 x4 = _mm_loadu_ps(x);
__m128 y4 = _mm_sqrt_ps(x4);
_mm_storeu_ps(y,y4);
This gives (with Intel syntax)
movups xmm0, XMMWORD PTR [rdx]
sqrtps xmm0, xmm0
However, if we do an aligned load we get the other form
float x[4], y[4]
__m128 x4 = _mm_load_ps(x);
__m128 y4 = _mm_sqrt_ps(x4);
_mm_storeu_ps(y,y4);
This combines the load and square root into one instruction
sqrtps xmm0, XMMWORD PTR [rax]
Most people would say "trust the compiler." I disagree. If you're using intrinsics then it should be assumed that YOU know what you're doing and NOT the compiler. Here is an example difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp where GCC chose one form and MSVC chose the other form (for multiplication instead of the sqrt) and it made a difference in performance.
So once again, if you're using aligned loads, you can only pray that the compiler does what you want. And then maybe on the next version of the compiler it does something different...

Does the return value always go into eax register after a method call?

I have written a hooking library, that examines a PE executables dll import table, to create a library that enables changing of parameters and return values. I have a few questions on how the return value is passed from a function.
I have learned that the return value of a function is saved in the accumulator register. Is this always the case? If not, how does the compiler know where to look for the function result?
What about the return type size? An integer will easily fit, but what about a bigger structure? Does the caller reserve stack space so the method it calls could write the result onto stack?
It's all specific to calling convention.
For most calling conventions floating point numbers are returned either on FPU-stack or in XMM registers.
Call to the function returning a structure
some_struct foo(int arg1, int arg2);
some_struct s = foo(1, 2);
will be compiled into some equivalent of:
some_struct* foo(some_struct* ret_val, int arg1, int arg2);
some_struct s; // constructor isn't called
foo(&s, 1, 2); // constructor will be called in foo
Edit: (add info)
just to clarify: this works for all structs and classes even when sizeof(some_struct) <= 4. So if you define small useful class like ip4_type with the only unsigned field and some constructors/convertors to/trom unsigned, in_addr, char* it will lack efficiency compared to use of raw unigned value.
If the function get inlined, the result is not saved in eax, also if results are passed by reference/pointer, that register won't be used.
look at what happens to a function that return doubles (on a 32 bit machine)
double func(){
volatile double val=5.0;
return val;
}
int main(){
double val = func();
return 0;
}
doubles are not in eax.
func():
pushq %rbp
movq %rsp, %rbp
movabsq $4617315517961601024, %rax
movq %rax, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, -24(%rbp)
movsd -24(%rbp), %xmm0
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $24, %rsp
call func()
movsd %xmm0, -24(%rbp)
movq -24(%rbp), %rax
movq %rax, -8(%rbp)
movl $0, %eax
leave
ret
It really depends on the calling convention used, but typically EAX is used for 32-bit and smaller integral data types, floating point values tend to use FPU or MMX registers, and 64-bit integral types tend to use a combination of EAX and EDX instead. Then there is the issue of complex class/struct types, in which case the compiler may decide to optimize away the return value and use an extra output parameter on the call stack to pass the returned object by reference to the caller.
You are asking questions about the ABI (Application Binary Interface). This varies depending on the operating system. You should look it up. You can find some good info and links to other documents at http://en.wikipedia.org/wiki/X86_calling_conventions
To answer your question, yes, as far as I know, all of the popular operating systems use the A register to return the result.

Is extra time required for float vs int comparisons?

If you have a floating point number double num_float = 5.0; and the following two conditionals.
if(num_float > 3)
{
//...
}
if(num_float > 3.0)
{
//...
}
Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).
Because of the "as-if" rule, the compiler is allowed to do the conversion of the literal to a floating point value at compile time. A good compiler will do so if that results in better code.
In order to answer your question definitively for your compiler and your target platform(s), you'd need to check what the compiler emits, and how it performs. However, I'd be surprised if any mainstream compiler did not turn either of the two if statements into the most efficient code possible.
If the value is a constant, then there shouldn't be any difference, since the compiler will convert the constant to float as part of the compilation [unless the compiler decides to use a "compare float with integer" instruction].
If the value is an integer VARIABLE, then there will be an extra instruction to convert the integer value to a floating point [again, unless the compiler can use a "compare float with integer" instruction].
How much, if any, time that adds to the whole process depends HIGHLY on what processor, how the floating point instructions work, etc, etc.
As with anything where performance really matters, measure the alternatives. Preferably on more than one type of hardware (e.g. both AMD and Intel processors if it's a PC), and then decide which is the better choice. Otherwise, you may find yourself tuning the code to work well on YOUR hardware, but worse on some other hardware. Which isn't a good optimisation - unless the ONLY machine you ever run on is your own.
Note: This will need to be repeated with your target hardware. The code below just demonstrates nicely what has been said.
with constants:
bool with_int(const double num_float) {
return num_float > 3;
}
bool with_float(const double num_float) {
return num_float > 3.0;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
with_float(double):
ucomisd .LC0(%rip), %xmm0
seta %al
ret
.LC0:
.long 0
.long 1074266112
clang 3.0 (-O3 -march=native):
.LCPI0_0:
.quad 4613937818241073152 # double 3.000000e+00
with_int(double): # #with_int(double)
ucomisd .LCPI0_0(%rip), %xmm0
seta %al
ret
.LCPI1_0:
.quad 4613937818241073152 # double 3.000000e+00
with_float(double): # #with_float(double)
ucomisd .LCPI1_0(%rip), %xmm0
seta %al
ret
Conclusion: No difference if comparing against constants.
with variables:
bool with_int(const double a, const int b) {
return a > b;
}
bool with_float(const double a, const float b) {
return a > b;
}
g++ 4.7.2 (-O3 -march=native):
with_int(double, int):
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float):
unpcklps %xmm1, %xmm1
cvtps2pd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
clang 3.0 (-O3 -march=native):
with_int(double, int): # #with_int(double, int)
cvtsi2sd %edi, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
with_float(double, float): # #with_float(double, float)
cvtss2sd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
seta %al
ret
Conclusion: the emitted instructions differ when comparing against variables, as Mats Peterson's answer already explained.
Q: Would it be slower to perform the former comparison because of the conversion of 3 to a floating point, or would there really be no difference at all?
A) Just specify it as an integer. Some chips have a special instruction for comparing to integer at runtime, but that is not important since the compiler will choose what is best. In some cases it might convert it to 3.0 at compile time depending on target architecture. In other cases it will leave it as int. But because you want 3 specifically then specify '3'
Obviously I'm assuming the time delay would be negligible at best, but compounded in a while(1) loop I suppose over the long run a decent chunk of time could be lost (if it really is slower).
A) The compiler will not do anything odd with such code. It will choose the best thing so there should be no time delay. With number constants the compiler is free to do what is best regardless of what it seems as long as it produces same result. However you would not want to compound this type of comparison in a while loop at all. Rather use an integer loop counter. A floating point loop counter is going to be much slower. If you have to use floats as a loop counter prefer single point 32 bit data type and compare as little as possible.
For example, you could break the problem down into multiple loops.
int x = 0;
float y = 0;
float finc = 0.1;
int total = 1000;
int num_times = total / finc;
num_times -= 2;// safety
// Run the loop in a safe zone using integer compares
while (x < num_times) {
// Do stuff
y += finc;
x++;
}
// Now complete the loop using float compares
while (y < total) {
y+= finc;
}
And that would cause a severe improvement in the comparisons speed.

Atomic 64 bit writes with GCC

I've gotten myself into a confused mess regarding multithreaded programming and was hoping someone could come and slap some understanding in me.
After doing quite a bit of reading, I've come to the understanding that I should be able to set the value of a 64 bit int atomically on a 64 bit system1.
I found a lot of this reading difficult though, so thought I would try to make a test to verify this. So I wrote a simple program with one thread which would set a variable into one of two values:
bool switcher = false;
while(true)
{
if (switcher)
foo = a;
else
foo = b;
switcher = !switcher;
}
And another thread which would check the value of foo:
while (true)
{
__uint64_t blah = foo;
if ((blah != a) && (blah != b))
{
cout << "Not atomic! " << blah << endl;
}
}
I set a = 1844674407370955161; and b = 1144644202170355111;. I run this program and get no output warning me that blah is not a or b.
Great, looks like it probably is an atomic write...but then, I changed the first thread to set a and b directly, like so:
bool switcher = false;
while(true)
{
if (switcher)
foo = 1844674407370955161;
else
foo = 1144644202170355111;
switcher = !switcher;
}
I re-run, and suddenly:
Not atomic! 1144644203261303193
Not atomic! 1844674406280007079
Not atomic! 1144644203261303193
Not atomic! 1844674406280007079
What's changed? Either way I'm assigning a large number to foo - does the compiler handle a constant number differently, or have I misunderstood everything?
Thanks!
1: Intel CPU documentation, section 8.1, Guaranteed Atomic Operations
2: GCC Development list discussing that GCC doesn't guarantee it in the documentation, but the kernel and other programs rely on it
Disassembling the loop, I get the following code with gcc:
.globl _switcher
_switcher:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl $0, -4(%rbp)
L2:
cmpl $0, -4(%rbp)
je L3
movq _foo#GOTPCREL(%rip), %rax
movl $-1717986919, (%rax)
movl $429496729, 4(%rax)
jmp L5
L3:
movq _foo#GOTPCREL(%rip), %rax
movl $1486032295, (%rax)
movl $266508246, 4(%rax)
L5:
cmpl $0, -4(%rbp)
sete %al
movzbl %al, %eax
movl %eax, -4(%rbp)
jmp L2
LFE2:
So it would appear that gcc does use to 32-bit movl instruction with 32-bit immediate values. There is an instruction movq that can move a 64-bit register to memory (or memory to a 64-bit register), but it does not seems to be able to set move an immediate value to a memory address, so the compiler is forced to either use a temporary register and then move the value to memory, or to use to movl. You can try to force it to use a register by using a temporary variable, but this may not work.
References:
mov
movq
http://www.x86-64.org/documentation/assembly.html
immediate values inside instructions remain 32 bits.
There is no way for the compiler to do the assignation of a 64 bits constant atomically, excepted by first filling a register and then moving that register to the variable. That is probably more costly than assigning directly to the variable and as atomicity is not required by the language, the atomic solution is not chosen.
The Intel CPU documentation is right, aligned 8 Bytes read/writes are always atomic on recent hardware (even on 32 bit operating systems).
What you don't tell us, are you using a 64 bit hardware on a 32 bit system? If so, the 8 byte write will most likely be splitted into two 4 byte writes by the compiler.
Just have a look at the relevant section in the object code.