I'm implementing a function as following:
void Add(list* node)
{
if(this->next == NULL)
this->next = node;
else
this->next->Add(node);
}
As it seems Add is going to be tail-called in every step of the recursion.
I could also implement it as:
void Add(list *node)
{
list *curr = this;
while(curr->next != NULL) curr = curr->next;
curr->next = node;
}
This will not use recursion at all.
Which version of this is better? (in stack size or speed)
Please don't give the "Why don't use STL/Boost/whatever?" comments/answers.
They probably will be the same performance-wise, since the compiler will probably optimise them into the exact same code.
However, if you compile on Debug settings, the compiler will not optimise for tail-recursion, so if the list is long enough, you can get a stack overflow. There is also the (very small) possibility that a bad compiler won't optimise the recursive version for tail-recursion. There is no risk of that in the iterative version.
Pick whichever one is clearer and easier for you to maintain taking the possibility of non-optimisation into account.
I tried it out, making three files to test your code:
node.hh:
struct list {
list *next;
void Add(list *);
};
tail.cc:
#include "node.hh"
void list::Add(list* node)
{
if(!this->next)
this->next = node;
else
this->next->Add(node);
}
loop.cc:
#include "node.hh"
void list::Add(list *node)
{
list *curr = this;
while(curr->next) curr = curr->next;
curr->next = node;
}
Compiled both files with G++ 4.3 for IA32, with -O3 and -S to give assembly output rather than object files
Results:
tail.s:
_ZN4list3AddEPS_:
.LFB0:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
movl 8(%ebp), %eax
.p2align 4,,7
.p2align 3
.L2:
movl %eax, %edx
movl (%eax), %eax
testl %eax, %eax
jne .L2
movl 12(%ebp), %eax
movl %eax, (%edx)
popl %ebp
ret
.cfi_endproc
loop.s:
_ZN4list3AddEPS_:
.LFB0:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
movl 8(%ebp), %edx
jmp .L3
.p2align 4,,7
.p2align 3
.L6:
movl %eax, %edx
.L3:
movl (%edx), %eax
testl %eax, %eax
jne .L6
movl 12(%ebp), %eax
movl %eax, (%edx)
popl %ebp
ret
.cfi_endproc
Conclusion: The output is substantially similar enough (the core loop/recursion becomes movl, movl, testl, jne in both) that it really isn't worth worrying about. There's one less unconditional jump in the recursive version, although I wouldn't want to bet either way which is faster if it's even measurable at all. Pick which ever is most natural to express the algorithm at hand. Even if you later decide that was a bad choice it's not too hard to switch.
Adding -g to the compilation doesn't change the actual implementation with g++ either, although there is the added complication that setting break points no longer behaves like you would expect it to - break points on the tail call line gets hit at most once (but not at all for a 1 element list) in my tests with GDB, regardless of how deep the recursion actually goes.
Timings:
Out of curiosity I ran some timings with the same variant of g++. I used:
#include <cstring>
#include "node.hh"
static const unsigned int size = 2048;
static const unsigned int count = 10000;
int main() {
list nodes[size];
for (unsigned int i = 0; i < count; ++i) {
std::memset(nodes, 0, sizeof(nodes));
for (unsigned int j = 1; j < size; ++j) {
nodes[0].Add(&nodes[j]);
}
}
}
This was run 200 times, with each of the loop and the tail call versions. The results with this compiler on this platform were fairly conclusive. Tail had a mean of 40.52 seconds whereas lop had a mean of 66.93. (The standard deviations were 0.45 and 0.47 respectively).
So I certainly wouldn't be scared of using tail call recursion if it seems the nicer way of expressing the algorithm, but I probably wouldn't go out of my way to use it either, since I suspect that these timing observations would most likely vary from platform/compiler (versions).
Related
I have 2 ways of returning an empty string from a function.
1)
std::string get_string()
{
return "";
}
2)
std::string get_string()
{
return std::string();
}
which one is more efficient and why?
Gcc 7.1 -O3 these are all identical, godbolt.org/z/a-hc1d – jterm Apr 25 at 3:27
Original answer:
Did some digging. Below is an example program and the relevant assembly:
Code:
#include <string>
std::string get_string1(){ return ""; }
std::string get_string2(){ return std::string(); }
std::string get_string3(){ return {}; } //thanks Kerrek SB
int main()
{
get_string1();
get_string2();
get_string3();
}
Assembly:
__Z11get_string1v:
LFB737:
.cfi_startproc
pushl %ebx
.cfi_def_cfa_offset 8
.cfi_offset 3, -8
subl $40, %esp
.cfi_def_cfa_offset 48
movl 48(%esp), %ebx
leal 31(%esp), %eax
movl %eax, 8(%esp)
movl $LC0, 4(%esp)
movl %ebx, (%esp)
call __ZNSsC1EPKcRKSaIcE
addl $40, %esp
.cfi_def_cfa_offset 8
movl %ebx, %eax
popl %ebx
.cfi_restore 3
.cfi_def_cfa_offset 4
ret $4
.cfi_endproc
__Z11get_string2v:
LFB738:
.cfi_startproc
movl 4(%esp), %eax
movl $__ZNSs4_Rep20_S_empty_rep_storageE+12, (%eax)
ret $4
.cfi_endproc
__Z11get_string3v:
LFB739:
.cfi_startproc
movl 4(%esp), %eax
movl $__ZNSs4_Rep20_S_empty_rep_storageE+12, (%eax)
ret $4
.cfi_endproc
This was compiled with -std=c++11 -O2.
You can see that there is quite a lot more work for the return ""; statement and comparably little for return std::string and return {}; (these two are identical).
As Frerich Raabe said, when passing an empty C_string, it still needs to do processing on it, instead of just allocating memory. It seems that this can't be optimised away (at least not by GCC)
So the answer is to definitely use:
return std::string();
or
return {}; //(c++11)
Although unless you are returning a lot of empty strings in performance critical code (logging I guess?), the difference is going to still be insignificant.
The latter version is never slower than the first. The first version calls the std::string constructor taking a C string, which then has to compute the length of the string first. Even though that's fast to do for an empty string, it's certainly not faster than not doing it at all.
Very much like this question, except that instead vector<int> I have vector<struct myType>.
If I want to reset (or for that matter, set to some value) myType.myVar for every element in the vector, what's the most efficient method?
Right now I'm iterating through:
for(int i=0; i<myVec.size(); i++) myVec.at(i).myVar = 0;
But since vectors are guaranteed to be stored contiguously, there's surely a better way?
Resetting will need to traverse every element of the vector, so it will need at least O(n) complexity. Your current algorithm takes O(n).
In this particular case you can use operator[] instead of at (that might throw an exception). But I doubt that's the bottleneck of your application.
On this note you should probably use std::fill:
std::fill(myVec.begin(), myVec.end(), 0);
But unless you want to go byte level and set a chunk of memory to 0, which will not only cause you headaches but will also make you lose portability in most cases, there's nothing to improve here.
Instead of the below code
for(int i=0; i<myVec.size(); i++) myVec.at(i).myVar = 0;
do it as follows:
size_t sz = myVec.size();
for(int i=0; i<sz; ++i) myVec[i].myVar = 0;
As the "at" method internally checks whether index is out of range or not. But as your loop index is taking care(myVec.size()), you can avoid the extra check. Otherwise this is the fastest way to do it.
EDIT
In addition to that, we can store the size() of vector prior to executing the for loop.This would ensure that there is no further call of method size() inside the for loop.
One of the fastest ways would be to perform loop unwinding and break the speed limit posed by conventional for loops that cause great cash spills. In your case, as it's a run time thing, there's no way to apply template metaprogramming so a variation on good old Duff's device would do the trick
#include <iostream>
#include <vector>
using namespace std;
struct myStruct {
int a;
double b;
};
int main()
{
std::vector<myStruct> mV(20);
double val(20); // the new value you want to reset to
int n = (mV.size()+7) / 8; // 8 is the size of the block unroll
auto to = mV.begin(); //
switch(mV.size()%8)
{
case 0: do { (*to++).b = val;
case 7: (*to++).b = val;
case 6: (*to++).b = val;
case 5: (*to++).b = val;
case 4: (*to++).b = val;
case 3: (*to++).b = val;
case 2: (*to++).b = val;
case 1: (*to++).b = val;
} while (--n>0);
}
// just printing to verify that the value is set
for (auto i : mV) std::cout << i.b << std::endl;
return 0;
}
Here I choose to perform an 8-block unwind, to reset the value (let's say) b of a myStruct structure. The block size can be tweaked and loops are effectively unrolled. Remember this is the underlying technique in memcpy and one of the optimizations (loop unrolling in general) a compiler will attempt (actually they're quite good at this so we might as well let them do their job).
In addition to what's been said before, you should consider that if you turn on optimizations, the compiler will likely perform loop-unrolling which will make the loop itself faster.
Also pre increment ++i takes few instructions less than post increment i++.
Explanation here
Beware of spending a lot of time thinking about optimization details which the compiler will just take care of for you.
Here are four implementations of what I understand the OP to be, along with the code generated using gcc 4.8 with --std=c++11 -O3 -S
Declarations:
#include <algorithm>
#include <vector>
struct T {
int irrelevant;
int relevant;
double trailing;
};
Explicit loop implementations, roughly from answers and comments provided to OP. Both produced identical machine code, aside from labels.
.cfi_startproc
movq (%rdi), %rsi
void clear_relevant(std::vector<T>* vecp) { movq 8(%rdi), %rcx
for(unsigned i=0; i<vecp->size(); i++) { xorl %edx, %edx
vecp->at(i).relevant = 0; xorl %eax, %eax
} subq %rsi, %rcx
} sarq $4, %rcx
testq %rcx, %rcx
je .L1
.p2align 4,,10
.p2align 3
.L5:
void clear_relevant2(std::vector<T>* vecp) { salq $4, %rdx
std::vector<T>& vec = *vecp; addl $1, %eax
auto s = vec.size(); movl $0, 4(%rsi,%rdx)
for (unsigned i = 0; i < s; ++i) { movl %eax, %edx
vec[i].relevant = 0; cmpq %rcx, %rdx
} jb .L5
} .L1:
rep ret
.cfi_endproc
Two other versions, one using std::for_each and the other one using the range for syntax. Here there is a subtle difference in the code for the two versions (other than the labels):
.cfi_startproc
movq 8(%rdi), %rdx
movq (%rdi), %rax
cmpq %rax, %rdx
je .L17
void clear_relevant3(std::vector<T>* vecp) { .p2align 4,,10
for (auto& p : *vecp) p.relevant = 0; .p2align 3
} .L21:
movl $0, 4(%rax)
addq $16, %rax
cmpq %rax, %rdx
jne .L21
.L17:
rep ret
.cfi_endproc
.cfi_startproc
movq 8(%rdi), %rdx
movq (%rdi), %rax
cmpq %rdx, %rax
void clear_relevant4(std::vector<T>* vecp) { je .L12
std::for_each(vecp->begin(), vecp->end(), .p2align 4,,10
[](T& o){o.relevant=0;}); .p2align 3
} .L16:
movl $0, 4(%rax)
addq $16, %rax
cmpq %rax, %rdx
jne .L16
.L12:
rep ret
.cfi_endproc
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
#include <stdint.h>
#include <iostream>
using namespace std;
uint32_t k[] = {0, 1, 17};
template <typename T>
bool f(T *data, int i) {
return data[0] < (T)(1 << k[i]);
}
int main() {
uint8_t v = 0;
cout << f(&v, 2) << endl;
cout << (0 < (uint8_t)(1 << 17)) << endl;
return 0;
}
g++ a.cpp && ./a.out
1
0
Why am I getting these results?
It looks like gcc reverses the shift and applies it to the other side, and I guess this is a bug.
In C (instead of C++) the same thing happens, and C translated to asm is easier to read, so I'm using C here; also I reduced the test cases (dropping templates and the k array).
foo() is the original buggy f() function, foo1() is what foo() behaves like with gcc but shouldn't, and bar() shows what foo() should look like apart from the pointer read.
I'm on 64-bit, but 32-bit is the same apart from the parameter handling and finding k.
#include <stdint.h>
#include <stdio.h>
uint32_t k = 17;
char foo(uint8_t *data) {
return *data < (uint8_t)(1<<k);
/*
with gcc -O3 -S: (gcc version 4.7.2 (Debian 4.7.2-5))
movzbl (%rdi), %eax
movl k(%rip), %ecx
shrb %cl, %al
testb %al, %al
sete %al
ret
*/
}
char foo1(uint8_t *data) {
return (((uint32_t)*data) >> k) < 1;
/*
movzbl (%rdi), %eax
movl k(%rip), %ecx
shrl %cl, %eax
testl %eax, %eax
sete %al
ret
*/
}
char bar(uint8_t data) {
return data < (uint8_t)(1<<k);
/*
movl k(%rip), %ecx
movl $1, %eax
sall %cl, %eax
cmpb %al, %dil
setb %al
ret
*/
}
int main() {
uint8_t v = 0;
printf("All should be 0: %i %i %i\n", foo(&v), foo1(&v), bar(v));
return 0;
}
If your int is 16-bit long, you're running into undefined behavior and either result is "OK".
Shifting N-bit integers by N or more bit positions left or right results in undefined behavior.
Since this happens with 32-bit ints, this is a bug in the compiler.
Here are some more data points:
basically, it looks like gcc optimizes (even in when the -O flag is off and -g is on):
[variable] < (type-cast)(1 << [variable2])
to
((type-cast)[variable] >> [variable2]) == 0
and
[variable] >= (type-cast)(1 << [variable2])
to
((type-cast)[variable] >> [variable2]) != 0
where [variable] needs to be an array access.
I guess the advantage here is that it doesn't have to load the literal 1 into a register, which saves 1 register.
So here are the data points:
changing 1 to a number > 1 forces it to implement the correct version.
changing any of the variables to a literal forces it to implement the correct version
changing [variable] to a non array access forces it to implement the correct version
[variable] > (type-cast)(1 << [variable2]) implements the correct version.
I suspect this is all trying to save a register. When [variable] is an array access, it needs to also keep an index. Someone probably thought this is so clever, until it's wrong.
Using code from the bug report http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56051
#include <stdio.h>
int main(void)
{
int a, s = 8;
unsigned char data[1] = {0};
a = data[0] < (unsigned char) (1 << s);
printf("%d\n", a);
return 0;
}
compiled with gcc -O2 -S
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $8, %esp
pushl $1 ***** seems it already precomputed the result to be 1
pushl $.LC0
pushl $1
call __printf_chk
xorl %eax, %eax
movl -4(%ebp), %ecx
leave
leal -4(%ecx), %esp
ret
compile with just gcc -S
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %ecx
subl $16, %esp
movl $8, -12(%ebp)
movb $0, -17(%ebp)
movb -17(%ebp), %dl
movl -12(%ebp), %eax
movb %dl, %bl
movb %al, %cl
shrb %cl, %bl ****** (unsigned char)data[0] >> s => %bl
movb %bl, %al %bl => %al
testb %al, %al %al = 0?
sete %dl
movl $0, %eax
movb %dl, %al
movl %eax, -16(%ebp)
movl $.LC0, %eax
subl $8, %esp
pushl -16(%ebp)
pushl %eax
call printf
addl $16, %esp
movl $0, %eax
leal -8(%ebp), %esp
addl $0, %esp
popl %ecx
popl %ebx
popl %ebp
leal -4(%ecx), %esp
ret
I guess the next step is to dig through gcc's source code.
I'm pretty sure we're talking undefined behaviour here - converting a "large" integer to a smaller, of a value that doesn't fit in the size of the new value, is undefined as far as I know. 131072 definitely doesn't fit in a uint_8.
Although looking at the code generated, I'd say that it's probably not quite right, since it does "sete" rather than "setb"??? That does seem very suspicios to me.
If I turn the expression around:
return (T)(1<<k[i]) > data[0];
then it uses a "seta" instruction, which is what I'd expect. I'll do a bit more digging - but something seems a bit wrong.
Which is better?
According to Savitch, each recurse is saved on the stack in the form of an activation frame. This has overhead. However it takes a few less lines of code to write a recursive version. For an interview which one is better to turn in. The code for both is below.
#include <iostream>
using namespace std;
const int SIZE = 10;
int array[ SIZE ] = { 0,1,2,3,4,5,6,7,8,9 };
int answer = NULL;
void binary_search_recursive( int array[], int start, int end, int value, int& answer )
{
int mid = (start + end ) / 2;
if ( array[ mid ] == value )
{
answer = mid;
}
else if ( array[ mid ] < value )
{
binary_search_recursive( array, mid + 1, end, value, answer );
}
else
{
binary_search_recursive( array, start, mid - 1, value, answer );
}
}
void binary_search_iterative( int array[], int start, int end, int value, int& answer )
{
int mid = ( (start + end ) / 2 );
while( array[ mid ] != value )
{
if ( array[ mid ] < value )
{
start = mid;
mid = ( ( ( mid + 1 ) + end ) / 2 );
}
else
{
end = mid;
mid = ( ( start + ( mid - 1 ) ) / 2 );
}
}
answer = mid;
}
int main()
{
binary_search_iterative( array, 0, SIZE - 1, 4, answer);
cout << answer;
return 0;
}
Recursive versions of algorithms are often shorter in lines of code but iterative versions of the same algorithm are often faster because of the function call overhead of the recursive version.
Regarding the binary search algorithm, the faster implementations are written as iterative. For example Jon Bentley's published version of the binary search is iterative.
In case of a binary search recursion does not help you express your intent any better than the iteration does, so an iterative approach is better.
I think the best approach for an interview would be to submit a solution that calls lower_bound: it shows the interviewer that you not only know some basic-level syntax and how to code a freshman-year algorithm, but that you do not waste time re-writing boilerplate code.
For an interview, I'd start by mentioning that both recursive and iterative solutions are possible and similarly trivial to write. Recursive versions have a potential issue with nested stack frames using or even exhausting stack memory (and faulting different pages into cache), but compilers tend to provide tail recursive optimisations that effectively create an iterative implementation. Recursive functions tend to be more self-evidently correct and concise, but aren't as widely applicable in day to day C++ programming so may be a little less familiar and comfortable to maintenance programmers.
Unless there's a reason not to, in a real project I'd use std::binary_search from <algorithm> (http://www.sgi.com/tech/stl/binary_search.html).
To illustrate tail recursion, your binary_search_recursive algorithm was changed to the assembly below by g++ -O4 -S. Notes:
to get an impression of the code, you don't need to understand every line, but the following helps:
movl are move statements (assignments) between registers and memory (the trailing "l" for "long" reflects the number of bits in the registers/memory locations)
subl, shrl, sarl, cmpl are subtraction, shift right, shift arithmetic right, and compare instructions, and the important thing to note is that as a side effect they set a few flags, such as "equals" if they produce a 0 result, that is consulted by je (jump if equals) and jge (jump if greater or equal), jne jump if not equal.
the answer = mid termination condition is handled at L10, while the recursive steps are instead handled by the code at L14 and L4 and may jump back up to L12.
Here's the disassembly of your binary_search_recursive function (the name is mangled in C++ style)...
__Z23binary_search_recursivePiiiiRi:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $4, %esp
movl 24(%ebp), %eax
movl 8(%ebp), %edi
movl 12(%ebp), %ebx
movl 16(%ebp), %ecx
movl %eax, -16(%ebp)
movl 20(%ebp), %esi
.p2align 4,,15
L12:
leal (%ebx,%ecx), %edx
movl %edx, %eax
shrl $31, %eax
leal (%edx,%eax), %eax
sarl %eax
cmpl %esi, (%edi,%eax,4)
je L10
L14:
jge L4
leal 1(%eax), %ebx
leal (%ebx,%ecx), %edx
movl %edx, %eax
shrl $31, %eax
leal (%edx,%eax), %eax
sarl %eax
cmpl %esi, (%edi,%eax,4)
jne L14
L10:
movl -16(%ebp), %ecx
movl %eax, (%ecx)
popl %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.p2align 4,,7
L4:
leal -1(%eax), %ecx
jmp L12
You use iteration if speed is an issue or if the stack size is constraining, because as you said it involves calling the function repeatedly which results in it occupying more space on the stack. As for answering in the interview I would go for whichever I feel is simplest to do correctly at that time, for obvious reasons :))
I have a virtual function in a hotspot code that needs to return a structure as a result. I have these two options:
virtual Vec4 generateVec() const = 0; // return value
virtual void generateVec(Vec4& output) const = 0; // output parameter
My question is, is there generally any difference in the performance of these functions? I'd assume the second one is faster, because it does not involve copying data on the stack. However, the first one is often much more convenient to use. If the first one is still slightly slower, would this be measurable at all? Am I too obsessed :)
Let me stress that that this function will be called millions of times every second, but also that the size of the structure Vec4 is small - 16 bytes.
As has been said, try them out - but you will quite possibly find that Vec4 generateVec() is actually faster. Return value optimization will elide the copy operation, whereas void generateVec(Vec4& output) may cause an unnecessary initialisation of the output parameter.
Is there any way you can avoid making the function virtual? If you're calling it millions of times a sec that extra level of indirection is worth looking at.
Code called millions of times per second implies you really do need to optimize for speed.
Depending on how complex the body of the derived generateVec's is, the difference between the two may be unnoticeable or could be massive.
Best bet is to try them both and profile to see if you need to worry about optimizing this particular aspect of the code.
Feeling a bit bored, so I came up with this:
#include <iostream>
#include <ctime>
#include <cstdlib>
using namespace std;
struct A {
int n[4];
A() {
n[0] = n[1] = n[2] = n[3] = rand();
}
};
A f1() {
return A();
}
A f2( A & a ) {
a = A();
}
const unsigned long BIG = 100000000;
int main() {
unsigned int sum = 0;
A a;
clock_t t = clock();
for ( unsigned int i = 0; i < BIG; i++ ) {
a = f1();
sum += a.n[0];
}
cout << clock() - t << endl;
t = clock();
for ( unsigned int i = 0; i < BIG; i++ ) {
f2( a );
sum += a.n[0];
}
cout << clock() - t << endl;
return sum & 1;
}
Results with -O2 optimisation are that there is no significant difference.
There are chances that the first solution is faster.
A very nice article :
http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/
Just out of curiosity, I wrote 2 similar functions (uses 8-byte data types) to check their assembly code.
long long int ret_val()
{
long long int tmp(1);
return tmp;
}
// ret_val() assembly
.globl _Z7ret_valv
.type _Z7ret_valv, #function
_Z7ret_valv:
.LFB0:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
subl $16, %esp
movl $1, -8(%ebp)
movl $0, -4(%ebp)
movl -8(%ebp), %eax
movl -4(%ebp), %edx
leave
ret
.cfi_endproc
Surprisingly, the pass-by-value method below required a few more instructions:
void output_val(long long int& value)
{
long long int tmp(2);
value = tmp;
}
// output_val() assembly
.globl _Z10output_valRx
.type _Z10output_valRx, #function
_Z10output_valRx:
.LFB1:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
subl $16, %esp
movl $2, -8(%ebp)
movl $0, -4(%ebp)
movl 8(%ebp), %ecx
movl -8(%ebp), %eax
movl -4(%ebp), %edx
movl %eax, (%ecx)
movl %edx, 4(%ecx)
leave
ret
.cfi_endproc
These functions were called in a test code as:
long long val = ret_val();
long long val2;
output_val(val2);
Compiled by gcc.