what is "stack alignment"? - c++

What is stack alignment?
Why is it used?
Can it be controlled by compiler settings?
The details of this question are taken from a problem faced when trying to use ffmpeg libraries with msvc, however what I'm really interested in is an explanation of what is "stack alignment".
The Details:
When runnig my msvc complied program which links to avcodec I get the
following error: "Compiler did not align stack variables. Libavcodec has
been miscompiled", followed by a crash in avcodec.dll.
avcodec.dll was not compiled with msvc, so I'm unable to see what is going on inside.
When running ffmpeg.exe and using the same avcodec.dll everything works well.
ffmpeg.exe was not compiled with msvc, it was complied with gcc / mingw (same as avcodec.dll)
Thanks,
Dan

Alignment of variables in memory (a short history).
In the past computers had an 8 bits databus. This means, that each clock cycle 8 bits of information could be processed. Which was fine then.
Then came 16 bit computers. Due to downward compatibility and other issues, the 8 bit byte was kept and the 16 bit word was introduced. Each word was 2 bytes. And each clock cycle 16 bits of information could be processed. But this posed a small problem.
Let's look at a memory map:
+----+
|0000|
|0001|
+----+
|0002|
|0003|
+----+
|0004|
|0005|
+----+
| .. |
At each address there is a byte which can be accessed individually.
But words can only be fetched at even addresses. So if we read a word at 0000, we read the bytes at 0000 and 0001. But if we want to read the word at position 0001, we need two read accesses. First 0000,0001 and then 0002,0003 and we only keep 0001,0002.
Of course this took some extra time and that was not appreciated. So that's why they invented alignment. So we store word variables at word boundaries and byte variables at byte boundaries.
For example, if we have a structure with a byte field (B) and a word field (W) (and a very naive compiler), we get the following:
+----+
|0000| B
|0001| W
+----+
|0002| W
|0003|
+----+
Which is not fun. But when using word alignment we find:
+----+
|0000| B
|0001| -
+----+
|0002| W
|0003| W
+----+
Here memory is sacrificed for access speed.
You can imagine that when using double word (4 bytes) or quad word (8 bytes) this is even more important. That's why with most modern compilers you can chose which alignment you are using while compiling the program.

Some CPU architectures require specific alignment of various datatypes, and will throw exceptions if you don't honor this rule. In standard mode, x86 doesn't require this for the basic data types, but can suffer performance penalties (check www.agner.org for low-level optimization tips).
However, the SSE instruction set (often used for high-performance) audio/video procesing has strict alignment requirements, and will throw exceptions if you attempt to use it on unaligned data (unless you use the, on some processors, much slower unaligned versions).
Your issue is probably that one compiler expects the caller to keep the stack aligned, while the other expects callee to align the stack when necessary.
EDIT: as for why the exception happens, a routine in the DLL probably wants to use SSE instructions on some temporary stack data, and fails because the two different compilers don't agree on calling conventions.

IIRC, stack alignment is when variables are placed on the stack "aligned" to a particular number of bytes. So if you are using a 16 bit stack alignment, each variable on the stack is going to start from a byte that is a multiple of 2 bytes from the current stack pointer within a function.
This means that if you use a variable that is < 2 bytes, such as a char (1 byte), there will be 8 bits of unused "padding" between it and the next variable. This allows certain optimisations with assumptions based on variable locations.
When calling functions, one method of passing arguments to the next function is to place them on the stack (as opposed to placing them directly into registers). Whether or not alignment is being used here is important, as the calling function places the variables on the stack, to be read off by the calling function using offsets. If the calling function aligns the variables, and the called function expects them to be non-aligned, then the called function won't be able to find them.
It seems that the msvc compiled code is disagreeing about variable alignment. Try compiling with all optimisations turned off.

As far as I know, compilers don't typically align variables that are on the stack. The library may be depending on some set of compiler options that isn't supported on your compiler. The normal fix is to declare the variables that need to be aligned as static, but if you go about doing this in other people's code, you'll want to be sure that they variables in question are initialized later on in the function rather than in the declaration.
// Some compilers won't align this as it's on the stack...
int __declspec(align(32)) needsToBe32Aligned = 0;
// Change to
static int __declspec(align(32)) needsToBe32Aligned;
needsToBe32Aligned = 0;
Alternately, find a compiler switch that aligns the variables on the stack. Obviously the "__declspec" align syntax I've used here may not be what your compiler uses.

Related

Data Alignment: Reason for restriction on memory address being multiple of data type size

I understand the general concept of data alignment, but what I do not understand is the restriction on memory address value, forced to be a multiple of the size of underlying data type.
This answer explains the data alignment well.
Quote:
Let's look at a memory map:
+----+
|0000|
|0001|
+----+
|0002|
|0003|
+----+
|0004|
|0005|
+----+
| .. |
At each address there is a byte which can be accessed individually. But words can only be fetched at even addresses. So if we read a word at 0000, we read the bytes at 0000 and 0001. But if we want to read the word at position 0001, we need two read accesses. First 0000,0001 and then 0002,0003 and we only keep 0001,0002.
Question:
Assuming it's true, why " But words can only be fetched at even addresses. " be true? Can't the memory/stack pointer point to 0001 in the example and then read a word of information starting there?
We know the machine can read memory in blocks of 2 bytes with one read action, (in the example [0000, 0001] or [0002, 0003]). So if my address register is pointing to 0001 (odd address instead of even), then I can read 2 bytes from there (i.e. 0001 and 0002) directly in one read action, right?
The assumption about that statement is not necessarily true. I don't want to re-iterate the answer you linked to describing the reasons for using and highly preferring aligned access, but there are architectures that do support unaligned memory access -- ARM for example (check out this SO answer).
But your question, I think, really comes down to hardware architecture, specifically the data bus design, and the accompanying instructions set that engineers at various silicon manufacturers have designed.
Some Cortex-M cores explicitly allow you to enable a CPU to trigger an exception on un- aligned access by configuring a Usage Fault register, which means that you can "utilize" unaligned memory access in rare use-cases.
Usually a processors internal addresses points to a whole word. This is because you don't want your (simple) processor be able to address a word at a random byte (or even worse: bit) because
You waste addressable memory: presuming the biggest possible address your processor is able to process is the max value of its word size, and you can multiply that by the size of your word to calculate the amount of storage you can address. (Each unique address points to a full word) The "address" I'm talking about here is not necessarily looking like the address which might be stored in a pointer of a higher programming language. The pointer address addresses each byte, which will be interpreted by a compiler or interpreter into the corresponding assembly instructions (discarding unwanted bytes from the loaded word)
A word loaded from memory could be anything, a value or the next instruction of the program you are running on your processor - the previous word loaded into the processor will often give an indication what the following word that gets loaded is used for: another instruction (eg arithmetic operation, load or store instruction) which might be followed by operands (values or addresses). Being able to address unaligned words would complicate a processor a lot in easy words.
Assuming it's true, why " But words can only be fetched at even addresses. " be true?
The memory actually stores words. The processor actually addresses the memory in words, and fetches a word at a time.
When you fetch a byte, it actually fetches a word, then ignores either the first half or the second half.
On 32-bit processors, it fetches a 32-bit word, then ignores three quarters; fetching a 16-bit word on a 32-bit processor ignores half the word.
If the 16-bit word you want to fetch (on a 16-bit processor) isn't aligned, then the processor has to fetch two words, take half of each word and then re-combine them. So even on processor designs where it works, it's often slower.
A lot of processor designs don't bother - either they just won't allow it, or they force the operating system to handle it (which is very slow).
(Not all types of processors work this way - e.g. 8-bit processors usually fetch a byte at a time)
Can't the memory/stack pointer point to 0001 in the example and then read a word of information starting there?
If the processor supports it, yes.

Why does the compiler allocate more space than sizeof(MyClass) for the object on the stack? [duplicate]

What is stack alignment?
Why is it used?
Can it be controlled by compiler settings?
The details of this question are taken from a problem faced when trying to use ffmpeg libraries with msvc, however what I'm really interested in is an explanation of what is "stack alignment".
The Details:
When runnig my msvc complied program which links to avcodec I get the
following error: "Compiler did not align stack variables. Libavcodec has
been miscompiled", followed by a crash in avcodec.dll.
avcodec.dll was not compiled with msvc, so I'm unable to see what is going on inside.
When running ffmpeg.exe and using the same avcodec.dll everything works well.
ffmpeg.exe was not compiled with msvc, it was complied with gcc / mingw (same as avcodec.dll)
Thanks,
Dan
Alignment of variables in memory (a short history).
In the past computers had an 8 bits databus. This means, that each clock cycle 8 bits of information could be processed. Which was fine then.
Then came 16 bit computers. Due to downward compatibility and other issues, the 8 bit byte was kept and the 16 bit word was introduced. Each word was 2 bytes. And each clock cycle 16 bits of information could be processed. But this posed a small problem.
Let's look at a memory map:
+----+
|0000|
|0001|
+----+
|0002|
|0003|
+----+
|0004|
|0005|
+----+
| .. |
At each address there is a byte which can be accessed individually.
But words can only be fetched at even addresses. So if we read a word at 0000, we read the bytes at 0000 and 0001. But if we want to read the word at position 0001, we need two read accesses. First 0000,0001 and then 0002,0003 and we only keep 0001,0002.
Of course this took some extra time and that was not appreciated. So that's why they invented alignment. So we store word variables at word boundaries and byte variables at byte boundaries.
For example, if we have a structure with a byte field (B) and a word field (W) (and a very naive compiler), we get the following:
+----+
|0000| B
|0001| W
+----+
|0002| W
|0003|
+----+
Which is not fun. But when using word alignment we find:
+----+
|0000| B
|0001| -
+----+
|0002| W
|0003| W
+----+
Here memory is sacrificed for access speed.
You can imagine that when using double word (4 bytes) or quad word (8 bytes) this is even more important. That's why with most modern compilers you can chose which alignment you are using while compiling the program.
Some CPU architectures require specific alignment of various datatypes, and will throw exceptions if you don't honor this rule. In standard mode, x86 doesn't require this for the basic data types, but can suffer performance penalties (check www.agner.org for low-level optimization tips).
However, the SSE instruction set (often used for high-performance) audio/video procesing has strict alignment requirements, and will throw exceptions if you attempt to use it on unaligned data (unless you use the, on some processors, much slower unaligned versions).
Your issue is probably that one compiler expects the caller to keep the stack aligned, while the other expects callee to align the stack when necessary.
EDIT: as for why the exception happens, a routine in the DLL probably wants to use SSE instructions on some temporary stack data, and fails because the two different compilers don't agree on calling conventions.
IIRC, stack alignment is when variables are placed on the stack "aligned" to a particular number of bytes. So if you are using a 16 bit stack alignment, each variable on the stack is going to start from a byte that is a multiple of 2 bytes from the current stack pointer within a function.
This means that if you use a variable that is < 2 bytes, such as a char (1 byte), there will be 8 bits of unused "padding" between it and the next variable. This allows certain optimisations with assumptions based on variable locations.
When calling functions, one method of passing arguments to the next function is to place them on the stack (as opposed to placing them directly into registers). Whether or not alignment is being used here is important, as the calling function places the variables on the stack, to be read off by the calling function using offsets. If the calling function aligns the variables, and the called function expects them to be non-aligned, then the called function won't be able to find them.
It seems that the msvc compiled code is disagreeing about variable alignment. Try compiling with all optimisations turned off.
As far as I know, compilers don't typically align variables that are on the stack. The library may be depending on some set of compiler options that isn't supported on your compiler. The normal fix is to declare the variables that need to be aligned as static, but if you go about doing this in other people's code, you'll want to be sure that they variables in question are initialized later on in the function rather than in the declaration.
// Some compilers won't align this as it's on the stack...
int __declspec(align(32)) needsToBe32Aligned = 0;
// Change to
static int __declspec(align(32)) needsToBe32Aligned;
needsToBe32Aligned = 0;
Alternately, find a compiler switch that aligns the variables on the stack. Obviously the "__declspec" align syntax I've used here may not be what your compiler uses.

Strange C++ Memory Allocation

I created a simple class, Storer, in C++, playing with memory allocation. It contains six field variables, all of which are assigned in the constructor:
int x;
int y;
int z;
char c;
long l;
double d;
I was interested in how these variables were being stored, so I wrote the following code:
Storer *s=new Storer(5,4,3,'a',5280,1.5465);
cout<<(long)s<<endl<<endl;
cout<<(long)&(s->x)<<endl;
cout<<(long)&(s->y)<<endl;
cout<<(long)&(s->z)<<endl;
cout<<(long)&(s->c)<<endl;
cout<<(long)&(s->l)<<endl;
cout<<(long)&(s->d)<<endl;
I was very interested in the output:
33386512
33386512
33386516
33386520
33386524
33386528
33386536
Why is the char c taking up four bytes? sizeof(char) returns, of course, 1, so why is the program allocating more memory than it needs? This is confirmed that too much memory is being allocated with the following code:
cout<<sizeof(s->c)<<endl;
cout<<sizeof(Storer)<<endl;
cout<<sizeof(int)+sizeof(int)+sizeof(int)+sizeof(char)+sizeof(long)+sizeof(double)<<endl;
which prints:
1
32
29
confirming that, indeed, 3 bytes are being allocated needlessly. Can anyone explain to me why this is happening? Thanks.
Data alignment and compiler padding say hi!
The CPU has no notion of type, what it gets in its 32-bit (or 64-bit, or 128-bit (SSE), or 256-bit (AVX) - let's keep it simple at 32) registers needs to be properly aligned in order to be processed correctly and efficiently. Imagine a simple scenario, where you have a char, followed by an int. In a 32-bit architecture, that's 1 byte for a char and 4 bytes for an integer.
A 32-bit register would have to break on its boundary, only taking in 3 bytes of the integer and leaving the 4th byte for "a second run". It cannot process the data properly that way, so the compiler will add padding in order to make sure all the stuff is processed efficiently. And that means adding a certain amount of padding depending on the type in question.
Why is misalignment a problem?
The computer is not human, it can't just pick them out with a pair of eyes and a brain. It has to be very deterministic and cautious about how it goes about doing things. First it loads one block which contains n bytes of the given information, shift it around so that it prunes out unrelated information, then another, again, shift out a bunch of unnecessary bytes which do not have anything to do with the operation at hand and only then can it do the necessary operations. And usually you have two operands, that's just one complete. When you do all that work, only then can you actually process it. Way too much performance overhead when you can simply align the data properly (and most of the time, compilers do it for you, if you're not doing anything fancy).
Could you visualize it?
Visually - the first green byte is the mentioned char, and the three green bytes plus the first red one of the second block is the 4-byte int, colorcoded on a 4-byte access boundary (we're talking about a 32-bit register). The "instead part" at the bottom shows an ideal setup where the int hits the register properly (the char getting padded into obedience somewhere off image):
Read more on data alignment, which comes quite handy when you're dealing with fancy extensions of the instruction set like SSE (128-bit regs) or AVX (256-bit regs), so special care must be taken so that the optimizations of vectorization are not defeated ( aligning on a 16-byte boundary for SSE, 16*8 -> 128-bits).
Additional remarks on user defined alignment
phonetagger made a valid point in the comments that there are pragma directives which can be assigned through the preprocessor to force to compiler in order to align the data in a way the user, programmer specifies. But such directives, like #pragma pack(...), are a statement to the compiler that you know what you're doing and what's best for you. Be sure that you do, because if you fail to accomodate your environment, you might experience various penalties - the most obvious being using external libraries you didn't write yourself which differ in the way they pack data.
Things simply explode when they clash. Best is to advise caution in such cases and really being intimate with the issue at hand. If you're not sure, leave it to the defaults. If you are not sure but have to use something like SSE where alignment is king (and not default nor simple by a long shot), consult various resources online or ask an another question here.
I will make an analogy to help you understand.
Assume there is a long loaf of bread and you have a cutting machine that can cut it into slices of equal thickness. Then you are giving out these breads to, let's say children. Every child takes their bread and fairly do what they want to do with them (put Nutella on them and eat, etc.). They can even make thinner slices out of it and use it like that.
If one child comes up to you and says that he does not want that slice everyone is getting, but a thinner slice instead, then you will have difficulties, because your cutting machine is optimized to cut at least a minimum amount, which makes everyone happy. But when one child asks for a thinner slice, then you have to reinvent the machine or put additional complexity to it like introducing two cutting modes. You don't want that. Eventually you give up and just give him a big slice anyway.
This is the same reason why it happens. Hope you could relate to the analogy.
Data alignement is why the char has allocated 4 bytes : Data alignement
char does not take up four bytes: it takes up a single byte as usual. You can check it by printing sizeof(char). The other three bytes are padding that the compiler inserts to optimize access to other members of your class. Depending on hardware, it is often much faster to access multi-byte types, say, 4-byte integers, when they are located at an address divisible by four. A compiler may insert up to three bytes of padding before an int member to align it with a good memory address for faster access.
If you would like to experiment with class layouts, you can use a handy operation called offsetof. It takes two parameters - the name of the member and the name of the class, and it returns the number of bytes from the base address of your struct to the position of the member in memory.
cout << offsetof(Storer, x) << endl;
cout << offsetof(Storer, y) << endl;
cout << offsetof(Storer, z) << endl;
Structure members are aligned in particular ways. In general, if you want the most compact representation, list the members in decreasing order of size.
http://en.wikipedia.org/wiki/Data_structure_alignment#Typical_alignment_of_C_structs_on_x86

What does "BUS_ADRALN - Invalid address alignment" error means?

We are on HPUX and my code is in C++.
We are getting
BUS_ADRALN - Invalid address alignment
in our executable on a function call. What does this error means?
Same function is working many times then suddenly its giving core dump.
in GDB when I try to print the object values it says not in context.
Any clue where to check?
You are having a data alignment problem. This is likely caused by trying to read or write through a bad pointer of some kind.
A data alignment problem is when the address a pointer is pointing at isn't 'aligned' properly. For example, some architectures (the old Cray 2 for example) require that any attempt to read anything other than a single character from memory only occur through a pointer in which the last 3 bits of the pointer's value are 0. If any of the last 3 bits are 1, the hardware will generate an alignment fault which will result in the kind of problem you're seeing.
Most architectures are not nearly so strict, and frequently the required alignment depends on the exact type being accessed. For example, a 32 bit integer might require only the last 2 bits of the pointer to be 0, but a 64 bit float might require the last 3 bits to be 0.
Alignment problems are usually caused by the same kinds of problems that would cause a SEGFAULT or segmentation fault. Usually a pointer that isn't initialized. But it could be caused by a bad memory allocator that isn't returning pointers with the proper alignment, or by the result of pointer arithmetic on the pointer when it isn't of the correct type.
The system implementation of malloc and/or operator new are almost certainly correct or your program would be crashing way before it currently does. So I think the bad memory allocator is the least likely tree to go barking up. I would check first for an uninitialized pointer and then bad pointer arithmetic.
As a side note, the x86 and x86_64 architectures don't have any alignment requirements. But, because of how cache lines work, and for various other reasons, it's often a good idea for performance to align your data on a boundary that's as big as the datatype being stored (i.e. a 4 byte boundary for a 32 bit int).
Most processors (not x86 and friends.. the blacksheep of the family lol) require accesses to certain elements to be aligned on multiples of bytes. I.e. if you read an integer from address 0x04 that is okay, but if you try to do the same from 0x03 you will cause an interrupt to be thrown.
This is because it's easier to implement the load/store hardware if it's always on a multiple of the data size with which you're working.
Since HP-UX runs only on RISC processors, which typically have such constraints, you should see here -> http://en.wikipedia.org/wiki/Data_structure_alignment#RISC.
Actually HP-UX has its own great forum on ITRC and some HP staff members are very helpful. I just took a look at the same topic you are asking and here are some results. For example the similar problem was caused actually by a bad input parameter. I strongly advise you first to read answers to similar question and if necessary to post your question there.
By the way it is likely that you will be asked to post results of these gdb commands:
(gdb) bt
(gdb) info reg
(gdb) disas $pc-16*8 $pc+16*4
Most of these issues are caused by multiple upstream dependencies linking to different versions of the same library.
For example, both the gnustl and stlport provide distinct implementations of the C++ standard library. If you compile and link against gnustl, while one of your dependencies was compiled and linked against stlport, then you will each have a different implementation of standard functions and classes. When your program is launched, the dynamic linker will attempt to resolve all of the exported symbols, and will discover known symbols at incorrect offsets, resulting in the BUS_ADRALN signal.

Can a C compiler rearrange stack variables?

I have worked on projects for embedded systems in the past where we have rearranged the order of declaration of stack variables to decrease the size of the resulting executable. For instance, if we had:
void func()
{
char c;
int i;
short s;
...
}
We would reorder this to be:
void func()
{
int i;
short s;
char c;
...
}
Because of alignment issues the first one resulted in 12 bytes of stack space being used and the second one resulted in only 8 bytes.
Is this standard behavior for C compilers or just a shortcoming of the compiler we were using?
It seems to me that a compiler should be able to reorder stack variables to favor smaller executable size if it wanted to. It has been suggested to me that some aspect of the C standard prevents this, but I haven't been able to find a reputable source either way.
As a bonus question, does this also apply to C++ compilers?
Edit
If the answer is yes, C/C++ compilers can rearrange stack variables, can you give an example of a compiler that definitely does this? I'd like to see compiler documentation or something similar that backs this up.
Edit Again
Thanks everybody for your help. For documentation, the best thing I've been able to find is the paper Optimal Stack Slot Assignment in GCC(pdf), by Naveen Sharma and Sanjiv Kumar Gupta, which was presented at the GCC summit proceedings in 2003.
The project in question here was using the ADS compiler for ARM development. It is mentioned in the documentation for that compiler that ordering declarations like I've shown can improve performance, as well as stack size, because of how the ARM-Thumb architecture calculates addresses in the local stack frame. That compiler didn't automatically rearrange locals to take advantage of this. The paper linked here says that as of 2003 GCC also didn't rearrange the stack frame to improve locality of reference for ARM-Thumb processors, but it implies that you could.
I can't find anything that definitely says this was ever implemented in GCC, but I think this paper counts as proof that you're all correct. Thanks again.
Not only can the compiler reorder the stack layout of the local variables, it can assign them to registers, assign them to live sometimes in registers and sometimes on the stack, it can assign two locals to the same slot in memory (if their live ranges do not overlap) and it can even completely eliminate variables.
As there is nothing in the standard prohibiting that for C or C++ compilers, yes, the compiler can do that.
It is different for aggregates (i.e. structs), where the relative order must be maintained, but still the compiler may insert pad bytes to achieve preferable alignment.
IIRC newer MSVC compilers use that freedom in their fight against buffer overflows of locals.
As a side note, in C++, the order of destruction must be reverse order of declaration, even if the compiler reorders the memory layout.
(I can't quote chapter and verse, though, this is from memory.)
The compiler is even free to remove the variable from the stack and make it register only if analysis shows that the address of the variable is never taken/used.
The stack need not even exist (in fact, the C99 standard does not have a single occurence of the word "stack"). So yes, the compiler is free to do whatever it wants as long as that preserves the semantics of variables with automatic storage duration.
As for an example: I encountered many times a situation where I could not display a local variable in the debugger because it was stored in a register.
The compiler for the Texas instruments 62xx series of DSP's is capable of, and does
"whole program optimization." ( you can turn it off)
This is where your code gets rearranged, not just the locals. So order of execution ends up being not quite what you might expect.
C and C++ don't actually promise a memory model (in the sense of say the JVM), so things can be quite different and still legal.
For those who don't know them, the 62xx family are 8 instruction per clock cycle DSP's; at 750Mhz, they do peak at 6e+9 instructions. Some of the time anyway. They do parallel execution, but instruction ordering is done in the compiler, not the CPU, like an Intel x86.
PIC's and Rabbit embedded boards don't have stacks unless you ask especially nicely.
A compiler might not even be using a stack at all for data. If you're on a platform so tiny that you're worrying about 8 vs 12 bytes of stack, then it's likely that there will be compilers which have pretty specialised approaches. (Some PIC and 8051 compilers come to mind)
What processor are you compiling for?
it is compiler specifics, one can make his own compiler that would do the inverse if he wanted it that way.
A decent compiler will put local variables in registers if it can. Variables should only be placed on the stack if there is excessive register pressure (not enough room) or if the variable's address is taken, meaning it needs to live in memory.
As far as I know, there is nothing that says variables need to be placed at any specific location or alignment on the stack for C/C++; the compiler will put them wherever is best for performance and/or whatever is convenient for compiler writers.
AFAIK there is nothing in the definition of C or C++ specifying how the compiler should order local variables on the stack. I would say that relying on what the compiler may do in this case is a bad idea, because the next version of your compiler may do it differently. If you spend time and effort to order your local variables to save a few bytes of stack, those few bytes had better be really critical for the functioning of your system.
There's no need for idle speculation about what the C standard requires or does not require: recent drafts are freely available online from the ANSI/ISO working group.
This does not answer your question but here is my 2 cents about a related issue...
I did not have the problem of stack space optimization but I had the problem of mis-alignment of double variables on the stack. A function may be called from any other function and the stack pointer value may have any un-aligned value. So I have come up with the idea below. This is not the original code, I just wrote it...
#pragma pack(push, 16)
typedef struct _S_speedy_struct{
double fval[4];
int64 lval[4];
int32 ival[8];
}S_speedy_struct;
#pragma pack(pop)
int function(...)
{
int i, t, rv;
S_speedy_struct *ptr;
char buff[112]; // sizeof(struct) + alignment
// ugly , I know , but it works...
t = (int)buff;
t += 15; // alignment - 1
t &= -16; // alignment
ptr = (S_speedy_struct *)t;
// speedy code goes on...
}