Understanding stack frame of function call in C/C++? [closed] - c++

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I am new to C/C++ and assembly lang as well.
This could also be very basic question.
I am trying to understand how stack frames are built and which variables(params) are pushed to stack in what order?.
Some search results showed that....compiler of C/C++ decides based on operations performed within a function. for e.g if the function was suppose to just increment value by 1 of the passed int param and return (similar to ++ operator) it would put all ..the param of the function and local variable within the function in registers and perform addition ....wondering which register is used for returned/pass by value ?....how are references returned? .....difference b/w eax, ebx,ecx and edx.
Requesting for a book/blog/link or any kind of material to understand registers,stack and heap references are used/built and destroyed during function call's....and also how main function is stored?
Thanks In Advance

Your question is borderline here. programmers could be a better place.
A good book to understand the concepts of stack etc might be Queinnec's Lisp In Small Pieces (it explains quite well what a stack is for Lisp). Also, SICP
is a good book to read.
D.Knuth's books and MMIX is also a good read.
Read carefully Wikipedia Call stack page.
In theory, no call stack is needed, and some languages and implementations (e.g. old SML/NJ) did not use any stack (but allocated the call frame in the garbage collected heap). See A.Appel's old paper Garbage Collection Can be Faster than Stack Allocation (and learn more about garbage collection in general).
Usually C and C++ implementations have a stack (and often use the hardware stack). Some C local variables might not have any stack location (because they have been optimized, or are kept in a register). Sometimes, the stack location of a C local variable may change (the compiler would use one call stack slot for some occurrences, and another call stack slot for other occurrences of the same local variable). And of course some temporary values may be compiled like your local variables (so stay in a register, on in one stack slot then another one, etc....). When optimizing the compiler could do weird tricks with variables.
On some old machines IBM/360 or IBM z/series, there is no hardware stack; the stack used by the C compiler is a software convention (e.g. some register is dedicated to that usage, without specific hardware support)
Think about the execution (or interpretation) of a recursively defined function (like the good old factorial naively coded). Read about recursion (in general, in computer science), primitive recursive functions, lambda calculus, denotational semantics, stack automaton, register allocation, tail calls, continuations, ABI, interrupts, Posix signals, sigaltstack(2), getcontext(2), longjmp(3)etc.... etc...
Read also books about Computer Architecture. In practice, the call stack is so important that several hardware resources (including the stack pointer register, often the call frame base pointer register, and perhaps hidden machinery e.g. cache related) are dedicated to it on common processors.
You could also look at the intermediate representations used by the GCC compiler. Then use -fdump-tree-all or the GCC MELT probe. If looking at the generated assembly be sure to pass -S -fverbose-asm to your gcc command.
See also the linux assembly howto.
I gave a lot of links. It is difficult to answer better, because I have no idea of your background.

I am trying to understand how stack frames are built and which
variables(params) are pushed to stack in what order?
THis is dependent on the architecture of the processor. However, typically, the stack grows from a high address towards a lower address (if we look at memory addressses as numeric values). One stackframe is "whatever this function puts on the stack"
The "stuff" that gets put on the stack typically is:
Return address back to the calling function.
A frame-pointer, pointing to the stack-frame at the start of the call.
Saved registers that need to be "preserved" for when this function returns.
Local variables.
Parameters to the "next" function in the call-stack.
compiler of C/C++ decides based on operations performed within a
function. for e.g if the function was suppose to just increment value
by 1 of the passed int param and return (similar to ++ operator) it
would put all ...
the param of the function and local variable within the function in registers
and perform addition ....wondering which register is used for
returned/pass by value ?....how are references returned?
The compiler has rules for how parameters are passed, and for regular function calls [that is, not "inlined" functions], the parameters are always passed in the same order, in the same combination of registers and stack-memory. If that wasn't the case, the compiler would have to know exactly what the function does before it could decide to pass the arguments.
Different processor architectures have different rules. x86-32 typically has one or two registers used for input parameters, and typically one register for the return values. x86-64 used up to 5 registers for passing the first five values to the function. Any further arguments are passed in registers.
Returning a reference is no different from returning any other value. The value (which in this case is the address of the object being returned). In x86-32, return values are in EAX. In x86-64, return values are in RAX. In ARM, R0 is used for the return value. In 29K, R96 is used for the return value.

Related

Function return pointers [duplicate]

Recently, I came across this question in an interview: How can we determine how much storage on the stack a particular function is consuming?
The "stack" is famously an implementation detail of the platform that is not inspectable or in any way queryable from within the language itself. It is essentially impossible to guarantee within any part of a C or C++ program whether it will be possible to make another function call. The "stack size", or maybe better called "function call and local variable storage depth", is one of the implementation limits whose existence is acknowledged by the language standard but considered out of scope. (E.g. for C++ see [implimits], Annex B.)
Individual platforms may offer APIs to allow programs to introspect the platform limitations, but neither C nor C++ specify that or how this should be possible.
Exceeding the implementation-defined resource limits leads to undefined behaviour, and you cannot know whether you will exceed the limits.
It's completely implementation defined - the standard does not in any way impose requirements on the possible underlying mechanisms used by a program.
On a x86 machine, one stack frame consists of a return address (4/8 byte), parameters and local variables.
The parameters, if e.g. scalars, may be passed through registers, so we can't say for sure whether they contribute to the storage taken up. The locals may be padded (and often are); We can only deduce a minimum amount of storage for these.
The only way to know for sure is to actually analyze the assembler code a compiler generates, or look at the absolute difference of the stack pointer values at runtime - before and after a particular function was called.
E.g.
#include <iostream>
void f()
{
register void* foo asm ("esp");
std::cout << foo << '\n';
}
int main()
{
register void* foo asm ("esp");
std::cout << foo << '\n';
f();
}
Now compare the outputs. GCC on Coliru gives
0x7fffbcefb410
0x7fffbcefb400
A difference of 16 bytes. (The stack grows downwards on x86.)
As stated by other answers, the program stack is a concept which is not specified within the language itself. However with a knowledge how typical implementation works, you can assume that the address of the first argument of a function is the beginning of its stack frame. The address of the first argument of a next called function is the beginning of the next stack frame. So, they probably wanted to see a code like:
void bar(void *b) {
printf("Foo stack frame is around %lld bytes\n", llabs((long long)b - (long long)&b));
}
void foo(int x) {
bar(&x);
}
The size increase of the stack, for those implementations that use a stack, is:
size of variables that don't fit in the available registers
size of variables declared in the function declared upfront that live for the life of the function
size of other local variables declared along the way or in statement blocks
the maximum stack size used by functions called by this function
everything above * the number of recursive calls
size of the return address
Return Address
Most implementations push the return address on the stack before any other data. So this address takes up space.
Available Registers
Some processors have many registers; however, only a few may be available for passing variables. For example, if the convention allows for 2 variables but there are 5 parameters, 3 parameters will be placed on the stack.
When large objects are passed by value, they will take up space on the stack.
Function Local Variables
This is tricky to calculate, because variables may be pushed onto the stack and then popped off when not used.
Some variables may not be pushed onto the stack until they are declared. So if a function returns midway through, it may not use the remaining variables, so the stack size won't increase for those variables.
The compiler may elect to use registers to hold values or place constants directly into the executable code. In this case, they don't add any length to the stack.
Calling Other Functions
The function may call other functions. Each called function may increase the amount of data on the stack. Those functions that are called may call other functions, and so on.
This again, depends on the snapshot in time of the execution. However, one can produce an approximate maximum increase of the stack by the other called functions.
Recursion
As with calling other functions, a recursive call may increase the size of the stack. A recursive call at the end of the function may increase the stack more than a recursive call near the beginning.
Register Value Saving
Sometimes, the compiler may need more space for data than the allocated registers allow. Thus the compiler may push variables on the stack.
The compiler may push registers on the stack for convenience, such as swapping registers or changing the value's order.
Summary
The exact size of stack space required for a function is very difficult to calculate and may depend on where the execution is. There are many items to consider in stack size calculation, such as parameter quantity and size as well as any other functions called. Due to the variability, most stack size measurements are based on a maximum size, or worst case size. Stack allocation is usually based on the worst case scenario.
For an interview question, I would mention all of the above, which usually makes the interviewer want to move on to the next question quickly.

Does stack space required by a function affect inlining decisions in C/C++?

Would a large amount of stack space required by a function prevent it from being inlined? Such as if I had a 10k automatic buffer on the stack, would that make the function less likely to be inlined?
int inlineme(int args) {
char svar[10000];
return stringyfunc(args, svar);
}
I'm more concerned about gcc, but icc and llvm would also be nice to know.
I know this isn't ideal, but I'm very curious. The code is probable also pretty bad on cache too.
Yes, the decision to inline or not depends on the complexity of the function, its stack and registers usage and the context in which the call is made. The rules are compiler- and target platform-dependent. Always check the generated assembly when performance matters.
Compare this version with a 10000-char array not being inlined (GCC 8.2, x64, -O2):
inline int inlineme(int args) {
char svar[10000];
return stringyfunc(args, svar);
}
int test(int x) {
return inlineme(x);
}
Generated assembly:
inlineme(int):
sub rsp, 10008
mov rsi, rsp
call stringyfunc(int, char*)
add rsp, 10008
ret
test(int):
jmp inlineme(int)
with this one with a much smaller 10-char array, which is inlined:
inline int inlineme(int args) {
char svar[10];
return stringyfunc(args, svar);
}
int test(int x) {
return inlineme(x);
}
Generated assembly:
test(int):
sub rsp, 24
lea rsi, [rsp+6]
call stringyfunc(int, char*)
add rsp, 24
ret
Such as if I had a 10k automatic buffer on the stack, would that make the function less likely to be inlined?
Not necessarily in general. In fact, inline expansion can sometimes reduce stack space usage due to not having to set up space for function arguments.
Expanding a "wide" call into a single frame which calls other "wide" functions can be a problem though, and unless the optimiser guards against that separately, it may have to avoid expansion of "wide" functions in general.
In case of recursion: Most likely yes.
An example of LLVM source:
if (IsCallerRecursive &&
AllocatedSize > InlineConstants::TotalAllocaSizeRecursiveCaller) {
InlineResult IR = "recursive and allocates too much stack space";
From GCC source:
For stack growth limits we always base the growth in stack usage
of the callers. We want to prevent applications from segfaulting
on stack overflow when functions with huge stack frames gets
inlined.
Controlling the limit, from GCC manual:
--param name=value
large-function-growth
Specifies maximal growth of large function caused by inlining in percents. For example, parameter value 100 limits large function growth to 2.0 times the original size.
large-stack-frame
The limit specifying large stack frames. While inlining the algorithm is trying to not grow past this limit too much.
large-stack-frame-growth
Specifies maximal growth of large stack frames caused by inlining in percents. For example, parameter value 1000 limits large stack frame growth to 11 times the original size.
Yes, partly because compilers do stack allocation for the whole function once in prologue/epilogue, not moving the stack pointer around as they enter/leave block scopes.
and each inlined call to inlineme() would need its own buffer.
No, I'm pretty sure compilers are smart enough to reuse the same stack space for different instances of the same function, because only one instance of that C variable can ever be in-scope at once.
Optimization after inlining can merge some of the operations of the inline function into calling code, but I think it would be rare for the compiler to end up with 2 versions of the array it wanted to keep around simultaneously.
I don't see why that would be a concern for inlineing. Can you give an example of how functions that require a lot of stack would be problematic to inline?
A real example of a problem it could create (which compiler heuristics mostly avoid):
Inlining if (rare_special_case) use_much_stack() into a recursive function that otherwise doesn't use much stack would be an obvious problem for performance (more cache and TLB misses), and even correctness if you recurse deep enough to actually overflow the stack.
(Especially in a constrained environment like Linux kernel stacks, typically 8kiB or 16kiB per thread, up from 4k on 32-bit platforms in older Linux versions. https://elinux.org/Kernel_Small_Stacks has some info and historical quotes about trying to get away with 4k stacks so the kernel didn't have to find 2 contiguous physical pages per task).
Compilers normally make functions allocate all the stack space they'll ever need up front (except for VLAs and alloca). Inlining an error-handling or special-case handling function instead of calling it in the rare case where it's needed will put a large stack allocation (and often save/restore of more call-preserved registers) in the main prologue/epilogue, where it affects the fast path, too. Especially if the fast path didn't make any other function calls.
If you don't inline the handler, that stack space will never be used if there aren't errors (or the special case didn't happen). So the fast-path can be faster, with fewer push/pop instructions and not allocating any big buffers before going on to call another function. (Even if the function itself isn't actually recursive, having this happen in multiple functions in a deep call tree could waste a lot of stack.)
I've read that the Linux kernel does manually do this optimization in a few key places where gcc's inlining heuristics make an unwanted decision to inline: break a function up into fast-path with a call to the slow path, and use __attribute__((noinline)) on the bigger slow-path function to make sure it doesn't inline.
In some cases not doing a separate allocation inside a conditional block is a missed optimization, but more stack-pointer manipulation makes stack unwinding metadata to support exceptions (and backtraces) more bloated (especially saving/restoring of call-preserved registers that stack unwinding for exceptions has to restore).
If you were doing a save and/or allocate inside a conditional block before running some common code that's reached either way (with another branch to decide which registers to restore in the epilogue), then there'd be no way for the exception handler machinery to know whether to load just R12, or R13 as well (for example) from where this function saved them, without some kind of insanely complicated metadata format that could signal a register or memory location to be tested for some condition. The .eh_frame section in ELF executables / libraries is bloated enough as is! (It's non-optional, BTW. The x86-64 System V ABI (for example) requires it even in code that doesn't support exceptions, or in C. In some ways that's good, because it means backtraces usually work, even passing an exception back up through a function would cause breakage.)
You can definitely adjust the stack pointer inside a conditional block, though. Code compiled for 32-bit x86 (with crappy stack-args calling conventions) can and does use push even inside conditional branches. So as long as you clean up the stack before leaving the block that allocated space, it's doable. That's not saving/restoring registers, just moving the stack pointer. (In functions built without a frame pointer, the unwind metadata has to record all such changes, because the stack pointer is the only reference for finding saved registers and the return address.)
I'm not sure exactly what the details are on why compiler can't / don't want to be smarter allocating large extra stack space only inside a block that uses it. Probably a good part of the problem is that their internals just aren't set up to be able to even look for this kind of optimization.
Related: Raymond Chen posted a blog about the PowerPC calling convention, and how there are specific requirements on function prologues / epilogues that make stack unwinding work. (And the rules imply / require the existence of a red zone below the stack pointer that's safe from async clobber. A few other calling conventions use red zones, like x86-64 System V, but Windows x64 doesn't. Raymond posted another blog about red zones)

What is the purpose of stack, if you can do the same operations with an array?

How frequent is a "stack" used in programming? In other words, would we loose something if we replace a stack with an array? Or is there any special case where a stack can't be replaced by anything else?
I'm just a C++ beginner, and all I know about stacks is what they are used to store data, so the subject doesn't seem clear for me.
Any information is apreciated.
A "stack" is a generic name of a data structure that supports first-in last-out. An array is one possible implementation of a stack. A linked list could also implement a stack.
Consider a stack which grows dynamically as more elements are added to it, with no preset limit. A simple C-style array couldn't support this, as it has a size limited at compile-time. A std::vector, which works like an array in some ways but is more sophisticated, would allow this dynamic growth. (A linked-list would too, but would usually be less efficient).
"First-in, last-out" is not always as descriptive for a new programmer as it might be for someone who has more programming experience. The concept of a stack - not, here, specifically referring to the C++ (or other-language) object - is a concept that permeates programming, at-large (a concept which is well described throughout SO and one I am sure that I would get "incorrect" were I to describe it in great detail, here).
There is indeed a link between the design and behavior of the stack data type you discuss and the programming 'thing' that is referred to as a call stack. In computer hardware, there is also a notion of a stack, briefly discussed, below.
To do the Call Stack Some Injustice
A call stack, specifically, is what keeps track of the calls in a program. The more calls that are made, the more calls are added on "top" of a stack. Once a call is finished, it's instance is removed from the stack, and control of the program is returned to the calling function.
So, I have three "functions" : A, B, C.
A calls B. B then calls C.
The order of the stack then is:
C
B
A
C takes priority for its resolution over the others (or, "it has control"). The others will not finish their operations until C has been resolved. Once it's resolved, it's "popped" off the stack, and B then does what it needs to do. Here, we find the notion of FILO (First In, Last Out): A is the first call; however, it is not resolved - in this case - until the others are complete.
Wikipedia's got a great description of the call stack and a great image for it, as well (without reading the article, the image isn't much help, see below image for discussion):
For our very high level purposes, this image shows us one important thing about the concept of a stack, at large: it is envisaged as a series of elements (in this case procedure locations - the yellow bars) stacked one on top of the other. Accessing the call stack works as described above: what's immediately needed is added on top, and removed once it is no longer needed.
A stack of cards or a stack of plates is often used as an analogy for the description of a stack's behavior.
This answer to a Stack Overflow question gives a great explanation of the stack.
Equal Injustice to the Hardware Stack
You are better to read about the hardware stack, elsewhere; however, from my understanding, it performs similarly to the call stack, above, but with data rather than procedures. Unfortunately, I do not yet know enough to distinguish between the hardware stack and the call stack. But, regardless, the notion of FILO still applies.
Why Is Stack Important?
The purpose of the stack object abstraction is to better mimic the operations of a stack. In fact, if my understanding is correct, the stack was proposed as the way to control procedural calls and is often what is implemented at the "architecture level" for memory management.
The stack data type abstraction forces the programmer to interact with the abstraction in a way that mimics - or is the root of - the call stack and the hardware stack. We could certainly do this with an array, but stack gives us the ability to do, easily.
This answer assumes that you are interested in the differences between the C++ STL stack<> and vector<> template classes.
A stack provides a sub-set of the functionality of a vector. This means a stack implementation has more freedom than a vector implementation.
A stack guarantees:
Constant time push
Constant time pop
But a vector guarantees in addition to the above:
Constant time random access with an index
To provide constant time random access with indexes, all elements must be stored contignously, which is not required for a stack. A vector must be reallocated and copied when the size exceeds the capacity, but a stack does not need that. Hence a stack could be implemented with a linked list, but a vector can't, because of the additional requirements.
In other words, would we loose something if we replace a stack with an
array?
Yes, we would loose some freedom in the implementation.

what's on the stack when a function is called?

I can only imagine
1) parameters;
2) local variables;
what else?
1) function return address?
2) function name?
It really depends on platform and architecture, but typically:
Function return address
Saved values of caller's CPU registers - most importantly, caller's stack frame pointer value
Variables allocated with alloca().
Sometimes - extra stuff for exception handling, this is VERY platform-dependent.
Sometimes - guard values to detect stack clobbering
Function name is never in the stack, to the best of my knowledge, unless your code places it there.
I think that a picture really is a thousand words.
It depends on the calling convention; for Unix, you typically look up this information in the SYSV ABI (Application Binary Interface).
You may find:
Return address (if the machine is a popular Intel architecture). On more modern architectures, the return address is passed in a register.
Callee-saves registers—these are registers that "belong" to the caller which the callee has chosen to borrow and must therefore save and restore.
Any incoming parameters that could not be passed in registers. In IA-32, no parameters are passed in registers; they all go on the stack. In x86-64, up to six integer and six floating-point parameters can be passed in registers, so it is seldom necessary to use the stack for that purpose.
You may or may not find a saved stack pointer or frame pointer. Most modern calling conventions go without a frame pointer in order to save an extra registers. In this design, the size of each frame is known at compile time, so restoring the old stack pointer is just a matter of adding a constant. But it makes it harder to implement alloca().
The older Intel calling conventions use both stack pointer and frame pointer, which burns an extra register, but it simplifies alloca() and also stack unwinding.
Local variables with storage class auto are allocated on the stack.
The stack may contain compiler temporaries that hold values which are "spilled" if the hardware does not provide enough registers to hold all the intermediate results of computations. (This happens if at any point the number of live intermediate results—the ones that will be needed later in the program—exceeds the number of registers available to the compiler for storing intermediate results.)
You may find variables allocated with alloca().
You may find metadata that says which PC ranges are in scope for which exception handlers, or other very platform-dependent exception stuff.
C and C++ do not support garbage collection, but in a language that does, you will often find metadata that identifies where in the stack frame you will find pointers.
Finally, the stack may contain "padding" used to ensure that the stack pointer is aligned on an 8-byte or 16-byte boundary.
Calling conventions are complex beasts, and stack-frame layout is not for the faint of heart!

Dot operator cost c/c++ [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
We all know about -> vs . speed diff to access members in c/c++, but I am hard to find any clues of the actual cost of the simple dot operator.
I imagine its something like address-of-struct + offset, also assume the offset being the sum of all sizeof-s of all preceding members. Is this (roughly) correct?
Then compared to -> who much faster it is? Twice?
(having seen some asm, here on SO, about . access being one instruction, I guess there is some magic about it)
Also, how much slower is, compared to local variable?
Thank You
EDIT:
I failed to ask it correctly, I guess.
To try to clear things up:
By "-> vs ." I meant "using pointer to access the struct" vs "direct member access" - (link).
And then I was just curious: "Ok, and what about the dot access itself?It sourly cost something." So I asked the question.
"Dot operator cost c/c++" itself might be absurd/nonsense/naive question, still it does get the answers I was looking for. Can't say it better now.
Thanks
We all know about -> vs . speed diff to access members in c/c++, but I am hard to find any clues of the actual cost of the simple dot operator.
The "we all" apparently doesn't include me. I'm not aware of any significant difference (especially in C++) between -> vs. ..
I imagine its something like address-of-struct + offset, also assume the offset being the sum of all sizeof-s of all preceding members. Is this (roughly) correct?
Yes.
Then compared to -> how much faster it is? Twice? (having seen some asm, here on SO, about . access being one instruction, I guess there is some magic about it)
Both -> and . involve calculation of an effective address and that is the most expensive operation (beside the actual memory access). If pointer (on the left of ->) is used very often (e.g. this) then it is highly likely to be already cached by compiler in a CPU register, effectively negating any possible difference between -> and ..
Well, this is a pointer, everything belonging the object inside a method is effectively prefixed with this->, yet C++ programs haven't slowed to a crawl.
Obviously, if . is applied to a reference, then it is 100% equivalent of ->.
Also, how much slower is it, compared to local variable?
Hard to evaluate. Essentially difference in meta assembler would be: two asm ops to access the local variable (add the offset of the variable on the stack to the stack pointer; access the value) vs. three asm ops to access attribute of an object via pointer (load the pointer to the object; add the offset; access the value). But due to compiler optimizations the difference is rarely noticeable.
Often it is difference between local and global variables which is standing out: address of local variable/object's attribute has to be computed, while all global variables have unique global address calculated at link time.
But overhead of the field/attribute access is really negligible and minuscule compared to e.g. overhead of a system call.
Any decent compiler will calculate the address of struct field at compile time so the cost of . should be zero.
In other words access to struct field using . is as costly as access to variable.
It depends on a huge number of things.
The . operator can vary from being cheaper than accessing a local variable or using -> to being much more expensive than either.
This isn't a sensible question to ask.
I would say the difference of cost isn't in the operators themselves, but in the cost of accessing the left-side object.
ptr->variable
should produce a similar asm output than
(*ptr).variable // yeah I am using a '.' because it's faster...
therefore your question is kind of nonsensical as it is.
I think I understand what you meant though, so I'll try to answer that instead.
The operators themselves are nearly costless. They only involve computing an address. This address is expressed as the address of the object plus an offset (likely fixed at compile time and therefore a constant in the program for each field). The real cost comes from actually fetching the bytes at this address.
Now, I don't understand why it would be more costly to use a -> vs a . since they effectively do the same thing, there can be a difference of access though in this case:
struct A { int x; };
void function(A& external)
{
A local;
external.x = local.x;
}
In this case, accessing external.x is likely to be more costly because it necessitates accessing memory outside of the scope of the function and therefore the compiler cannot know in advance whether or not the memory will have already been fetched and put in the processor cache (or a register) etc...
On the other hand local.x being local and stored on the stack (or a register), the compiler may optimize away the fetch part of the code and access local.x directly.
But as you notice there is no difference between using -> vs ., the difference is in using a local variable (stored on the stack) vs using a pointer / reference to an externally supplied object, on which the compiler cannot make an assumption.
Finally, it's important to note that if the function was inlined, then the compiler could optimize its use at the caller's site and effectively use a slightly different implementation there, but don't try to inline everything anyway: first of all the compiler will probably ignore your hints, and if you really force it you may actually lose performance.