Why stack get backwards? [duplicate] - c++

I know that in the architectures I'm personally familiar with (x86, 6502, etc), the stack typically grows downwards (i.e. every item pushed onto the stack results in a decremented SP, not an incremented one).
I'm wondering about the historical rationale for this. I know that in a unified address space, it's convenient to start the stack on the opposite end of the data segment (say) so there's only a problem if the two sides collide in the middle. But why does the stack traditionally get the top part? Especially given how this is the opposite of the "conceptual" model?
(And note that in the 6502 architecture, the stack also grows downwards, even though it is bounded to a single 256-byte page, and this direction choice seems arbitrary.)

As to the historic rationale, I can't say for certain (because I didn't design them). My thoughts on the matter are that early CPUs got their original program counter set to 0 and it was a natural desire to start the stack at the other end and grow downwards, since their code naturally grows upward.
As an aside, note that this setting of the program counter to 0 on reset is not the case for all early CPUs. For example, the Motorola 6809 would fetch the program counter from addresses 0xfffe/f so you could start running at an arbitrary location, depending on what was supplied at that address (usually, but by no means limited to, ROM).
One of the first things some historical systems would do would be to scan memory from the top until it found a location that would read back the same value written, so that it would know the actual RAM installed (e.g., a z80 with 64K address space didn't necessarily have 64K or RAM, in fact 64K would have been massive in my early days). Once it found the top actual address, it would set the stack pointer appropriately and could then start calling subroutines. This scanning would generally be done by the CPU running code in ROM as part of start-up.
With regard to the stacks growth, not all of them grow downwards, see this answer for details.

One good explanation I heard was that some machines in the past could only have unsigned offsets, so you'd want to the stack to grow downward so you could hit your locals without having to lose the extra instruction to fake a negative offset.

Stanley Mazor (4004 and 8080 architect) explains how stack growth direction was chosen for 8080 (and eventually for 8086) in "Intel Microprocessors: 8008 to 8086":
The stack pointer was chosen to run "downhill" (with the stack advancing toward lower memory) to simplify indexing into the stack from the user's program (positive indexing) and to simplify displaying the contents of the stack from a front panel.

One possible reason might be that it simplifies alignment. If you place a local variable on the stack which must be placed on a 4-byte boundary, you can simply subtract the size of the object from the stack pointer, and then zero out the two lower bits to get a properly aligned address. If the stack grows upwards, ensuring alignment becomes a bit trickier.

IIRC the stack grows downwards because the heap grows upwards. It could have been the other way around.

I believe it's purely a design decision. Not all of them grow downward -- see this SO thread for some good discussion on the direction of stack growth on different architectures.

I'm not certain but I did some programming for the VAX/VMS back in the days. I seem to remember one part of memory (the heap??) going up and the stack going down. When the two met, then you were out of memory.

I believe the convention began with the IBM 704 and its infamous "decrement register". Modern speech would call it an offset field of the instruction, but the point is they went down, not up.

Just 2c more:
Beyond all the historic rationale mentioned, I'm quite certain there's no reason which is valid in modern processors. All processors can take signed offsets, and maximizing the heap/stack distance is rather moot ever since we started dealing with multiple threads.
I personally consider this a security design flaw. If, say, the designers of the x64 architecture would have reversed the stack growth direction, most stack buffer overflows would have been eliminated - which is kind of a big deal. (since strings grow upward).

Because then a POP uses the same addressing mode that is commonly used to scan through strings and arrays
An instruction that pops a value off of a stack needs to do two things: read the value out of memory, and adjust the stack pointer. There are four possible design choices for this operation:
Preincrement the stack pointer first, then read the value. This implies that the stack will grow "downwards" (towards lower memory addresses).
Predecrement the stack pointer first, then read the value. This implies that the stack will grow "upwards" (towards higher memory addresses).
Read the value first, then postincrement the stack pointer. This implies that the stack will grow downwards.
Read the value first, then postdecrement the stack pointer. This implies that the stack will grow upwards.
In many computer languages (particularly C), strings and arrays are passed to functions as pointers to their first element. A very common operation is to read the elements of the string or array in order, starting with the first element. Such an operation needs only the postincrement addressing mode described above.
Furthermore, reading the elements of a string or array is more common than writing the elements. Indeed, there are many standard library functions that perform no writing at all (e.g. strlen(), strchr(), strcmp())!
Therefore, if you have a limited number of addressing modes in your instruction set design, the most useful addressing mode would be a read that postincrements. This results in not only the most useful string and array operations, but also a POP instruction that grows the stack downward.
The second-most-useful addressing mode would then be a post-decrement write, which can be used for the matching PUSH instruction.
Indeed, the PDP-11 had postincrement and predecrement addressing modes, which produced a downward-growing stack. Even the VAX did not have preincrement or postdecrement.

One advantage of descending stack growth in a minimal embedded system is that a single chunk of RAM can be redundantly mapped into both page O and page 1, allowing zero page variables to be assigned starting at 0x000 and the stack growing downwards from 0x1FF, maximizing the amount it would have to grow before overwriting variables.
One of the original design goals of the 6502 was that it could be combined with, for example, a 6530, resulting in a two-chip microcontroller system with 1 KB of program ROM, timer, I/O, and 64 bytes of RAM shared between stack and page zero variables. By comparison, the minimal embedded system of that time based on an 8080 or 6800 would be four or five chips.

Related

Why in C++ recursion the addresses of stack variables grow backwards? [duplicate]

I am preparing some training materials in C and I want my examples to fit the typical stack model.
What direction does a C stack grow in Linux, Windows, Mac OSX (PPC and x86), Solaris, and most recent Unixes?
Stack growth doesn't usually depend on the operating system itself, but on the processor it's running on. Solaris, for example, runs on x86 and SPARC. Mac OSX (as you mentioned) runs on PPC and x86. Linux runs on everything from my big honkin' System z at work to a puny little wristwatch.
If the CPU provides any kind of choice, the ABI / calling convention used by the OS specifies which choice you need to make if you want your code to call everyone else's code.
The processors and their direction are:
x86: down.
SPARC: selectable. The standard ABI uses down.
PPC: down, I think.
System z: in a linked list, I kid you not (but still down, at least for zLinux).
ARM: selectable, but Thumb2 has compact encodings only for down (LDMIA = increment after, STMDB = decrement before).
6502: down (but only 256 bytes).
RCA 1802A: any way you want, subject to SCRT implementation.
PDP11: down.
8051: up.
Showing my age on those last few, the 1802 was the chip used to control the early shuttles (sensing if the doors were open, I suspect, based on the processing power it had :-) and my second computer, the COMX-35 (following my ZX80).
PDP11 details gleaned from here, 8051 details from here.
The SPARC architecture uses a sliding window register model. The architecturally visible details also include a circular buffer of register-windows that are valid and cached internally, with traps when that over/underflows. See here for details. As the SPARCv8 manual explains, SAVE and RESTORE instructions are like ADD instructions plus register-window rotation. Using a positive constant instead of the usual negative would give an upward-growing stack.
The afore-mentioned SCRT technique is another - the 1802 used some or it's sixteen 16-bit registers for SCRT (standard call and return technique). One was the program counter, you could use any register as the PC with the SEP Rn instruction. One was the stack pointer and two were set always to point to the SCRT code address, one for call, one for return. No register was treated in a special way. Keep in mind these details are from memory, they may not be totally correct.
For example, if R3 was the PC, R4 was the SCRT call address, R5 was the SCRT return address and R2 was the "stack" (quotes as it's implemented in software), SEP R4 would set R4 to be the PC and start running the SCRT call code.
It would then store R3 on the R2 "stack" (I think R6 was used for temp storage), adjusting it up or down, grab the two bytes following R3, load them into R3, then do SEP R3 and be running at the new address.
To return, it would SEP R5 which would pull the old address off the R2 stack, add two to it (to skip the address bytes of the call), load it into R3 and SEP R3 to start running the previous code.
Very hard to wrap your head around initially after all the 6502/6809/z80 stack-based code but still elegant in a bang-your-head-against-the-wall sort of way. Also one of the big selling features of the chip was a full suite of 16 16-bit registers, despite the fact you immediately lost 7 of those (5 for SCRT, two for DMA and interrupts from memory). Ahh, the triumph of marketing over reality :-)
System z is actually quite similar, using its R14 and R15 registers for call/return.
In C++ (adaptable to C) stack.cc:
static int
find_stack_direction ()
{
static char *addr = 0;
auto char dummy;
if (addr == 0)
{
addr = &dummy;
return find_stack_direction ();
}
else
{
return ((&dummy > addr) ? 1 : -1);
}
}
The advantage of growing down is in older systems the stack was typically at the top of memory. Programs typically filled memory starting from the bottom thus this sort of memory management minimized the need to measure and place the bottom of the stack somewhere sensible.
Just a small addition to the other answers, which as far as I can see have not touched this point:
Having the stack grow downwards makes all addresses within the stack have a positive offset relative to the stack pointer. There's no need for negative offsets, as they would only point to unused stack space. This simplifies accessing stack locations when the processor supports stackpointer-relative addressing.
Many processors have instructions that allow accesses with a positive-only offset relative to some register. Those include many modern architectures, as well as some old ones. For example, the ARM Thumb ABI provides for stackpointer-relative accesses with a positive offset encoded within a single 16-bit instruction word.
If the stack grew upwards, all useful offsets relative to the stackpointer would be negative, which is less intuitive and less convenient. It also is at odds with other applications of register-relative addressing, for example for accessing fields of a struct.
Stack grows down on x86 (defined by the architecture, pop increments stack pointer, push decrements.)
In MIPS and many modern RISC architectures (like PowerPC, RISC-V, SPARC...) there are no push and pop instructions. Those operations are explicitly done by manually adjusting the stack pointer then load/store the value relatively to the adjusted pointer. All registers (except the zero register) are general purpose so in theory any register can be a stack pointer, and the stack can grow in any direction the programmer wants
That said, the stack typically grows down on most architectures, probably to avoid the case when the stack and program data or heap data grows up and clash to each other. There's also the great addressing reasons mentioned sh-'s answer. Some examples: MIPS ABIs grows downwards and use $29 (A.K.A $sp) as the stack pointer, RISC-V ABI also grows downwards and use x2 as the stack pointer
In Intel 8051 the stack grows up, probably because the memory space is so tiny (128 bytes in original version) that there's no heap and you don't need to put the stack on top so that it'll be separated from the heap growing from bottom
You can find more information about the stack usage in various architectures in https://en.wikipedia.org/wiki/Calling_convention
See also
Why does the stack grow downward?
What are the advantages to having the stack grow downward?
Why do stacks typically grow downwards?
Does stack grow upward or downward?
On most systems, stack grows down, and my article at https://gist.github.com/cpq/8598782 explains WHY it grows down. It is simple: how to layout two growing memory blocks (heap and stack) in a fixed chunk of memory? The best solution is to put them on the opposite ends and let grow towards each other.
It grows down because the memory allocated to the program has the "permanent data" i.e. code for the program itself at the bottom, then the heap in the middle. You need another fixed point from which to reference the stack, so that leaves you the top. This means the stack grows down, until it is potentially adjacent to objects on the heap.
This macro should detect it at runtime without UB:
#define stk_grows_up_eh() stk_grows_up__(&(char){0})
_Bool stk_grows_up__(char *ParentsLocal);
__attribute((__noinline__))
_Bool stk_grows_up__(char *ParentsLocal) {
return (uintptr_t)ParentsLocal < (uintptr_t)&ParentsLocal;
}

Is there an order you should declare variables in C++? [duplicate]

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
"re-order the member variables of a
class into most used and least used."
I'm not familiar with C++, nor with how it compiles, but I was wondering if
This statement is accurate?
How/Why?
Does it apply to other (compiled/scripting) languages?
I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.
Two issues here:
Whether and when keeping certain fields together is an optimization.
How to do actually do it.
The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.
The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.
The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.
Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.
Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:
an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
followed by an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.
In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.
I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.
The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:
struct some_stuff {
double d; // I expect double is 64bit IEEE, it might not be
uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
uint32_t i; // 4 bytes, usually 4-aligned
int32_t j; // same
short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know
char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment
char d; // 1 byte, any alignment
};
If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.
Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.
Depending on the type of program you're running this advice may result in increased performance or it may slow things down drastically.
Doing this in a multi-threaded program means you're going to increase the chances of 'false-sharing'.
Check out Herb Sutters articles on the subject here
I've said it before and I'll keep saying it. The only real way to get a real performance increase is to measure your code, and use tools to identify the real bottle neck instead of arbitrarily changing stuff in your code base.
It is one of the ways of optimizing the working set size. There is a good article by John Robbins on how you can speed up the application performance by optimizing the working set size. Of course it involves careful selection of most frequent use cases the end user is likely to perform with the application.
We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
group by alignment requirements (or, for newbies, "group by size" usually does the trick)
smallest first
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
8-bit bytes can be loaded in one instruction if they exist at this+0 through this+31
16-bit halfwords if they exist at this+0 through this+62;
32-bit machine words if they exist at this+0 through this+124.
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)
While locality of reference to improve the cache behavior of data accesses is often a relevant consideration, there are a couple other reasons for controlling layout when optimization is required - particularly in embedded systems, even though the CPUs used on many embedded systems do not even have a cache.
- Memory alignment of the fields in structures
Alignment considerations are pretty well understood by many programmers, so I won't go into too much detail here.
On most CPU architectures, fields in a structure must be accessed at a native alignment for efficiency. This means that if you mix various sized fields the compiler has to add padding between the fields to keep the alignment requirements correct. So to optimize the memory used by a structure it's important to keep this in mind and lay out the fields such that the largest fields are followed by smaller fields to keep the required padding to a minimum. If a structure is to be 'packed' to prevent padding, accessing unaligned fields comes at a high runtime cost as the compiler has to access unaligned fields using a series of accesses to smaller parts of the field along with shifts and masks to assemble the field value in a register.
- Offset of frequently used fields in a structure
Another consideration that can be important on many embedded systems is to have frequently accessed fields at the start of a structure.
Some architectures have a limited number of bits available in an instruction to encode an offset to a pointer access, so if you access a field whose offset exceeds that number of bits the compiler will have to use multiple instructions to form a pointer to the field. For example, the ARM's Thumb architecture has 5 bits to encode an offset, so it can access a word-sized field in a single instruction only if the field is within 124 bytes from the start. So if you have a large structure an optimization that an embedded engineer might want to keep in mind is to place frequently used fields at the beginning of a structure's layout.
Well the first member doesn't need an offset added to the pointer to access it.
In C#, the order of the member is determined by the compiler unless you put the attribute [LayoutKind.Sequential/Explicit] which forces the compiler to lay out the structure/class the way you tell it to.
As far as I can tell, the compiler seems to minimize packing while aligning the data types on their natural order (i.e. 4 bytes int start on 4 byte addresses).
I'm focusing on performance, execution speed, not memory usage.
The compiler, without any optimizing switch, will map the variable storage area using the same order of declarations in code.
Imagine
unsigned char a;
unsigned char b;
long c;
Big mess-up? without align switches, low-memory ops. et al, we're going to have an unsigned char using a 64bits word on your DDR3 dimm, and another 64bits word for the other, and yet the unavoidable one for the long.
So, that's a fetch per each variable.
However, packing it, or re-ordering it, will cause one fetch and one AND masking to be able to use the unsigned chars.
So speed-wise, on a current 64bits word-memory machine, aligns, reorderings, etc, are no-nos. I do microcontroller stuff, and there the differences in packed/non-packed are reallllly noticeable (talking about <10MIPS processors, 8bit word-memories)
On the side, it's long known that the engineering effort required to tweak code for performance other than what a good algorithm instructs you to do, and what the compiler is able to optimize, often results in burning rubber with no real effects. That and a write-only piece of syntaxically dubius code.
The last step-forward in optimization I saw (in uPs, don't think it's doable for PC apps) is to compile your program as a single module, have the compiler optimize it (much more general view of speed/pointer resolution/memory packing, etc), and have the linker trash non-called library functions, methods, etc.
In theory, it could reduce cache misses if you have big objects. But it's usually better to group members of the same size together so you have tighter memory packing.
I highly doubt that would have any bearing in CPU improvements - maybe readability. You can optimize the executable code if the commonly executed basic blocks that are executed within a given frame are in the same set of pages. This is the same idea but would not know how create basic blocks within the code. My guess is the compiler puts the functions in the order it sees them with no optimization here so you could try and place common functionality together.
Try and run a profiler/optimizer. First you compile with some profiling option then run your program. Once the profiled exe is complete it will dump some profiled information. Take this dump and run it through the optimizer as input.
I have been away from this line of work for years but not much has changed how they work.

Ring buffer: Disadvantages by moving through memory backwards?

This is probably language agnostic, but I'm asking from a C++ background.
I am hacking together a ring buffer for an embedded system (AVR, 8-bit). Let's assume:
const uint8_t size = /* something > 0 */;
uint8_t buffer[size];
uint8_t write_pointer;
There's this neat trick of &ing the write and read pointers with size-1 to do an efficient, branchless rollover if the buffer's size is a power of two, like so:
// value = buffer[write_pointer];
write_pointer = (write_pointer+1) & (size-1);
If, however, the size is not a power of two, the fallback would probably be a compare of the pointer (i.e. index) to the size and do a conditional reset:
// value = buffer[write_pointer];
if (++write_pointer == size) write_pointer ^= write_pointer;
Since the reset occurs rather rarely, this should be easy for any branch prediction.
This assumes though that the pointers need to be advancing foreward in memory. While this is intuitive, it requires a load of size in every iteration. I assume that reversing the order (advancing backwards) would yield better CPU instructions (i.e. jump if not zero) in the regular case, since size is only required during the reset.
// value = buffer[--write_pointer];
if (write_pointer == 0) write_pointer = size;
so
TL;DR: My question is: Does marching backwards through memory have a negative effect on the execution time due to cache misses (since memory cannot simply be read forward) or is this a valid optimization?
You have an 8 bit avr with a cache? And branch prediction?
How does forward or backwards matter as far as caches are concerned? The hit or miss on a cache is anywhere within the cache line, beginning, middle, end, random, sequential, doesnt matter. You can work from the back to the front or the front to back of a cache line, it is the same cost (assuming all other things held constant) the first mist causes a fill, then that line is in cache and you can access any of the items in any pattern at a lower latency until evicted.
On a microcontroller like that you want to make the effort, even at the cost of throwing away some memory, to align a circular buffer such that you can mask. There is no cache the instruction fetches are painful because they are likely from a flash that may be slower than the processor clock rate, so you do want to reduce instructions executed, or make the execution a little more deterministic (same number of instructions every loop until that task is done). There might be a pipeline that would appreciate the masking rather than an if-then-else.
TL;DR: My question is: Does marching backwards through memory have a
negative effect on the execution time due to cache misses (since
memory cannot simply be read forward) or is this a valid optimization?
The cache doesnt care, a miss from any item in the line causes a fill, once in the cache any pattern of access, random, sequential forward or back, or just pounding on the same address, takes less time being in faster memory. Until evicted. Evictions wont come from neighboring cache lines they will come from cache lines larger powers of two away, so whether the next cache line you pull is at a higher address or lower, the cost is the same.
Does marching backwards through memory have a negative effect on the
execution time due to cache misses (since memory cannot simply be read
forward)
Why do you think that you will have a cache miss? You will have a cache miss if you try to access outside the cache (forward or backward).
There are a number of points which need clarification:
That size needs to be loaded each and every time (it's const, therefore ought to be immutable)
That your code is correct. For example in a 0-based index (as used in C/C++ for array access) the value 0' is a valid pointer into the buffer, and the valuesize' is not. Similarly there is no need to xor when you could simply assign 0, equally a modulo operator will work (writer_pointer = (write_pointer +1) % size).
What happens in the general case with virtual memory (i.e. the logically adjacent addresses might be all over the place in the real memory map), paging (stuff may well be cached on a page-by-page basis) and other factors (pressure from external processes, interrupts)
In short: this is the kind of optimisation that leads to more feet related injuries than genuine performance improvements. Additionally it is almost certainly the case that you get much, much better gains using vectorised code (SIMD).
EDIT: And in interpreted languages or JIT'ed languages it might be a tad optimistic to assume you can rely on the use of JNZ and others at all. At which point the question is, how much of a difference is there really between loading size and comparing versus comparing with 0.
As usual, when performing any form of manual code optimization, you must have extensive in-depth knowledge of the specific hardware. If you don't have that, then you should not attempt manual optimizations, end of story.
Thus, your question is filled with various strange assumptions:
First, you assume that write_pointer = (write_pointer+1) & (size-1) is more efficient than something else, such as the XOR example you posted. You are just guessing here, you will have to disassemble the code and see which yields the less CPU instructions.
Because, when writing code for a tiny, primitive 8-bit MCU, there is not much going on in the core to speed up your code. I don't know AVR8, but it seems likely that you have a small instruction pipe and that's it. It seems quite unlikely that you have much in the way of branch prediction. It seems very unlikely that you have a data and/or instruction cache. Read the friendly CPU core manual.
As for marching backwards through memory, it will unlikely have any impact at all on your program's performance. On old, crappy compilers you would get slightly more efficient code if the loop condition was a comparison vs zero instead of a value. On modern compilers, this shouldn't be an issue. As for cache memory concerns, I doubt you have any cache memory to worry about.
The best way to write efficient code on 8-bit MCUs is to stick to 8-bit arithmetic whenever possible and to avoid 32-bit arithmetic like the plague. And forget you ever heard about something called floating point. This is what will make your program efficient, you are unlikely to find any better way to manually optimize your code.

How to offload memory offset calculation from runtime in C/C++?

I am implementing a simple VM, and currently I am using runtime arithmetic to calculate individual program object addresses as offsets from base pointers.
I asked a couple of questions on the subject today, but I seem to be going slowly nowhere.
I learned a couple of things thou, from question one -
Object and struct member access and address offset calculation -
I learned that modern processors have virtual addressing capabilities, allowing to calculate memory offsets without any additional cycles devoted to arithmetic.
And from question two - Are address offsets resolved during compile time in C/C++? - I learned that there is no guarantee for this happening when doing the offsets manually.
By now it should be clear that what I want to achieve is to take advantage of the virtual memory addressing features of the hardware and offload those from the runtime.
I am using GCC, as for platform - I am developing on x86 in windows, but since it is a VM I'd like to have it efficiently running on all platforms supported by GCC.
So ANY information on the subject is welcome and will be very appreciated.
Thanks in advance!
EDIT: Some overview on my program code generation - during the design stage the program is build as a tree hierarchy, which is then recursively serialized into one continuous memory block, along with indexing objects and calculating their offset from the beginning of the program memory block.
EDIT 2: Here is some pseudo code of the VM:
switch *instruction
case 1: call_fn1(*(instruction+1)); instruction += (1+sizeof(parameter1)); break;
case 2: call_fn2(*(instruction+1), *(instruction+1+sizeof(parameter1));
instruction += (1+sizeof(parameter1)+sizeof(parameter2); break;
case 3: instruction += *(instruction+1); break;
Case 1 is a function that takes one parameter, which is found immediately after the instruction, so it is passed as an offset of 1 byte from the instruction. The instruction pointer is incremented by 1 + the size of the first parameter to find the next instruction.
Case 2 is a function that takes two parameters, same as before, first parameter passed as 1 byte offset, second parameter passed as offset of 1 byte plus the size of the first parameter. The instruction pointer is then incremented by the size of the instruction plus sizes of both parameters.
Case 3 is a goto statement, the instruction pointer is incremented by an offset which immediately follows the goto instruction.
EDIT 3: To my understanding, the OS will provide each process with its own dedicated virtual memory addressing space. If so, does this mean the first address is always ... well zero, so the offset from the first byte of the memory block is actually the very address of this element? If memory address is dedicated to every process, and I know the offset of my program memory block AND the offset of every program object from the first byte of the memory block, then are the object addresses resolved during compile time?
Problem is those offsets are not available during the compilation of the C code, they become known during the "compilation" phase and translation to bytecode. Does this mean there is no way to do object memory address calculation for "free"?
How is this done in Java for example, where only the virtual machine is compiled to machine code, does this mean the calculation of object addresses takes a performance penalty because of runtime arithmetics?
Here's an attempt to shed some light on how the linked questions and answers apply to this situation.
The answer to the first question mixes two different things, the first is the addressing modes in X86 instruction and the second is virtual-to-physical address mapping. The first is something that is done by compilers and the second is something that is (typically) set up by the operating system. In your case you should only be worrying about the first.
Instructions in X86 assembly have great flexibility in how they access a memory address. Instructions that read or write memory have the address calculated according to the following formula:
segment + base + index * size + offset
The segment portion of the address is almost always the default DS segment and can usually be ignored. The base portion is given by one of the general purpose registers or the stack pointer. The index part is given by one of the general purpose registers and the size is either 1, 2, 4, or 8. Finally the offset is a constant value embedded in the instruction. Each of these components is optional, but obviously at least one must be given.
This addressing capability is what is generally meant when talking about computing addresses without explicit arithmetic instructions. There is a special instruction that one of the commenters mentioned: LEA which does the address calculation but instead of reading or writing memory, stores the computed address in a register.
For the code you included in the question, it is quite plausible that the compiler would use these addressing modes to avoid explicit arithmetic instructions.
As an example, the current value of the instruction variable could be held in the ESI register. Additionally, each of sizeof(parameter1) and sizeof(parameter2) are compile time constants. In the standard X86 calling conventions function arguments are pushed in reverse order (so the first argument is at the top of the stack) so the assembly codes might look something like
case1:
PUSH [ESI+1]
CALL fn1
ADD ESP,4 ; drop arguments from stack
ADD ESI,5
JMP end_switch
case2:
PUSH [ESI+5]
PUSH [ESI+1]
CALL fn2
ADD ESP,8 ; drop arguments from stack
ADD ESI,9
JMP end_swtich
case3:
MOV ESI,[ESI+1]
JMP end_switch
end_switch:
this is assuming that the size of both parameters is 4 bytes. Of course the actual code is up to the compiler and it is reasonable to expect that the compiler will output fairly efficient code as long as you ask for some level optimization.
You have a data item X in the VM, at relative address A, and an instruction that says (for instance) push X, is that right? And you want to be able to execute this instruction without having to add A to the base address of the VM's data area.
I have written a VM that solves this problem by mapping the VM's data area to a fixed Virtual Address. The compiler knows this Virtual Address, and so can adjust A at compile time. Would this solution work for you? Can you change the compiler yourself?
My VM runs on a smart card, and I have complete control over the OS, so it's a very different environment from yours. But Windows does have some facilities for allocating memory at a fixed address -- the VirtualAlloc function, for instance. You might like to try this out. If you do try it out, you might find that Windows is allocating regions that clash with your fixed-address data area, so you will probably have to load by hand any DLLs that you use, after you have allocated the VM's data area.
But there will probably be unforeseen problems to overcome, and it might not be worth the trouble.
Playing with virtual address translation, page tables or TLBs is something that can only be done at the OS kernel level, and is unportable between platforms and processor families. Furthermore hardware address translation on most CPU ISAs is usually supported only at the level of certain page sizes.
To answer my own question, based on the many responses I got.
Turns out what I want to achieve is not really possible in my situation, getting memory address calculations for free is attainable only when specific requirements are met and requires compilation to machine specific instructions.
I am developing a visual element, lego style drag and drop programming environment for educational purposes, which relies on a simple VM to execute the program code. I was hoping to maximize performance, but it is just not possible in my scenario. It is not that big of a deal thou, because program elements can also generate their C code equivalent, which can then be compiled conventionally to maximize performance.
Thanks to everyone who responded and clarified a matter that wasn't really clear to me!

Optimizing member variable order in C++

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
"re-order the member variables of a
class into most used and least used."
I'm not familiar with C++, nor with how it compiles, but I was wondering if
This statement is accurate?
How/Why?
Does it apply to other (compiled/scripting) languages?
I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.
Two issues here:
Whether and when keeping certain fields together is an optimization.
How to do actually do it.
The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.
The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.
The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.
Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.
Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:
an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
followed by an int (4 bytes, 4-aligned)
followed by a char (1 byte, any alignment)
then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.
In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.
I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.
The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:
struct some_stuff {
double d; // I expect double is 64bit IEEE, it might not be
uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
uint32_t i; // 4 bytes, usually 4-aligned
int32_t j; // same
short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know
char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment
char d; // 1 byte, any alignment
};
If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.
Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.
Depending on the type of program you're running this advice may result in increased performance or it may slow things down drastically.
Doing this in a multi-threaded program means you're going to increase the chances of 'false-sharing'.
Check out Herb Sutters articles on the subject here
I've said it before and I'll keep saying it. The only real way to get a real performance increase is to measure your code, and use tools to identify the real bottle neck instead of arbitrarily changing stuff in your code base.
It is one of the ways of optimizing the working set size. There is a good article by John Robbins on how you can speed up the application performance by optimizing the working set size. Of course it involves careful selection of most frequent use cases the end user is likely to perform with the application.
We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
group by alignment requirements (or, for newbies, "group by size" usually does the trick)
smallest first
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
8-bit bytes can be loaded in one instruction if they exist at this+0 through this+31
16-bit halfwords if they exist at this+0 through this+62;
32-bit machine words if they exist at this+0 through this+124.
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)
While locality of reference to improve the cache behavior of data accesses is often a relevant consideration, there are a couple other reasons for controlling layout when optimization is required - particularly in embedded systems, even though the CPUs used on many embedded systems do not even have a cache.
- Memory alignment of the fields in structures
Alignment considerations are pretty well understood by many programmers, so I won't go into too much detail here.
On most CPU architectures, fields in a structure must be accessed at a native alignment for efficiency. This means that if you mix various sized fields the compiler has to add padding between the fields to keep the alignment requirements correct. So to optimize the memory used by a structure it's important to keep this in mind and lay out the fields such that the largest fields are followed by smaller fields to keep the required padding to a minimum. If a structure is to be 'packed' to prevent padding, accessing unaligned fields comes at a high runtime cost as the compiler has to access unaligned fields using a series of accesses to smaller parts of the field along with shifts and masks to assemble the field value in a register.
- Offset of frequently used fields in a structure
Another consideration that can be important on many embedded systems is to have frequently accessed fields at the start of a structure.
Some architectures have a limited number of bits available in an instruction to encode an offset to a pointer access, so if you access a field whose offset exceeds that number of bits the compiler will have to use multiple instructions to form a pointer to the field. For example, the ARM's Thumb architecture has 5 bits to encode an offset, so it can access a word-sized field in a single instruction only if the field is within 124 bytes from the start. So if you have a large structure an optimization that an embedded engineer might want to keep in mind is to place frequently used fields at the beginning of a structure's layout.
Well the first member doesn't need an offset added to the pointer to access it.
In C#, the order of the member is determined by the compiler unless you put the attribute [LayoutKind.Sequential/Explicit] which forces the compiler to lay out the structure/class the way you tell it to.
As far as I can tell, the compiler seems to minimize packing while aligning the data types on their natural order (i.e. 4 bytes int start on 4 byte addresses).
I'm focusing on performance, execution speed, not memory usage.
The compiler, without any optimizing switch, will map the variable storage area using the same order of declarations in code.
Imagine
unsigned char a;
unsigned char b;
long c;
Big mess-up? without align switches, low-memory ops. et al, we're going to have an unsigned char using a 64bits word on your DDR3 dimm, and another 64bits word for the other, and yet the unavoidable one for the long.
So, that's a fetch per each variable.
However, packing it, or re-ordering it, will cause one fetch and one AND masking to be able to use the unsigned chars.
So speed-wise, on a current 64bits word-memory machine, aligns, reorderings, etc, are no-nos. I do microcontroller stuff, and there the differences in packed/non-packed are reallllly noticeable (talking about <10MIPS processors, 8bit word-memories)
On the side, it's long known that the engineering effort required to tweak code for performance other than what a good algorithm instructs you to do, and what the compiler is able to optimize, often results in burning rubber with no real effects. That and a write-only piece of syntaxically dubius code.
The last step-forward in optimization I saw (in uPs, don't think it's doable for PC apps) is to compile your program as a single module, have the compiler optimize it (much more general view of speed/pointer resolution/memory packing, etc), and have the linker trash non-called library functions, methods, etc.
In theory, it could reduce cache misses if you have big objects. But it's usually better to group members of the same size together so you have tighter memory packing.
I highly doubt that would have any bearing in CPU improvements - maybe readability. You can optimize the executable code if the commonly executed basic blocks that are executed within a given frame are in the same set of pages. This is the same idea but would not know how create basic blocks within the code. My guess is the compiler puts the functions in the order it sees them with no optimization here so you could try and place common functionality together.
Try and run a profiler/optimizer. First you compile with some profiling option then run your program. Once the profiled exe is complete it will dump some profiled information. Take this dump and run it through the optimizer as input.
I have been away from this line of work for years but not much has changed how they work.