Data Alignment: Reason for restriction on memory address being multiple of data type size - c++

I understand the general concept of data alignment, but what I do not understand is the restriction on memory address value, forced to be a multiple of the size of underlying data type.
This answer explains the data alignment well.
Quote:
Let's look at a memory map:
+----+
|0000|
|0001|
+----+
|0002|
|0003|
+----+
|0004|
|0005|
+----+
| .. |
At each address there is a byte which can be accessed individually. But words can only be fetched at even addresses. So if we read a word at 0000, we read the bytes at 0000 and 0001. But if we want to read the word at position 0001, we need two read accesses. First 0000,0001 and then 0002,0003 and we only keep 0001,0002.
Question:
Assuming it's true, why " But words can only be fetched at even addresses. " be true? Can't the memory/stack pointer point to 0001 in the example and then read a word of information starting there?
We know the machine can read memory in blocks of 2 bytes with one read action, (in the example [0000, 0001] or [0002, 0003]). So if my address register is pointing to 0001 (odd address instead of even), then I can read 2 bytes from there (i.e. 0001 and 0002) directly in one read action, right?

The assumption about that statement is not necessarily true. I don't want to re-iterate the answer you linked to describing the reasons for using and highly preferring aligned access, but there are architectures that do support unaligned memory access -- ARM for example (check out this SO answer).
But your question, I think, really comes down to hardware architecture, specifically the data bus design, and the accompanying instructions set that engineers at various silicon manufacturers have designed.
Some Cortex-M cores explicitly allow you to enable a CPU to trigger an exception on un- aligned access by configuring a Usage Fault register, which means that you can "utilize" unaligned memory access in rare use-cases.

Usually a processors internal addresses points to a whole word. This is because you don't want your (simple) processor be able to address a word at a random byte (or even worse: bit) because
You waste addressable memory: presuming the biggest possible address your processor is able to process is the max value of its word size, and you can multiply that by the size of your word to calculate the amount of storage you can address. (Each unique address points to a full word) The "address" I'm talking about here is not necessarily looking like the address which might be stored in a pointer of a higher programming language. The pointer address addresses each byte, which will be interpreted by a compiler or interpreter into the corresponding assembly instructions (discarding unwanted bytes from the loaded word)
A word loaded from memory could be anything, a value or the next instruction of the program you are running on your processor - the previous word loaded into the processor will often give an indication what the following word that gets loaded is used for: another instruction (eg arithmetic operation, load or store instruction) which might be followed by operands (values or addresses). Being able to address unaligned words would complicate a processor a lot in easy words.

Assuming it's true, why " But words can only be fetched at even addresses. " be true?
The memory actually stores words. The processor actually addresses the memory in words, and fetches a word at a time.
When you fetch a byte, it actually fetches a word, then ignores either the first half or the second half.
On 32-bit processors, it fetches a 32-bit word, then ignores three quarters; fetching a 16-bit word on a 32-bit processor ignores half the word.
If the 16-bit word you want to fetch (on a 16-bit processor) isn't aligned, then the processor has to fetch two words, take half of each word and then re-combine them. So even on processor designs where it works, it's often slower.
A lot of processor designs don't bother - either they just won't allow it, or they force the operating system to handle it (which is very slow).
(Not all types of processors work this way - e.g. 8-bit processors usually fetch a byte at a time)
Can't the memory/stack pointer point to 0001 in the example and then read a word of information starting there?
If the processor supports it, yes.

Related

Why does the compiler allocate more space than sizeof(MyClass) for the object on the stack? [duplicate]

What is stack alignment?
Why is it used?
Can it be controlled by compiler settings?
The details of this question are taken from a problem faced when trying to use ffmpeg libraries with msvc, however what I'm really interested in is an explanation of what is "stack alignment".
The Details:
When runnig my msvc complied program which links to avcodec I get the
following error: "Compiler did not align stack variables. Libavcodec has
been miscompiled", followed by a crash in avcodec.dll.
avcodec.dll was not compiled with msvc, so I'm unable to see what is going on inside.
When running ffmpeg.exe and using the same avcodec.dll everything works well.
ffmpeg.exe was not compiled with msvc, it was complied with gcc / mingw (same as avcodec.dll)
Thanks,
Dan
Alignment of variables in memory (a short history).
In the past computers had an 8 bits databus. This means, that each clock cycle 8 bits of information could be processed. Which was fine then.
Then came 16 bit computers. Due to downward compatibility and other issues, the 8 bit byte was kept and the 16 bit word was introduced. Each word was 2 bytes. And each clock cycle 16 bits of information could be processed. But this posed a small problem.
Let's look at a memory map:
+----+
|0000|
|0001|
+----+
|0002|
|0003|
+----+
|0004|
|0005|
+----+
| .. |
At each address there is a byte which can be accessed individually.
But words can only be fetched at even addresses. So if we read a word at 0000, we read the bytes at 0000 and 0001. But if we want to read the word at position 0001, we need two read accesses. First 0000,0001 and then 0002,0003 and we only keep 0001,0002.
Of course this took some extra time and that was not appreciated. So that's why they invented alignment. So we store word variables at word boundaries and byte variables at byte boundaries.
For example, if we have a structure with a byte field (B) and a word field (W) (and a very naive compiler), we get the following:
+----+
|0000| B
|0001| W
+----+
|0002| W
|0003|
+----+
Which is not fun. But when using word alignment we find:
+----+
|0000| B
|0001| -
+----+
|0002| W
|0003| W
+----+
Here memory is sacrificed for access speed.
You can imagine that when using double word (4 bytes) or quad word (8 bytes) this is even more important. That's why with most modern compilers you can chose which alignment you are using while compiling the program.
Some CPU architectures require specific alignment of various datatypes, and will throw exceptions if you don't honor this rule. In standard mode, x86 doesn't require this for the basic data types, but can suffer performance penalties (check www.agner.org for low-level optimization tips).
However, the SSE instruction set (often used for high-performance) audio/video procesing has strict alignment requirements, and will throw exceptions if you attempt to use it on unaligned data (unless you use the, on some processors, much slower unaligned versions).
Your issue is probably that one compiler expects the caller to keep the stack aligned, while the other expects callee to align the stack when necessary.
EDIT: as for why the exception happens, a routine in the DLL probably wants to use SSE instructions on some temporary stack data, and fails because the two different compilers don't agree on calling conventions.
IIRC, stack alignment is when variables are placed on the stack "aligned" to a particular number of bytes. So if you are using a 16 bit stack alignment, each variable on the stack is going to start from a byte that is a multiple of 2 bytes from the current stack pointer within a function.
This means that if you use a variable that is < 2 bytes, such as a char (1 byte), there will be 8 bits of unused "padding" between it and the next variable. This allows certain optimisations with assumptions based on variable locations.
When calling functions, one method of passing arguments to the next function is to place them on the stack (as opposed to placing them directly into registers). Whether or not alignment is being used here is important, as the calling function places the variables on the stack, to be read off by the calling function using offsets. If the calling function aligns the variables, and the called function expects them to be non-aligned, then the called function won't be able to find them.
It seems that the msvc compiled code is disagreeing about variable alignment. Try compiling with all optimisations turned off.
As far as I know, compilers don't typically align variables that are on the stack. The library may be depending on some set of compiler options that isn't supported on your compiler. The normal fix is to declare the variables that need to be aligned as static, but if you go about doing this in other people's code, you'll want to be sure that they variables in question are initialized later on in the function rather than in the declaration.
// Some compilers won't align this as it's on the stack...
int __declspec(align(32)) needsToBe32Aligned = 0;
// Change to
static int __declspec(align(32)) needsToBe32Aligned;
needsToBe32Aligned = 0;
Alternately, find a compiler switch that aligns the variables on the stack. Obviously the "__declspec" align syntax I've used here may not be what your compiler uses.

Why booleans take a whole byte? [duplicate]

In C++,
Why is a boolean 1 byte and not 1 bit of size?
Why aren't there types like a 4-bit or 2-bit integers?
I'm missing out the above things when writing an emulator for a CPU
Because the CPU can't address anything smaller than a byte.
From Wikipedia:
Historically, a byte was the number of
bits used to encode a single character
of text in a computer and it is
for this reason the basic addressable
element in many computer
architectures.
So byte is the basic addressable unit, below which computer architecture cannot address. And since there doesn't (probably) exist computers which support 4-bit byte, you don't have 4-bit bool etc.
However, if you can design such an architecture which can address 4-bit as basic addressable unit, then you will have bool of size 4-bit then, on that computer only!
Back in the old days when I had to walk to school in a raging blizzard, uphill both ways, and lunch was whatever animal we could track down in the woods behind the school and kill with our bare hands, computers had much less memory available than today. The first computer I ever used had 6K of RAM. Not 6 megabytes, not 6 gigabytes, 6 kilobytes. In that environment, it made a lot of sense to pack as many booleans into an int as you could, and so we would regularly use operations to take them out and put them in.
Today, when people will mock you for having only 1 GB of RAM, and the only place you could find a hard drive with less than 200 GB is at an antique shop, it's just not worth the trouble to pack bits.
The easiest answer is; it's because the CPU addresses memory in bytes and not in bits, and bitwise operations are very slow.
However it's possible to use bit-size allocation in C++. There's std::vector specialization for bit vectors, and also structs taking bit sized entries.
Because a byte is the smallest addressible unit in the language.
But you can make bool take 1 bit for example if you have a bunch of them
eg. in a struct, like this:
struct A
{
bool a:1, b:1, c:1, d:1, e:1;
};
You could have 1-bit bools and 4 and 2-bit ints. But that would make for a weird instruction set for no performance gain because it's an unnatural way to look at the architecture. It actually makes sense to "waste" a better part of a byte rather than trying to reclaim that unused data.
The only app that bothers to pack several bools into a single byte, in my experience, is Sql Server.
You can use bit fields to get integers of sub size.
struct X
{
int val:4; // 4 bit int.
};
Though it is usually used to map structures to exact hardware expected bit patterns:
// 1 byte value (on a system where 8 bits is a byte)
struct SomThing
{
int p1:4; // 4 bit field
int p2:3; // 3 bit field
int p3:1; // 1 bit
};
bool can be one byte -- the smallest addressable size of CPU, or can be bigger. It's not unusual to have bool to be the size of int for performance purposes. If for specific purposes (say hardware simulation) you need a type with N bits, you can find a library for that (e.g. GBL library has BitSet<N> class). If you are concerned with size of bool (you probably have a big container,) then you can pack bits yourself, or use std::vector<bool> that will do it for you (be careful with the latter, as it doesn't satisfy container requirments).
Think about how you would implement this at your emulator level...
bool a[10] = {false};
bool &rbool = a[3];
bool *pbool = a + 3;
assert(pbool == &rbool);
rbool = true;
assert(*pbool);
*pbool = false;
assert(!rbool);
Because in general, CPU allocates memory with 1 byte as the basic unit, although some CPU like MIPS use a 4-byte word.
However vector deals bool in a special fashion, with vector<bool> one bit for each bool is allocated.
The byte is the smaller unit of digital data storage of a computer. In a computer the RAM has millions of bytes and anyone of them has an address. If it would have an address for every bit a computer could manage 8 time less RAM that what it can.
More info: Wikipedia
Even when the minimum size possible is 1 Byte, you can have 8 bits of boolean information on 1 Byte:
http://en.wikipedia.org/wiki/Bit_array
Julia language has BitArray for example, and I read about C++ implementations.
Bitwise operations are not 'slow'.
And/Or operations tend to be fast.
The problem is alignment and the simple problem of solving it.
CPUs as the answers partially-answered correctly are generally aligned to read bytes and RAM/memory is designed in the same way.
So data compression to use less memory space would have to be explicitly ordered.
As one answer suggested, you could order a specific number of bits per value in a struct. However what does the CPU/memory do afterward if it's not aligned? That would result in unaligned memory where instead of just +1 or +2, or +4, there's not +1.5 if you wanted to use half the size in bits in one value, etc. so it must anyway fill in or revert the remaining space as blank, then simply read the next aligned space, which are aligned by 1 at minimum and usually by default aligned by 4(32bit) or 8(64bit) overall. The CPU will generally then grab the byte value or the int value that contains your flags and then you check or set the needed ones. So you must still define memory as int, short, byte, or the proper sizes, but then when accessing and setting the value you can explicitly compress the data and store those flags in that value to save space; but many people are unaware of how it works, or skip the step whenever they have on/off values or flag present values, even though saving space in sent/recv memory is quite useful in mobile and other constrained enviornments. In the case of splitting an int into bytes it has little value, as you can just define the bytes individually (e.g. int 4Bytes; vs byte Byte1;byte Byte2; byte Byte3; byte Byte4;) in that case it is redundant to use int; however in virtual environments that are easier like Java, they might define most types as int (numbers, boolean, etc.) so thus in that case, you could take advantage of an int dividing it up and using bytes/bits for an ultra efficient app that has to send less integers of data (aligned by 4). As it could be said redundant to manage bits, however, it is one of many optimizations where bitwise operations are superior but not always needed; many times people take advantage of high memory constraints by just storing booleans as integers and wasting 'many magnitudes' 500%-1000% or so of memory space anyway. It still easily has its uses, if you use this among other optimizations, then on the go and other data streams that only have bytes or few kb of data flowing in, it makes the difference if overall you optimized everything to load on whether or not it will load,or load fast, at all in such cases, so reducing bytes sent could ultimately benefit you alot; even if you could get away with oversending tons of data not required to be sent in an every day internet connection or app. It is definitely something you should do when designing an app for mobile users and even something big time corporation apps fail at nowadays; using too much space and loading constraints that could be half or lower. The difference between not doing anything and piling on unknown packages/plugins that require at minumim many hundred KB or 1MB before it loads, vs one designed for speed that requires say 1KB or only fewKB, is going to make it load and act faster, as you will experience those users and people who have data constraints even if for you loading wasteful MB or thousand KB of unneeded data is fast.

Strange C++ Memory Allocation

I created a simple class, Storer, in C++, playing with memory allocation. It contains six field variables, all of which are assigned in the constructor:
int x;
int y;
int z;
char c;
long l;
double d;
I was interested in how these variables were being stored, so I wrote the following code:
Storer *s=new Storer(5,4,3,'a',5280,1.5465);
cout<<(long)s<<endl<<endl;
cout<<(long)&(s->x)<<endl;
cout<<(long)&(s->y)<<endl;
cout<<(long)&(s->z)<<endl;
cout<<(long)&(s->c)<<endl;
cout<<(long)&(s->l)<<endl;
cout<<(long)&(s->d)<<endl;
I was very interested in the output:
33386512
33386512
33386516
33386520
33386524
33386528
33386536
Why is the char c taking up four bytes? sizeof(char) returns, of course, 1, so why is the program allocating more memory than it needs? This is confirmed that too much memory is being allocated with the following code:
cout<<sizeof(s->c)<<endl;
cout<<sizeof(Storer)<<endl;
cout<<sizeof(int)+sizeof(int)+sizeof(int)+sizeof(char)+sizeof(long)+sizeof(double)<<endl;
which prints:
1
32
29
confirming that, indeed, 3 bytes are being allocated needlessly. Can anyone explain to me why this is happening? Thanks.
Data alignment and compiler padding say hi!
The CPU has no notion of type, what it gets in its 32-bit (or 64-bit, or 128-bit (SSE), or 256-bit (AVX) - let's keep it simple at 32) registers needs to be properly aligned in order to be processed correctly and efficiently. Imagine a simple scenario, where you have a char, followed by an int. In a 32-bit architecture, that's 1 byte for a char and 4 bytes for an integer.
A 32-bit register would have to break on its boundary, only taking in 3 bytes of the integer and leaving the 4th byte for "a second run". It cannot process the data properly that way, so the compiler will add padding in order to make sure all the stuff is processed efficiently. And that means adding a certain amount of padding depending on the type in question.
Why is misalignment a problem?
The computer is not human, it can't just pick them out with a pair of eyes and a brain. It has to be very deterministic and cautious about how it goes about doing things. First it loads one block which contains n bytes of the given information, shift it around so that it prunes out unrelated information, then another, again, shift out a bunch of unnecessary bytes which do not have anything to do with the operation at hand and only then can it do the necessary operations. And usually you have two operands, that's just one complete. When you do all that work, only then can you actually process it. Way too much performance overhead when you can simply align the data properly (and most of the time, compilers do it for you, if you're not doing anything fancy).
Could you visualize it?
Visually - the first green byte is the mentioned char, and the three green bytes plus the first red one of the second block is the 4-byte int, colorcoded on a 4-byte access boundary (we're talking about a 32-bit register). The "instead part" at the bottom shows an ideal setup where the int hits the register properly (the char getting padded into obedience somewhere off image):
Read more on data alignment, which comes quite handy when you're dealing with fancy extensions of the instruction set like SSE (128-bit regs) or AVX (256-bit regs), so special care must be taken so that the optimizations of vectorization are not defeated ( aligning on a 16-byte boundary for SSE, 16*8 -> 128-bits).
Additional remarks on user defined alignment
phonetagger made a valid point in the comments that there are pragma directives which can be assigned through the preprocessor to force to compiler in order to align the data in a way the user, programmer specifies. But such directives, like #pragma pack(...), are a statement to the compiler that you know what you're doing and what's best for you. Be sure that you do, because if you fail to accomodate your environment, you might experience various penalties - the most obvious being using external libraries you didn't write yourself which differ in the way they pack data.
Things simply explode when they clash. Best is to advise caution in such cases and really being intimate with the issue at hand. If you're not sure, leave it to the defaults. If you are not sure but have to use something like SSE where alignment is king (and not default nor simple by a long shot), consult various resources online or ask an another question here.
I will make an analogy to help you understand.
Assume there is a long loaf of bread and you have a cutting machine that can cut it into slices of equal thickness. Then you are giving out these breads to, let's say children. Every child takes their bread and fairly do what they want to do with them (put Nutella on them and eat, etc.). They can even make thinner slices out of it and use it like that.
If one child comes up to you and says that he does not want that slice everyone is getting, but a thinner slice instead, then you will have difficulties, because your cutting machine is optimized to cut at least a minimum amount, which makes everyone happy. But when one child asks for a thinner slice, then you have to reinvent the machine or put additional complexity to it like introducing two cutting modes. You don't want that. Eventually you give up and just give him a big slice anyway.
This is the same reason why it happens. Hope you could relate to the analogy.
Data alignement is why the char has allocated 4 bytes : Data alignement
char does not take up four bytes: it takes up a single byte as usual. You can check it by printing sizeof(char). The other three bytes are padding that the compiler inserts to optimize access to other members of your class. Depending on hardware, it is often much faster to access multi-byte types, say, 4-byte integers, when they are located at an address divisible by four. A compiler may insert up to three bytes of padding before an int member to align it with a good memory address for faster access.
If you would like to experiment with class layouts, you can use a handy operation called offsetof. It takes two parameters - the name of the member and the name of the class, and it returns the number of bytes from the base address of your struct to the position of the member in memory.
cout << offsetof(Storer, x) << endl;
cout << offsetof(Storer, y) << endl;
cout << offsetof(Storer, z) << endl;
Structure members are aligned in particular ways. In general, if you want the most compact representation, list the members in decreasing order of size.
http://en.wikipedia.org/wiki/Data_structure_alignment#Typical_alignment_of_C_structs_on_x86

Decoding and matching Chip 8 opcodes in C/C++

I'm writing a Chip 8 emulator as an introduction to emulation and I'm kind of lost. Basically, I've read a Chip 8 ROM and stored it in a char array in memory. Then, following a guide, I use the following code to retrieve the opcode at the current program counter (pc):
// Fetch opcode
opcode = memory[pc] << 8 | memory[pc + 1];
Chip 8 opcodes are 2 bytes each. This is code from a guide which I vaguely understand as adding 8 extra bit spaces to memory[pc] (using << 8) and then merging memory[pc + 1] with it (using |) and storing the result in the opcode variable.
Now that I have the opcode isolated however, I don't really know what to do with it. I'm using this opcode table and I'm basically lost in regards to matching the hex opcodes I read to the opcode identifiers in that table. Also, I realize that many of the opcodes I'm reading also contain operands (I'm assuming the latter byte?), and that is probably further complicating my situation.
Help?!
Basically once you have the instruction you need to decode it. For example from your opcode table:
if ((inst&0xF000)==0x1000)
{
write_register(pc,(inst&0x0FFF)<<1);
}
And guessing that since you are accessing rom two bytes per instruction, the address is probably a (16 bit) word address not a byte address so I shifted it left one (you need to study how those instructions are encoded, the opcode table you provided is inadequate for that, well without having to make assumptions).
There is a lot more that has to happen and I dont know if I wrote anything about it in my github samples. I recommend you create a fetch function for fetching instructions at an address, a read memory function, a write memory function a read register function, write register function. I recommend your decode and execute function decodes and executes only one instruction at a time. Normal execution is to just call it in a loop, it provides the ability to do interrupts and things like that without a lot of extra work. It also modularizes your solution. By creating the fetch() read_mem_byte() read_mem_word() etc functions. You modularize your code (at a slight cost of performance), makes debugging much easier as you have a single place where you can watch registers or memory accesses and figure out what is or isnt going on.
Based on your question, and where you are in this process, I think the first thing you need to do before writing an emulator is to write a disassembler. Being a fixed instruction length instruction set (16 bits) that makes it much much easier. You can start at some interesting point in the rom, or at the beginning if you like, and decode everything you see. For example:
if ((inst&0xF000)==0x1000)
{
printf("jmp 0x%04X\n",(inst&0x0FFF)<<1);
}
With only 35 instructions that shouldnt take but an afternoon, maybe a whole saturday, being your first time decoding instructions (I assume that based on your question). The disassembler becomes the core decoder for your emulator. Replace the printf()s with emulation, even better leave the printfs and just add code to emulate the instruction execution, this way you can follow the execution. (same deal have a disassemble a single instruction function, call it for each instruction, this becomes the foundation for your emulator).
Your understanding needs to be more than vague as to what that fetch line of code is doing, in order to pull off this task you are going to have to have a strong understanding of bit manipulation.
Also I would call that line of code you provided buggy or at least risky. If memory[] is an array of bytes, the compiler might very well perform the left shift using byte sized math, resulting in a zero, then zero orred with the second byte results in only the second byte.
Basically a compiler is within its rights to turn this:
opcode = memory[pc] << 8) | memory[pc + 1];
Into this:
opcode = memory[pc + 1];
Which wont work for you at all, a very quick fix:
opcode = memory[pc + 0];
opcode <<= 8;
opcode |= memory[pc + 1];
Will save you some headaches. Minimal optimization will save the compiler from storing the intermediate results to ram for each operation resulting in the same (desired) output/performance.
The instruction set simulators I wrote and mentioned above are not intended for performance but instead readability, visibility, and hopefully educational. I would start with something like that then if performance for example is of interest you will have to re-write it. This chip8 emulator, once experienced, would be an afternoon task from scratch, so once you get through this the first time you could re-write it maybe three or four times in a weekend, not a monumental task (to have to re-write). (the thumbulator one took me a weekend, for the bulk of it. The msp430 one was probably more like an evening or two worth of work. Getting the overflow flag right, once and for all, was the biggest task, and that came later). Anyway, point being, look at things like the mame sources, most if not all of those instruction set simulators are designed for execution speed, many are barely readable without a fair amount of study. Often heavily table driven, sometimes lots of C programming tricks, etc. Start with something manageable, get it functioning properly, then worry about improving it for speed or size or portability or whatever. This chip8 thing looks to be graphics based so you are going to also have to deal with a lot of line drawing and other bit manipulation on a bitmap/screen/wherever. Or you could just call api or operating system functions. Basically this chip8 thing is not your traditional instruction set with registers and a laundry list of addressing modes and alu operations.
Basically -- Mask out the variable part of the opcode, and look for a match. Then use the variable part.
For example 1NNN is the jump. So:
int a = opcode & 0xF000;
int b = opcode & 0x0FFF;
if(a == 0x1000)
doJump(b);
Then the game is to make that code fast or small, or elegant, if you like. Good clean fun!
Different CPUs store values in memory differently. Big endian machines store a number like $FFCC in memory in that order FF,CC. Little-endian machines store the bytes in reverse order CC, FF (that is, with the "little end" first).
The CHIP-8 architecture is big endian, so the code you will run has the instructions and data written in big endian.
In your statement "opcode = memory[pc] << 8 | memory[pc + 1];", it doesn't matter if the host CPU (the CPU of your computer) is little endian or big endian. It will always put a 16-bit big endian value into an integer in the correct order.
There are a couple of resources that might help: http://www.emulator101.com gives a CHIP-8 emulator tutorial along with some general emulator techniques. This one is good too: http://www.multigesture.net/articles/how-to-write-an-emulator-chip-8-interpreter/
You're going to have to setup a bunch of different bit masks to get the actual opcode from the 16-bit word in combination with a finite state machine in order to interpret those opcodes since it appears that there are some complications in how the opcodes are encoded (i.e., certain opcodes have register identifiers, etc., while others are fairly straight-forward with a single identifier).
Your finite state machine can basically do the following:
Get the first nibble of the opcode using a mask like `0xF000. This will allow you to "categorize" the opcode
Based on the function category from step 1, apply more masks to either get the register values from the opcode, or whatever other variables might be encoded with the opcode that will narrow down the actual function that would need to be called, as well as it's arguments.
Once you have the opcode and the variable information, do a look-up into a fixed-length table of functions that have the appropriate handlers to coincide with the opcode functionality and the variables that go along with the opcode. While you can, in your state machine, hard-code the names of the functions that would go with each opcode once you've isolated the proper functionality, a table that you initialize with function-pointers for each opcode is a more flexible approach that will let you modify the code functionality easier (i.e., you could easily swap between debug handlers and "normal" handlers, etc.).

what is "stack alignment"?

What is stack alignment?
Why is it used?
Can it be controlled by compiler settings?
The details of this question are taken from a problem faced when trying to use ffmpeg libraries with msvc, however what I'm really interested in is an explanation of what is "stack alignment".
The Details:
When runnig my msvc complied program which links to avcodec I get the
following error: "Compiler did not align stack variables. Libavcodec has
been miscompiled", followed by a crash in avcodec.dll.
avcodec.dll was not compiled with msvc, so I'm unable to see what is going on inside.
When running ffmpeg.exe and using the same avcodec.dll everything works well.
ffmpeg.exe was not compiled with msvc, it was complied with gcc / mingw (same as avcodec.dll)
Thanks,
Dan
Alignment of variables in memory (a short history).
In the past computers had an 8 bits databus. This means, that each clock cycle 8 bits of information could be processed. Which was fine then.
Then came 16 bit computers. Due to downward compatibility and other issues, the 8 bit byte was kept and the 16 bit word was introduced. Each word was 2 bytes. And each clock cycle 16 bits of information could be processed. But this posed a small problem.
Let's look at a memory map:
+----+
|0000|
|0001|
+----+
|0002|
|0003|
+----+
|0004|
|0005|
+----+
| .. |
At each address there is a byte which can be accessed individually.
But words can only be fetched at even addresses. So if we read a word at 0000, we read the bytes at 0000 and 0001. But if we want to read the word at position 0001, we need two read accesses. First 0000,0001 and then 0002,0003 and we only keep 0001,0002.
Of course this took some extra time and that was not appreciated. So that's why they invented alignment. So we store word variables at word boundaries and byte variables at byte boundaries.
For example, if we have a structure with a byte field (B) and a word field (W) (and a very naive compiler), we get the following:
+----+
|0000| B
|0001| W
+----+
|0002| W
|0003|
+----+
Which is not fun. But when using word alignment we find:
+----+
|0000| B
|0001| -
+----+
|0002| W
|0003| W
+----+
Here memory is sacrificed for access speed.
You can imagine that when using double word (4 bytes) or quad word (8 bytes) this is even more important. That's why with most modern compilers you can chose which alignment you are using while compiling the program.
Some CPU architectures require specific alignment of various datatypes, and will throw exceptions if you don't honor this rule. In standard mode, x86 doesn't require this for the basic data types, but can suffer performance penalties (check www.agner.org for low-level optimization tips).
However, the SSE instruction set (often used for high-performance) audio/video procesing has strict alignment requirements, and will throw exceptions if you attempt to use it on unaligned data (unless you use the, on some processors, much slower unaligned versions).
Your issue is probably that one compiler expects the caller to keep the stack aligned, while the other expects callee to align the stack when necessary.
EDIT: as for why the exception happens, a routine in the DLL probably wants to use SSE instructions on some temporary stack data, and fails because the two different compilers don't agree on calling conventions.
IIRC, stack alignment is when variables are placed on the stack "aligned" to a particular number of bytes. So if you are using a 16 bit stack alignment, each variable on the stack is going to start from a byte that is a multiple of 2 bytes from the current stack pointer within a function.
This means that if you use a variable that is < 2 bytes, such as a char (1 byte), there will be 8 bits of unused "padding" between it and the next variable. This allows certain optimisations with assumptions based on variable locations.
When calling functions, one method of passing arguments to the next function is to place them on the stack (as opposed to placing them directly into registers). Whether or not alignment is being used here is important, as the calling function places the variables on the stack, to be read off by the calling function using offsets. If the calling function aligns the variables, and the called function expects them to be non-aligned, then the called function won't be able to find them.
It seems that the msvc compiled code is disagreeing about variable alignment. Try compiling with all optimisations turned off.
As far as I know, compilers don't typically align variables that are on the stack. The library may be depending on some set of compiler options that isn't supported on your compiler. The normal fix is to declare the variables that need to be aligned as static, but if you go about doing this in other people's code, you'll want to be sure that they variables in question are initialized later on in the function rather than in the declaration.
// Some compilers won't align this as it's on the stack...
int __declspec(align(32)) needsToBe32Aligned = 0;
// Change to
static int __declspec(align(32)) needsToBe32Aligned;
needsToBe32Aligned = 0;
Alternately, find a compiler switch that aligns the variables on the stack. Obviously the "__declspec" align syntax I've used here may not be what your compiler uses.