Fetching variable length opcodes and CPU timing

Fetching variable length opcodes and CPU timing - c++

I'm currently trying to write a NES emulator in C++ as a summer programming project to get ready for fall term next school year (I haven't coded in a while). I've already written a Chip8 emulator, so I thought the next step would be to try and write a NES emulator.
Anyways, I'm getting stuck. I'm using this website for my opcode table and I'm running into a road block. On the Chip8, all opcodes were two bytes long, so they were easy to fetch. However, the NES seems to have either 2 or 3 byte opcodes depending on what addressing mode the CPU is in. I can't think of any easy way to figure out how many bytes I need to read for each opcode (my only idea was to create really long if statements that check the first byte of the opcode to see how many more bytes to read).
I'm also having trouble with figuring how to count cycles. How do I create a clock within a programming language so that everything is in sync?
On an unrelated side note, since the NES is little-endian, do I need to read programCounter + 1 and then read programCounter to get the correct opcode?

However, the NES seems to have either 2 or 3 byte opcodes depending on what addressing mode the CPU is in. I can't think of any easy way to figure out how many bytes I need to read for each opcode.
The opcode is still only one byte. The extra bytes specify the operands for those instructions that have explicit operands.
To do the decoding, you can create a switch-block with 256 cases (actually it won't be 256 cases, because some opcodes are illegal). It could look something like this:
opcode = ReadByte(PC++);
switch (opcode) {
...
case 0x4C: // JMP abs
address = ReadByte(PC++);
address |= (uint16_t)ReadByte(PC) << 8;
PC = address;
cycles += 3;
break;
...
}
The compiler will typically create a jump table for the cases, so you'll end up with fairly efficient (albeit slightly bloated) code.
Another alternative is to create an array with one entry per opcode. This could simply be an array of function pointers, with one function per opcode - or the table could contain a pointer to one function for fetching the operands, one for performing the actual operation, plus information about the number of cycles that the instruction requires. This way you can share a lot of code. An example:
const Instruction INSTRUCTIONS[] =
{
...
// 0x4C: JMP abs
{&jmp, &abs_operand, 3},
...
};
I'm also having trouble with figuring how to count cycles. How do I create a clock within a programming language so that everything is in sync?
Counting CPU cycles is just a matter of incrementing a counter, like I showed in my code examples above.
To sync video with the CPU, the easiest way would be to run the CPU for the amount of cycles corresponding to the active display period of a single scanline, then draw one scanline, then run the CPU for the amount of cycles correspond to the horizontal blanking period, and start over again.
When you start involving audio, how you sync things can depend a bit on the audio API you're using. For example, some APIs might send you a callback to which you respond by filling a buffer with samples and returning the number of samples generated. In this case you could calculate the number of CPU cycles that have been emulated since the previous callback and determine how many samples to generate based on that.
On an unrelated side note, since the NES is little-endian, do I need to read programCounter + 1 and then read programCounter to get the correct opcode?
Since the opcode is a single byte and instructions on the 6502 aren't packed into a word like on some other CPU architectures, endianness doesn't really matter. It does become relevant for 16-bit operands, but on the other hand PCs and most mobile phones are also based on little-endian CPUs.

I wrote an emulator for 6502 some 25+ years back.
It's a pretty simple processor, so either a table of function pointers or a switch, with 256 entries for the bytes [the switch can be a bit shorter, since there aren't valid opcodes in all 256 entries, only about 200 of the opcodes are actually used].
Now, if you want to write a simulator that exactly simulates the timing of the instructions, then you'll have more fun. You basically will have to simulate much more of how each component works, and "ripple" through the units with a clock. This is quite a lot of work, so I would probably, if at all possible, ignore the timing, and just let the system's speed depend on the emulators speed.

Related

Using only part of crc32

Will using only 2 upper/lower bytes of a crc32 sum make it weaker than crc16?
Background:
I'm currently implementing a wireless protocol.
I have chunks of 64byte each, and according to
Data Length vs CRC Length
I would need at most crc16.
Using crc16 instead of crc32 would free up bandwidth for use in Forward error correction (64byte is one block in FEC).
However, my hardware is quite low powered but has hardware support for CRC32.
So my idea was to use the hardware crc32 engine and just throw away 2 of the result bytes.
I know that this is not a crc16 sum, but that does not matter because I control both sides of the transmission.
In case it matters: I can use both crc32 (poly 0x04C11DB7) or crc32c (poly 0x1EDC6F41).

Yes, it will be weaker, but only for small numbers of bit errors. You get none of the guarantees of a CRC-16 by instead taking half of a CRC-32. E.g. the number of bits in a burst that are always detectable.
What is the noise source that you are trying to protect against?

What might cause the same SSE code to run a few times slower in the same function?

Edit 3: The images are links to the full-size versions. Sorry for the pictures-of-text, but the graphs would be hard to copy/paste into a text table.
I have the following VTune profile for a program compiled with icc --std=c++14 -qopenmp -axS -O3 -fPIC:
In that profile, two clusters of instructions are highlighted in the assembly view. The upper cluster takes significantly less time than the lower one, in spite of instructions being identical and in the same order. Both clusters are located inside the same function and are obviously both called n times. This happens every time I run the profiler, on both a Westmere Xeon and a Haswell laptop that I'm using right now (compiled with SSE because that's what I'm targeting and learning right now).
What am I missing?
Ignore the poor concurrency, this is most probably due to the laptop throttling, since it doesn't occur on the desktop Xeon machine.
I believe this is not an example of micro-optimisation, since those three added together amount to a decent % of the total time, and I'm really interested about the possible cause of this behavior.
Edit: OMP_NUM_THREADS=1 taskset -c 1 /opt/intel/vtune...
Same profile, albeit with a slightly lower CPI this time.

HW perf counters typically charge stalls to the instruction that had to wait for its inputs, not the instruction that was slow producing outputs.
The inputs for your first group comes from your gather. This probably cache-misses a lot, and doesn't the costs aren't going to get charged to those SUBPS/MULPS/ADDPS instructions. Their inputs come directly from vector loads of voxel[], so store-forwarding failure will cause some latency. But that's only ~10 cycles IIRC, small compared to cache misses during the gather. (Those cache misses show up as large bars for the instructions right before the first group that you've highlighted)
The inputs for your second group come directly from loads that can miss in cache. In the first group, the direct consumers of the cache-miss loads were instructions for lines like the one that sets voxel[0], which has a really large bar.
But in the second group, the time for the cache misses in a_transfer[] is getting attributed to the group you've highlighted. Or if it's not cache misses, then maybe it's slow address calculation as the loads have to wait for RAX to be ready.
It looks like there's a lot you could optimize here.
instead of store/reload for a_pointf, just keep it hot across loop iterations in a __m128 variable. Storing/reloading in the C source only makes sense if you found the compiler was making a poor choice about which vector register to spill (if it ran out of registers).
calculate vi with _mm_cvttps_epi32(vf), so the ROUNDPS isn't part of the dependency chain for the gather indices.
Do the voxel gather yourself by shuffling narrow loads into vectors, instead of writing code that copies to an array and then loads from it. (guaranteed store-forwarding failure, see Agner Fog's optimization guides and other links from the x86 tag wiki).
It might be worth it to partially vectorize the address math (calculation of base_0, using PMULDQ with a constant vector), so instead of a store/reload (~5 cycle latency) you just have a MOVQ or two (~1 or 2 cycle latency on Haswell, I forget.)
Use MOVD to load two adjacent short values, and merge another pair into the second element with PINSRD. You'll probably get good code from _mm_setr_epi32(*(const int*)base_0, *(const int*)(base_0 + dim_x), 0, 0), except that pointer aliasing is undefined behaviour. You might get worse code from _mm_setr_epi16(*base_0, *(base_0 + 1), *(base_0 + dim_x), *(base_0 + dim_x + 1), 0,0,0,0).
Then expand the low four 16-bit elements into 32-bit elements integers with PMOVSX, and convert them all to float in parallel with _mm_cvtepi32_ps (CVTDQ2PS).
Your scalar LERPs aren't being auto-vectorized, but you're doing two in parallel (and could maybe save an instruction since you want the result in a vector anyway).
Calling floorf() is silly, and a function call forces the compiler to spill all xmm registers to memory. Compile with -ffast-math or whatever to let it inline to a ROUNDSS, or do that manually. Especially since you go ahead and load the float that you calculate from that into a vector!
Use a vector compare instead of scalar prev_x / prev_y / prev_z. Use MOVMASKPS to get the result into an integer you can test. (You only care about the lower 3 elements, so test it with compare_mask & 0b0111 (true if any of the low 3 bits of the 4-bit mask are set, after a compare for not-equal with _mm_cmpneq_ps. See the double version of the instruction for more tables on how it all works: http://www.felixcloutier.com/x86/CMPPD.html).

Well, analyzing assembly code please note that running time is attributed to the next instruction - so, the data you're looking by instructions need to be interpreted carefully. There is a corresponding note in VTune Release Notes:
Running time is attributed to the next instruction (200108041)
To collect the data about time-consuming running regions of the
target, the Intel® VTune™ Amplifier interrupts executing target
threads and attributes the time to the context IP address.
Due to the collection mechanism, the captured IP address points to an
instruction AFTER the one that is actually consuming most of the time.
This leads to the running time being attributed to the next
instruction (or, rarely to one of the subsequent instructions) in the
Assembly view. In rare cases, this can also lead to wrong attribution
of running time in the source - the time may be erroneously attributed
to the source line AFTER the actual hot line.
In case the inline mode is ON and the program has small functions
inlined at the hotspots, this can cause the running time to be
attributed to a wrong function since the next instruction can belong
to a different function in tightly inlined code.

Why booleans take a whole byte? [duplicate]

In C++,
Why is a boolean 1 byte and not 1 bit of size?
Why aren't there types like a 4-bit or 2-bit integers?
I'm missing out the above things when writing an emulator for a CPU

Because the CPU can't address anything smaller than a byte.

From Wikipedia:
Historically, a byte was the number of
bits used to encode a single character
of text in a computer and it is
for this reason the basic addressable
element in many computer
architectures.
So byte is the basic addressable unit, below which computer architecture cannot address. And since there doesn't (probably) exist computers which support 4-bit byte, you don't have 4-bit bool etc.
However, if you can design such an architecture which can address 4-bit as basic addressable unit, then you will have bool of size 4-bit then, on that computer only!

Back in the old days when I had to walk to school in a raging blizzard, uphill both ways, and lunch was whatever animal we could track down in the woods behind the school and kill with our bare hands, computers had much less memory available than today. The first computer I ever used had 6K of RAM. Not 6 megabytes, not 6 gigabytes, 6 kilobytes. In that environment, it made a lot of sense to pack as many booleans into an int as you could, and so we would regularly use operations to take them out and put them in.
Today, when people will mock you for having only 1 GB of RAM, and the only place you could find a hard drive with less than 200 GB is at an antique shop, it's just not worth the trouble to pack bits.

The easiest answer is; it's because the CPU addresses memory in bytes and not in bits, and bitwise operations are very slow.
However it's possible to use bit-size allocation in C++. There's std::vector specialization for bit vectors, and also structs taking bit sized entries.

Because a byte is the smallest addressible unit in the language.
But you can make bool take 1 bit for example if you have a bunch of them
eg. in a struct, like this:
struct A
{
bool a:1, b:1, c:1, d:1, e:1;
};

You could have 1-bit bools and 4 and 2-bit ints. But that would make for a weird instruction set for no performance gain because it's an unnatural way to look at the architecture. It actually makes sense to "waste" a better part of a byte rather than trying to reclaim that unused data.
The only app that bothers to pack several bools into a single byte, in my experience, is Sql Server.

You can use bit fields to get integers of sub size.
struct X
{
int val:4; // 4 bit int.
};
Though it is usually used to map structures to exact hardware expected bit patterns:
// 1 byte value (on a system where 8 bits is a byte)
struct SomThing
{
int p1:4; // 4 bit field
int p2:3; // 3 bit field
int p3:1; // 1 bit
};

bool can be one byte -- the smallest addressable size of CPU, or can be bigger. It's not unusual to have bool to be the size of int for performance purposes. If for specific purposes (say hardware simulation) you need a type with N bits, you can find a library for that (e.g. GBL library has BitSet<N> class). If you are concerned with size of bool (you probably have a big container,) then you can pack bits yourself, or use std::vector<bool> that will do it for you (be careful with the latter, as it doesn't satisfy container requirments).

Think about how you would implement this at your emulator level...
bool a[10] = {false};
bool &rbool = a[3];
bool *pbool = a + 3;
assert(pbool == &rbool);
rbool = true;
assert(*pbool);
*pbool = false;
assert(!rbool);

Because in general, CPU allocates memory with 1 byte as the basic unit, although some CPU like MIPS use a 4-byte word.
However vector deals bool in a special fashion, with vector<bool> one bit for each bool is allocated.

The byte is the smaller unit of digital data storage of a computer. In a computer the RAM has millions of bytes and anyone of them has an address. If it would have an address for every bit a computer could manage 8 time less RAM that what it can.
More info: Wikipedia

Even when the minimum size possible is 1 Byte, you can have 8 bits of boolean information on 1 Byte:
http://en.wikipedia.org/wiki/Bit_array
Julia language has BitArray for example, and I read about C++ implementations.

Bitwise operations are not 'slow'.
And/Or operations tend to be fast.
The problem is alignment and the simple problem of solving it.
CPUs as the answers partially-answered correctly are generally aligned to read bytes and RAM/memory is designed in the same way.
So data compression to use less memory space would have to be explicitly ordered.
As one answer suggested, you could order a specific number of bits per value in a struct. However what does the CPU/memory do afterward if it's not aligned? That would result in unaligned memory where instead of just +1 or +2, or +4, there's not +1.5 if you wanted to use half the size in bits in one value, etc. so it must anyway fill in or revert the remaining space as blank, then simply read the next aligned space, which are aligned by 1 at minimum and usually by default aligned by 4(32bit) or 8(64bit) overall. The CPU will generally then grab the byte value or the int value that contains your flags and then you check or set the needed ones. So you must still define memory as int, short, byte, or the proper sizes, but then when accessing and setting the value you can explicitly compress the data and store those flags in that value to save space; but many people are unaware of how it works, or skip the step whenever they have on/off values or flag present values, even though saving space in sent/recv memory is quite useful in mobile and other constrained enviornments. In the case of splitting an int into bytes it has little value, as you can just define the bytes individually (e.g. int 4Bytes; vs byte Byte1;byte Byte2; byte Byte3; byte Byte4;) in that case it is redundant to use int; however in virtual environments that are easier like Java, they might define most types as int (numbers, boolean, etc.) so thus in that case, you could take advantage of an int dividing it up and using bytes/bits for an ultra efficient app that has to send less integers of data (aligned by 4). As it could be said redundant to manage bits, however, it is one of many optimizations where bitwise operations are superior but not always needed; many times people take advantage of high memory constraints by just storing booleans as integers and wasting 'many magnitudes' 500%-1000% or so of memory space anyway. It still easily has its uses, if you use this among other optimizations, then on the go and other data streams that only have bytes or few kb of data flowing in, it makes the difference if overall you optimized everything to load on whether or not it will load,or load fast, at all in such cases, so reducing bytes sent could ultimately benefit you alot; even if you could get away with oversending tons of data not required to be sent in an every day internet connection or app. It is definitely something you should do when designing an app for mobile users and even something big time corporation apps fail at nowadays; using too much space and loading constraints that could be half or lower. The difference between not doing anything and piling on unknown packages/plugins that require at minumim many hundred KB or 1MB before it loads, vs one designed for speed that requires say 1KB or only fewKB, is going to make it load and act faster, as you will experience those users and people who have data constraints even if for you loading wasteful MB or thousand KB of unneeded data is fast.

Decoding and matching Chip 8 opcodes in C/C++

I'm writing a Chip 8 emulator as an introduction to emulation and I'm kind of lost. Basically, I've read a Chip 8 ROM and stored it in a char array in memory. Then, following a guide, I use the following code to retrieve the opcode at the current program counter (pc):
// Fetch opcode
opcode = memory[pc] << 8 | memory[pc + 1];
Chip 8 opcodes are 2 bytes each. This is code from a guide which I vaguely understand as adding 8 extra bit spaces to memory[pc] (using << 8) and then merging memory[pc + 1] with it (using |) and storing the result in the opcode variable.
Now that I have the opcode isolated however, I don't really know what to do with it. I'm using this opcode table and I'm basically lost in regards to matching the hex opcodes I read to the opcode identifiers in that table. Also, I realize that many of the opcodes I'm reading also contain operands (I'm assuming the latter byte?), and that is probably further complicating my situation.
Help?!

Basically once you have the instruction you need to decode it. For example from your opcode table:
if ((inst&0xF000)==0x1000)
{
write_register(pc,(inst&0x0FFF)<<1);
}
And guessing that since you are accessing rom two bytes per instruction, the address is probably a (16 bit) word address not a byte address so I shifted it left one (you need to study how those instructions are encoded, the opcode table you provided is inadequate for that, well without having to make assumptions).
There is a lot more that has to happen and I dont know if I wrote anything about it in my github samples. I recommend you create a fetch function for fetching instructions at an address, a read memory function, a write memory function a read register function, write register function. I recommend your decode and execute function decodes and executes only one instruction at a time. Normal execution is to just call it in a loop, it provides the ability to do interrupts and things like that without a lot of extra work. It also modularizes your solution. By creating the fetch() read_mem_byte() read_mem_word() etc functions. You modularize your code (at a slight cost of performance), makes debugging much easier as you have a single place where you can watch registers or memory accesses and figure out what is or isnt going on.
Based on your question, and where you are in this process, I think the first thing you need to do before writing an emulator is to write a disassembler. Being a fixed instruction length instruction set (16 bits) that makes it much much easier. You can start at some interesting point in the rom, or at the beginning if you like, and decode everything you see. For example:
if ((inst&0xF000)==0x1000)
{
printf("jmp 0x%04X\n",(inst&0x0FFF)<<1);
}
With only 35 instructions that shouldnt take but an afternoon, maybe a whole saturday, being your first time decoding instructions (I assume that based on your question). The disassembler becomes the core decoder for your emulator. Replace the printf()s with emulation, even better leave the printfs and just add code to emulate the instruction execution, this way you can follow the execution. (same deal have a disassemble a single instruction function, call it for each instruction, this becomes the foundation for your emulator).
Your understanding needs to be more than vague as to what that fetch line of code is doing, in order to pull off this task you are going to have to have a strong understanding of bit manipulation.
Also I would call that line of code you provided buggy or at least risky. If memory[] is an array of bytes, the compiler might very well perform the left shift using byte sized math, resulting in a zero, then zero orred with the second byte results in only the second byte.
Basically a compiler is within its rights to turn this:
opcode = memory[pc] << 8) | memory[pc + 1];
Into this:
opcode = memory[pc + 1];
Which wont work for you at all, a very quick fix:
opcode = memory[pc + 0];
opcode <<= 8;
opcode |= memory[pc + 1];
Will save you some headaches. Minimal optimization will save the compiler from storing the intermediate results to ram for each operation resulting in the same (desired) output/performance.
The instruction set simulators I wrote and mentioned above are not intended for performance but instead readability, visibility, and hopefully educational. I would start with something like that then if performance for example is of interest you will have to re-write it. This chip8 emulator, once experienced, would be an afternoon task from scratch, so once you get through this the first time you could re-write it maybe three or four times in a weekend, not a monumental task (to have to re-write). (the thumbulator one took me a weekend, for the bulk of it. The msp430 one was probably more like an evening or two worth of work. Getting the overflow flag right, once and for all, was the biggest task, and that came later). Anyway, point being, look at things like the mame sources, most if not all of those instruction set simulators are designed for execution speed, many are barely readable without a fair amount of study. Often heavily table driven, sometimes lots of C programming tricks, etc. Start with something manageable, get it functioning properly, then worry about improving it for speed or size or portability or whatever. This chip8 thing looks to be graphics based so you are going to also have to deal with a lot of line drawing and other bit manipulation on a bitmap/screen/wherever. Or you could just call api or operating system functions. Basically this chip8 thing is not your traditional instruction set with registers and a laundry list of addressing modes and alu operations.

Basically -- Mask out the variable part of the opcode, and look for a match. Then use the variable part.
For example 1NNN is the jump. So:
int a = opcode & 0xF000;
int b = opcode & 0x0FFF;
if(a == 0x1000)
doJump(b);
Then the game is to make that code fast or small, or elegant, if you like. Good clean fun!

Different CPUs store values in memory differently. Big endian machines store a number like $FFCC in memory in that order FF,CC. Little-endian machines store the bytes in reverse order CC, FF (that is, with the "little end" first).
The CHIP-8 architecture is big endian, so the code you will run has the instructions and data written in big endian.
In your statement "opcode = memory[pc] << 8 | memory[pc + 1];", it doesn't matter if the host CPU (the CPU of your computer) is little endian or big endian. It will always put a 16-bit big endian value into an integer in the correct order.
There are a couple of resources that might help: http://www.emulator101.com gives a CHIP-8 emulator tutorial along with some general emulator techniques. This one is good too: http://www.multigesture.net/articles/how-to-write-an-emulator-chip-8-interpreter/

You're going to have to setup a bunch of different bit masks to get the actual opcode from the 16-bit word in combination with a finite state machine in order to interpret those opcodes since it appears that there are some complications in how the opcodes are encoded (i.e., certain opcodes have register identifiers, etc., while others are fairly straight-forward with a single identifier).
Your finite state machine can basically do the following:
Get the first nibble of the opcode using a mask like `0xF000. This will allow you to "categorize" the opcode
Based on the function category from step 1, apply more masks to either get the register values from the opcode, or whatever other variables might be encoded with the opcode that will narrow down the actual function that would need to be called, as well as it's arguments.
Once you have the opcode and the variable information, do a look-up into a fixed-length table of functions that have the appropriate handlers to coincide with the opcode functionality and the variables that go along with the opcode. While you can, in your state machine, hard-code the names of the functions that would go with each opcode once you've isolated the proper functionality, a table that you initialize with function-pointers for each opcode is a more flexible approach that will let you modify the code functionality easier (i.e., you could easily swap between debug handlers and "normal" handlers, etc.).

Questions Regarding the Implementation of a Simple CPU Emulator

Background Information: Ultimately, I would like to write an emulator of a real machine such as the original Nintendo or Gameboy. However, I decided that I need to start somewhere much, much simpler. My computer science advisor/professor offered me the specifications for a very simple imaginary processor that he created to emulate first. There is one register (the accumulator) and 16 opcodes. Each instruction consists of 16 bits, the first 4 of which contain the opcode, the rest of which is the operand. The instructions are given as strings in binary format, e.g., "0101 0101 0000 1111".
My Question: In C++, what is the best way to parse the instructions for processing? Please keep my ultimate goal in mind. Here are some points I've considered:
I can't just process and execute the instructions as I read them because the code is self-modifying: an instruction can change a later instruction. The only way I can see to get around this would be to store all changes and for each instruction to check whether a change needs to be applied. This could lead to a massive amounts of comparisons with the execution of each instruction, which isn't good. And so, I think I have to recompile the instructions in another format.
Although I could parse the opcode as a string and process it, there are instances where the instruction as a whole has to be taken as a number. The increment opcode, for example, could modify even the opcode section of an instruction.
If I were to convert the instructions to integers, I'm not sure then how I could parse just the opcode or operand section of the int. Even if I were to recompile each instruction into three parts, the whole instruction as an int, the opcode as an int, and the operand as an int, that still wouldn't solve the problem, as I might have to increment an entire instruction and later parse the affected opcode or operand. Moreover, would I have to write a function to perform this conversion, or is there some library for C++ that has a function convert a string in "binary format" to an integer (like Integer.parseInt(str1, 2) in Java)?
Also, I would like to be able to perform operations such as shifting bits. I'm not sure how that can be achieved, but that might affect how I implement this recompilation.
Thank you for any help or advice you can offer!

Parse the original code into an array of integers. This array is your computer's memory.
Use bitwise operations to extract the various fields. For instance, this:
unsigned int x = 0xfeed;
unsigned int opcode = (x >> 12) & 0xf;
will extract the topmost four bits (0xf, here) from a 16-bit value stored in an unsigned int. You can then use e.g. switch() to inspect the opcode and take the proper action:
enum { ADD = 0 };
unsigned int execute(int *memory, unsigned int pc)
{
const unsigned int opcode = (memory[pc++] >> 12) & 0xf;
switch(opcode)
{
case OP_ADD:
/* Do whatever the ADD instruction's definition mandates. */
return pc;
default:
fprintf(stderr, "** Non-implemented opcode %x found in location %x\n", opcode, pc - 1);
}
return pc;
}
Modifying memory is just a case of writing into your array of integers, perhaps also using some bitwise math if needed.

I think the best approach is to read the instructions, convert them to unsigned integers, and store them into memory, then execute them from memory.
Once you've parsed the instructions and stored them to memory, self-modification is much easier than storing a list of changes for each instruction. You can just change the memory at that location (assuming you don't ever need to know what the old instruction was).
Since you're converting the instructions to integers, this problem is moot.
To parse the opcode and operand sections, you'll need to use bit shifting and masking. For example, to get the op code, you mask off the upper 4 bits and shift down by 12 bits (instruction >> 12). You can use a mask to get the operand too.
You mean your machine has instructions that shift bits? That shouldn't affect how you store the operands. When you get to executing one of those instructions, you can just use the C++ bit-shifting operators << and >>.

Just in case it helps, here's the last CPU emulator I wrote in C++. Actually, it's the only emulator I've written in C++.
The spec's language is slightly idiosyncratic but it's a perfectly respectable, simple VM description, possibly quite similar to your prof's VM:
http://www.boundvariable.org/um-spec.txt
Here's my (somewhat over-engineered) code, which should give you some ideas. For instance it shows how to implement mathematical operators, in the Giant Switch Statement in um.cpp:
http://www.eschatonic.org/misc/um.zip
You can maybe find other implementations for comparison with a web search, since plenty of people entered the contest (I wasn't one of them: I did it much later). Although not many in C++ I'd guess.
If I were you, I'd only store the instructions as strings to start with, if that's the way that your virtual machine specification defines operations on them. Then convert them to integers as needed, every time you want to execute them. It'll be slow, but so what? Yours isn't a real VM that you're going to be using to run time-critical programs, and a dog-slow interpreter still illustrates the important points you need to know at this stage.
It's possible though that the VM actually defines everything in terms of integers, and the strings are just there to describe the program when it's loaded into the machine. In that case, convert the program to integers at the start. If the VM stores programs and data together, with the same operations acting on both, then this is the way to go.
The way to choose between them is to look at the opcode which is used to modify the program. Is the new instruction supplied to it as an integer, or as a string? Whichever it is, the simplest thing to start with is probably to store the program in that format. You can always change later once it's working.
In the case of the UM described above, the machine is defined in terms of "platters" with space for 32 bits. Clearly these can be represented in C++ as 32-bit integers, so that's what my implementation does.

I created an emulator for a custom cryptographic processor. I exploited the polymorphism of C++ by creating a tree of base classes:
struct Instruction // Contains common methods & data to all instructions.
{
virtual void execute(void) = 0;
virtual size_t get_instruction_size(void) const = 0;
virtual unsigned int get_opcode(void) const = 0;
virtual const std::string& get_instruction_name(void) = 0;
};
class Math_Instruction
: public Instruction
{
// Operations common to all math instructions;
};
class Branch_Instruction
: public Instruction
{
// Operations common to all branch instructions;
};
class Add_Instruction
: public Math_Instruction
{
};
I also had a couple of factories. At least two would be useful:
Factory to create instruction from
text.
Factory to create instruction from
opcode
The instruction classes should have methods to load their data from an input source (e.g. std::istream) or text (std::string). The corollary methods of output should also be supported (such as instruction name and opcode).
I had the application create objects, from an input file, and place them into a vector of Instruction. The executor method would run the 'execute()` method of each instruction in the array. This action trickled down to the instruction leaf object which performed the detailed execution.
There are other global objects that may need emulation as well. In my case some included the data bus, registers, ALU and memory locations.
Please spend more time designing and thinking about the project before you code it. I found it quite a challenge, especially implementing a single-step capable debugger and GUI.
Good Luck!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js