I would like to create an OpenCL kernel without giving access to it to the end user.
Therefore, I can't use a regular external .cl text file. What are the alternatives, regarding that I would like to avoid creating a huge text string with the kernel?
And yet another question, if I put this code in an hardcoded string, won't it be possible to access that code from some disassembler?
Here you have 2 scenarios:
If you are targeting one single device
If you are targeting any OpenCL device
In the first scenario, there is a possibility to embed the binary data into your executable (using a string). And load it when you run the program.
There would be no reverse engineering possible (unless the already known ones, like assembly), since the program will have compiled code and not the original code you wrote.
The way of doing that would be:
uchar binary_dev1[binarySize] = "..."
uchar * binary = &binary_dev1;
program = clCreateProgramWithBinary(context, 1, &device,
&binarySize,
(const unsigned char**)&binary,
&binaryStatus,
&errNum);
The second alternative involves protecting the source code in the kernel by some sort of "mangling".
Since the mangler code is going to be compiled, reverse engineer it could be complicated.
You can do any mangling you can think of that is reversible, and even combine them. Some ideas:
Compress the code it using a compression format, but hardcode some parameters of the decompresion, to make it less straightforward.
LZ4, ZLIB, etc...
Use an XOR operator on the code. Better if it varies over time, and better if it varies using a non-obvious rule.
For example:
char seq = 0x1A;
for(int i=0; i<len; i++){
out[i] = in[i] ^ seq;
seq = ((seq ^ i) * 78965213) >> 4 + ((seq * i) * 56987) << 4;
}
Encode it using encoding methods that require a key, and are reversible
Use a program that protects your program binary towards reverse engineering, like Themida.
Use SPIR 1.2 for OpenCL 1.2 or SPIR 2.0 for OpenCL 2.0 until SPIR-V for OpenCL 2.1 is available.
Related
An open source C++/Qt app I'm interested in depends on CUDA. My macbook pro (mid 2014) has the stock Intel Iris Pro, and no NVidia graphics card. Naturally, the pre-built app won't run.
I found this emulator: https://github.com/gtcasl/gpuocelot - but it's only tested against Linux, and there are several open issues about it not compiling on the Mac.
I have the source - can I replace the CUDA dependency with c++ equivalents, at the cost of slower processing? I'm hoping for something like
rename file extensions: .cu to .cpp
remove CUDA references from make file
replace CUDA headers with equivalent c++ std lib headers
adjust makefile, adding missing library references as needed
fix remaining missing function calls (hopefully only one or two) with c++ code (possibly cribbed from Ocelot)
But I'm afraid it's not that simple. I'd like a sanity check before I begin.
In the general case, I don't think there is a specific roadmap to "de-CUDA-fy" an application. Just as I don't think there is a specific "mechanical" roadmap to "CUDA-fy" an application, nor do I find specific roadmaps for programming problems in general.
Furthermore, I think the proposed roadmap has flaws. To pick just one example, a .cu file will normally have CUDA-specific references that will not be tolerated by an ordinary c++ compiler used to compile a .cpp code. Some of these references may be items that depend on the CUDA runtime API, such as cudaMalloc and cudaMemcpy, and although these could be made to pass through an ordinary c++ compiler (they are just library calls) it would not be sensible to leave those in-place for an application that has the CUDA character removed. Furthermore, some of the references may be CUDA specific language features such as declaration of device code via __global__ or __device__ or launching of a device "kernel" function with it's corresponding syntax <<<...>>>. These cannot be made to pass through an ordinary c++ compiler, and would have to be dealt with specifically. Furthermore, simply deleting those CUDA keywords and syntax would be very unlikely to produce useful results.
In short, the code would have to be refactored; there is no reasonably concise roadmap that explains a more-or-less mechanical process to do so. I suggest the complexity of the refactoring process would be approximately the same complexity as the original process (if there was one) to convert a non-CUDA version of the code to a CUDA version. At a minimum, some non-mechanical knowledge of CUDA programming would be required in order to understand the CUDA constructs.
For very simple CUDA codes, it might be possible to lay out a somewhat mechanical process to de-CUDA-fy the code. To recap, the basic CUDA processing sequence is as follows:
allocate space for data on the device (perhaps with cudaMalloc) and copy data to the device (perhaps with cudaMemcpy)
launch a function that runs on the device (a __global__ or "kernel" function) to process the data and create results
copy results back from the device (perhaps, again, with cudaMemcpy)
Therefore, a straightforward approach would be to:
eliminate the cudaMalloc/cudaMemcpy operations, thus leaving the data of interest in its original form, on the host
convert the cuda processing functions (kernels) to ordinary c++ functions, that perform the same operation on the host data
Since CUDA is a parallel processing architecture, one approach to convert an inherently parallel CUDA "kernel" code to ordinary c++ code (step 2 above) would be to use a loop or a set of loops. But beyond that the roadmap tends to get quite divergent, depending on what the code is actually doing. In addition, inter-thread communication, non-transformational algorithms (such as reductions), and use of CUDA intrinsics or other language specific features will considerably complicate step 2.
For example lets take a very simple vector ADD code. The CUDA kernel code for this would be distinguished by a number of characteristics that would make it easy to convert to or from a CUDA realization:
There is no inter-thread communication. The problem is "embarassingly parallel". The work done by each thread is independent of all other threads. This describes only a limited subset of CUDA codes.
There is no need or use of any CUDA specific language features or intrinsics (other than a globally unique thread index variable), so the kernel code is recognizable as almost completely valid c++ code already. Again, this characteristic probably describes only a limited subset of CUDA codes.
So the CUDA version of the vector add code might look like this (drastically simplified for presentation purposes):
#include <stdio.h>
#define N 512
// perform c = a + b vector add
__global__ void vector_add(const float *a, const float *b, float *c){
int idx = threadIdx.x;
c[idx]=a[idx]+b[idx];
}
int main(){
float a[N] = {1};
float b[N] = {2};
float c[N] = {0};
float *d_a, *d_b, *d_c;
int dsize = N*sizeof(float);
cudaMalloc(&d_a, dsize); // step 1 of CUDA processing sequence
cudaMalloc(&d_b, dsize);
cudaMalloc(&d_c, dsize);
cudaMemcpy(d_a, a, dsize, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, dsize, cudaMemcpyHostToDevice);
vector_add<<<1,N>>>(d_a, d_b, d_c); // step 2
cudaMemcpy(c, d_c, dsize, cudaMemcpyDeviceToHost); // step 3
for (int i = 0; i < N; i++) if (c[i] != a[i]+b[i]) {printf("Fail!\n"); return 1;}
printf("Success!\n");
return 0;
}
We see that the above code follows the typical CUDA processing sequence 1-2-3 and the beginning of each step is marked in the comments. So our "de-CUDA-fy" roadmap is, again:
eliminate the cudaMalloc/cudaMemcpy operations, thus leaving the data of interest in its original form, on the host
convert the cuda processing functions (kernels) to ordinary c++ functions, that perform the same operation on the host data
For step 1, we will literally just delete the cudaMalloc and cudaMemcpy lines, and we will instead plan to operate directly on the a[], b[] and c[] variables in the host code. The remaining step, then, is to convert the vector_add CUDA "kernel" function to an ordinary c++ function. Again, some knowledge of CUDA fundamentals is necessary to understand the extent of the operation being performed in parallel. But the kernel code itself (other than the use of the threadIdx.x built-in CUDA variable) is completely valid c++ code, and there is no inter-thread communication or other complicating factors. So an ordinary c++ realization could just be the kernel code, placed into a suitable for-loop iterating over the parallel extent (N in this case), and placed into a comparable c++ function:
void vector_add(const float *a, const float *b, float *c){
for (int idx=0; idx < N; idx++)
c[idx]=a[idx]+b[idx];
}
Combining the above steps, we need to (in this trivial example):
delete the cudaMalloc and cudaMemcpy operations
replace the cuda kernel code with a similar, ordinary c++ function
fixup the kernel invocation in main to be an ordinary c++ function call
Which gives us:
#include <stdio.h>
#define N 512
// perform c = a + b vector add
void vector_add(const float *a, const float *b, float *c){
for (int idx = 0; idx < N; idx++)
c[idx]=a[idx]+b[idx];
}
int main(){
float a[N] = {1};
float b[N] = {2};
float c[N] = {0};
vector_add(a, b, c);
for (int i = 0; i < N; i++) if (c[i] != a[i]+b[i]) {printf("Fail!\n"); return 1;}
printf("Success!\n");
return 0;
}
The point of working through this example is not to suggest the process will be in general this trivially simple. But hopefully it is evident that the process is not a purely mechanical one, but depends on some knowledge of CUDA and also requires some actual code refactoring; it cannot be done simply by changing file extensions and modifying a few function calls.
A few other comments:
Many laptops are available which have CUDA-capable (i.e. NVIDIA) GPUs in them. If you have one of these (I realize you don't but I'm including this for others who may read this), you can probably run CUDA codes on it.
If you have an available desktop PC, it's likely that for less than $100 you could add a CUDA-capable GPU to it.
Trying to leverage emulation technology IMO is not the way to go here, unless you can use it in a turnkey fashion. Cobbling bits and pieces from an emulator into an application of your own is a non-trivial exercise, in my opinion.
I believe in the general case, conversion of a CUDA code to a corresponding OpenCL code will not be trivial either. (The motivation here is that there is a lot of similarity between CUDA and OpenCL, and an OpenCL code could probably be made to run on your laptop, as OpenCL codes can usually be run on a variety of targets, including CPUs and GPUs). There are enough differences between the two technologies that it requires some effort, and this brings the added burden of requiring some level of familiarity with both OpenCL and CUDA, and the thrust of your question seems to be wanting to avoid those learning curves.
I have a 3D-volume, represented as as vector of vector of vector of float, that I want to save to a binary file. (It's a density volume reconstructed from X-ray images that come from a CT scanner.)
Now, I could do this in the following way:
//iterate through the volume
for (int x = 0; x < _xSize; ++x){
for (int y = 0; y < _ySize; ++y){
for (int z = 0; z < _zSize; ++z){
//save one float of data
stream.write((char*)&_volume[x][y][z], sizeof(float));
}
}
}
This basically works. However, I'm asking myself to which extent this is platform independent. I would like to produce a file which is identical regardless of the system it was created on. So there might be machines running Windows, Linux or Mac, they might have 32bit or 64bit word lenght and little endian or big endian byte order.
I suppose if I did this the way it was done above this wouldn't be the case. Now how could I achieve this? I've heard about serialisation but I haven't found a concrete solution for this instance.
Google Protocol Buffers: free, encodes to binary, available in several languages, works across most platforms too. For your requirements I would seriously consider GPB. Be careful though, Google have released several versions and they've not always been backward compatible, ie old data is not necessarily readable by new versions of GPB code. I feel that it's still evolving and further changes will happen, which could be a nuisance if your project is also going to evolve over many years.
ASN.1, the grandpa of them all, very good schema language (value and size constraints can be set which is a terrific way of avoiding buffer overruns and gives automatic validation of data streams provided the auto generated code is correct), some free tools, see this page (mostly though they cost money). GPB's schema language is kind of a poor imitation of ASN.1's.
I solved the problem using the Qt Datastream class. Qt is part of my project anyway, so the additional effort is minimal. I can tell the Datastream object exactly if I want to save my floats using single precision (32bit) or double precision (64bit) and if I want to use little endian or big endian byte order. This is totally sufficient for what I need; I don't need to serialize objects. The files I save now have exactly the same format on all platforms (at least they should), and this is all I need. They will afterwards be read by 3rd party applications to which these information (byte order, precision) will be supplied. So to say it is not of importance exactly how my floats are saved but that I know how they are saved and that this is consistent no matter on which platform you run the program.
Here is how the code looks now:
QDataStream out(&file);
out.setFloatingPointPrecision(QDataStream::SinglePrecision);
out.setByteOrder(QDataStream::LittleEndian);
for (int x = 0; x < _xSize; ++x){
for (int y = 0; y < _ySize; ++y){
for (int z = 0; z < _zSize; ++z){
//save one float of data
out<<_volume[x][y][z];
}
}
}
I'm surprised there is no mention of the <rpc/xdr.h> header, for External Data Representation. I believe it is on all unixes, and may even work on Windows: https://github.com/ralight/oncrpc-windows/blob/master/win32/include/rpc/xdr.h
XDR stores all primitive data types in big endian, and takes care of the conversions for you.
For the now being I am developing a C++ program based on some MATLAB codes. During the developing period I need to output the intermediate results to MATLAB in order to compare the C++ implementation result with the MATLAB result. What I am doing now is to write a binary file with C++, and then load the binary file with MATLAB. The following codes show an example:
int main ()
{
ofstream abcdef;
abcdef.open("C:/test.bin",ios::out | ios::trunc | ios::binary);
for (int i=0; i<10; i++)
{
float x_cord;
x_cord = i*1.38;
float y_cord;
y_cord = i*10;
abcdef<<x_cord<<" "<<y_cord<<endl;
}
abcdef.close();
return 0;
}
When I have the file test.bin, I can load the file automatically with MATLAB command:
data = load('test.bin');
This method can work well when numerical data is the output; however, it could fail if the output is a class with many member variables. I was wondering whether there are better ways to do the job not only for simple numerical data but also for complicated data structure. Thanks!
I would suggest the use of MATLAB engine through which you can pass data to MATLAB on real time basis and can even visualize the data using various graph plotting facilities available in MATLAB.
All you have to do is to invoke the MATLAB engine from C/C++ program and then you can easily execute MATLAB commands directly from the C/C++ program and/or exchange data between MATLAB and C/C++. It can be done in both directions i.e. from C++ to MATLAB and vice versa.
You can have a look at a working example for the same as shown here.
I would suggest using the fread command in matlab. I do this all the time for exchanging data between matlab and other programs, for instance:
fd = fopen('datafile.bin','r');
a = fread(fd,3,'*uint32');
b = fread(fd,1,'float32');
With fread you have all the flexibility to read any type of data. By placing a * in the name, as above, you also say that you want to store into that data type instead of the default matlab data type. So the first one reads in 3 32 bit unsigned integers and stores them as integers. The second one reads in a single precision floating point number, but stores it as the default double precision.
You need to control the way that data is written in your c++ code, but that is inevitable. You can make a class method in c++ that packs the data in a deterministic way.
Dustin
Given the arrays:
int canvas[10][10];
int addon[10][10];
Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon?
IE, I want to achieve something like:
canvas += another;
So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5
Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.
And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!
Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):
#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>
static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
__m128i * cp = (__m128i *)&canvas[0][0];
const __m128i * ap = (__m128i *)&addon[0][0];
const __m128i vlimit = _mm_set1_epi32(100);
__m128i vmax = _mm_set1_epi32(INT_MIN);
__m128i vcmp;
int cmp;
int i;
for (i = 0; i < 10 * 10; i += 4)
{
__m128i vc = _mm_loadu_si128(cp);
__m128i va = _mm_loadu_si128(ap);
vc = _mm_add_epi32(vc, va);
vmax = _mm_max_epi32(vmax, vc); // SSE4 *
_mm_storeu_si128(cp, vc);
cp++;
ap++;
}
vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
cmp = _mm_testz_si128(vcmp, vcmp); // SSE4 *
return cmp == 0;
}
Compile with gcc -msse4.1 ... or equivalent for your particular development environment.
For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an * above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128/_mm_store_si128) in place of _mm_loadu_si128/_mm_storeu_si128.
You can't do anything faster than loops in just C++. You would need to use some platform specific vector instructions. That is, you would need to go down to the assembly language level. However, there are some C++ libraries that try to do this for you, so you can write at a high level and have the library take care of doing the low level SIMD work that is appropriate for whatever architecture you are targetting with your compiler.
MacSTL is a library that you might want to look at. It was originally a Macintosh specific library, but it is cross platform now. See their home page for more info.
The best you're going to do in standard C or C++ is to recast that as a one-dimensional array of 100 numbers and add them in a loop. (Single subscripts will use a bit less processing than double ones, unless the compiler can optimize it out. The only way you're going to know how much of an effect there is, if there is one, is to test.)
You could certainly create a class where the addition would be one simple C++ instruction (canvas += addon;), but that wouldn't speed anything up. All that would happen is that the simple C++ instruction would expand into the loop above.
You would need to get into lower-level processing in order to speed that up. There are additional instructions on many modern CPUs to do such processing that you might be able to use. You might be able to run something like this on a GPU using something like Cuda. You could try making the operation parallel and running on several cores, but on such a small instance you'll have to know how caching works on your CPU.
The alternatives are to improve your algorithm (on a knapsack-type problem, you might be able to use dynamic programming in some way - without more information from you, we can't tell you), or to accept the performance. Tens of millions of operations on a 10 by 10 array turn into hundreds of billions of operations on numbers, and that's not as intimidating as it used to be. Of course, I don't know your usage scenario or performance requirements.
Two parts: first, consider your two-dimensional array [10][10] as a single array [100]. The layout rules of C++ should allow this. Second, check your compiler for intrinsic functions implementing some form of SIMD instructions, such as Intel's SSE. For example Microsoft supplies a set. I believe SSE has some instructions for checking against a maximum value, and even clamping to the maximum if you want.
Here is an alternative.
If you are 100% certain that all your values are between 0 and 100, you could change your type from an int to a uint8_t. Then, you could add 4 elements together at once of them together using uint32_t without worrying about overflow.
That is ...
uint8_t array1[10][10];
uint8_t array2[10][10];
uint8_t dest[10][10];
uint32_t *pArr1 = (uint32_t *) &array1[0][0];
uint32_t *pArr2 = (uint32_t *) &array2[0][0];
uint32_t *pDest = (uint32_t *) &dest[0][0];
int i;
for (i = 0; i < sizeof (dest) / sizeof (uint32_t); i++) {
pDest[i] = pArr1[i] + pArr2[i];
}
It may not be the most elegant, but it could help keep you from going to architecture specific code. Additionally, if you were to do this, I would strongly recommend you comment what you are doing and why.
You should check out CUDA. This kind of problem is right up CUDA's street. Recommend the Programming Massively Parallel Processors book.
However, this does require CUDA capable hardware, and CUDA takes a bit of effort to get setup in your development environment, so it would depend how important this really is!
Good luck!
Background Information: Ultimately, I would like to write an emulator of a real machine such as the original Nintendo or Gameboy. However, I decided that I need to start somewhere much, much simpler. My computer science advisor/professor offered me the specifications for a very simple imaginary processor that he created to emulate first. There is one register (the accumulator) and 16 opcodes. Each instruction consists of 16 bits, the first 4 of which contain the opcode, the rest of which is the operand. The instructions are given as strings in binary format, e.g., "0101 0101 0000 1111".
My Question: In C++, what is the best way to parse the instructions for processing? Please keep my ultimate goal in mind. Here are some points I've considered:
I can't just process and execute the instructions as I read them because the code is self-modifying: an instruction can change a later instruction. The only way I can see to get around this would be to store all changes and for each instruction to check whether a change needs to be applied. This could lead to a massive amounts of comparisons with the execution of each instruction, which isn't good. And so, I think I have to recompile the instructions in another format.
Although I could parse the opcode as a string and process it, there are instances where the instruction as a whole has to be taken as a number. The increment opcode, for example, could modify even the opcode section of an instruction.
If I were to convert the instructions to integers, I'm not sure then how I could parse just the opcode or operand section of the int. Even if I were to recompile each instruction into three parts, the whole instruction as an int, the opcode as an int, and the operand as an int, that still wouldn't solve the problem, as I might have to increment an entire instruction and later parse the affected opcode or operand. Moreover, would I have to write a function to perform this conversion, or is there some library for C++ that has a function convert a string in "binary format" to an integer (like Integer.parseInt(str1, 2) in Java)?
Also, I would like to be able to perform operations such as shifting bits. I'm not sure how that can be achieved, but that might affect how I implement this recompilation.
Thank you for any help or advice you can offer!
Parse the original code into an array of integers. This array is your computer's memory.
Use bitwise operations to extract the various fields. For instance, this:
unsigned int x = 0xfeed;
unsigned int opcode = (x >> 12) & 0xf;
will extract the topmost four bits (0xf, here) from a 16-bit value stored in an unsigned int. You can then use e.g. switch() to inspect the opcode and take the proper action:
enum { ADD = 0 };
unsigned int execute(int *memory, unsigned int pc)
{
const unsigned int opcode = (memory[pc++] >> 12) & 0xf;
switch(opcode)
{
case OP_ADD:
/* Do whatever the ADD instruction's definition mandates. */
return pc;
default:
fprintf(stderr, "** Non-implemented opcode %x found in location %x\n", opcode, pc - 1);
}
return pc;
}
Modifying memory is just a case of writing into your array of integers, perhaps also using some bitwise math if needed.
I think the best approach is to read the instructions, convert them to unsigned integers, and store them into memory, then execute them from memory.
Once you've parsed the instructions and stored them to memory, self-modification is much easier than storing a list of changes for each instruction. You can just change the memory at that location (assuming you don't ever need to know what the old instruction was).
Since you're converting the instructions to integers, this problem is moot.
To parse the opcode and operand sections, you'll need to use bit shifting and masking. For example, to get the op code, you mask off the upper 4 bits and shift down by 12 bits (instruction >> 12). You can use a mask to get the operand too.
You mean your machine has instructions that shift bits? That shouldn't affect how you store the operands. When you get to executing one of those instructions, you can just use the C++ bit-shifting operators << and >>.
Just in case it helps, here's the last CPU emulator I wrote in C++. Actually, it's the only emulator I've written in C++.
The spec's language is slightly idiosyncratic but it's a perfectly respectable, simple VM description, possibly quite similar to your prof's VM:
http://www.boundvariable.org/um-spec.txt
Here's my (somewhat over-engineered) code, which should give you some ideas. For instance it shows how to implement mathematical operators, in the Giant Switch Statement in um.cpp:
http://www.eschatonic.org/misc/um.zip
You can maybe find other implementations for comparison with a web search, since plenty of people entered the contest (I wasn't one of them: I did it much later). Although not many in C++ I'd guess.
If I were you, I'd only store the instructions as strings to start with, if that's the way that your virtual machine specification defines operations on them. Then convert them to integers as needed, every time you want to execute them. It'll be slow, but so what? Yours isn't a real VM that you're going to be using to run time-critical programs, and a dog-slow interpreter still illustrates the important points you need to know at this stage.
It's possible though that the VM actually defines everything in terms of integers, and the strings are just there to describe the program when it's loaded into the machine. In that case, convert the program to integers at the start. If the VM stores programs and data together, with the same operations acting on both, then this is the way to go.
The way to choose between them is to look at the opcode which is used to modify the program. Is the new instruction supplied to it as an integer, or as a string? Whichever it is, the simplest thing to start with is probably to store the program in that format. You can always change later once it's working.
In the case of the UM described above, the machine is defined in terms of "platters" with space for 32 bits. Clearly these can be represented in C++ as 32-bit integers, so that's what my implementation does.
I created an emulator for a custom cryptographic processor. I exploited the polymorphism of C++ by creating a tree of base classes:
struct Instruction // Contains common methods & data to all instructions.
{
virtual void execute(void) = 0;
virtual size_t get_instruction_size(void) const = 0;
virtual unsigned int get_opcode(void) const = 0;
virtual const std::string& get_instruction_name(void) = 0;
};
class Math_Instruction
: public Instruction
{
// Operations common to all math instructions;
};
class Branch_Instruction
: public Instruction
{
// Operations common to all branch instructions;
};
class Add_Instruction
: public Math_Instruction
{
};
I also had a couple of factories. At least two would be useful:
Factory to create instruction from
text.
Factory to create instruction from
opcode
The instruction classes should have methods to load their data from an input source (e.g. std::istream) or text (std::string). The corollary methods of output should also be supported (such as instruction name and opcode).
I had the application create objects, from an input file, and place them into a vector of Instruction. The executor method would run the 'execute()` method of each instruction in the array. This action trickled down to the instruction leaf object which performed the detailed execution.
There are other global objects that may need emulation as well. In my case some included the data bus, registers, ALU and memory locations.
Please spend more time designing and thinking about the project before you code it. I found it quite a challenge, especially implementing a single-step capable debugger and GUI.
Good Luck!