Related
I'm working on a program running on a micro controller and need to implement a self-test for the program code integrity.
For this, I let the code calculate a CRC16 checksum over the whole flash memory (program space) and transmit this value to another system via some network. The other system then has to compare the checksum against a pre-calculated value.
However, with each update, the CRC value changes. So the whole process could be simplified, if the program code can be prepared beforehand, such that the CRC16 checksum always matches a predefined value like 0 or better something like 0x1234.
Is there an easy way to achieve this?
Another way to put this: can I easily calculate a byte sequence, that I would have to add to my programs binary code (for example by changing a static array with dummy data included in the program), so that the CRC16 gives my predefined value?
Can this byte sequence be included anywhere in the code or does it have to be exactly at the end?
(If necessary, I could also implement another checksum algorithm besides CRC-16.)
Thanks for your answers!
Yes, easily. For your n bytes of flash, compute the CRC-16 of the first n-2 bytes, and store that CRC in the last two bytes. Those two bytes would be appended in little-endian order for a reflected CRC, and in big-endian order for a non-reflected CRC. Then the CRC-16 of the n bytes will be a constant. That constant is known as the "residue" of the CRC. For CRC's with no exclusive-or at the end, the residue is always zero. You didn't say what CRC you're using, but you can find the residues of known CRC's (before the final exclusive-or) in Greg Cook's catalog. Or you can just see what you get.
In C++,
Why is a boolean 1 byte and not 1 bit of size?
Why aren't there types like a 4-bit or 2-bit integers?
I'm missing out the above things when writing an emulator for a CPU
Because the CPU can't address anything smaller than a byte.
From Wikipedia:
Historically, a byte was the number of
bits used to encode a single character
of text in a computer and it is
for this reason the basic addressable
element in many computer
architectures.
So byte is the basic addressable unit, below which computer architecture cannot address. And since there doesn't (probably) exist computers which support 4-bit byte, you don't have 4-bit bool etc.
However, if you can design such an architecture which can address 4-bit as basic addressable unit, then you will have bool of size 4-bit then, on that computer only!
Back in the old days when I had to walk to school in a raging blizzard, uphill both ways, and lunch was whatever animal we could track down in the woods behind the school and kill with our bare hands, computers had much less memory available than today. The first computer I ever used had 6K of RAM. Not 6 megabytes, not 6 gigabytes, 6 kilobytes. In that environment, it made a lot of sense to pack as many booleans into an int as you could, and so we would regularly use operations to take them out and put them in.
Today, when people will mock you for having only 1 GB of RAM, and the only place you could find a hard drive with less than 200 GB is at an antique shop, it's just not worth the trouble to pack bits.
The easiest answer is; it's because the CPU addresses memory in bytes and not in bits, and bitwise operations are very slow.
However it's possible to use bit-size allocation in C++. There's std::vector specialization for bit vectors, and also structs taking bit sized entries.
Because a byte is the smallest addressible unit in the language.
But you can make bool take 1 bit for example if you have a bunch of them
eg. in a struct, like this:
struct A
{
bool a:1, b:1, c:1, d:1, e:1;
};
You could have 1-bit bools and 4 and 2-bit ints. But that would make for a weird instruction set for no performance gain because it's an unnatural way to look at the architecture. It actually makes sense to "waste" a better part of a byte rather than trying to reclaim that unused data.
The only app that bothers to pack several bools into a single byte, in my experience, is Sql Server.
You can use bit fields to get integers of sub size.
struct X
{
int val:4; // 4 bit int.
};
Though it is usually used to map structures to exact hardware expected bit patterns:
// 1 byte value (on a system where 8 bits is a byte)
struct SomThing
{
int p1:4; // 4 bit field
int p2:3; // 3 bit field
int p3:1; // 1 bit
};
bool can be one byte -- the smallest addressable size of CPU, or can be bigger. It's not unusual to have bool to be the size of int for performance purposes. If for specific purposes (say hardware simulation) you need a type with N bits, you can find a library for that (e.g. GBL library has BitSet<N> class). If you are concerned with size of bool (you probably have a big container,) then you can pack bits yourself, or use std::vector<bool> that will do it for you (be careful with the latter, as it doesn't satisfy container requirments).
Think about how you would implement this at your emulator level...
bool a[10] = {false};
bool &rbool = a[3];
bool *pbool = a + 3;
assert(pbool == &rbool);
rbool = true;
assert(*pbool);
*pbool = false;
assert(!rbool);
Because in general, CPU allocates memory with 1 byte as the basic unit, although some CPU like MIPS use a 4-byte word.
However vector deals bool in a special fashion, with vector<bool> one bit for each bool is allocated.
The byte is the smaller unit of digital data storage of a computer. In a computer the RAM has millions of bytes and anyone of them has an address. If it would have an address for every bit a computer could manage 8 time less RAM that what it can.
More info: Wikipedia
Even when the minimum size possible is 1 Byte, you can have 8 bits of boolean information on 1 Byte:
http://en.wikipedia.org/wiki/Bit_array
Julia language has BitArray for example, and I read about C++ implementations.
Bitwise operations are not 'slow'.
And/Or operations tend to be fast.
The problem is alignment and the simple problem of solving it.
CPUs as the answers partially-answered correctly are generally aligned to read bytes and RAM/memory is designed in the same way.
So data compression to use less memory space would have to be explicitly ordered.
As one answer suggested, you could order a specific number of bits per value in a struct. However what does the CPU/memory do afterward if it's not aligned? That would result in unaligned memory where instead of just +1 or +2, or +4, there's not +1.5 if you wanted to use half the size in bits in one value, etc. so it must anyway fill in or revert the remaining space as blank, then simply read the next aligned space, which are aligned by 1 at minimum and usually by default aligned by 4(32bit) or 8(64bit) overall. The CPU will generally then grab the byte value or the int value that contains your flags and then you check or set the needed ones. So you must still define memory as int, short, byte, or the proper sizes, but then when accessing and setting the value you can explicitly compress the data and store those flags in that value to save space; but many people are unaware of how it works, or skip the step whenever they have on/off values or flag present values, even though saving space in sent/recv memory is quite useful in mobile and other constrained enviornments. In the case of splitting an int into bytes it has little value, as you can just define the bytes individually (e.g. int 4Bytes; vs byte Byte1;byte Byte2; byte Byte3; byte Byte4;) in that case it is redundant to use int; however in virtual environments that are easier like Java, they might define most types as int (numbers, boolean, etc.) so thus in that case, you could take advantage of an int dividing it up and using bytes/bits for an ultra efficient app that has to send less integers of data (aligned by 4). As it could be said redundant to manage bits, however, it is one of many optimizations where bitwise operations are superior but not always needed; many times people take advantage of high memory constraints by just storing booleans as integers and wasting 'many magnitudes' 500%-1000% or so of memory space anyway. It still easily has its uses, if you use this among other optimizations, then on the go and other data streams that only have bytes or few kb of data flowing in, it makes the difference if overall you optimized everything to load on whether or not it will load,or load fast, at all in such cases, so reducing bytes sent could ultimately benefit you alot; even if you could get away with oversending tons of data not required to be sent in an every day internet connection or app. It is definitely something you should do when designing an app for mobile users and even something big time corporation apps fail at nowadays; using too much space and loading constraints that could be half or lower. The difference between not doing anything and piling on unknown packages/plugins that require at minumim many hundred KB or 1MB before it loads, vs one designed for speed that requires say 1KB or only fewKB, is going to make it load and act faster, as you will experience those users and people who have data constraints even if for you loading wasteful MB or thousand KB of unneeded data is fast.
I'm currently trying to write a NES emulator in C++ as a summer programming project to get ready for fall term next school year (I haven't coded in a while). I've already written a Chip8 emulator, so I thought the next step would be to try and write a NES emulator.
Anyways, I'm getting stuck. I'm using this website for my opcode table and I'm running into a road block. On the Chip8, all opcodes were two bytes long, so they were easy to fetch. However, the NES seems to have either 2 or 3 byte opcodes depending on what addressing mode the CPU is in. I can't think of any easy way to figure out how many bytes I need to read for each opcode (my only idea was to create really long if statements that check the first byte of the opcode to see how many more bytes to read).
I'm also having trouble with figuring how to count cycles. How do I create a clock within a programming language so that everything is in sync?
On an unrelated side note, since the NES is little-endian, do I need to read programCounter + 1 and then read programCounter to get the correct opcode?
However, the NES seems to have either 2 or 3 byte opcodes depending on what addressing mode the CPU is in. I can't think of any easy way to figure out how many bytes I need to read for each opcode.
The opcode is still only one byte. The extra bytes specify the operands for those instructions that have explicit operands.
To do the decoding, you can create a switch-block with 256 cases (actually it won't be 256 cases, because some opcodes are illegal). It could look something like this:
opcode = ReadByte(PC++);
switch (opcode) {
...
case 0x4C: // JMP abs
address = ReadByte(PC++);
address |= (uint16_t)ReadByte(PC) << 8;
PC = address;
cycles += 3;
break;
...
}
The compiler will typically create a jump table for the cases, so you'll end up with fairly efficient (albeit slightly bloated) code.
Another alternative is to create an array with one entry per opcode. This could simply be an array of function pointers, with one function per opcode - or the table could contain a pointer to one function for fetching the operands, one for performing the actual operation, plus information about the number of cycles that the instruction requires. This way you can share a lot of code. An example:
const Instruction INSTRUCTIONS[] =
{
...
// 0x4C: JMP abs
{&jmp, &abs_operand, 3},
...
};
I'm also having trouble with figuring how to count cycles. How do I create a clock within a programming language so that everything is in sync?
Counting CPU cycles is just a matter of incrementing a counter, like I showed in my code examples above.
To sync video with the CPU, the easiest way would be to run the CPU for the amount of cycles corresponding to the active display period of a single scanline, then draw one scanline, then run the CPU for the amount of cycles correspond to the horizontal blanking period, and start over again.
When you start involving audio, how you sync things can depend a bit on the audio API you're using. For example, some APIs might send you a callback to which you respond by filling a buffer with samples and returning the number of samples generated. In this case you could calculate the number of CPU cycles that have been emulated since the previous callback and determine how many samples to generate based on that.
On an unrelated side note, since the NES is little-endian, do I need to read programCounter + 1 and then read programCounter to get the correct opcode?
Since the opcode is a single byte and instructions on the 6502 aren't packed into a word like on some other CPU architectures, endianness doesn't really matter. It does become relevant for 16-bit operands, but on the other hand PCs and most mobile phones are also based on little-endian CPUs.
I wrote an emulator for 6502 some 25+ years back.
It's a pretty simple processor, so either a table of function pointers or a switch, with 256 entries for the bytes [the switch can be a bit shorter, since there aren't valid opcodes in all 256 entries, only about 200 of the opcodes are actually used].
Now, if you want to write a simulator that exactly simulates the timing of the instructions, then you'll have more fun. You basically will have to simulate much more of how each component works, and "ripple" through the units with a clock. This is quite a lot of work, so I would probably, if at all possible, ignore the timing, and just let the system's speed depend on the emulators speed.
I'm writing a Chip 8 emulator as an introduction to emulation and I'm kind of lost. Basically, I've read a Chip 8 ROM and stored it in a char array in memory. Then, following a guide, I use the following code to retrieve the opcode at the current program counter (pc):
// Fetch opcode
opcode = memory[pc] << 8 | memory[pc + 1];
Chip 8 opcodes are 2 bytes each. This is code from a guide which I vaguely understand as adding 8 extra bit spaces to memory[pc] (using << 8) and then merging memory[pc + 1] with it (using |) and storing the result in the opcode variable.
Now that I have the opcode isolated however, I don't really know what to do with it. I'm using this opcode table and I'm basically lost in regards to matching the hex opcodes I read to the opcode identifiers in that table. Also, I realize that many of the opcodes I'm reading also contain operands (I'm assuming the latter byte?), and that is probably further complicating my situation.
Help?!
Basically once you have the instruction you need to decode it. For example from your opcode table:
if ((inst&0xF000)==0x1000)
{
write_register(pc,(inst&0x0FFF)<<1);
}
And guessing that since you are accessing rom two bytes per instruction, the address is probably a (16 bit) word address not a byte address so I shifted it left one (you need to study how those instructions are encoded, the opcode table you provided is inadequate for that, well without having to make assumptions).
There is a lot more that has to happen and I dont know if I wrote anything about it in my github samples. I recommend you create a fetch function for fetching instructions at an address, a read memory function, a write memory function a read register function, write register function. I recommend your decode and execute function decodes and executes only one instruction at a time. Normal execution is to just call it in a loop, it provides the ability to do interrupts and things like that without a lot of extra work. It also modularizes your solution. By creating the fetch() read_mem_byte() read_mem_word() etc functions. You modularize your code (at a slight cost of performance), makes debugging much easier as you have a single place where you can watch registers or memory accesses and figure out what is or isnt going on.
Based on your question, and where you are in this process, I think the first thing you need to do before writing an emulator is to write a disassembler. Being a fixed instruction length instruction set (16 bits) that makes it much much easier. You can start at some interesting point in the rom, or at the beginning if you like, and decode everything you see. For example:
if ((inst&0xF000)==0x1000)
{
printf("jmp 0x%04X\n",(inst&0x0FFF)<<1);
}
With only 35 instructions that shouldnt take but an afternoon, maybe a whole saturday, being your first time decoding instructions (I assume that based on your question). The disassembler becomes the core decoder for your emulator. Replace the printf()s with emulation, even better leave the printfs and just add code to emulate the instruction execution, this way you can follow the execution. (same deal have a disassemble a single instruction function, call it for each instruction, this becomes the foundation for your emulator).
Your understanding needs to be more than vague as to what that fetch line of code is doing, in order to pull off this task you are going to have to have a strong understanding of bit manipulation.
Also I would call that line of code you provided buggy or at least risky. If memory[] is an array of bytes, the compiler might very well perform the left shift using byte sized math, resulting in a zero, then zero orred with the second byte results in only the second byte.
Basically a compiler is within its rights to turn this:
opcode = memory[pc] << 8) | memory[pc + 1];
Into this:
opcode = memory[pc + 1];
Which wont work for you at all, a very quick fix:
opcode = memory[pc + 0];
opcode <<= 8;
opcode |= memory[pc + 1];
Will save you some headaches. Minimal optimization will save the compiler from storing the intermediate results to ram for each operation resulting in the same (desired) output/performance.
The instruction set simulators I wrote and mentioned above are not intended for performance but instead readability, visibility, and hopefully educational. I would start with something like that then if performance for example is of interest you will have to re-write it. This chip8 emulator, once experienced, would be an afternoon task from scratch, so once you get through this the first time you could re-write it maybe three or four times in a weekend, not a monumental task (to have to re-write). (the thumbulator one took me a weekend, for the bulk of it. The msp430 one was probably more like an evening or two worth of work. Getting the overflow flag right, once and for all, was the biggest task, and that came later). Anyway, point being, look at things like the mame sources, most if not all of those instruction set simulators are designed for execution speed, many are barely readable without a fair amount of study. Often heavily table driven, sometimes lots of C programming tricks, etc. Start with something manageable, get it functioning properly, then worry about improving it for speed or size or portability or whatever. This chip8 thing looks to be graphics based so you are going to also have to deal with a lot of line drawing and other bit manipulation on a bitmap/screen/wherever. Or you could just call api or operating system functions. Basically this chip8 thing is not your traditional instruction set with registers and a laundry list of addressing modes and alu operations.
Basically -- Mask out the variable part of the opcode, and look for a match. Then use the variable part.
For example 1NNN is the jump. So:
int a = opcode & 0xF000;
int b = opcode & 0x0FFF;
if(a == 0x1000)
doJump(b);
Then the game is to make that code fast or small, or elegant, if you like. Good clean fun!
Different CPUs store values in memory differently. Big endian machines store a number like $FFCC in memory in that order FF,CC. Little-endian machines store the bytes in reverse order CC, FF (that is, with the "little end" first).
The CHIP-8 architecture is big endian, so the code you will run has the instructions and data written in big endian.
In your statement "opcode = memory[pc] << 8 | memory[pc + 1];", it doesn't matter if the host CPU (the CPU of your computer) is little endian or big endian. It will always put a 16-bit big endian value into an integer in the correct order.
There are a couple of resources that might help: http://www.emulator101.com gives a CHIP-8 emulator tutorial along with some general emulator techniques. This one is good too: http://www.multigesture.net/articles/how-to-write-an-emulator-chip-8-interpreter/
You're going to have to setup a bunch of different bit masks to get the actual opcode from the 16-bit word in combination with a finite state machine in order to interpret those opcodes since it appears that there are some complications in how the opcodes are encoded (i.e., certain opcodes have register identifiers, etc., while others are fairly straight-forward with a single identifier).
Your finite state machine can basically do the following:
Get the first nibble of the opcode using a mask like `0xF000. This will allow you to "categorize" the opcode
Based on the function category from step 1, apply more masks to either get the register values from the opcode, or whatever other variables might be encoded with the opcode that will narrow down the actual function that would need to be called, as well as it's arguments.
Once you have the opcode and the variable information, do a look-up into a fixed-length table of functions that have the appropriate handlers to coincide with the opcode functionality and the variables that go along with the opcode. While you can, in your state machine, hard-code the names of the functions that would go with each opcode once you've isolated the proper functionality, a table that you initialize with function-pointers for each opcode is a more flexible approach that will let you modify the code functionality easier (i.e., you could easily swap between debug handlers and "normal" handlers, etc.).
So we all agree keys are a fixed-length of 128bits or 192bits or 256bits. If our context was 50 characters in size (bytes) % 16 = 2 bytes. So we encrypt the context in 3 times, but the remaining two bytes how will they be stored in the State block. Should I pad them, the standard doesn't specify how to handle such conditions.
MixColumns stage is the most complicated aspect in the AES, however I have been unable to understand the mathematical representation. I have an understanding of the matrix multiplication, but I'm surprised of the mathematical results. Multiplying a value by 2, shift left for little endian 1 position and shift right for big endian. If we had the most significant bit was set as 1 (0x80) then we should XOR the shifted result with 0x1B. I thought by multiplying by 3 it would mean to shift the value 2 positions.
I've checked the various sources on Wikipedia, even the tutorial that provides a C implementation. But I'm more interested to complete my own implementation! Thank you for any possible input.
In the mix columns stage the exponents are being multiplied.
take this example
AA*3
10101010*00000011
is
x^7+x^5+x^3+x^1*x^1+x^0
x^1+x^0 is 3 represented in polynomial form
x^7+x^5+x^3+x^1 is AA represented in polynomial form
first take x^1 and dot multiply it by the polynomial for AA.
that results in...
x^8+x^6+x^4+x^2 ... adding one to each exponent
then reduce this to 8 bits by XoRing by 11B
11B is x^8+x^4+x^3+x^1+x^0 in polynomial form.
so...
x^8+x6+x^4+ x^2
x^8+ x^4+x^3+ x^1+x^0
leaves
x^6+x^3+x^2+x^1+x^0 which is AA*2
now take AA and dot multiply by x^0 (basically AA*1)
that gives you
x^7+x^5+x^3+x^1 ... a duplicate of the original value.
then exclusive or AA*2 with AA*1
x^7+ x^5+x^3+ x^1
x^6+ x^3+x^2+x^1+x^0
which leaves
x^7+x^6+x^5+x^2+x^0 or 11100101 or E5
I hope that helps.
here also is a document detailing the specifics of how mix columns works.
mix_columns.pdf
EDIT:Normal matrix multiplication does not apply to this ..so forget about normal matrices.
In response to your questions:
If you want to encrypt a stream of bytes using AES, do not just break it into individual blocks and encrypt them individually. This is not cryptographically secure and a clever attacker can recover a lot of information from your original plaintext. This is called an electronic code book and if you follow the link and see what happens when you use it to encrypt Tux the Linux Penguin you can visually see its insecurities. Instead, consider using a known secure technique like cipher-block chaining (CBC) or counter mode (CTR). These are a bit more complex to implement, but it's well worth the effort so that you can ensure a clever attacker can't break your encryption indirectly.
As for how the MixColumns stage works, I really don't understand much of the operation myself. It's based on a construction that involves fields of polynomials. If I can find a good explanation as to how it works, I'll let you know.
If you want to implement AES to further your understanding, that's perfectly fine and I encourage you to do so (though you are probably better off reading the mathematical intuition as to where the algorithm comes from). However, you should not use your own implementation for any actual cryptographic purposes. Without extreme care, you will render your implementation vulnerable to a side-channel attack that can compromise its security. The most famous example of this involves RSA encryption, in which without careful planning an attacker can actually watch the power draw of the computer as it does the encryption to recover the bits of the key. If you want to use AES to do encryption, consider using a known, tested, open-source implementation of the algorithm.
Hope this helps!
If you want to test the outcome of your own implementation (any internal state during computation) you can check this page :
http://www.keymolen.com/aes.jsp
It displays all internal states for any given plaintext, key and iv, also for the mixcolumns stage.