Why inline assembly and optimization generates a problem with rbit [duplicate] - c++

recently i checked the Instruction Set for an ARM Cortex-M3 processor.
For example:
ADD <Rd>, <Rn>, <Rm>
What do those abbriviations mean exactly?
I guess they mean different kinds of addresses, like directely addressed, relatively addressed or so.
But what exactly?
Thanks!

Operands of the form <Rx> refer to general-purpose registers, i.e. r0-r15 (or accepted aliases like sp, pc, etc.).
I'm not sure if it's ever called out specifically anywhere but there is a general pattern of "d" meaning destination, "t" meaning target, "n" meaning the first operand or base register, "m" meaning the second operand, and occasionally "a" meaning an accumulator. Hence why you might spot designations like <Rdn> (in the destructive two-operand instructions), or <Rt>, <Rt2> (where a 64-bit value is held across a pair of GP registers). This is consistent across the other types of register too, e.g. VADD.F32 <Sd>, <Sn>, <Sm>.

They are just there to define registers, the lowercase letter just being there to separate them for explanation. Rd is destination, but Rn, Rm etc are just any register you can use. It's the only way to tell which is which when explaining like "Rd equals Rn bitwise anded with Rm", for example, since you can't use numbers.
They could be Rx, Ry etc, or Ra, Rb... as well.

Basics:
Rd is the destination, Rn and Rm are sources. They're all general-purpose integer registers; FP would use Sd / Sn / Sm or Dd / Dn / Dm for single or double.
ARM syntax puts the destination(s) on the left, before read-only source operands.
See Notlikethat's answer for more. Some small additions to that:
t: in this post, an ARM employee comments that "t" might mean "transfer" instead of "target".
Since t generally appears in memory instructions like LDR and STR, I understand that that means "transfer to/from memory", e.g. on ARMARMv8-fa:
LDR <Xt>, [<Xn|SP>, (<Wm>|<Xm>){, <extend> {<amount>}}]
STR <Xt>, [<Xn|SP>, (<Wm>|<Xm>){, <extend> {<amount>}}]
where the t is the source/destination of memory reads and writes.
This is also further suggested in the description of the STR and LDXR instruction registers:
<Xt> Is the 64-bit name of the general-purpose register to be transferred, encoded in the "Rt" field.
The LDR instruction however says "loaded":
<Xt> Is the 64-bit name of the general-purpose register to be loaded, encoded in the "Rt" field.
This terminology is especially meaningful because ARM is RISC-y and so there are relatively few instructions that do memory IO, and they tend to do just that (unlike say add and store to memory as is common in x86).
t1 and t2: these are used for memory instructions that load/store two values at once, e.g. the ARMv8 LDP/STP:
LDP <Xt1>, <Xt2>, [<Xn|SP>], #<imm>
STP <Xt1>, <Xt2>, [<Xn|SP>, #<imm>]!
n and m are just commonly used integer variable/index names in mathematics
s:
the STXR instruction stores to memory fom Xt (like STR), but it also gets a second return value (did the write succeed) to Ws:
STXR <Ws>, <Xt>, [<Xn|SP>{,#0}]
so presumably s was chosen because it comes before t like m comes before n.
Some ARMv7/aarch32 instructions could take a shift in a register, and Rs is the name given to that register, e.g.:
ORR{<c>}{<q>} {<Rd>,} <Rn>, <Rm>, <shift> <Rs>
I couldn't easily find aarch64 ones.
if it were documented, "Chapter C2 About the A64 Instruction Descriptions" might have been a good location for the information, but it's not there

Related

C++ Question about memory pretty basic but this is confusing me

Not really too c++ related I guess but say I have a signed int
int a =50;
This sets aside like 32 bits memory for this right it'll get some bit patternand a memory address, now my question is basically well we created this variable but the computer ITSELF doesn't know what the type is it just sees some bit pattern and memory address it doesn't know this is an int, but my question is how? Does the computer know that those 4 bytes are all connected to a? And also how does the computer not know the type? It set aside 4 bytes for one variable I get that that doesn't automatically make it an int but does the computer really know nothing? The value was 50 the number and that gets turned into binary and stored in the bit pattern how does the computer know nothing
Type information is used by the compiler. It knows the size in bytes of each type and will create an executable that at runtime will correctly access memory for each variable.
C++ is a compiled language and is not run directly by the computer. The computer itself runs machine code. Since machine code is all binary, we will often look at what is called assembly language. This language uses human readable symbols to represent the machine code but also corresponds to it on a line by line basis
When the compiler sees int a = 50; It might (depending on architecture and compiler)
mov r1 #50 // move the literal 50 into the register r1
Registers are placeholders in the CPU that the cpu can use to manipulate memory. In the above statement the compiler will remember that whenever it wants to translate a statement that uses a into machine code, it needs to fetch the value from r1
In the above case, the computer decided to map a to a register, it may well use memory instead.
The type itself is not translated into machine code, it is rather used as a hint as to what types of assembly operations subsequence c++ statements will be translated into. The CPU itself does not understand types. Size is used when loading and storing values from memory to registers. With a char type one 1 byte is read, with a short 2 and an int, typically 4.
When it comes to signedness, the cpu has different instructions for signed and unsigned comparisons.
Lastly float either have to simulated using integer maths or special floating point assembler instructions need to be used.
So once translated into assembler, there is no easy way to know what the original C++ code was.

Where are expressions and constants stored if not in memory?

From C Programming Language by Brian W. Kernighan
& operator only applies to objects in memory: variables and array
elements. It cannot be applied to expressions, constants or register
variables.
Where are expressions and constants stored if not in memory?
What does that quote mean?
E.g:
&(2 + 3)
Why can't we take its address? Where is it stored?
Will the answer be same for C++ also since C has been its parent?
This linked question explains that such expressions are rvalue objects and all rvalue objects do not have addresses.
My question is where are these expressions stored such that their addresses can't be retrieved?
Consider the following function:
unsigned sum_evens (unsigned number) {
number &= ~1; // ~1 = 0xfffffffe (32-bit CPU)
unsigned result = 0;
while (number) {
result += number;
number -= 2;
}
return result;
}
Now, let's play the compiler game and try to compile this by hand. I'm going to assume you're using x86 because that's what most desktop computers use. (x86 is the instruction set for Intel compatible CPUs.)
Let's go through a simple (unoptimized) version of how this routine could look like when compiled:
sum_evens:
and edi, 0xfffffffe ;edi is where the first argument goes
xor eax, eax ;set register eax to 0
cmp edi, 0 ;compare number to 0
jz .done ;if edi = 0, jump to .done
.loop:
add eax, edi ;eax = eax + edi
sub edi, 2 ;edi = edi - 2
jnz .loop ;if edi != 0, go back to .loop
.done:
ret ;return (value in eax is returned to caller)
Now, as you can see, the constants in the code (0, 2, 1) actually show up as part of the CPU instructions! In fact, 1 doesn't show up at all; the compiler (in this case, just me) already calculates ~1 and uses the result in the code.
While you can take the address of a CPU instruction, it often makes no sense to take the address of a part of it (in x86 you sometimes can, but in many other CPUs you simply cannot do this at all), and code addresses are fundamentally different from data addresses (which is why you cannot treat a function pointer (a code address) as a regular pointer (a data address)). In some CPU architectures, code addresses and data addresses are completely incompatible (although this is not the case of x86 in the way most modern OSes use it).
Do notice that while (number) is equivalent to while (number != 0). That 0 doesn't show up in the compiled code at all! It's implied by the jnz instruction (jump if not zero). This is another reason why you cannot take the address of that 0 — it doesn't have one, it's literally nowhere.
I hope this makes it clearer for you.
where are these expressions stored such that there addresses can't be retrieved?
Your question is not well-formed.
Conceptually
It's like asking why people can discuss ownership of nouns but not verbs. Nouns refer to things that may (potentially) be owned, and verbs refer to actions that are performed. You can't own an action or perform a thing.
In terms of language specification
Expressions are not stored in the first place, they are evaluated.
They may be evaluated by the compiler, at compile time, or they may be evaluated by the processor, at run time.
In terms of language implementation
Consider the statement
int a = 0;
This does two things: first, it declares an integer variable a. This is defined to be something whose address you can take. It's up to the compiler to do whatever makes sense on a given platform, to allow you to take the address of a.
Secondly, it sets that variable's value to zero. This does not mean an integer with value zero exists somewhere in your compiled program. It might commonly be implemented as
xor eax,eax
which is to say, XOR (exclusive-or) the eax register with itself. This always results in zero, whatever was there before. However, there is no fixed object of value 0 in the compiled code to match the integer literal 0 you wrote in the source.
As an aside, when I say that a above is something whose address you can take - it's worth pointing out that it may not really have an address unless you take it. For example, the eax register used in that example doesn't have an address. If the compiler can prove the program is still correct, a can live its whole life in that register and never exist in main memory. Conversely, if you use the expression &a somewhere, the compiler will take care to create some addressable space to store a's value in.
Note for comparison that I can easily choose a different language where I can take the address of an expression.
It'll probably be interpreted, because compilation usually discards these structures once the machine-executable output replaces them. For example Python has runtime introspection and code objects.
Or I can start from LISP and extend it to provide some kind of addressof operation on S-expressions.
The key thing they both have in common is that they are not C, which as a matter of design and definition does not provide those mechanisms.
Such expressions end up part of the machine code. An expression 2 + 3 likely gets translated to the machine code instruction "load 5 into register A". CPU registers don't have addresses.
It does not really make sense to take the address to an expression. The closest thing you can do is a function pointer. Expressions are not stored in the same sense as variables and objects.
Expressions are stored in the actual machine code. Of course you could find the address where the expression is evaluated, but it just don't make sense to do it.
Read a bit about assembly. Expressions are stored in the text segment, while variables are stored in other segments, such as data or stack.
https://en.wikipedia.org/wiki/Data_segment
Another way to explain it is that expressions are cpu instructions, while variables are pure data.
One more thing to consider: The compiler often optimizes away things. Consider this code:
int x=0;
while(x<10)
x+=1;
This code will probobly be optimized to:
int x=10;
So what would the address to (x+=1) mean in this case? It is not even present in the machine code, so it has - by definition - no address at all.
Where are expressions and constants stored if not in memory
In some (actually many) cases, a constant expression is not stored at all. In particular, think about optimizing compilers, and see CppCon 2017: Matt Godbolt's talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”
In your particular case of some C code having 2 + 3, most optimizing compilers would have constant folded that into 5, and that 5 constant might be just inside some machine code instruction (as some bitfield) of your code segment and not even have a well defined memory location. If that constant 5 was a loop limit, some compilers could have done loop unrolling, and that constant won't appear anymore in the binary code.
See also this answer, etc...
Be aware that C11 is a specification written in English. Read its n1570 standard. Read also the much bigger specification of C++11 (or later).
Taking the address of a constant is forbidden by the semantics of C (and of C++).

Getting "actual" registers from MCInsts (x86)

I'm using llvm-mc with the goal of making a relatively smart disassembler (identifying and tracking locals, easily following branches, etc), and part of that is creating a string representation of the disassembled instructions.
When I started this, I expected that I would be able to relatively easily identify registers and values used by MCInsts and whip out another representation myself with which I could easily work with. However, after some investigation, I realized that the correlation between the operands shown with the textual representation of an instruction and the operands that are actually present within the MCInst object is fairly low. Here are a few examples (Intel syntax):
Moving, say, 11587 as a 32-bit immediate into eax would be done with the MOV32ri opcode. The textual representation would be mov eax, 11587. The corresponding MCInst would have two operands, a register and an immediate. This works for me. This is great.
Adding 11587 to eax would be done with the ADD32ri opcode. The textual representation would be add eax, 11587. However, this time, the corresponding MCInst has three operands: eax is there twice and the immediate is in the end. This isn't so great. I can assume that this is an artifact of the lowering process, that the first instance of eax is the destination register and that the second one is there to be the source (even though x86 does not distinguish between the two), and I can hack around that.
Moving a 32-bits eip-relative value to eax would be done with the MOV32ao32 opcode. The textual representation would be mov eax, dword ptr [11587]. In this case, the MCInst doesn't even have an operand for eax, it can only be inferred from the operand type present in the opcode name. I can hack around that too, but things are getting less and less pretty and I've only tested 5-6 different instructions out of the 1300+ that x86 supports.
Obviously, for the purpose of showing text, I could get the textual representation with an MCInstPrinter, but the mapping between what's shown there and what the MCInst has is still muddy.
Is there a straightforward way to tell which operands appears in the textual representation of an instruction?
Add having three arguments sounds like a compiler builder preference for Three address code is bleeding through, since there is no justification for that in Intel assembler. (you can't add and store to a different register with the ADD instruction, you can with LEA though).
The opcodes run into the hundreds if you count all extensions (like SSE, FPU etc), and worse there are multiple variants of an opcode due to addressing modes and prefixes.
The NASM assembler has some tables in the source that you could try to mine if your llvm-mc system doesn't provide the functionality.
The MC level is very low and the operand layout depends on the opcode. That said, there are mapping tables that tell you what is where. MCInstrDesc and MCOperandInfo will tell you which operands and sources and destinations, whether they are immediates, registers, etc. and a set of flags.
You'll also need to get familiar with MCRegisterClass and MCRegisterInfo and a bunch of other stuff. It's a complicated interface because the task of representing arbitrary target information is complicated.
I would look at the code for the various MC-based tools to get started. You shouldn't need your own representation, MC should have everything you need.

Fast way to get a valid imm num for arm mov?

Arm mov has a limitation that the immediates must be an 8bit rotated by a multiple of 2, we can write:
mov ip, #0x5000
But we cannot write that:
mov ip, #0x5001
The 0x5000 can be split as 0x5000 + 1, my means, the sum of a valid immediate number and a small number.
So for a given 32bit number, how to find the closest valid immediate number fast? Like this:
uint32 find_imm(uint32 src, bool less_than_src) {
...
}
// x is 0x5000
uint32 x = find_imm(0x5001, true);
It is quite simple, look at the distance between the ones. 0x5001 = 0b101000000000001. 15 significant digits, so it will take you two instructions at 8 bits of immediate per. Also remember to put a rotate in your test, if there are enough zeros 0x80000001 and you rotate that around 0x88000000 or 0x00000003 that is only two significant digits from a distance between the ones measurement. So take the immediate, perform a distance between the ones type test, rotate one step, perform the test again, and repeat until all the possible (counter-)rotations have happened and go with one of the ones with the smallest number of instructions/immediates.
gnu as already does this and gas is open source so you can just go get their code if you prefer. When you use the load address trick:
ldr rd,=const
If that const can be resolved in a single move immediate instruction then it encodes it as a
mov rd,#const
if it cant then it tries to find a location to put the word and encodes it as a pc relative load:
ldr rd,[pc,#offset]
...
.word const
There is not a straightforward rule or function for finding ways to construct values. Once a value exceeds what can be loaded easily from immediate values, you usually load it by defining it in the data section and loading it from memory, rather than constructing it from immediate values.
If you do want to construct a value from two immediate values, you must consider a variety of operations, including:
Adding two immediates.
Subtracting two immediates.
Multiplying two immediates.
More esoteric instructions, such as some of the “SIMD” instructions that split 32-bit registers into multiple lanes.
If you must go to three immediate values, there are more combinations. One can find some patterns in the possibilities that reduce the search, but some portion of it remains a “brute force” search. Generally, there is no point in using complicated instructions sequences, since you can simply load the data from a prepared location in memory.
The ARM assembler has an instruction form to assist this:
LDR Rd, =const
When the assembler sees this, it places the const value in the literal pool and generates an instruction to load the value from the pool. If you are using a different assembler, it might not have the same instruction form, but you can write the necessary code manually.

Decoding and matching Chip 8 opcodes in C/C++

I'm writing a Chip 8 emulator as an introduction to emulation and I'm kind of lost. Basically, I've read a Chip 8 ROM and stored it in a char array in memory. Then, following a guide, I use the following code to retrieve the opcode at the current program counter (pc):
// Fetch opcode
opcode = memory[pc] << 8 | memory[pc + 1];
Chip 8 opcodes are 2 bytes each. This is code from a guide which I vaguely understand as adding 8 extra bit spaces to memory[pc] (using << 8) and then merging memory[pc + 1] with it (using |) and storing the result in the opcode variable.
Now that I have the opcode isolated however, I don't really know what to do with it. I'm using this opcode table and I'm basically lost in regards to matching the hex opcodes I read to the opcode identifiers in that table. Also, I realize that many of the opcodes I'm reading also contain operands (I'm assuming the latter byte?), and that is probably further complicating my situation.
Help?!
Basically once you have the instruction you need to decode it. For example from your opcode table:
if ((inst&0xF000)==0x1000)
{
write_register(pc,(inst&0x0FFF)<<1);
}
And guessing that since you are accessing rom two bytes per instruction, the address is probably a (16 bit) word address not a byte address so I shifted it left one (you need to study how those instructions are encoded, the opcode table you provided is inadequate for that, well without having to make assumptions).
There is a lot more that has to happen and I dont know if I wrote anything about it in my github samples. I recommend you create a fetch function for fetching instructions at an address, a read memory function, a write memory function a read register function, write register function. I recommend your decode and execute function decodes and executes only one instruction at a time. Normal execution is to just call it in a loop, it provides the ability to do interrupts and things like that without a lot of extra work. It also modularizes your solution. By creating the fetch() read_mem_byte() read_mem_word() etc functions. You modularize your code (at a slight cost of performance), makes debugging much easier as you have a single place where you can watch registers or memory accesses and figure out what is or isnt going on.
Based on your question, and where you are in this process, I think the first thing you need to do before writing an emulator is to write a disassembler. Being a fixed instruction length instruction set (16 bits) that makes it much much easier. You can start at some interesting point in the rom, or at the beginning if you like, and decode everything you see. For example:
if ((inst&0xF000)==0x1000)
{
printf("jmp 0x%04X\n",(inst&0x0FFF)<<1);
}
With only 35 instructions that shouldnt take but an afternoon, maybe a whole saturday, being your first time decoding instructions (I assume that based on your question). The disassembler becomes the core decoder for your emulator. Replace the printf()s with emulation, even better leave the printfs and just add code to emulate the instruction execution, this way you can follow the execution. (same deal have a disassemble a single instruction function, call it for each instruction, this becomes the foundation for your emulator).
Your understanding needs to be more than vague as to what that fetch line of code is doing, in order to pull off this task you are going to have to have a strong understanding of bit manipulation.
Also I would call that line of code you provided buggy or at least risky. If memory[] is an array of bytes, the compiler might very well perform the left shift using byte sized math, resulting in a zero, then zero orred with the second byte results in only the second byte.
Basically a compiler is within its rights to turn this:
opcode = memory[pc] << 8) | memory[pc + 1];
Into this:
opcode = memory[pc + 1];
Which wont work for you at all, a very quick fix:
opcode = memory[pc + 0];
opcode <<= 8;
opcode |= memory[pc + 1];
Will save you some headaches. Minimal optimization will save the compiler from storing the intermediate results to ram for each operation resulting in the same (desired) output/performance.
The instruction set simulators I wrote and mentioned above are not intended for performance but instead readability, visibility, and hopefully educational. I would start with something like that then if performance for example is of interest you will have to re-write it. This chip8 emulator, once experienced, would be an afternoon task from scratch, so once you get through this the first time you could re-write it maybe three or four times in a weekend, not a monumental task (to have to re-write). (the thumbulator one took me a weekend, for the bulk of it. The msp430 one was probably more like an evening or two worth of work. Getting the overflow flag right, once and for all, was the biggest task, and that came later). Anyway, point being, look at things like the mame sources, most if not all of those instruction set simulators are designed for execution speed, many are barely readable without a fair amount of study. Often heavily table driven, sometimes lots of C programming tricks, etc. Start with something manageable, get it functioning properly, then worry about improving it for speed or size or portability or whatever. This chip8 thing looks to be graphics based so you are going to also have to deal with a lot of line drawing and other bit manipulation on a bitmap/screen/wherever. Or you could just call api or operating system functions. Basically this chip8 thing is not your traditional instruction set with registers and a laundry list of addressing modes and alu operations.
Basically -- Mask out the variable part of the opcode, and look for a match. Then use the variable part.
For example 1NNN is the jump. So:
int a = opcode & 0xF000;
int b = opcode & 0x0FFF;
if(a == 0x1000)
doJump(b);
Then the game is to make that code fast or small, or elegant, if you like. Good clean fun!
Different CPUs store values in memory differently. Big endian machines store a number like $FFCC in memory in that order FF,CC. Little-endian machines store the bytes in reverse order CC, FF (that is, with the "little end" first).
The CHIP-8 architecture is big endian, so the code you will run has the instructions and data written in big endian.
In your statement "opcode = memory[pc] << 8 | memory[pc + 1];", it doesn't matter if the host CPU (the CPU of your computer) is little endian or big endian. It will always put a 16-bit big endian value into an integer in the correct order.
There are a couple of resources that might help: http://www.emulator101.com gives a CHIP-8 emulator tutorial along with some general emulator techniques. This one is good too: http://www.multigesture.net/articles/how-to-write-an-emulator-chip-8-interpreter/
You're going to have to setup a bunch of different bit masks to get the actual opcode from the 16-bit word in combination with a finite state machine in order to interpret those opcodes since it appears that there are some complications in how the opcodes are encoded (i.e., certain opcodes have register identifiers, etc., while others are fairly straight-forward with a single identifier).
Your finite state machine can basically do the following:
Get the first nibble of the opcode using a mask like `0xF000. This will allow you to "categorize" the opcode
Based on the function category from step 1, apply more masks to either get the register values from the opcode, or whatever other variables might be encoded with the opcode that will narrow down the actual function that would need to be called, as well as it's arguments.
Once you have the opcode and the variable information, do a look-up into a fixed-length table of functions that have the appropriate handlers to coincide with the opcode functionality and the variables that go along with the opcode. While you can, in your state machine, hard-code the names of the functions that would go with each opcode once you've isolated the proper functionality, a table that you initialize with function-pointers for each opcode is a more flexible approach that will let you modify the code functionality easier (i.e., you could easily swap between debug handlers and "normal" handlers, etc.).