Encode asm instructions to opcodes

Encode asm instructions to opcodes - c++

I need to encode a few instructions like
mov eax, edx
inc edx
to the corresponding x86_64 opcodes. Is there any library (not an entire asm compiler) to accomplish that easily?

You could take open source FASM or NASM and use their parser.

in case you already compiled it into a binary (from your asm or c with embedded asm):
objdump -S your_binary, it will list each instruction with its binary code.

Assuming you are just after translating simple instructions, writing a simple assembler wouldn't be THAT much work. I've done it before - and you probably have most of the logic and tables for your disassembler component (such as a table of opcodes to instruction name and register number to name - just use that in reverse). I don't necessarily mean that you can just use the table directly in reverse, but the content of the tables re-arranged in a suitable way should do most of the hard work not too bad.
What gets difficult is symbols and relocation and such things. But since you probably don't really need that for "find this sequence of code", I guess you could do without those parts. You also don't need to generate object files to some specification - you just need a set of bytes.
Now, it would get a little bit more tricky if you wanted to find:
here:
inc eax
jnz here
jmp someplace_else
....
...
someplace_else:
....
since you'd have to encode the jumps to the their relative location - at the very least, it would require a two-pass approach, to first figure the length of the instructions, then a the actual filling in of the jump targets. If "someplace_else" is far from the jump itself, it may also be an absolute jump, in which case your "search" would have to undertstand how that relates to the location it's searching at - since that sequence would be different for every single address.
I've written both assemblers and disassemblers, and it's not TERRIBLY hard if you don't have to deal with relocatable addresses and file formats with weird defintions that you don't know [until you've studied the 200 page definition of the format].

Related

How if statements literally works?

I was wondering how an if statement or conditional statements works behind the scenes when executed.
Consider an example like this:
if (10 > 6) {
// Some code
}
How does the compiler or interpreter knows that the number 10 is greater than 6 or 6 is less than 10?

At some point near the end of the compilation, the compiler will convert the above in to assembly language similar to:
start: # This is a label that you can reference
mov ax, 0ah # Store 10 in the ax register
mov bx, 06h # Store 6 in the bx register
cmp ax, bx # Compare ax to bx
jg inside_the_brackets # 10 > 6? Goto `inside_the_brackets`
jmp after_the_brackets # Otherwise skip ahead a little
inside_the_brackets:
# Some code - stuff inside the {} goes here
after_the_brackets:
# The rest of your program. You end up here no matter what.
I haven't written in assembler in years so I know that's a jumble of different varieties, but the above is the gist of it. Now, that's an inefficient way to structure the code, so a smart compiler might write it more like:
start: # This is a label that you can reference
mov ax, 0ah # Store 10 in the ax register
mov bx, 06h # Store 6 in the bx register
cmp ax, bx # Compare ax to bx
jle after_the_brackets # 10 <= 6? Goto `after_the_brackets`
inside_the_brackets:
# Some code - stuff inside the {} goes here
after_the_brackets:
# The rest of your program. You end up here no matter what.
See how that reversed the comparison, so instead of if (10 > 6) it's more like if (10 <= 6)? That removes a jmp instruction. The logic is identical, even if it's no longer exactly what you originally wrote. There -- now you've seen an "optimizing compiler" at work.
Every compiler you're likely to have heard of has a million tricks to convert code you write into assembly language that acts the same, but that the CPU can execute more efficiently. Sometimes the end result is barely recognizable. Some of the optimizations are as simple as what I just did, but others are fiendishly clever and people have earned PhDs in this stuff.

Kirk Strauser answer is correct. However you ask:
How does the compiler or interpreter knows that the number 10 is greater than 6 or 6 is less than 10?
Some optimizer compilers can see that 10 > 6 is a constant expression equivalent to true, and not emit any check or jump at all. If you are asking how they do that, well…
I'll explain the process in steps that hopefully are easy to understand. I'm covering no advanced topics.
The build process will start by parsing your code document according to the syntax of the language.
The syntax of the language will define how to interpret the text of the document (think a string with your code) as a series of symbols or tokens (e.g. keywords, literals, identifiers, operators…). In this case we have:
a if symbol.
a ( symbol.
a 10 symbol.
a > symbol".
a 6 symbol.
a ) symbol.
a { symbol.
and a } symbol.
I'm assuming comments, newlines and white-space do not generate symbols in this language.
From the series of symbols, it will build a tree-like memory structure (see AST) according to the rules of the language.
The tree will say that your code is:
An "if statement", that has two children:
A conditional (a boolean expression), which is a greater than comparison that has two children:
A constant literal integer 10
A constant literal integer 6
A body (a set of statements), in this case empty.
Then the compiler can look at that tree and figure out how to optimize it, and emit code in the target language (let us say machine code).
The optimization process will see that the conditional does not have variables, it is composed entirely of constants that are known at compile time. Thus it can compute the equivalent value and use that. Which leaves us with this:
An "if statement", that has two children:
A conditional (a boolean expression), which is a literal true.
A body (a set of statements), in this case empty.
Then it will see that we have a conditional that is always true, and thus we don't need it. Thus it replaces the if statement with the set of statements in its body. Which are none, we have optimized the code away to nothing.
You can imagine how the process would then go over the tree, figuring out what is the equivalent in the target language (again, let us say, machine code), and emitting that code until it has gone over the whole tree.
I want to mention that intermediate languages and JIT (just in time) compiling has become very common. See Understanding the differences: traditional interpreter, JIT compiler, JIT interpreter and AOT compiler.
My description of how the build process works is a toy textbook example. I would like to encourage to learn further of the topic. I'll suggest, in this order:
Computerphile Compilers with Professor Brailsford video series.
The good old Dragon
Book [pdf], and other books such as "How To Create Pragmatic, Lightweight Languages" and "Parsing with Perl 6 Regexes and Grammars".
Finally CS 6120: Advanced Compilers: The
Self-Guided Online
Course
which is not about parsing, because it presumes you already know
that.

The ability to actually check that is implemented in hardware. To be more specific, it will subtract 10-6 (which is one of the basic instructions that processors can do), and if the result is less than or equal to 0 then it will jump to the end of the block (comparing numbers to zero and jumping based on the result are also basic instructions). If you want to learn more, the keywords to look for are "instruction set architecture", and "assembly".

Why does pre_c_init access memory outside of the defined program segments?

While looking through the assembly for a console "hello world" program (compiled using the visual c++ compiler), I came across this:
pre_c_init proc near
.text:00401AFE mov eax, 5A4Dh
.text:00401B03 cmp ds:400000h, ax
The code above seems to be accessing memory that isn't filled with anything in particular: All segments start at 0x401000 or even further down in the file. (The image base is at 0x400000, but the first segment is at 0x401000).
I used OllyDbg to see what the actual value at 0x400000 is, and every single time it's the same as in the code (0x5A4D). What's going on here?

5A4D is "MZ" in little-endian ASCII, and MZ is the signature of MS-DOS and, more recently, PE executables.
The comparison checks whether the executable has been mapped at the default base address, 0x400000. This, I believe, is used to determine whether it is necessary to perform relocation.
This is discussed further in the following thread: Why does PE need a relocation table?

Getting "actual" registers from MCInsts (x86)

I'm using llvm-mc with the goal of making a relatively smart disassembler (identifying and tracking locals, easily following branches, etc), and part of that is creating a string representation of the disassembled instructions.
When I started this, I expected that I would be able to relatively easily identify registers and values used by MCInsts and whip out another representation myself with which I could easily work with. However, after some investigation, I realized that the correlation between the operands shown with the textual representation of an instruction and the operands that are actually present within the MCInst object is fairly low. Here are a few examples (Intel syntax):
Moving, say, 11587 as a 32-bit immediate into eax would be done with the MOV32ri opcode. The textual representation would be mov eax, 11587. The corresponding MCInst would have two operands, a register and an immediate. This works for me. This is great.
Adding 11587 to eax would be done with the ADD32ri opcode. The textual representation would be add eax, 11587. However, this time, the corresponding MCInst has three operands: eax is there twice and the immediate is in the end. This isn't so great. I can assume that this is an artifact of the lowering process, that the first instance of eax is the destination register and that the second one is there to be the source (even though x86 does not distinguish between the two), and I can hack around that.
Moving a 32-bits eip-relative value to eax would be done with the MOV32ao32 opcode. The textual representation would be mov eax, dword ptr [11587]. In this case, the MCInst doesn't even have an operand for eax, it can only be inferred from the operand type present in the opcode name. I can hack around that too, but things are getting less and less pretty and I've only tested 5-6 different instructions out of the 1300+ that x86 supports.
Obviously, for the purpose of showing text, I could get the textual representation with an MCInstPrinter, but the mapping between what's shown there and what the MCInst has is still muddy.
Is there a straightforward way to tell which operands appears in the textual representation of an instruction?

Add having three arguments sounds like a compiler builder preference for Three address code is bleeding through, since there is no justification for that in Intel assembler. (you can't add and store to a different register with the ADD instruction, you can with LEA though).
The opcodes run into the hundreds if you count all extensions (like SSE, FPU etc), and worse there are multiple variants of an opcode due to addressing modes and prefixes.
The NASM assembler has some tables in the source that you could try to mine if your llvm-mc system doesn't provide the functionality.

The MC level is very low and the operand layout depends on the opcode. That said, there are mapping tables that tell you what is where. MCInstrDesc and MCOperandInfo will tell you which operands and sources and destinations, whether they are immediates, registers, etc. and a set of flags.
You'll also need to get familiar with MCRegisterClass and MCRegisterInfo and a bunch of other stuff. It's a complicated interface because the task of representing arbitrary target information is complicated.
I would look at the code for the various MC-based tools to get started. You shouldn't need your own representation, MC should have everything you need.

_ftol2_sse, are there faster options?

I have code which calls a lot of
int myNumber = (int)(floatNumber);
which takes up, in total, around 10% of my CPU time (according to profiler). While I could leave it at that, I wonder if there are faster options, so I tried searching around, and stumbled upon
http://devmaster.net/forums/topic/7804-fast-int-float-conversion-routines/
http://stereopsis.com/FPU.html
I tried implementing the Real2Int() function given there, but it gives me wrong results, and runs slower. Now I wonder, are there faster implementations to floor double / float values to integers, or is the SSE2 version as fast as it gets? The pages I found date back a bit, so it might just be outdated, and newer STL is faster at this.
The current implementation does:
013B1030 call _ftol2_sse (13B19A0h)
013B19A0 cmp dword ptr [___sse2_available (13B3378h)],0
013B19A7 je _ftol2 (13B19D6h)
013B19A9 push ebp
013B19AA mov ebp,esp
013B19AC sub esp,8
013B19AF and esp,0FFFFFFF8h
013B19B2 fstp qword ptr [esp]
013B19B5 cvttsd2si eax,mmword ptr [esp]
013B19BA leave
013B19BB ret
Related questions I found:
Fast float to int conversion and floating point precision on ARM (iPhone 3GS/4)
What is the fastest way to convert float to int on x86
Since both are old, or are ARM based, I wonder if there are current ways to do this. Note that it says the best conversion is one that doesn't happen, but I need to have it, so that will not be possible.

It's going to be hard to beat that if you are targeting generic x86 hardware. The runtime doesn't know for sure that the target machine has an SSE unit. If it did, it could do what the x64 compiler does and inline a cvttss2si opcode. But since the runtime has to check whether an SSE unit is available, you are left with the current implementation. That's what the implementation of ftol2_sse does. And what's more it passes the value in an x87 register and then transfers it to an SSE register if an SSE unit is available.
You could tell the x86 compiler to target machines that have SSE units. Then the compiler would indeed emit a simple cvttss2si opcode inline. That's going to be as fast as you can get. But if you run the code on an older machine then it will fail. Perhaps you could supply two versions, one for machines with SSE, and one for those without.
That's not going to gain you all that much. It's just going to avoid all the overhead of ftol2_sse that happens before you actually reach the cvttss2si opcode that does the work.
To change the compiler settings from the IDE, use Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set. On the command line it is /arch:SSE or /arch:SSE2.

For double I don't think you will be able to improve the results much but if you have a lot of floats to convert that using a packed conversion could help, the following is nasm code:
global _start
section .data
align 16
fv1: dd 1.1, 2.5, 2.51, 3.6
section .text
_start:
cvtps2dq xmm1, [fv1] ; Convert four 32-bit(single precision) floats to 32-bit(double word) integers and place the result in xmm1
There should be intrinsics code that allows you to do the same thing in an easier way but I am not as familiar with using intrinsics libraries. Although you are not using gcc this article Auto-vectorization with gcc 4.7 is an eye opener on how hard it can be to get the compiler to generate good vectorized code.

If you need speed and a large base of target machines, you'd better introduce a fast SSE version of all your algorithms, as well as a generic one -- and choose the algorithms to be executed at much higher level.
This would also mean that also the ABI is optimized for SSE; and that you can vectorize the calculation when available and that also the control logic is optimized for the architecture.
btw. even FLD; FIST sequence should take no longer than ~7 clock cycles on Pentium.

Interchange 2 variables in C++ with asm code

I have a huge function that sorts a very large amount of int data. The code works fine except the fact that it's slower that it should be. My first step into solving this is to place some asm code inside C++. How can I interchange 2 variables using asm? I've tried this:
_asm{ push a[x]; push a[y]; pop a[x]; pop a[y];}
and this:
_asm(mov eax, a[x];mov ebx,a[y]; mov a[x],ebx; mov a[y],eax;}
but both crash. How can I save some time on these interchanges ? I use VS_2010

In general, it is very difficult to do better than your compiler with simple code like this.
A compiler, when faced with a swap operation on integers, will typically issue code like this:
mov eax, [x]
mov ebx, [y]
mov [x], ebx
mov [y], eax
Before you try to override, first check what the compiler is actually generating. If it's something like this, don't bother going any further; you won't be able to do better than this. Moreover, if you leave it to the compiler, it may, if these variables are used immediately thereafter, choose to reuse one of these registers to save on variable loads/stores as well. This is impossible with hand-coded assembly; the compiler must reload the variables after the black box that is hand-coded asm.
Note that the push/push/pop/pop sequence is likely to be much slower; not only does it add an additional four memory operations to the stack, it also introduces dependencies on the stack pointer, eliminating any possibility of pipelining. With the simple mov sequence, it is at least possible to run the pair of reads and pair of writes in parallel if they are on different memory banks, or one is in cache, etc. It also does not introduce stalls on the stack pointer in later code.
As such, you should not try to micro-optimize the cost of an interchange; instead, reduce the number of interchanges performed. There are many sorting algorithms available, each with slightly different characteristics. You may find some are better (cause less swaps) on your dataset than others.

What makes you think you can produce faster assembly than an optimizing compiler?
Even if you'll get it to work properly, all you're likely to achieve is to confuse the optimizer to produce even slower code.

When you do in-line assembly, you can change things so that assumptions the compiler has made about register contents will no longer be true. Often times EAX is used to pass a parameter or return a value, so trashing EAX might not have much effect, but you clobbered EBX and didn't put it back, and that could cause problems. Try pushing EBX before you use it, then pop it when you are done.

You can use the variable names, function names and labels in assembly code as symbols. Note that things like a[x] is not such valid symbol.
Writing more efficient code takes skill and knowledge, using asm does not necessarily help you there.
You can compare assembly code that your compiler produces for both the function with inline assembler and without to see where you did break it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js