Arm mov has a limitation that the immediates must be an 8bit rotated by a multiple of 2, we can write:
mov ip, #0x5000
But we cannot write that:
mov ip, #0x5001
The 0x5000 can be split as 0x5000 + 1, my means, the sum of a valid immediate number and a small number.
So for a given 32bit number, how to find the closest valid immediate number fast? Like this:
uint32 find_imm(uint32 src, bool less_than_src) {
...
}
// x is 0x5000
uint32 x = find_imm(0x5001, true);
It is quite simple, look at the distance between the ones. 0x5001 = 0b101000000000001. 15 significant digits, so it will take you two instructions at 8 bits of immediate per. Also remember to put a rotate in your test, if there are enough zeros 0x80000001 and you rotate that around 0x88000000 or 0x00000003 that is only two significant digits from a distance between the ones measurement. So take the immediate, perform a distance between the ones type test, rotate one step, perform the test again, and repeat until all the possible (counter-)rotations have happened and go with one of the ones with the smallest number of instructions/immediates.
gnu as already does this and gas is open source so you can just go get their code if you prefer. When you use the load address trick:
ldr rd,=const
If that const can be resolved in a single move immediate instruction then it encodes it as a
mov rd,#const
if it cant then it tries to find a location to put the word and encodes it as a pc relative load:
ldr rd,[pc,#offset]
...
.word const
There is not a straightforward rule or function for finding ways to construct values. Once a value exceeds what can be loaded easily from immediate values, you usually load it by defining it in the data section and loading it from memory, rather than constructing it from immediate values.
If you do want to construct a value from two immediate values, you must consider a variety of operations, including:
Adding two immediates.
Subtracting two immediates.
Multiplying two immediates.
More esoteric instructions, such as some of the “SIMD” instructions that split 32-bit registers into multiple lanes.
If you must go to three immediate values, there are more combinations. One can find some patterns in the possibilities that reduce the search, but some portion of it remains a “brute force” search. Generally, there is no point in using complicated instructions sequences, since you can simply load the data from a prepared location in memory.
The ARM assembler has an instruction form to assist this:
LDR Rd, =const
When the assembler sees this, it places the const value in the literal pool and generates an instruction to load the value from the pool. If you are using a different assembler, it might not have the same instruction form, but you can write the necessary code manually.
Related
I have a function that uses the compiler intrinsic __movsq to copy some data from a global buffer into another global buffer upon every call of the function. I'm trying to nop out those instructions once a flag has been set globally and the same function is called again. Example code:
// compiler: MSVC++ VS 2022 in C++ mode; x64
void DispatchOptimizationLoop()
{
__movsq(g_table, g_localimport, 23);
// hopefully create a nop after movsq?
static unsigned char* ptr = (unsigned char*)(&__nop);
if (!InterlockedExchange8(g_Reduce, 1))
{
// point to movsq in memory
ptr -= 3;
// nop it out
...
}
// rest of function here
...
}
Basically the function places a nop after the movsq, and then tries to get the address of the placed nop then backtrack by the size of the movsq so that a pointer is pointing to the start of movsq, so then I can simply cover it with 3 0x90s. I am aware that the line (unsigned char*)(&__nop) is not actaully creating a nop because I'm not calling the intrinsic, I'm just trying to show what I want to do.
Is this possible, or is there a better way to store the address of the instructions that need to be nop'ed out in the future?
It's not useful to have the address of a 0x90 NOP somewhere else, all you need is the address of machine code inside your function. Nothing you've written comes remotely close to helping you find that. As you say, &__nop doesn't lead to there being a NOP in your function's machine code which you could offset relative to.
If you want to hard-code offsets that could break with different optimization settings, you could take the address of the start of the function and offset it.
Or you could write the whole function in asm so you can put a label on the address you want to modify. That would actually let you do this safely.
You might get something that happens to work with GNU C labels as values, where you can take the address of C goto labels like &&label. Like put a mylabel: before the intrinsic, and maybe after for good measure so you can check that the difference is the expected 3 bytes. If you're lucky, the compiler won't put any other instructions between your labels.
So you can memset((void*)&&mylabel, 0x90, 3) (after an assert on &&mylabel_end - &&mylabel == 3). But I don't think MSVC supports that GNU extension or anything equivalent.
But you can't actually use memset if another thread could be running this at the same time.
And for efficiency, you want a single 3-byte NOP anyway.
And of course you'd have to VirtualProtect the page of machine code containing that instruction to make it writeable. (Assuming the function is 16-byte aligned, it's hopefully impossible for that one instruction near the start to be split across two pages.)
And if other threads could be running this function at the same time, you'd better use an atomic RMW (on the containing dword or qword) to replace the 3-byte instruction with a single 3-byte NOP, otherwise you could have another thread fetch and decode the first NOP, but then fetch a byte of of the movsq machine code not replaced yet.
Actually a plain mov store would be atomic if it's 4 bytes not crossing an 8-byte boundary. Since there are no other writers of different data, it's fine to load / AND/OR / store to later store the same surrounding bytes you loaded earlier. Normally a non-atomic load+store is not thread-safe, but no other threads could have written a different value in the meantime.
Atomicity of cross-modifying code
I think cross-modifying code has atomicity rules similar to data. But if the instruction spans a 16-byte boundary, code-fetch in another core might have pulled in the first 1 or 2 bytes of it before you atomically replace all 3. So the 2nd and 3rd byte get treated as either the start of an instruction, or the 2nd + 3rd bytes of a long-NOP. Since long-NOPs generally start with 0F 1F with an escape byte, if that's not how __movsq starts then it could desync.
So if cross-modifying code doesn't trigger a pipeline nuke on the other core, it's not safe to do it while another thread might be running the code. Code fetch is usually done in 16-byte chunks but that's not guaranteed. And it's not guaranteed that they're aligned 16-byte chunks.
So you should probably make sure no other threads are running this function while you change the machine code. Unless you're very sure of the safety of what you're doing and check each build to make sure the instruction starts at a safe offset, where safe is defined according to any possibility or anything that could go wrong.
recently i checked the Instruction Set for an ARM Cortex-M3 processor.
For example:
ADD <Rd>, <Rn>, <Rm>
What do those abbriviations mean exactly?
I guess they mean different kinds of addresses, like directely addressed, relatively addressed or so.
But what exactly?
Thanks!
Operands of the form <Rx> refer to general-purpose registers, i.e. r0-r15 (or accepted aliases like sp, pc, etc.).
I'm not sure if it's ever called out specifically anywhere but there is a general pattern of "d" meaning destination, "t" meaning target, "n" meaning the first operand or base register, "m" meaning the second operand, and occasionally "a" meaning an accumulator. Hence why you might spot designations like <Rdn> (in the destructive two-operand instructions), or <Rt>, <Rt2> (where a 64-bit value is held across a pair of GP registers). This is consistent across the other types of register too, e.g. VADD.F32 <Sd>, <Sn>, <Sm>.
They are just there to define registers, the lowercase letter just being there to separate them for explanation. Rd is destination, but Rn, Rm etc are just any register you can use. It's the only way to tell which is which when explaining like "Rd equals Rn bitwise anded with Rm", for example, since you can't use numbers.
They could be Rx, Ry etc, or Ra, Rb... as well.
Basics:
Rd is the destination, Rn and Rm are sources. They're all general-purpose integer registers; FP would use Sd / Sn / Sm or Dd / Dn / Dm for single or double.
ARM syntax puts the destination(s) on the left, before read-only source operands.
See Notlikethat's answer for more. Some small additions to that:
t: in this post, an ARM employee comments that "t" might mean "transfer" instead of "target".
Since t generally appears in memory instructions like LDR and STR, I understand that that means "transfer to/from memory", e.g. on ARMARMv8-fa:
LDR <Xt>, [<Xn|SP>, (<Wm>|<Xm>){, <extend> {<amount>}}]
STR <Xt>, [<Xn|SP>, (<Wm>|<Xm>){, <extend> {<amount>}}]
where the t is the source/destination of memory reads and writes.
This is also further suggested in the description of the STR and LDXR instruction registers:
<Xt> Is the 64-bit name of the general-purpose register to be transferred, encoded in the "Rt" field.
The LDR instruction however says "loaded":
<Xt> Is the 64-bit name of the general-purpose register to be loaded, encoded in the "Rt" field.
This terminology is especially meaningful because ARM is RISC-y and so there are relatively few instructions that do memory IO, and they tend to do just that (unlike say add and store to memory as is common in x86).
t1 and t2: these are used for memory instructions that load/store two values at once, e.g. the ARMv8 LDP/STP:
LDP <Xt1>, <Xt2>, [<Xn|SP>], #<imm>
STP <Xt1>, <Xt2>, [<Xn|SP>, #<imm>]!
n and m are just commonly used integer variable/index names in mathematics
s:
the STXR instruction stores to memory fom Xt (like STR), but it also gets a second return value (did the write succeed) to Ws:
STXR <Ws>, <Xt>, [<Xn|SP>{,#0}]
so presumably s was chosen because it comes before t like m comes before n.
Some ARMv7/aarch32 instructions could take a shift in a register, and Rs is the name given to that register, e.g.:
ORR{<c>}{<q>} {<Rd>,} <Rn>, <Rm>, <shift> <Rs>
I couldn't easily find aarch64 ones.
if it were documented, "Chapter C2 About the A64 Instruction Descriptions" might have been a good location for the information, but it's not there
With the constraint that I can use only SSE and SSE2 instructions, I have a need to replace the least significant (0) element of a 4 element __m128i vector with the 0 element from another vector.
For floating point vectors, the task is simple - one can use the _mm_move_ss() intrinsic to cause the element to be replaced with the 0 element from another vector. It generates one movss instruction, so is quite efficient.
Using two casting intrinsics, it's possible to also convince the compiler to use a single SSE movss instruction to move integer data. The source code ends up looking like this:
__m128i NewVector = _mm_castps_si128(_mm_move_ss(_mm_castsi128_ps(Take3FromThisVector),
_mm_castsi128_ps(Take1FromThisVector)));
It looks a bit messy, but with a suitable amount of commenting it can be acceptable, especially since it generates a minimum of instructions. In its typical use everything's optimized to be in xmm registers.
My question is this:
Since it's a movss instruction, where the "ss" implies single precision floating point, is it okay to have it move integer data that could potentially contain some "special" or "illegal" (for floating point) combo of bits in any of the vector positions?
The obvious alternative - which I also implemented and tested - is to AND the first vector with a mask, then OR in a second vector that contains just one value in the least significant element, with all the others being zero. As you can imagine, this generates more instructions.
I've tested the casting approach I showed above and it doesn't seem to cause any problems, but I note in particular that there's no intrinsic provided that does this same operation for integer data. It seems as though Intel would have provided one if it was just as good for integer data - e.g., _mm_move_epi32 or similar. And so I'm skeptical whether this is a good idea.
I did some searches, e.g., "can a movss instruction cause a floating point exception", but did not find any information that would answer my question.
Thanks in advance for knowledge you're willing to share.
-Noel
Yes, it's fine to use FP shuffles like movss xmm, xmm on integer data. The insn reference manual tells you that it can't raise FP numeric exceptions; only actual FP math instructions do that. So go ahead and cast.
There isn't even a bypass delay for using FP shuffles on integer data in most uarches (but there is extra latency for using integer shuffles between FP math instructions).
Agner Fog's "optimizing assembly" guide has a great section on what instructions are useful for different kinds of data movement (broadcasts, merging, etc.) See also the x86 tag wiki for more good links.
The reason there's no integer intrinsic is that the SSE2 movd integer instruction zeros the upper bytes of the destination, like movss used as a load, but unlike movss between registers.
Intel's vector instruction set known for its inconsistency and non-orthogonality, esp. the earliest versions (like SSE1). SSE4.1 filled many gaps, but there are still obvious missing pieces.
The types __m128 and __m128i are interchangeable. The main reason for the cast is to make your intentions clearer (and keep your compiler happy). The cast itself does not generate any extra assembly.
The _mm_move_ss operation is described directly in terms of which bits end up in your result.
If you end up with an invalid bit combination for single-precision floats, this will only be a problem if you try to use the resulting value in floating-point calculations.
I am in the process of optimizing my code for my n-body simulator, and when profiling my code, have seen this:
These two lines,
float diffX = (pNode->CenterOfMassx - pBody->posX);
float diffY = (pNode->CenterOfMassy - pBody->posY);
Where pNode is a pointer to a object of type Node that I have defined, and contains (with other things) 2 floats, CenterOfMassx and CenterOfMassy
Where pBody is a pointer to a object of type Body that I have defined, and contains (with other things) 2 floats, posX and posY.
Should take the same amount of time, but do not. In fact the first line accounts for 0.46% of function samples, but the second accounts for 5.20%.
Now I can see the second line has 3 instructions, and the first only has one.
My question is why do these seemingly do the same thing but in practice to different things?
As previously stated, profiler is listing only one assembly instruction with the first line, but three with the second line. However, because the optimizer can move code around a lot, this isn't very meaningful. It looks like the code is optimized to load all of the values into registers first, and then perform the subtractions. So it performs an action from the first line, then an action from the second line (the loads), followed by an action from the first line and an action from the second line (the subtractions). Since this is difficult to represent, it just does a best approximation of which line corresponds to which assembly code when displaying the disassembly inline with the code.
Take note that the first load is executed and may still be in the CPU pipeline when the next load instruction is executing. The second load has no dependence on the registers used in the first load. However, the first subtraction does. This instruction requires the previous load instruction to be far enough in the pipeline that the result can be used as one of the operands of the subtraction. This will likely cause a stall in the CPU while the pipeline lets the load finish.
All of this really reinforces the concept of memory optimizations being more important on modern CPUs than CPU optimizations. If, for example, you had loaded the required values into registers 15 instructions earlier, the subtractions might have occurred much quicker.
Generally the best thing you can do for optimizations is to keep the cache fresh with the memory you are going to using, and making sure it gets updated as soon as possible, and not right before the memory is needed. Beyond that, optimizations are complicated.
Of course, all of this is further complicated by modern CPUs that might look ahead 40-60 instructions for out of order execution.
To optimize this further, you might consider using a library that does vector and matrix operations in an optimized manner. Using one of these libraries, it might be possible to use two vector instructions instead of 4 scalar instructions.
According to the expanded assembly, the instructions are reordered to load pNodes's data members before performing subtraction with pBody's data members. The purpose may be to take advantage of memory caching.
Thus, the execution order is not the same as C code anymore. Comparing 1 movss which is accounted for the 1st C statement with 1 movss + 2 subss which are accounted for the 2nd C statement is unfair.
Performance counters aren't cycle-accurate. Sometimes the wrong instruction gets blamed. But in this case, it's probably pointing the finger at the instruction that was producing the result everything else was waiting for.
So probably it ran out of things it could do while waiting for the result of the memory access and FP sub. If cache misses are happening, look for ways to structure your code for better memory locality, or at least for memory access to happen in-order. Hardware prefetchers can detect sequential access patterns up to some limit of stride length.
Also, you compiler could have vectorized this. It loads two scalars from sequential addresses, then subtracts two scalars from sequential addresses. It would be faster to
movq xmm0, [esi+30h] # or movlps, but that wouldn't break the dep
movq xmm1, [edi] # on the old value of xmm0 / xmm1
subps xmm0, xmm1
That leaves diffX and diffY in element 0 and 1 of xmm0, rather than 2 different regs, so the benefit depends on the surrounding code.
I'm using llvm-mc with the goal of making a relatively smart disassembler (identifying and tracking locals, easily following branches, etc), and part of that is creating a string representation of the disassembled instructions.
When I started this, I expected that I would be able to relatively easily identify registers and values used by MCInsts and whip out another representation myself with which I could easily work with. However, after some investigation, I realized that the correlation between the operands shown with the textual representation of an instruction and the operands that are actually present within the MCInst object is fairly low. Here are a few examples (Intel syntax):
Moving, say, 11587 as a 32-bit immediate into eax would be done with the MOV32ri opcode. The textual representation would be mov eax, 11587. The corresponding MCInst would have two operands, a register and an immediate. This works for me. This is great.
Adding 11587 to eax would be done with the ADD32ri opcode. The textual representation would be add eax, 11587. However, this time, the corresponding MCInst has three operands: eax is there twice and the immediate is in the end. This isn't so great. I can assume that this is an artifact of the lowering process, that the first instance of eax is the destination register and that the second one is there to be the source (even though x86 does not distinguish between the two), and I can hack around that.
Moving a 32-bits eip-relative value to eax would be done with the MOV32ao32 opcode. The textual representation would be mov eax, dword ptr [11587]. In this case, the MCInst doesn't even have an operand for eax, it can only be inferred from the operand type present in the opcode name. I can hack around that too, but things are getting less and less pretty and I've only tested 5-6 different instructions out of the 1300+ that x86 supports.
Obviously, for the purpose of showing text, I could get the textual representation with an MCInstPrinter, but the mapping between what's shown there and what the MCInst has is still muddy.
Is there a straightforward way to tell which operands appears in the textual representation of an instruction?
Add having three arguments sounds like a compiler builder preference for Three address code is bleeding through, since there is no justification for that in Intel assembler. (you can't add and store to a different register with the ADD instruction, you can with LEA though).
The opcodes run into the hundreds if you count all extensions (like SSE, FPU etc), and worse there are multiple variants of an opcode due to addressing modes and prefixes.
The NASM assembler has some tables in the source that you could try to mine if your llvm-mc system doesn't provide the functionality.
The MC level is very low and the operand layout depends on the opcode. That said, there are mapping tables that tell you what is where. MCInstrDesc and MCOperandInfo will tell you which operands and sources and destinations, whether they are immediates, registers, etc. and a set of flags.
You'll also need to get familiar with MCRegisterClass and MCRegisterInfo and a bunch of other stuff. It's a complicated interface because the task of representing arbitrary target information is complicated.
I would look at the code for the various MC-based tools to get started. You shouldn't need your own representation, MC should have everything you need.