How to maintain the atomicity of the pinned code in LLVM - llvm

I used my own pass to insert some instructions into the basic block of the program. After the executable file was generated, disassembly could find that the order of some insert instructions was optimized, especially some inserting instructions were mixed with non-inserting instructions (the original instructions of the program). I would like to ask how can I disable the reordering of instructions and preserve the overall atomicity of my instrumentation code? Maybe optnone?
The instrumentation code like this:
/* Load */
LoadInst *MapPtr = IRB.CreateLoad(MAP);
MapPtr->setMetadata(M.getMDKindID("nosanitize"), MDNode::get(C, None));
/* Store */
IRB.CreateStore(Num, MAP)
->setMetadata(M.getMDKindID("nosanitize"), MDNode::get(C, None));

Related

What's the structure of .arm.extab entry in armcc?

I'm trying to understand exactly how the exception table (.arm.extab) works.
I'm aware that this is compiler dependent, so I'll restrict myself to armcc (as I'm using Keil).
A typical entry in the table looks something like:
b0aa0380 2a002c00 01000000 00000000
To my understanding, the first word encodes instructions for the personality routine, while the third word is a R_ARM_PREL31 relocation to the start of the catch block.
What baffles me is the second word - it appears to be split into 2 shorts, the second of which measures some distance from the start of the throwing function, but I'm not sure exactly to what (nor what the first short does).
Is there any place where the structure of these entries is documented?
I've found 2 relevant documents, but as far as I can see they have no compiler-dependent information, so they're not sufficient:
https://github.com/ARM-software/abi-aa/releases/download/2022Q1/aaelf32.pdf
https://github.com/ARM-software/abi-aa/releases/download/2022Q1/ehabi32.pdf
If you happen to have the byte ordering missed up, the below applies. Some information is probably useful even if the byte-order is correct in your original example.
extab and exidx are sections added by the AAPCS which is a newer ARM ABI.
For the older APCS, the frame pointer or fp is a root of a linked of the active routine back to the main routine (or _start). With AAPCS records are created and placed in the exidx and extab sections. These are needed to unwind stacks (and resources) when the fp is used as a generic register.
The exidx is an ordered table of routine start addresses and an extab index (or can't unwind). A PC (program counter) can be examined and search via the table to find the corresponding extab entry.
The ARM EHABI documentation has a section 7 on Exception-handling Table entries. These are extab entries and you can at least start from there to learn more. There are two defined,
Generic (or C++)
ARM compact
The compact model will be used for most 'C' code. There are no objects to be destroyed on the stack as with C++. The hex 8003aab0 gives,
1000b for the leading nibble, so this is compact.
0000b for the index. Su16—Short
03h - pop 16 bytes, some locals or padding.
aah - pop r4-r6
b0h - finish
Table 4, ARM-defined frame-unwinding instructions gives the unwinding data of each byte.
The next is 0x002c002a which is an offset to the generic personality routine. The next four values should be the 8.2 Data Structures, which are a size and should be zero... Next would be stride and then a four byte object type info. The offset 0x2c002a would be to call the objects destructor or some sort of wrapper to do this.
I think all C++ code is intended to use this Generic method. Other methods are for different languages and NOT compilers.
Related Q/A and links:
Arm exidx - about the exidx.
ARM link and frame pointer - situation for older APCS and many AAPCS functions.
Linux ARM Unwind - sample unwinding code for 'C'.
prel31 - SO Q/A on prel31 in Linux code above.
Generating unwind in ARM gnu assembler
gas ARM directives See: .cantunwind, .vsave, etc.

How can I utilize the 'red' and 'atom' PTX instructions in CUDA C++ code?

The CUDA PTX Guide describes the instructions 'atom' and 'red', which perform atomic and non-atomic reductions. This is news to me (at least with respect to non-atomic reductions)... I remember learning how to do reductions with SHFL a while back. Are these instructions reflected or wrapped somehow in CUDA runtime APIs? Or some other way accessible with C++ code without actually writing PTX code?
Are these instructions reflected or wrapped somehow in CUDA runtime APIs? Or some other way accessible with C++ code without actually writing PTX code?
Most of these instructions are reflected in atomic operations (built-in intrinsics) described in the programming guide. If you compile any of those atomic intrinsics, you will find atom or red instructions emitted by the compiler at the PTX or SASS level in your generated code.
The red instruction type will generally be used when you don't explicitly use the return value from from one of the atomic intrinsics. If you use the return value explicitly, then the compiler usually emits the atom instruction.
Thus, it should be clear that this instruction by itself does not perform a complete classical parallel reduction, but certainly could be used to implement one if you wanted to depend on atomic hardware (and associated limitations) for your reduction operations. This is generally not the fastest possible implementation for parallel reductions.
If you want direct access to these instructions, the usual advice would be to use inline PTX where desired.
As requested, to elaborate using atomicAdd() as an example:
If I perform the following:
atomicAdd(&x, data);
perhaps because I am using it for a typical atomic-based reduction into the device variable x, then the compiler would emit a red (PTX) or RED (SASS) instruction taking the necessary arguments (the pointer to x and the variable data, i.e. 2 logical registers).
If I perform the following:
int offset = atomicAdd(&buffer_ptr, buffer_size);
perhaps because I am using it not for a typical reduction but instead to reserve a space (buffer_size) in a buffer shared amongst various threads in the grid, which has an offset index (buffer_ptr) to the next available space in the shared buffer, then the compiler would emit a atom (PTX) or ATOM (SASS) instruction, including 3 arguments (offset, &buffer_ptr, and buffer_size, in registers).
The red form can be issued by the thread/warp which may then continue and not normally stall due to this instruction issue which will normally have no dependencies for subsequent instructions. The atom form OTOH will imply modification of one of its 3 arguments (one of 3 logical registers). Therefore subsequent use of the data in that register (i.e. the return value of the intrinsic, i.e. offset in this case) can result in a thread/warp stall, until the return value is actually returned by the atomic hardware.

How to recover the C++ try/throw/catch block length and address from machine code?

I'm doing a project that reorders basic blocks inside a function at runtime in C++ under 64-bit Linux. Of course, the reordering process includes updating instructions like "jmp", etc. One problem is that if (I guess) the compiler (clang++ or g++) determines the try{...} block using a range, i.e., from address1 to address2; the reordered code would have problems (some basic blocks are moved out of range and some new basic blocks are swapped in).
My question is: Does the compiler/program determines the try{...} block using a range? If so, or not, how can I know and modify the corresponding determinants, through which I can recover the try/throw/catch blocks and let the program execute normally after reordering; when the program has been already loaded into memory?
FYI, here is the relevant document for LLVM's implementation for try-catch. g++ does something very similar.
When you say by range, I would assume you are thinking the compiler would assume the code instruction from 0x0010 to 0x0020 is code, and instruction from 0x0020 to 0x0024 is for the catch block. From the LLVM specification, it doesn't rely on such assumption.
Edit:
here is some more reading for the implementation for how g++ and clang implements try-catch

Why this branch instruction of ARM doesn't work

Now I am writing a library to mock the trivial function for C/C++. It is used like this: MOCK(mocked, substitute)
If you call the mocked function, the substitute function will be called instead.
I modify the attribute of code page and inject the jump code into the function to implement it. I have implemented it for x86 CPU and I want to port it to ARM CPU. But I have a problem when I inject binary code.
For example, the address of substitute function is 0x91f1, and the address of function to mock is 0x91d1. So I want to inject the ARM branch code into 0x91d1 to jump to the substitute function.
According to the document online, the relative address is
(0x91f1 - (0x91d1 + 8)) / 4 = 6
so the binary instruction is:
0xea000006
Because my arm emulator(I use Android arm v7 emulator) is little endian, so the binary code to inject is:
0x060000ea
But when I executed the mocked function after injecting branch code, segment fault occurred. I don't know why the branch instruction is wrong. I have not learned ARM architecture so I don't know whether the branch instruction of ARM has some limits.
Addresses you are branching to is odd numbered, meaning they are in Thumb mode.
There is an obvious problem with your approach.
If target is in Thumb mode, you either need to be in Thumb mode at the point you are branching from or you need to use a bx (Branch and Exchange) instruction.
Your functions are in Thumb mode (+1 at the target) but you are using ARM mode branch coding (B A1 coding?), so obviously either you are not in Thumb mode or you are using ARM mode instruction in Thumb mode.
The ARM family allows loading of registers with values. One of those registers is the PC (Program Counter).
Some alternatives:
You could use a function to load the PC register with the
destination address (absolute).
Add the PC register with an offset.
Use a multiply-and-add instruction with the PC register.
Push the destination register onto the stack and pop into PC
register.
These choices plus modifying the destination of the branch instructions are all different options at are not "best". Pick one that suits you best and is easiest to maintain.

Design elements for inline asm in concurrent usage

I can't find a neat explanation about how I'm supposed to write a piece of inline asm, and what are the problem that can possibly arise from a concurrent use of a foo function that contains asm code in it.
The problem that I see is that in asm the registers are uniquely named, and so 1 name is strictly tied to a really precise portion of your cpu, and that's a big problem if you are writing 1 piece of code that is supposed to run concurrently because you can't simply extra registers with the same name.
The other problem is that asm doesn't really uses a calling convention, you simply call registers and/or values, and sometimes calling a register implies a silent action on another register that doesn't even shows up explicitly in your code; so I can't even expect that my C/C++ function foo will be packed and sealed inside its own stack if it contains asm code .
Now with what gcc calls extended asm I can basically declare where the input and the output goes, so each function can use its own parameters "as registers" , and the pattern is the following
asm ( assembler template
: output
: input
: registers
);
Assuming that my main target for now are mathematical operations, and my function is only supposed to give a certain functionality and perform some computation ( no internal lock ), is extended asm good for concurrency ? How I should design a piece of asm that is supposed to be used by a concurrent application ?
For now I'm using gcc, but I would like a generic answer about the general asm design that I'm supposed to give to this kind of code snippets.
You seem to be misunderstanding what threading actually is. Let's consider a single-processor system first. The threads don't actually run concurrently, since there is only one unit that can successfully decode and execute them. Your operating system is only creating the illusion of running multiple threads (and processes, too) by employing scheduling inside of it : every thread, or process, is allocated a certain amount of time it gets to execute on the processor.
This is why, when threads are executed, they don't overwrite each other's registers. When a currently executed thread or process is switched, the operating system asks the processor to perform something that's called a context switch. In a nutshell, the processor saves its state when it was executing the previous task/thread/process into some memory area, which is controlled by the OS. The new task/thread/process has its context restored from the previously stored state and continues its execution. When this task/thread/process' time slice on the CPU is up, the scheduler decides which task/thread/process to resume next. The time slice is usually very small, which is why you're given the illusion of multiple streams of code running at the same time. Keep in mind that this is a very, very simplified description : refer to CPU manuals or books on operating systems for more detail.
The situation is analogous on multi-processor systems : only with the exception that, then, there is more than one unit that can execute the instructions. This is also true for multi-core processors : every one of the cores has its own set of registers. The basic stuff stays the same - the scheduler in your OS decides whether the code being executed is actually executed at the same time by multiple cores in one processor.
Thus, your concerns in this case are not valid. However, they were raised for very valid reasons. Remember that the only things that threads share is the main memory : each thread has its own registers, and its own stack.
Let me come back to the actual question about gcc's extended inline assembly. The compiler itself cannot work out which registers are modified by the assembly you wrote. That's why you need to specify it. However, it is very rare that an instruction modifies a register without you being able to control it, and it happens only with a small number of instructions - assuming that we're talking about x86. Moreover, gcc can work out the destination/source operands by itself when you want to refer to a C/C++ variable from inside the assembly. In fact, this is the preferred method, since it leaves the compiler much more room for optimization.
Consider this piece of code :
unsigned int get_cr0(void)
{
unsigned int rc;
__asm__ (
"movl %%cr0, %0\n"
: "=r"(rc)
:
:
);
return rc;
}
This function's purpose is to return the contents of the control register cr0. This is a privileged instruction, so the program will not work when you run it in user mode, but this is not important right now. See how I put %0 in the instruction, and then specified "=r"(rc) in the output list. This means that %0 will be automagically aliased by the compiler to your rc variable. You can do this for every variable you specify on the input/output list. They are numbered starting from zero, as you can see.
I can't really remember the instructions which used registers that were not encoded as operands, so I can't give you an example right now. In this case, you would need to put them on the clobber list (the last one). I'm pretty sure you can refer to this for more information.
I also can't answer anything regarding "general asm design", since this is a non-standard extension and thus varies between compilers. The 64-bit Visual Studio compilers don't support it at all, for example.