How does LLVM allocate "irregular"-size virtual registers to physical registers? - llvm

I'm just getting started with learning LLVM and have a question about the register allocation process.
What I understand so far:
Registers defined in LLVM are considered "virtual" registers, and may or may not exceed the number of physical registers a machine has
When LLVM assembly is compiled for a specific machine architecture, a register allocation process determines which of the virtual registers can be mapped to physical registers, and which may need to be loaded and unloaded on the stack, instead (ideally this allocation process is optimizing for performance by minimizing memory access)
For the purpose of this question, let's assume that a "regular"-sized virtual register is one that is the same size as a physical register, and than an "irregular"-sized virtual register is one that is either smaller or larger than a physical register. How does LLVM allocate these "irregular"-sized virtual registers?
More specifically:
If I have multiple irregular-sized virtual registers that are smaller than the physical registers, can LLVM allocate them to the same physical register? For example, if a machine had 64-bit registers, but I had 3 i8 virtual registers, could they all be used in the same physical register? The images in this blog post seem to suggest they can, but I'm not sure I'm interpreting that post correctly. Are there any performance or capability limitations if virtual registers did share a single physical register?
If I have an irregular-sized virtual register that's larger than the physical registers, can LLVM split them across multiple physical registers, or would it be forced to use the stack? For example, if a machine had 64 bit registers, but I had an i72 virtual register, could that just be split across two physical registers? Are there any performance or capability limitations from this?
Assuming the answers to the above two questions are "yes", can a smaller virtual register share a physical register with the "overflow" portion of a larger virtual register?

There is a dedicated process within backend that is called "legalization". Basically the types / operations that are not legal for a given target are turned into ones that are supported by a target natively.
There are multiple approaches to legalizations:
Implementing operation in terms of others (sometimes even doing a library function call)
Promoting the type to the larger one
Splitting the type into smaller pieces
Or the target could chose to lower a particular operation in some custom way.
Note that "register allocation" is not "assigning physical registers to LLVM IR values" (it's better to think about values not registers here as the value is assigned once, you cannot redefine it in LLVM IR). Instead, the register allocation process operates on already legalized values. So it already has in the input the set of properly-sized values along with their live ranges.

Related

How do atomic variables based on shared memory work in inter-process contexts?

Let's say a process creates a piece of shared memory the size of 2 integer (64-bit/8bytes).
The shared memory will be available to not only threads of the process, but other processes on the system that have access to that piece of shared memory.
Presumably the shared memory will in the first process will be addressed via a virtual address space, so when an atomic operation (cmp exchange) is performed on the 1 integer, the virtual address in the context of the first processed is used.
If another process at the same time is performing some kind of atomic operation on the first integer, it would also be using its own virtual address space.
So what system actually performance the translations into the actual physical address, and from a very general POV how does the CPU provide atomicity guarantees in this situation?
Modern CPU caches operate on physical addresses (usually caches are virtually tagged physically indexed). Basically this means that two virtual addresses in two different processes translated to the same physical address will be cached just once per CPU.
Modern CPU caches are coherent: the cache is synchronized among all CPUs in the system, so all CPUs have the identical data in their caches. On Intel CPUs usually the MESI protocol is used.
Modern CPUs have write buffers, so a memory store takes some time to get to the cache.
So, from a very general point of view, an atomic operation on modern CPU basically reads and locks a cache line for an exclusive use of the CPU until the atomic operation is done and propagates the changes directly to the cache, avoiding buffering within the CPU.

How can I see a page-table maintained by each process in Virtual Memory - Linux?

In the virtual memory concept- each process maintains their own page table. This page table maps the virtual address to the kernel virtual address. This kernel virtual address translates the address to Physical RAM. I understand that there is a Kernel Virtual adddres - vm area struct. This vm area struct finally maps this address to the Physical address.
When I do cat /proc//maps - I see the direct mapping of virtual address to physical address. Because it maps the address to the file - with inode. So, it looks that it is the address on the hard-disk, file descriptor, major-minor number. There are a few address that are on the RAM. So, I can say that I can't see the table where the Virtual address is mapped to Kernel virtual address. I want to see that table. How can I see that? It should not be in the kernel space. Because when process is accessing let's say memory - 0x1681010 then this should be translated to kernel virtual memory address. Finally, this address should be translated to physical memory address.
No, the Linux kernel maintains the processes page tables (but not the processes themselves). Processes are only seeing virtual memory thorough their address space. Processes use some syscalls, like e.g. mmap(2) or execve(2), to change their address space.
Physical addresses and page tables and dealing with and managing the MMU is the business of the kernel, which actually provides some "abstract machine" (with the virtual address spaces, the syscalls as atomic elementary operations, etc...) to user applications. The applications don't see the raw (x86) hardware, but only the user mode as given by the kernel. Some hardware resources and instructions are not available to them (they only run in user space).
The page tables are managed by the kernel, and indeed various processes may use different -or sometimes same- page tables. (So context switches managed by the kernl may need to reconfigure the MMU). You don't care, (and the user processes don't see page tables) the kernel will manage them.
And no, /proc/self/maps does not show anything about physical addresses, only about virtual one. The kernel is permitted to move processes from one core to another, to move pages from one physical (not virtual) address to another, etc...., at any time; and applications usually don't see this (they might query this with mincore(2), getcpu(2) and thru proc(5) ...)
Applications should not care about physical memory or interrupts, like page faults (only the kernel care about these; sometimes by sending signals).
The virtual to physical address translation happens in the MMU. Usually, it is successful (perhaps transparently accessing page tables), and the processor sends on the bus to the RAM the translated physical address (corresponding to some virtual address handled by the user-mode machine instruction). When the MMU cannot handle it, a page fault occurs, which is processed by the kernel (which could swap-in some page, send a SIGSEGV, do a context switch, etc...)
See also the processor architecture, instruction set, page table, paging, translation lookaside buffer, cache, x86 and x86-64 wikipages (and follow all the links I gave you).

why 128bit variables should be aligned to 16Byte boundary

As we know, X86 CPU has a 64bit data bus. My understanding is that CPU can't access to arbitrary address. The address that CPU could access to is a integral multiple of the width of its data bus. For the performance, variables should start at(aligned to) these addresses to avoid extra memory access. 32bit variables aligned to 4Byte boundry will be automatically aligned to 8Byte(64bit) boundry, which corresponds to x86 64bit data bus. But why compilers align 128bit variables to 16Byte boundry? Not the 8Byte boundry?
Thanks
Let me make things more specific. Compilers use the length of a variable to align it. For example, if a variable has 256bit length, Complier will align it to 32Byte boundry. I don't think there is any kind of CPU has that long data-bus. Furthermore, common DDR memories only transfer 64bit data one time, despite of the cache, how could a memory fill up CPU's wider data-bus? or only by means of cache?
One reasons is that most SSE2 instructions on X86 require the data to be 128 bit aligned. This design decision would have been made for performance reasons and to avoid overly complex (and hence slow and big) hardware.
There are so many different processor models that I am going to answer this only in theoretical and general terms.
Consider an array of 16-byte objects that starts at an address that is a multiple of eight bytes but not of 16 bytes. Let’s suppose the processor has an eight-byte bus, as indicated in the question, even if some processors do not. However, note that at some point in the array, one of the objects must straddle a page boundary: Memory mapping commonly works in 4096-byte pages that start on 4096-byte boundaries. With an eight-byte-aligned array, some element of the array will start at byte 4088 of one page and continue up to byte 7 of the next page.
When a program tries to load the 16-byte object that crosses a page boundary, it can no longer do a single virtual-to-physical memory map. It has to do one lookup for the first eight bytes and another lookup for the second eight bytes. If the load/store unit is not designed for this, then the instruction needs special handling. The processor might abort its initial attempt to execute the instruction, divide it into two special microinstructions, and send those back into the instruction queue for execution. This can delay the instruction by many processor cycles.
In addition, as Hans Passant noted, alignment interacts with cache. Each processor has a memory cache, and it is common for cache to be organized into 32-byte or 64-byte “lines”. If you load a 16-byte object that is 16-byte aligned, and the object is in cache, then the cache can supply one cache line that contains the needed data. If you are loading 16-byte objects from an array that is not 16-byte aligned, then some of the objects in the array will straddle two cache lines. When these objects are loaded, two lines must be fetched from the cache. This may take longer. Even if it does not take longer to get two lines, perhaps because the processor is designed to provide two cache lines per cycle, this can interfere with other things that a program is doing. Commonly, a program will load data from multiple places. If the loads are efficient, the processor may be able to perform two at once. But if one of them requires two cache lines instead of the normal one, then it blocks simultaneous execution of other load operations.
Additionally, some instructions explicitly require aligned addresses. The processor might dispatch these instructions more directly, bypassing some of the tests that fix up operations without aligned addresses. When the addresses of these instructions are resolved and are found to be misaligned, the processor must abort them, because the fix-up operations have been bypassed.

What type of address returned on applying ampersand to a variable or a data type in C/C++ or in any other such language?

This is a very basic question boggling mind since the day I heard about the concept of virtual and physical memory concept in my OS class. Now I know that at load time and compile time , virtual address and logical adress binding scheme is same but at execution time they differ.
First of all why is it beneficial to generate virtual address at compile and load time and and what is returned when we apply the ampersand operator to get the address of a variable, naive datatypes , user-defined type and function definition addresses?
And how does OS maps exactly from virtual to physical address when it does so? These questions are hust out from curiosity and I would love some good and deep insights considering modern day OS' , How was it in early days OS' .I am only C/C++ specific since I don't know much about other languages.
Physical addresses occur in hardware, not software. A possible/occasional exception is in the operating system kernel. Physical means it's the address that the system bus and the RAM chips see.
Not only are physical addresses useless to software, but it could be a security issue. Being able to access any physical memory without address translation, and knowing the addresses of other processes, would allow unfettered access to the machine.
That said, smaller or embedded machines might have no virtual memory, and some older operating systems did allow shared libraries to specify their final physical memory location. Such policies hurt security and are obsolete.
At the application level (e.g. Linux application process), only virtual addresses exist. Local variables are on the stack (or in registers). The stack is organized in call frames. The compiler generates the offset of a local variable within the current call frame, usually an offset relative to the stack pointer or frame pointer register (so the address of a local variable, e.g. in a recursive function, is known only at runtime).
Try to step by step a recursive function in your gdb debugger and display the address of some local variable to understand more. Try also the bt command of gdb.
Type
cat /proc/self/maps
to understand the address space (and virtual memory mapping) of the process executing that cat command.
Within the kernel, the mapping from virtual addresses to physical RAM is done by code implementing paging and driving the MMU. Some system calls (notably mmap(2) and others) can change the address space of your process.
Some early computers (e.g. those from the 1950-s or early 1960-s like CAB 500 or IBM 1130 or IBM 1620) did not have any MMU, even the original Intel 8086 didn't have any memory protection. At that time (1960-s), C did not exist. On processors without MMU you don't have virtual addresses (only physical ones, including in your embedded C code for a washing-machine manufacturer). Some machines could protect writing into some memory banks thru physical switches. Today, some low end cheap processors (those in washing machines) don't have any MMU. Most cheap microcontrollers don't have any MMU. Often (but not always), the program is in some ROM so cannot be overwritten by buggy code.

How does a compiler know the alignment of a physical address?

I know that some CPU architectures don't support unaligned address access(e.g., ARM architectures prior to ARM 4 had no instructions to access half-word objects in memory). And some compiler(e.g., some version of GCC) for that architecture will use a series of memory access when it finds a misaligned address, so that the misaligned access is almost transparent to developers.(Refer to The Definitive Guide to GCC, By William von Hagen)
But I'm wondering how does a compiler know whether an address is aligned or not? After all, what a compiler sees is the virtual address(effective address, EA), if it can see anything. When the program is run, EA could be mapped to any physical address by OS. Even if virtual address is aligned, the resulting physical address could be misaligned, isn't it? The alignment of physical address is what really matters and transfers on CPU address lines.
Because a compiler is not aware of the physical address at all, how can it be smart enough to know if a variable's address is aligned?
Virtual address is not mapped to just any physical address. Virtual memory comes in pages that are mapped in an aligned manner to physical pages. (generally aligned to 4096).
See: Virtual memory and alignment - how do they factor together?
Alignment is a very useful attribute for object code, partly because some machines insist on "aligned access" but in modern computers because cache lines have huge impact on performance and thus cache-alignment of code/loops/data/locks is thus a requirement from your local friendly compiler.
Virtally all the loaders in the world support loading of code at power-of-two aligned boundaries of some modest size and on up. (Assemblers and linkers support this too with various ALIGNMENT directives). Often linkers and loaders just align the first loaded value anyway to a well-known boundary size; OSes with virtual memory often provide a convenient boundary based on VM page size (ties to other answer).
So a compiler can essentially know what the alignment of its emitted code/data is. And by keeping track of how much code it has emitted, it can know what the alignment of any emitted value is. If it needs alignment, it can issue a linker directive, or for modest sizes, simply pad until the emitted amount of code is suitably aligned.
Because of this, you can be pretty sure most compilers will not place code or data constructs in ways that cross cache line (or other architecture imposed) boundaries in a way that materially affects performance unless directed to do so.