Invalidating a specific area of data cache without flushing its content - c++

I'm currently working on a project using the Zynq-7000 SoC. We have a custom DMA IP in PL to provide faster transactions between peripherals and main memory. The peripherals are generally serial devices such as UART. The data received by the serial device is transferred immediately to the main memory by DMA.
What I try to do is to reach the data stored at a pre-determined location of the memory. Before reading the data, I invalidate the related cache lines using a function provided by xil_cache.h library as below.
Xil_DCacheInvalidateRange(INTPTR adr, u32 len);
The problem here is that this function flushes the related cache lines before invalidating them. Due to flushing, the stored data is overwritten. Hence, every time I fetch the corrupted bytes. The process has been explained in library documentation as below.
If the address to be invalidated is not cache-line aligned, the
following choices are available:
Invalidate the cache line when
required and do not bother much for the side effects. Though it sounds
good, it can result in hard-to-debug issues. The problem is, if some
other variables are allocated in the same cache line and had been
recently updated (in cache), the invalidation would result in loss of
data.
Flush the cache line first. This will ensure that if any
other variable presents in the same cache line and updated recently are
flushed out to memory. Then it can safely be invalidated. Again it
sounds good, but this can result in issues. For example, when the
invalidation happens in a typical ISR (after a DMA transfer has
updated the memory), then flushing the cache line means, losing data
that were updated recently before the ISR got invoked.
As you can guess that I cannot always allocate a memory region that has a cache-line aligned address. Therefore, I follow a different way to solve the problem so that I calculate the cache-line aligned address which is located in memory right before my buffer. Then I call the invalidation method with that address. Note that the Zynq's L2 Cache is an 8-way set-associative 512KB cache with a fixed 32-byte line size. This is why I mask the last 5 bits of the given memory address. (Check the section 3.4: L2 Cache in Zynq's documentation)
INTPTR invalidationStartAddress = INTPTR(uint32_t(dev2memBuffer) - (uint32_t(dev2memBuffer) & 0x1F));
Xil_DCacheInvalidateRange(invalidationStartAddress, BUFFER_LENGTH);
This way I can solve the problem but I'm not sure if I'm violating any of the resources that are placed before the resource allocated for DMA.(I would like to add that the referred resource is allocated at heap using the dynamic allocation operator new.) Is there a way to overcome this issue, or am I overthinking it? I believe that this problem could be solved better if there was a function to invalidate the related cache lines without flushing them.
EDIT: Invalidating resources that are not residing inside the allocated area violates the reliability of variables placed close to the referred resource. So, the first solution is not applicable. My second solution is to allocate a buffer that is 32-byte bigger than the required one and crop its unaligned part. But, this one also can cause the same problem as its last part*(parts = 32-byte blocks)* is not guaranteed to have 32 bytes. Hence, it might corrupt the resources placed next to it. The library documentation states that:
Whenever possible, the addresses must be cache-line aligned. Please
note that not just the start address, even the end address must be
cache-line aligned. If that is taken care of, this will always work.
SOLUTION: As I stated in the last edit, the only way to overcome the problem was to allocate a memory region with a Cache-Aligned address and length. I'm not able to determine the start address of the allocated area, hence I've decided to allocate a space that is two Cache-Blocks bigger than the requested one and crop the unaligned parts. The unalignment can occur at the first or the last block. In order not to violate the destruction of the resources, I saved the originally allocated address carefully and used the Cache-Aligned one in all of the operations.
I believe that there are better solutions to the problem and I keep the question open.

Your solution is correct. There is no way to flush a subset of a cache line.
Normally this behavior is transparent to programs but it becomes visible in multithreaded code and when sharing memory with hardware accelerators.

Related

How does a computer 'know' what memory is allocated?

When memory is allocated in a computer, how does it know which bytes are already occupied and can't be overwritten?
So if these are some bytes of memory that aren't being used:
[0|0|0|0]
How does the computer know whether they are or not? They could just be an integer that equals zero. Or it could be empty memory. How does it know?
That depends on the way the allocation is performed, but it generally involves manipulation of data belonging to the allocation mechanism.
When you allocate some variable in a function, the allocation is performed by decrementing the stack pointer. Via the stack pointer, your program knows that anything below the stack pointer is not allocated to the stack, while anything above the stack pointer is allocated.
When you allocate something via malloc() etc. on the heap, things are similar, but more complicated: all theses allocators have some internal data structures which they never expose to the calling application, but which allow them to select which memory addresses to return on an allocation request. Some malloc() implementation, for instance, use a number of memory pools for small objects of fixed size, and maintain linked lists of free objects for each fixed size which they track. That way, they can quickly pop one memory region of that list, only doing more expensive computations when they run out of regions to satisfy a certain request size.
In any case, each of the allocators have to request memory from the system kernel from time to time. This mechanism always works on complete memory pages (usually 4 kiB), and works via the syscalls brk() and mmap(). Again, the kernel keeps track of which pages are visible in which processes, and at which addresses they are mapped, so there is additional memory allocated inside the kernel for this.
These mappings are made available to the processor via the page tables, which uses them to resolve the virtual memory addresses to the physical addresses. So here, finally, you have some hardware involved in the process, but that is really far, far down in the guts of the mechanics, much below anything that a userspace process is ever able to see. Still, even the page tables are managed by the software of the kernel, not by the hardware, the hardware only interpretes what the software writes into the page tables.
First of all, I have the impression that you believe that there is some unoccupied memory that doesn't holds any value. That's wrong. You can imagine the memory as a very large array when each box contains a value whereas someone put something in it or not. If a memory was never written, then it contains a random value.
Now to answer your question, it's not the computer (meaning the hardware) but the operating system. It holds somewhere in its memory some tables recording which part of the memory are used. Also any byte of memory can be overwriten.
In general, you cannot tell by looking at content of memory at some location whether that portion of memory is used or not. Memory value '0' does not mean the memory is not used.
To tell what portions of memory are used you need some structure to tell you this. For example, you can divide memory into chunks and keep track of which chunks are used and which are not.
There are memory blocks, they have an occupied or not occupied. On the heap, there are very complex data structures which organise it. But the answer to your question is too broad.

Pointer indirection check for invalid memory access and segmentation fault

struct A { int i; };
...
A *p = (A*) (8); // or A *p = 0;
p->i = 5; // Undefined Behavior according C/C++ standard
However, practically most of the system would crash (segmentation fault) for such code.
Does it mean that all such Architectures/Systems have a hidden check for pointer indirection (i.e. p->) to verify if it's accessing a wrong memory location ?
If yes, then it implies that even in perfectly working code we are paying the price for that extra check, correct ?
There are generally no extra hidden checks, this is just an effect of using virtual memory.
Some of the potential virtual addresses are just not mapped to physical memory, so translating things like 8 will likely fail.
Yes, you are paying the price for that extra check. It's not just for pointer indirection, but any memory access (other than, say, DMA). However, the cost of the check is very small.
While your process is running, the page table does not change very often. Parts of the page table will be cached in the translation lookaside buffer, accessing pages with entries in the buffer incur no additional penalty.
If your process accesses a page without a TLB entry, then the CPU must make an additional memory access to fetch the page table entry for that page. It will then be cached.
You can see the effect of this in action by writing a test program. Give your test program a big chunk of memory and start randomly reading and writing locations in memory. Use a command line parameter to change the size.
Above the L1 cache size, performance will drop due to L2 cache latency.
Above the L2 cache size, performance will drop to RAM latency.
Above the size of the memory addressed by the TLB, performance will drop due to TLB misses. (This might happen before or after you run out of L2 cache space, depending on a number of factors.)
Above the size of available RAM, performance will drop due to swapping.
Above the size of available swap space and RAM, the application will be terminated by the OS.
If your operating system allows "big pages", the TLB might be able to cover a very large address space indeed. Perhaps you can sabotage the OS by allocating 4k chunks from mmap, in which case the TLB misses might be felt with only a few megs of working set, depending on your processor.
However: The small performance drop must be weighed against the benefits of virtual memory, which are too numerous to list here.
No, not correct. Those exact same checks are absolutely needed on valid memory accesses for two reasons:
1) Otherwise, how would the system know what physical memory you were accessing and whether the page was already resident?
2) Otherwise, how would the operating system know which pages of physical memory to page out if physical memory became tight?
It's integrated into the entire virtual memory system and part of what makes modern computers perform so amazingly well. It's not any kind of separate check, it's part of the process that determines which page of physical memory the operation is accessing. It's part of what makes copy-on-write work. (The very same check detects when a copy is needed.)
A segmentation fault is an attempt to access memory that the CPU cannot physically address. It occurs when the hardware notifies an operating system about a memory access violation. So I think there is no extra check as such If an attempt to access the memory location fails the hardware notifies the OS which then then sends a signal to the process which caused the exception. By default, the process receiving the signal dumps core and terminates.
First of all, you need to read and understand this: http://en.wikipedia.org/wiki/Virtual_memory#Page_tables
So what typically happens is, when a process attempts to dereference an invalid virtual memory location, the OS catches the page fault exception raised by the MMU (see link above) for the invalid virtual address (0x0, 0x8, whatever). The OS then looks up the address in its page table, doesn't find it, and issues a SIGSEGV signal (or similar) to the process which causes the process to crash.
The difference between a valid and invalid address is whether the OS has allocated a page for that address range. Most OSes are designed to never allocate the first page (the one starting at 0x0) so that NULL dereferences will always crash.
So what you're calling an "extra check" is really the same check that occurs for every single page fault, valid address or not -- it's just a matter of whether the page table lookup succeeds.

Fast resize of a mmap file

I need a copy-free re-size of a very large mmap file while still allowing concurrent access to reader threads.
The simple way is to use two MAP_SHARED mappings (grow the file, then create a second mapping that includes the grown region) in the same process over the same file and then unmap the old mapping once all readers that could access it are finished. However, I am curious if the scheme below could work, and if so, is there any advantage to it.
mmap a file with MAP_PRIVATE
do read-only access to this memory in multiple threads
either acquire a mutex for the file, write to the memory (assume this is done in a way that the readers, which may be reading that memory, are not messed up by it)
or acquire the mutex, but increase the size of the file and use mremap to move it to a new address (resize the mapping without copying or unnecessary file IO.)
The crazy part comes in at (4). If you move the memory the old addresses become invalid, and the readers, which are still reading it, may suddenly have an access violation. What if we modify the readers to trap this access violation and then restart the operation (i.e. don't re-read the bad address, re-calculate the address given the offset and the new base address from mremap.) Yes I know that's evil, but to my mind the readers can only successfully read the data at the old address or fail with an access violation and retry. If sufficient care is taken, that should be safe. Since re-sizing would not happen often, the readers would eventually succeed and not get stuck in a retry loop.
A problem could occur if that old address space is re-used while a reader still has a pointer to it. Then there will be no access violation, but the data will be incorrect and the program enters the unicorn and candy filled land of undefined behavior (wherein there is usually neither unicorns nor candy.)
But if you controlled allocations completely and could make certain that any allocations that happen during this period do not ever re-use that old address space, then this shouldn't be a problem and the behavior shouldn't be undefined.
Am I right? Could this work? Is there any advantage to this over using two MAP_SHARED mappings?
It is hard for me to imagine a case where you don't know the upper bound on how large the file can be. Assuming that's true, you could "reserve" the address space for the maximum size of the file by providing that size when the file is first mapped in with mmap(). Of course, any accesses beyond the actual size of the file will cause an access violation, but that's how you want it to work anyway -- you could argue that reserving the extra address space ensures the access violation rather than leaving that address range open to being used by other calls to things like mmap() or malloc().
Anyway, the point is with my solution, you never move the address range, you only change its size and now your locking is around the data structure that provides the current valid size to each thread.
My solution doesn't work if you have so many files that the maximum mapping for each file runs you out of address space, but this is the age of the 64-bit address space so hopefully your maximum mapping size is no problem.
(Just to make sure I wasn't forgetting something stupid, I did write a small program to convince myself creating the larger-than-file-size mapping gives an access violation when you try to access beyond the file size, and then works fine once you ftruncate() the file to be larger, all with the same address returned from the first mmap() call.)

Does any OS allow moving memory from one address to another without physically copying it?

memcpy/memmove duplicate (copy the data) from source to destination. Does anything exist to move pages from one virtual address to another without doing an actual byte by byte copy of the source data? It seems to be perfectly possible to me, but does any operating system actually allow this? It seems odd to me that dynamic arrays are such a widespread and popular concept but that growing them by physically copying is such a wasteful operation. It just doesn't scale when you start talking about array sizes in the gigabytes (e.g. imagine growing a 100GB array into a 200GB array. That's a problem that's entirely possible on servers in the < $10K range now.
void* very_large_buffer = VirtualAlloc(NULL, 2GB, MEM_COMMIT);
// Populate very_large_buffer, run out of space.
// Allocate buffer twice as large, but don't actually allocate
// physical memory, just reserve the address space.
void* even_bigger_buffer = VirtualAlloc(NULL, 4GB, MEM_RESERVE);
// Remap the physical memory from very_large_buffer to even_bigger_buffer without copying
// (i.e. don't copy 2GB of data, just copy the mapping of virtual pages to physical pages)
// Does any OS provide support for an operation like this?
MoveMemory(very_large_buffer, even_bigger_buffer, 2GB)
// Now very_large_buffer no longer has any physical memory pages associated with it
VirtualFree(very_large_buffer)
To some extent, you can do that with mremap on Linux.
That call plays with the process's page table to do a zero-copy reallocation if it can. It is not possible in all cases (address space fragmentation, and simply the presence of other existing mappings are an issue).
The man page actually says this:
mremap() changes the mapping between virtual addresses and memory pages. This can be used to implement a very efficient realloc(3).
Yes it's a common use of memory mapped files to 'move' or copy memory between process by mapping different views of the file
Every POSIX system is able to do this. If you use mmap with a file descriptor (obtained by open or shm_open) and not anonymously you can unmap it, then truncate (shrink or grow) and then map it again. You may and often will get a different virtual address for the same pages.
I mean, you'd never be able to absolutely guarantee that there would be no active memory in that next 100GB, so you might not be able to make it contiguous.
On the other hand, you could use a ragged array (an array of arrays) where the arrays do not have to be next to each other (or even the same size). Many of the advantages of dynamic arrays may not scale to the 100GB realm.

C++ new / new[], how is it allocating memory?

I would like to now how those instructions are allocating memory.
For example what if I got code:
x = new int[5];
y = new int[5];
If those are allocated how it actually looks like in RAM?
Is whole block reserved for each of the variables or block(memory page or how-you-call-it - 4KB of size on 32bit) is shared for 2 variables?
I couldn't find answer for my question in any manual. Thanks for all replies.
I found on wikipedia:
Internal fragmentation of pages
Rarely do processes require the use of an exact number of pages. As a result, the last page will likely only be partially full, wasting some amount of memory. Larger page sizes clearly increase the potential for wasted memory this way, as more potentially unused portions of memory are loaded into main memory. Smaller page sizes ensure a closer match to the actual amount of memory required in an allocation.
As an example, assume the page size is 1024KB. If a process allocates 1025KB, two pages must be used, resulting in 1023KB of unused space (where one page fully consumes 1024KB and the other only 1KB).
And that was answer for my question. Anyway thanks guys.
A typical allocator implementation will first call the operating system to get huge block of memory, and then to satisfy your request it will give you a piece of that memory, this is known as suballocation. If it runs out of memory, it will get more from the operating system.
The allocator must keep track of both all the big blocks it got from the operating system and also all the small blocks it handed out to its clients. It also must accept blocks back from clients.
A typical suballocation algorithm keeps a list of returned blocks of each size called a freelist and always tries to satisfy a request from the freelist, only going to the main block if the freelist is empty. This particular implementation technique is extremely fast and quite efficient for average programs, though it has woeful fragmentation properties if request sizes are all over the place (which is not usual for most programs).
Modern allocators like GNU's malloc implementation are complex, but have been built with many decades of experience and should be considered so good that it is very rare to need to write your own specialised suballocator.
You didn't find it in the manual because it's not specified by the standard. That is, most of the time x and y will be side by side (go ahead and cout<< hex << their addresses).
But nothing in the standard forces this so you can't rely on it.
Each process has different segments associated which are divided among the process address space :
1) The text segment :: Where your code is placed
2) Stack Segment :: Process stack
3) Data Segment :: This is where memory by "new" is reserved. Besides that it also store initialized and uninitialized static data (bss etc).
So , whenever you call a new function (which i guess uses malloc internally , but the new class makes it much safer to handle memory) it allocates the specified number of bytes in the data segment.
Ofcourse the address you print while running the program is virtual and needs to be translated to physical address..but that is not our headache and the OS Memory management unit does that for us.