Get-CacheStatistics command shows that any small object (even 1 Integer) takes at least 1024 bytes in cache memory. Is it correct and is there any way to adjust AppCache to handle memory more efficient for small objects?
In fact, there is an error in your understandings.
It's true taht objects are stored in the cache in a serialized form but there are other internal data structures that require memory in the cache : key, duration, named cache, regions, tags, notifications, ... take a look on capacity guide.
There is no way to manage internal cache data structure.
Related
I'm currently working on a project using the Zynq-7000 SoC. We have a custom DMA IP in PL to provide faster transactions between peripherals and main memory. The peripherals are generally serial devices such as UART. The data received by the serial device is transferred immediately to the main memory by DMA.
What I try to do is to reach the data stored at a pre-determined location of the memory. Before reading the data, I invalidate the related cache lines using a function provided by xil_cache.h library as below.
Xil_DCacheInvalidateRange(INTPTR adr, u32 len);
The problem here is that this function flushes the related cache lines before invalidating them. Due to flushing, the stored data is overwritten. Hence, every time I fetch the corrupted bytes. The process has been explained in library documentation as below.
If the address to be invalidated is not cache-line aligned, the
following choices are available:
Invalidate the cache line when
required and do not bother much for the side effects. Though it sounds
good, it can result in hard-to-debug issues. The problem is, if some
other variables are allocated in the same cache line and had been
recently updated (in cache), the invalidation would result in loss of
data.
Flush the cache line first. This will ensure that if any
other variable presents in the same cache line and updated recently are
flushed out to memory. Then it can safely be invalidated. Again it
sounds good, but this can result in issues. For example, when the
invalidation happens in a typical ISR (after a DMA transfer has
updated the memory), then flushing the cache line means, losing data
that were updated recently before the ISR got invoked.
As you can guess that I cannot always allocate a memory region that has a cache-line aligned address. Therefore, I follow a different way to solve the problem so that I calculate the cache-line aligned address which is located in memory right before my buffer. Then I call the invalidation method with that address. Note that the Zynq's L2 Cache is an 8-way set-associative 512KB cache with a fixed 32-byte line size. This is why I mask the last 5 bits of the given memory address. (Check the section 3.4: L2 Cache in Zynq's documentation)
INTPTR invalidationStartAddress = INTPTR(uint32_t(dev2memBuffer) - (uint32_t(dev2memBuffer) & 0x1F));
Xil_DCacheInvalidateRange(invalidationStartAddress, BUFFER_LENGTH);
This way I can solve the problem but I'm not sure if I'm violating any of the resources that are placed before the resource allocated for DMA.(I would like to add that the referred resource is allocated at heap using the dynamic allocation operator new.) Is there a way to overcome this issue, or am I overthinking it? I believe that this problem could be solved better if there was a function to invalidate the related cache lines without flushing them.
EDIT: Invalidating resources that are not residing inside the allocated area violates the reliability of variables placed close to the referred resource. So, the first solution is not applicable. My second solution is to allocate a buffer that is 32-byte bigger than the required one and crop its unaligned part. But, this one also can cause the same problem as its last part*(parts = 32-byte blocks)* is not guaranteed to have 32 bytes. Hence, it might corrupt the resources placed next to it. The library documentation states that:
Whenever possible, the addresses must be cache-line aligned. Please
note that not just the start address, even the end address must be
cache-line aligned. If that is taken care of, this will always work.
SOLUTION: As I stated in the last edit, the only way to overcome the problem was to allocate a memory region with a Cache-Aligned address and length. I'm not able to determine the start address of the allocated area, hence I've decided to allocate a space that is two Cache-Blocks bigger than the requested one and crop the unaligned parts. The unalignment can occur at the first or the last block. In order not to violate the destruction of the resources, I saved the originally allocated address carefully and used the Cache-Aligned one in all of the operations.
I believe that there are better solutions to the problem and I keep the question open.
Your solution is correct. There is no way to flush a subset of a cache line.
Normally this behavior is transparent to programs but it becomes visible in multithreaded code and when sharing memory with hardware accelerators.
I've read a paper called Last-Level Cache Side-Channel Attacks are Practical. In the paper, it tried to construct an eviction set. According to the paper, we shall allocate a buffer (backed up by large pages) of at least twice the size of the LLC and select a set of potentially conflicting memory lines within the buffer,i.e., lines whose
addresses have the same set index bits.
If I've got the set index, how can I get the address of each cache line within this set?
For example, I've allocate a 6MB buffer backed by 2MB page(using C mmap() function) and got the base address of the page. According to the basic cache addressing, I can get set index through method in the picture.
But how to get the whole addresses of the cache lines within the same set?
Ok so I can't find much in the way of answers to this, it's a simple question in memory management. I know that when a computer pulls from memory it caches 32-64 bits of memory in a cache line depending on your processor. My question is does it only store 1 cache line's worth of memory or multiple, if multiple how many?
For instance say we're using c++, and I pull a vector<int> using a for loop, then I use those integers to pull information out of another vector that is most likely no where near it in memory. Would that qualify as 2 pulls and then everything is cached or is that just going to continuously pull from memory? Basically, would it pull the vector<int> and store it in cache, then pull the other vector and store it as well in a different catch line? Thus only pulling twice then getting from it's cached memory from then on? Assume that each vector = the size of 1 catch lines worth of data.
EDIT: Ok so on the same line.... I have a second question: Lets assume for a moment that my initial vector<int> is called and iterated over in a for loop, which then references multiple vectors. When it caches those vectors, obviously it's going to have a limit so it will start writing over previous cache right? In which case in what order would it write over the previous cache lines, FIFO, FILO, some other way?
There's different types of cache. Generally, the amount of cache depends on the processor. A moden processor has 3 levels of cache, where the fastest (and smallest) is called L1 and usually range between 128kb and 512kb, where the slowest (and largest) is 1mb to 4mb.
Each access to the memory is 64 bit long, regardless of the processor architecture. Therefore accessing the memory with 64bit long operands is most efficient.
The cache may store memory from different pages too.
The problem is following: I have certain amount of words (let's say 20M), each containing some bits used as flags; all stored in single continuous binary file.
What I would like to do is to get access to those words in container like style, so container_instance[i] allows me to access i-th word. To get things more complicated, I cannot store all words in memory at one time, they have to be stored back to file and memory freed for those not used for long period. To simplify things the whole sequence is partitioned to 1K fragments, so we need to free and allocate such 1K blocks. Memory should be freed after some time or after certain number of times container have been accessed.
Thread safety in nice to have. But I can protect externally.
The implementation I have currently only allocates blocks on demand (empty or read from file if they are available; file is not sparse, so everything after the last byte in file is allocated empty) and it is not nicely done. Not frees at all, so unused blocks remain in memory forever.
I started to think about nice looking solution and I would like to know whether any elements from STL or Boosts can help me build such container not by engraving it step by step from scratch?
I am not expecting full solutions, rather pointing "you can use that for that".
You can use mmap system call to map your file into memory. You can use pointer arithmetic with that buffer, so access by index is not a trouble.
Mapped pages are virutual and managed by the kernel, allowing to save unused memory blocks and load/flush them at transparently to you. Also, using madvise probably can enable some optimisations.
I have to deal with a huge amount of data that usually doesn't fit into main memory. The way I access this data has high locality, so caching parts of it in memory looks like a good option. Is it feasible to just malloc() a huge array, and let the operating system figure out which bits to page out and which bits to keep?
Assuming the data comes from a file, you're better off memory mapping that file. Otherwise, what you end up doing is allocating your array, and then copying the data from your file into the array -- and since your array is mapped to the page file, you're basically just copying the original file to the page file, and in the process polluting the "cache" (i.e., physical memory) so other data that's currently active has a much better chance of being evicted. Then, when you're done you (typically) write the data back from the array to the original file, which (in this case) means copying from the page file back to the original file.
Memory mapping the file instead just creates some address space and maps it directly to the original file instead. This avoids copying data from the original file to the page file (and back again when you're done) as well as temporarily moving data into physical memory on the way from the original file to the page file. The biggest win, of course, is when/if there are substantial pieces of the original file that you never really use at all (in which case they may never be read into physical memory at all, assuming the unused chunk is at least a page in size).
If the data are in a large file, look into using mmap to read it. Modern computers have so much RAM, you might not enough swap space available.