As I understand, _mm_clflush() / _mm_clflushopt() invalidates a cache line while saving it to memory if it has been changed. Is there a way to simply abandon a cache line, without saving to memory any changes made to it?
A use case is before freeing memory: I don't need cache lines or their values anymore.
Related
I'm currently working on a project using the Zynq-7000 SoC. We have a custom DMA IP in PL to provide faster transactions between peripherals and main memory. The peripherals are generally serial devices such as UART. The data received by the serial device is transferred immediately to the main memory by DMA.
What I try to do is to reach the data stored at a pre-determined location of the memory. Before reading the data, I invalidate the related cache lines using a function provided by xil_cache.h library as below.
Xil_DCacheInvalidateRange(INTPTR adr, u32 len);
The problem here is that this function flushes the related cache lines before invalidating them. Due to flushing, the stored data is overwritten. Hence, every time I fetch the corrupted bytes. The process has been explained in library documentation as below.
If the address to be invalidated is not cache-line aligned, the
following choices are available:
Invalidate the cache line when
required and do not bother much for the side effects. Though it sounds
good, it can result in hard-to-debug issues. The problem is, if some
other variables are allocated in the same cache line and had been
recently updated (in cache), the invalidation would result in loss of
data.
Flush the cache line first. This will ensure that if any
other variable presents in the same cache line and updated recently are
flushed out to memory. Then it can safely be invalidated. Again it
sounds good, but this can result in issues. For example, when the
invalidation happens in a typical ISR (after a DMA transfer has
updated the memory), then flushing the cache line means, losing data
that were updated recently before the ISR got invoked.
As you can guess that I cannot always allocate a memory region that has a cache-line aligned address. Therefore, I follow a different way to solve the problem so that I calculate the cache-line aligned address which is located in memory right before my buffer. Then I call the invalidation method with that address. Note that the Zynq's L2 Cache is an 8-way set-associative 512KB cache with a fixed 32-byte line size. This is why I mask the last 5 bits of the given memory address. (Check the section 3.4: L2 Cache in Zynq's documentation)
INTPTR invalidationStartAddress = INTPTR(uint32_t(dev2memBuffer) - (uint32_t(dev2memBuffer) & 0x1F));
Xil_DCacheInvalidateRange(invalidationStartAddress, BUFFER_LENGTH);
This way I can solve the problem but I'm not sure if I'm violating any of the resources that are placed before the resource allocated for DMA.(I would like to add that the referred resource is allocated at heap using the dynamic allocation operator new.) Is there a way to overcome this issue, or am I overthinking it? I believe that this problem could be solved better if there was a function to invalidate the related cache lines without flushing them.
EDIT: Invalidating resources that are not residing inside the allocated area violates the reliability of variables placed close to the referred resource. So, the first solution is not applicable. My second solution is to allocate a buffer that is 32-byte bigger than the required one and crop its unaligned part. But, this one also can cause the same problem as its last part*(parts = 32-byte blocks)* is not guaranteed to have 32 bytes. Hence, it might corrupt the resources placed next to it. The library documentation states that:
Whenever possible, the addresses must be cache-line aligned. Please
note that not just the start address, even the end address must be
cache-line aligned. If that is taken care of, this will always work.
SOLUTION: As I stated in the last edit, the only way to overcome the problem was to allocate a memory region with a Cache-Aligned address and length. I'm not able to determine the start address of the allocated area, hence I've decided to allocate a space that is two Cache-Blocks bigger than the requested one and crop the unaligned parts. The unalignment can occur at the first or the last block. In order not to violate the destruction of the resources, I saved the originally allocated address carefully and used the Cache-Aligned one in all of the operations.
I believe that there are better solutions to the problem and I keep the question open.
Your solution is correct. There is no way to flush a subset of a cache line.
Normally this behavior is transparent to programs but it becomes visible in multithreaded code and when sharing memory with hardware accelerators.
Ok so I can't find much in the way of answers to this, it's a simple question in memory management. I know that when a computer pulls from memory it caches 32-64 bits of memory in a cache line depending on your processor. My question is does it only store 1 cache line's worth of memory or multiple, if multiple how many?
For instance say we're using c++, and I pull a vector<int> using a for loop, then I use those integers to pull information out of another vector that is most likely no where near it in memory. Would that qualify as 2 pulls and then everything is cached or is that just going to continuously pull from memory? Basically, would it pull the vector<int> and store it in cache, then pull the other vector and store it as well in a different catch line? Thus only pulling twice then getting from it's cached memory from then on? Assume that each vector = the size of 1 catch lines worth of data.
EDIT: Ok so on the same line.... I have a second question: Lets assume for a moment that my initial vector<int> is called and iterated over in a for loop, which then references multiple vectors. When it caches those vectors, obviously it's going to have a limit so it will start writing over previous cache right? In which case in what order would it write over the previous cache lines, FIFO, FILO, some other way?
There's different types of cache. Generally, the amount of cache depends on the processor. A moden processor has 3 levels of cache, where the fastest (and smallest) is called L1 and usually range between 128kb and 512kb, where the slowest (and largest) is 1mb to 4mb.
Each access to the memory is 64 bit long, regardless of the processor architecture. Therefore accessing the memory with 64bit long operands is most efficient.
The cache may store memory from different pages too.
SUMMARY:
I have an application which consumes way more memory that it should (roughly about 250% of the expected amount) but I can't seem to find any memory leaks. Calling the same function (which does a lot of allocations) will keep increasing memory usage to some point and then it will not change and stay there.
PROGRAM DETAILS:
The application uses a quadtree data structure to store 'Points'. It is possible to specify the maximum number of points to be stored in memory (cache size). The 'Points' are stored in 'PointBuckets' (arrays of points linked to the leaf nodes of the quadtree) which, if the maximum total number of points in the quadtree is reached, are serialized and saved to temporary files, to be retrieved when needed. This all seems to work fine.
Now when a file is loaded a new Quadtree is created and the old one is deleted if it exists, then points are read from the file and inserted into the quadtree one by one. A lot of memory allocations take place as buckets are being created and deleted during node splitting etc.
SYMPTOMS:
If I load a file that is expected to use 300MB of memory once, I get the expected amount of memory consumed. All good. If I keep loading the same file over and over again the memory usage keeps growing (I'm looking at the RES column in top, Linux) till about 700MB. That could indicate a memory leak. However if I then keep loading the files still, memory consumption just stays at 700MB.
Another thing: When I use valgrind massif and look at the memory usage it always stays within expected limit. For example if I specify cache size to be 1.5 GB and run my program alone, it will eventually consume 4GB of memory. If I run it in massif, it will stay below 2GB for all the time and then in the produced graphs I'll be able to see that it in fact never allocated more then the expected 1.5GB. My naive assumption is that this happens because massif uses a custom memory pool which somehow prevents fragmentation.
So what do you think is going on here? What kind of solution should I look for to solve this issue, if it is memory fragmentation?
I'd put it more at simple allocator and OS caching behaviours. They retain memory you allocated instead of freeing it so that it can be returned to you in a more prompt fashion the next time you request it. However, 250% does sound like a lot for this kind of effect- you could be looking at fragmentation problems.
Try swapping your allocator for a fragmentation-free allocator like object pool or memory arena.
I've been trying to track down a memory problem for a couple of days - my program is using around 3GB of memory, when it should be using around 200MB-300MB. Valgrind is actually reporting that it is using ~300MB at its peak, and is not reporting any memory leaks.
The program reads an input file, and stores every unique word in that file. It is multi-threaded, and I've been running it using 4 threads. My major sources of data are:
Constant-size array of wchar_t (4MB total)
Map between words and a list of associated values. This grows with the size of input. If there are 1,000,000 unique words in the input file, there will be 1,000,000 entries in the tree.
I am doing a huge number of allocations and deallocations (using new and delete) -- at least two per unique word. Is it possible that memory I free is not being reused for some reason, causing the program to keep acquiring more and more memory? It consistently grabs more as it continues to run.
In general, any ideas about where I should go from here?
Edit 1 (based on advice from Graham):
One path I'll try is minimizing allocation. I'll work with a single string per thread (which may grow occasionally if a word is longer than this string is), but if I remember my code correctly this will eliminate a huge number of new/delete calls. If all goes well I'll be left with: one-time allocation of input buffer, one-time allocation of string-per-thread (with some reallocs), two allocs per map entry (one for key, one for value).
Thanks!
It's likely to be heap fragmentation. Because you are allocating and releasing small blocks in such huge quantities, it's probable that there are loads of small free chunks which are too small to be reused by subsequent allocations. Since these chunks are effectively wasted, the process has to keep grabbing more and more memory from the system to honour new allocations.
You may be able to mitigate the effect by first reserving a sufficiently large default capacity in each string with string::reserve(), and then clearing strings to empty when you're finished with them (rather than deleting). Then, keep a list of empty strings to be reused instead of allocating new ones all the time.
EDIT: the above suggestion assumes the objects being allocated are std::strings. If they're not, then you can probably still apply the general technique of keeping old empty objects around for reuse.
Memory your program frees should be returned to the heap where it can be allocated again.
However, that does not mean it is freed back to the operating system. Often, the app will continue to "own" memory that has been allocated and freed.
Is this a Windows app? How are you allocating and freeing the memory? And how are you determining how much memory the app is using?
You should try wrapping the resource allocations into a class if you can. Call new in the constructor, and delete in the destructor. Try and take advantage of scope so memory management is done more automatically.
http://en.wikipedia.org/wiki/RAII
I have to deal with a huge amount of data that usually doesn't fit into main memory. The way I access this data has high locality, so caching parts of it in memory looks like a good option. Is it feasible to just malloc() a huge array, and let the operating system figure out which bits to page out and which bits to keep?
Assuming the data comes from a file, you're better off memory mapping that file. Otherwise, what you end up doing is allocating your array, and then copying the data from your file into the array -- and since your array is mapped to the page file, you're basically just copying the original file to the page file, and in the process polluting the "cache" (i.e., physical memory) so other data that's currently active has a much better chance of being evicted. Then, when you're done you (typically) write the data back from the array to the original file, which (in this case) means copying from the page file back to the original file.
Memory mapping the file instead just creates some address space and maps it directly to the original file instead. This avoids copying data from the original file to the page file (and back again when you're done) as well as temporarily moving data into physical memory on the way from the original file to the page file. The biggest win, of course, is when/if there are substantial pieces of the original file that you never really use at all (in which case they may never be read into physical memory at all, assuming the unused chunk is at least a page in size).
If the data are in a large file, look into using mmap to read it. Modern computers have so much RAM, you might not enough swap space available.