How to find most frequently used areas of memory? - c++

I'm looking to profile a large C++ application and determine which pieces of data (or memory regions) are fetched the most. Basically, I want to be able to do something like the processor's MFU cache algorithm for determining what to store in L2/L3 caches. There is surprisingly little to no information online on anybody that has tried to accomplish this.
Edit: Changed MRU to MFU
Edit 2: To clarify, I need the addresses, or the data structures that are pointed to at the addresses.

You can use Pin tool to log all memory accesses and calculate cache hit/miss.

valgrind can do this - it will need a plugin , dont know if there is already one.
EDIT: it's called cachegrind

Related

Is there a way to code data directly to the hard drive (similar to how one can do with RAM)?

My question concerns C/C++. It is possible to manipulate the data on the RAM with pretty great flexibility. You can also give the GPU direct commands using OpenGL, allowing one to manipulate VRAM as well.
My curiosity is whether it is possible to do this to the hard drive (even though this would likely be a horrible idea with many, many possibilities of corrupting existing data). The logic of my question comes from an assumption that the hard drive is similar to RAM and VRAM (bytes of data), but just accesses data slower.
I'm not asking about how to perform file IO, but instead how to directly modify bytes of memory on the hard drive (maybe via some sort of "hard-drive pointer").
If my assumption is totally off, a detailed correction about how the hard drive's data storage is different from RAM or VRAM would be very helpful. Thank you!
Modern operating systems in combination with modern CPUs offer the ability to memory-map disk clusters to memory pages.
The memory pages are initially marked as invalid, and as soon as you try to access them an invalid page "trap" or "interrupt" occurs, which is handled by the operating system, which loads the corresponding cluster into that memory page.
If you write to that page there is either a hardware-supported "dirty" bit, or another interrupt mechanism: the memory page is initially marked as read-only, so the first time you try to write to it there is another interrupt, which simply marks the page as dirty and turns it read-write. Then, you know that the page needs to be flushed to disk at a convenient time.
Note that reading and writing is usually done via Direct Memory Access (DMA) so the CPU is free to do other things while the pages are being transferred.
So, yes, you can do it, either with the help of the operating system, or by writing all that very complex code yourself.
Not for you. Being able to write directly to the hard drive would give you infinite potential to mess up things beyond all recognition. (The technical term is FUBAR, and the F doesn't stand for Mess).
And if you write hard disk drivers, I sincerely hope you are not trying to ask for help here.

cache misss/ cache miss rate of a 2D array

I am executing my C++ code on linux. In my code, there is a large 2D array of some structure. The array is accessed randomly. I have to find how many cache misses occur when that 2D array is accessed. Is there any other solution except valgrind (as it takes too much time to compute results ) that can help me to find cache misses and cache miss rate of this array.
Generally, your program does not just access the 2D array, so memory access includes other variables. It may have some effects, more or less.
If the array access is intensive(or encapsulated), maybe you can use a simple simulator to evaluate the miss rate.
Otherwise, you may need some tools. I suggest Pin, a dynamic binary instrumentation tool, of which the principle is similar to valgrind, while I think the overhead is acceptable, but you have to write the analysis code.
Or a better choice is Vtune, performance analysis tool from intel, try the cache analysis. Still, you need some time to learn it.
Simulating the cache risks not reproducing the exact same behavior due to minor differences (in replacement policy, protocols, etc..).
A better option is profiling the code - the simplest profiling tool for linux would probably be perf. There's a pretty decent tutorial here - https://perf.wiki.kernel.org/index.php/Tutorial
Also see - Are there any way to profile cache miss in linux kernel? , it includes a list of specific example statistics to check for cache performance

How to build an application layer pre-fetching system

I'm working in a C/C++ mixed project that has the following situation.
I need to have a iteration to go through very small chunks (rarely larger chunks as well) in a file one by one. Ideally, I should just read them once consecutively. I think will be a better solution in this case to read a big chunk into a buffer and consume it later, rather than read each of them instantly when I need.
The problem is, how do I balance the cache size? Is there any well-known algorithm/library that I can take advantage of?
UPDATE: (changes the title)
Thanks for you guys' replies and I understand there are different levels of caching mechanism working in our boxes. But that not enough in my case.
I think I missed something important here. Actually I'm building an application upon an existing framework, in which requesting reads to the engine frquently will cost too much for me. (Yes, i believe the engine do take advantage of OS and disk level caches.) And what I'm trying to do is indeed to build an application level pre-fetching system.
Thoughts?
in general you should try to use what the OS gives you, rather than creating your own cache (because you run the risk of caching twice). for linux, you can request OS level caching via readahead(); i don't know what the windows equivalent would be.
looking into this some more, there is also a block level (ie disk) parameter, set via blockdev --setra. it's probably not a good idea to change that on your system (unless it is dedicated to just this one task), but if the value there (blockdev --getra) is already larger than your typical chunk size then you may not need to do anything else.
[and just to address the other point mentioned in the question comments - while an OS will cache file data in free memory, i don't believe that it will pre-emptively read an otherwise unread file (apart from to meet the requirements above). but if anyone knows otherwise, please post details...]
Have you tried mmap()ing the file instead of read()ing from it? In some cases this might be more efficient, in some cases this might not. However it is usually best to let the system optimize for you, since it knows more about the hardware than an application. mmap() will let the system know that you need the whole file, so it might just be more optimal.

Store huge amount of data in memory

I am looking for a way to store several gb's of data in memory. The data is loaded into a tree structure. I want to be able to access this data through my main function, but I'm not interested in reloading the data into the tree every time I run the program. What is the best way to do this? Should I create a separate program for loading the data and then call it from the main function, or are there better alternatives?
thanks
Mads
I'd say the best alternative would be using a database - which would be then your "separate program for loading the data".
If you are using a POSIX compliant system, then take a look into mmap.
I think Windows has another function to memory map a file.
You could probably solve this using shared memory, to have one process that it long-lived build the tree and expose the address for it, and then other processes that start up can get hold of that same memory for querying. Note that you will need to make sure the tree is up to being read by multiple simultaneous processes, in that case. If the reads are really just pure reads, then that should be easy enough.
You should look into a technique called a Memory mapped file.
I think the best solution is to configure a cache server and put data there.
Look into Ehcache:
Ehcache is an open source, standards-based cache used to boost
performance, offload the database and simplify scalability. Ehcache is
robust, proven and full-featured and this has made it the most
widely-used Java-based cache.
It's written in Java, but should support any language you choose:
The Cache Server has two apis: RESTful resource oriented, and SOAP.
Both support clients in any programming language.
You must be running a 64 bit system to use more than 4 GB's of memory. If you build the tree and set it as a global variable, you can access the tree and data from any function in the program. I suggest you perhaps try an alternative method that requires less memory consumption. If you post what type of program, and what type of tree you're doing, I can perhaps give you some help in finding an alternative method.
Since you don't want to keep reloading the data...file storage and databases are out of question, but several gigs of memory seem like such a hefty price.
Also note that on Windows systems, you can access the memory of another program using ReadProcessMemory(), all you need is a pointer to use for the location of the memory.
You may alternatively implement the data loader as an executable program and the main program as a dll loaded and unloaded on demand. That way you can keep the data in the memory and be able to modify the processing code w/o reloading all the data or doing cross-process memory sharing.
Also, if you can operate on the raw data from the disk w/o making any preprocessing of it (e.g. putting it in a tree, manipulating pointers to its internals), you may want to memory-map the data and avoid loading unused portions of it.

Direct SPU to SPU DMA requests on the Cell Processor?

Normal DMA requests on the Cell happen between the SPUs and the PPU. However, I have read that it is possible to set up DMA directly between SPUs. Anyone have any idea how this is accomplished?
Have a look at spe_get_ls(). This will help you to setup a list of effective addresses that you can use to transfer data between local stores. You may need some management to map spe identifiers to physical SPUs.
The trick is essentially what Chris said. The local store of one SPE is memory-mapped into the PPE's memory space. And then you just perform a regular DMA transfer from the other SPE to that address on the PPE.
I'm sorry I don't have the exact code for this. It's been a year or so since I had to do any of this. :)