Efficient cache and BLOB's - profiling cache hits/misses

Efficient cache and BLOB's - profiling cache hits/misses - c++

For a program to be cache efficient the data used should be stored linearly right?
So instead of dynamic allocation I put my data in a blob using a linear allocator. Is this enought to improve performace? what should I do to improve cache efficiency even more?
I know that this questions arent specific but I don't know how to explain it...
Which programs can help me profile cache hits/misses?

If your looking for a profiler for windows, you can try AMD's CodeAnalyst or VerySleepy, both of these are free, AMDs is the more powerful of the two however( and works on intel hardware, but iirc you can't use the hardware based profiling stuff), it includes monitoring of things like branch prediction misses and cache utilization. Profiling is great, as it tells you what to optimize, but you don't always know how, for that, you should have a look at Agner Fog's optimization manuals combined with Intel's optimization manual (which contains a lot on locality and cachability optimizations)

If you're on Linux you could use Valgrind(specifically cachegrind tool).
If you're on Windows then VS2010(2008) Professional edition has a builtin profiler but
I don't know any details about it's cache profiling facilities. There is also the Intel
VTune Analyzer(Amplifier). Both of them are commercial products, although I think you can get 30 days evaluation copies.
Some other questions on SO that might be of help:
What's your favorite profiling tool (for C++)
C and C++ source code profiling tools

On Linux, you can use perf mem to sample memory accesses, including misses in a very fine-grained manner (including the miss address), as described here.

Related

What is the fastest instrumentation profiler out there

What is the fastest profiler available for dynamic profiling (like what gprof does). The profiler has to be an instrumentation profiler, or even if it has sampling profiling with it, I'm interested to know the overhead of instrumentation profiling, because sampling profiling can be done with almost 0% overhead anyway.

Any profiler that uses hardware based sampling (via the CPU PMSR's) will have the smallest overhead (as its reading the profiling data the CPU is keeping track of at a hardware level, for more info, see AMD & Intels Architecture manuals, they should be explained in-depth in one of the appendices).
The only profilers I know of using these are VTune for Intel (not free) and CodeAnalyst for AMD (free).
Next in line would be timer based profilers and event based profilers, of these the ones with the least overhead would probably be ones compiled directly into your code (CodeAnalyst has an API for event based, so does VTune). gprof also falls into this category (Clang also has something but IDK if its still maintained...). If you have VS Pro or Ultimate, its PG compile mode will do similar things, though I have never found it to compare with a dedicated profiler suite.
Last would be the ones that need to insert probes into your code to determine its profiling data, all the aforementioned ones can do this, as well as other freeware profilers like VerySleepy.

Intel's vtune amplifier is probably the most complete.

Determine Values AND/OR Address of Values in CPU Cache

Is there a way to determine exactly what values, memory addresses, and/or other information currently resides in the CPU cache (L1, L2, etc.) - for current or all processes?
I've been doing quite a bit a reading which shows how to optimize programs to utilize the CPU cache more effectively. However, I'm looking for a way to truly determine if certain approaches are effective.
Bottom line: is it possible to be 100% certain what does and does not make it into the CPU cache.
Searching for this topic returns several results on how to determine the cache size, but not contents.
Edit: To clarify some of the comments below: Since software would undoubtedly alter the cache, do CPU manufactures have a tool / hardware diagnostic system (built-in) which provides this functionality?

Without using specialized hardware, you cannot directly inspect what is in the CPU cache. The act of running any software to inspect the CPU cache would alter the state of the cache.
The best approach I have found is simply to identify real hot spots in your application and benchmark alternative algorithms on hardware the code will run on in production (or on a range of likely hardware if you do not have control over the production environment).

In addition to Eric J.'s answer, I'll add that while I'm sure the big chip manufacturers do have such tools it's unlikely that such a "debug" facility would be made available to regular mortals like you and I, but even if it were, it wouldn't really be of much help.
Why? It's unlikely that you are having performance issues that you've traced to cache and which cannot be solved using the well-known and "common sense" techniques for maintaining high cache-hit ratios.
Have you really optimized all other hotspots in the code and poor cache behavior by the CPU is the problem? I very much doubt that.
Additionally, as food for thought: do you really want to optimize your program's behavior to only one or two particular CPUs? After all, caching algorithms change all the time, as do the parameters of the caches, sometimes dramatically.

If you have a relatively modern processor running Windows then take a look at
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
and see if that might provide some of what you are looking for.

To optimize for one specific CPU cache size is usually in vain since this optimization will break when your assumptions about the CPU cache sizes are wrong when you execute on a different CPU.
But there is a way out there. You should optimize for certain access patterns to allow the CPU to easily predict what memory locations should be read next (the most obvious one is a linear increasing read). To be able to fully utilize a CPU you should read about cache oblivious algorithms where most of them follow a divide and conquer strategy where a problem is divided into sub parts to a certain extent until all memory accesses fit completly into the CPU cache.
It is also noteworthy to mention that you have a code and data cache which are separate. Herb Sutter has a nice video online where he talks about the CPU internals in depth.
The Visual Studio Profiler can collect CPU counters dealing with memory and L2 counters. These options are available when you select instrumentation profiling.
Intel has also a paper online which talks in greater detail about these CPU counters and what the task manager of Windows and Linux do show you and how wrong it is for todays CPUs which do work internally asynchronous and parallel at many diffent levels. Unfortunatley there is no tool from intel to display this stuff directly. The only tool I do know is the VS profiler. Perhaps VTune has similar capabilities.
If you have gone this far to optimize your code you might look as well into GPU programming. You need at least a PHD to get your head around SIMD instructions, cache locality, ... to get perhaps a factor 5 over your original design. But by porting your algorithm to a GPU you get a factor 100 with much less effort ony a decent graphics card. NVidia GPUs which do support CUDA (all today sold cards do support it) can be very nicely programmed in a C dialect. There are even wrapper for managed code (.NET) to take advantage of the full power of GPUs.
You can stay platform agnostic by using OpenCL but NVidia OpenCL support is very bad. The OpenCL drivers are at least 8 times slower than its CUDA counterpart.

Almost everything you do will be in the cache at the moment when you use it, unless you are reading memory that has been configured as "uncacheable" - typically, that's frame buffer memory of your graphics card. The other way to "not hit the cache" is to use specific load and store instructions that are "non-temporal". Everything else is read into the L1 cache before it reaches the target registers inside the CPU itself.
For nearly all cases, CPU's do have a fairly good system of knowing what to keep and what to throw away in the cache, and the cache is nearly always "full" - not necessarily of useful stuff, if, for example you are working your way through an enormous array, it will just contain a lot of "old array" [this is where the "non-temporal" memory operations come in handy, as they allow you to read and/or write data that won't be stored in the cache, since next time you get back to the same point, it won't be in the cache ANYWAYS].
And yes, processors usually have special registers [that can be accessed in kernel drivers] that can inspect the contents of the cache. But they are quite tricky to use without at the same time losing the content of the cache(s). And they are definitely not useful as "how much of array A is in the cache" type checking. They are specifically for "Hmm, it looks like cache-line 1234 is broken, I'd better read the cached data to see if it's really the value it should be" when processors aren't working as they should.
As DanS says, there are performance counters that you can read from suitable software [need to be in the kernel to use those registers too, so you need some sort of "driver" software for that]. In Linux, there's "perf". And AMD has a similar set of performance counters that can be used to find out, for example "how many cache misses have we had over this period of time" or "how many cache hits in L" have we had, etc.

How to profile a C++ function at assembly level?

I have a function that is the bottleneck of my program. It requires no access to memory and requires only calculation. It is the inner loop and called many times so any small gains to this function is big wins for my program.
I come from a background in optimizing SPU code on the PS3 where you take a SPU program and run it through a pipeline analyzer where you can put each assembly statement in its own column and you minimize the amount of cycles the function takes. Then you overlay loops so you can minimized pipeline dependencies even more. With that program and a list of all the cycles each assembly instruction takes I could optimize much better then the compiler ever could.
On a different platform it had events I could register (cache misses, cycles, etc.) and I could run the function and track CPU events. That was pretty nice as well.
Now I'm doing a hobby project on Windows using Visual Studio C++ 2010 w/ a Core i7 Intel processor. I don't have the money to justify paying the large cost of VTune.
My question:
How do I profile a function at the assembly level for an Intel processor on Windows?
I want to compile, view disassembly, get performance metrics, adjust my code and repeat.

There are some great free tools available, mainly AMD's CodeAnalyst (from my experiences on my i7 vs my phenom II, its a bit handicapped on the Intel processor cause it doesn't have access to the direct hardware specific counters, though that might have been bad config).
However, a lesser know tool is the Intel Architecture Code Analyser (which is free like CodeAnalyst), which is similar to the spu tool you described, as it details latency, throughput and port pressure (basically the request dispatches to the ALU's, MMU and the like) line by line for your programs assembly. Stan Melax gave a nice talk on it and x86 optimization at this years GDC, under the title "hotspots, flops and uops: to-the-metal cpu optimization".
Intel also has a few more tools in the same vein as IACA, avaibale under the performance tuning section of their experimental/what-if code site, such as PTU, which is (or was) an experimental evolution of VTune, from what I can see, its free.
Its also a good idea to have read the intel optimization manual before diving into this.
EDIT: as Ben pointed out, the timings might not be correct for older processors, but that can be easily made up for using Agner Fog's Optimization manuals, which also contain many other gems.

You might want to try some of the utilities included in valgrind like callgrind or cachegrind.
Callgrind can do profiling and dump assembly.
And kcachegrind is a nice GUI, and will show the dumps including assembly and number of hits per instruction etc.

From you description it sounds like you problem may be embarrassingly parallel, have you considered using ppl's parallel_for?

Programmatically counting cache faults

I need to evaluate the time taken by a C++ function in a bunch of hypothesis about memory hierarchy efficiency (e.g: time taken when we have a cache miss, a cache hit or page fault when reading a portion of an array), so I'd like to have some libraries that let me count the cache miss / page faults in order to be capable of auto-generating a performance summary.
I know there are some tools like cachegrind that gives some related statistics on a given application execution, but I'd like a library, as I've already said.
edit Oh, I forgot: I'm using Linux and I'm not interested in portability, it's an academic thing.
Any suggestion is welcome!

Most recent CPUs (both AMD and Intel) have performance monitor registers that can be used for this kind of job. For Intel, they're covered in the programmer's reference manual, volume 3B, chapter 30. For AMD, it's in the BIOS and Kernel Developer's Guide.
Either way, you can count things like cache hits, cache misses, memory requests, data prefetches, etc. They have pretty specific selectors, so you could get a count of (for example) the number of reads on the L2 cache to fill lines in the L1 instruction cache (while still excluding L2 reads to fill lines in the L1 data cache).
There is a Linux kernel module to give access to MSRs (Model-specific registers). Offhand, I don't know whether it gives access to the performance monitor registers, but I'd expect it probably does.

It looks like now there is exactly what I was searching for: perf_event_open.
It lets you do interesting things like initializing/enabling/disabling some performance counters for subsequently fetching their values through an uniform and intuitive API (it gives you a special file descriptor which hosts a struct containing the previously requested informations).
It is a linux-only solution and the functionalities varies depending on the kernel version, so be careful :)

Intel VTune is a performance tuning tool that does exactly what you are asking for;
Of course it works with Intel processors, as it access the internal processor counters, as explained by Jerry Coffin, so this probably not work on an AMD processor.
It expose literally undreds of counters, like cache hit/misses, branch prediction rates, etc. the real issue with it is understanding which counters to check ;)

The cache misses cannot be just counted easily. Most tools or profilers simulate the memory access by redirecting memory accesses to a function that provides this feature. That means these kind of tools instrument the code at all places where a memory access is done and makes your code run awfully slowly. This is not what your intent is I guess.
However depending on the hardware you might have some other possibilities. But even if this is the case the OS should support it (because otherwise you would get system global stats not the ones related to a process or thread)
EDIT: I could find this interesting article that may help you: http://lwn.net/Articles/417979/

Profiling C++ multi-threaded applications

Have you used any profiling tool like Intel Vtune analyzer?
What are your recommendations for a C++ multi threaded application on Linux and windows? I am primarily interested in cache misses, memory usage, memory leaks and CPU usage.
I use valgrind (only on UNIX), but mainly for finding memory errors and leaks.

Following are the good tools for multithreaded applications. You can try evaluation copy.
Runtime sanity check tool
Thread Checker -- Intel Thread checker / VTune, here
Memory consistency-check tools (memory usage, memory leaks)
- Memory Validator, here
Performance Analysis. (CPU usage)
- AQTime , here
EDIT: Intel thread checker can be used to diagnose Data races, Deadlocks, Stalled threads, abandoned locks etc. Please have lots of patience in analyzing the results as it is easy to get confused.
Few tips:
Disable the features that are not required.(In case of identifying deadlocks, data race can be disabled and vice versa.)
Use Instrumentation level based on your need. Levels like "All Function" and "Full Image" are used for data races, where as "API Imports" can be used for deadlock detection)
use context sensitive menu "Diagnostic Help" often.

On Linux, try oprofile.
It supports various performance counters.
On Windows, AMD's CodeAnalyst (free, unlike VTune) is worth a look.
It only supports event profiling on AMD hardware though
(on Intel CPUs it's just a handy timer-based profiler).
A colleague recently tried Intel Parallel Studio (beta) and rated it favourably
(it found some interesting parallelism-related issues in some code).

VTune give you a lot of details on what the processor is doing and sometimes I find it hard to see the wood for the trees. VTune will not report on memory leaks. You'll need purify plus for that, or if you can run on a Linux box valgrind is good for memory leaks at a great price.
VTune shows two views, one is useful the tabular one, the other I think is just for sales men to impress people with but not that useful.
For quick and cheap option I'd go with valgrind. Valgrind also has a cache grind part to it but i've not used it, but suspect its very good also.
cheers,
Martin.

You can try out AMD CodeXL's CPU profiler. It is free and available for both Windows and Linux.
AMD CodeXL's CPU profiler replaces the no longer supported CodeAnalyst tool (which was mentioned in an answer above given by timday).
For more information and download links, visit: AMD CodeXL web page.

I'll put in another answer for valgrind, especially the callgrind portion with the UI. It can handle multiple threads by profiling each thread for cache misses, etc. They also have a multi-thread error checker called helgrind, but I've never used it and don't know how good it is.

The Rational PurifyPlus suite includes both a well-proven leak detector and pretty good profiler. I'm not sure if it does go down to the level of cache misses, though - you might need VTune for that.
PurifyPlus is available both on various Unices and Windows so it should cover your requirements, but unfortunately in contrast to Valgrind, it isn't free.

For simple profiling gprof is pretty good..

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js