Is valgrind's cachegrind still the go-to tool in 2021? - profiling

I'm a long-time user of cachegrind for program profiling, and recently went back to check the official documentation once more: https://valgrind.org/docs/manual/cg-manual.html
In it, there are multiple references to CPU models, implementation decisions and simulation models that are all from the mid-2000s, and there are also statements that some behavior changed on "modern" processors:
the LL cache typically replicates all the entries of the L1 caches [...] This is standard on Pentium chips, but AMD Opterons, Athlons and Durons use an exclusive LL cache [...]
Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around 2004.
More recent processors have better branch predictors [...] Cachegrind's predictor design is deliberately conservative so as to be representative of the large installed base of processors which pre-date widespread deployment of more sophisticated indirect branch predictors. In particular, late model Pentium 4s (Prescott), Pentium M, Core and Core 2 have more sophisticated indirect branch predictors than modelled by Cachegrind.
Now I'm wondering
how many of these choices still apply in 2021 when developing on latest-gen CPUs,
whether the implementation of cachegrind has been updated to reflect latest CPUs, but the manual is outdated,
whether cachegrind shows skewed results on modern CPUs due to its simulation of legacy behavior.
Any insight is greatly appreciated!

Related

How is if statement executed in NVIDIA GPUs?

As much as know GPU cores are very simple and can only execute basic mathematic instructions.
If I have a kernel with an if statement, then what does execute that if statement? Fp32, Fp64 and Int32 can only execute operations with floats, doubles and integers, not a COMPARE instruction, am I wrong. What happens if I have printf function in kernel? Who executes that.
Compare instructions are arithmetic instructions, you can implement a comparison with subtraction and a flag register, and GPGPUs have them.
But they are often not advertised as much as the number-crunching capability of the whole GPU.
NVIDIA doesn't publish the machine code documentation for their GPUs nor the ISA of the respective assembly (called SASS).
Instead, NVIDIA maintains the PTX language which is designed to be more portable across different generations while still being very close to the actual machine code.
PTX is a predicated architecture. The setp instruction (which again, is just a subtraction with a few caveats) sets the value of the defined predicate registers and these are used to conditionally execute other instructions. Including the bra instruction which is a branch, making it possible to execute conditional branches.
One could argue that PTX is not SASS but it seems the predicate architecture is what NVIDIA GPUs, at least, used to do.
AMD GPUs seem to use the traditional approach to branching: there are comparison instructions (e.g. S_CMP_EQ_U64) and conditional branches (e.g. S_CBRANCH_SCCZ).
Intel GPUs also rely on predication but have different instructions for divergent vs non-divergent branches.
So GPGPUs do have branch instructions, in fact, their SIMT model has to deal with the branch divergence problem.
Before c. 2006 GPUs were not fully programmable and programmers had to rely on other tricks (like data masking or branchless code) to implement their kernel.
Keep in mind that at the time it was not widely accepted that one could execute arbitrary programs or make arbitrary shading effects with GPUs. GPUs relaxed their programming constraints with time.
Putting a printf in a CUDA kernel won't probably work because there is no C runtime on the GPU (remember the GPU is an entirely different executor from the CPU) and the linking would fail I guess.
You can theoretically force a GPU implementation of the CRT and design a mechanism to call syscalls from the GPU code but that would be unimaginably slow since GPUs are not designed for this kind of work.
EDIT: Apparently NVIDIA actually did implement a printf on the GPU that prints to a buffer shared with host.
The problem here is not the presence of branches but the very nature of printf.

How to measure read/cycle or instructions/cycle?

I want to thoroughly measure and tune my C/C++ code to perform better with caches on a x86_64 system. I know how to measure time with a counter (QueryPerformanceCounter on my Windows machine) but I'm wondering how would one measure the instructions per cycle or reads/write per cycle with respect to the working set.
How should I proceed to measure these values?
Modern processors (i.e., those not very constrained that are less than some 20 years old) are superscalar, i.e., they execute more than one instruction at a time (given correct instruction ordering). Latest x86 processors translate the CISC instructions into internal RISC instructions, reorder them and execute the result, have even several regster banks so instructions using "the same registers" can be done in parallel. There isn't any reasonable way to define the "time the instruction execution takes" today.
The current CPUs are much faster than memory (a few hundred instructions is the typical cost of accessing memory), they are all heavily dependent on cache for performance. And then you have all kinds of funny effects of cores sharing (or not) parts of cache, ...
Tuning code for maximal performance starts with the software architecture, goes on to program organization, algorithm and data structure selection (here a modicum of cache/virtual memory awareness is useful too), careful programming and (as te most extreme measures to squeeze out the last 2% of performance) considerations like the ones you mention (and the other favorite, "rewrite in assembly"). And the ordering is that one because the first levels give more performance for the same cost. Measure before digging in, programmers are notoriously unreliable in finding bottlenecks. And consider the cost of reorganizing code for performance, both in the work itself, in convincing yourself this complex code is correct, and maintenance. Given the relative costs of computers and people, extreme performance tuning rarely makes any sense (perhaps for heavily travelled code paths in popular operating systems, in common code paths generated by a compiler, but almost nowhere else).
If you are really interested in where your code is hitting cache and where it is hitting memory, and the processor is less than about 10-15 years old in its design, then there are performance counters in the processor. You need driver level software to access these registers, so you probably don't want to write your own tools for this. Fortunately, you don't have to.
There is tools like VTune from Intel, CodeAnalyst from AMD and oprofile for Linux (works with both AMD and Intel processors).
There are a whole range of different registers that count the number of instructions actually completed, the number of cycles the processor is waiting for . You can also get a count of things like "number of memory reads", "number of cache misses", "number of TLB misses", "number of FPU instructions".
The next, more tricky part, is of course to try to fix any of these sort of issues, and as mentioned in another answer, programmers aren't always good at tweaking these sort of things - and it's certainly time consuming, not to mention that what works well on processor model X will not necessarily run fast on model Y (there were some tuning tricks for early Pentium 4 that works VERY badly on AMD processors - if on the other hand, you tune that code for AMD processors of that age, you get code that runs well on the same generation Intel processor too!)
You might be interested in the rdtsc x86 instruction, which reads a relative number of cycles.
See http://www.fftw.org/cycle.h for an implementation to read the counter in many compilers.
However, I'd suggest simply measuring using QueryPerformanceCounter. It is rare that the actual number of cycles is important, to tune code you typically only need to be able to compare relative time measurements, and rdtsc has many pitfalls (though probably not applicable to the situation you described):
On multiprocessor systems, there is not a single coherent cycle counter value.
Modern processors often adjust the frequency, changing the rate of change in time with respect to the rate of change in cycles.

Measure how often a branch is mispredicted

Assuming I have a if-else branch in C++ how can I (in-code) measure how often the branch is mispredicted? I would like to add some calls or macros around the branch (similar to how you do bottom-up profiling) that would report branch mispredictions.
It would be nice to have a generic method, but lets do Intel i5 2500k for starters.
If you are using an AMD CPU, AMD's CodeAnalyst is just what you need (works on windows and Linux)*.
if your not, then you may need to fork out for a VTune licence or build something using the on CPU performance registers and counters details in the instruction manuals.
You can also check out gperf & OProfile (linux only), see how well they perform (I've never used these, but I see them referred to quite a bit).
*CodeAnalyst should work on an Intel CPU, you just don't get all then nice CPU level analysis.
I wonder if it would be possible to extract this information from g++ -fprofile-arcs? It has to measure exactly this in order to feed back into the optimizer in order to optimize branching.
OProfile
OProfile is pretty complex, but it can profile anything your CPU tracks.
Look through the Event Type Reference and look for your particular CPU.
For instance here is the core2 events. After a quick search I don't see any event counters for missed branch prediction on the core2 architecture.

How to profile a C++ function at assembly level?

I have a function that is the bottleneck of my program. It requires no access to memory and requires only calculation. It is the inner loop and called many times so any small gains to this function is big wins for my program.
I come from a background in optimizing SPU code on the PS3 where you take a SPU program and run it through a pipeline analyzer where you can put each assembly statement in its own column and you minimize the amount of cycles the function takes. Then you overlay loops so you can minimized pipeline dependencies even more. With that program and a list of all the cycles each assembly instruction takes I could optimize much better then the compiler ever could.
On a different platform it had events I could register (cache misses, cycles, etc.) and I could run the function and track CPU events. That was pretty nice as well.
Now I'm doing a hobby project on Windows using Visual Studio C++ 2010 w/ a Core i7 Intel processor. I don't have the money to justify paying the large cost of VTune.
My question:
How do I profile a function at the assembly level for an Intel processor on Windows?
I want to compile, view disassembly, get performance metrics, adjust my code and repeat.
There are some great free tools available, mainly AMD's CodeAnalyst (from my experiences on my i7 vs my phenom II, its a bit handicapped on the Intel processor cause it doesn't have access to the direct hardware specific counters, though that might have been bad config).
However, a lesser know tool is the Intel Architecture Code Analyser (which is free like CodeAnalyst), which is similar to the spu tool you described, as it details latency, throughput and port pressure (basically the request dispatches to the ALU's, MMU and the like) line by line for your programs assembly. Stan Melax gave a nice talk on it and x86 optimization at this years GDC, under the title "hotspots, flops and uops: to-the-metal cpu optimization".
Intel also has a few more tools in the same vein as IACA, avaibale under the performance tuning section of their experimental/what-if code site, such as PTU, which is (or was) an experimental evolution of VTune, from what I can see, its free.
Its also a good idea to have read the intel optimization manual before diving into this.
EDIT: as Ben pointed out, the timings might not be correct for older processors, but that can be easily made up for using Agner Fog's Optimization manuals, which also contain many other gems.
You might want to try some of the utilities included in valgrind like callgrind or cachegrind.
Callgrind can do profiling and dump assembly.
And kcachegrind is a nice GUI, and will show the dumps including assembly and number of hits per instruction etc.
From you description it sounds like you problem may be embarrassingly parallel, have you considered using ppl's parallel_for?

What's a good PPC based >MACHINE< for profiling code for in-order processors

I know that older Macs have PPC processors in them, which is perfect, but which specific models are suitable for dropping a linux distribution onto? I've not used a Mac in over 10 years now so I have no idea which to go for. In particular, I ask about ones that accept Linux because I believe Apple ask you to pay to develop on their machines or is it possible to use c++ with gcc and the LLVM for free on the mac?
I need to be able to profile code on an in-order risc processor, and the PPC seems like the best place to start, but what other CPUs offer similar coding experience? That is, with a much reduced instruction set, stalls when branching, microcode instructions and load-hit-store problems when switching between float/int/vector representations.
There is no charge to develop on Mac. There is a charge to install iOS products on an iPhone, and there is a charge to sell Mac products through App Store. But you can build c++ apps for free on Mac. Xcode itself is free.
Any PowerBook G4 is fine for this kind of work, and there are many pages on installing Linux on a PowerBook G4 if you wanted to do that (though I'd probably just Xcode rather than go through the hassle).
Use Mac OS X and get the free Xcode developer tools from Apple (Xcode 3.x) and also the free CHUD performance tools package which includes Shark, a very good sampling profiler which you will find extremely useful.
Slightly off-topic, but
in-order
It depends on precisely what you mean by in-order! PowerPC has a variety of synchronizing instructions like like sync, lwsync, and eieio to enforce (different types of!) memory ordering, and isync which flushes the instruction pipeline. IBM has a decent summary.
risc processor
I really wouldn't call the PPC "reduced" ;)
stalls when branching
IIRC, a correctly-predicted branch with its target in the instruction cache does not stall the G4 (I forget how the different models of G4 differ). OTOH, the G5 performs better if branch targets are 16-byte aligned (something about the branch target buffer).
microcode instructions
I thought half the point of RISC was to avoid microcode? I'm not aware of microcode updates, at any rate.
load-hit-store problems when switching between float/int/vector representations
I'm not sure what this means...
"Traditional" ARM might is probably closer to what you're looking for, but I suspect the more recent processors have some of the more "modern" processor features. My ARM box of choice is probably the SheevaPlug or similar, though the WZR-HP-G300NH router is cheaper (and comes with Wi-Fi) if you don't mind being constrained to 64 MB.