Easiest way to collect dynamic Instruction execution counts? - profiling

I'd like a simple and fast way to collect the number of times each Instruction in LLVM bitcode was executed in a given run of the application. As far as I can tell, there are a number of approaches I can take:
Use PIN. This would require using DWARF debug info and Instruction debug info to attempt to map instructions in the binary to instructions in the bitcode; not 100% sure how accurate this will be.
Use llvm-prof. Two questions here. First, I've seen on Stack Overflow an option to opt called --insert-edge-profiling. However, that option doesn't seem to be available in 3.6? Second, it appears that such profiling only records execution counts at the Function level, not at the individual Instruction level. Is that correct?
Write a new tool similar to AddressSanitizer. This may work, but seems like overkill.
Is there an easier way to achieve my goal that I'm missing?

As part of my PhD research, I have written a tool to collect a trace of the basic blocks executed by a program. This tool also records the number of LLVM instructions in each basic block, so an analysis of the trace would give the dynamic Instruction execution count.
Another research tool is Harmony. It will provide the dynamic execution counts of each basic block in the program, which you could extend with the static instruction counts.
Otherwise, I would suggest writing your own tool. For each basic block, (atomically) increment a global counter by the number of instructions in that block.

Related

Looking for a way to detect valgrind/memcheck at runtime without including valgrind headers

Valgrind/Memcheck can be intensive and causes runtime performance to drop significantly. I need a way (at runtime) to detect it in order to disable all auxiliary services and features in order to perform checks in under 24 hours. I would prefer not to pass any explicit flags to program, but that would be one way.
I explored searching the symbol table (via abi calls) for valgrind or memcheck symbols, but there were none.
I explored checking the stack (via boost::stacktrace), but nothing was there either.
Not sure it's a good idea to have a different behavior when running under Valgrind since the goal of Valgrind is to assert your software in the expected usage case.
Anyway, Valgrind does not change the stack or symbols since it (kind of) emulates a CPU running your program. The only way to detect if you're being run under Valgrind is to observe for its effects, that is, everything is slow and not multithreaded in Valgrind.
So, for example, run a test that spawn 3 threads consuming a common FIFO (with a mutex/lock) and observe the number of items received. In real CPU, you'd expect the 3 threads to have processed close to the same amount of items in T time, but when run under Valgrind, one thread will have consumed almost all items in >>T time.
Another possibility is to call a some known syscall. Valgrind has some rules for observing syscall. For example, if you are allocating memory, then Valgrind will intercept this block of memory, and fill that memory area with some data. In a good software, you should not read that data and first write to it (so overwriting what Valgrind set). If you try to read that data and observe non zero value, you'll get a Valgrind invalid read of size XXX message, but your code will know it's being instrumented.
Finally, (and I think it much simpler), you should move the code you need to instrument in a library, and have 2 frontends. The "official" frontend, and a test frontend where you've disabled all bells and whistles that's supposed to be run under Valgrind.

Performance counter for Halide?

Is there a performance counter available for code written in the Halide language? I would like to know how many loads, stores, and ALU operations are performed by my code.
The Halide tutorial for scheduling multi-stage pipelines compares different schedules by comparing the amount of allocated memory, loads, stores, and calls to halide Funcs, but I don't see how this information was collected. I suppose it might be possible to use trace_stores, trace_loads, and trace_realizations to print to the console every time one of these operations occurs. This isn't a great option though because it would greatly slow down the program's execution and would require some kind of counting script to compile the long list of console outputs into the desired counts for loads, stores, and ALU operations.
I'm pretty sure they just used the trace_xxx output and ran some scripts/programs on it.
If you're looking for real performance numbers on a X86 platform, I would go with Intel VTune Amplifier. It's pretty expensive, but may be free if you're in academia (student, teacher, researcher) or it's for an open source project.
Other than that, look at the lowered statement code by setting HL_DEBUG_CODEGEN=1 in the environment and you can get a better idea of the loop structure and data use. Note that this output goes to stderr, not stdout.
EDIT: For Linux, there's perf.
We do not have any perf counter based support at present. It is fairly difficult to make it portable. (And on mobile devices, often the OS simply doesn't allow access to the hardware.) The support in Profiling.cpp and src/profiling.cpp could likely be used to drive perf counter operation. The profiling lowering pass adds code to call routines in the runtime which update information about Func and Pipeline execution. This information is collected and aggregated by another thread.
If tracing is run to a file (e.g. using HL_TRACE_FILE) a binary format is used and it is a bit more efficient. See utils/HalideTraceViz for a tool to work with the binary format. This is generally how analyses are done within the team.
There was a small amount of investigation of OProfile, which looked promising but I don't think we ever got code working.

Let a task run for a fixed number of instructions?

Is it possible to schedule a given task to run exactly n machine instructions before control is returned to the user?
The motivation for this question is the debugging of multithreaded programs, where this could be helpful to reliably reproduce certain bugs or undefined behaviour.
I'm particularly interested in the case of x86_64-linux running on an Intel CPU, but solutions for other architectures or operating systems would also be interesting.
The documentation for the kernel perf suite says
Performance counters are special hardware registers available on most modern
CPUs. These registers count the number of certain types of hw events: such
as instructions executed, cachemisses suffered, or branches mis-predicted -
without slowing down the kernel or applications. These registers can also
trigger interrupts when a threshold number of events have passed.
so it seems like the hardware could support this in principle, but I'm not sure if this is exposed in any way to the user.
Of course it's possible to just use ptrace to single-step the program n times, but that would make all but the most simple programs impossibly slow.
One simple option to ensure an exact count of the instructions executed is to instrument the assembly code and maintain an execution counter. I believe the easiest way to do instrumentation is Pin ( https://software.intel.com/en-us/articles/pintool ).
High level idea:
- interpret machine code and maintain a counter of the number of instructions executed,
after each instruction you increment the counter and check if it is time for a breakpoint,
reset counter after each breakpoint.
The interpretation idea would introduce quite a bit of overhead. I see a number of straightforward optimizations:
Instrument binary statically (create a new binary where all these increments/checks are hard coded). Such an approach would eliminate instrumentation/interpretation overheads. You can consider the instructions related to monitoring/breakpoints as extra instructions executed or chose to ignore them from the counting.
The increments/checks can be more smartly implemented. Imagine we have a set of instructions with no jumps/branches you can do one increment and one check. This idea is simple but might prove pretty tricky in practice, especially if you need an absolutely accurate breakpoint..

How can I get the number of instructions executed by a program?

I have written and cross compiled a small c++ program, and I could run it in an ARM or a PC. Since ARM and a PC have different instruction set architectures, I wanna to compare them. Is that possible for me to get the number of executed instructions in this c++ program for both ISAs?
What you need is a profiler. perf would be one easy to use. It will give you the number of instructions that executed, which is the best metric if you want to compare ISA efficiency.
Check the tutorial here.
You need to use: perf stat ./your binary
Look for instructions metric. This approach uses a register in your CPU's performance monitoring unit - PMU - that counts the number of instructions.
Are you trying to get the number of static instructions or dynamic instructions? So, for instance, if you have the following loop (pseudocode):
for (i 0 to N):
a[i] = b[i] + c[i]
Static instruction count will be just under 10 instructions, give or take based on your ISA, but the dynamic count would depend on N, on the branch prediction implementation and so on.
So for static count I would recommend using objdump, as per recommendations in the comments. You can find the entry and exit labels of your subroutine and count the number of instructions in between.
For dynamic instruction count, I would recommend one of two things:
You can simulate running that code using an instruction set simulator (there are open source ISA simulators for both ARM and x86 out there - Gem5 for instance implements both of them, there are others out there that support one or the other.
Your second option is to run this natively on the target system and setup performance counters in the CPU to report dynamic instruction count. You would reset before executing your code, and read it afterwards (there might be some noise here associated with calling your subroutine and exiting, but you should be able to isolate that out)
Hope this helps :)
objdump -dw mybinary | wc -l
On Linux and friends, this gives a good approximation of the number of instructions in an executable, library or object file. This is a static count, which is of course completely different than runtime behavior.
Linux:
valgrind --tool=callgrind ./program 1 > /dev/null

Newbie: Performance Analysis through the command line

I am looking for a performance analysis tool with the following properties:
Free.
Runs on Windows.
Does not require using the GUI
(i.e. can be run from command line or by using some library in any programming language).
Runs on some x86-based architecture (preferably Intel).
Can measure the running time of my C++, mingw-compiled, program, except for the time spent in a few certain functions I specify (and all calls emanating from them).
Can measure the amount of memory used by my program, except for the memory allocated by those functions I specified in (5) and all calls emanating from them.
A tool that have properties (1) to (5) (without 6) can still be very valuable to me.
My goal is to be able to compare the running time and memory usage of different programs in a consistent way (i.e. the main requirement is that timing the same program twice would return roughly the same results).
Mingw should already have a gprof tool. To use it you just need to compile with correct flags set. I think it was -g -pg.
For heap analysis (free) you can use umdh.exe which is a full heap dumper, you can also compare consecutive memory snapshots to check for leakage over time. You'd have to filter the output yourself to remove functions that were not of interest, however.
I know that's not exactly what you were asking for in (6), but it might be useful. I think filtering like this is not going to be so common in freeware.