I have written and cross compiled a small c++ program, and I could run it in an ARM or a PC. Since ARM and a PC have different instruction set architectures, I wanna to compare them. Is that possible for me to get the number of executed instructions in this c++ program for both ISAs?
What you need is a profiler. perf would be one easy to use. It will give you the number of instructions that executed, which is the best metric if you want to compare ISA efficiency.
Check the tutorial here.
You need to use: perf stat ./your binary
Look for instructions metric. This approach uses a register in your CPU's performance monitoring unit - PMU - that counts the number of instructions.
Are you trying to get the number of static instructions or dynamic instructions? So, for instance, if you have the following loop (pseudocode):
for (i 0 to N):
a[i] = b[i] + c[i]
Static instruction count will be just under 10 instructions, give or take based on your ISA, but the dynamic count would depend on N, on the branch prediction implementation and so on.
So for static count I would recommend using objdump, as per recommendations in the comments. You can find the entry and exit labels of your subroutine and count the number of instructions in between.
For dynamic instruction count, I would recommend one of two things:
You can simulate running that code using an instruction set simulator (there are open source ISA simulators for both ARM and x86 out there - Gem5 for instance implements both of them, there are others out there that support one or the other.
Your second option is to run this natively on the target system and setup performance counters in the CPU to report dynamic instruction count. You would reset before executing your code, and read it afterwards (there might be some noise here associated with calling your subroutine and exiting, but you should be able to isolate that out)
Hope this helps :)
objdump -dw mybinary | wc -l
On Linux and friends, this gives a good approximation of the number of instructions in an executable, library or object file. This is a static count, which is of course completely different than runtime behavior.
Linux:
valgrind --tool=callgrind ./program 1 > /dev/null
Related
We are chasing some weird hardware failures on AMD Threadrippers. I came across some evidence that AVX2/AVX-512 instructions can lead to weird behaviour (https://news.ycombinator.com/item?id=22382946).
Is there a generic way of measuring or profiling the use of AVX2/AVX-512 instructions of a running program or machine? For now it would be enough for me to get a ball-park of how many of these instructions are being used in a given time frame. I do not necessarily need to pin it down to the actual program using them. The more detailed the profiling / attribution of AVX2/AVX-512 instruction use by program or time is the better.
I would prefer tools that run in Linux.
I read an interesting paper, entitled "A High-Resolution Side-Channel Attack on Last-Level Cache", and wanted to find out the index hash function for my own machine—i.e., Intel Core i7-7500U (Kaby Lake architecture)—following the leads from this work.
To reverse-engineer the hash function, the paper mentions the first step as:
for (n=16; ; n++)
{
// ignore any miss on first run
for (fill=0; !fill; fill++)
{
// set pmc to count LLC miss
reset_pmc();
for (a=0; a<n; a++)
// set_count*line_size=2^19
load(a*2^19);
}
// get the LLC miss count
if (read_pmc()>0)
{
min = n;
break;
}
}
How can I code the reset_pmc() and read_pmc() in C++? From all that I read online so far, I think it requires inline assembly code, but I have no clue what instructions to use to get the LLC miss count. I would be obliged if someone can specify the code for these two steps.
I am running Ubuntu 16.04.1 (64-bit) on VMware workstation.
P.S.: I found mention of these LONGEST_LAT_CACHE.REFERENCES and LONGEST_LAT_CACHE.MISSES in Chapter-18 Volume 3B of the Intel Architectures Software Developer's Manual, but I do not know how to use them.
You can use perf as Cody suggested to measure the events from outside the code, but I suspect from your code sample that you need fine-grained, programmatic access to the performance counters.
To do that, you need to enable user-mode reading of the counters, and also have a way to program them. Since those are restricted operations, you need at least some help from the OS kernel to do that. Rolling your own solution is going to be pretty difficult, but luckily there are several existing solutions for Ubunty 16.04:
Andi Kleen's jevents library, which among other things lets you read PMU events from user space. I haven't personally used this part of pmu-tools, but the stuff I have used has been high quality. It seems to use the existing perf_events syscalls for counter programming so and doesn't need a kernel model.
The libpfc library is a from-scratch implementation of a kernel module and userland code that allows userland reading of the performance counters. I've used this and it works well. You install the kernel module which allows you to program the PMU, and then use the API exposed by libpfc to read the counters from userspace (the calls boil down to rdpmc instructions). It is the most accurate and precise way to read the counters, and it includes "overhead subtraction" functionality which can give you the true PMU counts for the measured region by subtracting out the events caused by the PMU read code itself. You need to pin to a single core for the counts to make sense, and you will get bogus results if your process is interrupted.
Intel's open-sourced Processor Counter Monitor library. I haven't tried this on Linux, but I used its predecessor library, the very similarly named1 Performance Counter Monitor on Windows, and it worked. On Windows it needs a kernel driver, but on Linux it seems you can either use a drive or have it go through perf_events.
Use the likwid library's Marker API functionality. Likwid has been around for a while and seems well supported. I have used likwid in the past, but only to measure whole processes in a matter similar to perf stat and not with the marker API. To use the marker API you still need to run your process as a child of the likwid measurement process, but you can read programmatically the counter values within your process, which is what you need (as I understand it). I'm not sure how likwid is setting up and reading the counters when the marker API is used.
So you've got a lot of options! I think all of them could work, but I can personally vouch for libpfc since I've used it myself for the same purpose on Ubuntu 16.04. The project is actively developed and probably the most accurate (least overhead) of the above. So I'd probably start with that one.
All of the solutions above should be able to work for Kaby Lake, since the functionality of each successive "Performance Monitoring Architecture" seems to generally be a superset of the prior one, and the API is generally preserved. In the case of libpfc, however, the author has restricted it to only support Haswell's architecture (PMA v3), but you just need to change one line of code locally to fix that.
1 Indeed, they are both commonly called by their acronym, PCM, and I suspect that the new project is simply the officially open sourced continuation of the old PCM project (which was also available in source form, but without a mechanism for community contribution).
I would use PAPI, see http://icl.cs.utk.edu/PAPI/
This is a cross platform solution that has a lot of support, especially from the hpc community.
Is there a performance counter available for code written in the Halide language? I would like to know how many loads, stores, and ALU operations are performed by my code.
The Halide tutorial for scheduling multi-stage pipelines compares different schedules by comparing the amount of allocated memory, loads, stores, and calls to halide Funcs, but I don't see how this information was collected. I suppose it might be possible to use trace_stores, trace_loads, and trace_realizations to print to the console every time one of these operations occurs. This isn't a great option though because it would greatly slow down the program's execution and would require some kind of counting script to compile the long list of console outputs into the desired counts for loads, stores, and ALU operations.
I'm pretty sure they just used the trace_xxx output and ran some scripts/programs on it.
If you're looking for real performance numbers on a X86 platform, I would go with Intel VTune Amplifier. It's pretty expensive, but may be free if you're in academia (student, teacher, researcher) or it's for an open source project.
Other than that, look at the lowered statement code by setting HL_DEBUG_CODEGEN=1 in the environment and you can get a better idea of the loop structure and data use. Note that this output goes to stderr, not stdout.
EDIT: For Linux, there's perf.
We do not have any perf counter based support at present. It is fairly difficult to make it portable. (And on mobile devices, often the OS simply doesn't allow access to the hardware.) The support in Profiling.cpp and src/profiling.cpp could likely be used to drive perf counter operation. The profiling lowering pass adds code to call routines in the runtime which update information about Func and Pipeline execution. This information is collected and aggregated by another thread.
If tracing is run to a file (e.g. using HL_TRACE_FILE) a binary format is used and it is a bit more efficient. See utils/HalideTraceViz for a tool to work with the binary format. This is generally how analyses are done within the team.
There was a small amount of investigation of OProfile, which looked promising but I don't think we ever got code working.
Is it possible to schedule a given task to run exactly n machine instructions before control is returned to the user?
The motivation for this question is the debugging of multithreaded programs, where this could be helpful to reliably reproduce certain bugs or undefined behaviour.
I'm particularly interested in the case of x86_64-linux running on an Intel CPU, but solutions for other architectures or operating systems would also be interesting.
The documentation for the kernel perf suite says
Performance counters are special hardware registers available on most modern
CPUs. These registers count the number of certain types of hw events: such
as instructions executed, cachemisses suffered, or branches mis-predicted -
without slowing down the kernel or applications. These registers can also
trigger interrupts when a threshold number of events have passed.
so it seems like the hardware could support this in principle, but I'm not sure if this is exposed in any way to the user.
Of course it's possible to just use ptrace to single-step the program n times, but that would make all but the most simple programs impossibly slow.
One simple option to ensure an exact count of the instructions executed is to instrument the assembly code and maintain an execution counter. I believe the easiest way to do instrumentation is Pin ( https://software.intel.com/en-us/articles/pintool ).
High level idea:
- interpret machine code and maintain a counter of the number of instructions executed,
after each instruction you increment the counter and check if it is time for a breakpoint,
reset counter after each breakpoint.
The interpretation idea would introduce quite a bit of overhead. I see a number of straightforward optimizations:
Instrument binary statically (create a new binary where all these increments/checks are hard coded). Such an approach would eliminate instrumentation/interpretation overheads. You can consider the instructions related to monitoring/breakpoints as extra instructions executed or chose to ignore them from the counting.
The increments/checks can be more smartly implemented. Imagine we have a set of instructions with no jumps/branches you can do one increment and one check. This idea is simple but might prove pretty tricky in practice, especially if you need an absolutely accurate breakpoint..
I'd like a simple and fast way to collect the number of times each Instruction in LLVM bitcode was executed in a given run of the application. As far as I can tell, there are a number of approaches I can take:
Use PIN. This would require using DWARF debug info and Instruction debug info to attempt to map instructions in the binary to instructions in the bitcode; not 100% sure how accurate this will be.
Use llvm-prof. Two questions here. First, I've seen on Stack Overflow an option to opt called --insert-edge-profiling. However, that option doesn't seem to be available in 3.6? Second, it appears that such profiling only records execution counts at the Function level, not at the individual Instruction level. Is that correct?
Write a new tool similar to AddressSanitizer. This may work, but seems like overkill.
Is there an easier way to achieve my goal that I'm missing?
As part of my PhD research, I have written a tool to collect a trace of the basic blocks executed by a program. This tool also records the number of LLVM instructions in each basic block, so an analysis of the trace would give the dynamic Instruction execution count.
Another research tool is Harmony. It will provide the dynamic execution counts of each basic block in the program, which you could extend with the static instruction counts.
Otherwise, I would suggest writing your own tool. For each basic block, (atomically) increment a global counter by the number of instructions in that block.