getting providers of a llvm instruction - llvm

Analyzing llvm IR file ( .bc or .ll ) for every instruction it is possible, using the provided functions, to determine the list of instructions that consume it result. What about the other way around – are there llvm functions that allow for every instruction ( or better – for every operand of an instruction ) to determine the list of instructions that provide it with the necessary resources.
Being able to know the users of the result of an instruction it is possible to calculate such a data. The question is if it is already calculated and, if yes, how to obtain it.

Related

Where is integer addition and subtraction event count from intel Vtune?

I am using intel VTune to profile my program.
The CPU I am using is IVY Bridge.
All the hardware instruction event can be found here:
https://software.intel.com/en-us/node/589933
FP_COMP_OPS_EXE.X87
Number of FP Computational Uops Executed this
cycle. The number of FADD, FSUB, FCOM, FMULs, integer MULsand IMULs,
FDIVs, FPREMs, FSQRTS, integer DIVs, and IDIVs. This event does not
distinguish an FADD used in the middle of a transcendental flow from a
s
FP_COMP_OPS_EXE.X87 seems to include Integer Multiplication and Integer Division; however, there is no Integer Addition and Integer Subtraction there. I can not find those two kinds of instruction either from the above website.
Can anyone tell me what is the event that counts integer addition and integer subtraction instructions?
I'm reading a lot into your question, but here goes:
It's possible that if your code is computationally bound you could find ways to infer the significance of integer adds and subs without measuring them directly. For example, UOPS_RETIRED.ALL - FP_COMP_OPS_EXE.ALL would give you a very rough estimate of adds and subs, assuming that you've already done something to establish that your code is compute bound.
Have you? If not, it might help to start with VTune's basic analysis and then eliminate memory, cache and front end bottlenecks. If you've already done this, you have a few more options:
Cross-reference UOPS_DISPATCHED_PORT with an Ivy Bridge block diagram, or even better, a list of which specific types of arithmetic can execute on which ports (which I can't find).
Modify your program source, compiler flags or assembly, rerun a coarser-grained profile like basic analysis, and see whether you see an impact at the level of a measure like INST_RETIRED.ANY / CPU_CLK_UNHALTED.
Sorry there doesn't seem to be a more direct answer.

How can I get the number of instructions executed by a program?

I have written and cross compiled a small c++ program, and I could run it in an ARM or a PC. Since ARM and a PC have different instruction set architectures, I wanna to compare them. Is that possible for me to get the number of executed instructions in this c++ program for both ISAs?
What you need is a profiler. perf would be one easy to use. It will give you the number of instructions that executed, which is the best metric if you want to compare ISA efficiency.
Check the tutorial here.
You need to use: perf stat ./your binary
Look for instructions metric. This approach uses a register in your CPU's performance monitoring unit - PMU - that counts the number of instructions.
Are you trying to get the number of static instructions or dynamic instructions? So, for instance, if you have the following loop (pseudocode):
for (i 0 to N):
a[i] = b[i] + c[i]
Static instruction count will be just under 10 instructions, give or take based on your ISA, but the dynamic count would depend on N, on the branch prediction implementation and so on.
So for static count I would recommend using objdump, as per recommendations in the comments. You can find the entry and exit labels of your subroutine and count the number of instructions in between.
For dynamic instruction count, I would recommend one of two things:
You can simulate running that code using an instruction set simulator (there are open source ISA simulators for both ARM and x86 out there - Gem5 for instance implements both of them, there are others out there that support one or the other.
Your second option is to run this natively on the target system and setup performance counters in the CPU to report dynamic instruction count. You would reset before executing your code, and read it afterwards (there might be some noise here associated with calling your subroutine and exiting, but you should be able to isolate that out)
Hope this helps :)
objdump -dw mybinary | wc -l
On Linux and friends, this gives a good approximation of the number of instructions in an executable, library or object file. This is a static count, which is of course completely different than runtime behavior.
Linux:
valgrind --tool=callgrind ./program 1 > /dev/null

Easiest way to collect dynamic Instruction execution counts?

I'd like a simple and fast way to collect the number of times each Instruction in LLVM bitcode was executed in a given run of the application. As far as I can tell, there are a number of approaches I can take:
Use PIN. This would require using DWARF debug info and Instruction debug info to attempt to map instructions in the binary to instructions in the bitcode; not 100% sure how accurate this will be.
Use llvm-prof. Two questions here. First, I've seen on Stack Overflow an option to opt called --insert-edge-profiling. However, that option doesn't seem to be available in 3.6? Second, it appears that such profiling only records execution counts at the Function level, not at the individual Instruction level. Is that correct?
Write a new tool similar to AddressSanitizer. This may work, but seems like overkill.
Is there an easier way to achieve my goal that I'm missing?
As part of my PhD research, I have written a tool to collect a trace of the basic blocks executed by a program. This tool also records the number of LLVM instructions in each basic block, so an analysis of the trace would give the dynamic Instruction execution count.
Another research tool is Harmony. It will provide the dynamic execution counts of each basic block in the program, which you could extend with the static instruction counts.
Otherwise, I would suggest writing your own tool. For each basic block, (atomically) increment a global counter by the number of instructions in that block.

Number of machine instructions for an llvm PHINode on x86/amd64

I'm currently writting a pass in opt that happens to create extra control flow and, as a result of that, I need to also insert a lot of llvm::PHINode instructions. The endgame of my pass is to reduce codesize and, as far I can tell, the number of llvm instructions after I run it is lower. However, in most cases I don't see a significant reduction in code size, or sometimes I even see an increase (even though the total number of llvm is smaller). I've been trying to find a reference to how PHINode instructions are implemented on x86/amd64 but without luck. The obvious solution for me would be to just go through the source and find out myself but I can't invest that much time in investigating this issue. Any help would be much appreciated.

is there any faster way to parse than by walk each byte?

is there any faster way to parse a text than by walk each byte of the text?
I wonder if there is any special CPU (x86/x64) instruction for string operation that is used by string library, that somehow used to optimize the parsing routine.
for example instruction like finding a token in a string that could be run by hardware instead of looping each byte until a token is found.
*edited->note: I'am asking more to algorithm instead of CPU architecture, so my really question is, is there any special algorithm or specific technique that could optimize the string manipulation routine given the current cpu architecture.
The x86 had a few string instructions, but they fell out of favor on modern processors, because they became slower than more primitive instructions which do the same thing.
The processor world is moving more and more towards RISC, ie, simplistic instruction sets.
Quote from Wikipedia (emphasis mine):
The first highly (or tightly) pipelined x86 implementations, the 486 designs from Intel, AMD, Cyrix, and IBM, supported every instruction that their predecessors did, but achieved maximum efficiency only on a fairly simple x86 subset that resembled only a little more than a typical RISC instruction set (i.e. without typical RISC load-store limitations).
This is still true on today's x86 processors.
You could get marginally better performance processing four bytes at a time, assuming each "token" in the text was four-byte-aligned. Obviously this isn't true for most text... so better to stick with byte-by-byte scanning.
Yes there are special CPU instructions; and the run-time library, which implements functions like strchr, might be written in assembly.
One technique that can be faster than walking bytes is to walk double-words, i.e. to process data 32 bits at a time.
The problem with walking bigger-than-the-smallest-addressable-memory-unit chunks in the context of strings is one of alignment
You add code at the begining and end of your function (before and after your loop), to handle the uneven/unaligned byte[s]. (shrug) It makes your code faster: not simpler.
The following for example is some source code which claims to be an improved version of strchr. It is using special CPU instructions, but it's not simple (and has extra code for the unaligned bytes):
PATCH: Optimize strchr/strrchr with SSE4.2 -- "This patch adds SSE4.2 optimized strchr/strrchr. It can speed up strchr/strrchr by up to 2X on Intel Core i7"
While (some) processors do have string instructions, they're of little use in producing faster code. First of all, as #zildjohn01 noted, they're often slower than other instructions with current processors. More importantly, it rarely makes much difference anyway -- if you're scanning much text at all, the bottleneck will usually be the bandwidth from the memory to the CPU, so essentially nothing you do with changing instructions is likely to make a significant difference in any case.
That said, especially if the token you're looking for is long, a better algorithm may be useful. A Boyer-Moore search (or variant) can avoid looking at some of the text, which can give a substantial improvement.
Well, you have to look at everything to know everything about the text, at some level. Arguably, you could have some sort of structured text, which gives you further information about where to walk at each point in a sort of n-ary space partition. But then, the "parsing" has partially been done back when the file was created. Starting from 0 information, you will need to touch every byte to know everything.
Yes. As your edit specifically asks for algorithms, I'd like to add an example fo that.
You're going to know the language that you are parsing, and you can use that knowledge when building a parser. Take for instance a language in which every token must be at least two characters, but any length of whitespace can occur between tokens. Now when you're scanning through whitespace for the next token, you can skip every second character. Only when you hit the first non-whitespace do you need to back up one character.
Example:
01234567890123456789
FOO BAR
When scanning for the next token, you'd probe 4,6,8,10 and 12. Only when you see the A do you look back to 11 to find B. You never looked at 3,5,7 and 9.
This is a special case of the Boyer–Moore string search algorithm
Even though this question is long gone, I have another answer for you.
The fastest way to parse is to not parse at all. Huh?!
By that I mean, statistically, most source code text, and most code and data generated from that source code, do not change from compile to compile.
Therefore you can use a technique called incremental compilation. The first time you compile source, you divide it into coarse grained chunks, for example, at the boundaries of global declarations and function bodies. You have to persist the chunks' source, or its signatures or checksums, plus the information about the boundaries, what they compiled to, etc.
The next time you have to recompile the same source, given the same environment, you can gallop over the source code looking for changes. One easy way is to compare (longword at a time) the current code to the persisted snapshot of the code you saved from last time. As long as the longwords match, you skip through the source; instead of parsing and compiling, you reuse the snapshotted compilation results for that section from last time.
As seen in "C#'88", QuickC 2.0, and VC++ 4.0 Incremental Recompilation.
Happy hacking!
is there any faster way to parse a text than by walk each byte of the text?
Yes, sometimes you can skip over data. The Boyer-Moore string search algorithm is based on this idea.
Of course, if the result of the parse operation somehow needs to contain all the information in the text, then there is no way around the fact that you have to read everything. If you want to avoid CPU load, you could build hardware which processes data with direct memory access I guess.