What is the LLVM intermediate representation? - llvm

I have tried the LLVM demo from Try out LLVM and Clang in your browser!.
What kind of IR is this? HIR, MIR, or LIR? The SSA representation is usually used in MIR, I think. So, is it an MIR? But it can store the information for dependence analysis. Hence can it be an HIR?
What filename extension actually represents the LLVM IR, .ll or .bc?
How can I get the symbol table used in LLVM?

I'm not familiar with a single, strict definition of what differentiates one level of IR from another, but from what I know it would fall under the MIR category.
It has SSA form and doesn't have explicit registers, so I would definitely not categorize it as low-level.
It doesn't have named subelement reference or advanced flow-control constructs, so it wouldn't really fit into HIR, either.
In any case, LLVM IR is typically stored on disk in either text files with .ll extension or in binary files with .bc extension. Conversion between the two is trivial, and you can just use llvm-dis for bc -> ll and llvm-as for ll -> bc. The binary format is more memory-efficient, while the textual format is human-readable.

Related

pragma optimize vs pragma target what is difference

What is the difference between #pragma GCC optimize() and #pragma GCC target() and which one to choose when, what are the other options as well?
At a high level, optimize() is used to control whether certain optimisation techniques are used when compiling code: the big-picture idea is that the compiler can spend more time to produce either faster-executing code or - occasionally - more compact code that needs less memory to run. Optimised code can sometimes be harder to profile or debug, so you may want most of your program un- or less-optimised, but some specific functions that are performance critical to be highly optimised. The pragma gives you that freedom to vary optimisation on a function-by-function basis.
Individual optimisations are listed and explained at https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options
A general overview of the optimise attribute notation and purpose from here:
optimize (level, …)
optimize (string, …)
The optimize attribute is used to specify that a function is to be compiled with different optimization options than specified on the command line. Valid arguments are constant non-negative integers and strings. Each numeric argument specifies an optimization level. Each string argument consists of one or more comma-separated substrings. Each substring that begins with the letter O refers to an optimization option such as -O0 or -Os. Other substrings are taken as suffixes to the -f prefix jointly forming the name of an optimization option. See Optimize Options.
‘#pragma GCC optimize’ can be used to set optimization options for more than one function. See Function Specific Option Pragmas, for details about the pragma.
Providing multiple strings as arguments separated by commas to specify multiple options is equivalent to separating the option suffixes with a comma (‘,’) within a single string. Spaces are not permitted within the strings.
Not every optimization option that starts with the -f prefix specified by the attribute necessarily has an effect on the function. The optimize attribute should be used for debugging purposes only. It is not suitable in production code.
target() indicates that the compiler can use machine code instructions for specific CPUs. Normally, if you know people running your code might use a different generations of a specific CPU, and each is backwards compatible, then you'd compile for the earliest of those generations so everyone can run your program. With target() you can compile different functions making use of the more advanced features (machine code instructions, or just optimised for the specific cache sizes or pipelining features etc) of later generations of CPU, then your code can selectively call the faster version on CPUs that can support it.
Further details from here
target (string, …)
Multiple target back ends implement the target attribute to specify that a function is to be compiled with different target options than specified on the command line. One or more strings can be provided as arguments. Each string consists of one or more comma-separated suffixes to the -m prefix jointly forming the name of a machine-dependent option. See Machine-Dependent Options.
The target attribute can be used for instance to have a function compiled with a different ISA (instruction set architecture) than the default. ‘#pragma GCC target’ can be used to specify target-specific options for more than one function. See Function Specific Option Pragmas, for details about the pragma.
For instance, on an x86, you could declare one function with the target("sse4.1,arch=core2") attribute and another with target("sse4a,arch=amdfam10"). This is equivalent to compiling the first function with -msse4.1 and -march=core2 options, and the second function with -msse4a and -march=amdfam10 options. It is up to you to make sure that a function is only invoked on a machine that supports the particular ISA it is compiled for (for example by using cpuid on x86 to determine what feature bits and architecture family are used).
int core2_func (void) __attribute__ ((__target__ ("arch=core2")));
int sse3_func (void) __attribute__ ((__target__ ("sse3")));
Providing multiple strings as arguments separated by commas to specify multiple options is equivalent to separating the option suffixes with a comma (‘,’) within a single string. Spaces are not permitted within the strings.
The options supported are specific to each target; refer to x86 Function Attributes, PowerPC Function Attributes, ARM Function Attributes, AArch64 Function Attributes, Nios II Function Attributes, and S/390 Function Attributes for details.

LLVM | check if a value is signed [duplicate]

The LLVM project does not distinguish between signed and unsigned integers as described here. There are situations where you need to know if a particular variable should be interpreted as signed or as unsigned though, for instance when it is size extended or when it is used in a division. My solution to this is to keep a separate type information for every variable that describes whether it is an integer or a cardinal type.
However, I am wondering, isn't there a way to "attribute" a type in LLVM that way? I was looking for some sort of "user data" that could be added to a type but there seems to be nothing. This would have to happen somehow when the type is created since equal types are generated only once in LLVM.
My question therefore is:
Is there a way to track whether an integer variable should be interpreted as signed or unsigned within the LLVM infrastructure, or is the only way indeed to keep separate information like I do?
Thanks
First of all, you have to be sure that you need inserting extra type meta-data since Clang already handles signed integer operations appropriately for example by using sdiv and srem rather than udev and urem.
Additionally, It's possible to utilize that to implement some lightweight type-inference based on how the variables are accessed in the IR. Note that an operation like add doesn't need signdness info since it is based on two-complement representation.
Otherwise, I think that the best way to do that is to modify the front-end (Clang) to add some custom DWARF debug info. Here is a link that might get you started.
UPDATE:
If your goal is to implement static-analysis directly on LLVM IR. This paper can offer a thorough discussion.
Navas, J.A., Schachte, P., Søndergaard, H., Stuckey, P.J.:
Signedness-agnostic program analysis: Precise integer bounds for low-level code. In: Jhala, R., Igarashi, A. (eds.) APLAS 2012. LNCS,
vol. 7705, pp. 115–130. Springer, Heidelberg (2012)

GCC Assembly "+t"

I'm currently testing some inline assembly in C++ on an old compiler (GCC circa 2004) and I wanted to perform the square root function on a floating point number. After trying and searching for a successful method, I came across the following code
float r3(float n){
__asm__("fsqrt" : "+t" (n));
return n;
};
which worked. The issue is, even though I understand the assembly instructions used, I'm unable to find any particular documentation as to what the "+t" flag means on the n variable. I'm under the genuine idea that it seems to be a manner by which to treat the variable n as both the input and output variable but I was unable to find any information on it. So, what exactly is the "t" flag and how does it work here?
+
Means that this operand is both read and written by the instruction.
(From here)
t
Top of 80387 floating-point stack (%st(0)).
(From here)
+ means you are reading and writing the register.
t means the value is on the top of the 80387 floating point stack.
References:
GCC manual, Extended Asm has general information about constraints - search for "constraints"
GCC manual, Machine Constraints has information about the specific constraints supported on each architecture - search for "x86 family"

How to get "phi" instruction in llvm without optimization

When I use the command clang -emit-llvm -S test.c -o test.ll, there is no any "phi" instruction in the IR file. How can I get it?
I know that I can use the pass "-mem2reg" or "-gvn" to get "phi" instruction. But they would do some optimization. I just want to get "phi" without any optimization.
I'm not sure what you mean by "do some optimization" but it seems to me that mem2reg is exactly what you need. Here is how it's described in the documentation:
This file promotes memory references to be register references. It
promotes alloca instructions which only have loads and stores as uses.
An alloca is transformed by using dominator frontiers to place phi
nodes, then traversing the function in depth-first order to rewrite
loads and stores as appropriate. This is just the standard SSA
construction algorithm to construct “pruned” SSA form.
Clang itself does not produce optimized LLVM IR. It produces fairly straightforward IR wherein locals are kept in memory (using allocas). The optimizations are done by opt on LLVM IR level, and one of the most important optimizations is indeed mem2reg which makes sure that locals are represented in LLVM's SSA values instead of memory.

To what extent does SSA form allow types with non-trivial copying?

I've been looking into IR code which is specified using SSA- especially, generating LLVM IR in this form. However, I'm confused about whether or not this can be effective when presented with a type which has non-trivial copy semantics. For example,
void f() {
std::string s = "Too long for you, short string optimization!";
std::string s1 = s + " Also, goodbye SSA.";
some_other_function(s1);
}
In this SSA form, at least at the most obvious level, this results in a nasty mess of copies (even for C++). Can optimizers such as LLVM's actually optimize this case accurately? Is SSA viable for use even for types with non-trivial copy/assignment/etc semantics?
Edit: The question is that if I use an LLVM SSA register to represent a complex type (in this case, std:string), here represented by manually making it SSA, can LLVM automatically translate this into a mutating += call in the underlying assembly in the general case and avoid a nasty copy?
SSA means single static assignment. It's a way of dealing with value semantics as applied to registers. Each object is the result of exactly one machine instruction.
LLVM provides a generic "move" instruction, which is useful because there are many instructions across the spectrum of architectures that move 8, 32, N bytes. It also provides structured datatypes and arrays, because it is useful to hoist such things to registers, and they can be used to represent wacky high-level machine constructs. The intent is not to model OOP.