Move LLVM IR out of SSA form - llvm

I am attempting to write a pass which would eliminate phi nodes in llvm IR. Although I have converted the llvm IR to CSSA form I am not sure how to actually remove the phi nodes.
From a broader perspective, I want to do RA directly from LLVM IR. The only option I can think of is to have the SSA deconstruction pass as an analysis pass and then while doing code generation, use the information from this pass to insert the necessary movs.
Is that the only option or is there any other way whereby we can move out of SSA form with IR.

Related

__builtin_sinf What is it? Where is it? How do I obtain the dissassembly of it?

I was curious this morning to see if I could obtian and understand the machine instructions generated for calculating a mathematical function such as sin, which as far as I am aware there is no machine instruction (in x86-64) for computing.
I first went to Godbolt and tried to obtian the disassembly for this:
// Type your code here, or load an example.
#include <cmath>
int mysin(float num) {
float result = sin(num);
return result;
}
I found that the output sets up some registers for a function call and then does call sin.
I then went and searched my local machine for the cmath headers. I found one in /usr/include/c++/10/cmath.h
This header calls another function: __builtin_sin.
My guess would be that the gcc compiler sees this identifier and substitutes it for some set of machine instructions which is somehow "baked into" the gcc compiler itself. Not sure if I am correct about that however.
I did a search of my system for the string __builtin_sinf and it doesn't look like there is any text file on the system which contains source code (C, asm or otherwise) which might then be used by a compiler.
Can anyone offer any explaination as to what __builtin_sinf is, where it is located if it exists in a file on my system, and finally how to obtain the disassembly of this function?
If you do not have the source code for the compiler on your system, then you probably do not have the source code for __builtin_sin because what to do for __builtin_sin is built into the compiler.
__builtin_sin is not an instruction to the compiler to replace it with a call to a sin routine. It tells the compiler that the operation of the standard C library sin function is desired, with all the semantics of the sin function defined by the C standard. This means the compiler does not have to replace a use of __builtin_sin with a call to sin or a call to any other function. It may replace it with whatever code is appropriate. For example, given some __builtin_sin(x), if the compiler can determine the value of x during compilation, it can calculate sin(x) itself and replace __builtin_sin(x) with that value. Alternatively, the compiler can compile __builtin_sin(x) into assembly language that computes sin. Or it can compile __builtin_sin(x) into a call to the sin routine. It could also compile __builtin_sin(x) to a single machine instruction, such as fsin. (However, fsin is a primitive instruction with some limitations, particularly on argument domain, and is generally not suitable to serve as a sin function by itself.)
So there is no source file that contains the implementation of __builtin_sin other than the compiler’s source code. And, while that source code might contain information about an inline implementation of sin that the compiler uses, that information might not be in the form of assembly language for your target processor; it might be in the form of an internal compiler language specifying what operations to perform to calculate sin. (This is called an intermediate representation.)

Calling conventions in LLVM IR

LLVM allows call instructions and defines to specify a calling convention. Does the IR itself already need to adhere to the specified convention? For example when using ccc, I believe that the return value would need to fit in the 64-bit eax on my OS/architecture. Am I allowed to write LLVM IR code that returns a struct of 3 i32's? Does LLVM convert that to something that adheres to the C calling convention? Could I change the calling convention without changing any other code?
When I look at the output of compiling a C file with -emit-llvm, the IR generator already applied the calling convention, and would allocate at the call site, and convert the return value as a pointer parameter. Is that absolutely necessary at this stage? What does LLVM do with the information of which calling convention to use at the next stage, -emit-obj?
There are many things mixed here, unfortunately. The calling convention is usually defined in terms of the source language. And many necessary details are already lost when converting to LLVM IR. So, in order to preserve the ABI and calling convention frontend is supposed to properly lay out arguments / return values so they will be properly codegen'ed at LLVM level.
So, making long story short: the calling convention contains both high-level (source language) and low-level requirements. The former are handled by frontend and the latter – by the backend. You can change the LLVM IR, but you need to make sure that the code generated will be indeed compatible with your C code. And on some platforms this might be complicated.

LLVM: Constant variables

How can I get all the variables (from bytecode file or IR file) with
const modifier or the variables that are not changed due execution?
I need to make list for further use.
I'm not sure you can get what you want directly because const is a C/C++ semantic that's useful for Clang but much less so for LLVM. Only some const promises are preserved (for example the readonly attribute on pointer function arguments - see the language reference for details).
LLVM IR level "constants" are something entirely different, and usually refer to actually constant (compile-time known) values that can be efficiently uniqued, etc. Read this documentation for the full scoop.

How do I get LLVM types from Clang?

I'm looking to write a tool using Clang. The details are fairly immaterial, but what I'm looking to do is get an llvm type from Clang. For example, I'd like to go from "printf" to llvm::Function*, and "size_t" to a llvm::Type*. But I can't find any functions in Clang that give out these functions. I've decided that I can ask Clang to mangle the names, and then ask the llvm::Module* for the data- but I can't find how to get an llvm::Module* that corresponds to a Clang invocation.
How can I get the internal LLVM data from Clang?
Ultimately, the code generation APIs are not part of Clang's public API. This is for good reason, because they are awful.
However, you can create a clang::CodeGen::CodeGenModule, which you can use to codegen a given TU into a provided llvm::Module. Then, you can get the mangled name of a symbol by using the getMangledName function on the CodeGenModule.
However, do not attempt to use the provided functions for converting from a clang::QualType to an llvm::Type*- they are unusably broken. The only viable strategy I have found for reliably performing this conversion is to find a function with a signature, for example, a member function, and then query the Module for the type of that parameter, for example, this. But this is pretty ABI-specific and a nasty hack. You can also compute the LLVM type name that Clang generates for a given type and search for it in the module, but this is not always successful.
In general, you cannot. clang internals are made in layers as well. So, you cannot simply grab a piece of AST and say 'Hey, give me a function'. AST is converted to LLVM IR at the IR generation step module at a time. This way we can be sure everything is parsed and semantically correct and complete.
So, if you really need all this sort of thing, then you need to hook into clang really late, after IR generation and try to operate on loosely coupled AST and LLVM IR at that time.

Static source code analysis with LLVM

I recently discover the LLVM (low level virtual machine) project, and from what I have heard It can be used to performed static analysis on a source code. I would like to know if it is possible to extract the different function call through function pointer (find the caller function and the callee function) in a program.
I could find the kind of information in the website so it would be really helpful if you could tell me if such an library already exist in LLVM or can you point me to the good direction on how to build it myself (existing source code, reference, tutorial, example...).
EDIT:
With my analysis I actually want to extract caller/callee function call. In the case of a function pointer, I would like to return a set of possible callee. both caller and callee must be define in the source code (this does not include third party function in a library).
I think that Clang (the analyzer that is part of LLVM) is geared towards the detection of bugs, which means that the analyzer tries to compute possible values of some expressions (to reduce false positives) but it sometimes gives up (in this case, emitting no alarm to avoid a deluge of false positives).
If your program is C only, I recommend you take a look at the Value Analysis in Frama-C. It computes supersets of possible values for any l-value at each point of the program, under some hypotheses that are explained at length here. Complexity in the analyzed program only means that the returned supersets are more approximated, but they still contain all the possible run-time values (as long as you remain within the aforementioned hypotheses).
EDIT: if you are interested in possible values of function pointers for the purpose of slicing the analyzed program, you should definitely take a look at the existing dependencies and slicing computations in Frama-C. The website doesn't have any nice example for slicing, here is one from a discussion on the mailing-list
You should take a look at Elsa. It is relatively easy to extend and lets you parse an AST fairly easily. It handles all of the parsing, lexing and AST generation and then lets you traverse the tree using the Visitor pattern.
class CallGraphGenerator : public ASTVisitor
{
//...
virtual bool visitFunction(Function *func);
virtual bool visitExpression(Expression *expr);
}
You can then detect function declarations, and probably detect function pointer usage. Finally you could check the function pointers' declarations and generate a list of the declared functions that could have been called using that pointer.
In our project, we perform static source code analysis by converting LLVM bytecode into C code with help of llc program that is shipped with LLVM. Then we analyze C code with CIL (C Intermediate Language), but for C language a lot of tools is available. The pitfail that the code generated by llc is AWFUL and suffers from a great loss of precision. But still, it's one way to go.
Edit: in fact, I wouldn't recommend anyone to o like this. But still, just for a record...
I think your question is flawed. The title says "Static source code analysis". Yet your underlying reason appears to be the construction of (part of ) a call graph including calls through a function pointer. The essence of function pointers is that you cannot know their values at compile time, i.e. at the point where you do static source code analysis. Consider this bit of code:
void (*pFoo)() = GetFoo();
pFoo();
Static code analysis cannot tell you what GetFoo() returns at runtime, although it might tell you that the result is subsequently used for a function call.
Now, what values could GetFoo() possibly return? You simply can't say this in general (equivalent to solving the halting problem). You will be able to guess some trivial cases. The guessable percentage will of course go up depending on how much effort you are willing to invest.
The DMS Software Reengineering Toolkit provides various types of control, data flow, and global points-to analyzers for large systems of C code, and constructs call graphs using that global points-to analysis (with the appropriate conservative assumptions). More discussion and examples of the analyses can be found at the web site.
DMS has been tested on monolithic systems of C code with 25 million lines. (The call graph for this monster had 250,000 functions in it).
Engineering all this machinery from basic C ASTs and symbol tables is a huge amount of work; been there, done that. You don't want to do this yourself if you have something else to do with your life, like implement other applications.