How can I get all the variables (from bytecode file or IR file) with
const modifier or the variables that are not changed due execution?
I need to make list for further use.
I'm not sure you can get what you want directly because const is a C/C++ semantic that's useful for Clang but much less so for LLVM. Only some const promises are preserved (for example the readonly attribute on pointer function arguments - see the language reference for details).
LLVM IR level "constants" are something entirely different, and usually refer to actually constant (compile-time known) values that can be efficiently uniqued, etc. Read this documentation for the full scoop.
Related
LLVM allows call instructions and defines to specify a calling convention. Does the IR itself already need to adhere to the specified convention? For example when using ccc, I believe that the return value would need to fit in the 64-bit eax on my OS/architecture. Am I allowed to write LLVM IR code that returns a struct of 3 i32's? Does LLVM convert that to something that adheres to the C calling convention? Could I change the calling convention without changing any other code?
When I look at the output of compiling a C file with -emit-llvm, the IR generator already applied the calling convention, and would allocate at the call site, and convert the return value as a pointer parameter. Is that absolutely necessary at this stage? What does LLVM do with the information of which calling convention to use at the next stage, -emit-obj?
There are many things mixed here, unfortunately. The calling convention is usually defined in terms of the source language. And many necessary details are already lost when converting to LLVM IR. So, in order to preserve the ABI and calling convention frontend is supposed to properly lay out arguments / return values so they will be properly codegen'ed at LLVM level.
So, making long story short: the calling convention contains both high-level (source language) and low-level requirements. The former are handled by frontend and the latter – by the backend. You can change the LLVM IR, but you need to make sure that the code generated will be indeed compatible with your C code. And on some platforms this might be complicated.
I was trying to add a function declaration for something provided by the system.
However, the function prototype returns size_t, which is int32 on 32bit platform and int64 on 64bit platform.
I'd like to know if there is a method to detect target platform and add the declaration accordingly?
After a bit of research, LLVM IR as a target neutral language cannot possibly know target-specific type sizes. Have a look at this relevant discussion where Chris Lattner comments on the subject. Also, at this relevant SO question.
So, this is the job of the front-end and this causes extra bookkeeping information that front-ends need to "know" for a target and its ABI. So, for example, you might have needs for projects like this in the case of the Loci programming language.
Now, specifically for size_t according to this:
[...] std::size_t can safely store the value of any non-member pointer, in
which case it is synonymous with std::uintptr_t.
So, you could use the getIntPtrType method of DataLayout class.
For any other data types, I'm not sure how far "guessing" can get you (probably not very far judging from the previous references).
Lastly, another alternative could be extending LLVM with a custom intrinsic (see memcpy for example), which inevitably goes through specific definition per target.
For actually adapting your integer type creation you could use the sizeof operator along with the use of CHAR_BIT, in order to provide the correct number of bits in the getIntNType call.
This will get you as far as using the right size for integer type on the platform where your module pass is built on.
For detecting a type's size 'dynamically' on the platform where your pass is being run, I know of none other way than providing that info in some sort of configuration file.
However, this can be automated and using the example of various build systems (e.g. cmake which is also used by LLVM), you can craft a simple program that can be compiled and automate that generation.
To that end, and to make this as portable as possible and avoid reinventing the wheel, you can use cmake's CheckTypeSize module.
Using LLVM's C++ API, I'm calling constantInt->setName("name") but on constantInt->getName() it doesn't show up. I always get the empty string. Is ConstantInt not supposed to have a name?
You cannot assign names to constants (and neither can you assign names to void values). Unfortunately this is indeed poorly-documented, but you can see it in the source code of Value::setName. It also makes sense when you consider how constants look in the textual representation of the IR.
What you can do instead is create a global variable and mark it as constant - those can be named.
I want to find out storage type of variables in a function block. How to check if compiler has elevated auto variable storage to register storage or if variables declared with register storage are honored by compiler? I am assuming by seeing the assembly code of the obj file after optimization would give us an idea. Please list the switch that I need to use with gcc or cl.exe to get this information?
The -S switch in gcc is the one you are looking for.
See §3.2 Options Controlling the Kind of Output (GCC manual)
You can look at the generated assembly, but there's no way to programmatically determine this from within your program. Generally be aware that GCC ignores the register keyword except to issue errors if you try to take the address of a register-storage variable, and when used in common with GCC-specific extensions to force a variable into a particular register for use in conjunction with inline asm. No idea what MSVC does.
I recently discover the LLVM (low level virtual machine) project, and from what I have heard It can be used to performed static analysis on a source code. I would like to know if it is possible to extract the different function call through function pointer (find the caller function and the callee function) in a program.
I could find the kind of information in the website so it would be really helpful if you could tell me if such an library already exist in LLVM or can you point me to the good direction on how to build it myself (existing source code, reference, tutorial, example...).
EDIT:
With my analysis I actually want to extract caller/callee function call. In the case of a function pointer, I would like to return a set of possible callee. both caller and callee must be define in the source code (this does not include third party function in a library).
I think that Clang (the analyzer that is part of LLVM) is geared towards the detection of bugs, which means that the analyzer tries to compute possible values of some expressions (to reduce false positives) but it sometimes gives up (in this case, emitting no alarm to avoid a deluge of false positives).
If your program is C only, I recommend you take a look at the Value Analysis in Frama-C. It computes supersets of possible values for any l-value at each point of the program, under some hypotheses that are explained at length here. Complexity in the analyzed program only means that the returned supersets are more approximated, but they still contain all the possible run-time values (as long as you remain within the aforementioned hypotheses).
EDIT: if you are interested in possible values of function pointers for the purpose of slicing the analyzed program, you should definitely take a look at the existing dependencies and slicing computations in Frama-C. The website doesn't have any nice example for slicing, here is one from a discussion on the mailing-list
You should take a look at Elsa. It is relatively easy to extend and lets you parse an AST fairly easily. It handles all of the parsing, lexing and AST generation and then lets you traverse the tree using the Visitor pattern.
class CallGraphGenerator : public ASTVisitor
{
//...
virtual bool visitFunction(Function *func);
virtual bool visitExpression(Expression *expr);
}
You can then detect function declarations, and probably detect function pointer usage. Finally you could check the function pointers' declarations and generate a list of the declared functions that could have been called using that pointer.
In our project, we perform static source code analysis by converting LLVM bytecode into C code with help of llc program that is shipped with LLVM. Then we analyze C code with CIL (C Intermediate Language), but for C language a lot of tools is available. The pitfail that the code generated by llc is AWFUL and suffers from a great loss of precision. But still, it's one way to go.
Edit: in fact, I wouldn't recommend anyone to o like this. But still, just for a record...
I think your question is flawed. The title says "Static source code analysis". Yet your underlying reason appears to be the construction of (part of ) a call graph including calls through a function pointer. The essence of function pointers is that you cannot know their values at compile time, i.e. at the point where you do static source code analysis. Consider this bit of code:
void (*pFoo)() = GetFoo();
pFoo();
Static code analysis cannot tell you what GetFoo() returns at runtime, although it might tell you that the result is subsequently used for a function call.
Now, what values could GetFoo() possibly return? You simply can't say this in general (equivalent to solving the halting problem). You will be able to guess some trivial cases. The guessable percentage will of course go up depending on how much effort you are willing to invest.
The DMS Software Reengineering Toolkit provides various types of control, data flow, and global points-to analyzers for large systems of C code, and constructs call graphs using that global points-to analysis (with the appropriate conservative assumptions). More discussion and examples of the analyses can be found at the web site.
DMS has been tested on monolithic systems of C code with 25 million lines. (The call graph for this monster had 250,000 functions in it).
Engineering all this machinery from basic C ASTs and symbol tables is a huge amount of work; been there, done that. You don't want to do this yourself if you have something else to do with your life, like implement other applications.