I'm planning to create a data structure optimized to hold assembly code. That way I can be totally responsible for the optimization algorithms that will be working on this structure. If I can compile while running. It will be kind of dynamic execution. Is this possible? Have any one seen something like this?
Should I use structs to link the structure into a program flow. Are objects better?
struct asm_code {
int type;
int value;
int optimized;
asm_code *next_to_execute;
} asm_imp;
Update: I think it will turn out, like a linked list.
Update: I know there are other compilers out there. But this is a military top secret project. So we can't trust any code. We have to do it all by ourselves.
Update: OK, I think I will just generate basic i386 machine code. But how do I jump into my memory blob when it is finished?
It is possible. Dynamic code generation is even mainstream in some areas like software rendering and graphics. You find a lot of use in all kinds of script languages, in dynamic compilation of byte-code in machine code (.NET, Java, as far as I know Perl. Recently JavaScript joined the club as well).
You also find it used in very math-heavy applications as well, It makes a difference if you for example remove all multiplication with zero out of a matrix multiplication if you plan to do such a multiplication several thousand times.
I strongly suggest that you read on the SSA representation of code. That's a representation where each primitive is turned into the so called three operand form, and each variable is only assigned once (hence the same Static Single Assignment form).
You can run high order optimizations on such code, and it's straight forward to turn that code into executable code. You won't write that code-generation backend on a weekend though...
To get a feeling how the SSA looks like you can try out the LLVM compiler. On their web-site they have a little "Try Out" widget to play with. You paste C code into a window and you get something out that is close to the SSA form.
Little example how it looks like:
Lets take this integer square root algorithm in C. (arbitrary example, I just took something simple yet non-trivial):
unsigned int isqrt32 (unsigned int value)
{
unsigned int g = 0;
unsigned int bshift = 15;
unsigned int b = 1<<bshift;
do {
unsigned int temp = (g+g+b)<<bshift;
if (value >= temp) {
g += b;
value -= temp;
}
b>>=1;
} while (bshift--);
return g;
}
LLVM turns it into:
define i32 #isqrt32(i32 %value) nounwind {
entry:
br label %bb
bb: ; preds = %bb, %entry
%indvar = phi i32 [ 0, %entry ], [ %indvar.next, %bb ]
%b.0 = phi i32 [ 32768, %entry ], [ %tmp23, %bb ]
%g.1 = phi i32 [ 0, %entry ], [ %g.0, %bb ]
%value_addr.1 = phi i32 [ %value, %entry ], [ %value_addr.0, %bb ]
%bshift.0 = sub i32 15, %indvar
%tmp5 = shl i32 %g.1, 1
%tmp7 = add i32 %tmp5, %b.0
%tmp9 = shl i32 %tmp7, %bshift.0
%tmp12 = icmp ult i32 %value_addr.1, %tmp9
%tmp17 = select i1 %tmp12, i32 0, i32 %b.0
%g.0 = add i32 %tmp17, %g.1
%tmp20 = select i1 %tmp12, i32 0, i32 %tmp9
%value_addr.0 = sub i32 %value_addr.1, %tmp20
%tmp23 = lshr i32 %b.0, 1
%indvar.next = add i32 %indvar, 1
%exitcond = icmp eq i32 %indvar.next, 16
br i1 %exitcond, label %bb30, label %bb
bb30: ; preds = %bb
ret i32 %g.0
}
I know it looks horrible at first. It's not even pure SSA-Form. The more you read on that representation the more sense it will make. And you will also find out why this representation is so widely used these days.
Encapsulating all the info you need into a data-structure is easy. In the end you have to decide if you want to use enums or strings for opcode names ect.
Btw - I know I didn't gave you a data-structure but more a formal yet practical language and the advice where to look further.
It's a very nice and interesting research field.
Edit: And before I forget it: Don't overlook the built in features of .NET and Java. These languates allow you to compile from byte-code or source code from within the program and execute the result.
Cheers,
Nils
Regarding your edit: How to execute a binary blob with code:
Jumping into your binary blob is OS and platform dependent. In a nutshell you have invalide the instruction cache, maybe you have to writeback the data-cache and you may have to enable execution rights on the memory-region you've wrote your code into.
On win32 it's relative easy as instruction cache flushing seems to be sufficient if you place your code on the heap.
You can use this stub to get started:
typedef void (* voidfunc) (void);
void * generate_code (void)
{
// reserve some space
unsigned char * buffer = (unsigned char *) malloc (1024);
// write a single RET-instruction
buffer[0] = 0xc3;
return buffer;
}
int main (int argc, char **args)
{
// generate some code:
voidfunc func = (voidfunc) generate_code();
// flush instruction cache:
FlushInstructionCache(GetCurrentProcess(), func, 1024);
// execute the code (it does nothing atm)
func();
// free memory and exit.
free (func);
}
I assume you want a data structure to hold some kind of instruction template, probably parsed from existing machine code, similar to:
add r1, r2, <int>
You will have an array of this structure and you will perform some optimization on this array, probably changing its size or building a new one, and generate corresponding machine code.
If your target machine uses variable width instructions (x86 for example), you can't determine your array size without actually finishing parsing the instructions which you build the array from. Also you can't determine exactly how much buffer you need before actually generating all the instructions from optimized array. You can make a good estimate though.
Check out GNU Lightning. It may be useful to you.
In 99% of the cases, the difference in performance is negligible. The main advantage of classes is that the code produced by OOP is better and easier to understand than procedural code.
I'm not sure in what language you're coding - note that in C# the major difference between classes and structs is that structs are value types while classes are reference types. In this case, you might want to start with structs, but still add behavior (constructor, methods) to them.
Not discussing the technical value of optimize yourself your code, in a C++ code, choosing between a POD struct or a full object is mostly a point of encapsulation.
Inlining the code will let the compiler optimize (or not) the constructors/accessors used. There will be no loss of performance.
First, set a constructor
If you're working with a C++ compiler, create at least one constructor:
struct asm_code {
asm_code()
: type(0), value(0), optimized(0) {}
asm_code(int type_, int value_, int optimized_)
: type(type_), value(value_), optimized(_optimized) {}
int type;
int value;
int optimized;
};
At least, you won't have undefined structs in your code.
Are every combination of data possible?
Using a struct like you use means that any type is possible, with any value, and any optimized. For example, if I set type = 25, value = 1205 and optimized = -500, then it is Ok.
If you don't want the user to put random values inside your structure, add inline accessors:
struct asm_code {
int getType() { return type ; }
void setType(int type_) { VERIFY_TYPE(type_) ; type = type_ ; }
// Etc.
private :
int type;
int value;
int optimized;
};
This will let you control what is set inside your structure, and debug your code more easily (or even do runtime verification of you code)
After some reading, my conclusion is that common lisp is the best fit for this task.
With lisp macros I have enormous power.
Related
Problem
I am trying to create a vector type in LLVM (version 12) to exploit the SIMD feature associated with this type. However, the required size of the array is stored in an integer variable. The desired LLVM IR code could possibly look like,
;; Pseudo-code for the desired LLVM IR
%0 = load i64, i64* %a
%vec = alloca <%0 x double>, align 16
Generating such IR code seems to be impossible though.
It is possible to generate vector alloca with compile-time constants, e.g. a vector of size 4 could be generated as
%vec = alloca <4 x double>, align 16
using the LLVM C++ API as
llvm::Type* I = llvm::Type::getDoubleTy(TheContext);
auto arr_type = llvm::VectorType::get(I,4,false);
llvm::AllocaInst* arr_alloc = Builder.CreateAlloca(arr_type, 0 , "vec" );
however using a runtime constant obtained from a variable seems to be a problem, since the llvm::VectorType::get interface only allows the size to be specified as an unsigned int. I.e. the available interface looks like
static VectorType* llvm::VectorType::get ( Type * ElementType,
unsigned NumElements,
bool Scalable
)
However, if I load the variable value from %a and I cant create a vector type from it using,
llvm::Value *SIZE = Builder.CreateLoad(IntType,Address_Of_Variable_A,"a");
auto arr_type = llvm::VectorType::get(I,SIZE,false); // this line fails to compile (since SIZE is not an unsigned int)
I also could not typecast the Value* pointer to a llvm::ConstantInt* pointer to get the integer value back from the Value* as done in https://stackoverflow.com/a/5315581/2940917 .
This happens due to the fact that SIZE is in this case a LoadInst* as opposed to being created from a ConstantInt::get as done in the linked question.
Is there a way to achieve this? This seems like an essential operation for many cases. It would be surprising if there is no way to declare a vector size from the runtime constant.
Could someone point me to the right information source/idea?
In section 20 of Scott Meyer's Effective C++, he states:
some compilers refuse to put objects consisting of only a double into a register
When passing built-in types by value, compilers will happily place the data in registers and quickly send ints/doubles/floats/etc. along. However, not all compilers will treat small objects with the same grace. I can easily understand why compilers would treat Objects differently - to pass an Object by value can be a lot more work than copying data members between the vtable and all the constructors.
But still. This seems like an easy problem for modern compilers to solve: "This class is small, maybe I can treat it differently". Meyer's statement seemed to imply that compilers WOULD make this optimization for objects consisting of only an int (or char or short).
Can someone give further insight as to why this optimization sometimes doesn't happen?
I found this document online on "Calling conventions for different C++ compilers and operating systems" (updated on 2018-04-25) which has a table depicting "Methods for passing structure, class and union objects".
From the table you can see that if an object contains long double, copy of entire object is transferred to stack for all compilers shown here.
Also from the same resource (with emphasis added):
There are several different methods to transfer a parameter to a function if the parameter is a structure, class or union object. A copy of the object is always made, and this copy is transferred to the called function either in registers, on the stack, or by a pointer, as specified in table 6. The symbols in the table specify which method to use. S takes precedence over I and R. PI and PS take precedence over all other passing methods.
As table 6 tells, an object cannot be transferred in registers if it is too big or too complex. For example, an object that has a copy constructor cannot be transferred in registers because the copy constructor needs an address of the object. The copy constructor is called by the caller, not the callee.
Objects passed on the stack are aligned by the stack word size, even if higher alignment would be desired. Objects passed by pointers are not aligned by any of the compilers studied, even if alignment is explicitly requested. The 64bit Windows ABI requires that objects passed by pointers be aligned by 16.
An array is not treated as an object but as a pointer, and no copy of the array is made, except if the array is wrapped into a structure, class or union.
The 64 bit compilers for Linux differ from the ABI (version 0.97) in the following respects: Objects with inheritance, member functions, or constructors can be passed in registers. Objects with copy constructor, destructor or virtual are passed by pointers rather than on the stack.
The Intel compilers for Windows are compatible with Microsoft. Intel compilers for Linux are compatible with Gnu.
Here is an example showing that LLVM clang with optimization level O3 treats a class with a single double data member just like it was a double:
$ cat main.cpp
#include <stdio.h>
class MyDouble {
public:
double d;
MyDouble(double _d):d(_d){}
};
void foo(MyDouble d)
{
printf("%lg\n",d.d);
}
int main(int argc, char **argv)
{
if (argc>5)
{
double x=(double)argc;
MyDouble d(x);
foo(d);
}
return 0;
}
When I compile it and view the generated bitcode file, I see that foo behaves
as if it operates on a double type input parameter:
$ clang++ -O3 -c -emit-llvm main.cpp
$ llvm-dis main.bc
Here is the relevant part:
; Function Attrs: nounwind uwtable
define void #_Z3foo8MyDouble(double %d.coerce) #0 {
entry:
%call = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([5 x i8]* #.str, i64 0, i64 0), double %d.coerce)
ret void
}
See how foo declares its input parameter as double, and moves it around for
printing ``as is". Now let's compile the exact same code with O0:
$ clang++ -O0 -c -emit-llvm main.cpp
$ llvm-dis main.bc
When we look at the relevant part, we see that clang uses a getelementptr instruction to access its first (and only) data member d:
; Function Attrs: uwtable
define void #_Z3foo8MyDouble(double %d.coerce) #0 {
entry:
%d = alloca %class.MyDouble, align 8
%coerce.dive = getelementptr %class.MyDouble* %d, i32 0, i32 0
store double %d.coerce, double* %coerce.dive, align 1
%d1 = getelementptr inbounds %class.MyDouble* %d, i32 0, i32 0
%0 = load double* %d1, align 8
%call = call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([5 x i8]* #.str, i32 0, i32 0), double %0)
ret void
}
I want to store data about each stack memory that is being allocated and changed.
To do so, I thought I'd need a data structure of some sort in which I can store the related data, such as the value of the local variable, and any change of it, e.g. before a store instruction.
Moreover, I should be able to check the value with the one saved in the data structure before any read of the local variable (e.g. load instruction).
If we assume that the IR looks like this
%1 = alloca i32,
[...]
%20 = store i32 0, %1
I need to change the above code to look like the below
%18 = call __Checking
%19 = <storing the return value from __Checking into a data structure>
%20 = store i32 0, %1
My problem is I cannot figure out how to store the returned value from the library called __Checking into a StructType I defined at the beginning.
My relevant code
if (StoreInst *SI = dyn_cast<StoreInst>(&I)){
Value* newValue = IRB.CreateCall(....);
Type* strTy[] = { Type::getInt64Ty(m.getContext()),
Type::getInt64Ty(m.getContext()) };
Type* t = StructType::create(strTy);
}
You didn't provide any specific information on the type returned by __Checking (or even where does __Checking come from) and the internals of the StructType, so I'm going to suggest to abstract that behaviour away in a runtime library (most likely in C).
This will allow you to setup function calls in those parts of the IR that you require to store (or compare, etc.) a value. The function(s) would take as arguments the value and the StructType (and anything else required) and perform the desired action in plain C (e.g. store the value).
For example, using part of your posted IR (names prefixed with myruntime_ should be defined in your runtime library):
%0 = alloca %myruntime_struct
[...]
%18 = call #__Checking
%19 = call #myruntime_storevalue(%2, %18, /* other arguments? */)
%20 = store i32 0, %1
I have a struct which holds values that are used as the arguments of a for loop:
struct ARGS {
int endValue;
int step;
int initValue;
}
ARGS * arg = ...; //get a pointer to an initialized struct
for (int i = arg->initValue; i < arg->endValue; i+=arg->step) {
//...
}
Since the values of initValue and step are checked each iteration, would it be faster if I move them to local values before using in the for loop?
initValue = arg->initValue;
endValue = arg->endValue;
step = arg->step;
for (int i = initValue; i < endValue; i+=step) {
//...
}
The clear cut answer is that in 99.9% of the cases it does not matter, and you should not be concerned with it. Now, there might be different micro differences that won't matter to mostly anyone. The gory details depend on the architecture and optimizer. But bear with me, understand not no mean very very high probability that there is no difference.
// case 1
ARGS * arg = ...; //get a pointer to an initialized struct
for (int i = arg->initValue; i < endValue; i+=arg->step) {
//...
}
// case 2
initValue = arg->initValue;
step = arg->step;
for (int i = initValue; i < endValue; i+=step) {
//...
}
In the case of initValue, there will not be a difference. The value will be loaded through the pointer and stored into the initValue variable, just to store it in i. Chances are that the optimizer will skip initValue and write directly to i.
The case of step is a bit more interesting, in that the compiler can prove that the local variable step is not shared by any other thread and can only change locally. If the pressure on registers is small, it can keep step in a register and never have to access the real variable. On the other hand, it cannot assume that arg->step is not changing by external means, and is required to go to memory to read the value. Understand that memory here means L1 cache most probably. A L1 cache hit on a Core i7 takes approximately 4 cpu cycles, which roughly means 0.5 * 10-9 seconds (on a 2Ghz processor). And that is under the worst case assumption that the compiler can maintain step in a register, which may not be the case. If step cannot be held on a register, you will pay for the access to memory (cache) in both cases.
Write code that is easy to understand, then measure. If it is slow, profile and figure out where the time is really spent. Chances are that this is not the place where you are wasting cpu cycles.
This depends on your architecture. If it is a RISC or CISC processor, then that will affect how memory is accessed, and on top of that the addressing modes will affect it as well.
On the ARM code I work with, typically the base address of a structure will be moved into a register, and then it will execute a load from that address plus an offset. To access a variable, it will move the address of the variable into the register, then execute the load without an offset. In this case it takes the same amount of time.
Here's what the example assembly code might look like on ARM for accessing the second int member of a strcture compared to directly accessing a variable.
ldr r0, =MyStruct ; struct {int x, int y} MyStruct
ldr r0, [r0, #4] ; load MyStruct.y into r0
ldr r1, =MyIntY ; int MyIntX, MyIntY
ldr r1, [r1] ; directly load MyIntY into r0.
If your architecture does not allow addressing with offsets, then it would need to move the address into a register and then perform the addition of the offset.
Additionally, since you've tagged this as C++ as well, if you overload the -> operator for the type, then this will invoke your own code which could take longer.
The problem is that the two version are not identical. If the code in the ... part modifies the values in arg then the two options will behave differently (the "optimized" one will use the step and end value using the original values, not the updated ones).
If the optimizer can prove by looking at the code that this is not going to happen then the performance will be the same because moving things out of loops is a common optimization performed today. However it's quite possible that something in ... will POTENTIALLY change the content of the structure and in this case the optimizer must be paranoid and the generated code will reload the values from the structure at each iteration. How costly it will be depends on the processor.
For example if the arg pointer is received as a parameter and the code in ... calls any external function for which the code is unknown to the compiler (including things like malloc) then the compiler must assume that MAY BE the external code knows the address of the structure and MAY BE it will change the end or step values and thus the optimizer is forbidden to move those computations out of the loop because doing so would change the behavior of the code.
Even if it's obvious for you that malloc is not going to change the contents of your structure this is not obvious at all for the compiler, for which malloc is just an external function that will be linked in at a later step.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Probably everyone uses some kind of optimization switches (in case of gcc, the most common one is -O2 I believe).
But what does gcc (and other compilers like VS, Clang) really do in presence of such options?
Of course there is no definite answer, since it depends very much on platform, compiler version, etc.
However, if possible, I would like to collect a set of "rules of thumb".
When should I think about some tricks to speed-up the code and when should I just leave the job to the compiler?
For example, how far will compiler go in such (a little bit artifficial...)
cases, for different optimization levels:
1) sin(3.141592)
// will it be evaluated at compile time or should I think of a look-up table to speed-up the calculations?
2) int a = 0; a = exp(18), cos(1.57), 2;
// will the compiler evaluate exp and cos, although not needed, as the value of the expression is equal 2?
3)
for (size_t i = 0; i < 10; ++i) {
int a = 10 + i;
}
// will the compiler skip the whole loop as it has no visible side-effects?
Maybe you can think of other examples.
If you want to know what a compiler does, your best bet is to have a look at the compiler documentation. For optimizations, you may look at the LLVM's Analysis and Transform Passes for example.
1) sin(3.141592) // will it be evaluated at compile time ?
Probably. There are very precise semantics for IEEE float computations. This might be surprising if you change the processor flags at runtime, by the way.
2) int a = 0; a = exp(18), cos(1.57), 2;
It depends:
whether the functions exp and cos are inline or not
if they are not, whether they correctly annotated (so the compiler know they have no side-effect)
For functions taken from your C or C++ Standard library, they should be correctly recognized/annotated.
As for the eliminitation of the computation:
-adce: Aggressive Dead Code Elimination
-dce: Dead Code Elimination
-die: Dead Instruction Elimination
-dse: Dead Store Elimination
compilers love finding code that is useless :)
3)
Similar to 2) actually. The result of the store is not used and the expression as no side-effect.
-loop-deletion: Delete dead loops
And for the final: what not put the compiler to the test ?
#include <math.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
double d = sin(3.141592);
printf("%f", d);
int a = 0; a = (exp(18), cos(1.57), 2); /* need parentheses here */
printf("%d", a);
for (size_t i = 0; i < 10; ++i) {
int a = 10 + i;
}
return 0;
}
Clang tries to be helpful already during the compilation:
12814_0.c:8:28: warning: expression result unused [-Wunused-value]
int a = 0; a = (exp(18), cos(1.57), 2);
^~~ ~~~~
12814_0.c:12:9: warning: unused variable 'a' [-Wunused-variable]
int a = 10 + i;
^
And the emitted code (LLVM IR):
#.str = private unnamed_addr constant [3 x i8] c"%f\00", align 1
#.str1 = private unnamed_addr constant [3 x i8] c"%d\00", align 1
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind uwtable {
%1 = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([3 x i8]* #.str, i64 0, i64 0), double 0x3EA5EE4B2791A46F) nounwind
%2 = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([3 x i8]* #.str1, i64 0, i64 0), i32 2) nounwind
ret i32 0
}
We remark that:
as predicted the sin computation has been resolved at compile-time
as predicted the exp and cos have been stripped completely.
as predicted the loop has been stripped too.
If you want to delve deeper into compiler optimizations I would encourage you to:
learn to read IR (it's incredibly easy, really, much more so that assembly)
use the LLVM Try Out page to test your assumptions
The compiler has a number of optimization passes. Every optimization pass is responsible for a number of small optimizations. For example, you may have a pass that calculates arithmetic expressions at compile time (so that you can express 5MB as 5 * (1024*1024) without a penalty, for example). Another pass inlines functions. Another searches for unreachable code and kills it. And so on.
The developers of the compiler then decide which of these passes they want to execute in which order. For example, suppose you have this code:
int foo(int a, int b) {
return a + b;
}
void bar() {
if (foo(1, 2) > 5)
std::cout << "foo is large\n";
}
If you run dead-code elimination on this, nothing happens. Similarly, if you run expression reduction, nothing happens. But the inliner might decide that foo is small enough to be inlined, so it substitutes the call in bar with the function body, replacing arguments:
void bar() {
if (1 + 2 > 5)
std::cout << "foo is large\n";
}
If you run expression reduction now, it will first decide that 1 + 2 is 3, and then decide that 3 > 5 is false. So you get:
void bar() {
if (false)
std::cout << "foo is large\n";
}
And now the dead-code elimination will see an if(false) and kill it, so the result is:
void bar() {
}
But now bar is suddenly very tiny, when it was larger and more complicated before. So if you run the inliner again, it would be able to inline bar into its callers. That may expose yet more optimization opportunities, and so on.
For compiler developers, this is a trade-off between compile time and generated code quality. They decide on a sequence of optimizers to run, based on heuristics, testing, and experience. But since one size does not fit all, they expose some knobs to tweak this. The primary knob for gcc and clang is the -O option family. -O1 runs a short list of optimizers; -O3 runs a much longer list containing more expensive optimizers, and repeats passes more often.
Aside from deciding which optimizers run, the options may also tweak internal heuristics used by the various passes. The inliner, for example, usually has lots of parameters that decide when it's worth inlining a function. Pass -O3, and those parameters will lean more towards inlining functions whenever there is a chance of improved performance; pass -Os, and the parameters will cause only really tiny functions (or functions provably called exactly once) to be inlined, as anything else would increase executable size.
The compilers does all sort of optimization that you event cannot think off. Especially the C++ compilers.
They do things like unrolling loops, make functions inline, eliminating dead code, replacing multiple instruction with just one and so on.
A piece of advice I can give is: In the C/C++ compilers you can have faith that they will perform a lot of optimizations.
Take a look at [1].
[1] http://en.wikipedia.org/wiki/Compiler_optimization