As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Probably everyone uses some kind of optimization switches (in case of gcc, the most common one is -O2 I believe).
But what does gcc (and other compilers like VS, Clang) really do in presence of such options?
Of course there is no definite answer, since it depends very much on platform, compiler version, etc.
However, if possible, I would like to collect a set of "rules of thumb".
When should I think about some tricks to speed-up the code and when should I just leave the job to the compiler?
For example, how far will compiler go in such (a little bit artifficial...)
cases, for different optimization levels:
1) sin(3.141592)
// will it be evaluated at compile time or should I think of a look-up table to speed-up the calculations?
2) int a = 0; a = exp(18), cos(1.57), 2;
// will the compiler evaluate exp and cos, although not needed, as the value of the expression is equal 2?
3)
for (size_t i = 0; i < 10; ++i) {
int a = 10 + i;
}
// will the compiler skip the whole loop as it has no visible side-effects?
Maybe you can think of other examples.
If you want to know what a compiler does, your best bet is to have a look at the compiler documentation. For optimizations, you may look at the LLVM's Analysis and Transform Passes for example.
1) sin(3.141592) // will it be evaluated at compile time ?
Probably. There are very precise semantics for IEEE float computations. This might be surprising if you change the processor flags at runtime, by the way.
2) int a = 0; a = exp(18), cos(1.57), 2;
It depends:
whether the functions exp and cos are inline or not
if they are not, whether they correctly annotated (so the compiler know they have no side-effect)
For functions taken from your C or C++ Standard library, they should be correctly recognized/annotated.
As for the eliminitation of the computation:
-adce: Aggressive Dead Code Elimination
-dce: Dead Code Elimination
-die: Dead Instruction Elimination
-dse: Dead Store Elimination
compilers love finding code that is useless :)
3)
Similar to 2) actually. The result of the store is not used and the expression as no side-effect.
-loop-deletion: Delete dead loops
And for the final: what not put the compiler to the test ?
#include <math.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
double d = sin(3.141592);
printf("%f", d);
int a = 0; a = (exp(18), cos(1.57), 2); /* need parentheses here */
printf("%d", a);
for (size_t i = 0; i < 10; ++i) {
int a = 10 + i;
}
return 0;
}
Clang tries to be helpful already during the compilation:
12814_0.c:8:28: warning: expression result unused [-Wunused-value]
int a = 0; a = (exp(18), cos(1.57), 2);
^~~ ~~~~
12814_0.c:12:9: warning: unused variable 'a' [-Wunused-variable]
int a = 10 + i;
^
And the emitted code (LLVM IR):
#.str = private unnamed_addr constant [3 x i8] c"%f\00", align 1
#.str1 = private unnamed_addr constant [3 x i8] c"%d\00", align 1
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind uwtable {
%1 = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([3 x i8]* #.str, i64 0, i64 0), double 0x3EA5EE4B2791A46F) nounwind
%2 = tail call i32 (i8*, ...)* #printf(i8* getelementptr inbounds ([3 x i8]* #.str1, i64 0, i64 0), i32 2) nounwind
ret i32 0
}
We remark that:
as predicted the sin computation has been resolved at compile-time
as predicted the exp and cos have been stripped completely.
as predicted the loop has been stripped too.
If you want to delve deeper into compiler optimizations I would encourage you to:
learn to read IR (it's incredibly easy, really, much more so that assembly)
use the LLVM Try Out page to test your assumptions
The compiler has a number of optimization passes. Every optimization pass is responsible for a number of small optimizations. For example, you may have a pass that calculates arithmetic expressions at compile time (so that you can express 5MB as 5 * (1024*1024) without a penalty, for example). Another pass inlines functions. Another searches for unreachable code and kills it. And so on.
The developers of the compiler then decide which of these passes they want to execute in which order. For example, suppose you have this code:
int foo(int a, int b) {
return a + b;
}
void bar() {
if (foo(1, 2) > 5)
std::cout << "foo is large\n";
}
If you run dead-code elimination on this, nothing happens. Similarly, if you run expression reduction, nothing happens. But the inliner might decide that foo is small enough to be inlined, so it substitutes the call in bar with the function body, replacing arguments:
void bar() {
if (1 + 2 > 5)
std::cout << "foo is large\n";
}
If you run expression reduction now, it will first decide that 1 + 2 is 3, and then decide that 3 > 5 is false. So you get:
void bar() {
if (false)
std::cout << "foo is large\n";
}
And now the dead-code elimination will see an if(false) and kill it, so the result is:
void bar() {
}
But now bar is suddenly very tiny, when it was larger and more complicated before. So if you run the inliner again, it would be able to inline bar into its callers. That may expose yet more optimization opportunities, and so on.
For compiler developers, this is a trade-off between compile time and generated code quality. They decide on a sequence of optimizers to run, based on heuristics, testing, and experience. But since one size does not fit all, they expose some knobs to tweak this. The primary knob for gcc and clang is the -O option family. -O1 runs a short list of optimizers; -O3 runs a much longer list containing more expensive optimizers, and repeats passes more often.
Aside from deciding which optimizers run, the options may also tweak internal heuristics used by the various passes. The inliner, for example, usually has lots of parameters that decide when it's worth inlining a function. Pass -O3, and those parameters will lean more towards inlining functions whenever there is a chance of improved performance; pass -Os, and the parameters will cause only really tiny functions (or functions provably called exactly once) to be inlined, as anything else would increase executable size.
The compilers does all sort of optimization that you event cannot think off. Especially the C++ compilers.
They do things like unrolling loops, make functions inline, eliminating dead code, replacing multiple instruction with just one and so on.
A piece of advice I can give is: In the C/C++ compilers you can have faith that they will perform a lot of optimizations.
Take a look at [1].
[1] http://en.wikipedia.org/wiki/Compiler_optimization
Related
Consider the following code snippet
#include <vector>
#include <cstdlib>
void __attribute__ ((noinline)) calculate1(double& a, int x) { a += x; };
void __attribute__ ((noinline)) calculate2(double& a, int x) { a *= x; };
void wrapper1(double& a, int x) { calculate1(a, x); }
void wrapper2(double& a, int x) { calculate2(a, x); }
typedef void (*Func)(double&, int);
int main()
{
std::vector<std::pair<double, Func>> pairs = {
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
};
for (auto& [a, wrapper] : pairs)
(*wrapper)(a, 5);
return pairs[0].first + pairs[1].first;
}
With -O3 optimization the latest gcc and clang versions do not optimize the pointers to wrappers to pointers to underlying functions. See assembly here at line 22:
mov ebp, OFFSET FLAT:wrapper2(double&, int) # tmp118,
which results later in call + jmp, instead of just call had the compiler put a pointer to the calculate1 instead.
Note that I specifically asked for no-inlined calculate functions to illustrate; doing it without noinline results in another flavour of non-optimization where compiler will generate two identical functions to be called by pointer (so still won't optimize, just in a different fashion).
What am I missing here? Is there any way to guide the compiler short of manually plugging in the correct functions (without wrappers)?
Edit 1. Following suggestions in the comments, here is a disassembly with all functions declared static, with exactly the same result (call + jmp instead of call).
Edit 2. Much simpler example of the same pattern:
#include <vector>
#include <cstdlib>
typedef void (*Func)(double&, int);
static void __attribute__ ((noinline)) calculate(double& a, int x) { a += x; };
static void wrapper(double& a, int x) { calculate(a, x); }
int main() {
double a = 5.0;
Func f;
if (rand() % 2)
f = &wrapper; // f = &calculate;
else
f = &wrapper;
f(a, 0);
return 0;
}
gcc 8.2 successfully optimizes this code by throwing pointer to wrapper away and storing &calculate directly in its place (https://gcc.godbolt.org/z/nMIBeo). However changing the line as per comment (that is, performing part of the same optimization manually) breaks the magic and results in pointless jmp.
You seem to be suggesting that &calculate1 should be stored in the vector instead of &wrapper1. In general this is not possible: later code might try to compare the stored pointer against &calculate1 and that must compare false.
I further assume that your suggestion is that the compiler might try to do some static analysis and determine that the function pointers values in the vector are never compared for equality with other function pointers, and in fact that none of the other operations done on the vector elements would produce a change in observable behaviour; and therefore in this exact program it could store &calculate1 instead.
Usually the answer to "why does the compiler not perform some particular optimization" is that nobody has conceived of and implemented that idea. Another common reason is that the static analysis involved is, in the general case, quite difficult and might lead to a slowdown in compilation with no benefit in real programs where the analysis could not be guaranteed to succeed.
You are making a lot of assumptions here. Firstly, your syntax. The second is that compilers are perfect in the eye of the beholder and catch everything. The reality is that it is easy to find and hand optimize compiler output, it is not difficult to write small functions to trip up a compiler that you are well in tune with or write a decent size application and there will be places where you can hand tune. This is all known and expected. Then opinion comes in where on my machine my blah is faster than blah so it should have made these instructions instead.
gcc is not a great compiler for performance, on some targets it has been getting worse for a number of major revs. It is pretty good at what it does, better than pretty good, it deals with a number of pre processors/languages has a common middle and a number of backends. Some backends get better optimization applied front to back others are just hanging on for the ride.There were a number of other compilers that could produce code that could easily outperform gcc.
These were mostly pay-for compilers. More than an individual would pay out of pocket: used car prices, sometimes recurring annually.
There are things that gcc can optimize that are simply amazing and times that it totally goes in the wrong direction. Same goes for clang, often they do similar jobs with similar output, sometimes do some impressive things sometimes just go off into the weeds. I now find it more fun to manipulate the optimizer to make it do good or bad things rater than worry about why didn't it do what I "think" it should have done on a particular occasion. If I need that code faster I take the compiled output and hand fix it and use it as an assembly function.
You get what you pay for with gcc, if you were to look deep in its bowels you will find it is barely held together with duct tape and bailing wire (llvm is catching up). But for a free tool it does a simply amazing job, it is so widely used that you can get free support just about anywhere. We are sadly well into a time where folks think that because gcc interprets the language in a certain way that is how the language is defined and sadly that is not remotely true. But so many folks don't try other compilers to find out what "implementation defined" really means.
Last and most important, it's open source, if you want to "fix" an optimization then just do it. Keep that fix for yourself, post it, or try to push it upstream.
Sometimes I write code like
if (ptr)
ASSERT(ptr->member);
instead of
ASSERT(!ptr || ptr->member);
because it's more straightforward IMO. Would the redundant comparison remain in the release build?
I'd say that depends on your compiler.
In release mode, the ASSERT macro won't evaluate ptr->member and will resolve to a trivial expression that the compiler will optimize out, but the if statement and the associated comparison will remain as is.
However, if the compiler is smart enough to determine that the condition does not have any side effect, it might optimize the entire if statement away. Compiling to assembly (using the /FA option) would give you a definite answer.
As long as the compiler is not stupid, yes it would be trimmed.
Try writing this in the compiler:
if (x);
It gives you a warning that statement has no effect and like I said, if it is not stupid, it would remove the code.
If you want to be sure, you could compile it with your compiler and see the assembly.
LLVM removes it when optimization is required (by the user):
int main(int argc, char **argv) {
if (argc) {}
return 0;
}
Becomes:
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind readnone {
ret i32 0
}
I'm doing a bit of hands on research surrounding the speed benefits of making a function inline. I don't have the book with me, but one text I was reading, was suggesting a fairly large overhead cost to making function calls; and when ever executable size is either negligible, or can be spared, a function should be declared inline, for speed.
I've written the following code to test this theory, and from what I can tell, there is no speed benifit from declaring a function as inline. Both functions, when called 4294967295 times, on my computer, execute in 196 seconds.
My question is, what would be your thoughts as to why this is happening? Is it modern compiler optimization? Would it be the lack of large calculations taking place in the function?
Any insight on the matter would be appreciated. Thanks in advance friends.
#include < iostream >
#include < time.h >
// RESEARCH Jared Thomson 2010
////////////////////////////////////////////////////////////////////////////////
// Two functions that preform an identacle arbitrary floating point calculation
// one function is inline, the other is not.
double test(double a, double b, double c);
double inlineTest(double a, double b, double c);
double test(double a, double b, double c){
a = (3.1415 / 1.2345) / 4 + 5;
b = 9.999 / a + (a * a);
c = a *=b;
return c;
}
inline
double inlineTest(double a, double b, double c){
a = (3.1415 / 1.2345) / 4 + 5;
b = 9.999 / a + (a * a);
c = a *=b;
return c;
}
// ENTRY POINT Jared Thomson 2010
////////////////////////////////////////////////////////////////////////////////
int main(){
const unsigned int maxUINT = -1;
clock_t start = clock();
//============================ NON-INLINE TEST ===============================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
clock_t end = clock();
std::cout << maxUINT << " calls to non inline function took "
<< (end - start)/CLOCKS_PER_SEC << " seconds.\n";
start = clock();
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
end = clock();
std::cout << maxUINT << " calls to inline function took "
<< (end - start)/CLOCKS_PER_SEC << " seconds.\n";
getchar(); // Wait for input.
return 0;
} // Main.
Assembly Output
PasteBin
The inline keyword is basically useless. It is a suggestion only. The compiler is free to ignore it and refuse to inline such a function, and it is also free to inline a function declared without the inline keyword.
If you are really interested in doing a test of function call overhead, you should check the resultant assembly to ensure that the function really was (or wasn't) inlined. I'm not intimately familiar with VC++, but it may have a compiler-specific method of forcing or prohibiting the inlining of a function (however the standard C++ inline keyword will not be it).
So I suppose the answer to the larger context of your investigation is: don't worry about explicit inlining. Modern compilers know when to inline and when not to, and will generally make better decisions about it than even very experienced programmers. That's why the inline keyword is often entirely ignored. You should not worry about explicitly forcing or prohibiting inlining of a function unless you have a very specific need to do so (as a result of profiling your program's execution and finding that a bottleneck could be solved by forcing an inline that the compiler has for some reason not done).
Re: the assembly:
; 30 : const unsigned int maxUINT = -1;
; 31 : clock_t start = clock();
mov esi, DWORD PTR __imp__clock
push edi
call esi
mov edi, eax
; 32 :
; 33 : //============================ NON-INLINE TEST ===============================//
; 34 : for(unsigned int i = 0; i < maxUINT; ++i)
; 35 : blank(1.1,2.2,3.3);
; 36 :
; 37 : clock_t end = clock();
call esi
This assembly is:
Reading the clock
Storing the clock value
Reading the clock again
Note what's missing: calling your function a whole bunch of times
The compiler has noticed that you don't do anything with the result of the function and that the function has no side-effects, so it is not being called at all.
You can likely get it to call the function anyway by compiling with optimizations off (in debug mode).
Both the functions could be inlined. The definition of the non-inline function is in the same compilation unit as the usage point, so the compiler is within its rights to inline it even without you asking.
Post the assembly and we can confirm it for you.
EDIT: the MSVC compiler pragma for banning inlining is:
#pragma auto_inline(off)
void myFunction() {
// ...
}
#pragma auto_inline(on)
Two things could be happening:
The compiler may either be inlining both or neither functions. Check your compiler documentation for how to control that.
Your function may be complex enough that the overhead of doing the function call isn't big enough to make a big difference in the tests.
Inlining is great for very small functions but it's not always better. Code bloat can prevent the CPU from caching code.
In general inline getter/setter functions and other one liners. Then during performance tuning you can try inlining functions if you think you'll get a boost.
Your code as posted contains a couple oddities.
1) The math and output of your test functions are completely independent of the function parameters. If the compiler is smart enough to detect that those functions always return the same value, that might give it incentive to optimize them out entirely inline or not.
2) Your main function is calling test for both the inline and non-inline tests. If this is the actual code that you ran, then that would have a rather large role to play in why you saw the same results.
As others have suggested, you would do well to examine the actual assembly code generated by the compiler to determine that you're actually testing what you intended to.
Um, shouldn't
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
test(1.1,2.2,3.3);
be
//============================ INLINE TEST ===================================//
for(unsigned int i = 0; i < maxUINT; ++i)
inlineTest(1.1,2.2,3.3);
?
But if that was just a typo, would recommend that look at a dissassembler or reflector to see if the code is actually inline or still stack-ed.
If this test took 196 seconds for each loop, then you must not have turned optimizations on; with optimizations off, generally compilers don't inline anything.
With optimization on, however, the compiler is free to notice that your test function can be completely evaluated at compile time, and crush it down to "return [constant]" -- at which point, it may well decide to inline both functions since they're so trivial, and then notice that the loops are pointless since the function value is not used, and squash that out too! This is basically what I got when I tried it.
So either way, you're not testing what you thought you tested.
Function call overhead ain't what it used to be, compared to the overhead of blowing out the level-1 instruction cache, which is what aggressive inlining does to you. You can easily find reports online of gcc's -Os option (optimize for size) being a better default choice for large projects than -O2, and the big reason for that is that -O2 inlines a lot more aggressively. I would expect it is much the same with MSVC.
The only way I know of to guarantee a function is inline is to #define it
For example:
#define RADTODEG(x) ((x) * 57.29578)
That said, the only time I would bother with such a function would be in an embedded system. On a desktop/server the performance difference is negligible.
Run it in a debugger and have a look at the generated code to see if your function is always or never inlined. I think it's always a good idea to have a look at the assembler code when you want more knowledge about the optimization the compiler does.
Apologies for a small flame ...
Compilers think in assembly language. You should too. Whatever else you do, just step through the code at the assembler level. Then you'll know exactly what the compiler did.
Don't think of performance in absolute terms like "fast" or "slow". It's all relative, percentage-wise. The way software is made fast is by removing, in successive steps, things that take too large a percent of the time.
Here's the flame: If a compiler can do a pretty good job of inlining functions that clearly need it, and if it can do a really good job of managing registers, I think that's just what it should do. If it can do a reasonable job of unrolling loops that clearly could use it, I can live with that. If it's knocking itself out trying to outsmart me by removing function calls that I clearly wrote and intended to be called, or scrambling my code sanctimoniously trying to save a JMP when that JMP occupies 0.000001% of running time (the way Fortran does), I get annoyed, frankly.
There seems to be a notion in the compiler world that there's no such thing as an unhelpful optimization. No matter how smart the compiler is, real optimization is the programmer's job, and nobody else's.
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Can a recursive function be inline?
What are the trade offs of making recursive functions inline.
Recursive functions that can be optimised by tail-end recursion can certainly be inlined. If the last thing a function does is call itself, then it can be converted into a plain loop.
Arbitrary recursive functions can't be inlined for the same reason a snake can't swallow its own tail.
[Edit: just noticed that although your title says "be inlined", your actual question says "making functions inline". The two effectively have nothing to do with one another, they just have confusingly similar names. In modern compilers, the primary effect of inline is the thing that originally in C99 was (I think) just a necessary detail to make inline work at all: to permit multiple definitions of a symbol with external linkage. That's because modern compilers don't pay a whole lot of attention to the programmer's opinion of whether a function should be inlined. They do pay some, though, so the confusion of concepts persists. I've answered the question in the title, which is the decision the compiler makes, not the question in the body, which is the decision the programmer makes.]
Inlining is not necessarily an all-or-nothing deal. One strategy which compilers use to decide whether to inline, is to keep inlining function calls until the resulting code is "too big". "Big" is defined by some hopefully sensible heuristic.
So consider the following recursive function (which deliberately is not simply tail-recursive):
int triangle(int n) {
if (n == 1) return 1;
return n + triangle(n-1);
}
If it's called like this:
int t100() {
return triangle(100);
}
Then there's no particular reason in principle that the usual rules that the compiler uses for inlining shouldn't result in this:
int t100() {
// inline call to triangle(100)
int result;
if (100 == 1) { result = 1; } else {
// inline call to triangle(99)
int t99;
if (100 - 1 == 1) { t99 = 1; } else {
// inline call to triangle(98)
int t98;
if (100 - 1 - 1 == 1) { t98 = 1; } else {
// oops, "too big", no more inlining
t98 = triangle(100 - 1 - 1 - 1) + 98;
}
t99 = t98 + 99;
}
result = t99 + 100;
}
return result;
}
Obviously the optimiser will have a field day with that, so it's much "smaller" than it looks:
int t100() {
return triangle(97) + 297;
}
The code in triangle itself could be "unrolled" a few steps by a few levels of inlining, in exactly the same way, except that it doesn't have the benefits of constants:
int triangle(int n) {
if (n == 1) return 1;
if (n == 2) return 3;
if (n == 3) return 6;
return triangle(n-3) + 3*n - 3;
}
I doubt whether compilers actually do this, though, I don't think I've ever noticed it [Edit: MSVC does if you tell it to, thanks peterchen].
There's an obvious potential benefit in saving call overhead, but as against that people don't really expect recursive functions to get inlined, and there's no particular guarantee that the usual inlining heuristics will perform well with recursive functions (where there are two different places, the call site and the recursive call, that might be inlined, with different benefits in each case). Furthermore, it's difficult at compile time to estimate how deep the recursion will go, and the inline heuristics might like to take account of the call depth to make decisions. So it may be that the compiler just doesn't bother.
Functional language compilers are typically a lot more aggressive dealing with recursion than C or C++ compilers. The relevant trade-off there is that so many functions written in functional languages are recursive, that performance might be hopeless if the compiler couldn't optimise tail-recursion. So Lisp programmers typically rely on good optimisation of recursive functions, whereas C and C++ programmers typically don't.
If your compiler does not support it, you can try manually inlining instead...
int factorial(int n) {
int result = 1;
if (n-- == 0) {
return result;
} else {
result *= 1;
if (n-- == 0) {
return result;
} else {
result *= 2;
if (n-- == 0) {
return result;
} else {
result *= 3;
if (n-- == 0) {
return result;
} else {
result *= 4;
if (n-- == 0) {
return result;
} else {
// ...
}
}
}
}
}
}
See the problem yet?
Tail recursion (a special case of recursion) it's possible to be inlined by smart compilers.
Now, hold on. A tail-recursive function could be unrolled and inlined pretty easily. Apparently there are compilers that do this, but I am not aware of specifics.
Of course. Any function can be inlined if it makes sense to do it:
int f(int i)
{
if (i <= 0) return 1;
else return i * f(i - 1);
}
int main()
{
return f(10);
}
pseudo assembly (f is inlined in main):
main:
mov r0, #10 ; Pass 10 to f
f:
cmp r0, #0 ; arg <= 0? ...
bge 1l
mov r0, #1 ; ... is so, return 1
ret
1:
mov r0, -(sp) ; if not, save arg.
dec r0 ; pass arg - 1 to f
call f ; just because it's inlined doesn't mean I can't call it.
mul r0, (sp)+ ; compute the result
ret ; done.
;-)
When you call an ordinary function when you change command sequential execution order and jump(call or jmp) into some address where the function resides. Inlining mean that you place in all occurences of this function the commands of this function, so you don't have a one place where you could jump, also other types of optimisations can be used, like elemination of pushing/popping function parameters.
When you know, that the recursive chain will in normal cases be not so long, you could do inlining upto a predefined level (I don't know, if any existing compiler is intelligent enough for this today).
Inlining a recursive function is much like unrolling a loop. You will end up with much duplicate code -- but in some cases it could be worthwhile:
The number of recursive calls (the length of the chain) is normally short (in cases it gets longer than predefined, just do normal recursion)
The overhead for the functions calls is relatively big compared to the logic -- so do some "unrolling" for example five instances and end up doing a recursive call again -- this would lead to saving 80% of the call overhead.
Off course the tail-recursive special-case -- but this was mentioned by others.
Of course can be declared inline. The inline keyword is just a hint to the compiler. In many case the compiler just ignore it and depending on the compiler this could be one of this situatios.
Some compilers cna turn tail recursion into plain loops, and thus inline them normally.
Non-tail recursion could be inlined up to a given depth, usually decided by the compiler.
I've never encountered a practical application for that, as the cost of call isn't high enough anymore to offset the increase in code size.
[edit] (to clarify that: even though I like to toy with these things, and often check what code my compiler generates for "funny stuff" just out of curiosity, I haven't encountered a use case where any such unrolling helped significantly. This doesn't mean they don't exist or couldn't be constructed.
The only place where it would help is precalculating low iterations during compile time. However, in my experience this immensely increases compile times for often negligible runtime performance benefits.
Note that Visual Studio 2008 (and earlier) gives you quite some control over this:
#pragma inline_recursion(on)
#pragma inline_depth(N)
__forceinline
Be careful with the latter, it can easily overload the compiler :)
Inline means that on each place a call to a function marked as inline gets done, the compiler places a copy of the said function code there. This avoids function calling mechanisms, and it's usual argument stack pushing-poping, saving time in gazillion-calls-per-second situations. You see the consequences to static variables and stuff like that? all gone...
So, if you had an inlined recursive call, either your compiler is super smart and figures whether the number of copies is deterministic, of it will say "Cannot make it inline", because it wouldn't know when to stop.
I'm planning to create a data structure optimized to hold assembly code. That way I can be totally responsible for the optimization algorithms that will be working on this structure. If I can compile while running. It will be kind of dynamic execution. Is this possible? Have any one seen something like this?
Should I use structs to link the structure into a program flow. Are objects better?
struct asm_code {
int type;
int value;
int optimized;
asm_code *next_to_execute;
} asm_imp;
Update: I think it will turn out, like a linked list.
Update: I know there are other compilers out there. But this is a military top secret project. So we can't trust any code. We have to do it all by ourselves.
Update: OK, I think I will just generate basic i386 machine code. But how do I jump into my memory blob when it is finished?
It is possible. Dynamic code generation is even mainstream in some areas like software rendering and graphics. You find a lot of use in all kinds of script languages, in dynamic compilation of byte-code in machine code (.NET, Java, as far as I know Perl. Recently JavaScript joined the club as well).
You also find it used in very math-heavy applications as well, It makes a difference if you for example remove all multiplication with zero out of a matrix multiplication if you plan to do such a multiplication several thousand times.
I strongly suggest that you read on the SSA representation of code. That's a representation where each primitive is turned into the so called three operand form, and each variable is only assigned once (hence the same Static Single Assignment form).
You can run high order optimizations on such code, and it's straight forward to turn that code into executable code. You won't write that code-generation backend on a weekend though...
To get a feeling how the SSA looks like you can try out the LLVM compiler. On their web-site they have a little "Try Out" widget to play with. You paste C code into a window and you get something out that is close to the SSA form.
Little example how it looks like:
Lets take this integer square root algorithm in C. (arbitrary example, I just took something simple yet non-trivial):
unsigned int isqrt32 (unsigned int value)
{
unsigned int g = 0;
unsigned int bshift = 15;
unsigned int b = 1<<bshift;
do {
unsigned int temp = (g+g+b)<<bshift;
if (value >= temp) {
g += b;
value -= temp;
}
b>>=1;
} while (bshift--);
return g;
}
LLVM turns it into:
define i32 #isqrt32(i32 %value) nounwind {
entry:
br label %bb
bb: ; preds = %bb, %entry
%indvar = phi i32 [ 0, %entry ], [ %indvar.next, %bb ]
%b.0 = phi i32 [ 32768, %entry ], [ %tmp23, %bb ]
%g.1 = phi i32 [ 0, %entry ], [ %g.0, %bb ]
%value_addr.1 = phi i32 [ %value, %entry ], [ %value_addr.0, %bb ]
%bshift.0 = sub i32 15, %indvar
%tmp5 = shl i32 %g.1, 1
%tmp7 = add i32 %tmp5, %b.0
%tmp9 = shl i32 %tmp7, %bshift.0
%tmp12 = icmp ult i32 %value_addr.1, %tmp9
%tmp17 = select i1 %tmp12, i32 0, i32 %b.0
%g.0 = add i32 %tmp17, %g.1
%tmp20 = select i1 %tmp12, i32 0, i32 %tmp9
%value_addr.0 = sub i32 %value_addr.1, %tmp20
%tmp23 = lshr i32 %b.0, 1
%indvar.next = add i32 %indvar, 1
%exitcond = icmp eq i32 %indvar.next, 16
br i1 %exitcond, label %bb30, label %bb
bb30: ; preds = %bb
ret i32 %g.0
}
I know it looks horrible at first. It's not even pure SSA-Form. The more you read on that representation the more sense it will make. And you will also find out why this representation is so widely used these days.
Encapsulating all the info you need into a data-structure is easy. In the end you have to decide if you want to use enums or strings for opcode names ect.
Btw - I know I didn't gave you a data-structure but more a formal yet practical language and the advice where to look further.
It's a very nice and interesting research field.
Edit: And before I forget it: Don't overlook the built in features of .NET and Java. These languates allow you to compile from byte-code or source code from within the program and execute the result.
Cheers,
Nils
Regarding your edit: How to execute a binary blob with code:
Jumping into your binary blob is OS and platform dependent. In a nutshell you have invalide the instruction cache, maybe you have to writeback the data-cache and you may have to enable execution rights on the memory-region you've wrote your code into.
On win32 it's relative easy as instruction cache flushing seems to be sufficient if you place your code on the heap.
You can use this stub to get started:
typedef void (* voidfunc) (void);
void * generate_code (void)
{
// reserve some space
unsigned char * buffer = (unsigned char *) malloc (1024);
// write a single RET-instruction
buffer[0] = 0xc3;
return buffer;
}
int main (int argc, char **args)
{
// generate some code:
voidfunc func = (voidfunc) generate_code();
// flush instruction cache:
FlushInstructionCache(GetCurrentProcess(), func, 1024);
// execute the code (it does nothing atm)
func();
// free memory and exit.
free (func);
}
I assume you want a data structure to hold some kind of instruction template, probably parsed from existing machine code, similar to:
add r1, r2, <int>
You will have an array of this structure and you will perform some optimization on this array, probably changing its size or building a new one, and generate corresponding machine code.
If your target machine uses variable width instructions (x86 for example), you can't determine your array size without actually finishing parsing the instructions which you build the array from. Also you can't determine exactly how much buffer you need before actually generating all the instructions from optimized array. You can make a good estimate though.
Check out GNU Lightning. It may be useful to you.
In 99% of the cases, the difference in performance is negligible. The main advantage of classes is that the code produced by OOP is better and easier to understand than procedural code.
I'm not sure in what language you're coding - note that in C# the major difference between classes and structs is that structs are value types while classes are reference types. In this case, you might want to start with structs, but still add behavior (constructor, methods) to them.
Not discussing the technical value of optimize yourself your code, in a C++ code, choosing between a POD struct or a full object is mostly a point of encapsulation.
Inlining the code will let the compiler optimize (or not) the constructors/accessors used. There will be no loss of performance.
First, set a constructor
If you're working with a C++ compiler, create at least one constructor:
struct asm_code {
asm_code()
: type(0), value(0), optimized(0) {}
asm_code(int type_, int value_, int optimized_)
: type(type_), value(value_), optimized(_optimized) {}
int type;
int value;
int optimized;
};
At least, you won't have undefined structs in your code.
Are every combination of data possible?
Using a struct like you use means that any type is possible, with any value, and any optimized. For example, if I set type = 25, value = 1205 and optimized = -500, then it is Ok.
If you don't want the user to put random values inside your structure, add inline accessors:
struct asm_code {
int getType() { return type ; }
void setType(int type_) { VERIFY_TYPE(type_) ; type = type_ ; }
// Etc.
private :
int type;
int value;
int optimized;
};
This will let you control what is set inside your structure, and debug your code more easily (or even do runtime verification of you code)
After some reading, my conclusion is that common lisp is the best fit for this task.
With lisp macros I have enormous power.