Gentle introduction to JIT and dynamic compilation / code generation - c++

The deceptively simple foundation of dynamic code generation within a C/C++ framework has already been covered in another question. Are there any gentle introductions into topic with code examples?
My eyes are starting to bleed staring at highly intricate open source JIT compilers when my needs are much more modest.
Are there good texts on the subject that don't assume a doctorate in computer science? I'm looking for well worn patterns, things to watch out for, performance considerations, etc. Electronic or tree-based resources can be equally valuable. You can assume a working knowledge of (not just x86) assembly language.

Well a pattern I've used in emulators goes something like this:
typedef void (*code_ptr)();
unsigned long instruction_pointer = entry_point;
std::map<unsigned long, code_ptr> code_map;
void execute_block() {
code_ptr f;
std::map<unsigned long, void *>::iterator it = code_map.find(instruction_pointer);
if(it != code_map.end()) {
f = it->second
} else {
f = generate_code_block();
code_map[instruction_pointer] = f;
}
f();
instruction_pointer = update_instruction_pointer();
}
void execute() {
while(true) {
execute_block();
}
}
This is a simplification, but the idea is there. Basically, every time the engine is asked to execute a "basic block" (usually a everything up to next flow control op or whole function in possible), it will look it up to see if it has already been created. If so, execute it, else create it, add it and then execute.
rinse repeat :)
As for the code generation, that gets a little complicated, but the idea is to emit a proper "function" which does the work of your basic block in the context of your VM.
EDIT: note that I haven't demonstrated any optimizations either, but you asked for a "gentle introduction"
EDIT 2: I forgot to mention one of the most immediately productive speed ups you can implement with this pattern. Basically, if you never remove a block from your tree (you can work around it if you do but it is way simpler if you never do), then you can "chain" blocks together to avoid lookups. Here's the concept. Whenever you return from f() and are about to do the "update_instruction_pointer", if the block you just executed ended in either a call, unconditional jump, or didn't end in flow control at all, then you can "fixup" its ret instruction with a direct jmp to the next block it'll execute (cause it'll always be the same one) if you have already emited it. This makes it so you are executing more and more often in the VM and less and less in the "execute_block" function.

I'm not aware of any sources specifically related to JITs, but I imagine that it's pretty much like a normal compiler, only simpler if you aren't worried about performance.
The easiest way is to start with a VM interpreter. Then, for each VM instruction, generate the assembly code that the interpreter would have executed.
To go beyond that, I imagine that you would parse the VM byte codes and convert them into some sort of suitable intermediate form (three address code? SSA?) and then optimize and generate code as in any other compiler.
For a stack based VM, it may help to to keep track of the "current" stack depth as you translate the byte codes into intermediate form, and treat each stack location as a variable. For example, if you think that the current stack depth is 4, and you see a "push" instruction, you might generate an assignment to "stack_variable_5" and increment a compile time stack counter, or something like that. An "add" when the stack depth is 5 might generate the code "stack_variable_4 = stack_variable_4+stack_variable_5" and decrement the compile time stack counter.
It is also possible to translate stack based code into syntax trees. Maintain a compile-time stack. Every "push" instruction causes a representation of the thing being pushed to be stored on the stack. Operators create syntax tree nodes that include their operands. For example, "X Y +" might cause the stack to contain "var(X)", then "var(X) var(Y)" and then the plus pops both var references off and pushes "plus(var(X), var(Y))".

Get yourself a copy of Joel Pobar's book on Rotor (when it's out), and delve through the source to the SSCLI. Beware, insanity lies within :)

Related

Design elements for inline asm in concurrent usage

I can't find a neat explanation about how I'm supposed to write a piece of inline asm, and what are the problem that can possibly arise from a concurrent use of a foo function that contains asm code in it.
The problem that I see is that in asm the registers are uniquely named, and so 1 name is strictly tied to a really precise portion of your cpu, and that's a big problem if you are writing 1 piece of code that is supposed to run concurrently because you can't simply extra registers with the same name.
The other problem is that asm doesn't really uses a calling convention, you simply call registers and/or values, and sometimes calling a register implies a silent action on another register that doesn't even shows up explicitly in your code; so I can't even expect that my C/C++ function foo will be packed and sealed inside its own stack if it contains asm code .
Now with what gcc calls extended asm I can basically declare where the input and the output goes, so each function can use its own parameters "as registers" , and the pattern is the following
asm ( assembler template
: output
: input
: registers
);
Assuming that my main target for now are mathematical operations, and my function is only supposed to give a certain functionality and perform some computation ( no internal lock ), is extended asm good for concurrency ? How I should design a piece of asm that is supposed to be used by a concurrent application ?
For now I'm using gcc, but I would like a generic answer about the general asm design that I'm supposed to give to this kind of code snippets.
You seem to be misunderstanding what threading actually is. Let's consider a single-processor system first. The threads don't actually run concurrently, since there is only one unit that can successfully decode and execute them. Your operating system is only creating the illusion of running multiple threads (and processes, too) by employing scheduling inside of it : every thread, or process, is allocated a certain amount of time it gets to execute on the processor.
This is why, when threads are executed, they don't overwrite each other's registers. When a currently executed thread or process is switched, the operating system asks the processor to perform something that's called a context switch. In a nutshell, the processor saves its state when it was executing the previous task/thread/process into some memory area, which is controlled by the OS. The new task/thread/process has its context restored from the previously stored state and continues its execution. When this task/thread/process' time slice on the CPU is up, the scheduler decides which task/thread/process to resume next. The time slice is usually very small, which is why you're given the illusion of multiple streams of code running at the same time. Keep in mind that this is a very, very simplified description : refer to CPU manuals or books on operating systems for more detail.
The situation is analogous on multi-processor systems : only with the exception that, then, there is more than one unit that can execute the instructions. This is also true for multi-core processors : every one of the cores has its own set of registers. The basic stuff stays the same - the scheduler in your OS decides whether the code being executed is actually executed at the same time by multiple cores in one processor.
Thus, your concerns in this case are not valid. However, they were raised for very valid reasons. Remember that the only things that threads share is the main memory : each thread has its own registers, and its own stack.
Let me come back to the actual question about gcc's extended inline assembly. The compiler itself cannot work out which registers are modified by the assembly you wrote. That's why you need to specify it. However, it is very rare that an instruction modifies a register without you being able to control it, and it happens only with a small number of instructions - assuming that we're talking about x86. Moreover, gcc can work out the destination/source operands by itself when you want to refer to a C/C++ variable from inside the assembly. In fact, this is the preferred method, since it leaves the compiler much more room for optimization.
Consider this piece of code :
unsigned int get_cr0(void)
{
unsigned int rc;
__asm__ (
"movl %%cr0, %0\n"
: "=r"(rc)
:
:
);
return rc;
}
This function's purpose is to return the contents of the control register cr0. This is a privileged instruction, so the program will not work when you run it in user mode, but this is not important right now. See how I put %0 in the instruction, and then specified "=r"(rc) in the output list. This means that %0 will be automagically aliased by the compiler to your rc variable. You can do this for every variable you specify on the input/output list. They are numbered starting from zero, as you can see.
I can't really remember the instructions which used registers that were not encoded as operands, so I can't give you an example right now. In this case, you would need to put them on the clobber list (the last one). I'm pretty sure you can refer to this for more information.
I also can't answer anything regarding "general asm design", since this is a non-standard extension and thus varies between compilers. The 64-bit Visual Studio compilers don't support it at all, for example.

Interpreting GPerfTools sample count

I'm struggling a little with reading the textual output the GPerfTools generate. I think part of the problem is that I don't fully understand how the sampling method operates.
From Wikipedia I gather that profilers based on sample functions usually work by sending an interrupt to the OS and querying the program's current instruction pointer. Now my knowledge about assembly is a little rusty, so I'm wondering what it means if the instruction pointer points to method m at any given time? I.e. does it mean that the function is about to be called or does it mean it's currently executed, or both?
There's a difference if I'm not mistaken, because in the first case the sample count (i.e. times m is seen while taking a sample) translates to the absolute call count of m, while in the latter case it simply translates to times seen, i.e. a mere indication of relative time spent in this method.
Can someone clarify?

Text iteration, Assembly versus C++

I am making a program in which is is frequently reading chunks of text received from the web looking for specific characters and parsing the data accordingly. I am becoming fairly skilled with C++, and have made it work well, however, is Assembly going to be faster than a
for(size_t len = 0;len != tstring.length();len++) {
if(tstring[len] == ',')
stuff();
}
Would an inline-assembly routine using cmp and jz/jnz be faster? I don't want to waste my time working with asm for the fact being able to say I used it, but for true speed purposes.
Thank you,
No way. Your loop is so simple, the cost of the optimizer losing the ability to reason about your code is going to be way higher than any performance you could gain. This isn't SSE intrinsics or a bootloader, it's a trivial loop.
An inline assembly routine using "plain old" jz/jnz is unlikely to be faster than what you have; that said, you have a few inefficiencies in your code:
you're retrieving tstring.length() once per loop iteration; that's unnecessary.
you're using random indexing, tstring[len] which might be a more-expensive operation than using a forward iterator.
you're calling stuff() during the loop; depending on what exactly that does, it might be faster to just let the loop build a list of locations within the string first (so that the scanned string as well as the scanning code stays cache-hot and is not evicted by whatever stuff() does), and only afterwards iterate over those results.
There's already a likely low-level optimized standard library function available,strchr(), for exactly that kind of scanning. The C++ STL std::string::find() is also likely to have been optimized for the purpose (and/or might use strchr() in the char specialization).
In particular, strchr() has SSE2 (using pcmpeqb, maskmov... and bsf) or SSE4.2 (using the string op pcmpistri) implementations; for examples/actual SSE code doing this, check e.g. strchr() in GNU libc (as used on Linux). See also the references and comments here (suitably named website ...).
My advice: Check your library implementation / documentation, and/or the actual generated assembly code for your program. You might well be using fast code already ... or would be if you'd switch from your hand-grown character-by-character simple search to just using std::string::find() or strchr().
If this is ultra-speed-critical, then inlining assembly code for strchr() as used by known/tested implementations (watch licensing) would eliminate function calls and gain a few cycles. Depends on your requirements ... code, benchmark, vary, benchmark again, ...
Checking characters one by one is not the fastest thing to do. Maybe you should try something like this and find out if it's faster.
string s("xxx,xxxxx,x,xxxx");
string::size_type pos = s.find(',');
while(pos != string::npos){
do_stuff(pos);
pos = s.find(',', pos+1);
}
Each iteration of the loop will give you the next position of a ',' character so the program will need only few loops to finish the job.
Would an inline-assembly routine using cmp and jz/jnz be faster?
Maybe, maybe not. It depends upon what stuff() does, what the type and scope of tstring is, and what your assembly looks like.
First, measure the speed of the maintainable C++ code. Only if this loop dominates your program's speed should you consider rewriting it.
If you choose to rewrite it, keep both implementations available, and comparatively measure them. Only use the less maintainable version if it is faster, and if the speed increase matters. Also, since you have the original version in place, future readers will be able to understand your intent even if they don't know asm that well.

Profiling a simple, one cycle length operation

We have an assignment where we need to profile a 'simple instruction' (addition or bit-wise and for example). This means performing the same operation a large number of times (100K+) and measuring the average time in microseconds. The result should be presented in cycle-lengths: (totalTime/iterations)*cphMHz.
So, results may vary but all in all we were told that we should get a result close to 1 cycle-length. Actual result doesn't matter as long as programming is correct.
My question is: what is a good operation to profile?
There are two points I need to concider:
I use loop unrolling to be a bit more accurate, so in each iteration I perform 10 simple instruction. This means I have to choose an operation to wouldn't be performed only once due to compiler optimization (we can't use -o0 flag as school staff does not).
Bad example: var = i; - the compiler would only perform the last command.
What is a real 'simple instruction'? How do I know the number of operations that are actually performed? I tried reading the assembly output, but I couldn't understand it.
Hope I was clear enough, any idea would be great.
Thanks anyway
P.S don't know if it matters but I write in CPP
1) This sounds (to me) like an impossible task, if optimizations are (or might be) enabled. You can never be sure on what the compiler will do during optimizations. I'd definitely do something like reusing the previous result. If allowed to/possible, I'd try to include a raw assembler snippet to be profiled (so you can be sure there's no additional overhead; although it still could be optimized).
2) As for instructions: One assembler command is one instruction. E.g. a += i will - depending on available instruction set and stuff - most likely result in 4 instructions: read a, read i, add, write a. Reading assembly is pretty much straightforward. Depending on the instruction set/processor, there might be different "directions" for reading (i.e. "from -> to"). x86 assemblers (and those for most other common processors) will prefer instruction target, source, while DSPs prefer to use instruction source, target. Just important to know: moving data has to happen through registers. So even a single assignment like a = b will result in two instructions (b to register and register to a).
In general, if this answer goes into the wrong direction, try to elaborate a bit more on your specific task and its requirements (e.g. which compiler is to be used) and drop me a short comment.

How can elusive 64-bit portability issues be detected?

I found a snippet similar to this in some (C++) code I'm preparing for a 64-bit port.
int n;
size_t pos, npos;
/* ... initialization ... */
while((pos = find(ch, start)) != npos)
{
/* ... advance start position ... */
n++; // this will overflow if the loop iterates too many times
}
While I seriously doubt this would actually cause a problem in even memory-intensive applications, it's worth looking at from a theoretical standpoint because similar errors could surface that will cause problems. (Change n to a short in the above example and even small files could overflow the counter.)
Static analysis tools are useful, but they can't detect this kind of error directly. (Not yet, anyway.) The counter n doesn't participate in the while expression at all, so this isn't as simple as other loops (where typecasting errors give the error away). Any tool would need to determine that the loop would execute more than 231 times, but that means it needs to be able to estimate how many times the expression (pos = find(ch, start)) != npos will evaluate as true—no small feat! Even if a tool could determine that the loop could execute more than 231 times (say, because it recognizes the find function is working on a string), how could it know that the loop won't execute more than 264 times, overflowing a size_t value, too?
It seems clear that to conclusively identify and fix this kind of error requires a human eye, but are there patterns that give away this kind of error so it can be manually inspected? What similar errors exist that I should be watchful for?
EDIT 1: Since short, int and long types are inherently problematic, this kind of error could be found by examining every instance of those types. However, given their ubiquity in legacy C++ code, I'm not sure this is practical for a large piece of software. What else gives away this error? Is each while loop likely to exhibit some kind of error like this? (for loops certainly aren't immune to it!) How bad is this kind of error if we're not dealing with 16-bit types like short?
EDIT 2: Here's another example, showing how this error appears in a for loop.
int i = 0;
for (iter = c.begin(); iter != c.end(); iter++, i++)
{
/* ... */
}
It's fundamentally the same problem: loops are counting on some variable that never directly interacts with a wider type. The variable can still overflow, but no compiler or tool detects a casting error. (Strictly speaking, there is none.)
EDIT 3: The code I'm working with is very large. (10-15 million lines of code for C++ alone.) It's infeasible to inspect all of it, so I'm specifically interested in ways to identify this sort of problem (even if it results in a high false-positive rate) automatically.
Code reviews. Get a bunch of smart people looking at the code.
Use of short, int, or long is a warning sign, because the range of these types isn't defined in the standard. Most usage should be changed to the new int_fastN_t types in <stdint.h>, usage dealing with serialization to intN_t. Well, actually these <stdint.h> types should be used to typedef new application-specific types.
This example really ought to be:
typedef int_fast32_t linecount_appt;
linecount_appt n;
This expresses a design assumption that linecount fits in 32 bits, and also makes it easy to fix the code if the design requirements change.
Its clear what you need is a smart "range" analyzer tool to determine what the range of values are that are computed vs the type in which those values are being stored. (Your fundamental objection is to that smart range analyzer being a person). You might need some additional code annotations (manually well-placed typedefs or assertions that provide explicit range constraints) to enable a good analysis, and to handle otherwise apparantly arbitrarily large user input.
You'd need special checks to handle the place where C/C++ says the arithmetic is legal but dumb (e.g., assumption that you don't want [twos complement] overflows).
For your n++ example, (equivalent to n_after=n_before+1), n_before can be 2^31-1 (because of your observations about strings), so n_before+1 can be 2^32 which is overflow. (I think standard C/C++ semantics says that overflow to -0 without complaint is OK).
Our DMS Software Reengineering Toolkit in fact has range analysis machinery built in... but it is not presently connected to the DMS's C++ front end; we can only peddle so fast :-{ [We have used it on COBOL programs for different problems involving ranges].
In the absence of such range analysis, you could probably detect the existing of loops with such dependent flows; the value of n clearly depends on the loop count. I suspect this would get you every loop in the program that had a side effect, which might not be that much help.
Another poster suggests somehow redeclaring all the int-like declarations using application specific types (e.g., *linecount_appt*) and then typedef'ing those to value that work for your application. To do this, I'd think you'd have to classify each int-like declaration into categories (e.g., "these declarations are all *linecount_appt*"). Doing this by manual inspection for 10M SLOC seems pretty hard and very error prone. Finding all declarations which receive (by assignment) values from the "same" value sources might be a way to get hints about where such application types are. You'd want to be able to mechanically find such groups of declarations, and then have some tool automatically replace the actual declarations with a designated application type (e.g., *linecount_appt*). This is likely somewhat easier than doing precise range analysis.
There are tools that help find such issues. I won't give any links here because the ones I know of are commercial but should be pretty easy to find.