I wish to implement a simulator for a subset of instructions for the x86 architecture. Given a binary, I wish to disassemble it and run a simulation on the instructions. For that, one would need to look at certain bits of an instruction to decide whether it is a control instruction, arithmetic instruction or a logical instruction and based on that, one must derive the parameters of the operation by looking at the remaining bits. One obvious yet painful way to implement this is by using nested if-else/switch-case statements. Can someone suggest a better methodology for implementing this?
Use a lookup table, perhaps in the form of a std::map.
You can look at the source of an x86 emulator to find an implementation of this idea, already fully written and fleshed out.
Here's one you might try: http://www.dosbox.com/wiki/BuildingDOSBox#1._Grab_the_source
Let me know if this doesn't work out; there are lots to choose from.
In general, with an emulator, I would think that a switch on the opcode would be one way to go. Another good approach would be an 256-entry array of function pointers, corresponding to the first byte of the instruction. That gives a little more separation than a giant switch or if block. Of course you can reuse the functions as needed.
Doing a nested if/else type construct should be fine if you cache the output of the translation. If you are doing simulation you will have relatively few dynamic instructions out of all of the static instructions in the program. So the best performance optimization is to cache the output of the translation and then reuse it when the dynamic instruction executes. Eventually your cache will fill up and you will need to clear it for new entries. But it makes more sense to cache the translation somehow, rather than try to come up with a really fast method of doing the translation in the first place.
As an example QEMU is an emulator that supports a variety of targets that is optimized for performance. You can see how they translate x86 instructions here:
https://github.com/qemu/QEMU/blob/master/target-i386/translate.c#L4076
if QEMU did this for every instruction the performance would be very slow. But since the cache the results it does not matter too much that the first time an instruction is translated there is a complex case statement.
Related
I started writing a DCPU-16 emulator using this v1.7 spec. I started laying down the architecture, and I don't like the fact that I'm using very long switch statements. This is my first time writing an emulator, and so I don't know if there's better way to be doing it. While the switches aren't that large, due to the DCPU's small number of opcodes (and the fact that I haven't actually implemented the instructions yet), I can imagine if I were writing an emulator for a larger instruction set the switch statements would be huge.
Anywhom, here's my code.
EDIT: I forgot to get my question across:
Is there a better way to design an emulator than using a massive switch?
This approach seems reasonable to me. It is certainly how I would do it (I have written a few CPU emulators and similar types of code).
The nearest alternative is a set of function pointers, but some of your cases will probably be rather simple (e.g. cpu_regs.flags &= ~CARRY or if (cpu_regs.flags & CARRY) do_rel_jump(next_byte());, so using function pointers will slow you down.
You can bunch all the "No Operation Specified yet" to one place, that will make it a lot shorter in number of lines, but the number of cases will of course still be the same [unless you put it in default:].
I'm planning on writing a game with C++, and it will be extremely CPU-intensive (pathfinding,genetic algorithms, neural networks, ...)
So I've been thinking about how to tackle this situation best so that it would run smoothly.
(let this top section of this question be side information, I don't want it to restrict the main question, but it would be nice if you could give me side notes as well)
Is it worth it to learn how to work with ASM, so I can make ASM calls in C++,
can it give me a significant/notable performance advantage?
In what situations should I use it?
Almost never:
You only want to be using it once you've profiled your C++ code and have identified a particular section as a bottleneck.
And even then, you only want to do it once you've exhausted all C++ optimization options.
And even then, you only want to be using ASM for tight, inner loops.
And even then, it takes quite a lot of effort and skill to beat a C++ compiler on a modern platform.
If your not an experienced assembly programmer, I doubt you will be able to optimize assembly code better than your compiler.
Also note that assembly is not portable. If you decide to go this way, you will have to write different assembly for all the architectures you decide to support.
Short answer: it depends, most likely you won't need it.
Don't start optimizing prematurely. Write code that is also easy to read and to modify. Separate logical sections into modules. Write something that is easy to extend.
Do some profiling.
You can't tell where your bottlenecks are unless you profile your code. 99% of the time you won't get that much performance gain by writing asm. There's a high chance you might even worsen your performance. Optimizers nowadays are very good at what they do. If you do have a bottleneck, it will most probably be because of some poorly chosen algorithm or at least something that can be remedied at a high-level.
My suggestion is, even if you do learn asm, which is a good thing, don't do it just so you can optimize.
Profile profile profile....
A legitimate use case for going low-level (although sometimes a compiler can infer it for you) is to make use of SIMD instructions such as SSE. I would assume that at least some of the algorithms you mention will benefit from parallel processing.
However, you don't need to write actual assembly, instead you can simply use intrinsic functions. See, e.g. this.
Don't get ahead of yourself.
I've posted a sourceforge project showing how a simulation program was massively speeded up (over 700x).
This was not done by assuming in advance what needed to be made fast.
It was done by "profiling", which I put in quotes because the method I use is not to employ a profiler.
Rather I rely on random pausing, a method known and used to good effect by some programmers.
It proceeds through a series of iterations.
In each iteration a large source of time-consumption is identified and fixed, resulting in a certain speedup ratio.
As you proceed through multiple iterations, these speedup ratios multiply together (like compound interest).
That's how you get major speedup.
If, and only if, you get to a point where some code is taking a large fraction of time, and it doesn't contain any function calls, and you think you can write assembly code better than the compiler does, then go for it.
P.S. If you're wondering, the difference between using a profiler and random pausing is that profilers look for "bottlenecks", on the assumption that those are localized things. They look for routines or lines of code that are responsible for a large percent of overall time.
What they miss is problems that are diffuse.
For example, you could have 100 routines, each taking 1% of time.
That is, no bottlenecks.
However, there could be an activity being done within many or all of those routines, accounting for 1/3 of the time, that could be done better or not at all.
Random pausing will see that activity with a small number of samples, because you don't summarize, you examine the samples.
In other words, if you took 9 samples, on average you would notice the activity on 3 of them.
That tells you it's big.
So you can fix it and get your 3/2 speedup ratio.
"To understand recursion, you must first understand recursion." That quote comes to mind when I consider my response to your question, which is "until you understand when to use assembly, you should never use assembly." After you have completely implemented your soution, extensively profiled its performance and determined precise bottlenecks, and experimented with several alternative solutions, then you can begin to consider using assembly. If you code a single line of assembly before you have a working and extensively profiled program, you have made a mistake.
If you need to ask than you don't need it.
I sunk about a month of full time into a native C++ equation parser. It works, except it is slow (between 30-100 times slower than a hard-coded equation). What can I change to make it faster?
I read everything I could find on efficient code. In broad strokes:
The parser converts a string equation expression into a list of "operation" objects.
An operation object has two function pointers: a "getSource" and a "evaluate".
To evaluate an equation, all I do is a for loop on the operation list, calling each function in turn.
There isn't a single if / switch encountered when evaluating an equation - all conditionals are handled by the parser when it originally assigned the function pointers.
I tried inlining all the functions to which the function pointers point - no improvement.
Would switching from function pointers to functors help?
How about removing the function pointer framework, and instead creating a full set of derived "operation" classes, each with its own virtual "getSource" and "evaluate" functions? (But doesn't this just move the function pointers into the vtable?)
I have a lot of code. Not sure what to distill / post. Ask for some aspect of it, and ye shall receive.
In your post you don't mention that you have profiled the code. This is the first thing I would do if I were in your shoes. It'll give you a good idea of where the time is spent and where to focus your optimization efforts.
It's hard to tell from your description if the slowness includes parsing, or it is just the interpretation time.
The parser, if you write it as recursive-descent (LL1) should be I/O bound. In other words, the reading of characters by the parser, and construction of your parse tree, should take a lot less time than it takes to simply read the file into a buffer.
The interpretation is another matter.
The speed differential between interpreted and compiled code is usually 10-100 times slower, unless the basic operations themselves are lengthy.
That said, you can still optimize it.
You could profile, but in such a simple case, you could also just single-step the program, in the debugger, at the level of individual instructions.
That way, you are "walking in the computer's shoes" and it will be obvious what can be improved.
Whenever I'm doing what you're doing, that is, providing a language to the user, but I want the language to have fast execution, what I do is this:
I translate the source language into a language I have a compiler for, and then compile it on-the-fly into a .dll (or .exe) and run that.
It's very quick, and I don't need to write an interpreter or worry about how fast it is.
The very first thing is: Profile what actually went wrong. Is the bottleneck in parsing or in evaluation? valgrind offers some tools that can help you here.
If it's in parsing, boost::spirit might help you. If in evaluation, remember that virtual functions can be pretty slow to evaluate. I've made pretty good experiences with recursive boost::variant's.
You know, building an expression recursive descent parser is really easy, the LL(1) grammar for expressions is only a couple of rules. Parsing then becomes a linear affair and everything else can work on the expression tree (while parsing basically); you'd collect the data from the lower nodes and pass it up to the higher nodes for aggregation.
This would avoid altogether function/class pointers to determine the call path at runtime, relying instead of proven recursivity (or you can build an iterative LL parser if you wish).
It seems that you're using a quite complicated data structure (as I understand it, a syntax tree with pointers etc.). Thus, walking through pointer dereference is not very efficient memory-wise (lots of random accesses) and could slow you down significantly. As Mike Dunlavey proposed, you could compile the whole expression at runtime using another language or by embedding a compiler (such as LLVM). For what I know, Microsoft .Net provides this feature (dynamic compilation) with Reflection.Emit and Linq.Expression trees.
This is one of those rare times that I'd advise against profiling just yet. My immediate guess is that the basic structure you're using is the real source of the problem. Profiling the code is rarely worth much until you're reasonably certain the basic structure is reasonable, and it's mostly a matter of finding which parts of that basic structure can be improved. It's not so useful when what you really need to do is throw out most of what you have, and basically start over.
I'd advise converting the input to RPN. To execute this, the only data structure you need is a stack. Basically, when you get to an operand, you push it on the stack. When you encounter an operator, it operates on the items at the top of the stack. When you're done evaluating a well-formed expression, you should have exactly one item on the stack, which is the value of the expression.
Just about the only thing that will usually give better performance than this is to do like #Mike Dunlavey advised, and just generate source code and run it through a "real" compiler. That is, however, a fairly "heavy" solution. If you really need maximum speed, it's clearly the best solution -- but if you just want to improve what you're doing now, converting to RPN and interpreting that will usually give a pretty decent speed improvement for a small amount of code.
I have a very large program which I have been compiling under visual studio (v6 then migrated to 2008). I need the executable to run as fast as possible. The program spends most of its time processing integers of various sizes and does very little IO.
Obviously I will select maximum optimization, but it seems that there are a variety of things that can be done which don't come under the heading of optimization which do still affect the speed of the executable. For example selecting the __fastcall calling convention or setting structure member alignment to a large number.
So my question is: Are there other compiler/linker options I should be using to make the program faster which are not controlled from the "optimization" page of the "properties" dialog.
EDIT: I already make extensive use of profilers.
Another optimization option to consider is optimizing for size. Sometimes size-optimized code can run faster than speed-optimized code due to better cache locality.
Also, beyond optimization operations, run the code under a profiler and see where the bottlenecks are. Time spent with a good profiler can reap major dividends in performance (especially it if gives feedback on the cache-friendliness of your code).
And ultimately, you'll probably never know what "as fast as possible" is. You'll eventually need to settle for "this is fast enough for our purposes".
Profile-guided optimization can result in a large speedup. My application runs about 30% faster with a PGO build than a normal optimized build. Basically, you run your application once and let Visual Studio profile it, and then it is built again with optimization based on the data collected.
1) Reduce aliasing by using __restrict.
2) Help the compiler in common subexpression elimination / dead code elimination by using __pure.
3) An introduction to SSE/SIMD can be found here and here. The internet isn't exactly overflowing with articles about the topic, but there's enough. For a reference list of intrinsics, you can search MSDN for 'compiler intrinsics'.
4) For 'macro parallelization', you can try OpenMP. It's a compiler standard for easy task parallelization -- essentially, you tell the compiler using a handful of #pragmas that certain sections of the code are reentrant, and the compiler creates the threads for you automagically.
5) I second interjay's point that PGO can be pretty helpful. And unlike #3 and #4, it's almost effortless to add in.
You're asking which compiler options can help you speed up your program, but here's some general optimisation tips:
1) Ensure your algorithms are appropriate for the job. No amount of fiddling with compiler options will help you if you write an O(shit squared) algorithm.
2) There's no hard and fast rules for compiler options. Sometimes optimise for speed, sometimes optimise for size, and make sure you time the differences!
3) Understand the platform you are working on. Understand how the caches for that CPU operate, and write code that specifically takes advantage of the hardware. Make sure you're not following pointers everywhere to get access to data which will thrash the cache. Understand the SIMD operations available to you and use the intrinsics rather than writing assembly. Only write assembly if the compiler is definitely not generating the right code (i.e. writing to uncached memory in bad ways). Make sure you use __restrict on pointers that will not alias. Some platforms prefer you to pass vector variables by value rather than by reference as they can sit in registers - I could go on with this but this should be enough to point you in the right direction!
Hope this helps,
-Tom
Forget micro-optimization such as what you are describing. Run your application through a profiler (there is one included in Visual Studio, at least in some editions). The profiler will tell you where your application is spending its time.
Micro-optimization will rarely give you more than a few percentage points increase in performance. To get a really big boost, you need to identify areas in your code where inefficient algorithms and/or data structures are being used. Focus on those, for example by changing algorithms. The profiler will help identify these problem areas.
Check which /precision mode you are using. Each one generates quite different code and you need to choose based on what accuracy is required in your app. Our code needs precision (geometry, graphics code) but we still use /fp:fast (C/C++ -> Code generation options).
Also make sure you have /arch:SSE2, assuming your deployment covers processors that all support SSE2. This will result is quite a big difference in performance, as compile will use very few cycles. Details are nicely covered in the blog SomeAssemblyRequired
Since you are already profiling, I would suggest loop unrolling if it is not happening. I have seen VS2008 not doing it more frequently (templates, references etc..)
Use __forceinline in hotspots if applicable.
Change hotspots of your code to use SSE2 etc as your app seems to be compute intense.
You should always address your algorithm and optimise that before relying on compiler optimisations to get you significant improvements in most cases.
Also you can throw hardware at the problem. Your PC may already have the necessary hardware lying around mostly unused: the GPU! One way of improving performance of some types of computationally expensive processing is to execute it on the GPU. This is hardware specific but NVIDIA provide an API for exactly that: CUDA. Using the GPU is likely to get you far greater improvement than using the CPU.
I agree with what everyone has said about profiling. However you mention "integers of various sizes". If you are doing much arithmetic with mismatched integers a lot of time can be wasted in changing sizes, shorts to ints for example, when the expressions are evaluated.
I'll throw in one more thing too. Probably the most significant optimisation is in choosing and implementing the best algorithm.
You have three ways to speed up your application:
Better algorithm - you've not specified the algorithm or the data types (is there an upper limit to integer size?) or what output you want.
Macro parallelisation - split the task into chunks and give each chunk to a separate CPU, so, on a two core cpu divide the integer set into two sets and give half to each cpu. This depends on the algorithm you're using - not all algorithms can be processed like this.
Micro parallelisation - this is like the above but uses SIMD. You can combine this with point 2 as well.
You say the program is very large. That tells me it probably has many classes in a hierarchy.
My experience with that kind of program is that, while you are probably assuming that the basic structure is just about right, and to get better speed you need to worry about low-level optimization, chances are very good that there are large opportunities for optimization that are not of the low-level kind.
Unless the program has already been tuned aggressively, there may be room for massive speedup in the form of mid-stack operations that can be done differently. These are usually very innocent-looking and would never grab your attention. They are not cases of "improve the algorithm". They are usually cases of "good design" that just happen to be on the critical path.
Unfortunately, you cannot rely on profilers to find these things, because they are not designed to look for them.
This is an example of what I'm talking about.
I'm thinking about the tokenizer here.
Each token calls a different function inside the parser.
What is more efficient:
A map of std::functions/boost::functions
A switch case
I would suggest reading switch() vs. lookup table? from Joel on Software. Particularly, this response is interesting:
" Prime example of people wasting time
trying to optimize the least
significant thing."
Yes and no. In a VM, you typically
call tiny functions that each do very
little. It's the not the call/return
that hurts you as much as the preamble
and clean-up routine for each function
often being a significant percentage
of the execution time. This has been
researched to death, especially by
people who've implemented threaded
interpreters.
In virtual machines, lookup tables storing computed addresses to call are usually preferred to switches. (direct threading, or "label as values". directly calls the label address stored in the lookup table) That's because it allows, under certain conditions, to reduce branch misprediction, which is extremely expensive in long-pipelined CPUs (it forces to flush the pipeline). It, however, makes the code less portable.
This issue has been discussed extensively in the VM community, I would suggest you to look for scholar papers in this field if you want to read more about it. Ertl & Gregg wrote a great article on this topic in 2001, The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures
But as mentioned, I'm pretty sure that these details are not relevant for your code. These are small details, and you should not focus too much on it. Python interpreter is using switches, because they think it makes the code more readable. Why don't you pick the usage you're the most comfortable with? Performance impact will be rather small, you'd better focus on code readability for now ;)
Edit: If it matters, using a hash table will always be slower than a lookup table. For a lookup table, you use enum types for your "keys", and the value is retrieved using a single indirect jump. This is a single assembly operation. O(1). A hash table lookup first requires to calculate a hash, then to retrieve the value, which is way more expensive.
Using an array where the function addresses are stored, and accessed using values of an enum is good. But using a hash table to do the same adds an important overhead
To sum up, we have:
cost(Hash_table) >> cost(direct_lookup_table)
cost(direct_lookup_table) ~= cost(switch) if your compiler translates switches into lookup tables.
cost(switch) >> cost(direct_lookup_table) (O(N) vs O(1)) if your compiler does not translate switches and use conditionals, but I can't think of any compiler doing this.
But inlined direct threading makes the code less readable.
STL Map that comes with visual studio 2008 will give you O(log(n)) for each function call since it hides a tree structure beneath.
With modern compiler (depending on implementation) , A switch statement will give you O(1) , the compiler translates it to some kind of lookup table.
So in general , switch is faster.
However , consider the following facts:
The difference between map and switch is that : Map can be built dynamically while switch can't. Map can contain any arbitrary type as a key while switch is very limited to c++ Primitive types (char , int , enum , etc...).
By the way , you can use a hash map to achieve nearly O(1) dispatching (though , depending on the hash table implementation , it can sometimes be O(n) at worst case). Even though , switch will still be faster.
Edit
I am writing the following only for fun and for the matter of the discussion
I can suggest an nice optimization for you but it depends on the nature of your language and whether you can expect how your language will be used.
When you write the code:
You divide your tokens into two groups , one group will be of very High frequently used and the other of low frequently used. You also sort the high frequently used tokens.
For the high frequently tokens you write an if-else series with the highest frequently used coming first. for the low frequently used , you write a switch statement.
The idea is to use the CPU branch prediction in order to even avoid another level of indirection (assuming the condition checking in the if statement is nearly costless).
in most cases the CPU will pick the correct branch without any level of indirection . They will be few cases however that the branch will go to the wrong place.
Depending on the nature of your languege , Statisticly it may give a better performance.
Edit : Due to some comments below , Changed The sentence telling that compilers will allways translate a switch to LUT.
What is your definition of "efficient"? If you mean faster, then you probably should profile some test code for a definite answer. If you're after flexible and easy-to-extend code though, then do yourself a favor and use the map approach. Everything else is just premature optimization...
Like yossi1981 said, a switch could be optimized of beeing a fast lookup-table but there is not guarantee, every compiler has other algorithms to determine whether to implement the switch as consecutive if's or as fast lookup table, or maybe a combination of both.
To gain a fast switch your values should meet the following rule:
they should be consecutive, that is e.g. 0,1,2,3,4. You can leave some values out but things like 0,1,2,34,43 are extremely unlikely to be optimized.
The question really is: is the performance of such significance in your application?
And wouldn't a map which loads its values dynamically from a file be more readable and maintainable instead of a huge statement which spans multiple pages of code?
You don't say what type your tokens are. If they are not integers, you don't have a choice - switches only work with integer types.
The C++ standard says nothing about the performance of its requirements, only that the functionality should be there.
These sort of questions about which is better or faster or more efficient are meaningless unless you state which implementation you're talking about. For example, the string handling in a certain version of a certain implementation of JavaScript was atrocious, but you can't extrapolate that to being a feature of the relevant standard.
I would even go so far as to say it doesn't matter regardless of the implementation since the functionality provided by switch and std::map is different (although there's overlap).
These sort of micro-optimizations are almost never necessary, in my opinion.