I'm writing a performance-critical, number-crunching C++ project where 70% of the time is used by the 200 line core module.
I'd like to optimize the core using inline assembly, but I'm completely new to this. I do, however, know some x86 assembly languages including the one used by GCC and NASM.
All I know:
I have to put the assembler instructions in _asm{} where I want them to be.
Problem:
I have no clue where to start. What is in which register at the moment my inline assembly comes into play?
You can access variables by their name and copy them to registers.
Here's an example from MSDN:
int power2( int num, int power )
{
__asm
{
mov eax, num ; Get first argument
mov ecx, power ; Get second argument
shl eax, cl ; EAX = EAX * ( 2 to the power of CL )
}
// Return with result in EAX
}
Using C or C++ in ASM blocks might be also interesting for you.
The microsoft compiler is very poor at optimisations when inline assembly gets involved. It has to back up registers because if you use eax then it won't move eax to another free register it will continue using eax. The GCC assembler is far more advanced on this front.
To get round this microsoft started offering intrinsics. These are a far better way to do your optimisation as it allows the compiler to work with you. As Chris mentioned inline assembly doesn't work under x64 with the MS compiler as well so on that platform you REALLY are better off just using the intrinsics.
They are easy to use and give good performance. I will admit I am often able to squeeze a few more cycles out of it by using an external assembler but they're bloody good for the productivity improvement they provide
Nothing is in the registers. as the _asm block is executed. You need to move stuff into the registers. If there is a variable: 'a', then you would need to
__asm {
mov eax, [a]
}
It is worth pointing out that VS2010 comes with Microsofts assembler. Right click on a project, go to build rules and turn on the assembler build rules and the IDE will then process .asm files.
this is a somewhat better solution as VS2010 supports 32bit AND 64bit projects and the __asm keyword does NOT work in 64bit builds. You MUST use external assembler for 64bit code :/
I prefer writing entire functions in assembly rather than using inline assembly. This allows you to swap out the high level language function with the assembly one during the build process. Also, you don't have to worry about compiler optimizations getting in the way.
Before you write a single line of assembly, print out the assembly language listing for your function. This gives you a foundation to build upon or modify. Another helpful tool is the interweaving of assembly with source code. This will tell you how the compiler is coding specific statements.
If you need to insert inline assembly for a large function, make a new function for the code that you need to inline. Again replace with C++ or assembly during build time.
These are my suggestions, Your Mileage May Vary (YMMV).
Go for the low hanging fruit first...
As other have said, the Microsoft compiler is pretty poor at optimisation. You may be able to save yourself a lot of effort just by investing in a decent compiler, such as Intel's ICC, and re-compiling the code "as is". You can get a 30 day free evaluation license from Intel and try it out.
Also, if you have the option to build a 64-bit executable, then running in 64-bit mode can yield a 30% performance improvement, due to the x2 increase in number of available registers.
I really like assembly, so I'm not going to be a nay-sayer here. It appears that you've profiled your code and found the 'hotspot', which is the correct way to start. I also assume that the 200 lines in question don't use a lot of high-level constructs like vector.
I do have to give one bit of warning: if the number-crunching involves floating-point math, you are in for a world of pain, specifically a whole set of specialized instructions, and a college term's worth of algorithmic study.
All that said: if I were you, I'd step through the code in question in the VS debugger, using the Disassembly view. If you feel comfortable reading the code as you go along, that's a good sign. After that, do a Release compile (Debug turns off optimization) and generate an ASM listing for that module. Then if you think you see room for improvement...you have a place to start. Other people's answers have linked to the MSDN documentation, which is really pretty skimpy but still a reasonable start.
Related
I write a code, doing nothing in C++
void main(void){
}
and Assembly.
.global _start
.text
_start:
mov $60, %rax
xor %rdi, %rdi
syscall
I compile the C code and compile and link Assembly code. I make a comparison between two executable file with time command.
Assembly
time ./Assembly
real 0m0.001s
user 0m0.000s
sys 0m0.000s
C
time ./C
real 0m0.002s
user 0m0.000s
sys 0m0.000s
Assembly is two times faster than C. I disassemble the codes, in Assembly code, there was only four lines code (Same). In the C code, there was tons of unnecessary code writed for linking main to _start. In main there was four lines code, three of that is writed for making impossible (you can't access to a function's variable from outside of the function blog) to access 'local' (like function veriables) variables from outside of 'block' (like function blocks).
push %rbp ; push base pointer.
mov %rsp, %rbp ; copy value of stack pointer to base pointer, stack pointer is using for saving variables.
pop %rbp ; 'local' variables are removed, because we pop the base pointer
retq ; ?
What is why of that?
The amount of time required to execute the core of your program you've written is incredibly small. Figure that it consists of three or four assembly instructions, and at several gigahertz that will only require a couple of nanoseconds to run. That's such a small amount of time that it's vastly below the detection threshold for the time program, whose resolution is measured in milliseconds (remember that a millisecond is a million times slower than a nanosecond!) So in that sense, I would be very careful about making judgments about the runtime of one program as being "twice as fast" as the other; the resolution of your timer isn't high enough to say that for certain. You might just be seeing noise terms.
Your question, though, was why there is all this automatically generated code if nothing is going to happen. The answer is "it depends." With no optimization turned on, most compilers generate assembly code that faithfully simulates the program you wrote, possibly doing more work than is necessary. Since most C and C++ functions, you actually will have code that does something, will need local variables, etc., a compiler wouldn't be too wrong in emitting code at the start and end of a function to set up the stack and frame pointer properly to support those variables. With optimization turned up to the max, an optimizing compiler might be smart enough to notice that this isn't necessary and to remove that code, but it's not required.
In principle, a perfect compiler would always emit the fastest code possible, but it turns out that it's impossible to build a compiler that will always do this (this has to do with things like the undecidability of the halting problem). Therefore, it's somewhat assumed that the code generated will be good - even great - but not optimal. However, it's a tradeoff. Yes, the code might not be as fast as it could possibly be, but by working in languages like C and C++ it's possible to write large and complex programs in a way that's (compared to assembly) easy to read, easy to write, and easy to maintain. We're okay with the slight performance hit because in practice it's not too bad and most optimizing compilers are good enough to make the price negligible (or even negative, if the optimizing compiler finds a better approach to solving a problem than the human!)
To summarize:
Your timing mechanism is probably not sufficient to make the conclusions that you're making. You'll need a higher-precision timer than that.
Compilers often generate unnecessary code in the interest of simplicity. Optimizing compilers often remove that code, but can't always.
We're okay paying the cost of using higher-level languages in terms of raw runtime because of the ease of development. In fact, it might actually be a net win to use a high-level language with a good optimizing compiler, since it offloads the optimization complexity.
All the extra time from C is dynamic linker and CRT overhead. The asm program is statically linked, and just calls exit(2) (the sycall directly, not the glibc wrapper). Of course it's faster, but it's just startup overhead and doesn't tell you anything about how fast compiler-emitted code that actually does anything will run.
i.e. if you wrote some code to actually do something in C, and compiled it with gcc -O3 -march=native, you'd expect it to be ~0.001 seconds slower than a statically linked binary with no CRT overhead. (If the your hand-written asm and the compiler output were both near-optimal. e.g. if you used the compiler output as a starting point for a hand-optimized version, but didn't find anything major. It's usually possible to make some improvements to compiler output, but often just to code-size and probably not much effect on speed.)
If you want to call malloc or printf, then the startup overhead is not useless; it's actually necessary to initialize glibc internal data structures so that library functions don't have any overhead of checking that stuff is initialized every time they're called.
From a statically linked hand-written asm program that links glibc, you need to call __libc_init_first, __dl_tls_setup, and __libc_csu_init, in that order, before you can safely use all libc functions.
Anyway, ideally you can expect a constant time difference from the startup overhead, not a factor of 2 difference.
If you're good at writing optimal asm, you can usually do a better job than the compiler on a local scale, but compilers are really good at global optimizations. Moreover, they do it in seconds of CPU time (very cheap) instead of weeks of human effort (very precious).
It can make sense to hand-craft a critical loop, e.g. as part of a video encoder, but even video encoders (like x264, x264, and vpx) have most of the logic written in C or C++, and just call asm functions.
The extra push/mov/pop instructions are because you compiled with optimization disabled, where -fno-omit-frame-pointer is the default, and makes a stack frame even for leaf functions. gcc defaults to -fomit-frame-pointer at -O1 and higher on x86 and x86-64 (since modern debug metadata formats mean it's not needed for debugging or exception-handling stack unwinding).
If you'd told your C compiler to make fast code (-O3), instead of to compile quickly and make dumb code that works well in a debugger (-O0), you would have gotten code like this for main (from the Godbolt compiler explorer):
// this is valid C++ and C99, but C89 doesn't have an implicit return 0 in main.
int main(void) {}
xor eax, eax
ret
To learn more about assembly and how everything works, have a look at some of the links in the x86 tag wiki. Perhaps Programming From the Ground Up would be a good start; it probably explains compilers and dynamic linking.
A much shorter article is A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux, which starts with what you did, and then gets down to having _start overlap with some other ELF headers so the file can be even smaller.
Did you compile with optimizations enabled? If not, then this is invalid.
Did you consider that this is a completely trivial example that will have no real-life performance implications worth writing even a postcard about?
Please write clear maintainable code and (in 99% of cases) leave the optimization to the compiler. Please.
Is it possible to somehow convert a simple C or C++ code (by simple I mean: taking some int as input, printing some simple shapes dependent on that int as output) to assembly language? If there isn't I'll just do it manually but since I'm gonna be doing it for processors like Intel 8080, it just seemed a bit tedious. Can you somehow automate the process?
Also, if there is a way, how good (as in: elegant) would the output assembly file source code be when compared to just translating it manually?
Most compilers will let you produce assembly output. For a couple of obvious examples, Clang and gcc/g++ use the -S flag, and MS VC++ uses the -Fa flag to do so.
A few compilers don't support this directly (e.g., if memory serves Watcom didn't). The ones I've seen like this had you produce an object file, and then included a disassembler that would produce an assembly language file from the object file. I don't remember for sure, but it wouldn't surprise me if this is what you'd need to do with the Digital Mars compiler.
To somebody who's accustomed to writing assembly language, the output from most compilers typically tends to look at least somewhat inelegant, especially on a CPU like an x86 that has quite a few registers that are now really general purpose, but have historically had more specific meanings. For example, if some piece of code needs both a pointer and a counter, a person would probably put the pointer in ESI or EDI, and the counter in ECX. The compiler might easily reverse those. That'll work fine, but an experienced assembly language programmer will undoubtedly find it more readable using ESI for the pointer and ECX for the counter.
Take look at gcc -S:
gcc -S hello.c # outputs hello.s file
Other compilers that maintain at lest partial gcc compatibility may also accept this flag. LLVM's clang, for example, does.
Well, yes there is such a program. It's called "Compiler"
To answer your edit: The elegance of the output depends on the optimization of your compiler. Usually compilers do not generate code we humans would call "elegant".
Most folks here are right, but seem to have missed the note about 8080 (no wonder, it's not in the title :). However, Google comes to the rescue as always - looking for compiler for 8080 produces some nice results like these:
http://www.bdsoft.com/resources/bdsc.html
http://tack.sourceforge.net/
Most of these are pretty old and might be poorly maintained. You might also try 8085 which is fairly similar
(by simple I mean: taking some int as input, printing some simple shapes dependent on
that int as output) to assembly language?
Looking at the output of an x86 compiler is not going to be very helpful, since inputting and outputting are done by a C or C++ library. With an 8080 there is no such library so you have to develop your own I/O routines for some particular hardware. That's lots and lots of additional work.
I am porting inline assembler that use SSE commands to intrinsics. It takes much work to find appropriate intrinsic for assembler instruction. Somewhere on the Internet I saw a Python script that simplifies the job, but cannot find it now.
I don't think you will be happy with such a script.
First, in my opinion intrinsics are only useful for a one or two liner, if you have more instructions it is possible better to have a separate assembler file. Also with a long listing of assembler instructions you will have to control the result anyway, which include to understand each instruction and its result, which basically means you can write it again in the same time.
Second, I think you are looking for something like this because you want to port a piece of software from 32 bit to 64 bit, right? My experience told me that you will run into some strange errors because of some unexpected type casts if you don't have a look on every line of code.
Third, are you talking about Visual Studio? Is there any other compiler which supports intrinsics? We had some strange errors while porting our software using intrinsics, because there are some ugly compiler bugs while using intrinsics, mostly by messing up the stack. We had a lot of trouble in finding these things and ending up to write these functions in assembler.
So my suggestion is to be careful with intrinsics!
I'm not aware of a script that will do exactly what you asking. A lot of cases will also have non-SSE instructions interleaved into the assembly, and not every assembly instruction can be mapped to an intrinsic or a primitive C operation.
I suppose you can probably hack you way through it with find-and-replace. (This actually might not be that bad. How much code are you trying port? Thousands of lines?)
Also, VC++ doesn't allow inline assembly at all on 64-bit. So everything needs to be done using intrinsics or a completely separate assembly file.
I won't go far to say that using intrinsics is completely inferior to assembly (assuming you know what you're doing), but writing good intrinsic code that compiles well and runs as fast as optimized assembly is a work of art on it's own. But it maintains two advantages: portability, and ease of use (no need to manually allocate registers).
I created my own script to convert inline assembler to intrinsics. He does a lot of rough work.
https://github.com/KindDragon/Asm2Intrinsics
I am porting 32 bit C++ code into 64 bit in VS2008. The old code has mixed code written in assembly.
Example:
__asm
{
mov esi, pbSrc
mov edi, pbDest
...
}
I have read somewhere that I need to remove all assembly code and put them all in a separate project and somehow link to it. Can somebody give me the step x step procedure to do this. I know C++ & C# but don't know assembly language. Thanks in advance.
Visual C++ 2005, 2008, or 2010 don't support assembly inlining when compiling for 64-bit platforms. x64 ASM just has way too many intricacies and complexities that can screw up the C(++) code around the __asm block so they just don't allow it. In short, there's no way to "port" it.
I suggest using 64-bit intrinsics with #define and #ifdef when trying to cross-architecture compile such low-level code. Check out: http://msdn.microsoft.com/en-us/library/h2k70f3s.aspx
Edit: If you really know what you're doing, however, you can save the proper ASM bytecodes in some sort of a buffer array and execute that "raw" - via a void* pointer and VirtualProtect() or something similar. Of course, you need to understand x64 calling/returning conventions since you'd essentially be calling a function. Note that this is 99.9% of the time a bad idea.
If you can't use intrinsic functions to perform your tasks, or cannot write the equivalent in C/c++, you're going to have to dive into x64 assembly language. From a high level:
1) Assuming that you're starting with x86 assembly language routines, you'll need to port them to x64.
2) Use ML64 (Microsoft x64 assembler) or an equivalent assembler to assemble your routines into .OBJs that you can link into your C/C++ project.
3) Set up a custom build step that invokes the assembler on your .ASM files.
Good luck!
I have following basic questions :
When we should involve disassembly in debugging
How to interpret disassembly, For example below what does each segment stands for
00637CE3 8B 55 08 mov edx,dword ptr [arItem]
00637CE6 52 push edx
00637CE7 6A 00 push 0
00637CE9 8B 45 EC mov eax,dword ptr [result]
00637CEC 50 push eax
00637CED E8 3E E3 FF FF call getRequiredFields (00636030)
00637CF2 83 C4 0C add
Language : C++
Platform : Windows
It's quite useful to estimate how efficient is the code emitted by the compiler.
For example, if you use an std::vector::operator[] in a loop without disassembly it's quite hard to guess that each call to operator[] in fact requires two memory accesses but using an iterator for the same would require one memory access.
In your example:
mov edx,dword ptr [arItem] // value stored at address "arItem" is loaded onto the register
push edx // that register is pushes into stack
push 0 // zero is pushed into stack
mov eax,dword ptr [result] // value stored at "result" address us loaded onto the register
push eax // that register is pushed into stack
call getRequiredFields (00636030) // getRequiredFields function is called
this is a typical sequence for calling a function - paramaters are pushed into stack and then the control is transferred to that function code (call instruction).
Also using disassembly is quite useful when participating in arguments about "how it works after compilation" - like caf points in his answer to this question.
When you should involve disassembly: When you exactly want to know what the CPU is doing when it's executing your program, or when you don't have the source code in whatever higher level language the program was written in (C++ in your case).
How to interpret assembly code: Learn assembly language. You can find an exhaustive reference on Intel x86 CPU instructions in Intel's processor manuals.
The piece of code that you posted prepares arguments for a function call (by getting and pushing some values on the stack and putting a value in the register eax), and then calls the function getRequiredFields.
1 - We should (I) involve disassembly in debugging as a last resort. Generally, an optimizing compiler generates code that is not trivial to understand to the human eye. Instructions are reordered, some dead code is eliminated, some specific code is inlined, etc, etc. So it is not necessary and not easy when necessary to understand disassembled code. For example, I sometimes look at the disassembly to see if constants are part of the opcode or are stored in const variables.
2 - That piece of code calls a function like getRequiredFields(result, 0, arItem). You have to learn assembly language for the processor you want. For x86, go to www.intel.com and get the manuals of the IA32.
I started out in 1982 with assembly debugging of PL/M programs on CP/M-80 and later Digital Research OSes. It was the same in the early days of MS-DOS until Microsoft introduced symdeb which was a command-line debugger where source and assembly were displayed simultaneously. Symdeb was a leap forward but not that great since the earlier debuggers had forced me to learn to recognize what assembly code belonged to which source code line. Before CodeView the best debugger was pfix86 from Phoenix Technologies. NuMegas SoftIce was the best debugger (apart from pure hardware ICEs) I've ever come across in that it not only debugged my application but effortlessly led me through the inner workings of Windows as well. But I digress.
Late in 1990 a consultant in a project I was working in approached me and said he had this (very early) C++ bug he'd been working on for days but couldn't understand what the problem was. He single-stepped through the source code (on a windowed non-graphic DOS debugger) for me while I got all impatient. Finally I interrupted him and looked through the debugger options and sure enough there was the mixed source/assembly mode with registers and everything. This made it easy to realize that the application was trying to free an internal pointer (for local variables) containing NULL. For this problem, the source code mode was of no help at all. Today's C++ compilers will probably no longer contain a bug such as this but there will be others.
Knowing assembly-level debugging allows you to understand the source-compiler-assembly relationship to the extent of being able to predict what code the compiler will generate. Many people here on stackoverflow say "profile-profile-profile" but this goes a step further in that you learn what source-code constructs (I write in C) to use when and which to avoid. I suspect this is even more important with C++ which can generate a lot of code without the developer suspecting anything. For example there is a standard class for handling lists of objects which appears to be without drawbacks - just a few lines of code and this fantastic functionality! - until you look at the scores of strange procedure calls it generates. I'm not saying it's wrong to use them, I'm just saying that the developer should be aware of the pros and cons of using them. Overloading operators may be great functionality (somewhat weird to a WYSIWYG programmer like me) but what is the price in execution speed? If you say "nothing" I say "prove it."
It is never wrong to use mixed or pure assembly mode when debugging. Difficult bugs will usually be easier to find and the developer will learn to write more efficient code. Developers from the interpreted camp (C# and Java) will say that their code is just as efficient as the compiled languages but if you know assembly you will also know why they are wrong, why they are dead wrong. You can smile and think "yeah, tell me about it!"
After you've worked with different compilers you will come across one with the most astonishing code-generation ability. One PowerPC compiler condensed three nested loops into one loop simply through the superior code interpretation of it's optimizer. Next to the guy who wrote that I'm ... well, let's just say in a different league.
Up until about ten years ago I wrote quite a bit of pure assembly but with multi-stage pipelines, multiple execution units and now multiple cores to contend with the C compiler beats me hands down. On the other hand I know what the compiler can do a good job with and what it shouldn't have to work with: Garbage In still equals Garbage Out. This is true for any compiler that produces assembly output.