Microarchitectural profiling of C++ and assembly code on MIPS

Microarchitectural profiling of C++ and assembly code on MIPS - c++

As part of course project, I need to analyze a piece of C++ code for performance and find out which parts of the Computer Architecture (MIPS or x86) are mostly utilized while running the code and is possibly a bottleneck for the performance. I am looking at various Profilers for analyzing the performance and came across SimpleScalar which is a great tool but sadly only works with C code.
Since I am more familiar with MIPS architecture it would be great if there's a tool like SimpleScalar for simulating and profiling C++ code for MIPS. I am looking at the performance critical parts like branch, cache, instruction set, addressing modes etc. If not, mention of any tool which can do the similar kind of analysis for x86 architectures would be great as well.
(Just to clarify, I'm not looking for any old profiler, but for one that understands the CPU microarchitecture and knows what parts of the CPU are taken advantage of or underused.)

CACTI has detailed low-level simulation of cache.
SESC is a cycle accurate computer architecture simulator that supports MIPS.
SESC includes CACTI.

I doubt that what you want is possible. C++ is the language, but it still needs to be compiled to the target architecture. The optimisations (or the lack of them) will determine a lot of your performance criteria like cache use, etc. So I guess you need to look for machine level profilers (Hopefully they support the debug format of your compiler, so you see source code context).

My understanding is that SimpleScalar can simulate and profile MIPS machine code, no matter what the original language it was compiled from.
(The source-level debugger "DLite!" that comes with SimpleScalar may only support a few languages, but it sounds like you don't need to "debug" your code.)

Related

Transpiling to C vs C++ : range of CPU instructions

I am considering the question of transpiling a language (home-grown DSL) to C vs to C++.
I haven't done any 'native' programming for over 15 years, so I want to check my assumptions.
Am I right into assuming that transpiling to the newest C++ version (17) would enable the native compiler to use a much wider range of 'modern' Intel/AMD CPU instructions, resulting in a more efficient executable (beyond the multi-threading / memory-model part of C++, which already by itself seems a good enough reason to go for C++)?
Put another way, isn't a large part of 'more recent' CPU instructions never generated by a C compiler, simply because it has too little information about the programmer intent, due to the simpler syntax of C? I know I could access all CPU instructions with assembler, but that is precisely what I don't want to do. Ideally, I would want the generated code to still be as platform-independent as possible.

All of your assumptions about the relationship between programming language and "modern CPU instructions" are incorrect.
Let's consider the GNU Compiler Collection.
The choice of language here doesn't much matter, as the language front-ends all end up generating the same intermediate form called GIMPLE. The optimizing passes then work on that.
The range of CPU instructions which can be emitted is controlled by the -mtune option. For x86, GCC is capable of emitting modern AVX 512 instructions when optimizing some very plain-looking C code. Automatic loop vectorisation is a powerful thing. Try it out: implement memcpy and look at the generated assembly.
My advice: generate clean, un-clever C code, and crank up the optimization level. Just like you would do if writing code by hand.
You might also consider implementing your language directly as a front-end to GCC or LLVM, without transpiling to C or C++. LLVM was designed for this purpose, intended to make implementing new languages easy, and still taking advantage of modern optimization approaches.

Writing a new jit

I'm interested in starting my own JIT project in C++. I'm not that unfamiliar with assembly, or compiler design etc etc. But, I am very unfamiliar with the resulting machine code format - like, what does a mov instruction actually look like when all is said and done and it's time to call that function pointer. So, what are the best resources for creating such a thing?
Edit: Right now, I'm only interested in x86 on Windows, stretching a tiny bit to 64bit Windows in the future.

You want to have a look at the processor manuals for the architecture you are interested in. Those manuals describe the opcode encoding. For x86 processors, the manuals can be downloaded from this page.

Starting your project on top of LLVM might shield you from the platform details.
http://llvm.org/
LLVM is used by several dynamic language JIT compilers.

GNU lightning is a multi-architecture (x86, SPARC, PPC) library for generating code within another program. You'll need to understand general assembly language concepts, but not at a very deep level. You won't have to write anything architecture-specific at all. The down side to lightning (at least last time I used it) is that the interface presented is the intersection of the features available on the supported targets: The small register set of x86, a RISC instruction set like SPARC, and so on. The single-pass code generation is easy to use but has its own quirks, like you can't relocate your output buffer (because of address references) so if you run out of space you generally have to start over. The good thing is that you will probably get a working example going very quickly.

Older versions of NASM come with a fairly concise opcode reference that has x86 instruction encodings. (Looks like there's no 64-bit info, though.) I found this one using google:
http://alien.dowling.edu/~rohit/nasmdocb.html
The official manuals say basically the same thing (and a lot more besides), but not quite so conveniently.

need book & web site suggestion for advanced low-level programming

I want to learn all advanced details of low-level programming so i want to be able to
Learn advanced c/c++
Optimize my code with and without inline assembly
Understand the internals of an exe, dll, thread, process
Effeciently make use of technologies like SSE, 3DNow, MMX
Debug&disassemble executables/libraries and understand what's going on inside
The differences/features of different cpus/platforms like x86, MIPS, ARM, PowerPC
My first target is a x86 Windows based system. After that, comes linux based platforms. And embedded systems follow.
Any books, web sites, tutorials, forums, comunities that give me what I'm looking for DIRECTLY is fine.
Thanks..

What you are asking for cannot be found in a single book. Much of what you have mentioned is best found in User Manuals or Functional Specifications for various processors. I recommend starting with an understanding of the core x86 arch and working up from there. One of the old Intel 386 or 486 manuals might be a good start.
I know of no websites for this type of info.

A few recommendations from among my personal favourites to get you started:
“Effective C++: 55 Specific Ways to Improve Your Programs and Designs (3rd Edition)”
-- Scott Meyers
“Inside the Machine” -- John Stokes
“Hacker’s Delight” -- Henry S. Warren
“The Software Optimization Cookbook” -- Richard Gerber
“Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A: Instruction Set Reference, A-M” (253666-021)
“Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B: Instruction Set Reference, N-Z” (253667-021)

Maybe it's time for you to get an account on http://my.safaribooksonline.com/, unplug the phone for a couple of weeks, load the refrigerator up with Jolt and Funyuns, say goodbye to your family and friends, and then read as many books as you can. They have a pretty substantial library on there that covers most of the topics that you're looking for.

that is a bit too much, that you want to learn. :)
i would suggest starting with basic ARM v4 core architecture.
it is simple enough to understand.
then move on to 8086, then build up to later versions of ARM and x86.
ARM is of the RISC type. and x86 of the CISC type.
you can never learn all of the processors. (like you wont be ever able to learn all the programming languages)
but having a knowledge of 1 or 2 can will enable to grasp any other you would come across.
there is nothing much Object oriented about low level programming.
so it doesnt matter if you use c++ or c.
get a full system simulator like gxemul or qemu.
try to execute a hello world assembly program - (without using the processor runtime libraries, - you want it hard, right?)
others might be able to guide you with respect to SSE, MMX etc.
checkout infocenter.arm.com for the ARM assembly language and architecture specifications.

I've always found Computer Systems: A Programmer's Perspective (http://www.amazon.com/Computer-Systems-Programmers-Randal-Bryant/dp/013034074X) to be a very good book. It's got a large amount of information about Computer Architecture, and it taught me about memory management, compilation and linking (as well as how to debug linking errors), optimization, relocatable object code, and some lower-level architecture items like how to go about studying computer science from a low-level (e.g. what the internals of the processor are like). There are a lot of good exercises, ranging from optimization examples to implementing buffer overflows. It discusses how to write inline assembly code (and make it work). There's even a section on writing code for a fictional (Y86) processor.
One caveat, though, is that it tends to focus to heavily on the Intel processor line (in my opinion). If you want something that's a bit more along the lines of working with say the ARM line, then you'll probably want to take the recommendations from others above.

Visual studio compiler flag /arch and performance

I just noticed that in our project have left the "Enable Enhanced Instruction Set" flag left unset, probably just an oversight.
Before enabling the flag I would like to ask if anyone have seen any real-world performance improvements enabling it ?
I guess we will see some improvement our application constantly do floating point based calucations, but its not a major part,.

So in a nutshell: This setting only enables certain intrinsic functions that map directly on SSE instructions. In normal C++ programs you don't use these intrinsic functions, so this setting won't improve performance.
If you need more performance, you could try to find a compiler that rewrites your code to use SSE instructions (intel claims its compiler can), but its probably smarter to go for multicore (with openMP or .net 4.0), or use the GPU, which is faster and more flexible than SSE.

The performance benefit will depend on whether you project uses intensive mathematical computations. For many tasks (networking, text processing, data management) this simply isn't the case as no (or almost no) floating-point operations are used there. Hence, there will be no performance boost at all.
Using SSE/SSE2 instructions generated by the compiler would not generate top performance. First, you won't have any control on actual code generation. There are scenarios where you need to use legacy (x87) code on an old system and SSE/SSE2-enabled code on a new system. You might also want to take advantage of SSE3 on most newest systems. For that purpose, I'd recommend to check the processor type using the cpuid instruction and then switch to an implementation that could take most advantage of the processor capabilities. You can then use compiler intrinsics in the implementations targeting SSE/SSE2. To target SSE3, you'll need a dedicated library which I'm trying to locate on the internet.
I believe, there must exist libraries that perform the analysis of processor capabilities and allow for optimal code switcing. I just need some time to look on the net also.

Intel C++ compiler as an alternative to Microsoft's?

Is anyone here using the Intel C++ compiler instead of Microsoft's Visual c++ compiler?
I would be very interested to hear your experience about integration, performance and build times.

The Intel compiler is one of the most advanced C++ compiler available, it has a number of advantages over for instance the Microsoft Visual C++ compiler, and one major drawback. The advantages include:
Very good SIMD support, as far as I've been able to find out, it is the compiler that has the best support for SIMD instructions.
Supports both automatic parallelization (multi core optimzations), as well as manual (through OpenMP), and does both very well.
Support CPU dispatching, this is really important, since it allows the compiler to target the processor for optimized instructions when the program runs. As far as I can tell this is the only C++ compiler available that does this, unless G++ has introduced this in their yet.
It is often shipped with optimized libraries, such as math and image libraries.
However it has one major drawback, the dispatcher as mentioned above, only works on Intel CPU's, this means that advanced optimizations will be left out on AMD cpu's. There is a workaround for this, but it is still a major problem with the compiler.
To work around the dispatcher problem, it is possible to replace the dispatcher code produced with a version working on AMD processors, one can for instance use Agner Fog's asmlib library which replaces the compiler generated dispatcher function. Much more information about the dispatching problem, and more detailed technical explanations of some of the topics can be found in the Optimizing software in C++ paper - also from Anger (which is really worth reading).
On a personal note I have used the Intel c++ Compiler with Visual Studio 2005 where it worked flawlessly, I didn't experience any problems with microsoft specific language extensions, it seemed to understand those I used, but perhaps the ones mentioned by John Knoeller were different from the ones I had in my projects.
While I like the Intel compiler, I'm currently working with the microsoft C++ compiler, simply because of the financial extra investment the Intel compiler requires. I would only use the Intel compiler as an alternative to Microsofts or the GNU compiler, if performance were critical to my project and I had a the financial part in order ;)

I'm not using Intel C++ compiler at work / personal (I wish I would).
I would use it because it has:
Excellent inline assembler support. Intel C++ supports both Intel and AT&T (GCC) assembler syntaxes, for x86 and x64 platforms. Visual C++ can handle only Intel assembly syntax and only for x86.
Support for SSE3, SSSE3, and SSE4 instruction sets. Visual C++ has support for SSE and SSE2.
Is based on EDG C++, which has a complete ISO/IEC 14882:2003 standard implementation. That means you can use / learn every C++ feature.

I've had only one experience with this compiler, compiling STLPort. It took MSVC around 5 minutes to compile it and ICC was compiling for more than an hour. It seems that their template compilation is very slow. Other than this I've heard only good things about it.
Here's something interesting:
Intel's compiler can produce different
versions of pieces of code, with each
version being optimised for a specific
processor and/or instruction set
(SSE2, SSE3, etc.). The system detects
which CPU it's running on and chooses
the optimal code path accordingly; the
CPU dispatcher, as it's called.
"However, the Intel CPU dispatcher
does not only check which instruction
set is supported by the CPU, it also
checks the vendor ID string," Fog
details, "If the vendor string says
'GenuineIntel' then it uses the
optimal code path. If the CPU is not
from Intel then, in most cases, it
will run the slowest possible version
of the code, even if the CPU is fully
compatible with a better version."
OSnews article here

I tried using Intel C++ at my previous job. IIRC, it did indeed generate more efficient code at the expense of compilation time. We didn't put it to production use though, for reasons I can't remember.
One important difference compared to MSVC is that the Intel compiler supports C99.

Anecdotally, I've found that the Intel compiler crashes more frequently than Visual C++. Its diagnostics are a bit more thorough and clearly written than VC's. Thus, it's possible that the compiler will give diagnostics that weren't given with VC, or will crash where VC didn't, making your conversion more expensive.
However, I do believe that Intel's compiler allows you to link with Microsoft runtimes like the CRT, easing the transition cost.
If you are interoperating with managed code you should probably stick with Microsoft's compiler.
Recent Intel compilers achieve significantly better performance on floating-point heavy benchmarks, and are similar to Visual C++ on integer heavy benchmarks. However, it varies dramatically based on the program and whether or not you are using link-time code generation or profile-guided optimization. If performance is critical for you, you'll need to benchmark your application before making a choice. I'd only say that if you are doing scientific computing, it's probably worth the time to investigate.
Intel allows you a month-long free trial of its compiler, so you can try these things out for yourself.

I've been using the Intel C++ compiler since the first Release of Intel Parallel Studio, and so far I haven't felt the temptation to go back. Here's an outline of dis/advantages as well as (some obvious) observations.
Advantages
Parallelization (vectorization, OpenMP, SSE) is unmatched in other compilers.
Toolset is simply awesome. I'm talking about the profiling, of course.
Inclusion of optimized libraries such as Threading Building Blocks (okay, so Microsoft replicated TBB with PPL), Math Kernel Library (standard routines, and some implementations have MPI (!!!) support), Integrated Performance Primitives, etc. What's great also is that these libraries are constantly evolving.
Disadvantages
Speed-up is Intel-only. Well duh! It doesn't worry me, however, because on the server side all I have to do is choose Intel machines. I have no problem with that, some people might.
You can't really do OSS or anything like that on this, because the project file format is different. Yes, you can have both VS and IPS file formats, but that's just weird. You'll get lost in synchronising project options whenever you make a change. Intel's compiler has twice the number of options, by the way.
The compiler is a lot more finicky. It is far too easy to set incompatible project settings that will give you a cryptic compilation error instead of a nice meaningful explanation.
It costs additional money on top of Visual Studio.
Neutrals
I think that the performance argument is not a strong one anymore, because plenty of libraries such as Thrust or Microsoft AMP let you use GPGPU which will outgun your cpu anyway.
I recommend anyone interested to get a trial and try out some code, including the libraries. (And yes, the libraries are nice, but C-style interfaces can drive you insane.)

The last time the company I work for compared the two was about a year ago, (maybe 2). The Intel compiler generated faster code, usually only a bit faster, but in some cases quite a bit.
But it couldn't handle some of the MS language extensions that we depended on, so we ended up sticking with MS. It was VS 2005 that we were comparing it to. And I'm wracking my brain to remember exactly what MS extension the Intel compiler couldn't handle. I'll come back and edit this post if I can remember.

Intel C++ Compiler has AMAZING (human) support. Talking to Microsoft can literally take days. My non-trivial issue was solved through chat in under 10 minutes (including membership verification time).
EDIT: I have talked to Microsoft about problems in their products such as Office 2007, even got a bug reported. While I eventually succeeded, the overall size and complexity of their products and organization hierarchy is daunting.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js