Visual studio compiler flag /arch and performance - c++

I just noticed that in our project have left the "Enable Enhanced Instruction Set" flag left unset, probably just an oversight.
Before enabling the flag I would like to ask if anyone have seen any real-world performance improvements enabling it ?
I guess we will see some improvement our application constantly do floating point based calucations, but its not a major part,.

So in a nutshell: This setting only enables certain intrinsic functions that map directly on SSE instructions. In normal C++ programs you don't use these intrinsic functions, so this setting won't improve performance.
If you need more performance, you could try to find a compiler that rewrites your code to use SSE instructions (intel claims its compiler can), but its probably smarter to go for multicore (with openMP or .net 4.0), or use the GPU, which is faster and more flexible than SSE.

The performance benefit will depend on whether you project uses intensive mathematical computations. For many tasks (networking, text processing, data management) this simply isn't the case as no (or almost no) floating-point operations are used there. Hence, there will be no performance boost at all.
Using SSE/SSE2 instructions generated by the compiler would not generate top performance. First, you won't have any control on actual code generation. There are scenarios where you need to use legacy (x87) code on an old system and SSE/SSE2-enabled code on a new system. You might also want to take advantage of SSE3 on most newest systems. For that purpose, I'd recommend to check the processor type using the cpuid instruction and then switch to an implementation that could take most advantage of the processor capabilities. You can then use compiler intrinsics in the implementations targeting SSE/SSE2. To target SSE3, you'll need a dedicated library which I'm trying to locate on the internet.
I believe, there must exist libraries that perform the analysis of processor capabilities and allow for optimal code switcing. I just need some time to look on the net also.

Related

Using runtime interface at compile time

I need hardware support (it might be FPGA) at compile time to speed up compile-time computations. To be more specific compile-time training of a neural network. This might use OpenCL to greatly speed up compilation.
Would a compiler provide such an ability?
The best would be ability to call custom dynamic library function at compile-time.
I prefer C++. I see LLVM is moving forward pretty fast.
Does it provide something similar to enable it in Clang?
I need hardware support (it might be FPGA) at compile time to speed up compile-time computations.
Compile-time computations are generally not that intensive. (and it is a quality of implementation issue). So it is unlikely you'll find that.
Perhaps you can use plugins for your compiler (e.g. plugins in C++ for GCC, or extensions using GCC MELT, or plugins in C++ for Clang), and add for example extra compiler builtins thru them.
Or simply, generate some C or C++ code thru some external tool.
(maybe you are looking for hardware support in your compiler to speed up the run time of your compiled program, so you want a compiler able to take advantage of your hardware for the generated code, but that is a very different question)

Transpiling to C vs C++ : range of CPU instructions

I am considering the question of transpiling a language (home-grown DSL) to C vs to C++.
I haven't done any 'native' programming for over 15 years, so I want to check my assumptions.
Am I right into assuming that transpiling to the newest C++ version (17) would enable the native compiler to use a much wider range of 'modern' Intel/AMD CPU instructions, resulting in a more efficient executable (beyond the multi-threading / memory-model part of C++, which already by itself seems a good enough reason to go for C++)?
Put another way, isn't a large part of 'more recent' CPU instructions never generated by a C compiler, simply because it has too little information about the programmer intent, due to the simpler syntax of C? I know I could access all CPU instructions with assembler, but that is precisely what I don't want to do. Ideally, I would want the generated code to still be as platform-independent as possible.
All of your assumptions about the relationship between programming language and "modern CPU instructions" are incorrect.
Let's consider the GNU Compiler Collection.
The choice of language here doesn't much matter, as the language front-ends all end up generating the same intermediate form called GIMPLE. The optimizing passes then work on that.
The range of CPU instructions which can be emitted is controlled by the -mtune option. For x86, GCC is capable of emitting modern AVX 512 instructions when optimizing some very plain-looking C code. Automatic loop vectorisation is a powerful thing. Try it out: implement memcpy and look at the generated assembly.
My advice: generate clean, un-clever C code, and crank up the optimization level. Just like you would do if writing code by hand.
You might also consider implementing your language directly as a front-end to GCC or LLVM, without transpiling to C or C++. LLVM was designed for this purpose, intended to make implementing new languages easy, and still taking advantage of modern optimization approaches.

Microarchitectural profiling of C++ and assembly code on MIPS

As part of course project, I need to analyze a piece of C++ code for performance and find out which parts of the Computer Architecture (MIPS or x86) are mostly utilized while running the code and is possibly a bottleneck for the performance. I am looking at various Profilers for analyzing the performance and came across SimpleScalar which is a great tool but sadly only works with C code.
Since I am more familiar with MIPS architecture it would be great if there's a tool like SimpleScalar for simulating and profiling C++ code for MIPS. I am looking at the performance critical parts like branch, cache, instruction set, addressing modes etc. If not, mention of any tool which can do the similar kind of analysis for x86 architectures would be great as well.
(Just to clarify, I'm not looking for any old profiler, but for one that understands the CPU microarchitecture and knows what parts of the CPU are taken advantage of or underused.)
CACTI has detailed low-level simulation of cache.
SESC is a cycle accurate computer architecture simulator that supports MIPS.
SESC includes CACTI.
I doubt that what you want is possible. C++ is the language, but it still needs to be compiled to the target architecture. The optimisations (or the lack of them) will determine a lot of your performance criteria like cache use, etc. So I guess you need to look for machine level profilers (Hopefully they support the debug format of your compiler, so you see source code context).
My understanding is that SimpleScalar can simulate and profile MIPS machine code, no matter what the original language it was compiled from.
(The source-level debugger "DLite!" that comes with SimpleScalar may only support a few languages, but it sounds like you don't need to "debug" your code.)

Intel C++ compiler as an alternative to Microsoft's?

Is anyone here using the Intel C++ compiler instead of Microsoft's Visual c++ compiler?
I would be very interested to hear your experience about integration, performance and build times.
The Intel compiler is one of the most advanced C++ compiler available, it has a number of advantages over for instance the Microsoft Visual C++ compiler, and one major drawback. The advantages include:
Very good SIMD support, as far as I've been able to find out, it is the compiler that has the best support for SIMD instructions.
Supports both automatic parallelization (multi core optimzations), as well as manual (through OpenMP), and does both very well.
Support CPU dispatching, this is really important, since it allows the compiler to target the processor for optimized instructions when the program runs. As far as I can tell this is the only C++ compiler available that does this, unless G++ has introduced this in their yet.
It is often shipped with optimized libraries, such as math and image libraries.
However it has one major drawback, the dispatcher as mentioned above, only works on Intel CPU's, this means that advanced optimizations will be left out on AMD cpu's. There is a workaround for this, but it is still a major problem with the compiler.
To work around the dispatcher problem, it is possible to replace the dispatcher code produced with a version working on AMD processors, one can for instance use Agner Fog's asmlib library which replaces the compiler generated dispatcher function. Much more information about the dispatching problem, and more detailed technical explanations of some of the topics can be found in the Optimizing software in C++ paper - also from Anger (which is really worth reading).
On a personal note I have used the Intel c++ Compiler with Visual Studio 2005 where it worked flawlessly, I didn't experience any problems with microsoft specific language extensions, it seemed to understand those I used, but perhaps the ones mentioned by John Knoeller were different from the ones I had in my projects.
While I like the Intel compiler, I'm currently working with the microsoft C++ compiler, simply because of the financial extra investment the Intel compiler requires. I would only use the Intel compiler as an alternative to Microsofts or the GNU compiler, if performance were critical to my project and I had a the financial part in order ;)
I'm not using Intel C++ compiler at work / personal (I wish I would).
I would use it because it has:
Excellent inline assembler support. Intel C++ supports both Intel and AT&T (GCC) assembler syntaxes, for x86 and x64 platforms. Visual C++ can handle only Intel assembly syntax and only for x86.
Support for SSE3, SSSE3, and SSE4 instruction sets. Visual C++ has support for SSE and SSE2.
Is based on EDG C++, which has a complete ISO/IEC 14882:2003 standard implementation. That means you can use / learn every C++ feature.
I've had only one experience with this compiler, compiling STLPort. It took MSVC around 5 minutes to compile it and ICC was compiling for more than an hour. It seems that their template compilation is very slow. Other than this I've heard only good things about it.
Here's something interesting:
Intel's compiler can produce different
versions of pieces of code, with each
version being optimised for a specific
processor and/or instruction set
(SSE2, SSE3, etc.). The system detects
which CPU it's running on and chooses
the optimal code path accordingly; the
CPU dispatcher, as it's called.
"However, the Intel CPU dispatcher
does not only check which instruction
set is supported by the CPU, it also
checks the vendor ID string," Fog
details, "If the vendor string says
'GenuineIntel' then it uses the
optimal code path. If the CPU is not
from Intel then, in most cases, it
will run the slowest possible version
of the code, even if the CPU is fully
compatible with a better version."
OSnews article here
I tried using Intel C++ at my previous job. IIRC, it did indeed generate more efficient code at the expense of compilation time. We didn't put it to production use though, for reasons I can't remember.
One important difference compared to MSVC is that the Intel compiler supports C99.
Anecdotally, I've found that the Intel compiler crashes more frequently than Visual C++. Its diagnostics are a bit more thorough and clearly written than VC's. Thus, it's possible that the compiler will give diagnostics that weren't given with VC, or will crash where VC didn't, making your conversion more expensive.
However, I do believe that Intel's compiler allows you to link with Microsoft runtimes like the CRT, easing the transition cost.
If you are interoperating with managed code you should probably stick with Microsoft's compiler.
Recent Intel compilers achieve significantly better performance on floating-point heavy benchmarks, and are similar to Visual C++ on integer heavy benchmarks. However, it varies dramatically based on the program and whether or not you are using link-time code generation or profile-guided optimization. If performance is critical for you, you'll need to benchmark your application before making a choice. I'd only say that if you are doing scientific computing, it's probably worth the time to investigate.
Intel allows you a month-long free trial of its compiler, so you can try these things out for yourself.
I've been using the Intel C++ compiler since the first Release of Intel Parallel Studio, and so far I haven't felt the temptation to go back. Here's an outline of dis/advantages as well as (some obvious) observations.
Advantages
Parallelization (vectorization, OpenMP, SSE) is unmatched in other compilers.
Toolset is simply awesome. I'm talking about the profiling, of course.
Inclusion of optimized libraries such as Threading Building Blocks (okay, so Microsoft replicated TBB with PPL), Math Kernel Library (standard routines, and some implementations have MPI (!!!) support), Integrated Performance Primitives, etc. What's great also is that these libraries are constantly evolving.
Disadvantages
Speed-up is Intel-only. Well duh! It doesn't worry me, however, because on the server side all I have to do is choose Intel machines. I have no problem with that, some people might.
You can't really do OSS or anything like that on this, because the project file format is different. Yes, you can have both VS and IPS file formats, but that's just weird. You'll get lost in synchronising project options whenever you make a change. Intel's compiler has twice the number of options, by the way.
The compiler is a lot more finicky. It is far too easy to set incompatible project settings that will give you a cryptic compilation error instead of a nice meaningful explanation.
It costs additional money on top of Visual Studio.
Neutrals
I think that the performance argument is not a strong one anymore, because plenty of libraries such as Thrust or Microsoft AMP let you use GPGPU which will outgun your cpu anyway.
I recommend anyone interested to get a trial and try out some code, including the libraries. (And yes, the libraries are nice, but C-style interfaces can drive you insane.)
The last time the company I work for compared the two was about a year ago, (maybe 2). The Intel compiler generated faster code, usually only a bit faster, but in some cases quite a bit.
But it couldn't handle some of the MS language extensions that we depended on, so we ended up sticking with MS. It was VS 2005 that we were comparing it to. And I'm wracking my brain to remember exactly what MS extension the Intel compiler couldn't handle. I'll come back and edit this post if I can remember.
Intel C++ Compiler has AMAZING (human) support. Talking to Microsoft can literally take days. My non-trivial issue was solved through chat in under 10 minutes (including membership verification time).
EDIT: I have talked to Microsoft about problems in their products such as Office 2007, even got a bug reported. While I eventually succeeded, the overall size and complexity of their products and organization hierarchy is daunting.

Package for distributing calculations

Do you know of any package for distributing calculations on several computers and/or several cores on each computer? The calculation code is in c++, the package needs to be able to cope with data >2GB and work on a windows x64 machine. Shareware would be nice, but isn't a requirement.
A suitable solution would depend on the type of calculation and data you wish you process, the granularity of parallelism you wish to achieve, and how much effort you are willing to invest in it.
The simplest would be to just use a suitable solver/library that supports parallelism (e.g.
scalapack). Or if you wish to roll your own solvers, you can squeeze out some paralleisation out of your current code using OpenMP or compilers that provide automatic paralleisation (e.g Intel C/C++ compiler). All these will give you a reasonable performance boost without requiring massive restructuring of your code.
At the other end of the spectrum, you have the MPI option. It can afford you the most performance boost if your algorithm parallelises well. It will however require a fair bit of reengineering.
Another alternative would be to go down the threading route. There are libraries an tools out there that will make this less of a nightmare. These are worth a look: Boost C++ Parallel programming library and Threading Building Block
You may want to look at OpenMP
There's an MPI library and the DVM system working on top of MPI. These are generic tools widely used for parallelizing a variety of tasks.