Does anybody have any experience with SSEPlus? - c++

SSEPlus is an open source library from AMD for unified handling of SSE processor extensions.
I'm considering to use this library for my next small project and would like to know, if anybody have experience with it? Can I use it on Intel machines? Any performance issues in comparison to direct SSE calls? Any issues on 64bit machines? What other projects than Framewave use it?

Yes, you can use it on Intel machines too.
Performance should not differ except that it adds the checks about supported processor features which might cost a little.

Related

Visual studio compiler flag /arch and performance

I just noticed that in our project have left the "Enable Enhanced Instruction Set" flag left unset, probably just an oversight.
Before enabling the flag I would like to ask if anyone have seen any real-world performance improvements enabling it ?
I guess we will see some improvement our application constantly do floating point based calucations, but its not a major part,.
So in a nutshell: This setting only enables certain intrinsic functions that map directly on SSE instructions. In normal C++ programs you don't use these intrinsic functions, so this setting won't improve performance.
If you need more performance, you could try to find a compiler that rewrites your code to use SSE instructions (intel claims its compiler can), but its probably smarter to go for multicore (with openMP or .net 4.0), or use the GPU, which is faster and more flexible than SSE.
The performance benefit will depend on whether you project uses intensive mathematical computations. For many tasks (networking, text processing, data management) this simply isn't the case as no (or almost no) floating-point operations are used there. Hence, there will be no performance boost at all.
Using SSE/SSE2 instructions generated by the compiler would not generate top performance. First, you won't have any control on actual code generation. There are scenarios where you need to use legacy (x87) code on an old system and SSE/SSE2-enabled code on a new system. You might also want to take advantage of SSE3 on most newest systems. For that purpose, I'd recommend to check the processor type using the cpuid instruction and then switch to an implementation that could take most advantage of the processor capabilities. You can then use compiler intrinsics in the implementations targeting SSE/SSE2. To target SSE3, you'll need a dedicated library which I'm trying to locate on the internet.
I believe, there must exist libraries that perform the analysis of processor capabilities and allow for optimal code switcing. I just need some time to look on the net also.

Intel C++ compiler as an alternative to Microsoft's?

Is anyone here using the Intel C++ compiler instead of Microsoft's Visual c++ compiler?
I would be very interested to hear your experience about integration, performance and build times.
The Intel compiler is one of the most advanced C++ compiler available, it has a number of advantages over for instance the Microsoft Visual C++ compiler, and one major drawback. The advantages include:
Very good SIMD support, as far as I've been able to find out, it is the compiler that has the best support for SIMD instructions.
Supports both automatic parallelization (multi core optimzations), as well as manual (through OpenMP), and does both very well.
Support CPU dispatching, this is really important, since it allows the compiler to target the processor for optimized instructions when the program runs. As far as I can tell this is the only C++ compiler available that does this, unless G++ has introduced this in their yet.
It is often shipped with optimized libraries, such as math and image libraries.
However it has one major drawback, the dispatcher as mentioned above, only works on Intel CPU's, this means that advanced optimizations will be left out on AMD cpu's. There is a workaround for this, but it is still a major problem with the compiler.
To work around the dispatcher problem, it is possible to replace the dispatcher code produced with a version working on AMD processors, one can for instance use Agner Fog's asmlib library which replaces the compiler generated dispatcher function. Much more information about the dispatching problem, and more detailed technical explanations of some of the topics can be found in the Optimizing software in C++ paper - also from Anger (which is really worth reading).
On a personal note I have used the Intel c++ Compiler with Visual Studio 2005 where it worked flawlessly, I didn't experience any problems with microsoft specific language extensions, it seemed to understand those I used, but perhaps the ones mentioned by John Knoeller were different from the ones I had in my projects.
While I like the Intel compiler, I'm currently working with the microsoft C++ compiler, simply because of the financial extra investment the Intel compiler requires. I would only use the Intel compiler as an alternative to Microsofts or the GNU compiler, if performance were critical to my project and I had a the financial part in order ;)
I'm not using Intel C++ compiler at work / personal (I wish I would).
I would use it because it has:
Excellent inline assembler support. Intel C++ supports both Intel and AT&T (GCC) assembler syntaxes, for x86 and x64 platforms. Visual C++ can handle only Intel assembly syntax and only for x86.
Support for SSE3, SSSE3, and SSE4 instruction sets. Visual C++ has support for SSE and SSE2.
Is based on EDG C++, which has a complete ISO/IEC 14882:2003 standard implementation. That means you can use / learn every C++ feature.
I've had only one experience with this compiler, compiling STLPort. It took MSVC around 5 minutes to compile it and ICC was compiling for more than an hour. It seems that their template compilation is very slow. Other than this I've heard only good things about it.
Here's something interesting:
Intel's compiler can produce different
versions of pieces of code, with each
version being optimised for a specific
processor and/or instruction set
(SSE2, SSE3, etc.). The system detects
which CPU it's running on and chooses
the optimal code path accordingly; the
CPU dispatcher, as it's called.
"However, the Intel CPU dispatcher
does not only check which instruction
set is supported by the CPU, it also
checks the vendor ID string," Fog
details, "If the vendor string says
'GenuineIntel' then it uses the
optimal code path. If the CPU is not
from Intel then, in most cases, it
will run the slowest possible version
of the code, even if the CPU is fully
compatible with a better version."
OSnews article here
I tried using Intel C++ at my previous job. IIRC, it did indeed generate more efficient code at the expense of compilation time. We didn't put it to production use though, for reasons I can't remember.
One important difference compared to MSVC is that the Intel compiler supports C99.
Anecdotally, I've found that the Intel compiler crashes more frequently than Visual C++. Its diagnostics are a bit more thorough and clearly written than VC's. Thus, it's possible that the compiler will give diagnostics that weren't given with VC, or will crash where VC didn't, making your conversion more expensive.
However, I do believe that Intel's compiler allows you to link with Microsoft runtimes like the CRT, easing the transition cost.
If you are interoperating with managed code you should probably stick with Microsoft's compiler.
Recent Intel compilers achieve significantly better performance on floating-point heavy benchmarks, and are similar to Visual C++ on integer heavy benchmarks. However, it varies dramatically based on the program and whether or not you are using link-time code generation or profile-guided optimization. If performance is critical for you, you'll need to benchmark your application before making a choice. I'd only say that if you are doing scientific computing, it's probably worth the time to investigate.
Intel allows you a month-long free trial of its compiler, so you can try these things out for yourself.
I've been using the Intel C++ compiler since the first Release of Intel Parallel Studio, and so far I haven't felt the temptation to go back. Here's an outline of dis/advantages as well as (some obvious) observations.
Advantages
Parallelization (vectorization, OpenMP, SSE) is unmatched in other compilers.
Toolset is simply awesome. I'm talking about the profiling, of course.
Inclusion of optimized libraries such as Threading Building Blocks (okay, so Microsoft replicated TBB with PPL), Math Kernel Library (standard routines, and some implementations have MPI (!!!) support), Integrated Performance Primitives, etc. What's great also is that these libraries are constantly evolving.
Disadvantages
Speed-up is Intel-only. Well duh! It doesn't worry me, however, because on the server side all I have to do is choose Intel machines. I have no problem with that, some people might.
You can't really do OSS or anything like that on this, because the project file format is different. Yes, you can have both VS and IPS file formats, but that's just weird. You'll get lost in synchronising project options whenever you make a change. Intel's compiler has twice the number of options, by the way.
The compiler is a lot more finicky. It is far too easy to set incompatible project settings that will give you a cryptic compilation error instead of a nice meaningful explanation.
It costs additional money on top of Visual Studio.
Neutrals
I think that the performance argument is not a strong one anymore, because plenty of libraries such as Thrust or Microsoft AMP let you use GPGPU which will outgun your cpu anyway.
I recommend anyone interested to get a trial and try out some code, including the libraries. (And yes, the libraries are nice, but C-style interfaces can drive you insane.)
The last time the company I work for compared the two was about a year ago, (maybe 2). The Intel compiler generated faster code, usually only a bit faster, but in some cases quite a bit.
But it couldn't handle some of the MS language extensions that we depended on, so we ended up sticking with MS. It was VS 2005 that we were comparing it to. And I'm wracking my brain to remember exactly what MS extension the Intel compiler couldn't handle. I'll come back and edit this post if I can remember.
Intel C++ Compiler has AMAZING (human) support. Talking to Microsoft can literally take days. My non-trivial issue was solved through chat in under 10 minutes (including membership verification time).
EDIT: I have talked to Microsoft about problems in their products such as Office 2007, even got a bug reported. While I eventually succeeded, the overall size and complexity of their products and organization hierarchy is daunting.

Issues in porting c/c++ code to VxWorks

I need to port a c/c++ codebase that already supports Linux/Mac, to VxWorks. I am pretty new to VxWorks. Could you let me know what are the possible issues that could arise?
We recently did the opposite conversion - we ported code from a PowerPC machine running VxWorks to an Intel system running Linux. I don't remember hitting many snags as far as the differences between the operating systems. Obviously any call to an OS specific API will have to change and we were not making extensive use of these functions.
Our biggest problem was not the difference between the operating systems, but rather the difference between PowerPC and Intel hardware. PowerPC is Big Endian and Intel is Little Endian. Our software is written in C and made many assumptions as to the order of bytes and this was an absolute nightmare to get it working smoothly again. There were literally hundreds of structures that defined bitfields and needed to be re-ordered to work correctly. We ended up implementing a #pragma in GCC that reversed these bitfields at their definition (#pragma reverse_bitfields).
Much depends on which version of VxWorks you're targeting, and the actual target processor itself. One thing you will have to deal with is that there is no paged memory system or virtual memory--you have what's there. The environment itself is far more constrained than a linux system. Sometimes the work involved in porting applications goes all the way back to the architecture level because resources are not as unlimited as they are in linux.
Some other tips:
license vxworks such that you have the source code available
use a real, physical target as soon as possible in the development cycle; do not count on the simulators accurately emulating the target
use TSRs (technical support requests) as necessary; I don't know how they structure the purchase of the right to create TSRs, but don't let anybody cheap out on these
Depending on what processor you are running with VxWorks endianness, structure packing, and memory alignment could all be issues. The last time I used VxWorks it supported a pthreads, sockets, and mutex layer that mimicked the unix environments easily enough.
It's difficult to tell, without knowing more about the application that you're porting: What linux libraries and api calls does it use? Is it self-contained, or does it rely on slews of linux command-line tools and scripts to do its job?
As Average says, endianness can cause you way more problems than you expect - particularly if you're not prepared for it.

Package for distributing calculations

Do you know of any package for distributing calculations on several computers and/or several cores on each computer? The calculation code is in c++, the package needs to be able to cope with data >2GB and work on a windows x64 machine. Shareware would be nice, but isn't a requirement.
A suitable solution would depend on the type of calculation and data you wish you process, the granularity of parallelism you wish to achieve, and how much effort you are willing to invest in it.
The simplest would be to just use a suitable solver/library that supports parallelism (e.g.
scalapack). Or if you wish to roll your own solvers, you can squeeze out some paralleisation out of your current code using OpenMP or compilers that provide automatic paralleisation (e.g Intel C/C++ compiler). All these will give you a reasonable performance boost without requiring massive restructuring of your code.
At the other end of the spectrum, you have the MPI option. It can afford you the most performance boost if your algorithm parallelises well. It will however require a fair bit of reengineering.
Another alternative would be to go down the threading route. There are libraries an tools out there that will make this less of a nightmare. These are worth a look: Boost C++ Parallel programming library and Threading Building Block
You may want to look at OpenMP
There's an MPI library and the DVM system working on top of MPI. These are generic tools widely used for parallelizing a variety of tasks.

Is it worth learning AMD-specific APIs?

I'm currently learning the APIs related to Intel's parallelization libraries such as TBB, MKL and IPP. I was wondering, though, whether it's also worth looking at AMD's part of the puzzle. Or would that just be a waste of time? (I must confess, I have no clue about AMD's library support - at all - so would appreciate any advice you might have.)
Just to clarify, the reason I'm going the Intel way is because 1) the APIs are very nice; and 2) Intel seems to be taking tool support as seriously as API support. (Once again, I have no clue how AMD is doing in this department.)
The MKL and IPP libraries will perform (nearly) as well on AMD machines. My guess is that TBB will also run just fine on AMD boxes. If I had to suggest a technology that would be beneficial and useful to both, it would be to master the OpenMP libraries. The Intel compiler with the OpenMP extensions is stunningly fast and works with AMD chips also.
Only worth it if you are specifically interested in building something like Video games, Operating systems, database servers, or virtualization software. In other words: if you have a segment where you care enough about performance to take the time to do it (and do it right) in assembler. The same is true for Intel.
If your company sells packages of just Intel Servers with your software, then you shouldn't bother learning the AMD approach. But if you're going to have to offer software for both (or many) different platforms, then it might be worth looking into the different technologies. It will be very difficult to create the wrappers for the hardware-specific libraries. (Especially since threading is involved.)
And you definitely don't want to write completely separate implementation for each hardware configuration. In fact, if your software is to be consumed by a generic user, then you may want to abandon the Intel technology, and use standard threading techniques. I don't mean to be discouraging, but I believe that the Intel threading libraries are a bit ahead of their time for all intents and purposes.