Using runtime interface at compile time - c++

I need hardware support (it might be FPGA) at compile time to speed up compile-time computations. To be more specific compile-time training of a neural network. This might use OpenCL to greatly speed up compilation.
Would a compiler provide such an ability?
The best would be ability to call custom dynamic library function at compile-time.
I prefer C++. I see LLVM is moving forward pretty fast.
Does it provide something similar to enable it in Clang?

I need hardware support (it might be FPGA) at compile time to speed up compile-time computations.
Compile-time computations are generally not that intensive. (and it is a quality of implementation issue). So it is unlikely you'll find that.
Perhaps you can use plugins for your compiler (e.g. plugins in C++ for GCC, or extensions using GCC MELT, or plugins in C++ for Clang), and add for example extra compiler builtins thru them.
Or simply, generate some C or C++ code thru some external tool.
(maybe you are looking for hardware support in your compiler to speed up the run time of your compiled program, so you want a compiler able to take advantage of your hardware for the generated code, but that is a very different question)

Related

Transpiling to C vs C++ : range of CPU instructions

I am considering the question of transpiling a language (home-grown DSL) to C vs to C++.
I haven't done any 'native' programming for over 15 years, so I want to check my assumptions.
Am I right into assuming that transpiling to the newest C++ version (17) would enable the native compiler to use a much wider range of 'modern' Intel/AMD CPU instructions, resulting in a more efficient executable (beyond the multi-threading / memory-model part of C++, which already by itself seems a good enough reason to go for C++)?
Put another way, isn't a large part of 'more recent' CPU instructions never generated by a C compiler, simply because it has too little information about the programmer intent, due to the simpler syntax of C? I know I could access all CPU instructions with assembler, but that is precisely what I don't want to do. Ideally, I would want the generated code to still be as platform-independent as possible.
All of your assumptions about the relationship between programming language and "modern CPU instructions" are incorrect.
Let's consider the GNU Compiler Collection.
The choice of language here doesn't much matter, as the language front-ends all end up generating the same intermediate form called GIMPLE. The optimizing passes then work on that.
The range of CPU instructions which can be emitted is controlled by the -mtune option. For x86, GCC is capable of emitting modern AVX 512 instructions when optimizing some very plain-looking C code. Automatic loop vectorisation is a powerful thing. Try it out: implement memcpy and look at the generated assembly.
My advice: generate clean, un-clever C code, and crank up the optimization level. Just like you would do if writing code by hand.
You might also consider implementing your language directly as a front-end to GCC or LLVM, without transpiling to C or C++. LLVM was designed for this purpose, intended to make implementing new languages easy, and still taking advantage of modern optimization approaches.

Explanation of CUDA C and C++

Can anyone give me a good explanation as to the nature of CUDA C and C++? As I understand it, CUDA is supposed to be C with NVIDIA's GPU libraries. As of right now CUDA C supports some C++ features but not others.
What is NVIDIA's plan? Are they going to build upon C and add their own libraries (e.g. Thrust vs. STL) that parallel those of C++? Are they eventually going to support all of C++? Is it bad to use C++ headers in a .cu file?
CUDA C is a programming language with C syntax. Conceptually it is quite different from C.
The problem it is trying to solve is coding multiple (similar) instruction streams for multiple processors.
CUDA offers more than Single Instruction Multiple Data (SIMD) vector processing, but data streams >> instruction streams, or there is much less benefit.
CUDA gives some mechanisms to do that, and hides some of the complexity.
CUDA is not optimised for multiple diverse instruction streams like a multi-core x86.
CUDA is not limited to a single instruction stream like x86 vector instructions, or limited to specific data types like x86 vector instructions.
CUDA supports 'loops' which can be executed in parallel. This is its most critical feature. The CUDA system will partition the execution of 'loops', and run the 'loop' body simultaneously across an array of identical processors, while providing some of the illusion of a normal sequential loop (specifically CUDA manages the loop "index"). The developer needs to be aware of the GPU machine structure to write 'loops' effectively, but almost all of the management is handled by the CUDA run-time. The effect is hundreds (or even thousands) of 'loops' complete in the same time as one 'loop'.
CUDA supports what looks like if branches. Only processors running code which match the if test can be active, so a subset of processors will be active for each 'branch' of the if test. As an example this if... else if ... else ..., has three branches. Each processor will execute only one branch, and be 're-synched' ready to move on with the rest of the processors when the if is complete. It may be that some of the branch conditions are not matched by any processor. So there is no need to execute that branch (for that example, three branches is the worst case). Then only one or two branches are executed sequentially, completing the whole if more quickly.
There is no 'magic'. The programmer must be aware that the code will be run on a CUDA device, and write code consciously for it.
CUDA does not take old C/C++ code and auto-magically run the computation across an array of processors. CUDA can compile and run ordinary C and much of C++ sequentially, but there is very little (nothing?) to be gained by that because it will run sequentially, and more slowly than a modern CPU. This means the code in some libraries is not (yet) a good match with CUDA capabilities. A CUDA program could operate on multi-kByte bit-vectors simultaneously. CUDA isn't able to auto-magically convert existing sequential C/C++ library code into something which would do that.
CUDA does provides a relatively straightforward way to write code, using familiar C/C++ syntax, adds a few extra concepts, and generates code which will run across an array of processors. It has the potential to give much more than 10x speedup vs e.g. multi-core x86.
Edit - Plans: I do not work for NVIDIA
For the very best performance CUDA wants information at compile time.
So template mechanisms are the most useful because it gives the developer a way to say things at compile time, which the CUDA compiler could use. As a simple example, if a matrix is defined (instantiated) at compile time to be 2D and 4 x 8, then the CUDA compiler can work with that to organise the program across the processors. If that size is dynamic, and changes while the program is running, it is much harder for the compiler or run-time system to do a very efficient job.
EDIT:
CUDA has class and function templates.
I apologise if people read this as saying CUDA does not. I agree I was not clear.
I believe the CUDA GPU-side implementation of templates is not complete w.r.t. C++.
User harrism has commented that my answer is misleading. harrism works for NVIDIA, so I will wait for advice. Hopefully this is already clearer.
The hardest stuff to do efficiently across multiple processors is dynamic branching down many alternate paths because that effectively serialises the code; in the worst case only one processor can execute at a time, which wastes the benefit of a GPU. So virtual functions seem to be very hard to do well.
There are some very smart whole-program-analysis tools which can deduce much more type information than the developer might understand. Existing tools might deduce enough to eliminate virtual functions, and hence move analysis of branching to compile time. There are also techniques for instrumenting program execution which feeds directly back into recompilation of programs which might reach better branching decisions.
AFAIK (modulo feedback) the CUDA compiler is not yet state-of-the-art in these areas.
(IMHO it is worth a few days for anyone interested, with a CUDA or OpenCL-capable system, to investigate them, and do some experiments. I also think, for people interested in these areas, it is well worth the effort to experiment with Haskell, and have a look at Data Parallel Haskell)
CUDA is a platform (architecture, programming model, assembly virtual machine, compilation tools, etc.), not just a single programming language. CUDA C is just one of a number of language systems built on this platform (CUDA C, C++, CUDA Fortran, PyCUDA, are others.)
CUDA C++
Currently CUDA C++ supports the subset of C++ described in Appendix D ("C/C++ Language Support") of the CUDA C Programming Guide.
To name a few:
Classes
__device__ member functions (including constructors and destructors)
Inheritance / derived classes
virtual functions
class and function templates
operators and overloading
functor classes
Edit: As of CUDA 7.0, CUDA C++ includes support for most language features of the C++11 standard in __device__ code (code that runs on the GPU), including auto, lambda expressions, range-based for loops, initializer lists, static assert, and more.
Examples and specific limitations are also detailed in the same appendix linked above. As a very mature example of C++ usage with CUDA, I recommend checking out Thrust.
Future Plans
(Disclosure: I work for NVIDIA.)
I can't be explicit about future releases and timing, but I can illustrate the trend that almost every release of CUDA has added additional language features to get CUDA C++ support to its current (In my opinion very useful) state. We plan to continue this trend in improving support for C++, but naturally we prioritize features that are useful and performant on a massively parallel computational architecture (GPU).
Not realized by many, CUDA is actually two new programming languages, both derived from C++. One is for writing code that runs on GPUs and is a subset of C++. Its function is similar to HLSL (DirectX) or Cg (OpenGL) but with more features and compatibility with C++. Various GPGPU/SIMT/performance-related concerns apply to it that I need not mention. The other is the so-called "Runtime API," which is hardly an "API" in the traditional sense. The Runtime API is used to write code that runs on the host CPU. It is a superset of C++ and makes it much easier to link to and launch GPU code. It requires the NVCC pre-compiler which then calls the platform's C++ compiler. By contrast, the Driver API (and OpenCL) is a pure, standard C library, and is much more verbose to use (while offering few additional features).
Creating a new host-side programming language was a bold move on NVIDIA's part. It makes getting started with CUDA easier and writing code more elegant. However, truly brilliant was not marketing it as a new language.
Sometimes you hear that CUDA would be C and C++, but I don't think it is, for the simple reason that this impossible. To cite from their programming guide:
For the host code, nvcc supports whatever part of the C++ ISO/IEC
14882:2003 specification the host c++ compiler supports.
For the device code, nvcc supports the features illustrated in Section
D.1 with some restrictions described in Section D.2; it does not
support run time type information (RTTI), exception handling, and the
C++ Standard Library.
As I can see, it only refers to C++, and only supports C where this happens to be in the intersection of C and C++. So better think of it as C++ with extensions for the device part rather than C. That avoids you a lot of headaches if you are used to C.
What is NVIDIA's plan?
I believe the general trend is that CUDA and OpenCL are regarded as too low level techniques for many applications. Right now, Nvidia is investing heavily into OpenACC which could roughly be described as OpenMP for GPUs. It follows a declarative approach and tackles the problem of GPU parallelization at a much higher level. So that is my totally subjective impression of what Nvidia's plan is.

Visual studio compiler flag /arch and performance

I just noticed that in our project have left the "Enable Enhanced Instruction Set" flag left unset, probably just an oversight.
Before enabling the flag I would like to ask if anyone have seen any real-world performance improvements enabling it ?
I guess we will see some improvement our application constantly do floating point based calucations, but its not a major part,.
So in a nutshell: This setting only enables certain intrinsic functions that map directly on SSE instructions. In normal C++ programs you don't use these intrinsic functions, so this setting won't improve performance.
If you need more performance, you could try to find a compiler that rewrites your code to use SSE instructions (intel claims its compiler can), but its probably smarter to go for multicore (with openMP or .net 4.0), or use the GPU, which is faster and more flexible than SSE.
The performance benefit will depend on whether you project uses intensive mathematical computations. For many tasks (networking, text processing, data management) this simply isn't the case as no (or almost no) floating-point operations are used there. Hence, there will be no performance boost at all.
Using SSE/SSE2 instructions generated by the compiler would not generate top performance. First, you won't have any control on actual code generation. There are scenarios where you need to use legacy (x87) code on an old system and SSE/SSE2-enabled code on a new system. You might also want to take advantage of SSE3 on most newest systems. For that purpose, I'd recommend to check the processor type using the cpuid instruction and then switch to an implementation that could take most advantage of the processor capabilities. You can then use compiler intrinsics in the implementations targeting SSE/SSE2. To target SSE3, you'll need a dedicated library which I'm trying to locate on the internet.
I believe, there must exist libraries that perform the analysis of processor capabilities and allow for optimal code switcing. I just need some time to look on the net also.

Intel C++ compiler as an alternative to Microsoft's?

Is anyone here using the Intel C++ compiler instead of Microsoft's Visual c++ compiler?
I would be very interested to hear your experience about integration, performance and build times.
The Intel compiler is one of the most advanced C++ compiler available, it has a number of advantages over for instance the Microsoft Visual C++ compiler, and one major drawback. The advantages include:
Very good SIMD support, as far as I've been able to find out, it is the compiler that has the best support for SIMD instructions.
Supports both automatic parallelization (multi core optimzations), as well as manual (through OpenMP), and does both very well.
Support CPU dispatching, this is really important, since it allows the compiler to target the processor for optimized instructions when the program runs. As far as I can tell this is the only C++ compiler available that does this, unless G++ has introduced this in their yet.
It is often shipped with optimized libraries, such as math and image libraries.
However it has one major drawback, the dispatcher as mentioned above, only works on Intel CPU's, this means that advanced optimizations will be left out on AMD cpu's. There is a workaround for this, but it is still a major problem with the compiler.
To work around the dispatcher problem, it is possible to replace the dispatcher code produced with a version working on AMD processors, one can for instance use Agner Fog's asmlib library which replaces the compiler generated dispatcher function. Much more information about the dispatching problem, and more detailed technical explanations of some of the topics can be found in the Optimizing software in C++ paper - also from Anger (which is really worth reading).
On a personal note I have used the Intel c++ Compiler with Visual Studio 2005 where it worked flawlessly, I didn't experience any problems with microsoft specific language extensions, it seemed to understand those I used, but perhaps the ones mentioned by John Knoeller were different from the ones I had in my projects.
While I like the Intel compiler, I'm currently working with the microsoft C++ compiler, simply because of the financial extra investment the Intel compiler requires. I would only use the Intel compiler as an alternative to Microsofts or the GNU compiler, if performance were critical to my project and I had a the financial part in order ;)
I'm not using Intel C++ compiler at work / personal (I wish I would).
I would use it because it has:
Excellent inline assembler support. Intel C++ supports both Intel and AT&T (GCC) assembler syntaxes, for x86 and x64 platforms. Visual C++ can handle only Intel assembly syntax and only for x86.
Support for SSE3, SSSE3, and SSE4 instruction sets. Visual C++ has support for SSE and SSE2.
Is based on EDG C++, which has a complete ISO/IEC 14882:2003 standard implementation. That means you can use / learn every C++ feature.
I've had only one experience with this compiler, compiling STLPort. It took MSVC around 5 minutes to compile it and ICC was compiling for more than an hour. It seems that their template compilation is very slow. Other than this I've heard only good things about it.
Here's something interesting:
Intel's compiler can produce different
versions of pieces of code, with each
version being optimised for a specific
processor and/or instruction set
(SSE2, SSE3, etc.). The system detects
which CPU it's running on and chooses
the optimal code path accordingly; the
CPU dispatcher, as it's called.
"However, the Intel CPU dispatcher
does not only check which instruction
set is supported by the CPU, it also
checks the vendor ID string," Fog
details, "If the vendor string says
'GenuineIntel' then it uses the
optimal code path. If the CPU is not
from Intel then, in most cases, it
will run the slowest possible version
of the code, even if the CPU is fully
compatible with a better version."
OSnews article here
I tried using Intel C++ at my previous job. IIRC, it did indeed generate more efficient code at the expense of compilation time. We didn't put it to production use though, for reasons I can't remember.
One important difference compared to MSVC is that the Intel compiler supports C99.
Anecdotally, I've found that the Intel compiler crashes more frequently than Visual C++. Its diagnostics are a bit more thorough and clearly written than VC's. Thus, it's possible that the compiler will give diagnostics that weren't given with VC, or will crash where VC didn't, making your conversion more expensive.
However, I do believe that Intel's compiler allows you to link with Microsoft runtimes like the CRT, easing the transition cost.
If you are interoperating with managed code you should probably stick with Microsoft's compiler.
Recent Intel compilers achieve significantly better performance on floating-point heavy benchmarks, and are similar to Visual C++ on integer heavy benchmarks. However, it varies dramatically based on the program and whether or not you are using link-time code generation or profile-guided optimization. If performance is critical for you, you'll need to benchmark your application before making a choice. I'd only say that if you are doing scientific computing, it's probably worth the time to investigate.
Intel allows you a month-long free trial of its compiler, so you can try these things out for yourself.
I've been using the Intel C++ compiler since the first Release of Intel Parallel Studio, and so far I haven't felt the temptation to go back. Here's an outline of dis/advantages as well as (some obvious) observations.
Advantages
Parallelization (vectorization, OpenMP, SSE) is unmatched in other compilers.
Toolset is simply awesome. I'm talking about the profiling, of course.
Inclusion of optimized libraries such as Threading Building Blocks (okay, so Microsoft replicated TBB with PPL), Math Kernel Library (standard routines, and some implementations have MPI (!!!) support), Integrated Performance Primitives, etc. What's great also is that these libraries are constantly evolving.
Disadvantages
Speed-up is Intel-only. Well duh! It doesn't worry me, however, because on the server side all I have to do is choose Intel machines. I have no problem with that, some people might.
You can't really do OSS or anything like that on this, because the project file format is different. Yes, you can have both VS and IPS file formats, but that's just weird. You'll get lost in synchronising project options whenever you make a change. Intel's compiler has twice the number of options, by the way.
The compiler is a lot more finicky. It is far too easy to set incompatible project settings that will give you a cryptic compilation error instead of a nice meaningful explanation.
It costs additional money on top of Visual Studio.
Neutrals
I think that the performance argument is not a strong one anymore, because plenty of libraries such as Thrust or Microsoft AMP let you use GPGPU which will outgun your cpu anyway.
I recommend anyone interested to get a trial and try out some code, including the libraries. (And yes, the libraries are nice, but C-style interfaces can drive you insane.)
The last time the company I work for compared the two was about a year ago, (maybe 2). The Intel compiler generated faster code, usually only a bit faster, but in some cases quite a bit.
But it couldn't handle some of the MS language extensions that we depended on, so we ended up sticking with MS. It was VS 2005 that we were comparing it to. And I'm wracking my brain to remember exactly what MS extension the Intel compiler couldn't handle. I'll come back and edit this post if I can remember.
Intel C++ Compiler has AMAZING (human) support. Talking to Microsoft can literally take days. My non-trivial issue was solved through chat in under 10 minutes (including membership verification time).
EDIT: I have talked to Microsoft about problems in their products such as Office 2007, even got a bug reported. While I eventually succeeded, the overall size and complexity of their products and organization hierarchy is daunting.

dynamic code compilation

I'm working on a program that renders iterated fractal systems. I wanted to add the functionality where someone could define their own iteration process, and compile that code so that it would run efficiently.
I currently don't know how to do this and would like tips on what to read to learn how to do this.
The main program is written in C++ and I'm familiar with C++. In fact given most of the scenarios I know how to convert it to assembly code that would accomplish the goal, but I don't know how to take the extra step to convert it to machine code. If possible I'd like to dynamically compile the code like how I believe many game system emulators work.
If it is unclear what I'm asking, tell me so I can clarify.
Thanks!
Does the routine to be compiled dynamically need to be in any particular language. If the answer to that question is "Yes, it must be C++" you're probably out of luck. C++ is about the worst possible choice for online recompilation.
Is the dynamic portion of your application (the fractal iterator routine) a major CPU bottleneck? If you can afford using a language that isn't compiled, you can probably save yourself an awful lot of trouble. Lua and JavaScript are both heavily optimized interpreted languages that only run a few times slower than native, compiled code.
If you really need the dynamic functionality to be compiled to machine code, your best bet is probably going to be using clang/llvm. clang is the C/Objective-C front end being developed by Apple (and a few others) to make online, dynamic recompilation perform well. llvm is the backend clang uses to translate from a portable bytecode to native machine code. Be advised that clang does not currently support much of C++, since that's such a difficult language to get right.
Some CPU emulators treat the machine code as if it was byte code and they do a JIT compile, almost as if it was Java. This is very efficient, but it means that the developers need to write a version of the compiler for each CPU their emulator runs on and for each CPU emulated.
That usually means it only works on x86 and is annoying to anyone who would like to use something different.
They could also translate it to LLVM or Java byte code or .Net CIL and then compile it, which would also work.
In your case I am not sure that sort of thing is the best way to go. I think that I would do this by using dynamic libraries. Make a directory that is supposed to contain "plugins" and let the user compile their own. Make your program scan the directory and load each DLL or .so it finds.
Doing it this way means you spend less time writing code compilers and more time actually getting stuff done.
If you can write your dynamic extensions in C (not C++), you might find the Tiny C Compiler to be of use. It's available under the LGPL, it's compatible for Windows and Linux, and it's a small executable (or library) at ~100kb for the preprocessor, compiler, linker and assembler, all of which it does very fast. The downside to that, of course, is that it can't compare to the optimizations you can get with GCC. Another potential downside is that it's X86 only AFAIK.
If you did decide to write assembly, TCC can handle that -- the documentation says it supports a gas-like syntax, and it does support X86 opcodes.
TCC also fully supports ANSI C, and it's nearly fully compliant with C99.
That being said, you could either include TCC as an executable with your application or use libtcc (there's not too much documentation of libtcc online, but it's available in the source package). Either way, you can use tcc to generate dynamic or shared libraries, or executables. If you went the dynamic library route, you would just put in a Render (or whatever) function in it, and dlopen or LoadLibrary on it, and call Render to finally run the user-designed rendering. Alternatively, you could make a standalone executable and popen it, and do all your communication through the standalone's stdin and stdout.
Since you're generating pixels to be displayed on a screen, have you considered using HLSL with dynamic shader compile? That will give you access to SIMD hardware designed for exactly this sort of thing, as well as the dynamic compiler built right into DirectX.
LLVM should be able to do what you want to do. It allows you to form a description of the program you'd like to compile in an object-oriented manner, and then it can compile that program description into native machine code at runtime.
Nanojit is a pretty good example of what you want. It generates machine code from an intermediate langauge. It's C++, and it's small and cross-platform. I haven't used it very extensively, but I enjoyed toying around just for demos.
Spit the code to a file and compile it as a dynamically loaded library, then load it and call it.
Is there are reason why you can't use a GPU-based solutions? This seems to be screaming for one.