Fastest way to run a program in a 64 bit environment?

Fastest way to run a program in a 64 bit environment? - c++

It's been a couple of decades since I've done any programming. As a matter of fact the last time I programmed was in an MS-DOS environment before Windows came out. I've had this programming idea that I have wanted to try for a few years now and I thought I would give it a try. The amount of calculations are enormous. Consequently I want to run it in the fastest environment I can available to a general hobby programmer.
I'll be using a 64 bit machine. Currently it is running Windows 7. Years ago a program ran much slower in the windows environment then then in MS-DOS mode. My personal programming experience has been in Fortran, Pascal, Basic, and machine language for the 6800 Motorola series processors. I'm basically willing to try anything. I've fooled around with Ubuntu also. No objections to learning new. Just want to take advantage of speed. I'd prefer to spend no money on this project. So I'm looking for a free or very close to free compiler. I've downloaded Microsoft Visual Studio C++ Express. But I've got a feeling that the completed compiled code will have to be run in the Windows environment. Which I'm sure slows the processing speed considerably.
So I'm looking for ideas or pointers to what is available.
Thank you,
Have a Great Day!
Jim

Speed generally comes with the price of either portability or complexity.
If your programming idea involves lots of computation, then if you're using Intel CPU, you might want to use Intel's compiler, which might benefit from some hidden processor features that might make your program faster. Otherwise, if portability is your goal, then use GCC (GNU Compiler Collection), which can cross-compile well optimized executable to practically any platform available on earth. If your computation can be parallelizable, then you might want to look at SIMD (Single Input Multiple Data) and GPGPU/CUDA/OpenCL (using graphic card for computation) techniques.
However, I'd recommend you should just try your idea in the simpler languages first, e.g. Python, Java, C#, Basic; and see if the speed is good enough. Since you've never programmed for decades now, it's likely your perception of what was an enormous computation is currently miniscule due to the increased processor speed and RAM. Nowadays, there is not much noticeable difference in running in GUI environment and command line environment.

Tthere is no substantial performance penalty to operating under Windows and a large quantity of extremely high performance applications do so. With new compiler advances and new optimization techniques, Windows is no longer the up-and-coming, new, poorly optimized technology it was twenty years ago.
The simple fact is that if you haven't programmed for 20 years, then you won't have any realistic performance picture at all. You should make like most people- start with an easy to learn but not very fast programming language like C#, create the program, then prove that it runs too slowly, then make several optimization passes with tools such as profilers, then you may decide that the language is too slow. If you haven't written a line of code in two decades, the overwhelming probability is that any program that you write will be slow because you're a novice programmer from modern perspectives, not because of your choice of language or environment. Creating very high performance applications requires a detailed understanding of the target platform as well as the language of choice, AND the operations of the program.
I'd definitely recommend Visual C++. The Express Edition is free and Visual Studio 2010 can produce some unreasonably fast code. Windows is not a slow platform - even if you handwrote your own OS, it'd probably be slower, and even if you produced one that was faster, the performance gain would be negligible unless your program takes days or weeks to complete a single execution.

The OS does not make your program magically run slower. True, the OS does eat a few clock cycles here and there, but it's really not enough to be at all noticeable (and it does so in order to provide you with services you most likely need, and would need to re-implement yourself otherwise)
Windows doesn't, as some people seem to believe, eat 50% of your CPU. It might eat 0.5%, but so does Linux and OSX. And if you were to ditch all existing OS'es and instead write your own from scratch, you'd end up with a buggy, less capable OS which also eats a bit of CPU time.
So really, the environment doesn't matter.
What does matter is what hardware you run the program on (and here, running it on the GPU might be worth considering) and how well you utilize the hardware (concurrency is pretty much a must if you want to exploit modern hardware).
What code you write, and how you compile it does make a difference. The hardware you're running on makes a difference. The choice of OS does not.

A digression: that the OS doesn't matter for performance is, in general, obviously false. Citing CPU utilization when idle seems a quite "peculiar" idea to me: of course one hopes that when no jobs are running the OS is not wasting energy. Otherwise one measure the speed/throughput of an OS when it is providing a service (i.e. mediating the access to hardware/resources).
To avoid an annoying MS Windows vs Linux vs Mac OS X battle, I will refer to a research OS concept: exokernels. The point of exokernels is that a traditional OS is not just a mediator for resource access but it implements policies. Such policies does not always favor the performance of your application-specific access mode to a resource. With the exokernel concept, researchers proposed to "exterminate all operating system abstractions" (.pdf) retaining its multiplexer role. In this way:
… The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). …
So bypassing the usual OS access policies they gained, for a customized web server, an increase of about 800% in performance.
Returning to the original question: it's generally true that an application is executed with no or negligible OS overhead when:
it has a compute-intensive kernel, where such kernel does not call the OS API;
memory is enough or data is accessed in a way that does not cause excessive paging;
all inessential services running on the same systems are switched off.
There are possibly other factors, depending by hardware/OS/application.
I assume that the OP is correct in its rough estimation of computing power required. The OP does not specify the nature of such intensive computation, so its difficult to give suggestions. But he wrote:
The amount of calculations are enormous
"Calculations" seems to allude to compute-intensive kernels, for which I think is required a compiled language or a fast interpreted language with native array operators, like APL, or modern variant such as J, A+ or K (potentially, at least: I do not know if they are taking advantage of modern hardware).
Anyway, the first advice is to spend some time in researching fast algorithms for your specific problem (but when comparing algorithms remember that asymptotic notation disregards constant factors that sometimes are not negligible).
For the sequential part of your program a good utilization of CPU caches is crucial for speed. Look into cache conscious algorithms and data structures.
For the parallel part, if such program is amenable to parallelization (remember both Amdahl's law and Gustafson's law), there are different kinds of parallelism to consider (they are not mutually exclusive):
Instruction-level parallelism: it is taken care by the hardware/compiler;
data parallelism:
bit-level: sometimes the acronym SWAR (SIMD Within A Register) is used for this kind of parallelism. For problems (or some parts of them) where it can be formulated a data representation that can be mapped to bit vectors (where a value is represented by 1 or more bits); so each instruction from the instruction set is potentially a parallel instruction which operates on multiple data items (SIMD). Especially interesting on a machine with 64 bits (or larger) registers. Possible on CPUs and some GPUs. No compiler support required;
fine-grain medium parallelism: ~10 operations in parallel on x86 CPUs with SIMD instruction set extensions like SSE, successors, predecessors and similar; compiler support required;
fine-grain massive parallelism: hundreds of operations in parallel on GPGPUs (using common graphic cards for general-purpose computations), programmed with OpenCL (open standard), CUDA (NVIDIA), DirectCompute (Microsoft), BrookGPU (Stanford University) and Intel Array Building Blocks. Compiler support or use of a dedicated API is required. Note that some of these have back-ends for SSE instructions also;
coarse-grain modest parallelism (at the level of threads, not single instructions): it's not unusual for CPUs on current desktops/laptops to have more then one core (2/4) sharing the same memory pool (shared-memory). The standard for shared-memory parallel programming is the OpenMP API, where, for example in C/C++, #pragma directives are used around loops. If I am not mistaken, this can be considered data parallelism emulated on top of task parallelism;
task parallelism: each core in one (or multiple) CPU(s) has its independent flow of execution and possibly operates on different data. Here one can use the concept of "thread" directly or a more high-level programming model which masks threads.
I will not go into details of these programming models here because apparently it is not what the OP needs.
I think this is enough for the OP to evaluate by himself how various languages and their compilers/run-times / interpreters / libraries support these forms of parallelism.

Just my two cents about DOS vs. Windows.
Years ago (something like 1998?), I had the same assumption.
I have some program written in QBasic (this was before I discovered C), which did intense calculations (neural network back-propagation). And it took time.
A friend offered to rewrite the thing in Visual Basic. I objected, because, you know, all those gizmos, widgets and fancy windows, you know, would slow down the execution of, you know, the important code.
The Visual Basic version so much outperformed the QBasic one that it became the default application (I won't mention the "hey, even in Excel's VBA, you are outperformed" because of my wounded pride, but...).
The point here, is the "you know" part.
You don't know.
The OS here is not important. As others explained in their answers, choose your hardware, and choose your language. And write your code in a clear way because now, compilers are better at optimizing code developers, unless you're John Carmack (premature optimization is the root of all evil).
Then, if you're not happy with the result, use a profiler to test your code. Consider multithreading (which will help you if you have multiple cores... TBB comes to mind).

What are you trying to do? I believe all the stuff should be compiled in 64bit mode by default. Computers have gotten a lot faster. Speed should not be a problem for the most part.
Side note: As for computation intense stuff you may want to look into OpenCL or CUDA. OpenCL and CUDA take advantage of the GPU which can transfer lots of information at a time compared to the CPU.

If your last points of reference are M68K and PCs running DOS then I'd suggest that you start with C/C++ on a modern processor and OS. If you run into performance problems and can prove that they are caused by running on Linux / Windows or that the compiler / optimizer generated code isn't sufficient, then you could look at other OSes and/or hand coded ASM. If you're looking for free, Linux / gcc is a good place to start.

I am the original poster of this thread.
I am once again reiterating the emphasis that this program will have enormous number of calculations.
Windows & Ubuntu are multi-tasking environments. There are processes running and many of them are using processor resources. True many of them are seen as inactive. But still the Windows environment by the nature of multi-tasking is constantly monitoring the need to start up each process. For example currently there are 62 processes showing in the Windows Task Manager. According the task manager three are consuming CPU resouces. So we have three ongoing processes that are consuming CPU processing. There are an addition 59 showing active but consuming no CPU processing. So that is 63 being monitored by Windows and then there is the Windows that also is checking on various things.
I was hoping to find some way to just be able to run a program on the bare machine level. Side stepping all the Windows (Ubuntu) involvement.
The idea is very calculation intensive.
Thank you all for taking the time to respond.
Have a Great Day,
Jim

Related

Determine Values AND/OR Address of Values in CPU Cache

Is there a way to determine exactly what values, memory addresses, and/or other information currently resides in the CPU cache (L1, L2, etc.) - for current or all processes?
I've been doing quite a bit a reading which shows how to optimize programs to utilize the CPU cache more effectively. However, I'm looking for a way to truly determine if certain approaches are effective.
Bottom line: is it possible to be 100% certain what does and does not make it into the CPU cache.
Searching for this topic returns several results on how to determine the cache size, but not contents.
Edit: To clarify some of the comments below: Since software would undoubtedly alter the cache, do CPU manufactures have a tool / hardware diagnostic system (built-in) which provides this functionality?

Without using specialized hardware, you cannot directly inspect what is in the CPU cache. The act of running any software to inspect the CPU cache would alter the state of the cache.
The best approach I have found is simply to identify real hot spots in your application and benchmark alternative algorithms on hardware the code will run on in production (or on a range of likely hardware if you do not have control over the production environment).

In addition to Eric J.'s answer, I'll add that while I'm sure the big chip manufacturers do have such tools it's unlikely that such a "debug" facility would be made available to regular mortals like you and I, but even if it were, it wouldn't really be of much help.
Why? It's unlikely that you are having performance issues that you've traced to cache and which cannot be solved using the well-known and "common sense" techniques for maintaining high cache-hit ratios.
Have you really optimized all other hotspots in the code and poor cache behavior by the CPU is the problem? I very much doubt that.
Additionally, as food for thought: do you really want to optimize your program's behavior to only one or two particular CPUs? After all, caching algorithms change all the time, as do the parameters of the caches, sometimes dramatically.

If you have a relatively modern processor running Windows then take a look at
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
and see if that might provide some of what you are looking for.

To optimize for one specific CPU cache size is usually in vain since this optimization will break when your assumptions about the CPU cache sizes are wrong when you execute on a different CPU.
But there is a way out there. You should optimize for certain access patterns to allow the CPU to easily predict what memory locations should be read next (the most obvious one is a linear increasing read). To be able to fully utilize a CPU you should read about cache oblivious algorithms where most of them follow a divide and conquer strategy where a problem is divided into sub parts to a certain extent until all memory accesses fit completly into the CPU cache.
It is also noteworthy to mention that you have a code and data cache which are separate. Herb Sutter has a nice video online where he talks about the CPU internals in depth.
The Visual Studio Profiler can collect CPU counters dealing with memory and L2 counters. These options are available when you select instrumentation profiling.
Intel has also a paper online which talks in greater detail about these CPU counters and what the task manager of Windows and Linux do show you and how wrong it is for todays CPUs which do work internally asynchronous and parallel at many diffent levels. Unfortunatley there is no tool from intel to display this stuff directly. The only tool I do know is the VS profiler. Perhaps VTune has similar capabilities.
If you have gone this far to optimize your code you might look as well into GPU programming. You need at least a PHD to get your head around SIMD instructions, cache locality, ... to get perhaps a factor 5 over your original design. But by porting your algorithm to a GPU you get a factor 100 with much less effort ony a decent graphics card. NVidia GPUs which do support CUDA (all today sold cards do support it) can be very nicely programmed in a C dialect. There are even wrapper for managed code (.NET) to take advantage of the full power of GPUs.
You can stay platform agnostic by using OpenCL but NVidia OpenCL support is very bad. The OpenCL drivers are at least 8 times slower than its CUDA counterpart.

Almost everything you do will be in the cache at the moment when you use it, unless you are reading memory that has been configured as "uncacheable" - typically, that's frame buffer memory of your graphics card. The other way to "not hit the cache" is to use specific load and store instructions that are "non-temporal". Everything else is read into the L1 cache before it reaches the target registers inside the CPU itself.
For nearly all cases, CPU's do have a fairly good system of knowing what to keep and what to throw away in the cache, and the cache is nearly always "full" - not necessarily of useful stuff, if, for example you are working your way through an enormous array, it will just contain a lot of "old array" [this is where the "non-temporal" memory operations come in handy, as they allow you to read and/or write data that won't be stored in the cache, since next time you get back to the same point, it won't be in the cache ANYWAYS].
And yes, processors usually have special registers [that can be accessed in kernel drivers] that can inspect the contents of the cache. But they are quite tricky to use without at the same time losing the content of the cache(s). And they are definitely not useful as "how much of array A is in the cache" type checking. They are specifically for "Hmm, it looks like cache-line 1234 is broken, I'd better read the cached data to see if it's really the value it should be" when processors aren't working as they should.
As DanS says, there are performance counters that you can read from suitable software [need to be in the kernel to use those registers too, so you need some sort of "driver" software for that]. In Linux, there's "perf". And AMD has a similar set of performance counters that can be used to find out, for example "how many cache misses have we had over this period of time" or "how many cache hits in L" have we had, etc.

Increasing C++ Program CPU Use

I have a program written in C++ that runs a number of for loops per second without using anything that would make it wait for any reason. It consistently uses 2-10% of the CPU. Is there any way to force it to use more of the CPU and do a greater number of calculations without making the program more complex? Additionally, I compile with C::B on a Windows computer. Essentially, I'm asking whether there is a way to make my program faster by increasing usage of CPU, and if so, how.

That depends on why it's only using 10% of the CPU. If it's because you're using a multi-CPU machine and your program is using only one CPU, then no, you will have to introduce concurrency into your code to use that additional horsepower.
If it's being limited by something else (e.g. copying data to and from the disk), then you don't need to focus on CPU, you need to focus on whatever the bottleneck is. Most likely, the limiter will be reading from the disk, which you can improve by using better caching mechanisms.

Assuming your application has the power (PROCESS_SET_INFORMATION access right), you can use SetPriorityClass to bump up your priortiy (to the usual detriment of all other processes, of course).
You can go ABOVE_NORMAL_PRIORITY_CLASS (try this one first), HIGH_PRIORITY_CLASS (be very careful with this one) or REALTIME_PRIORITY_CLASS (I would strongly suggest that you probably shouldn't give this one a shot).
If you try the higher priorities and it's still clocking pretty low, then that's probably because you're not CPU-bound (such as if you're writing data to an output file). If that's the case, you'll probably have to find a way to make yourself CPU bound.
Just keep in mind that doing so may not be necessary (or even desirable). If you're running at a higher priority than other threads and you're still not sucking up a lot of CPU, it's probably because Windows has (most likely, rightfully) decided you don't need it.

It's really not the program's right or responsibility to demand additional resources from the system. That's the OS' job, as resource scheduler.
If it is necessary to use more CPU time than the OS sees fit, you should request that from the OS using the platform-dependent API. In this case, that seems to be something along the lines of SetPriorityClass or SetThreadPriority.

Creating a thread & giving higher priority to the thread might be one way.

If you use C++, consider using Intel Threading Building Block. You can find some examples here.

Some profilers give very nice indications of where bottlenecks in your code are. For example - the CodeAnalyst (for AMD chips only) has the instructions per cycle ratio. I'm sure intel profilers are similar.
As Billy O'Neal says though, if your runnning on an 8-core, being stuck on 10 percent of cpu is about right. If this is your problem then Windows msvc++ has a parallel mode (the parallel patterns library) for the standard algorithms. This can give parallelisation for free if have written your loops the c++ way (its still your responsibility to make sure your loops are thread safe). I've not used the msvc version but the gnu::__parallel_for_each etc work a treat.

Low latency trading systems using C++ in Windows?

It seems that all the major investment banks use C++ in Unix (Linux, Solaris) for their low latency/high frequency server applications. Why is Windows generally not used as a platform for this? Are there technical reasons why Windows can't compete?

The performance requirements on the extremely low-latency systems used for algorithmic trading are extreme. In this environment, microseconds count.
I'm not sure about Solaris, but the case of Linux, these guys are writing and using low-latency patches and customisations for the whole kernel, from the network card drivers on up. It's not that there's a technical reason why that couldn't be done on Windows, but there is a practical/legal one - access to the source code, and the ability to recompile it with changes.

Technically, no. However, there is a very simple business reason: the rest of the financial world runs on Unix. Banks run on AIX, the stock market itself runs on Unix, and therefore, it is simply easier to find programmers in the financial world that are used to a Unix environment, rather than a Windows one.

(I've worked in investment banking for 8 years)
In fact, quite a lot of what banks call low latency is done in Java. And not even Real Time Java - just normal Java with the GC turned off. The main trick here is to make sure you've exercised all of your code enough for the jit to have run before you switch a particular VM into prod ( so you have some startup looping that runs for a couple of minutes - and hot failover).
The reasons for using Linux are:
Familiarity
Remote administration is still better, and also low impact - it will have a minimal effect on the other processes on the machine. Remember, these systems are often co-located at the exchange, so the links to the machines (from you/your support team) will probably be worse than those to your normal datacentres.
Tunability - the ability to set swappiness to 0, get the JVM to preallocate large pages, and other low level tricks is quite useful.
I'm sure you could get Windows to work acceptably, but there is no huge advantage to doing so - as others have said, any employees you poached would have to rediscover all their latency busting tricks rather than just run down a checklist.

Linux/UNIX are much more usable for concurrent remote users, making it easier to script around the systems, use standard tools like grep/sed/awk/perl/ruby/less on logs... ssh/scp... all that stuff's just there.
There are also technical issues, for example: to measure elapsed time on Windows you can choose between a set of functions based on the Windows clock tick, and the hardware-based QueryPerformanceCounter(). The former is increments each 10 to 16 milliseconds (note: some documentation implies more precision - e.g. the values from GetSystemTimeAsFileTime() measure to 100ns, but they report the same 100ns edge of the clock tick until it ticks again). The latter - QueryPerformanceCounter() - has show-stopping issues where different cores/cpus can report clocks-since-startup that differ by several seconds due to being warmed up at different times during system boot. MSDN documents this as a possible BIOS bug, but it's common. So, who wants to develop low-latency trading systems on a platform that can't be instrumented properly? (There are solutions, but you won't find any software ones sitting conveniently in boost or ACE).
Many Linux/UNIX variants have lots of easily tweakable parameters to trade off latency for a single event against average latency under load, time slice sizes, scheduling policies etc.. On open source Operating Systems, there's also the assurance that comes with being able to refer to the code when you think something should be faster than it is, and the knowledge that a (potentially huge) community of people have been and are doing so critically - with Windows it's obviously mainly going to be the people who're assigned to look at it.
On the FUD/reputation side - somewhat intangible but an important part of the reasons for OS selection - I think most programmers in the industry would just trust Linux/UNIX more to provide reliable scheduling and behaviour. Further, Linux/UNIX has a reputation for crashing less, though Windows is pretty reliable these days, and Linux has a much more volatile code base than Solaris or FreeBSD.

Reason is simple, 10-20 years ago when such systems emerged, "hardcore" multi-CPU servers were ONLY on some sort of UNIX. Windows NT was in kinder-garden these days. So the reason is "historical".
Modern systems might be developed on Windows, it's just a matter of taste these days.
PS: I am currencly working on one of such systems :-)

I partially agree with most of the answers above. Though what I have realized is the biggest reason to use C++ is becuase it is relatively faster with a very vast STL library.
Apart from it, linux/unix system is also used to boost performance. I know many low latency team which go to a extent of tweaking the linux kernel. Obviously this level of freedom is not provided by windows.
Other reasons like legacy systems, license cost, resources count as well but are lesser driving factors. As "rjw" mentioned, I have seen teams use Java as well with a modified JVM.

There are a variety of reasons, but the reason is not only historical. In fact, it seems as if more and more server-side financial applications run on *nix these days than ever before (including big names like the London Stock Exchange, who switched from a .NET platform). For client-side or desktop apps, it would be silly to target anything other than Windows, as that is the established platform. However, for server-side apps, most places that I have worked at deploy to *nix.

I second the opinions of historical and access to kernel manipulation.
Apart from those reasons I also believe that just like how they turn off garbage collection of .NET and the similar mechanism in Java when using these technologies in some low latency. They might avoid Windows because of the API's at high level which interact with low level os and then the kernel.
So the core is of course the kernel which can be interacted with using the low level os. The high level APIs are provided just to make the common users life easier. But in case of Low latency this turns out to be a fatty layer and fraction seconds loss around each operation. So a lucrative option for gaining few fraction seconds per call.
Apart from this another thing to consider is integration. Most of the servers, data centers, exchanges use UNIX not windows so using the clients of same family makes the integration and communication easier.
Then you have security issues (many people out there might not agree with this point though) hacking UNIX is not easy compared to hacking WINDOWS. I don't agree licensing must be the issue for banks because they shower money on every single piece of hardware and software and the people who customize them, so buying licenses will not be as bigger the issue when considered what they gain by purchasing.

OpenMP & MPI explanation

A few minutes ago I stumbled upon some text, which reminded me of something that has been wondering my mind for a while, but I had nowhere to ask.
So, in hope this may be the place, where people have hands on
experience with both, I was wondering if someone could explain
what is the difference between OpenMP and MPI ?
I've read the Wikipedia articles in whole, understood them in
segments, but am still pondering;
for a Fortran programmer who wishes one day to enter the world of
paralellism (just learning the basics of OpenMP now), what is the more
future-proof way to go ?
I would be grateful on all your comments

OpenMP is primarily for tightly coupled multiprocessing -- i.e., multiple processors on the same machine. It's mostly for things like spinning up a number of threads to execute a loop in parallel.
MPI is primarily for loosely couple multiprocessing -- i.e., a cluster of computers talking to each other via a network. It can be used on a single machine as kind of a degenerate form of a network, but it does relatively little to take advantage of its being a single machine (e.g., having extremely high bandwidth communication between the "nodes").
Edit (in response to comment): for a cluster of 24 machines, MPI becomes the obvious choice. As noted above (and similar to #Mark's comments) OpenMP is primarily for multiple processors that share memory. When you don't have shared memory, MPI becomes the clear choice.
At the same time, assuming you're going to be using multiprocessor machines (is there anything else anymore?) you might want to use OpenMP to spread the load in each machine across all its processors.
Keep in mind, however, that OpenMP is generally quite a lot quicker/easier to put into use than MPI. Depending on how much speedup you need, scaling up instead of out (i.e. a fewer machines with more processors each) can make the software development enough quicker/cheaper that it can be worthwhile even though it rarely gives the lowest price per core.

Another view, not inconsistent with what #Jerry has already written is that OpenMP is for shared-memory parallelisation and MPI is for distributed-memory parallelisation. Emulating shared-memory on distributed systems is rarely convincing or successful, but it's a perfectly reasonable approach to use MPI on a shared-memory system.
Of course, all (?) multicore PCs and servers are shared-memory systems these days so the execution model for OpenMP is widely applicable. MPI tends to come into its own on clusters on which processors communicate with each other over a network (which is sometimes called an interconnect and is often of a higher-spec than office Ethernet).
In terms of applications I would estimate that a large proportion of parallel programs can be successfully implemented with either OpenMP or MPI and that your choice between the two is probably best driven by the availability of hardware. Most of us (parallel-ists) would regard OpenMP as easier to get into than MPI, and it is certainly (I assert) easier to incrementally parallelise an existing program with OpenMP than with MPI.
However, if you need to use more processors than you can get in one box (and how many processors that is is increasing steadily) then MPI is your better choice. You may also stumble across the idea of hybrid programming -- for example if you have a cluster of multicore PCs you might use MPI between PCs, and OpenMP within a PC. I've not seen any evidence that the additional complexity of programming is rewarded by improved performance, and I've seen some evidence that it is definitely not worth the effort.
And, as one of the comments has already stated, I think that Fortran is future-proof enough in the domain of parallel, high-performance, scientific and engineering applications. The latest (2008) edition of the standard incorporates co-arrays (ie arrays which themselves are distributed across a memory system with non-local and local access) right into the language. There are even one or two early implementations of this feature. I don't yet have any experience of them and expect that there will be teething issues for a few years.
EDIT to pick up on a number of points in OP's comments ...
No, I don't think that it's a bad idea to approach parallel computing via OpenMP. I think that OpenMP and MPI (or, more accurately, the models of parallel computing that they implement) are complementary. I certainly use both, and I suspect that most professional parallel programmers do too. I hadn't done much OpenMP since leaving university about 6 years ago until about 2 years ago when multicores really started popping up everywhere. Now I probably do about equal amounts of both.
In terms of your further (self-)education I think that the book Using OpenMP by Chapman et al is better than the one by Chandra, if only because it is much more up to date. I think that the Chandra book pre-dates OpenMP 2, and the Chapman book pre-dates OpenMP 3 which is worth learning.
On the MPI side the books by Gropp et al, Using MPI and Using MPI-2 are indispensable; this is perhaps because they are (as far as I have found) the only tutorial introductions to MPI rather than because of their excellence. I don't think that they are bad, mind you but they don't have a lot of competition. I like Parallel Scientific Computing in C++ and MPI by Karniadakis and Kirby too; depending on your level of scientific computing knowledge though you may find much of the material too basic.
But what I think the field lacks entirely (hope someone can prove me wrong here ?) is a good textbook (or handful of textbooks) on the design of programs for parallel execution, something to help the experienced Fortran (in our case) programmer make the jump from serial to parallel program design. Lots of info on how to parallelise a loop or nest of loops, not so much on options for parallelising computations on structured positive semi-definite matrices (or whatever). For that level of information we have to dig quite hard into the research papers (ACM and IEEE digital libraries are well worth the modest annual costs -- if you are at an academic institution your library probably has subscriptions to these and a lot more, I'm lucky in that my employers pay for my professional society memberships and add-ons, but if they didn't I would pay myself).
As to your plans for a new lab with, say, 24 processors (CPUs ? or cores ?, doesn't really matter, just asking) then the route you take should depend on the depths of your pocket. If you can afford it I'd suggest:
-- Consider a shared-memory computer, certainly a year ago Sun, SGI and IBM could all supply a shared-memory system with that sort of number of cores, I'm not sure of the current state of the market but since you have until Feb to decide it's worth looking into. A shared-memory system gives you the shared-memory parallelism option, which a cluster doesn't, and message-passing on a shared-memory platform should run at light speed. (By the way, if you go this route, benchmark this aspect of the system, there have been some bad MPI implementations on shared-memory computers.) A good MPI implementation on a shared-memory computer (my last experience of this was on a 512 processor SGI Altix) doesn't send any messages, it just moves a few pointers around and is, consequently, blisteringly fast. Trouble with the Altix was that beyond 128 processors the memory bus tended to get overwhelmed by all the traffic; that was the time to switch to MPI on a cluster or an MPP box.
-- Again, if you can afford it, I'd recommend having a system integrator deliver you a working system, and avoid building a cluster (or whatever) yourself. If, like me, you are a programmer first and a reluctant system integrator way second, this is an easier approach and will deliver you a working system on which you can start programming far sooner.
If you can't afford the expensive options, then I'd go for as many rack-mounted servers with 4 or 8 cores per box (choice is price dependent, and maybe even 16 cores per box is worth considering today) and, today, I'd be planning for at least 4GB RAM per core. Then you need the fastest interconnect you can afford; GB Ethernet is fine, but Infiniband (or the other one whose name I forget) is finer, though the jump in price is noticeable. And you'll need a PC to act as head node for your new cluster, running the job management system and other stuff. There's a ton of excellent material on the Internet on building and running clusters, often under the heading of Beowulf, which was the name of what is considered to have been the first 'home-brew' cluster.
Now, since you have until February to get your lab up and running, fire 2 of your colleagues and turn their PCs into a mini-Beowulf. Download and install a likely-looking MPI installation (OpenMPI is good but there are others to consider and your o/s might dictate another choice). Now you can start getting ready for when the lab is ready.
PS You don't have to fire 2 people if you can scavenge 2 PCs some other way. And the PCs can be old and inadequate for desktop use they are just going to be a training platform for you and your colleagues (if you have any left). The more nearly identical they are the better.

As said above, OpenMP is certainly the easier way to program as compared to MPI because of incremental parallelization. OpenMP has been used mostly for fine grain parallelism (loop level), while MPI more of a coarse-grain parallelism (domain decomposition). Both are good ways to obtain parallel performance.
We have an OpenMP and an MPI versions of our software (Fortran) and Customers use both depending on their needs.
With the current trends in multi-core architecture, hybrid OpenMP-MPI is another viable approach.

How are you taking advantage of Multicore?

As someone in the world of HPC who came from the world of enterprise web development, I'm always curious to see how developers back in the "real world" are taking advantage of parallel computing. This is much more relevant now that all chips are going multicore, and it'll be even more relevant when there are thousands of cores on a chip instead of just a few.
My questions are:
How does this affect your software roadmap?
I'm particularly interested in real stories about how multicore is affecting different software domains, so specify what kind of development you do in your answer (e.g. server side, client-side apps, scientific computing, etc).
What are you doing with your existing code to take advantage of multicore machines, and what challenges have you faced? Are you using OpenMP, Erlang, Haskell, CUDA, TBB, UPC or something else?
What do you plan to do as concurrency levels continue to increase, and how will you deal with hundreds or thousands of cores?
If your domain doesn't easily benefit from parallel computation, then explaining why is interesting, too.
Finally, I've framed this as a multicore question, but feel free to talk about other types of parallel computing. If you're porting part of your app to use MapReduce, or if MPI on large clusters is the paradigm for you, then definitely mention that, too.
Update: If you do answer #5, mention whether you think things will change if there get to be more cores (100, 1000, etc) than you can feed with available memory bandwidth (seeing as how bandwidth is getting smaller and smaller per core). Can you still use the remaining cores for your application?

My research work includes work on compilers and on spam filtering. I also do a lot of 'personal productivity' Unix stuff. Plus I write and use software to administer classes that I teach, which includes grading, testing student code, tracking grades, and myriad other trivia.
Multicore affects me not at all except as a research problem for compilers to support other applications. But those problems lie primarily in the run-time system, not the compiler.
At great trouble and expense, Dave Wortman showed around 1990 that you could parallelize a compiler to keep four processors busy. Nobody I know has ever repeated the experiment. Most compilers are fast enough to run single-threaded. And it's much easier to run your sequential compiler on several different source files in parallel than it is to make your compiler itself parallel. For spam filtering, learning is an inherently sequential process. And even an older machine can learn hundreds of messages a second, so even a large corpus can be learned in under a minute. Again, training is fast enough.
The only significant way I have of exploiting parallel machines is using parallel make. It is a great boon, and big builds are easy to parallelize. Make does almost all the work automatically. The only other thing I can remember is using parallelism to time long-running student code by farming it out to a bunch of lab machines, which I could do in good conscience because I was only clobbering a single core per machine, so using only 1/4 of CPU resources. Oh, and I wrote a Lua script that will use all 4 cores when ripping MP3 files with lame. That script was a lot of work to get right.
I will ignore tens, hundreds, and thousands of cores. The first time I was told "parallel machines are coming; you must get ready" was 1984. It was true then and is true today that parallel programming is a domain for highly skilled specialists. The only thing that has changed is that today manufacturers are forcing us to pay for parallel hardware whether we want it or not. But just because the hardware is paid for doesn't mean it's free to use. The programming models are awful, and making the thread/mutex model work, let alone perform well, is an expensive job even if the hardware is free. I expect most programmers to ignore parallelism and quietly get on about their business. When a skilled specialist comes along with a parallel make or a great computer game, I will quietly applaud and make use of their efforts. If I want performance for my own apps I will concentrate on reducing memory allocations and ignore parallelism.
Parallelism is really hard. Most domains are hard to parallelize. A widely reusable exception like parallel make is cause for much rejoicing.
Summary (which I heard from a keynote speaker who works for a leading CPU manufacturer): the industry backed into multicore because they couldn't keep making machines run faster and hotter and they didn't know what to do with the extra transistors. Now they're desperate to find a way to make multicore profitable because if they don't have profits, they can't build the next generation of fab lines. The gravy train is over, and we might actually have to start paying attention to software costs.
Many people who are serious about parallelism are ignoring these toy 4-core or even 32-core machines in favor of GPUs with 128 processors or more. My guess is that the real action is going to be there.

For web applications it's very, very easy: ignore it. Unless you've got some code that really begs to be done in parallel you can simply write old-style single-threaded code and be happy.
You usually have a lot more requests to handle at any given moment than you have cores. And since each one is handled in its own Thread (or even process, depending on your technology) this is already working in parallel.
The only place you need to be careful is when accessing some kind of global state that requires synchronization. Keep that to a minimum to avoid introducing artificial bottlenecks to an otherwise (almost) perfectly scalable world.
So for me multi-core basically boils down to these items:
My servers have less "CPUs" while each one sports more cores (not much of a difference to me)
The same number of CPUs can substain a larger amount of concurrent users
When the seems to be performance bottleneck that's not the result of the CPU being 100% loaded, then that's an indication that I'm doing some bad synchronization somewhere.

At the moment - doesn't affect it that much, to be honest. I'm more in 'preparation stage', learning about the technologies and language features that make this possible.
I don't have one particular domain, but I've encountered domains like math (where multi-core is essential), data sort/search (where divide & conquer on multi-core is helpful) and multi-computer requirements (e.g., a requirement that a back-up station's processing power is used for something).
This depends on what language I'm working. Obviously in C#, my hands are tied with a not-yet-ready implementation of Parallel Extensions that does seem to boost performance, until you start comparing same algorithms with OpenMP (perhaps not a fair comparison). So on .NET it's going to be an easy ride with some for → Parallel.For refactorings and the like.Where things get really interesting is with C++, because the performance you can squeeze out of things like OpenMP is staggering compared to .NET. In fact, OpenMP surprised me a lot, because I didn't expect it to work so efficiently. Well, I guess its developers have had a lot of time to polish it. I also like that it is available in Visual Studio out-of-the-box, unlike TBB for which you have to pay.As for MPI, I use PureMPI.net for little home projects (I have a LAN) to fool around with computations that one machine can't quite take. I've never used MPI commercially, but I do know that MKL has some MPI-optimized functions, which might be interesting to look at for anyone needing them.
I plan to do 'frivolous computing', i.e. use extra cores for precomputation of results that might or might not be needed - RAM permitting, of course. I also intend to delve into costly algorithms and approaches that most end users' machines right now cannot handle.
As for domains not benefitting from parallellization... well, one can always find something. One thing I am concerned about is decent support in .NET, though regrettably I have given up hope that speeds similar to C++ can be attained.

I work in medical imaging and image processing.
We're handling multiple cores in much the same way we handled single cores-- we have multiple threads already in the applications we write in order to have a responsive UI.
However, because we can now, we're taking strong looks at implementing most of our image processing operations in either CUDA or OpenMP. The Intel Compiler provides a lot of good sample code for OpenMP, and is just a much more mature product than CUDA, and provides a much larger installed base, so we're probably going to go with that.
What we tend to do for expensive (ie, more than a second) operations is to fork that operation off into another process, if we can. That way, the main UI remains responsive. If we can't, or it's just far too inconvenient or slow to move that much memory around, the operation is still in a thread, and then that operation can itself spawn multiple threads.
The key for us is to make sure that we don't hit concurrency bottlenecks. We develop in .NET, which means that UI updates have to be done from an Invoke call to the UI in order to have the main thread update the UI.
Maybe I'm lazy, but really, I don't want to have to spend too much time figuring a lot of this stuff out when it comes to parallelizing things like matrix inversions and the like. A lot of really smart people have spent a lot of time making that stuff fast like nitrous, and I just want to take what they've done and call it. Something like CUDA has an interesting interface for image processing (of course, that's what it's defined for), but it's still too immature for that kind of plug-and-play programming. If I or another developer get a lot of spare time, we might give it a try. So instead, we'll just go with OpenMP to make our processing faster (and that's definitely on the development roadmap for the next few months).

So far, nothing more than more efficient compilation with make:
gmake -j
the -j option allows tasks that don't depend on one another to run in parallel.

I'm developing ASP.NET web applications. There is little possibility to use multicore directly in my code, however IIS already scales well for multiple cores/CPU's by spawning multiple worker threads/processes when under load.

We're having a lot of success with task parallelism in .NET 4 using F#. Our customers are crying out for multicore support because they don't want their n-1 cores idle!

I'm in image processing. We're taking advantage of multicore where possible by processing images in slices doled out to different threads.

I said some of this in answer to a different question (hope this is OK!): there is a concept/methodology called Flow-Based Programming (FBP) that has been around for over 30 years, and is being used to handle most of the batch processing at a major Canadian bank. It has thread-based implementations in Java and C#, although earlier implementations were fiber-based (C++ and mainframe Assembler). Most approaches to the problem of taking advantage of multicore involve trying to take a conventional single-threaded program and figure out which parts can run in parallel. FBP takes a different approach: the application is designed from the start in terms of multiple "black-box" components running asynchronously (think of a manufacturing assembly line). Since the interface between components is data streams, FBP is essentially language-independent, and therefore supports mixed-language applications, and domain-specific languages. Applications written this way have been found to be much more maintainable than conventional, single-threaded applications, and often take less elapsed time, even on single-core machines.

My graduate work is in developing concepts for doing bare-metal multicore work & teaching same in embedded systems.
I'm also working a bit with F# to bring up my high-level multiprocess-able language facilities to speed.

We create the VivaMP code analyzer for error detecting in parallel OpenMP programs.
VivaMP is a lint-like static C/C++ code analyzer meant to indicate errors in parallel programs based on OpenMP technology. VivaMP static analyzer adds much to the abilities of the existing compilers, diagnoses any parallel code which has some errors or is an eventual source of such errors. The analyzer is integrated into VisualStudio2005/2008 development environment.
VivaMP – a tool for OpenMP
32 OpenMP Traps For C++ Developers

I believe that "Cycles are an engineers' best friend".
My company provides a commercial tool for analyzing
and transforming very
large software systems in many computer languages.
"Large" means 10-30 million lines of code.
The tool is the DMS Software Reengineering Toolkit
(DMS for short).
Analyses (and even transformations) on such huge systems
take a long time: our points-to analyzer for C
code takes 90 CPU hours on an x86-64 with 16 Gb RAM.
Engineers want answers faster than that.
Consequently, we implemented DMS in PARLANSE,
a parallel programming language of our own design,
intended to harness small-scale multicore shared
memory systems.
The key ideas behind parlanse are:
a) let the programmer expose parallelism,
b) let the compiler choose which part it can realize,
c) keep the context switching to an absolute minimum.
Static partial orders over computations are
an easy to help achieve all 3; easy to say,
relatively easy to measure costs,
easy for compiler to schedule computations.
(Writing parallel quicksort with this is trivial).
Unfortunately, we did this in 1996 :-(
The last few years have finally been a vindication;
I can now get 8 core machines at Fry's for under $1K
and 24 core machines for about the same price as a small
car (and likely to drop rapidly).
The good news is that DMS is now a fairly mature,
and there are a number of key internal mechanisms
in DMS which take advantage of this, notably
an entire class of analyzers call "attribute grammars",
which we write using a domain-specific language
which is NOT parlanse. DMS compiles these
atrribute grammars into PARLANSE and then they
are executed in parallel. Our C++ front
end uses attribute grammars, and is about 100K
sloc; it is compiled into 800K SLOC of parallel
parlanse code that actually works reliably.
Now (June 2009), we are pretty busy making DMS useful, and
don't always have enough time to harness the parallelism
well. Thus the 90 hour points-to analysis.
We are working on parallelizing that, and
have reasonable hope of 10-20x speedup.
We believe that in the long run, harnessing
SMP well will make workstations far more
friendly to engineers asking hard questions.
As well they should.

Our domain logic is based heavily on a workflow engine and each workflow instance runs off the ThreadPool.
That's good enough for us.

I can now separate my main operating system from my development / install whatever I like os using vitualisation setups with Virtual PC or VMWare.
Dual core means that one CPU runs my host OS, the other runs my development OS with a decent level of performance.

Learning a functional programming language might use multiple cores... costly.
I think it's not really hard to use extra cores. There are some trivialities as web apps that does not need to have any extra care as the web server does its work running the queries in parallel. The questions are for long running algorythms (long is what you call long). These need to be split over smaller domains that does not depend each other, or synchronize the dependencies. A lot of algs can do this, but sometimes horribly different implementations needed (costs again).
So, no silver bullet until you are using imperative programming languages, sorry. Either you need skilled programmers (costly) or you need to turn to an other programming language (costly). Or you may have luck simply (web).

I'm using and programming on a Mac. Grand Central Dispatch for the win. The Ars Technica review of Snow Leopard has a lot of interesting things to say about multicore programming and where people (or at least Apple) are going with it.

I've decided to take advantage of multiple cores in an implementation of the DEFLATE algorithm. MArc Adler did something similar in C code with PIGZ (parallel gzip). I've delivered the philosophical equivalent, but in a managed code library, in DotNetZip v1.9. This is not a port of PIGZ, but a similar idea, implemented independently.
The idea behind DEFLATE is to scan a block of data, look for repeated sequences, build a "dictionary" that maps a short "code" to each of those repeated sequences, then emit a byte stream where each instance of one of the repeated sequences is replaced by a "code" from the dictionary.
Because building the dictionary is CPU intensive, DEFLATE is a perfect candidate for parallelization. i've taken a Map+Reduce type approach, where I divide the incoming uncompressed bytestreeam into a set of smaller blocks (map), say 64k each, and then compress those independently. Then I concatenate the resulting blocks together (reduce). Each 64k block is compressed independently, on its own thread, without regard for the other blocks.
On a dual-core machine, this approach compresses in about 54% of the time of the traditional serial approach. On server-class machines, with more cores available, it can potentially deliver even better results; with no server machine, I haven't tested it personally, but people tell me it's fast.
There's runtime (cpu) overhead associated to the management of multiple threads, runtime memory overhead associated to the buffers for each thead, and data overhead associated to concatenating the blocks. So this approach pays off only for larger bytestreams. In my tests, above 512k, it can pay off. Below that, it is better to use a serial approach.
DotNetZip is delivered as a library. My goal was to make all of this transparent. So the library automatically uses the extra threads when the buffer is above 512kb. There's nothing the application has to do, in order to use threads. It just works, and when threads are used, it's magically faster. I think this is a reasonable approach to take for most libbraries being consumed by applications.
It would be nice for the computer to be smart about automatically and dynamically exploiting resources on parallizable algorithms, but the reality today is that apps designers have to explicitly code the parallelization in.

I work in C# with .Net Threads.
You can combine object-oriented encapsulation with Thread management.
I've read some posts from Peter talking about a new book from Packt Publishing and I've found the following article in Packt Publishing web page:
http://www.packtpub.com/article/simplifying-parallelism-complexity-c-sharp
I've read Concurrent Programming with Windows, Joe Duffy's book. Now, I am waiting for "C# 2008 and 2005 Threaded Programming", Hillar's book - http://www.amazon.com/2008-2005-Threaded-Programming-Beginners/dp/1847197108/ref=pd_rhf_p_t_2
I agree with Szundi "No silver bullet"!

You say "For web applications it's very, very easy: ignore it. Unless you've got some code that really begs to be done in parallel you can simply write old-style single-threaded code and be happy."
I am working with Web applications and I do need to take full advantage of parallelism.
I understand your point. However, we must prepare for the multicore revolution. Ignoring it is the same than ignoring the GUI revolution in the 90's.
We are not still developing for DOS? We must tackle multicore or we'll be dead in many years.

I think this trend will first persuade some developers, and then most of them will see that parallelization is a really complex task.
I expect some design pattern to come to take care of this complexity. Not low level ones but architectural patterns which will make hard to do something wrong.
For example I expect messaging patterns to gain popularity, because it's inherently asynchronous, but you don't think about deadlock or mutex or whatever.

How does this affect your software roadmap?
It doesn't. Our (as with almost all other) business related apps run perfectly well on a single core. So long as adding more cores doesn't significantly reduce the performance of single threaded apps, we're happy
...real stories...
Like everyone else, parallel builds are the main benefit we get. The Visual Studio 2008 C# compiler doesn't seem to use more than one core though, which really sucks
What are you doing with your existing code to take advantage of multicore machines
We may look into using the .NET parallel extensions if we ever have a long-running algorithm that can be parallelized, but the odds of this actually occurring are slim. The most likely answer is that some of the developers will play around with it for interest's sake, but not much else
how will you deal with hundreds or thousands of cores?
Head -> Sand.
If your domain doesn't easily benefit from parallel computation, then explaining why is interesting, too.
The client app mostly pushes data around, the server app mostly relies on SQL server to do the heavy lifting

I'm taking advantage of multicore using C, PThreads, and a home brew implementation of Communicating Sequential Processes on an OpenVPX platform with Linux using the PREEMPT_RT patch set's scheduler. It all adds up to nearly 100% CPU utilisation across multiple OS instances with no CPU time used for data exchange between processor cards in the OpenVPX chassis, and very low latency too. Also using sFPDP to join multiple OpenVPX chassis together into a single machine. Am not using Xeon's internal DMA so as to relieve memory pressure inside CPUs (DMA still uses memory bandwidth at the expense of the CPU cores). Instead we're leaving data in place and passing ownership of it around in a CSP way (so not unlike the philosophy of .NET's task parallel data flow library).
1) Software Roadmap - we have pressure to maximise the use real estate and available power. Making the very most of the latest hardware is essential
2) Software domain - effectively Scientific Computing
3) What we're doing with existing code? Constantly breaking it apart and redistributing parts of it across threads so that each core is maxed out doing the most it possibly can without breaking out real time requirement. New hardware means quite a lot of re-thinking (faster cores can do more in the given time, don't want them to be under utilised). Not as bad as it sounds - the core routines are very modular so easily assembled into thread-sized lumps. Although we planned on taking control of thread affinity away from Linux, we've not yet managed to extract significant extra performance by doing so. Linux is pretty good at getting data and code in more or less the same place.
4) In effect already there - total machine already adds up to thousands of cores
5) Parallel computing is essential - it's a MISD system.
If that sounds like a lot of work, it is. some jobs require going whole hog on making the absolute most of available hardware and eschewing almost everything that is high level. We're finding that the total machine performance is a function of CPU memory bandwidth, not CPU core speed, L1/L2/L3 cache size.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js