Windows, Linux and Memory Management

Windows, Linux and Memory Management - c++

I'm pretty curious about how Windows and Linux does memory management with C++ programs.
The reason of this curiosity is because I've just made 3 very simple programs in C++ portable between Linux and Windows. The code was exactly the same. The hardware too. But the results were incredibly different! Both tests were repeated 10 times and then the arithmetic mean was calculated.
I've tested sequential insertions on a static array of integers, on the class vector and at a stack (with pointers). The total number of insertions was 10^6.
Windows XP SP2 x86 results:
Static array of integers: 56 ms
Class vector: 686 ms
Stack (with pointers): 2193 ms
Slackware 11 x86 results:
Static array of integers: 100 ms
Class vector: 476 ms
Stack (with pointer): 505 ms
The speed difference between the stack insertion time on Windows and Slax is impressive. Does these results seem normal? Both codes were compiled using G++ (mingw32-g++ on Windows).
The computer used was a Dual Core 3.2Ghz with 4GB RAM and when the tests were made, there were more than 2GB of free RAM.

This may have more to do with the C++ stdlib implementation (Microsoft MSVC vs. GNU libc/stdc++) than memory management...
C++ sets the spec, but implementations vary greatly.
Update: And as I now notice that you used g++ on both platforms - well, it's still a different implementation. And the GNU code was developed for a Unix environment, and was ported to Windows later - so it may make assumptions and optimizations that are incorrect for a Windows environment.
But your original question is valid. It may have something to do with the underlying memory model of the operating system - but your tests here are too coarse to draw any conclusions.
If you run more tests, or try using different flags / compilers, please post some updated stats, they would be informative.

Two seconds for a million insertions is a little too high. Have you enabled optimizations? Without them, those numbers mean nothing.
Edit: Since you compiled without optimizations, enable them (use -O2 for example) and measure again. The difference will probably be much smaller. The standard libraries tend to be quite defensive and perform a lot of consistency checking, which can skew the measurement a lot.
Edit: If it still takes over 2 seconds even with optimizations enabled, there something else going on. Try posting some code. The following program runs about 40 milliseconds on my system under cygwin.
#include <stack>
int main()
{
std::stack<int> s;
for (int i = 0; i < 1000000; ++i)
s.push(i);
}

Related

parallel c++11 program random crashes

I have a problem which I could not solve for a long time now. Since, I don't have more Ideas I am happy for any suggestions.
The program is a physics simulation which works on a huge tree data structure with millions of dynamical allocated nodes which are constructed / reorganized / destructed many times in parallel throughout the simulation with allot of pointers involved. Also this might sound very error-prone I am almost sure that I am doing all this in a thread-save manner. The program uses only standard libs and classes plus Intel-MKL (blas / lapack optimized for Intel CPUs) for matrix operations.
My code is parallelized using c++11 threads. The program runs fine on my desktop, my laptop and on two different Intel clusters using up to 8 threads. Only on one cluster the code suffers from random crashes if I use more than 2 threads (it runs absolutely fine with one or two threads).
The crash reports are varying from case to case but are mostly connected to heap corruption (segmentation fault, corrupted double linked list, malloc assertions, ...). some times the program gets caught in an infinite loop as well. In very rear cases the data structure suddenly blows up and the program runs out of memory. Anyway, since the program runs fine on all other machines I doubt the problem is in my source code. Since the crashes occur randomly I found all back tracing information relatively useless.
The hardware of the problematic cluster is almost identical to another cluster on which the code runs fine on up to 8 threads (Intel Xeon E5-2630 CPUs). The libs / compilers / OS are all relatively up to date. Note that other open-MP parallelized programs are running fine on the same cluster.
(Linux version 3.11.10-21-default (geeko#buildhost) (gcc version 4.8.1 20130909 [gcc-4_8-branch revision 202388] (SUSE Linux) ) #1 SMP Mon Jul 21 15:28:46 UTC 2014 (9a9565d))
I already tried the following approaches:
adding allot of assertions to assure that all my pointers are handled correctly
linking against tc-malloc instead of glibc-malloc/free
trying different compilers (g++, icpc, clang++) and compiler options (with / without compiler optimizations / debugging options)
using the working binary from another machine with statically linked libraries to
using open-MP instead of c++ threads
switching between serial / parallel MKL
using other blas / lapack libraries
Using valgrind is out of question, since the problem occurs randomly after 10 minutes up to several hours and valgrind gives me a slowdown factor of around 50 - 100 (Plus valgrind does not allow real concurrency). Nevertheless I ran the code in valgrind for several hours without problems.
Also, I can not see any problem with the resource limits:
RLIMIT_AS: 18446744073709551615
RLIMIT_CORE : 18446744073709551615
RLIMIT_CPU: 18446744073709551615
RLIMIT_DATA: 18446744073709551615
RLIMIT_FSIZE: 18446744073709551615
RLIMIT_LOCKS: 18446744073709551615
RLIMIT_MEMLOCK: 18446744073709551615
RLIMIT_MSGQUEUE: 819200
RLIMIT_NICE: 0
RLIMIT_NOFILE: 1024
RLIMIT_NPROC: 2066856
RLIMIT_RSS: 18446744073709551615
RLIMIT_RTPRIO: 0
RLIMIT_RTTIME: 18446744073709551615
RLIMIT_SIGPENDING: 2066856
RLIMIT_STACK : 18446744073709551615
RLIMIT_STACK : 18446744073709551615
I found out that for some reason the stack size per thread seems to be only 2mb, so I increased it using ulimit -s. Anyway stack size shouldn't be the problem.
Also the program should not have problem with allocatable memory on the heap, since the memory size is more than sufficient.
Does anyone have an Idea of what could go wrong here / where I should look at? Maybe I miss some environment variables I should check? I think the fact that the error occurs only if I use more than two threads and that the crash rate for more than two threads is independent of the number of threads could be a hint.
Thanks in advance.

Speed performance of a Qt program: Windows vs Linux

I've already posted this question here, but since it's maybe not that Qt-specific, I thought I might try my chance here as well. I hope it's not inappropriate to do that (just tell me if it is).
I’ve developed a small scientific program that performs some mathematical computations. I’ve tried to optimize it so that it’s as fast as possible. Now I’m almost done deploying it for Windows, Mac and Linux users. But I have not been able to test it on many different computers yet.
Here’s what troubles me: To deploy for Windows, I’ve used a laptop which has both Windows 7 and Ubuntu 12.04 installed on it (dual boot). I compared the speed of the app running on these two systems, and I was shocked to observe that it’s at least twice as slow on Windows! I wouldn’t have been surprised if there were a small difference, but how can one account for such a difference?
Here are a few precisions:
The computation that I make the program do are just some brutal and stupid mathematical calculations, basically, it computes products and cosines in a loop that is called a billion times. On the other hand, the computation is multi-threaded: I launch something like 6 QThreads.
The laptop has two cores #1.73Ghz. At first I thought that Windows was probably not using one of the cores, but then I looked at the processor activity, according to the small graphic, both cores are running 100%.
Then I thought the C++ compiler for Windows didn’t the use the optimization options (things like -O1 -O2) that the C++ compiler for Linux automatically did (in release build), but apparently it does.
I’m bothered that the app is so mush slower (2 to 4 times) on Windows, and it’s really weird. On the other hand I haven’t tried on other computers with Windows yet. Still, do you have any idea why the difference?
Additional info: some data…
Even though Windows seems to be using the two cores, I’m thinking this might have something to do with threads management, here’s why:
Sample Computation n°1 (this one launches 2 QThreads):
PC1-windows: 7.33s
PC1-linux: 3.72s
PC2-linux: 1.36s
Sample Computation n°2 (this one launches 3 QThreads):
PC1-windows: 6.84s
PC1-linux: 3.24s
PC2-linux: 1.06s
Sample Computation n°3 (this one launches 6 QThreads):
PC1-windows: 8.35s
PC1-linux: 2.62s
PC2-linux: 0.47s
where:
PC1-windows = my 2 cores laptop (#1.73Ghz) with Windows 7
PC1-linux = my 2 cores laptop (#1.73Ghz) with Ubuntu 12.04
PC2-linux = my 8 cores laptop (#2.20Ghz) with Ubuntu 12.04
(Of course, it's not shocking that PC2 is faster. What's incredible to me is the difference between PC1-windows and PC1-linux).
Note: I've also tried running the program on a recent PC (4 or 8 cores #~3Ghz, don't remember exactly) under Mac OS, speed was comparable to PC2-linux (or slightly faster).
EDIT: I'll answer here a few questions I was asked in the comments.
I just installed Qt SDK on Windows, so I guess I have the latest version of everything (including MinGW?). The compiler is MinGW. Qt version is 4.8.1.
I use no optimization flags because I noticed that they are automatically used when I build in release mode (with Qt Creator). It seems to me that if I write something like QMAKE_CXXFLAGS += -O1, this only has an effect in debug build.
Lifetime of threads etc: this is pretty simple. When the user clicks the "Compute" button, 2 to 6 threads are launched simultaneously (depending on what he is computing), they are terminated when the computation ends. Nothing too fancy. Every thread just does brutal computations (except one, actually, which makes a (not so) small"computation every 30ms, basically checking whether the error is small enough).
EDIT: latest developments and partial answers
Here are some new developments that provide answers about all this:
I wanted to determine whether the difference in speed really had something to do with threads or not. So I modified the program so that the computation only uses 1 thread, that way we are pretty much comparing the performance on "pure C++ code". It turned out that now Windows was only slightly slower than Linux (something like 15%). So I guess that a small (but not unsignificant) part of the difference is intrinsic to the system, but the largest part is due to threads management.
As someone (Luca Carlon, thanks for that) suggested in the comments, I tried building the application with the compiler for Microsoft Visual Studio (MSVC), instead of MinGW. And suprise, the computation (with all the threads and everything) was now "only" 20% to 50% slower than Linux! I think I'm going to go ahead and be content with that. I noticed that weirdly though, the "pure C++" computation (with only one thread) was now even slower (than with MinGW), which must account for the overall difference. So as far as I can tell, MinGW is slightly better than MSVC except that it handles threads like a moron.
So, I’m thinking either there’s something I can do to make MinGW (ideally I’d rather use it than MSVC) handle threads better, or it just can’t. I would be amazed, how could it not be well known and documented ? Although I guess I should be careful about drawing conclusions too quickly, I’ve only compared things on one computer (for the moment).

Another option it could be: on linux qt are just loaded, this could happens i.e. if you use KDE, while in Windows library must be loaded so this slow down computation time. To check how much library loading waste your application you could write a dummy test with pure c++ code.

I have noticed exactly the same behavior on my PC.
I am running Windows 7(64bits), Ubuntu (64bits) and OSX (Lion 64bits) and my program compares 2 XML files (more than 60Mb each). It uses Multithreading too (2 threads) :
-Windows : 40sec
-Linux : 14sec (!!!)
-OSX : 22sec.
I use a personal class for threads (and not Qt one) which uses "pthread" on linux/OSX and "threads" on windows.
I use Qt/mingw compiler as I need the XML class from Qt.
I have found no way (for now) to have the 3 OS having similar performances... but I hope I will !
I think that another reason may be the memory : my program uses about 500Mb of RAM. So I think that Unix is managing it best because, in mono-thread, Windows is exactly 1.89 times slower and I don't think that Linux could be more than 2 times slower !

I have heard of one case where Windows was extremely slow with writing files if you do it wrongly. (This has nothing to do with Qt.)
The problem in that case was that the developer used a SQLite database, wrote some 10000 datasets, and did a SQL COMMIT after each insert. This caused Windows to write the whole DB file to disk each time, while Linux would only update the buffered version of the filesystem inode in the RAM. The speed difference was even worse in that case: 1 second on Linux vs. 1 minute on Windows. (After he changed SQLite to commit only once at the end, it was also 1 second on Windows.)
So if you're writing the results of your computation to disk, you might want to check if you're calling fsync() or fflush() too often. If your writing code comes from a library, you can use strace for this (Linux-only, but should give you a basic idea).

You might experience performance differences by how mutexes run on Windows and Linux.
Pure mutex code on windows can have a 15ms wait every time there is a contention for resource when locking. Better performing synchronization mechanism on Windows is Critical Sections. It doesn't experience the locking penalty that regular mutexes experience in most cases.
I have found that on Linux, regular mutexes perform the same as Critical Sections on Windows.

It's probably the memory allocator, try using jemalloc or tcmalloc from Google. Glibc's ptmalloc3 is significantly better than the old crusty allocator in MSVC's crt. The comparable option from Microsoft is the Concurrency CRT but you cannot simply drop it in as a replacement.

How to optimize large data manipulation in parallel

I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation).
I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.
I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.
Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ).
Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.
My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time.
Can I further improve my software? Have I hit a limit?
I'm using C/C++ on Linux x86-64, Ubuntu 11.10.
I'm all ears! :-)

What kind of application is it? Could you show us some code?
As I commented, you might have reached some hardware limit like RAM bandwidth. If you did, no software trick could improve it.
You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help.
If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance).

64bit Applications and Inline Assembly

I am using Visual C++ 2010 developing 32bit windows applications. There is something I really want to use inline assembly. But I just realized that visual C++ does not support inline assembly in 64bit applications. So porting to 64bit in the future is a big issue.
I have no idea how 64bit applications are different from 32bit applications. Is there a chance that 32bit applications will ALL have to be upgraded to 64bit in the future? I heard that 64bit CPUs have more registers. Since performance is not a concern for my applications, using these extra registers is not a concern to me. Are there any other reasons that a 32bit application needs to be upgraded to 64bit? Would a 64 bit application process things differently when compared with a 32bit application, apart from that the 64bit applications may use registers or instructions that are unique to 64bit CPUs?
My application needs to interact with other OS components e.g. drivers, which i know must be 64bit in 64bit windows. Would my 32bit application compatible with them?

Visual C++ does not support inline assembly for x64 (or ARM) processors, because generally using inline assembly is a bad idea.
Usually compilers produce better assembly than humans.
Even if you can produce better assembly than the compiler, using inline assembly generally defeats code optimizers of any type. Sure, your bit of hand optimized code might be faster, but the fact that code around it can't be optimized will generally lead to a slower overall program.
Compiler intrinsics are available from pretty much every major compiler that let you access advanced CPU features (e.g. SSE) in a manner that's consistent with the C and C++ languages, and does not defeat the optimizer.
I am wondering would there be a chance that 32bit applications will ALL have to be upgraded to 64bit in the future.
That depends on your target audience. If you're targeting servers, then yes, it's reasonable to allow users to not install the WOW64 subsystem because it's a server -- you know it'll probably not be running too much 32 bit code. I believe Windows Server 2008 R2 already allows this as an option if you install it as a "server core" instance.
Since performance is not a concern for my appli so using the extra 64bit registers is not a concern to me. Is there any other reasons that a 32bit appli has to be upgraded to 64bit in the future?
64 bit has nothing to do with registers. It has to do with size of addressable virtual memory.
Would a 64 bit app process different from a 32bit appl process apart from that the 64bit appli is using some registers/instructions that is unique to 64bit CPUs?
Most likely. 32 bit applications are constrained in that they can't map things more than ~2GB into memory at once. 64 bit applications don't have that problem. Even if they're not using more than 4GB of physical memory, being able to address more than 4GB of virtual memory is helpful for mapping files on disk into memory and similar.
My application needs to interact with other OS components e.g. drivers, which i know must be 64bit in 64bit windows. Would my 32bit application compatible with them?
That depends entirely on how you're communicating with those drivers. If it's through something like a "named file interface" then your app could stay as 32 bit. If you try to do something like shared memory (Yikes! Shared memory accessible from user mode with a driver?!?) then you're going to have to build your app as 64 bit.

Apart form #Billy's great write up, if you really feel the need to use 64bit assembly, then you can use an external assembler like MASM to get that done, see this. (its also possible to speed this up with prebuild scripts).

the Intel C Compiler 15 has inline capability in 64bit too.
And you could integrate the IC in Visual Studio as a toolset: then you'd have VC++ 64bit with inline assembly.
One catch though -its expensive
cheers

While we're at it, MinGW also has 64-bit inline assembly language; and it's pretty fast, and free. It used to be slow on some math; so I'd start out comparing performances of MSVC vs. MinGW to see if its a decent starting place for your application.
Also, as to hand-coded assembly being slower:
Actually, humans very often do code assembly that runs more efficiently than compilers - or at least that was always the common wisdom when I was learning programming in the 70's and 80's and continued to be the case through ~2000.
You can always code it in "C" or C++, compile that to assembly, and tweak it to see if you can improve that. That way, you can learn from optimizations; and see if you can improve on them.
Assembly very much can have a place in code that needs high optimization, no matter what M$ says. You won't really know if assembly will or won't speed up code until you try it. Everything else is just pontificating.
As above, I favor the approach of compiling c++ code into assembly, and then hand-optimizing that. It saves you the trouble of writing much of it; and with a little experimentation, you may get something that tests out faster. FWIW, I've never needed to with a modern program. Often, other things can speed it up just as much or more - e.g. such as multi-threading, using look-up tables, moving time-expensive operations out of loops, using static analyzers, using real-time analyzers such as valgrind (if you're on Linux), etc. However, for performance-critical applications, I see no reason not to try; and just use it if it works. M$ is just being lazy by dropping inline assembly.
As to is 64-bit or 32-bit faster, this is similar to the situation with 16-bit vs. 32-bit. The wider bandwidth can sling huge amounts of data faster. If both run on a 64-bit OS, they run at exactly the same clock speed; so the 32-bit program shouldn't be faster. Yet, I've observed the CPU clock on 32-bit Win7 to run slightly faster than 64-bit Win7. Thus for the same number of threads, and for more CPU intensive operations, a 32-bit app on 32-bit Win7 would be faster. However, the difference isn't much; and 64-bit instructions can really make a difference. However, a given user will only have one OS installed; and so the 64-bit app will be either faster for that OS; or at best the same speed if running a 32-bit app on a 64-bit OS. It will be a larger download, however. You might as well go for the possibly faster speed with 64-bits; unless you are dealing with a dedicated system running code you know won't be moving large amounts of data.
Also, note that I benchmarked a 64-bit and a 32-bit app on OSs of the respective sizes, using the respective versions of MinGW. It did a lot of 64-bit floating point number crunching, and I was sure the 64-bit version would have the edge. It didn't!! My guess is that the floating point registers in the built-in math coprocessor run in equal numbers of clock cycles on both OSs, and perhaps slightly slower on 64-bit Win7. My benchmarks were so close in both versions, that one was not clearly faster. Perhaps long number-crunching operations were slower on 64-bit, but the 64-bit program code ran a little faster - causing nearly equal results.
Basically, the only time 32-bits makes sense, IMHO, is when you think you might have an in-house app that would run faster on a 32-bit OS; you want a really small executable, or when you are delivering to users on 32-bit OS machines (many developers still offer both versions), or a 32-bit embedded system.
Edited to reflect that some of my remarks pertain to my specific experience with Win7 x86 vs. x64.

Fastest way to run a program in a 64 bit environment?

It's been a couple of decades since I've done any programming. As a matter of fact the last time I programmed was in an MS-DOS environment before Windows came out. I've had this programming idea that I have wanted to try for a few years now and I thought I would give it a try. The amount of calculations are enormous. Consequently I want to run it in the fastest environment I can available to a general hobby programmer.
I'll be using a 64 bit machine. Currently it is running Windows 7. Years ago a program ran much slower in the windows environment then then in MS-DOS mode. My personal programming experience has been in Fortran, Pascal, Basic, and machine language for the 6800 Motorola series processors. I'm basically willing to try anything. I've fooled around with Ubuntu also. No objections to learning new. Just want to take advantage of speed. I'd prefer to spend no money on this project. So I'm looking for a free or very close to free compiler. I've downloaded Microsoft Visual Studio C++ Express. But I've got a feeling that the completed compiled code will have to be run in the Windows environment. Which I'm sure slows the processing speed considerably.
So I'm looking for ideas or pointers to what is available.
Thank you,
Have a Great Day!
Jim

Speed generally comes with the price of either portability or complexity.
If your programming idea involves lots of computation, then if you're using Intel CPU, you might want to use Intel's compiler, which might benefit from some hidden processor features that might make your program faster. Otherwise, if portability is your goal, then use GCC (GNU Compiler Collection), which can cross-compile well optimized executable to practically any platform available on earth. If your computation can be parallelizable, then you might want to look at SIMD (Single Input Multiple Data) and GPGPU/CUDA/OpenCL (using graphic card for computation) techniques.
However, I'd recommend you should just try your idea in the simpler languages first, e.g. Python, Java, C#, Basic; and see if the speed is good enough. Since you've never programmed for decades now, it's likely your perception of what was an enormous computation is currently miniscule due to the increased processor speed and RAM. Nowadays, there is not much noticeable difference in running in GUI environment and command line environment.

Tthere is no substantial performance penalty to operating under Windows and a large quantity of extremely high performance applications do so. With new compiler advances and new optimization techniques, Windows is no longer the up-and-coming, new, poorly optimized technology it was twenty years ago.
The simple fact is that if you haven't programmed for 20 years, then you won't have any realistic performance picture at all. You should make like most people- start with an easy to learn but not very fast programming language like C#, create the program, then prove that it runs too slowly, then make several optimization passes with tools such as profilers, then you may decide that the language is too slow. If you haven't written a line of code in two decades, the overwhelming probability is that any program that you write will be slow because you're a novice programmer from modern perspectives, not because of your choice of language or environment. Creating very high performance applications requires a detailed understanding of the target platform as well as the language of choice, AND the operations of the program.
I'd definitely recommend Visual C++. The Express Edition is free and Visual Studio 2010 can produce some unreasonably fast code. Windows is not a slow platform - even if you handwrote your own OS, it'd probably be slower, and even if you produced one that was faster, the performance gain would be negligible unless your program takes days or weeks to complete a single execution.

The OS does not make your program magically run slower. True, the OS does eat a few clock cycles here and there, but it's really not enough to be at all noticeable (and it does so in order to provide you with services you most likely need, and would need to re-implement yourself otherwise)
Windows doesn't, as some people seem to believe, eat 50% of your CPU. It might eat 0.5%, but so does Linux and OSX. And if you were to ditch all existing OS'es and instead write your own from scratch, you'd end up with a buggy, less capable OS which also eats a bit of CPU time.
So really, the environment doesn't matter.
What does matter is what hardware you run the program on (and here, running it on the GPU might be worth considering) and how well you utilize the hardware (concurrency is pretty much a must if you want to exploit modern hardware).
What code you write, and how you compile it does make a difference. The hardware you're running on makes a difference. The choice of OS does not.

A digression: that the OS doesn't matter for performance is, in general, obviously false. Citing CPU utilization when idle seems a quite "peculiar" idea to me: of course one hopes that when no jobs are running the OS is not wasting energy. Otherwise one measure the speed/throughput of an OS when it is providing a service (i.e. mediating the access to hardware/resources).
To avoid an annoying MS Windows vs Linux vs Mac OS X battle, I will refer to a research OS concept: exokernels. The point of exokernels is that a traditional OS is not just a mediator for resource access but it implements policies. Such policies does not always favor the performance of your application-specific access mode to a resource. With the exokernel concept, researchers proposed to "exterminate all operating system abstractions" (.pdf) retaining its multiplexer role. In this way:
… The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). …
So bypassing the usual OS access policies they gained, for a customized web server, an increase of about 800% in performance.
Returning to the original question: it's generally true that an application is executed with no or negligible OS overhead when:
it has a compute-intensive kernel, where such kernel does not call the OS API;
memory is enough or data is accessed in a way that does not cause excessive paging;
all inessential services running on the same systems are switched off.
There are possibly other factors, depending by hardware/OS/application.
I assume that the OP is correct in its rough estimation of computing power required. The OP does not specify the nature of such intensive computation, so its difficult to give suggestions. But he wrote:
The amount of calculations are enormous
"Calculations" seems to allude to compute-intensive kernels, for which I think is required a compiled language or a fast interpreted language with native array operators, like APL, or modern variant such as J, A+ or K (potentially, at least: I do not know if they are taking advantage of modern hardware).
Anyway, the first advice is to spend some time in researching fast algorithms for your specific problem (but when comparing algorithms remember that asymptotic notation disregards constant factors that sometimes are not negligible).
For the sequential part of your program a good utilization of CPU caches is crucial for speed. Look into cache conscious algorithms and data structures.
For the parallel part, if such program is amenable to parallelization (remember both Amdahl's law and Gustafson's law), there are different kinds of parallelism to consider (they are not mutually exclusive):
Instruction-level parallelism: it is taken care by the hardware/compiler;
data parallelism:
bit-level: sometimes the acronym SWAR (SIMD Within A Register) is used for this kind of parallelism. For problems (or some parts of them) where it can be formulated a data representation that can be mapped to bit vectors (where a value is represented by 1 or more bits); so each instruction from the instruction set is potentially a parallel instruction which operates on multiple data items (SIMD). Especially interesting on a machine with 64 bits (or larger) registers. Possible on CPUs and some GPUs. No compiler support required;
fine-grain medium parallelism: ~10 operations in parallel on x86 CPUs with SIMD instruction set extensions like SSE, successors, predecessors and similar; compiler support required;
fine-grain massive parallelism: hundreds of operations in parallel on GPGPUs (using common graphic cards for general-purpose computations), programmed with OpenCL (open standard), CUDA (NVIDIA), DirectCompute (Microsoft), BrookGPU (Stanford University) and Intel Array Building Blocks. Compiler support or use of a dedicated API is required. Note that some of these have back-ends for SSE instructions also;
coarse-grain modest parallelism (at the level of threads, not single instructions): it's not unusual for CPUs on current desktops/laptops to have more then one core (2/4) sharing the same memory pool (shared-memory). The standard for shared-memory parallel programming is the OpenMP API, where, for example in C/C++, #pragma directives are used around loops. If I am not mistaken, this can be considered data parallelism emulated on top of task parallelism;
task parallelism: each core in one (or multiple) CPU(s) has its independent flow of execution and possibly operates on different data. Here one can use the concept of "thread" directly or a more high-level programming model which masks threads.
I will not go into details of these programming models here because apparently it is not what the OP needs.
I think this is enough for the OP to evaluate by himself how various languages and their compilers/run-times / interpreters / libraries support these forms of parallelism.

Just my two cents about DOS vs. Windows.
Years ago (something like 1998?), I had the same assumption.
I have some program written in QBasic (this was before I discovered C), which did intense calculations (neural network back-propagation). And it took time.
A friend offered to rewrite the thing in Visual Basic. I objected, because, you know, all those gizmos, widgets and fancy windows, you know, would slow down the execution of, you know, the important code.
The Visual Basic version so much outperformed the QBasic one that it became the default application (I won't mention the "hey, even in Excel's VBA, you are outperformed" because of my wounded pride, but...).
The point here, is the "you know" part.
You don't know.
The OS here is not important. As others explained in their answers, choose your hardware, and choose your language. And write your code in a clear way because now, compilers are better at optimizing code developers, unless you're John Carmack (premature optimization is the root of all evil).
Then, if you're not happy with the result, use a profiler to test your code. Consider multithreading (which will help you if you have multiple cores... TBB comes to mind).

What are you trying to do? I believe all the stuff should be compiled in 64bit mode by default. Computers have gotten a lot faster. Speed should not be a problem for the most part.
Side note: As for computation intense stuff you may want to look into OpenCL or CUDA. OpenCL and CUDA take advantage of the GPU which can transfer lots of information at a time compared to the CPU.

If your last points of reference are M68K and PCs running DOS then I'd suggest that you start with C/C++ on a modern processor and OS. If you run into performance problems and can prove that they are caused by running on Linux / Windows or that the compiler / optimizer generated code isn't sufficient, then you could look at other OSes and/or hand coded ASM. If you're looking for free, Linux / gcc is a good place to start.

I am the original poster of this thread.
I am once again reiterating the emphasis that this program will have enormous number of calculations.
Windows & Ubuntu are multi-tasking environments. There are processes running and many of them are using processor resources. True many of them are seen as inactive. But still the Windows environment by the nature of multi-tasking is constantly monitoring the need to start up each process. For example currently there are 62 processes showing in the Windows Task Manager. According the task manager three are consuming CPU resouces. So we have three ongoing processes that are consuming CPU processing. There are an addition 59 showing active but consuming no CPU processing. So that is 63 being monitored by Windows and then there is the Windows that also is checking on various things.
I was hoping to find some way to just be able to run a program on the bare machine level. Side stepping all the Windows (Ubuntu) involvement.
The idea is very calculation intensive.
Thank you all for taking the time to respond.
Have a Great Day,
Jim

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js