Fast cross-platform cpu profiling of C++ functions - c++

I have written a software in Qt for a company, which works with a hardware device and shows realtime plots. I have problems with its speed and I need to know which parts are cpu intensive, but there are lots of threads and events, and when I use valgrind the application gets so slow that serial handlers won't work as expected and timeouts happen and therefore I can't find what's going on. The code base is huge and simplification is almost impossible, because things depend on each other. I'm developing on MacOSX but the application runs on Linux, I wanted to know if there is a faster profiler than valgrind out there. Preferably one that works on MacOSX, as valgrind doesn't work on MacOSX (It says it does, but there are so many problems that makes it just not practical). Thanks in advance.
P.S. If you need any more info, please comment instead of down voting, I can't offer the code as it is proprietary, but I can say it is well written, and I'm a fairly experienced C++ programmer.

Related

kernel vs user-space audio device driver on macOS

I'm in a need to develop an audio device driver for System Audio Capture(based on Soundflower). But soon a problem appeared that it seems IOAudioFamily stack is being deprecated in OSX 10.10 and later. Looking through the IOAudioDevice and IOAudioEngine header files it seems that apple recommends now using the <CoreAudio/AudioServerPlugIn.h> API which runs in user-space. But I can't find lots of information on this user-space device drivers topic. It seems that the only resource is the Apple provided sample devices from https://developer.apple.com/library/prerelease/content/samplecode/AudioDriverExamples/Introduction/Intro.html
Looking through the examples I find that its a lot harder and more work to develop a user-space driver instead of I/O Kit kernel based.
So the question arises what should motivate to develop a device driver in user-space instead of kernel space?
The "SimpleAudioDriver" example is somewhat misnamed. It demonstrates pretty much every feature of the API. This is handy as a reference if you actually need to use those features. It's also structured in a way that's maybe a little more complicated than necessary.
For a virtual device, the NullAudioDriver is probably a much better base, and much, much easier to understand (single source file, if I remember correctly). SimpleAudioDriver is more useful for dealing with issues such as hotplugging, multiple instances of identical devices, etc.
IOAudioEngine is deprecated as you say, and has been since OS X 10.10. Expect it to go away eventually, so if you build your driver with it, you'll probably need to rewrite it sooner than if you create a Core Audio Server Plugin based one.
Testing and debugging audio drivers is awkward either way (due to being so time sensitive), but I'd say userspace ones are slightly less frustrating to deal with. You'll still want to test on a different machine than your development Mac, because if coreaudiod crashes or hangs, apps usually start locking up too, so being able to just ssh in, delete your plugin and kill coreaudiod is handy. Certainly quicker turnaround than having to reboot.
(FWIW, I've shipped both kernel and userspace OS X audio drivers, and I spend a lot of time working on kexts.)
There is a great book on this subject, available free online here:
http://free-electrons.com/doc/books/ldd3.pdf
See page 37 for a summary of why you might want a user-space driver, copied here for convenience:
The advantages of user-space drivers are:
The full C library can be linked in. The driver can perform many exotic tasks without resorting to external programs (the utility
programs implementing usage policies that are usually distributed
along with the driver itself).
The programmer can run a conventional debugger on the driver code without having to go through contortions to debug a running kernel.
If a user-space driver hangs, you can simply kill it. Problems with the driver are unlikely to hang the entire system, unless the hardware
being controlled is really misbehaving.
User memory is swappable, unlike kernel memory. An infrequently used device with a huge driver won’t occupy RAM that other programs could
be using, except when it is actually in use.
A well-designed driver program can still, like kernel-space drivers, allow concurrent access to a device.
If you must write a closed-source driver, the user-space option makes it easier for you to avoid ambiguous licensing situations and
problems with changing kernel interfaces.

OpenGL stable enough for 24/7 running program?

I am creating a GUI program that will run 24/7. I couldn't find much online on the subject, but is OpenGL stable enough to run 24/7 for weeks on end without leaks, crashes, etc?
Should I have any concerns or anything to look into before delving too deep into using OpenGL?
I know that OpenGL and DirectX are primarily used for games or other programs that aren't used for very long lengths. Hopefully someone here has some experience with this or knowledge on the subject. Thanks.
EDIT: Sorry for the lack of detail. This will only be doing 2D rendering, and nothing too heavy, what I have now (which will be similar to production) already runs at a stable 900-1000 FPS on my i5 laptop with Radeon 6850m
Going into OpenGL just for making a GUI sounds insane. You should be worried more about what language you use if you are concerned about stuff like memory leaks. Remember that in C/C++ you manage memory on your own.
Furthermore, do you really need the GUI to be running 24/7? If you are making a service sort of application, you might as well leave it in the background and make a second application which provides the GUI. These two applications would communicate via soma IPC (sockets?). That's how this sort of thing usually works, not having a window open all the time.
In the end, memory leaks are not caused by some graphical library, but more by the programmer writing bad code. The library should be the last in your list of possible reasons for memory leaks/creashes.
I work for a company that makes (windows based) quality assurance software (machine vision) using Delphi.
The main operator screen shows the camera images at up to 20fps (2 x 10fps) with opengl overlay, and has essentially unbounded uptime (longest uptimes close to an year, longer is hard due to power downs for maintenance). Higher speed cameras have their display rates throttled.
I would avoid integrated video from intel for a while longer though. Since i5 it meets our minimal requirements (non power of 2 textures mostly), but the initial drivers were bad, and while they have improved there are occasional stability and regularity problems still.

Beyond Stack Sampling: C++ Profilers

A Hacker's Tale
The date is 12/02/10. The days before Christmas are dripping away and I've pretty much hit a major road block as a windows programmer. I've been using AQTime, I've tried sleepy, shiny, and very sleepy, and as we speak, VTune is installing. I've tried to use the VS2008 profiler, and it's been positively punishing as well as often insensible. I've used the random pause technique. I've examined call-trees. I've fired off function traces. But the sad painful fact of the matter is that the app I'm working with is over a million lines of code, with probably another million lines worth of third-party apps.
I need better tools. I've read the other topics. I've tried out each profiler listed in each topic. There simply has to be something better than these junky and expensive options, or ludicrous amounts of work for almost no gain. To further complicate matters, our code is heavily threaded, and runs a number of Qt Event loops, some of which are so fragile that they crash under heavy instrumentation due to timing delays. Don't ask me why we're running multiple event loops. No one can tell me.
Are there any options more along the lines of Valgrind in a windows environment?
Is there anything better than the long swath of broken tools I've already tried?
Is there anything designed to integrate with Qt, perhaps with a useful display of events in queue?
A full list of the tools I tried, with the ones that were really useful in italics:
AQTime: Rather good! Has some trouble with deep recursion, but the call graph is correct in these cases, and can be used to clear up any confusion you might have. Not a perfect tool, but worth trying out. It might suit your needs, and it certainly was good enough for me most of the time.
Random Pause attack in debug mode: Not enough information enough of the time.
A good tool but not a complete solution.
Parallel Studios: The nuclear option. Obtrusive, weird, and crazily powerful. I think you should hit up the 30 day evaluation, and figure out if it's a good fit. It's just darn cool, too.
AMD Codeanalyst: Wonderful, easy to use, very crash-prone, but I think that's an environment thing. I'd recommend trying it, as it is free.
Luke Stackwalker: Works fine on small projects, it's a bit trying to get it working on ours. Some good results though, and it definitely replaces Sleepy for my personal tasks.
PurifyPlus: No support for Win-x64 environments, most prominently Windows 7. Otherwise excellent. A number of my colleagues in other departments swear by it.
VS2008 Profiler: Produces output in the 100+gigs range in function trace mode at the required resolution. On the plus side, produces solid results.
GProf: Requires GCC to be even moderately effective.
VTune: VTune's W7 support borders on criminal. Otherwise excellent
PIN: I'd need to hack up my own tool, so this is sort of a last resort.
Sleepy\VerySleepy: Useful for smaller apps, but failing me here.
EasyProfiler: Not bad if you don't mind a bit of manually injected code to indicate where to instrument.
Valgrind: *nix only, but very good when you're in that environment.
OProfile: Linux only.
Proffy: They shoot wild horses.
Suggested tools that I haven't tried:
XPerf:
Glowcode:
Devpartner:
Notes:
Intel environment at the moment. VS2008, boost libraries. Qt 4+. And the wretched humdinger of them all: Qt/MFC integration via trolltech.
Now: Almost two weeks later, it looks like my issue is resolved. Thanks to a variety of tools, including almost everything on the list and a couple of my personal tricks, we found the primary bottlenecks. However, I'm going to keep testing, exploring, and trying out new profilers as well as new tech. Why? Because I owe it to you guys, because you guys rock. It does slow the timeline down a little, but I'm still very excited to keep trying out new tools.
Synopsis
Among many other problems, a number of components had recently been switched to the incorrect threading model, causing serious hang-ups due to the fact that the code underneath us was suddenly no longer multithreaded. I can't say more because it violates my NDA, but I can tell you that this would never have been found by casual inspection or even by normal code review. Without profilers, callgraphs, and random pausing in conjunction, we'd still be screaming our fury at the beautiful blue arc of the sky. Thankfully, I work with some of the best hackers I've ever met, and I have access to an amazing 'verse full of great tools and great people.
Gentlefolk, I appreciate this tremendously, and only regret that I don't have enough rep to reward each of you with a bounty. I still think this is an important question to get a better answer to than the ones we've got so far on SO.
As a result, each week for the next three weeks, I'll be putting up the biggest bounty I can afford, and awarding it to the answer with the nicest tool that I think isn't common knowledge. After three weeks, we'll hopefully have accumulated a definitive profile of the profilers, if you'll pardon my punning.
Take-away
Use a profiler. They're good enough for Ritchie, Kernighan, Bentley, and Knuth. I don't care who you think you are. Use a profiler. If the one you've got doesn't work, find another. If you can't find one, code one. If you can't code one, or it's a small hang up, or you're just stuck, use random pausing. If all else fails, hire some grad students to bang out a profiler.
A Longer View
So, I thought it might be nice to write up a bit of a retrospective. I opted to work extensively with Parallel Studios, in part because it is actually built on top of the PIN Tool. Having had academic dealings with some of the researchers involved, I felt that this was probably a mark of some quality. Thankfully, I was right. While the GUI is a bit dreadful, I found IPS to be incredibly useful, though I can't comfortably recommend it for everyone. Critically, there's no obvious way to get line-level hit counts, something that AQT and a number of other profilers provide, and I've found very useful for examining rate of branch-selection among other things. In net, I've enjoyed using AQTime as well, and I've found their support to be really responsive. Again, I have to qualify my recommendation: A lot of their features don't work that well, and some of them are downright crash-prone on Win7x64. XPerf also performed admirably, but is agonizingly slow for the sampling detail required to get good reads on certain kinds of applications.
Right now, I'd have to say that I don't think there's a definitive option for profiling C++ code in a W7x64 environment, but there are certainly options that simply fail to perform any useful service.
First:
Time sampling profilers are more robust than CPU sampling profilers. I'm not extremely familiar with Windows development tools so I can't say which ones are which. Most profilers are CPU sampling.
A CPU sampling profiler grabs a stack trace every N instructions.
This technique will reveal portions of your code that are CPU bound. Which is awesome if that is the bottle neck in your application. Not so great if your application threads spend most of their time fighting over a mutex.
A time sampling profiler grabs a stack trace every N microseconds.
This technique will zero in on "slow" code. Whether the cause is CPU bound, blocking IO bound, mutex bound, or cache thrashing sections of code. In short what ever piece of code is slowing your application will standout.
So use a time sampling profiler if at all possible especially when profiling threaded code.
Second:
Sampling profilers generate gobs of data. The data is extremely useful, but there is often too much to be easily useful. A profile data visualizer helps tremendously here. The best tool I've found for profile data visualization is gprof2dot. Don't let the name fool you, it handles all kinds of sampling profiler output (AQtime, Sleepy, XPerf, etc). Once the visualization has pointed out the offending function(s), jump back to the raw profile data to get better hints on what the real cause is.
The gprof2dot tool generates a dot graph description that you then feed into a graphviz tool. The output is basically a callgraph with functions color coded by their impact on the application.
A few hints to get gprof2dot to generate nice output.
I use a --skew of 0.001 on my graphs so I can easily see the hot code paths. Otherwise the int main() dominates the graph.
If you're doing anything crazy with C++ templates you'll probably want to add --strip. This is especially true with Boost.
I use OProfile to generate my sampling data. To get good output I need configure it to load the debug symbols from my 3rd party and system libraries. Be sure to do the same, otherwise you'll see that CRT is taking 20% of your application's time when what's really going on is malloc is trashing the heap and eating up 15%.
What happened when you tried random pausing? I use it all the time on a monster app. You said it did not give enough information, and you've suggested you need high resolution. Sometimes people need a little help in understanding how to use it.
What I do, under VS, is configure the stack display so it doesn't show me the function arguments, because that makes the stack display totally unreadable, IMO.
Then I take about 10 samples by hitting "pause" during the time it's making me wait. I use ^A, ^C, and ^V to copy them into notepad, for reference. Then I study each one, to try to figure out what it was in the process of trying to accomplish at that time.
If it was trying to accomplish something on 2 or more samples, and that thing is not strictly necessary, then I've found a live problem, and I know roughly how much fixing it will save.
There are things you don't really need to know, like precise percents are not important, and what goes on inside 3rd-party code is not important, because you can't do anything about those. What you can do something about is the rich set of call-points in code you can modify displayed on each stack sample. That's your happy hunting ground.
Examples of the kinds of things I find:
During startup, it can be about 30 layers deep, in the process of trying to extract internationalized character strings from DLL resources. If the actual strings are examined, it can easily turn out that the strings don't really need to be internationalized, like they are strings the user never actually sees.
During normal usage, some code innocently sets a Modified property in some object. That object comes from a super-class that captures the change and triggers notifications that ripple throughout the entire data structure, manipulating the UI, creating and desroying obects in ways hard to foresee. This can happen a lot - the unexpected consequences of notifications.
Filling in a worksheet row-by-row, cell-by-cell. It turns out if you build the row all at once, from an array of values, it's a lot faster.
P.S. If you're multi-threaded, when you pause it, all threads pause. Take a look at the call stack of each thread. Chances are, only one of them is the real culprit, and the others are idling.
I've had some success with AMD CodeAnalyst.
Do you have an MFC OnIdle function? In the past I had a near real-time app I had to fix that was dropping serial packets when set at 19.2K speed which a PentiumD should have been able to keep up with. The OnIdle function was what was killing things. I'm not sure if QT has that concept, but I'd check for that too.
Re the VS Profiler -- if it's generating such large files, perhaps your sampling interval is too frequent? Try lowering it, as you probably have enough samples anyway.
And ideally, make sure you're not collecting samples until you're actually exercising the problem area. So start with collection paused, get your program to do its "slow activity", then start collection. You only need at most 20 seconds of collection. Stop collection after this.
This should help reduce your sample file sizes, and only capture what is necessary for your analysis.
I have successfully used PurifyPlus for Windows. Although it is not cheap, IBM provides a trial version that is slightly crippled. All you need for profiling with quantify are pdb files and linking with /FIXED:NO. Only drawback: No support for Win7/64.
Easyprofiler - I haven't seen it mentioned here yet so not sure if you've looked at it already. It takes a slightly different approach in how it gathers metric data. A drawback to using its compile-time profile approach is you have to make changes to the code-base. Thus you'll need to have some idea of where the slow might be and insert profiling code there.
Going by your latest comments though, it sounds like you're at least making some headway. Perhaps this tool might provide some useful metrics for you. If nothing else it has some really purdy charts and pictures :P
Two more tool suggestions.
Luke Stackwalker has a cute name (even if it's trying a bit hard for my taste), it won't cost you anything, and you get the source code. It claims to support multi threaded programs, too. So it is surely worth a spin.
http://lukestackwalker.sourceforge.net/
Also Glowcode, which I've had pointed out to me as worth using:
http://www.glowcode.com/
Unfortunately I haven't done any PC work for a while, so I haven't tried either of these. I hope the suggestions are of help anyway.
Checkout XPerf
This is free, non-invasive and extensible profiler offered by MS. It was developed by Microsoft to profile Windows.
If you're suspicious of the event loop, could overriding QCoreApplication::notify() and dosome manual profiling (one or two maps of senders/events to counts/time)?
I'm thinking that you first log the frequency of event types, then examine those events more carefully (which object sends it, what does it contain, etc). Signals across threads are queued implicitly, so they end up in the event loop (as well explicit queued connections too, obviously).
We've done it to trap and report exceptions in our event handlers, so really, every event goes through there.
Just an idea.
Edit: I see now you mentioned this in your first post. Dammit, I never thought I'd be that guy.
You can use Pin to instrument your code with finer granularity. I think Pin would let you create a tool to count how many times you enter a function or how many clockticks you spend there, roughly emulating something like VTune or CodeAnalyst. Then you could strip down which functions get instrumented until your timing issues go away.
I can tell you what I use everyday.
a) AMD Code Analyst
It is easy, and it will give you a quick overview of what is happening. It will be ok for most of the time.
With AMD CPUs, it will tell you info about the cpu pipeline, but you only need this only if you have heavy loops, like in graphic engines, video codecs, etc.
b) VTune.
It is very well integrated in vs2008
after you know the hotspots, you need to sample not only time, but other things like cache misses, and memory usage. This is very important. Setup a sampling session, and edit the properties. I always sample for time, memory read/write, and cache misses (three different runs)
But more than the tool, you need to get experience with profiling. And that means understanding how the CPU/Memory/PCI works... so, this is my 3rd option
c) Unit testing
This is very important if you are developing a big application that needs huge performance. If you cannot split the app in some pieces, it will be difficult to track cpu usage. I dont test all the cases and classes, but I have hardcoded executions and input files with important features.
My advice is using random sampling in several small tests, and try to standardise a profile strategy.
I use xperf/ETW for all of my profiling needs. It has a steep learning curve but is incredibly powerful. If you are profiling on Windows then you must know xperf. I frequently use this profiler to find performance problems in my code and in other people's code.
In the configuration that I use it:
xperf grabs CPU samples from every core that is executing code every
ms. The sampling rate can be increased to 8 KHz and the samples
include user-mode and kernel code. This allows finding out what a
thread is doing while it is running
xperf records every context
switch (allowing for perfect reconstruction of how much time each
thread uses), plus call stacks for when threads are switched in, plus
call stacks for what thread readied another thread, allowing tracing
of wait chains and finding out why a thread is not running
xperf
records all file I/O from all processes
xperf records all disk I/O
from all processes
xperf records what window is active, the CPU
frequency, CPU power state, UI delays, etc.
xperf can also record all
heap allocations from one process, all virtual allocations from all
processes, and much more.
That's a lot of data, all on one timeline, for all processes. No other profiler on Windows can do that.
I have blogged extensively about how to use xperf/ETW. These blog posts, and some professionally quality training videos, can be found here:
http://randomascii.wordpress.com/2014/08/19/etw-training-videos-available-now/
If you want to find out what might happen if you don't use xperf read these blog posts:
http://randomascii.wordpress.com/category/investigative-reporting/
These are tales of performance problems I have found in other people's code, that should have been found by the developers. This includes mshtml.dll being loaded into the VC++ compiler, a denial of service in VC++'s find-in-files, thermal throttling in a surprising number of customer machines, slow single-stepping in Visual Studio, a 4 GB allocation in a hard-disk driver, a powerpoint performance bug, and more.
I just finished the first usable version of CxxProf, a portable manual instrumented profiling library for C++.
It fulfills the following goals:
Easy integration
Easily remove the lib during compile time
Easily remove the lib during runtime
Support for multithreaded applications
Support for distributed systems
Keep impact on a minimum
These points were ripped from the project wiki, have a look there for more details.
Disclaimer: Im the main developer of CxxProf
Just to throw it out, even though it's not a full-blown profiler: if all you're after is hung event loops that take long processing an event, an ad-hoc tool is simple matter in Qt. That approach could be easily expanded to keep track of how long did each event take to process, and what those events were, and so on. It's not a universal profiler, but an event-loop-centric one.
In Qt, all cross-thread signal-slot calls are delivered via the event loop, as are timers, network and serial port notifications, and all user interaction,. Thus, observing the event loops is a big step towards understanding where the application is spending its time.
DevPartner, originally developed by NuMega and now distributed by MicroFocus, was once the solution of choice for profiling and code analysis (memory and resource leaks for example).
I haven't tried it recently, so I cannot assure you it will help you; but I once had excellent results with it, so that this is an alternative I do consider to re-install in our code quality process (they provide a 14 days trial)
though your os is win7,the programm cann't run under xp?
how about profile it under xp and the result should be a hint for win7.
There are lots of profilers listed here and I've tried a few of them myself - however I ended up writing my own based on this:
http://code.google.com/p/high-performance-cplusplus-profiler/
It does of course require that you modify the code base, but it's perfect for narrowing down bottlenecks, should work on all x86s (could be a problem with multi-core boxes, i.e. it uses rdtsc, however - this is purely for indicative timing anyway - so I find it's sufficient for my needs..)
I use Orbit profiler, easy, open source and powerfull ! https://orbitprofiler.com/

Fastest way to run a program in a 64 bit environment?

It's been a couple of decades since I've done any programming. As a matter of fact the last time I programmed was in an MS-DOS environment before Windows came out. I've had this programming idea that I have wanted to try for a few years now and I thought I would give it a try. The amount of calculations are enormous. Consequently I want to run it in the fastest environment I can available to a general hobby programmer.
I'll be using a 64 bit machine. Currently it is running Windows 7. Years ago a program ran much slower in the windows environment then then in MS-DOS mode. My personal programming experience has been in Fortran, Pascal, Basic, and machine language for the 6800 Motorola series processors. I'm basically willing to try anything. I've fooled around with Ubuntu also. No objections to learning new. Just want to take advantage of speed. I'd prefer to spend no money on this project. So I'm looking for a free or very close to free compiler. I've downloaded Microsoft Visual Studio C++ Express. But I've got a feeling that the completed compiled code will have to be run in the Windows environment. Which I'm sure slows the processing speed considerably.
So I'm looking for ideas or pointers to what is available.
Thank you,
Have a Great Day!
Jim
Speed generally comes with the price of either portability or complexity.
If your programming idea involves lots of computation, then if you're using Intel CPU, you might want to use Intel's compiler, which might benefit from some hidden processor features that might make your program faster. Otherwise, if portability is your goal, then use GCC (GNU Compiler Collection), which can cross-compile well optimized executable to practically any platform available on earth. If your computation can be parallelizable, then you might want to look at SIMD (Single Input Multiple Data) and GPGPU/CUDA/OpenCL (using graphic card for computation) techniques.
However, I'd recommend you should just try your idea in the simpler languages first, e.g. Python, Java, C#, Basic; and see if the speed is good enough. Since you've never programmed for decades now, it's likely your perception of what was an enormous computation is currently miniscule due to the increased processor speed and RAM. Nowadays, there is not much noticeable difference in running in GUI environment and command line environment.
Tthere is no substantial performance penalty to operating under Windows and a large quantity of extremely high performance applications do so. With new compiler advances and new optimization techniques, Windows is no longer the up-and-coming, new, poorly optimized technology it was twenty years ago.
The simple fact is that if you haven't programmed for 20 years, then you won't have any realistic performance picture at all. You should make like most people- start with an easy to learn but not very fast programming language like C#, create the program, then prove that it runs too slowly, then make several optimization passes with tools such as profilers, then you may decide that the language is too slow. If you haven't written a line of code in two decades, the overwhelming probability is that any program that you write will be slow because you're a novice programmer from modern perspectives, not because of your choice of language or environment. Creating very high performance applications requires a detailed understanding of the target platform as well as the language of choice, AND the operations of the program.
I'd definitely recommend Visual C++. The Express Edition is free and Visual Studio 2010 can produce some unreasonably fast code. Windows is not a slow platform - even if you handwrote your own OS, it'd probably be slower, and even if you produced one that was faster, the performance gain would be negligible unless your program takes days or weeks to complete a single execution.
The OS does not make your program magically run slower. True, the OS does eat a few clock cycles here and there, but it's really not enough to be at all noticeable (and it does so in order to provide you with services you most likely need, and would need to re-implement yourself otherwise)
Windows doesn't, as some people seem to believe, eat 50% of your CPU. It might eat 0.5%, but so does Linux and OSX. And if you were to ditch all existing OS'es and instead write your own from scratch, you'd end up with a buggy, less capable OS which also eats a bit of CPU time.
So really, the environment doesn't matter.
What does matter is what hardware you run the program on (and here, running it on the GPU might be worth considering) and how well you utilize the hardware (concurrency is pretty much a must if you want to exploit modern hardware).
What code you write, and how you compile it does make a difference. The hardware you're running on makes a difference. The choice of OS does not.
A digression: that the OS doesn't matter for performance is, in general, obviously false. Citing CPU utilization when idle seems a quite "peculiar" idea to me: of course one hopes that when no jobs are running the OS is not wasting energy. Otherwise one measure the speed/throughput of an OS when it is providing a service (i.e. mediating the access to hardware/resources).
To avoid an annoying MS Windows vs Linux vs Mac OS X battle, I will refer to a research OS concept: exokernels. The point of exokernels is that a traditional OS is not just a mediator for resource access but it implements policies. Such policies does not always favor the performance of your application-specific access mode to a resource. With the exokernel concept, researchers proposed to "exterminate all operating system abstractions" (.pdf) retaining its multiplexer role. In this way:
… The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). …
So bypassing the usual OS access policies they gained, for a customized web server, an increase of about 800% in performance.
Returning to the original question: it's generally true that an application is executed with no or negligible OS overhead when:
it has a compute-intensive kernel, where such kernel does not call the OS API;
memory is enough or data is accessed in a way that does not cause excessive paging;
all inessential services running on the same systems are switched off.
There are possibly other factors, depending by hardware/OS/application.
I assume that the OP is correct in its rough estimation of computing power required. The OP does not specify the nature of such intensive computation, so its difficult to give suggestions. But he wrote:
The amount of calculations are enormous
"Calculations" seems to allude to compute-intensive kernels, for which I think is required a compiled language or a fast interpreted language with native array operators, like APL, or modern variant such as J, A+ or K (potentially, at least: I do not know if they are taking advantage of modern hardware).
Anyway, the first advice is to spend some time in researching fast algorithms for your specific problem (but when comparing algorithms remember that asymptotic notation disregards constant factors that sometimes are not negligible).
For the sequential part of your program a good utilization of CPU caches is crucial for speed. Look into cache conscious algorithms and data structures.
For the parallel part, if such program is amenable to parallelization (remember both Amdahl's law and Gustafson's law), there are different kinds of parallelism to consider (they are not mutually exclusive):
Instruction-level parallelism: it is taken care by the hardware/compiler;
data parallelism:
bit-level: sometimes the acronym SWAR (SIMD Within A Register) is used for this kind of parallelism. For problems (or some parts of them) where it can be formulated a data representation that can be mapped to bit vectors (where a value is represented by 1 or more bits); so each instruction from the instruction set is potentially a parallel instruction which operates on multiple data items (SIMD). Especially interesting on a machine with 64 bits (or larger) registers. Possible on CPUs and some GPUs. No compiler support required;
fine-grain medium parallelism: ~10 operations in parallel on x86 CPUs with SIMD instruction set extensions like SSE, successors, predecessors and similar; compiler support required;
fine-grain massive parallelism: hundreds of operations in parallel on GPGPUs (using common graphic cards for general-purpose computations), programmed with OpenCL (open standard), CUDA (NVIDIA), DirectCompute (Microsoft), BrookGPU (Stanford University) and Intel Array Building Blocks. Compiler support or use of a dedicated API is required. Note that some of these have back-ends for SSE instructions also;
coarse-grain modest parallelism (at the level of threads, not single instructions): it's not unusual for CPUs on current desktops/laptops to have more then one core (2/4) sharing the same memory pool (shared-memory). The standard for shared-memory parallel programming is the OpenMP API, where, for example in C/C++, #pragma directives are used around loops. If I am not mistaken, this can be considered data parallelism emulated on top of task parallelism;
task parallelism: each core in one (or multiple) CPU(s) has its independent flow of execution and possibly operates on different data. Here one can use the concept of "thread" directly or a more high-level programming model which masks threads.
I will not go into details of these programming models here because apparently it is not what the OP needs.
I think this is enough for the OP to evaluate by himself how various languages and their compilers/run-times / interpreters / libraries support these forms of parallelism.
Just my two cents about DOS vs. Windows.
Years ago (something like 1998?), I had the same assumption.
I have some program written in QBasic (this was before I discovered C), which did intense calculations (neural network back-propagation). And it took time.
A friend offered to rewrite the thing in Visual Basic. I objected, because, you know, all those gizmos, widgets and fancy windows, you know, would slow down the execution of, you know, the important code.
The Visual Basic version so much outperformed the QBasic one that it became the default application (I won't mention the "hey, even in Excel's VBA, you are outperformed" because of my wounded pride, but...).
The point here, is the "you know" part.
You don't know.
The OS here is not important. As others explained in their answers, choose your hardware, and choose your language. And write your code in a clear way because now, compilers are better at optimizing code developers, unless you're John Carmack (premature optimization is the root of all evil).
Then, if you're not happy with the result, use a profiler to test your code. Consider multithreading (which will help you if you have multiple cores... TBB comes to mind).
What are you trying to do? I believe all the stuff should be compiled in 64bit mode by default. Computers have gotten a lot faster. Speed should not be a problem for the most part.
Side note: As for computation intense stuff you may want to look into OpenCL or CUDA. OpenCL and CUDA take advantage of the GPU which can transfer lots of information at a time compared to the CPU.
If your last points of reference are M68K and PCs running DOS then I'd suggest that you start with C/C++ on a modern processor and OS. If you run into performance problems and can prove that they are caused by running on Linux / Windows or that the compiler / optimizer generated code isn't sufficient, then you could look at other OSes and/or hand coded ASM. If you're looking for free, Linux / gcc is a good place to start.
I am the original poster of this thread.
I am once again reiterating the emphasis that this program will have enormous number of calculations.
Windows & Ubuntu are multi-tasking environments. There are processes running and many of them are using processor resources. True many of them are seen as inactive. But still the Windows environment by the nature of multi-tasking is constantly monitoring the need to start up each process. For example currently there are 62 processes showing in the Windows Task Manager. According the task manager three are consuming CPU resouces. So we have three ongoing processes that are consuming CPU processing. There are an addition 59 showing active but consuming no CPU processing. So that is 63 being monitored by Windows and then there is the Windows that also is checking on various things.
I was hoping to find some way to just be able to run a program on the bare machine level. Side stepping all the Windows (Ubuntu) involvement.
The idea is very calculation intensive.
Thank you all for taking the time to respond.
Have a Great Day,
Jim

How to profile multi-threaded C++ application on Linux?

I used to do all my Linux profiling with gprof.
However, with my multi-threaded application, it's output appears to be inconsistent.
Now, I dug this up:
http://sam.zoy.org/writings/programming/gprof.html
However, it's from a long time ago and in my gprof output, it appears my gprof is listing functions used by non-main threads.
So, my questions are:
In 2010, can I easily use gprof to profile multi-threaded Linux C++ applications? (Ubuntu 9.10)
What other tools should I look into for profiling?
Edit: added another answer on poor man's profiler, which IMHO is better for multithreaded apps.
Have a look at oprofile. The profiling overhead of this tool is negligible and it supports multithreaded applications---as long as you don't want to profile mutex contention (which is a very important part of profiling multithreaded applications)
Have a look at poor man's profiler. Surprisingly there are few other tools that for multithreaded applications do both CPU profiling and mutex contention profiling, and PMP does both, while not even requiring to install anything (as long as you have gdb).
Try modern linux profiling tool, the perf (perf_events): https://perf.wiki.kernel.org/index.php/Tutorial and http://www.brendangregg.com/perf.html:
perf record ./application
# generates profile file perf.data
perf report
Have a look at Valgrind.
A Paul R said, have a look at Zoom. You can also use lsstack, which is a low-tech approach but surprisingly effective, compared to gprof.
Added: Since you clarified that you are running OpenGL at 33ms, my prior recommendation stands. In addition, what I personally have done in situations like that is both effective and non-intuitive. Just get it running with a typical or problematic workload, and just stop it, manually, in its tracks, and see what it's doing and why. Do this several times.
Now, if it only occasionally misbehaves, you would like to stop it only while it's misbehaving. That's not easy, but I've used an alarm-clock interrupt set for just the right delay. For example, if one frame out of 100 takes more than 33ms, at the start of a frame, set the timer for 35ms, and at the end of a frame, turn it off. That way, it will interrupt only when the code is taking too long, and it will show you why. Of course, one sample might miss the guilty code, but 20 samples won't miss it.
I tried valgrind and gprof. It is a crying shame that none of them work well with multi-threaded applications. Later, I found Intel VTune Amplifier. The good thing is, it handles multi-threading well, works with most of the major languages, works on Windows and Linux, and has many great profiling features. Moreover, the application itself is free. However, it only works with Intel processors.
You can randomly run pstack to find out the stack at a given point. E.g. 10 or 20 times.
The most typical stack is where the application spends most of the time (according to experience, we can assume a Pareto distribution).
You can combine that knowledge with strace or truss (Solaris) to trace system calls, and pmap for the memory print.
If the application runs on a dedicated system, you have also sar to measure cpu, memory, i/o, etc. to profile the overall system.
Since you didn't mention non-commercial, may I suggest Intel's VTune. It's not free but the level of detail is very impressive (and the overhead is negligible).
Putting a slightly different twist on matters, you can actually get a pretty good idea as to what's going on in a multithreaded application using ftrace and kernelshark. Collecting the right trace and pressing the right buttons and you can see the scheduling of individual threads.
Depending on your distro's kernel you may have to build a kernel with the right configuration (but I think that a lot of them have it built in these days).
Microprofile is another possible answer to this. It requires hand-instrumentation of the code, but it seems like it handles multi-threaded code pretty well. And it also has special hooks for profiling graphics pipelines, including what's going on inside the card itself.