QueryPerformanceCounter Status? - c++

I've read some about the problems and inaccuracy of the QPC implementation. But all those discussions and articles seem rather out of date.
Anyone know what the current status is of the QPC functionality on the latest Win7 OS and Intel/AMD cpus? Has there been any improvements? Is e.g. Turbo Boost still a big issue?
I'm mainly interested in this since I need to decide between using timeGetTime and QueryPerformanceCounter for timing in my application.

QPC was patched software side by microsoft ages ago (old kb article on the problem: http://support.microsoft.com/kb/895980 ). This fix can be installed on any target system for piece of mind (I've seen a few games doing this), however it shouldn't be needed. All newer processors have a hardware fix/compensation for this, as well as having the TSC correctly synced between cores, so the problems with RDTSC based timers are eliminated. The hot-fix can be gotten from here: http://support.microsoft.com/?id=896256 i'm not sure if there is anything for other systems, but from the looks of it this fix was inheirent in windows vista and windows 7 based OSes. As for the question of accuracy, that depends on the hardware itself, however Microsoft says QPC is the most reliably accurate timer you can get (with decent performance), so short of inline __rdtsc(), I doubt you'll get any more accurate.

Related

Modern high res timer for periodic calls

A huge amount has been said about high resolution timers on stackoverflow. But it's clear that the solution is a bit of a moving target and best practices are changing.
I need to create a high resolution timer that has a callback every 10ms to achieve a consistent 100Hz. The target platform is Windows 7 and later.
This exact question was asked in 2009, but I believe things have probably moved on.
Multimedia timers looked to be a great solution, but MSDN says they are depreciated, replaced with CreateTimerQueueTimer. But other answers on stackoverflow suggest that CreateTimerQueue timer is not as accurate as timeSetEvent.
All answers do consistently point out the requirement for setting the windows timer resolution to a low value using timeBeginPeriod.
So with all that said, what is the best approach to achieve the above desired goal in C today.
but MSDN says they are depreciated
It is pretty important to be able to read between the lines when you see a deprecation warning like this. Yes, most certainly Microsoft would like everybody to stop using multimedia timers. They are heavily abused and are very bad for business. Actually getting programmers to stop using them is however a pipe dream, CreateTimerQueueTimer() is not an alternative.
Bad for business because Microsoft likes to be competitive in mobile computing. And multimedia timers are a very poor match, they are murder on battery life. Most programs that use them jack up the clock interrupt rate the maximum allowed, 1000 times per second. With a backdoor to get to 2000. And it is very hard to stop them from doing that, especially when their competitors give their software away for free. They have no incentive whatsoever to fix that problem since it makes their mobile OS look good. And Microsoft can't kill very popular apps like that.
Microsoft also has a mobile OS, exposed through the WinRT api. Where that deprecation is rock-hard, you cannot get your app approved by the Store validation procedure when you use those timers. But it doesn't get much use, their customers like to keep using their desktop apps.
If you want a 100 Hz update rate then you have use use timeBeginPeriod() and timeSetEvent(), there is no other way. And avoid WinRT. Since it is actually only 1.5 times worse than the default, there is no appreciable reason to worry about power consumption. Set laser to stun and use what works.
Are you willing to burn CPU as you go? Call QueryPerformanceCounter() in a busy loop. You will get microsecond precision, with appropriate detrimental effect on battery life. You can also take it to eleven with jacking up your process priority class to REALTIME_PRIORITY_CLASS and worker thread's priority to THREAD_PRIORITY_TIME_CRITICAL (or a notch or two lower). There are of course negative consequences for doing such things.

Speed performance of a Qt program: Windows vs Linux

I've already posted this question here, but since it's maybe not that Qt-specific, I thought I might try my chance here as well. I hope it's not inappropriate to do that (just tell me if it is).
I’ve developed a small scientific program that performs some mathematical computations. I’ve tried to optimize it so that it’s as fast as possible. Now I’m almost done deploying it for Windows, Mac and Linux users. But I have not been able to test it on many different computers yet.
Here’s what troubles me: To deploy for Windows, I’ve used a laptop which has both Windows 7 and Ubuntu 12.04 installed on it (dual boot). I compared the speed of the app running on these two systems, and I was shocked to observe that it’s at least twice as slow on Windows! I wouldn’t have been surprised if there were a small difference, but how can one account for such a difference?
Here are a few precisions:
The computation that I make the program do are just some brutal and stupid mathematical calculations, basically, it computes products and cosines in a loop that is called a billion times. On the other hand, the computation is multi-threaded: I launch something like 6 QThreads.
The laptop has two cores #1.73Ghz. At first I thought that Windows was probably not using one of the cores, but then I looked at the processor activity, according to the small graphic, both cores are running 100%.
Then I thought the C++ compiler for Windows didn’t the use the optimization options (things like -O1 -O2) that the C++ compiler for Linux automatically did (in release build), but apparently it does.
I’m bothered that the app is so mush slower (2 to 4 times) on Windows, and it’s really weird. On the other hand I haven’t tried on other computers with Windows yet. Still, do you have any idea why the difference?
Additional info: some data…
Even though Windows seems to be using the two cores, I’m thinking this might have something to do with threads management, here’s why:
Sample Computation n°1 (this one launches 2 QThreads):
PC1-windows: 7.33s
PC1-linux: 3.72s
PC2-linux: 1.36s
Sample Computation n°2 (this one launches 3 QThreads):
PC1-windows: 6.84s
PC1-linux: 3.24s
PC2-linux: 1.06s
Sample Computation n°3 (this one launches 6 QThreads):
PC1-windows: 8.35s
PC1-linux: 2.62s
PC2-linux: 0.47s
where:
PC1-windows = my 2 cores laptop (#1.73Ghz) with Windows 7
PC1-linux = my 2 cores laptop (#1.73Ghz) with Ubuntu 12.04
PC2-linux = my 8 cores laptop (#2.20Ghz) with Ubuntu 12.04
(Of course, it's not shocking that PC2 is faster. What's incredible to me is the difference between PC1-windows and PC1-linux).
Note: I've also tried running the program on a recent PC (4 or 8 cores #~3Ghz, don't remember exactly) under Mac OS, speed was comparable to PC2-linux (or slightly faster).
EDIT: I'll answer here a few questions I was asked in the comments.
I just installed Qt SDK on Windows, so I guess I have the latest version of everything (including MinGW?). The compiler is MinGW. Qt version is 4.8.1.
I use no optimization flags because I noticed that they are automatically used when I build in release mode (with Qt Creator). It seems to me that if I write something like QMAKE_CXXFLAGS += -O1, this only has an effect in debug build.
Lifetime of threads etc: this is pretty simple. When the user clicks the "Compute" button, 2 to 6 threads are launched simultaneously (depending on what he is computing), they are terminated when the computation ends. Nothing too fancy. Every thread just does brutal computations (except one, actually, which makes a (not so) small"computation every 30ms, basically checking whether the error is small enough).
EDIT: latest developments and partial answers
Here are some new developments that provide answers about all this:
I wanted to determine whether the difference in speed really had something to do with threads or not. So I modified the program so that the computation only uses 1 thread, that way we are pretty much comparing the performance on "pure C++ code". It turned out that now Windows was only slightly slower than Linux (something like 15%). So I guess that a small (but not unsignificant) part of the difference is intrinsic to the system, but the largest part is due to threads management.
As someone (Luca Carlon, thanks for that) suggested in the comments, I tried building the application with the compiler for Microsoft Visual Studio (MSVC), instead of MinGW. And suprise, the computation (with all the threads and everything) was now "only" 20% to 50% slower than Linux! I think I'm going to go ahead and be content with that. I noticed that weirdly though, the "pure C++" computation (with only one thread) was now even slower (than with MinGW), which must account for the overall difference. So as far as I can tell, MinGW is slightly better than MSVC except that it handles threads like a moron.
So, I’m thinking either there’s something I can do to make MinGW (ideally I’d rather use it than MSVC) handle threads better, or it just can’t. I would be amazed, how could it not be well known and documented ? Although I guess I should be careful about drawing conclusions too quickly, I’ve only compared things on one computer (for the moment).
Another option it could be: on linux qt are just loaded, this could happens i.e. if you use KDE, while in Windows library must be loaded so this slow down computation time. To check how much library loading waste your application you could write a dummy test with pure c++ code.
I have noticed exactly the same behavior on my PC.
I am running Windows 7(64bits), Ubuntu (64bits) and OSX (Lion 64bits) and my program compares 2 XML files (more than 60Mb each). It uses Multithreading too (2 threads) :
-Windows : 40sec
-Linux : 14sec (!!!)
-OSX : 22sec.
I use a personal class for threads (and not Qt one) which uses "pthread" on linux/OSX and "threads" on windows.
I use Qt/mingw compiler as I need the XML class from Qt.
I have found no way (for now) to have the 3 OS having similar performances... but I hope I will !
I think that another reason may be the memory : my program uses about 500Mb of RAM. So I think that Unix is managing it best because, in mono-thread, Windows is exactly 1.89 times slower and I don't think that Linux could be more than 2 times slower !
I have heard of one case where Windows was extremely slow with writing files if you do it wrongly. (This has nothing to do with Qt.)
The problem in that case was that the developer used a SQLite database, wrote some 10000 datasets, and did a SQL COMMIT after each insert. This caused Windows to write the whole DB file to disk each time, while Linux would only update the buffered version of the filesystem inode in the RAM. The speed difference was even worse in that case: 1 second on Linux vs. 1 minute on Windows. (After he changed SQLite to commit only once at the end, it was also 1 second on Windows.)
So if you're writing the results of your computation to disk, you might want to check if you're calling fsync() or fflush() too often. If your writing code comes from a library, you can use strace for this (Linux-only, but should give you a basic idea).
You might experience performance differences by how mutexes run on Windows and Linux.
Pure mutex code on windows can have a 15ms wait every time there is a contention for resource when locking. Better performing synchronization mechanism on Windows is Critical Sections. It doesn't experience the locking penalty that regular mutexes experience in most cases.
I have found that on Linux, regular mutexes perform the same as Critical Sections on Windows.
It's probably the memory allocator, try using jemalloc or tcmalloc from Google. Glibc's ptmalloc3 is significantly better than the old crusty allocator in MSVC's crt. The comparable option from Microsoft is the Concurrency CRT but you cannot simply drop it in as a replacement.

How can I find out how much time is spend on each line in C/C++ code?

I am trying to find a profiling tool with which I can find out, how much time is spend on each line of code in a C/C++ program. I am working on Linux platforms (Ubuntu, Gentoo, SL) mainly with gcc. I use gprof but sometimes I need the "per line" information.
Any suggestions? Thank you!
On linux you can use oprofile. This is a sample based profiler which runs on almost any platform and supports the performance monitoring registers if they are available. On x86 it works with both AMD and Intel.
You can use it as standalone program wich will give you an annotated source, but there is a plugin available (linuxtools) for eclipse which integrates nicely into the IDE.
AMD CodeAnalyst is your best bet, it is totally free, and it works on windows and linux, though its primarily for AMD CPU's, so non-AMD CPU's won't get the MSR based profiling options. Under Windows it also has great integration for Visual Studio 2008 & 2010 as well.
for non-vendor specific, free profilers, you can try very sleepy, which also happens to be open source.
What Zoom does is take stack samples on wall-clock time.
Then the percent of time any function or line of code is responsible for is the fraction of samples on which it appears.
For example, if a line of code is on 30% of stack samples, and you could avoid executing it, the total execution time would decrease by 30%
This is true regardless of I/O, recursion, competing processes, swapping, all the things that confuse many profilers.

How to realise long-term high-resolution timing on windows using C++?

I need to get exact timestamps every couple of ms (20, 30, 40ms) over a long period of time (a couple of hours). The function in which the timestamp is taken is invoked as a callback by a 3rd-party library.
Using GetSystemTime() one can get the correct system timestamp but only with milliseconds accuracy, which is not precise enough for me. Using QueryPerformanceTimer() yields more accurate timestamps but is not synchronous to the system timestamp over a long period of time (see http://msdn.microsoft.com/en-us/magazine/cc163996.aspx).
The solution provided at the site linked above somehow works only on older computers, it hangs while synchronizing when i try to use it with newer computers.
It seems to me like boost is also only working on milliseconds accuracy.
If possible, I'd like to avoid using external libraries, but if there's no other choice I'll go with it.
Any suggestions?
Deleted article from CodeProject, this seems to be the copy: DateTimePrecise C# Class The idea is to use QueryPerformanceCounter API for accurate small increments and periodically adjust it in order to keep long term accuracy. This is about to give microsecond accuracy ("about" because it's still not exactly precise, but still quite usable).
See also: Microsecond resolution timestamps on Windows
Which language are you using?
In Java (1.5 or above) I'd suggest 'System.nanoTime()' which requires no import.
Remember in Windows that time-slice granularity is 1000ms / 64 = 15.625ms.
This will affect inter-process communication, especially on uni-processor machines, or machines that run several heavy CPU usage processes 'concurrently'*.
In fact, I just got DOS 6.22 and Windows for Workgroups 3.11/3.15 via eBay, so I can screenshot the original timeslice configuration for uni-processor Windows machines of the era when I started to get into it. (Although it might not be visible in versions above 3.0).
You'll be hard pressed to find anything better than QueryPerformanceTimer() on Windows.
On modern hardware it uses the HPET as a source which replaces the RTC interrupt controller. I would expect QueryPerformanceTimer() and the System clock to be synchronous.
There is no such QueryPerformanceTimer() on windows. The resource is named QueryPerformanceCounter(). It provides a counter value counting at some higher frequency.
Its incrementing frequency can be retrieved by a call to QueryPerformanceFrequency().
Since this frequency is typically in the MHz range, microsecond resolution can be observed.
There are some implementations around, i.e. this thread or at the Windows Timestamp Project

Fastest way to run a program in a 64 bit environment?

It's been a couple of decades since I've done any programming. As a matter of fact the last time I programmed was in an MS-DOS environment before Windows came out. I've had this programming idea that I have wanted to try for a few years now and I thought I would give it a try. The amount of calculations are enormous. Consequently I want to run it in the fastest environment I can available to a general hobby programmer.
I'll be using a 64 bit machine. Currently it is running Windows 7. Years ago a program ran much slower in the windows environment then then in MS-DOS mode. My personal programming experience has been in Fortran, Pascal, Basic, and machine language for the 6800 Motorola series processors. I'm basically willing to try anything. I've fooled around with Ubuntu also. No objections to learning new. Just want to take advantage of speed. I'd prefer to spend no money on this project. So I'm looking for a free or very close to free compiler. I've downloaded Microsoft Visual Studio C++ Express. But I've got a feeling that the completed compiled code will have to be run in the Windows environment. Which I'm sure slows the processing speed considerably.
So I'm looking for ideas or pointers to what is available.
Thank you,
Have a Great Day!
Jim
Speed generally comes with the price of either portability or complexity.
If your programming idea involves lots of computation, then if you're using Intel CPU, you might want to use Intel's compiler, which might benefit from some hidden processor features that might make your program faster. Otherwise, if portability is your goal, then use GCC (GNU Compiler Collection), which can cross-compile well optimized executable to practically any platform available on earth. If your computation can be parallelizable, then you might want to look at SIMD (Single Input Multiple Data) and GPGPU/CUDA/OpenCL (using graphic card for computation) techniques.
However, I'd recommend you should just try your idea in the simpler languages first, e.g. Python, Java, C#, Basic; and see if the speed is good enough. Since you've never programmed for decades now, it's likely your perception of what was an enormous computation is currently miniscule due to the increased processor speed and RAM. Nowadays, there is not much noticeable difference in running in GUI environment and command line environment.
Tthere is no substantial performance penalty to operating under Windows and a large quantity of extremely high performance applications do so. With new compiler advances and new optimization techniques, Windows is no longer the up-and-coming, new, poorly optimized technology it was twenty years ago.
The simple fact is that if you haven't programmed for 20 years, then you won't have any realistic performance picture at all. You should make like most people- start with an easy to learn but not very fast programming language like C#, create the program, then prove that it runs too slowly, then make several optimization passes with tools such as profilers, then you may decide that the language is too slow. If you haven't written a line of code in two decades, the overwhelming probability is that any program that you write will be slow because you're a novice programmer from modern perspectives, not because of your choice of language or environment. Creating very high performance applications requires a detailed understanding of the target platform as well as the language of choice, AND the operations of the program.
I'd definitely recommend Visual C++. The Express Edition is free and Visual Studio 2010 can produce some unreasonably fast code. Windows is not a slow platform - even if you handwrote your own OS, it'd probably be slower, and even if you produced one that was faster, the performance gain would be negligible unless your program takes days or weeks to complete a single execution.
The OS does not make your program magically run slower. True, the OS does eat a few clock cycles here and there, but it's really not enough to be at all noticeable (and it does so in order to provide you with services you most likely need, and would need to re-implement yourself otherwise)
Windows doesn't, as some people seem to believe, eat 50% of your CPU. It might eat 0.5%, but so does Linux and OSX. And if you were to ditch all existing OS'es and instead write your own from scratch, you'd end up with a buggy, less capable OS which also eats a bit of CPU time.
So really, the environment doesn't matter.
What does matter is what hardware you run the program on (and here, running it on the GPU might be worth considering) and how well you utilize the hardware (concurrency is pretty much a must if you want to exploit modern hardware).
What code you write, and how you compile it does make a difference. The hardware you're running on makes a difference. The choice of OS does not.
A digression: that the OS doesn't matter for performance is, in general, obviously false. Citing CPU utilization when idle seems a quite "peculiar" idea to me: of course one hopes that when no jobs are running the OS is not wasting energy. Otherwise one measure the speed/throughput of an OS when it is providing a service (i.e. mediating the access to hardware/resources).
To avoid an annoying MS Windows vs Linux vs Mac OS X battle, I will refer to a research OS concept: exokernels. The point of exokernels is that a traditional OS is not just a mediator for resource access but it implements policies. Such policies does not always favor the performance of your application-specific access mode to a resource. With the exokernel concept, researchers proposed to "exterminate all operating system abstractions" (.pdf) retaining its multiplexer role. In this way:
… The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). …
So bypassing the usual OS access policies they gained, for a customized web server, an increase of about 800% in performance.
Returning to the original question: it's generally true that an application is executed with no or negligible OS overhead when:
it has a compute-intensive kernel, where such kernel does not call the OS API;
memory is enough or data is accessed in a way that does not cause excessive paging;
all inessential services running on the same systems are switched off.
There are possibly other factors, depending by hardware/OS/application.
I assume that the OP is correct in its rough estimation of computing power required. The OP does not specify the nature of such intensive computation, so its difficult to give suggestions. But he wrote:
The amount of calculations are enormous
"Calculations" seems to allude to compute-intensive kernels, for which I think is required a compiled language or a fast interpreted language with native array operators, like APL, or modern variant such as J, A+ or K (potentially, at least: I do not know if they are taking advantage of modern hardware).
Anyway, the first advice is to spend some time in researching fast algorithms for your specific problem (but when comparing algorithms remember that asymptotic notation disregards constant factors that sometimes are not negligible).
For the sequential part of your program a good utilization of CPU caches is crucial for speed. Look into cache conscious algorithms and data structures.
For the parallel part, if such program is amenable to parallelization (remember both Amdahl's law and Gustafson's law), there are different kinds of parallelism to consider (they are not mutually exclusive):
Instruction-level parallelism: it is taken care by the hardware/compiler;
data parallelism:
bit-level: sometimes the acronym SWAR (SIMD Within A Register) is used for this kind of parallelism. For problems (or some parts of them) where it can be formulated a data representation that can be mapped to bit vectors (where a value is represented by 1 or more bits); so each instruction from the instruction set is potentially a parallel instruction which operates on multiple data items (SIMD). Especially interesting on a machine with 64 bits (or larger) registers. Possible on CPUs and some GPUs. No compiler support required;
fine-grain medium parallelism: ~10 operations in parallel on x86 CPUs with SIMD instruction set extensions like SSE, successors, predecessors and similar; compiler support required;
fine-grain massive parallelism: hundreds of operations in parallel on GPGPUs (using common graphic cards for general-purpose computations), programmed with OpenCL (open standard), CUDA (NVIDIA), DirectCompute (Microsoft), BrookGPU (Stanford University) and Intel Array Building Blocks. Compiler support or use of a dedicated API is required. Note that some of these have back-ends for SSE instructions also;
coarse-grain modest parallelism (at the level of threads, not single instructions): it's not unusual for CPUs on current desktops/laptops to have more then one core (2/4) sharing the same memory pool (shared-memory). The standard for shared-memory parallel programming is the OpenMP API, where, for example in C/C++, #pragma directives are used around loops. If I am not mistaken, this can be considered data parallelism emulated on top of task parallelism;
task parallelism: each core in one (or multiple) CPU(s) has its independent flow of execution and possibly operates on different data. Here one can use the concept of "thread" directly or a more high-level programming model which masks threads.
I will not go into details of these programming models here because apparently it is not what the OP needs.
I think this is enough for the OP to evaluate by himself how various languages and their compilers/run-times / interpreters / libraries support these forms of parallelism.
Just my two cents about DOS vs. Windows.
Years ago (something like 1998?), I had the same assumption.
I have some program written in QBasic (this was before I discovered C), which did intense calculations (neural network back-propagation). And it took time.
A friend offered to rewrite the thing in Visual Basic. I objected, because, you know, all those gizmos, widgets and fancy windows, you know, would slow down the execution of, you know, the important code.
The Visual Basic version so much outperformed the QBasic one that it became the default application (I won't mention the "hey, even in Excel's VBA, you are outperformed" because of my wounded pride, but...).
The point here, is the "you know" part.
You don't know.
The OS here is not important. As others explained in their answers, choose your hardware, and choose your language. And write your code in a clear way because now, compilers are better at optimizing code developers, unless you're John Carmack (premature optimization is the root of all evil).
Then, if you're not happy with the result, use a profiler to test your code. Consider multithreading (which will help you if you have multiple cores... TBB comes to mind).
What are you trying to do? I believe all the stuff should be compiled in 64bit mode by default. Computers have gotten a lot faster. Speed should not be a problem for the most part.
Side note: As for computation intense stuff you may want to look into OpenCL or CUDA. OpenCL and CUDA take advantage of the GPU which can transfer lots of information at a time compared to the CPU.
If your last points of reference are M68K and PCs running DOS then I'd suggest that you start with C/C++ on a modern processor and OS. If you run into performance problems and can prove that they are caused by running on Linux / Windows or that the compiler / optimizer generated code isn't sufficient, then you could look at other OSes and/or hand coded ASM. If you're looking for free, Linux / gcc is a good place to start.
I am the original poster of this thread.
I am once again reiterating the emphasis that this program will have enormous number of calculations.
Windows & Ubuntu are multi-tasking environments. There are processes running and many of them are using processor resources. True many of them are seen as inactive. But still the Windows environment by the nature of multi-tasking is constantly monitoring the need to start up each process. For example currently there are 62 processes showing in the Windows Task Manager. According the task manager three are consuming CPU resouces. So we have three ongoing processes that are consuming CPU processing. There are an addition 59 showing active but consuming no CPU processing. So that is 63 being monitored by Windows and then there is the Windows that also is checking on various things.
I was hoping to find some way to just be able to run a program on the bare machine level. Side stepping all the Windows (Ubuntu) involvement.
The idea is very calculation intensive.
Thank you all for taking the time to respond.
Have a Great Day,
Jim