Cycle count measurement - profiling

I have a MS Visual Studio 2005 application solution. All the code is in C. I want to measure the number of cycles taken for execution by particular functions. Is there any Win32 API which I can use to get the cycle count?
I have used gettimeofday() to get time in micro seconds, but I want to know the cycles consumed.

Both Intel and AMD offer windows libraries and tools to access the performance counters on their cpus. These give access not only to cycle counts, but also to cache line hits and misses and TLB flush counts. The Intel tools are marketed under the name VTune, while AMD calls theirs CodeAnalyst.

Related

Profiler shows no time for jumps but much time for compares

While testing the work of custom heap manager (to replace system one) I have encountered some slowdowns in comparison to system heap.
I used AMD CodeAnalyst for profiling x64 application on Windows 7, Intel Xeon CPU E5-1620 v2 # 3.70 GHz. And got the following results:
This block consumes about 90% of the time for the whole application run. We can see a lot of time spent on "cmp [rsp+18h], rax" and "test eax, eax" but no time spent on jumps right below the compares. Is it ok that jumps take no time? Is it because of branch prediction mechanism?
I changed the clause to the opposite and here what I've got (the results are a bit different in absolute numbers because I manually stopped profiling sessions - but still a lot of time is taken by compares):
There are so many calls to these compares that they become a bottle-neck... This is how I can interpret these results. And probably the best optimization is reworking the algorithm, right?
Intel and AMD CPUs both macro-fuse cmp/jcc pairs into a single compare-and-branch uop (Intel) or macro-op (AMD). Intel SnB-family CPUs like yours can do this with some instructions that also write an output register, like and, sub/add, inc/dec.
To really understand profiling data, you have to understand something about how the out-of-order pipeline works in the microarch you're tuning on. See the links at the x86 tag wiki, especially Agner Fog's microarch pdf.
You should also beware that profiling cycle counts can get charged to the instruction that's waiting for results, not the instruction that is slow to produce them.

Maximum Windows 8.1 CPU usage <= 30%

I'm writing a C++ application using Visual Studio 2013. The application iterates through an image doing some complicated analysis. To test code efficiency I am running the analysis (say) 100 times and seeing how long it takes. Then I modify the code, re-run the test and see if there is an improvement (or degradation) in performance.
Problem is that while I have a powerful 4-core i5 (i5-4200U # 1.6 GHz to be specific) and plenty of RAM, the overall CPU utilisation never exceeds about 30%. My process never seems to get beyond about 29.5%. I've tried setting the priority class of my application to "High" (using SetProcessPriority) and this doesn't help. There is zero disk and network access, all in memory (and about 5GB of memory to spare).
Is this some secret Windows 8.1 setting (to preserve performance)? Can I change this programmatically or through some Control Panel applet?
Well how do you expect your application to use 100% cpu when it is (most likely) only running on one core because you aren't using threads?
30% is slightly above the usage for one core (25%) so it is almost certain you aren't using threads here.

Is QueryPerformanceFrequency accurate when using HPET?

I'm playing around with QueryPerformanceFrequency.
It used to return 3.6 Mhz, but it was not enough for what I was trying to do.
I've enabled HPET using this command bcdedit /set useplatformclock true. Now it returns 14.3 Mhz. It's great it's more precise... excepted it's not. I quickly realized that I did not get the granularity I expected.
If I try to poll QueryPerformanceCounter until it ticks, the smallest increment I can get is 11, which means 1.27Mhz. If I try to count the number of different values I can get from QueryPerformanceCounter in one second, I get 1.26Mhz.
So I was wondering is there was a way to really use the 14.3 Mhz to their full extent ?
I'm using windows 7, 64 bit system, visual studio 2008.
Using the HPET hardware as a source for QueryPerformanceCounter (QPC) is known to be assosiated with large overheads.
QPC is an expensive call when configured with HPET.
It provides 14.3 MHz which suggests high accuracy but as you found, it can't be called fast enough to actually resolve that frequency.
Therefore Microsoft has turned into the CPUs time stamp counter (TSC) as a source for QPC whenever the hardware is capable doing so. TSC queries have much lower overhead. The associated frequency used for QPC is typically the CPU frequency divided by 1024; also typically a few MHz.
The call of QPC in TSC mode is so fast that a lot of consecutive calls may show the same result (typically approx. 20-30 calls or 15 - 20 ns/call).
This way you may obtain typical resolutions of approx. 0.3 us (on a 3.4 GHz CPU).
You observed 3.6 MHz before you switched to HPET. That's likely the signiture of the systems ACPI PM timer (3579545 Hz), which indicates that you were not operating on TSC based QPC before switching to HPET.
So either way, running HPET or ACPI PM timer results in a usable resoluion in the range of a few MHz. Both cannot expose the full resolution given by the performance counter frequency (PCF) because the call to QPC is too expensive. Only The TSC based QPC is fast enough and capable to actually oversample the QPC.
Microsoft has just recently released more detailed information about this matter:
See Acquiring high-resolution time stamps (MSDN 2014) for the details.
This is a comprehensive article with lots of examples and detailed description. A must read for users of QPC.
...a way to really use the 14.3 Mhz to their full extent ?
Unfortunately not.
You can run Coreinfo.exe utility from Windows Sysinternals. Sysinternals has moved to Microsoft technet. Here is the link: Sysinternals System Information Utilities. This will give you an answer to the question: How can I check if my system has a non-invariant TSC?
Summary: The best resolution/accuracy/granularty is obtained by QPC based on TSC.
BTW: The proper choice of hardware as resource for QPC also influences the call expense of the new GetSystemTimePreciseAsFileTime function (Windows 8 desktop and upwards) because it internally uses QPC.

What is the fastest instrumentation profiler out there

What is the fastest profiler available for dynamic profiling (like what gprof does). The profiler has to be an instrumentation profiler, or even if it has sampling profiling with it, I'm interested to know the overhead of instrumentation profiling, because sampling profiling can be done with almost 0% overhead anyway.
Any profiler that uses hardware based sampling (via the CPU PMSR's) will have the smallest overhead (as its reading the profiling data the CPU is keeping track of at a hardware level, for more info, see AMD & Intels Architecture manuals, they should be explained in-depth in one of the appendices).
The only profilers I know of using these are VTune for Intel (not free) and CodeAnalyst for AMD (free).
Next in line would be timer based profilers and event based profilers, of these the ones with the least overhead would probably be ones compiled directly into your code (CodeAnalyst has an API for event based, so does VTune). gprof also falls into this category (Clang also has something but IDK if its still maintained...). If you have VS Pro or Ultimate, its PG compile mode will do similar things, though I have never found it to compare with a dedicated profiler suite.
Last would be the ones that need to insert probes into your code to determine its profiling data, all the aforementioned ones can do this, as well as other freeware profilers like VerySleepy.
Intel's vtune amplifier is probably the most complete.

How to profile a C++ function at assembly level?

I have a function that is the bottleneck of my program. It requires no access to memory and requires only calculation. It is the inner loop and called many times so any small gains to this function is big wins for my program.
I come from a background in optimizing SPU code on the PS3 where you take a SPU program and run it through a pipeline analyzer where you can put each assembly statement in its own column and you minimize the amount of cycles the function takes. Then you overlay loops so you can minimized pipeline dependencies even more. With that program and a list of all the cycles each assembly instruction takes I could optimize much better then the compiler ever could.
On a different platform it had events I could register (cache misses, cycles, etc.) and I could run the function and track CPU events. That was pretty nice as well.
Now I'm doing a hobby project on Windows using Visual Studio C++ 2010 w/ a Core i7 Intel processor. I don't have the money to justify paying the large cost of VTune.
My question:
How do I profile a function at the assembly level for an Intel processor on Windows?
I want to compile, view disassembly, get performance metrics, adjust my code and repeat.
There are some great free tools available, mainly AMD's CodeAnalyst (from my experiences on my i7 vs my phenom II, its a bit handicapped on the Intel processor cause it doesn't have access to the direct hardware specific counters, though that might have been bad config).
However, a lesser know tool is the Intel Architecture Code Analyser (which is free like CodeAnalyst), which is similar to the spu tool you described, as it details latency, throughput and port pressure (basically the request dispatches to the ALU's, MMU and the like) line by line for your programs assembly. Stan Melax gave a nice talk on it and x86 optimization at this years GDC, under the title "hotspots, flops and uops: to-the-metal cpu optimization".
Intel also has a few more tools in the same vein as IACA, avaibale under the performance tuning section of their experimental/what-if code site, such as PTU, which is (or was) an experimental evolution of VTune, from what I can see, its free.
Its also a good idea to have read the intel optimization manual before diving into this.
EDIT: as Ben pointed out, the timings might not be correct for older processors, but that can be easily made up for using Agner Fog's Optimization manuals, which also contain many other gems.
You might want to try some of the utilities included in valgrind like callgrind or cachegrind.
Callgrind can do profiling and dump assembly.
And kcachegrind is a nice GUI, and will show the dumps including assembly and number of hits per instruction etc.
From you description it sounds like you problem may be embarrassingly parallel, have you considered using ppl's parallel_for?