Simple but yet complicated question:
What counter to use to get perf tools to measure wall clock time?
As a base line the first thing when profiling code I think I need to measure is just wall clock time to get an first idea where the code takes most of the time.
I don’t care if it’s IO or bandwidth limited or something else I just want to know where it is slow.
Sounds simple requirement, but with all the many tricks modern CPUs do to work efficient (like frequency scaling etc.) and the hell lot of different (not so well documented) performance counters available in perf, it’s not easy to be sure measuring the right thing.
Currently I do:
perf record -g -e ref-cycles -F 999 -- <cmd>
I think this is unscaled CPU frequency and thus proportional to the amount of wall clock time that part of the code is running. But who the hell knows?
You can use task-clock.
This is explicitly wall clock time while the process is running and as a bonus is portable because it doesn't rely on any PMU event.
I've a multi-task program written in C++ and I want to calculate CPU usage for each of threads in sub second basis (may be 100 ms)
as you may know /proc/stat or something like that didn't have accuracy for that resolution of time.
I want to know is there a way to calculate clock cycles consumed for each thread in assembly or C/C++?
Starting from Linux 2.6.12 and glibc 2.4, you can use clock_gettime with a clock of type CLOCK_THREAD_CPUTIME_ID to get realiable timing information
I'm programming in a Keil Board and am trying to count the number of clock periods taken for execution by a code block inside a C function.
Is there a way to get time with precision to microseconds before and after the code block, so that I can get the diff and divide it by the number of clock periods per microsecond to compute the clock periods consumed by the block?
The clock() function in time.h gives time in seconds which will give the diff as 0 as it is a small code block that I'm trying to get the clock periods for.
If this is not a good way to solve this problem are there alternatives?
Read up on the timers in the chip, find one the operating system/environment you are using has not consumed and use it directly. This takes some practice, you need to use volatiles to not let the compiler re-arrange your code or not re-read the timer. And you need to adjust the prescaler on the timer so that it gets the most practical resolution without rolling over. So start with a big prescale divisor, convince yourself it is not rolling over, then make that prescale divisor shorter, until you reach a divide by one or or reach the desired accuracy. If divide by one doesnt give you enough then you have to call the function many times in a loop and time around that loop. Remember that any time you change your code to add these measurements you can and will change the performance of your code, sometimes small enough not to notice, sometimes 10% - 20% or more. if you are using a cache then any line of code you add or remove can change the performance by double digit percentages and you have to understand more about timing your code at that point.
The best way to count the number of clock cycles in the embedded world is to use an oscilloscope. Toggle a GPIO pin before and after your code block and measure the time with the oscilloscope.The measured time multiplied by the CPU frequency is the numbler of CPU clock cycles spent.
You have omitted to say what processor is on the board (far more important than the brand of board!), if the processor includes ETM, and you have a ULINK-Pro or other trace-capable debugger then uVision can unintrusively profile the executing code directly at the instruction cycle level.
Similarly if you run the code in the uVision simulator rather than real hardware, you can get cycle accurate profiling and timing, without the need for hardware trace support.
Even without the trace capability, uVision's "stopwatch" feature can perform timing between two break-points directly. The stopwatch is at the bottom of the IDE in the status bar. You do need to set the clock frequency in the debugger trace configuration to get "real-time" from the stop-watch.
A simple approach that requires no special debug or simulator capability is to use an available timer peripheral (or in the case of Cortex-M devices the sysclk) to timestamp the start and end of execution of a code section, or if you have no available timing resource, you could toggle a GPIO pin and monitor it on an oscilloscope. These methods have some level of software overhead that is not present in hardware or simulator trace, that may make them unsuitable for very short code sections.
Programs like CPUz are very good at giving in depth information about the system (bus speed, memory timings, etc.)
However, is there a programmatic way of calculating the per core (and per processor, in multi processor systems with multiple cores per CPU) frequency without having to deal with CPU specific info.
I am trying to develop a anti cheating tool (for use with clock limited benchmark competitions) which will be able to record the CPU clock during the benchmark run for all the active cores in the system (across all processors.)
I'll expand on my comments here. This is too big and in-depth for me to fit in the comments.
What you're trying to do is very difficult - to the point of being impractical for the following reasons:
There's no portable way to get the processor frequency. rdtsc does NOT always give the correct frequency due to effects such as SpeedStep and Turbo Boost.
All known methods to measure frequency require an accurate measurement of time. However, a determined cheater can tamper with all the clocks and timers in the system.
Accurately reading either the processor frequency as well as time in a tamper-proof way will require kernel-level access. This implies driver signing for Windows.
There's no portable way to get the processor frequency:
The "easy" way to get the CPU frequency is to call rdtsc twice with a fixed time-duration in between. Then dividing out the difference will give you the frequency.
The problem is that rdtsc does not give the true frequency of the processor. Because real-time applications such as games rely on it, rdtsc needs to be consistent through CPU throttling and Turbo Boost. So once your system boots, rdtsc will always run at the same rate (unless you start messing with the bus speeds with SetFSB or something).
For example, on my Core i7 2600K, rdtsc will always show the frequency at 3.4 GHz. But in reality, it idles at 1.6 GHz and clocks up to 4.6 GHz under load via the overclocked Turbo Boost multiplier at 46x.
But once you find a way to measure the true frequency, (or you're happy enough with rdtsc), you can easily get the frequency of each core using thread-affinities.
Getting the True Frequency:
To get the true frequency of the processor, you need to access either the MSRs (model-specific registers) or the hardware performance counters.
These are kernel-level instructions and therefore require the use of a driver. If you're attempting this in Windows for the purpose of distribution, you will therefore need to go through the proper driver signing protocol. Furthermore, the code will differ by processor make and model so you will need different detection code for each processor generation.
Once you get to this stage, there are a variety of ways to read the frequency.
On Intel processors, the hardware counters let you count raw CPU cycles. Combined with a method of precisely measuring real time (next section), you can compute the true frequency. The MSRs give you access to other information such as the CPU frequency multiplier.
All known methods to measure frequency require an accurate measurement of time:
This is perhaps the bigger problem. You need a timer to be able to measure the frequency. A capable hacker will be able to tamper with all the clocks that you can use in C/C++.
This includes all of the following:
clock()
gettimeofday()
QueryPerformanceCounter()
etc...
The list goes on and on. In other words, you cannot trust any of the timers as a capable hacker will be able to spoof all of them. For example clock() and gettimeofday() can be fooled by changing the system clock directly within the OS. Fooling QueryPerformanceCounter() is harder.
Getting a True Measurement of Time:
All the clocks listed above are vulnerable because they are often derived off of the same system base clock in some way or another. And that system base clock is often tied to the system base clock - which can be changed after the system has already booted up by means of overclocking utilities.
So the only way to get a reliable and tamper-proof measurement of time is to read external clocks such as the HPET or the ACPI. Unfortunately, these also seem to require kernel-level access.
To Summarize:
Building any sort of tamper-proof benchmark will almost certainly require writing a kernel-mode driver which requires certificate signing for Windows. This is often too much of a burden for casual benchmark writers.
This has resulted in a shortage of tamper-proof benchmarks which has probably contributed to the overall decline of the competitive overclocking community in recent years.
I realise this has already been answered. I also realise this is basically a black art, so please take it or leave it - or offer feedback.
In a quest to find the clock rate on throttled (thanks microsft,hp, and dell) HyperV hosts (unreliable perf counter), and HyperV guests (can only get stock CPU speed, not current), I have managed, through trial error and fluke, to create a loop that loops exactly once per clock.
Code as follows - C# 5.0, SharpDev, 32bit, Target 3.5, Optimize on (crucial), no debuger active (crucial)
long frequency, start, stop;
double multiplier = 1000 * 1000 * 1000;//nano
if (Win32.QueryPerformanceFrequency(out frequency) == false)
throw new Win32Exception();
Process.GetCurrentProcess().ProcessorAffinity = new IntPtr(1);
const int gigahertz= 1000*1000*1000;
const int known_instructions_per_loop = 1;
int iterations = int.MaxValue;
int g = 0;
Win32.QueryPerformanceCounter(out start);
for( i = 0; i < iterations; i++)
{
g++;
g++;
g++;
g++;
}
Win32.QueryPerformanceCounter(out stop);
//normal ticks differs from the WMI data, i.e 3125, when WMI 3201, and CPUZ 3199
var normal_ticks_per_second = frequency * 1000;
var ticks = (double)(stop - start);
var time = (ticks * multiplier) /frequency;
var loops_per_sec = iterations / (time/multiplier);
var instructions_per_loop = normal_ticks_per_second / loops_per_sec;
var ratio = (instructions_per_loop / known_instructions_per_loop);
var actual_freq = normal_ticks_per_second / ratio;
Console.WriteLine( String.Format("Perf counhter freq: {0:n}", normal_ticks_per_second));
Console.WriteLine( String.Format("Loops per sec: {0:n}", loops_per_sec));
Console.WriteLine( String.Format("Perf counter freq div loops per sec: {0:n}", instructions_per_loop));
Console.WriteLine( String.Format("Presumed freq: {0:n}", actual_freq));
Console.WriteLine( String.Format("ratio: {0:n}", ratio));
Notes
25 instructions per loop if debugger is active
Consider running a 2 or 3 seconds loop before hand to spin up the processor (or at least attempt to spin up, knowing how heavily servers are throttled these days)
Tested on a 64bit Core2 and Haswell Pentium and compared against CPU-Z
One of the most simple ways to do it is using RDTSC, but seeing as this is for anti-cheating mechanisms, I'd put this in as a kernel driver or a hyper-visor resident piece of code.
You'd probably also need to roll your own timing code**, which again can be done with RDTSC (QPC as used in the example below uses RDTSC, and its in fact very simple to reverse engineer and use a local copy of, which means to tamper with it, you'd need to tamper with your driver).
void GetProcessorSpeed()
{
CPUInfo* pInfo = this;
LARGE_INTEGER qwWait, qwStart, qwCurrent;
QueryPerformanceCounter(&qwStart);
QueryPerformanceFrequency(&qwWait);
qwWait.QuadPart >>= 5;
unsigned __int64 Start = __rdtsc();
do
{
QueryPerformanceCounter(&qwCurrent);
}while(qwCurrent.QuadPart - qwStart.QuadPart < qwWait.QuadPart);
pInfo->dCPUSpeedMHz = ((__rdtsc() - Start) << 5) / 1000000.0;
}
** I this would be for security as #Mystical mentioned, but as I've never felt the urge to subvert low level system timing mechanisms, there might be more involved, would be nice if Mystical could add something on that :)
I've previously posted on this subject (along with a basic algorithm): here. To my knowledge the algorithm (see the discussion) is very accurate. For example, Windows 7 reports my CPU clock as 2.00 GHz, CPU-Z as 1994-1996 MHz and my algorithm as 1995025-1995075 kHz.
The algorithm performs a lot of loops to do this which causes the CPU frequency to increase to maximum (as it also will during benchmarks) so speed-throttling software won't come into play.
Additional info here and here.
On the question of speed-throttling I really don't see it as a problem unless an application uses the speed values to determine elapsed times and that the times themselves are extremely important. For example, if a division requires x clock cycles to complete it doesn't matter if the CPU is running at 3 GHz or 300 MHz: it will still need x clock cycles and the only difference is that it will complete the division in a tenth of the time at # 3 GHz.
You need to use CallNtPowerInformation. Here's a code sample from putil project.
With this you can get current and max CPU frequency. As far as I know it's not possible to get per-CPU frequency.
One should refer to this white paper: Intel® Turbo Boost Technology in Intel® Core™ Microarchitecture (Nehalem) Based Processors. Basically, produce several reads of the UCC fixed performance counter over a sample period T.
Relative.Freq = Delta(UCC) / T
Where:
Delta() = UCC # period T
- UCC # period T-1
Starting with Nehalem architecture, UCC increase and decrease the number of click ticks relatively to the Unhalted state of the core.
When SpeedStep or Turbo Boost are activated, the estimated frequency using UCC will be measured accordingly; while TSC remains constant. For instance, Turbo Boost in action reveals that Delta(UCC) is greater or equal to Delta(TSC)
Example in function Core_Cycle function at Cyring | CoreFreq GitHub.
Hi i am using QueryperformanceFrequency to get the No of cycle i.e processor speed.
But it is showing me the wrong value. It is written in the specicfication is the Processor is about 400MHz, but what we are getting through code is something 16MHz.
Please porvide any pointer :
The code for Wince device is:
LARGE_INTEGER FrequnecyCounter;
QueryPerformanceFrequency(&FrequnecyCounter);
CString temp;
temp.Format(L"%lld",FrequnecyCounter.QuadPart)`AfxMessageBox(temp);
Thanks,
Mukesh
QueryPerformanceFrequency returns frequency of the counter peripheral not of the processor. These peripheral typically runs at the original Crystal clock frequency. 16Mhz should be good enough resolution for you to measure fine grain intervals.
QPF doesn't return the CPU clock speed. It returns the frequency of a high performance timer. On a few systems, it might actually measure CPU cycles. On other systems, it might use s a separate timer running at the same frequency. (but which is unaffected by things like SpeedStep which can change the clock speed of the CPU). Often, it uses a separate timer entirely, one which may not even be on the CPU itself, but may be part of the motherboard.
QueryPerformanceCounter/QueryPerformanceFrequency only promise that they use the best timer available on the system. They make no promises about what that timer might be.