I'm trying to measure a function's performance by measuring the time for each iteration.
During the process, I found even if I do nothing, the results still vary quite a bit.
e.g.
volatile long count = 0;
for (int i = 0; i < N; ++i) {
measure.begin();
++count;
measure.end();
}
In measure.end(), I measure the time difference and keep an unordered_map to keep track of the time-count.
I've used clock_gettime as well as rdtsc, but there's always about 1% of the data points lie far away from mean, in a 1000 factor.
Here's what the above loop generates:
T: count percentile
18 117563 11.7563%
19 111821 22.9384%
21 201605 43.0989%
22 541095 97.2084%
23 2136 97.422%
24 2783 97.7003%
...
406 1 99.9994%
3678 1 99.9995%
6662 1 99.9996%
17945 1 99.9997%
18148 1 99.9998%
18181 1 99.9999%
22800 1 100%
mean:21
So whether it's ticks or ns, the worst case 22800 is about 1000 times bigger than mean.
I did isolcpus in grub and was running this with taskset. The simple loop almost does nothing, the hash table to do time-count statistics is outside of the time measurements.
What am I missing?
I'm running this on a laptop with ubuntu installed, CPU is Intel(R) Core(TM) i5-2520M CPU # 2.50GHz
Thank you for all the answers.
The main interrupt that I couldn't stop is the local timer interrupt. And it seems new 3.10 kernel would support tickless. I'll try that one.
Related
On a few Windows computers I have seen that two, on each other following, calls to ::GetTickCount() returns a difference of 1610619236 ms (around 18 days). This is not due to wrap around og int/unsigned int mismatch. I use Visual C++ 2015/2017.
Has anybody else seen this behaviour? Does anybody have any idea about what could cause behaviour like this?
Best regards
John
Code sample that shows the bug:
class CLTemp
{
DWORD nLastCheck;
CLTemp()
{
nLastCheck=::GetTickCount();
}
//Service is called every 200ms by a timer
void Service()
{
if( ::GetTickCount() - nLastCheck > 20000 )//check every 20 sec
{
//On some Windows machines, after an uptime of 776 days, the
//::GetTickCount() - nLastCheck gives a value of 1610619236
//(corresponding to around 18 days)
nLastCheck = ::GetTickCount();
}
}
};
Update - problem description, a way of recreating and solution:
The Windows API function GetTickCount() unexpectedly jumps 18 days forward in time when passing 776 days after Windows Restart.
We have experienced several times that some of our long running Windows pc applications coded in Microsoft Visual C++ suddenly reported a time-out error. In many of our applications we call GetTickCount() to perform some tasks with certain intervals or to watch for a time-out condition. The example code could go as this:
DWORD dwTimeNow, dwPrevTime = ::GetTickCount();
bool bExit = false;
While (!bExit)
{
dwTimeNow = ::GetTickCount();
if (dwTimeNow – dwPrevTime >= 5000)
{
dwPrevTime = dwTimeNow;
// Perform my task
}
else
{
::Sleep(10);
}
}
GetTickCount() returns a DWORD, which is an unsigned 32-bit int. GetTickCount() wraps around from its maximum value of 0xFFFFFFFF to zero after app. 49 days. The wrap around is easily handled by using unsigned arithmetic and always subtracting the previous value from the new value to calculate the distance. Do never compare two values from GetTickCount() against each other.
So, the wrap around at its maximum value each 49 days it expected and handled. But we have experienced an unexpected wrap around to zero of GetTickCount() after 776 days after latest Windows Restart. And in this case GetTickCount() wraps from 0x9FFFFFFF to zero, which is 1610612736 milliseconds too early corresponding to around 18.6 days. When GetTickCount() is used to check for a time-out condition and it suddenly reports that 18 days have elapsed since last check, then the software reports a false time-out condition. Note that it is 776 days after a Windows Restart. A Windows Restart resets the GetTickCount() value to zero. A pc reboot does not, instead the time elapsed while switched off is added to the initial GetTickCount() value.
We have made a test program that provides evidence of this issue. The test program reads the values of GetTickCount(), GetTickCount64(), InterruptTime(), and UnbiasedInterruptTime() each 5000 milliseconds scheduled by a Windows Timer. Each time the sample program calculates the distance in time for each of the four time-functions. If the distance in time is 10000 milliseconds or more, it is marked as a time jump event and logged. Each time it also keeps track of the minimum distance and the maximum distance in time for each time-function.
Before starting the test program, a Windows Restart is carried out. Ensure no automatic time synchronization is enabled. Then make a Windows shut down. Start the pc again and make it enter its Bios setup when it boots. In the Bios, advance the real time clock 776 days. Let the pc boot up and start the test program. Then after 17 hours the unexpected wraparound of GetTickCount() occurs (776 days, 17 hours, and 21 minutes). It is only GetTickCount() that shows this behavior. The other time-functions do not.
The following excerpt from the logfile of the test program shows the start values reported by the four time-functions. In this example the time has only been advanced to 775 days after Windows Restart. The format of the log entry is the time-function value converted into: days hh:mm:ss.msec. TickCount32 is the plain GetTickCount(). Because it is a 32-bit value it has wrapped around and shows a different value. At GetTickCount64() we can see the 775 days.
2024-05-14 09:13:27.262 Start times
TickCount32 : 029 08:30:11.591
TickCount64 : 775 00:12:01.031
InterruptTime : 775 00:12:01.036
UnbiasedInterruptTime: 000 00:05:48.411
The next excerpt from the logfile shows the unexpected wrap around of GetTickCount() (TickCount32). The format is: Distance between the previous value and the new value (should always be around 5000 msec). Then follows the new value converted into days and time, and finally follows the previous value converted into days and time. We can see that GetTickCount() jumps 1610617752 milliseconds (app. 18.6 days) while the other three time-functions only advances app. 5000 msec as expected. At TickCount64 one can see that it occurs at 776 days, 17 hours, and 21 minutes.
2024-05-16 02:22:30.394 Time jump *****
TickCount32 : 1610617752 - 000 00:00:00.156 - 031 01:39:09.700
TickCount64 : 5016 - 776 17:21:04.156 - 776 17:20:59.140
InterruptTime : 5015 - 776 17:21:04.165 - 776 17:20:59.150
UnbiasedInterruptTime: 5015 - 001 17:14:51.540 - 001 17:14:46.525
If you increase the time that the real time clock is advanced to two times 776 days and 17 hours – for example 1551 days – the phenomenon shows up once more. It has a cyclic nature.
2026-06-30 06:34:26.663 Start times
TickCount32 : 029 12:41:57.888
TickCount64 : 1551 21:44:51.328
InterruptTime : 1551 21:44:51.334
UnbiasedInterruptTime: 004 21:24:24.593
2026-07-01 19:31:47.641 Time jump *****
TickCount32 : 1610617736 - 000 00:00:04.296 - 031 01:39:13.856
TickCount64 : 5000 - 1553 10:42:12.296 - 1553 10:42:07.296
InterruptTime : 5007 - 1553 10:42:12.310 - 1553 10:42:07.303
UnbiasedInterruptTime: 5007 - 006 10:21:45.569 - 006 10:21:40.562
The only viable solution to this issue seems to be using GetTickCount64() and totally abandon usage of GetTickCount().
I have set up simple tests with Parallel Boost Graph Library (PBGL), which I have never used before, and observed entirely unexpected behaviour I would like to explain.
My steps were as follows:
Dump test data in METIS format (a kind of social graph with 50 mln vertices and 100 mln edges);
Build modified PBGL example from graph_parallel\example\dijkstra_shortest_paths.cpp
Example was slightly extended to proceed with Eager, Crauser and delta-stepping algorithms.
Note: building of the example required some obscure workaround about the MUTABLE_QUEUE define in crauser_et_al_shortest_paths.hpp (example code is in fact incompatible with the new mutable_queue)
int lookahead = 1;
delta_stepping_shortest_paths(g, start, dummy_property_map(), get(vertex_distance, g), get(edge_weight, g), lookahead);
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)).lookahead(lookahead));
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)));
Run
mpiexec -n 1 mytest.exe mydata.me
mpiexec -n 2 mytest.exe mydata.me
mpiexec -n 4 mytest.exe mydata.me
mpiexec -n 8 mytest.exe mydata.me
The observed behaviour:
-n 1:
mem usage: 35 GB in 1 running process, which utilizes exactly 1 device thread (processor load 12.5%)
delta stepping time: about 1 min 20 s
eager time: about 2 min
crauser time: about 3 min 20 s.
-n 2:
crash in the stage of data load.
-n 4:
mem usage: 40+ Gb in roughly equal parts in 4 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
-n 8:
mem usage: 44+ Gb in roughly equal parts in 8 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
So, except the unapropriate memory usage and very low total performance the only changes I observe when more MPI processes are running are slightly increased total memory consumption and linear rise of processor load.
The fact that initial graph is somehow partitioned between processes (probably by the vertices number ranges) is nevertheless evident.
What is wrong with this test (and, probably, my idea of MPI usage in whole)?
My enviromnent:
- one Win 10 PC with 64 Gb and 8 kernels;
- MS MPI 10.0.12498.5;
- MSVC 2017, toolset 141;
- boost 1.71
N.B. See original example code here.
I have some OpenACC-accelerated C++ code that I've compiled using the PGI compiler. Things seem to be working, so now I want to play efficiency whack-a-mole with profiling information.
I generate some timing info by setting:
export PGI_ACC_TIME=1
And then running the program.
The following output results:
-bash-4.2$ ./a.out
libcupti.so not found
Accelerator Kernel Timing data
PGI_ACC_SYNCHRONOUS was set, disabling async() clauses
/home/myuser/myprogram.cpp
_MyProgram NVIDIA devicenum=1
time(us): 97,667
75: data region reached 2 times
75: data copyin transfers: 3
device time(us): total=101 max=82 min=9 avg=33
76: compute region reached 1000 times
76: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=680,216 max=1,043 min=654 avg=680
95: compute region reached 1000 times
95: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=487,365 max=801 min=476 avg=487
110: data region reached 2000 times
110: data copyin transfers: 1000
device time(us): total=6,783 max=140 min=3 avg=6
125: data copyout transfers: 1000
device time(us): total=7,445 max=190 min=6 avg=7
real 0m3.864s
user 0m3.499s
sys 0m0.348s
It raises some questions:
I see time(us): 97,667 at the top. This seems like a total time, but, at the bottom, I see real 0m3.864s. Why is there such a difference?
If time(us): 97,667 is the total, why is it so much smaller than values lower down, such as elapsed time(us): total=680,216?
This kernel including the line (elapsed time(us): total=680,216 max=1,043 min=654 avg=680) was run 1000 times. Are max, min, and avg values based on per-run values of the kernel?
Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data (preparing to transfer, post-receipt operations)?
The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a variable definition. Should line numbers be interpreted as "the loop most closely following the indicated line number", or in some other way?
The kernel at 76 contains the kernel at 95. Are the times calculated for 76 inclusive of time spent in 95? If so, is there a convenient way to find the time spent in a kernel minus the times of all the subkernels?
(Some of these questions are a bit anal retentive, but I haven't found documentation for this, so I thought I'd be thorough.)
Part of the issue here is that the runtime can't find the CUDA Profiling library (libcupti.so), hence you're only seeing the PGI CPU side profiling not the device profiling. PGI ships libcupti.so library with the compilers (under $PGI/[linux86-64|linuxpower]/2017/cuda/[7.5|8.0]/lib64) but this is an optional install so you may not have it install on the system you're running. CUPTI also ships with the CUDA SDK, so if the system has CUDA install, you can try setting you're LD_LIBRARY_PATH there instead. On my system it's installed in "/opt/cuda-8.0/extras/CUPTI/lib64/".
The missing CUPTI library is why you're seeing the bad time, 97,667, for the file time. Also since you're missing CUPTI, the time you're seeing is being measured from the host. With CUPTI, in addition to the elapsed time, you'd see the device time for each kernel. The difference between the elapsed time and the device time is the launch overhead per kernel.
Are max, min, and avg values based on per-run values of the kernel?
Yes.
4.Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
I tend to first look at the avg time since there's typically more opportunities to tune these loops. If you are varying the amount of work per kernel iteration (i.e the grid size changes), then it might not be as useful, but a good starting point.
Now if you had a low average but many calls, then the elapsed time may be dominated by kernel launch overhead. In which case, I'd look to see if you can combine loops or push more work into each loop.
5.For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data
(preparing to transfer, post-receipt operations)?
Just the data transfer time. For the overhead, you would need to use PGPROF/NVPROF.
6.The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a
variable definition. Should line numbers be interpreted as "the loop
most closely following the indicated line number", or in some other
way?
It's because the code's been optimized so the line number may be a bit off though it should correspond to the line numbers from compiler feedback messages (-Minfo=accel). So "the loop most closely..." option should be correct.
I encountered an interesting phenomenon that I fail to explain. I haven't found an answer online, as most of the posts deal with weak scaling and thus communication overhead.
Here is a small piece of code to illustrate the problem. This was tested in different languages with similar results, hence the multiple tags.
#include <mpi.h>
#include <stdio.h>
#include <time.h>
int main() {
MPI_Init(NULL,NULL);
int wsize;
MPI_Comm_size(MPI_COMM_WORLD, &wsize);
int wrank;
MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
clock_t t;
MPI_Barrier(MPI_COMM_WORLD);
t=clock();
int imax = 10000000;
int jmax = 1000;
for (int i=0; i<imax; i++) {
for (int j=0; j<jmax; j++) {
//nothing
}
}
t=clock()-t;
printf( " proc %d took %f seconds.\n", wrank,(float)t/CLOCKS_PER_SEC );
MPI_Finalize();
return 0;
}
Now as you can see, the only part that is timed here is the loop. Therefore, with similar CPUs, no hyperthreading, and sufficient RAM, increasing the number of CPUs should produce exactly the same time.
However, on my machine which is 32 cores with 15GiB RAM,
mpirun -np 1 ./test
gives
proc 0 took 22.262777 seconds.
but
mpirun -np 20 ./test
gives
proc 18 took 24.440767 seconds.
proc 0 took 24.454365 seconds.
proc 4 took 24.461191 seconds.
proc 15 took 24.467632 seconds.
proc 14 took 24.469728 seconds.
proc 7 took 24.469809 seconds.
proc 5 took 24.461639 seconds.
proc 11 took 24.484224 seconds.
proc 9 took 24.491638 seconds.
proc 2 took 24.484953 seconds.
proc 17 took 24.490984 seconds.
proc 16 took 24.502146 seconds.
proc 3 took 24.513380 seconds.
proc 1 took 24.541555 seconds.
proc 8 took 24.539808 seconds.
proc 13 took 24.540005 seconds.
proc 12 took 24.556068 seconds.
proc 10 took 24.528328 seconds.
proc 19 took 24.585297 seconds.
proc 6 took 24.611254 seconds.
and values ranging in between for varying numbers of CPUs.
htop also shows an increase in RAM consumption (VIRT is ~100M for 1 core, and ~300M for 20). Although that might be related to the size of the mpi communicator?
Finally, it is definitely related to the size of the problem (and thus not a communication overhead causing a constant delay regardless of the size of the loop ). Indeed, decreasing imax to say 10 000 makes the walltimes similar.
1 core :
proc 0 took 0.028439 seconds.
20 cores :
proc 1 took 0.027880 seconds.
proc 12 took 0.027880 seconds.
proc 8 took 0.028024 seconds.
proc 16 took 0.028135 seconds.
proc 17 took 0.028094 seconds.
proc 19 took 0.028098 seconds.
proc 7 took 0.028265 seconds.
proc 9 took 0.028051 seconds.
proc 13 took 0.028259 seconds.
proc 18 took 0.028274 seconds.
proc 5 took 0.028087 seconds.
proc 6 took 0.028032 seconds.
proc 14 took 0.028385 seconds.
proc 15 took 0.028429 seconds.
proc 0 took 0.028379 seconds.
proc 2 took 0.028367 seconds.
proc 3 took 0.028291 seconds.
proc 4 took 0.028419 seconds.
proc 10 took 0.028419 seconds.
proc 11 took 0.028404 seconds.
It's been tried on several machines with similar results.
Maybe we're missing something very simple.
Thanks for the help !
processor with turbo frequencies bound by temperature.
Modern processors are limited by thermal design power (TDP). Whenever the processor is cold, single cores may speed up to the turbo frequency multipliers. When hot, or multiple non-idling cores, the cores are slowed down to the guaranteed base speed. Difference between base and turbo speeds are often around 400MHz. AVX or FMA3 may slowdown even below base speed.
I am trying to get total cpu usage in %. First I should start by saying that "top" will simply not do, as there is a delay between cpu dumps, it requires 2 dumps and several seconds, which hangs my program (I do not want to give it its own thread)
next thing what I tried is "ps" which is instant but always gives very high number in total (20+) and when I actually got my cpu to do something it stayed at about 20...
Is there any other way that I could get total cpu usage? It does not matter if it is over one second or longer periods of time... Longer periods would be more useful, though.
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I agree with this answer above. The cpu line in this file gives the total number of "jiffies" your system has spent doing different types of processing.
What you need to do is take 2 readings of this file, seperated by whatever interval of time you require. The numbers are increasing values (subject to integer rollover) so to get the %cpu you need to calculate how many jiffies have elapsed over your interval, versus how many jiffies were spend doing work.
e.g.
Suppose at 14:00:00 you have
cpu 4698 591 262 8953 916 449 531
total_jiffies_1 = (sum of all values) = 16400
work_jiffies_1 = (sum of user,nice,system = the first 3 values) = 5551
and at 14:00:05 you have
cpu 4739 591 289 9961 936 449 541
total_jiffies_2 = 17506
work_jiffies_2 = 5619
So the %cpu usage over this period is:
work_over_period = work_jiffies_2 - work_jiffies_1 = 68
total_over_period = total_jiffies_2 - total_jiffies_1 = 1106
%cpu = work_over_period / total_over_period * 100 = 6.1%
Try reading /proc/loadavg. The first three numbers are the number of processes actually running (i.e., using a CPU), averaged over the last 1, 5, and 15 minutes, respectively.
http://www.linuxinsight.com/proc_loadavg.html
Read /proc/cpuinfo to find the number of CPU/cores available to the systems.
Call the getloadavg() (or alternatively read the /proc/loadavg), take the first value, multiply it by 100 (to convert to percents), divide by number of CPU/cores. If the value is greater than 100, truncate it to 100. Done.
Relevant documentation: man getloadavg and man 5 proc
N.B. Load average, usual to *NIX systems, can be more than 100% (per CPU/core) because it actually measures number of processes ready to be run by scheduler. With Windows-like CPU metric, when load is at 100% you do not really know whether it is optimal use of CPU resources or system is overloaded. Under *NIX, optimal use of CPU loadavg would give you value ~1.0 (or 2.0 for dual system). If the value is much greater than number CPU/cores, then you might want to plug extra CPUs into the box.
Otherwise, dig the /proc file system.
cpu-stat is a C++ project that permits to read Linux CPU counter from /proc/stat .
Get CPUData.* and CPUSnaphot.* files from cpu-stat's src directory.
Quick implementation to get overall cpu usage:
#include "CPUSnapshot.h"
#include <chrono>
#include <thread>
#include <iostream>
int main()
{
CPUSnapshot previousSnap;
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
CPUSnapshot curSnap;
const float ACTIVE_TIME = curSnap.GetActiveTimeTotal() - previousSnap.GetActiveTimeTotal();
const float IDLE_TIME = curSnap.GetIdleTimeTotal() - previousSnap.GetIdleTimeTotal();
const float TOTAL_TIME = ACTIVE_TIME + IDLE_TIME;
int usage = 100.f * ACTIVE_TIME / TOTAL_TIME;
std::cout << "total cpu usage: " << usage << " %" << std::endl;
}
Compile it:
g++ -std=c++11 -o CPUUsage main.cpp CPUSnapshot.cpp CPUData.cpp
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I suggest two files to starting...
/proc/stat and /proc/cpuinfo.
http://www.mjmwired.net/kernel/Documentation/filesystems/proc.txt
have a look at this C++ Lib.
The information is parsed from /proc/stat. it also parses memory usage from /proc/meminfo and ethernet load from /proc/net/dev
----------------------------------------------
current CPULoad:5.09119
average CPULoad 10.0671
Max CPULoad 10.0822
Min CPULoad 1.74111
CPU: : Intel(R) Core(TM) i7-10750H CPU # 2.60GHz
----------------------------------------------
network load: wlp0s20f3 : 1.9kBit/s : 920Bit/s : 1.0kBit/s : RX Bytes Startup: 15.8mByte TX Bytes Startup: 833.5mByte
----------------------------------------------
memory load: 28.4% maxmemory: 16133792 Kb used: 4581564 Kb Memload of this Process 170408 KB
----------------------------------------------