I am trying to get total cpu usage in %. First I should start by saying that "top" will simply not do, as there is a delay between cpu dumps, it requires 2 dumps and several seconds, which hangs my program (I do not want to give it its own thread)
next thing what I tried is "ps" which is instant but always gives very high number in total (20+) and when I actually got my cpu to do something it stayed at about 20...
Is there any other way that I could get total cpu usage? It does not matter if it is over one second or longer periods of time... Longer periods would be more useful, though.
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I agree with this answer above. The cpu line in this file gives the total number of "jiffies" your system has spent doing different types of processing.
What you need to do is take 2 readings of this file, seperated by whatever interval of time you require. The numbers are increasing values (subject to integer rollover) so to get the %cpu you need to calculate how many jiffies have elapsed over your interval, versus how many jiffies were spend doing work.
e.g.
Suppose at 14:00:00 you have
cpu 4698 591 262 8953 916 449 531
total_jiffies_1 = (sum of all values) = 16400
work_jiffies_1 = (sum of user,nice,system = the first 3 values) = 5551
and at 14:00:05 you have
cpu 4739 591 289 9961 936 449 541
total_jiffies_2 = 17506
work_jiffies_2 = 5619
So the %cpu usage over this period is:
work_over_period = work_jiffies_2 - work_jiffies_1 = 68
total_over_period = total_jiffies_2 - total_jiffies_1 = 1106
%cpu = work_over_period / total_over_period * 100 = 6.1%
Try reading /proc/loadavg. The first three numbers are the number of processes actually running (i.e., using a CPU), averaged over the last 1, 5, and 15 minutes, respectively.
http://www.linuxinsight.com/proc_loadavg.html
Read /proc/cpuinfo to find the number of CPU/cores available to the systems.
Call the getloadavg() (or alternatively read the /proc/loadavg), take the first value, multiply it by 100 (to convert to percents), divide by number of CPU/cores. If the value is greater than 100, truncate it to 100. Done.
Relevant documentation: man getloadavg and man 5 proc
N.B. Load average, usual to *NIX systems, can be more than 100% (per CPU/core) because it actually measures number of processes ready to be run by scheduler. With Windows-like CPU metric, when load is at 100% you do not really know whether it is optimal use of CPU resources or system is overloaded. Under *NIX, optimal use of CPU loadavg would give you value ~1.0 (or 2.0 for dual system). If the value is much greater than number CPU/cores, then you might want to plug extra CPUs into the box.
Otherwise, dig the /proc file system.
cpu-stat is a C++ project that permits to read Linux CPU counter from /proc/stat .
Get CPUData.* and CPUSnaphot.* files from cpu-stat's src directory.
Quick implementation to get overall cpu usage:
#include "CPUSnapshot.h"
#include <chrono>
#include <thread>
#include <iostream>
int main()
{
CPUSnapshot previousSnap;
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
CPUSnapshot curSnap;
const float ACTIVE_TIME = curSnap.GetActiveTimeTotal() - previousSnap.GetActiveTimeTotal();
const float IDLE_TIME = curSnap.GetIdleTimeTotal() - previousSnap.GetIdleTimeTotal();
const float TOTAL_TIME = ACTIVE_TIME + IDLE_TIME;
int usage = 100.f * ACTIVE_TIME / TOTAL_TIME;
std::cout << "total cpu usage: " << usage << " %" << std::endl;
}
Compile it:
g++ -std=c++11 -o CPUUsage main.cpp CPUSnapshot.cpp CPUData.cpp
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I suggest two files to starting...
/proc/stat and /proc/cpuinfo.
http://www.mjmwired.net/kernel/Documentation/filesystems/proc.txt
have a look at this C++ Lib.
The information is parsed from /proc/stat. it also parses memory usage from /proc/meminfo and ethernet load from /proc/net/dev
----------------------------------------------
current CPULoad:5.09119
average CPULoad 10.0671
Max CPULoad 10.0822
Min CPULoad 1.74111
CPU: : Intel(R) Core(TM) i7-10750H CPU # 2.60GHz
----------------------------------------------
network load: wlp0s20f3 : 1.9kBit/s : 920Bit/s : 1.0kBit/s : RX Bytes Startup: 15.8mByte TX Bytes Startup: 833.5mByte
----------------------------------------------
memory load: 28.4% maxmemory: 16133792 Kb used: 4581564 Kb Memload of this Process 170408 KB
----------------------------------------------
Related
I need to calculate the overall CPU usage of my Linux device over some time (1-5 seconds) and a list of processes with their respective CPU usage times. The programm should be designed and implemented in C++. My assumption would be that the sum of all process CPU times would be equal to the total value for the whole CPU. For now the CPU I am using is multi-cored (2 cores).
According to How to determine CPU and memory consumption from inside a process? it is possible to calculate all "jiffies" available in the system since startup using the values for "cpu" in /proc/stat. If you now sample the values at two points in time and compare the values for user, nice, system and idle at the two time points, you can calculate the average CPU usage in this interval. The formula would be
totalCPUUsage = ((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef)) /
((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef) + (idle_aft - idle_bef)) * 100 %
According to How to calculate the CPU usage of a process by PID in Linux from C? the used jiffies for a single process can be calculated by adding utime and stime from /proc/${PID}/stat (column 14 and 15 in this file). When I now calculate this sum and divide it by the total amount of jiffies in the analyzed interval, I would assume the formula for one process to be
processCPUUsage = ((process_utime_aft - process_utime_bef) + (process_stime_aft - process_stime_bef)) /
((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef) + (idle_aft - idle_bef)) * 100 %
When I now sum up the values for all processes and compare it to the overall calculated CPU usage, I receive a slightly higher value for the aggregated value most of the time (although the values are quite close for all different CPU loads).
Can anyone explain to me, what's the reason for that? Are there any CPU resources that are used by more than one process and thus accounted twice or more in my accumlation? Or am I simply missing something here? I can not find any further hint in the Linux man page for the proc file system (https://linux.die.net/man/5/proc) as well.
Thanks in advance!
I wanted to test a way to measure the precise execution time of a piece of code in nanoseconds (accuracy upto 100 nanoseconds is ok) in C++.
I tried using chrono::high_resolution_clock for this purpose. In order to test whether it is working properly or not. I do the following:
Get current time in nanoseconds using high_resolution_clock, call it "start"
sleep for "x" nanoseconds using nanosleep(x)
Get current time in nanoseconds using high_resolution_clock, call it "end"
Now "end" - "start" should be roughly same as "x". Lets call this difference "diff"
I ran the above test for x varying from 10 to 1000000. I get the diff to be around 100000 i.e (100 microseconds)
Where as this shouldn't be more than say 100 nanoseconds. Please help me fix this.
#include <ctime>
#include <unistd.h>
#include <iostream>
#include <chrono>
using namespace std;
int main() {
int sleep_ns[] = {10, 50, 100, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000};
int n = sizeof(sleep_ns)/sizeof(int);
for (int i = 0; i < n; i++) {
auto start = std::chrono::high_resolution_clock::now();
timespec tspec = {0, sleep_ns[i]};
nanosleep(&tspec, NULL);
auto end = std::chrono::high_resolution_clock::now();
chrono::duration<int64_t, nano> dur_ns = (end - start);
int64_t measured_ns = dur_ns.count();
int64_t diff = measured_ns - sleep_ns[i];
cout << "diff: " << diff
<< " sleep_ns: " << sleep_ns[i]
<< " measured_ns: " << measured_ns << endl;
}
return 0;
}
Following was the output of this code on my machine. Its running "Ubuntu 16.04.4 LTS"
diff: 172747 sleep_ns: 10 measured_ns: 172757
diff: 165078 sleep_ns: 50 measured_ns: 165128
diff: 164669 sleep_ns: 100 measured_ns: 164769
diff: 163855 sleep_ns: 500 measured_ns: 164355
diff: 163647 sleep_ns: 1000 measured_ns: 164647
diff: 162207 sleep_ns: 2000 measured_ns: 164207
diff: 160904 sleep_ns: 5000 measured_ns: 165904
diff: 155709 sleep_ns: 10000 measured_ns: 165709
diff: 145306 sleep_ns: 20000 measured_ns: 165306
diff: 115915 sleep_ns: 50000 measured_ns: 165915
diff: 125983 sleep_ns: 100000 measured_ns: 225983
diff: 115470 sleep_ns: 200000 measured_ns: 315470
diff: 115774 sleep_ns: 500000 measured_ns: 615774
diff: 116473 sleep_ns: 1000000 measured_ns: 1116473
What you're trying to do is not going to work on every platform, or even most platforms. There's a couple of reasons why.
The first, and biggest, reason is that measuring the precise time that code is executed at/within is, by its very nature, imprecise. It requires a black-box OS call to determine, and if you've ever looked at how those calls are implemented in the first place, it's quickly apparent that there's inherent imprecision in the technique. On Windows, this is done by measuring both the current "tick" of the processor, and its reported frequency, and multiplying one by the other to determine how many nanoseconds have passed between two successive calls. But Windows only reports with accuracy of Microseconds to begin with, and if the CPU changes its frequency, even if only modestly (which is common in modern CPUs, to lower the frequency when the CPU isn't being maxed out, to save power) that can skew results.
Linux also has similar quirks, and every OS is at the mercy of the CPU's ability to accurately report its own tick counter/tick rate.
The second reason you're going to get results like what you've observed is in the fact that, for similar reasons to the first reason, "Sleep"-ing a thread is usually very imprecise. CPUs usually can't sleep with better precision than microsecond precision, and it's often not possible to sleep any faster than half a millisecond at a time. Your particular environment seems to be at least capable of a few hundred microseconds of precision, but it's clearly not more precise than that. Some environments will even drop Nanosecond resolution altogether.
Altogether, it's probably a mistake to presume that, without programming for an explicitly Real-Time OS, using the specific API of that OS, that you can get the kind of precision you're expecting/desiring. If you want reliable information about the timing of individual snippets of code, you'll need to run said code over-and-over, get a sample of the entire execution, and then take the average, to get you a broad idea of the timing for each run. It'll still be imprecise, but it'll help get around these limitations.
Here's part of the description of nanosleep:
If the interval specified in req is not an exact multiple of the granularity underlying clock (see time(7)), then the interval will be rounded up to the next multiple. Furthermore, after the sleep completes, there may still be a delay before the CPU becomes free to once again execute the calling thread.
The behavior you're getting seems to fit pretty well with the description.
For extremely short pauses, you're probably going to have to do some (most?) of the work on your own. The system clock source will often have a granularity of a microsecond or so.
One possible way to pause for less than the system clock time would be to measure how often you can execute a loop before the clock changes. During startup (for example) do that a few times, to get a good idea of how many loops you can execute per microsecond.
Then to pause for some fraction of that time, you can do linear interpolation to guess at a number of times to execute a loop to get about the same length of pause.
Note: this will generally run the CPU at 100% for the duration of the pause, so you only want to do it for really short pauses--up to a microsecond or two is fine, but if you want much more than that, you probably want to fall back to nanosleep.
Even with that, however, you need to be aware that a pause could end up substantially longer than you planned. The OS does time slicing. If your process' time slice expires in the middle of your pause loop, it could easily be tens of milliseconds (or more) before it's scheduled to run again.
If you really need an assurance of response times on this order, you'll probably need to consider another OS (but even that's not a panacea--what you're asking for isn't trivial, regardless of how you approach it).
Reference
nanosleep man page
I have some OpenACC-accelerated C++ code that I've compiled using the PGI compiler. Things seem to be working, so now I want to play efficiency whack-a-mole with profiling information.
I generate some timing info by setting:
export PGI_ACC_TIME=1
And then running the program.
The following output results:
-bash-4.2$ ./a.out
libcupti.so not found
Accelerator Kernel Timing data
PGI_ACC_SYNCHRONOUS was set, disabling async() clauses
/home/myuser/myprogram.cpp
_MyProgram NVIDIA devicenum=1
time(us): 97,667
75: data region reached 2 times
75: data copyin transfers: 3
device time(us): total=101 max=82 min=9 avg=33
76: compute region reached 1000 times
76: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=680,216 max=1,043 min=654 avg=680
95: compute region reached 1000 times
95: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=487,365 max=801 min=476 avg=487
110: data region reached 2000 times
110: data copyin transfers: 1000
device time(us): total=6,783 max=140 min=3 avg=6
125: data copyout transfers: 1000
device time(us): total=7,445 max=190 min=6 avg=7
real 0m3.864s
user 0m3.499s
sys 0m0.348s
It raises some questions:
I see time(us): 97,667 at the top. This seems like a total time, but, at the bottom, I see real 0m3.864s. Why is there such a difference?
If time(us): 97,667 is the total, why is it so much smaller than values lower down, such as elapsed time(us): total=680,216?
This kernel including the line (elapsed time(us): total=680,216 max=1,043 min=654 avg=680) was run 1000 times. Are max, min, and avg values based on per-run values of the kernel?
Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data (preparing to transfer, post-receipt operations)?
The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a variable definition. Should line numbers be interpreted as "the loop most closely following the indicated line number", or in some other way?
The kernel at 76 contains the kernel at 95. Are the times calculated for 76 inclusive of time spent in 95? If so, is there a convenient way to find the time spent in a kernel minus the times of all the subkernels?
(Some of these questions are a bit anal retentive, but I haven't found documentation for this, so I thought I'd be thorough.)
Part of the issue here is that the runtime can't find the CUDA Profiling library (libcupti.so), hence you're only seeing the PGI CPU side profiling not the device profiling. PGI ships libcupti.so library with the compilers (under $PGI/[linux86-64|linuxpower]/2017/cuda/[7.5|8.0]/lib64) but this is an optional install so you may not have it install on the system you're running. CUPTI also ships with the CUDA SDK, so if the system has CUDA install, you can try setting you're LD_LIBRARY_PATH there instead. On my system it's installed in "/opt/cuda-8.0/extras/CUPTI/lib64/".
The missing CUPTI library is why you're seeing the bad time, 97,667, for the file time. Also since you're missing CUPTI, the time you're seeing is being measured from the host. With CUPTI, in addition to the elapsed time, you'd see the device time for each kernel. The difference between the elapsed time and the device time is the launch overhead per kernel.
Are max, min, and avg values based on per-run values of the kernel?
Yes.
4.Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
I tend to first look at the avg time since there's typically more opportunities to tune these loops. If you are varying the amount of work per kernel iteration (i.e the grid size changes), then it might not be as useful, but a good starting point.
Now if you had a low average but many calls, then the elapsed time may be dominated by kernel launch overhead. In which case, I'd look to see if you can combine loops or push more work into each loop.
5.For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data
(preparing to transfer, post-receipt operations)?
Just the data transfer time. For the overhead, you would need to use PGPROF/NVPROF.
6.The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a
variable definition. Should line numbers be interpreted as "the loop
most closely following the indicated line number", or in some other
way?
It's because the code's been optimized so the line number may be a bit off though it should correspond to the line numbers from compiler feedback messages (-Minfo=accel). So "the loop most closely..." option should be correct.
I used the following function to find the time taken by my code.
#include <sys/time.h>
struct timeval start, end;
gettimeofday(&start,NULL);
//mycode
gettimeofday(&end,NULL);
cout<<" time taken by my code: "<<((end.tv_sec - start.tv_sec) * 1000000 + end.tv_usec - start.tv_usec ) / 1000.0<<" msec"<<endl;
I observed that even though my code runs for 2 hours, yet the time reported by the above function is 1213 milliseconds. I am not able to understand as to why is it happened. Also is there a way by which I may record the time taken by my code in hours correctly
My best guess is that time_t (the type of tv_sec) on your system is signed 32 bits and that (end.tv_sec - start.tv_sec) * 1000000 overflows.
You could test that theory by making sure that you don't use 32 bit arithmetic for this computation:
(end.tv_sec - start.tv_sec) * 1000000LL
That being said, I advise use of the C++11 <chrono> library instead:
#include <chrono>
auto t0 = std::chrono::system_clock::now();
//mycode
auto t1 = std::chrono::system_clock::now();
using milliseconds = std::chrono::duration<double, std::milli>;
milliseconds ms = t1 - t0;
std::cout << " time taken by my code: " << ms.count() << '\n';
The <chrono> library has an invariant that none of the "predefined" durations will overflow in less than +/- 292 years. In practice, only nanoseconds will overflow that quickly, and the other durations will have a much larger range. Each duration has static ::min() and ::max() functions you can use to query the range for each.
The original proposal for <chrono> has a decent tutorial section that might be a helpful introduction. It is only slightly dated. What it calls monotonic_clock is now called steady_clock. I believe that is the only significant update it lacks.
On which platform are you doing this? If it's Linux/Unix-like your easiest non-intrusive bet is simply using the time command from the command-line. Is the code you're running single-threaded or not? Some of the functions in time.h (like clock() e.g. ) return the number of ticks against each core, which may or may not be what you want. And the newer stuff in the chrono may not be as exact as you like (a while back I tried to measure time intervals in nanoseconds with chrono, but the lowest time interval I got back back then was 300ns, which was much less exact than I'd hoped).
This part of the bench marking process may help your purpose:
#include<time.h>
#include<cstdlib>
...
...
float begin = (float)clock()/CLOCKS_PER_SEC;
...
//do your bench-marking stuffs
...
float end = (float)clock()/CLOCKS_PER_SEC;
float totalTime = end - begin;
cout<<"Time Req in the stuffs: "<<totalTime<<endl;
NOTE: This process is a simple alternative to the chrono library
If you are on linux and the code that you want to time is largely the program itself, then you can time your program by passing it as an argument to the time command and look at the 'elapsed time' row.
/usr/bin/time -v <your program's executable>
For example:
/usr/bin/time -v sleep 3 .../home/aakashah/workspace/head/src/GroverStorageCommon
Command being timed: "sleep 3"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2176
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 165
Voluntary context switches: 2
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
I'm trying to measure a function's performance by measuring the time for each iteration.
During the process, I found even if I do nothing, the results still vary quite a bit.
e.g.
volatile long count = 0;
for (int i = 0; i < N; ++i) {
measure.begin();
++count;
measure.end();
}
In measure.end(), I measure the time difference and keep an unordered_map to keep track of the time-count.
I've used clock_gettime as well as rdtsc, but there's always about 1% of the data points lie far away from mean, in a 1000 factor.
Here's what the above loop generates:
T: count percentile
18 117563 11.7563%
19 111821 22.9384%
21 201605 43.0989%
22 541095 97.2084%
23 2136 97.422%
24 2783 97.7003%
...
406 1 99.9994%
3678 1 99.9995%
6662 1 99.9996%
17945 1 99.9997%
18148 1 99.9998%
18181 1 99.9999%
22800 1 100%
mean:21
So whether it's ticks or ns, the worst case 22800 is about 1000 times bigger than mean.
I did isolcpus in grub and was running this with taskset. The simple loop almost does nothing, the hash table to do time-count statistics is outside of the time measurements.
What am I missing?
I'm running this on a laptop with ubuntu installed, CPU is Intel(R) Core(TM) i5-2520M CPU # 2.50GHz
Thank you for all the answers.
The main interrupt that I couldn't stop is the local timer interrupt. And it seems new 3.10 kernel would support tickless. I'll try that one.