Why is there a difference between the sum (stime + utime) of all processes' CPU usage, compared to the overall CPU usage from /proc/stat in Linux? - c++

I need to calculate the overall CPU usage of my Linux device over some time (1-5 seconds) and a list of processes with their respective CPU usage times. The programm should be designed and implemented in C++. My assumption would be that the sum of all process CPU times would be equal to the total value for the whole CPU. For now the CPU I am using is multi-cored (2 cores).
According to How to determine CPU and memory consumption from inside a process? it is possible to calculate all "jiffies" available in the system since startup using the values for "cpu" in /proc/stat. If you now sample the values at two points in time and compare the values for user, nice, system and idle at the two time points, you can calculate the average CPU usage in this interval. The formula would be
totalCPUUsage = ((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef)) /
((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef) + (idle_aft - idle_bef)) * 100 %
According to How to calculate the CPU usage of a process by PID in Linux from C? the used jiffies for a single process can be calculated by adding utime and stime from /proc/${PID}/stat (column 14 and 15 in this file). When I now calculate this sum and divide it by the total amount of jiffies in the analyzed interval, I would assume the formula for one process to be
processCPUUsage = ((process_utime_aft - process_utime_bef) + (process_stime_aft - process_stime_bef)) /
((user_aft - user_bef) + (nice_aft - nice_bef) + (system_aft - system_bef) + (idle_aft - idle_bef)) * 100 %
When I now sum up the values for all processes and compare it to the overall calculated CPU usage, I receive a slightly higher value for the aggregated value most of the time (although the values are quite close for all different CPU loads).
Can anyone explain to me, what's the reason for that? Are there any CPU resources that are used by more than one process and thus accounted twice or more in my accumlation? Or am I simply missing something here? I can not find any further hint in the Linux man page for the proc file system (https://linux.die.net/man/5/proc) as well.
Thanks in advance!

Related

Measuring FLOPS and memory traffic in a C++ program

I am trying to profile a C++ program. For the first step, I want to determine whether the program is compute-bound or memory-bound by the Roofline Model. So I need to measure the following 4 things.
W: # of computations performed in the program (FLOPs)
Q: # of bytes of memory accesses incurred in the program (Byte/s)
π: peak performance (FLOPs)
β: peak bandwidth (Byte/s)
I have tried to use Linux perf to measure W. I followed the instructions here, using libpfm4 to determine the available events (by ./showevinfo). I found my CPU supports the INST_RETIREDevent with umask X87, then I used ./check_events INST_RETIRED:X87 to find the code, which is 0x5302c0. Then I tried perf stat -e r5302c0 ./test_exe and I got
Performance counter stats for './test_exe':
83,381,997 r5302c0
20.134717382 seconds time elapsed
74.691675000 seconds user
0.357003000 seconds sys
Questions:
Is it right for my process to measure the W of my program? If yes, then it should be 83,381,997 FLOPs, right?
Why is this FLOPs not stable between repeated executions?
How can I measure the other Q, π and β?
Thanks for your time and any suggestions.

c++ Create time remaining estimate when data calcs get progressively longer?

I'm adding items to a list, so each insert takes just a bit longer than the last (this is a requirement, assume you can't change that). I've manually timed a sample dataset on MY computer but I want a generalized way to predict the time on any computer, and given ANY dataset size.
In my flailing around trying to figure this out, what i have collected is a vector, 100 long, of "how long 1/100th the sample data" took. So in my example data set i have 237,965 objects, which means in the vector of times i collected, each bucket tells how long it took to add 2,379 items.
Here's a link to the sample data of 100 items. So you can see the first 2k items took about 8 seconds, and the last 2k items took about 101 seconds. All together, if you add all the time, that's 4,295 seconds or about 1 hr 11 minutes.
So my question is, given this data set, and using it for future predictions, how do i estimate the remaining time when adding different size data?
In more flailing, i made some plots, wondering if it could help. First plot is just the raw data on a log graph:
I then made a 2nd data set based on first, this time showing accumulated time, rather than just the time for the current slice, and plotted that on a linear graph:
Notice the lovely trend line formula? That MUST be something that i just need to somehow plug into my code but i can't for the life of me figure out how.
Should i have instead gathered the data into time-slices and not index-slices? ie: i KNOW this data takes 1:10 to load, so take snapshots every 1/100th of that duration, instead of snapshotting every 1/100th of the data set?
Or HOW do i figure this out?
the function I need to write has this API:
CFTimeInterval get_estimated_end_time(int maxI, int curI, CFTimeInterval elapsedT);
so given only those three variables (maxI, curI, and elapsedT), and knowing the trend line formula from above, i need to return "duration until maxI" (seconds).
Any ideas?
Update:
well it seems after much futzing around, i can just do this (note "LERP" is just linear interpolate):
#define kDataSetMax 237965
double FunctionX(int in_x)
{
double _x(LERP(0, 100, in_x, 0, i_maxI));
double resultF =
(0.32031139888898874 * math_square(_x))
+ (9.609731568497784 * _x)
- (7.527252350031663);
if (resultF <= 1) {
resultF = 1;
}
return resultF;
}
CFTimeInterval get_estimated_end_time(int maxI, int curI, CFTimeInterval elapsedT)
{
CFTimeInterval endT(FunctionX(maxI));
return remainingT;
}
But that means i'm just ignoring curI and elapsedT?? That doesn't seem... right? What am I missing?
Footnotes:
#define LERP(to_min, to_max, from, from_min, from_max) \
((from_max) == (from_min) ? from : \
(double)(to_min) + ((double)((to_max) - (to_min)) \
* ((double)((from) - (from_min)) \
/ (double)((from_max) - (from_min)))))
#define LERP_PERCENT(from, from_max) \
LERP(0.0f, 1.0f, from, 0.0f, from_max)
Your FunctionX is most of the way there. It's currently calculating expectedTimeToReachMaxIOnMyMachine. What you need to do is figure out how much slower the current time is relative to the expected on your machine to reach this same point, and then extrapolate that same ratio to the maximum time.
CFTimeInterval get_estimated_end_time(int maxI, int curI, CFTimeInterval elapsedT) {
//calculate how long we expected it to take to reach this point
CFTimeInterval expectedTimeToReachCurrentIOnMyMachine = FunctionX(curI);
//calculate how much slower we are than the expectation
//if this machine is faster, the math still works out.
double slowerThanExpectedByRatio
= double(elapsedT) / expectedTimeToReachCurrentIOnMyMachine;
//calculate how long we expected to reach the max
CFTimeInterval expectedTimeToReachMaxIOnMyMachine = FunctionX(maxI);
//if we continue to be the same amount slower, we'll reach the max at:
CFTimeInterval estimatedTimeToReachMaxI
= expectedTimeToReachMaxIOnMyMachine * slowerThanExpectedByRatio;
return estimatedTimeToReachMaxI;
}
Note that a smart implementation can cache and reuse expectedTimeToReachMaxIOnMyMachine and not calculate it every time.
Basically this assumes that after doing X% of the work, we can calculate how much slower we were than the expected curve, and assume we will stay approximately that same amount slower than the expected curve.
In the example below, the expected time taken is the blue line. At 4000 elements, we see that the expected time on your machine was 8,055,826, But the actual time taken on this machine was 10,472,573, which is 30% higher (slowerThanExpectedByRatio=1.3). At that point, we can extrapolate that we'll probably remain 30% higher throughout the entire process (the purple line). So if the total expected time on your machine for 10000 elements was 32,127,229, then our total estimated time on this machine for 10000 will be 41,765,398 (30% higher)

Returning the memory used so I can predict the memory required to compute ML algorithm

I am running a Random Forest ML script using a test size data set 5 k observations with a set number of parameters with a varying number of forests. My real model is closer to 1 million observations with 500+ parameters. I am trying to calculate how much memory this model would require assuming x number of forests.
In order to do this I could use a method of returning how much memory was used in a running of the script. Is it possible to return this, so that I can calculate the RAM required to compute the full model?
I currently use the following to tell me how long it takes to compute:
global starttime
print "The whole routine took %.3f seconds" % (time() - starttime)
Edit Re to my own answer
Feel like I am conversing with myself a little but hey ho, I tried running the following code to find out how much memory is actually being used, and why when I increase the number of n_estimators_value my PC runs out of memory. Unfortunately all of the % memory usage come back the same, I assume this is because it is calculating the memory usage at the incorrect time, it needs to record it at its peak whilst actually fitting the random forest. See code:
psutilpercent = psutil.virtual_memory()
print "\n", " --> Memory Check 1 Percent:", str(psutilpercent.percent) + "%\n"
n_estimators_value = 500
rf = ensemble.RandomForestRegressor(n_estimators = n_estimators_value, oob_score=True, random_state = 1)
psutilpercent = psutil.virtual_memory()
print "\n", " --> Memory Check 1 Percent:", str(psutilpercent.percent) + "%\n"
Any methods to find out the peak memory usage? I am trying to calculate how much memory would be required to fit a rather large RF, and I cant calculate this without knowing how much memory my smaller models require.
/usr/bin/time reports peak memory usage for a program. There's also the memory_profiler for Python.

Fast percentile in C++ - speed more important than precision

This is a follow-up to Fast percentile in C++
I have a sorted array of 365 daily cashflows (xDailyCashflowsDistro) which I randomly sample 365 times to get a generated yearly cashflow. Generating is carried out by
1/ picking a random probability in the [0,1] interval
2/ converting this probability to an index in the [0,364] interval
3/ determining what daily cashflow corresponds to this probability by using the index and some linear aproximation.
and summing 365 generated daily cashflows. Following the previously mentioned thread, my code precalculates the differences of sorted daily cashflows (xDailyCashflowDiffs) where
xDailyCashflowDiffs[i] = xDailyCashflowsDistro[i+1] - xDailyCashflowsDistro[i]
and thus the whole code looks like
double _dIdxConverter = ((double)(365 - 1)) / (double)(RAND_MAX - 1);
for ( unsigned int xIdx = 0; xIdx < _xCount; xIdx++ )
{
double generatedVal = 0.0;
for ( unsigned int xDayIdx = 0; xDayIdx < 365; xDayIdx ++ )
{
double dIdx = (double)fastRand()* _dIdxConverter;
long iIdx1 = (unsigned long)dIdx;
double dFloor = (double)iIdx1;
generatedVal += xDailyCashflowsDistro[iIdx1] + xDailyCashflowDiffs[iIdx1] *(dIdx - dFloor);
}
results.push_back(generatedVal) ;
}
_xCount (the number of simulations) is 1K+, usually 10K.
The problem:
This simulation is being carried out 15M times (compared to 100K when the first thread was written) at the moment, and it takes ~10 minutes on a 3.4GHz machine. Due to the nature of problem, this 15M is unlikely to be significantly lowered in the future, only increased. Having used VTune Analyzer, I am being told that the last but one line (generatedVal += ...) generates 80% of runtime. And my question is why and how I can work with that.
Things I have tried:
1/ getting rid of the (dIdx - dFloor) part to see whether double difference and multiplication is the main culprit - runtime dropped by a couple of percent
2/ declaring xDailyCashflowsDistro and xDailyCashflowDiffs as __restict so as to prevent the compiler thinking they are dependendent on each other - no change
3/ tried using 16 days (as opposed to 365) to see whether it is cache misses that drag my performance - not a slight change
4/ tried using floats as opposed to doubles - no change
5/ compiling with different /fp: - no change
6/ compiling as x64 - has effect on the double <-> ulong conversions, but the line in question is unaffected
What I am willing to sacrifice is resolution - I do not care whether the generatedVal is 100010.1 or 100020.0 at the end if the speed gain is substantial.
EDIT:
The daily/yearly cashflows are related to the whole portfolio. I could divide all daily cashflows by portflio size and would thus (at 99.99% confidence level) ensure that daily cashflows/pflio_size will not reach out of the [-1000,+1000] interval. In this case, though, I would need precision to the hundredths.
Perhaps you could turn your piecewise linear function into a piecewise-linear "histogram" of its values. The number you're sampling appears to be the sum of 365 samples from that histogram. What you're doing is a not-particularly-fast way to sample from the sum of 365 samples from that histogram.
You might try computing a Fourier (or wavelet or similar) transform, keeping only the first few terms, raising it to the 365th power, and computing the inverse transform. You won't get a probability distribution in the end, but there shouldn't be "too much" mass below 0 or above 1 and the total mass shouldn't be "too different" from 1 with this technique. (I don't know what your data looks like; this technique may well be unworkable for good mathematical reasons.)

How to get total cpu usage in Linux using C++

I am trying to get total cpu usage in %. First I should start by saying that "top" will simply not do, as there is a delay between cpu dumps, it requires 2 dumps and several seconds, which hangs my program (I do not want to give it its own thread)
next thing what I tried is "ps" which is instant but always gives very high number in total (20+) and when I actually got my cpu to do something it stayed at about 20...
Is there any other way that I could get total cpu usage? It does not matter if it is over one second or longer periods of time... Longer periods would be more useful, though.
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I agree with this answer above. The cpu line in this file gives the total number of "jiffies" your system has spent doing different types of processing.
What you need to do is take 2 readings of this file, seperated by whatever interval of time you require. The numbers are increasing values (subject to integer rollover) so to get the %cpu you need to calculate how many jiffies have elapsed over your interval, versus how many jiffies were spend doing work.
e.g.
Suppose at 14:00:00 you have
cpu 4698 591 262 8953 916 449 531
total_jiffies_1 = (sum of all values) = 16400
work_jiffies_1 = (sum of user,nice,system = the first 3 values) = 5551
and at 14:00:05 you have
cpu 4739 591 289 9961 936 449 541
total_jiffies_2 = 17506
work_jiffies_2 = 5619
So the %cpu usage over this period is:
work_over_period = work_jiffies_2 - work_jiffies_1 = 68
total_over_period = total_jiffies_2 - total_jiffies_1 = 1106
%cpu = work_over_period / total_over_period * 100 = 6.1%
Try reading /proc/loadavg. The first three numbers are the number of processes actually running (i.e., using a CPU), averaged over the last 1, 5, and 15 minutes, respectively.
http://www.linuxinsight.com/proc_loadavg.html
Read /proc/cpuinfo to find the number of CPU/cores available to the systems.
Call the getloadavg() (or alternatively read the /proc/loadavg), take the first value, multiply it by 100 (to convert to percents), divide by number of CPU/cores. If the value is greater than 100, truncate it to 100. Done.
Relevant documentation: man getloadavg and man 5 proc
N.B. Load average, usual to *NIX systems, can be more than 100% (per CPU/core) because it actually measures number of processes ready to be run by scheduler. With Windows-like CPU metric, when load is at 100% you do not really know whether it is optimal use of CPU resources or system is overloaded. Under *NIX, optimal use of CPU loadavg would give you value ~1.0 (or 2.0 for dual system). If the value is much greater than number CPU/cores, then you might want to plug extra CPUs into the box.
Otherwise, dig the /proc file system.
cpu-stat is a C++ project that permits to read Linux CPU counter from /proc/stat .
Get CPUData.* and CPUSnaphot.* files from cpu-stat's src directory.
Quick implementation to get overall cpu usage:
#include "CPUSnapshot.h"
#include <chrono>
#include <thread>
#include <iostream>
int main()
{
CPUSnapshot previousSnap;
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
CPUSnapshot curSnap;
const float ACTIVE_TIME = curSnap.GetActiveTimeTotal() - previousSnap.GetActiveTimeTotal();
const float IDLE_TIME = curSnap.GetIdleTimeTotal() - previousSnap.GetIdleTimeTotal();
const float TOTAL_TIME = ACTIVE_TIME + IDLE_TIME;
int usage = 100.f * ACTIVE_TIME / TOTAL_TIME;
std::cout << "total cpu usage: " << usage << " %" << std::endl;
}
Compile it:
g++ -std=c++11 -o CPUUsage main.cpp CPUSnapshot.cpp CPUData.cpp
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I suggest two files to starting...
/proc/stat and /proc/cpuinfo.
http://www.mjmwired.net/kernel/Documentation/filesystems/proc.txt
have a look at this C++ Lib.
The information is parsed from /proc/stat. it also parses memory usage from /proc/meminfo and ethernet load from /proc/net/dev
----------------------------------------------
current CPULoad:5.09119
average CPULoad 10.0671
Max CPULoad 10.0822
Min CPULoad 1.74111
CPU: : Intel(R) Core(TM) i7-10750H CPU # 2.60GHz
----------------------------------------------
network load: wlp0s20f3 : 1.9kBit/s : 920Bit/s : 1.0kBit/s : RX Bytes Startup: 15.8mByte TX Bytes Startup: 833.5mByte
----------------------------------------------
memory load: 28.4% maxmemory: 16133792 Kb used: 4581564 Kb Memload of this Process 170408 KB
----------------------------------------------