Why is boost::auto_cpu_timer showing > 100% utilization? - c++

To do timing comparisons I wanted to use boost::timer. Here is a simple test case that performs some vector operations:
std::vector<float> hv( 1000*1000 );
std::generate(hv.begin(), hv.end(), rand);
{
boost::timer::auto_cpu_timer t;
std::transform(hv.begin(), hv.end(), hv.begin(), sqrtf);
}
The confusing part is that boost::timer reports this:
0.011577s wall, 0.020000s user + 0.000000s system = 0.020000s CPU (172.8%)
How can my userspace time exceed wall time?

Most likely if you use threads, it will display the CPU time spent on all threads in the process

By adding more test code the userspace time will jump to 0.03s and then to 0.04s
So it looks like the userspace duration is only accurate to within 10 ms causing the CPU utilization calculation to be wrong.

Related

Measuring FLOPS and memory traffic in a C++ program

I am trying to profile a C++ program. For the first step, I want to determine whether the program is compute-bound or memory-bound by the Roofline Model. So I need to measure the following 4 things.
W: # of computations performed in the program (FLOPs)
Q: # of bytes of memory accesses incurred in the program (Byte/s)
π: peak performance (FLOPs)
β: peak bandwidth (Byte/s)
I have tried to use Linux perf to measure W. I followed the instructions here, using libpfm4 to determine the available events (by ./showevinfo). I found my CPU supports the INST_RETIREDevent with umask X87, then I used ./check_events INST_RETIRED:X87 to find the code, which is 0x5302c0. Then I tried perf stat -e r5302c0 ./test_exe and I got
Performance counter stats for './test_exe':
83,381,997 r5302c0
20.134717382 seconds time elapsed
74.691675000 seconds user
0.357003000 seconds sys
Questions:
Is it right for my process to measure the W of my program? If yes, then it should be 83,381,997 FLOPs, right?
Why is this FLOPs not stable between repeated executions?
How can I measure the other Q, π and β?
Thanks for your time and any suggestions.

Monitor task CPU utilization in VxWorks while program is running

I'm running a VxWorks 6.9 OS embedded system and I need to see when I'm starving low priority tasks. Ideally I'd like to have CPU utilization by task so I know what is eating up all my CPU time.
I know this is a built in feature in many operating systems but have been so far unable to find it for VxWorks 6.9.
If I can't measure by task I'd like to at least to see what percentage of time the CPU is idle.
To that end I've been trying to make a lowest priority task that will run the function below that would try to measure it indirectly.
float Foo::IdleTime(Foo* f)
{
bool inIdleTask;
float timeIdle;
float totalTime;
float percentIdle;
while(true)
{
startTime = _time(); //get time before before measurement starts
inIdleTask = true;
timeIdle = 0;
while(inIdleTask) // I have no clue how to detect when the task left and set this to false
{
timeIdle += (amount_of_time_for_inner_loop); //measure idle time
}
returnTime = _time(); //get time after you return to IdleTime task
totalTime = ( returnTime - startTime );
percentIdle = ( timeIdle / totalTime ) * 100; //calculate percentage of idle time
//logic to report percentIdle
}
The big problem with this concept is I don't know how I would detect when this task is left for a higher priority task.
If you are looking for a one time measurement done during the developement, then spyLib is what you are looking for. Simply call spy from the command line to get per task CPU usage report in 10s intervals. Call spyHelp to learn how to configure the spy. (Might need to inculude the spyLib to kernel if not already included.)
If you want to go the extra mile, taskHookLib is what you need. Simply put, you hook a function to be called in every task switch. Call gives you the TASK_IDs of tasks going in and out of the CPU. You can either simply monitor the starvation of low pri tasks or take action and increase their priority temporarily.
From experience, spy adds a little performance overhead, especially if stdout faces to a slow I/O (e.g. a 9600 baud serial), but fairly easy to use. taskHook'ing adds little to none overhead if you are not immediately printing the results on the terminal, but takes a bit of programming to get it running.
Another thing that might be of interest is WindRiver's remote debugger. Haven't use that one personally, imagine it would require setting up the workbench and the target properly.

Best option to profile CPU use in my program?

I am profiling CPU usage on a simple program I am writing. I have different algorithms I want to try, and I also want to know what's the impact on the total system performance.
Currently, I am using ualarm() to execute some instructions at 30Hz; every 15 of those interruptions (every 0.5s) I record the CPU time with getrusage() (in useconds), so I have an estimation on the total cpu time of cpu consumption on that point in time. But to get context, I also need to know the total time elapsed in the system in that time period, so I can have the % of which is used by my program.
/* Main Loop */
while(1)
{
alarm = 0;
/* Waiting Loop: */
for(i=0; !alarm; i++){
}
count++;
/* Do my things */
/* Check if it's time to store cpu log: */
if ((count%count_max) == 0)
{
getrusage(RUSAGE_SELF, &ru);
store_cpulog(f,
(int64_t) ru.ru_utime.tv_sec,
(int64_t) ru.ru_utime.tv_usec,
(int64_t) ru.ru_stime.tv_sec,
(int64_t) ru.ru_stime.tv_usec);
}
}
I have different options, but I don't know which one will provide the most exact result:
Use ualarm for the timing. Currently it's programmed to signal every 0.5 seconds, so I can take those 0.5 seconds as the CPU time. Seems quite obvious to use, but it's the best option?
Use clock_gettime(CLOCK_MONOTONIC): it provides readings with a nanosec resolution.
Use gettimeofday(): provides readings with a usec resolution. I've found opinions against using it.
Any recommendation? Thanks.
Possible solution is to use system function time and don't using busy loop (like #Hasturkun say) in your program. Call in console:
time /path/to/my/program
and after execution of it you get something like:
real 0m1.465s
user 0m0.000s
sys 0m1.210s
Not sure about precision, if it is enough for you.
Callgrind is possibly the best application for profiling C/C++ code under linux. Use it with pride:)

C++ timing method with lower sys time

I'm trying to make my program self timed and I know two methods
1) using getrusage
truct rusage startu; struct rusage endu;
getrusage(RUSAGE_SELF, &startu);
//Do computation here
getrusage(RUSAGE_SELF, &endu);
double start_sec = start.ru_utime.tv_sec + startu.ru_utime.tv_usec/1000000.0;
double end_sec = endu.ru_utime.tv_sec + endu.ru_utime.tv_usec/1000000.0;
double duration = end_sec - start_sec;
This fetches the user time of a program segment.
2) using clock(), which gets the processor's executing time
double start_sec = (double)clock()/CLOCKS_PER_SEC;
//Do computation here
double end_sec = (double)clock()/CLOCKS_PER_SEC;
double duration = end_sec - start_sec;
This fetches the real time of a program segment.
However, I get really long sys time for both methods. The user time is also longer than without these timings. System time sometimes even doubles the user time.
For example, I'm doing Traveling Salesman Problem, for a input that runs around 3 seconds for both user and real time normally, these two timings both make the user time to be over 5 seconds and real time over 15 secs, which means sys time is around 10 seconds long.
I hope to know if there are ways of improvements or other libraries that are capable of shortening the sys time and user time if possible. If I have to user other libraries, I want libraries for both user time timing and real time timing.
Thanks for any advice!
I suggest to carefully read the time(7) man page and to consider also the clock_gettime(2) syscall.

Strange boost::this_thread::sleep behavior

I was trying to implement small time delays in multithreading code using boost::this_thread::sleep.
Here is code example:
{
boost::timer::auto_cpu_timer t;//just to check sleep interval
boost::this_thread::sleep(boost::posix_time::milliseconds(25));
}
Output generated by auto_cpu_timer confused me little bit:
0.025242s wall, 0.010000s user + 0.020000s system = 0.030000s CPU (118.9%)
Why it 0.025242s but not 0.0025242s ?
Because 25 milliseconds is 0.025 seconds; 0.0025 seconds would be 2.5 milliseconds.