Does perf reset counter after every sample? - profiling

In Collecting Samples sections of the perf documentation, we can specify the sampling rate using -F option. According to my understanding, if -F 250 is used, perf will sample the counters 250 times per second on average. Does this mean that after sampling the counter, does that counter get reset? For example, consider two samples below(assume these are the only samples generated for the function f1)
sample S1 reporting c1 cycles in function f1
sample S2 reporting c2 cycles in function f1
Suppose I have to calculate approximately how many cycles are spent in all calls of function f1(assume in the application above there is only one call to f1). Is it right to say the answer is c1 + c2(This is true if the counters are reset after every sample)?
Thanks

Related

Measuring FLOPS and memory traffic in a C++ program

I am trying to profile a C++ program. For the first step, I want to determine whether the program is compute-bound or memory-bound by the Roofline Model. So I need to measure the following 4 things.
W: # of computations performed in the program (FLOPs)
Q: # of bytes of memory accesses incurred in the program (Byte/s)
π: peak performance (FLOPs)
β: peak bandwidth (Byte/s)
I have tried to use Linux perf to measure W. I followed the instructions here, using libpfm4 to determine the available events (by ./showevinfo). I found my CPU supports the INST_RETIREDevent with umask X87, then I used ./check_events INST_RETIRED:X87 to find the code, which is 0x5302c0. Then I tried perf stat -e r5302c0 ./test_exe and I got
Performance counter stats for './test_exe':
83,381,997 r5302c0
20.134717382 seconds time elapsed
74.691675000 seconds user
0.357003000 seconds sys
Questions:
Is it right for my process to measure the W of my program? If yes, then it should be 83,381,997 FLOPs, right?
Why is this FLOPs not stable between repeated executions?
How can I measure the other Q, π and β?
Thanks for your time and any suggestions.

Reasons for variations in run-time performance of identical code

when running some benchmarks of my C++ software, I obtained the following picture:
The plot shows the execution time in nanoseconds of one tick of the software.
The exact same tick is ran each time (one data point is one tick).
When testing in the simulated environment of valgrind, there is zero difference between each tick, and I don't have syscalls others than what clock_gettime may do.
I would like to understand what can cause the two "speeds" in which the tick seems to run. I disabled intel CPU sleep states which greatly helped (before that I had 4 lines like this), and what could be the causes for the outlier points. The scheduler used is linux's FIFO scheduler.
An interesting observation is that the tick times alternate between the two values of 9000 and 6700 ns; here's some data points:
9022
6605
9170
6756
9126
6594
9102
6744
9016
6643
8950
6638
9047
6662
edit:
just doing this in a loop in my thread:
auto t0 = std::chrono::high_resolution_clock::now();
auto t1 = std::chrono::high_resolution_clock::now();
measure(std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count()));
is enough for me to see the alternating effect, though between 100 and 150 ns.

FIO runtime different than gettimeofday()

I am trying to measure the execution time of FIO benchmark. I am, currently, doing so wrapping the FIO call between gettimeofday():
gettimeofday(&startFioFix, NULL);
FILE* process = popen("fio --name=randwrite --ioengine=posixaio rw=randwrite --size=100M --direct=1 --thread=1 --bs=4K", "r");
gettimeofday(&doneFioFix, NULL);
and calculate the elapsed time as:
double tstart = startFioFix.tv_sec + startFioFix.tv_usec / 1000000.;
double tend = doneFioFix.tv_sec + doneFioFix.tv_usec / 1000000.;
double telapsed = (tend - tstart);
Now, the question(s) is
telapsed time is different (larger) than the runt by FIO output. Can you please help me in understanding Why? as the fact can be seen in FIO output:
randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=posixaio, iodepth=1
fio-2.2.8
Starting 1 thread
randwrite: (groupid=0, jobs=1): err= 0: pid=3862: Tue Nov 1 18:07:50 2016
write: io=102400KB, bw=91674KB/s, iops=22918, runt= 1117msec
...
and the telapsed is:
telapsed: 1.76088 seconds
what is the actual time taken by FIO execution:
a) runt given by FIO, or
b) the elapsed time by getttimeofday()
How does FIO measure its runt? (probably, this question linked to 1.)
PS: I have tried to replace the gettimeofday(with std::chrono::high_resolution_clock::now()), but it also behaves the same (by same, I mean it also gives larger elapsed time than runt)
Thank you in advance, for your time and assistance.
A quick point:gettimeofday() on Linux uses a clock that doesn't necessarily tick at a constant interval and can even move backwards (see http://man7.org/linux/man-pages/man2/gettimeofday.2.html and https://stackoverflow.com/a/3527632/4513656 ) - this may make telapsed unreliable (or even negative).
Your gettimeofday/popen/gettimeofday measurement (telapsed) is going to be: the fio process start up (i.e. fork+exec on Linux) elapsed + fio initialisation (e.g. thread creation because I see --thread, ioengine initialisation) + fio job elapsed (runt) + fio stopping elapsed + process stop elapsed). You are comparing this to just runt which is a sub component of telapsed. It is unlikely all the non-runt components are going to happen instantly (i.e. take up 0 usecs) so the expectation is that runt will be smaller than telapsed. Try running fio with --debug=all just to see all the things it does in addition to actually submitting I/O for the job.
This is difficult to answer because it depends on what you want you mean when you say "fio execution" and why (i.e. the question is hard to interpret in an unambiguous way). Are you interested in how long fio actually spent trying to submit I/O for a given job (runt)? Are you interested in how long it takes your system to start/stop a new process that just so happens to try and submit I/O for a given period (telapsed)? Are you interested in how much CPU time was spent submitting I/O (none of the above)? So because I'm confused I'll ask you some questions instead: what are you going to use the result for and why?
Why not look at the source code? https://github.com/axboe/fio/blob/7a3b2fc3434985fa519db55e8f81734c24af274d/stat.c#L405 shows runt comes from ts->runtime[ddir]. You can see it is initialised by a call to set_epoch_time() (https://github.com/axboe/fio/blob/6be06c46544c19e513ff80e7b841b1de688ffc66/backend.c#L1664 ), is updated by update_runtime() ( https://github.com/axboe/fio/blob/6be06c46544c19e513ff80e7b841b1de688ffc66/backend.c#L371 ) which is called from thread_main().

How can I periodically execute some function if this function takes along time to run (less than peroid)

I want to run a function for example func() exactly 1 time per second. However the running time of func() is about 500 ms. How Can I do that? I know if the running time of the function is low, I can write a while loop in func() and sleep() for 1 second after each execution. But now, the running time is high. What should I do to ensure the func() run exactly 1 time per second? Thanks.
Yo do:
Take the current time in start_time.
Perform your job
Take the current time in end_time
Wait for (1 second + start_time - end_time)
That way, you can perform your tasks every seconds reliably. If the task takes less time, you will wait longer and vice versa. Note however that this assumes that your task takes always less than 1 sec. to execute. In the real code, you want to check for that before the sleep statement.
Implementation details depend on the platform.
Note that using this method still results in a small drift due to the time it takes to compute step 4. A more accurate alternative would be to synchronize on integer multiple of one second. That way, over 1000s of cycles you would not drift.
It depends on the level of accuracy you need.
If you want a brute, easy to code solution, you can get the time before first run of the function and save it in some variable (start_time). Create repeat index count variable (repeat_number) that stores next repeat number. Then you can do kinda this:
1) next_run_time = ++repeat_number*1sec + start_time;
2) func();
3) wait_time = next_run_time - current_time;
4) sleep(wait_time)
5) goto 1;
This approach disables accumulation of time error on each iteration.
But for the real application you should find some event framework or library.

Limit iterations per time unit

Is there a way to limit iterations per time unit? For example, I have a loop like this:
for (int i = 0; i < 100000; i++)
{
// do stuff
}
I want to limit the loop above so there will be maximum of 30 iterations per second.
I would also like the iterations to be evenly positioned in the timeline so not something like 30 iterations in first 0.4s and then wait 0.6s.
Is that possible? It does not have to be completely precise (though the more precise it will be the better).
#FredOverflow My program is running
very fast. It is sending data over
wifi to another program which is not
fast enough to handle them at the
current rate. – Richard Knop
Then you should probably have the program you're sending data to send an acknowledgment when it's finished receiving the last chunk of data you sent then send the next chunk. Anything else will just cause you frustrations down the line as circumstances change.
Suppose you have a good Now() function (GetTickCount() is bad example, it's OS specific and has bad precision):
for (int i = 0; i < 1000; i++){
DWORD have_to_sleep_until = GetTickCount() + EXPECTED_ITERATION_TIME_MS;
// do stuff
Sleep(max(0, have_to_sleep_until - GetTickCount()));
};
You can check elapsed time inside the loop, but it may be not an usual solution. Because computation time is totally up to the performance of the machine and algorithm, people optimize it during their development time(ex. many game programmer requires at least 25-30 frames per second for properly smooth animation).
easiest way (for windows) is to use QueryPerformanceCounter(). Some pseudo-code below.
QueryPerformanceFrequency(&freq)
timeWanted = 1.0/30.0 //time per iteration if 30 iterations / sec
for i
QueryPerf(count1)
do stuff
queryPerf(count2)
timeElapsed = (double)(c2 - c1) * (double)(1e3) / double(freq) //time in milliseconds
timeDiff = timeWanted - timeElapsed
if (timeDiff > 0)
QueryPerf(c3)
QueryPerf(c4)
while ((double)(c4 - c3) * (double)(1e3) / double(freq) < timeDiff)
queryPerf(c4)
end for
EDIT: You must make sure that the 'do stuff' area takes less time than your framerate or else it doesn't matter. Also instead of 1e3 for milliseconds, you can go all the way to nanoseconds if you do 1e9 (if you want that much accuracy)
WARNING... this will eat your CPU but give you good 'software' timing... Do it in a separate thread (and only if you have more than 1 processor) so that any guis wont lock. You can put a conditional in there to stop the loop if this is a multi-threaded app too.
#FredOverflow My program is running very fast. It is sending data over wifi to another program which is not fast enough to handle them at the current rate. – Richard Knop
What you might need a buffer or queue at the receiver side. The thread that receives the messages from the client (like through a socket) get the message and put it in the queue. The actual consumer of the messages reads/pops from the queue. Of course you need concurrency control for your queue.
Besides the flow control methods mentioned, if you also have the need to maintain an accurate specific data sending rate in your sender part. Usually it can be done like this.
E.x. if you want to send at 10Mbps, create a timer of interval 1ms so it will call a predefined function every 1ms. Then in the timer handler function, by keep tracking of 2 static variables 1)Time elapsed since beginning of sending data 2)How much data in bytes have been sent up to last call, you can easily calculate how much data is needed to be sent in the current call (or just sleep and wait for next call).
By this way, you can do "streaming" of data in a very stable way with very little jitterness, and this is usually adopted in streaming of videos. Of course it also depends on how accurate the timer is.