Best event counter to use for measuring wall clock time using perf tools - profiling

Simple but yet complicated question:
What counter to use to get perf tools to measure wall clock time?
As a base line the first thing when profiling code I think I need to measure is just wall clock time to get an first idea where the code takes most of the time.
I don’t care if it’s IO or bandwidth limited or something else I just want to know where it is slow.
Sounds simple requirement, but with all the many tricks modern CPUs do to work efficient (like frequency scaling etc.) and the hell lot of different (not so well documented) performance counters available in perf, it’s not easy to be sure measuring the right thing.
Currently I do:
perf record -g -e ref-cycles -F 999 -- <cmd>
I think this is unscaled CPU frequency and thus proportional to the amount of wall clock time that part of the code is running. But who the hell knows?

You can use task-clock.
This is explicitly wall clock time while the process is running and as a bonus is portable because it doesn't rely on any PMU event.

Related

Profiling: How to discover where/why my program is sleeping

My company's codebase has a unit test that takes unreasonably long (5 minutes wall clock time, 200ms cpu-time). It is not I/O bound, so it is probably sleeping/waiting somewhere. How would one go about discovering where?
I can't do a textual search. The actual repository is huge and has legitimate reasons to sleep. I'd get too many false positives
Profilers I tried (perf, valgrind, gprof) all seem to focus on finding functions that use a lot of cpu time, not wall clock time.
I ended up doing strace -r -k and then looking for slow system calls, but surely there are more convenient approaches?

Is there any usage for letting a process "warm up"?

I recently did some digging into memory and how to use it properly. Of course, I also stumbled upon prefetching and how I can make life easier for the CPU.
I ran some benchmarks to see the actual benefits of proper storage/access of data and instructions. These benchmarks showed not only the expected benefits of helping your CPU prefetch, it also showed that prefetching also speeds up the process during runtime. After about 100 program cycles, the CPU seems to have figured it out and has optimized the cache accordingly. This saves me up to 200.000 ticks per cycle, the number drops from around 750.000 to 550.000. I got these Numbers using the qTestLib.
Now to the Question: Is there a safe way to use this runtime-speedup, letting it warm up, so to speak? Or should one not calculate this in at all and just build faster code from the start up?
First of all, there is generally no gain in trying to warm up a process prior to normal execution: That would only speed up the first 100 program cycles in your case, gaining a total of less than 20000 ticks. That's much less than the 75000 ticks you would have to invest in the warming up.
Second, all these gains from warming up a process/cache/whatever are rather brittle. There is a number of events that destroy the warming effect that you generally do not control. Mostly these come from your process not being alone on the system. A process switch can behave pretty much like an asynchronous cache flush, and whenever the kernel needs a page of memory, it may drop a page from the disk cache.
Since the factors make computing time pretty unpredictable, they need to be controlled when running benchmarks that are supposed to produce results of any reliability. Apart from that, these effects are mostly ignored.
It is important to note that keeping the CPU busy isn't necessarily a bad thing. Ideally you want your CPU to run anywhere from 60% to 100% because that means that your computer is actually doing "work". Granted, if there is a process that you are unaware of and that process is taking up CPU cycles, that isn't good.
In answer to your question, the machine usually takes care of this.

What is the most accurate way to estimate the time performance of a code in C++

I am writing a C++ code that runs on Ubuntu. I use pthreads as well. I am doing my research in an algorithm performance area.
I have this algorithm that I have improved on, it could run for 6~ 10 hours. But the time measurements I am taking also include things that are very minute, like in ms.
Also, the computer I am running it on is also running other processes, so how do I make sure that the time measured does not include the time processing for other processes.
There is no single "most accurate way". As with any measurement, you will have to define what you want to measure first. If you just want to measure execution time, repeatedly doing the same task and stopping the time is appropriate.
If you want to measure the time the CPU(s) is (are) actually busy doing your task, top might be a program of interest to you.
If you need to know how much time is spend in what subroutine, the linux-utils package contains perf, which can measure individual call times.
Often, non-CPU latencies dominate runtime. It, for example, doesn't make sense to just measure the time spent by the CPU when you're waiting on network input or hard drive data.
So: your question is actually a question for yourself: What do you want to measure? "Algorithm Performance" suggests you're doing computer science, in which case you should have access to extensive literature that explains what might be of interest. There's no single "solution" to the question "what is the most accurate measurement", unless you define "measurement" more closely; that's your job, and usually it's the harder part of measuring things.

What should I check: cpu time or wall time?

I have two algorithms to do the same task. To examine their performance, what should I check: cpu time or wall time? I think it is cpu time, right?
I am doing parallelism of my code. To check my parallelism performance, what should I check: cpu time or wall time? I think it is wall time, right?
Assume I have done an ideal parallelism using multi-threads. I think the cpu time for 1 thread will be same as 8 threads, and the wall time for 1 threas will be 8 times longer than the 8 threads. Is it right?
Also any easy way to check those times?
The answer depends on what you're really trying to measure.
If you have a couple small code sequences where each runs on a single CPU (i.e., it's basically single-threaded) and you want to know which is faster, you probably want CPU time. This will tell you the time taken to execute that code, without counting other things like I/O, task switches, time spent on other processes, interrupt handling, etc. [Note: although it attempts to ignore other facts, you'll still usually get the most accurate results with the system otherwise as quiescent as possible.]
If you're writing multi-threaded code and want to measure how well you're distributing your code across processors/cores, you'll probably measure both CPU time and wall time, and compare the two. If, for example, you have 4 cores available, your ideal would be that the wall time is 1/4th the CPU time.
So, for multithreaded code you'll often end up doing things in two phases: first you look at the time to execute on thread, using CPU time. You optimize to get that to (reasonable) minimum. Then in the second phase, you compare wall time to CPU time, to try to use multiple cores efficiently. Since changing one often affects the other, you may well iterate through the two a number of times (and often compromise between the two to some degree).
Just as a really general rule of thumb, you tend to use CPU time to measure microscopic benchmarks of individual bits of code, and wall time for larger (system-level) benchmarks. In other words, when you want to measure how fast one piece of code runs, and nothing else, CPU time generally make the most sense. When you want to include the effects of things like disk I/O time, caching, etc., then you're a lot more likely to care about wall time.
Wall time tells you how long your computer took. But it doesn't tell you anything about how long the execution of your code took as it depends on other things that were keeping your computer busy.
There are different mechanisms for measuring CPU time spent executing your code - I personally like getrusage()

How limit cpu usage from specific process?

How i can limit cpu usage to 10% for example for specific process in windows C++?
You could use Sleep(x) - will slow down your program execution but it will free up CPU cycles
Where x is the time in milliseconds
This is rarely needed and maybe thread priorities are better solution but since you asked, what you should is:
do a small fraction of your "solid" work i.e. calculations
measure how much time step 1) took, let's say it's twork milliseconds
Sleep() for (100/percent - 1)*twork milliseconds where percent is your desired load
go back to 1.
In order for this to work well, you have to be really careful in selecting how big a "fraction" of a calculation is and some tasks are hard to split up. A single fraction should take somewhere between 40 and 250 milliseconds or so, if it takes less, the overhead from sleeping and measuring might become significant, if it's more, the illusion of using 10% CPU will disappear and it will seem like your thread is oscillating between 0 and 100% CPU (which is what happens anyway, but if you do it fast enough then it looks like you only take whatever percent). Two additional things to note: first, as mentioned before me, this is on a thread level and not process level; second, your work must be real CPU work, disk/device/network I/O usually involves a lot of waiting and doesn't take as much CPU.
That is the job of OS, you can not control it.
You can't limit to exactly 10%, but you can reduce it's priority and restrict it to use only one CPU core.