What should I check: cpu time or wall time? - c++

I have two algorithms to do the same task. To examine their performance, what should I check: cpu time or wall time? I think it is cpu time, right?
I am doing parallelism of my code. To check my parallelism performance, what should I check: cpu time or wall time? I think it is wall time, right?
Assume I have done an ideal parallelism using multi-threads. I think the cpu time for 1 thread will be same as 8 threads, and the wall time for 1 threas will be 8 times longer than the 8 threads. Is it right?
Also any easy way to check those times?

The answer depends on what you're really trying to measure.
If you have a couple small code sequences where each runs on a single CPU (i.e., it's basically single-threaded) and you want to know which is faster, you probably want CPU time. This will tell you the time taken to execute that code, without counting other things like I/O, task switches, time spent on other processes, interrupt handling, etc. [Note: although it attempts to ignore other facts, you'll still usually get the most accurate results with the system otherwise as quiescent as possible.]
If you're writing multi-threaded code and want to measure how well you're distributing your code across processors/cores, you'll probably measure both CPU time and wall time, and compare the two. If, for example, you have 4 cores available, your ideal would be that the wall time is 1/4th the CPU time.
So, for multithreaded code you'll often end up doing things in two phases: first you look at the time to execute on thread, using CPU time. You optimize to get that to (reasonable) minimum. Then in the second phase, you compare wall time to CPU time, to try to use multiple cores efficiently. Since changing one often affects the other, you may well iterate through the two a number of times (and often compromise between the two to some degree).
Just as a really general rule of thumb, you tend to use CPU time to measure microscopic benchmarks of individual bits of code, and wall time for larger (system-level) benchmarks. In other words, when you want to measure how fast one piece of code runs, and nothing else, CPU time generally make the most sense. When you want to include the effects of things like disk I/O time, caching, etc., then you're a lot more likely to care about wall time.

Wall time tells you how long your computer took. But it doesn't tell you anything about how long the execution of your code took as it depends on other things that were keeping your computer busy.
There are different mechanisms for measuring CPU time spent executing your code - I personally like getrusage()

Related

Worse performance with multiple threads in C++ [duplicate]

There is a really interesting note here: http://en.cppreference.com/w/cpp/chrono/c/clock
"Only the difference between two values returned by different calls to std::clock is meaningful, as the beginning of the std::clock era does not have to coincide with the start of the program. std::clock time may advance faster or slower than the wall clock, depending on the execution resources given to the program by the operating system. For example, if the CPU is shared by other processes, std::clock time may advance slower than wall clock. On the other hand, if the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock."
Why does the clock speed up with multithreading? I'm checking the performance of a C++ program with threading vs without it and I'm noticing that the times are similar for threading (not better) but feel faster (like saying 8 seconds in 3 seconds of runtime).
If more than one core is available, and you are running multiple threads, then potentially multiple threads are executing at the same time on different cores. Since clock() measures processor time, it may advance faster than wallclock time, because multiple threads are advancing it simultaneously.
Just as the example given in the documentation - it shows two threads created, and the clock() value reported is almost double the wallclock time reported.

Is there any usage for letting a process "warm up"?

I recently did some digging into memory and how to use it properly. Of course, I also stumbled upon prefetching and how I can make life easier for the CPU.
I ran some benchmarks to see the actual benefits of proper storage/access of data and instructions. These benchmarks showed not only the expected benefits of helping your CPU prefetch, it also showed that prefetching also speeds up the process during runtime. After about 100 program cycles, the CPU seems to have figured it out and has optimized the cache accordingly. This saves me up to 200.000 ticks per cycle, the number drops from around 750.000 to 550.000. I got these Numbers using the qTestLib.
Now to the Question: Is there a safe way to use this runtime-speedup, letting it warm up, so to speak? Or should one not calculate this in at all and just build faster code from the start up?
First of all, there is generally no gain in trying to warm up a process prior to normal execution: That would only speed up the first 100 program cycles in your case, gaining a total of less than 20000 ticks. That's much less than the 75000 ticks you would have to invest in the warming up.
Second, all these gains from warming up a process/cache/whatever are rather brittle. There is a number of events that destroy the warming effect that you generally do not control. Mostly these come from your process not being alone on the system. A process switch can behave pretty much like an asynchronous cache flush, and whenever the kernel needs a page of memory, it may drop a page from the disk cache.
Since the factors make computing time pretty unpredictable, they need to be controlled when running benchmarks that are supposed to produce results of any reliability. Apart from that, these effects are mostly ignored.
It is important to note that keeping the CPU busy isn't necessarily a bad thing. Ideally you want your CPU to run anywhere from 60% to 100% because that means that your computer is actually doing "work". Granted, if there is a process that you are unaware of and that process is taking up CPU cycles, that isn't good.
In answer to your question, the machine usually takes care of this.

What is the most accurate way to estimate the time performance of a code in C++

I am writing a C++ code that runs on Ubuntu. I use pthreads as well. I am doing my research in an algorithm performance area.
I have this algorithm that I have improved on, it could run for 6~ 10 hours. But the time measurements I am taking also include things that are very minute, like in ms.
Also, the computer I am running it on is also running other processes, so how do I make sure that the time measured does not include the time processing for other processes.
There is no single "most accurate way". As with any measurement, you will have to define what you want to measure first. If you just want to measure execution time, repeatedly doing the same task and stopping the time is appropriate.
If you want to measure the time the CPU(s) is (are) actually busy doing your task, top might be a program of interest to you.
If you need to know how much time is spend in what subroutine, the linux-utils package contains perf, which can measure individual call times.
Often, non-CPU latencies dominate runtime. It, for example, doesn't make sense to just measure the time spent by the CPU when you're waiting on network input or hard drive data.
So: your question is actually a question for yourself: What do you want to measure? "Algorithm Performance" suggests you're doing computer science, in which case you should have access to extensive literature that explains what might be of interest. There's no single "solution" to the question "what is the most accurate measurement", unless you define "measurement" more closely; that's your job, and usually it's the harder part of measuring things.

How limit cpu usage from specific process?

How i can limit cpu usage to 10% for example for specific process in windows C++?
You could use Sleep(x) - will slow down your program execution but it will free up CPU cycles
Where x is the time in milliseconds
This is rarely needed and maybe thread priorities are better solution but since you asked, what you should is:
do a small fraction of your "solid" work i.e. calculations
measure how much time step 1) took, let's say it's twork milliseconds
Sleep() for (100/percent - 1)*twork milliseconds where percent is your desired load
go back to 1.
In order for this to work well, you have to be really careful in selecting how big a "fraction" of a calculation is and some tasks are hard to split up. A single fraction should take somewhere between 40 and 250 milliseconds or so, if it takes less, the overhead from sleeping and measuring might become significant, if it's more, the illusion of using 10% CPU will disappear and it will seem like your thread is oscillating between 0 and 100% CPU (which is what happens anyway, but if you do it fast enough then it looks like you only take whatever percent). Two additional things to note: first, as mentioned before me, this is on a thread level and not process level; second, your work must be real CPU work, disk/device/network I/O usually involves a lot of waiting and doesn't take as much CPU.
That is the job of OS, you can not control it.
You can't limit to exactly 10%, but you can reduce it's priority and restrict it to use only one CPU core.

Profiling program by type of activity

The output of a typical profiler is, a list of functions in your code, sorted by the amount of time each function took while the program ran.
This is very good, but sometimes I'm interested more with what was program doing most of the time, than with where was EIP most of the time.
An example output of my hypothetical profiler is:
Waiting for file IO - 19% of execution time.
Waiting for network - 4% of execution time
Cache misses - 70% of execution time.
Actual computation - 7% of execution time.
Is there such a profiler? Is it possible to derive such an output from a "standard" profiler?
I'm using Linux, but I'll be glad to hear any solutions for other systems.
This is Solaris only, but dtrace can monitor almost every kind of I/O, on/off CPU, time in specific functions, sleep time, etc. I'm not sure if it can determine cache misses though, assuming you mean CPU cache - I'm not sure if that information is made available by the CPU or not.
Please take a look at this and this.
Consider any thread. At any instant of time it is doing something, and it is doing it for a reason, and slowness can be defined as the time it spends for poor reasons - it doesn't need to be spending that time.
Take a snapshot of the thread at a point in time. Maybe it's in a cache miss, in an instruction, in a statement, in a function, called from a call instruction in another function, called from another, and so on, up to call _main. Every one of those steps has a reason, that an examination of the code reveals.
If any one of those steps is not a very good reason and could be avoided, that instant of time does not need to be spent.
Maybe at that time the disk is coming around to certain sector, so some data streaming can be started, so a buffer can be filled, so a read statement can be satisfied, in a function, and that function is called from a call site in another function, and that from another, and so on, up to call _main, or whatever happens to be the top of the thread.
Repeat previous point 1.
So, the way to find bottlenecks is to find when the code is spending time for poor reasons, and the best way to find that is to take snapshots of its state. The EIP, or any other tiny piece of the state, is not going to do it, because it won't tell you why.
Very few profilers "get it". The ones that do are the wall-clock-time stack-samplers that report by line of code (not by function) percent of time active (not amount of time, especially not "self" or "exclusive" time.) One that does is Zoom, and there are others.
Looking at where the EIP hangs out is like trying to tell time on a clock with only a second hand. Measuring functions is like trying to tell time on a clock with some of the digits missing. Profiling only during CPU time, not during blocked time, is like trying to tell time on a clock that randomly stops running for long stretches. Being concerned about measurement precision is like trying to time your lunch hour to the second.
This is not a mysterious subject.