The output of a typical profiler is, a list of functions in your code, sorted by the amount of time each function took while the program ran.
This is very good, but sometimes I'm interested more with what was program doing most of the time, than with where was EIP most of the time.
An example output of my hypothetical profiler is:
Waiting for file IO - 19% of execution time.
Waiting for network - 4% of execution time
Cache misses - 70% of execution time.
Actual computation - 7% of execution time.
Is there such a profiler? Is it possible to derive such an output from a "standard" profiler?
I'm using Linux, but I'll be glad to hear any solutions for other systems.
This is Solaris only, but dtrace can monitor almost every kind of I/O, on/off CPU, time in specific functions, sleep time, etc. I'm not sure if it can determine cache misses though, assuming you mean CPU cache - I'm not sure if that information is made available by the CPU or not.
Please take a look at this and this.
Consider any thread. At any instant of time it is doing something, and it is doing it for a reason, and slowness can be defined as the time it spends for poor reasons - it doesn't need to be spending that time.
Take a snapshot of the thread at a point in time. Maybe it's in a cache miss, in an instruction, in a statement, in a function, called from a call instruction in another function, called from another, and so on, up to call _main. Every one of those steps has a reason, that an examination of the code reveals.
If any one of those steps is not a very good reason and could be avoided, that instant of time does not need to be spent.
Maybe at that time the disk is coming around to certain sector, so some data streaming can be started, so a buffer can be filled, so a read statement can be satisfied, in a function, and that function is called from a call site in another function, and that from another, and so on, up to call _main, or whatever happens to be the top of the thread.
Repeat previous point 1.
So, the way to find bottlenecks is to find when the code is spending time for poor reasons, and the best way to find that is to take snapshots of its state. The EIP, or any other tiny piece of the state, is not going to do it, because it won't tell you why.
Very few profilers "get it". The ones that do are the wall-clock-time stack-samplers that report by line of code (not by function) percent of time active (not amount of time, especially not "self" or "exclusive" time.) One that does is Zoom, and there are others.
Looking at where the EIP hangs out is like trying to tell time on a clock with only a second hand. Measuring functions is like trying to tell time on a clock with some of the digits missing. Profiling only during CPU time, not during blocked time, is like trying to tell time on a clock that randomly stops running for long stretches. Being concerned about measurement precision is like trying to time your lunch hour to the second.
This is not a mysterious subject.
Related
Usually profile data is gathered by randomly sampling the stack of the running program to see which function is in execution, over a running period it is possible to be statistically sure which methods/function calls eats most time and need intervention in case of bottlenecks.
However this has to do with overall application/game performance. Sometime happens that there are singular and isolated hiccups in performance that are causing usability troubles anyway (user notice it / introduced lag in some internal mechanism etc). With regular profiling over few seconds of execution is not possible to know which. Even if the hiccup lasts long enough (says 30 ms, which are not enough anyway), to detect some method that is called too often, we will still miss to see execution of many other methods that are just "skipped" because of the random sampling.
So are there any tecniques to profile hiccups in order to keep framerate more stable after fixing those kind of "rare bottlenecks"? I'm assuming usage of languages like C# or C++.
This has been answered before, but I can't find it, so here goes...
The problem is that the DrawFrame routine sometimes takes too long.
Suppose it normally takes less than 1000/30 = 33ms, but once in a while it takes longer than 33ms.
At the beginning of DrawFrame, set a timer interrupt that will expire after, say, 40ms.
Then at the end of DrawFrame, disable the interrupt.
So if it triggers, you know DrawFrame is taking an unusually long time.
Put a breakpoint in the interrupt handler, and when it gets there, examine the stack.
Chances are pretty good that you have caught it in the process of doing the costly thing.
That's a variation on random pausing.
I recently did some digging into memory and how to use it properly. Of course, I also stumbled upon prefetching and how I can make life easier for the CPU.
I ran some benchmarks to see the actual benefits of proper storage/access of data and instructions. These benchmarks showed not only the expected benefits of helping your CPU prefetch, it also showed that prefetching also speeds up the process during runtime. After about 100 program cycles, the CPU seems to have figured it out and has optimized the cache accordingly. This saves me up to 200.000 ticks per cycle, the number drops from around 750.000 to 550.000. I got these Numbers using the qTestLib.
Now to the Question: Is there a safe way to use this runtime-speedup, letting it warm up, so to speak? Or should one not calculate this in at all and just build faster code from the start up?
First of all, there is generally no gain in trying to warm up a process prior to normal execution: That would only speed up the first 100 program cycles in your case, gaining a total of less than 20000 ticks. That's much less than the 75000 ticks you would have to invest in the warming up.
Second, all these gains from warming up a process/cache/whatever are rather brittle. There is a number of events that destroy the warming effect that you generally do not control. Mostly these come from your process not being alone on the system. A process switch can behave pretty much like an asynchronous cache flush, and whenever the kernel needs a page of memory, it may drop a page from the disk cache.
Since the factors make computing time pretty unpredictable, they need to be controlled when running benchmarks that are supposed to produce results of any reliability. Apart from that, these effects are mostly ignored.
It is important to note that keeping the CPU busy isn't necessarily a bad thing. Ideally you want your CPU to run anywhere from 60% to 100% because that means that your computer is actually doing "work". Granted, if there is a process that you are unaware of and that process is taking up CPU cycles, that isn't good.
In answer to your question, the machine usually takes care of this.
I am writing a C++ code that runs on Ubuntu. I use pthreads as well. I am doing my research in an algorithm performance area.
I have this algorithm that I have improved on, it could run for 6~ 10 hours. But the time measurements I am taking also include things that are very minute, like in ms.
Also, the computer I am running it on is also running other processes, so how do I make sure that the time measured does not include the time processing for other processes.
There is no single "most accurate way". As with any measurement, you will have to define what you want to measure first. If you just want to measure execution time, repeatedly doing the same task and stopping the time is appropriate.
If you want to measure the time the CPU(s) is (are) actually busy doing your task, top might be a program of interest to you.
If you need to know how much time is spend in what subroutine, the linux-utils package contains perf, which can measure individual call times.
Often, non-CPU latencies dominate runtime. It, for example, doesn't make sense to just measure the time spent by the CPU when you're waiting on network input or hard drive data.
So: your question is actually a question for yourself: What do you want to measure? "Algorithm Performance" suggests you're doing computer science, in which case you should have access to extensive literature that explains what might be of interest. There's no single "solution" to the question "what is the most accurate measurement", unless you define "measurement" more closely; that's your job, and usually it's the harder part of measuring things.
As the title suggests I'm interested in obtaining CPU clock cycles used by a process in kernel mode only. I know there is an API called "QueryProcessCycleTime" which returns the CPU clock
cycles used by the threads of the process. But this value includes cycles spent in both user mode and kernel mode. How can I obtain cycles spent in kernel mode only? Do I need to get this using Performance counters? If yes, which one I should use?
Thanks in advance for your answers.
I've just found an interesting article that describes almost what you ask for. It's on MSDN Internals.
They write there, that if you were using C# or C++/CLI, you could easily get that information from an instance of System.Diagnostic.Process class, pointed to the right PID. But it would give you a TimeSpan from the PrivilegedProcessorTime, so a "pretty time" instead of 'cycles'.
However, they also point out that all that .Net code is actually thin wrapper for unmanaged APIs, so you should be able to easily get it from native C++ too. They were ILDASM'ing that class to show what it calls, but the image is missing. I've just done the same, and it uses the GetProcessTimes from kernel32.dll
So, again, MSDN'ing it - it returns LPFILETIME structures. So, the 'pretty time', not 'cycles', again.
The description of this method points out that if you want to get the clock cycles, you should use QueryProcessCycleTime function. This actually returns the amount of clock cycles.. but user- and kernel-mode counted together.
Now, summing up:
you can read userTIME
you can read kernelTIME
you can read (user+kernel)CYCLES
So you have almost everything needed. By some simple math:
u_cycles = u_time * allcycles / (utime+ktime)
k_cycles = k_time * allcycles / (utime+ktime)
Of course this will be some approximation due to rounding etc.
Also, this will has a gotcha: you have to invoke two functions (GetTimes, QueryCycles) to get all the information, so there will be a slight delay between their readings, and therefore all your calculation will probably slip a little since the target process still runs and burns the time.
If you cannot allow for this (small?) noise in the measurement, I think you can circumvent it by temporarily suspending the process:
suspend the target
wait a little and ensure it is suspended
read first stats
read second stats
then resume the process and calculate values
I think this will ensure the two readings to be consistent, but in turn, each such reading will impact the overall performance of the measured process - so i.e. things like "wall time" will not be measureable any longer, unless you take some corrections for the time spent in suspension..
There may be some better way to get the separate clock cycles, but I have not found them, sorry. You could try looking inside the QueryProcessCycleTime and what source it reads the data from - maybe you are lucky and it reads A,B and returns A+B and maybe you could peek what are the sources. I have not checked it.
Take a look at GetProcessTimes. It'll give you the amount of kernel and user time your process has used.
What I want to do
I have a computationally intensive OCaml application and I'd like it to run in the background without disturbing normal computer usage. I'd like to present the users with two options:
(1) the application only runs when CPU usage is virtually 0%;
(2) the application only uses "free" processing power (e.g. if other processes add up to 100%, the OCaml application pauses; if other processes are virtually 0%, then there are no restrictions for the OCaml application; if other processes add up to, say, 50% then OCaml will use up to 50%).
Some thoughts
My idea is to check CPU usage at various check points in the code and pause execution if necessary.
In (1), we just check if CPU is below say 2% and, if not, pause until it becomes lower than 2% again.
In (2), things are trickier. Since when no restrictions are present the application always consumes 100% of CPU and checkpoints will be quite frequent, to reduce CPU usage to, say, half, I just have to delay it at every checkpoint by exactly the time it took between check points. If check points are frequent, this would be similar to using 50% CPU, I'd say. For other percentages we can do something similar by suspending for appropriate periods of time. However, this looks very contrived, full of overhead, and above all, I'm not sure it really does what I want. A better alternative could be to invoke Unix.nice n with some appropriate integer at the start of the application. I suppose that setting n=15 would probably be right.
My questions
(Q1) How can I know from within my OCaml application what the CPU usage for the application process is? (I'd like to do this with an OCaml function and not by invoking "ps" or something similar on the command line...)
(Q2) Do you see problems with my idea to achieve (2). Which are the practical differences to changing niceness of process?
(Q3) Do you have any other suggestions for (2)?
Use Caml's Unix library to periodically capture your CPU times and your elapsed times. Your CPU usage is the ratio. Try Unix.gettimeofday and Unix.times. N.B. You'll need to link with the -lunix option.
I too would just run the process under nice and be done with it.
Get your PID then parse the contents of /proc/<PID>/stat to get info about your process and /proc/stat to get global CPU info. They both have a bunch of statistics that you can use to decide when to do work and when to sleep. Do man proc to see the documentation for all the fields (long). Related question with good info: stackoverflow.com/questions/1420426
Setting niceness is easy and reliable. Doing things yourself is much more work but potentially gives you more control. If your actual goal is to just run as a background task, I would go with nice and be done with it.