Ocaml: Getting CPU usage of process

Ocaml: Getting CPU usage of process - ocaml

What I want to do
I have a computationally intensive OCaml application and I'd like it to run in the background without disturbing normal computer usage. I'd like to present the users with two options:
(1) the application only runs when CPU usage is virtually 0%;
(2) the application only uses "free" processing power (e.g. if other processes add up to 100%, the OCaml application pauses; if other processes are virtually 0%, then there are no restrictions for the OCaml application; if other processes add up to, say, 50% then OCaml will use up to 50%).
Some thoughts
My idea is to check CPU usage at various check points in the code and pause execution if necessary.
In (1), we just check if CPU is below say 2% and, if not, pause until it becomes lower than 2% again.
In (2), things are trickier. Since when no restrictions are present the application always consumes 100% of CPU and checkpoints will be quite frequent, to reduce CPU usage to, say, half, I just have to delay it at every checkpoint by exactly the time it took between check points. If check points are frequent, this would be similar to using 50% CPU, I'd say. For other percentages we can do something similar by suspending for appropriate periods of time. However, this looks very contrived, full of overhead, and above all, I'm not sure it really does what I want. A better alternative could be to invoke Unix.nice n with some appropriate integer at the start of the application. I suppose that setting n=15 would probably be right.
My questions
(Q1) How can I know from within my OCaml application what the CPU usage for the application process is? (I'd like to do this with an OCaml function and not by invoking "ps" or something similar on the command line...)
(Q2) Do you see problems with my idea to achieve (2). Which are the practical differences to changing niceness of process?
(Q3) Do you have any other suggestions for (2)?

Use Caml's Unix library to periodically capture your CPU times and your elapsed times. Your CPU usage is the ratio. Try Unix.gettimeofday and Unix.times. N.B. You'll need to link with the -lunix option.
I too would just run the process under nice and be done with it.

Get your PID then parse the contents of /proc/<PID>/stat to get info about your process and /proc/stat to get global CPU info. They both have a bunch of statistics that you can use to decide when to do work and when to sleep. Do man proc to see the documentation for all the fields (long). Related question with good info: stackoverflow.com/questions/1420426
Setting niceness is easy and reliable. Doing things yourself is much more work but potentially gives you more control. If your actual goal is to just run as a background task, I would go with nice and be done with it.

Related

Analyzing spikes in performance measurement

I have a set of C++ functions which does some image processing related operation. Generally I see that the final output is delivered in 5-6ms time range. I am measuring the time taken using QueryPerformanceCounter Win32 API. But when running in a continuous loop with 100 images, I see that the performance spikes up to 20ms for some images. My question is how do I go about analyzing such issues. Basically I want to determine whether the spikes are caused due to some delay in this code or whether some other task started running inside the CPU because of which this operation took time. I have tried using GetThreadTimes API to see how much time my thread spent inside CPU but am unable to conclude based on those numbers. What is the standard way to go about troubleshooting these types of issues?

Reason behind sudden spikes during processing could be any of IO, interrupt, scheduled processes etc.
It is very common to see such spikes considering such low latency/processing time operations. IMO you can consider them because of any of the above mentioned reasons (There could be more). Simplest solution is run same experiment with more inputs multiple times and take the average for final consideration.
To answer your question about checking/confirming source of the spike you can try following,
Check variation in images - already ruled out as per your comment
Monitor resource utilization during processing. Check if any resource is choking (% util is simplest way to check and SAR/NMON utility on linux is best with minimal overhead)
Reserve few CPU's on system (CPU Affinity) for your experiment which are dedicated only for your program and no OS task will run on them. Taskset is simplest utility to try out. More details are here.
Run the experiment with this setting and check behavior.

That's a nasty thing you are trying to figure out, I wouldn'd even attempt to, since coming into concrete conlusions is hard.
In general, one should run a loop of many iterations (100 just seems too small I think), and then take the average time for an image to be processed.
That will rule out any unexpected exterior events that may have hurt performance of your program.
A typical way to check if "some other task started running inside the CPU" would be to run your program once and mark the images that produce that spike. Example, image 2, 4, 5, and 67 take too long to be processed. Run your program again some times, and mark again which images produce the spikes.
If the same images produce these spikes, then it's not something caused by another exterior task.

What is the standard way to go about troubleshooting these types of issues?
There are Real Time Operating Systems (RTOS) which guarantee those kind of delays. It is totally different class of operating systems than Windows or Linux.
But still, there are something you can do about your delays even on general purpose OS.
1. Avoid system calls
Once you ask your OS to read or write something to a disk -- there are no guarantees whatever about delays. So, avoid any system functions on you critical path:
even functions like gettimeofday() might cause unpredictable delays, so you should really avoid any system calls in time-critical code;
use another thread to perform IO and pass data via a shared buffer to your critical code.
If your code base is big, tools like strace on Linux or Dr Memory on Windows to trace system calls.
2. Avoid context switches
The multi threading on Windows is preemptive. It means, there is a system scheduler, which might stop your thread any time and schedule another thread on your CPU. As previously, there are RTOSes, which allow to avoid such context switches, but there is something you can do about it:
make sure there is at least one CPU core left for system and other tasks;
bind each of your threads to a dedicated CPU with SetThreadAffinityMask() (Windows) or sched_setaffinity() (Linux) -- this effectively hints system scheduler to avoid scheduling other threads on this CPU;
make sure hardware interrupts go to another CPU; usually interrupts go to CPU 0, so the easiest way would be to bind your thread with CPU 1+;
increase your thread priority, so scheduler less likely to switch your thread with another one.
There are tools like perf (Linux) and Intel VTune (Windows) to confirm there are context switches.
3. Avoid other non-deterministic features
Few more sources of unexpected delays:
disable swap, so you know for sure your thread memory will not be swapped on slow and unpredictable disk drive;
disable CPU turbo boost -- after a high-performance CPU boosts, there is always a slow down, so the CPU stays withing its thermal power (TDP);
disable hyper threading -- from scheduler point of view those are independent CPUs, but in fact performance of each hyper-thread CPU depend on what another thread is doing at the moment.
Hope this helps.

Profiling a multiprocess system

I have a system that i need to profile.
It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.
I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.
What would be the best way to approach profiling a system like this?
I have thought of the following strategy:
Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.
This method seems sorta hacky and guess-worky. I dont like it.
How would you suggest to approach this problem?
Are there tools that would help me out (multi-process profiler?)?
What im looking for is more of a strategy than just specific tools.
Should i profile every process separately and look for problems? if so how do i approach this?
Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?
Are there other options?

I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.
Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?
If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.
If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.
There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.
On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.
Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.
If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.

How to get CPU clock cycles used by process in kernel mode on windows?

As the title suggests I'm interested in obtaining CPU clock cycles used by a process in kernel mode only. I know there is an API called "QueryProcessCycleTime" which returns the CPU clock
cycles used by the threads of the process. But this value includes cycles spent in both user mode and kernel mode. How can I obtain cycles spent in kernel mode only? Do I need to get this using Performance counters? If yes, which one I should use?
Thanks in advance for your answers.

I've just found an interesting article that describes almost what you ask for. It's on MSDN Internals.
They write there, that if you were using C# or C++/CLI, you could easily get that information from an instance of System.Diagnostic.Process class, pointed to the right PID. But it would give you a TimeSpan from the PrivilegedProcessorTime, so a "pretty time" instead of 'cycles'.
However, they also point out that all that .Net code is actually thin wrapper for unmanaged APIs, so you should be able to easily get it from native C++ too. They were ILDASM'ing that class to show what it calls, but the image is missing. I've just done the same, and it uses the GetProcessTimes from kernel32.dll
So, again, MSDN'ing it - it returns LPFILETIME structures. So, the 'pretty time', not 'cycles', again.
The description of this method points out that if you want to get the clock cycles, you should use QueryProcessCycleTime function. This actually returns the amount of clock cycles.. but user- and kernel-mode counted together.
Now, summing up:
you can read userTIME
you can read kernelTIME
you can read (user+kernel)CYCLES
So you have almost everything needed. By some simple math:
u_cycles = u_time * allcycles / (utime+ktime)
k_cycles = k_time * allcycles / (utime+ktime)
Of course this will be some approximation due to rounding etc.
Also, this will has a gotcha: you have to invoke two functions (GetTimes, QueryCycles) to get all the information, so there will be a slight delay between their readings, and therefore all your calculation will probably slip a little since the target process still runs and burns the time.
If you cannot allow for this (small?) noise in the measurement, I think you can circumvent it by temporarily suspending the process:
suspend the target
wait a little and ensure it is suspended
read first stats
read second stats
then resume the process and calculate values
I think this will ensure the two readings to be consistent, but in turn, each such reading will impact the overall performance of the measured process - so i.e. things like "wall time" will not be measureable any longer, unless you take some corrections for the time spent in suspension..
There may be some better way to get the separate clock cycles, but I have not found them, sorry. You could try looking inside the QueryProcessCycleTime and what source it reads the data from - maybe you are lucky and it reads A,B and returns A+B and maybe you could peek what are the sources. I have not checked it.

Take a look at GetProcessTimes. It'll give you the amount of kernel and user time your process has used.

Does Clojure (or JCE, or JVM, or...?) introduce parallelism automatically?

I am running some CPU-intensive Clojure code from within Intellij Idea (I don't think that's important - it seems to just spawn a process). According to both htop and top, it is using all 4 cores (well, 2 + hyperthreading) on my laptop. This is despite me not having any explicit parallelism in the code.
A little more detail: top shows a single process with ~380% CPU use, while htop shows a "parent" process and then 4 "children", each with 1/4 the time and ~100% CPU.
Is this normal? Or does it mean I have got something very wrong somewhere? The code involves many lazy sequences, but at its core modifies a mutable data structure (a mutable - not a Clojure data structure - hash that accumulates results). I am not using any explicit parallelism.
A significant amount of time is likely (I haven't profiled) spent in JCA/JCE (crypto lib) - I am using multiple AES ciphers in CTR mode, each as a stream of secure random bytes (code here), implemented as lazy seqs. Perhaps that is parallelized?
More random ideas: Could this be related to IO? I'm running on an encrypted SSD and this program is processing data from disk, so does a lot of reading. But htop shows system time as red, and these are green.
Sorry for such a vague question. I can post more info if required. This is Clojure 1.4 on 64bit Linux (JDK 1.7.0_05). The code being executed is here but it's pretty messy (more apologies) and spread across various files (most CPU time is spent in nearest-in-dump in the code there). Note - please don't waste time trying to run code to reproduce, as it expects a pre-existing data-dump to be on disk (which isn't in git).
debugger Running in the debugger (thanks, A-M) shows four threads (if I understand the debugger correctly), but only one is executing the program. They are labelled finalizer, main (the program), reference handler, and signal dispatcher. Finalizer + ref handler are in wait state; signal dispatcher has no frames available. I tentatively think this means the parallelism is at a lower level, perhaps in the crypto implementation?
Aha I think it's parallel GC (Java now has a concurrent collector). At the start, CPU use jumps way up when the actual process pauses (it prints out a regular tick). And since it's churning through lots of data it's generating a lot of short-lived objects (confirmed by using -XX:+UseSerialGC which reduces CPU use to 100%)

OK, I feel a bit dumb posting this as it now looks pretty obvious, but it seems to be parallel GC. I am processing a lot of data (sucking it in from an SSD) and generating lots of short-lived objects. And it appears that the JVM has parallel GC. See http://blog.ragozin.info/2011/12/garbage-collection-in-hotspot-jvm.html
It may also be a sign of a problem - What is going on with java GC? PermGen space is filling up? - which I will investigate tomorrow (I didn't mention it - although in retrospect I should have - but this is borderline running out of memory).
Update: Running with -XX:+UseSerialGC reduces the total CPU use to 100% (ie 1 core). But I didn't really mean that the two explanations above were exclusive, only that with better configuration and/or code I could reduce the amount of GC.

C++: Limiting CPU usage intentionally

At my company, we often test the performance of our USB and FireWire devices under CPU strain.
There is a test code we run that loads the CPU, and it is often used in really simple informal tests to see what happens to our device's performance.
I took a look at the code for this, and its a simple loop that increments a counter and does a calculation based on the new value, storing this result in another variable.
Running a single instance will use 1/X of the CPU, where X is the number of cores.
So, for instance, if we're on a 8-core PC and we want to see how our device runs under 50% CPU usage, we can open four instances of this at once, and so forth...
I'm wondering:
What decides how much of the CPU gets used up? does it just run everything as fast as it can on a single thread in a single threaded application?
Is there a way to voluntarily limit the maximum CPU usage your program can use? I can think of some "sloppy" ways (add sleep commands or something), but is there a way to limit to say, some specified percent of available CPU or something?

CPU quotas on Windows 7 and on Linux.
Also on QNX (i.e. Blackberry Tablet OS) and LynuxWorks
In case of broken links, the articles are named:
Windows -- "CPU rate limits in Windows Server 2008 R2 and Windows 7"
Linux -- "CPU Usage Limiter for Linux"
QNX -- "Adaptive Partitioning"
LynuxWorks - "Partitioning Operating Systems" and "ARINC 653"

The OS usually decides how to schedule processes and on which CPUs they should run. It basically keeps a ready queue for processes which are ready to run (not marked for termination and not blocked waiting for some I/O, event etc.). Whenever a process used up its timeslice or blocks it basically frees a processing core and the OS selects another process to run. Now if you have a process which is always ready to run and never blocks then this process essentially runs whenever it can thus pushing the utilization of a processing unit to a 100%. Of course this is a bit simplified description (there are things like process priorities for example).
There is usually no generic way to achieve this. The OS you are using might offer some mechanism to do this (some kind of CPU quota). You could try and measure how much time has passed vs. how much cpu time your process used up and then put your process to sleep for certain periods to achieve an approximation of desired CPU utilization.

You've essentially answered your own questions!
The key trait of code that burns a lot of CPU is that it never does anything that blocks (e.g. waiting for network or file I/O), and never voluntarily yields its time slice (e.g. sleep(), etc.).
The other trick is that the code must do something that the compiler cannot optimize away. So, most likely your CPU burn code outputs something based on the loop calculation at the end, or is simply compiled without optimization so that the optimizer isn't tempted to remove the useless loop. Since you're trying to load the CPU, there's no sense in optimizing anyways.
As you hypothesized, single threaded code that matches this description will saturate a CPU core unless the OS has more of these processes than it has cores to run them--then it will round-robin schedule them and the utilization of each will be some fraction of 100%.

The issue isn't how much time the CPU spends idle, but rather how long it takes for your code to start executing. Who cares if it's idle or doing low-priority busywork, as long as the latency is low?
Your problem is fundamentally a consequence of using a synthetic benchmark, presumably in an attempt to obtain reproducible results. But synthetic benchmarks tend to produce meaningless results, so reproducibility is moot.
Look at your bug database, find actual customer complaints, and use actual software and test hardware to reproduce a situation that actually made someone dissatisfied. Develop the performance test in parallel with hard, meaningful performance requirements.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js