Linux Timing across Kernel & User Space - c++

I'm writing a kernel module for a special camera, working through V4L2 to handle transfer of frames to userspace code.. Then I do lots of userspace stuff in the app..
Timing is very critical here, so I've been doing lots of performance profiling and plain old std::chrono::steady_clock stuff to track timing, but I've reached the point where I need to also collect timing data from the Kernel side of things so that I can analyze the entire path from hardware interrupt through V4L DQBuf to userspace...
Can anyone recommend a good way to get high-resolution timing data, that would be consistent with application userspace data, that I could use for such comparisons? Right now I'm measuring activity in microseconds..
Ubuntu 12.04 LTS

At the lowest level, there are the rdtsc and rdtscp instructions if you're on an x86/x86-64 processor. That should provide the lowest overhead, highest possible resolution across the kernel/userspace boundary.
However, there are things you need to worry about. You need to make sure you're executing across the same core/cpu, the process isn't being context switched, and the frequency isn't changing across invocations. If the cpu supports an invariant tsc, (constant_tsc in /proc/cpuinfo) it's a little more reliable across cpus/cores and frequencies.
This should provide roughly nanosecond accuracy.

There are lot of kernel level utilities available that can get the timing related traces for you. For eg ptrace, ftrace, LTTng, Kprobes. Check out this link for more information.
http://elinux.org/Kernel_Trace_Systems

Related

OpenCL Profiling timestamps are not consistent in duration compared to CPU clock

I am creating a custom tool interface with my application to profile the performance of OpenCL kernels while also integrating CPU profiling points. I'm currently working with this code on Linux using Ubuntu, and am testing using the 3 OpenCL devices in my machine: Intel CPU, Intel IGP, and Nvidia Quadro.
I am using this code std::chrono::high_resolution_clock::now().time_since_epoch().count() to produce a timestamp on the CPU, and of course for the OpenCL profiling time points, they are 64-bit nanoseconds provided from the OpenCL profiling events API. The purpose of the tool I made is to consume log output (specially formatted and generated so as not to impact performance much) from the program and generate a timeline chart to aid performance analysis.
So far in my visualization interface I had made the assumption that nanoseconds are uniform. I've realized now after getting my visual interface working and checking a few assumptions that this condition more or less does hold to a standard deviation of 0.4 microsecond for the CPU OpenCL device (which indicates that the CPU device could be implemented using the same time counter, as it has no drift), but does not hold for the two GPU devices! This is perhaps not the most surprising thing in the world, but it affects the core design of my tool, so this was an unforeseen risk.
I'll provide some eye candy since it is very interesting and it does prove to me that this is indeed happening.
This is zoomed into the beginning of the profile where the GPU has the corresponding mapBuffer for poses happening around a millisecond before the CPU calls it (impossible!)
Toward the end of the profile we see the same shapes but reversed relationship, clearly showing that GPU seconds seem to count for a little bit less compared to CPU seconds.
The way that this visualization currently works as i had assumed a GPU nanosecond is indeed a CPU nanosecond, is that I actually have been computing the average of the delta between the values given to me by the CPU and GPU... Since I did implement this initially, perhaps it indicates that i was at least subconsciously expecting there to be an issue like this one. Anyway, what I did was establish a sync point at the kernel dispatch by recording a CPU timestamp immediately before calling clEnqueueNDRangeKernel and then comparing this against the CL_PROFILING_COMMAND_QUEUED OpenCL Profile event time. This delta upon further inspection showed the time drift:
This screenshot from the chrome console shows me logging the array of delta values I collected from these two GPU devices; they are showing BigInts to avoid losing integer precision: in both cases the GPU reported timestamp deltas are trending down.
Compare with the numbers from the CPU:
My questions:
What might be a practical way to deal with this issue? I am currently leaning toward the use of sync points when dispatching OpenCL kernels, and these sync points could be used to either locally piecewise stretch the OpenCL Profiling timestamps, or to locally sync at the beginning of, say, a kernel dispatch and just ignore the discrepancy we have, assuming it will be insignificant during the period. In particular it is unclear whether it'd be a good idea to maximize granularity by implementing a sync point for every single profiling event I want to use.
What might be some other time measuring systems I can or should use on the CPU-side to see if maybe they will end up aligning better? I don't really have much hope in this at this point because I can imagine that the profiling times being provided to me are actually generated and timed on the GPU device itself. The fluctuations would then be affected by such things as dynamic GPU clock scaling, and there would be no hope of stumbling upon a different better timekeeping scheme on the CPU.

Analyzing spikes in performance measurement

I have a set of C++ functions which does some image processing related operation. Generally I see that the final output is delivered in 5-6ms time range. I am measuring the time taken using QueryPerformanceCounter Win32 API. But when running in a continuous loop with 100 images, I see that the performance spikes up to 20ms for some images. My question is how do I go about analyzing such issues. Basically I want to determine whether the spikes are caused due to some delay in this code or whether some other task started running inside the CPU because of which this operation took time. I have tried using GetThreadTimes API to see how much time my thread spent inside CPU but am unable to conclude based on those numbers. What is the standard way to go about troubleshooting these types of issues?
Reason behind sudden spikes during processing could be any of IO, interrupt, scheduled processes etc.
It is very common to see such spikes considering such low latency/processing time operations. IMO you can consider them because of any of the above mentioned reasons (There could be more). Simplest solution is run same experiment with more inputs multiple times and take the average for final consideration.
To answer your question about checking/confirming source of the spike you can try following,
Check variation in images - already ruled out as per your comment
Monitor resource utilization during processing. Check if any resource is choking (% util is simplest way to check and SAR/NMON utility on linux is best with minimal overhead)
Reserve few CPU's on system (CPU Affinity) for your experiment which are dedicated only for your program and no OS task will run on them. Taskset is simplest utility to try out. More details are here.
Run the experiment with this setting and check behavior.
That's a nasty thing you are trying to figure out, I wouldn'd even attempt to, since coming into concrete conlusions is hard.
In general, one should run a loop of many iterations (100 just seems too small I think), and then take the average time for an image to be processed.
That will rule out any unexpected exterior events that may have hurt performance of your program.
A typical way to check if "some other task started running inside the CPU" would be to run your program once and mark the images that produce that spike. Example, image 2, 4, 5, and 67 take too long to be processed. Run your program again some times, and mark again which images produce the spikes.
If the same images produce these spikes, then it's not something caused by another exterior task.
What is the standard way to go about troubleshooting these types of issues?
There are Real Time Operating Systems (RTOS) which guarantee those kind of delays. It is totally different class of operating systems than Windows or Linux.
But still, there are something you can do about your delays even on general purpose OS.
1. Avoid system calls
Once you ask your OS to read or write something to a disk -- there are no guarantees whatever about delays. So, avoid any system functions on you critical path:
even functions like gettimeofday() might cause unpredictable delays, so you should really avoid any system calls in time-critical code;
use another thread to perform IO and pass data via a shared buffer to your critical code.
If your code base is big, tools like strace on Linux or Dr Memory on Windows to trace system calls.
2. Avoid context switches
The multi threading on Windows is preemptive. It means, there is a system scheduler, which might stop your thread any time and schedule another thread on your CPU. As previously, there are RTOSes, which allow to avoid such context switches, but there is something you can do about it:
make sure there is at least one CPU core left for system and other tasks;
bind each of your threads to a dedicated CPU with SetThreadAffinityMask() (Windows) or sched_setaffinity() (Linux) -- this effectively hints system scheduler to avoid scheduling other threads on this CPU;
make sure hardware interrupts go to another CPU; usually interrupts go to CPU 0, so the easiest way would be to bind your thread with CPU 1+;
increase your thread priority, so scheduler less likely to switch your thread with another one.
There are tools like perf (Linux) and Intel VTune (Windows) to confirm there are context switches.
3. Avoid other non-deterministic features
Few more sources of unexpected delays:
disable swap, so you know for sure your thread memory will not be swapped on slow and unpredictable disk drive;
disable CPU turbo boost -- after a high-performance CPU boosts, there is always a slow down, so the CPU stays withing its thermal power (TDP);
disable hyper threading -- from scheduler point of view those are independent CPUs, but in fact performance of each hyper-thread CPU depend on what another thread is doing at the moment.
Hope this helps.

Finding CPU utilization and CPU cycles

My program is in C++ and I have one server listening to a number of clients. Clients send small packets to server. I'm running my code on Ubuntu.
I want to measure the CPU utilization and possibly total number of CPU cycles on both sides, ideally with a breakdown on cycles/utilization spent on networking (all the way from NIC to the user space and vice versa), kernel space, user space, context switches, etc.
I did some search, but I couldn't figure out whether it should be done inside my C++ code or an external profiler should be used, or perhaps some other way.
Your best friend/helper in this case is the /proc file system in Linux. In /proc you will find CPU usage, memory usage, power usage etc. Have a look at this link
http://www.linuxhowtos.org/System/procstat.htm
Even you can check each process cpu usage by looking at the files /proc/process_id/stat.
Take a look at the RDTSCP instruction and some other ways to measure performacnce metrics. System simulators like SniperSim, Gem5 etc. can also give total cycle count of your running program ( however, they may not be very accurate - there are some conditions that need to be met (core frequencies are same etc.))
As I commented, you probably should consider using oprofile. I am not very familiar with it (and it may be complex to use, and require system-wide configuration)

Finding performance issue that may be due to thread locking (possibly)

I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.
In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.
My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?
Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.
We might have to try walking through the code or something but not sure that we have time.
Any tips appreciated.
Thanks.
Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.
This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:
time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context
These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at https://stackoverflow.com/a/378024. The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.
While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.

Resource recommendations for Windows performance tuning (realtime)

Any recommendations out there for Windows application tuning resources (books web sites etc.)?
I have a C++ console application that needs to feed a hardware device with a considerable amount of data at a fairly high rate. (buffer is 32K in size and gets consumed at ~800k bytes per second)
It will stream data without under runs, except when I perform file IO like opening a folder etc... (It seems to be marginally meeting its timing requirements).
Anyway.. a good book or resource to brush up on realtime performance with windows would be helpful.
Thanks!
The best you can hope for on commodity Windows is "usually meets timing requirements". If the system is running any processes other than your target app, it will occasionally miss deadlines due scheduling inconsistencies. However, if your app/hardware can handle the rare but occasional misses, there are a few things you can do to reduce the number of misses.
Set your process's priority to REALTIME_PRIORITY_CLASS
Change the scheduler's granularity to 1ms resolution via the timeBeginPeriod() function (part of the Windows Multimedia libraries)
Avoid as many system calls in your main loop as possible (this includes allocating memory). Each syscall is an opportunity for the OS to put the process to sleep and, consequently, is an opportunity for the non-deterministic scheduler to miss the next deadline
If this doesn't get the job done for you, you might consider trying a Linux distribution with realtime kernel patches applied. I've found those to provide near-perfect timing (within 10s of microseconds accuracy over the course of several hours). That said, nothing short of a true-realtime OS will actually give you perfection but the realtime-linux distros are much closer than commodity Windows.
The first thing I would do is tune it to where it's as lean as possible. I use this method. For these reasons. Since it's a console app, another option is to try out LTProf, which will show you if there is anything you can fruitfully optimize. When that's done, you will be in the best position to look for buffer timing issues, as #Hans suggested.
Optimizing software in C++ from agner.com is a great optimization manual.
As Rakis said, you will need to be very careful in the processing loop:
No memory allocation. Use the stack and preallocated memory instead.
No throws. Exceptions are quite expensive, in win32 they have a cost even not throwing.
No polymorphism. You will save some indirections.
Use inline extensively.
No locks. Try lock-free approaches when possible.
The buffer will last for only 40 milliseconds. You can't guarantee zero under-runs on Windows with such strict timing requirements. In user mode land, you are looking at, potentially, hundreds of milliseconds when kernel threads do what they need to do. They run with higher priorities that you can ever gain. The thread quantum on the workstation version is 3 times the clock tick, already beyond 40 milliseconds (3 x 15.625 msec). You can't even reliably compete with user mode threads that boosted their priority and take their sweet old time.
If a bigger buffer is not an option then you are looking at a device driver to get this kind of service guarantee. Or something in between that can provide a larger buffer.